Pedagogy MeTTa-NLP Corpus Generation

chevron-icon
RFP Proposals
Top
chevron-icon
project-presentation-img
Expert Rating 4.9
simuliinc
Project Owner

Pedagogy MeTTa-NLP Corpus Generation

Expert Rating

4.9

Overview

This proposal outlines the creation of a 20,000-pair MeTTa language corpus to enable training of an AI coding assistant. The approach involves generating instruction-output pairs through a combination of data collection, processing, and synthesis, followed by rigorous validation using both automated and human review. The corpus will cover six key areas including arithmetic, functional programming, and AGI-specific tasks. The $35,000, 6-month project delivers the corpus, validation tools, documentation, and a roadmap for future updates. A unique aspect is that the extraction/generation model can later validate the resulting MeTTa LLM and a pedagogical approach.

RFP Guidelines

Create corpus for NL-to-MeTTa LLM

Complete & Awarded
  • Type SingularityNET RFP
  • Total RFP Funding $70,000 USD
  • Proposals 10
  • Awarded Projects 1
author-img
SingularityNET
Aug. 13, 2024

Develop a MeTTa language corpus to enable the training or fine-tuning of an LLM and/or LoRAs aimed at supporting developers by providing a natural language coding assistant for the MeTTa language.

Proposal Description

Company Name (if applicable)

Simuli Inc.

Project details

Deliverables

 

1. MeTTa Corpus: 

A ~~20k pair dataset of instruction-output pairs that comprehensively covers MeTTa features and functionalities.

 

2. Corpus Validation:

A validation check of each pair, ensuring the pairs are accurate.

 

3. Codebase: 

Scripts and tools for corpus creation, data synthesis, and validation

 

4. Documentation:

Detailed explanations of corpus creation processes.

Guidelines for using the corpus in AI model training or fine-tuning.

Tutorials for beginners and advanced users.

 

5. Future Roadmap: 

A plan for updating and expanding the corpus as the MeTTa language evolves.

Usefulness

The development of a MeTTa language corpus and specialized LLM will transform how developers interact with MeTTa code. This tool significantly speeds up the development process by providing intelligent, context-aware coding assistance in real-time. New developers benefit from a dramatically reduced learning curve as they receive immediate guidance on language syntax, patterns, and best practices. Such a system's automated code review capabilities help maintain code quality while reducing manual review time. Additionally, it would streamline documentation by automatically generating clear, consistent documentation from code, and supports efficient code maintenance through intelligent refactoring suggestions based on established patterns in the corpus. Our methodology for creating the corpus is repeatable and lends itself to the MeTTA LLM directly. 

 

Problem Description

MeTTa development currently faces significant efficiency challenges due to the lack of automated coding tools. Developers must navigate through development cycles that are unnecessarily prolonged by manual coding processes, while newcomers encounter steep learning curves without adequate assistance. The absence of standardized tools leads to inconsistent coding patterns across projects, making maintenance and collaboration more difficult. Developers also face the time-consuming burden of manual documentation, and without proper tooling, code reuse remains limited, forcing frequent recreation of common solutions. Having an automated coding tool that translates natural language into MeTTa code would speed up development, lower barriers to entry in developing in Hyperon, an help existing developers find more efficient code..

Solution Description

We propose to develop a comprehensive MeTTa language corpus that will enable the training or fine-tuning of a natural language-to-MeTTa large language model (LLM). This corpus will support the creation of an AI-powered coding assistant, which will help users generate accurate and functional MeTTa code, thereby lowering the barrier to entry for the MeTTa language and accelerating the development of Artificial General Intelligence (AGI) within the Hyperon framework. Our approach emphasizes quality, reproducibility, and alignment with the objectives of the SingularityNET Foundation, ensuring the deliverables meet all functional and non-functional requirements. Our main priority is to create a foundation corpus for the LLM to be created using a method that itself lends directly to the MeTTa LLM. 

 

  • Corpus Development

    • We will create a structured corpus containing up to 20,000 high-quality instruction-output pairs, where instructions are in natural language and outputs are error-free, valid MeTTa code. The process includes:

      • Data Collection: Extracting MeTTa programs from available resources such as GitHub repositories, official documentation, and community tutorials using an extraction model.

      • Data Processing: Cleaning, formatting, and standardizing existing MeTTa resources to ensure consistency and usability.

      • Data Generation: Synthesizing new MeTTa code samples, covering diverse use cases and functionalities using a generation model (same as extraction).

    • The corpus is structured by the following headings:

  1. Arithmetic and logic

  2. Functional programming

  3. Symbolic reasoning and rules

  4. Graph operations

  5. AGI-specific tasks

  6. Probabilistic models and constraint solving 

  • Quality Assurance

    • To ensure the corpus meets the highest standards:

      • All code will undergo validation to guarantee correctness and adherence to MeTTa best practices.

      • An iterative feedback process will involve experts from the Hyperon framework and the MeTTa community to refine outputs.

  • Documentation

    • We will provide thorough documentation detailing:

      • The corpus creation process, including methodologies for extraction, generation, and validation.

      • Known limitations and recommendations for future improvements.

      • Tutorials and guidelines to assist users in training or fine-tuning their own MeTTa coding assistants.

    • Open-Source Contribution

      • All code, scripts, and documentation developed during this project will be open-sourced to foster collaboration and reproducibility.

      • Version-controlled documentation ensuring transparency and reproducibility.

  • Roadmap for Future Corpus Updates

    • We will provide an explanatory video/tutorial of how to use our corpus generator, the process of validation, and other related items needed for future corpus updates.

 

Longer description

The project involves creating a structured corpus of MeTTa code by:

  1. Collecting existing codebases and documentation

  2. Generating new example code

  3. Annotating with natural language descriptions

  4. Processing and standardizing data

  5. Preparing training datasets

  6. Fine-tuning LLM models

  7. Developing evaluation frameworks

Our approach is to develop a generative extraction LLM that extracts from our hand curated selection of documentation knowledge required to make and generate natural language to MeTTa pairs. Then these pairs are reviewed in a systematic process of automatic validation and human validation before being fed back to the model for re-incorporation. Our unique approach relies on a “good cop – bad cop” negotiation style. This approach is interesting in our application here because it drives the model towards a perfection mode where anything less than the exact answer isn’t good enough, but trying and learning to get there is. In one sense, it is also a parenting style of how to train the LLM to perform correctly. 

Our validation protocol is equally as important as training the extraction and generation model. The pairs are first verified in a cross-over accuracy model where separately trained models evaluate the pairs. Any pairs which don’t make a full consensus are re-evaluated and formatted until they do. Secondly, two human experts blindly sample the automated corpus from different sections (whereas they don’t know which section the pair relates to). Once a pair has been validated by at least two human validators, it is cleared for the corpus. Any unknown or uncertain pairs are further corrected by separate validators. Finally, we feed back all corrected pairs in an interactive process to update the core models. 

Competition and USP

We propose a cost-effective budget that reflects the scope and complexity of the project while ensuring high-quality outcomes. A detailed breakdown can be provided upon request.

Our proposal aligns with the objectives of the SingularityNET Foundation by addressing the need for a robust, scalable, and reproducible corpus for training MeTTa coding assistants. The deliverables will not only accelerate AGI development but also empower the community to adopt and innovate with the MeTTa language.

Our team consists of experts with extensive experience in:

  • Developing and fine-tuning LLMs for programming languages.

  • Re-building logic from underrepresented programming languages to popular ones.

  • Active contributions to AI and AGI research initiatives, including participation in the MeTTa study group and Hyperon framework development.

Our unique value is the model with extracts and generates the pairs can be used to validate the resulting MeTTA LLM when it is developed.

 

Open Source Licensing

GNU GPL - GNU General Public License

The generation model can be given as an offline tool. 

Proposal Video

Not Avaliable Yet

Check back later during the Feedback & Selection period for the RFP that is proposal is applied to.

  • Total Milestones

    4

  • Total Budget

    $35,000 USD

  • Last Updated

    7 Dec 2024

Milestone 1 - Fundamental MeTTa Corpus Foundation

Description

Generate and validate first batch of instruction-output pairs covering arithmetic operations and functional programming paradigms in MeTTa. Establish initial validation framework.

Deliverables

6,000-7,000 validated instruction-output pairs Initial extraction/generation model Validation tooling first version Documentation of processes used

Budget

$10,000 USD

Success Criterion

95% pass rate on automated validation checks Human expert validation of random 10% sample Successful execution of all code samples Documentation peer reviewed by 2 team members

Milestone 2 - Symbolic & Graph Operations

Description

Develop and validate pairs focused on symbolic reasoning and graph operations. Enhance validation framework based on learnings.

Deliverables

Additional 6000-7000 validated pairs Improved validation framework Updated extraction/generation model Integration tests for new pairs

Budget

$10,000 USD

Success Criterion

97% pass rate on automated validation Cross-validation by separate model implementations All graph operations verified with test cases Zero conflicts with existing corpus

Milestone 3 - Advanced AGI Features

Description

Complete corpus with AGI-specific tasks and probabilistic models while refining overall quality.

Deliverables

Final 6000-7000 validated pairs Finalized validation system Complete extraction/generation model Comprehensive test suite

Budget

$10,000 USD

Success Criterion

99% pass rate on automated validation Full coverage of specified AGI tasks Successful integration with Hyperon framework All probabilistic models verified accurate

Milestone 4 - Documentation & Tools

Description

Package all tools create comprehensive documentation and establish future maintenance protocols.

Deliverables

Complete 20k pair corpus All source code and tools Comprehensive documentation Tutorial videos and examples

Budget

$5,000 USD

Success Criterion

Successful test runs by external developers Documentation covers all major use cases Tools successfully deployed in test environment Positive feedback from user testing

Join the Discussion (0)

Expert Ratings

Reviews & Ratings

Group Expert Rating (Final)

Overall

4.9

  • Feasibility 5.0
  • Desirabilty 5.0
  • Usefulness 5.0

Excellent submission with high ratings but experts selected another proposal for stategic reasons.

  • Expert Review 1

    Overall

    5.0

    • Compliance with RFP requirements 5.0
    • Solution details and team expertise 5.0
    • Value for money 5.0

    Comprehensive and well-aligned with RFP goals. Strong on validation methodology, scalability, and open-source principles. Concerns include ambitious timeline, labor-intensive validation process, and sourcing details. Overall extremely promising!

  • Expert Review 2

    Overall

    5.0

    • Compliance with RFP requirements 5.0
    • Solution details and team expertise 5.0
    • Value for money 5.0
    It's a strong proposal that gives details on most aspects of what is to be done...

    Details on the human curation aspect are given, and the automated validation aspect. More thoughts on the details of the. modeling/synthesis process would have been good but perhaps this will just need to be uncovered experimentally...

  • Expert Review 3

    Overall

    5.0

    • Compliance with RFP requirements 5.0
    • Solution details and team expertise 5.0
    • Value for money 5.0

    The proposal is detailed and comprehensive in scope. Unique to its approach is a breakdown of corpus topics via functionality.

feedback_icon