Project details
Deliverables
1. MeTTa Corpus:
A ~~20k pair dataset of instruction-output pairs that comprehensively covers MeTTa features and functionalities.
2. Corpus Validation:
A validation check of each pair, ensuring the pairs are accurate.
3. Codebase:
Scripts and tools for corpus creation, data synthesis, and validation
4. Documentation:
Detailed explanations of corpus creation processes.
Guidelines for using the corpus in AI model training or fine-tuning.
Tutorials for beginners and advanced users.
5. Future Roadmap:
A plan for updating and expanding the corpus as the MeTTa language evolves.
Usefulness
The development of a MeTTa language corpus and specialized LLM will transform how developers interact with MeTTa code. This tool significantly speeds up the development process by providing intelligent, context-aware coding assistance in real-time. New developers benefit from a dramatically reduced learning curve as they receive immediate guidance on language syntax, patterns, and best practices. Such a system's automated code review capabilities help maintain code quality while reducing manual review time. Additionally, it would streamline documentation by automatically generating clear, consistent documentation from code, and supports efficient code maintenance through intelligent refactoring suggestions based on established patterns in the corpus. Our methodology for creating the corpus is repeatable and lends itself to the MeTTA LLM directly.
Problem Description
MeTTa development currently faces significant efficiency challenges due to the lack of automated coding tools. Developers must navigate through development cycles that are unnecessarily prolonged by manual coding processes, while newcomers encounter steep learning curves without adequate assistance. The absence of standardized tools leads to inconsistent coding patterns across projects, making maintenance and collaboration more difficult. Developers also face the time-consuming burden of manual documentation, and without proper tooling, code reuse remains limited, forcing frequent recreation of common solutions. Having an automated coding tool that translates natural language into MeTTa code would speed up development, lower barriers to entry in developing in Hyperon, an help existing developers find more efficient code..
Solution Description
We propose to develop a comprehensive MeTTa language corpus that will enable the training or fine-tuning of a natural language-to-MeTTa large language model (LLM). This corpus will support the creation of an AI-powered coding assistant, which will help users generate accurate and functional MeTTa code, thereby lowering the barrier to entry for the MeTTa language and accelerating the development of Artificial General Intelligence (AGI) within the Hyperon framework. Our approach emphasizes quality, reproducibility, and alignment with the objectives of the SingularityNET Foundation, ensuring the deliverables meet all functional and non-functional requirements. Our main priority is to create a foundation corpus for the LLM to be created using a method that itself lends directly to the MeTTa LLM.
-
Corpus Development
-
We will create a structured corpus containing up to 20,000 high-quality instruction-output pairs, where instructions are in natural language and outputs are error-free, valid MeTTa code. The process includes:
-
Data Collection: Extracting MeTTa programs from available resources such as GitHub repositories, official documentation, and community tutorials using an extraction model.
-
Data Processing: Cleaning, formatting, and standardizing existing MeTTa resources to ensure consistency and usability.
-
Data Generation: Synthesizing new MeTTa code samples, covering diverse use cases and functionalities using a generation model (same as extraction).
-
The corpus is structured by the following headings:
-
Arithmetic and logic
-
Functional programming
-
Symbolic reasoning and rules
-
Graph operations
-
AGI-specific tasks
-
Probabilistic models and constraint solving
Longer description
The project involves creating a structured corpus of MeTTa code by:
-
Collecting existing codebases and documentation
-
Generating new example code
-
Annotating with natural language descriptions
-
Processing and standardizing data
-
Preparing training datasets
-
Fine-tuning LLM models
-
Developing evaluation frameworks
Our approach is to develop a generative extraction LLM that extracts from our hand curated selection of documentation knowledge required to make and generate natural language to MeTTa pairs. Then these pairs are reviewed in a systematic process of automatic validation and human validation before being fed back to the model for re-incorporation. Our unique approach relies on a “good cop – bad cop” negotiation style. This approach is interesting in our application here because it drives the model towards a perfection mode where anything less than the exact answer isn’t good enough, but trying and learning to get there is. In one sense, it is also a parenting style of how to train the LLM to perform correctly.
Our validation protocol is equally as important as training the extraction and generation model. The pairs are first verified in a cross-over accuracy model where separately trained models evaluate the pairs. Any pairs which don’t make a full consensus are re-evaluated and formatted until they do. Secondly, two human experts blindly sample the automated corpus from different sections (whereas they don’t know which section the pair relates to). Once a pair has been validated by at least two human validators, it is cleared for the corpus. Any unknown or uncertain pairs are further corrected by separate validators. Finally, we feed back all corrected pairs in an interactive process to update the core models.
Competition and USP
We propose a cost-effective budget that reflects the scope and complexity of the project while ensuring high-quality outcomes. A detailed breakdown can be provided upon request.
Our proposal aligns with the objectives of the SingularityNET Foundation by addressing the need for a robust, scalable, and reproducible corpus for training MeTTa coding assistants. The deliverables will not only accelerate AGI development but also empower the community to adopt and innovate with the MeTTa language.
Our team consists of experts with extensive experience in:
-
Developing and fine-tuning LLMs for programming languages.
-
Re-building logic from underrepresented programming languages to popular ones.
-
Active contributions to AI and AGI research initiatives, including participation in the MeTTa study group and Hyperon framework development.
Our unique value is the model with extracts and generates the pairs can be used to validate the resulting MeTTA LLM when it is developed.
Join the Discussion (0)
Please create account or login to post comments.