iliachry
Project OwnerWith his wide-ranging expertise in software architecture, together with the leading award-winning innovations he has developed, Ilias will provide invaluable insights and project lead to the team.
This project seeks to develop a robust corpus for the MeTTa programming language to train an LLM capable of translating natural language into accurate, functional MeTTa code, thus serving as an AI-powered MeTTa coding assistant. It will consist of three main milestones: a diverse and validated dataset of instruction-output pairs, open-source scripts that will automate processes, and a comprehensive documentation. Led by a multidisciplinary team specialized in artificial intelligence, semantic data engineering, and back-end development, this initiative aims to lower adoption barriers, enhance usability, and advance AGI development within the Hyperon framework.
Develop a MeTTa language corpus to enable the training or fine-tuning of an LLM and/or LoRAs aimed at supporting developers by providing a natural language coding assistant for the MeTTa language.
In order to protect this proposal from being copied, all details are hidden until the end of the submission period. Please come back later to see all details.
The first milestone of the MeTTa CODE project will be to build a comprehensive and well-structured dataset that pairs natural language instructions with valid MeTTa code. The Corpus Creation milestone is divided into several systematic phases: Data collection involves the gathering of a wide range of resources for varied representation and comprehensive coverage of what MeTTa has to offer. To capture how these features are put into practice community-driven examples from GitHub repositories will be included to complement these resources. Data Validation: Thorough checking of correctness in this phase for all the resources collected. It will involve finding and correcting the mistakes of community-contributed resources updating the code that is no longer state-of-the-art and checking that all materials conform to official MeTTa standards. Data Generation: In this process the focus will be to fill the gaps in existing resources. This task covers writing original scripts demonstrating advanced or underrepresented features and expanding the scope of the corpus. The final stages of this milestone involve structuring the dataset to be usable and performing extensive validation and quality assurance. The goal is to ensure that all MeTTa code within the corpus is correct functional and logically segmented into sections. These steps produce a reliable dataset created for training intelligent coding assistants and deepening the understanding of the MeTTa programming language.
The deliverable for this milestone will be a robust structured and validated corpus for the MeTTa programming language. The corpus will comprise a comprehensive collection of instruction-output pairs where natural language instructions are meticulously paired with valid error-free MeTTa code. This dataset is designed to cover all aspects of MeTTa from simple constructs to advanced features edge cases and practical applications. All instruction-output pairs will be divided into logical sections so that users and AI models can efficiently find and use the corpus for specific purposes. The dataset includes a wide variety of examples sourced from official documentation community-contributed code and newly synthesized scripts to make it as versatile and inclusive of real-world use cases as possible. It will also ensure that all such datasets are reliable. Extensive testing protocols will assure syntactic correctness and functional accuracy for all MeTTa scripts. Most mistakes and inconsistencies present in source material will be eliminated at the same time to provide maximum quality for the corpus. This deliverable will make a cornerstone for AI Coding Assistants in developing error-free code in MeTTa through inputs in natural language. Aside from being useful in training AIs this corpus will function very well as a part of the programmer's toolkit as it encourages wider exposure to and higher-level uses of MeTTa.
$15,000 USD
A rich collection of at least 10,000 natural language instructions each paired with valid, working MeTTa code. When someone unfamiliar with MeTTa picks a random pair and tries it out, they find that the code runs smoothly and matches what they expected from the instruction. The entire set feels organized and easy to browse—experienced users can quickly find examples covering basic to advanced features, while newcomers can effortlessly learn from simpler examples first. The data has passed a thorough “sanity check”: few to no errors remain, no outdated code is lurking, and every snippet has been double-checked for correctness.
This milestone focuses on transforming the tools and scripts developed during the first milestone into open-source resources to ensure transparency reproducibility and accessibility. The milestone will emphasize refining and optimizing data extraction scripts (e.g. web scraping using Selenium and Beautiful Soup) validation tools (e.g. grammar parsing with Lark and ANTLR) and data synthesis utilities (e.g. transformer-based models for generating instruction-output pairs). Additionally it will involve creating an evaluation framework to measure corpus completeness and accuracy using Python libraries like NumPy and Matplotlib. All tools will be version-controlled on Git hosted on GitHub and provided with detailed documentation to adhere to open-source best practices. This milestone will ensure that internal workflows evolve into polished resources accessible to the community facilitating ongoing collaboration and improvement.
The deliverable for this milestone will be a fully open-source suite of tools and scripts designed to automate the creation validation and extension of the MeTTa corpus. This will include ready-to-use data extraction utilities for collecting instructional content and MeTTa code from diverse online sources robust validation tools for syntax and functional correctness checks and data synthesizer tools that will generate diverse instruction-output pairs using transformer-based models. Additionally it will feature an evaluation framework to quantitatively assess the corpus's completeness and accuracy. These tools will be uploaded to GitHub with version control user-friendly interfaces and extensive documentation ensuring they are easy to adopt modify and expand upon.
$10,000 USD
The tools and scripts that help create, validate, and extend the corpus are all available on GitHub, and it’s straightforward for a first-time visitor to figure out how to run them. The code feels tidy, well-structured, and accompanied by helpful comments, making it friendly even to a curious newcomer who wants to understand or improve it. Users can run the scripts without complex setups; they can reliably extract new data, validate code, or generate new instruction-output pairs just by following the given instructions. Everything is version-controlled, so if someone wants to roll back to a previous version or suggest improvements, they can do so without hitting roadblocks.
This last milestone will cover the development of thorough documentation that will make the project transparent reproducible and usable for the widest possible audience. This documentation will describe in detail the methodology followed during the corpus creation cycle from data collection and validation to generation and organization. It will contain explicit instructions and examples that will assist users independent of their technical background in using the open-source tools. It will be outlined in the documentation how the corpus addresses all the main features MeTTa has including providing clear examples to show such depth and breadth. Another contribution will be discussions on potential drawbacks related to the corpus and include advice about how this set of data could be refreshed as standards of the language shift over time. A third will be tutorials for showing end-users how can this corpus train or tune an AI model.
The deliverable for this milestone will be a complete set of documentation that will clearly show step by step how to use the MeTTa corpus and tools. This will include a detailed description of the corpus creation process examples of how the corpus will address all MeTTa features and instructions for using the open-source tools. Documentation will also include an evaluation of any limitations with suggestions for future updates keeping the project open to changes in requirements. It will also contain tutorials on how to train or fine-tune AI models using the corpus making the tools and corpus user-friendly for both novice and advanced users.
$10,000 USD
The documentation reads like a helpful guide rather than a dry manual. It gently leads a beginner from “What is MeTTa?” to “How do I use these tools to create my own corpus?” Every step in the data collection, validation, and generation process is explained clearly, with examples that make sense. Users don’t have to guess what a certain step means—they’ll know exactly what to do. Common questions are answered right there in the documentation, reducing confusion and frustration. For example, if someone asks, “How do I add my own snippets?” or “What if I want to use the corpus to train my own model?” the answer is easy to find. After reading through the docs, both a total newbie and a seasoned developer feel confident enough to start using the corpus and the tools on their own.
Reviews & Ratings
Please create account or login to write a review and rate.
Check back later by refreshing the page.
© 2025 Deep Funding
Join the Discussion (0)
Please create account or login to post comments.