MeTTa CODE (MeTTa COrpus DEvelopment)

chevron-icon
RFP Proposals
Top
chevron-icon
project-presentation-img
Expert Rating 4.0
iliachry
Project Owner

MeTTa CODE (MeTTa COrpus DEvelopment)

Expert Rating

4.0

Overview

This project seeks to develop a robust corpus for the MeTTa programming language to train an LLM capable of translating natural language into accurate, functional MeTTa code, thus serving as an AI-powered MeTTa coding assistant. It will consist of three main milestones: a diverse and validated dataset of instruction-output pairs, open-source scripts that will automate processes, and a comprehensive documentation. Led by a multidisciplinary team specialized in artificial intelligence, semantic data engineering, and back-end development, this initiative aims to lower adoption barriers, enhance usability, and advance AGI development within the Hyperon framework.

RFP Guidelines

Create corpus for NL-to-MeTTa LLM

Complete & Awarded
  • Type SingularityNET RFP
  • Total RFP Funding $70,000 USD
  • Proposals 10
  • Awarded Projects 1
author-img
SingularityNET
Aug. 13, 2024

Develop a MeTTa language corpus to enable the training or fine-tuning of an LLM and/or LoRAs aimed at supporting developers by providing a natural language coding assistant for the MeTTa language.

Proposal Description

Company Name (if applicable)

Metatopia

Project details

The MeTTa language is a multi-paradigm tool for declarative and functional computations over knowledge metagraphs, crucial in furthering the development of Artificial General Intelligence within the Hyperon framework. However, this powerful tool has an innovative nature and a unique syntax that makes it hard for new users to master. To address this challenge, this proposal focuses on creating a comprehensive MeTTa corpus – a high-quality dataset designed to train or fine-tune an AI model that will work as a coding assistant. Such coding assistants would be capable of generating correct, functional MeTTa code upon natural language input from a developer and simplify the way one learns and uses the language.

Beyond the corpus, this project will also provide open-source tools to automate the corpus creation process and extensive documentation to ensure transparency, reproducibility, and accessibility. Together, these efforts will lower barriers to MeTTa adoption, enhance usability, and accelerate AGI development within the Hyperon ecosystem. The resulting resources will not only serve immediate needs, but also give a sound foundation for further advances in the field.

Therefore, the project will be structured into three main milestones. The first milestone will be the creation of the MeTTa corpus itself, a high-quality dataset of 10,000 instruction-output pairs where instructions are expressed in natural language and their outputs are valid, error-free MeTTa code. This process will begin with data collection, leveraging resources such as the official MeTTa documentation, GitHub repositories, community-contributed code, and tutorials. To extract relevant data systematically, we will employ natural language processing (NLP) techniques, using libraries like SpaCy and NLTK, to analyze and parse instructional text, and web scraping frameworks like Scrapy for gathering examples from online sources. Data validation will be the next step, which will make sure that the dataset is free of errors, inconsistencies, and deprecated code. It will be performed by running syntax checking algorithms based on the grammar rules of MeTTa and testing frameworks that will verify the functional correctness of each code snippet. The checks will be implemented using technologies like PyTest and custom-written validators in Python.

In addition, when existing materials need supplementation to fill gaps, we will generate examples using generative approaches. Specifically, we will use fine-tuned language models to initially draft the instruction-output pairs, refining the output into MeTTa standards through some rule-based post-processing steps. In other words, combining data-driven generation and manual oversight will result in a highly diverse and representative dataset. Once the data is collected and validated, it will be categorized into sections such as basic constructs, advanced features, and practical applications. We will make use of technologies like Pandas for structuring data and JSON formats for storing the output to ensure usability. Finally, the entire corpus will undergo vigorous testing through automated tools that confirm syntactic and functional accuracy, making it a reliable resource for training AI models.

The second milestone entails the release of open-source tools and scripts that automate the processes involved in creating, validating, and extending the corpus. This will include data extraction scripts, which would systematically source code snippets and instructions from various resources, for which we will be using web scraping utilities like Selenium and Beautiful Soup, along with APIs where available. The validation tools will involve proprietary static code analysis algorithms to find syntactic errors and will use parsing and the validation of MeTTa-specific grammar by using frameworks such as Lark and ANTLR. Functional validation will involve testing MeTTa code snippets to verify their expected output in an isolated environment.

To supplement existing examples, users of the OSS will be able to use data synthesis tools leveraging transformer-based language models. Through prompt engineering, these tools will be able to generate instruction-output pairs that address advanced use cases and fill gaps in underrepresented features. These will be refined and validated with the help of rule-based algorithms. An evaluation framework using metrics will be developed, such as corpus completeness concerning key language feature coverage with accuracy rates. These will be implemented in Python and its libraries, NumPy and Matplotlib, used for quantitative analysis and reporting. All tools and scripts will be version-controlled through Git and hosted on GitHub to ensure openness and enable collaboration. This milestone ensures that the process of corpus creation is not only efficient, but also reproducible and scalable for future updates.

The third milestone will focus on extensive documentation, ensuring the project is transparent, easily replicable, and usable to the widest possible audience. This will describe the methodology employed in the corpus creation cycle, from data collection through validation, generation, and up to organization. This will also include instructions and example usages of open-sourced tools, designed to be accessible and easy to use for individuals of varying technical expertise. This documentation will also be detailed as to how this corpus addresses all MeTTa features, by way of examples, to make the breadth and depth of coverage clear. Any limitations will be acknowledged and recommendations for any future updates necessary to keep this corpus current with evolving language standards will be also discussed. Tutorials and practical guides on training or fine-tuning AI models using this corpus will also be provided in the documentation.

To sum up, this project will have three main deliverables, each designed to meet the immediate needs of MeTTa developers and to foster long-term AGI development within the Hyperon framework:

1.     Validated and Structured MeTTa Corpus:

A dataset of 10,000 instruction-output pairs, ensuring that MeTTa code generated is accurate and error-free. The corpus will span the complete range of MeTTa features, categorized into sections like basic constructs, advanced functionalities, and practical applications. Free and Open-Source Tools and Scripts:

2.     Automation of data extraction, validation, and corpus extension: The following will be provided:

- Data extraction scripts

- Validation: Static code analysis and functional testing

- Data synthesis: Synthesis of instruction-output pairs by transformer-based models

All these tools will be publicly available and thus can be replicated or extended.

3.     Comprehensive Documentation

Detailed documentation that covers all steps involved in the creation of the corpus, from data collection and validation to usage and generation. In particular, the following will be included:

- Step-by-step tutorials for new and experienced users alike

- Explanations of how the corpus addresses all MeTTa features

- Recommendations for future updates to maintain alignment with evolving standards

Moreover, to ensure sufficient material for the MeTTa corpus, we estimated the lines of code available across GitHub repositories, totalling 2,058,710 lines, excluding comments. After considering factors like relevance and usability, we project that 10-20% of this—approximately 205,871 to 411,742 lines—will be directly usable for natural language-to-MeTTa instruction-output pairs. With an average of 10 lines per pair, this would provide 20,587 to 41,174 pairs, well over the 10,000 needed for a high-quality dataset. This estimation confirms that the repositories provide a good basis for corpus creation while leaving room for refinement. The code for the estimation can be found in the references field.

This project represents a crucial step in lowering the barriers to the adoption of the MeTTa programming language. It will deliver a robust, validated corpus, automated tools, and comprehensive documentation that will create a scalable framework for empowering developers and researchers alike. In turn, these resources will accelerate the creation of an AI-powered coding assistant that simplifies the generation of correct and functional MeTTa code while broadening the language's accessibility. In the wider perspective, this project will contribute to the long-term development of Artificial General Intelligence with the enhanced ecosystem of Hyperon. The tools and methods developed in this area will provide a foundation for future innovation: they will enable sharing and collaboration to continuously improve. At its very core, the open-source principle of the project ensures that progress made today forms the basis for breakthroughs in AGI research tomorrow.

Our multidisciplinary team combines expertise in artificial intelligence, semantic data engineering, and software development, perfectly suited to create the MeTTa corpus and associated tools. Ilias Chrysovergis, Co-founder, CEO, and CTO of Metatopia, has led high-impact projects that bridge complex technical challenges with user-focused solutions. The last distinctive assignment which Ilias secured is a funding from the Greek Government (specifically from GRNET), to develop a specialized large language model oriented for Revit C# API. Anneza Bardani is a semantic data engineer, specializing in ontology design and data modeling. Dimitris Kleitsas specializes in machine learning and NLP. He has fine-tuned language models for automated content review. Iason Malkotsis experience in backend architecture will ensure efficient and reliable infrastructure for the tools and resources developed in the MeTTa ecosystem. Finally, Dimitra Pazouli has worked on LLM-based Level Generation for Games, showing her capabilities of using AI for creative and technical challenges.

Open Source Licensing

MIT - Massachusetts Institute of Technology License

Proposal Video

Not Avaliable Yet

Check back later during the Feedback & Selection period for the RFP that is proposal is applied to.

  • Total Milestones

    3

  • Total Budget

    $35,000 USD

  • Last Updated

    7 Dec 2024

Milestone 1 - Corpus Creation

Description

The first milestone of the MeTTa CODE project will be to build a comprehensive and well-structured dataset that pairs natural language instructions with valid MeTTa code. The Corpus Creation milestone is divided into several systematic phases: Data collection involves the gathering of a wide range of resources for varied representation and comprehensive coverage of what MeTTa has to offer. To capture how these features are put into practice community-driven examples from GitHub repositories will be included to complement these resources. Data Validation: Thorough checking of correctness in this phase for all the resources collected. It will involve finding and correcting the mistakes of community-contributed resources updating the code that is no longer state-of-the-art and checking that all materials conform to official MeTTa standards. Data Generation: In this process the focus will be to fill the gaps in existing resources. This task covers writing original scripts demonstrating advanced or underrepresented features and expanding the scope of the corpus. The final stages of this milestone involve structuring the dataset to be usable and performing extensive validation and quality assurance. The goal is to ensure that all MeTTa code within the corpus is correct functional and logically segmented into sections. These steps produce a reliable dataset created for training intelligent coding assistants and deepening the understanding of the MeTTa programming language.

Deliverables

The deliverable for this milestone will be a robust structured and validated corpus for the MeTTa programming language. The corpus will comprise a comprehensive collection of instruction-output pairs where natural language instructions are meticulously paired with valid error-free MeTTa code. This dataset is designed to cover all aspects of MeTTa from simple constructs to advanced features edge cases and practical applications. All instruction-output pairs will be divided into logical sections so that users and AI models can efficiently find and use the corpus for specific purposes. The dataset includes a wide variety of examples sourced from official documentation community-contributed code and newly synthesized scripts to make it as versatile and inclusive of real-world use cases as possible. It will also ensure that all such datasets are reliable. Extensive testing protocols will assure syntactic correctness and functional accuracy for all MeTTa scripts. Most mistakes and inconsistencies present in source material will be eliminated at the same time to provide maximum quality for the corpus. This deliverable will make a cornerstone for AI Coding Assistants in developing error-free code in MeTTa through inputs in natural language. Aside from being useful in training AIs this corpus will function very well as a part of the programmer's toolkit as it encourages wider exposure to and higher-level uses of MeTTa.

Budget

$15,000 USD

Success Criterion

A rich collection of at least 10,000 natural language instructions each paired with valid, working MeTTa code. When someone unfamiliar with MeTTa picks a random pair and tries it out, they find that the code runs smoothly and matches what they expected from the instruction. The entire set feels organized and easy to browse—experienced users can quickly find examples covering basic to advanced features, while newcomers can effortlessly learn from simpler examples first. The data has passed a thorough “sanity check”: few to no errors remain, no outdated code is lurking, and every snippet has been double-checked for correctness.

Milestone 2 - Open-Source Software Code

Description

This milestone focuses on transforming the tools and scripts developed during the first milestone into open-source resources to ensure transparency reproducibility and accessibility. The milestone will emphasize refining and optimizing data extraction scripts (e.g. web scraping using Selenium and Beautiful Soup) validation tools (e.g. grammar parsing with Lark and ANTLR) and data synthesis utilities (e.g. transformer-based models for generating instruction-output pairs). Additionally it will involve creating an evaluation framework to measure corpus completeness and accuracy using Python libraries like NumPy and Matplotlib. All tools will be version-controlled on Git hosted on GitHub and provided with detailed documentation to adhere to open-source best practices. This milestone will ensure that internal workflows evolve into polished resources accessible to the community facilitating ongoing collaboration and improvement.

Deliverables

The deliverable for this milestone will be a fully open-source suite of tools and scripts designed to automate the creation validation and extension of the MeTTa corpus. This will include ready-to-use data extraction utilities for collecting instructional content and MeTTa code from diverse online sources robust validation tools for syntax and functional correctness checks and data synthesizer tools that will generate diverse instruction-output pairs using transformer-based models. Additionally it will feature an evaluation framework to quantitatively assess the corpus's completeness and accuracy. These tools will be uploaded to GitHub with version control user-friendly interfaces and extensive documentation ensuring they are easy to adopt modify and expand upon.

Budget

$10,000 USD

Success Criterion

The tools and scripts that help create, validate, and extend the corpus are all available on GitHub, and it’s straightforward for a first-time visitor to figure out how to run them. The code feels tidy, well-structured, and accompanied by helpful comments, making it friendly even to a curious newcomer who wants to understand or improve it. Users can run the scripts without complex setups; they can reliably extract new data, validate code, or generate new instruction-output pairs just by following the given instructions. Everything is version-controlled, so if someone wants to roll back to a previous version or suggest improvements, they can do so without hitting roadblocks.

Milestone 3 - Documentation

Description

This last milestone will cover the development of thorough documentation that will make the project transparent reproducible and usable for the widest possible audience. This documentation will describe in detail the methodology followed during the corpus creation cycle from data collection and validation to generation and organization. It will contain explicit instructions and examples that will assist users independent of their technical background in using the open-source tools. It will be outlined in the documentation how the corpus addresses all the main features MeTTa has including providing clear examples to show such depth and breadth. Another contribution will be discussions on potential drawbacks related to the corpus and include advice about how this set of data could be refreshed as standards of the language shift over time. A third will be tutorials for showing end-users how can this corpus train or tune an AI model.

Deliverables

The deliverable for this milestone will be a complete set of documentation that will clearly show step by step how to use the MeTTa corpus and tools. This will include a detailed description of the corpus creation process examples of how the corpus will address all MeTTa features and instructions for using the open-source tools. Documentation will also include an evaluation of any limitations with suggestions for future updates keeping the project open to changes in requirements. It will also contain tutorials on how to train or fine-tune AI models using the corpus making the tools and corpus user-friendly for both novice and advanced users.

Budget

$10,000 USD

Success Criterion

The documentation reads like a helpful guide rather than a dry manual. It gently leads a beginner from “What is MeTTa?” to “How do I use these tools to create my own corpus?” Every step in the data collection, validation, and generation process is explained clearly, with examples that make sense. Users don’t have to guess what a certain step means—they’ll know exactly what to do. Common questions are answered right there in the documentation, reducing confusion and frustration. For example, if someone asks, “How do I add my own snippets?” or “What if I want to use the corpus to train my own model?” the answer is easy to find. After reading through the docs, both a total newbie and a seasoned developer feel confident enough to start using the corpus and the tools on their own.

Join the Discussion (0)

Expert Ratings

Reviews & Ratings

Group Expert Rating (Final)

Overall

4.0

  • Feasibility 4.7
  • Desirabilty 4.0
  • Usefulness 3.7
  • Expert Review 1

    Overall

    4.0

    • Compliance with RFP requirements 5.0
    • Solution details and team expertise 4.0
    • Value for money 3.0
    Overall promising.

    Good focus on deliverables. Concerns around synthetic data quality and team expertise in this area. Overall promising.

  • Expert Review 2

    Overall

    3.0

    • Compliance with RFP requirements 4.0
    • Solution details and team expertise 3.0
    • Value for money 3.0
    A reasonable proposal but sketchy where one would like to see details

    It's proposed to curate existing examples and then fine-tune an LLM to make more examples, but how to make the fine-tuning work when it doesn't out of the box (I've tried based on the existing corpus of metta code already) is alluded to only overly sketchily ...

  • Expert Review 3

    Overall

    5.0

    • Compliance with RFP requirements 5.0
    • Solution details and team expertise 5.0
    • Value for money 5.0

    A comprehensive approach using ML NLP tools, with a good sized team of 5.

feedback_icon