Building a MeTTa Corpus for NL-to-Code LLMs

chevron-icon
RFP Proposals
Top
chevron-icon
project-presentation-img
Expert Rating 3.3
Anthony Oliko
Project Owner

Building a MeTTa Corpus for NL-to-Code LLMs

Expert Rating

3.3

Overview

This proposal aims to create a high-quality MeTTa corpus designed to train or fine-tune a natural language-to-MeTTa language model (LLM). The corpus will feature up to 10,000 diverse instruction-code pairs, serving as a foundation for building an AI-powered coding assistant. Along with the corpus, we’ll deliver all necessary scripts and clear documentation to ensure transparency and reproducibility. This project is set to make MeTTa more accessible, simplify its learning curve, and contribute to AGI research within the Hyperon framework. Our team is made up of passionate members from the SingularityNET community, all deeply interested in the OpenCog Hyperon framework and its potential.

RFP Guidelines

Create corpus for NL-to-MeTTa LLM

Complete & Awarded
  • Type SingularityNET RFP
  • Total RFP Funding $70,000 USD
  • Proposals 10
  • Awarded Projects 1
author-img
SingularityNET
Aug. 13, 2024

Develop a MeTTa language corpus to enable the training or fine-tuning of an LLM and/or LoRAs aimed at supporting developers by providing a natural language coding assistant for the MeTTa language.

Proposal Description

Company Name (if applicable)

Trenches AI

Project details

This project aims to create a high-quality MeTTa corpus that can be used to train or fine-tune a natural language-to-MeTTa language model (LLM). The ultimate goal is to develop an AI-powered coding assistant that helps users generate accurate and functional MeTTa code with ease.

MeTTa is a unique, multi-paradigm language specifically designed for declarative and functional computations over knowledge metagraphs. Its innovative approach is tailored for building Artificial General Intelligence (AGI) applications. However, because MeTTa is still new and complex, it can present a steep learning curve for developers, especially those just starting out. This project aims to bridge that gap by making it easier for people to learn and use MeTTa through the help of an intelligent, AI-driven coding assistant.

Objectives:

  1. Build a Comprehensive Corpus:

    • We’ll collect and convert existing MeTTa resources, including documentation, tutorials, and community contributions, into a standardized format.
    • Where gaps exist, we’ll create new examples to ensure we cover all key features and use cases of MeTTa.
    • The final corpus will contain up to 10,000 diverse instruction-output pairs to provide comprehensive and practical examples for the assistant.
  2. Ensure Quality and Usability:

    • Every piece of data in the corpus will be carefully validated to ensure it’s accurate, diverse, and follows MeTTa best practices.
    • We’ll also develop scripts and processes that others can use to replicate or expand the corpus in the future.
  3. Support Open Collaboration:

    • Detailed, easy-to-follow documentation will accompany the corpus to ensure anyone can understand how it was created and how it can be used.
    • All code, data, and resources will be shared as open-source contributions, empowering the broader community to build on this work.

Impact:

This project is about more than just data—it’s about making MeTTa accessible to everyone. By creating an AI coding assistant powered by this corpus, we’re helping developers spend less time struggling with syntax and more time solving meaningful problems. This work will lower the barrier to entry for MeTTa, encouraging adoption and innovation within the Hyperon AGI framework.

Ultimately, this project supports the larger vision of SingularityNET, the OpenCog Foundation, and TrueAGI: to advance AGI research through decentralized, collaborative tools that empower individuals and teams alike.

Open Source Licensing

Apache License

This project will be released under the Apache License 2.0, a permissive open-source license that allows anyone to use, modify, and distribute the code and data, provided that appropriate credit is given to the original authors.

Components Outside This License

At this stage, there are no planned components or resources in the project that fall outside the scope of the Apache License 2.0. However, the project will rely on external tools and libraries (e.g., Python packages, data processing utilities) that may be governed by their respective licenses. In such cases:

  • All dependencies and third-party resources will be clearly documented, along with their licenses.
  • Care will be taken to ensure compatibility between the Apache License 2.0 and any third-party licenses.

Proposal Video

Not Avaliable Yet

Check back later during the Feedback & Selection period for the RFP that is proposal is applied to.

  • Total Milestones

    4

  • Total Budget

    $30,000 USD

  • Last Updated

    8 Dec 2024

Milestone 1 - Project Kickoff and Resource Collection

Description

The project begins with gathering and reviewing all available MeTTa resources including official documentation community contributions GitHub repositories and tutorials. This phase will also involve evaluating these resources for their quality relevance and coverage of MeTTa features. Additionally a framework for data collection and formatting will be defined along with establishing validation criteria. This phase ensures a structured foundation for building the corpus. Time: Month 1

Deliverables

1. A comprehensive list of MeTTa resources categorized by type and relevance. 2. A detailed plan outlining the methods for data extraction formatting and validation. 3. Initial drafts of scripts and tools for automated resource extraction where applicable. 4. Weekly progress updates and a milestone completion report.

Budget

$5,000 USD

Success Criterion

1. Resource Compilation: A complete and well-organized list of MeTTa resources is compiled, categorized by type (e.g., documentation, tutorials, repositories) and assessed for relevance and coverage of key MeTTa features. 2. Framework Definition: A clear and actionable plan is developed, outlining the methods for data extraction, formatting, and validation, including measurable validation criteria. 3. Tool Development: Initial versions of scripts and tools for automated resource extraction are created, tested for functionality, and meet the requirements for scalability and accuracy. 4. Progress Tracking: Weekly updates are provided, detailing completed tasks, challenges encountered, and planned actions, ensuring transparency and accountability throughout the milestone. 5. Milestone Report: A comprehensive milestone completion report is delivered, summarizing achievements, insights, and any modifications to the project plan, demonstrating readiness to proceed to the next phase.

Milestone 2 - Corpus Development and Synthesis

Description

This phase focuses on transforming raw resources into usable data. The extracted material will be formatted into instruction-output pairs with gaps addressed through the generation of synthetic examples. Coverage will include all key features and functionalities of MeTTa. The emphasis will be on creating diverse accurate and comprehensive data. Time: Month 2

Deliverables

1. A structured dataset of 5000 validated instruction-output pairs. 2. New synthetic examples to fill coverage gaps. 3. Scripts for data formatting generation and validation. 4. Midpoint evaluation report to ensure quality and alignment with project goals.

Budget

$12,000 USD

Success Criterion

1. Dataset Creation: A structured dataset of 5,000 validated instruction-output pairs is developed, demonstrating accuracy, diversity, and alignment with MeTTa’s core features and functionalities. 2. Synthetic Data Generation: New synthetic examples are created to address identified coverage gaps, ensuring comprehensive representation of MeTTa's capabilities. 3. Tool Availability: Fully functional scripts for data formatting, synthetic generation, and validation are delivered, with thorough testing to confirm usability and reliability. 4. Quality Assurance: All data passes validation checks based on predefined criteria, ensuring high standards of correctness and relevance. 5. Midpoint Evaluation: A detailed evaluation report is submitted, assessing progress, highlighting achievements, identifying potential risks, and confirming alignment with project goals.

Milestone 3 - Corpus Finalization and Quality Assurance

Description

The primary task here is to complete the corpus by expanding it to 10000 validated pairs. Rigorous quality assurance processes will be implemented to ensure that the corpus meets the defined standards for accuracy diversity and usability. Feedback from stakeholders will be incorporated during this phase. Time: Month 3

Deliverables

1. A finalized corpus with 10000 high-quality validated instruction-output pairs. 2. Comprehensive quality assurance reports detailing validation processes and outcomes. 3. Updated scripts/tools for corpus refinement and replication. 4. Weekly progress updates and milestone completion report.

Budget

$8,000 USD

Success Criterion

1. Corpus Completion: A finalized corpus of 10,000 high-quality, validated instruction-output pairs is delivered, meeting predefined standards for accuracy, diversity, and usability. 2. Quality Assurance: Comprehensive quality assurance processes are conducted, with detailed reports documenting validation criteria, methodologies, and outcomes, ensuring the dataset's reliability and robustness. 3. Tool Updates: Scripts and tools for corpus refinement and replication are updated and optimized for efficiency and scalability, with clear documentation for future use. 4. Stakeholder Feedback: Feedback from stakeholders is effectively integrated into the final corpus, addressing any concerns or suggestions to enhance its utility and relevance. 5. Progress Transparency: Weekly updates are provided, tracking milestones and ensuring alignment with project objectives, culminating in a thorough milestone completion report.

Milestone 4 - Documentation and Delivery

Description

The final phase focuses on preparing and delivering the project outputs. Comprehensive documentation will be created covering the corpus creation process validation methods known limitations and guidelines for future use. All scripts tools and data will be organized tested and delivered as open-source resources. Time: Month 4

Deliverables

1. Full documentation detailing the project including data sources methods and use instructions. 2. All scripts and tools necessary for replicating or extending the corpus creation process. 3. Finalized corpus shared as an open-source deliverable. 4. Presentation to stakeholders summarizing project outcomes. 5. Final project report summarizing milestones challenges and future recommendations.

Budget

$5,000 USD

Success Criterion

1. Comprehensive Documentation: Detailed, user-friendly documentation is delivered, clearly describing data sources, corpus creation methods, validation processes, known limitations, and guidelines for replication or extension. 2. Script and Tool Delivery: All scripts and tools required for reproducing or extending the corpus creation process are finalized, thoroughly tested, and organized, ensuring they are functional and accessible as open-source resources. 3. Corpus Accessibility: The finalized corpus is published and made available as an open-source deliverable, meeting all requirements for usability and proper licensing. 4. Stakeholder Presentation: A well-structured presentation is conducted, effectively summarizing project goals, achievements, challenges, and the practical utility of deliverables. 5. Final Project Report: A comprehensive final report is submitted, detailing milestones achieved, obstacles encountered, solutions implemented, and recommendations for future work.

Join the Discussion (0)

Expert Ratings

Reviews & Ratings

Group Expert Rating (Final)

Overall

3.3

  • Compliance with RFP requirements 4.3
  • Solution details and team expertise 3.3
  • Value for money 3.0
  • Expert Review 1

    Overall

    4.0

    • Compliance with RFP requirements 5.0
    • Solution details and team expertise 3.0
    • Value for money 0.0
    Clear proposal

    Clear and structured plan with strong deliverables and adherence to RFP goals. Unclear team background and potentially unrealistic timelines raise concerns.

  • Expert Review 2

    Overall

    3.0

    • Compliance with RFP requirements 4.0
    • Solution details and team expertise 3.0
    • Value for money 0.0
    It's a solid proposal but the tricky part (synthetic generation) is just briefly alluded to...

    Not clear if the team has the expertise to make synthetic generation work here, simply fine-tuning models on the available metta code seems not to work ...

  • Expert Review 3

    Overall

    3.0

    • Compliance with RFP requirements 4.0
    • Solution details and team expertise 3.0
    • Value for money 0.0

    Vague -- what processes does the proposer intend to use? Lack of relevant detail.

Welcome to our website!

Nice to meet you! If you have any question about our services, feel free to contact us.