Building a MeTTa Corpus for NL-to-Code LLMs

chevron-icon
RFP Proposals
Top
chevron-icon
project-presentation-img
Anthony Oliko
Project Owner

Building a MeTTa Corpus for NL-to-Code LLMs

Expert Rating

n/a

Overview

This proposal aims to create a high-quality MeTTa corpus designed to train or fine-tune a natural language-to-MeTTa language model (LLM). The corpus will feature up to 10,000 diverse instruction-code pairs, serving as a foundation for building an AI-powered coding assistant. Along with the corpus, we’ll deliver all necessary scripts and clear documentation to ensure transparency and reproducibility. This project is set to make MeTTa more accessible, simplify its learning curve, and contribute to AGI research within the Hyperon framework. Our team is made up of passionate members from the SingularityNET community, all deeply interested in the OpenCog Hyperon framework and its potential.

RFP Guidelines

Create corpus for NL-to-MeTTa LLM

Internal Proposal Review
  • Type SingularityNET RFP
  • Total RFP Funding $70,000 USD
  • Proposals 10
  • Awarded Projects n/a
author-img
SingularityNET
Aug. 13, 2024

Develop a MeTTa language corpus to enable the training or fine-tuning of an LLM and/or LoRAs aimed at supporting developers by providing a natural language coding assistant for the MeTTa language.

Proposal Description

Proposal Details Locked…

In order to protect this proposal from being copied, all details are hidden until the end of the submission period. Please come back later to see all details.

Proposal Video

Not Avaliable Yet

Check back later during the Feedback & Selection period for the RFP that is proposal is applied to.

  • Total Milestones

    4

  • Total Budget

    $30,000 USD

  • Last Updated

    8 Dec 2024

Milestone 1 - Project Kickoff and Resource Collection

Description

The project begins with gathering and reviewing all available MeTTa resources including official documentation community contributions GitHub repositories and tutorials. This phase will also involve evaluating these resources for their quality relevance and coverage of MeTTa features. Additionally a framework for data collection and formatting will be defined along with establishing validation criteria. This phase ensures a structured foundation for building the corpus. Time: Month 1

Deliverables

1. A comprehensive list of MeTTa resources categorized by type and relevance. 2. A detailed plan outlining the methods for data extraction formatting and validation. 3. Initial drafts of scripts and tools for automated resource extraction where applicable. 4. Weekly progress updates and a milestone completion report.

Budget

$5,000 USD

Success Criterion

1. Resource Compilation: A complete and well-organized list of MeTTa resources is compiled, categorized by type (e.g., documentation, tutorials, repositories) and assessed for relevance and coverage of key MeTTa features. 2. Framework Definition: A clear and actionable plan is developed, outlining the methods for data extraction, formatting, and validation, including measurable validation criteria. 3. Tool Development: Initial versions of scripts and tools for automated resource extraction are created, tested for functionality, and meet the requirements for scalability and accuracy. 4. Progress Tracking: Weekly updates are provided, detailing completed tasks, challenges encountered, and planned actions, ensuring transparency and accountability throughout the milestone. 5. Milestone Report: A comprehensive milestone completion report is delivered, summarizing achievements, insights, and any modifications to the project plan, demonstrating readiness to proceed to the next phase.

Milestone 2 - Corpus Development and Synthesis

Description

This phase focuses on transforming raw resources into usable data. The extracted material will be formatted into instruction-output pairs with gaps addressed through the generation of synthetic examples. Coverage will include all key features and functionalities of MeTTa. The emphasis will be on creating diverse accurate and comprehensive data. Time: Month 2

Deliverables

1. A structured dataset of 5000 validated instruction-output pairs. 2. New synthetic examples to fill coverage gaps. 3. Scripts for data formatting generation and validation. 4. Midpoint evaluation report to ensure quality and alignment with project goals.

Budget

$12,000 USD

Success Criterion

1. Dataset Creation: A structured dataset of 5,000 validated instruction-output pairs is developed, demonstrating accuracy, diversity, and alignment with MeTTa’s core features and functionalities. 2. Synthetic Data Generation: New synthetic examples are created to address identified coverage gaps, ensuring comprehensive representation of MeTTa's capabilities. 3. Tool Availability: Fully functional scripts for data formatting, synthetic generation, and validation are delivered, with thorough testing to confirm usability and reliability. 4. Quality Assurance: All data passes validation checks based on predefined criteria, ensuring high standards of correctness and relevance. 5. Midpoint Evaluation: A detailed evaluation report is submitted, assessing progress, highlighting achievements, identifying potential risks, and confirming alignment with project goals.

Milestone 3 - Corpus Finalization and Quality Assurance

Description

The primary task here is to complete the corpus by expanding it to 10000 validated pairs. Rigorous quality assurance processes will be implemented to ensure that the corpus meets the defined standards for accuracy diversity and usability. Feedback from stakeholders will be incorporated during this phase. Time: Month 3

Deliverables

1. A finalized corpus with 10000 high-quality validated instruction-output pairs. 2. Comprehensive quality assurance reports detailing validation processes and outcomes. 3. Updated scripts/tools for corpus refinement and replication. 4. Weekly progress updates and milestone completion report.

Budget

$8,000 USD

Success Criterion

1. Corpus Completion: A finalized corpus of 10,000 high-quality, validated instruction-output pairs is delivered, meeting predefined standards for accuracy, diversity, and usability. 2. Quality Assurance: Comprehensive quality assurance processes are conducted, with detailed reports documenting validation criteria, methodologies, and outcomes, ensuring the dataset's reliability and robustness. 3. Tool Updates: Scripts and tools for corpus refinement and replication are updated and optimized for efficiency and scalability, with clear documentation for future use. 4. Stakeholder Feedback: Feedback from stakeholders is effectively integrated into the final corpus, addressing any concerns or suggestions to enhance its utility and relevance. 5. Progress Transparency: Weekly updates are provided, tracking milestones and ensuring alignment with project objectives, culminating in a thorough milestone completion report.

Milestone 4 - Documentation and Delivery

Description

The final phase focuses on preparing and delivering the project outputs. Comprehensive documentation will be created covering the corpus creation process validation methods known limitations and guidelines for future use. All scripts tools and data will be organized tested and delivered as open-source resources. Time: Month 4

Deliverables

1. Full documentation detailing the project including data sources methods and use instructions. 2. All scripts and tools necessary for replicating or extending the corpus creation process. 3. Finalized corpus shared as an open-source deliverable. 4. Presentation to stakeholders summarizing project outcomes. 5. Final project report summarizing milestones challenges and future recommendations.

Budget

$5,000 USD

Success Criterion

1. Comprehensive Documentation: Detailed, user-friendly documentation is delivered, clearly describing data sources, corpus creation methods, validation processes, known limitations, and guidelines for replication or extension. 2. Script and Tool Delivery: All scripts and tools required for reproducing or extending the corpus creation process are finalized, thoroughly tested, and organized, ensuring they are functional and accessible as open-source resources. 3. Corpus Accessibility: The finalized corpus is published and made available as an open-source deliverable, meeting all requirements for usability and proper licensing. 4. Stakeholder Presentation: A well-structured presentation is conducted, effectively summarizing project goals, achievements, challenges, and the practical utility of deliverables. 5. Final Project Report: A comprehensive final report is submitted, detailing milestones achieved, obstacles encountered, solutions implemented, and recommendations for future work.

Join the Discussion (0)

Expert Ratings

Reviews & Ratings

    No Reviews Avaliable

    Check back later by refreshing the page.

feedback_icon