MeTTa Language Corpus

chevron-icon
RFP Proposals
Top
chevron-icon
project-presentation-img
Expert Rating 3.0
amcmaster1988
Project Owner

MeTTa Language Corpus

Expert Rating

3.0

Overview

This proposal outlines a strategic approach for developing a high-quality MeTTa language corpus that aligns with the objectives of the SingularityNET Foundation. The aim is to deliver a comprehensive dataset of 10,000 well-curated instruction-output pairs over a four-month period. Leveraging our team's significant expertise in natural language processing, corpus development, and AI training—coupled with extensive experience in Lisp-based languages similar to MeTTa—this proposal is designed to meet the project’s requirements with rigor and efficiency.

RFP Guidelines

Create corpus for NL-to-MeTTa LLM

Complete & Awarded
  • Type SingularityNET RFP
  • Total RFP Funding $70,000 USD
  • Proposals 10
  • Awarded Projects 1
author-img
SingularityNET
Aug. 13, 2024

Develop a MeTTa language corpus to enable the training or fine-tuning of an LLM and/or LoRAs aimed at supporting developers by providing a natural language coding assistant for the MeTTa language.

Proposal Description

Project details

Methodology and Corpus Development Strategy
The development process begins with a comprehensive review and systematic extraction of data from existing MeTTa resources, including official language documentation, community-authored code, tutorials, and repositories hosted on platforms like GitHub. Advanced scripting techniques will be employed to automate the extraction, conversion, and formatting of these resources into structured instruction-output pairs that reflect real-world applications of MeTTa.

Recognizing that the current volume of existing data may not provide the breadth needed for robust training, synthetic data will be generated to fill these gaps. The generation process will employ augmentation techniques that are mindful of maintaining the language's consistency and idiomatic structures. This approach ensures that the corpus covers a diverse range of scenarios and captures the intricacies of the MeTTa language.

The validation of the corpus will occur through a dual-layer process. The first layer consists of automated checks using customized linters that identify and rectify structural errors and deviations from best coding practices. The second layer involves manual review by seasoned MeTTa developers, who will ensure the functional accuracy and adherence of the corpus to best practices. This dual approach balances efficiency with meticulous quality assurance.

Data Compatibility with Modern Linters and Machine Learning Pipelines
To ensure that the corpus is suitable for integration with contemporary development tools and machine learning frameworks, several measures will be taken. All MeTTa code outputs will be reviewed to meet coding standards that align with modern linting tools, facilitating seamless compatibility with popular integrated development environments and CI/CD pipelines. The dataset will be structured in universally recognized formats, such as JSON and CSV, with clearly defined fields for instructions, outputs, and accompanying metadata. This structure will enhance the usability of the corpus in machine learning workflows, including training with frameworks like TensorFlow and PyTorch.

Annotations will be included to provide context, explain code behavior, and note potential dependencies or relevant documentation links. These annotations will aid in model training by providing additional insights for debugging and optimization. The corpus will be formatted to align with machine learning libraries' requirements, supporting efficient batch processing and input-output mappings for streamlined training.

 

Open Source Licensing

GNU GPL - GNU General Public License

Proposal Video

Not Avaliable Yet

Check back later during the Feedback & Selection period for the RFP that is proposal is applied to.

  • Total Milestones

    3

  • Total Budget

    $30,000 USD

  • Last Updated

    4 Nov 2024

Milestone 1 - Extraction of MeTTa resources/data structuring.

Description

In Month 1, the project will commence with the extraction and analysis of existing MeTTa resources, including official documentation, community-contributed code, and repository data. This phase will involve organizing and structuring the initial dataset to create a comprehensive foundation for the corpus.

Deliverables

Deliverables: Comprehensive review of MeTTa documentation, community code, and repositories. Extraction scripts and initial structured dataset of instruction-output pairs. Preliminary report detailing data sources and initial findings.

Budget

$10,000 USD

Milestone 2 - Initial assembly of the corpus

Description

Month 2-3 will focus on the development and rigorous testing of data processing scripts to facilitate efficient data extraction, conversion, and formatting. During this phase, the initial assembly of the corpus will take shape, with automated processes ensuring consistency and readiness for subsequent expansion.

Deliverables

Deliverables: Completed and tested data extraction and processing scripts. Preliminary version of the corpus containing structured instruction-output pairs. Interim validation report documenting the results of script testing and early-stage corpus quality.

Budget

$10,000 USD

Milestone 3 - Final validation, comprehensive documentation

Description

In Month 4, the project will enter its final phase, focusing on comprehensive validation of the entire corpus and the completion of detailed documentation. Integration testing will be conducted to ensure compatibility with modern linters and machine learning frameworks. This phase will culminate in the release of the complete corpus and associated open-source code, marking the successful conclusion of the project and enabling future development and applications.

Deliverables

Deliverables: Fully validated and finalized corpus of 10,000 instruction-output pairs. Comprehensive, version-controlled documentation detailing the corpus creation process, validation steps, and known limitations. Integration testing report showing compatibility with modern linters and machine learning frameworks. Final open-source release package containing the complete corpus, data processing scripts, and associated documentation.

Budget

$10,000 USD

Join the Discussion (0)

Expert Ratings

Reviews & Ratings

Group Expert Rating (Final)

Overall

3.0

  • Compliance with RFP requirements 4.0
  • Solution details and team expertise 3.0
  • Value for money 2.0
  • Expert Review 1

    Overall

    3.0

    • Compliance with RFP requirements 5.0
    • Solution details and team expertise 1.0
    • Value for money 0.0
    Prior experience unclear

    Clear milestones. Strong focus on usability and structured deliverables. Unverified team expertise, reliance on existing resources, and risks with synthetic data fidelity. Promising but lacks credibility.

  • Expert Review 2

    Overall

    2.0

    • Compliance with RFP requirements 2.0
    • Solution details and team expertise 2.0
    • Value for money 0.0
    What is proposed seems to be to gather and clean up existing MeTTa code samples

    What is proposed is worthwhile but easy to do and too simple to be worth this much $$

  • Expert Review 3

    Overall

    4.0

    • Compliance with RFP requirements 5.0
    • Solution details and team expertise 3.0
    • Value for money 0.0

    Straightforward and to the point. Could be more detailed and I would like to know more about the team.

feedback_icon