Create Quality corpus for NL-to-MeTTa LLM

chevron-icon
RFP Proposals
Top
chevron-icon
project-presentation-img
Expert Rating 3.3
Yeabsira Derese
Project Owner

Create Quality corpus for NL-to-MeTTa LLM

Expert Rating

3.3

Overview

This proposal aims to develop a comprehensive and versatile MeTTa code corpus that serves as a critical resource for fine-tuning and training large language models (LLMs) within the Hyperon ecosystem. The corpus will encompass diverse examples encompassing MeTTa specific features, algorithmic implementations, and problem-solving scenarios, ensuring compatibility with LLM training needs while showcasing MeTTa's advantages in addressing complex reasoning tasks.

RFP Guidelines

Create corpus for NL-to-MeTTa LLM

Complete & Awarded
  • Type SingularityNET RFP
  • Total RFP Funding $70,000 USD
  • Proposals 10
  • Awarded Projects 1
author-img
SingularityNET
Aug. 13, 2024

Develop a MeTTa language corpus to enable the training or fine-tuning of an LLM and/or LoRAs aimed at supporting developers by providing a natural language coding assistant for the MeTTa language.

Proposal Description

Project details

Overview

This proposal aims to develop a comprehensive and versatile MeTTa code corpus that serves as a critical resource for fine-tuning and training large language models (LLMs) within the Hyperon ecosystem. The corpus will encompass diverse examples encompassing MeTTa specific features, algorithmic implementations, and problem-solving scenarios, ensuring compatibility with LLM training needs while showcasing MeTTa's advantages in addressing complex reasoning tasks.

Approach and Methodology

We are preparing the MeTTa corpus using several resources, including MeTTa documentation, public MeTTa code repositories, repositories of similar paradigm languages (e.g., Haskell, Prolog), and common use case scenarios from MeTTa and other programming languages.

Our data preparation begins with the MeTTa documentation. This serves a dual purpose: introducing basic MeTTa syntax examples into the corpus and categorizing various MeTTa features. This approach ensures comprehensive coverage of MeTTa’s concepts, avoiding an overemphasis on certain features while neglecting others. Examples of these features include pattern matching, MeTTa’s non-deterministic nature, and its knowledge representation capabilities.

Before preparing the corpus, we will identify and gather existing MeTTa repositories and other relevant resources. However, MeTTa lacks a rich standard library compared to other functional programming languages. To address this limitation, we will incorporate repositories from similar languages like Haskell, Prolog, and Lisp. These repositories will provide quality working code that we will adapt into MeTTa. This step is critical, as it supplements the corpus with missing but essential data needed for training LLMs. Including simple yet foundational code snippets will enhance the LLM’s ability to solve more complex problems effectively and elegantly whenever possible. We will also focus on identifying and filtering diverse codebases to streamline the process of writing MeTTa code. By planning these repositories in advance, we aim to minimize duplication in instruction-code pairs and ensure that each entry is meaningfully distinct.

Another key aspect of our methodology is outlining and creating MeTTa code for relatively complex tasks, particularly those relevant to Hyperon-related projects. These use cases are uncommon in other programming languages but essential for the MeTTa corpus. Defining these scenarios and crafting solutions for them will enhance the corpus significantly. Additionally, solving common algorithmic problems in MeTTa adds great value, as it bridges problem-solving in MeTTa with approaches in well-known programming languages that LLMs are already familiar with.

Finally, we are incorporating error-handling examples into the corpus, such as entries addressing MeTTa code reduction errors. Including these scenarios is crucial for improving the robustness of LLM training and ensuring comprehensive language understanding.

Corpus Development Plan:

In general, the following steps will be taken:

  1. Data Collection: 
    • Automated Mining: Scrape MeTTa code from public GitHub repositories using automated scripts.

    • Code Reimplementation: Identify code snippets written in other programming languages like Haskell, Lisp and Prolog to diversify the corpus.

    • Scenario Implementation: Design and list problem-specific scenarios or algorithmic tasks that would be useful in MeTTa. These tasks will reflect the language's strengths and unique paradigms.

    • Test: Utilize testing to filter out only working codes.

      2. Data Organization and Preparation:

    • Start corpus preparation from the MeTTa documentation and MeTTa code repositories

    • Reimplement filtered out code bases and standard libraries in other programming languages in MeTTa with sufficient documentation and comments.

    • Implement solutions to the scenarios and selected algorithms in step 1 with sufficient documentation and comments. 

    • Prepare MeTTa code error handling entries for the corpus.

    • Consolidate and structure all collected MeTTa code in a standardized format.

    • Annotate the code with natural language descriptions to explain functionality and context.

      3. Data documentation 

    • Provide comprehensive documentation detailing the collection, reimplementation, and organization processes.

    • Include guidelines on extending the corpus with new code or scenarios.

    • Offer a walkthrough of the structure and usability of the prepared MeTTa corpus.

Corpus Structure:

The corpus will adhere to the following structure for consistency and accessibility:

  • Category 1: Basic Examples

    • Syntax demonstrations.

    • Simple use cases: i.e., implementation of common pure and / or higher order functions, code examples covering different features of MeTTa. 

  • Category 2: Algorithms and Data Structures

    • Common algorithms (e.g., sorting, graph traversal) reimplemented in MeTTa.

    • Fundamental data structures such as trees, stacks, and queues.

  • Category 3: Use Case based Implementations

    • Problem-solving scenarios (e.g., knowledge representation, function programming logics).

    • Error handling in MeTTa (e.g., solving reduction errors)

    • AI-focused applications (e.g., learning algorithms, symbolic reasoning).

Each example will include:

  • Instruction: A concise, natural language description of the task.

  • Code: The corresponding MeTTa implementation.

  • Annotations: Additional comments explaining the logic or unique MeTTa features.

Documentation Process:

The documentation will cover:

  1. Corpus Collection Overview: Tools and methods used for mining and collecting MeTTa examples and sources of public repositories used during the processes.

  2. Standardization Practices: Steps to organize, reimplement, and annotate code.

  3. Replication Guide: Clear, step-by-step instructions for replicating the corpus preparation process.

Deliverables

Primary Deliverables:

  1. A structured MeTTa corpus with 10,000 entries of categorized and annotated code-instruction pairs that are diversified enough to have a good coverage of the language's features.

  2. Comprehensive documentation accompanying the corpus.

Optional Deliverables:

  1. Automation scripts used for scraping and organizing code.

  2. A repository with codes to extension to the standard library that could be useful if maintained centrally.

  3. Tutorials showcasing how to use the corpus effectively.

Team Competence

We bring a wealth of experience and expertise to this project, which makes us exceptionally well-suited for the task.

Experience with MeTTa

  • We have been working with MeTTa at iCog Labs for over six months, during which we have developed comprehensive knowledge of all its features.

  • Our expertise extends beyond the publicly documented functionalities, including familiarity with undocumented features, which eliminates any learning curve during corpus preparation and ensures high-quality data and code.

  • Our team leads core Hyperon projects such as MOSES and ECAN, where MeTTa has been integral. The skills and knowledge gained from these projects will directly contribute to developing a robust and useful dataset.

Experience with Haskell

  • With a strong foundation in functional programming, particularly in Haskell, we are well-equipped to write high-quality and efficient MeTTa code. MeTTa’s functional programming nature makes this experience invaluable for maintaining speed and precision in the corpus development process.

Experience with LLMs and Data Collection

  • Our team has prior experience working on LLM-based projects and data collection, equipping us with the skills necessary for handling the complexities of creating a dataset tailored to training language models.

Additional Strengths

  • We have a strong background in programming, with a particular emphasis on functional programming and problem-solving using MeTTa.

  • Access to a network of individuals with advanced MeTTa knowledge will further enhance the quality and volume of the generated code.

In summary, our combination of expertise in MeTTa, functional programming, and LLM-based projects, along with access to a highly skilled network, positions us as an ideal team to execute this project successfully.

Budget

The budget will cover:

  • Automation tool development and deployment.

  • Compensation for manual review and annotation tasks.

  • Miscellaneous expenses (e.g., infrastructure, software licenses).

Conclusion

This proposal outlines a robust plan to create a high-quality corpus tailored to the MeTTa programming language. Leveraging our deep expertise in MeTTa, functional programming, and LLM-based projects, we will deliver a structured dataset that enhances LLM training and broadens MeTTa’s usability.

By combining thorough data collection, reimplementation, and annotation, we aim to produce a comprehensive resource that empowers developers and addresses MeTTa’s unique challenges. With the necessary support, our team is confident in executing this project efficiently and delivering impactful results.

Proposal Video

Not Avaliable Yet

Check back later during the Feedback & Selection period for the RFP that is proposal is applied to.

  • Total Milestones

    4

  • Total Budget

    $35,000 USD

  • Last Updated

    7 Dec 2024

Milestone 1 - Planning & Data Collection

Description

In this initial phase the primary task will be to identify relevant repositories and sources that can provide the necessary data for the corpus. This will involve searching through public GitHub repositories documentation and other available resources. Additionally scraping scripts will be developed or refined to automate the process of data extraction from these repositories. The team will also begin executing the automated mining and reimplementation tasks ensuring that quality data is collected. This phase will also focus on drafting problem-specific scenarios and common error handling cases that reflect the language's features and paradigms. Finally algorithmic and data structure problems will be selected to ensure comprehensive coverage of MeTTa’s capabilities.

Deliverables

A collection of repositories and data sources working scraping scripts initial sets of problem-specific scenarios error handling examples and a list of algorithmic problems selected for inclusion.

Budget

$7,000 USD

Success Criterion

Successful identification of at least five high-quality repositories, completion of functional scraping scripts, and the creation of at least 100 problem-specific scenarios, algorithm solution pairs and error handling examples.

Milestone 2 - Corpus Organization

Description

This milestone involves structuring and annotating the collected examples to prepare them for integration into the corpus. The collected MeTTa code will be organized by category with detailed annotations explaining the logic and context of the code. Additionally the team will prepare MeTTa code examples by reimplementing code from other programming languages such as Haskell Prolog and Lisp to provide more variety and robustness to the corpus. This will involve carefully rewriting the selected code snippets into MeTTa while maintaining accuracy and functionality.

Deliverables

A structured and annotated collection of MeTTa code examples and reimplemented code repositories from other programming languages.

Budget

$12,000 USD

Success Criterion

Completion of at least 5000 annotated examples, with successful reimplementations of code from haskel, prolog and lisp, ensuring the examples are accurate, well-documented, and properly categorized.

Milestone 3 - Scenario and Algorithmic Problem Solutions

Description

The third phase will focus on the preparation of scenario-based and error-handling MeTTa code which will be essential for ensuring that the corpus represents real-world use cases and challenges. This includes writing solutions for algorithmic problems in MeTTa and focusing on areas such as pattern matching knowledge representation and error handling. Furthermore a script will be developed to organize the MeTTa files into a common format that aligns with the corpus standards. Using this script all previously prepared MeTTa code will be organized and standardized into a uniform format facilitating easy access and integration into the training process.

Deliverables

A set of scenario-based MeTTa code algorithmic problem solutions and a script to organize and standardize the MeTTa code.

Budget

$10,000 USD

Success Criterion

Completion of at least 100 scenario-based solutions and algorithmic problem solutions, as well as the successful development and execution of the script that organizes the corpus into the required format along with the additional 3000 data entries to the corpus.

Milestone 4 - Final Deliverables

Description

In the final phase comprehensive documentation will be written to detail the corpus creation process including the steps taken for data collection organization and annotation. The documentation will also include guidelines for extending the corpus with additional data or scenarios in the future. The corpus will be validated by test users to ensure its accuracy and usability. A final review of the deliverables will be conducted to ensure consistency and quality across all materials. Upon validation and quality checks the finalized corpus along with the documentation will be submitted.

Deliverables

A complete and finalized corpus with full documentation including validation feedback and any revisions based on test user input.

Budget

$6,000 USD

Success Criterion

Successful submission of the finalized corpus with 10000 instruction-code pair data entries

Join the Discussion (0)

Expert Ratings

Reviews & Ratings

Group Expert Rating (Final)

Overall

3.3

  • Feasibility 4.3
  • Desirabilty 4.7
  • Usefulness 5.0
  • Expert Review 1

    Overall

    4.0

    • Compliance with RFP requirements 5.0
    • Solution details and team expertise 4.0
    • Value for money 5.0
    Strong proposal

    Strong proposal w aligned RFP goals. Well-structured plan and diverse deliverables to advance model training. The team’s expertise and inclusion of error-handling and algorithmic tasks add depth to the corpus. Ambitious timelines, vague validation strategies, and reliance on external sources introduce minor risks.

  • Expert Review 2

    Overall

    2.0

    • Compliance with RFP requirements 3.0
    • Solution details and team expertise 5.0
    • Value for money 5.0
    The proposal seems to suggest hand-coding the whole corpus rather than synthesizing the programs... I don't think this is feasible

    The approach suggested may make sense for beefing up a training corpus, but I don't really think it's viable to hand-code 10K MeTTA programs as suggested... (I also am not sure if the proposer is already working for SNet on a different contract or if this OK, maybe it is..?)

  • Expert Review 3

    Overall

    4.0

    • Compliance with RFP requirements 5.0
    • Solution details and team expertise 5.0
    • Value for money 5.0

    The proposal is detailed and comprehensive in scope. Unique to its approach is a use of other functional languages to diversify the training corpus. While this could be useful, it might also be somewhat difficult in practice.

feedback_icon