Large MeTTa corpus for LLM fine-tuning

chevron-icon
Back
Top
chevron-icon
project-presentation-img
Seb Wiechers
Project Owner

Large MeTTa corpus for LLM fine-tuning

Status

  • Overall Status

    ⏳ Contract Pending

  • Funding Transfered

    $0 USD

  • Max Funding Amount

    $27,000 USD

Funding Schedule

View Milestones
Milestone Release 1
$4,000 USD Pending TBD
Milestone Release 2
$6,000 USD Pending TBD
Milestone Release 3
$8,500 USD Pending TBD
Milestone Release 4
$8,500 USD Pending TBD

Project AI Services

No Service Available

Overview

We propose to curate two (natural language <-> MeTTa) expression datasets, respectively the *silver* dataset, consisting of 20.000 AI-generated, probabilistically verified (NL <-> MeTTa) pairs, and the *gold* dataset, consisting of 10.000 human-labeled, high-quality pairs. The proposed timeline is 4 months. Funding will be used to cover the expense of a) compute costs b) scoring output-pairs, c) developing algorithms for estimating the probability of correct predictions. Our knowledge of the MeTTa language and NLP, linguistics, background in logic, AI and real-world organizational experience places us in a perfect position to have a compounding effect on the SNET ecosystem.

RFP Guidelines

Create corpus for NL-to-MeTTa LLM

Complete & Awarded
  • Type SingularityNET RFP
  • Total RFP Funding $70,000 USD
  • Proposals 10
  • Awarded Projects 1
author-img
SingularityNET
Aug. 13, 2024

Develop a MeTTa language corpus to enable the training or fine-tuning of an LLM and/or LoRAs aimed at supporting developers by providing a natural language coding assistant for the MeTTa language.

Proposal Description

Company Name (if applicable)

Pearstop

Project details

We propose to curate two (natural language <-> MeTTa) expression datasets, respectively:

  • the silver dataset, consisting of 20.000 AI-generated, probabilistically verified (NL <-> MeTTa) pairs,
  • and the gold dataset, consisting of 10.000 human-verified, high-quality pairs. The proposed timeline is 4 months.

Funding will be used to cover the expense of 

  • a) compute costs
  • b) scoring output-pairs 
  • c) developing algorithms for estimating the probability of correct predictions.
  • d) human capital
  • e) tooling

Our knowledge of the MeTTa language and NLP, linguistics, background in logic, AI and real-world organizational experience places us in a perfect position to have a compounding effect on the SNET ecosystem.

Milestones

  • 1 month: gathering of team, resources and tooling. Defining human labeling workflow. Selecting appropriate datasets and preparing the data pipeline in python (deliverable)
  • 2 months: small-scale testing of different generative approaches and development of a 'common sense' algorithm that checks parsed metta expressions against a logical properties of input statements (deliverable)
  • 3 months: start human labeling-process, while incrementally using findings to produce the silver dataset. 
  • 4 months: delivery of the silver dataset (minimum 20.000 labeled pairs) and a gold dataset (10.000 pairs)

We will deliver not only the correct NL <> Metta pairs, but also incorrect labels that we accumulated in the process. This will be useful for defining a loss function in potential DNN approaches in future projects.

Open Source Licensing

MIT - Massachusetts Institute of Technology License

Default MIT Licence (code and dataset)

Proposal Video

Not Avaliable Yet

Check back later during the Feedback & Selection period for the RFP that is proposal is applied to.

Group Expert Rating (Final)

Overall

5.0

  • Compliance with RFP requirements 5.0
  • Solution details and team expertise 5.0
  • Value for money 4.0

New reviews and ratings are disabled for Awarded Projects

Overall Community

4.3

from 4 reviews
  • 5
    1
  • 4
    2
  • 3
    0
  • 2
    0
  • 1
    0

Feasibility

5

from 4 reviews

Viability

5

from 4 reviews

Desirabilty

4

from 4 reviews

Usefulness

0

from 4 reviews

Sort by

4 ratings
  • Expert Review 1

    Overall

    4.0

    • Compliance with RFP requirements 5.0
    • Solution details and team expertise 4.0
    • Value for money 0.0
    Robust solution

    Robust, cost-effective proposal offering dual datasets and reusable tools. Clear milestones and deliverables. Risks include reliance on AI-generated data, ambitious gold dataset timeline, and unsubstantiated team expertise.

  • Expert Review 2

    Overall

    4.0

    • Compliance with RFP requirements 5.0
    • Solution details and team expertise 4.0
    • Value for money 0.0
    It's a sensible proposal though light on some critical details

    This proposal understands the size and nature of the task, and takes seriously the magnitude of the human labeling process ... a little more concreteness on how the human labeling and the synthesis will be done would have been better but at least the proposers are taking the nature of the task seriously...

  • Expert Review 3

    Overall

    5.0

    • Compliance with RFP requirements 5.0
    • Solution details and team expertise 4.0
    • Value for money 0.0

    A great idea using a mix of human-curation and ML tools with two datasets (Silver and Gold). I was really hoping for more detail of the processes proposed for the ML group.

  • Total Milestones

    4

  • Total Budget

    $27,000 USD

  • Last Updated

    3 Feb 2025

Milestone 1 - Python data pipeline

Status
😐 Not Started
Description

1 month: gathering of team, resources and tooling. Defining human labeling workflow. Selecting appropriate datasets and preparing the data pipeline in python (deliverable)

Deliverables

We will open source a data pipeline that can be used to generate and curate NL <> MeTTa pairs.

Budget

$4,000 USD

Link URL

Milestone 2 - 'Common Sense' verification algorithm

Status
😐 Not Started
Description

2 months: small-scale testing of different generative approaches and development of a 'common sense' algorithm that checks parsed metta expressions against a logical properties of input statements (deliverable)

Deliverables

We will open source an algorithm and approach that can be used to perform a 'common sense' test on generated MeTTa statements, if the logical relation between input expressions is known beforehand.

Budget

$6,000 USD

Link URL

Milestone 3 - Halfway milestone

Status
😐 Not Started
Description

3 months: start human labeling-process, while incrementally using findings to produce the silver dataset. 

Deliverables

By this time we expect to be able to deliver 2.000 gold pairs, as well as 20.000 silver pairs.

Budget

$8,500 USD

Link URL

Milestone 4 - Project delivery (10.000 gold pairs)

Status
😐 Not Started
Description

4 months: delivery of the silver dataset (minimum 20.000 labeled pairs) and a gold dataset (10.000 pairs)

Deliverables

We finish the project, delivering the remaining 8.000 gold pairs.

Budget

$8,500 USD

Link URL

Join the Discussion (0)

Expert Ratings

Reviews & Ratings

Group Expert Rating (Final)

Overall

5.0

  • Compliance with RFP requirements 5.0
  • Solution details and team expertise 5.0
  • Value for money 4.0

New reviews and ratings are disabled for Awarded Projects

  • Expert Review 1

    Overall

    4.0

    • Compliance with RFP requirements 5.0
    • Solution details and team expertise 4.0
    • Value for money 0.0
    Robust solution

    Robust, cost-effective proposal offering dual datasets and reusable tools. Clear milestones and deliverables. Risks include reliance on AI-generated data, ambitious gold dataset timeline, and unsubstantiated team expertise.

  • Expert Review 2

    Overall

    4.0

    • Compliance with RFP requirements 5.0
    • Solution details and team expertise 4.0
    • Value for money 0.0
    It's a sensible proposal though light on some critical details

    This proposal understands the size and nature of the task, and takes seriously the magnitude of the human labeling process ... a little more concreteness on how the human labeling and the synthesis will be done would have been better but at least the proposers are taking the nature of the task seriously...

  • Expert Review 3

    Overall

    5.0

    • Compliance with RFP requirements 5.0
    • Solution details and team expertise 4.0
    • Value for money 0.0

    A great idea using a mix of human-curation and ML tools with two datasets (Silver and Gold). I was really hoping for more detail of the processes proposed for the ML group.

Welcome to our website!

Nice to meet you! If you have any question about our services, feel free to contact us.