Leyu: Crowdsorcing Datasets

chevron-icon
Back
project-presentation-img
Betelhem_Dessie
Project Owner

Leyu: Crowdsorcing Datasets

Funding Requested

$150,000 USD

Expert Review
Star Filled Image Star Filled Image Star Filled Image Star Filled Image Star Filled Image 0
Community
Star Filled Image Star Filled Image Star Filled Image Star Filled Image Star Filled Image 3.1 (9)

Overview

Leyu (leyu.ai) is a platform dedicated to crowdsourcing datasets for low-resource languages, meticulously crafting language datasets to fuel innovation, particularly in AI development. It addresses data ownership and bias concerns while ensuring equitable compensation for contributors and combating harmful stereotypes. Through micro-work opportunities, Leyu empowers communities to actively participate in shaping the future of AI, giving them a voice and a stake in the process.

Proposal Description

How Our Project Will Contribute To The Growth Of The Decentralized AI Platform

  • Expanding Dataset Scope: Leyu enriches datasets for low-resource languages by crowdsourcing local language data through hybrid labeling methods, enhancing AI model accuracy.

  • Ethical Data Collection: Leyu ensures ethical data collection practices, prioritizing privacy, fairness, and fair compensation for contributors.

  • Driving Innovation: Leyu democratizes data creation, fosters local job opportunities, and accelerates AI adoption, expanding the platform's impact.

Our Team

Our team is comprised of 20 developers with CS degrees and knowledge of ML and NLP technology and 35 undergrad interns and  have worked on products and systems including OpenCog's PLN inference framework, the back end of the Rejuve app, Amharic language capability for the Desi/Desta robot. Besides that we have worked on grass root training projects reaching more than 30k young people in Ethiopia.

View Team

AI services (New or Existing)

Multilingual Speech Recognition

How it will be used

The data that will be crowdsourced will be utilizing this AI service to train and develop the models for the languages

Generative Language Models

How it will be used

Generate corpus of text data that will will be recorded by the user for speech data

Multilingual Speech Recognition

How it will be used

To identify and organize the text and speech data that will also be assisted by human reviewers

Company Name (if applicable)

Leyu(leyu.ai)

The core problem we are aiming to solve

The core problem Leyu aims to solve is the absence of quality and accurate datasets tailored to Ethiopia's linguistic nuances. This deficiency hampers AI system development, limiting adaptability, precision, and ethical considerations. Additionally, the limited scale and accuracy of existing datasets across various sectors like agriculture, education, and health hinder the creation of robust AI applications. Scaling efforts are hindered by cost, time constraints, and a lack of suitable platforms. Furthermore, concerns regarding data privacy, ownership, and copyright underscore the need for responsible and ethical AI development.

Our specific solution to this problem

Leyu offers a solution by providing a platform tailored to crowdsource datasets for underrepresented languages, starting in Ethiopia. We tackle crucial issues of data ownership and bias, ensuring equitable compensation for contributors, thus empowering communities and promoting inclusivity in AI development. Through micro-task opportunities, Leyu meticulously constructs comprehensive datasets for Amharic language. Moreover, we establish a marketplace where companies can access and license these datasets to drive innovation. By employing a hybrid human-automated labeling approach, Leyu effectively expands dataset scale while upholding accuracy standards, thereby training Large Language Models (LLMs)  Upholding stringent data privacy and ethical standards, Leyu actively promotes responsible AI development. Furthermore, our crowdsourcing model not only fosters social responsibility but also fuels job creation, democratizing the data creation process and catalyzing positive socio-economic change at the local level.

Competition and USPs

Our solution's uniqueness lies in its specialization in crowdsourcing datasets for low-resource languages. While similar initiatives like Karya exist in other countries, they're notably absent in the African continent. Leyu's pioneering approach fills this gap, focusing on languages often overlooked by traditional methods. Initially targeting Amharic and later expanding to other languages, Leyu taps into untapped linguistic resources, providing valuable datasets for AI development. This exclusivity positions Leyu as a trailblazer, attracting attention from organizations in need of comprehensive datasets for underrepresented languages. Our commitment to fair compensation and ethical data practices instills trust, ensuring sustained success. Leyu's emphasis on social responsibility and job creation resonates with stakeholders, enhancing its appeal and long-term viability.

Needed resources

We need senior advisory services in AI, ML and Data Science.

Existing resources

We will be using shared company resources, such as operations, office space, amenities, existing market expertise, and networks, to enhance efficiency, reduce costs, and leverage established connections and knowledge.

Open Source Licensing

Custom

We offer two licenses:

  1. Research (Non-Commercial): Free for researchers, academics, and non-profits. For educational, scientific use only. Prohibits commercial use.

  2. Commercial: Paid license for businesses and individuals. Allows use in products, services, and analyses for profit generation. Pricing varies.

 

Revenue Sharing Model

Custom Model

Custom Description:

Data will be available for purchase to those utilizing it for commercial purposes.

 

Proposal Video

Placeholder for Spotlight Day Pitch-presentations. Video's will be added by the DF team when available.

  • Total Milestones

    4

  • Total Budget

    $150,000 USD

  • Last Updated

    20 May 2024

Milestone 1 - API Calls & Hostings

Description

This milestone represents the required reservation of 25% of your total requested budget for API calls or hosting costs. Because it is required we have prefilled it for you and it cannot be removed or adapted.

Deliverables

You can use this amount for payment of API calls on our platform. Use it to call other services or use it as a marketing instrument to have other parties try out your service. Alternatively you can use it to pay for hosting and computing costs.

Budget

$37,500 USD

Milestone 2 - Data Collection Platform Design Requirements

Description

This milestone establishes the project's foundation by completing critical design and analysis tasks: Requirements Specification: Defines the platform's purpose functionality data sources user roles performance needs and security measures. Data Ingestion Platform Design: Outlines the architecture technologies data flow transformation validation and error handling mechanisms. Database Schema Design: Specifies the database structure tables relationships constraints and indexing for optimal storage and retrieval. API Design: Details the interfaces for interacting with the platform including endpoints formats authentication and error codes. Completing this milestone provides a clear roadmap for development ensuring alignment with project goals and user needs.

Deliverables

Requirements specification and software design documents for the data crowdsourcing platform

Budget

$59,961 USD

Milestone 3 - Launch platform to collect 500 hrs of speech data

Description

Launch a robust platform to collect 500+ hours of diverse speech data. Design includes user-friendly interfaces for recording secure storage data annotation tools and quality assurance mechanisms. Scalable architecture supports future growth and ensures data privacy compliance.

Deliverables

This project delivers a comprehensive platform tailored for large-scale speech data collection. Key components include: User Interface (UI): Intuitive UI for recording high-quality audio from various devices (mobile web etc.) with clear instructions and progress tracking. Data Storage: Secure scalable cloud storage for raw audio files and associated metadata ensuring data integrity and availability. Annotation Tools: Integrated tools for efficient transcription labeling and segmentation of speech data supporting various annotation formats. Quality Assurance: Automated and manual QA processes to verify audio quality transcription accuracy and data integrity ensuring a high-quality dataset. Data Privacy: Robust data anonymization and encryption mechanisms to protect user privacy and comply with relevant regulations (e.g. GDPR). Scalability: Cloud-based architecture designed to handle large volumes of data and accommodate future growth in data collection efforts. API : Integration with external tools and services via API for seamless data processing and analysis workflows. This platform will empower researchers developers and organizations to gather diverse and high-quality speech data accelerating advancements in speech recognition natural language processing and other voice-related technologies.

Budget

$44,970 USD

Milestone 4 - Dataset of 500 hours

Description

This involves the delivery of a comprehensive dataset comprising 500 hours ofaudio and text data.

Deliverables

A comprehensive dataset consisting of 500 hours of audio and text data ready for use in various applications such as training machine learning models or conducting research.

Budget

$7,569 USD

Join the Discussion (4)

Sort by

4 Comments
  • 0
    commentator-avatar
    Gombilla
    Jun 2, 2024 | 6:05 PM

    Hey Dessie. This is some awesome stuff. But how will you ensure the accuracy and quality of the crowdsourced datasets, particularly for low-resource languages ? Also, I think it will be important to address issues related to compensation for contributors and ensure fairness and transparency in the process. Thanks

    • 0
      commentator-avatar
      Betelhem_Dessie
      Jun 3, 2024 | 12:46 PM

      Hey,  We will have peer reviews(for ratings) and also paid reviewers to validate the data. In terms of compensation, we have priced the data to ensure a person working on the data at least gets double the minimum wage in Ethiopia.

  • 0
    commentator-avatar
    HenriqC
    May 19, 2024 | 6:48 AM

    This is indeed a problem of some languages and organized crowdsourcing could be at least a part of the solution. In addition, micro-work opportunities are a great way to get people involved in the space in general.     The 150k budget is 10% of the entire round. That gives the investment a bit of a risky flavor in my eyes even though from the technical side this sounds relatively straightforward to implement. Nevertheless, I recall that the low resource language empowerment idea has existed in the SNET circles for a while and would definitely add real value.    Your website could be added to the proposal. Is it icogacc.com ?

    • 0
      commentator-avatar
      Betelhem_Dessie
      May 19, 2024 | 6:11 PM

      Hey, Thank you for the feedback. Our companies websites are icogacc.com and icog-labs.com. The product has its own website which is leyu.ai

Reviews & Rating

Sort by

9 ratings
  • 0
    user-icon
    Joseph Gastoni
    May 20, 2024 | 3:38 PM

    Overall

    4

    • Feasibility 4
    • Viability 3
    • Desirabilty 3
    • Usefulness 4
    Leyu, a platform for crowdsourcing datasets

    This proposal outlines Leyu, a platform for crowdsourcing datasets in low-resource languages. Here's a breakdown of its strengths and weaknesses:

    Feasibility:

    • High: Leveraging crowdsourcing and existing labeling methods makes this project highly feasible.
    • Strengths: The focus on a specific language (Amharic) and hybrid labeling approaches reduces complexity.
    • Weaknesses: Securing a large enough pool of contributors and ensuring data quality requires careful recruitment and training strategies.

    Viability:

    • Moderate: Success depends on attracting contributors, creating a sustainable compensation model, and convincing companies to purchase datasets.
    • Strengths: The focus on a market gap (low-resource languages) and ethical data practices offers a unique selling proposition.
    • Weaknesses: The proposal lacks details on the specific compensation model and its long-term financial sustainability.

    Desirability:

    • Moderate: For researchers, companies working with AI in Ethiopia, and potential contributors, this project can be desirable.
    • Strengths: The focus on improving AI accuracy for underrepresented languages and ethical data collection addresses current concerns.
    • Weaknesses: The proposal needs to clearly articulate the value proposition for contributors beyond micro-work opportunities.

    Usefulness:

    • High (potential): This project has the potential to significantly improve AI development for low-resource languages and empower local communities.
    • Strengths: The focus on crowdsourcing, data quality, and responsible AI development aligns with important trends.
    • Weaknesses: The proposal lacks details on how the project will measure its impact on AI development and community empowerment.

    Additional Points:

    • Developing a clear recruitment and training strategy for contributors is crucial for ensuring data quality and participant engagement.
    • Establishing a transparent and sustainable compensation model that incentivizes participation is essential.
    • Demonstrating the value proposition for companies by showcasing the quality and impact of the datasets is key to attracting buyers.

    Overall, the Leyu project has a strong potential to be a valuable tool for AI development and community empowerment in Ethiopia. Focusing on a clear compensation model, robust data quality measures, and demonstrating the value proposition for both contributors and companies can increase its effectiveness. By outlining a sustainable financial model and impact measurement strategy, this proposal can become even more compelling.

    Here are some strengths of this project:

    • Focuses on a specific market gap - crowdsourcing datasets for underrepresented languages, starting with Amharic.
    • Emphasizes ethical data collection practices, data privacy, and fair compensation for contributors.
    • Offers a hybrid human-automated labeling approach to ensure data quality and scalability.

    Here are some challenges to address:

    • Securing a large enough and engaged pool of contributors in Ethiopia for consistent data collection.
    • Developing a sustainable compensation model that incentivizes participation and ensures financial viability.
    • Clearly demonstrating the value proposition for companies and researchers who would purchase these datasets.

  • 0
    user-icon
    CLEMENT
    Jun 2, 2024 | 6:10 PM

    Overall

    4

    • Feasibility 3
    • Viability 3
    • Desirabilty 4
    • Usefulness 5
    Solves lack of dataset for low-resource languages

    Firstly. I think the budget ask is quite enormous for the task at hand. The team may want to look into this aspect but I believe the Leyu project has the potential to make significant contributions to the SNET AI community by addressing the lack of comprehensive datasets for low-resource languages.

    Overall, Leyu's approach of employing a hybrid human-automated labeling approach and upholding stringent data privacy and ethical standards aligns with the principles of responsible AI development. By democratizing the data creation process and promoting social responsibility.

    Kudos to the team

  • 0
    user-icon
    Max1524
    Jun 8, 2024 | 12:43 AM

    Overall

    3

    • Feasibility 2
    • Viability 3
    • Desirabilty 3
    • Usefulness 3
    Project team has not yet specified the individual

    I see the team mentioning 3 members of the project team, but in reality I can only see the profile of Betelhem_Dessie as the project leader, the remaining 2 members, Chief Advisor and Production Director, have no personal information. core. Can the team soon provide enough personal information for me to have confidence in the feasibility?

  • 0
    user-icon
    TrucTrixie
    Jun 9, 2024 | 1:11 PM

    Overall

    3

    • Feasibility 3
    • Viability 2
    • Desirabilty 3
    • Usefulness 3
    Add a basis for evaluating work progress

    The advantage is that 4 important milestones are clearly presented with a budget for each milestone.
    What the team needs to improve, in my opinion, is to add more timelines for each milestone so that the community has a full basis to evaluate the progress of the work.

  • 0
    user-icon
    BlackCoffee
    Jun 10, 2024 | 12:03 AM

    Overall

    3

    • Feasibility 3
    • Viability 3
    • Desirabilty 3
    • Usefulness 3
    Necessary resources are not sufficient

    I notice that the necessary resources have not been fully presented by the team. It seems like they are still recruiting more staff and are still looking into additional services? This affects Feasibility.

  • 0
    user-icon
    Nicolad2008
    Jun 7, 2024 | 4:12 AM

    Overall

    3

    • Feasibility 3
    • Viability 3
    • Desirabilty 3
    • Usefulness 3
    The participation of the DF community

    The project brings a lot of positive potential in developing AI through building a set of data for languages ​​that few people use and promote fairness in owning data. I find this project also faces significant challenges. The quality assurance and information security in the process of collecting data from the crowd is a complicated and expensive task, requiring strict control mechanism. The risk of data abuse is always existing, especially in the context of personal data becoming a valuable resource. Another difficulty is to maintain the long -term participation of the community, which requires Leyu to build a strong belief and commitment from the contributor, which is not easy to achieve. The project also needs strong support from reputable organizations to build trust and provide necessary resources. Therefore, the realization of these goals requires continuous efforts and careful management of factors related to quality, security, and community participation.

  • 0
    user-icon
    JeyGarg23
    May 18, 2024 | 10:32 AM

    Overall

    3

    • Feasibility 1
    • Viability 2
    • Desirabilty 4
    • Usefulness 5
    Too raw of a proposal

    I think if the milestone would've been splitted - there will be more clarity

    user-icon
    Betelhem_Dessie
    May 19, 2024 | 6:14 PM
    Project Owner

    Thank you for the feedback, we have elaborated more or the milestones 

  • 0
    user-icon
    Tu Nguyen
    May 23, 2024 | 7:20 AM

    Overall

    3

    • Feasibility 3
    • Viability 4
    • Desirabilty 3
    • Usefulness 4
    Crowdsorcing Datasets

    This proposal will address the issue of lack of quality and accurate data sets appropriate to the linguistic nuances of Ethiopia. This deficiency hinders the development of AI systems, limiting adaptability, accuracy, and ethical considerations. The solution of this proposal: they will provide a suitable platform with crowdsourced datasets for underrepresented languages, starting with Ethiopia. This is a useful solution for the Ethiopian community. Hopefully they will successfully implement this solution. 
    However, information about the team should be more detailed, they should also send members' social network links. A budget of 150000 USD is quite a lot, so I think they should define a more detailed budget based on milestones. Additionally, they should identify the start and end times of each milestone.

  • 0
    user-icon
    pindiyaa
    May 21, 2024 | 1:57 PM

    Overall

    2

    • Feasibility 3
    • Viability 2
    • Desirabilty 3
    • Usefulness 4
    Reviews & Rating for Leyu

    While the team has partnered with SingularityNET, there are concerns about their ability to deliver the project effectively. The team is still in the pending process, which suggests possible communication issues among team members. Additionally, the proposal mentions "democratizing the data creation process," yet the dataset is stored in a centralized database, which contradicts SingularityNET's decentralized AI service platform. Although the project has great potential to benefit the community, some improvements are needed in the proposal.

    Good luck with the process.

Summary

Overall Community

3.1

from 9 reviews
  • 5
    0
  • 4
    2
  • 3
    6
  • 2
    1
  • 1
    0

Feasibility

2.8

from 9 reviews

Viability

2.8

from 9 reviews

Desirabilty

3.2

from 9 reviews

Usefulness

3.8

from 9 reviews

Get Involved

Contribute your talents by joining your dream team and project. Visit the job board at Freelance DAO for opportunites today!

View Job Board