Customizable Code Assistant

chevron-icon
Back
project-presentation-img
Ammar Khairi
Project Owner

Customizable Code Assistant

Funding Awarded

$30,000 USD

Expert Review
Star Filled Image Star Filled Image Star Filled Image Star Filled Image Star Filled Image 0
Community
Star Filled Image Star Filled Image Star Filled Image Star Filled Image Star Filled Image 0 (0)

Status

  • Overall Status

    🛠️ In Progress

  • Funding Transfered

    $25,000 USD

  • Max Funding Amount

    $30,000 USD

Funding Schedule

View Milestones
Milestone Release 1
$1,000 USD Transfer Complete 12 Jan 2024
Milestone Release 2
$4,000 USD Transfer Complete 12 Jan 2024
Milestone Release 3
$5,000 USD Transfer Complete 02 Feb 2024
Milestone Release 4
$15,000 USD Transfer Complete 01 Mar 2024
Milestone Release 5
$5,000 USD Pending TBD

Status Reports

Mar. 27, 2024

Status
😀 Excellent
Summary

Completed all milestones until onboarding which should also be completed soon.

Full Report

Video Updates

Customizable Code Assistant

9 February 2024

Project AI Services

No Service Available

Overview

This project addresses the accessibility gap in using and developing large language models (LLMs) by making them more usable and reducing computational costs. The aim is to democratize LLMs' practicality by exploring transfer learning on small pre-trained models for new programming languages and tasks. The solution entails generating datasets, conducting performance-cost analysis, fine-tuning and evaluating models, creating GPU inference environments, and deploying a modular pipeline for customizable Code Assistants. The budget of $30,000 covers these milestones. Risks are mitigated through privacy measures, licensing compliance, and adapting to computation costs. The project's inspiration stems from previous collaborative work and aims to contribute to AI's democratization through accessible and open-source models and datasets.

Proposal Description

Compnay Name

Enigma AI

Service Details

The aim of this project is to address the accessibility gap and further efforts towards democratising the use and development of code LLMs. To make LLMs more usable, we need to first address the high computational cost of large language models. Using smaller models also comes with the cost of being limited to some popular programming languages. Hence, we will explore the feasibility of using transfer learning on small pre-trained models to new programming languages and tasks. For more challenging tasks that require larger models and thus higher costs, we will make use of efficient training and inference methods to achieve good performance with reasonable costs.

Problem Description

Accessibility to code LLMs is one of the important goals of AI development. The Democratization of AI is an important factor in the progress of the field, and AI companies such as Meta and Stability are showing their commitment to it by publicly sharing their models and releasing their weights. However, this may not be enough, as the barrier to tuning and using LLM is higher than just access to their weights. For the practitioner, the choices of model architecture and learning algorithms are not obvious, and exploring these options is costly due to the high computation costs. On the other hand, users of these tools often have to choose between the high inference cost of running models locally or the monetary cost of access via hosted APIs (e.g. Co-pilot). In order to achieve the benefits of democratization of AI use and development, these issues need to be resolved.

Solution Description

Our approach to solving these issues addresses various factors:

Training Data: High-quality training data is essential to lowering the cost of training LLMs. Studies have shown that fine-tuning LLMs on smaller but more task-specific and higher-quality data is more beneficial to the performance of the model. Hence, we will focus on creating automated and semi-automated approaches to collect, clean, and filter our data in such a way that we can achieve significant performance in less training time with less data, thus lowering the overall cost of training.

Model Size: The scaling laws of LLMs show that better generalization is expected with larger models. However, our previous work has also shown the feasibility of the alternative approach of using relatively smaller LLMs on more specified tasks. Our approach enables us to significantly cut down on the cost and time associated with fine-tuning LLM to ensure more equitable access to these tools.

Privacy: The use of hosted APIs in code generation and understanding tasks poses significant privacy concerns for companies working with sensitive data, as the use of these APIs requires sending your code outside the company's servers. Our proposed API addresses this problem through its open-source and privacy-oriented setup. This setup is divided into two main paths. The first path is where customers can directly call the API to provide code suggestions for a variety of tasks. The second is for more sensitive cases where users can use our open-sourced efficient fine-tuning and data filtering methods to fine-tune their own code LLMs, which can then be used locally.

Milestone & Budget

Generate Datasets:

Description: Create an automated (or semi-automated) scaping pipeline on GitHub for permissively licenced code that represents a set of selected tasks. The scraping will be based on programming languages, documentation, dates, keywords, and other criteria relevant to the task. The pipeline should also include a filtering step where scraped files are filtered based on specific quality measures such as number of GitHub stars, length of the file, comments, recency, and additional relevant quality standards.

Status: Previous work includes pipelines for scraping based only on programming languages, filtering based on the aforementioned standards, and sharing the datasets on HuggingFace.

Cost: 1000$

Estimated Time: 10 working days (2 weeks)

-------------------------------------

Performance Cost Analysis:

Description: The objective of this milestone is to establish a roadmap for fine-tuning LLMs on selected tasks. The roadmap should include details about the optimal training hyperparameters that balance training cost and model performance. We intend to create these recommendations based on current literature and empirical experiments.

Status: Previous work has established a roadmap for smaller code LLMs (< 1 billion parameters) in the task of autoregressive code completion in different programming languages.

Cost: 4000$

Estimated Time: 20 working days (4 weeks)

-------------------------------------

Finetuning and Evaluation:

Description: Fine-tune our off-the-shelf efficient and specialised models and evaluate each model on its respective benchmark. In this stage, we also establish the steps of the second part of our API for one-call training of models with sensitive data.

Status: Not Started

Cost: 5000$

Estimated Time: 15 working days (3 weeks)

------------------------------------- Hosting/API Calls:

Description: Create a GPU inference environment where different models can be accessed through either API calls or an interactive dashboard. The hosting will be on Amazon's Sage Maker to enable efficient loading and invoking of models with minimal costs. The interactive dashboard is intended to showcase the versatility and performance of the models, while the direct API call can be used for integration in code IDEs such as VS-Code. Finally, the one-call API will also be developed to allow customers to train the models based on their own datasets and use the models locally.

Status: Previous work has explored options for optimised inference from code LLMs as well as establishing a CPU-hosted interactive code generation interface. We expect existing knowledge and architecture to be beneficial when transferring from HuggingFace to other hosting platforms.

Cost: 15000$

Estimated Time: 30 working days (6 weeks)

------------------------------------- Onboarding:

Description: Onboarding the services into the Singularity Marketplace

Status: Not Started

Cost: 5000$

Estimated Time: 15 working days (3 weeks)

------------------------------------ Total:

Cost: 30000$

Estimated Time: 90 working days (18 weeks)

Marketing & Competition

Our marketing strategy centres around the unique perks our approach provide:

  • Quality Training Data: Our solution focuses on automating the collection, cleaning, and filtering of data. By curating task-specific, superior-quality data, we reduce training time and costs significantly, making LLMs more accessible to all.
  • Open Source, Simplicity, and Privacy: Complexity and lack of privacy are often problems when creating LLM's APIs; our solution addresses this by offering two distinct paths. First, users can call our API directly for code suggestions across a range of tasks. Second, for sensitive data, we provide open-source tools for efficient fine-tuning and data filtering, allowing users to fine-tune their own LLMs locally while maintaining data privacy.
  • Customization and Versatility: Our solution is designed to adapt to any code or task that users require. It offers a high degree of customization, ensuring that it meets the unique needs of each practitioner or organisation.
  • Cost-Effective Pricing: We stand out with the option for a cost-effective pricing model that charges only for training, unlike the traditional pay-per-API-call or subscription models. This significantly reduces the financial burden on users, making LLMs more accessible.

Related Links

Google-Colab Notebooks: These notebooks were used to train the models and generate the results. Copies of these notebooks can be found in the

. The notebooks are also linked below:

Links to Trained Models, Datasets, and Inference Engine:

Long Description

Company Name

Enigma AI

Summary

The aim of this project is to address the accessibility gap and further efforts towards democratising the use and development of code LLMs. To make LLMs more usable, we need to first address the high computational cost of large language models. Using smaller models also comes with the cost of being limited to some popular programming languages. Hence, we will explore the feasibility of using transfer learning on small pre-trained models to new programming languages and tasks. For more challenging tasks that require larger models and thus higher costs, we will make use of efficient training and inference methods to achieve good performance with reasonable costs.

Funding Amount

30,000$

The Problem to be Solved

Accessibility to code LLMs is one of the important goals of AI development. The Democratization of AI is an important factor in the progress of the field, and AI companies such as Meta and Stability are showing their commitment to it by publicly sharing their models and releasing their weights. However, this may not be enough, as the barrier to tuning and using LLM is higher than just access to their weights. For the practitioner, the choices of model architecture and learning algorithms are not obvious, and exploring these options is costly due to the high computation costs. On the other hand, users of these tools often have to choose between the high inference cost of running models locally or the monetary cost of access via hosted APIs (e.g. Co-pilot). In order to achieve the benefits of democratization of AI use and development, these issues need to be resolved.

Our Solution

Our approach to solving these issues addresses various factors:

Training Data: High-quality training data is essential to lowering the cost of training LLMs. Studies have shown that fine-tuning LLMs on smaller but more task-specific and higher-quality data is more beneficial to the performance of the model. Hence, we will focus on creating automated and semi-automated approaches to collect, clean, and filter our data in such a way that we can achieve significant performance in less training time with less data, thus lowering the overall cost of training.

Model Size: The scaling laws of LLMs show that better generalization is expected with larger models. However, our previous work has also shown the feasibility of the alternative approach of using relatively smaller LLMs on more specified tasks. Our approach enables us to significantly cut down on the cost and time associated with fine-tuning LLM to ensure more equitable access to these tools.

Privacy: The use of hosted APIs in code generation and understanding tasks poses significant privacy concerns for companies working with sensitive data, as the use of these APIs requires sending your code outside the company's servers. Our proposed API addresses this problem through its open-source and privacy-oriented setup. This setup is divided into two main paths. The first path is where customers can directly call the API to provide code suggestions for a variety of tasks. The second is for more sensitive cases where users can use our open-sourced efficient fine-tuning and data filtering methods to fine-tune their own code LLMs, which can then be used locally.

Marketing Strategy

Our marketing strategy centres around the unique perks our approach provide:

  • Quality Training Data: Our solution focuses on automating the collection, cleaning, and filtering of data. By curating task-specific, superior-quality data, we reduce training time and costs significantly, making LLMs more accessible to all.
  • Open Source, Simplicity, and Privacy: Complexity and lack of privacy are often problems when creating LLM's APIs; our solution addresses this by offering two distinct paths. First, users can call our API directly for code suggestions across a range of tasks. Second, for sensitive data, we provide open-source tools for efficient fine-tuning and data filtering, allowing users to fine-tune their own LLMs locally while maintaining data privacy.
  • Customization and Versatility: Our solution is designed to adapt to any code or task that users require. It offers a high degree of customization, ensuring that it meets the unique needs of each practitioner or organisation.
  • Cost-Effective Pricing: We stand out with the option for a cost-effective pricing model that charges only for training, unlike the traditional pay-per-API-call or subscription models. This significantly reduces the financial burden on users, making LLMs more accessible.

Our Project Milestones and Cost Breakdown

-------------------------------------

Generate Datasets:

Description: Create an automated (or semi-automated) scaping pipeline on GitHub for permissively licenced code that represents a set of selected tasks. The scraping will be based on programming languages, documentation, dates, keywords, and other criteria relevant to the task. The pipeline should also include a filtering step where scraped files are filtered based on specific quality measures such as number of GitHub stars, length of the file, comments, recency, and additional relevant quality standards.

Status: Previous work includes pipelines for scraping based only on programming languages, filtering based on the aforementioned standards, and sharing the datasets on HuggingFace.

Cost: 1000$

Estimated Time: 10 working days (2 weeks)

-------------------------------------

Performance Cost Analysis:

Description: The objective of this milestone is to establish a roadmap for fine-tuning LLMs on selected tasks. The roadmap should include details about the optimal training hyperparameters that balance training cost and model performance. We intend to create these recommendations based on current literature and empirical experiments.

Status: Previous work has established a roadmap for smaller code LLMs (< 1 billion parameters) in the task of autoregressive code completion in different programming languages.

Cost: 4000$

Estimated Time: 20 working days (4 weeks)

-------------------------------------

Finetuning and Evaluation:

Description: Fine-tune our off-the-shelf efficient and specialised models and evaluate each model on its respective benchmark. In this stage, we also establish the steps of the second part of our API for one-call training of models with sensitive data.

Status: Not Started

Cost: 5000$

Estimated Time: 15 working days (3 weeks)

------------------------------------- Hosting/API Calls:

Description: Create a GPU inference environment where different models can be accessed through either API calls or an interactive dashboard. The hosting will be on Amazon's Sage Maker to enable efficient loading and invoking of models with minimal costs. The interactive dashboard is intended to showcase the versatility and performance of the models, while the direct API call can be used for integration in code IDEs such as VS-Code. Finally, the one-call API will also be developed to allow customers to train the models based on their own datasets and use the models locally.

Status: Previous work has explored options for optimised inference from code LLMs as well as establishing a CPU-hosted interactive code generation interface. We expect existing knowledge and architecture to be beneficial when transferring from HuggingFace to other hosting platforms.

Cost: 15000$

Estimated Time: 30 working days (6 weeks)

------------------------------------- Onboarding:

Description: Onboarding the services into the Singularity Marketplace

Status: Not Started

Cost: 5000$

Estimated Time: 15 working days (3 weeks)

------------------------------------ Total:

Cost: 30000$

Estimated Time: 90 working days (18 weeks)

Risk and Mitigation

Privacy Concerns:

One of the primary risks associated with our project is the potential exposure of sensitive information when fine-tuning models with personal datasets. To mitigate this risk, we have implemented a comprehensive privacy safeguard protocol. Before any data is utilized in the fine-tuning process, we recommend a thorough inspection of personal datasets to identify and remove any potential secrets or sensitive information. Additionally, we employ state-of-the-art AI-based code analysis tools to scan the code for any inadvertent disclosures. This proactive approach ensures that user data remains confidential and secure throughout the fine-tuning process.

Intellectual Property (IP) Infringement:

To prevent any inadvertent usage of unlicensed code during the model training phase, we have adopted a stringent policy of exclusively scraping permissively licensed code from reputable sources. This policy is aimed at minimizing the risk of inadvertently including copyrighted or proprietary code in our training datasets. By adhering strictly to this policy, we reduce the likelihood of IP infringement issues arising during the project.

High Computation Costs:

Another potential risk is the unforeseen increase in computation costs, particularly when it comes to GPU resources required for inference and training, which may exceed our estimated budget. To address this risk, we have contingency plans in place. In the event that computational costs exceed expectations, we will implement limitations on the size of the baseline Large Language Models (LLMs) we fine-tune. This proactive measure ensures that we can stay within budget constraints without compromising the quality of our service or overburdening users with unexpected costs. Additionally, we continually monitor and optimize our computational resource usage to maintain cost-efficiency throughout the project's lifecycle.

Open Source

This project has been inspired by a previous work done as part of an MSc dissertation at University of Edinburgh with Collaboration with the Amazon Data Centre Edinburgh.

This work has presented a concerted effort to bridge the gap between advanced AI technologies and their practical usability, particularly in the domain of code intelligence. By focusing on accessibility, usability, and empirical understanding, the work contributed to the ongoing narrative of democratisation in AI. The empirical insights gained through extensive experimentation shed light on the intricacies of fine-tuning code LLMs. These insights equip practitioners with valuable knowledge to navigate the complexities of model training, save resources, and ultimately drive innovation more effectively. The shared models and datasets are among the most downloaded for their specific task in the popular Hugging Face Platform.

Following the same spirit, we aim open-source our off-the-shelf-datasets and smaller models will be open-sourced.

 

 

Our Team

Ammar Khairi: MSc Artificial Intelligence University of Edinburgh - Machine Learning Engineer / Data Scientists. (

)

Mukhtar Mohammed: MSc Artificial Intelligence University of Edinburgh - Machine Learning Engineer / Deployment Engineer. (

)

Muhammed Saeed: MSc Artificial Intelligence University of Saarland - Machine Learning Engineer

Related Links

Google-Colab Notebooks: These notebooks were used to train the models and generate the results. Copies of these notebooks can be found in the

. The notebooks are also linked below:

Links to Trained Models, Datasets, and Inference Engine:

AI Services

Proposal Video

Customizable Code Assistant - #DeepFunding IdeaFest Round 3

26 September 2023
  • Total Milestones

    5

  • Total Budget

    $30,000 USD

  • Last Updated

    3 Apr 2024

Milestone 1 - Generate Datasets

Status
😀 Completed
Description

Create an automated (or semi-automated) scaping pipeline on GitHub for permissively licenced code that represents a set of selected tasks. The scraping will be based on programming languages, documentation, dates, keywords, and other criteria relevant to the task. The pipeline should also include a filtering step where scraped files are filtered based on specific quality measures such as number of GitHub stars, length of the file, comments, recency, and additional relevant quality standards.

Deliverables

Budget

$1,000 USD

Milestone 2 - Performance Cost Analysis

Status
😀 Completed
Description

The objective of this milestone is to establish a roadmap for fine-tuning LLMs on selected tasks. The roadmap should include details about the optimal training hyperparameters that balance training cost and model performance. We intend to create these recommendations based on current literature and empirical experiments.

Deliverables

Budget

$4,000 USD

Milestone 3 - Finetuning and Evaluation

Status
😀 Completed
Description

Fine-tune our off-the-shelf efficient and specialised models and evaluate each model on its respective benchmark. In this stage, we also establish the steps of the second part of our API for one-call training of models with sensitive data.

Deliverables

Budget

$5,000 USD

Milestone 4 - Hosting/API Calls

Status
😀 Completed
Description

Create a GPU inference environment where different models can be accessed through either API calls or an interactive dashboard. The hosting will be on Amazon's Sage Maker to enable efficient loading and invoking of models with minimal costs. The interactive dashboard is intended to showcase the versatility and performance of the models, while the direct API call can be used for integration in code IDEs such as VS-Code. Finally, the one-call API will also be developed to allow customers to train the models based on their own datasets and use the models locally.

Deliverables

Budget

$15,000 USD

Milestone 5 - Onboarding

Status
🧐 In Progress
Description

Onboarding the services into the Singularity Marketplace

Deliverables

Budget

$5,000 USD

Link URL

Join the Discussion (0)

Reviews & Rating

New reviews and ratings are disabled for Awarded Projects

Sort by

0 ratings

Summary

Overall Community

0

from 0 reviews
  • 5
    0
  • 4
    0
  • 3
    0
  • 2
    0
  • 1
    0

Feasibility

0

from 0 reviews

Viability

0

from 0 reviews

Desirabilty

0

from 0 reviews

Usefulness

0

from 0 reviews