Milestone & Budget
Generate Datasets:
Description: Create an automated (or semi-automated) scaping pipeline on GitHub for permissively licenced code that represents a set of selected tasks. The scraping will be based on programming languages, documentation, dates, keywords, and other criteria relevant to the task. The pipeline should also include a filtering step where scraped files are filtered based on specific quality measures such as number of GitHub stars, length of the file, comments, recency, and additional relevant quality standards.
Status: Previous work includes pipelines for scraping based only on programming languages, filtering based on the aforementioned standards, and sharing the datasets on HuggingFace.
Cost: 1000$
Estimated Time: 10 working days (2 weeks)
-------------------------------------
Performance Cost Analysis:
Description: The objective of this milestone is to establish a roadmap for fine-tuning LLMs on selected tasks. The roadmap should include details about the optimal training hyperparameters that balance training cost and model performance. We intend to create these recommendations based on current literature and empirical experiments.
Status: Previous work has established a roadmap for smaller code LLMs (< 1 billion parameters) in the task of autoregressive code completion in different programming languages.
Cost: 4000$
Estimated Time: 20 working days (4 weeks)
-------------------------------------
Finetuning and Evaluation:
Description: Fine-tune our off-the-shelf efficient and specialised models and evaluate each model on its respective benchmark. In this stage, we also establish the steps of the second part of our API for one-call training of models with sensitive data.
Status: Not Started
Cost: 5000$
Estimated Time: 15 working days (3 weeks)
------------------------------------- Hosting/API Calls:
Description: Create a GPU inference environment where different models can be accessed through either API calls or an interactive dashboard. The hosting will be on Amazon's Sage Maker to enable efficient loading and invoking of models with minimal costs. The interactive dashboard is intended to showcase the versatility and performance of the models, while the direct API call can be used for integration in code IDEs such as VS-Code. Finally, the one-call API will also be developed to allow customers to train the models based on their own datasets and use the models locally.
Status: Previous work has explored options for optimised inference from code LLMs as well as establishing a CPU-hosted interactive code generation interface. We expect existing knowledge and architecture to be beneficial when transferring from HuggingFace to other hosting platforms.
Cost: 15000$
Estimated Time: 30 working days (6 weeks)
------------------------------------- Onboarding:
Description: Onboarding the services into the Singularity Marketplace
Status: Not Started
Cost: 5000$
Estimated Time: 15 working days (3 weeks)
------------------------------------ Total:
Cost: 30000$
Estimated Time: 90 working days (18 weeks)
Long Description
Company Name
Enigma AI
Summary
The aim of this project is to address the accessibility gap and further efforts towards democratising the use and development of code LLMs. To make LLMs more usable, we need to first address the high computational cost of large language models. Using smaller models also comes with the cost of being limited to some popular programming languages. Hence, we will explore the feasibility of using transfer learning on small pre-trained models to new programming languages and tasks. For more challenging tasks that require larger models and thus higher costs, we will make use of efficient training and inference methods to achieve good performance with reasonable costs.
Funding Amount
30,000$
The Problem to be Solved
Accessibility to code LLMs is one of the important goals of AI development. The Democratization of AI is an important factor in the progress of the field, and AI companies such as Meta and Stability are showing their commitment to it by publicly sharing their models and releasing their weights. However, this may not be enough, as the barrier to tuning and using LLM is higher than just access to their weights. For the practitioner, the choices of model architecture and learning algorithms are not obvious, and exploring these options is costly due to the high computation costs. On the other hand, users of these tools often have to choose between the high inference cost of running models locally or the monetary cost of access via hosted APIs (e.g. Co-pilot). In order to achieve the benefits of democratization of AI use and development, these issues need to be resolved.
Our Solution
Our approach to solving these issues addresses various factors:
Training Data: High-quality training data is essential to lowering the cost of training LLMs. Studies have shown that fine-tuning LLMs on smaller but more task-specific and higher-quality data is more beneficial to the performance of the model. Hence, we will focus on creating automated and semi-automated approaches to collect, clean, and filter our data in such a way that we can achieve significant performance in less training time with less data, thus lowering the overall cost of training.
Model Size: The scaling laws of LLMs show that better generalization is expected with larger models. However, our previous work has also shown the feasibility of the alternative approach of using relatively smaller LLMs on more specified tasks. Our approach enables us to significantly cut down on the cost and time associated with fine-tuning LLM to ensure more equitable access to these tools.
Privacy: The use of hosted APIs in code generation and understanding tasks poses significant privacy concerns for companies working with sensitive data, as the use of these APIs requires sending your code outside the company's servers. Our proposed API addresses this problem through its open-source and privacy-oriented setup. This setup is divided into two main paths. The first path is where customers can directly call the API to provide code suggestions for a variety of tasks. The second is for more sensitive cases where users can use our open-sourced efficient fine-tuning and data filtering methods to fine-tune their own code LLMs, which can then be used locally.
Marketing Strategy
Our marketing strategy centres around the unique perks our approach provide:
- Quality Training Data: Our solution focuses on automating the collection, cleaning, and filtering of data. By curating task-specific, superior-quality data, we reduce training time and costs significantly, making LLMs more accessible to all.
- Open Source, Simplicity, and Privacy: Complexity and lack of privacy are often problems when creating LLM's APIs; our solution addresses this by offering two distinct paths. First, users can call our API directly for code suggestions across a range of tasks. Second, for sensitive data, we provide open-source tools for efficient fine-tuning and data filtering, allowing users to fine-tune their own LLMs locally while maintaining data privacy.
- Customization and Versatility: Our solution is designed to adapt to any code or task that users require. It offers a high degree of customization, ensuring that it meets the unique needs of each practitioner or organisation.
- Cost-Effective Pricing: We stand out with the option for a cost-effective pricing model that charges only for training, unlike the traditional pay-per-API-call or subscription models. This significantly reduces the financial burden on users, making LLMs more accessible.
Our Project Milestones and Cost Breakdown
-------------------------------------
Generate Datasets:
Description: Create an automated (or semi-automated) scaping pipeline on GitHub for permissively licenced code that represents a set of selected tasks. The scraping will be based on programming languages, documentation, dates, keywords, and other criteria relevant to the task. The pipeline should also include a filtering step where scraped files are filtered based on specific quality measures such as number of GitHub stars, length of the file, comments, recency, and additional relevant quality standards.
Status: Previous work includes pipelines for scraping based only on programming languages, filtering based on the aforementioned standards, and sharing the datasets on HuggingFace.
Cost: 1000$
Estimated Time: 10 working days (2 weeks)
-------------------------------------
Performance Cost Analysis:
Description: The objective of this milestone is to establish a roadmap for fine-tuning LLMs on selected tasks. The roadmap should include details about the optimal training hyperparameters that balance training cost and model performance. We intend to create these recommendations based on current literature and empirical experiments.
Status: Previous work has established a roadmap for smaller code LLMs (< 1 billion parameters) in the task of autoregressive code completion in different programming languages.
Cost: 4000$
Estimated Time: 20 working days (4 weeks)
-------------------------------------
Finetuning and Evaluation:
Description: Fine-tune our off-the-shelf efficient and specialised models and evaluate each model on its respective benchmark. In this stage, we also establish the steps of the second part of our API for one-call training of models with sensitive data.
Status: Not Started
Cost: 5000$
Estimated Time: 15 working days (3 weeks)
------------------------------------- Hosting/API Calls:
Description: Create a GPU inference environment where different models can be accessed through either API calls or an interactive dashboard. The hosting will be on Amazon's Sage Maker to enable efficient loading and invoking of models with minimal costs. The interactive dashboard is intended to showcase the versatility and performance of the models, while the direct API call can be used for integration in code IDEs such as VS-Code. Finally, the one-call API will also be developed to allow customers to train the models based on their own datasets and use the models locally.
Status: Previous work has explored options for optimised inference from code LLMs as well as establishing a CPU-hosted interactive code generation interface. We expect existing knowledge and architecture to be beneficial when transferring from HuggingFace to other hosting platforms.
Cost: 15000$
Estimated Time: 30 working days (6 weeks)
------------------------------------- Onboarding:
Description: Onboarding the services into the Singularity Marketplace
Status: Not Started
Cost: 5000$
Estimated Time: 15 working days (3 weeks)
------------------------------------ Total:
Cost: 30000$
Estimated Time: 90 working days (18 weeks)
Risk and Mitigation
Privacy Concerns:
One of the primary risks associated with our project is the potential exposure of sensitive information when fine-tuning models with personal datasets. To mitigate this risk, we have implemented a comprehensive privacy safeguard protocol. Before any data is utilized in the fine-tuning process, we recommend a thorough inspection of personal datasets to identify and remove any potential secrets or sensitive information. Additionally, we employ state-of-the-art AI-based code analysis tools to scan the code for any inadvertent disclosures. This proactive approach ensures that user data remains confidential and secure throughout the fine-tuning process.
Intellectual Property (IP) Infringement:
To prevent any inadvertent usage of unlicensed code during the model training phase, we have adopted a stringent policy of exclusively scraping permissively licensed code from reputable sources. This policy is aimed at minimizing the risk of inadvertently including copyrighted or proprietary code in our training datasets. By adhering strictly to this policy, we reduce the likelihood of IP infringement issues arising during the project.
High Computation Costs:
Another potential risk is the unforeseen increase in computation costs, particularly when it comes to GPU resources required for inference and training, which may exceed our estimated budget. To address this risk, we have contingency plans in place. In the event that computational costs exceed expectations, we will implement limitations on the size of the baseline Large Language Models (LLMs) we fine-tune. This proactive measure ensures that we can stay within budget constraints without compromising the quality of our service or overburdening users with unexpected costs. Additionally, we continually monitor and optimize our computational resource usage to maintain cost-efficiency throughout the project's lifecycle.
Open Source
This project has been inspired by a previous work done as part of an MSc dissertation at University of Edinburgh with Collaboration with the Amazon Data Centre Edinburgh.
This work has presented a concerted effort to bridge the gap between advanced AI technologies and their practical usability, particularly in the domain of code intelligence. By focusing on accessibility, usability, and empirical understanding, the work contributed to the ongoing narrative of democratisation in AI. The empirical insights gained through extensive experimentation shed light on the intricacies of fine-tuning code LLMs. These insights equip practitioners with valuable knowledge to navigate the complexities of model training, save resources, and ultimately drive innovation more effectively. The shared models and datasets are among the most downloaded for their specific task in the popular Hugging Face Platform.
Following the same spirit, we aim open-source our off-the-shelf-datasets and smaller models will be open-sourced.
Our Team
Ammar Khairi: MSc Artificial Intelligence University of Edinburgh - Machine Learning Engineer / Data Scientists. (
)
Mukhtar Mohammed: MSc Artificial Intelligence University of Edinburgh - Machine Learning Engineer / Deployment Engineer. (
)
Muhammed Saeed: MSc Artificial Intelligence University of Saarland - Machine Learning Engineer
Related Links
Google-Colab Notebooks: These notebooks were used to train the models and generate the results. Copies of these notebooks can be found in the
. The notebooks are also linked below:
Links to Trained Models, Datasets, and Inference Engine: