Memory-augmented LLMs: Retrieving Information using a Gated Encoder LLM (RIGEL)

chevron-icon
Back
project-presentation-img
Luke Mahoney
Project Owner

Memory-augmented LLMs: Retrieving Information using a Gated Encoder LLM (RIGEL)

Funding Awarded

$140,000 USD

Expert Review
Star Filled Image Star Filled Image Star Filled Image Star Filled Image Star Filled Image 0
Community
Star Filled Image Star Filled Image Star Filled Image Star Filled Image Star Filled Image 0 (0)

Status

  • Overall Status

    🛠️ In Progress

  • Funding Transfered

    $40,000 USD

  • Max Funding Amount

    $140,000 USD

Funding Schedule

View Milestones
Milestone Release 1
$20,000 USD Transfer Complete 23 May 2024
Milestone Release 2
$20,000 USD Transfer Complete 30 May 2024
Milestone Release 3
$30,000 USD Pending TBD
Milestone Release 4
$35,000 USD Pending TBD
Milestone Release 5
$35,000 USD Pending TBD

Video Updates

Memory-augmented LLMs: Retrieving Information using a Gated Encoder LLM (RIGEL)

9 February 2024

Project AI Services

No Service Available

Overview

MLabs, an AI and Blockchain consultancy, seeks funding through RFP3 - Memory-augmented LLMs. They propose addressing the shortcomings of large language models (LLMs), such as ChatGPT, concerning the veracity of generated responses. MLabs aims to develop a new module for LLMs, enabling them to reference the source material, providing explanation and evidence to support responses, thereby reducing the risk of factually incorrect outputs or hallucinations. Their solution involves compressing context vectors and storing them externally, using hierarchical self-attention for efficient data compression and retrieval.

The project includes three parts: large-scale contextual compression experiments, training and evaluating the compression and retrieval module, and integration with the existing LLM. The total funding request is USD 140,000 for a 36-week project. Risks include the novelty of the concept, vector retrieval system performance, machine learning training, software development, and project management, which they aim to mitigate with their experienced team and contingency plans.

MLabs has a proven track record in ML/AI deployments and aims to leverage their expertise to enhance LLMs.

Proposal Description

Compnay Name

MLabs – AI and Blockchain Consultancy

Service Details

Large Language Models using Pre-trained Generative Transformers have taken the AI world and the larger community by storm over the past few years. Networks like ChatGPT have established a new level of human-like natural language production. Despite the recent advances and public excitement, such models have been shown to possess profound shortcomings regarding the veracity of the information in generated responses. Both users and governing bodies have correctly identified this as a major source of concern. Indeed, this single factor may lead to restrictions on the use of LLMs and prevent their widespread adoption.

We have developed a new module for GPT-style LLMs that utilises the linguistic and semantic knowledge elicited in the initial "encode" phase of the LLM as an index to the factual information used to generate a response. Crucially, this gives us a much-needed reference to the source material, which in turn provides both explanation and evidence to support the response. This is an essential augmentation, releasing the LLM from much of the negative implications of storing semantics and facts in the same network structure (resulting in so-called hallucination: production of unsupported or factually incorrect output), and enables users to effectively create domain-specific LLMs without having to retrain or fine-tune the billions of parameters in existing networks.

We are excited by the far-reaching potential for our work and look forward to bringing a fully deployed solution to life through the SingularityNET platform.

Solution Description

Implementation Details

The proposed architecture reuses moving parts that are now well-established components of successful LLMs: embeddings, position encoding, and attention for a paired encoder/decoder system.

Our solution is to feed articles into the network, further compress the output of the encoder, known as the context vector, and store it in a library external to the main parameters of the network. When the user makes a read query, we fetch the original context vector for the article that is most similar to the query’s context vector and feed both into a slightly modified decoder.

Our breakthrough is that we have found that multiple levels of self-attention can be used effectively as a contextual compression of quite large amounts of input data - whole articles, for example. In write mode, our extra module hierarchically compresses the information being stored into a vector data store and maintains a pointer to the original text. In read mode, we efficiently locate the most relevant stored information in this store and use the lowest level of vector compression as context to the decoder. In every way, the LLM behaves exactly as if the information had been supplied at the same time as the question, and the response is thereby focused and more factual.

If required, the decoder can produce a response directly from the retrieved information - essentially providing a summary of the original text that can be reproduced, together with any metadata stored alongside. Context vectors are fetched from the library using a neural network that compresses the context vectors into clusters of vectors, each with broadly similar areas of discourse. This clustering acts as a kind of locality-sensitive hashing that mitigates the explosion in complexity caused by the large dimensionality in the retrieval module.

The hierarchical compression module needs to be able to compress articles such that the compressed version encodes the subject matter of the article at increasing levels of compression and abstraction. This needs to be consistent regardless of the number of words in the article or the original level of abstraction. Unless this is explicitly handled, we will end up comparing compressions of short, detailed articles with long, general articles - even if these are about the same subject the compression vectors will not line up. Therefore, in addition to simply compressing the input data, we also need to know the original level of abstraction; i.e., how generic or specific the information is. This is not explicitly encoded in the embedding and needs to be explicitly learned from text data which is tagged with this level of abstraction. Wikipedia "rabbit holes" provide perfect training data for exactly this problem. The abstraction level signal is provided by the depth in the Wikipedia hierarchy. The rest of the compression module is trained using a stack of autoencoders in the normal way.

The retrieval module has two jobs to do: retrieve the correct article stored during the write phase, and do so efficiently. The second can be evaluated by plotting the relationship between the amount of stored information and retrieval time. We expect this to be bounded by O(logN). The multiplier on this retrieval time needs to be determined by experiment at full scale. The accuracy of retrieval is more difficult to assess since the read input does not contain the exact same text as the written article. To evaluate accuracy we intend to input several Wikipedia articles into the module and read them back using extracted information from the same articles. The extracts will be of successively smaller size to evaluate the point at which the retrieval fails. We will first check that reading the whole article correctly selects the correct context. We will then test a single paragraph from each article, and then a single sentence containing one of the keywords from the article. Finally, we will attempt retrieval using only the article keywords - in combinations or singly.

Our research and development work has established that our general approach is valid, and with suitable funding, can be developed into a new gold standard for contextual, domain-constrained LLMs. We intend to optimise our algorithm for retrieving similar context vectors and have identified several variations to our main approach including permutation-based context vector encoding, anisotropic vector similarity search, and techniques borrowed from biological genetic search algorithms such as BLAST.

Our final, deployable, memory-augmented LLM will be based on Llama or an equivalent open-source LLM available at the time the project is underway, with our new read/write compression and attention module acting as an intelligent gating network for the information store. Our current work suggests that the augmented LLM will require additional computation, which is roughly constant time for writing and scales logarithmically with the size of the data store for retrieving new-context read queries. Due to our store organisation, retrieval of sequences of related contexts is much faster.

Milestone & Budget

Milestones and Plan of Work

The project requires the development and training of the new compression, retrieval and gating module at a large scale and suitable for use with a general-purpose LLM. To facilitate this process we also require the associated data preparation, testing and evaluation. There are three main parts.

Part 1: Large-scale Contextual Compression Experiments

In this part, we will prepare a large corpus of text data to evaluate the compression characteristics of large amounts of semi-structured data. This will validate our approach at scale and guide the training of the compression and retrieval module in the next part.

Milestone 1: At-scale Document Compression: 5 weeks ($20K)

Summary: Feed all Wikipedia articles into Llama 2 and store the context vectors in a database.

Collate and prepare all of Wikipedia as an at-scale set of source documents and transform them into the context vector representation generated by Llama's encoder output. Store these vectors in a database for future retrieval and processing. A database storing links to Wikipedia articles and their corresponding context vectors will also be maintained for testing purposes.

Engineering Hours: 430

KPI: Successful transformation of 99.5% of Wikipedia articles into context vector representations with corresponding links stored in the database, properly indexed, with no data loss or corruption.

Milestone 2: Calibration of Hierarchical Document Embedding: 5 weeks ($20K)

Summary: Identify how similar articles are based on the graph of Wikipedia article links. Store the graph and similarities for future use.

Build a directed multi-graph capturing where articles are linked to each other. Use graph theory algorithms to identify the path length through the lowest common ancestor between articles and identify articles with high betweenness. This data will be used in the next part to train the contextual compression such that similar articles produce similar compressed vectors at suitable granularity. This graph can be made available as a deliverable if desired. Training data will then be produced ready for the next part of the project - we are happy to supply this data as a deliverable, too.

Engineering Hours: 430

KPI: Creation of a directed multi-graph with 99.5% of Wikipedia articles accurately represented. Calculation of path lengths through lowest common ancestors for at least 98% of article pairs. Identification of top 0.5% of articles with the highest betweenness centrality.

Part 2: Training and Evaluation of Compression and Retrieval Module

Our context compression and retrieval module will be trained on the large-scale datasets produced in the previous part and evaluated on held-out data. This will experimentally prove the concept at a deployable scale and identify any algorithmic developments necessary.

Milestone 3: Context Compression Neural Network Training: 8 weeks ($30K)

Summary: Train a neural network to compress the context vectors at multiple levels in a hierarchical fashion.

We now train a neural network to compress context vectors hierarchically so that similar articles produce vectors with cosine similarity near 1, and dissimilar ones produce vectors that have a cosine similarity near 0. (As a technical sidenote, since the cosine similarity can range from -1 to 1, it may make intuitive sense that dissimilar articles should have a cosine similarity of -1. This, however, would be misleading as a cosine similarity of -1 indicates that the articles are negatively correlated, not completely uncorrelated. A cosine similarity score of 0 truly means that the articles have no correlation.)

This network will be able to produce multiple levels of compressed vectors at increasing abstraction, as informed by our calibration experiments in part one, while allowing linear and modified softmax layers to identify matches.

We will deliver a trained neural network that can compress the context vectors into a conceptually related metric space at multiple levels of abstraction. It will be implemented in PyTorch and consist of a cascade of attention, linear and normalisation layers, each producing a smaller and more abstract vector than the last. These compressed vectors will be used to identify related articles for quick context vector retrieval in the next step.

Engineering Hours: 650

KPI: Successful training of a neural network that:

  • Achieves an average cosine similarity score of >= 0.95 for similar articles and <= 0.05 for dissimilar articles in validation datasets.
  • Produces context compression at a minimum of three hierarchical levels, each resulting in consistent reductions in vector dimensions as demonstrated by size benchmarks.
  • Has a well-documented PyTorch implementation, including comprehensive model architecture details, training methodology, and achieved metrics on training and validation datasets.
  • Is able to retrieve compressed vectors within a latency of less than 50ms per article from the provided dataset during testing.

Milestone 4: Context Vector Retrieval Training: 8 weeks ($35K)

Summary: Use the hierarchical vectors to generate a tree that allows for quick vector retrieval.

The hierarchy of contextually compressed abstraction vectors produced in the previous task will be used for retrieval. The retrieval query is simply the context vector output from the standard encoder part of the LLM. This query will also be compressed at multiple levels, corresponding to the levels used in the original storage operations. Each abstraction level vector provides a search vector for the retrieval system.

Retrieval proceeds as follows: bottom-up context compression of the query, then bottom-up retrieval of the most similar stored vectors. When the most similar vector (or vectors) is identified the lowest level (leaf) vector, corresponding to the original output of the encoder for that piece of information, can be passed on to the decoder.

This system can be thought of as a cascade of similarity networks and a gating mechanism to avoid comparison with every piece of information stored in the system. Using a modified version of softmax, we can, in principle, keep a small set of the most similar vectors active at each stage.

If more than one article is identified by the retrieval network we can choose to either:

  • Generate a response using each article as context and combine the output of the decoder.
  • Combine the context of the two articles before sending them to the decoder (using another attention head as a data fusion module).

Engineering Hours: 760

KPI:

  • Compression-Driven Retrieval: Demonstrate that the hierarchical compression of the query consistently results in faster retrievals than non-hierarchical methods. Target: Average retrieval times should be reduced by at least 75%.
  • Similarity Accuracy: When retrieving vectors, the system should have a hit rate of at least 95% in finding the most similar vectors from the database during testing.
  • Decisioning between Retrievals: Implementation of both described retrieval mechanisms (response generation using each article and context combination) with clear documentation on performance metrics for each method.

Part 3: Integration and System-Level Testing and Evaluation

The components developed in the first two parts will be brought together and integrated with the existing LLM. The resulting memory-augmented LLM will be thoroughly tested at scale.

Milestone 5: LLM Modifications and Integration: 10 weeks ($35K)

Summary: Integrate the additional document context vector into the decoder section of Llama’s network and retrain.

The above components will be brought together with the original LLM to provide memory augmentation. We will need some gating logic for the encoder so that write, read and query prompts are properly dealt with, and further gating logic for the second attention layer in the decoder so that the context vector for the correctly retrieved article is used for generating the response.

Once integration is completed the whole system will be tested so that accuracy and speed of operation can be evaluated. Any fine-tuning of parameters can take place at this time to ensure that appropriate weight is given to the query and the retrieved information. As a cross-check, we can use the metadata pointing to the original data to determine that the new module is operating adequately.

The delivery for this stage will be a modified version of a Llama-like LLM that can accept articles for reading, writing and as constrained context for query/response generation.

Engineering Hours: 760

KPI:

  • Seamless Integration: Successful integration of context vector retrieval and gating mechanisms into the encoder and decoder sections of Llama's network without any functional disruptions or anomalies.
  • Functional Testing: Post-integration, the system should demonstrate at least a 95% success rate in correctly gating write, read, and query prompts during testing.
  • Attention Layer Gating Accuracy: Ensure the second attention layer in the decoder correctly utilizes the retrieved article's context vector for response generation with an accuracy rate of at least 98%.
  • Operational Speed and Accuracy: Post-integration, the LLM should exhibit no more than a 5% increase in response time and maintain or improve upon its original accuracy metrics.

TOTAL for all 5 Milestones: 36 Weeks (USD 140K)

Related Links

Webpage:

Twitter:

Long Description

Company Name

MLabs – AI and Blockchain Consultancy

Request for Proposal Pool

RFP3 - Memory-augmented LLMs

Summary and Impact

Large Language Models using Pre-trained Generative Transformers have taken the AI world and the larger community by storm over the past few years. Networks like ChatGPT have established a new level of human-like natural language production. Despite the recent advances and public excitement, such models have been shown to possess profound shortcomings regarding the veracity of the information in generated responses. Both users and governing bodies have correctly identified this as a major source of concern. Indeed, this single factor may lead to restrictions on the use of LLMs and prevent their widespread adoption.

We have developed a new module for GPT-style LLMs that utilises the linguistic and semantic knowledge elicited in the initial "encode" phase of the LLM as an index to the factual information used to generate a response. Crucially, this gives us a much-needed reference to the source material, which in turn provides both explanation and evidence to support the response. This is an essential augmentation, releasing the LLM from much of the negative implications of storing semantics and facts in the same network structure (resulting in so-called hallucination: production of unsupported or factually incorrect output), and enables users to effectively create domain-specific LLMs without having to retrain or fine-tune the billions of parameters in existing networks.

We are excited by the far-reaching potential for our work and look forward to bringing a fully deployed solution to life through the SingularityNET platform.

Funding Amount

USD 140,000

Our Solution

Implementation Details

The proposed architecture reuses moving parts that are now well-established components of successful LLMs: embeddings, position encoding, and attention for a paired encoder/decoder system.

Our solution is to feed articles into the network, further compress the output of the encoder, known as the context vector, and store it in a library external to the main parameters of the network. When the user makes a read query, we fetch the original context vector for the article that is most similar to the query’s context vector and feed both into a slightly modified decoder.

Our breakthrough is that we have found that multiple levels of self-attention can be used effectively as a contextual compression of quite large amounts of input data - whole articles, for example. In write mode, our extra module hierarchically compresses the information being stored into a vector data store and maintains a pointer to the original text. In read mode, we efficiently locate the most relevant stored information in this store and use the lowest level of vector compression as context to the decoder. In every way, the LLM behaves exactly as if the information had been supplied at the same time as the question, and the response is thereby focused and more factual.

If required, the decoder can produce a response directly from the retrieved information - essentially providing a summary of the original text that can be reproduced, together with any metadata stored alongside. Context vectors are fetched from the library using a neural network that compresses the context vectors into clusters of vectors, each with broadly similar areas of discourse. This clustering acts as a kind of locality-sensitive hashing that mitigates the explosion in complexity caused by the large dimensionality in the retrieval module.

The hierarchical compression module needs to be able to compress articles such that the compressed version encodes the subject matter of the article at increasing levels of compression and abstraction. This needs to be consistent regardless of the number of words in the article or the original level of abstraction. Unless this is explicitly handled, we will end up comparing compressions of short, detailed articles with long, general articles - even if these are about the same subject the compression vectors will not line up. Therefore, in addition to simply compressing the input data, we also need to know the original level of abstraction; i.e., how generic or specific the information is. This is not explicitly encoded in the embedding and needs to be explicitly learned from text data which is tagged with this level of abstraction. Wikipedia "rabbit holes" provide perfect training data for exactly this problem. The abstraction level signal is provided by the depth in the Wikipedia hierarchy. The rest of the compression module is trained using a stack of autoencoders in the normal way.

The retrieval module has two jobs to do: retrieve the correct article stored during the write phase, and do so efficiently. The second can be evaluated by plotting the relationship between the amount of stored information and retrieval time. We expect this to be bounded by O(logN). The multiplier on this retrieval time needs to be determined by experiment at full scale. The accuracy of retrieval is more difficult to assess since the read input does not contain the exact same text as the written article. To evaluate accuracy we intend to input several Wikipedia articles into the module and read them back using extracted information from the same articles. The extracts will be of successively smaller size to evaluate the point at which the retrieval fails. We will first check that reading the whole article correctly selects the correct context. We will then test a single paragraph from each article, and then a single sentence containing one of the keywords from the article. Finally, we will attempt retrieval using only the article keywords - in combinations or singly.

Our research and development work has established that our general approach is valid, and with suitable funding, can be developed into a new gold standard for contextual, domain-constrained LLMs. We intend to optimise our algorithm for retrieving similar context vectors and have identified several variations to our main approach including permutation-based context vector encoding, anisotropic vector similarity search, and techniques borrowed from biological genetic search algorithms such as BLAST.

Our final, deployable, memory-augmented LLM will be based on Llama or an equivalent open-source LLM available at the time the project is underway, with our new read/write compression and attention module acting as an intelligent gating network for the information store. Our current work suggests that the augmented LLM will require additional computation, which is roughly constant time for writing and scales logarithmically with the size of the data store for retrieving new-context read queries. Due to our store organisation, retrieval of sequences of related contexts is much faster.

Our Project Milestones and Cost Breakdown

Milestones and Plan of Work

The project requires the development and training of the new compression, retrieval and gating module at a large scale and suitable for use with a general-purpose LLM. To facilitate this process we also require the associated data preparation, testing and evaluation. There are three main parts.

Part 1: Large-scale Contextual Compression Experiments

In this part, we will prepare a large corpus of text data to evaluate the compression characteristics of large amounts of semi-structured data. This will validate our approach at scale and guide the training of the compression and retrieval module in the next part.

Milestone 1: At-scale Document Compression: 5 weeks ($20K)

Summary: Feed all Wikipedia articles into Llama 2 and store the context vectors in a database.

Collate and prepare all of Wikipedia as an at-scale set of source documents and transform them into the context vector representation generated by Llama's encoder output. Store these vectors in a database for future retrieval and processing. A database storing links to Wikipedia articles and their corresponding context vectors will also be maintained for testing purposes.

Engineering Hours: 430

KPI: Successful transformation of 99.5% of Wikipedia articles into context vector representations with corresponding links stored in the database, properly indexed, with no data loss or corruption.

Milestone 2: Calibration of Hierarchical Document Embedding: 5 weeks ($20K)

Summary: Identify how similar articles are based on the graph of Wikipedia article links. Store the graph and similarities for future use.

Build a directed multi-graph capturing where articles are linked to each other. Use graph theory algorithms to identify the path length through the lowest common ancestor between articles and identify articles with high betweenness. This data will be used in the next part to train the contextual compression such that similar articles produce similar compressed vectors at suitable granularity. This graph can be made available as a deliverable if desired. Training data will then be produced ready for the next part of the project - we are happy to supply this data as a deliverable, too.

Engineering Hours: 430

KPI: Creation of a directed multi-graph with 99.5% of Wikipedia articles accurately represented. Calculation of path lengths through lowest common ancestors for at least 98% of article pairs. Identification of top 0.5% of articles with the highest betweenness centrality.

Part 2: Training and Evaluation of Compression and Retrieval Module

Our context compression and retrieval module will be trained on the large-scale datasets produced in the previous part and evaluated on held-out data. This will experimentally prove the concept at a deployable scale and identify any algorithmic developments necessary.

Milestone 3: Context Compression Neural Network Training: 8 weeks ($30K)

Summary: Train a neural network to compress the context vectors at multiple levels in a hierarchical fashion.

We now train a neural network to compress context vectors hierarchically so that similar articles produce vectors with cosine similarity near 1, and dissimilar ones produce vectors that have a cosine similarity near 0. (As a technical sidenote, since the cosine similarity can range from -1 to 1, it may make intuitive sense that dissimilar articles should have a cosine similarity of -1. This, however, would be misleading as a cosine similarity of -1 indicates that the articles are negatively correlated, not completely uncorrelated. A cosine similarity score of 0 truly means that the articles have no correlation.)

This network will be able to produce multiple levels of compressed vectors at increasing abstraction, as informed by our calibration experiments in part one, while allowing linear and modified softmax layers to identify matches.

We will deliver a trained neural network that can compress the context vectors into a conceptually related metric space at multiple levels of abstraction. It will be implemented in PyTorch and consist of a cascade of attention, linear and normalisation layers, each producing a smaller and more abstract vector than the last. These compressed vectors will be used to identify related articles for quick context vector retrieval in the next step.

Engineering Hours: 650

KPI: Successful training of a neural network that:

  • Achieves an average cosine similarity score of >= 0.95 for similar articles and <= 0.05 for dissimilar articles in validation datasets.
  • Produces context compression at a minimum of three hierarchical levels, each resulting in consistent reductions in vector dimensions as demonstrated by size benchmarks.
  • Has a well-documented PyTorch implementation, including comprehensive model architecture details, training methodology, and achieved metrics on training and validation datasets.
  • Is able to retrieve compressed vectors within a latency of less than 50ms per article from the provided dataset during testing.

Milestone 4: Context Vector Retrieval Training: 8 weeks ($35K)

Summary: Use the hierarchical vectors to generate a tree that allows for quick vector retrieval.

The hierarchy of contextually compressed abstraction vectors produced in the previous task will be used for retrieval. The retrieval query is simply the context vector output from the standard encoder part of the LLM. This query will also be compressed at multiple levels, corresponding to the levels used in the original storage operations. Each abstraction level vector provides a search vector for the retrieval system.

Retrieval proceeds as follows: bottom-up context compression of the query, then bottom-up retrieval of the most similar stored vectors. When the most similar vector (or vectors) is identified the lowest level (leaf) vector, corresponding to the original output of the encoder for that piece of information, can be passed on to the decoder.

This system can be thought of as a cascade of similarity networks and a gating mechanism to avoid comparison with every piece of information stored in the system. Using a modified version of softmax, we can, in principle, keep a small set of the most similar vectors active at each stage.

If more than one article is identified by the retrieval network we can choose to either:

  • Generate a response using each article as context and combine the output of the decoder.
  • Combine the context of the two articles before sending them to the decoder (using another attention head as a data fusion module).

Engineering Hours: 760

KPI:

  • Compression-Driven Retrieval: Demonstrate that the hierarchical compression of the query consistently results in faster retrievals than non-hierarchical methods. Target: Average retrieval times should be reduced by at least 75%.
  • Similarity Accuracy: When retrieving vectors, the system should have a hit rate of at least 95% in finding the most similar vectors from the database during testing.
  • Decisioning between Retrievals: Implementation of both described retrieval mechanisms (response generation using each article and context combination) with clear documentation on performance metrics for each method.

Part 3: Integration and System-Level Testing and Evaluation

The components developed in the first two parts will be brought together and integrated with the existing LLM. The resulting memory-augmented LLM will be thoroughly tested at scale.

Milestone 5: LLM Modifications and Integration: 10 weeks ($35K)

Summary: Integrate the additional document context vector into the decoder section of Llama’s network and retrain.

The above components will be brought together with the original LLM to provide memory augmentation. We will need some gating logic for the encoder so that write, read and query prompts are properly dealt with, and further gating logic for the second attention layer in the decoder so that the context vector for the correctly retrieved article is used for generating the response.

Once integration is completed the whole system will be tested so that accuracy and speed of operation can be evaluated. Any fine-tuning of parameters can take place at this time to ensure that appropriate weight is given to the query and the retrieved information. As a cross-check, we can use the metadata pointing to the original data to determine that the new module is operating adequately.

The delivery for this stage will be a modified version of a Llama-like LLM that can accept articles for reading, writing and as constrained context for query/response generation.

Engineering Hours: 760

KPI:

  • Seamless Integration: Successful integration of context vector retrieval and gating mechanisms into the encoder and decoder sections of Llama's network without any functional disruptions or anomalies.
  • Functional Testing: Post-integration, the system should demonstrate at least a 95% success rate in correctly gating write, read, and query prompts during testing.
  • Attention Layer Gating Accuracy: Ensure the second attention layer in the decoder correctly utilizes the retrieved article's context vector for response generation with an accuracy rate of at least 98%.
  • Operational Speed and Accuracy: Post-integration, the LLM should exhibit no more than a 5% increase in response time and maintain or improve upon its original accuracy metrics.

TOTAL for all 5 Milestones: 36 Weeks (USD 140K)

Risk and Mitigation

Novelty Induced Risk

This is a very new concept that, to our knowledge, no one has attempted at this scale before. While we accept that there is associated technical risk, we are confident that the basic approach is sound and that we can overcome unforeseen blockers. We have been realistic in our resource estimates and have included technical contingency as appropriate. Additionally, our team comprises professionals who are experienced in the realm of ML and neural networks and have a solid track record of producing novel solutions on time and within budget.

Vector Retrieval System Performance At Scale

Our vector retrieval module uses a specific approach that has not been tested at full scale. While we believe it will work as described, we have identified several other good candidate solutions. We can mitigate risk here by leveraging contingency techniques should it be necessary. Provision for this has been made in our plan.

Machine Learning

Training large, deep learning solutions on new data carries a degree of risk owing to the intricacies of selecting network architecture and parameter optimisations. Our team is well-versed in this process and has decades of experience in following a rigorous and principled approach in this respect. We believe the residual risk to be minimal.

Software Development

MLabs specialises in mission-critical software development and has well-established procedures and guidelines for delivering robust, production-level code deployed at scale. Any risks in this area are sufficiently mitigated by our organizational expertise and procedures.

Project Management

We have included appropriate project management resources in our plan to ensure that the project runs smoothly and that any course corrections are taken effectively and promptly. The number of tasks is relatively small, and the interdependency is straightforward. The team has worked well together before and feels comfortable with the project plan. We believe the risks are adequately mitigated.

Our Team

MLabs is a cutting-edge consultancy specialising in both AI and blockchain technologies. We have over 40 years of combined experience in successful ML/AI deployments as well as an extensive background in helping bootstrap Web3 communities such as Cardano. More importantly, we specialise in pinpointing where AI and automation can yield the most substantial ROI for clients and communities, followed by the seamless implementation of systems to achieve those gains. Our past ML/AI work spans a range of impactful projects, including:

  • UK Vaccine Programme: Offering actionable insights that have directly influenced public policy.
  • NHS COVID-19 App: Conducting time-series analyses that have informed crucial budgetary decisions.
  • Major Aircraft Manufacturer: Utilising corporate data-mining strategies to eliminate redundancy and boost efficiency.
  • Large E-commerce Platform: Implementing automated tax categorisation systems to ensure legislative compliance.

Principal Technical Team

Dr. Mark Bedworth

Mark Bedworth is an internationally respected innovator and neural network visionary with over 40 years of experience in machine learning. He has designed and deployed neural networks, deep learning, and natural language processing algorithms since the early 1980s; he has had the privilege to work with many of the luminaries from the neural network community. He has written and published numerous academic articles, white papers and patents. He also innovated the algorithm used for small-footprint speech recognition which was later acquired by Siri for inclusion in the iPhone.

Jonny Edwards

Jonathan Edwards is an enthusiastic and highly motivated Machine Learning and Artificial Intelligence expert who excels in problem-solving and invention. As a researcher with more than 20 years of commercial/R&D experience in AI, Jonathan provides advanced algorithmic experience across several industry sectors, and he is a specialist in neural networks, machine vision, classifier design and implementation.

 

Nathaniel Lane

Nathaniel Lane graduated summa cum laude from the Colorado School of Mines in 2015 with a B.S. in Computer Science. He received a Master's at Montana State University in Computer Science by studying how to use neural networks to predict whether an amino acid sequence represented an anti-cancer peptide. These results were published in the paper "DeepACPPred," which he wrote with his advisor, Indika Kahanda.

 
 
 

Related Links

Webpage:

Twitter:

Proposal Video

Retrieving Information using a Gated Encoder LLM (RIGEL) - #DeepFunding IdeaFest Round 3

26 September 2023
  • Total Milestones

    5

  • Total Budget

    $140,000 USD

  • Last Updated

    2 Jun 2024

Milestone 1 - At-scale Document Compression

Status
😀 Completed
Description

Feed all Wikipedia articles into Llama 2 and store the context vectors in a database.

Deliverables

Budget

$20,000 USD

Milestone 2 - Calibration of Hierarchical Document Embedding

Status
😀 Completed
Description

Identify how similar articles are based on the graph of Wikipedia article links. Store the graph and similarities for future use.

Deliverables

Budget

$20,000 USD

Milestone 3 - Context Compression Neural Network Training

Status
🧐 In Progress
Description

Train a neural network to compress the context vectors at multiple levels in a hierarchical fashion.

Deliverables

Budget

$30,000 USD

Link URL

Milestone 4 - Context Vector Retrieval Training

Status
😐 Not Started
Description

Use the hierarchical vectors to generate a tree that allows for quick vector retrieval.

Deliverables

Budget

$35,000 USD

Link URL

Milestone 5 - LLM Modifications and Integration

Status
😐 Not Started
Description

Integrate the additional document context vector into the decoder section of Llama’s network and retrain.

Deliverables

Budget

$35,000 USD

Link URL

Join the Discussion (0)

Reviews & Rating

New reviews and ratings are disabled for Awarded Projects

Sort by

0 ratings

Summary

Overall Community

0

from 0 reviews
  • 5
    0
  • 4
    0
  • 3
    0
  • 2
    0
  • 1
    0

Feasibility

0

from 0 reviews

Viability

0

from 0 reviews

Desirabilty

0

from 0 reviews

Usefulness

0

from 0 reviews