Proposal Description
Solution Description
This project is about developing a Python package to retrieve, inspect, convert, and import up-to-date knowledge graphs from the fields of biomedicine and in particular from pharmacology into OpenCog Hyperon. The following sections provide an overview of the project's scientific context, its underlying motivation, alignment with the RFP proposition, and technical details to enable the assessment of its validity and feasibility.
1. Background
Since the inception of receptor theory around 1900, the prevalent paradigm in pharmacology was that of "one drug, one target, one disease" [3]. This means that a medicinal compound such as Imatinib is assumed to bind to a single target like the BCR-ABL kinase and thereby influences a single disease state such as Chronic Myeloid Leukemia (CML). In recent decades, systems biology increasingly shed light into the complex interaction networks between various biological entities in living organisms, including their genes, RNAs, proteins, metabolites, signaling molecules, etc. This partly shifted the focus from investigating isolated parts, which made molecular biology so successful, to studying systems of interacting parts and their emergent behavior, which requires large-scale experiments, well-curated collections of data, and careful computational modeling.
In this project, the topic of concern is network pharmacology [3], a relatively young discipline that was born from the influence of systems biology and from the finding that most drugs in current use are actually binding to multiple targets, a phenomenon called polypharmacology. This implies that active compounds usually influence entire modules in biological networks in subtle ways, rather than modulating a single target, and therefore the new paradigm of network pharmacology is beginning to replace the old one. The hope and promise is that by integrating and studying vast amounts of experimental data, it will become possible to understand chemical and biological interaction networks sufficiently well to design drugs or combination therapies with higher efficacy and less toxicity, allowing us to target challenging diseases such as different cancers with higher precision and less side effects. For this purpose, scientists in various fields of the life sciences are collecting experimental data and sharing it by storage in many specialized databases. A much smaller portion of academic teams are exploring ways to integrate carefully selected data sources in consistent ways into knowledge graphs with the aim to serve the ambitious tasks of what is often interchangeably called "systems", "network" or "precision" pharmacology and medicine.
2. Motivation
In preparation for this proposal, a review of recent academic literature was carried out, focusing on the design and use of knowledge graphs in the field of biomedicine, and more specifically, in its subfield pharmacology and the modern branch of network pharmacology. This examination resulted in following observations:
- There are multiple high-quality academic projects that actively work on the integration of data from chemical, biological, and pharmacological databases into specialized knowledge graphs, often for addressing specific pharmacological tasks like drug repurposing, target identification, disease-gene prioritization, or predicting negative drug-drug interactions [1, 2].
- The data integration process requires considerable domain expertise for making well-informed design decisions that are suitable for the types of data to be integrated and for the types of tasks to be solved [1].
- The rationale, methodology, tools and data produced by these projects are shared in commendable openness with the research community and interested public, inviting reuse and further research.
In light of these findings, the goal of this project is to tap into the rich source of well-curated knowledge graphs available in biomedicine, with a focus on modern pharmacological applications, by bringing several of these knowledge graphs to OpenCog Hyperon. This will be achieved by implementing a Python package in accordance with guidelines by the Python packaging authority (PyPA). The package will allow 1) to retrieve the latest published versions of the knowledge graphs from the web, 2) to convert them from the various formats used by different projects into a suitable common intermediate representation, and 3) to import the data into OpenCog Hyperon in form of MeTTa expressions. Since there are many ways to represent knowledge in general, and especially in the highly flexible MeTTa language, the structure of the generated expressions is not fixed but open to parameterization by users of the package. This will enable experimentation with different representations, for example to see how suitable they are for formulating queries with OpenCog's pattern matcher, for reasoning on them with PLN, for learning on them with evolutionary methods, for creating projections of them with graph embeddings, and so on.
3. Alignment with RFP requirements
This project adheres to the RFP requirements in following narrow senses:
- It implements a tool (i.e. a Python package) for importing data (i.e. existing knowledge graphs) into a graph database (i.e. OpenCog Hyperon's Atomspace).
- As part of its functionality, it is able to retrieve up-to-date data (i.e. latest builds of knowledge graphs) from reliable sources (i.e. high-quality academic projects) on the web (i.e. data repositories on Zenodo, Harvard Dataverse, GitHub, etc.).
In a broader sense, it is intended to be a R&D project investigating ways of transforming existing knowledge graphs coming in different formats, importing them into OpenCog Hyperon with MeTTa expressions of varying structures, enabling queries with OpenCog Hyperon's built-in pattern matcher, and providing a basis for testing different learning and reasoning algorithms on well-curated real-world data. As such, the hope is to be in exchange with and support the development team of MeTTa and PLN, as well as the research team of Rejuve who is working in a partially overlapping domain but with a different focus. Initial conversations with Alexey Potapov, Nil Geisweiller and Michael Duncan were encouraging and informed this proposal.
4. Technical details
In preparation for this proposal, a small prototype was implemented to demonstrate the main steps in utilizing an existing knowledge graph. It can be inspected in form of a Jupyter notebook here:
This prototype loads, samples, transforms, visualizes and imports the main knowledge graph of the project PrimeKG (last updated in July 2023) into OpenCog Hyperon. It does so by generating simple MeTTa expressions, without making use of the flexible typing features of the language or other advanced constructs yet. This limited demonstration shows that it is possible to prepare and load a state-of-the-art knowledge graph into OpenCog Hyperon and perform basic queries on it, such as
- identifying protein targets of a specific drug,
- listing all pairs of drugs that have a shared target and therefore might influence each other's activities,
- finding protein-protein interaction partners, or
- identifying anatomical regions in which a specific protein is expressed.
Figure 1: Visualization of a tiny subset of entities (nodes) and relations (edges) contained in the knowledge graph "PrimeKG" by the Zitnik lab at Harvard. An isoform of Heat Shock Protein 90 named Hsp90AA1 (green) is shown together with a) drugs (orange) it interacts with such as the polyketide Geldanamycin, b) diseases (red) it is associated with such as breast cancer, c) pathways (dark blue) it is part of, and d) bioprocesses (light blue) it partakes in.
In further preparation of this proposal, a preliminary literature review was conducted to get an overview of available knowledge graphs in the fields of biomedicine and pharmacology. Among the many projects that could be identified so far, following stood out due to having high quality and being kept up-to-date. Part of this project will be an extension of this initial literature review to ensure that no promising project is overlooked. Therefore the following list is preliminary and contains only candidate knowledge graphs currently considered to be ported to OpenCog Hyperon:
PrimeKG
PheKnowLator
Otter-Knowledge
- Integration of 7 databases, covering >30,000,000 triples in four knowledge graphs
-
(arxiv, 2023)
-
Hetionet
- Integration of 29 databases, covering 2,250,197 relationships in one knowledge graph
-
-
(eLife, 2017)
-
CROssBAR
RTX-KG2
Biocypher
Long Description
Company Name
Antecedens e.U.
Request for Proposal Pool
RFP4: Tools for Knowledge Graphs and LLMs Integration
Summary
The goal of this project is to develop a tool for bringing several well-curated, state-of-the-art knowledge graphs from the modern field of network pharmacology to OpenCog Hyperon. This shall serve the following purposes:
- Support the ongoing R&D of OpenCog Hyperon's language MeTTa and the machine reasoning approach Probabilistic Logic Networks (PLN) by experimenting with different options of representing knowledge in these formalisms.
- Contribute to SingularityNET's new web3 initiative of building "Blockchain Layer 3: The Internet of Knowledge" by providing high-quality data from a discipline that is at the center of expanding human healthspan and lifespan.
- Share ideas and insights from the process of converting different knowledge graphs to MeTTa expressions with the team of Rejuve, who is working in the same broad domain of biomedicine, yet with a different focus.
- In an ideal scenario, this project will enable the derivation of novel pharmacological relationships and hypotheses on the converted knowledge graphs by leveraging OpenCog Hyperon's advanced querying, reasoning and learning algorithms as they are becoming available and mature.
Prototype for this project:
Funding Amount
$40,000
RFP4 Requirements
The main goal of
is "to stimulate research and development in the area of Knowledge Graphs and their integration with LLMs".
Suggested examples include:
- "Tools for importing data to a graph database."
- "Online web retrieval tools [...] with a selection of reliable sources to provide generative language models with relevant and up-to-date data".
Proposed Solution
This project is about developing a Python package to retrieve, inspect, convert, and import up-to-date knowledge graphs from the fields of biomedicine and in particular from pharmacology into OpenCog Hyperon. The following sections provide an overview of the project's scientific context, its underlying motivation, alignment with the RFP proposition, and technical details to enable the assessment of its validity and feasibility.
1. Background
Since the inception of receptor theory around 1900, the prevalent paradigm in pharmacology was that of "one drug, one target, one disease" [3]. This means that a medicinal compound such as Imatinib is assumed to bind to a single target like the BCR-ABL kinase and thereby influences a single disease state such as Chronic Myeloid Leukemia (CML). In recent decades, systems biology increasingly shed light into the complex interaction networks between various biological entities in living organisms, including their genes, RNAs, proteins, metabolites, signaling molecules, etc. This partly shifted the focus from investigating isolated parts, which made molecular biology so successful, to studying systems of interacting parts and their emergent behavior, which requires large-scale experiments, well-curated collections of data, and careful computational modeling.
In this project, the topic of concern is network pharmacology [3], a relatively young discipline that was born from the influence of systems biology and from the finding that most drugs in current use are actually binding to multiple targets, a phenomenon called polypharmacology. This implies that active compounds usually influence entire modules in biological networks in subtle ways, rather than modulating a single target, and therefore the new paradigm of network pharmacology is beginning to replace the old one. The hope and promise is that by integrating and studying vast amounts of experimental data, it will become possible to understand chemical and biological interaction networks sufficiently well to design drugs or combination therapies with higher efficacy and less toxicity, allowing us to target challenging diseases such as different cancers with higher precision and less side effects. For this purpose, scientists in various fields of the life sciences are collecting experimental data and sharing it by storage in many specialized databases. A much smaller portion of academic teams are exploring ways to integrate carefully selected data sources in consistent ways into knowledge graphs with the aim to serve the ambitious tasks of what is often interchangeably called "systems", "network" or "precision" pharmacology and medicine.
2. Motivation
In preparation for this proposal, a review of recent academic literature was carried out, focusing on the design and use of knowledge graphs in the field of biomedicine, and more specifically, in its subfield pharmacology and the modern branch of network pharmacology. This examination resulted in following observations:
- There are multiple high-quality academic projects that actively work on the integration of data from chemical, biological, and pharmacological databases into specialized knowledge graphs, often for addressing specific pharmacological tasks like drug repurposing, target identification, disease-gene prioritization, or predicting negative drug-drug interactions [1, 2].
- The data integration process requires considerable domain expertise for making well-informed design decisions that are suitable for the types of data to be integrated and for the types of tasks to be solved [1].
- The rationale, methodology, tools and data produced by these projects are shared in commendable openness with the research community and interested public, inviting reuse and further research.
In light of these findings, the goal of this project is to tap into the rich source of well-curated knowledge graphs available in biomedicine, with a focus on modern pharmacological applications, by bringing several of these knowledge graphs to OpenCog Hyperon. This will be achieved by implementing a Python package in accordance with guidelines by the Python packaging authority (PyPA). The package will allow 1) to retrieve the latest published versions of the knowledge graphs from the web, 2) to convert them from the various formats used by different projects into a suitable common intermediate representation, and 3) to import the data into OpenCog Hyperon in form of MeTTa expressions. Since there are many ways to represent knowledge in general, and especially in the highly flexible MeTTa language, the structure of the generated expressions is not fixed but open to parameterization by users of the package. This will enable experimentation with different representations, for example to see how suitable they are for formulating queries with OpenCog's pattern matcher, for reasoning on them with PLN, for learning on them with evolutionary methods, for creating projections of them with graph embeddings, and so on.
3. Alignment with RFP requirements
This project adheres to the RFP requirements in following narrow senses:
- It implements a tool (i.e. a Python package) for importing data (i.e. existing knowledge graphs) into a graph database (i.e. OpenCog Hyperon's Atomspace).
- As part of its functionality, it is able to retrieve up-to-date data (i.e. latest builds of knowledge graphs) from reliable sources (i.e. high-quality academic projects) on the web (i.e. data repositories on Zenodo, Harvard Dataverse, GitHub, etc.).
In a broader sense, it is intended to be a R&D project investigating ways of transforming existing knowledge graphs coming in different formats, importing them into OpenCog Hyperon with MeTTa expressions of varying structures, enabling queries with OpenCog Hyperon's built-in pattern matcher, and providing a basis for testing different learning and reasoning algorithms on well-curated real-world data. As such, the hope is to be in exchange with and support the development team of MeTTa and PLN, as well as the research team of Rejuve who is working in a partially overlapping domain but with a different focus. Initial conversations with Alexey Potapov, Nil Geisweiller and Michael Duncan were encouraging and informed this proposal.
4. Technical details
In preparation for this proposal, a small prototype was implemented to demonstrate the main steps in utilizing an existing knowledge graph. It can be inspected in form of a Jupyter notebook here:
This prototype loads, samples, transforms, visualizes and imports the main knowledge graph of the project PrimeKG (last updated in July 2023) into OpenCog Hyperon. It does so by generating simple MeTTa expressions, without making use of the flexible typing features of the language or other advanced constructs yet. This limited demonstration shows that it is possible to prepare and load a state-of-the-art knowledge graph into OpenCog Hyperon and perform basic queries on it, such as
- identifying protein targets of a specific drug,
- listing all pairs of drugs that have a shared target and therefore might influence each other's activities,
- finding protein-protein interaction partners, or
- identifying anatomical regions in which a specific protein is expressed.
Figure 1: Visualization of a tiny subset of entities (nodes) and relations (edges) contained in the knowledge graph "PrimeKG" by the Zitnik lab at Harvard. An isoform of Heat Shock Protein 90 named Hsp90AA1 (green) is shown together with a) drugs (orange) it interacts with such as the polyketide Geldanamycin, b) diseases (red) it is associated with such as breast cancer, c) pathways (dark blue) it is part of, and d) bioprocesses (light blue) it partakes in.
In further preparation of this proposal, a preliminary literature review was conducted to get an overview of available knowledge graphs in the fields of biomedicine and pharmacology. Among the many projects that could be identified so far, following stood out due to having high quality and being kept up-to-date. Part of this project will be an extension of this initial literature review to ensure that no promising project is overlooked. Therefore the following list is preliminary and contains only candidate knowledge graphs currently considered to be ported to OpenCog Hyperon:
PrimeKG
PheKnowLator
Otter-Knowledge
- Integration of 7 databases, covering >30,000,000 triples in four knowledge graphs
-
(arxiv, 2023)
-
Hetionet
- Integration of 29 databases, covering 2,250,197 relationships in one knowledge graph
-
-
(eLife, 2017)
-
CROssBAR
RTX-KG2
Biocypher
Project Milestones and Cost Breakdown
Milestone 1: Kick-off
- Description: Setting up a contract with SingularityNET. Expanding the preliminary literature review performed in preparation of the proposal to provide a comprehensive overview of available knowledge graphs and tooling in the field of biomedicine.
- Deliverables: 1) Signed contract, 2) Report with literature review as PDF file or HTML site
- Budget: $4,000
- Estimated time: 2-3 weeks
Milestone 2: Design
- Description: Creating a code repository and outlining a scaffold for the Python package. Discussing potential knowledge representations with the development teams of MeTTa and PLN and testing them on the basis of the prototype that was generated in preparation of this proposal.
- Deliverables: 1) GitHub repository with Python package, 2) Discussions with SinguarityNET's tech teams and tests on the prototype
- Budget: $6,000
- Estimated time: 4-5 weeks
Milestone 3: Implementation Phase 1
- Description: Writing modules for integrating the first two knowledge graphs to OpenCog Hyperon. Figuring out shared representations and interfaces for a consistent package.
- Deliverable: GitHub repository with Python package v0.1 covering two KGs
- Budget: $12,000
- Estimated time: 8-10 weeks
Milestone 4: Implementation Phase 2
- Description: Writing modules for integrating the remaining three knowledge graphs to OpenCog Hyperon.
- Deliverable: GitHub repository with Python package v0.2 covering five KGs
- Budget: $12,000
- Estimated time: 8-10 weeks
Milestone 5: Documentation and Finalization
- Description: Creation of a documentation website for the Python package, including installation instructions, getting started guide, example notebooks and an API reference. Cleaning up and finalizing the Python package.
- Deliverable: GitHub repository with Python package v1.0 covering five KGs and documentation
- Budget: $6,000
- Estimated time: 4-6 weeks
Total
- Budget: $40,000
- Estimated time: 26-34 weeks
Risk and Mitigation
API stability of OpenCog Hyperon
- Risk: MeTTa is under active development and therefore potentially subject to change in its surface syntax or in its advanced features, including the Pattern Matcher and PLN.
- Mitigation: This project aims to use an intermediate representation in the conversion process, and to only generate the final MeTTa expressions from it in a last step. This means if adaptations to a changed MeTTa syntax should become necessary, the amount of required code modifications should be kept at a minimum.
Unforeseen issues with a knowledge graph
- Risk: On close inspection, a candidate knowledge graph may turn out to be inadequate for the purpose of this proposal due to some subtle reason.
- Mitigation: In a first step, a complete list of recent high-quality projects will be collected, so that there are more candidate knowledge graphs than can be covered in the scope of this proposal. If a chosen one fails, it is easily possible to switch to the next best one on the list.
Redundancy
- Risk: Rejuve is a project in the SingularityNET ecosystem that is also active in the domain of biomedicine, so there could be redundancies in the performed work.
- Mitigation: Biomedicine is a very broad field and available knowledge graphs from academic projects are designed with specific tasks in mind that in turn influence how the data is preprocessed and structured [1]. This project focuses on network pharmacology and associated tasks like drug repurposing and drug-drug interaction prediction. So while it is conceivable that on the surface there could be overlaps, e.g. using data from the same primary database (e.g. DrugBank, ChEMBL), the details of the knowledge representation will most likely differ in several aspects and as a consequence enable other inferences and insights. Besides a potential for redundancies, there is also a significant potential for synergies, e.g. sharing ideas and learnings about what kinds of MeTTa expressions and constructs are especially suitable for querying, reasoning or learning algorithms.
Team
Robert Haas
- Formal education in molecular biology and computational science, with a focus on cheminformatics, computer-aided drug design and evolutionary computation.
- Independent software developer of several open source and commercial tools.
- Long-term follower of SingularityNET and active contributor to its vision of open and beneficial ML/AI applications.
Related Links
Literature
- Bonner, S. et al. A review of biomedical datasets relating to drug discovery: a knowledge graph perspective. Briefings in Bioinformatics 23, bbac404 (2022).
- Callahan, T. J., Tripodi, I. J., Pielke-Lombardo, H. & Hunter, L. E. Knowledge-Based Biomedical Data Science. Annual Review of Biomedical Data Science 3, 23–41 (2020).
- Nogales, C. et al. Network pharmacology: curing causal mechanisms instead of treating symptoms. Trends in Pharmacological Sciences 43, 136–150 (2022).