Augmenting BMKGs With Cheminformatics And CADD

chevron-icon
Back
Top
chevron-icon
project-presentation-img
user-profile-img
Robert Haas
Project Owner

Augmenting BMKGs With Cheminformatics And CADD

Status

  • Overall Status

    ⏳ Contract Pending

  • Funding Transfered

    $0 USD

  • Max Funding Amount

    $70,000 USD

Funding Schedule

View Milestones
Milestone Release 1
$10,000 USD Pending TBD
Milestone Release 2
$10,000 USD Pending TBD
Milestone Release 3
$23,000 USD Pending TBD
Milestone Release 4
$15,000 USD Pending TBD
Milestone Release 5
$12,000 USD Pending TBD

Project AI Services

No Service Available

Overview

Biomedical knowledge graphs (BMKG) contain chemical compounds such as drugs, toxins, metabolites, cofactors or signaling molecules. These entities and some of their relations can be richly augmented with qualitative & quantitative properties by methods from cheminformatics, computer-aided drug design (CADD) and related fields. This enables numerical queries and many analyses such as filtering, clustering, embedding, similarity/outlier detection, QSAR modeling, ML, etc. The aim of this project is to make existing but scattered methods available in a Python package with a unified functional API, expose it to OpenCog Hyperon, and apply it in a PoC study to annotate and analyze Hetionet in MORK.

RFP Guidelines

Advanced knowledge graph tooling for AGI systems

Complete & Awarded
  • Type SingularityNET RFP
  • Total RFP Funding $350,000 USD
  • Proposals 39
  • Awarded Projects 5
author-img
SingularityNET
Apr. 16, 2025

This RFP seeks the development of advanced tools and techniques for interfacing with, refining, and evaluating knowledge graphs that support reasoning in AGI systems. Projects may target any part of the graph lifecycle — from extraction to refinement to benchmarking — and should optionally support symbolic reasoning within the OpenCog Hyperon framework, including compatibility with the MeTTa language and MORK knowledge graph. Bids are expected to range from $10,000 - $200,000.

Proposal Description

Our Team

Robert Haas

  • Formal education in molecular biology and computational science, with a focus on cheminformatics, computer-aided drug design and evolutionary computation.
  • Independent software developer of several open source and commercial tools
  • Long-term community member of SingularityNET and active contributor to its vision of open and beneficial ML/AI applications.

Company Name (if applicable)

Antecedens e.U.

Project details

Relations to other work

This project aims to extend upon previous work that brought pharmacologically-relevant BMKGs to OpenCog Hyperon with an ETL-pipeline named kgw. The long-term idea is to explore their potential for applications such as drug repurposing, side-effect prediction or optimizing drug combinations with machine learning and reasoning algorithms that gradually become available in MeTTa:

  1. https://deepfunding.ai/proposal/bringing-network-pharmacology-to-opencog-hyperon
  2. https://deepfunding.ai/proposal/knowledge-graph-workflows

The project also aims to provide concrete and near-term value to other initiatives in the SingularityNET ecosystem:

  1. Rejuve's work on BMKGs in the context of longevity research can use the annotation functionality implemented in this project, and the case study can serve as guidance on how this can be done. The first part was discussed briefly with Michael Duncan before deciding to write this proposal.
  2. BMKGs ported to MeTTa have been used for testing early iterations of MORK. This helped to discover of some parsing deviations related to atypical string encodings and supported early performance benchmarking. In discussions with Adam Vandervorst, MORK support for high-performance numerical range queries was found to be a relevant feature for large BMKGs, which holds especially more so if they are annotated with extra properties. This and some early ideas for embedding methods could likely be tested well on the Hetionet case study suggested in this proposal.

Finally, this project also aims to support the global research community in cheminformatics and computer-aided drug discovery:

  • The challenge described in following sections also applies to researchers in both fields, especially those without computer science background. Since the resulting package will be published openly and freely, anyone interested can draw and extend upon this work, which may in turn lead to benefits for projects within the SingularityNET ecosystem if they find use for it. The author's previous BMKG projects mentioned above have already found some resonance as evidenced by a few email correspondences with researchers that had feedback, as well as some stars and forks by GitHub users, e.g. on the repository awesome-biomedical-knowledge-graphs.

Scientific background

Cheminformatics and computer-aided drug design are two active and closely related scientific fields. Over decades, various research groups and companies have developed computational methods to represent small molecules in different ways and perform calculations on them. Today this includes a very wide range of tasks, such as 1) conversion between file formats, 2) derivation of numerical molecular descriptors and binary fingerprints to characterize compounds, 3) measuring similarity and distance values between compounds with different metrics, 4) estimation of realistic embeddings of atoms and bonds in 3D space to generate chemically plausible conformers, 5) visualization of compounds in 2D and 3D, 6) QSAR modeling and supervised machine learning to find predictive models that relate structures to relevant experimental properties such as a particular biological activity of interest, 7) molecular docking to predict binding poses and energies of small molecules to target macromolecules, 8) molecular dynamics simulations to calculate detailed trajectories of binding processes and evaluate them statistically, 9) heuristic quantum chemical calculations to estimate experimental measurements very accurately, as well as many other approaches that tackle chemical, pharmacological and medicinal questions.

The notion of chemical space is a central concept in cheminformatics and CADD. It refers to the property space spanned by either a) all thermodynamically stable molecules or b) a concrete subcollection of compounds such as the set of all approved drugs, some class of natural products, or some commercially available library of physical molecules. Mathematically this property space is simply a high-dimensional vector space, where each molecule is represented by one vector, and each element of such a vector is a numerical molecular descriptor that characterizes an aspect of a molecule (e.g. its molecular weight, logP, number of hydrogen donors or acceptors, etc.). Collections of molecules can be studied and compared in this very information-rich setting in many ways. This has for example lead to simple statistical observations such as approved drugs tending to occupy particular sub-regions in the overall space, which is often characterized by simple rules, e.g. Lipinski's rule of five and several others named after persons. These findings can be used to estimate the "druglikeness" of new chemical structures. In pharmaceutical settings, heuristics like that are often used as a filter in virtual screening campaigns that aim to reduce large molecule collections to a few candidate compounds that could bind a drug target such as a protein involved in a disease. This difficult selection task is often compared to the search of a needle in a haystack. It can profit from algorithmic improvements and new scientific insights, with one possible direction of innovation being the integration of more heterogeneous data in KGs to find and utilize novel patterns.

Problems

Biomedical knowledge graphs (BMKGs) typically contain small molecules but in the majority of cases they are not annotated with chemical properties, even if the goals of the KG construction effort explicitly contained tasks such as drug repurposing or missing link prediction, where this informaiton could clearly support the purpose. This is surprising, because freely available software can be used 1) to calculate thousands of relevant numerical and binary properties for a given molecule, i.e. deliver additional  node annotations, and 2) to predict binding characteristics between a candidate molecule and a protein, i.e. deliver addional edge annotations or add entirely new edges.

One reason for this lack of use may be issues regarding accessibility of existing functionality: The landscape of cheminformatic and CADD tools is highly scattered, contains many historical and deprecated projects, often some skills are required to install and reliably use existing methods and even more so when in combination in a shared environment, APIs are often intricately woven and insufficiently documented, sanity checks are sometimes needed to ensure that inputs and outputs conform to expectations, and several other complications.

Solutions

Coming from the problem description, three main motivations for this project are:

1) Identification of existing methods that are relevant to the topic, supported by proper research, actively maintained, and can be installed together with other methods in a shared environment without running into major compatibility issues.

2) Implementation of a Python package that provides a unified API to several existing tools and a relevant portion of the provided functionality within them. The overall package design aims to be functional and stateless as far as possible, in constrast to the object-oriented and stateful style chosen by some toolkits. The latter often increases mental load due to having to think about toolkit-specific classes, their dependencies, as well as method call order for reproducible results in some cases, which can lead to subtle errors. A more functional style also has the advantage that it is well suited to either manually or automatically expose the calculation methods in form of grounded atoms in MeTTa and thereby bring cheminformatics and computer-aided drug design to OpenCog Hyperon, which is a central drive behind this project.

3) Applying the package in a proof-of-concept case study in OpenCog Hyperon, if feasible with MORK as backend, on a well-known and medium-sized BMKG that contains drug molecules which come with thier chemical structure but lack numerical annotations. One candidate is Hetionet, which isn't too large and contains small molecules that come with InChI strings in their node properties, a standard format supported by many toolkits.

Preliminary tests in preparation for this proposal

cheminformatics_in_hyperon.html is a Jupyter notebook that contains a preliminary exploration of a few cheminformatic toolkits to answer some basic questions:

  • Which projects have Python bindings and can be installed in the same conda environment without errors such as outdated installation scripts, missing dependencies, or clashes in their dependency version requirements?
    • The result contains a combination of OpenBabel, RDKit, Indigo and PaDEL-Descriptor (via PaDELPy) that worked.
    • Some other candidates had to be dropped for different reasons. E.g. mordred had strong dependencies on highly outdated package versions for numpy, and when provided with it, still produced some errors in calculations. E.g. Cinfony couldn't be installed in a conda environment with a recent Python 3 interpreter.
  • Do some basic format conversions (SMILES to InChI, InChI to SMILES) and descriptor calculations work as expected on a test molecule?
    • Yes, at least with a very simple ethanol molecule ("CCO" in SMILES) everything seemed to behave as intended.
  • Can the functionality first be abstracted into functions with simple interfaces, then be registered as grounded atoms in OpenCog Hyperon, and finally be used to perform nested calculations with simple MeTTa code?
    • Yes, the registering and calling of Python functions in form of grounded atoms worked and enabled some basic chaining of functions from different toolkits due to unified interfaces. This serves as one basic guideline for the overall package design that will cover much more functionality.

Open Source Licensing

GNU GPL - GNU General Public License

GNU General Public License v3.0 will be applied to the developed Python package in order to accommodate all licenses of currently covered toolkits and potential future additions:

  • OpenBabel: https://github.com/openbabel/openbabel -  GPL-2.0 license
  • RDKit:  https://github.com/rdkit/rdkit - BSD-3-Clause license
  • Indigo: https://github.com/epam/Indigo - Apache-2.0 license
  • PaDELPy: https://github.com/ecrl/padelpy - MIT license

Describe the particulars.

Discussions with Michael Duncan (Rejuve: Application of chemical annotations in their BMKG) and Adam Vandervorst (MORK: Making use of upcoming numerical range queries and KG embedding methods).

Proposal Video

Not Avaliable Yet

Check back later during the Feedback & Selection period for the RFP that is proposal is applied to.

Reviews & Rating

New reviews and ratings are disabled for Awarded Projects

Overall Community

0

from 0 reviews
  • 5
    0
  • 4
    0
  • 3
    0
  • 2
    0
  • 1
    0

Feasibility

0

from 0 reviews

Viability

0

from 0 reviews

Desirabilty

0

from 0 reviews

Usefulness

0

from 0 reviews

Sort by

0 ratings

    No Reviews Avaliable

    Check back later by refreshing the page.

  • Total Milestones

    5

  • Total Budget

    $70,000 USD

  • Last Updated

    10 Sep 2025

Milestone 1 - Kick-off and research

Status
😐 Not Started
Description

1) Set up and sign the contract. 2) Extend the preliminary research done as prepartion for this proposal. This means a broad literature and code review of existing functionality in cheminformatics and computer-aided drug discovery today will be performed in order to find out what implementations are actively maintained, backed by publications, deliver reliable results, and can be used in combination in a shared conda environment.

Deliverables

The results of the broad literature and code review are provided in form of a GitHub repository rather than a PDF report so that other researchers can find and extend it in an easy way. If determined suitable, it may adhere to the style of an "awesome list" repository to make it easily findable and recognizable.

Budget

$10,000 USD

Link URL

Milestone 2 - Design

Status
😐 Not Started
Description

Decide which 5 to 7 external projects are going to be covered and what structure the unified API is going to take. The precise number of projects will depend on how much functionality each of them provides and how complex it is to abstract it into a functional interface that maximizes mutual compatibility. From the current point of view, good candidates seem to be OpenBabel, RDKit, Indigo and PaDeL-Descriptor, which cover a wide range of functionality, e.g. format conversions, descriptor and fingerprint calculations, 3D structure generation, 2D and 3D visualization, tautomer enumeration, etc. Reasonable additions from the perspective of broad methodological coverage could be a dedicated 3D conformer generator like Balloon, a molecular docking program like AutoDock Vina, and perhaps a quantum chemical toolkit like Psi4 for slower but more accurate geometry and electronic structure prediction that could be used as basis for molecular dynamics simulation.

Deliverables

The results of the project selection and API design will be a Python package with a scaffold for the covered toolkits and a few sample functions already implemented to ensure the outline works as intended.

Budget

$10,000 USD

Link URL

Milestone 3 - Implementation

Status
😐 Not Started
Description

Fully implement the Python package that provides a unified API. The aim is to cover a large portion of the methods provided in the chosen external projects, though some very specific methods may not be of broad interest and therefore omitted.

Deliverables

Completed Python package, including a test suite with high coverage of the codebase, and everything required for distributing it as easily installable package on a suitable repository.

Budget

$23,000 USD

Link URL

Milestone 4 - Application

Status
😐 Not Started
Description

Apply the Python package on a proof-of-concept case study. A reasonably sized biomedical knowledge graph will be ported to MeTTa and then augmented with various functions provided in the package, either by manual or automatic registering them as grounded atoms in OpenCog Hyperon. Ideally MORK will be used as backend if the project is mature enough at this point. The goal is not only to annotate the BMKG but also to perform interesting queries and analyses on it, e.g. basic filtering up to embedding and QSAR modeling or supervised ML.

Deliverables

Proof-of-concept case study that applies the package on a BMKG such as Hetionet and performs some downstream analyses.

Budget

$15,000 USD

Link URL

Milestone 5 - Documentation

Status
😐 Not Started
Description

Generate a technical documentation website for the Python package and a summary PDF report for the entire project.

Deliverables

Code documentation so that external developers can use it without further help. PDF report so the project and its results can be understood by anyone interested.

Budget

$12,000 USD

Link URL

Join the Discussion (0)

Expert Ratings

Reviews & Ratings

New reviews and ratings are disabled for Awarded Projects

    No Reviews Avaliable

    Check back later by refreshing the page.

Welcome to our website!

Nice to meet you! If you have any question about our services, feel free to contact us.