Graphito: Tools for Entity & Predicate Mgmt in KGs

chevron-icon
RFP Proposals
Top
chevron-icon
project-presentation-img

Graphito: Tools for Entity & Predicate Mgmt in KGs

Expert Rating

n/a

Overview

There are many large, generic semantic graphs (e.g. WikiData, DBpedia, etc.) alongside growing numbers of domain-specific ones. LLMs offer a quick path to KG extraction but introduce: Inconsistency: erratic, non-deterministic entity resolution; Inaccuracy: missing or hallucinated predicates; Context loss: unstable frames of reference; Confidence: Displaying confidence regardless of accuracy; Speed: Impractical at scale. We are developing an AGI-focused KG tooling suite tackling these challenges via four modules—Context Identification, Entity Management, Predicate Management, and Confidence Management. This proposal addresses entity, predicate, and confidence management.

RFP Guidelines

Advanced knowledge graph tooling for AGI systems

Complete & Awarded
  • Type SingularityNET RFP
  • Total RFP Funding $350,000 USD
  • Proposals 39
  • Awarded Projects 5
author-img
SingularityNET
Apr. 16, 2025

This RFP seeks the development of advanced tools and techniques for interfacing with, refining, and evaluating knowledge graphs that support reasoning in AGI systems. Projects may target any part of the graph lifecycle — from extraction to refinement to benchmarking — and should optionally support symbolic reasoning within the OpenCog Hyperon framework, including compatibility with the MeTTa language and MORK knowledge graph. Bids are expected to range from $10,000 - $200,000.

Proposal Description

Our Team

MLabs AI sees AGI as a very important strategic goal, and our senior technical team already devote a portion of their time in pursuit of this outcome. We see our collaborations with SingularityNET as an important part of this endeavor.

Company Name (if applicable)

MLabs LTD

Project details

There are a number of large, generic semantic graphs (e.g. WikiData, DBpedia, BabelNet, and so on), with an ever-increasing number of domain-specific knowledge graphs appearing on a monthly basis. In addition to these more traditional graphs, there has been a move recently to using LLMs for the production of knowledge graphs.

While this capability provides a useful shortcut to a difficult problem, there are several problems with using LLMs for knowledge graph extraction:

  • inconsistency - LLMs often fail to correctly resolve entities, and the output is non-deterministic
  • accuracy - they can miss valid predicates, or (more often) hallucinate invalid predicates
  • context - LLMs often fail to maintain a frame of reference specific to the knowledge domain
  • confidence - LLMs display the same level of confidence about valid and invalid conclusions
  • speed - knowledge graph extraction from even a single website can take many minutes, and is impractical for very large knowledge graphs

Our KG tooling roadmap assumes that most of the structured graph information being operated on will be noisy, of variable quality, and not subject to ontological best practices. It is likely to have been created manually, using ad hoc processes, or from the output of an LLM.

We have designed requirements that specifically address the concerns listed above. Our tooling needs to maintain a suitable frame of reference (and to identify multiple frames of reference in large KGs), with contextually resolved entities, have relationships with quantified confidence, and be able to operate computationally efficiently at the very large scales that will be required in AGI systems. There are four types of tooling:

  • Context Identification
  • Entity Management
  • Predicate Management
  • Confidence Management

This is one of two KG tooling projects submitted to the current RFP and concentrates on entity management, predicate management, and confidence management. We will open source not only the tooling developed during the project, but the tooling worked on thus far. We will continue to develop these tools as an important component of our AGI roadmap and may deliver further enhancements to the Singularity NET community too.

 

Node (Entity) Management

Of interest to us in this tooling set are the split and merge functions - the ability to split nodes that possess different nuances in different frames of reference; and the ability to simplify KGs by merging nodes that are maximally similar in terms of connectivity and frame of reference. We use a graph theoretic approach to allocating edges to new entities so that the frame of reference separation is maximised.

It is also of importance in entity management to maintain an alias list for naming an entity so that different expressions for the same entity can be reconciled. Here we concentrate on the dual of this same problem: how to detect the entities in a corpus of unstructured text when the (single) entity may be referred to by a multi-word phrase (e.g. “United States of America” is clearly a single entity with a 4 word alias). MLabs has an approach that builds arbitrary length n-grams efficiently so that a variation of the Nevill-Manning-Witten algorithm can be used to build a plausible compound term list. This can be subsequently used with standard tools such as Latent Semantic Analysis (LSA) to attach the compound terms to existing entities.

Edge (Predicate) Management

A number of very standard tools are required to maintain a properly curated KG: such as resolving contradictory predicates, removing directed edge cycles, and pruning redundant, or low-confidence edges. We will assume that these functions are handled using standard methods and do not include them in this project.

Of most interest is predicate confidence management - attaching a confidence score to each predicate in the KG. There are three main reasons why a predicate should be down-weighted in confidence:

  • the external evidence is poor - it is from an unreliable source or is not independently confirmed
  • the internal evidence is poor - it does not make sense within the context of the current frame of reference
  • the evidence (internal or external) is stale - the predicate is fluid, and the support for it is not up to date

Source reliability and information credibility are measures that have been studied in great detail and can be manually coded according to the Admiralty code (a 2-dimensional code with A5 indicating a highly reliable source of information lacking in credibility, and E1 indicating a poor reputation source producing very credible information).

We can use the same principle extended to 3 dimensions for reliability, credibility and timeliness. The confidence in a predicate being a combination of these three factors.

Reliability

We base most of our source reliability assessment on reputation. Our reputation score is a combination of three factors:

  • Historical accuracy - the mean confidence of all predicates produced by this source i.e. how trustworthy this source has shown itself to be
  • Fit to frame of reference - the proportion of facts from this source that are within the frame of reference for the predicate under consideration i.e. how subject-specific this source is
  • Access to relevant evidence - the proportion of the current frame of reference that has been covered by predicates from this source i.e. how completely the source knows the subject

All of these factors are values between zero and one, including the degree of belonging of a predicate to a frame of reference. We currently combine these values by summing over products, but it is likely that some degree of weighting or non-linearity may produce better reliability estimates.

Credibility

We base our assessment of credibility on the degree to which the predicate is supported by evidence, consistency and context.

  • The evidence for a predicate is related to the length of the chain of predicates which comprise the piece of information. A standalone fact is regarded as less credible than a chain of connected facts. We currently measure the length of the chain by the sum of the confidence values for the indicated predicates.
  • Consistency is determined by comparing the predicate to the local connections in the KG. There are three possible forms of support we calculate:
    • deductive support (inference) - a chain of predicates exists from the premise to the conclusion; the support is the product of the confidence values in the chain e.g. “an eagle is an instance of a bird, all birds lay eggs; this provides deductive support that eagles lay eggs”
    • inductive support (generalisation) - chains of predicates link other instances of the premise to the same conclusion; the support is the weighted proportion of instances with this conclusion; the support depends on the proportion of instances with the predicate and their confidence e.g. “an eagle is an instance of a bird, an eagle can fly; a lark is an instance of a bird, a lark can fly; this provides inductive support that birds can fly”
    • abductive support (explanation) - a chain of predicates exists from the conclusion to the premise; the support is related to the number of premises with connections to this conclusion e.g. “birds can fly; the bat is flying; this provides abductive support that a bat is a bird”
  • If the predicate is indicated by multiple sources we adjust the credibility according to the complementary product i.e. the falsehood is computed as the product of falsehoods for each source, the credibility is just 1 - the falsehood. This assumes source independence, and we are aware that information incest can cause overconfidence in such situations. We have a separate strand of work to address this.

Currency

For some predicates the evidential support decays over time, for others the support is long-lived. For example, Pythagoras’ Theorem is as true today as it was 2,000 years ago, but the fact that Donald Trump is the US President will expire in a little less than 4 years’ time.

To ascertain the currency of a predicate we need to determine several factors:

  • a timeline for when the predicate became true, and if there is a predefined expiration date on the predicate - as above for a presidential term (the window of confidence)
  • a timeline for when fresh evidence is provided in support of or against the predicate (the changes in confidence)
  • a model of the rate at which the predicate becomes out of date (confidence decay)

We use an exponential model of confidence decay with a rate factor denoted by the confidence half-life (the time interval after which confidence will have fallen by 50% in the absence of new evidence). This decay model is combined in a multiplicative manner with the window of confidence to provide the currency of the information. New evidence can reset the clock on confidence decay in an obvious way. We are aware that there is the same potential for information incest to occur at multiple times as for multiple sources, and care must be taken not to recycle old information.

Background & Experience

Dr Bedworth is the Chief Scientist of MLabs AI with over 40 years of experience. He co-authored a 2000 AGI paper with psychologist Carl Frankel. He worked with Nobel Laureate Geoff Hinton on Boltzmann machines, and his work on Bayesian probability received worldwide recognition. Two of his patents were acquired by Apple for use in Siri.

Ibrahim received his Masters Degree in Computer Engineering and Machine Intelligence from the American University of Beirut. He specializes in optimization, search algorithms, and deep learning. He has worked at MLabs AI on novel approaches to deep learning and is currently leading the LLM workflow.

Dr Sarma received her Doctorate in Data Science and Artificial Intelligence from the Indian Institute of Technology. She specializes in NLP and reinforcement learning. She has built context embedding models, and LLMs for a number of languages, and is currently a member of the NLP team at MLabs AI, as well as handling our reinforcement learning activities.

Links and references

Website: https://www.mlabs.city/

RIGEL DFR3 project: https://github.com/mlabs-haskell/rigel

NEURAL SEARCH DFR4 project: https://github.com/mlabs-ai/neural-search

Proposal Video

Not Avaliable Yet

Check back later during the Feedback & Selection period for the RFP that is proposal is applied to.

  • Total Milestones

    4

  • Total Budget

    $60,000 USD

  • Last Updated

    27 May 2025

Milestone 1 - Detection of Compound Terms from Unstructured Text

Description

This milestone deploys our novel IP for the automatic discovery of aliases for entities in a KG. It uses two elements; our variation of the N-M-W algorithm to produce a list of statistically plausible alias candidates and a semantic analysis to refine this list into a set of entity aliases. The algorithm is computationally efficient and produces final candidate lists which require minimal additional curation.

Deliverables

Software implementation of the compound alias detection algorithm and experimental results.

Budget

$15,000 USD

Success Criterion

Fully operational software implementation, with a demonstration on a large unstructured or semi-structured text corpus and the WikiData KG, together with accompanying documentation.

Milestone 2 - Source Reliability Modelling

Description

In this milestone we will show how source reliability can be bootstrapped using credibility and re-estimation. When constructing AGI KGs we can initialise using a trusted well-curated source. This becomes the baseline. When new information is included in the KG we judge the credibility of the new knowledge and the reliability of its source using the baseline as ground truth. Having assigned values for confidence we can use consistency to recompute the credibility and reliability metrics. In this way we iteratively refine our estimates of source reliability as our knowledge base grows.

Deliverables

Source reliability estimation implementation and an illustration of the system working with a small number of variable-quality sources (such as WordNet WikiData and automatically generated KGs such as those from AutoKG).

Budget

$15,000 USD

Success Criterion

Fully operational software implementation, with a demonstration of selected KGs, together with accompanying documentation.

Milestone 3 - Information Credibility Modelling

Description

This milestone is a production-level implementation of the deductive inductive and abductive scoring for self-consistency together with the fusion of reliability for evidence from multiple sources. The system will be incorporated into the earlier system for curating KGs and evaluated on the same variable-quality graphs.

Deliverables

Credibility estimation implementation and an illustration of the system working with a small number of variable-quality sources (such as WordNet WikiData and automatically generated KGs such as those from AutoKG).

Budget

$15,000 USD

Success Criterion

Fully operational software implementation, with a demonstration on selected KGs, together with accompanying documentation.

Milestone 4 - Confidence Currency Modelling

Description

In the final milestone we include the currency estimation algorithm. The approach is simple and is based on partial derivatives - how quickly is the knowledge likely to be changing (the derivative) and how long ago was the knowledge acquired (from timestamps) give an estimate of how out-of-date the knowledge is likely to be. The decay of currency is modeled as an inverse exponential function appropriately calibrated to the type of predicate the source and the frame of reference.

Deliverables

Currency estimation implementation and an illustration of the approach operating on information extracted from Wikipedia plus associated documentation.

Budget

$15,000 USD

Success Criterion

Fully operational software implementation, with a demonstration on Wikipedia, together with accompanying documentation.

Join the Discussion (0)

Expert Ratings

Reviews & Ratings

    No Reviews Avaliable

    Check back later by refreshing the page.

Welcome to our website!

Nice to meet you! If you have any question about our services, feel free to contact us.