Knowledge Graph Construction from Raw Data

chevron-icon
RFP Proposals
Top
chevron-icon
project-presentation-img
user-profile-img
Seyed Mohammad
Project Owner

Knowledge Graph Construction from Raw Data

Expert Rating

n/a

Overview

We propose a framework that converts streaming raw observations into a continuously refined knowledge graph for AGI use case. Self-supervised models learn embeddings, discretised into MeTTa atoms housed in the MORK hypergraph. A symbolic engine reasons over the graph while neural encoders continuously supply fresh concepts, forming a tight neuro-symbolic loop. Ongoing refinement prunes noise, resolves contradictions, and adds causal links, giving agents a live, compact, and explainable world model.

RFP Guidelines

Advanced knowledge graph tooling for AGI systems

Complete & Awarded
  • Type SingularityNET RFP
  • Total RFP Funding $350,000 USD
  • Proposals 39
  • Awarded Projects 5
author-img
SingularityNET
Apr. 16, 2025

This RFP seeks the development of advanced tools and techniques for interfacing with, refining, and evaluating knowledge graphs that support reasoning in AGI systems. Projects may target any part of the graph lifecycle — from extraction to refinement to benchmarking — and should optionally support symbolic reasoning within the OpenCog Hyperon framework, including compatibility with the MeTTa language and MORK knowledge graph. Bids are expected to range from $10,000 - $200,000.

Proposal Description

Our Team

We are a recently graduated student team with complementary strengths in artificial intelligence and theoretical computer science. Our collaboration brings together deep expertise in applied AI/ML research and rigorous algorithmic thinking. We are driven by curiosity, technical excellence, and a shared goal of solving complex, real-world problems through interdisciplinary innovation.

Project details

We propose an open-source framework that transforms raw observations {X₁,…,Xₙ}  - could be text, image, and audio streams- into a continuously evolving knowledge graph (KG) that supports causal, neuro-symbolic reasoning in the OpenCog Hyperon ecosystem. The system meets the RFP’s call for end-to-end tooling—from extraction through refinement to benchmarking—while remaining domain-agnostic and fully compatible with MeTTa and the MORK hypergraph backend.

An asynchronous ingestion layer normalises heterogenous inputs into a common event format ⟨id,time,payload,meta⟩. Modular adapters handle UTF-8 text, RGB images, and 16-kHz audio. Each adapter performs light pre-processing (tokenisation, patching, spectrogramming) and forwards the result to its modality-specific encoder. The adapters expose a uniform gRPC interface so new modalities (e.g., time-series sensors) can be snapped in without recompiling the pipeline.

Three self-supervised encoders—Transformer-based for text [1], masked-token ViT for images [2], and contrastive speech transformer for audio [3]—learn modality-specific embeddings without labels. The pre-text tasks (masked language modelling, masked patch prediction, future audio contrast) force each encoder to capture high-level semantics rather than surface patterns, delivering 768-D vectors zₖ with cosine-normalised magnitudes. A joint projection head maps all embeddings into one aligned latent space, enabling cross-modal comparison similar to CLIP-style losses [4]. This architecture is agnostic to the specific self-supervised recipe, satisfying the requirement to remain flexible to future representation advances.

A lightweight, incremental clustering module discretises incoming embeddings. New vectors are either merged into the nearest prototype if the cosine distance <τ (adaptive) or seeded as a fresh concept node otherwise. Each accepted assignment generates a MeTTa S-expression:

(concept :id 8743 :type Entity :label "violin")
(contextLink 8743 9021 :relation usedIn :confidence 0.87)

where 9021 may be a latent “classical_music” node inferred from co-occurrence statistics. Edges are hyperedges when n-ary relations are detected (e.g., (play Person Instrument Location)). Provenance, timestamps, and encoder uncertainty are stored as atom-level fields, enabling later integrity checks.

The symbolic layer (PLN + ECAN) consumes the graph to perform logical inference, similarity-guided retrieval, and planning. It can issue graph queries back to the neural side—for instance, “find an embedding most similar to the prototype for cat but closer to water_terrain” to hypothesise fishing_cat. Conversely, the neural layer consults the graph as a semantic prior, biasing its predictions toward entities already grounded with high-confidence links. This bidirectional flow realises the cognitive synergy advocated in neuro-symbolic literature [5], while maintaining clear audit trails for every symbol introduced.

A dedicated stream engine ingests atoms on commodity hardware. Every Δt seconds, a refinement job executes three passes: (i) duplicate detection and alias merging via incremental disjoint-set union; (ii) contradiction spotting using rule-based SAT templates [6]; (iii) redundancy pruning guided by a compression-coverage objective similar to PG-T [7]. Obsolete atoms decay by exponential ageing unless refreshed by new evidence. This keeps the KG compact, semantically rich, and internally consistent as mandated by the RFP’s quality goals.

Temporal edges carry lag metadata, enabling automatic extraction of candidate causal graphs. We implement an algorithm inspired by PCMCI+ [8] to propose causes links, which the symbolic layer verifies through constraint-based testing. Counterfactual queries are answered by running abduction on the causal subgraph, followed by forward simulation using learned conditional distributions. Benchmarks will include TETRAD synthetic sets and MetaQA multi-hop causal subsets, reporting AUROC and average intervention score to satisfy the “evaluate KG utility for reasoning” requirement.

The entire knowledge‐graph layer is built on MeTTa and backed by the high-performance MORK engine, with a simple API (REST/JSON-LD) for integration. It’s delivered as a modular, containerized package for straightforward deployment and maintenance. The code is organized into clear neural and symbolic components under an open-source license, thoroughly tested, and optimized to meet the RFP’s performance and scalability expectations.

We will report: extraction F₁ on Wikidata subsets, compression-coverage ratio, contradiction resolution accuracy, causal inference AUROC, and average query latency. These metrics map directly to the RFP’s emphasis on compactness, integrity, reasoning support, and performance. Public leaderboards and Jupyter notebooks will accompany each release for third-party replication.

 

[1] Devlin J. et al., “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” NAACL-HLT 2019.
[2] He K. et al., “Masked Autoencoders Are Scalable Vision Learners,” CVPR 2022.
[3] Baevski A. et al., “Wav2Vec 2.0: A Framework for Self-Supervised Learning of Speech Representations,” NeurIPS 2020.
[4] Radford A. et al., “Learning Transferable Visual Models From Natural Language Supervision,” ICML 2021.
[5] Liu Z. et al., “Neuro-Symbolic Methods and Knowledge Graph Reasoning: A Survey,” AAAI 2023.
[6] Zhang J. et al., “Rule-Based Knowledge Graph Consistency Checking,” ISWC 2022.
[7] Bourhis P. et al., “Pruning Knowledge Graphs with Pattern-Graph Truncation,” WWW 2023.
[8] Runge J. et al., “Detecting Causal Associations in Large Nonlinear Time Series Datasets,” Science Advances 2022.

Background & Experience

Seyed Mohammad Seyed Javadi is a Master’s student in Computer Science at York University, specializing in theoretical computer science. He has published in top-tier conferences such as IJCAI and ICALP, and is a National Gold Medalist in the Iranian Computer Olympiad. His strengths include competitive programming, algorithm design, and formal theoretical analysis.

Amirhossein Mohammadi is a Master’s student in Artificial Intelligence at York University with over three years of focused research experience in AI and machine learning. His work spans model development, deep learning, and applied machine learning systems. Amir has hands-on experience building AI solutions and is passionate about advancing practical applications of intelligent systems.

 

Describe the particulars.

Crypto week in Toronto

Proposal Video

Not Avaliable Yet

Check back later during the Feedback & Selection period for the RFP that is proposal is applied to.

  • Total Milestones

    3

  • Total Budget

    $65,000 USD

  • Last Updated

    28 May 2025

Milestone 1 - Research Plan & Architecture Definition

Description

This phase establishes the foundation of the entire project by crystallizing requirements data modalities and high-level design. We will map out the end-to-end pipeline—from ingestion adapters through self-supervised encoders to graph population and reasoning loops—in a detailed architecture diagram. Risks dependencies and success metrics are identified ensuring that all stakeholders share a common understanding of scope and deliverables

Deliverables

A comprehensive research plan document (20–30 pages) will be delivered including: (1) an annotated architecture diagram showing data flows and component interactions; (2) a prioritized backlog of technical tasks with estimated effort and risk ratings; and (3) a prototype configuration for running initial data ingestion and embedding experiments. This document will be presented in a review meeting with the RFP committee and iterated based on feedback. All materials will be shared in a public GitHub repository with version control and issue tracking enabled.

Budget

$15,000 USD

Success Criterion

The milestone is successful when the review committee formally approves the research plan with no major open issues (i.e., all “critical” or “high” risks are mitigated or have clear contingency plans). The backlog must cover ≥ 90% of required technical tasks, and the architecture diagram must receive sign-off from both the neural-model and symbolic-reasoning leads. Finally, the GitHub repository must be staffed with initial issues and milestones, demonstrating that the team is ready to begin prototype development.

Milestone 2 - Prototype Implementation & Preliminary Testing

Description

In this stage we build and assemble the core pipeline components: multimodal ingestion adapters self-supervised encoders embedding-to-symbol grounding and basic MeTTa export into MORK. The prototype will process a controlled dataset (≈100 000 atoms) end-to-end and generate an initial knowledge graph with live streaming updates. We will also integrate a simple causal inference module to demonstrate generation of candidate “causes” relationships from temporal data.

Deliverables

A working codebase (Python + Rust) will be delivered in a Docker-Compose package that when launched ingests sample text image and audio streams and outputs a browsable MORK hypergraph. We will provide a Jupyter notebook illustrating data ingestion embedding visualization node creation and causal edge proposals. In addition a preliminary benchmark report will show extraction F₁ prototype clustering purity and update latency metrics on the test dataset.

Budget

$30,000 USD

Success Criterion

This milestone is deemed complete when the prototype successfully ingests ≥ 80 % of test observations without errors, produces a knowledge graph of ≥ 100 000 nodes/edges, and maintains end-to-end latency below 250 ms per observation. The preliminary benchmarks must show F₁ ≥ 0.7 for concept extraction and clustering purity ≥ 0.65. Finally, the delivered notebook and Docker package must run out-of-the-box on a standard GPU-equipped machine following documented steps.

Milestone 3 - Full-System Delivery & Benchmark Validation

Description

The final phase delivers the scalable production-ready framework capable of handling ≥ 10 million atoms with continuous streaming updates refinement passes and causal reasoning. We will implement advanced graph refinement strategies (alias merging contradiction resolution pruning) and integrate a robust causal inference engine that supports counterfactual queries. Full developer and user documentation API references and example applications will complete the system.

Deliverables

We will provide: (1) the full codebase with CI/CD pipelines; (2) Docker images and Helm charts for Kubernetes deployment; (3) a suite of automated benchmarks—covering extraction accuracy compression-coverage ratios causal inference AUROC and query latencies—run on large synthetic and real-world datasets; and (4) comprehensive documentation including a user guide API reference and tutorial workflows. All artifacts will be published under an open-source license in a public repository.

Budget

$20,000 USD

Success Criterion

The project is successful if the system processes 10 million+ atoms with streaming ingest throughput ≥ 5 000 atoms/s and average query latencies ≤ 100 ms. Benchmark results must meet or exceed: extraction F₁ ≥ 0.8, compression-coverage ≥ 0.5, and causal AUROC ≥ 0.7 on standard test suites. User acceptance is confirmed via a demo session with the RFP team, demonstrating deployment, end-to-end ingestion, and execution of representative reasoning tasks.

Join the Discussion (0)

Expert Ratings

Reviews & Ratings

    No Reviews Avaliable

    Check back later by refreshing the page.

Welcome to our website!

Nice to meet you! If you have any question about our services, feel free to contact us.