Benchmind

chevron-icon
RFP Proposals
Top
chevron-icon
project-presentation-img
user-profile-img
Prasad Kumkar
Project Owner

Benchmind

Expert Rating

n/a

Overview

Benchmind is a benchmarking suite for symbolic and neuro-symbolic reasoning in AGI systems built on Hyperon/MeTTa. It provides a standardized set of reasoning tasks—multi-hop QA, analogical reasoning, and hypothesis generation—along with datasets, evaluation metrics, and tooling to assess the effectiveness of knowledge graphs and reasoning agents. Fully integrated with MeTTa and MORK, Benchmind enables developers to stress-test and compare symbolic, probabilistic, and hybrid LLM-based reasoning strategies at scale. It establishes a shared foundation for measuring reasoning performance, uncovering bottlenecks, and driving progress toward robust, interpretable AGI.

RFP Guidelines

Advanced knowledge graph tooling for AGI systems

Complete & Awarded
  • Type SingularityNET RFP
  • Total RFP Funding $350,000 USD
  • Proposals 39
  • Awarded Projects 5
author-img
SingularityNET
Apr. 16, 2025

This RFP seeks the development of advanced tools and techniques for interfacing with, refining, and evaluating knowledge graphs that support reasoning in AGI systems. Projects may target any part of the graph lifecycle — from extraction to refinement to benchmarking — and should optionally support symbolic reasoning within the OpenCog Hyperon framework, including compatibility with the MeTTa language and MORK knowledge graph. Bids are expected to range from $10,000 - $200,000.

Proposal Description

Our Team

Prasad - 6+ years of engineering experience, w deep expertise in decentralized systems. Bachelor’s in CS Engineering.

Harsh - Oversees backend architecture and service design. 

Akash - Specializes in scalable backend services. Designs & implements core components for indexing, storage, and event processing pipelines.

Kartik - Delivers across backend and frontend layers. Works on API systems, dashboard integrations, and end-to-end feature development with a focus on reliability and consistency.

Company Name (if applicable)

Chainscore Labs

Project details

The Benchmind project proposes the creation of a robust, extensible, and purpose-built benchmarking suite to evaluate the reasoning capabilities of symbolic and neuro-symbolic AGI systems within the Hyperon framework. Designed specifically for the MeTTa language and MORK hypergraph backend, Benchmind will serve as a standard for measuring progress in reasoning-rich AGI by providing a suite of realistic cognitive tasks, datasets, metrics, and tooling. The project will offer quantitative evaluations across reasoning types — including multi-hop question answering, analogical reasoning, and hypothesis generation — and support diverse reasoning agents (symbolic, probabilistic, and LLM-augmented). Benchmind will accelerate research and development in AGI systems built atop Hyperon by making reasoning performance measurable, comparable, and improvable.

Introduction

As AGI development accelerates, symbolic knowledge representation and reasoning systems — such as OpenCog Hyperon’s Atomspace and the MeTTa language — are playing a central role in organizing and manipulating structured knowledge at scale. While much effort has been focused on building and maintaining these symbolic knowledge graphs (e.g., via MORK), there is currently no standardized, system-aligned framework for evaluating how well such graphs and their associated agents perform in actual reasoning tasks.

Existing KG benchmarks (e.g., for link prediction or triple classification) do not test the kind of advanced, compositional reasoning that AGI systems require. They also rarely align with the semantics or architecture of Hyperon, making them unsuitable for teams working in the MeTTa ecosystem.

This lack of a standard evaluation protocol:

  • Slows down iteration and progress in symbolic AGI tooling.

  • Obscures bottlenecks in reasoning accuracy, tractability, or coverage.

  • Prevents effective comparison of approaches across teams and methods.

Benchmind is designed to fill this gap.

Goals and Scope

Benchmind will develop:

  1. A suite of benchmark tasks for symbolic and neuro-symbolic AGI reasoning, designed around the MeTTa/MORK environment and covering a range of reasoning capabilities crucial to AGI development.

  2. Task categories and datasets that reflect real-world and AGI-relevant cognitive challenges:

    • Multi-hop factual and logical question answering.

    • Analogical reasoning and pattern completion.

    • Hypothesis generation and abductive reasoning.

  3. Evaluation metrics and tooling to automatically assess reasoning performance, path efficiency, accuracy, and robustness.

  4. Interfaces and APIs to plug in different reasoning agents, including symbolic engines (e.g., MeTTa queries, PLN), probabilistic solvers, and LLM-augmented modules.

  5. End-to-end pipeline allowing other developers to benchmark their knowledge graphs or agents within the Hyperon ecosystem using minimal setup.

Benchmind will act as a “reasoning stress-test” for symbolic AGI agents, offering interpretability, reproducibility, and diagnostic insight into the strengths and weaknesses of different approaches.

Technical Architecture

Benchmind consists of four primary components:

1. Task Library (Cognitive Benchmarks)

A modular task framework where each benchmark task is defined via:

  • A knowledge graph scenario or dataset (loaded into MORK).

  • A reasoning objective (e.g., answer a multi-hop query, complete a relational analogy, generate a plausible new link).

  • Ground-truth answers or expected outputs for evaluation.

Initial task categories:

  • Multi-hop QA: Answering questions that require traversing 2–5 relational hops in the graph.

  • Analogical reasoning: Given A:B::C:?, identify D such that the relation between C and D mirrors A and B.

  • Abductive/hypothesis generation: Given partial or incomplete facts, suggest plausible inferences that complete the knowledge context.

2. Reasoning Agent Interfaces

A flexible plug-in interface to test different types of reasoners:

  • Pure MeTTa logic-based agents.

  • Probabilistic inference engines (e.g., PLN).

  • LLM-augmented retrieval or ranking agents (LLM queries the KG via a prompt-aware interface).

3. Evaluation Engine

For each benchmark task, Benchmind will:

  • Track reasoning chains or paths taken.

  • Score correctness, path length, response time, and confidence.

  • Log failure cases for inspection.

    Evaluation metrics will include:

  • Accuracy/precision/recall.

  • Average path length to solution.

  • Inference steps vs. brute force.

  • Coverage and robustness over time.

4. Execution Environment

  • Fully integrated with MeTTa/MORK stack.

  • CLI interface and config-based task execution.

  • Containerized deployment for reproducibility.

Key Features and Deliverables

  • MeTTa-native benchmarks: Tasks will be formulated as MeTTa expressions, runnable directly by Hyperon agents.

  • MORK-optimized execution: Fast path lookup and atom matching using MORK’s zipper graph operations.

  • Hybrid reasoning support: Enable evaluation of agents that combine symbolic inference with neural language models.

  • Debugging & analytics: Visual or textual summaries of reasoning chains, failure types, and knowledge gaps.

  • Documentation & extensibility: Clear developer documentation, plus support for custom task and agent modules.

Workflow Example

  1. A developer loads a knowledge graph into MORK (e.g., biomedical or general-purpose).

  2. They choose a task set, e.g., “Analogical Reasoning.”

  3. They define or select a reasoning agent (e.g., a MeTTa query runner).

  4. Benchmind executes all tasks, logs results, and computes evaluation metrics.

  5. The developer receives a diagnostic report showing:

    • Task accuracy and failure cases.

    • Bottlenecks (e.g., missing graph links, inefficient traversals).

    • Comparisons to baseline solvers.

Use Cases and Value

  • Benchmarking New Reasoners: Test new symbolic or hybrid inference engines under realistic AGI tasks.

  • KG Quality Testing: Compare performance of different KGs on the same task set — useful for graph builders.

  • Debugging AGI Agents: Identify why reasoning failed (missing links? ambiguous paths? bad confidence ranking?).

  • Community Comparison: Create a shared standard for reasoning performance in the Hyperon/MeTTa ecosystem.

Datasets and Ground Truth

Benchmind will include:

  • Curated mini-knowledge graphs with handcrafted reasoning challenges.

  • Synthetic datasets for stress testing.

  • Converted subsets of public graphs (e.g., ConceptNet, Wikidata) adapted to MORK/MeTTa.

All datasets will be open-source and extensible.

Integration with MORK & MeTTa

Benchmind is explicitly built for:

  • MORK: Used as the execution backend for atomspace storage and traversal.

  • MeTTa: Tasks encoded as MeTTa expressions. Agents must return MeTTa-compatible output or logs.

This ensures deep, native integration rather than loose interfacing.

Competitive Edge

  • First standardized benchmark for reasoning in Hyperon/MeTTa.

  • Covers neuro-symbolic scenarios explicitly (e.g., hybrid LLM+KG agents).

  • Designed for scale: Supports small experiments and billion-atom KGs via MORK.

  • Community-aligned: Easily reusable by other RFP teams and Hyperon researchers.

Open Source Licensing

MIT - Massachusetts Institute of Technology License

Background & Experience

Chainscore Labs is a specialist Web3 R&D firm with deep expertise in blockchain infrastructure, distributed systems, and AI. Our team combines seasoned software engineers (Python, Rust, smart contracts) with AI researchers experienced in symbolic reasoning, probabilistic logic networks, and meta-programming. 

Proposal Video

Not Avaliable Yet

Check back later during the Feedback & Selection period for the RFP that is proposal is applied to.

  • Total Milestones

    3

  • Total Budget

    $60,000 USD

  • Last Updated

    27 May 2025

Milestone 1 - Benchmark Design & Prototyping

Description

In this initial phase we will design the core structure and categories of the Benchmind benchmarking suite. This includes identifying AGI-relevant reasoning tasks (e.g. multi-hop QA analogical reasoning hypothesis generation) and defining benchmark formats metrics and evaluation criteria. We will analyze the cognitive capabilities needed for each task and design sample MeTTa-compatible benchmarks for each category. This milestone also includes technical planning for MORK integration and establishing baseline agents for comparative evaluation.

Deliverables

• A technical specification document detailing benchmark categories task types input/output formats and scoring metrics. • A prototype of 2–3 benchmark tasks for MeTTa/Hyperon agents including test datasets and expected outputs. • Documentation of evaluation protocols and examples of reasoning queries. • A project roadmap and timeline breakdown for Milestones 2 and 3.

Budget

$12,000 USD

Success Criterion

• At least three reasoning task categories are defined and prototyped. • All sample tasks are executable using MeTTa expressions and operate on a small-scale MORK-based knowledge graph. • The evaluation logic is demonstrably functional on prototype tasks, scoring outputs for accuracy and reasoning depth. • Internal testing confirms end-to-end flow from task definition to result scoring. • Documentation is clear, complete, and ready for integration with future milestones.

Milestone 2 - Core Suite Implementation & Integration

Description

This milestone focuses on the development of the core Benchmind benchmarking framework. We will build the task execution engine scoring/evaluation logic and modular interfaces for plugging in symbolic probabilistic and hybrid reasoning agents. Each task category (QA analogies hypothesis generation) will be expanded with real and synthetic datasets and configured for batch testing within the MORK environment. We will implement benchmarking support for MeTTa queries and symbolic workflows optimizing graph interaction using MORK APIs for traversal and querying. Initial performance baselines will be gathered for standard agents to validate the framework.

Deliverables

• A fully functional CLI-based benchmarking engine with support for running task suites and recording results. • 10+ finalized benchmark tasks across 3 categories with ground-truth answers and scoring metrics. • Integration with the MORK hypergraph backend for fast reasoning task execution. • Interfaces for symbolic agents (e.g. MeTTa scripts) and hybrid agents (LLM+KG). • Evaluation logs capturing success/failure modes reasoning paths and diagnostic metrics. • Internal baseline benchmark runs (e.g. using MeTTa-only or hardcoded traversal agents).

Budget

$24,000 USD

Success Criterion

• The benchmarking engine runs reliably across all included task types. • Integration with MeTTa and MORK is confirmed — benchmark tasks can be loaded, queried, and scored via native interfaces. • Benchmind successfully logs execution traces and performance scores for each benchmarked agent. • Baseline agents (symbolic and/or LLM-augmented) are benchmarked on at least one full task per category. • Codebase is modular, documented, and ready for open-sourcing.

Milestone 3 - Finalization Optimization & Public Release

Description

In this final phase we will complete the Benchmind suite with full optimization documentation and packaging for open-source release. This includes refining performance (query speed memory usage) stress-testing against larger graphs in MORK and ensuring stable support for hybrid agents (LLM + symbolic). We will containerize the project for easy deployment write detailed usage guides and validate the suite’s utility by benchmarking external or community agents. This milestone aims to deliver a robust extensible and production-ready benchmarking standard for symbolic AGI reasoning within the Hyperon ecosystem.

Deliverables

• Full Benchmind codebase tested on large-scale graphs with MORK. • Containerized deployment setup (e.g. Docker) and CLI install scripts. • Finalized benchmark task sets including datasets and evaluation configs. • Performance-tuned agent interfaces (symbolic probabilistic LLM-augmented). • Comprehensive documentation (user manual dev guide extension API). • Final benchmarking report with comparative evaluation of agents.

Budget

$24,000 USD

Success Criterion

• Benchmind runs consistently across environments (local + containerized). • Final benchmarks execute on large MORK graphs (≥500k atoms) without failure. • Documentation enables external developers to run and extend the suite independently. • All benchmark categories show measurable reasoning outcomes across multiple agents. • Project is open-sourced, publicly accessible, and presented with a reproducible demo or walkthrough.

Join the Discussion (0)

Expert Ratings

Reviews & Ratings

    No Reviews Avaliable

    Check back later by refreshing the page.

Welcome to our website!

Nice to meet you! If you have any question about our services, feel free to contact us.