Medical Knowledge Graph from Papers Using the STC

chevron-icon
Back
project-presentation-img
ivan reznikov
Project Owner

Medical Knowledge Graph from Papers Using the STC

Funding Requested

$25,000 USD

Expert Review
Star Filled Image Star Filled Image Star Filled Image Star Filled Image Star Filled Image 0
Community
Star Filled Image Star Filled Image Star Filled Image Star Filled Image Star Filled Image 3.8 (6)

Overview

The objective of this project is to develop documentation and a PoC for a comprehensive medical knowledge graph from research papers using the Standard Template Construct (STC). The project will involve cleaning the data, identifying and filling in missing values, and creating a recommendation system based on citations, large language models (LLMs), and embeddings. We've previously build a STC subgraph during DFR4-beta: https://deepfunding.ai/proposal/metta-driven-kg-service-with-llm-integration/ but noticed, that STC contains missing data and incorrect links to build a fully usefully medical knowledge graph

Proposal Description

Company Name (if applicable)

KNAI

How our project will contribute to the growth of the decentralized AI platform

Our project will significantly contribute to the growth of the AI platform by enhancing its data integration and knowledge representation capabilities. By creating a comprehensive and structured knowledge graph of medical research papers, the platform will facilitate more efficient data discovery and retrieval for users. The enriched and cleaned dataset, combined with advanced semantic relationship modeling and recommendation systems, will improve the accuracy and relevance of AI-driven insights

The core problem we are aiming to solve

The core problem we aim to solve is the difficulty in efficiently discovering and understanding the relationships between medical research papers. Researchers often struggle to navigate the vast and growing volume of literature, leading to missed connections and redundant efforts. By creating a comprehensive knowledge graph that accurately represents the relationships between research papers, authors, and institutions, and by implementing a recommendation system based on citation analysis and semantic relationships, we can significantly enhance researchers' ability to find relevant work, gain insights, and foster collaboration, ultimately advancing medical research more effectively.

Our specific solution to this problem

The proposed solution involves two main milestones: Data Cleaning and Enrichment, and Knowledge Graph Creation. Initially, research papers will be collected from sources like PubMed and Google Scholar using APIs or web scraping frameworks. The data will be cleaned by removing duplicates, correcting formatting issues, handling missing values, and standardizing author and institution names. For the Knowledge Graph Creation, the cleaned data will be imported into a graph database, where nodes and edges representing papers and their relationships will be established. Embeddings will be used to identify semantic relationships, and a simple recommendation system will be developed to suggest related papers based on citation analysis, validated through user feedback or benchmark datasets.

Project details

Step 1: Data Scraping Objective (RFP Documentation): Collect research papers from various sources.

    • List potential sources such as PubMed, Google Scholar, and institutional repositories.
    • Use APIs or requests or web scraping frameworks like BeautifulSoup to create scraper pipelines.
    • Implement scrapers to extract data points including titles, authors, abstracts, keywords, and citations.

Step 2: Data Cleaning Objective (RFP Documentation): Ensure the data is accurate and complete.

  1. Remove Duplicates:

    • Identify and eliminate duplicate entries using techniques such as fuzzy matching.
    • Ensure unique identifiers are assigned to each paper.
  2. Correct Formatting Issues:

    • Standardize text data, ensuring consistent casing and removal of special characters.
    • Normalize date formats and citation styles.
  3. Handle Missing Values:

    • Apply imputation techniques such as mean, median, or predictive imputation.
    • Assess the impact of missing data and document any assumptions made.
  4. Standardize Author and Institution Names:

    • Use reference datasets to standardize author names (e.g., ORCID IDs) and institution names.
    • Implement algorithms to resolve name variations and affiliations.

Step 3: Building the Knowledge Graph Objective (PoC): Represent relationships between research papers and entities.

  1. Populate the Knowledge Graph:

    • Use a graph database to store the knowledge graph.
    • Import cleaned data into the graph database, creating nodes and edges based on the defined schema.
  2. Use Embeddings to Establish Semantic Relationships:

    • Generate embeddings for abstracts and research papers if possible.
    • Use these embeddings to identify and strengthen semantic relationships between entities.
  3. Implement a Simple Recommendation System:

    • Develop algorithms to suggest related papers based on citation analysis.
    • Use graph traversal and similarity measures to provide recommendations.
    • Validate the recommendations through user feedback or benchmark datasets.

The competition and our USPs

Yes

Describe how your solution distinguishes itself from other solutions (if exist) and how it will succeed in the market.

Our solution distinguishes itself by combining advanced data cleaning techniques, the use of embeddings for semantic analysis, and the creation of a dynamic, scalable knowledge graph tailored specifically for medical research papers. Unlike existing solutions such as Google Scholar and PubMed, which primarily offer basic search functionalities and static lists of related papers, our approach leverages sophisticated algorithms to uncover deeper, context-rich relationships between papers, authors, and institutions. By integrating these features into a user-friendly interface with a recommendation system that evolves with user feedback, our solution offers a more intuitive and powerful tool for researchers. This comprehensive and proactive approach will not only improve research efficiency but also foster new discoveries and collaborations, ensuring its success in the market.

  1. PubMed
  2. Google Scholar
  3. Semantic Scholar
  4. ResearchGate
  5. Dimensions
  6. Microsoft Academic
  7. Scopus
  8. Web of Science

Our team

Our team consists of a 2 Data Engineers, 1 Frontend and Graph Network specialist. We have medical domain expert onboard as well.

View Team

What we still need besides budget?

No

Proposal Video

Placeholder for Spotlight Day Pitch-presentations. Video's will be added by the DF team when available.

  • Total Milestones

    2

  • Total Budget

    $25,000 USD

  • Last Updated

    20 May 2024

Milestone 1 - Data Cleaning and Enrichment

Description

Refine and data imputation techniques. Leverage external databases and APIs for data enrichment. Employ LLMs to infer missing values where applicable.

Deliverables

Documentation related to the following topics: - Cleaned and standardized dataset of research papers. - Report detailing the cleaning process including methods used for data imputation and enrichment.

Budget

$10,000 USD

Milestone 2 - Knowledge Graph Creation

Description

Documentation related to the following topics: - Use graph databases (metta or Neo4j) to store and manage the knowledge graph. - Utilize embeddings to capture semantic relationships between entities. Proof of Concept: - Implement a simple recommendation system using machine learning techniques and algorithms suitable for graph data

Deliverables

Fully functional knowledge graph representing the relationships between research papers authors and other entities. A recommendation system leveraging the knowledge graph to provide insights and suggestions. Documentation detailing the knowledge graph schema the processes used to establish relationships and the functionality of the recommendation system.

Budget

$15,000 USD

Join the Discussion (4)

Sort by

4 Comments
  • 0
    commentator-avatar
    ivan reznikov
    May 20, 2024 | 10:26 AM

    The purpose of this project is to make more effective medical findings through graph search

  • 0
    commentator-avatar
    Jan Horlings
    May 19, 2024 | 9:36 AM

    Is this proposal compliant with the actual RFP: https://deepfunding.ai/rfp/content-knowledge-graph/ ?

    If not, please add it to another pool such as Miscellaneous or, in case you are utilizing/creating an AI service, to 'new services.'

    • 0
      commentator-avatar
      ivan reznikov
      May 20, 2024 | 10:27 AM

      This is a RFP, but not related to the Content Knowledge Graph.

      • 0
        commentator-avatar
        Jan Horlings
        May 20, 2024 | 10:48 AM

        Clear. Thanks for moving. BTW this pool is offering grants for in-depth quality RFPs. Not for their development. Looking at the budget that is probably clear.

Reviews & Rating

Sort by

6 ratings
  • 1
    user-icon
    IndependentStream
    May 21, 2024 | 10:41 AM

    Overall

    5

    • Feasibility 5
    • Viability 5
    • Desirabilty 5
    • Usefulness 5
    Improvement of Healthcare

    As a neurologist, it can be challenging to shift to modern, intuitive systems that require some preparation. I am deeply researching the non-obvious correlations and relationships between oncology and stroke using machine learning methods, which are innovative compared to traditional methods that offer only a single perspective.

    I consistently face obstacles such as time constraints and a lack of resources, hindering my ability to improve patient quality of life through accessible technology. This product allows for rapid, high-quality, and structured insights and results. Searching for articles, encountering imprecise formulations, studying vast amounts of literature, and losing focus; having limitations in skills and knowledge to move from simple data collection to a new qualitative level and implementing solutions that will advance prevention rather than merely providing help to those who could have learned about their pathology earlier; spending much time on theories and so little on practice, consuming more and more new information and facing cognitive biases—all this can be avoided by learning to trust and understand new technology.

    This project resonates with me, and I hope that through its development and implementation, I will be able to make the world a better place for people. Great idea, I look forward to your success!

  • 0
    user-icon
    Nick
    May 20, 2024 | 9:58 AM

    Overall

    5

    • Feasibility 5
    • Viability 5
    • Desirabilty 5
    • Usefulness 5
    Very impactful product!

    This solution presents a promising opportunity to boost research efficiency and discovery. Automating the data cleaning process and constructing a comprehensive knowledge graph will enable researchers to seamlessly explore and identify connections between papers, leading to improved citation analysis and more accurate recommendations.

  • 0
    user-icon
    Joseph Gastoni
    May 19, 2024 | 9:14 AM

    Overall

    3

    • Feasibility 3
    • Viability 3
    • Desirabilty 3
    • Usefulness 4
    Building a comprehensive medical knowledge graph

    This project proposes building a comprehensive medical knowledge graph from research papers using the Standard Template Construct (STC). Here's a breakdown of its strengths and weaknesses:

    Feasibility:

    • Moderate: The project leverages existing data sources, natural language processing techniques, and graph database technologies. Complexity depends on chosen data cleaning methods and embedding techniques.
    • Strengths: The concept builds on established tools, and development can be efficient with clear technical choices.
    • Weaknesses: Data cleaning, especially handling missing values and inconsistent formatting, can be time-consuming and require expertise. Building robust embeddings for medical concepts might require additional research or pre-trained models.

    Viability:

    • Moderate: Success depends on acquiring high-quality data, building a user base, and demonstrating the value proposition compared to existing solutions.
    • Strengths: The focus on a valuable domain (medical research) with a clear need for knowledge organization can be attractive.
    • Weaknesses: Competition from existing medical databases and knowledge graphs, user adoption within the medical research community, and ongoing maintenance of the knowledge graph need to be addressed.

    Desirability:

    • High: For medical researchers and practitioners seeking a comprehensive and well-organized source of medical knowledge, this can be valuable.
    • Strengths: The project addresses a critical need for efficient access to and exploration of medical research findings.
    • Weaknesses: Building awareness and demonstrating the user benefits compared to existing resources requires effort.

    Usefulness:

    • Moderate-High: The project can be very useful if it delivers a high-quality knowledge graph with accurate data, relevant recommendations, and a user-friendly interface.
    • Strengths: This knowledge graph can improve research efficiency, identify connections between studies, and potentially accelerate medical discoveries.
    • Weaknesses: The long-term impact on user engagement, effectiveness of the recommendation system, and integration with existing medical research workflows needs evaluation.

    Additional Points:

    • A focus on data quality and curation throughout the data cleaning process is essential.
    • Collaboration with medical professionals and domain experts for data validation and knowledge graph development is crucial.
    • A clear strategy for user adoption and integration with existing research tools can enhance project value.

    Overall, this project has potential to be a valuable resource for the medical research community. Focusing on high-quality data acquisition, robust data cleaning strategies, collaboration with medical experts, and a user-centric approach can increase the project's value and impact.

    Here are some strengths of this project:

    • Addresses a critical need for efficient knowledge organization and access to medical research findings.
    • Leverages existing technologies for data scraping, cleaning, and building a knowledge graph.
    • Emphasizes data quality and utilizes techniques like embeddings and recommendation systems for improved navigation.

  • 0
    user-icon
    Max1524
    May 18, 2024 | 12:36 PM

    Overall

    3

    • Feasibility 3
    • Viability 2
    • Desirabilty 3
    • Usefulness 3
    Pay more attention to human resources work

    The two milestones established are relatively short-lived compared to the scope of this proposal. Maybe the author will add more in the next few days?
    Besides, currently I can only see the only author who implemented this proposal. Surely more staff is needed to complete the proposal properly?
    My current advice is that the author should pay a lot of attention to the collaborators so that it is worthy of the expectations that this proposal brings.
    The STC standard sample structure may be a trend applied to other projects.

    user-icon
    ivan reznikov
    May 18, 2024 | 12:54 PM
    Project Owner

    Thanks for the review. As listed in the description for the project, our team consists of a 2 Data Engineers, Graph Network specialist. We have medical domain expert onboard as well. I haven't included all of them in the project team, that's true. And I see how this affects the score for 1 or 2 of the criteria, but how does it affect others I don't understand

  • 0
    user-icon
    JeyGarg23
    May 18, 2024 | 10:29 AM

    Overall

    5

    • Feasibility 4
    • Viability 5
    • Desirabilty 5
    • Usefulness 5
    I see how this might be useful

    I'm far from academia now, but I see, how experts might benefit from such a graph

  • 0
    user-icon
    TrucTrixie
    May 17, 2024 | 9:59 AM

    Overall

    2

    • Feasibility 2
    • Viability 1
    • Desirabilty 1
    • Usefulness 2
    Parts need to be completed

    I assess that this proposal still needs to add a lot of details before the deadline expires.
    Additional information on team members and transparency.
    Present more about the milestone with what needs to be done.
    Complete the proposed description steps more specifically. Only then will the proposal be more likely to be accepted. Good luck.

    user-icon
    ivan reznikov
    May 17, 2024 | 11:15 AM
    Project Owner

    Thank you for the useful feedback. We've updated the proposal, the plan and added more details for the project

Summary

Overall Community

3.8

from 6 reviews
  • 5
    3
  • 4
    0
  • 3
    2
  • 2
    1
  • 1
    0

Feasibility

3.7

from 6 reviews

Viability

3.5

from 6 reviews

Desirabilty

3.7

from 6 reviews

Usefulness

4

from 6 reviews