Skip to content

Instantly share code, notes, and snippets.

@wware
Last active January 3, 2026 06:35
Show Gist options
  • Select an option

  • Save wware/55bc833cdae10c337e86fd873fc9554f to your computer and use it in GitHub Desktop.

Select an option

Save wware/55bc833cdae10c337e86fd873fc9554f to your computer and use it in GitHub Desktop.

Medical Literature Knowledge Graph Schema Review

Date: 2026-01-02
Repository: wware/med-lit-graph
Reviewer: GitHub Copilot
Developer: wware


Initial Question

Developer asked for review of the schema directory ( https://github.com/wware/med-lit-graph/tree/main/schema ), acknowledging bugs but seeking feedback on overall direction.


Schema Overview

The schema implements a three-layer knowledge graph for medical literature:

  1. Extraction Layer - Raw output from LLMs/NER (noisy, model-dependent, reproducible)
  2. Claim Layer - What papers assert (paper-level, versioned, contradictory by nature)
  3. Evidence Layer - Empirical evidence from experiments (fine-grained, weighted, reusable)

Key Files Reviewed:

  • schema/README.md - Comprehensive documentation
  • schema/base.py - Base classes and edge hierarchy
  • schema/entity.py - Entity definitions (Disease, Gene, Drug, etc.)
  • schema/sql_mixin.py - SQL generation utilities

Overall Assessment: βœ… Headed in the Right Direction

Strong Points πŸ’ͺ

  1. Edge Class Hierarchy - Implementation of Edge, ExtractionEdge, ClaimEdge, and EvidenceEdge correctly captures epistemic roles rather than just different types

  2. Evidence-First Philosophy - Provenance as a first-class citizen with detailed ExtractionProvenance tracking:

    • Git commit hashes
    • Model versions and parameters
    • Prompt versions with checksums
    • Execution metadata
  3. Predicate Organization - Well-structured categories:

    • CausalPredicateType (causes, prevents, increases_risk)
    • TreatmentPredicateType (treats, manages, contraindicated_for)
    • BiologicalPredicateType (binds_to, inhibits, activates)
    • DiagnosticPredicateType (diagnoses, indicates, co_occurs_with)
  4. Ontology Integration - Standards-based identifiers:

    • UMLS (diseases)
    • HGNC (genes)
    • RxNorm (drugs)
    • UniProt (proteins)
    • IAO, OBI, STATO, ECO, SEPIO (scientific methodology)
  5. Entity Collection as Registry - Canonical entity ID approach for entity resolution across papers

  6. Embedded Design Rationale - The ChatGPT conversation in base.py (lines 8-798) preserves crucial design decisions

Areas to Work Through πŸ”§

  1. Incomplete Base Classes - Several stub classes need implementation:

    • ClaimPredicate (line 803)
    • Provenance (line 808)
    • EvidenceType (line 822)
  2. Predicate Type Hierarchy Inconsistency - Both old PredicateType enum and new category-based predicates exist; needs unification

  3. Edge vs Relationship Duality - Two parallel systems need reconciliation:

    • New: Edge/ClaimEdge/EvidenceEdge
    • Old: AssertedRelationship and various *Relationship classes
  4. Entity vs EntityReference Clarity - Distinction between BaseMedicalEntity (canonical) and EntityReference (lightweight pointer) could be clearer in practice


SQLMixin Review (schema/sql_mixin.py)

Overall: 7.5/10 - Solid foundation but needs critical fixes

What Works βœ…

  1. Clean Separation - Pure mixin, no coupling to specific database libraries
  2. Dialect Support - Handles PostgreSQL and SQLite differences correctly
  3. Smart Type Mapping - Handles Optional types, lists, nested Pydantic models
  4. Bidirectional Conversion - Both to_db_dict() and from_db_dict()
  5. Pydantic v2 APIs - Uses correct modern methods

Critical Issues πŸ”΄

  1. Primary Key Heuristic Too Simple (line 133)

    pk_clause = " PRIMARY KEY" if field_name == "id" else ""

    Breaks for paper_id, entity_id, multi-column keys

    Fix: Use Field metadata:

    paper_id: str = Field(..., json_schema_extra={"primary_key": True})
  2. Enum Handling Missing Schema uses many enums (EntityType, Polarity) but _python_type_to_sql() doesn't explicitly handle them

  3. from_db_dict() Doesn't Reconstruct Nested Models

    paper = Paper.from_db_dict(row_dict, dialect="sqlite")
    # paper.extraction_provenance will be dict, not ExtractionProvenance!

    Fix: Use model_validate() to let Pydantic reconstruct nested objects

  4. Table Name Pluralization Naive

    • Hypothesis β†’ hypothesiss 😬
    • Should use explicit table_name in model_config or inflect library
  5. No Index/Constraint Support Missing foreign keys, indexes, unique constraints

  6. No DELETE SQL Has INSERT/SELECT/UPDATE but no DELETE

Recommendation

Consider using SQLAlchemy instead of maintaining custom SQL generation. The complexity will only grow with relationships, migrations, and advanced queries.


Database ORM Strategy Discussion

Evaluated Options

Option 1: SQLModel (Initial Attempt)

Result: ❌ Doesn't handle inheritance hierarchy

SQLModel struggles with:

  • BaseMedicalEntity β†’ Disease/Gene/Drug structure
  • Polymorphic queries
  • Table-per-class with shared base fields

Developer tried SQLModel and confirmed: "It doesn't handle the inheritance tree for my schema."

Option 2: SQLAlchemy 2.0 with Separate Models

Approach: Keep Pydantic for API/validation, SQLAlchemy for persistence

# schema/entity.py - Pydantic (validation, API)
class DiseaseSchema(BaseModel):
    entity_id: str
    name: str
    model_config = ConfigDict(from_attributes=True)

# db/models.py - SQLAlchemy (persistence)
class DiseaseORM(Base):
    __tablename__ = "diseases"
    entity_id: Mapped[str] = mapped_column(primary_key=True)
    name: Mapped[str]

# Conversion
def to_schema(orm: DiseaseORM) -> DiseaseSchema:
    return DiseaseSchema.model_validate(orm, from_attributes=True)

Pros:

  • βœ… SQLAlchemy handles inheritance correctly
  • βœ… Full ORM features (relationships, lazy loading)
  • βœ… Battle-tested for complex schemas

Cons:

  • πŸ”΄ Code duplication
  • πŸ”΄ Conversion boilerplate
  • πŸ”΄ Must maintain both in sync

Option 3: Custom Decorator

Approach: Decorator that generates SQLAlchemy mappings from Pydantic models

@sqlalchemy_mapped(polymorphic_on="entity_type")
class BaseMedicalEntity(BaseModel):
    entity_id: str
    # Disease._sqla_class available for queries

Pros:

  • βœ… Single source of truth
  • βœ… No duplication

Cons:

  • πŸ”΄ Complex to implement correctly
  • πŸ”΄ Edge cases will be painful
  • πŸ”΄ Debugging generated classes is hard

Breakthrough: No Base Class for Edges

Key Insight from Developer: "What if I don't require my different edges to share a base class? Then SQLModel would probably work."

Analysis: This is Brilliant ✨

What You Lose

  1. Polymorphic queries: session.query(BaseEdge).all()
  2. Shared interface guarantees
  3. Type hints on mixed collections: edges: list[BaseEdge]

What You Gain

  1. βœ… SQLModel works perfectly - No inheritance complexity
  2. βœ… Simpler mental model - Each edge type is independent
  3. βœ… Flexibility - Each layer evolves independently
  4. βœ… Clearer separation - Epistemic layers are distinct by design

The Critical Question: Do You Need Polymorphism?

Answer: NO - Your queries are naturally layer-specific:

# You'll always query by specific layer:
claims = session.query(ClaimEdge).filter_by(subject_id="DRUG:olaparib").all()
evidence = session.query(EvidenceEdge).filter_by(claim_id=claim.id).all()

# You'll never need:
all_edges = session.query(BaseEdge).all()  # Mixing layers defeats the purpose

Looking at the three-layer design (Extraction β†’ Claim β†’ Evidence), queries naturally target specific layers because they have fundamentally different semantics.

Test Against Canonical Questions

Clinician: "Which FDA-approved drugs treat Disease X with high-quality evidence?"

claims = session.query(ClaimEdge).filter(
    ClaimEdge.predicate == "TREATS",
    ClaimEdge.object_id == disease_x_id,
    ClaimEdge.polarity == "supports"
).all()

evidence = session.query(EvidenceEdge).filter(
    EvidenceEdge.object_id == claim.id,
    EvidenceEdge.evidence_type == "rct_evidence",
    EvidenceEdge.strength > 0.8
).all()

βœ… Works - layer-specific queries

Researcher: "Which hypotheses have both supporting and refuting evidence?"

claims = session.query(ClaimEdge).filter(
    ClaimEdge.subject_type == "hypothesis"
).all()

for claim in claims:
    supporting = session.query(EvidenceEdge).filter(
        EvidenceEdge.object_id == claim.id,
        EvidenceEdge.polarity == "supports"
    ).count()

βœ… Works - still specific queries

Auditor: "Why does this claim exist?"

claim = session.get(ClaimEdge, claim_id)
extraction = session.query(ExtractionEdge).filter(
    ExtractionEdge.paper_id == claim.asserted_by
).first()
paper = session.get(Paper, claim.asserted_by)

βœ… Works - explicit layer traversal


Final Recommendation

βœ… Use SQLModel WITHOUT Base Class Inheritance

Proposed Structure:

from sqlmodel import SQLModel, Field, Column, JSON
from datetime import datetime

# Three independent edge types (no base class)

class ExtractionEdge(SQLModel, table=True):
    __tablename__ = "extraction_edges"
    
    id: str = Field(primary_key=True)
    subject_id: str = Field(index=True)
    object_id: str = Field(index=True)
    
    # Extraction-specific
    extractor_name: str
    confidence: float
    paper_id: str = Field(foreign_key="papers.paper_id")
    extracted_at: datetime


class ClaimEdge(SQLModel, table=True):
    __tablename__ = "claim_edges"
    
    id: str = Field(primary_key=True)
    subject_id: str = Field(index=True)
    object_id: str = Field(index=True)
    
    # Claim-specific
    predicate: str  # TREATS, CAUSES, etc.
    asserted_by: str = Field(foreign_key="papers.paper_id")
    polarity: str  # supports/refutes/neutral
    evidence_ids: list[str] = Field(default_factory=list, sa_column=Column(JSON))


class EvidenceEdge(SQLModel, table=True):
    __tablename__ = "evidence_edges"
    
    id: str = Field(primary_key=True)
    subject_id: str = Field(index=True)
    object_id: str = Field(index=True)
    
    # Evidence-specific
    evidence_type: str
    strength: float
    study_type: str | None
    sample_size: int | None


# Entities also independent (no BaseMedicalEntity)

class Disease(SQLModel, table=True):
    __tablename__ = "diseases"
    
    entity_id: str = Field(primary_key=True)
    name: str
    synonyms: list[str] = Field(default_factory=list, sa_column=Column(JSON))
    umls_id: str | None
    mesh_id: str | None
    embedding: list[float] | None = Field(None, sa_column=Column(JSON))


class Gene(SQLModel, table=True):
    __tablename__ = "genes"
    
    entity_id: str = Field(primary_key=True)
    name: str
    synonyms: list[str] = Field(default_factory=list, sa_column=Column(JSON))
    symbol: str | None
    hgnc_id: str | None
    embedding: list[float] | None = Field(None, sa_column=Column(JSON))


class Paper(SQLModel, table=True):
    __tablename__ = "papers"
    
    paper_id: str = Field(primary_key=True)
    title: str
    abstract: str
    authors: list[str] = Field(default_factory=list, sa_column=Column(JSON))
    extraction_provenance: dict = Field(default_factory=dict, sa_column=Column(JSON))

Why This Works

  1. SQLModel compatibility - No inheritance = no problems
  2. Conceptual clarity - The three layers are philosophically distinct, not just subclasses
  3. Query patterns match - All your queries are layer-specific anyway
  4. Independent evolution - Each layer can change without affecting others
  5. Simpler code - No polymorphic complexity, no multiple dispatch
  6. Full validation - Each model is still a Pydantic model with all validation

If You Need Shared Fields

Use composition instead of inheritance:

from pydantic import BaseModel

# Not a table, just a validator
class EdgeFields(BaseModel):
    """Shared fields that all edges should have"""
    id: str
    subject_id: str
    object_id: str
    created_at: datetime

# Verify each edge type includes these (can be enforced in tests)

Next Steps

  1. Remove SQLMixin - Replace with SQLModel (no inheritance)
  2. Flatten entity hierarchy - Each entity type (Disease, Gene, Drug) as independent SQLModel class
  3. Flatten edge hierarchy - ExtractionEdge, ClaimEdge, EvidenceEdge as independent SQLModel classes
  4. Fill in stub classes - Complete ClaimPredicate, Provenance, EvidenceType definitions
  5. Unify predicate system - Reconcile old PredicateType enum with new category predicates
  6. Write expressibility tests - Start with 2-3 canonical queries as discussed in base.py ChatGPT conversation

Key Design Insights Preserved

From the embedded ChatGPT conversation in base.py:

"Edges are not predicates. Predicates are meanings; edges are events."

This is why the three edge types work better as independent classes rather than subclasses. They represent different kinds of events in the scientific process:

  • ExtractionEdge = "What did the model extract?"
  • ClaimEdge = "What does the paper claim?"
  • EvidenceEdge = "What empirical evidence exists?"

These aren't variations of the same thingβ€”they're fundamentally different roles in the knowledge graph's epistemology.


Conclusion

Your schema is architecturally sound and headed in an excellent direction. The design shows sophisticated thinking about:

  • Evidence quality and provenance
  • Scientific methodology tracking
  • Contradiction handling
  • Multi-hop reasoning

The shift from inheritance-based to independent SQLModel classes will:

  • Eliminate technical complexity
  • Preserve conceptual clarity
  • Enable SQLModel to work perfectly
  • Keep code maintainable as the project scales

Overall Assessment: 8.5/10 - Strong foundation with clear path forward.


End of Review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment