Medical Literature Knowledge Graph Schema Review

Date: 2026-01-02
Repository: wware/med-lit-graph
Reviewer: GitHub Copilot
Developer: wware

Initial Question

Developer asked for review of the schema directory ( https://github.com/wware/med-lit-graph/tree/main/schema ), acknowledging bugs but seeking feedback on overall direction.

Schema Overview

The schema implements a three-layer knowledge graph for medical literature:

Extraction Layer - Raw output from LLMs/NER (noisy, model-dependent, reproducible)
Claim Layer - What papers assert (paper-level, versioned, contradictory by nature)
Evidence Layer - Empirical evidence from experiments (fine-grained, weighted, reusable)

Key Files Reviewed:

schema/README.md - Comprehensive documentation
schema/base.py - Base classes and edge hierarchy
schema/entity.py - Entity definitions (Disease, Gene, Drug, etc.)
schema/sql_mixin.py - SQL generation utilities

Overall Assessment: ✅ Headed in the Right Direction

Strong Points 💪

Edge Class Hierarchy - Implementation of Edge, ExtractionEdge, ClaimEdge, and EvidenceEdge correctly captures epistemic roles rather than just different types
Evidence-First Philosophy - Provenance as a first-class citizen with detailed ExtractionProvenance tracking:
- Git commit hashes
- Model versions and parameters
- Prompt versions with checksums
- Execution metadata
Predicate Organization - Well-structured categories:
- CausalPredicateType (causes, prevents, increases_risk)
- TreatmentPredicateType (treats, manages, contraindicated_for)
- BiologicalPredicateType (binds_to, inhibits, activates)
- DiagnosticPredicateType (diagnoses, indicates, co_occurs_with)
Ontology Integration - Standards-based identifiers:
- UMLS (diseases)
- HGNC (genes)
- RxNorm (drugs)
- UniProt (proteins)
- IAO, OBI, STATO, ECO, SEPIO (scientific methodology)
Entity Collection as Registry - Canonical entity ID approach for entity resolution across papers
Embedded Design Rationale - The ChatGPT conversation in base.py (lines 8-798) preserves crucial design decisions

Areas to Work Through 🔧

Incomplete Base Classes - Several stub classes need implementation:
- ClaimPredicate (line 803)
- Provenance (line 808)
- EvidenceType (line 822)
Predicate Type Hierarchy Inconsistency - Both old PredicateType enum and new category-based predicates exist; needs unification
Edge vs Relationship Duality - Two parallel systems need reconciliation:
- New: Edge/ClaimEdge/EvidenceEdge
- Old: AssertedRelationship and various *Relationship classes
Entity vs EntityReference Clarity - Distinction between BaseMedicalEntity (canonical) and EntityReference (lightweight pointer) could be clearer in practice

SQLMixin Review (schema/sql_mixin.py)

Overall: 7.5/10 - Solid foundation but needs critical fixes

What Works ✅

Clean Separation - Pure mixin, no coupling to specific database libraries
Dialect Support - Handles PostgreSQL and SQLite differences correctly
Smart Type Mapping - Handles Optional types, lists, nested Pydantic models
Bidirectional Conversion - Both to_db_dict() and from_db_dict()
Pydantic v2 APIs - Uses correct modern methods

Critical Issues 🔴

Primary Key Heuristic Too Simple (line 133)

pk_clause = " PRIMARY KEY" if field_name == "id" else ""

Breaks for paper_id, entity_id, multi-column keys

Fix: Use Field metadata:

paper_id: str = Field(..., json_schema_extra={"primary_key": True})

Enum Handling Missing Schema uses many enums (EntityType, Polarity) but _python_type_to_sql() doesn't explicitly handle them

from_db_dict() Doesn't Reconstruct Nested Models

paper = Paper.from_db_dict(row_dict, dialect="sqlite")
# paper.extraction_provenance will be dict, not ExtractionProvenance!

Fix: Use model_validate() to let Pydantic reconstruct nested objects

Table Name Pluralization Naive
- Hypothesis → hypothesiss 😬
- Should use explicit table_name in model_config or inflect library
No Index/Constraint Support Missing foreign keys, indexes, unique constraints
No DELETE SQL Has INSERT/SELECT/UPDATE but no DELETE

Recommendation

Consider using SQLAlchemy instead of maintaining custom SQL generation. The complexity will only grow with relationships, migrations, and advanced queries.

Database ORM Strategy Discussion

Evaluated Options

Option 1: SQLModel (Initial Attempt)

Result: ❌ Doesn't handle inheritance hierarchy

SQLModel struggles with:

BaseMedicalEntity → Disease/Gene/Drug structure
Polymorphic queries
Table-per-class with shared base fields

Developer tried SQLModel and confirmed: "It doesn't handle the inheritance tree for my schema."

Option 2: SQLAlchemy 2.0 with Separate Models

Approach: Keep Pydantic for API/validation, SQLAlchemy for persistence

# schema/entity.py - Pydantic (validation, API)
class DiseaseSchema(BaseModel):
    entity_id: str
    name: str
    model_config = ConfigDict(from_attributes=True)

# db/models.py - SQLAlchemy (persistence)
class DiseaseORM(Base):
    __tablename__ = "diseases"
    entity_id: Mapped[str] = mapped_column(primary_key=True)
    name: Mapped[str]

# Conversion
def to_schema(orm: DiseaseORM) -> DiseaseSchema:
    return DiseaseSchema.model_validate(orm, from_attributes=True)

Pros:

✅ SQLAlchemy handles inheritance correctly
✅ Full ORM features (relationships, lazy loading)
✅ Battle-tested for complex schemas

Cons:

🔴 Code duplication
🔴 Conversion boilerplate
🔴 Must maintain both in sync

Option 3: Custom Decorator

Approach: Decorator that generates SQLAlchemy mappings from Pydantic models

@sqlalchemy_mapped(polymorphic_on="entity_type")
class BaseMedicalEntity(BaseModel):
    entity_id: str
    # Disease._sqla_class available for queries

Pros:

✅ Single source of truth
✅ No duplication

Cons:

🔴 Complex to implement correctly
🔴 Edge cases will be painful
🔴 Debugging generated classes is hard

Breakthrough: No Base Class for Edges

Key Insight from Developer: "What if I don't require my different edges to share a base class? Then SQLModel would probably work."

Analysis: This is Brilliant ✨

What You Lose

Polymorphic queries: session.query(BaseEdge).all()
Shared interface guarantees
Type hints on mixed collections: edges: list[BaseEdge]

What You Gain

✅ SQLModel works perfectly - No inheritance complexity
✅ Simpler mental model - Each edge type is independent
✅ Flexibility - Each layer evolves independently
✅ Clearer separation - Epistemic layers are distinct by design

The Critical Question: Do You Need Polymorphism?

Answer: NO - Your queries are naturally layer-specific:

# You'll always query by specific layer:
claims = session.query(ClaimEdge).filter_by(subject_id="DRUG:olaparib").all()
evidence = session.query(EvidenceEdge).filter_by(claim_id=claim.id).all()

# You'll never need:
all_edges = session.query(BaseEdge).all()  # Mixing layers defeats the purpose

Looking at the three-layer design (Extraction → Claim → Evidence), queries naturally target specific layers because they have fundamentally different semantics.

Test Against Canonical Questions

Clinician: "Which FDA-approved drugs treat Disease X with high-quality evidence?"

claims = session.query(ClaimEdge).filter(
    ClaimEdge.predicate == "TREATS",
    ClaimEdge.object_id == disease_x_id,
    ClaimEdge.polarity == "supports"
).all()

evidence = session.query(EvidenceEdge).filter(
    EvidenceEdge.object_id == claim.id,
    EvidenceEdge.evidence_type == "rct_evidence",
    EvidenceEdge.strength > 0.8
).all()

✅ Works - layer-specific queries

Researcher: "Which hypotheses have both supporting and refuting evidence?"

claims = session.query(ClaimEdge).filter(
    ClaimEdge.subject_type == "hypothesis"
).all()

for claim in claims:
    supporting = session.query(EvidenceEdge).filter(
        EvidenceEdge.object_id == claim.id,
        EvidenceEdge.polarity == "supports"
    ).count()

✅ Works - still specific queries

Auditor: "Why does this claim exist?"

claim = session.get(ClaimEdge, claim_id)
extraction = session.query(ExtractionEdge).filter(
    ExtractionEdge.paper_id == claim.asserted_by
).first()
paper = session.get(Paper, claim.asserted_by)

✅ Works - explicit layer traversal

Final Recommendation

✅ Use SQLModel WITHOUT Base Class Inheritance

Proposed Structure:

from sqlmodel import SQLModel, Field, Column, JSON
from datetime import datetime

# Three independent edge types (no base class)

class ExtractionEdge(SQLModel, table=True):
    __tablename__ = "extraction_edges"
    
    id: str = Field(primary_key=True)
    subject_id: str = Field(index=True)
    object_id: str = Field(index=True)
    
    # Extraction-specific
    extractor_name: str
    confidence: float
    paper_id: str = Field(foreign_key="papers.paper_id")
    extracted_at: datetime


class ClaimEdge(SQLModel, table=True):
    __tablename__ = "claim_edges"
    
    id: str = Field(primary_key=True)
    subject_id: str = Field(index=True)
    object_id: str = Field(index=True)
    
    # Claim-specific
    predicate: str  # TREATS, CAUSES, etc.
    asserted_by: str = Field(foreign_key="papers.paper_id")
    polarity: str  # supports/refutes/neutral
    evidence_ids: list[str] = Field(default_factory=list, sa_column=Column(JSON))


class EvidenceEdge(SQLModel, table=True):
    __tablename__ = "evidence_edges"
    
    id: str = Field(primary_key=True)
    subject_id: str = Field(index=True)
    object_id: str = Field(index=True)
    
    # Evidence-specific
    evidence_type: str
    strength: float
    study_type: str | None
    sample_size: int | None


# Entities also independent (no BaseMedicalEntity)

class Disease(SQLModel, table=True):
    __tablename__ = "diseases"
    
    entity_id: str = Field(primary_key=True)
    name: str
    synonyms: list[str] = Field(default_factory=list, sa_column=Column(JSON))
    umls_id: str | None
    mesh_id: str | None
    embedding: list[float] | None = Field(None, sa_column=Column(JSON))


class Gene(SQLModel, table=True):
    __tablename__ = "genes"
    
    entity_id: str = Field(primary_key=True)
    name: str
    synonyms: list[str] = Field(default_factory=list, sa_column=Column(JSON))
    symbol: str | None
    hgnc_id: str | None
    embedding: list[float] | None = Field(None, sa_column=Column(JSON))


class Paper(SQLModel, table=True):
    __tablename__ = "papers"
    
    paper_id: str = Field(primary_key=True)
    title: str
    abstract: str
    authors: list[str] = Field(default_factory=list, sa_column=Column(JSON))
    extraction_provenance: dict = Field(default_factory=dict, sa_column=Column(JSON))

Why This Works

SQLModel compatibility - No inheritance = no problems
Conceptual clarity - The three layers are philosophically distinct, not just subclasses
Query patterns match - All your queries are layer-specific anyway
Independent evolution - Each layer can change without affecting others
Simpler code - No polymorphic complexity, no multiple dispatch
Full validation - Each model is still a Pydantic model with all validation

If You Need Shared Fields

Use composition instead of inheritance:

from pydantic import BaseModel

# Not a table, just a validator
class EdgeFields(BaseModel):
    """Shared fields that all edges should have"""
    id: str
    subject_id: str
    object_id: str
    created_at: datetime

# Verify each edge type includes these (can be enforced in tests)

Next Steps

Remove SQLMixin - Replace with SQLModel (no inheritance)
Flatten entity hierarchy - Each entity type (Disease, Gene, Drug) as independent SQLModel class
Flatten edge hierarchy - ExtractionEdge, ClaimEdge, EvidenceEdge as independent SQLModel classes
Fill in stub classes - Complete ClaimPredicate, Provenance, EvidenceType definitions
Unify predicate system - Reconcile old PredicateType enum with new category predicates
Write expressibility tests - Start with 2-3 canonical queries as discussed in base.py ChatGPT conversation

Key Design Insights Preserved

From the embedded ChatGPT conversation in base.py:

"Edges are not predicates. Predicates are meanings; edges are events."

This is why the three edge types work better as independent classes rather than subclasses. They represent different kinds of events in the scientific process:

ExtractionEdge = "What did the model extract?"
ClaimEdge = "What does the paper claim?"
EvidenceEdge = "What empirical evidence exists?"

These aren't variations of the same thing—they're fundamentally different roles in the knowledge graph's epistemology.

Conclusion

Your schema is architecturally sound and headed in an excellent direction. The design shows sophisticated thinking about:

Evidence quality and provenance
Scientific methodology tracking
Contradiction handling
Multi-hop reasoning

The shift from inheritance-based to independent SQLModel classes will:

Eliminate technical complexity
Preserve conceptual clarity
Enable SQLModel to work perfectly
Keep code maintainable as the project scales

Overall Assessment: 8.5/10 - Strong foundation with clear path forward.

End of Review

wware/SQLMODEL_WHY.md

Select an option

No results found