Date: 2026-01-02
Repository: wware/med-lit-graph
Reviewer: GitHub Copilot
Developer: wware
Developer asked for review of the schema directory ( https://github.com/wware/med-lit-graph/tree/main/schema ), acknowledging bugs but seeking feedback on overall direction.
The schema implements a three-layer knowledge graph for medical literature:
- Extraction Layer - Raw output from LLMs/NER (noisy, model-dependent, reproducible)
- Claim Layer - What papers assert (paper-level, versioned, contradictory by nature)
- Evidence Layer - Empirical evidence from experiments (fine-grained, weighted, reusable)
Key Files Reviewed:
schema/README.md- Comprehensive documentationschema/base.py- Base classes and edge hierarchyschema/entity.py- Entity definitions (Disease, Gene, Drug, etc.)schema/sql_mixin.py- SQL generation utilities
-
Edge Class Hierarchy - Implementation of
Edge,ExtractionEdge,ClaimEdge, andEvidenceEdgecorrectly captures epistemic roles rather than just different types -
Evidence-First Philosophy - Provenance as a first-class citizen with detailed
ExtractionProvenancetracking:- Git commit hashes
- Model versions and parameters
- Prompt versions with checksums
- Execution metadata
-
Predicate Organization - Well-structured categories:
CausalPredicateType(causes, prevents, increases_risk)TreatmentPredicateType(treats, manages, contraindicated_for)BiologicalPredicateType(binds_to, inhibits, activates)DiagnosticPredicateType(diagnoses, indicates, co_occurs_with)
-
Ontology Integration - Standards-based identifiers:
- UMLS (diseases)
- HGNC (genes)
- RxNorm (drugs)
- UniProt (proteins)
- IAO, OBI, STATO, ECO, SEPIO (scientific methodology)
-
Entity Collection as Registry - Canonical entity ID approach for entity resolution across papers
-
Embedded Design Rationale - The ChatGPT conversation in
base.py(lines 8-798) preserves crucial design decisions
-
Incomplete Base Classes - Several stub classes need implementation:
ClaimPredicate(line 803)Provenance(line 808)EvidenceType(line 822)
-
Predicate Type Hierarchy Inconsistency - Both old
PredicateTypeenum and new category-based predicates exist; needs unification -
Edge vs Relationship Duality - Two parallel systems need reconciliation:
- New:
Edge/ClaimEdge/EvidenceEdge - Old:
AssertedRelationshipand various*Relationshipclasses
- New:
-
Entity vs EntityReference Clarity - Distinction between
BaseMedicalEntity(canonical) andEntityReference(lightweight pointer) could be clearer in practice
Overall: 7.5/10 - Solid foundation but needs critical fixes
- Clean Separation - Pure mixin, no coupling to specific database libraries
- Dialect Support - Handles PostgreSQL and SQLite differences correctly
- Smart Type Mapping - Handles Optional types, lists, nested Pydantic models
- Bidirectional Conversion - Both
to_db_dict()andfrom_db_dict() - Pydantic v2 APIs - Uses correct modern methods
-
Primary Key Heuristic Too Simple (line 133)
pk_clause = " PRIMARY KEY" if field_name == "id" else ""
Breaks for
paper_id,entity_id, multi-column keysFix: Use Field metadata:
paper_id: str = Field(..., json_schema_extra={"primary_key": True})
-
Enum Handling Missing Schema uses many enums (
EntityType,Polarity) but_python_type_to_sql()doesn't explicitly handle them -
from_db_dict()Doesn't Reconstruct Nested Modelspaper = Paper.from_db_dict(row_dict, dialect="sqlite") # paper.extraction_provenance will be dict, not ExtractionProvenance!
Fix: Use
model_validate()to let Pydantic reconstruct nested objects -
Table Name Pluralization Naive
Hypothesisβhypothesissπ¬- Should use explicit
table_nameinmodel_configorinflectlibrary
-
No Index/Constraint Support Missing foreign keys, indexes, unique constraints
-
No DELETE SQL Has INSERT/SELECT/UPDATE but no DELETE
Consider using SQLAlchemy instead of maintaining custom SQL generation. The complexity will only grow with relationships, migrations, and advanced queries.
Result: β Doesn't handle inheritance hierarchy
SQLModel struggles with:
BaseMedicalEntityβDisease/Gene/Drugstructure- Polymorphic queries
- Table-per-class with shared base fields
Developer tried SQLModel and confirmed: "It doesn't handle the inheritance tree for my schema."
Approach: Keep Pydantic for API/validation, SQLAlchemy for persistence
# schema/entity.py - Pydantic (validation, API)
class DiseaseSchema(BaseModel):
entity_id: str
name: str
model_config = ConfigDict(from_attributes=True)
# db/models.py - SQLAlchemy (persistence)
class DiseaseORM(Base):
__tablename__ = "diseases"
entity_id: Mapped[str] = mapped_column(primary_key=True)
name: Mapped[str]
# Conversion
def to_schema(orm: DiseaseORM) -> DiseaseSchema:
return DiseaseSchema.model_validate(orm, from_attributes=True)Pros:
- β SQLAlchemy handles inheritance correctly
- β Full ORM features (relationships, lazy loading)
- β Battle-tested for complex schemas
Cons:
- π΄ Code duplication
- π΄ Conversion boilerplate
- π΄ Must maintain both in sync
Approach: Decorator that generates SQLAlchemy mappings from Pydantic models
@sqlalchemy_mapped(polymorphic_on="entity_type")
class BaseMedicalEntity(BaseModel):
entity_id: str
# Disease._sqla_class available for queriesPros:
- β Single source of truth
- β No duplication
Cons:
- π΄ Complex to implement correctly
- π΄ Edge cases will be painful
- π΄ Debugging generated classes is hard
Key Insight from Developer: "What if I don't require my different edges to share a base class? Then SQLModel would probably work."
- Polymorphic queries:
session.query(BaseEdge).all() - Shared interface guarantees
- Type hints on mixed collections:
edges: list[BaseEdge]
- β SQLModel works perfectly - No inheritance complexity
- β Simpler mental model - Each edge type is independent
- β Flexibility - Each layer evolves independently
- β Clearer separation - Epistemic layers are distinct by design
Answer: NO - Your queries are naturally layer-specific:
# You'll always query by specific layer:
claims = session.query(ClaimEdge).filter_by(subject_id="DRUG:olaparib").all()
evidence = session.query(EvidenceEdge).filter_by(claim_id=claim.id).all()
# You'll never need:
all_edges = session.query(BaseEdge).all() # Mixing layers defeats the purposeLooking at the three-layer design (Extraction β Claim β Evidence), queries naturally target specific layers because they have fundamentally different semantics.
claims = session.query(ClaimEdge).filter(
ClaimEdge.predicate == "TREATS",
ClaimEdge.object_id == disease_x_id,
ClaimEdge.polarity == "supports"
).all()
evidence = session.query(EvidenceEdge).filter(
EvidenceEdge.object_id == claim.id,
EvidenceEdge.evidence_type == "rct_evidence",
EvidenceEdge.strength > 0.8
).all()β Works - layer-specific queries
claims = session.query(ClaimEdge).filter(
ClaimEdge.subject_type == "hypothesis"
).all()
for claim in claims:
supporting = session.query(EvidenceEdge).filter(
EvidenceEdge.object_id == claim.id,
EvidenceEdge.polarity == "supports"
).count()β Works - still specific queries
claim = session.get(ClaimEdge, claim_id)
extraction = session.query(ExtractionEdge).filter(
ExtractionEdge.paper_id == claim.asserted_by
).first()
paper = session.get(Paper, claim.asserted_by)β Works - explicit layer traversal
Proposed Structure:
from sqlmodel import SQLModel, Field, Column, JSON
from datetime import datetime
# Three independent edge types (no base class)
class ExtractionEdge(SQLModel, table=True):
__tablename__ = "extraction_edges"
id: str = Field(primary_key=True)
subject_id: str = Field(index=True)
object_id: str = Field(index=True)
# Extraction-specific
extractor_name: str
confidence: float
paper_id: str = Field(foreign_key="papers.paper_id")
extracted_at: datetime
class ClaimEdge(SQLModel, table=True):
__tablename__ = "claim_edges"
id: str = Field(primary_key=True)
subject_id: str = Field(index=True)
object_id: str = Field(index=True)
# Claim-specific
predicate: str # TREATS, CAUSES, etc.
asserted_by: str = Field(foreign_key="papers.paper_id")
polarity: str # supports/refutes/neutral
evidence_ids: list[str] = Field(default_factory=list, sa_column=Column(JSON))
class EvidenceEdge(SQLModel, table=True):
__tablename__ = "evidence_edges"
id: str = Field(primary_key=True)
subject_id: str = Field(index=True)
object_id: str = Field(index=True)
# Evidence-specific
evidence_type: str
strength: float
study_type: str | None
sample_size: int | None
# Entities also independent (no BaseMedicalEntity)
class Disease(SQLModel, table=True):
__tablename__ = "diseases"
entity_id: str = Field(primary_key=True)
name: str
synonyms: list[str] = Field(default_factory=list, sa_column=Column(JSON))
umls_id: str | None
mesh_id: str | None
embedding: list[float] | None = Field(None, sa_column=Column(JSON))
class Gene(SQLModel, table=True):
__tablename__ = "genes"
entity_id: str = Field(primary_key=True)
name: str
synonyms: list[str] = Field(default_factory=list, sa_column=Column(JSON))
symbol: str | None
hgnc_id: str | None
embedding: list[float] | None = Field(None, sa_column=Column(JSON))
class Paper(SQLModel, table=True):
__tablename__ = "papers"
paper_id: str = Field(primary_key=True)
title: str
abstract: str
authors: list[str] = Field(default_factory=list, sa_column=Column(JSON))
extraction_provenance: dict = Field(default_factory=dict, sa_column=Column(JSON))- SQLModel compatibility - No inheritance = no problems
- Conceptual clarity - The three layers are philosophically distinct, not just subclasses
- Query patterns match - All your queries are layer-specific anyway
- Independent evolution - Each layer can change without affecting others
- Simpler code - No polymorphic complexity, no multiple dispatch
- Full validation - Each model is still a Pydantic model with all validation
Use composition instead of inheritance:
from pydantic import BaseModel
# Not a table, just a validator
class EdgeFields(BaseModel):
"""Shared fields that all edges should have"""
id: str
subject_id: str
object_id: str
created_at: datetime
# Verify each edge type includes these (can be enforced in tests)- Remove SQLMixin - Replace with SQLModel (no inheritance)
- Flatten entity hierarchy - Each entity type (Disease, Gene, Drug) as independent SQLModel class
- Flatten edge hierarchy - ExtractionEdge, ClaimEdge, EvidenceEdge as independent SQLModel classes
- Fill in stub classes - Complete
ClaimPredicate,Provenance,EvidenceTypedefinitions - Unify predicate system - Reconcile old
PredicateTypeenum with new category predicates - Write expressibility tests - Start with 2-3 canonical queries as discussed in base.py ChatGPT conversation
From the embedded ChatGPT conversation in base.py:
"Edges are not predicates. Predicates are meanings; edges are events."
This is why the three edge types work better as independent classes rather than subclasses. They represent different kinds of events in the scientific process:
- ExtractionEdge = "What did the model extract?"
- ClaimEdge = "What does the paper claim?"
- EvidenceEdge = "What empirical evidence exists?"
These aren't variations of the same thingβthey're fundamentally different roles in the knowledge graph's epistemology.
Your schema is architecturally sound and headed in an excellent direction. The design shows sophisticated thinking about:
- Evidence quality and provenance
- Scientific methodology tracking
- Contradiction handling
- Multi-hop reasoning
The shift from inheritance-based to independent SQLModel classes will:
- Eliminate technical complexity
- Preserve conceptual clarity
- Enable SQLModel to work perfectly
- Keep code maintainable as the project scales
Overall Assessment: 8.5/10 - Strong foundation with clear path forward.
End of Review