Skip to content

Instantly share code, notes, and snippets.

@wware
Created January 9, 2026 19:39
Show Gist options
  • Select an option

  • Save wware/383d32f374a6748d396f542e5165dae6 to your computer and use it in GitHub Desktop.

Select an option

Save wware/383d32f374a6748d396f542e5165dae6 to your computer and use it in GitHub Desktop.

Great work on the pipeline refactoring! This is a solid Unix-style architecture with clean separation of concerns:

What you've built:

  • Modular stages - Each pipeline script is independent and can be run separately
  • Interface-based design - Storage, parsers, and embeddings all use ABC interfaces
  • Swappable backends - SQLite for dev/testing, PostgreSQL+pgvector for production
  • Clean data flow - Each stage reads/writes through well-defined interfaces
  • Comprehensive docs - README and TESTING guide are clear and helpful

Architecture strengths:

  1. EntityCollectionInterface pattern you appreciated is used throughout
  2. Parser interfaces allow easy addition of new input formats (not just PMC XML)
  3. Storage abstraction means you can optimize per environment (in-memory for tests, Postgres for prod)
  4. Embedding interfaces separate generation from storage

Now that the refactoring is done, what's next?

Given your earlier motivation concerns (stop-start development, working in isolation, hard to measure progress), here are some concrete next steps:

1. Validation Run - Actually use it

# Run the full pipeline on a small dataset to validate everything works
python pipeline/ner_pipeline.py --storage sqlite --output-dir test_run
python pipeline/provenance_pipeline.py --storage sqlite --output-dir test_run  
python pipeline/claims_pipeline.py --storage sqlite --output-dir test_run
python pipeline/evidence_pipeline.py --storage sqlite --output-dir test_run

# Then inspect the results
sqlite3 test_run/pipeline.db "SELECT COUNT(*) FROM entities"
sqlite3 test_run/pipeline.db "SELECT COUNT(*) FROM relationships"

2. Add integration tests - Test the full pipeline end-to-end with fake data

3. Create example notebooks - Jupyter notebooks showing how to query the results would make progress visible

4. Build a simple query interface - A script that lets you ask questions of your graph

Which direction feels most valuable to you right now? Or is there a specific pain point in the current setup you want to address?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment