This is a substantial and high-quality piece of engineering. It makes several critical improvements to the training pipeline that address real-world operational concerns. My review is strongly positive.
This PR hardens the nbcore training pipeline by tackling two fundamental issues: the risk of OOM errors from loading large training windows into memory, and the temporal information leakage from using a random train/test split.
It introduces:
- A week-by-week streaming pattern (
stream_training_weeks) to cap peak memory usage. - A temporal train/test split to ensure offline metrics more closely reflect production performance.
- A significant cleanup, removing ~550 lines of dead code related to a legacy backfill system.
This is a great example of production-aware model engineering. The focus on data integrity, scalability, and verifiability is evident.
Coming from a background of deploying models at scale, the two main changes here are not just "nice to have"—they are essential for a system to be considered production-ready.
-
Memory Scalability: Loading an entire dataset into a pandas DataFrame is a time bomb. The move to
stream_training_weeksdefuses it. Querying week-by-week and immediately slimming down the dataframe by dropping theinferencecolumn is a pragmatic and effective pattern. This demonstrates foresight about how systems fail as data volume grows. -
Temporal-Aware Evaluation: A random split on time-series data gives a false sense of security. The model learns from "future" data, leading to inflated offline metrics that don't hold up in production. The switch to a temporal split is a non-negotiable requirement for any serious time-series modeling effort. It shows a mature understanding of the gap between theoretical modeling and operational reality.
The PR also contains a subtle but crucial bug fix that speaks to a rigorous, first-principles approach to engineering.
The query_observed_variant_ids pre-flight check is a world-class bug fix.
An ACTIVE variant in the database that happens to have zero rows in the training window is a classic edge case. Many pipelines would proceed silently, leading to a misalignment between the model's action indices and the variant IDs. This would cause the model to mis-attribute predictions, a notoriously difficult bug to track down post-deployment.
This pre-emptive query and intersection (variant_ids = sorted(variant_id for variant_id in active_variant_ids if variant_id in observed_variant_ids)) is a beautiful piece of defensive programming. The accompanying test, test_prepare_week_dataframe_action_indices_consistent_with_variant_ids, is exactly how this hard-won operational knowledge should be codified to prevent future regressions.
The code is excellent, but in the spirit of challenging the design, I have one point of discussion and a minor suggestion.
1. Is this true streaming?
The implementation materializes all the weekly chunks into a list before concatenation:
# packages/nbcore/src/nbcore/model_training/train_vw.py
slim_chunks = list(stream_training_weeks(...))
...
df = pd.concat(slim_chunks, ignore_index=True)While this dramatically reduces peak memory by slimming the chunks, the orchestrator still holds the entire dataset (in its slim form) in memory. This is a pragmatic trade-off, but it's not a constant-memory streaming implementation.
Question for the author: Does Vowpal Wabbit's learner require the full dataset for this training configuration? If so, the current approach is the right one. If not, a future architectural refinement could be to pass the generator directly to the learner, processing one week at a time. This would make the pipeline truly scalable to any time duration. A comment clarifying this constraint would teach future readers why this design decision was made.
2. Purity in the Functional Core
The docstring for prepare_week_dataframe honestly states that it mutates its input.
# packages/nbcore/src/nbcore/model_training/train_vw.py
# Note: Mutates the input DataFrame by adding feature_set and probability
# columns. The caller should treat df as consumed after this call...This is good documentation, but to better adhere to the "functional core, imperative shell" pattern, this function could be made pure.
Suggestion: Refactor prepare_week_dataframe to return a new DataFrame instead of modifying the input df in-place. Since the caller immediately deletes the old df (del df; gc.collect()), the performance impact of the copy should be negligible, but the gain in architectural purity would be significant. It would make the function's signature (pd.DataFrame, ...) -> pd.DataFrame a more honest and predictable contract.
Strongly Approve.
This is an excellent PR that significantly improves the production-readiness of the training pipeline. The changes are well-reasoned, robustly implemented, and demonstrate a deep understanding of building reliable ML systems.