Problem: ClickHouse consumer stuck since 1/27, massive lag preventing event ingestion
Root Cause: 4 bad test messages blocking consumer across all 3 Kafka partitions:
- Partition 0, Offset 58:
{"test":"connection_test",...} - Partition 1, Offset 56:
{"test":"direct_publish",...} - Partition 2, Offset 113:
{"test":"message",...} - Error: Invalid DateTime parsing - messages missing required schema fields
Fix Applied (Raymond + Kenji):
- Stopped ClickHouse consumer
- Reset consumer group 'clickhouse-raw' offsets:
- Partition 0 → 59
- Partition 1 → 57
- Partition 2 → 114
- Restarted consumer
- Result: Lag reduced to 0, events flowing again
Frank's confirmation: "yay, events!"
Current Issues:
- ❌ No schema validation - invalid messages can block entire pipeline
- ❌ No DLQ (Dead Letter Queue) - failed messages have nowhere to go
- ❌ ClickHouse MV can't handle conditional validation easily
- ❌ Consumer not resilient to malformed messages
Proposed Solutions:
Short-term:
- Schema Registry implementation for pre-publish validation
- DLQ topic setup for failed messages
- Resilient consumer pattern: validate → DLQ on failure → commit offset
Long-term (Will's recommendation):
- Apache Spark or Kafka Streams for enrichment layer
- Advanced enrichment capabilities:
- Text analysis (tone, topics)
- Video analysis
- IP to Geohash (simpler - can do in ClickHouse)
Infrastructure Option (Kenji):
- ClickHouse Kafka Connect Sink with built-in:
- Schema validation via Confluent Schema Registry
- Automatic DLQ forwarding
- Comes free with Confluent Kafka deployment
Schema Definition Phase:
- Requesting JSON Schema for all Kafka message types
- Working on
cdp.yaml- machine-readable schema config - Generated 105 tables from schema so far
- Covers ~30% of Will's original spreadsheet
Analytics Architecture Components:
- Facts: Raw events (what we're ingesting now)
- Metrics: Aggregations (count, sum, distinct counts)
- Dimensions: Categorical values for slicing (GROUP BY fields)
Will's 200+ metrics need to be defined in terms of these components.
New documentation for exposing NodePort services in manually deployed cluster: https://github.com/parler-tech/customer-data-platform/blob/production/docs/adding_nodeport_services.md
Our work validated:
- ✅ PR #313 events ARE flowing correctly to Kafka
- ✅ New schema with rich context fields working as designed
- ✅ SubjectType changes not causing issues
- ✅ Pipeline blockage was infrastructure issue (test messages), NOT PR #313
Next: Kenji's schema work will formalize the event structure we've been validating, ensuring proper schema validation prevents future pipeline blocks.
Ready for merge to development after:
- Kenji confirms pipeline can handle subjectType change (infrastructure issue now resolved)
- R4wm approval
The application is correctly sending events with the new schema. All app-side testing is complete.