Skip to content

Instantly share code, notes, and snippets.

@deepakjois
Created February 2, 2026 16:21
Show Gist options
  • Select an option

  • Save deepakjois/db306ba50b2c2050053067aef702c477 to your computer and use it in GitHub Desktop.

Select an option

Save deepakjois/db306ba50b2c2050053067aef702c477 to your computer and use it in GitHub Desktop.
RAS (Recording Analytics Service) Deep Dive - Temporal Migration Analysis
╭─── Claude Code v2.1.20 ──────────────────────────────────────────────────────╮
│ │ Tips for getting │
│ Welcome back Deepak! │ started │
│ │ Run /init to create a … │
│ ▐▛███▜▌ │ ─────────────────────── │
│ ▝▜█████▛▘ │ Recent activity │
│ ▘▘ ▝▝ │ No recent activity │
│ │ │
│ Opus 4.5 · Claude Pro · deepak.jois@gmail.com's │ │
│ Organization │ │
│ ~/code/veo/recording-analytics │ │
╰──────────────────────────────────────────────────────────────────────────────╯
❯ do a deep dive into this codebase and describe the various parts of it.
give a conceptual overview of the codebase with data flows and ascii
diagrams whereever possible
Here is the ticket where a dev from another infrastructure team will
assist us with migrating the codebase over to use Temporal workflows.
Organize the explanation of the codebase around themes in terms of the
ticket overview below
# Centralized orchestration > Move analytics orchestration from RAS to
temporal
Overall responsibilities of RAS:
1. Orchestrating AI jobs, i.e. sending requests to EMA and keeping track
of the order of jobs. In this process some validation and a few
transformations of the EMA output happens (e.g. changing a pts column to
a timestamp column).
2. Delivering data to clients (Zola+MES) for features in the product.
Here some transformation of the data happens on-the-fly.
3. Storing output files from AI jobs and storing the location of those
files.
---
Stages of how to roll out:
1. Move orchestration to temporal
1.
2. Move delivery to MES
3. Move what remains of storage to recordings-api
---
Migration plan:
* RAS should be able to handle both ways of doing orchestration (temporal
+ RAS's internal orchestration)
* Shadow the existing processing workflow. Send traffic through both
flows and check that the temporal workflow produces the right stuff.
* Send a percentage of "orchestration traffic" to temporal instead of
RAS's internal orchestration
---
Questions to figure out:
* Should we completely reuse the storage in RAS in the first round?
* Should temporal query RAS for a presigned URL of where to upload
files (to the RAS S3 bucket), i.e. reuse storage in RAS completely?
* Or should the temporal workflow store the files in veo-content-ii?
* Should temporal update the job states of RAS? This will make sure
that RAS can still serve it clients exactly as it does today.
* Where does RAS "stop"?
* What is the interface between RAS/MES, RAS/Zola?
* Which transformations in RAS goes to temporal and which goes to MES?
* Should the first part of the project not include the transformation,
that RAS currently performs today?
* RAS is producing files based on EMA output. These files are deleted
after 7 days. So RAS is essentially caching files. Should those files be
produced as part of the temporal workflow?
* Figure out if we should add separate asset types for "raw AI output"
and "RAS transformed" files?
---
For the future project about moving storage out of RAS:
* Figure out what clients of RAS use the file locations which are stored
in RAS? Ideally we would like RAS not to store file location. Can the
clients simply look of file location in recordings-api instead?
* Make sure that RAS job states is not required by any client
* When we start to orchestrate via temporal, RAS will not store the
state of jobs. We should make sure, that no client breaks if the job
state is no longer updated. Also, L2/L3 supporters use the Zola admin
page to see, if RAS/analytics failed for some reason. We need to make
sure, that L2/L3 can still get information about whether or not analytics
failed. Coordinate with L2/L3 about a solution for this.
## Metadata
- URL: [https://linear.app/veo-technologies/project/centralized-orchestra
tion-move-analytics-orchestration-from-ras-to-2d39bcc34b44](https://linea
r.app/veo-technologies/project/centralized-orchestration-move-analytics-o
rchestration-from-ras-to-2d39bcc34b44)
- Status: Now
- Lead: Tobias Slott Jensen
- Members: Tobias Slott Jensen
- Start date: Jan 12th
- Target date: Not set
⏺ I'll do a thorough exploration of the codebase to build a comprehensive
picture. Let me launch several exploration agents in parallel to cover
different aspects.
⏺ 5 Explore agents finished (ctrl+o to expand)
├─ Explore top-level structure · 36 tool uses · 39.1k tokens
│ ⎿  Done
├─ Explore orchestration/jobs system · 55 tool uses · 67.4k tokens
│ ⎿  Done
├─ Explore API/delivery layer · 52 tool uses · 57.0k tokens
│ ⎿  Done
├─ Explore storage layer · 60 tool uses · 68.4k tokens
│ ⎿  Done
└─ Explore external integrations · 59 tool uses · 61.3k tokens
⎿  Done
⏺ Now let me put together the comprehensive overview organized around the
migration themes.
---
RAS (Recording Analytics Service) — Deep Dive
Tech Stack at a Glance
- Language: TypeScript on Node.js 22
- Runtime: AWS Lambda (serverless)
- IaC: AWS CDK
- Storage: DynamoDB (state), S3 (files)
- Messaging: SQS (job queues), EventBridge (domain events)
- Architecture: Hexagonal (Ports & Adapters), Event Sourcing
---
High-Level Architecture
┌─────────────────────────────────────────────────┐
│ CLIENTS
│ Zola (Admin UI) MES (via
EventBridge)│
└──────────┬──────────────────────┬───────────────┘
│ HTTP (X-Token) │ EventBridge
▼ ▲
┌─────────────────────────┐ │
│ CloudFront CDN │ │
│ (7-14 day cache on GET) │ │
└───────────┬─────────────┘ │
│ │
┌───────────▼─────────────┐ │
│ API Gateway │ │
│ (REST, token auth) │ │
└───────────┬─────────────┘ │
│ │
┌───────────────────────────────┼────────────────────────┤
│ RAS (Lambda Functions) │
│ │
│ ┌──────────────────────────────────────────────┐ │
│ │ 1. ORCHESTRATION │ │
│ │ │ │
│ │ jobPrepper ──► jobScheduler ──► jobTrigger │ │
│ │ │ ▲ │ │ │
│ │ │ DynamoDB │ Streams │ │ │
│ │ ▼ │ ▼ │ │
│ │ [AnalyticsState] ◄──┘ EMA (AI) │ │
│ │ │ │ │
│ │ jobImporter ◄── jobCallback ◄─────┘ │ │
│ └──────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────┐ │
│ │ 2. DELIVERY (API Handlers) │ │
│ │ │ │
│ │ GET /player-tracking (normalize + merge) │ │
│ │ GET /events (filter by type) │ │
│ │ GET /match-details (serve from S3) │ │
│ │ GET /shot-details (serve from S3) │ │
│ │ GET /match-momentum (serve from DynamoDB)│ │
│ │ GET /match-report (serve from S3) │ │
│ │ GET /analytics (job state summary) │──────┘
│ └──────────────────────────────────────────────┘ publishes
│ domain events
│ ┌──────────────────────────────────────────────┐
│ │ 3. STORAGE │
│ │ │
│ │ S3: raw files (CSV, JSON, Parquet) │
│ │ S3: processed files (JSON) │
│ │ S3: cache bucket (7-day TTL) │
│ │ DynamoDB: analytics state + change log │
│ │ DynamoDB: recording metadata │
│ └──────────────────────────────────────────────┘
└────────────────────────────────────────────────────
---
Responsibility 1: Orchestrating AI Jobs
This is the core of what the Temporal migration targets. RAS manages a
pipeline of 6 AI job types with dependency ordering, communicating with the
EMA inference service.
Job Types & Dependencies
INDEPENDENT (run in parallel) DEPENDENT (need all 3 independent
jobs)
┌──────────────────┐ ┌──────────────────┐
│ event-detection │──────────┐ │ match-details │
└──────────────────┘ │ └──────────────────┘
┌──────────────────┐ ├─────►┌──────────────────┐
│ player-tracking │──────────┤ │ shot-details │
└──────────────────┘ │ └──────────────────┘
┌──────────────────┐ │ ┌──────────────────┐
│ ball-tracking │──────────┘ │ match-report │
└──────────────────┘ └──────────────────┘
The three independent jobs can run concurrently. The three dependent jobs each
require all three independent jobs to reach succeeded/completed before they
can be triggered.
Job State Machine
Each job goes through these states:
null ──► pending ──► scheduled ──► starting ──► running ──► succeeded ──►
completed
└── failed
(import
finishes)
- pending → job is validated and ready
- scheduled → EMA has accepted the job
- starting / running → EMA is processing
- succeeded → EMA finished; raw output is on S3
- completed → RAS has imported and transformed the output
- failed → error at any stage
The 5-Lambda Orchestration Pipeline
The orchestration is driven by SQS queues and DynamoDB Streams, spread across
5 Lambda functions:
Zola (POST /jobs)
┌────────────┐ PrepJobQueue ┌────────────┐
│ API Handler │ ──── (SQS) ────► │ jobPrepper │
└────────────┘ └──────┬─────┘
│ Validates recording metadata
│ Sets jobs to "pending"
│ Writes AnalyticsChange to DynamoDB
┌──────────────┐
│ DynamoDB │
│ (analytics │
│ changes) │
└──────┬───────┘
│ DynamoDB Stream
┌──────────────┐
│ jobScheduler │ Compares old vs new state
│ │ Evaluates dependencies
└──┬───────┬──┘
│ │
┌──────────────────────┘ └──────────────────┐
▼ ▼
JobTriggerQueue JobImporterQueue
(SQS) (SQS)
│ │
▼ ▼
┌─────────────┐ ┌──────────────┐
│ jobTrigger │ │ jobImporter │
│ │ │ │
│ Calls EMA │ │ Reads raw S3 │
│ via HTTP │ │ Transforms │
└──────┬──────┘ │ Saves processed│
│ └──────────────┘
┌─────────────┐
│ EMA │ (external AI inference service)
│ /v2, /v3 │
└──────┬──────┘
│ HTTP webhook callback
JobCallbackQueue (SQS)
┌─────────────┐
│ jobCallback │
│ │
│ Maps EMA │
│ status to │
│ Analytics │
│ Change │──────► DynamoDB ──► Stream ──► jobScheduler
└─────────────┘ (loop back to evaluate next steps)
Key design pattern: The jobScheduler is triggered by DynamoDB Streams on the
analytics changes table. Every state change (from any source) triggers it to
re-evaluate what should happen next. This is the central decision point that
would be replaced by Temporal workflow logic.
EMA Communication Details
Triggering a job (src/integrations/ema/client.ts):
- RAS generates presigned S3 PUT URLs where EMA should upload results
- RAS sends HTTP POST to EMA with job inputs, metadata, and a callback URL
- EMA versions: v2 for
event-detection/shot-details/match-details/match-report, v3 for
ball-tracking/player-tracking
Callback from EMA (src/integrations/ema/callbackHandler.ts):
- EMA POSTs to /recordings/{id}/jobs/callback when job state changes
- The callback goes through SQS → jobCallback Lambda
- RAS validates the job_id is still current (prevents stale callbacks)
- Maps EMA status to an AnalyticsChange record in DynamoDB
Transformations During Import
When a job succeeds, the importer reads raw EMA output from S3 and transforms
it:
Job Type: player-tracking
Raw Format: CSV (v2/v3)
Transformation: Normalize coordinates to [0,1] range using field dimensions;
merge ball tracking data by PTS/timestamp; v2: convert
frameId→timestamp via FPS
────────────────────────────────────────
Job Type: ball-tracking
Raw Format: CSV
Transformation: Stored raw; merged into player-tracking during that import
────────────────────────────────────────
Job Type: event-detection
Raw Format: JSON
Transformation: Validated against schema (v2, v2.1, v2.2, v3)
────────────────────────────────────────
Job Type: match-details
Raw Format: CSV
Transformation: Parsed and stored as JSON
────────────────────────────────────────
Job Type: shot-details
Raw Format: JSON
Transformation: Validated and stored
────────────────────────────────────────
Job Type: match-report
Raw Format: JSON
Transformation: Validated and stored
────────────────────────────────────────
Job Type: match-momentum
Raw Format: CSV
Transformation: Parsed and stored in DynamoDB
The PTS-to-timestamp conversion mentioned in the ticket happens in the player
tracking importer (src/core/playerTracking/index.ts): v2 format uses frameId /
fps to get seconds, while v3 has timestamp_in_seconds directly.
---
Responsibility 2: Delivering Data to Clients
RAS serves processed analytics data via REST API behind CloudFront.
API Endpoint Map
/recordings/{recording_id}/
├── analytics GET (job state summary) No cache
├── analytics POST (replay state - Zola auth) No cache
├── analytics/logs GET (processing logs) No cache
├── player-tracking GET (merged + normalized data) 7-14 day cache
├── player-tracking/
│ └── jersey-numbers GET (jersey numbers by period) 7-14 day cache
├── player-moments GET (key player moments) 7-14 day cache
├── match-momentum GET (momentum timeline) 7-14 day cache
├── match-details GET (match stats) 7-14 day cache
├── match-details/
│ └── attacking-moments GET (sub-analysis) 7-14 day cache
├── events GET (event detections, ?type=) 7-14 day cache
├── shot-details GET (shot/goal data) 7-14 day cache
├── match-report GET (match summary) No cache
├── jobs POST (trigger jobs - Zola auth)
├── jobs DELETE (cancel jobs - Zola auth)
├── jobs/callback POST (EMA callback → SQS)
├── upload-url GET (presigned S3 URL - EMA auth)
├── raw-files GET (raw file links - Zola auth) 60s cache
└── duplicate POST (copy analytics - Zola auth)
On-the-Fly Transformations in Delivery
The GET endpoints don't just serve stored files — several do real-time
processing:
/player-tracking — The most complex:
Raw player tracking CSV (S3)
+
Raw ball tracking CSV (S3) ──► Merge by timestamp/PTS
+ │
Recording metadata (field dims) ▼
+ Normalize coordinates
Query params (?start, ?end, │
?jersey, ?side, ?filters) ▼
Filter by time range,
jersey, side, etc.
JSON response with
merged player+ball data
/events — Filters raw event JSON by ?type= query parameter
/jersey-numbers — Extracts and groups jersey numbers by team side and period
Authentication
┌────────────┐ X-Token header ┌──────────────────┐
│ Zola │ ─────────────────► │ ZolaTokenAuth │──► POST/DELETE
endpoints
└────────────┘ │ (Secrets Manager) │
└──────────────────┘
┌────────────┐ X-Token header ┌──────────────────┐
│ EMA │ ─────────────────► │ EmaTokenAuth │──► /upload-url
endpoint
└────────────┘ │ (Secrets Manager) │
└──────────────────┘
GET endpoints (analytics data) ──► No auth (public, cached by CloudFront)
Caching Architecture
Client
CloudFront (Layer 1)
│ Cache-Control: public, max-age=604800 (7 days)
│ Cache keys include query params: start, end, jersey, side, type, etc.
API Gateway → Lambda (Layer 2)
│ Application-level cache for player tracking (CacheRepository → S3)
Recordings API calls (Layer 3)
│ Undici HTTP cache interceptor: 5-minute TTL on GET 200s
S3 / DynamoDB (data source)
Domain Event Publishing
When jobs complete, RAS publishes to EventBridge (this is how MES and other
services learn about results):
DynamoDB Stream (analytics state changes)
OutboundEventPublisher Lambda
DomainEventBroadcaster evaluates:
├── shot-details completed? ──► ShotDetailsProcessedV1
├── event-detection completed? ──► EventDetectionProcessedV1
├── match-report completed? ──► MatchReportProcessedV1
└── ALL jobs completed? ──► RecordingAnalysisProcessedV1
(includes links to all endpoints)
Event format: Source: "com.veo.recording-analytics", published to shared
EventBridge bus.
---
Responsibility 3: Storage
S3 Bucket Layout
ras-raw-analytics-bucket-{env}/
├── ball-tracking/{recordingId}.csv
├── player-tracking/{recordingId}.csv
├── match-details/{recordingId}.csv
├── match-momentum/{recordingId}.csv
├── events/{recordingId}.json
├── shot-details/{recordingId}.json
├── match-report/{recordingId}.json
└── physical-metrics/{recordingId}.parquet
ras-events-bucket-{env}/
└── {recordingId}.json (processed events)
ras-player-tracking-bucket-{env}/
└── {recordingId}.json (processed player tracking)
ras-match-details-bucket-{env}/
└── {recordingId}.json (processed match details)
ras-shot-details-bucket-{env}/
└── {recordingId}.json (processed shot details)
ras-match-report-bucket-{env}/
└── {recordingId}.json (processed match report)
ras-cache-bucket-{env}/ (7-day auto-expiry lifecycle rule)
└── playerTracking:{recordingId} (cached transformed data)
DynamoDB Tables
Table: analyticsStates
PK: id (recordingId)
SK: —
Purpose: Current job statuses (projected view)
────────────────────────────────────────
Table: analyticsStateChanges
PK: id (recordingId)
SK: version
Purpose: Event sourcing log (immutable)
────────────────────────────────────────
Table: recordingMetadata
PK: id (recordingId)
SK: —
Purpose: FPS, duration, pitch dims, camera URIs
────────────────────────────────────────
Table: matchMomentum
PK: id (recordingId)
SK: —
Purpose: Momentum timeline data
────────────────────────────────────────
Table: dataDuplication
PK: id (recordingId)
SK: —
Purpose: Tracks duplicated recordings
The analyticsStateChanges table has DynamoDB Streams enabled — this is the
backbone of the orchestration, triggering the jobScheduler on every state
change.
Presigned URL Flow
EMA needs to upload results:
EMA ──► GET /upload-url?jobType=ball-tracking
RAS generates PutObject presigned URL (3-day expiry)
for: ras-raw-analytics-bucket/{jobType}/{recordingId}.csv
EMA uploads file directly to S3
Client needs to read raw files:
Zola ──► GET /raw-files
RAS generates GetObject presigned URLs (15-min expiry)
for each raw file that exists
Returns JSON with download links
File Lifecycle
- Raw files: Created by EMA upload, persist indefinitely (no lifecycle rule)
- Processed files: Created by importers, persist indefinitely
- Cache files: Auto-deleted after 7 days (S3 lifecycle rule on cache bucket)
- Data duplication: DataDuplicator can copy all files from one recording to
another
---
Mapping to Temporal Migration Concerns
Here's how the codebase maps to the questions in the ticket:
What moves to Temporal?
The entire orchestration pipeline (the 5 Lambdas + DynamoDB Streams trigger
loop):
TODAY (RAS internal orchestration):
PrepJobQueue → jobPrepper → DynamoDB → Stream → jobScheduler
→ JobTriggerQueue → jobTrigger → EMA → callback → jobCallback
→ DynamoDB → Stream → jobScheduler → ImporterQueue → jobImporter
FUTURE (Temporal):
Temporal Workflow:
1. Validate recording (activity)
2. Trigger independent jobs in parallel (activities)
3. Wait for EMA callbacks (signals/activities)
4. When all 3 independent done → trigger dependent jobs (activities)
5. Wait for EMA callbacks
6. Import results (activities)
7. Publish domain events (activity)
What stays in RAS?
- Delivery layer: All GET API endpoints with on-the-fly transformations
- Storage: S3 buckets, DynamoDB tables, presigned URL generation
- Authentication: Token-based authorizers for Zola/EMA
Shadow mode / dual orchestration
The ticket mentions running both flows in parallel. Key integration points:
- Job state updates: Temporal would need to write to the same
analyticsStateChanges DynamoDB table (or RAS needs an API to accept state
updates)
- EMA callbacks: Currently go to /recordings/{id}/jobs/callback → SQS.
Temporal could either receive these callbacks directly or RAS could forward
them
- Domain events: Currently published via DynamoDB Streams → EventBridge. If
Temporal manages state, it needs to trigger these events too
Transformations: Temporal vs MES vs RAS?
┌───────────────────────┬─────────────────────────────┬──────────────────────┐
│ Transformation │ Current Location │ Migration Target │
├───────────────────────┼─────────────────────────────┼──────────────────────┤
│ PTS → timestamp │ jobImporter │ Temporal activity or │
│ │ (player-tracking) │ MES │
├───────────────────────┼─────────────────────────────┼──────────────────────┤
│ Coordinate │ Delivery (player-tracking │ MES │
│ normalization │ GET) │ │
├───────────────────────┼─────────────────────────────┼──────────────────────┤
│ Ball+player merge │ Delivery (player-tracking │ MES │
│ │ GET) │ │
├───────────────────────┼─────────────────────────────┼──────────────────────┤
│ Event type filtering │ Delivery (events GET) │ MES │
├───────────────────────┼─────────────────────────────┼──────────────────────┤
│ Jersey number │ Delivery (player-tracking │ MES │
│ extraction │ GET) │ │
├───────────────────────┼─────────────────────────────┼──────────────────────┤
│ JSON schema │ jobImporter (events) │ Temporal activity │
│ validation │ │ │
└───────────────────────┴─────────────────────────────┴──────────────────────┘
File storage questions from ticket
- "Should Temporal query RAS for presigned URLs?" — RAS already has GET
/upload-url (EMA-auth protected). Temporal could use this, or generate URLs
directly if it has S3 access.
- "Raw AI output vs RAS transformed files?" — Today these are in separate S3
buckets (raw-analytics-bucket vs per-type processed buckets). Making them
distinct asset types in recordings-api would be clean.
- "7-day cached files" — Only the ras-cache-bucket has the 7-day lifecycle.
Raw and processed files persist indefinitely. The cache stores pre-computed
player-tracking responses.
Client interfaces (where RAS "stops")
Zola ◄──── RAS API (GET endpoints + POST /jobs, /analytics, /duplicate)
├── Zola uses: player-tracking, events, match-details,
│ shot-details, match-report, match-momentum,
│ player-moments, analytics state, raw-files
MES ◄──── EventBridge domain events
├── MES listens for: RecordingAnalysisProcessedV1,
│ EventDetectionProcessedV1, ShotDetailsProcessedV1,
│ MatchReportProcessedV1
EMA ◄──── RAS (HTTP POST to trigger jobs)
EMA ────► RAS (HTTP webhook callback + S3 upload)
Key files for the migration team
Concern: Job types & dependencies
Key Files: src/core/models/types/index.ts
────────────────────────────────────────
Concern: Orchestration logic
Key Files: src/core/job/jobScheduler.ts
────────────────────────────────────────
Concern: DynamoDB stream trigger
Key Files: src/apps/jobScheduler/analyticsStateDynamoDBStreamHandler.ts
────────────────────────────────────────
Concern: EMA client
Key Files: src/integrations/ema/client.ts
────────────────────────────────────────
Concern: EMA callback handling
Key Files: src/integrations/ema/callbackHandler.ts
────────────────────────────────────────
Concern: Job state model
Key Files: src/core/models/analyticState.ts
────────────────────────────────────────
Concern: Data importers
Key Files: src/core/dataImporters/ (6 importers)
────────────────────────────────────────
Concern: Domain event publishing
Key Files: src/core/domainEventBroadcaster/domainEventBroadcaster.ts
────────────────────────────────────────
Concern: API endpoints (delivery)
Key Files: src/adapters/driving/AWSLambda/controllers/
────────────────────────────────────────
Concern: Storage infrastructure
Key Files: stack/recording-analytics-storage-stack.ts
────────────────────────────────────────
Concern: Queue definitions
Key Files: stack/recordingAnalyticsQueue.ts
────────────────────────────────────────
Concern: Full stack wiring
Key Files: stack/recording-analytics-stack.ts
✻ Crunched for 4m 19s
❯ /rename
⎿ Please provide a name for the session. Usage: /rename
❯ /rename ras-deep-dive
⎿ Session renamed to: ras-deep-dive
❯ /export
⎿ Conversation copied to clipboard
❯ ok let's focus on just the /jobs POST endpoint, and the best way to
deprecate that with incorporating a Temporal workflow. For now, let's
keep the rest of the GET endpoints as APIs in RAS. all the orchestration
with Dynamodb streams needs to go away. Also explore the question of
whether we need any of the SQS queues involved or do we handle that
within temporal itself.
⏺ I'll explore the specific components involved in the /jobs POST endpoint and
the orchestration flow to understand exactly what needs to change for
Temporal.
⏺ 3 Explore agents finished (ctrl+o to expand)
├─ Explore /jobs POST endpoint · 35 tool uses · 39.7k tokens
│ ⎿  Done
├─ Explore SQS queues and purposes · 57 tool uses · 54.3k tokens
│ ⎿  Done
└─ Explore EMA callback flow · 48 tool uses · 59.3k tokens
⎿  Done
⏺ Excellent - I now have a detailed understanding of the orchestration flow. Let
me launch a Plan agent to design the migration approach.
⏺ Plan(Design Temporal migration plan)
⎿  Done (45 tool uses · 63.0k tokens · 2m 46s)
⏺ Let me write the plan to the plan file and then ask some clarifying questions.
⏺ Updated plan
⎿  /plan to preview
⏺ I've written the migration plan. Before finalizing, I have a few questions to
clarify:
⏺ User answered Claude's questions:
⎿  · Where will Temporal be hosted? → Temporal Cloud (Recommended)
· How should the Temporal worker be deployed? → TBD with infra team
· For shadow mode, EMA will receive duplicate job requests. How should
this be handled? → Skip shadow mode
· How should EMA callbacks reach the Temporal workflow? → Keep SQS
buffer initially
⏺ Thanks for the clarifications. Let me update the plan with these decisions.
⏺ Updated plan
⎿  /plan to preview
⏺ Updated plan
⎿  /plan to preview
⏺ Updated plan
⎿  /plan to preview
⏺ Updated plan
⎿  /plan to preview
⏺ Read 1 file (ctrl+o to expand)
⏺ Let me fix the verification section to remove the shadow mode reference since
we're skipping it.
⏺ Updated plan
⎿  /plan to preview
⏺ The plan is complete and reflects your decisions:
Summary:
- Temporal Cloud for hosting (managed)
- Keep SQS buffer for EMA callbacks (SQS → Lambda → Temporal signal)
- Skip shadow mode - validate in staging, then cut over to production
- Worker deployment TBD with infrastructure team
Key changes:
- POST /jobs will start a Temporal workflow instead of publishing to SQS
- 4 Lambdas eliminated: jobPrepper, jobScheduler, jobTrigger, jobImporter
- 3 SQS queues eliminated: PrepJobQueue, TriggerJobQueue, ImporterJobQueue
- JobCallbackQueue kept as buffer for EMA callbacks
- DynamoDB streams orchestration removed; workflow handles dependency logic
- Activities wrap existing services (PrepareJobsService, TriggerJobService,
importers)
⏺ User approved Claude's plan
⎿  Plan saved to: ~/.claude/plans/silly-questing-turtle.md · /plan to edit
RAS Temporal Migration Plan
Goal
Replace the /jobs POST endpoint orchestration with Temporal workflows,
removing DynamoDB streams-based orchestration and SQS queues while
keeping GET endpoints unchanged.
---
Current State
POST /jobs → InitJobService → PrepJobQueue (SQS)
jobPrepper Lambda
DynamoDB (analyticsStateChanges)
DynamoDB Stream
jobScheduler Lambda
↓ ↓
TriggerJobQueue ImporterJobQueue
↓ ↓
jobTrigger jobImporter
EMA (async)
/jobs/callback → JobCallbackQueue → jobCallback
DynamoDB → Stream →
jobScheduler (loop)
4 SQS Queues to eliminate:
- PrepJobQueue
- JobTriggerQueue
- JobImporterQueue
- JobCallbackQueue
5 Lambdas to replace:
- jobPrepper
- jobScheduler
- jobTrigger
- jobImporter
- jobCallback (partial - keeps HTTP endpoint)
---
Target Architecture
POST /jobs → CreateJobs Controller
Temporal Client.start()
RecordingAnalyticsWorkflow
┌───────────────┴───────────────┐
↓ ↓
prepareJobs [parallel] triggerJob × 3
activity (player-tracking, ball-tracking,
event-detection)
↓ ↓
DynamoDB EMA HTTP calls
(metadata) ↓
await signal (emaJobCallback)
[parallel] importers
[parallel] triggerJob × 3
(match-details, shot-details, match-report)
await signals + importers
Workflow complete
EMA Callbacks (keep SQS buffer):
/jobs/callback → SQS → Lambda → Temporal signal to workflow
---
Workflow Design
Single Workflow per Recording
recordingAnalyticsWorkflow(recordingId, jobTypes[], actor)
├── Activity: prepareJobs()
│ └── Validates recording, fetches metadata, writes to DynamoDB
├── Phase 1: Independent Jobs (parallel)
│ ├── triggerJob("player-tracking") → await signal →
importPlayerTracking()
│ ├── triggerJob("ball-tracking") → await signal
│ └── triggerJob("event-detection") → await signal →
importEvents()
└── Phase 2: Dependent Jobs (parallel, after Phase 1 succeeds)
├── triggerJob("match-details") → await signal →
importMatchDetails()
├── triggerJob("shot-details") → await signal →
importShotDetails()
└── triggerJob("match-report") → await signal →
importMatchReport()
EMA Callback → SQS → Temporal Signal
Keep SQS buffer for durability during transition:
EMA HTTP callback → API Gateway → JobCallbackQueue (SQS)
jobCallback Lambda (modified)
Temporal signal to workflow
// Modified SQSJobCallbackConsumer
async processSQSEvent(event: SQSEvent): Promise<void> {
const { body, messageAttributes } = event.Records[0];
const recordingId = messageAttributes.recordingId.stringValue;
const emaData = JSON.parse(body) as emaCallbackType;
// Send signal to Temporal workflow instead of writing to DynamoDB
const client = getTemporalClient();
const handle = client.workflow.getHandle(`ras-${recordingId}`);
await handle.signal('emaJobCallback', {
jobType: emaData.job_kind,
jobId: emaData.job_id,
status: emaData.job_state,
exitCode: emaData.job_exit_code,
reason: emaData.job_reason,
});
}
State Persistence (for GET endpoints)
Activities write to DynamoDB so existing GET endpoints work unchanged:
- analyticsStateChanges table (event sourcing)
- analyticsStates table (projected view via existing stream handler)
- recordingMetadata table
---
Activities to Create
Activity: prepareJobs
Wraps: PrepareJobsService.prepareJobs()
Purpose: Validation, metadata fetch
────────────────────────────────────────
Activity: triggerJob
Wraps: TriggerJobService.triggerJob()
Purpose: Call EMA to start job
────────────────────────────────────────
Activity: importEvents
Wraps: EventsImporter
Purpose: Import event detection results
────────────────────────────────────────
Activity: importPlayerTracking
Wraps: PlayerTrackingImporter + MatchMomentumImporter
Purpose: Import tracking data
────────────────────────────────────────
Activity: importMatchDetails
Wraps: MatchDetailsImporter
Purpose: Import match details
────────────────────────────────────────
Activity: importShotDetails
Wraps: ShotDetailsImporter
Purpose: Import shot details
────────────────────────────────────────
Activity: importMatchReport
Wraps: MatchReportImporter
Purpose: Import match report
────────────────────────────────────────
Activity: persistJobState
Wraps: AnalyticsChangeRepository.put()
Purpose: Write state to DynamoDB
---
Files to Create
src/temporal/
├── worker.ts # Temporal worker entry point
├── client.ts # Temporal client factory
├── workflows/
│ └── recordingAnalytics.ts # Main workflow
├── activities/
│ ├── index.ts # Activity exports
│ ├── prepareJobs.ts # Wrap PrepareJobsService
│ ├── triggerJob.ts # Wrap TriggerJobService
│ ├── importers.ts # Wrap all data importers
│ └── persistState.ts # DynamoDB state writes
└── signals/
└── emaCallback.ts # Signal type definitions
Files to Modify
File: src/adapters/driving/AWSLambda/controllers/job/createJobs.ts
Change: Add Temporal workflow start
────────────────────────────────────────
File: src/apps/jobCallback/SQSJobCallbackConsumer.ts
Change: Replace SQS processing with Temporal signal
────────────────────────────────────────
File: stack/recording-analytics-stack.ts
Change: Add Temporal worker infrastructure
────────────────────────────────────────
File: package.json
Change: Add @temporalio/client, @temporalio/worker,
@temporalio/workflow
Files to Remove (after migration)
- src/apps/jobPrepper/*
- src/apps/jobScheduler/*
- src/apps/jobTrigger/*
- src/apps/jobImporter/*
- src/core/job/jobScheduler.ts
- src/core/job/initJobService.ts
- src/core/publishers/prepJobQueue.ts
- src/core/publishers/jobTriggerQueue.ts
- src/core/publishers/jobImporterQueue.ts
- Queue definitions in stack/recordingAnalyticsQueue.ts
---
Migration Phases
Phase 1: Foundation
- Set up Temporal Cloud namespace
- Create src/temporal/ directory structure
- Implement activities wrapping existing services
- Implement workflow with dependency logic
- Unit test workflows with mocked activities
Phase 2: Integration (Staging)
- Deploy Temporal worker to staging
- Modify callback handler to send Temporal signals (keep SQS buffer)
- Add ORCHESTRATION_MODE env var (legacy | temporal)
- Disable legacy flow in staging, enable Temporal
- End-to-end testing: trigger jobs, verify callbacks, check GET
endpoints
Phase 3: Production Cutover
- Deploy Temporal worker to production
- Switch ORCHESTRATION_MODE to temporal
- Monitor for issues
- Keep legacy code available for rollback (2 weeks)
Phase 4: Cleanup
- Remove legacy Lambdas (jobPrepper, jobScheduler, jobTrigger,
jobImporter)
- Remove SQS queues (PrepJobQueue, TriggerJobQueue, ImporterJobQueue)
- Keep JobCallbackQueue (still used for SQS → Signal flow)
- Update CDK stack
- Update documentation
---
Verification
1. Unit tests: Mock activities, test workflow logic (dependency
ordering, signal handling, timeout handling)
2. Integration tests (staging):
- Trigger jobs via POST /jobs
- Verify EMA receives requests and sends callbacks
- Verify callbacks signal the correct workflow
- Verify DynamoDB state is updated correctly
3. GET endpoint verification: Ensure all GET endpoints return same data
after migration
4. Rollback test: Verify switching ORCHESTRATION_MODE back to legacy
works
---
Decisions
- Temporal hosting: Temporal Cloud (managed)
- Worker deployment: TBD with infrastructure team
- Shadow mode: Skip - go directly from staging validation to production
cutover
- Callback approach: Keep SQS buffer initially (SQS → Lambda → Temporal
signal)
⏺ The plan is approved. Let me create tasks to track the implementation.
⏺ Read 1 file (ctrl+o to expand)
⎿  Interrupted · What should Claude do instead?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment