ikouchiha47/01_QUESTION.md

Last active December 22, 2025 09:16

Star (0) You must be signed in to star a gist
Fork (0) You must be signed in to fork a gist

Select an option

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/ikouchiha47/1ac41e238ba61738f788420f3903c003.js"></script>
Save ikouchiha47/1ac41e238ba61738f788420f3903c003 to your computer and use it in GitHub Desktop.

Download ZIP

Ride Hailer

Raw

01_QUESTION.md

DAW Problem Statement – Multi-Region Ride-Hailing Platform

Context

You are tasked with designing a multi-tenant, multi-region ride-hailing system (think Uber/Ola).

The platform must handle driver–rider matching, dynamic surge pricing, trip lifecycle management, and payments at scale with strict latency requirements.

Functional Requirements:

Real-time driver location ingestion: each online driver sends 1–2 updates per second
Ride request flow: pickup, destination, tier, payment method
Dispatch/Matching: assign drivers within <1s p95; reassign on decline/timeout
Dynamic surge pricing: per geo-cell, based on supply–demand
Trip lifecycle: start, pause, end, fare calculation, receipts
Payments orchestration: integrate with external PSPs, retries, reconciliation
Notifications: push/SMS for key ride states
Admin/ops tooling: feature flags, kill-switches, observability

Non-Functional Requirements:

Latency SLOs:
- Dispatch decision: p95 ≤ 1s
- End-to-end request→acceptance: p95 ≤ 3s
- Availability: 99.95% for dispatch APIs
Scale assumptions:
- 300k concurrent drivers globally
- 60k ride requests/minute peak
- 500k location updates/second globally
Multi-region: region-local writes, failover handling, no cross-region sync on hot path
Compliance: PCI, PII encryption, GDPR/DPDP

Constraints:

Use Kafka/Pulsar for events, Redis for hot KV, Postgres/CockroachDB for transactions
Clients are mobile with flaky networks; all APIs must be idempotent
Payments through external PSPs; their latency is outside your control

Deliverables

HLD: components, data flow, scaling, storage, trade-offs LLD: deep dive into either Dispatch/Matching, Surge Pricing, or Trip Service APIs & Events: request/response schemas and event topics Data Model: ERD for chosen component Resilience plan: retries, backpressure, circuit breakers, failure modes

Raw

02_OVERVIEW.md

Multi-Region Ride-Hailing Platform

Designing a multi-tenant, multi-region ride-hailing system.

The platform must handle driver–rider matching, dynamic surge pricing, trip lifecycle management, and payments at scale with strict latency requirements.

Actors

Customer
Driver

Workflows

Customers

1. Customer Opens the App

Trigger: Customer opens the app
Actions:
- Check and Wait for location permissions
- Check if service is available in region
- Check if payment pending.
- Get a list of nearby locations upto 5km, from user
- Show a list of available vehicles, maybe events, or poi nearby
Pre-condition:
- Authenticated
- Location permissions enabled
- Trips never cross regions at runtime
Success:
Failure:

2. Customer Intends to Book Ride

Trigger: Searches and selects a location to ride to
Action:
- Check if the target location is within geo-boundaries
- Check for Vehicle types that can reach.
- System fetches current surge multiplier for pickup geo-cell
- An estimated price can be calculated (base + distance + surge) (the first part can be cached as well)
Success:
- Customer sees rides available in the area
- Customer sees estimated prices of rides, and approximate duration (can be from google or internal tracking)
- Additional metadata like toll-paid or paid separately etc can be shown.
Failure:
- Customer sees appropriate message, ride not available
- Price estimation failed, but booking can still proceed.

3. Customer Sends a Booking Request

Trigger: User selects a ride type and clicks "Book"
Action:
- Create a Booking Request, status = pending, radius = 5km
- Publish to Q
- Set a TTL of 15s
Success:
- Driver accepts the booking, status = 'assigned'
- Driver can also cancel booking after accepting status = 'cancelled' (in which case, new booking request)
Failure:
- Retry 2x with increasing radius.
- REQUEST_EXPIRED, REQUEST_ABORTED, REQUEST_RETRY

4. Real Time ETAs

Trigger: App polls every 5s the driver's Location, and Booking status (via booking handshake)
Pre-condition:
- Driver - Booking - User relationship should exsists amd status = assigned
Action:
- System tracks driver location in context of the Booking

5. Customer Payments

Trigger: Trip ends (triggered by driver), Booking status = completed
Action:
- Booking status changes to completed
- Calculate actual amount for ride
- Create Payment Request for the Booking (Payment.status = pending)
- Redirect to appropriate PSP page, wait for completion
- En-Q a recon process as a delayed job (TTL 5 minute)
- On PSP Callback, Enqueue to update the Payment status and timestamps
Success:
- Update Booking.status = paid
Failure:

Driver

1. Driver toggles duty status

Trigger: Driver taps Go Online / Go Offline
Pre-condition:
- Driver has registered and KYC verified
Actions:
- Update Driver.status = online/offline
- If online: start sending location pings (batch them on the client side to reduce ping rate)
- If offline: stop location stream, clear from matching pool

2. Driver receives & accepts ride

Trigger: New BookingRequest assigned to driver
Actiond:
- Push notification sent (booking details)
- Driver has 10s to accept
  On accept:
  - Booking.status = accepted
  - Driver.status = on_trip
  - Start navigation
  - Clear other offers
- On decline/timeout:
  - Put the driver id in offered set (to select new driver)

3. Driver views hotspot map

Trigger: Driver opens "Hotspots" tab
Data: Aggregated DemandHeatMap (geo-cells with high booking rate)
Source: Stream processor (Kafka -> Flink -> Redis Hash)
Update freq: Every 2 min
Consistency: Eventual

Toolings

Observability via metric collection, and alerting policies
Logs and Request Tracing to debug problems better
- Some logging libraries come with PII handling mechanisms, masking emails/cards/phone etc.
Feature flags:
- Toggle entire modules, like chat (ui + apis)
- Toggle business level features, like on-off surge pricing
- Toggle to A/B test exsisting flows
Localisation:
- Because its multi-region english may not always be the prefered choice.
- Currencies can be shown better
- Cultural context can be helpful in desiging surge pricing
Policy Engines:
- Different Regions can be different pricing policies, per-km, surge caps, minimum wage etc.
- Specific regions can benefit from static surge prices
- Systems need periodic maintainance, like certificate renewal, domain renewal, key rotation etc.
- Logging library can use PII poilices per region.
- Data Archival policies for different kinds of data.
- KYC Archival policies
- Data Replication policies
- Rate Limiting

Security

Databases, Access Keys must be encrypted
Keys must be rotated
APIs should be protected via WAF
Rate Limits should be present on most apis.
TLS certificates should never expire, All communications over TLS/WSS

Failure Points

Driver device connection issues
Redis/Database/Store Crashes
PSP / External Providers outage
Service Failure Cascade
Internet outage
Flaky Networks (tuning tcp params, reduce data transfer)
Bad Deployments

Entities

Customer
Driver
Vehicle
Pin
Location
Offer
Fare
Booking
Pricing
Payment
History
DemandHeatMap
Session
UserProfile
Device
Wallet
Policy

Relationships

Trip

belongs_to :booking
belongs_to :rider (User)
belongs_to :driver (User)
belongs_to :fare
belongs_to :payment

has_many :state_transitions
has_many :locations
has_many :events

Booking

belongs_to :rider (User)

has_one :trip
has_many :offers

User (Customer/Driver)

has_many :trips_as_rider (Trip)
has_many :trips_as_driver (Trip)
has_many :bookings

Offer

belongs_to :booking
belongs_to :driver (User)

TripStateTransition

belongs_to :trip
belongs_to :actor (User)

TripLocation

belongs_to :trip

TripEvent

belongs_to :trip

Fare

has_one :trip
belongs_to :booking

Payment

has_one :trip
belongs_to :booking
belongs_to :rider (User)

Vehicle

belongs_to :driver (User)

has_many :trips

erDiagram
    Trip ||--|| Booking : "belongs_to"
    Trip ||--|| User : "belongs_to (rider)"
    Trip ||--|| User : "belongs_to (driver)"
    Trip ||--o| Fare : "belongs_to"
    Trip ||--o| Payment : "belongs_to"
    Trip ||--o| Vehicle : "belongs_to"
    Trip ||--o{ TripStateTransition : "has_many"
    Trip ||--o{ TripLocation : "has_many"
    Trip ||--o{ TripEvent : "has_many"
    
    Booking ||--|| User : "belongs_to (rider)"
    Booking ||--o| Trip : "has_one"
    Booking ||--o{ Offer : "has_many"
    
    Offer ||--|| Booking : "belongs_to"
    Offer ||--|| User : "belongs_to (driver)"
    
    User ||--o{ Trip : "has_many (as rider)"
    User ||--o{ Trip : "has_many (as driver)"
    User ||--o{ Booking : "has_many"
    User ||--o{ Offer : "has_many"
    User ||--o{ Vehicle : "has_many"
    
    TripStateTransition ||--|| Trip : "belongs_to"
    TripStateTransition ||--o| User : "belongs_to (actor)"
    
    TripLocation ||--|| Trip : "belongs_to"
    
    TripEvent ||--|| Trip : "belongs_to"
    
    Fare ||--|| Booking : "belongs_to"
    Fare ||--o| Trip : "has_one"
    
    Payment ||--|| Booking : "belongs_to"
    Payment ||--|| User : "belongs_to (rider)"
    Payment ||--o| Trip : "has_one"
    
    Vehicle ||--|| User : "belongs_to (driver)"
    Vehicle ||--o{ Trip : "has_many"

    Trip {
        uuid trip_id PK
        varchar booking_id UK "FK to Booking"
        varchar rider_id "FK to User"
        varchar driver_id "FK to User"
        varchar vehicle_id "FK to Vehicle"
        decimal pickup_lat
        decimal pickup_lng
        text pickup_address
        decimal dropoff_lat
        decimal dropoff_lng
        text dropoff_address
        varchar vehicle_type
        varchar city_id
        varchar status "enum"
        timestamptz assigned_at
        timestamptz started_at
        timestamptz paused_at
        timestamptz resumed_at
        timestamptz ended_at
        decimal distance_km
        int duration_sec
        varchar fare_id "FK to Fare"
        varchar payment_id "FK to Payment"
        varchar cancelled_by "enum"
        text cancellation_reason
        decimal cancellation_fee
        varchar otp
        timestamptz created_at
        timestamptz updated_at
    }

    Booking {
        varchar booking_id PK
        varchar rider_id "FK to User"
        decimal pickup_lat
        decimal pickup_lng
        text pickup_address
        decimal destination_lat
        decimal destination_lng
        text destination_address
        varchar vehicle_type "enum"
        varchar city_id
        varchar status "enum"
        decimal estimated_fare
        decimal surge_multiplier
        varchar payment_method
        timestamptz created_at
        timestamptz updated_at
    }

    User {
        varchar user_id PK
        varchar name
        varchar email
        varchar phone
        varchar type "RIDER, DRIVER"
        decimal rating
        int total_trips
        varchar city_id
        varchar status "enum"
        timestamptz created_at
    }

    Offer {
        varchar offer_id PK
        varchar booking_id "FK to Booking"
        varchar driver_id "FK to User"
        varchar status "enum"
        int distance_m
        int eta_sec
        int rank
        decimal score
        timestamptz created_at
        timestamptz expires_at
    }

    TripStateTransition {
        bigint id PK
        uuid trip_id "FK to Trip"
        varchar from_status
        varchar to_status
        varchar changed_by "enum"
        varchar actor_id "FK to User"
        text reason
        jsonb metadata
        timestamptz created_at
    }

    TripLocation {
        bigint id PK
        uuid trip_id "FK to Trip"
        decimal lat
        decimal lng
        int accuracy_m
        decimal speed_kmh
        int bearing
        timestamptz recorded_at
        timestamptz received_at
    }

    TripEvent {
        bigint id PK
        uuid trip_id "FK to Trip"
        varchar booking_id
        varchar event_type
        jsonb event_data
        int sequence_number
        timestamptz created_at
    }

    Fare {
        varchar fare_id PK
        varchar booking_id "FK to Booking"
        decimal base_fare
        decimal distance_fare
        decimal time_fare
        decimal surge_multiplier
        decimal subtotal
        decimal tax
        decimal total
        varchar currency
        timestamptz created_at
    }

    Payment {
        varchar payment_id PK
        varchar booking_id "FK to Booking"
        varchar rider_id "FK to User"
        decimal amount
        varchar currency
        varchar status "enum"
        varchar payment_method
        varchar psp_transaction_id
        timestamptz created_at
        timestamptz completed_at
    }

    Vehicle {
        varchar vehicle_id PK
        varchar driver_id "FK to User"
        varchar vehicle_type "enum"
        varchar make
        varchar model
        varchar plate_number
        varchar color
        int year
        varchar status "enum"
        timestamptz created_at
    }

P.S. This is a generated diagram, from the blobs of text below

Scaling, Consistency and Availability

Scale assumptions:
- Each online driver sends 1–2 updates per second
- 300k concurrent drivers globally
- 60k ride requests/minute peak
- 500k location updates/second globally
- Dispatch decision: p95 ≤ 1s
- End-to-end request->acceptance: p95 ≤ 3s

End-to-end request->acceptance: p95 ≤ 3s (hmmm, not sure about this, depends on the driver, human involved, incremental search and retry upto 1 minute) Average: 20s

Ride Requests:

60k rpm => 1k rps * 20s = 20k Concurrent Users Booking

Location Updates: Assuming most of this is driver pings, user pings are not that useful right now.

300K Concurrent Drivers * 1.5 UPS = 450K UPS
Ping Data: (id, lat, long, speed, sig, acc, unix_ts) 100B (with tcp headers)
500K UPS * 100B = 500MBps (Bandwith req)

Dispatch Decision:

Disptach involves sending req
Writing a booking request to db
Enqueing to Q
Consume from Q
Driver Matching
Offer Creation
Push Notification

Timings:

Kafka tail worst case 200ms p95 (from blogs) (i3gen,large 16Gb ram, 2vcpu, 25Gbps, acks=all, replication >= 3)
- Redpanda can reduce this by factor of 3.
Redis (single instance, 8G RAM) 10ms. 100000 requests completed in 0.50 seconds 50 parallel clients 3 bytes payload
Single instance supported 400k ops/s, 2-4 redis instances per city, with replicas.
Redis read and write, 10ms (combined)
API Roundtrip: 100ms
Database Write: 50ms-100ms, Dynamodb Ensures 20ms p95.

200 + 50 + 100 + 100 = 450ms (can be lower because of multi-region)

Push Request, can affect the tail latency, 1s I am assuming as worst case.

Consistency, Availability and Storage

Authentication and Sessions: HA and Strong Consistency
Booking: Strong Consistency
Locations: HA, EC
Payment: HA, SC
Surge Pricing: EC

Query Patterns:

CanBook(user)
RidePossible(srcLoc, dstLoc)
GetRides(srcLoc, dstLoc)
GetFareEstimates(srcLoc, dstLoc)
SurgeApplied(srcLoc, dstLoc)
InitBooking(user, srcLoc, dstLoc, vehicleTypes, radius, shownEstimate, timeWithZone)
GetMatchingDrivers(loc, radius)
AcceptBooking(booking, driver)
Decline/CancelBooking(booking, driver)
UpdateBookingStatus(bookingID, status)
CheckBookingStatus(bookingID)
InitiatePayment(bookingID)
UpdatePaymentStatus(bookingID, pricingID?, status)
GetBookings(driverID/userID, cursor)
GetActiveBookings(userID/driverID)
UpdateOfferStatus(driverID, offerID/bookingID)
BroadCastOffers(bookingID)
GetOffers(driverID, srcLoc)
UnsubcribeBroadCast(bookingID)
CanAcceptBookings(driverID)
UserProfile(userID/driverID)
UpdateLocation(driverID, bookingID, []coorD)

Services

Identity Service (Sessions + User) Initially can be a shared service. (User = Customer + Driver)
- Responsibilities:
  - Authn & Authz
  - Banned or Blocked
  - KYC status
  - Assigned PIN
  - User Profile (Gender, Talkativity and stuff)
- Traffic:
  - AuthN+Z: Read heavy, C > A
  - User Status: Ready heavy, A > C
- Verdict:
  - Centralize and Cache across A-Z

Identity and UserProfile can be deployed separately, because CAP requirements are Different

Policy Service:
- Responsibilities:
  - Surge Rules
  - City Rules
  - Cross city Rules
  - Cancellation penalities (Fairness meter)
  - Matchmaking feature flags
  - PII policies, App update policies
  - Vehicle Eligibility
  - Scaling Policies
- Traffic:
  - Mostly ready heavy, C > A
  - Changes has to nearly Strong Consistency depending on what database fields it touches
Pricing Service
- Responsibilities:
  - Fare Estimation
  - Surge Pricing
  - Final Fare computation
- Traffic Patterns:
  - Fare Estimation: A > C
  - Final Fare: C > A
Booking Service:
- Responsibilities:
  - CreateBooking Record 🤯
  - Idempotency Handling
  - Booking State Machine
  - SMS based bookings
- Traffic Patterns:
  - CreateBooking, Complete, Accept, Cancel, Decline WriteHeavy
  - GetBooking, Low Read
  - GetActiveBookings, Should be lowest, might indicate production issues
  - BookingHistory: Consitent, Read Skewed
  - Overall both C and A are needed, but C > A
  - Allow service degradation with proper cutoffs
Payments Service
- Responsibilities:
  - Payment intents
  - PSP integration
  - Idempotency Handling
  - Payments History (internal)
- Traffic Patterns:
  - Proportional to Successfull Bookings, (assuming 90% Successfull bookings)
  - Fairly Write heavy
  - Payment Intent, C > A (Strong)
  - Durable + Replayability (+Idempotent)
  - Confirmation: Eventual Consistent (with Reconciliation) + High Correctness
Match Making Service
- Responsibilities:
  - Nearby Driver Search
  - Ranking
  - Offer Broadcast
  - Offer listing
  - Maps (Mapping can be moved to a different service)
  - Timeouts and Retry
- Traffic Patterns:
  - Search drivers, Offer listing (read heavy)
  - Broadcasting (write heavy, blazing fast)
  - Ranking (compute cpu heavy)
  - Timeouts and Retries, write skeweed
  - A > C
  - Allow service degradation
Driver Supply Service:
- Responsibilities:
  - Driver online/offline
  - Location ingestion
  - Supply snapshots
  - Hotspot generation
- Traffic Patterns:
  - Locatoin ingestion (i/o) write heavy
  - Hotspot Generation (eventually consistency, compute heavy, ready heavy)
  - Supply snapshots (analytical)
Trip Service (Co-ordinator):
- Responsibilities:
  - Accept/Decline offers
  - Driver Customer binding
  - Cancel semantics (emit events to stop broadcasting)
  - Final trip state
  - Central co-ordinator to avoid race-conditions
  - HA (Critical)
Notification Service
- Responsibilities:
  - SMS
  - Push Notifications
  - Emails for invoices (if needed)
  - Most of the system is event driven
  - Idempotency Handling
- Traffic Patterns:
  - Endpoint is both ready and write heavy, because of the Transactional Outbox
  - Should be HA, but shouldn't affect core booking and payment.
  - At most 1 gurranttee
Reconciliation Jobs
- Responsibilities:
  - It would be nice to keep the ActiveBookings, ActivePayments tables separate and small, to reduce the data size. For databases, it would also mean reduce the index size as well. Recon jobs can cleanup/move these data across partitions.
  - Cleanup notifications from Outbox
  - Rebuild Locks in Memory when system Crashes
  - Payment recon workers to validate transaction status and see missed or double charges
Experiment Service (control plane)
- Responsibilities:
  - Flags
  - Experiments
  - Kill switches
  - Audits
- Reads HA - Shouldn't block, Eventually C
- Writes: Strong C
Localisation Engine

Flow Diagrams

Ride Discovery:

sequenceDiagram
    participant Rider
    participant Gateway
    participant Identity
    participant Policy
    participant Pricing
    participant Supply
    participant Maps

    Rider->>Gateway: App Launch (lat, lng)
    Gateway->>Identity: Validate session
    Identity-->>Gateway: OK

    par Parallel Reads
        Gateway->>Policy: Serviceable? city rules
        Gateway->>Supply: Nearby vehicle supply snapshot
        Gateway->>Pricing: Fare estimate request
    end

    Pricing->>Maps: Distance + ETA (approx)
    Maps-->>Pricing: distance, duration

    Pricing->>Policy: Surge multiplier
    Policy-->>Pricing: surge

    Pricing-->>Gateway: Fare ranges per vehicle type
    Supply-->>Gateway: Vehicle availability
    Policy-->>Gateway: Service flags

    Gateway-->>Rider: Launch screen payload

MatchMaking:

sequenceDiagram
    Kafka->>Matchmaking: booking.created

    Matchmaking->>Redis: drivers:cell + neighbors
    Matchmaking->>Matchmaking: rank & filter

    loop Candidates
        Matchmaking->>Redis: SET lock:driver:{id} TTL
        Matchmaking->>Redis: SET offer:{offer_id} TTL
    end

    Matchmaking->>Kafka: offers.created

Ride Booking, Customer:

sequenceDiagram
    Rider->>Gateway: POST /bookings
    Gateway->>BookingService: create booking
    BookingService->>DB: insert booking
    BookingService->>Kafka: booking.created

    Kafka->>Matchmaking: booking.created
    Matchmaking->>MatchMaking: create matches

    Kafka->>NotificationService: offers.created
    NotificationService->>Workers: enqueue push
    Workers->>DriverApp: push offer

Ride Booking, Driver:

sequenceDiagram
    participant Driver
    participant Gateway
    participant TripService
    participant Redis
    participant DB
    participant Kafka
    participant Matchmaking
    participant NotifService
    participant OtherDrivers

    Note over Driver,OtherDrivers: Multiple drivers have pending offers

    Driver->>Gateway: POST /trips/accept<br/>{offer_id, idempotency_key}
    Gateway->>TripService: accept(offer_id, driver_id)
    
    TripService->>Redis: HGETALL offer:{offer_id}
    Redis-->>TripService: {booking_id, status:PENDING, expires_at}
    
    alt Offer Expired
        TripService-->>Gateway: 409 OFFER_EXPIRED
        Gateway-->>Driver: Offer expired
    else Offer Valid
        TripService->>Redis: EVAL Lua Script<br/>CHECK lock:driver:{id}<br/>CHECK offer status<br/>SET offer ACCEPTED<br/>SET lock:driver:{id}
        
        alt Race Condition: Driver Already Locked
            Redis-->>TripService: {false, DRIVER_BUSY}
            TripService-->>Gateway: 409 DRIVER_BUSY
            Gateway-->>Driver: Already on another trip
        else Race Condition: Offer Already Taken
            Redis-->>TripService: {false, OFFER_TAKEN}
            TripService-->>Gateway: 410 OFFER_TAKEN
            Gateway-->>Driver: Another driver accepted
        else Success
            Redis-->>TripService: {true, OK}
            
            TripService->>DB: INSERT INTO trips<br/>(booking_id, driver_id, status:ASSIGNED)<br/>ON CONFLICT DO NOTHING
            DB-->>TripService: trip_id (or existing if duplicate)
            
            TripService->>Kafka: publish trip.created<br/>{trip_id, booking_id, driver_id}
            TripService-->>Gateway: 201 Created<br/>{trip_id, rider_info, pickup}
            Gateway-->>Driver: Trip assigned!
            
            Note over Kafka,Matchmaking: Async cleanup begins
            
            Kafka->>Matchmaking: trip.created event
            Matchmaking->>Redis: SMEMBERS offers:booking:{booking_id}
            Redis-->>Matchmaking: [off_1, off_2, off_3, off_4, off_5]
            
            loop For each other offer
                Matchmaking->>Redis: HSET offer:{other_id} status CANCELLED
                Matchmaking->>Redis: HGETALL offer:{other_id}
                Redis-->>Matchmaking: {driver_id: drv_X}
                Matchmaking->>Redis: DEL lock:driver:{drv_X}
            end
            
            Matchmaking->>Redis: DEL offers:booking:{booking_id}
            Matchmaking->>Kafka: publish offers.cancelled<br/>{booking_id, cancelled_offer_ids[]}
            
            Kafka->>NotifService: offers.cancelled event
            NotifService->>OtherDrivers: Push: "Ride filled by another driver"
            
            Note over OtherDrivers: If they try to accept now
            OtherDrivers->>Gateway: POST /trips/accept
            Gateway->>TripService: accept(offer_id)
            TripService->>Redis: HGETALL offer:{offer_id}
            Redis-->>TripService: {status: CANCELLED}
            TripService-->>Gateway: 410 OFFER_TAKEN
            Gateway-->>OtherDrivers: Already assigned
        end
    end

LLD

Dispatch/MatchMaking

Responsibilities:

Consume booking.created
Fetch candidate drivers
Rank & filter
Create offers (Fan Out)
Expire stale offers- Retry / backoff / radius expansion
Ephemeral state only

Data Store

Matchmaking is mostly stateless but stores minimal metadata for debugging and analytics.

CREATE TABLE dispatch_attempts (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  booking_id VARCHAR(64) NOT NULL,
  attempt_number INT NOT NULL,
  search_radius_m INT NOT NULL,
  candidates_found INT NOT NULL,
  offers_created INT NOT NULL,
  status VARCHAR(32) NOT NULL, 
    -- SEARCHING, OFFERS_SENT, EXPIRED, CANCELLED, ASSIGNED
  started_at TIMESTAMPTZ NOT NULL,
  completed_at TIMESTAMPTZ,
  error_code VARCHAR(64),
  
  INDEX idx_booking_id (booking_id),
  INDEX idx_started_at (started_at)
);

-- Partitoned Postgres. TimeScale would be better

CREATE TABLE hotspot_snapshots (
  id BIGSERIAL PRIMARY KEY,
  city_id VARCHAR(16) NOT NULL,
  h3_cell VARCHAR(32) NOT NULL,
  vehicle_type VARCHAR(16) NOT NULL,
  snapshot_time TIMESTAMPTZ NOT NULL,
  
  -- Metrics
  booking_count_5m INT NOT NULL,
  active_drivers INT NOT NULL,
  avg_surge DECIMAL(4,2),
  demand_score DECIMAL(4,2), -- 0.0 to 1.0
  
  INDEX idx_city_time (city_id, snapshot_time DESC),
  INDEX idx_h3_time (h3_cell, snapshot_time DESC)
);

Redis clusters are multi regions
Hot Keys are possible, City wise paritioning to reduce changes

Events

Topic: booking.created Partition Key: booking_id

Schema:

{
  "event_id": "evt_123",
  "event_type": "booking.created",
  "timestamp": "2025-12-20T10:00:00Z",
  "booking_id": "bkg_123",
  "rider_id": "usr_456",
  "city_id": "blr",
  "pickup": {
    "lat": 12.934523,
    "lng": 77.610234,
    "h3_cell": "89283082837ffff",
    "address": "Koramangala, Bangalore"
  },
  "destination": {
    "lat": 12.956789,
    "lng": 77.634567,
    "address": "Indiranagar, Bangalore"
  },
  "vehicle_type": "AUTO",
  "estimated_fare": 150,
  "surge_multiplier": 1.2,
  "payment_method": "WALLET",
  "rider_preferences": {
    "female_driver_only": false,
    "pet_friendly": false,
    "ac_required": false
  }
}

Topic: trip.created Partition Key: booking_id

Topic: booking.cancelled Partition Key: booking_id Schema:

{
  "event_id": "evt_126",
  "event_type": "booking.cancelled",
  "timestamp": "2025-12-20T10:00:15Z",
  "booking_id": "bkg_123",
  "city_id": "blr",
  "cancelled_by": "RIDER",
  "reason": "CHANGED_MIND",
  "cancellation_fee": 0
}

Topic: offer.declined Partition Key: booking_id

Topic: offers.created Partition Key: booking_id Schema:

{
  "event_id": "evt_124",
  "event_type": "offers.created",
  "timestamp": "2025-12-20T10:00:01Z",
  "booking_id": "bkg_123",
  "city_id": "blr",
  "attempt": 1,
  "search_radius_m": 1000,
  "candidates_evaluated": 23,
  "offers": [
    {
      "offer_id": "off_789",
      "driver_id": "drv_001",
      "distance_m": 500,
      "eta_sec": 120,
      "rank": 1,
      "score": 0.92,
      "expires_at": "2025-12-20T10:00:16Z"
    },
    {
      "offer_id": "off_790",
      "driver_id": "drv_002",
      "distance_m": 800,
      "eta_sec": 180,
      "rank": 2,
      "score": 0.87,
      "expires_at": "2025-12-20T10:00:16Z"
    }
  ]
}

Topic: offers.cancelled Partition Key: booking_id Schema:

{
  "event_id": "evt_128",
  "event_type": "offers.cancelled",
  "timestamp": "2025-12-20T10:00:12Z",
  "booking_id": "bkg_123",
  "city_id": "blr",
  "reason": "TRIP_ASSIGNED | BOOKING_CANCELLED | EXPIRED",
  "assigned_driver_id": "drv_001", // if reason = TRIP_ASSIGNED
  "cancelled_offers": [
    {
      "offer_id": "off_790",
      "driver_id": "drv_002",
      "status_before": "PENDING"
    },
    {
      "offer_id": "off_791",
      "driver_id": "drv_003",
      "status_before": "PENDING"
    }
  ]
}

Topic: dispatch.retry_scheduled Partition Key: booking_id Schema:

{
  "event_id": "evt_130",
  "event_type": "dispatch.retry_scheduled",
  "timestamp": "2025-12-20T10:00:20Z",
  "booking_id": "bkg_123",
  "city_id": "blr",
  "attempt": 2,
  "new_radius_m": 3000,
  "retry_after_ms": 5000,
  "reason": "ALL_OFFERS_EXPIRED | ALL_OFFERS_DECLINED"
}

Partition Key, booking_id will be hashed, to choose one of the partitions.

Having a booking_id as partition allows ordered events for a booking
Having a city_id allows to prevent hotspots
For reconstruction the Postgres outbox snapshots table dispatch_attempts can be used, for the ordering
Given the traffic we need to benchmark to see, if pgpool would handle the traffic, or given kafka as a durable store Make database writes Async.
Another options is to only use the Append only Log, or an LSM tree backed database, like rocksdb which are better at handling writes at the cost of compaction.

Caches

Geo-spatial index using H3 cells, Updatd by Driver Supply Service

Key: drivers:h3:{city_id}:{h3_cell}

Type: ZSET
Members: driver_id
Score: last_seen_timestamp_ms
TTL: None (members auto-prune based on score in queries)

Example:
ZADD drivers:h3:blr:89283082837ffff 1703001234000 drv_001
ZADD drivers:h3:blr:89283082837ffff 1703001235000 drv_002

Query:
ZRANGEBYSCORE drivers:h3:blr:89283082837ffff 
  (now_ms - 30000) now_ms  # Only fresh entries

Driver Metadata

Key: driver:meta:{city_id}:{driver_id}

Type: HASH
Fields:
  - status: online | on_trip | offline
  - vehicle_type: AUTO | SEDAN | SUV | BIKE
  - lat: 12.934523
  - lng: 77.610234
  - rating: 4.7
  - acceptance_rate: 0.85
  - trip_count_today: 12
  - last_updated: 1703001234000
  - current_booking_id: bkg_123 (if on_trip)
TTL: 60 seconds

HSET driver:meta:blr:drv_001 status online vehicle_type AUTO lat 12.934523 lng 77.610234 rating 4.7 acceptance_rate 0.85

Offers

Key: offer:{city_id}:{offer_id}

Type: HASH
Fields:
  - offer_id: off_789
  - booking_id: bkg_123
  - driver_id: drv_001
  - created_at: 1703001234
  - expires_at: 1703001249
  - status: PENDING | ACCEPTED | DECLINED | EXPIRED | CANCELLED
  - attempt: 1
  - distance_m: 500
  - eta_sec: 120
  - rank: 2
TTL: 20 seconds (slightly > acceptance window)

---

Key: offers:booking:{city_id}:{booking_id}

Type: SET
Members: offer_id_1, offer_id_2, ...
TTL: 300 seconds (5 minutes)

Dispatcher State

Key: dispatch:state:{city_id}:{booking_id}

Type: HASH
Fields:
  - booking_id: bkg_123
  - attempt: 2
  - current_radius_m: 3000
  - max_radius_m: 5000
  - status: SEARCHING | OFFERS_SENT | EXPIRED | CANCELLED
  - started_at: 1703001234
  - last_retry_at: 1703001250
  - offers_pending: 5
  - offers_declined: 2
TTL: 300 seconds

HMSET dispatch:state:blr:bkg_123 
  attempt 2 
  current_radius_m 3000 
  status OFFERS_SENT
EXPIRE dispatch:state:blr:bkg_123 300

This can be re-constructed by replaying in the Kafka events

Locking

Intermediate Lock to Prevent multiple ride assignements to driver

Key: lock:driver:{city_id}:{driver_id}

Type: STRING
Value: booking_id (which booking locked this driver)
TTL: 15 seconds (offer acceptance window)

SET lock:driver:blr:drv_001 bkg_123 EX 15

Lock to prevent multiple Matchmaking consumers from processing same booking. Similar to caching for Transactional Outbox

SET lock:dispatch:{booking_id} {worker_id} NX EX 60

If lock acquired:
  - Process dispatch
  - Release lock after completion
Else:
  - Skip (another worker handling it)

This can be re-constructed based on database and events

Hotspot Cache

Key: hotspot:{city_id}:{vehicle_type}

Type: SORTED SET
Members: h3_cell_id
Score: demand_score (0.0 to 1.0)
TTL: 120 seconds (refreshed by stream processor)

Example:
ZADD hotspot:blr:AUTO 0.85 89283082837ffff 0.72 89283082838ffff
EXPIRE hotspot:blr:AUTO 120

# Top 10 hotspots
ZREVRANGE hotspot:blr:AUTO 0 9 WITHSCORES

sequenceDiagram
    participant Kafka as Kafka<br/>(Event Log)
    participant Worker as Matchmaking<br/>Worker
    participant Postgres as Postgres<br/>(Snapshots)
    participant Redis as Redis<br/>(Ephemeral)

    Note over Kafka,Redis: Normal Flow - Events + Snapshots

    Kafka->>Worker: booking.created<br/>(bkg_123, 10:00:00)
    Worker->>Worker: state = {status: SEARCHING}
    Worker->>Postgres: INSERT dispatch_attempts<br/>{booking_id: bkg_123,<br/>status: SEARCHING,<br/>attempt: 1,<br/>started_at: 10:00:00}
    Worker->>Redis: SET dispatch:state:blr:bkg_123<br/>(TTL 300s)

    Note over Worker: Snapshot #1 saved to Postgres

    Kafka->>Worker: offers.created<br/>(5 drivers, 10:00:01)
    Worker->>Worker: state = {status: OFFERS_SENT,<br/>offers_pending: 5}
    Worker->>Postgres: UPDATE dispatch_attempts<br/>SET status=OFFERS_SENT,<br/>offers_pending=5,<br/>last_updated=10:00:01
    Worker->>Redis: HSET dispatch:state:blr:bkg_123<br/>status OFFERS_SENT

    Note over Worker: Snapshot #2 updated in Postgres

    Kafka->>Worker: offer.declined<br/>(drv_001, 10:00:08)
    Worker->>Worker: state.offers_pending--
    Worker->>Postgres: UPDATE dispatch_attempts<br/>SET offers_pending=4,<br/>offers_declined=1,<br/>last_updated=10:00:08
    Worker->>Redis: HSET offers_pending 4

    Note over Worker: Snapshot #3 updated in Postgres

    rect rgb(255, 200, 200)
        Note over Worker: WORKER CRASHES at 10:00:09
    end

    Note over Kafka,Redis: Worker Recovery - Using Snapshots

    Worker->>Worker: New worker starts (10:00:15)
    
    Worker->>Postgres: SELECT * FROM dispatch_attempts<br/>WHERE booking_id='bkg_123'
    Postgres-->>Worker: Latest snapshot:<br/>{status: OFFERS_SENT,<br/>offers_pending: 4,<br/>offers_declined: 1,<br/>last_updated: 10:00:08}
    
    Note over Worker: State restored instantly<br/>from snapshot!

    Worker->>Kafka: Fetch events AFTER 10:00:08<br/>(optional - for freshness)
    Kafka-->>Worker: [offer.declined at 10:00:09,<br/>trip.created at 10:00:11]
    
    Worker->>Worker: Apply recent events:<br/>state.offers_pending = 3<br/>state.status = ASSIGNED

    Worker->>Redis: Rebuild cache:<br/>SET dispatch:state:blr:bkg_123

    Note over Worker: Resume processing from<br/>current state

    rect rgb(200, 255, 200)
        Note over Worker: Recovery complete<br/>Total time: <1 second
    end

    Note over Kafka,Redis: Without Snapshots (Pure Event Sourcing)

    Worker->>Kafka: Must replay ALL events<br/>from topic beginning
    Kafka-->>Worker: Event 1: booking.created<br/>Event 2: offers.created<br/>Event 3-100: offer.declined...<br/>(replay 100s of events)
    
    Note over Worker: ❌ Slow reconstruction<br/>Total time: 10-30 seconds

Trip Service

Database Tables:

trips (
  trip_id UUID,
  booking_id VARCHAR(64) NOT NULL UNIQUE,
  
  -- Participants
  rider_id NOT NULL,
  driver_id NOT NULL,
  
  -- Location
  pickup_lat DECIMAL(10, 8) NOT NULL,
  pickup_lng DECIMAL(11, 8) NOT NULL,
  dropoff_lat <same>,
  dropoff_lng <same>,
  pickup_address TEXT,
  dropoff_address TEXT,
  
  vehicle_type ENUM/VARCHAR,
  city_id VARCHAR NOT NULL,
  
  status VARCHAR NOT NULL,
  -- DRIVER_ASSIGNED, IN_PROGRESS, PAUSED, COMPLETED, CANCELLED, PAID
  
  -- Timestamps
  assigned_at TIMESTAMPTZ NOT NULL,
  started_at TIMESTAMPTZ,
  paused_at TIMESTAMPTZ,
  resumed_at TIMESTAMPTZ,
  ended_at TIMESTAMPTZ,
  
  -- Trip metrics (filled on completion)
  distance_km DECIMAL(6, 2),
  duration_sec INT,
  
  -- References
  fare_id VARCHAR(64),  -- FK to pricing service
  payment_id VARCHAR(64),  -- FK to payment service
  
  -- Cancellation
  cancelled_by VARCHAR(16),  -- DRIVER, RIDER, SYSTEM
  cancellation_reason TEXT,
  cancellation_fee DECIMAL(10, 2),
  
  -- Audit
  created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  
  -- Indexes
  INDEX idx_booking_id (booking_id),
  INDEX idx_rider_id (rider_id, created_at DESC),
  INDEX idx_driver_id (driver_id, created_at DESC),
  INDEX idx_status (status, created_at DESC),
  INDEX idx_city_created (city_id, created_at DESC)
);

-- Composite index for active trips query
CREATE INDEX idx_active_trips 
ON trips(status, rider_id) 
WHERE status IN ('DRIVER_ASSIGNED', 'IN_PROGRESS', 'PAUSED');

-- Partition by created_at (monthly) for scalability
CREATE TABLE trips_2025_12 PARTITION OF trips
FOR VALUES FROM ('2025-12-01') TO ('2026-01-01');

trip_state_transitions (
  id BIGSERIAL PRIMARY KEY,
  trip_id UUID NOT NULL REFERENCES trips(trip_id),
  
  from_status VARCHAR(32),
  to_status VARCHAR(32) NOT NULL,
  
  changed_by VARCHAR(16) NOT NULL,
  actor_id VARCHAR(64),
  
  reason TEXT,
  metadata JSONB,
  
  created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  
  INDEX idx_trip_id (trip_id, created_at DESC)
);

-- Partition by created_at (monthly)
-- Retention: 90 days

trip.locations
trip.events (although present in Kafka, could be async write to database)

Trip Event Types:

  -- OFFER_ACCEPTED, TRIP_STARTED, LOCATION_UPDATE, 
  -- TRIP_PAUSED, TRIP_RESUMED, TRIP_ENDED, TRIP_CANCELLED

Caches:

Key: trip:active:{city_id}:{trip_id}

Type: HASH
Fields:
  - trip_id: trp_001
  - booking_id: bkg_123
  - rider_id: usr_456
  - driver_id: drv_001
  - status: IN_PROGRESS
  - started_at: 1703001234
  - pickup_lat: 12.934523
  - pickup_lng: 77.610234
  - dropoff_lat: 12.956789
  - dropoff_lng: 77.634567
TTL: (ex. 4 hours) [max trip duration + buffer]

Inverse Indexes
Driver-Trip Binding

Key: trip:by-driver:{city_id}:{driver_id}
Type: STRING

Rider Trip Binding
Key: trip:by-rider:{city_id}:{rider_id}
Type: STRING

Key: trip:location:{trip_id}

Type: GEOHASH
Members: timestamped location points
TTL: 1 hour after trip ends

Purpose: Fast location queries for ETA

Example:
GEOADD trip:location:trp_001 77.610234 12.934523 "loc_1703001234"

Key: lock:trip:{trip_id}
Type: STRING

Value: {operation}:{worker_id}
TTL: 30 seconds

Purpose: Prevent concurrent state transitions
induced by driver, rider modifying the same trip record

Example:
SET lock:trip:trp_001 "start:worker-3" NX EX 30

Key: trip:location:{trip_id}
Type: GEOHASH

Members: timestamped location points
TTL: 1 hour after trip ends

Purpose: Fast location queries for ETA

Example:
GEOADD trip:location:trp_001 77.610234 12.934523 "loc_1703001234"

States

SEARCHING -> ASSIGNED -> IN_PROGRESS -> COMPLETED

Events

trip.created
- Consumed by:
  - Matchmaking Service (cancel other offers)
  - Notification Service (notify rider "Driver assigned")
  - Analytics Service
trip.started
- Consumers:
  - Push Notification/SMS Service (notify rider "Trip started")
  - Analytics Service (start tracking metrics)
  - Driver Supply Service (mark driver as on_trip)
trip.location_updated
- Consumers:
  - Rider App (via WebSocket server that subscribes to this, or http poll)
  - Analytics Service (tracking, heatmaps)
  - Trip Service itself (async write to trip_locations table)
trip.completed
- Consumers:
  - Payment Service (initiate payment)
  - Notification Service (send receipt, rating prompt)
  - Driver Supply Service (mark driver available)
  - Analytics Service
trip.cancelled:
- Consumers:
  - Payment Service (charge cancellation fee if applicable)
  - Driver Supply Service (mark driver available)
  - Notification Service (notify both parties)
  - Policy Service (update fairness metrics)

APIs

1. Accept Offer (POST /trips/accept)

POST /trips/accept
Headers:
  X-Driver-ID: drv_001
  X-Idempotency-Key: idmp_abc123
  Authorization: Bearer <driver_token>

Body:
{
  "offer_id": "off_789"
}

Response: 201 Created
{
  "trip_id": "trp_001",
  "booking_id": "bkg_123",
  "status": "DRIVER_ASSIGNED",
  "assigned_at": "2025-12-20T10:00:10Z",
  
  "rider": {
    "rider_id": "usr_456",
    "name": "John Doe",
    "phone": "+91XXXXXXXXXX",
    "rating": 4.8,
    "profile_pic_url": "https://..."
  },
  
  "pickup": {
    "lat": 12.934523,
    "lng": 77.610234,
    "address": "123 Main St, Koramangala"
  },
  
  "destination": {
    "lat": 12.956789,
    "lng": 77.634567,
    "address": "456 Park Ave, Indiranagar"
  },
  
  "estimated_fare": 150,
  "estimated_distance_km": 8.2,
  "estimated_duration_min": 20,
  
  "otp": "1234"
}

Errors:
409 Conflict - OFFER_EXPIRED
409 Conflict - OFFER_ALREADY_ACCEPTED
409 Conflict - DRIVER_BUSY
410 Gone - OFFER_TAKEN
500 Internal Server Error

2. Start Trip (POST /trips/{trip_id}/start)

POST /trips/trp_001/start
Headers:
  X-Driver-ID: drv_001
  Authorization: Bearer <driver_token>

Body:
{
  "otp": "1234",  // Rider's OTP
  "location": {
    "lat": 12.934523,
    "lng": 77.610234,
    "accuracy_m": 10
  }
}

Response: 200 OK
{
  "trip_id": "trp_001",
  "status": "IN_PROGRESS",
  "started_at": "2025-12-20T10:05:00Z",
  "otp_verified": true
}

Errors:
400 Bad Request - INVALID_OTP
403 Forbidden - UNAUTHORIZED_DRIVER
404 Not Found - TRIP_NOT_FOUND
409 Conflict - TRIP_ALREADY_STARTED
422 Unprocessable - TRIP_CANCELLED

3. Update Location (POST /trips/{trip_id}/location)

POST /trips/trp_001/location
Headers:
  X-Driver-ID: drv_001

Body:
{
  "locations": [
    {
      "lat": 12.935000,
      "lng": 77.611000,
      "accuracy_m": 10,
      "speed_kmh": 35.5,
      "bearing": 45,
      "recorded_at": "2025-12-20T10:05:10Z"
    },
    {
      "lat": 12.935500,
      "lng": 77.611500,
      "accuracy_m": 8,
      "speed_kmh": 40.0,
      "bearing": 50,
      "recorded_at": "2025-12-20T10:05:15Z"
    }
  ]
}

Response: 202 Accepted
{
  "processed": 2,
  "trip_status": "IN_PROGRESS"
}

Note: Batched updates to reduce API calls

4. End Trip (POST /trips/{trip_id}/end)

POST /trips/trp_001/end
Headers:
  X-Driver-ID: drv_001

Body:
{
  "dropoff": {
    "lat": 12.956789,
    "lng": 77.634567,
    "accuracy_m": 10
  },
  "odometer_km": 8.5,
  "actual_duration_sec": 1200
}

Response: 200 OK
{
  "trip_id": "trp_001",
  "status": "COMPLETED",
  "ended_at": "2025-12-20T10:25:00Z",
  
  "trip_summary": {
    "distance_km": 8.5,
    "duration_sec": 1200,
    "avg_speed_kmh": 42.5
  },
  
  "fare": {
    "base": 50,
    "distance": 85,
    "time": 20,
    "surge_multiplier": 1.2,
    "subtotal": 155,
    "tax": 31,
    "total": 186,
    "currency": "INR"
  },
  
  "payment_status": "PENDING"
}

Errors:
403 Forbidden - UNAUTHORIZED_DRIVER
404 Not Found - TRIP_NOT_FOUND
409 Conflict - TRIP_NOT_STARTED
409 Conflict - TRIP_ALREADY_ENDED

Cancel Trip (POST /trips/{trip_id}/cancel)

POST /trips/trp_001/cancel
Headers:
  X-User-ID: usr_456 OR X-Driver-ID: drv_001

Body:
{
  "cancelled_by": "RIDER",  // or DRIVER
  "reason": "CHANGED_MIND",
  "location": {
    "lat": 12.934523,
    "lng": 77.610234
  }
}

Response: 200 OK
{
  "trip_id": "trp_001",
  "status": "CANCELLED",
  "cancelled_at": "2025-12-20T10:03:00Z",
  "cancelled_by": "RIDER",
  
  "cancellation_fee": 20,
  "reason": "Cancellation within 2 minutes of assignment"
}

Errors:
403 Forbidden - UNAUTHORIZED
404 Not Found - TRIP_NOT_FOUND
409 Conflict - TRIP_ALREADY_COMPLETED
422 Unprocessable - CANNOT_CANCEL_IN_PROGRESS

Get Trip Details (GET /trips/{trip_id})

GET /trips/trp_001
Headers:
  X-User-ID: usr_456 OR X-Driver-ID: drv_001

Response: 200 OK
{
  "trip_id": "trp_001",
  "booking_id": "bkg_123",
  "status": "IN_PROGRESS",
  
  "rider": {
    "rider_id": "usr_456",
    "name": "John Doe",
    "rating": 4.8
  },
  
  "driver": {
    "driver_id": "drv_001",
    "name": "Jane Smith",
    "rating": 4.9,
    "vehicle": {
      "make": "Maruti",
      "model": "Swift",
      "plate": "KA01AB1234",
      "color": "White"
    }
  },
  
  "pickup": {...},
  "destination": {...},
  
  "started_at": "2025-12-20T10:05:00Z",
  "estimated_arrival": "2025-12-20T10:25:00Z",
  
  "current_location": {
    "lat": 12.945000,
    "lng": 77.620000,
    "updated_at": "2025-12-20T10:15:30Z"
  }
}

Get Active Trip (GET /trips/active)

GET /trips/active
Headers:
  X-User-ID: usr_456

Response: 200 OK
{
  "trip": {
    "trip_id": "trp_001",
    "status": "IN_PROGRESS",
    ...
  }
}

Response: 404 Not Found (if no active trip)
{
  "error": "NO_ACTIVE_TRIP"
}

Accept Offer & Create Trip

Driver clicks "Accept" in app
  |
POST /trips/accept {offer_id}
  |
Trip Service:
  1. Validate offer exists in Redis
  2. Check offer not expired
  3. Atomic Redis operation (Lua):
     - Check driver not locked
     - Check offer status = PENDING
     - Set offer = ACCEPTED
     - Lock driver
  4. Insert trip record in Postgres (idempotent)
  5. Write to Redis caches (active trip bindings)
  6. Publish trip.created event to Kafka
  7. Return 201 with trip details
  ↓
Matchmaking Service consumes trip.created:
  - Cancel other pending offers
  - Release other driver locks
  |
Notification Service:
  - Send push to rider "Driver assigned"
  - Send SMS with driver details

Start Trip

Driver arrives at pickup
  |
Rider shares OTP: "1234"
  |
Driver enters OTP in app
  |
POST /trips/{trip_id}/start {otp, location}
  |
Trip Service:
  1. Validate trip exists, status = DRIVER_ASSIGNED
  2. Verify OTP matches Redis cache
  3. Acquire trip lock in Redis
  4. Update Postgres:
     - trips.status = IN_PROGRESS
     - trips.started_at = NOW()
  5. Update Redis cache
  6. Write to trip_state_transitions (audit)
  7. Publish trip.started event
  8. Release lock
  |
Driver Supply Service:
  - Mark driver.status = on_trip
  |
Notification Service:
  - Notify rider "Trip started"
  |
Driver app starts sending location updates every 5s

Near Real-time Location Updates

Driver app (every 5 seconds):
  |
Batch 6 location points
  |
PATCH /trips/{trip_id}/location {locations[]}
  |
Trip Service:
  1. Validate trip IN_PROGRESS
  2. Write to Redis GEOHASH (fast cache)
  3. Publish trip.location_updated event (async)
  4. Return 202 Accepted
  |
Async Worker consumes event:
  - Batch insert to trip_locations table
  - 1000 locations per batch
  |
SSE/WebSocket Server consumes event:
  - Push to rider's active WebSocket connection
  - Rider sees driver moving on map

End Trip & Payment

Driver arrives at destination
  |
Driver clicks "End Trip"
  |
POST /trips/{trip_id}/end {dropoff, odometer}
  |
Trip Service:
  1. Validate trip IN_PROGRESS
  2. Acquire trip lock
  3. Calculate metrics:
     - distance_km (from GPS breadcrumbs)
     - duration_sec (started_at - now)
  4. Call Pricing Service:
     GET /pricing/calculate-fare
     {distance_km, duration_sec, surge, vehicle_type}
  5. Update Postgres:
     - trips.status = COMPLETED
     - trips.ended_at = NOW()
     - trips.distance_km = 8.5
     - trips.duration_sec = 1200
     - trips.fare_id = fare_123
  6. Publish trip.completed event
  7. Return fare breakdown
  |
Payment Service consumes trip.completed:
  - Create payment intent
  - Redirect rider to payment
  |
Driver Supply Service:
  - Mark driver.status = online (available)
  |
Notification Service:
  - Send receipt to rider
  - Prompt for rating

Cancel Trip

Rider clicks "Cancel" (or Driver cancels)
  |
POST /trips/{trip_id}/cancel {cancelled_by, reason}
  |
Trip Service:
  1. Validate trip not COMPLETED
  2. Acquire trip lock
  3. Call Policy Service:
     GET /policies/cancellation-fee
     {status, cancelled_by, time_since_assigned}
     -> Returns: {fee: 20, reason: "..."}
  4. Update Postgres:
     - trips.status = CANCELLED
     - trips.cancelled_by = RIDER
     - trips.cancellation_fee = 20
  5. Publish trip.cancelled event
  6. Return cancellation details
  |
Payment Service:
  - Charge cancellation fee if applicable
  |
Driver Supply Service:
  - Mark driver available
  |
Matchmaking Service:
  - Release offer locks (if not started yet)
  |
Notification Service:
  - Notify both parties

Cache Strategy

Request: GET /trips/{trip_id}

Layer 1: Redis (Hot Cache)
  ├─ Check: trip:active:{city}:{trip_id}
  ├─ Hit: Return immediately (5ms)
  └─ Miss: Go to Layer 2

Layer 2: Postgres (Source of Truth)
  ├─ Query: SELECT * FROM trips WHERE trip_id = ?
  ├─ Hit: Return + Write to Redis (50ms)
  └─ Miss: 404 Not Found

Cache Invalidation:
  ├─ On state change: Update Redis + Postgres
  ├─ On trip end: TTL expires in 1 hour
  └─ On trip cancel: Delete from Redis

State Change (e.g., trip started):
  1. Acquire lock
  2. Write to Postgres (source of truth)
  3. Write to Redis (invalidate + update)
  4. Publish event to Kafka
  5. Release lock
  6. Return to client

If Redis write fails:
  - Log error
  - Continue (Postgres is source of truth)
  - Next read will cache
The platform must handle driver–rider matching, dynamic surge pricing, trip
lifecycle management, and payments at scale with strict latency requirements.

Actors

Customer
Driver

Workflows

Customers

1. Customer Opens the App

Trigger: Customer opens the app
Actions:
- Check and Wait for location permissions
- Check if service is available in region
- Check if payment pending.
- Get a list of nearby locations upto 5km, from user
- Show a list of available vehicles, maybe events, or poi nearby
Pre-condition:
- Authenticated
- Location permissions enabled
- Trips never cross regions at runtime
Success:
Failure:

2. Customer Intends to Book Ride

Trigger: Searches and selects a location to ride to
Action:
- Check if the target location is within geo-boundaries
- Check for Vehicle types that can reach.
- System fetches current surge multiplier for pickup geo-cell
- An estimated price can be calculated (base + distance + surge) (the first part can be cached as well)
Success:
- Customer sees rides available in the area
- Customer sees estimated prices of rides, and approximate duration (can be from google or internal tracking)
- Additional metadata like toll-paid or paid separately etc can be shown.
Failure:
- Customer sees appropriate message, ride not available
- Price estimation failed, but booking can still proceed.

3. Customer Sends a Booking Request

Trigger: User selects a ride type and clicks "Book"
Action:
- Create a Booking Request, status = pending, radius = 5km
- Publish to Q
- Set a TTL of 15s
Success:
- Driver accepts the booking, status = 'assigned'
- Driver can also cancel booking after accepting status = 'cancelled' (in which case, new booking request)
Failure:
- Retry 2x with increasing radius.
- REQUEST_EXPIRED, REQUEST_ABORTED, REQUEST_RETRY

4. Real Time ETAs

Trigger: App polls every 5s the driver's Location, and Booking status (via booking handshake)
Pre-condition:
- Driver - Booking - User relationship should exsists amd status = assigned
Action:
- System tracks driver location in context of the Booking

5. Customer Payments

Trigger: Trip ends (triggered by driver), Booking status = completed
Action:
- Booking status changes to completed
- Calculate actual amount for ride
- Create Payment Request for the Booking (Payment.status = pending)
- Redirect to appropriate PSP page, wait for completion
- En-Q a recon process as a delayed job (TTL 5 minute)
- On PSP Callback, Enqueue to update the Payment status and timestamps
Success:
- Update Booking.status = paid
Failure:

Driver

1. Driver toggles duty status

Trigger: Driver taps Go Online / Go Offline
Pre-condition:
- Driver has registered and KYC verified
Actions:
- Update Driver.status = online/offline
- If online: start sending location pings (batch them on the client side to reduce ping rate)
- If offline: stop location stream, clear from matching pool

2. Driver receives & accepts ride

Trigger: New BookingRequest assigned to driver
Actiond:
- Push notification sent (booking details)
- Driver has 10s to accept
  On accept:
  - Booking.status = accepted
  - Driver.status = on_trip
  - Start navigation
  - Clear other offers
- On decline/timeout:
  - Put the driver id in offered set (to select new driver)

3. Driver views hotspot map

Trigger: Driver opens "Hotspots" tab
Data: Aggregated DemandHeatMap (geo-cells with high booking rate)
Source: Stream processor (Kafka -> Flink -> Redis Hash)
Update freq: Every 2 min
Consistency: Eventual

Toolings

Observability via metric collection, and alerting policies
Logs and Request Tracing to debug problems better
- Some logging libraries come with PII handling mechanisms, masking emails/cards/phone etc.
Feature flags:
- Toggle entire modules, like chat (ui + apis)
- Toggle business level features, like on-off surge pricing
- Toggle to A/B test exsisting flows
Localisation:
- Because its multi-region english may not always be the prefered choice.
- Currencies can be shown better
- Cultural context can be helpful in desiging surge pricing
Policy Engines:
- Different Regions can be different pricing policies, per-km, surge caps, minimum wage etc.
- Specific regions can benefit from static surge prices
- Systems need periodic maintainance, like certificate renewal, domain renewal, key rotation etc.
- Logging library can use PII poilices per region.
- Data Archival policies for different kinds of data.
- KYC Archival policies
- Data Replication policies
- Rate Limiting

Security

Databases, Access Keys must be encrypted
Keys must be rotated
APIs should be protected via WAF
Rate Limits should be present on most apis.
TLS certificates should never expire, All communications over TLS/WSS

Failure Points

Driver device connection issues
Redis/Database/Store Crashes
PSP / External Providers outage
Service Failure Cascade
Internet outage
Flaky Networks (tuning tcp params, reduce data transfer)
Bad Deployments

Entities

Customer
Driver
Vehicle
Pin
Location
Offer
Fare
Booking
Pricing
Payment
History
DemandHeatMap
Session
UserProfile
Device
Wallet
Policy

Relationships

Trip

belongs_to :booking
belongs_to :rider (User)
belongs_to :driver (User)
belongs_to :fare
belongs_to :payment

has_many :state_transitions
has_many :locations
has_many :events

Booking

belongs_to :rider (User)

has_one :trip
has_many :offers

User (Customer/Driver)

has_many :trips_as_rider (Trip)
has_many :trips_as_driver (Trip)
has_many :bookings

Offer

belongs_to :booking
belongs_to :driver (User)

TripStateTransition

belongs_to :trip
belongs_to :actor (User)

TripLocation

belongs_to :trip

TripEvent

belongs_to :trip

Fare

has_one :trip
belongs_to :booking

Payment

has_one :trip
belongs_to :booking
belongs_to :rider (User)

Vehicle

belongs_to :driver (User)

has_many :trips

erDiagram
    Trip ||--|| Booking : "belongs_to"
    Trip ||--|| User : "belongs_to (rider)"
    Trip ||--|| User : "belongs_to (driver)"
    Trip ||--o| Fare : "belongs_to"
    Trip ||--o| Payment : "belongs_to"
    Trip ||--o| Vehicle : "belongs_to"
    Trip ||--o{ TripStateTransition : "has_many"
    Trip ||--o{ TripLocation : "has_many"
    Trip ||--o{ TripEvent : "has_many"
    
    Booking ||--|| User : "belongs_to (rider)"
    Booking ||--o| Trip : "has_one"
    Booking ||--o{ Offer : "has_many"
    
    Offer ||--|| Booking : "belongs_to"
    Offer ||--|| User : "belongs_to (driver)"
    
    User ||--o{ Trip : "has_many (as rider)"
    User ||--o{ Trip : "has_many (as driver)"
    User ||--o{ Booking : "has_many"
    User ||--o{ Offer : "has_many"
    User ||--o{ Vehicle : "has_many"
    
    TripStateTransition ||--|| Trip : "belongs_to"
    TripStateTransition ||--o| User : "belongs_to (actor)"
    
    TripLocation ||--|| Trip : "belongs_to"
    
    TripEvent ||--|| Trip : "belongs_to"
    
    Fare ||--|| Booking : "belongs_to"
    Fare ||--o| Trip : "has_one"
    
    Payment ||--|| Booking : "belongs_to"
    Payment ||--|| User : "belongs_to (rider)"
    Payment ||--o| Trip : "has_one"
    
    Vehicle ||--|| User : "belongs_to (driver)"
    Vehicle ||--o{ Trip : "has_many"

    Trip {
        uuid trip_id PK
        varchar booking_id UK "FK to Booking"
        varchar rider_id "FK to User"
        varchar driver_id "FK to User"
        varchar vehicle_id "FK to Vehicle"
        decimal pickup_lat
        decimal pickup_lng
        text pickup_address
        decimal dropoff_lat
        decimal dropoff_lng
        text dropoff_address
        varchar vehicle_type
        varchar city_id
        varchar status "enum"
        timestamptz assigned_at
        timestamptz started_at
        timestamptz paused_at
        timestamptz resumed_at
        timestamptz ended_at
        decimal distance_km
        int duration_sec
        varchar fare_id "FK to Fare"
        varchar payment_id "FK to Payment"
        varchar cancelled_by "enum"
        text cancellation_reason
        decimal cancellation_fee
        varchar otp
        timestamptz created_at
        timestamptz updated_at
    }

    Booking {
        varchar booking_id PK
        varchar rider_id "FK to User"
        decimal pickup_lat
        decimal pickup_lng
        text pickup_address
        decimal destination_lat
        decimal destination_lng
        text destination_address
        varchar vehicle_type "enum"
        varchar city_id
        varchar status "enum"
        decimal estimated_fare
        decimal surge_multiplier
        varchar payment_method
        timestamptz created_at
        timestamptz updated_at
    }

    User {
        varchar user_id PK
        varchar name
        varchar email
        varchar phone
        varchar type "RIDER, DRIVER"
        decimal rating
        int total_trips
        varchar city_id
        varchar status "enum"
        timestamptz created_at
    }

    Offer {
        varchar offer_id PK
        varchar booking_id "FK to Booking"
        varchar driver_id "FK to User"
        varchar status "enum"
        int distance_m
        int eta_sec
        int rank
        decimal score
        timestamptz created_at
        timestamptz expires_at
    }

    TripStateTransition {
        bigint id PK
        uuid trip_id "FK to Trip"
        varchar from_status
        varchar to_status
        varchar changed_by "enum"
        varchar actor_id "FK to User"
        text reason
        jsonb metadata
        timestamptz created_at
    }

    TripLocation {
        bigint id PK
        uuid trip_id "FK to Trip"
        decimal lat
        decimal lng
        int accuracy_m
        decimal speed_kmh
        int bearing
        timestamptz recorded_at
        timestamptz received_at
    }

    TripEvent {
        bigint id PK
        uuid trip_id "FK to Trip"
        varchar booking_id
        varchar event_type
        jsonb event_data
        int sequence_number
        timestamptz created_at
    }

    Fare {
        varchar fare_id PK
        varchar booking_id "FK to Booking"
        decimal base_fare
        decimal distance_fare
        decimal time_fare
        decimal surge_multiplier
        decimal subtotal
        decimal tax
        decimal total
        varchar currency
        timestamptz created_at
    }

    Payment {
        varchar payment_id PK
        varchar booking_id "FK to Booking"
        varchar rider_id "FK to User"
        decimal amount
        varchar currency
        varchar status "enum"
        varchar payment_method
        varchar psp_transaction_id
        timestamptz created_at
        timestamptz completed_at
    }

    Vehicle {
        varchar vehicle_id PK
        varchar driver_id "FK to User"
        varchar vehicle_type "enum"
        varchar make
        varchar model
        varchar plate_number
        varchar color
        int year
        varchar status "enum"
        timestamptz created_at
    }

P.S. This is a generated diagram, from the blobs of text below

Scaling, Consistency and Availability

Scale assumptions:
- Each online driver sends 1–2 updates per second
- 300k concurrent drivers globally
- 60k ride requests/minute peak
- 500k location updates/second globally
- Dispatch decision: p95 ≤ 1s
- End-to-end request->acceptance: p95 ≤ 3s

End-to-end request->acceptance: p95 ≤ 3s (hmmm, not sure about this, depends on the driver, human involved, incremental search and retry upto 1 minute) Average: 20s

Ride Requests:

60k rpm => 1k rps * 20s = 20k Concurrent Users Booking

Location Updates: Assuming most of this is driver pings, user pings are not that useful right now.

300K Concurrent Drivers * 1.5 UPS = 450K UPS
Ping Data: (id, lat, long, speed, sig, acc, unix_ts) 100B (with tcp headers)
500K UPS * 100B = 500MBps (Bandwith req)

Dispatch Decision:

Disptach involves sending req
Writing a booking request to db
Enqueing to Q
Consume from Q
Driver Matching
Offer Creation
Push Notification

Timings:

Kafka tail worst case 200ms p95 (from blogs) (i3gen,large 16Gb ram, 2vcpu, 25Gbps, acks=all, replication >= 3)
- Redpanda can reduce this by factor of 3.
Redis (single instance, 8G RAM) 10ms. 100000 requests completed in 0.50 seconds 50 parallel clients 3 bytes payload
Single instance supported 400k ops/s, 2-4 redis instances per city, with replicas.
Redis read and write, 10ms (combined)
API Roundtrip: 100ms
Database Write: 50ms-100ms, Dynamodb Ensures 20ms p95.

200 + 50 + 100 + 100 = 450ms (can be lower because of multi-region)

Push Request, can affect the tail latency, 1s I am assuming as worst case.

Consistency, Availability and Storage

Authentication and Sessions: HA and Strong Consistency
Booking: Strong Consistency
Locations: HA, EC
Payment: HA, SC
Surge Pricing: EC

Query Patterns:

CanBook(user)
RidePossible(srcLoc, dstLoc)
GetRides(srcLoc, dstLoc)
GetFareEstimates(srcLoc, dstLoc)
SurgeApplied(srcLoc, dstLoc)
InitBooking(user, srcLoc, dstLoc, vehicleTypes, radius, shownEstimate, timeWithZone)
GetMatchingDrivers(loc, radius)
AcceptBooking(booking, driver)
Decline/CancelBooking(booking, driver)
UpdateBookingStatus(bookingID, status)
CheckBookingStatus(bookingID)
InitiatePayment(bookingID)
UpdatePaymentStatus(bookingID, pricingID?, status)
GetBookings(driverID/userID, cursor)
GetActiveBookings(userID/driverID)
UpdateOfferStatus(driverID, offerID/bookingID)
BroadCastOffers(bookingID)
GetOffers(driverID, srcLoc)
UnsubcribeBroadCast(bookingID)
CanAcceptBookings(driverID)
UserProfile(userID/driverID)
UpdateLocation(driverID, bookingID, []coorD)

Services

Identity Service (Sessions + User) Initially can be a shared service. (User = Customer + Driver)
- Responsibilities:
  - Authn & Authz
  - Banned or Blocked
  - KYC status
  - Assigned PIN
  - User Profile (Gender, Talkativity and stuff)
- Traffic:
  - AuthN+Z: Read heavy, C > A
  - User Status: Ready heavy, A > C
- Verdict:
  - Centralize and Cache across A-Z

Identity and UserProfile can be deployed separately, because CAP requirements are Different

Policy Service:
- Responsibilities:
  - Surge Rules
  - City Rules
  - Cross city Rules
  - Cancellation penalities (Fairness meter)
  - Matchmaking feature flags
  - PII policies, App update policies
  - Vehicle Eligibility
  - Scaling Policies
- Traffic:
  - Mostly ready heavy, C > A
  - Changes has to nearly Strong Consistency depending on what database fields it touches
Pricing Service
- Responsibilities:
  - Fare Estimation
  - Surge Pricing
  - Final Fare computation
- Traffic Patterns:
  - Fare Estimation: A > C
  - Final Fare: C > A
Booking Service:
- Responsibilities:
  - CreateBooking Record 🤯
  - Idempotency Handling
  - Booking State Machine
  - SMS based bookings
- Traffic Patterns:
  - CreateBooking, Complete, Accept, Cancel, Decline WriteHeavy
  - GetBooking, Low Read
  - GetActiveBookings, Should be lowest, might indicate production issues
  - BookingHistory: Consitent, Read Skewed
  - Overall both C and A are needed, but C > A
  - Allow service degradation with proper cutoffs
Payments Service
- Responsibilities:
  - Payment intents
  - PSP integration
  - Idempotency Handling
  - Payments History (internal)
- Traffic Patterns:
  - Proportional to Successfull Bookings, (assuming 90% Successfull bookings)
  - Fairly Write heavy
  - Payment Intent, C > A (Strong)
  - Durable + Replayability (+Idempotent)
  - Confirmation: Eventual Consistent (with Reconciliation) + High Correctness
Match Making Service
- Responsibilities:
  - Nearby Driver Search
  - Ranking
  - Offer Broadcast
  - Offer listing
  - Maps (Mapping can be moved to a different service)
  - Timeouts and Retry
- Traffic Patterns:
  - Search drivers, Offer listing (read heavy)
  - Broadcasting (write heavy, blazing fast)
  - Ranking (compute cpu heavy)
  - Timeouts and Retries, write skeweed
  - A > C
  - Allow service degradation
Driver Supply Service:
- Responsibilities:
  - Driver online/offline
  - Location ingestion
  - Supply snapshots
  - Hotspot generation
- Traffic Patterns:
  - Locatoin ingestion (i/o) write heavy
  - Hotspot Generation (eventually consistency, compute heavy, ready heavy)
  - Supply snapshots (analytical)
Trip Service (Co-ordinator):
- Responsibilities:
  - Accept/Decline offers
  - Driver Customer binding
  - Cancel semantics (emit events to stop broadcasting)
  - Final trip state
  - Central co-ordinator to avoid race-conditions
  - HA (Critical)
Notification Service
- Responsibilities:
  - SMS
  - Push Notifications
  - Emails for invoices (if needed)
  - Most of the system is event driven
  - Idempotency Handling
- Traffic Patterns:
  - Endpoint is both ready and write heavy, because of the Transactional Outbox
  - Should be HA, but shouldn't affect core booking and payment.
  - At most 1 gurranttee
Reconciliation Jobs
- Responsibilities:
  - It would be nice to keep the ActiveBookings, ActivePayments tables separate and small, to reduce the data size. For databases, it would also mean reduce the index size as well. Recon jobs can cleanup/move these data across partitions.
  - Cleanup notifications from Outbox
  - Rebuild Locks in Memory when system Crashes
  - Payment recon workers to validate transaction status and see missed or double charges
Experiment Service (control plane)
- Responsibilities:
  - Flags
  - Experiments
  - Kill switches
  - Audits
- Reads HA - Shouldn't block, Eventually C
- Writes: Strong C
Localisation Engine

Flow Diagrams

Ride Discovery:

sequenceDiagram
    participant Rider
    participant Gateway
    participant Identity
    participant Policy
    participant Pricing
    participant Supply
    participant Maps

    Rider->>Gateway: App Launch (lat, lng)
    Gateway->>Identity: Validate session
    Identity-->>Gateway: OK

    par Parallel Reads
        Gateway->>Policy: Serviceable? city rules
        Gateway->>Supply: Nearby vehicle supply snapshot
        Gateway->>Pricing: Fare estimate request
    end

    Pricing->>Maps: Distance + ETA (approx)
    Maps-->>Pricing: distance, duration

    Pricing->>Policy: Surge multiplier
    Policy-->>Pricing: surge

    Pricing-->>Gateway: Fare ranges per vehicle type
    Supply-->>Gateway: Vehicle availability
    Policy-->>Gateway: Service flags

    Gateway-->>Rider: Launch screen payload

MatchMaking:

sequenceDiagram
    Kafka->>Matchmaking: booking.created

    Matchmaking->>Redis: drivers:cell + neighbors
    Matchmaking->>Matchmaking: rank & filter

    loop Candidates
        Matchmaking->>Redis: SET lock:driver:{id} TTL
        Matchmaking->>Redis: SET offer:{offer_id} TTL
    end

    Matchmaking->>Kafka: offers.created

Ride Booking, Customer:

sequenceDiagram
    Rider->>Gateway: POST /bookings
    Gateway->>BookingService: create booking
    BookingService->>DB: insert booking
    BookingService->>Kafka: booking.created

    Kafka->>Matchmaking: booking.created
    Matchmaking->>MatchMaking: create matches

    Kafka->>NotificationService: offers.created
    NotificationService->>Workers: enqueue push
    Workers->>DriverApp: push offer

Ride Booking, Driver:

sequenceDiagram
    participant Driver
    participant Gateway
    participant TripService
    participant Redis
    participant DB
    participant Kafka
    participant Matchmaking
    participant NotifService
    participant OtherDrivers

    Note over Driver,OtherDrivers: Multiple drivers have pending offers

    Driver->>Gateway: POST /trips/accept<br/>{offer_id, idempotency_key}
    Gateway->>TripService: accept(offer_id, driver_id)
    
    TripService->>Redis: HGETALL offer:{offer_id}
    Redis-->>TripService: {booking_id, status:PENDING, expires_at}
    
    alt Offer Expired
        TripService-->>Gateway: 409 OFFER_EXPIRED
        Gateway-->>Driver: Offer expired
    else Offer Valid
        TripService->>Redis: EVAL Lua Script<br/>CHECK lock:driver:{id}<br/>CHECK offer status<br/>SET offer ACCEPTED<br/>SET lock:driver:{id}
        
        alt Race Condition: Driver Already Locked
            Redis-->>TripService: {false, DRIVER_BUSY}
            TripService-->>Gateway: 409 DRIVER_BUSY
            Gateway-->>Driver: Already on another trip
        else Race Condition: Offer Already Taken
            Redis-->>TripService: {false, OFFER_TAKEN}
            TripService-->>Gateway: 410 OFFER_TAKEN
            Gateway-->>Driver: Another driver accepted
        else Success
            Redis-->>TripService: {true, OK}
            
            TripService->>DB: INSERT INTO trips<br/>(booking_id, driver_id, status:ASSIGNED)<br/>ON CONFLICT DO NOTHING
            DB-->>TripService: trip_id (or existing if duplicate)
            
            TripService->>Kafka: publish trip.created<br/>{trip_id, booking_id, driver_id}
            TripService-->>Gateway: 201 Created<br/>{trip_id, rider_info, pickup}
            Gateway-->>Driver: Trip assigned!
            
            Note over Kafka,Matchmaking: Async cleanup begins
            
            Kafka->>Matchmaking: trip.created event
            Matchmaking->>Redis: SMEMBERS offers:booking:{booking_id}
            Redis-->>Matchmaking: [off_1, off_2, off_3, off_4, off_5]
            
            loop For each other offer
                Matchmaking->>Redis: HSET offer:{other_id} status CANCELLED
                Matchmaking->>Redis: HGETALL offer:{other_id}
                Redis-->>Matchmaking: {driver_id: drv_X}
                Matchmaking->>Redis: DEL lock:driver:{drv_X}
            end
            
            Matchmaking->>Redis: DEL offers:booking:{booking_id}
            Matchmaking->>Kafka: publish offers.cancelled<br/>{booking_id, cancelled_offer_ids[]}
            
            Kafka->>NotifService: offers.cancelled event
            NotifService->>OtherDrivers: Push: "Ride filled by another driver"
            
            Note over OtherDrivers: If they try to accept now
            OtherDrivers->>Gateway: POST /trips/accept
            Gateway->>TripService: accept(offer_id)
            TripService->>Redis: HGETALL offer:{offer_id}
            Redis-->>TripService: {status: CANCELLED}
            TripService-->>Gateway: 410 OFFER_TAKEN
            Gateway-->>OtherDrivers: Already assigned
        end
    end

LLD

Dispatch/MatchMaking

Responsibilities:

Consume booking.created
Fetch candidate drivers
Rank & filter
Create offers (Fan Out)
Expire stale offers- Retry / backoff / radius expansion
Ephemeral state only

Data Store

Matchmaking is mostly stateless but stores minimal metadata for debugging and analytics.

CREATE TABLE dispatch_attempts (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  booking_id VARCHAR(64) NOT NULL,
  attempt_number INT NOT NULL,
  search_radius_m INT NOT NULL,
  candidates_found INT NOT NULL,
  offers_created INT NOT NULL,
  status VARCHAR(32) NOT NULL, 
    -- SEARCHING, OFFERS_SENT, EXPIRED, CANCELLED, ASSIGNED
  started_at TIMESTAMPTZ NOT NULL,
  completed_at TIMESTAMPTZ,
  error_code VARCHAR(64),
  
  INDEX idx_booking_id (booking_id),
  INDEX idx_started_at (started_at)
);

-- Partitoned Postgres. TimeScale would be better

CREATE TABLE hotspot_snapshots (
  id BIGSERIAL PRIMARY KEY,
  city_id VARCHAR(16) NOT NULL,
  h3_cell VARCHAR(32) NOT NULL,
  vehicle_type VARCHAR(16) NOT NULL,
  snapshot_time TIMESTAMPTZ NOT NULL,
  
  -- Metrics
  booking_count_5m INT NOT NULL,
  active_drivers INT NOT NULL,
  avg_surge DECIMAL(4,2),
  demand_score DECIMAL(4,2), -- 0.0 to 1.0
  
  INDEX idx_city_time (city_id, snapshot_time DESC),
  INDEX idx_h3_time (h3_cell, snapshot_time DESC)
);

Redis clusters are multi regions
Hot Keys are possible, City wise paritioning to reduce changes

Events

Topic: booking.created Partition Key: booking_id

Schema:

{
  "event_id": "evt_123",
  "event_type": "booking.created",
  "timestamp": "2025-12-20T10:00:00Z",
  "booking_id": "bkg_123",
  "rider_id": "usr_456",
  "city_id": "blr",
  "pickup": {
    "lat": 12.934523,
    "lng": 77.610234,
    "h3_cell": "89283082837ffff",
    "address": "Koramangala, Bangalore"
  },
  "destination": {
    "lat": 12.956789,
    "lng": 77.634567,
    "address": "Indiranagar, Bangalore"
  },
  "vehicle_type": "AUTO",
  "estimated_fare": 150,
  "surge_multiplier": 1.2,
  "payment_method": "WALLET",
  "rider_preferences": {
    "female_driver_only": false,
    "pet_friendly": false,
    "ac_required": false
  }
}

Topic: trip.created Partition Key: booking_id

Topic: booking.cancelled Partition Key: booking_id Schema:

{
  "event_id": "evt_126",
  "event_type": "booking.cancelled",
  "timestamp": "2025-12-20T10:00:15Z",
  "booking_id": "bkg_123",
  "city_id": "blr",
  "cancelled_by": "RIDER",
  "reason": "CHANGED_MIND",
  "cancellation_fee": 0
}

Topic: offer.declined Partition Key: booking_id

Topic: offers.created Partition Key: booking_id Schema:

{
  "event_id": "evt_124",
  "event_type": "offers.created",
  "timestamp": "2025-12-20T10:00:01Z",
  "booking_id": "bkg_123",
  "city_id": "blr",
  "attempt": 1,
  "search_radius_m": 1000,
  "candidates_evaluated": 23,
  "offers": [
    {
      "offer_id": "off_789",
      "driver_id": "drv_001",
      "distance_m": 500,
      "eta_sec": 120,
      "rank": 1,
      "score": 0.92,
      "expires_at": "2025-12-20T10:00:16Z"
    },
    {
      "offer_id": "off_790",
      "driver_id": "drv_002",
      "distance_m": 800,
      "eta_sec": 180,
      "rank": 2,
      "score": 0.87,
      "expires_at": "2025-12-20T10:00:16Z"
    }
  ]
}

Topic: offers.cancelled Partition Key: booking_id Schema:

{
  "event_id": "evt_128",
  "event_type": "offers.cancelled",
  "timestamp": "2025-12-20T10:00:12Z",
  "booking_id": "bkg_123",
  "city_id": "blr",
  "reason": "TRIP_ASSIGNED | BOOKING_CANCELLED | EXPIRED",
  "assigned_driver_id": "drv_001", // if reason = TRIP_ASSIGNED
  "cancelled_offers": [
    {
      "offer_id": "off_790",
      "driver_id": "drv_002",
      "status_before": "PENDING"
    },
    {
      "offer_id": "off_791",
      "driver_id": "drv_003",
      "status_before": "PENDING"
    }
  ]
}

Topic: dispatch.retry_scheduled Partition Key: booking_id Schema:

{
  "event_id": "evt_130",
  "event_type": "dispatch.retry_scheduled",
  "timestamp": "2025-12-20T10:00:20Z",
  "booking_id": "bkg_123",
  "city_id": "blr",
  "attempt": 2,
  "new_radius_m": 3000,
  "retry_after_ms": 5000,
  "reason": "ALL_OFFERS_EXPIRED | ALL_OFFERS_DECLINED"
}

Partition Key, booking_id will be hashed, to choose one of the partitions.

Having a booking_id as partition allows ordered events for a booking
Having a city_id allows to prevent hotspots
For reconstruction the Postgres outbox snapshots table dispatch_attempts can be used, for the ordering
Given the traffic we need to benchmark to see, if pgpool would handle the traffic, or given kafka as a durable store Make database writes Async.
Another options is to only use the Append only Log, or an LSM tree backed database, like rocksdb which are better at handling writes at the cost of compaction.

Caches

Geo-spatial index using H3 cells, Updatd by Driver Supply Service

Key: drivers:h3:{city_id}:{h3_cell}

Type: ZSET
Members: driver_id
Score: last_seen_timestamp_ms
TTL: None (members auto-prune based on score in queries)

Example:
ZADD drivers:h3:blr:89283082837ffff 1703001234000 drv_001
ZADD drivers:h3:blr:89283082837ffff 1703001235000 drv_002

Query:
ZRANGEBYSCORE drivers:h3:blr:89283082837ffff 
  (now_ms - 30000) now_ms  # Only fresh entries

Driver Metadata

Key: driver:meta:{city_id}:{driver_id}

Type: HASH
Fields:
  - status: online | on_trip | offline
  - vehicle_type: AUTO | SEDAN | SUV | BIKE
  - lat: 12.934523
  - lng: 77.610234
  - rating: 4.7
  - acceptance_rate: 0.85
  - trip_count_today: 12
  - last_updated: 1703001234000
  - current_booking_id: bkg_123 (if on_trip)
TTL: 60 seconds

HSET driver:meta:blr:drv_001 status online vehicle_type AUTO lat 12.934523 lng 77.610234 rating 4.7 acceptance_rate 0.85

Offers

Key: offer:{city_id}:{offer_id}

Type: HASH
Fields:
  - offer_id: off_789
  - booking_id: bkg_123
  - driver_id: drv_001
  - created_at: 1703001234
  - expires_at: 1703001249
  - status: PENDING | ACCEPTED | DECLINED | EXPIRED | CANCELLED
  - attempt: 1
  - distance_m: 500
  - eta_sec: 120
  - rank: 2
TTL: 20 seconds (slightly > acceptance window)

---

Key: offers:booking:{city_id}:{booking_id}

Type: SET
Members: offer_id_1, offer_id_2, ...
TTL: 300 seconds (5 minutes)

Dispatcher State

Key: dispatch:state:{city_id}:{booking_id}

Type: HASH
Fields:
  - booking_id: bkg_123
  - attempt: 2
  - current_radius_m: 3000
  - max_radius_m: 5000
  - status: SEARCHING | OFFERS_SENT | EXPIRED | CANCELLED
  - started_at: 1703001234
  - last_retry_at: 1703001250
  - offers_pending: 5
  - offers_declined: 2
TTL: 300 seconds

HMSET dispatch:state:blr:bkg_123 
  attempt 2 
  current_radius_m 3000 
  status OFFERS_SENT
EXPIRE dispatch:state:blr:bkg_123 300

This can be re-constructed by replaying in the Kafka events

Locking

Intermediate Lock to Prevent multiple ride assignements to driver

Key: lock:driver:{city_id}:{driver_id}

Type: STRING
Value: booking_id (which booking locked this driver)
TTL: 15 seconds (offer acceptance window)

SET lock:driver:blr:drv_001 bkg_123 EX 15

Lock to prevent multiple Matchmaking consumers from processing same booking. Similar to caching for Transactional Outbox

SET lock:dispatch:{booking_id} {worker_id} NX EX 60

If lock acquired:
  - Process dispatch
  - Release lock after completion
Else:
  - Skip (another worker handling it)

This can be re-constructed based on database and events

Hotspot Cache

Key: hotspot:{city_id}:{vehicle_type}

Type: SORTED SET
Members: h3_cell_id
Score: demand_score (0.0 to 1.0)
TTL: 120 seconds (refreshed by stream processor)

Example:
ZADD hotspot:blr:AUTO 0.85 89283082837ffff 0.72 89283082838ffff
EXPIRE hotspot:blr:AUTO 120

# Top 10 hotspots
ZREVRANGE hotspot:blr:AUTO 0 9 WITHSCORES

sequenceDiagram
    participant Kafka as Kafka<br/>(Event Log)
    participant Worker as Matchmaking<br/>Worker
    participant Postgres as Postgres<br/>(Snapshots)
    participant Redis as Redis<br/>(Ephemeral)

    Note over Kafka,Redis: Normal Flow - Events + Snapshots

    Kafka->>Worker: booking.created<br/>(bkg_123, 10:00:00)
    Worker->>Worker: state = {status: SEARCHING}
    Worker->>Postgres: INSERT dispatch_attempts<br/>{booking_id: bkg_123,<br/>status: SEARCHING,<br/>attempt: 1,<br/>started_at: 10:00:00}
    Worker->>Redis: SET dispatch:state:blr:bkg_123<br/>(TTL 300s)

    Note over Worker: Snapshot #1 saved to Postgres

    Kafka->>Worker: offers.created<br/>(5 drivers, 10:00:01)
    Worker->>Worker: state = {status: OFFERS_SENT,<br/>offers_pending: 5}
    Worker->>Postgres: UPDATE dispatch_attempts<br/>SET status=OFFERS_SENT,<br/>offers_pending=5,<br/>last_updated=10:00:01
    Worker->>Redis: HSET dispatch:state:blr:bkg_123<br/>status OFFERS_SENT

    Note over Worker: Snapshot #2 updated in Postgres

    Kafka->>Worker: offer.declined<br/>(drv_001, 10:00:08)
    Worker->>Worker: state.offers_pending--
    Worker->>Postgres: UPDATE dispatch_attempts<br/>SET offers_pending=4,<br/>offers_declined=1,<br/>last_updated=10:00:08
    Worker->>Redis: HSET offers_pending 4

    Note over Worker: Snapshot #3 updated in Postgres

    rect rgb(255, 200, 200)
        Note over Worker: WORKER CRASHES at 10:00:09
    end

    Note over Kafka,Redis: Worker Recovery - Using Snapshots

    Worker->>Worker: New worker starts (10:00:15)
    
    Worker->>Postgres: SELECT * FROM dispatch_attempts<br/>WHERE booking_id='bkg_123'
    Postgres-->>Worker: Latest snapshot:<br/>{status: OFFERS_SENT,<br/>offers_pending: 4,<br/>offers_declined: 1,<br/>last_updated: 10:00:08}
    
    Note over Worker: State restored instantly<br/>from snapshot!

    Worker->>Kafka: Fetch events AFTER 10:00:08<br/>(optional - for freshness)
    Kafka-->>Worker: [offer.declined at 10:00:09,<br/>trip.created at 10:00:11]
    
    Worker->>Worker: Apply recent events:<br/>state.offers_pending = 3<br/>state.status = ASSIGNED

    Worker->>Redis: Rebuild cache:<br/>SET dispatch:state:blr:bkg_123

    Note over Worker: Resume processing from<br/>current state

    rect rgb(200, 255, 200)
        Note over Worker: Recovery complete<br/>Total time: <1 second
    end

    Note over Kafka,Redis: Without Snapshots (Pure Event Sourcing)

    Worker->>Kafka: Must replay ALL events<br/>from topic beginning
    Kafka-->>Worker: Event 1: booking.created<br/>Event 2: offers.created<br/>Event 3-100: offer.declined...<br/>(replay 100s of events)
    
    Note over Worker: ❌ Slow reconstruction<br/>Total time: 10-30 seconds

Trip Service

Database Tables:

trips (
  trip_id UUID,
  booking_id VARCHAR(64) NOT NULL UNIQUE,
  
  -- Participants
  rider_id NOT NULL,
  driver_id NOT NULL,
  
  -- Location
  pickup_lat DECIMAL(10, 8) NOT NULL,
  pickup_lng DECIMAL(11, 8) NOT NULL,
  dropoff_lat <same>,
  dropoff_lng <same>,
  pickup_address TEXT,
  dropoff_address TEXT,
  
  vehicle_type ENUM/VARCHAR,
  city_id VARCHAR NOT NULL,
  
  status VARCHAR NOT NULL,
  -- DRIVER_ASSIGNED, IN_PROGRESS, PAUSED, COMPLETED, CANCELLED, PAID
  
  -- Timestamps
  assigned_at TIMESTAMPTZ NOT NULL,
  started_at TIMESTAMPTZ,
  paused_at TIMESTAMPTZ,
  resumed_at TIMESTAMPTZ,
  ended_at TIMESTAMPTZ,
  
  -- Trip metrics (filled on completion)
  distance_km DECIMAL(6, 2),
  duration_sec INT,
  
  -- References
  fare_id VARCHAR(64),  -- FK to pricing service
  payment_id VARCHAR(64),  -- FK to payment service
  
  -- Cancellation
  cancelled_by VARCHAR(16),  -- DRIVER, RIDER, SYSTEM
  cancellation_reason TEXT,
  cancellation_fee DECIMAL(10, 2),
  
  -- Audit
  created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  
  -- Indexes
  INDEX idx_booking_id (booking_id),
  INDEX idx_rider_id (rider_id, created_at DESC),
  INDEX idx_driver_id (driver_id, created_at DESC),
  INDEX idx_status (status, created_at DESC),
  INDEX idx_city_created (city_id, created_at DESC)
);

-- Composite index for active trips query
CREATE INDEX idx_active_trips 
ON trips(status, rider_id) 
WHERE status IN ('DRIVER_ASSIGNED', 'IN_PROGRESS', 'PAUSED');

-- Partition by created_at (monthly) for scalability
CREATE TABLE trips_2025_12 PARTITION OF trips
FOR VALUES FROM ('2025-12-01') TO ('2026-01-01');

trip_state_transitions (
  id BIGSERIAL PRIMARY KEY,
  trip_id UUID NOT NULL REFERENCES trips(trip_id),
  
  from_status VARCHAR(32),
  to_status VARCHAR(32) NOT NULL,
  
  changed_by VARCHAR(16) NOT NULL,
  actor_id VARCHAR(64),
  
  reason TEXT,
  metadata JSONB,
  
  created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  
  INDEX idx_trip_id (trip_id, created_at DESC)
);

-- Partition by created_at (monthly)
-- Retention: 90 days

trip.locations
trip.events (although present in Kafka, could be async write to database)

Trip Event Types:

  -- OFFER_ACCEPTED, TRIP_STARTED, LOCATION_UPDATE, 
  -- TRIP_PAUSED, TRIP_RESUMED, TRIP_ENDED, TRIP_CANCELLED

Caches:

Key: trip:active:{city_id}:{trip_id}

Type: HASH
Fields:
  - trip_id: trp_001
  - booking_id: bkg_123
  - rider_id: usr_456
  - driver_id: drv_001
  - status: IN_PROGRESS
  - started_at: 1703001234
  - pickup_lat: 12.934523
  - pickup_lng: 77.610234
  - dropoff_lat: 12.956789
  - dropoff_lng: 77.634567
TTL: (ex. 4 hours) [max trip duration + buffer]

Inverse Indexes
Driver-Trip Binding

Key: trip:by-driver:{city_id}:{driver_id}
Type: STRING

Rider Trip Binding
Key: trip:by-rider:{city_id}:{rider_id}
Type: STRING

Key: trip:location:{trip_id}

Type: GEOHASH
Members: timestamped location points
TTL: 1 hour after trip ends

Purpose: Fast location queries for ETA

Example:
GEOADD trip:location:trp_001 77.610234 12.934523 "loc_1703001234"

Key: lock:trip:{trip_id}
Type: STRING

Value: {operation}:{worker_id}
TTL: 30 seconds

Purpose: Prevent concurrent state transitions
induced by driver, rider modifying the same trip record

Example:
SET lock:trip:trp_001 "start:worker-3" NX EX 30

Key: trip:location:{trip_id}
Type: GEOHASH

Members: timestamped location points
TTL: 1 hour after trip ends

Purpose: Fast location queries for ETA

Example:
GEOADD trip:location:trp_001 77.610234 12.934523 "loc_1703001234"

States

SEARCHING -> ASSIGNED -> IN_PROGRESS -> COMPLETED

Events

trip.created
- Consumed by:
  - Matchmaking Service (cancel other offers)
  - Notification Service (notify rider "Driver assigned")
  - Analytics Service
trip.started
- Consumers:
  - Push Notification/SMS Service (notify rider "Trip started")
  - Analytics Service (start tracking metrics)
  - Driver Supply Service (mark driver as on_trip)
trip.location_updated
- Consumers:
  - Rider App (via WebSocket server that subscribes to this, or http poll)
  - Analytics Service (tracking, heatmaps)
  - Trip Service itself (async write to trip_locations table)
trip.completed
- Consumers:
  - Payment Service (initiate payment)
  - Notification Service (send receipt, rating prompt)
  - Driver Supply Service (mark driver available)
  - Analytics Service
trip.cancelled:
- Consumers:
  - Payment Service (charge cancellation fee if applicable)
  - Driver Supply Service (mark driver available)
  - Notification Service (notify both parties)
  - Policy Service (update fairness metrics)

APIs

1. Accept Offer (POST /trips/accept)

POST /trips/accept
Headers:
  X-Driver-ID: drv_001
  X-Idempotency-Key: idmp_abc123
  Authorization: Bearer <driver_token>

Body:
{
  "offer_id": "off_789"
}

Response: 201 Created
{
  "trip_id": "trp_001",
  "booking_id": "bkg_123",
  "status": "DRIVER_ASSIGNED",
  "assigned_at": "2025-12-20T10:00:10Z",
  
  "rider": {
    "rider_id": "usr_456",
    "name": "John Doe",
    "phone": "+91XXXXXXXXXX",
    "rating": 4.8,
    "profile_pic_url": "https://..."
  },
  
  "pickup": {
    "lat": 12.934523,
    "lng": 77.610234,
    "address": "123 Main St, Koramangala"
  },
  
  "destination": {
    "lat": 12.956789,
    "lng": 77.634567,
    "address": "456 Park Ave, Indiranagar"
  },
  
  "estimated_fare": 150,
  "estimated_distance_km": 8.2,
  "estimated_duration_min": 20,
  
  "otp": "1234"
}

Errors:
409 Conflict - OFFER_EXPIRED
409 Conflict - OFFER_ALREADY_ACCEPTED
409 Conflict - DRIVER_BUSY
410 Gone - OFFER_TAKEN
500 Internal Server Error

2. Start Trip (POST /trips/{trip_id}/start)

POST /trips/trp_001/start
Headers:
  X-Driver-ID: drv_001
  Authorization: Bearer <driver_token>

Body:
{
  "otp": "1234",  // Rider's OTP
  "location": {
    "lat": 12.934523,
    "lng": 77.610234,
    "accuracy_m": 10
  }
}

Response: 200 OK
{
  "trip_id": "trp_001",
  "status": "IN_PROGRESS",
  "started_at": "2025-12-20T10:05:00Z",
  "otp_verified": true
}

Errors:
400 Bad Request - INVALID_OTP
403 Forbidden - UNAUTHORIZED_DRIVER
404 Not Found - TRIP_NOT_FOUND
409 Conflict - TRIP_ALREADY_STARTED
422 Unprocessable - TRIP_CANCELLED

3. Update Location (POST /trips/{trip_id}/location)

POST /trips/trp_001/location
Headers:
  X-Driver-ID: drv_001

Body:
{
  "locations": [
    {
      "lat": 12.935000,
      "lng": 77.611000,
      "accuracy_m": 10,
      "speed_kmh": 35.5,
      "bearing": 45,
      "recorded_at": "2025-12-20T10:05:10Z"
    },
    {
      "lat": 12.935500,
      "lng": 77.611500,
      "accuracy_m": 8,
      "speed_kmh": 40.0,
      "bearing": 50,
      "recorded_at": "2025-12-20T10:05:15Z"
    }
  ]
}

Response: 202 Accepted
{
  "processed": 2,
  "trip_status": "IN_PROGRESS"
}

Note: Batched updates to reduce API calls

4. End Trip (POST /trips/{trip_id}/end)

POST /trips/trp_001/end
Headers:
  X-Driver-ID: drv_001

Body:
{
  "dropoff": {
    "lat": 12.956789,
    "lng": 77.634567,
    "accuracy_m": 10
  },
  "odometer_km": 8.5,
  "actual_duration_sec": 1200
}

Response: 200 OK
{
  "trip_id": "trp_001",
  "status": "COMPLETED",
  "ended_at": "2025-12-20T10:25:00Z",
  
  "trip_summary": {
    "distance_km": 8.5,
    "duration_sec": 1200,
    "avg_speed_kmh": 42.5
  },
  
  "fare": {
    "base": 50,
    "distance": 85,
    "time": 20,
    "surge_multiplier": 1.2,
    "subtotal": 155,
    "tax": 31,
    "total": 186,
    "currency": "INR"
  },
  
  "payment_status": "PENDING"
}

Errors:
403 Forbidden - UNAUTHORIZED_DRIVER
404 Not Found - TRIP_NOT_FOUND
409 Conflict - TRIP_NOT_STARTED
409 Conflict - TRIP_ALREADY_ENDED

Cancel Trip (POST /trips/{trip_id}/cancel)

POST /trips/trp_001/cancel
Headers:
  X-User-ID: usr_456 OR X-Driver-ID: drv_001

Body:
{
  "cancelled_by": "RIDER",  // or DRIVER
  "reason": "CHANGED_MIND",
  "location": {
    "lat": 12.934523,
    "lng": 77.610234
  }
}

Response: 200 OK
{
  "trip_id": "trp_001",
  "status": "CANCELLED",
  "cancelled_at": "2025-12-20T10:03:00Z",
  "cancelled_by": "RIDER",
  
  "cancellation_fee": 20,
  "reason": "Cancellation within 2 minutes of assignment"
}

Errors:
403 Forbidden - UNAUTHORIZED
404 Not Found - TRIP_NOT_FOUND
409 Conflict - TRIP_ALREADY_COMPLETED
422 Unprocessable - CANNOT_CANCEL_IN_PROGRESS

Get Trip Details (GET /trips/{trip_id})

GET /trips/trp_001
Headers:
  X-User-ID: usr_456 OR X-Driver-ID: drv_001

Response: 200 OK
{
  "trip_id": "trp_001",
  "booking_id": "bkg_123",
  "status": "IN_PROGRESS",
  
  "rider": {
    "rider_id": "usr_456",
    "name": "John Doe",
    "rating": 4.8
  },
  
  "driver": {
    "driver_id": "drv_001",
    "name": "Jane Smith",
    "rating": 4.9,
    "vehicle": {
      "make": "Maruti",
      "model": "Swift",
      "plate": "KA01AB1234",
      "color": "White"
    }
  },
  
  "pickup": {...},
  "destination": {...},
  
  "started_at": "2025-12-20T10:05:00Z",
  "estimated_arrival": "2025-12-20T10:25:00Z",
  
  "current_location": {
    "lat": 12.945000,
    "lng": 77.620000,
    "updated_at": "2025-12-20T10:15:30Z"
  }
}

Get Active Trip (GET /trips/active)

GET /trips/active
Headers:
  X-User-ID: usr_456

Response: 200 OK
{
  "trip": {
    "trip_id": "trp_001",
    "status": "IN_PROGRESS",
    ...
  }
}

Response: 404 Not Found (if no active trip)
{
  "error": "NO_ACTIVE_TRIP"
}

Accept Offer & Create Trip

Driver clicks "Accept" in app
  |
POST /trips/accept {offer_id}
  |
Trip Service:
  1. Validate offer exists in Redis
  2. Check offer not expired
  3. Atomic Redis operation (Lua):
     - Check driver not locked
     - Check offer status = PENDING
     - Set offer = ACCEPTED
     - Lock driver
  4. Insert trip record in Postgres (idempotent)
  5. Write to Redis caches (active trip bindings)
  6. Publish trip.created event to Kafka
  7. Return 201 with trip details
  ↓
Matchmaking Service consumes trip.created:
  - Cancel other pending offers
  - Release other driver locks
  |
Notification Service:
  - Send push to rider "Driver assigned"
  - Send SMS with driver details

Start Trip

Driver arrives at pickup
  |
Rider shares OTP: "1234"
  |
Driver enters OTP in app
  |
POST /trips/{trip_id}/start {otp, location}
  |
Trip Service:
  1. Validate trip exists, status = DRIVER_ASSIGNED
  2. Verify OTP matches Redis cache
  3. Acquire trip lock in Redis
  4. Update Postgres:
     - trips.status = IN_PROGRESS
     - trips.started_at = NOW()
  5. Update Redis cache
  6. Write to trip_state_transitions (audit)
  7. Publish trip.started event
  8. Release lock
  |
Driver Supply Service:
  - Mark driver.status = on_trip
  |
Notification Service:
  - Notify rider "Trip started"
  |
Driver app starts sending location updates every 5s

Near Real-time Location Updates

Driver app (every 5 seconds):
  |
Batch 6 location points
  |
PATCH /trips/{trip_id}/location {locations[]}
  |
Trip Service:
  1. Validate trip IN_PROGRESS
  2. Write to Redis GEOHASH (fast cache)
  3. Publish trip.location_updated event (async)
  4. Return 202 Accepted
  |
Async Worker consumes event:
  - Batch insert to trip_locations table
  - 1000 locations per batch
  |
SSE/WebSocket Server consumes event:
  - Push to rider's active WebSocket connection
  - Rider sees driver moving on map

End Trip & Payment

Driver arrives at destination
  |
Driver clicks "End Trip"
  |
POST /trips/{trip_id}/end {dropoff, odometer}
  |
Trip Service:
  1. Validate trip IN_PROGRESS
  2. Acquire trip lock
  3. Calculate metrics:
     - distance_km (from GPS breadcrumbs)
     - duration_sec (started_at - now)
  4. Call Pricing Service:
     GET /pricing/calculate-fare
     {distance_km, duration_sec, surge, vehicle_type}
  5. Update Postgres:
     - trips.status = COMPLETED
     - trips.ended_at = NOW()
     - trips.distance_km = 8.5
     - trips.duration_sec = 1200
     - trips.fare_id = fare_123
  6. Publish trip.completed event
  7. Return fare breakdown
  |
Payment Service consumes trip.completed:
  - Create payment intent
  - Redirect rider to payment
  |
Driver Supply Service:
  - Mark driver.status = online (available)
  |
Notification Service:
  - Send receipt to rider
  - Prompt for rating

Cancel Trip

Rider clicks "Cancel" (or Driver cancels)
  |
POST /trips/{trip_id}/cancel {cancelled_by, reason}
  |
Trip Service:
  1. Validate trip not COMPLETED
  2. Acquire trip lock
  3. Call Policy Service:
     GET /policies/cancellation-fee
     {status, cancelled_by, time_since_assigned}
     -> Returns: {fee: 20, reason: "..."}
  4. Update Postgres:
     - trips.status = CANCELLED
     - trips.cancelled_by = RIDER
     - trips.cancellation_fee = 20
  5. Publish trip.cancelled event
  6. Return cancellation details
  |
Payment Service:
  - Charge cancellation fee if applicable
  |
Driver Supply Service:
  - Mark driver available
  |
Matchmaking Service:
  - Release offer locks (if not started yet)
  |
Notification Service:
  - Notify both parties

Cache Strategy

Request: GET /trips/{trip_id}

Layer 1: Redis (Hot Cache)
  ├─ Check: trip:active:{city}:{trip_id}
  ├─ Hit: Return immediately (5ms)
  └─ Miss: Go to Layer 2

Layer 2: Postgres (Source of Truth)
  ├─ Query: SELECT * FROM trips WHERE trip_id = ?
  ├─ Hit: Return + Write to Redis (50ms)
  └─ Miss: 404 Not Found

Cache Invalidation:
  ├─ On state change: Update Redis + Postgres
  ├─ On trip end: TTL expires in 1 hour
  └─ On trip cancel: Delete from Redis

State Change (e.g., trip started):
  1. Acquire lock
  2. Write to Postgres (source of truth)
  3. Write to Redis (invalidate + update)
  4. Publish event to Kafka
  5. Release lock
  6. Return to client

If Redis write fails:
  - Log error
  - Continue (Postgres is source of truth)
  - Next read will cache miss -> rebuild from Postgres

Race cases

10:00:00 - Driver clicks "Cancel" (poor network, slow request) 10:00:01 - Rider clicks "Cancel" (request arrives first)

Expected Behaviour:

Request 1 (Rider): Acquire lock -> Process -> Update -> Release
Request 2 (Driver): Try lock -> LOCKED -> Wait/Retry
Request 2: Acquire lock -> Read status = CANCELLED -> Abort
Result: Only one cancellation processed

10:05:00 - Driver clicks "Start Trip" (at pickup location) 10:05:00 - Rider clicks "Cancel" (changed mind)

Expected:

Request 1 (Start): Acquire lock -> Update to IN_PROGRESS -> Release
Request 2 (Cancel): Try lock -> LOCKED -> Wait
Request 2: Acquire lock -> Read status = IN_PROGRESS
Request 2: Validation fails: "Cannot cancel in-progress trip"
Result: Trip correctly started, cancel rejected

10:25:00 - Driver clicks "End Trip" 10:25:00 - Driver clicks "End Trip" again (double-click / network retry)

Expected:

Request 1: Acquire lock -> Process -> Update -> Publish -> Release
Request 2: Try lock -> LOCKED -> Wait
Request 2: Acquire lock -> Read status = COMPLETED -> Abort (idempotent)
Result: Only one payment miss -> rebuild from Postgres



#### Race cases

10:00:00 - Driver clicks "Cancel" (poor network, slow request)
10:00:01 - Rider clicks "Cancel" (request arrives first)

Expected Behaviour:
- Request 1 (Rider): Acquire lock -> Process -> Update -> Release
- Request 2 (Driver): Try lock -> LOCKED -> Wait/Retry
- Request 2: Acquire lock -> Read status = CANCELLED -> Abort
- Result: Only one cancellation processed

---

10:05:00 - Driver clicks "Start Trip" (at pickup location)
10:05:00 - Rider clicks "Cancel" (changed mind)

Expected:
- Request 1 (Start): Acquire lock -> Update to IN_PROGRESS -> Release
- Request 2 (Cancel): Try lock -> LOCKED -> Wait
- Request 2: Acquire lock -> Read status = IN_PROGRESS
- Request 2: Validation fails: "Cannot cancel in-progress trip"
- Result: Trip correctly started, cancel rejected

---

10:25:00 - Driver clicks "End Trip"
10:25:00 - Driver clicks "End Trip" again (double-click / network retry)

Expected:
- Request 1: Acquire lock -> Process -> Update -> Publish -> Release
- Request 2: Try lock -> LOCKED -> Wait
- Request 2: Acquire lock -> Read status = COMPLETED -> Abort (idempotent)
- Result: Only one payment

Raw

03_MATCHMAKING.md

Matchmaking

(MVP)

We start off with 3 stage pipeline:

Candidate List Generation (which drivers are nearby)
Filtering (eligibility check)
Ranking

Candidate Generation

Inputs:

pickup location (lat/lng)
vehicle_type _ city_id

Mechanism:

H3 cell lookup (2-3 resolutions)
kRing expansion

cells = h3.kRing(pickup_cell, ring=0..N)
drivers = UNION(ZRANGEBYSCORE drivers:h3:{cell})

The Sorted SET works, because it automatically would filter drivers whose location pings aren't new, much like LRU.

Filtering

Inputs:

driver metadata

Filtering Criteria:

driver.status == online
vehicle_type matches
last_seen < 10s
in_trip
has_offer

For each of these we can have feature flags

enable_fairness Turn fairness on/off
enable_multi_offer Fan-out vs sequential
max_offers_per_attempt Control blast radius
max_radius_per_attempt Retry aggressiveness
enable_maps_eta Future upgrade

Ranking

score =
  w_distance * distance_score +
  w_recency  * recency_score +
  w_fairness * fairness_score +
  w_quality  * quality_score

Reward Control

This allows the system to control how the system rewards good customers and drivers

There could be multiple ways to control this:

Tiered Queues with differing consumption Rate
Better feature controls to improve the search

In this, I am using the second option, its easier and probably cheaper if tuned.

Feature Controls:

Search Radius deltas
More Results in Candidates

Matchmaking must never scan the world. It must only touch pre-shaped memory that already reflects reality.

Driver Supply maintains:

Spatial locality
Temporal freshness
Driver eligibility
Pre-aggregation
Shard alignment

Matchmaking consumes only the already curated view + some real time data

Partitioning for location data

Region -> City -> Vehicle Type -> H3 Cell -> Drivers

Level	Why
Region	Legal + latency + isolation
City	Surge, policy, ops ownership
Vehicle	Matching correctness
H3 cell	Spatial pruning
Driver	Unit of supply

GEO strategy

Using H3, because its easier and cheaper with a bit of accuracy trade-off

two resolutions, minimum.

Purpose	H3 Resolution
Matching	r8 (~460m)
Pre-aggregation	r7 (~1.2km)

r8 is small enough to avoid false positives
r7 is stable enough to compute trends and load

There are other granual ways to calculate more precise Locations, to determine things like, which side of the road, more granular locations at airports etc.

OpenStreetMap xml can be converted to Nodes and Edges, and we can triangulate based on, location data, OSM node values.

Driver Supply in-memory layout

Hot path: matching index

drivers:h3:{city}:{vehicle}:{h3_r8}
  - ZSET(driver_id -> last_seen_ts)

This is what Matchmaking hits. There can be 2 instances of Driver Supply Service - following a CQRS pattern.

A cluster of instances can read the data from cache. This way, the ownership of the data doesn't change, and we don't hit the write path.

Incoming Request -> Nginx -> (API Based Route Splitting) -> [Write Cluster + Read Cluster]

Guarantees provided:

Only online drivers
Only correct vehicle
Time-filterable
O(log N)

Metadata side-channel

driver:meta:{city}:{driver_id}
  - HASH

Pre-computation:

Cell-level supply snapshot (updated continuously)

For each (city, vehicle, h3_r7):

{
  "active_drivers": 42,
  "avg_last_seen_sec": 3.1,
  "busy_ratio": 0.28,
  "acceptance_rate": 0.81
}

Stored as:

supply:snapshot:{city}:{vehicle}:{h3_r7}

Updated incrementally, not recomputed.

Demand vs supply signal (for surge + throttling)

Driver Supply emits:

{
  "h3_r7": "89283082",
  "vehicle": "AUTO",
  "drivers": 42,
  "bookings_5m": 71,
  "pressure": 1.69
}

Kafka
Pricing
Matchmaking (for radius / offer limits)

Partitioning strategy (Redis, Kafka, CPU)

Redis

Shard by city first. Always.

Redis Cluster
 ├── blr-shard-1
 ├── blr-shard-2
 ├── del-shard-1
 ├── mum-shard-1

Key pattern ensures co-location:

drivers:h3:blr:AUTO:89283082837ffff

No cross-shard fan-out for matching.

Kafka

Topics

Topic	Partition key
driver.location	driver_id
booking.created	booking_id
supply.snapshot	city + h3_r7
offers.created	booking_id

Why not city-only?

booking lifecycle must be ordered
driver updates must be independent

Pre-Aggreators and Ownership

Task	Ownership
Spatial bucketing	Driver Supply
Driver freshness	Driver Supply
Busy/idle	Driver Supply
Hotspots	Stream processor
Candidate ranking	Matchmaking
Offer lifecycle	Matchmaking
Trip binding	Trip Service

Handling Spikes

This assumes generic scenarios which can lead to spike, ignoring region specific be market behaviour

Rains
Large Events
Traffic or Road Closures

Application Metrics to Track:

N(=10)x booking.created events
Same cells hit repeatedly
Offer fanout increases
Drivers get spammed
Acceptance drops further

Control fanout

fanout = min(
  MAX_FANOUT,
  available_drivers * acceptance_rate_estimate
)

Hard Cap on retries

Max Retry Attempts
Max Radius
Max Time

Driver Supply Tanks

Define SLA for Demand Supply Ratio of time buckets like 1m, 5m.
Bail early if thresholds exceed
Or Degrade Experience by Queueing

Punishment Rewards (Rate Limitter)

Driver location updates have stopped without logging off
Driver rejects too many bookings, Temporarily ban
Customer books and cancels multiple times in time interval
Historic behavior analysis which can be used based on Tiered Queues if needed.
- Such people get lesser priority

Driver Movement Tracking

Depending on Supply Demand, we can adaptively increase the resolution of the H3.

Driver app sends a location ping

Example payload (batched, every ~1s or 2s):

{ "driver_id": "drv_001", "lat": 12.93491, "lng": 77.61088, "speed": 32.4, "accuracy": 8, "ts": 1703001234567 }

This hits:

PUT /drivers/location

Handled only by Driver Supply Service.

Driver Supply computes the H3 cell (pure function)

Inside Driver Supply:

new_cell = h3.geo_to_h3(lat, lng, RESOLUTION_R8)

Driver Supply loads the last known cell

From Redis (or in-memory cache):

driver:last_cell:{driver_id} -> old_cell

driver:last_cell:drv_001 -> 89283082837ffff

Cell comparison (this is the gate) if new_cell == old_cell:
driver still in same cell
return 204 # do nothing

Failure semantics

Driver Supply down?

Matching uses last-known state
Time filters naturally decay supply
Graceful degradation

Redis shard down?

City-level isolation
Surge throttles bookings
No cross-city blast radius

Kafka lag?

Supply snapshots lag -> pricing increases
Matching still works locally

Service Casade Failure

Circuit Breakers for internal apis
CB for external endpoings

SET NX PX

Although set nx can be used, along with Redilock to implement an expirable key, it might become slow, because of locking.

This can be avoided with using Atomic Sorted Set acting as a list/queue, with BLPUSH, or ZSET with a Counter (incr)

Raw

09_SCRATCH.md

tchPad

Requirements

Functional Requirements:

Real-time driver location ingestion: each online driver sends 1–2 updates per second
Ride request flow: pickup, destination, tier, payment method
Dispatch/Matching: assign drivers within <1s p95; reassign on decline/timeout
Dynamic surge pricing: per geo-cell, based on supply–demand
Trip lifecycle: start, pause, end, fare calculation, receipts
Payments orchestration: integrate with external PSPs, retries, reconciliation
Notifications: push/SMS for key ride states
Admin/ops tooling: feature flags, kill-switches, observability

Non-Functional Requirements:

Latency SLOs:
- Dispatch decision: p95 ≤ 1s
- End-to-end request→acceptance: p95 ≤ 3s
- Availability: 99.95% for dispatch APIs
Scale assumptions:
- 300k concurrent drivers globally
- 60k ride requests/minute peak
- 500k location updates/second globally
Multi-region: region-local writes, failover handling, no cross-region sync on hot path
Compliance: PCI, PII encryption, GDPR/DPDP

Assumptions:

Most rides will be within a 60-80km radius.
Instead of OTP, we use static PINs
Drivers are charged platform fees per ride
Fleet management is not considered, how a driver gets a cab is out of scope.
Discount codes are not considered
Chat can be considered (but real-time is out of scope)
Assign drivers within <1s p95, wut!! (from the scaling req. assuiming) the time to dispatch the "Book" request from the user to the drivers.
- We can't really assign a driver within <1s, unless they are self driving cars, in which case, maybe 3s.
- The search radius to broadcast the ride booking can also be extended
Assuming 1 minute TTL for a ride request, every 5-15-30s it expands the broadcast radius by 5km.

Build the workflows, think about points of entry, from the UI
Identify entities (stateful nouns)
Data modelling based on Access patterns
Consistency Evaluation and Concurrency support
Scale and Durability
Tech decisions
Observability

All users of the platform has to be logged in from their devices. All endpoints are authenticated requiring some form of tokens.

Ridees:

Users can search for destination locations (within the max radius, and intercity/interstate travel policies and boundaries)
User sees a list of ride types, with estimated prices or price-ranges
- Ride types can be sorted by most availble or cheapest or user preferences. (this can be controlled based on country/cultures)
- Pricing can vary based on surge (supply demand, peak hours in specific regions)
Ride requests are sent to drivers in the area nearby
Ride request gets accepted or times-out or user cancels it.
User can look at ETA of the ride to the location
User can also look at ETA to destination while in transit
Users are charged when the ride ends. All bookings are blocked if payment is pending from a previous trip.
User will have payment options, similar to order payment flow, cards, upi and other PSP integrations with callback mechanisms and retries
User can share their trip location updates with Friends/Point of Contacts.

Drivers:

Drivers can go on/off duty
Drivers get booking notifications containing (to-from-estimated fare)
Drivers can look at hotspots within X km radius
Drivers can see ride history and amount received - platform fees
Drivers can be banned based on policies and behaviours.

Others:

Notifications for login's
Send notifications to users on ride booking (Booked, Retry, Ended)
Send notifications for payment completions, payment status
Send notifications to Point of Contacts, periodically sending user location and trip details.
Send notifications to drivers for ride bookings, nearby hotspots, different POS at certain times like movie theaters, stations etc.
Send notifications to users based on travel history.
Alert drivers or users to confirm for safety based on certain factors: (booking in progress too long, route and destination not aligning, other patterns drivers use to game system)
Fair scoring policies, based on identified factors, for drivers.

Ridees, Drivers, Vehicles, Booking, Location, Pricing

Idempotency Keys for:

BookingRequest
Payments
ChatSession

Critical points:

Storing geo cordinates
Driver location tracking, moving between geo regions
Area specific geo resolution selection
Matching scores (pre-computed + on the fly)
Pricing policies attached to users, locations, (pre-computed and dynamic on the fly)
Re-hydration of in-memory data in case of crashes
Total Re-conciliation of orders, money transactions
Handling hotspots

System Numbers and Estimates

Kafka numbers with group commit
Benchmark numbers from RedPanda
32G RAM, 1Gbps NIC, 100B messages, assuming a 20% drop from 130K msg/s = 50K msg/s per partition across 6 partitions
Ride Request: 1K RPS, Assuming 3 events per booking, 3K msgs/sec , with 12 partitions , its 250 msgs/sec
Kafka Tail Latency is around 200ms at p95 (on i3en.large 16Gb RAM, 25Gbps, 2v CPU), given leader-relection or broker crashes on High Throughput systems.
API requests Round Trip - 100ms
2 DB Writes - 200ms

Since the application is global and requires no cross-region syncs, per region these numbers will be less, because of region specific deployments.

Matching Engine:

Caching Opts: Per driver location update (worst case)

Typical pipeline:

HGET driver:state:{id} last_cell
ZREM driver:geo:{old_cell} (only if cell changed)
ZADD driver:geo:{new_cell}
HSET driver:state:{id}
(Optional) EXPIRE driver:state:{id}

500,000 pings/sec × 4 ops = 2M Ops (Globally)

When split across region this number would come down, depending on partitioning. Assuming 10 major regions and 10 hotspots: 200K Ops

Redis benchmarks:

$> redis-benchmark -t get -n 100000 -c 50
====== GET ======
  100000 requests completed in 0.90 seconds
  50 parallel clients
  3 bytes payload
  keep alive: 1
  host configuration "save": 3600 1 300 100 60 10000
  host configuration "appendonly": no
  multi-thread: no

Latency by percentile distribution:
0.000% <= 0.087 milliseconds (cumulative count 1)
50.000% <= 0.223 milliseconds (cumulative count 59475)
75.000% <= 0.239 milliseconds (cumulative count 75730)
87.500% <= 0.263 milliseconds (cumulative count 89721)
93.750% <= 0.295 milliseconds (cumulative count 94374)
96.875% <= 0.343 milliseconds (cumulative count 97031)
98.438% <= 0.383 milliseconds (cumulative count 98599)
99.219% <= 0.423 milliseconds (cumulative count 99280)
99.609% <= 0.559 milliseconds (cumulative count 99617)
99.805% <= 1.015 milliseconds (cumulative count 99809)
99.902% <= 1.455 milliseconds (cumulative count 99903)
99.951% <= 1.951 milliseconds (cumulative count 99952)
99.976% <= 2.279 milliseconds (cumulative count 99976)
99.988% <= 2.455 milliseconds (cumulative count 99988)
99.994% <= 2.527 milliseconds (cumulative count 99995)
99.997% <= 2.823 milliseconds (cumulative count 99997)
99.998% <= 3.391 milliseconds (cumulative count 99999)
99.999% <= 3.543 milliseconds (cumulative count 100000)
100.000% <= 3.543 milliseconds (cumulative count 100000)

Cumulative distribution of latencies:
0.003% <= 0.103 milliseconds (cumulative count 3)
20.062% <= 0.207 milliseconds (cumulative count 20062)
94.939% <= 0.303 milliseconds (cumulative count 94939)
99.104% <= 0.407 milliseconds (cumulative count 99104)
99.542% <= 0.503 milliseconds (cumulative count 99542)
99.658% <= 0.607 milliseconds (cumulative count 99658)
99.697% <= 0.703 milliseconds (cumulative count 99697)
99.733% <= 0.807 milliseconds (cumulative count 99733)
99.780% <= 0.903 milliseconds (cumulative count 99780)
99.804% <= 1.007 milliseconds (cumulative count 99804)
99.835% <= 1.103 milliseconds (cumulative count 99835)
99.853% <= 1.207 milliseconds (cumulative count 99853)
99.873% <= 1.303 milliseconds (cumulative count 99873)
99.896% <= 1.407 milliseconds (cumulative count 99896)
99.913% <= 1.503 milliseconds (cumulative count 99913)
99.926% <= 1.607 milliseconds (cumulative count 99926)
99.934% <= 1.703 milliseconds (cumulative count 99934)
99.942% <= 1.807 milliseconds (cumulative count 99942)
99.948% <= 1.903 milliseconds (cumulative count 99948)
99.956% <= 2.007 milliseconds (cumulative count 99956)
99.963% <= 2.103 milliseconds (cumulative count 99963)
99.997% <= 3.103 milliseconds (cumulative count 99997)
100.000% <= 4.103 milliseconds (cumulative count 100000)

Summary:
  throughput summary: 110987.79 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        0.233     0.080     0.223     0.311     0.407     3.543




$> redis-benchmark -t get -n 100000 -c 50 -P 8
====== GET ======
  100000 requests completed in 0.14 seconds
  50 parallel clients
  3 bytes payload
  keep alive: 1
  host configuration "save": 3600 1 300 100 60 10000
  host configuration "appendonly": no
  multi-thread: no

Latency by percentile distribution:
0.000% <= 0.111 milliseconds (cumulative count 16)
50.000% <= 0.263 milliseconds (cumulative count 54552)
96.875% <= 0.647 milliseconds (cumulative count 96912)
99.994% <= 2.607 milliseconds (cumulative count 100000)
100.000% <= 2.607 milliseconds (cumulative count 100000)

Cumulative distribution of latencies:
0.000% <= 0.103 milliseconds (cumulative count 0)
70.888% <= 0.303 milliseconds (cumulative count 70888)
83.032% <= 0.407 milliseconds (cumulative count 83032)
100.000% <= 3.103 milliseconds (cumulative count 100000)

Summary:
  throughput summary: 724637.69 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        0.314     0.104     0.263     0.543     1.063     2.607
      
$> redis-benchmark -t get -n 100000 -c 50 --threads 4
====== GET ======
  100000 requests completed in 0.50 seconds
  50 parallel clients
  3 bytes payload
  keep alive: 1
  host configuration "save": 3600 1 300 100 60 10000
  host configuration "appendonly": no
  multi-thread: yes
  threads: 4

Latency by percentile distribution:
0.000% <= 0.023 milliseconds (cumulative count 9)
50.000% <= 0.111 milliseconds (cumulative count 51821)
99.999% <= 22.911 milliseconds (cumulative count 100000)
100.000% <= 22.911 milliseconds (cumulative count 100000)

Cumulative distribution of latencies:
32.283% <= 0.103 milliseconds (cumulative count 32283)
9.531% <= 1.807 milliseconds (cumulative count 99531)
9.992% <= 10.103 milliseconds (cumulative count 99992)
100.000% <= 23.103 milliseconds (cumulative count 100000)

Summary:
  throughput summary: 198807.16 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        0.141     0.016     0.111     0.199     0.791    22.911
      

$> redis-benchmark -t get -n 100000 -c 50 --threads 4 -P 4

Summary:
  throughput summary: 396825.38 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        0.245     0.024     0.207     0.399     1.463     4.535

$>  redis-benchmark -t set -n 100000 -c 50 --threads 4 -P 4
====== SET ======
  100000 requests completed in 0.25 seconds
  50 parallel clients
  3 bytes payload
  keep alive: 1
  host configuration "save": 3600 1 300 100 60 10000
  host configuration "appendonly": no
  multi-thread: yes
  threads: 4

Latency by percentile distribution:
0.000% <= 0.031 milliseconds (cumulative count 24)
50.000% <= 0.391 milliseconds (cumulative count 50984)
75.000% <= 0.431 milliseconds (cumulative count 75104)
87.500% <= 0.455 milliseconds (cumulative count 89184)
93.750% <= 0.479 milliseconds (cumulative count 94160)
96.875% <= 0.551 milliseconds (cumulative count 96888)
98.438% <= 0.751 milliseconds (cumulative count 98452)
99.219% <= 1.135 milliseconds (cumulative count 99220)
99.609% <= 1.399 milliseconds (cumulative count 99620)
99.805% <= 1.543 milliseconds (cumulative count 99812)
99.902% <= 2.031 milliseconds (cumulative count 99912)
99.951% <= 2.199 milliseconds (cumulative count 99956)
99.976% <= 2.447 milliseconds (cumulative count 99976)
99.988% <= 2.487 milliseconds (cumulative count 99988)
99.994% <= 2.503 milliseconds (cumulative count 99996)
99.997% <= 2.511 milliseconds (cumulative count 100000)
100.000% <= 2.511 milliseconds (cumulative count 100000)

Cumulative distribution of latencies:
2.216% <= 0.103 milliseconds (cumulative count 2216)
10.160% <= 0.207 milliseconds (cumulative count 10160)
17.368% <= 0.303 milliseconds (cumulative count 17368)
59.720% <= 0.407 milliseconds (cumulative count 59720)
95.780% <= 0.503 milliseconds (cumulative count 95780)
97.556% <= 0.607 milliseconds (cumulative count 97556)
98.128% <= 0.703 milliseconds (cumulative count 98128)
98.764% <= 0.807 milliseconds (cumulative count 98764)
98.964% <= 0.903 milliseconds (cumulative count 98964)
99.064% <= 1.007 milliseconds (cumulative count 99064)
99.180% <= 1.103 milliseconds (cumulative count 99180)
99.376% <= 1.207 milliseconds (cumulative count 99376)
99.488% <= 1.303 milliseconds (cumulative count 99488)
99.628% <= 1.407 milliseconds (cumulative count 99628)
99.752% <= 1.503 milliseconds (cumulative count 99752)
99.864% <= 1.607 milliseconds (cumulative count 99864)
99.872% <= 1.703 milliseconds (cumulative count 99872)
99.884% <= 2.007 milliseconds (cumulative count 99884)
99.920% <= 2.103 milliseconds (cumulative count 99920)
100.000% <= 3.103 milliseconds (cumulative count 100000)

Summary:
  throughput summary: 398406.41 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        0.382     0.024     0.391     0.495     0.975     2.511

From the above, $> redis-benchmark -t get -n 100000 -c 50 --threads 4 -P 4

seems like a good tradeoff, at 0.4ms p95, with pipeline, 3-4 ops/s batched

Cockroach DB Bench Blog: Blog

Next, we examined an apples to apples comparison of latency under the same 850 warehouse workload between CockroachDB 2.0 and CockroachDB 1.1 with a three node (16 vcpus/node) setup:

Metric, CRDB 1.1, CRDB 2.0, % Improvement
Average Latency (p50), 201,67, 67%
95% Latency (p95), 671, 151, 77%
99% Latency (p99), 1,140, 210, 82%

These results were achieved at the same isolation level (i.e., serializable), number of nodes (i.e. 3), number of warehouses (i.e., 850).

For the same load, CockroachDB 2.0 reduces latency by as much as 82% when compared to CockroachDB 1.1. Viewed another way, CockroachDB 2.0 improves response time by 544% (CockroachDB 1.1 p99/CockroachDB 2.0 p99) when compared to 1.1.

Isolation Levels

Most databases present a choice of several transaction isolation levels, offering a tradeoff between correctness and performance. CockroachDB provides strong (“SERIALIZABLE”) isolation by default to ensure that your application always sees the data it expects.

With years of improvement in DB, if SERIALIZABLE is the default, Thank you for your service, RoachDb

Raw

diagrams.svg

Sorry, something went wrong. Reload?

Sorry, we cannot display this file.

Sorry, this file is invalid so it cannot be displayed.