Skip to content

Instantly share code, notes, and snippets.

@r33drichards
Created December 17, 2025 17:42
Show Gist options
  • Select an option

  • Save r33drichards/55cff01230ce6d7d396dfe5f0755ccac to your computer and use it in GitHub Desktop.

Select an option

Save r33drichards/55cff01230ce6d7d396dfe5f0755ccac to your computer and use it in GitHub Desktop.
CUA SDK Telemetry - Deployment Strategy
# CUA SDK Telemetry - Deployment Strategy
## Overview
This document outlines the deployment strategy for CUA SDK telemetry instrumentation (OpenTelemetry + Sentry).
## Architecture
```
┌─────────────────┐ ┌──────────────────┐ ┌─────────────┐ ┌─────────┐
│ CUA SDK │────▶│ otel.cua.ai │────▶│ Prometheus │────▶│ Grafana │
│ (Python) │ │ (OTLP Collector)│ │ │ │ │
└─────────────────┘ └──────────────────┘ └─────────────┘ └─────────┘
┌─────────────┐
│ Alertmanager│────▶ Slack
└─────────────┘
```
## Pull Requests
| Repo | PR | Branch | Description | Status |
|------|-----|--------|-------------|--------|
| cloud | [#562](https://github.com/trycua/cloud/pull/562) | `feat/cua-sdk-dashboard` | Grafana dashboard + alert rules | ✅ Merged |
| cua | [#661](https://github.com/trycua/cua/pull/661) | `feat/otel-sentry-core` | Core OTEL/Sentry modules | Pending |
| cua | [#662](https://github.com/trycua/cua/pull/662) | `feat/otel-sentry-agent` | Agent callback instrumentation | Pending |
| cua | [#663](https://github.com/trycua/cua/pull/663) | `feat/otel-sentry-computer` | Computer interface instrumentation | Pending |
## Deployment Order
### Step 1: Cloud Infrastructure (✅ Complete)
The Grafana dashboard and Prometheus alert rules have been deployed via PR #562.
**Verification:**
- Dashboard: https://grafana.cua.ai/d/cua-sdk-metrics
- Alert rules: 4 CuaSDK* rules active in Prometheus
### Step 2: CUA SDK Packages (PyPI)
**Merge order matters due to dependencies:**
```
1. PR #661 (cua-core) ← Merge FIRST - contains core telemetry modules
2. PR #662 (cua-agent) ← Merge after core
3. PR #663 (cua-computer) ← Merge after core
```
**Publish to PyPI after merging:**
```bash
# Option 1: Git tag trigger
git tag core-v0.1.10 && git push origin core-v0.1.10
git tag agent-v<version> && git push origin agent-v<version>
git tag computer-v<version> && git push origin computer-v<version>
# Option 2: Manual workflow dispatch
# Go to GitHub Actions → "Publish Core Package" → Run workflow
```
## Metrics Reference
### Four Golden Signals
| Signal | Metric | Type |
|--------|--------|------|
| Latency | `cua_sdk_operation_duration_seconds` | Histogram |
| Traffic | `cua_sdk_operations_total` | Counter |
| Errors | `cua_sdk_errors_total` | Counter |
| Saturation | `cua_sdk_concurrent_operations` | Gauge |
### Additional Metrics
| Metric | Type | Description |
|--------|------|-------------|
| `cua_sdk_tokens_total` | Counter | Token usage by model and type |
### Labels
- `operation`: Operation name (e.g., `agent.run`, `computer.action.screenshot`)
- `status`: `success` or `error`
- `error_type`: Exception class name (for errors)
- `model`: LLM model name (for tokens)
- `token_type`: `prompt` or `completion` (for tokens)
## Alert Rules
| Alert | Condition | Severity |
|-------|-----------|----------|
| CuaSDKHighErrorRate | Error rate > 5% for 5m | warning |
| CuaSDKHighLatency | P99 latency > 30s for 5m | warning |
| CuaSDKHighLatencyCritical | P99 latency > 60s for 5m | critical |
| CuaSDKHighSaturation | Concurrent ops > 50 for 5m | warning |
## Configuration
### Environment Variables
| Variable | Default | Description |
|----------|---------|-------------|
| `CUA_TELEMETRY_DISABLED` | `false` | Set to `true` to disable telemetry |
| `CUA_OTEL_ENDPOINT` | `https://otel.cua.ai` | OTLP HTTP endpoint |
| `CUA_OTEL_SERVICE_NAME` | `cua-sdk` | Service name for metrics |
### Installation
```bash
# Install with telemetry support
pip install cua-core[telemetry]
pip install cua-agent[telemetry]
pip install cua-computer[telemetry]
# Or install all telemetry deps
pip install cua-core[otel,sentry]
```
## Testing Locally
```python
import os
os.environ["CUA_OTEL_ENDPOINT"] = "https://otel.cua.ai"
from core.telemetry.otel import (
record_operation,
record_error,
record_tokens,
_initialize_otel
)
from core.telemetry import otel
_initialize_otel()
# Record test metrics
record_operation("agent.run", duration_seconds=0.5, status="success")
record_error("TestError", "agent.run")
record_tokens(prompt_tokens=100, completion_tokens=50, model="claude-3-5-sonnet")
# Force flush
otel._meter_provider.force_flush(timeout_millis=30000)
```
**Verify in Prometheus:**
```bash
ssh -i ~/.ssh/rw.pem root@35.92.213.109 \
"curl -s 'http://localhost:9090/api/v1/query?query=cua_sdk_operations_total' | jq '.data.result'"
```
## Monitoring URLs
| Service | URL |
|---------|-----|
| Grafana Dashboard | https://grafana.cua.ai/d/cua-sdk-metrics |
| Alertmanager | https://am.cua.ai |
| OTLP Endpoint | https://otel.cua.ai |
## Rollback
### Cloud (NixOS)
The NixOS instance auto-rebuilds from main. To rollback:
```bash
ssh -i ~/.ssh/rw.pem root@35.92.213.109 \
"nix-env --profile /nix/var/nix/profiles/system --rollback && /nix/var/nix/profiles/system/bin/switch-to-configuration switch"
```
### SDK (PyPI)
Publish a new version with fixes, or users can pin to previous version:
```bash
pip install cua-core==0.1.9 # Previous version
```
## Verified Test Results
| Test | Status | Evidence |
|------|--------|----------|
| SDK → OTLP export | ✅ | Metrics received at otel.cua.ai |
| OTLP → Prometheus | ✅ | `cua_sdk_operations_total` queryable |
| Dashboard loads | ✅ | "CUA SDK Metrics" in Grafana |
| Alert rules healthy | ✅ | All 4 rules state=inactive, health=ok |
| Queries valid | ✅ | rate(), histogram_quantile() return success |
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment