Created
December 17, 2025 17:42
-
-
Save r33drichards/55cff01230ce6d7d396dfe5f0755ccac to your computer and use it in GitHub Desktop.
CUA SDK Telemetry - Deployment Strategy
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| # CUA SDK Telemetry - Deployment Strategy | |
| ## Overview | |
| This document outlines the deployment strategy for CUA SDK telemetry instrumentation (OpenTelemetry + Sentry). | |
| ## Architecture | |
| ``` | |
| ┌─────────────────┐ ┌──────────────────┐ ┌─────────────┐ ┌─────────┐ | |
| │ CUA SDK │────▶│ otel.cua.ai │────▶│ Prometheus │────▶│ Grafana │ | |
| │ (Python) │ │ (OTLP Collector)│ │ │ │ │ | |
| └─────────────────┘ └──────────────────┘ └─────────────┘ └─────────┘ | |
| │ | |
| ▼ | |
| ┌─────────────┐ | |
| │ Alertmanager│────▶ Slack | |
| └─────────────┘ | |
| ``` | |
| ## Pull Requests | |
| | Repo | PR | Branch | Description | Status | | |
| |------|-----|--------|-------------|--------| | |
| | cloud | [#562](https://github.com/trycua/cloud/pull/562) | `feat/cua-sdk-dashboard` | Grafana dashboard + alert rules | ✅ Merged | | |
| | cua | [#661](https://github.com/trycua/cua/pull/661) | `feat/otel-sentry-core` | Core OTEL/Sentry modules | Pending | | |
| | cua | [#662](https://github.com/trycua/cua/pull/662) | `feat/otel-sentry-agent` | Agent callback instrumentation | Pending | | |
| | cua | [#663](https://github.com/trycua/cua/pull/663) | `feat/otel-sentry-computer` | Computer interface instrumentation | Pending | | |
| ## Deployment Order | |
| ### Step 1: Cloud Infrastructure (✅ Complete) | |
| The Grafana dashboard and Prometheus alert rules have been deployed via PR #562. | |
| **Verification:** | |
| - Dashboard: https://grafana.cua.ai/d/cua-sdk-metrics | |
| - Alert rules: 4 CuaSDK* rules active in Prometheus | |
| ### Step 2: CUA SDK Packages (PyPI) | |
| **Merge order matters due to dependencies:** | |
| ``` | |
| 1. PR #661 (cua-core) ← Merge FIRST - contains core telemetry modules | |
| 2. PR #662 (cua-agent) ← Merge after core | |
| 3. PR #663 (cua-computer) ← Merge after core | |
| ``` | |
| **Publish to PyPI after merging:** | |
| ```bash | |
| # Option 1: Git tag trigger | |
| git tag core-v0.1.10 && git push origin core-v0.1.10 | |
| git tag agent-v<version> && git push origin agent-v<version> | |
| git tag computer-v<version> && git push origin computer-v<version> | |
| # Option 2: Manual workflow dispatch | |
| # Go to GitHub Actions → "Publish Core Package" → Run workflow | |
| ``` | |
| ## Metrics Reference | |
| ### Four Golden Signals | |
| | Signal | Metric | Type | | |
| |--------|--------|------| | |
| | Latency | `cua_sdk_operation_duration_seconds` | Histogram | | |
| | Traffic | `cua_sdk_operations_total` | Counter | | |
| | Errors | `cua_sdk_errors_total` | Counter | | |
| | Saturation | `cua_sdk_concurrent_operations` | Gauge | | |
| ### Additional Metrics | |
| | Metric | Type | Description | | |
| |--------|------|-------------| | |
| | `cua_sdk_tokens_total` | Counter | Token usage by model and type | | |
| ### Labels | |
| - `operation`: Operation name (e.g., `agent.run`, `computer.action.screenshot`) | |
| - `status`: `success` or `error` | |
| - `error_type`: Exception class name (for errors) | |
| - `model`: LLM model name (for tokens) | |
| - `token_type`: `prompt` or `completion` (for tokens) | |
| ## Alert Rules | |
| | Alert | Condition | Severity | | |
| |-------|-----------|----------| | |
| | CuaSDKHighErrorRate | Error rate > 5% for 5m | warning | | |
| | CuaSDKHighLatency | P99 latency > 30s for 5m | warning | | |
| | CuaSDKHighLatencyCritical | P99 latency > 60s for 5m | critical | | |
| | CuaSDKHighSaturation | Concurrent ops > 50 for 5m | warning | | |
| ## Configuration | |
| ### Environment Variables | |
| | Variable | Default | Description | | |
| |----------|---------|-------------| | |
| | `CUA_TELEMETRY_DISABLED` | `false` | Set to `true` to disable telemetry | | |
| | `CUA_OTEL_ENDPOINT` | `https://otel.cua.ai` | OTLP HTTP endpoint | | |
| | `CUA_OTEL_SERVICE_NAME` | `cua-sdk` | Service name for metrics | | |
| ### Installation | |
| ```bash | |
| # Install with telemetry support | |
| pip install cua-core[telemetry] | |
| pip install cua-agent[telemetry] | |
| pip install cua-computer[telemetry] | |
| # Or install all telemetry deps | |
| pip install cua-core[otel,sentry] | |
| ``` | |
| ## Testing Locally | |
| ```python | |
| import os | |
| os.environ["CUA_OTEL_ENDPOINT"] = "https://otel.cua.ai" | |
| from core.telemetry.otel import ( | |
| record_operation, | |
| record_error, | |
| record_tokens, | |
| _initialize_otel | |
| ) | |
| from core.telemetry import otel | |
| _initialize_otel() | |
| # Record test metrics | |
| record_operation("agent.run", duration_seconds=0.5, status="success") | |
| record_error("TestError", "agent.run") | |
| record_tokens(prompt_tokens=100, completion_tokens=50, model="claude-3-5-sonnet") | |
| # Force flush | |
| otel._meter_provider.force_flush(timeout_millis=30000) | |
| ``` | |
| **Verify in Prometheus:** | |
| ```bash | |
| ssh -i ~/.ssh/rw.pem root@35.92.213.109 \ | |
| "curl -s 'http://localhost:9090/api/v1/query?query=cua_sdk_operations_total' | jq '.data.result'" | |
| ``` | |
| ## Monitoring URLs | |
| | Service | URL | | |
| |---------|-----| | |
| | Grafana Dashboard | https://grafana.cua.ai/d/cua-sdk-metrics | | |
| | Alertmanager | https://am.cua.ai | | |
| | OTLP Endpoint | https://otel.cua.ai | | |
| ## Rollback | |
| ### Cloud (NixOS) | |
| The NixOS instance auto-rebuilds from main. To rollback: | |
| ```bash | |
| ssh -i ~/.ssh/rw.pem root@35.92.213.109 \ | |
| "nix-env --profile /nix/var/nix/profiles/system --rollback && /nix/var/nix/profiles/system/bin/switch-to-configuration switch" | |
| ``` | |
| ### SDK (PyPI) | |
| Publish a new version with fixes, or users can pin to previous version: | |
| ```bash | |
| pip install cua-core==0.1.9 # Previous version | |
| ``` | |
| ## Verified Test Results | |
| | Test | Status | Evidence | | |
| |------|--------|----------| | |
| | SDK → OTLP export | ✅ | Metrics received at otel.cua.ai | | |
| | OTLP → Prometheus | ✅ | `cua_sdk_operations_total` queryable | | |
| | Dashboard loads | ✅ | "CUA SDK Metrics" in Grafana | | |
| | Alert rules healthy | ✅ | All 4 rules state=inactive, health=ok | | |
| | Queries valid | ✅ | rate(), histogram_quantile() return success | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment