r33drichards · December 17, 2025 17:42
diff --git a/gistfile0.txt b/gistfile0.txt
 # CUA SDK Telemetry - Deployment Strategy

 ## Overview

 This document outlines the deployment strategy for CUA SDK telemetry instrumentation (OpenTelemetry + Sentry).

 ## Architecture

 ```
 ┌─────────────────┐     ┌──────────────────┐     ┌─────────────┐     ┌─────────┐
 │  CUA SDK        │────▶│  otel.cua.ai     │────▶│  Prometheus │────▶│ Grafana │
 │  (Python)       │     │  (OTLP Collector)│     │             │     │         │
 └─────────────────┘     └──────────────────┘     └─────────────┘     └─────────┘
                                                        │
                                                        ▼
                                                 ┌─────────────┐
                                                 │ Alertmanager│────▶ Slack
                                                 └─────────────┘
 ```

 ## Pull Requests

 | Repo | PR | Branch | Description | Status |
 |------|-----|--------|-------------|--------|
 | cloud | [#562](https://github.com/trycua/cloud/pull/562) | `feat/cua-sdk-dashboard` | Grafana dashboard + alert rules | ✅ Merged |
 | cua | [#661](https://github.com/trycua/cua/pull/661) | `feat/otel-sentry-core` | Core OTEL/Sentry modules | Pending |
 | cua | [#662](https://github.com/trycua/cua/pull/662) | `feat/otel-sentry-agent` | Agent callback instrumentation | Pending |
 | cua | [#663](https://github.com/trycua/cua/pull/663) | `feat/otel-sentry-computer` | Computer interface instrumentation | Pending |

 ## Deployment Order

 ### Step 1: Cloud Infrastructure (✅ Complete)

 The Grafana dashboard and Prometheus alert rules have been deployed via PR #562.

 **Verification:**
 - Dashboard: https://grafana.cua.ai/d/cua-sdk-metrics
 - Alert rules: 4 CuaSDK* rules active in Prometheus

 ### Step 2: CUA SDK Packages (PyPI)

 **Merge order matters due to dependencies:**

 ```
 1. PR #661 (cua-core)      ← Merge FIRST - contains core telemetry modules
 2. PR #662 (cua-agent)     ← Merge after core
 3. PR #663 (cua-computer)  ← Merge after core
 ```

 **Publish to PyPI after merging:**

 ```bash
 # Option 1: Git tag trigger
 git tag core-v0.1.10 && git push origin core-v0.1.10
 git tag agent-v<version> && git push origin agent-v<version>
 git tag computer-v<version> && git push origin computer-v<version>

 # Option 2: Manual workflow dispatch
 # Go to GitHub Actions → "Publish Core Package" → Run workflow
 ```

 ## Metrics Reference

 ### Four Golden Signals

 | Signal | Metric | Type |
 |--------|--------|------|
 | Latency | `cua_sdk_operation_duration_seconds` | Histogram |
 | Traffic | `cua_sdk_operations_total` | Counter |
 | Errors | `cua_sdk_errors_total` | Counter |
 | Saturation | `cua_sdk_concurrent_operations` | Gauge |

 ### Additional Metrics

 | Metric | Type | Description |
 |--------|------|-------------|
 | `cua_sdk_tokens_total` | Counter | Token usage by model and type |

 ### Labels

 - `operation`: Operation name (e.g., `agent.run`, `computer.action.screenshot`)
 - `status`: `success` or `error`
 - `error_type`: Exception class name (for errors)
 - `model`: LLM model name (for tokens)
 - `token_type`: `prompt` or `completion` (for tokens)

 ## Alert Rules

 | Alert | Condition | Severity |
 |-------|-----------|----------|
 | CuaSDKHighErrorRate | Error rate > 5% for 5m | warning |
 | CuaSDKHighLatency | P99 latency > 30s for 5m | warning |
 | CuaSDKHighLatencyCritical | P99 latency > 60s for 5m | critical |
 | CuaSDKHighSaturation | Concurrent ops > 50 for 5m | warning |

 ## Configuration

 ### Environment Variables

 | Variable | Default | Description |
 |----------|---------|-------------|
 | `CUA_TELEMETRY_DISABLED` | `false` | Set to `true` to disable telemetry |
 | `CUA_OTEL_ENDPOINT` | `https://otel.cua.ai` | OTLP HTTP endpoint |
 | `CUA_OTEL_SERVICE_NAME` | `cua-sdk` | Service name for metrics |

 ### Installation

 ```bash
 # Install with telemetry support
 pip install cua-core[telemetry]
 pip install cua-agent[telemetry]
 pip install cua-computer[telemetry]

 # Or install all telemetry deps
 pip install cua-core[otel,sentry]
 ```

 ## Testing Locally

 ```python
 import os
 os.environ["CUA_OTEL_ENDPOINT"] = "https://otel.cua.ai"

 from core.telemetry.otel import (
    record_operation, 
    record_error, 
    record_tokens,
    _initialize_otel
 )
 from core.telemetry import otel

 _initialize_otel()

 # Record test metrics
 record_operation("agent.run", duration_seconds=0.5, status="success")
 record_error("TestError", "agent.run")
 record_tokens(prompt_tokens=100, completion_tokens=50, model="claude-3-5-sonnet")

 # Force flush
 otel._meter_provider.force_flush(timeout_millis=30000)
 ```

 **Verify in Prometheus:**
 ```bash
 ssh -i ~/.ssh/rw.pem root@35.92.213.109 \
  "curl -s 'http://localhost:9090/api/v1/query?query=cua_sdk_operations_total' | jq '.data.result'"
 ```

 ## Monitoring URLs

 | Service | URL |
 |---------|-----|
 | Grafana Dashboard | https://grafana.cua.ai/d/cua-sdk-metrics |
 | Alertmanager | https://am.cua.ai |
 | OTLP Endpoint | https://otel.cua.ai |

 ## Rollback

 ### Cloud (NixOS)
 The NixOS instance auto-rebuilds from main. To rollback:
 ```bash
 ssh -i ~/.ssh/rw.pem root@35.92.213.109 \
  "nix-env --profile /nix/var/nix/profiles/system --rollback && /nix/var/nix/profiles/system/bin/switch-to-configuration switch"
 ```

 ### SDK (PyPI)
 Publish a new version with fixes, or users can pin to previous version:
 ```bash
 pip install cua-core==0.1.9  # Previous version
 ```

 ## Verified Test Results

 | Test | Status | Evidence |
 |------|--------|----------|
 | SDK → OTLP export | ✅ | Metrics received at otel.cua.ai |
 | OTLP → Prometheus | ✅ | `cua_sdk_operations_total` queryable |
 | Dashboard loads | ✅ | "CUA SDK Metrics" in Grafana |
 | Alert rules healthy | ✅ | All 4 rules state=inactive, health=ok |
 | Queries valid | ✅ | rate(), histogram_quantile() return success |
	# CUA SDK Telemetry - Deployment Strategy

	## Overview

	This document outlines the deployment strategy for CUA SDK telemetry instrumentation (OpenTelemetry + Sentry).

	## Architecture

	```
	┌─────────────────┐ ┌──────────────────┐ ┌─────────────┐ ┌─────────┐
	│ CUA SDK │────▶│ otel.cua.ai │────▶│ Prometheus │────▶│ Grafana │
	│ (Python) │ │ (OTLP Collector)│ │ │ │ │
	└─────────────────┘ └──────────────────┘ └─────────────┘ └─────────┘
	│
	▼
	┌─────────────┐
	│ Alertmanager│────▶ Slack
	└─────────────┘
	```

	## Pull Requests

	\| Repo \| PR \| Branch \| Description \| Status \|
	\|------\|-----\|--------\|-------------\|--------\|
	\| cloud \| [#562](https://github.com/trycua/cloud/pull/562) \| `feat/cua-sdk-dashboard` \| Grafana dashboard + alert rules \| ✅ Merged \|
	\| cua \| [#661](https://github.com/trycua/cua/pull/661) \| `feat/otel-sentry-core` \| Core OTEL/Sentry modules \| Pending \|
	\| cua \| [#662](https://github.com/trycua/cua/pull/662) \| `feat/otel-sentry-agent` \| Agent callback instrumentation \| Pending \|
	\| cua \| [#663](https://github.com/trycua/cua/pull/663) \| `feat/otel-sentry-computer` \| Computer interface instrumentation \| Pending \|

	## Deployment Order

	### Step 1: Cloud Infrastructure (✅ Complete)

	The Grafana dashboard and Prometheus alert rules have been deployed via PR #562.

	Verification:
	- Dashboard: https://grafana.cua.ai/d/cua-sdk-metrics
	- Alert rules: 4 CuaSDK* rules active in Prometheus

	### Step 2: CUA SDK Packages (PyPI)

	Merge order matters due to dependencies:

	```
	1. PR #661 (cua-core) ← Merge FIRST - contains core telemetry modules
	2. PR #662 (cua-agent) ← Merge after core
	3. PR #663 (cua-computer) ← Merge after core
	```

	Publish to PyPI after merging:

	```bash
	# Option 1: Git tag trigger
	git tag core-v0.1.10 && git push origin core-v0.1.10
	git tag agent-v<version> && git push origin agent-v<version>
	git tag computer-v<version> && git push origin computer-v<version>

	# Option 2: Manual workflow dispatch
	# Go to GitHub Actions → "Publish Core Package" → Run workflow
	```

	## Metrics Reference

	### Four Golden Signals

	\| Signal \| Metric \| Type \|
	\|--------\|--------\|------\|
	\| Latency \| `cua_sdk_operation_duration_seconds` \| Histogram \|
	\| Traffic \| `cua_sdk_operations_total` \| Counter \|
	\| Errors \| `cua_sdk_errors_total` \| Counter \|
	\| Saturation \| `cua_sdk_concurrent_operations` \| Gauge \|

	### Additional Metrics

	\| Metric \| Type \| Description \|
	\|--------\|------\|-------------\|
	\| `cua_sdk_tokens_total` \| Counter \| Token usage by model and type \|

	### Labels

	- `operation`: Operation name (e.g., `agent.run`, `computer.action.screenshot`)
	- `status`: `success` or `error`
	- `error_type`: Exception class name (for errors)
	- `model`: LLM model name (for tokens)
	- `token_type`: `prompt` or `completion` (for tokens)

	## Alert Rules

	\| Alert \| Condition \| Severity \|
	\|-------\|-----------\|----------\|
	\| CuaSDKHighErrorRate \| Error rate > 5% for 5m \| warning \|
	\| CuaSDKHighLatency \| P99 latency > 30s for 5m \| warning \|
	\| CuaSDKHighLatencyCritical \| P99 latency > 60s for 5m \| critical \|
	\| CuaSDKHighSaturation \| Concurrent ops > 50 for 5m \| warning \|

	## Configuration

	### Environment Variables

	\| Variable \| Default \| Description \|
	\|----------\|---------\|-------------\|
	\| `CUA_TELEMETRY_DISABLED` \| `false` \| Set to `true` to disable telemetry \|
	\| `CUA_OTEL_ENDPOINT` \| `https://otel.cua.ai` \| OTLP HTTP endpoint \|
	\| `CUA_OTEL_SERVICE_NAME` \| `cua-sdk` \| Service name for metrics \|

	### Installation

	```bash
	# Install with telemetry support
	pip install cua-core[telemetry]
	pip install cua-agent[telemetry]
	pip install cua-computer[telemetry]

	# Or install all telemetry deps
	pip install cua-core[otel,sentry]
	```

	## Testing Locally

	```python
	import os
	os.environ["CUA_OTEL_ENDPOINT"] = "https://otel.cua.ai"

	from core.telemetry.otel import (
	record_operation,
	record_error,
	record_tokens,
	_initialize_otel
	)
	from core.telemetry import otel

	_initialize_otel()

	# Record test metrics
	record_operation("agent.run", duration_seconds=0.5, status="success")
	record_error("TestError", "agent.run")
	record_tokens(prompt_tokens=100, completion_tokens=50, model="claude-3-5-sonnet")

	# Force flush
	otel._meter_provider.force_flush(timeout_millis=30000)
	```

	Verify in Prometheus:
	```bash
	ssh -i ~/.ssh/rw.pem root@35.92.213.109 \
	"curl -s 'http://localhost:9090/api/v1/query?query=cua_sdk_operations_total' \| jq '.data.result'"
	```

	## Monitoring URLs

	\| Service \| URL \|
	\|---------\|-----\|
	\| Grafana Dashboard \| https://grafana.cua.ai/d/cua-sdk-metrics \|
	\| Alertmanager \| https://am.cua.ai \|
	\| OTLP Endpoint \| https://otel.cua.ai \|

	## Rollback

	### Cloud (NixOS)
	The NixOS instance auto-rebuilds from main. To rollback:
	```bash
	ssh -i ~/.ssh/rw.pem root@35.92.213.109 \
	"nix-env --profile /nix/var/nix/profiles/system --rollback && /nix/var/nix/profiles/system/bin/switch-to-configuration switch"
	```

	### SDK (PyPI)
	Publish a new version with fixes, or users can pin to previous version:
	```bash
	pip install cua-core==0.1.9 # Previous version
	```

	## Verified Test Results

	\| Test \| Status \| Evidence \|
	\|------\|--------\|----------\|
	\| SDK → OTLP export \| ✅ \| Metrics received at otel.cua.ai \|
	\| OTLP → Prometheus \| ✅ \| `cua_sdk_operations_total` queryable \|
	\| Dashboard loads \| ✅ \| "CUA SDK Metrics" in Grafana \|
	\| Alert rules healthy \| ✅ \| All 4 rules state=inactive, health=ok \|
	\| Queries valid \| ✅ \| rate(), histogram_quantile() return success \|
No results found