Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Select an option

  • Save andrew-kramer-inno/34f9303a5cc29a14af7c2e729b676fc9 to your computer and use it in GitHub Desktop.

Select an option

Save andrew-kramer-inno/34f9303a5cc29a14af7c2e729b676fc9 to your computer and use it in GitHub Desktop.
Anthropic api limits
The logic is actually quite simple, when /v1/messages api is called through claude code, backend server returns something like
```
anthropic-ratelimit-unified-5h-reset: 1765944000
anthropic-ratelimit-unified-5h-status: allowed
anthropic-ratelimit-unified-5h-utilization: 0.042598363636363636
anthropic-ratelimit-unified-7d-reset: 1766030400
anthropic-ratelimit-unified-7d-status: allowed
anthropic-ratelimit-unified-7d-utilization: 0.3068459187383675
anthropic-ratelimit-unified-fallback-percentage: 0.5
```
in response headers.
So by comparing deltas between 5h/7d limits and the input/output/cache token count, you can derive their correlations if you sample enough data points. You can also try this by yourself using mitm proxies, but be aware that it is a potential violation of tos because you have to get past certificate pinning.
Additional findings:
Unlike standard API rates, 5m cached writes weighs the same as standard inputs (1x instead of 1.25x). And cached read is completely free instead of 0.1x standard input. So the actual value of the equivalent credit is even higher.
Weighing between standard input and standard output tokens is the same at 5x (standard rates is $5/$25 MTok)
Some of my experiment logs, to prove that I didn't make up the numbers /s
```
Δutil_5h (probe before -> cache creation) = 0.00026263636363636246
Δutil_5h (cache creation -> cache read) = 5.454545454544601e-07
cache_creation_input_tokens=4325
cache_read_input_tokens=4325
Cost_cache_create_5h = 6.072517078297398e-08
Cost_cache_read_5h = 1.261166579085457e-10
Cost_cache_create_5h / Standard_input_cost_in_5h = 1.0006936416184973
================================================================================
[compare] derived weights (5h)
W_in (Cost_in_5h) = 6.068307847420574e-08
W_out (Cost_out_5h) = 3.030372626449337e-07
Ratio W_out/W_in = 4.993768778123942
```
Note that Δutil_5h (cache creation -> cache read) is actually caused by the dummy input_tokens and 1 single output_tokens, full usage object for that response is
"usage": {
"input_tokens": 3,
"cache_creation_input_tokens": 4325,
"cache_read_input_tokens": 0,
"cache_creation": {
"ephemeral_5m_input_tokens": 4325,
"ephemeral_1h_input_tokens": 0
},
"output_tokens": 1,
"service_tier": "standard"
}
@LawtonRescue
Copy link

I love it.

@RahulSDeshpande
Copy link

🤯 🤯 🤯

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment