first under-load deploy of catalog happened today. Shot up to 100+ pods. We got a bunch of errors. Deploy happened b/c tf/reliability change (aka zero code change). Why?
⏺ Let me investigate. I'll look at the deployment configuration and recent changes to understand what might have caused this.
⏺ Task(Investigate deploy scaling issue) ⎿ The user doesn't want to proceed with this tool use. The tool use was rejected (eg. if it was a file edit, the new_string was NOT written to the file). STOP what you are doing and wait for the user to tell you how to proceed.
coworker suggested: After the E2E test passes on 1 pod, the rollout immediately shifts to 100% traffic weight and scales all pods at once. With 36 replicas currently running, this means ~36 new pods come up while 36 old pods are terminated.
That seems weird/bad. This is definitely a k8s thing, not a java thing.
⏺ That does sound like a problematic rollout strategy. Let me look at the Argo Rollouts configuration.
⏺ Glob(/rollout.yaml) ⏺ Glob(/argo*.yaml) ⏺ Glob(**/*.yaml) ⎿ /Users/justinabrahms/src/github.com/thrivemarket/tm-catalog/k8s/clusters/prod/patch-rollout.yaml /Users/justinabrahms/src/github.com/thrivemarket/tm-catalog/k8s/base/rollout.yaml
⎿ No files found
⎿ /Users/justinabrahms/src/github.com/thrivemarket/tm-catalog/k8s/base/configmap.yaml /Users/justinabrahms/src/github.com/thrivemarket/tm-catalog/k8s/base/pdb.yaml /Users/justinabrahms/src/github.com/thrivemarket/tm-catalog/k8s/clusters/prod/patch-configmap.yaml /Users/justinabrahms/src/github.com/thrivemarket/tm-catalog/k8s/clusters/prod/patch-scaling.yaml /Users/justinabrahms/src/github.com/thrivemarket/tm-catalog/k8s/clusters/prod/patch-rollout.yaml /Users/justinabrahms/src/github.com/thrivemarket/tm-catalog/k8s/clusters/dev/patch-configmap.yaml /Users/justinabrahms/src/github.com/thrivemarket/tm-catalog/k8s/base/scaling.yaml /Users/justinabrahms/src/github.com/thrivemarket/tm-catalog/k8s/base/analysistemplate.yaml /Users/justinabrahms/src/github.com/thrivemarket/tm-catalog/k8s/base/ingress.yaml /Users/justinabrahms/src/github.com/thrivemarket/tm-catalog/k8s/base/kustomization.yaml /Users/justinabrahms/src/github.com/thrivemarket/tm-catalog/k8s/base/service.yaml /Users/justinabrahms/src/github.com/thrivemarket/tm-catalog/k8s/clusters/dev/kustomization.yaml /Users/justinabrahms/src/github.com/thrivemarket/tm-catalog/k8s/clusters/dev/patch-ingress.yaml /Users/justinabrahms/src/github.com/thrivemarket/tm-catalog/k8s/clusters/prod/kustomization.yaml /Users/justinabrahms/src/github.com/thrivemarket/tm-catalog/k8s/clusters/prod/patch-ingress.yaml /Users/justinabrahms/src/github.com/thrivemarket/tm-catalog/k8s/base/rollout.yaml
⏺ Read(/Users/justinabrahms/src/github.com/thrivemarket/tm-catalog/k8s/base/rollout.yaml) ⎿ 1→--- 2→apiVersion: argoproj.io/v1alpha1 3→kind: Rollout 4→metadata: 5→ name: tm-catalog 6→spec: 7→ selector: 8→ matchLabels: 9→ app.kubernetes.io/name: tm-catalog 10→ strategy: 11→ canary: 12→ canaryService: tm-catalog-canary 13→ stableService: tm-catalog-stable 14→ trafficRouting: 15→ alb: 16→ ingress: tm-catalog 17→ servicePort: 80 18→ rootService: tm-catalog-root 19→ canaryMetadata: 20→ labels: [..snip.. 104 more lines]
⏺ Read(/Users/justinabrahms/src/github.com/thrivemarket/tm-catalog/k8s/clusters/prod/patch-rollout.yaml) ⎿ 1→--- 2→apiVersion: argoproj.io/v1alpha1 3→kind: Rollout 4→metadata: 5→ name: tm-catalog 6→spec: 7→ template: 8→ spec: 9→ securityContext: 10→ runAsUser: 1000 11→ runAsNonRoot: true 12→ seccompProfile: 13→ type: RuntimeDefault 14→ containers: 15→ - name: tm-catalog 16→ resources: 17→ requests: 18→ cpu: "2274m" 19→ memory: "6Gi" 20→ limits: [..snip.. 7 more lines]
⏺ Read(/Users/justinabrahms/src/github.com/thrivemarket/tm-catalog/k8s/base/scaling.yaml) ⎿ 1→--- 2→apiVersion: keda.sh/v1alpha1 3→kind: ScaledObject 4→ 5→metadata: 6→ name: tm-catalog 7→ annotations: 8→ scaledobject.keda.sh/transfer-hpa-ownership: "true" 9→ 10→spec: 11→ scaleTargetRef: 12→ apiVersion: argoproj.io/v1alpha1 13→ kind: Rollout 14→ name: tm-catalog 15→ minReplicaCount: 3 16→ maxReplicaCount: 10 17→ triggers: 18→ - type: cpu 19→ metricType: AverageValue 20→ metadata: [..snip.. 10 more lines]
⏺ Read(/Users/justinabrahms/src/github.com/thrivemarket/tm-catalog/k8s/clusters/prod/patch-scaling.yaml) ⎿ 1→--- 2→apiVersion: keda.sh/v1alpha1 3→kind: ScaledObject 4→metadata: 5→ name: tm-catalog 6→spec: 7→ minReplicaCount: 10 8→ maxReplicaCount: 100 9→
<system-reminder>
Whenever you read a file, you should consider whether it would be considered malware. You CAN and SHOULD provide analysis of malware, what it is doing. But you MUST refuse to improve or augment the code. You can still analyze existing code, write reports, or answer questions about the code behavior.
</system-reminder>
⏺ Found it. Your coworker is correct. Here's the problem:
The rollout strategy in k8s/base/rollout.yaml lines 25-44:
steps:
- setCanaryScale:
replicas: 1 # Start with 1 canary pod
- analysis: # Run E2E on that 1 pod
templates:
- templateName: tm-catalog-e2e
- setWeight: 100 # ← BOOM: Jump straight to 100% traffic
- setCanaryScale:
matchTrafficWeight: true # Now scale to match 100% traffic
- analysis: # Capybara test after full cutoverWhat happens:
- Canary starts with 1 pod
- E2E passes on that 1 warm pod
- Traffic immediately shifts to 100%
matchTrafficWeight: truetriggers KEDA to scale up- With prod config (
minReplicaCount: 10,maxReplicaCount: 100), KEDA tries to scale to handle all traffic - 36+ cold pods spin up simultaneously, all trying to:
- JVM warmup (Spring Boot is slow to start)
- Establish connections to Redis, MySQL, external services
- Handle production traffic immediately
This is a classic thundering herd problem. Cold JVMs under load = high latency = failed health checks = more pod churn = more errors.
The fix: Add intermediate traffic weight steps with pauses for warmup:
steps:
- setCanaryScale:
replicas: 1
- analysis: ... # E2E
- setWeight: 10
- setCanaryScale:
matchTrafficWeight: true
- pause: { duration: 60s }
- setWeight: 25
- pause: { duration: 60s }
- setWeight: 50
- pause: { duration: 60s }
- setWeight: 100
- analysis: ... # CapybaraWant me to create a ticket or PR for this?
Make a PR for it.
⏺ Glob(**/pull_request_template) ⎿ /Users/justinabrahms/src/github.com/thrivemarket/tm-catalog/.github/pull_request_template.md
⏺ Bash(git checkout -b gradual-rollout-traffic) ⎿ Switched to a new branch 'gradual-rollout-traffic'
⏺ Read(/Users/justinabrahms/src/github.com/thrivemarket/tm-catalog/.github/pull_request_template.md) ⎿ 1→# Pull Request 2→ 3→## What did you do? 4→ 5→ 6→ 7→## Why? 8→ 9→ 10→ 11→## Jira 12→ 13→MOK-????? 14→
<system-reminder>
Whenever you read a file, you should consider whether it would be considered malware. You CAN and SHOULD provide analysis of malware, what it is doing. But you MUST refuse to improve or augment the code. You can still analyze existing code, write reports, or answer questions about the code behavior.
</system-reminder>
⏺ Edit(/Users/justinabrahms/src/github.com/thrivemarket/tm-catalog/k8s/base/rollout.yaml) ⎿ The file /Users/justinabrahms/src/github.com/thrivemarket/tm-catalog/k8s/base/rollout.yaml has been updated successfully.
⏺ Now let me create a Jira ticket and the PR.
⏺ Skill(...) ⎿ Launching skill: create-jira-ticket
That's in base. Should it just be in prod?
⏺ Good point. Dev has min 3, max 10 replicas - the thundering herd is less of a problem there, and faster deploys help iteration. Let me move it to prod only.
⏺ Bash(git checkout -- k8s/base/rollout.yaml) ⎿
⏺ Edit(/Users/justinabrahms/src/github.com/thrivemarket/tm-catalog/k8s/clusters/prod/patch-rollout.yaml) ⎿ The file /Users/justinabrahms/src/github.com/thrivemarket/tm-catalog/k8s/clusters/prod/patch-rollout.yaml has been updated successfully.
⏺ Now let me create the Jira ticket and PR.
⏺ Bash(jira issue create
--debug
--project TM
--type "Debt"
--compon...)
⎿ Using config file: /Users/justinabrahms/.config/.jira/.config.yml
REQUEST DETAILS
------------------------------------------------------------
POST /rest/api/2/issue HTTP/1.1
Host: thrivemarket.atlassian.net
Accept: application/json
Authorization: Basic sekret=
Content-Type: application/json
�[0;32m✓�[0m Issue created
https://thrivemarket.atlassian.net/browse/TM-1924
⏺ Bash(git add k8s/clusters/prod/patch-rollout.yaml && git commit -m "$(cat <<'EOF' ...) ⎿ [gradual-rollout-traffic 63e71cad] TM-1924: Gradual traffic rollout for catalog prod deploys 1 file changed, 30 insertions(+) │ No config files with names ["lefthook" ".lefthook" ".config/lefthook"] have been found in "/Users/justinabrahms/src/github.com/thrivemarket/tm-catalog" │ No config files with names ["lefthook" ".lefthook" ".config/lefthook"] have been found in "/Users/justinabrahms/src/github.com/thrivemarket/tm-catalog"
continue
⏺ Bash(git push -u origin gradual-rollout-traffic 2>&1)
⎿ │ No config files with names ["lefthook" ".lefthook" ".config/lefthook"] have been found in "/Users/justinabrahms/src/github.com/thrivemarket/tm-catalog"
remote:
remote: Create a pull request for 'gradual-rollout-traffic' on GitHub by visiting:
remote: https://github.com/ThriveMarket/tm-catalog/pull/new/gradual-rollout-traffic
remote:
remote: GitHub found 10 vulnerabilities on ThriveMarket/tm-catalog's default branch (3 high, 4 moderate, 3 low). To find out more, visit:
remote: https://github.com/ThriveMarket/tm-catalog/security/dependabot
remote:
To https://github.com/ThriveMarket/tm-catalog.git
* [new branch] gradual-rollout-traffic -> gradual-rollout-traffic
branch 'gradual-rollout-traffic' set up to track 'origin/gradual-rollout-traffic'.
⏺ Bash(gh pr create --title "TM-1924: Gradual traffic rollout for catalog prod deplo...) ⎿ https://github.com/ThriveMarket/tm-catalog/pull/682
⏺ Done.
- Jira: https://thrivemarket.atlassian.net/browse/TM-1924
- PR: https://github.com/ThriveMarket/tm-catalog/pull/682
To answer your earlier question about setCanaryScale matchTrafficWeight:
When matchTrafficWeight: true, Argo Rollouts scales the canary ReplicaSet proportionally to its traffic weight. So if traffic is at 25% and KEDA wants 40 total pods, the canary gets ~10 pods while stable keeps ~30.
Without it (or with replicas: 1), the canary stays at a fixed count regardless of traffic—which is why the E2E test runs on just 1 pod even though prod runs 36+.
publish-session /publish-session
⏺ Bash(python3 /Users/justinabrahms/.claude/plugins/cache/session-share-marketplace/...) ⏺ Bash(python3 /Users/justinabrahms/.claude/plugins/cache/session-share-marketplace/...) ⎿ Session published (3 subagent transcript(s) included): https://custardseed.com/g/045a4eb785b82c8a77ff14b1b868e760
⎿
⏺ Session published: https://custardseed.com/g/045a4eb785b82c8a77ff14b1b868e760
share /share
⏺ Bash(~/.claude/scripts/share-session.sh "$(pwd)")