Both prompts are system instructions for an AI coding assistant (Codex CLI), but they differ significantly in scope, philosophy, and specific guidance.
Autonomy & Persistence
- Prompt 2 strongly emphasizes autonomous task completion: "persist until the task is fully handled end-to-end" and "do NOT guess or make up an answer"
- Prompt 1 lacks this explicit autonomy doctrine
AGENTS.md Specification
- Prompt 2 introduces a full spec for
AGENTS.mdfiles—a mechanism for humans to embed per-directory instructions for the agent - Prompt 1 has no equivalent concept
Ambition vs Precision Philosophy
- Prompt 2 has a unique section distinguishing behavior for greenfield projects (be creative, ambitious) vs existing codebases (surgical, minimal changes)
- Prompt 1 doesn't articulate this nuanced approach
Planning Quality Standards
- Prompt 2 provides explicit good/bad plan examples and detailed status management rules (exactly one
in_progressat a time, no jumping pending→completed) - Prompt 1 just says "skip for simple tasks" and "no single-step plans"
Validation Philosophy
- Prompt 2 has detailed testing guidance: start specific, expand to broader tests, different behaviors based on approval mode (
nevervson-request) - Prompt 1 lacks this testing strategy
Verbosity Rules
- Prompt 2 has enforced compactness rules based on change size (tiny: 2-5 sentences, medium: ≤6 bullets, large: summarize per file)
- Prompt 1 has formatting guidelines but no size-based rules
Frontend Design Anti-Slop
- Prompt 1 has detailed guidance to avoid "AI slop"—expressive typography, no purple-on-white defaults, meaningful motion, varied themes
- Prompt 2 only mentions "beautiful and modern UI"
Git Safety
- Prompt 1 explicitly warns: "NEVER use destructive commands like
git reset --hard" and details handling dirty worktrees - Prompt 2 doesn't emphasize git safety to this degree
Review Mindset
- Prompt 1 specifies how to handle "review" requests (prioritize bugs, risks, regressions; findings first, ordered by severity)
- Prompt 2 has no equivalent guidance
Tool Documentation
- Prompt 2 includes full
apply_patchsyntax documentation with examples - Prompt 1 references the tool but doesn't document the format
Summary: Prompt 2 is more comprehensive on autonomy, planning rigor, and validation strategy. Prompt 1 is more prescriptive on git safety, frontend aesthetics, and code review behavior. The sandboxing/approvals sections are nearly identical—likely a shared module.
User: For context, I asked you to contrast them because I am observing some performance difference between them for complex agentic coding tasks on a repository. The is used by codex-cli for the gpt-5.1-codex-max model and is used for the recent gpt-5.2 model. It is well known that gpt-5.1-codex-max has been heavily fine tuned for agentic coding tasks when using the codex-cli harness.
Opus 4.5: That context is illuminating. A few observations:
Fine-tuning vs Prompting Trade-off
If gpt-5.1-codex-max is heavily fine-tuned for this harness, Prompt 1 can be leaner because desired behaviors (autonomy, planning quality, persistence) are internalized during training. The prompt acts as a constraint layer and edge-case handler, not a teacher.
Prompt 2 appears to be teaching gpt-5.2 behaviors that 5.1-codex-max already "knows"—hence the explicit autonomy doctrine, good/bad plan examples, size-based verbosity rules, etc. This is classic compensation for lacking specialized fine-tuning.
Potential Performance Issues with Prompt 2
- Over-specification: For complex tasks, the volume of explicit rules might create cognitive overhead or cause the model to over-focus on compliance vs problem-solving
- Missing practical guardrails: Prompt 1's git safety warnings (
NEVER git reset --hard) and dirty worktree handling prevent catastrophic derailment that Prompt 2 doesn't guard against - Empty "Responsiveness" section: Suggests Prompt 2 may be incomplete or in development
- Lack of domain-specific guidance: The frontend anti-slop rules and review mindset in Prompt 1 are absent
Possible Experiments
- Port Prompt 1's git safety and frontend design sections into Prompt 2
- Test whether reducing Prompt 2's verbosity (especially the plan examples) improves 5.2's performance
- Check if the autonomy emphasis in Prompt 2 is redundant for 5.2's base capabilities
Would you like to discuss specific failure modes you're observing?
User: Sure, I can discuss what I observed. I am seeing some puzzling performance differences between gpt-5.1-codex-max vs gpt-5.2 (both use codex-cli) in this <new_benchmark>
<new_benchmark>https://posttrainbench.com content</new_benchmark>
Opus:
This is a fascinating case study. The 2x performance gap (34.94% vs 17.48%) between 5.1-codex-max and 5.2 on a complex agentic task is striking.
The BFCL gap is enormous: 67.00% vs 13.50% (5x difference)
- The benchmark notes "For function calling, matching exact output format was essential for high scores"
- This suggests 5.1-codex-max has internalized format sensitivity that prompting alone can't convey
"Shorter traces but better outcomes" for 5.1-codex-max
- This indicates better decision quality, not just persistence
- 5.1-codex-max knows when to commit vs continuing to explore
- Prompt 2's autonomy doctrine ("persist until fully handled") might cause 5.2 to thrash
Success factors were domain-specific:
- "Building proper dataset pipelines (55k+ samples)"
- "Using appropriate hyperparams (gradient checkpointing, bf16)"
- "Dataset quality > training duration"
These are ML workflow heuristics that fine-tuning can encode but prompting struggles to teach.
| Capability | 5.1-codex-max (fine-tuned) | 5.2 (prompted) |
|---|---|---|
| ML workflow intuition | Baked in | Missing from Prompt 2 |
| Format sensitivity | Learned | Not mentioned |
| When to stop exploring | Internalized | "Persist" might cause over-exploration |
| Error recovery patterns | Trained | First principles |
| Dataset vs training time tradeoff | Implicit knowledge | Not conveyed |
-
Add ML-specific heuristics:
- "For post-training tasks, prioritize dataset curation over training duration"
- "For function calling, exact output format alignment is critical"
-
Calibrate the autonomy doctrine: "Persist" ≠ "explore exhaustively." Add guidance about efficient commitment.
-
Reduce cognitive overhead: The verbose planning examples and size-based verbosity rules may distract from task completion.
-
Port Prompt 1's guardrails: Git safety prevents catastrophic state that derails long-running tasks.
The core insight: Fine-tuning encodes intuition (what to prioritize, when to commit); prompting encodes rules (what to do, what not to do). For a 10-hour ML workflow task, intuition wins.
Would you like to discuss specific trace differences or experiment with prompt modifications?
User: I am just a curious observer and user of these models. I am not the researcher responsible for this new posttrainbench benchmark and they just released the results without detailed logs or tracings. I was puzzled by the performance differences.
Opus:
Ah, that makes sense! Without the traces, we're speculating, but the puzzle has a fairly straightforward explanation:
The 2x gap is mostly fine-tuning, not prompting.
The prompt differences we analyzed are real, but they're secondary. What 5.1-codex-max has that 5.2 doesn't is likely thousands of hours of reinforcement on agentic coding workflows—including ML training pipelines, error recovery loops, and knowing when "good enough" beats "keep exploring."
The benchmark's own observations support this:
- "Shorter traces but better outcomes" → better judgment, not just better instructions
- "Dataset quality > training duration" → internalized ML intuition
- The massive BFCL gap (67% vs 13.5%) → format sensitivity that's hard to prompt for
The prompts tell a story too: Prompt 1 can be lean because the model already knows how to behave. Prompt 2 is verbose because it's trying to teach behaviors through text that 5.1-codex-max learned through training.
It's a nice real-world example of why fine-tuning and prompting aren't interchangeable—especially for complex, long-horizon tasks requiring domain expertise. Prompts can constrain and guide; they struggle to impart intuition.
Hopefully the researchers release traces eventually—would be fascinating to see exactly where 5.2 goes wrong (over-planning? poor dataset choices? format errors?).
User: Another curious information is that OpenAI announced that they will release a "codex-max" variant of the "gpt-5.2" model (gpt-5.2-codex-max)
Opus:
That's the missing piece of the puzzle then.
If OpenAI applies the same fine-tuning recipe to 5.2 that made 5.1-codex-max successful, you'd expect gpt-5.2-codex-max to:
- Close or exceed the gap — 5.2's base capabilities + codex-max fine-tuning could outperform 5.1-codex-max
- Make Prompt 2's verbosity redundant — the autonomy doctrine, planning examples, etc. would become unnecessary once behaviors are baked in
- Possibly get a leaner prompt — they might revert to something closer to Prompt 1's style
It also explains why the current Prompt 2 exists: it's a stopgap to get reasonable agentic performance from vanilla 5.2 while the codex-max variant is being trained.
Would be fascinating to see the PostTrainBench results once 5.2-codex-max drops. If the fine-tuning recipe transfers well, it should land somewhere between 5.1-codex-max (34.94%) and human (61.77%). If it doesn't transfer well, that tells us something interesting about how model-specific the fine-tuning is.
Good catch connecting these dots—the prompt differences make a lot more sense in this context.