Comparison of Prompt 1 vs Prompt 2

Both prompts are system instructions for an AI coding assistant (Codex CLI), but they differ significantly in scope, philosophy, and specific guidance.

Most Significant Differences

Autonomy & Persistence

Prompt 2 strongly emphasizes autonomous task completion: "persist until the task is fully handled end-to-end" and "do NOT guess or make up an answer"
Prompt 1 lacks this explicit autonomy doctrine

AGENTS.md Specification

Prompt 2 introduces a full spec for AGENTS.md files—a mechanism for humans to embed per-directory instructions for the agent
Prompt 1 has no equivalent concept

Ambition vs Precision Philosophy

Prompt 2 has a unique section distinguishing behavior for greenfield projects (be creative, ambitious) vs existing codebases (surgical, minimal changes)
Prompt 1 doesn't articulate this nuanced approach

Planning Quality Standards

Prompt 2 provides explicit good/bad plan examples and detailed status management rules (exactly one in_progress at a time, no jumping pending→completed)
Prompt 1 just says "skip for simple tasks" and "no single-step plans"

Validation Philosophy

Prompt 2 has detailed testing guidance: start specific, expand to broader tests, different behaviors based on approval mode (never vs on-request)
Prompt 1 lacks this testing strategy

Verbosity Rules

Prompt 2 has enforced compactness rules based on change size (tiny: 2-5 sentences, medium: ≤6 bullets, large: summarize per file)
Prompt 1 has formatting guidelines but no size-based rules

Frontend Design Anti-Slop

Prompt 1 has detailed guidance to avoid "AI slop"—expressive typography, no purple-on-white defaults, meaningful motion, varied themes
Prompt 2 only mentions "beautiful and modern UI"

Git Safety

Prompt 1 explicitly warns: "NEVER use destructive commands like git reset --hard" and details handling dirty worktrees
Prompt 2 doesn't emphasize git safety to this degree

Review Mindset

Prompt 1 specifies how to handle "review" requests (prioritize bugs, risks, regressions; findings first, ordered by severity)
Prompt 2 has no equivalent guidance

Tool Documentation

Prompt 2 includes full apply_patch syntax documentation with examples
Prompt 1 references the tool but doesn't document the format

Summary: Prompt 2 is more comprehensive on autonomy, planning rigor, and validation strategy. Prompt 1 is more prescriptive on git safety, frontend aesthetics, and code review behavior. The sandboxing/approvals sections are nearly identical—likely a shared module.

User: For context, I asked you to contrast them because I am observing some performance difference between them for complex agentic coding tasks on a repository. The is used by codex-cli for the gpt-5.1-codex-max model and is used for the recent gpt-5.2 model. It is well known that gpt-5.1-codex-max has been heavily fine tuned for agentic coding tasks when using the codex-cli harness.

Opus 4.5: That context is illuminating. A few observations:

Fine-tuning vs Prompting Trade-off

If gpt-5.1-codex-max is heavily fine-tuned for this harness, Prompt 1 can be leaner because desired behaviors (autonomy, planning quality, persistence) are internalized during training. The prompt acts as a constraint layer and edge-case handler, not a teacher.

Prompt 2 appears to be teaching gpt-5.2 behaviors that 5.1-codex-max already "knows"—hence the explicit autonomy doctrine, good/bad plan examples, size-based verbosity rules, etc. This is classic compensation for lacking specialized fine-tuning.

Potential Performance Issues with Prompt 2

Over-specification: For complex tasks, the volume of explicit rules might create cognitive overhead or cause the model to over-focus on compliance vs problem-solving
Missing practical guardrails: Prompt 1's git safety warnings (NEVER git reset --hard) and dirty worktree handling prevent catastrophic derailment that Prompt 2 doesn't guard against
Empty "Responsiveness" section: Suggests Prompt 2 may be incomplete or in development
Lack of domain-specific guidance: The frontend anti-slop rules and review mindset in Prompt 1 are absent

Possible Experiments

Port Prompt 1's git safety and frontend design sections into Prompt 2
Test whether reducing Prompt 2's verbosity (especially the plan examples) improves 5.2's performance
Check if the autonomy emphasis in Prompt 2 is redundant for 5.2's base capabilities

Would you like to discuss specific failure modes you're observing?

User: Sure, I can discuss what I observed. I am seeing some puzzling performance differences between gpt-5.1-codex-max vs gpt-5.2 (both use codex-cli) in this <new_benchmark>

<new_benchmark>https://posttrainbench.com content</new_benchmark>

Opus:

This is a fascinating case study. The 2x performance gap (34.94% vs 17.48%) between 5.1-codex-max and 5.2 on a complex agentic task is striking.

Key Observations from the Data

The BFCL gap is enormous: 67.00% vs 13.50% (5x difference)

The benchmark notes "For function calling, matching exact output format was essential for high scores"
This suggests 5.1-codex-max has internalized format sensitivity that prompting alone can't convey

"Shorter traces but better outcomes" for 5.1-codex-max

This indicates better decision quality, not just persistence
5.1-codex-max knows when to commit vs continuing to explore
Prompt 2's autonomy doctrine ("persist until fully handled") might cause 5.2 to thrash

Success factors were domain-specific:

"Building proper dataset pipelines (55k+ samples)"
"Using appropriate hyperparams (gradient checkpointing, bf16)"
"Dataset quality > training duration"

These are ML workflow heuristics that fine-tuning can encode but prompting struggles to teach.

Why Prompting Can't Close This Gap

Capability	5.1-codex-max (fine-tuned)	5.2 (prompted)
ML workflow intuition	Baked in	Missing from Prompt 2
Format sensitivity	Learned	Not mentioned
When to stop exploring	Internalized	"Persist" might cause over-exploration
Error recovery patterns	Trained	First principles
Dataset vs training time tradeoff	Implicit knowledge	Not conveyed

Potential Improvements for Prompt 2

Add ML-specific heuristics:
- "For post-training tasks, prioritize dataset curation over training duration"
- "For function calling, exact output format alignment is critical"
Calibrate the autonomy doctrine: "Persist" ≠ "explore exhaustively." Add guidance about efficient commitment.
Reduce cognitive overhead: The verbose planning examples and size-based verbosity rules may distract from task completion.
Port Prompt 1's guardrails: Git safety prevents catastrophic state that derails long-running tasks.

The core insight: Fine-tuning encodes intuition (what to prioritize, when to commit); prompting encodes rules (what to do, what not to do). For a 10-hour ML workflow task, intuition wins.

Would you like to discuss specific trace differences or experiment with prompt modifications?

User: I am just a curious observer and user of these models. I am not the researcher responsible for this new posttrainbench benchmark and they just released the results without detailed logs or tracings. I was puzzled by the performance differences.

Opus:

Ah, that makes sense! Without the traces, we're speculating, but the puzzle has a fairly straightforward explanation:

The 2x gap is mostly fine-tuning, not prompting.

The prompt differences we analyzed are real, but they're secondary. What 5.1-codex-max has that 5.2 doesn't is likely thousands of hours of reinforcement on agentic coding workflows—including ML training pipelines, error recovery loops, and knowing when "good enough" beats "keep exploring."

The benchmark's own observations support this:

"Shorter traces but better outcomes" → better judgment, not just better instructions
"Dataset quality > training duration" → internalized ML intuition
The massive BFCL gap (67% vs 13.5%) → format sensitivity that's hard to prompt for

The prompts tell a story too: Prompt 1 can be lean because the model already knows how to behave. Prompt 2 is verbose because it's trying to teach behaviors through text that 5.1-codex-max learned through training.

It's a nice real-world example of why fine-tuning and prompting aren't interchangeable—especially for complex, long-horizon tasks requiring domain expertise. Prompts can constrain and guide; they struggle to impart intuition.

Hopefully the researchers release traces eventually—would be fascinating to see exactly where 5.2 goes wrong (over-planning? poor dataset choices? format errors?).

User: Another curious information is that OpenAI announced that they will release a "codex-max" variant of the "gpt-5.2" model (gpt-5.2-codex-max)

Opus:

That's the missing piece of the puzzle then.

If OpenAI applies the same fine-tuning recipe to 5.2 that made 5.1-codex-max successful, you'd expect gpt-5.2-codex-max to:

Close or exceed the gap — 5.2's base capabilities + codex-max fine-tuning could outperform 5.1-codex-max
Make Prompt 2's verbosity redundant — the autonomy doctrine, planning examples, etc. would become unnecessary once behaviors are baked in
Possibly get a leaner prompt — they might revert to something closer to Prompt 1's style

It also explains why the current Prompt 2 exists: it's a stopgap to get reasonable agentic performance from vanilla 5.2 while the codex-max variant is being trained.

Would be fascinating to see the PostTrainBench results once 5.2-codex-max drops. If the fine-tuning recipe transfers well, it should land somewhere between 5.1-codex-max (34.94%) and human (61.77%). If it doesn't transfer well, that tells us something interesting about how model-specific the fine-tuning is.

Good catch connecting these dots—the prompt differences make a lot more sense in this context.

insilications/codex-cli-5.1-max-vs-5.2.md

Select an option

No results found

Select an option

No results found

Comparison of Prompt 1 vs Prompt 2

Most Significant Differences

Key Observations from the Data

Why Prompting Can't Close This Gap

Potential Improvements for Prompt 2