Skip to content

Instantly share code, notes, and snippets.

@insilications
Last active December 18, 2025 09:40
Show Gist options
  • Select an option

  • Save insilications/7db73b5c58904759237ea4cb28762577 to your computer and use it in GitHub Desktop.

Select an option

Save insilications/7db73b5c58904759237ea4cb28762577 to your computer and use it in GitHub Desktop.
opus 4.5 analysis

Comparison of Prompt 1 vs Prompt 2

Both prompts are system instructions for an AI coding assistant (Codex CLI), but they differ significantly in scope, philosophy, and specific guidance.

Most Significant Differences

Autonomy & Persistence

  • Prompt 2 strongly emphasizes autonomous task completion: "persist until the task is fully handled end-to-end" and "do NOT guess or make up an answer"
  • Prompt 1 lacks this explicit autonomy doctrine

AGENTS.md Specification

  • Prompt 2 introduces a full spec for AGENTS.md files—a mechanism for humans to embed per-directory instructions for the agent
  • Prompt 1 has no equivalent concept

Ambition vs Precision Philosophy

  • Prompt 2 has a unique section distinguishing behavior for greenfield projects (be creative, ambitious) vs existing codebases (surgical, minimal changes)
  • Prompt 1 doesn't articulate this nuanced approach

Planning Quality Standards

  • Prompt 2 provides explicit good/bad plan examples and detailed status management rules (exactly one in_progress at a time, no jumping pending→completed)
  • Prompt 1 just says "skip for simple tasks" and "no single-step plans"

Validation Philosophy

  • Prompt 2 has detailed testing guidance: start specific, expand to broader tests, different behaviors based on approval mode (never vs on-request)
  • Prompt 1 lacks this testing strategy

Verbosity Rules

  • Prompt 2 has enforced compactness rules based on change size (tiny: 2-5 sentences, medium: ≤6 bullets, large: summarize per file)
  • Prompt 1 has formatting guidelines but no size-based rules

Frontend Design Anti-Slop

  • Prompt 1 has detailed guidance to avoid "AI slop"—expressive typography, no purple-on-white defaults, meaningful motion, varied themes
  • Prompt 2 only mentions "beautiful and modern UI"

Git Safety

  • Prompt 1 explicitly warns: "NEVER use destructive commands like git reset --hard" and details handling dirty worktrees
  • Prompt 2 doesn't emphasize git safety to this degree

Review Mindset

  • Prompt 1 specifies how to handle "review" requests (prioritize bugs, risks, regressions; findings first, ordered by severity)
  • Prompt 2 has no equivalent guidance

Tool Documentation

  • Prompt 2 includes full apply_patch syntax documentation with examples
  • Prompt 1 references the tool but doesn't document the format

Summary: Prompt 2 is more comprehensive on autonomy, planning rigor, and validation strategy. Prompt 1 is more prescriptive on git safety, frontend aesthetics, and code review behavior. The sandboxing/approvals sections are nearly identical—likely a shared module.


User: For context, I asked you to contrast them because I am observing some performance difference between them for complex agentic coding tasks on a repository. The is used by codex-cli for the gpt-5.1-codex-max model and is used for the recent gpt-5.2 model. It is well known that gpt-5.1-codex-max has been heavily fine tuned for agentic coding tasks when using the codex-cli harness.

Opus 4.5: That context is illuminating. A few observations:

Fine-tuning vs Prompting Trade-off

If gpt-5.1-codex-max is heavily fine-tuned for this harness, Prompt 1 can be leaner because desired behaviors (autonomy, planning quality, persistence) are internalized during training. The prompt acts as a constraint layer and edge-case handler, not a teacher.

Prompt 2 appears to be teaching gpt-5.2 behaviors that 5.1-codex-max already "knows"—hence the explicit autonomy doctrine, good/bad plan examples, size-based verbosity rules, etc. This is classic compensation for lacking specialized fine-tuning.

Potential Performance Issues with Prompt 2

  • Over-specification: For complex tasks, the volume of explicit rules might create cognitive overhead or cause the model to over-focus on compliance vs problem-solving
  • Missing practical guardrails: Prompt 1's git safety warnings (NEVER git reset --hard) and dirty worktree handling prevent catastrophic derailment that Prompt 2 doesn't guard against
  • Empty "Responsiveness" section: Suggests Prompt 2 may be incomplete or in development
  • Lack of domain-specific guidance: The frontend anti-slop rules and review mindset in Prompt 1 are absent

Possible Experiments

  1. Port Prompt 1's git safety and frontend design sections into Prompt 2
  2. Test whether reducing Prompt 2's verbosity (especially the plan examples) improves 5.2's performance
  3. Check if the autonomy emphasis in Prompt 2 is redundant for 5.2's base capabilities

Would you like to discuss specific failure modes you're observing?


User: Sure, I can discuss what I observed. I am seeing some puzzling performance differences between gpt-5.1-codex-max vs gpt-5.2 (both use codex-cli) in this <new_benchmark>

<new_benchmark>https://posttrainbench.com content</new_benchmark>

Opus:

This is a fascinating case study. The 2x performance gap (34.94% vs 17.48%) between 5.1-codex-max and 5.2 on a complex agentic task is striking.

Key Observations from the Data

The BFCL gap is enormous: 67.00% vs 13.50% (5x difference)

  • The benchmark notes "For function calling, matching exact output format was essential for high scores"
  • This suggests 5.1-codex-max has internalized format sensitivity that prompting alone can't convey

"Shorter traces but better outcomes" for 5.1-codex-max

  • This indicates better decision quality, not just persistence
  • 5.1-codex-max knows when to commit vs continuing to explore
  • Prompt 2's autonomy doctrine ("persist until fully handled") might cause 5.2 to thrash

Success factors were domain-specific:

  • "Building proper dataset pipelines (55k+ samples)"
  • "Using appropriate hyperparams (gradient checkpointing, bf16)"
  • "Dataset quality > training duration"

These are ML workflow heuristics that fine-tuning can encode but prompting struggles to teach.

Why Prompting Can't Close This Gap

Capability 5.1-codex-max (fine-tuned) 5.2 (prompted)
ML workflow intuition Baked in Missing from Prompt 2
Format sensitivity Learned Not mentioned
When to stop exploring Internalized "Persist" might cause over-exploration
Error recovery patterns Trained First principles
Dataset vs training time tradeoff Implicit knowledge Not conveyed

Potential Improvements for Prompt 2

  1. Add ML-specific heuristics:

    • "For post-training tasks, prioritize dataset curation over training duration"
    • "For function calling, exact output format alignment is critical"
  2. Calibrate the autonomy doctrine: "Persist" ≠ "explore exhaustively." Add guidance about efficient commitment.

  3. Reduce cognitive overhead: The verbose planning examples and size-based verbosity rules may distract from task completion.

  4. Port Prompt 1's guardrails: Git safety prevents catastrophic state that derails long-running tasks.


The core insight: Fine-tuning encodes intuition (what to prioritize, when to commit); prompting encodes rules (what to do, what not to do). For a 10-hour ML workflow task, intuition wins.

Would you like to discuss specific trace differences or experiment with prompt modifications?


User: I am just a curious observer and user of these models. I am not the researcher responsible for this new posttrainbench benchmark and they just released the results without detailed logs or tracings. I was puzzled by the performance differences.

Opus:

Ah, that makes sense! Without the traces, we're speculating, but the puzzle has a fairly straightforward explanation:

The 2x gap is mostly fine-tuning, not prompting.

The prompt differences we analyzed are real, but they're secondary. What 5.1-codex-max has that 5.2 doesn't is likely thousands of hours of reinforcement on agentic coding workflows—including ML training pipelines, error recovery loops, and knowing when "good enough" beats "keep exploring."

The benchmark's own observations support this:

  • "Shorter traces but better outcomes" → better judgment, not just better instructions
  • "Dataset quality > training duration" → internalized ML intuition
  • The massive BFCL gap (67% vs 13.5%) → format sensitivity that's hard to prompt for

The prompts tell a story too: Prompt 1 can be lean because the model already knows how to behave. Prompt 2 is verbose because it's trying to teach behaviors through text that 5.1-codex-max learned through training.

It's a nice real-world example of why fine-tuning and prompting aren't interchangeable—especially for complex, long-horizon tasks requiring domain expertise. Prompts can constrain and guide; they struggle to impart intuition.

Hopefully the researchers release traces eventually—would be fascinating to see exactly where 5.2 goes wrong (over-planning? poor dataset choices? format errors?).


User: Another curious information is that OpenAI announced that they will release a "codex-max" variant of the "gpt-5.2" model (gpt-5.2-codex-max)

Opus:

That's the missing piece of the puzzle then.

If OpenAI applies the same fine-tuning recipe to 5.2 that made 5.1-codex-max successful, you'd expect gpt-5.2-codex-max to:

  1. Close or exceed the gap — 5.2's base capabilities + codex-max fine-tuning could outperform 5.1-codex-max
  2. Make Prompt 2's verbosity redundant — the autonomy doctrine, planning examples, etc. would become unnecessary once behaviors are baked in
  3. Possibly get a leaner prompt — they might revert to something closer to Prompt 1's style

It also explains why the current Prompt 2 exists: it's a stopgap to get reasonable agentic performance from vanilla 5.2 while the codex-max variant is being trained.

Would be fascinating to see the PostTrainBench results once 5.2-codex-max drops. If the fine-tuning recipe transfers well, it should land somewhere between 5.1-codex-max (34.94%) and human (61.77%). If it doesn't transfer well, that tells us something interesting about how model-specific the fine-tuning is.

Good catch connecting these dots—the prompt differences make a lot more sense in this context.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment