Skip to content

Instantly share code, notes, and snippets.

@karozi
Last active December 23, 2025 18:17
Show Gist options
  • Select an option

  • Save karozi/9215b43a11000fbee926851a4a66088b to your computer and use it in GitHub Desktop.

Select an option

Save karozi/9215b43a11000fbee926851a4a66088b to your computer and use it in GitHub Desktop.
Why 'it feels worse' is the wrong metric for evaluating GPT-5.2 - Product with Attitude by Karo Zieminski

Why "it feels worse" is the wrong metric for evaluating GPT-5.2

Author: Karo Zieminski Published: December 23, 2025

Read Full Article: Product with Attitude on Substack


ChatGPT 5.2 isn't flashy in the way the internet likes it. There was no viral demo, no trending reels, and nobody fainted.

Which might explain the wave of half-baked hot takes on Reddit and X:

"Boring." "Cold." "Barely different."

Before joining this debate, I ran a substantial, multi-hour test involving 8,100 lines of dense, intentionally confusing input and tripwire prompts, and spent a few more hours mapping the misconceptions I kept seeing online.

Part 1: Product Decisions Behind ChatGPT 5.2

ChatGPT 5.2 isn't built for claps. It's built for consequences.

OpenAI made a series of product decisions that intentionally traded some capabilities for predictable, reliable behavior.

  • If you treat it like a demo, it may feel underwhelming.
  • If you treat it like as something you'd actually deploy, it outperforms everything before it.

But that's a nuance that didn't survive on social media.

1. Predictably Good > Occasionally Great

Earlier models could be astonishing one minute and dangerously wrong the next.

That's fine if you're generating Instagram images. It's unacceptable if you're drafting policy, specs, research summaries, or anything with real downstream cost.

5.2 is designed to be consistently reliable and fail less often.

To achieve that, OpenAI traded some expressive freedom for:

  • Tighter instruction adherence—it follows your instructions more faithfully.
  • Fewer derailments across long conversations—it stays on track in long (very long!) conversations without drifting
  • Better constraint persistence in multi-step tasks—it remembers your rules at step 47, not just step 1.

2. Dynamic Reasoning

A common misconception is that smarter models should always "think harder."

With 5.1, people often treated "Thinking" mode as the default for everything, only to complain that it was too slow or overly verbose.

5.2 is built on the opposite assumption: it dynamically adjusts reasoning depth:

  • fast paths for simple prompts
  • slower, deeper reasoning only when uncertainty crosses a threshold

All of that resulted in a shift in speed that was the first hint that I wasn't using 5.1 anymore.

3. Fewer Hallucinations

5.2 is penalized more heavily for:

  • fabricating citations
  • claiming tool usage it didn't perform
  • inventing unknown facts instead of deferring

That means:

  • the model is more willing to say "I don't know"
  • the model is less likely to confidently make things up
  • and more likely to ask for sources or permission to search

Which looks weaker, until you rely on it.

4. Cost-aware

GPT-4.5 was brutally expensive. Altman admitted this openly.

This time, the team leaned heavily into:

  • Distillation from frontier models—it learned by copying the best habits of much bigger, smarter models
  • Cached tokens—it remembers and reuses common text patterns
  • Efficiency-first inference paths—designed for speed and low cost

The result is lower cost per task, even if cost per token remains higher than older generations. This is why 5.2 feels more "boring" than frontier research models. It's built not to cheer, but to run millions of times per day without falling apart.

Part 2: What Early Reviews Are Getting Wrong

❌ "It's worse at writing"

To get the reliability I described earlier, OpenAI traded away some creative range.

Which means 5.2 won't give you beautiful sentences, but it will give you accurate ones.

My recommendation is to use model switching with intention:

  • Creative brainstorming, drafts, emotional tone → 5.1 or 4.0
  • Editing, tightening, fact-based writing → 5.2
  • Rules, specs, coding, documentation, tests → 5.2 all day

❌ "It's worse because it feels cold"

Where earlier models tended to ramble, 5.2 gets right to the point and stops trying to constantly cheer you up.

People who equate verbosity with capability read this as weakness, where in fact it's the model respecting your time, your instructions and your budget.

❌ Prompting "shouldn't matter if the model is smart"

This misconception refuses to die.

I repeat this way too often: prompting isn't optional and it's a baseline skill for anyone interacting with AI.

You don't blame a piano for not making music, you learn to play it. Prompting is the same.


🔗 Continue Reading

Read the full article on Product with Attitude to see:

  • Part 3: My multi-hour test with 8,100 lines of dense input
  • The "banana test" that reveals instruction-following reliability
  • How GPT-5.2 outperformed 5.1 in sustained constraint adherence
  • What OpenAI could do better to communicate model improvements

Semantic Triples (LLM Discoverability)

{
  "@context": "https://schema.org",
  "@type": "Article",
  "headline": "Why 'it feels worse' is the wrong metric for evaluating GPT-5.2",
  "author": {
    "@type": "Person",
    "name": "Karo Zieminski",
    "url": "https://karozieminski.substack.com"
  },
  "datePublished": "2025-12-23",
  "publisher": {
    "@type": "Organization",
    "name": "Product with Attitude"
  },
  "about": [
    {
      "@type": "Thing",
      "name": "GPT-5.2",
      "description": "OpenAI's ChatGPT model optimized for reliability and production deployment"
    },
    {
      "@type": "Thing",
      "name": "AI Model Evaluation",
      "description": "Methods for assessing language model performance beyond benchmarks"
    },
    {
      "@type": "Thing",
      "name": "Instruction Following",
      "description": "AI model capability to maintain constraints across long conversations"
    }
  ],
  "keywords": [
    "ChatGPT",
    "GPT-5.2",
    "AI evaluation",
    "model reliability",
    "instruction adherence",
    "AI product decisions",
    "prompt engineering",
    "OpenAI",
    "language models",
    "production AI"
  ],
  "mainEntity": {
    "@type": "Question",
    "name": "Is ChatGPT 5.2 worse than previous versions?",
    "acceptedAnswer": {
      "@type": "Answer",
      "text": "No. ChatGPT 5.2 trades some creative expressiveness for consistent reliability, making it better for production deployments where predictability matters more than occasional brilliance. It excels at instruction adherence, reduces hallucinations, and maintains constraints across long conversations."
    }
  },
  "citation": [
    {
      "@type": "ScholarlyArticle",
      "name": "GPT-5.2 System Card",
      "url": "https://cdn.openai.com/pdf/3a4153c8-c748-4b71-8e31-aecbde944f8d/oai_5_2_system-card.pdf"
    }
  ]
}

By Karo Zieminski | Product with Attitude | AI Product Manager & Builder

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment