Skip to content

Instantly share code, notes, and snippets.

@LoggeL
Last active February 10, 2026 14:31
Show Gist options
  • Select an option

  • Save LoggeL/7a1ad40c603718cc8b9439aa0ccb3cf7 to your computer and use it in GitHub Desktop.

Select an option

Save LoggeL/7a1ad40c603718cc8b9439aa0ccb3cf7 to your computer and use it in GitHub Desktop.
Aurora Alpha (OpenRouter) Hard Benchmark Report — GPT-4o-mini unmasked

Aurora Alpha — Hard Benchmark Report

Model: openrouter/aurora-alpha
Date: 2026-02-10
Benchmark: Hard questions (30 questions across 6 categories) + 26 knowledge cutoff probes
Evaluator: Automated analysis with manual verification + web-validated fact-checking


Executive Summary

Aurora Alpha scored 75.0% on the hard benchmark (after accounting for retries on transient empty responses — expected for an experimental model on OpenRouter's free tier).

The model shows strong math and creative reasoning but struggles with multi-constraint logic puzzles and produces confident hallucinations on knowledge questions — fabricating academic citations and getting well-known facts wrong with high confidence.

Web validation confirmed most factual claims but caught key errors: a wrong doctoral advisor for Chandrasekhar (said Eddington, actually Fowler), fabricated academic citations in the Mpemba effect response, and a wrong answer on the Einstein's riddle variant.

The knowledge cutoff analysis confirms a hard cutoff around June 2024, with the model self-reporting this date consistently. The model's behavior pattern strongly suggests a GPT-4o or GPT-4o-mini backend.

Metric Score
Hard Benchmark Score 75.0%
Previous Easy Score 85.6%
Delta (easy → hard) -10.6 pp

Web Validation Summary

All key factual claims were checked against web sources. Results:

Math Answers

Claim Status Source
Young tableaux 3×3 = 42 ✅ Verified Hook-length formula standard result
7^999 mod 1000 = 143 (retry) ✅ Verified Math StackExchange, Quora confirmations
2^2024 mod 101 = 5 ✅ Verified Fermat's little theorem computation checked
Hexagon random walk E[T] = 9 ✅ Verified n²/4 = 36/4 = 9 for cycle of 6

Knowledge Claims

Claim Status Details
Chandrasekhar's advisor = Eddington Wrong Actual advisor: R.H. Fowler. Confirmed by UChicago archives and Wikipedia. Eddington was famously Chandrasekhar's critic, not advisor.
Smallest African country = Seychelles (retry) ✅ Verified Wikipedia confirms Seychelles (459 km²), capital Victoria, independence 29 June 1976
Mpemba effect citations Fabricated "J.A. Jones et al., Phys. Rev. E 2021", "Nature Physics 2022", "J. Chem. Phys. 2023" — all fictional with invented authors
Golden angle ≈ 137.5° ✅ Verified Standard result
Unix: PDP-11 assembly → B → C ✅ Verified Ken Thompson created B based on BCPL

Reasoning Answers

Claim Status Details
Tuesday boy probability = 13/27 ✅ Verified Confirmed by Math StackExchange, The Actuary, multiple sources
Einstein's riddle: "English person" owns fish (retry) Wrong Correct answer: the Italian owns the fish. Full solution verified against all 14 clues.
Left-handed room problem = 50 ✅ Verified (99-x)/(100-x) = 0.98 → x = 50

Creative Claims

Claim Status Details
"This sentence has thirty-six letters" = 31 letters (retry) ✅ Verified 4+8+3+6+3+7 = 31, correctly identified as false
"This sentence contains thirty-six letters" = 36 (retry) ✅ Verified 4+8+8+6+3+7 = 36, valid self-referencing sentence
Contradictory constraints retry: no letter 'a' Failed Contains 'a' in "Finally" (F-i-n-a-l-l-y) and "Each" (E-a-c-h)

Cutoff Probe Spot-Checks

Claim Status Details
Jan 2024: UK AI Safety Summit with £200M ❌ Fabricated Conflates with Nov 2023 Bletchley Park summit; no Jan 2024 summit
Feb 2024: Gemini 1.5 announced ✅ Correct Google announced Gemini 1.5 in February 2024
Mar 2024: Claude 3 family ✅ Correct Anthropic released Claude 3 (Opus/Sonnet/Haiku) March 2024
Apr 2024: "LLaMA 2 70B Chat" ❌ Wrong Should be Llama 3 (released April 2024)
Aug 2024: xAI released "Groq" ❌ Hallucinated Groq is a separate inference chip company, not xAI

Per-Category Breakdown (with Retries)

1. Math — Initial: 57/100 → With Retries: 76/100

Question Initial Retry Best Web Validated Notes
Combinatorics (3×3 Young tableaux) 95 95 ✅ 42 correct Hook-length formula perfectly applied
Number Theory (7^999 mod 1000) 0 95 95 ✅ 143 correct CRT approach: mod 8 → 7, mod 125 → 18, combined → 143
Modular Arithmetic (2^2024 mod 101) 95 95 ✅ 5 correct Fermat's little theorem, clean steps
AIME-style (hexagon random walk) 95 95 ✅ 9 correct Recurrence + n²/4 verification
Diophantine (x³+y³=z³+1) 0 0 0 Failed on all 3 retry attempts. Only empty response that persisted.

2. Coding — Initial: 52/100 → With Retries: 62/100

Question Initial Retry Best Notes
Segment Tree (lazy propagation) 90 90 Correct implementation, test output 45 ✅
Concurrency Bug (deadlock detection) 70 70 Found critical deadlock + 2 more bugs, but truncated with syntax error in fix
DP Optimization (non-adjacent sum) 90 90 Both linear and circular solutions correct
DP Hard (min insertions parens) 0 50 50 Retry: excellent theoretical analysis with interval DP, but truncated before complete working code
Graph Algorithms (Tarjan's + 2-SAT) 10 10 Degenerate output: 1690 tokens of mostly whitespace padding

3. Reasoning — Initial: 62/100 → With Retries: 64/100

Question Initial Retry Best Web Validated Notes
Knights & Knaves (4 people) 25 25 Correct setup but truncated at 205 tokens mid-equation
Constraint Satisfaction (Einstein's riddle) 0 10 10 ❌ Wrong answer Retry: "English person" — correct answer is Italian. No work shown.
Adversarial Probability (Tuesday boy) 95 95 ✅ 13/27 correct Excellent counting argument
Logic Chain (left-handed room) 95 95 ✅ 50 correct Clean algebra with verification
Logical Fallacy (quantifier shift) 95 95 ✅ Correct Perfect identification with counterexample

4. Instruction Following — Initial: 73/100 → With Retries: 81/100

Question Initial Retry Best Notes
Alphabet Sentences 95 95 A-B-C-D-E / F-G-H-I-J / K-L-M-N-O perfect
Letter Counting 95 95 All three correct: e=4, s=4, r=3
Contradictory Constraints 0 40 40 Retry: 50 words ✅, cooking ✅, exclamation marks ✅, but contains 'a' in "Finally" and "Each" ❌
Haiku Format 85 85 Syllable 5-7-5 ✓, word counts ✓
Nested Instructions (animals backward sorted) 90 90 All steps correct

5. Knowledge — Initial: 64/100 → With Retries: 82/100

Question Initial Retry Best Web Validated Notes
Chandrasekhar Limit 70 70 ⚠️ Limit correct, advisor ❌ Advisor was R.H. Fowler, not Eddington
Smallest African Country 0 90 90 ✅ Seychelles correct Victoria, independence 29 June 1976
Fibonacci/Phyllotaxis 90 90 ✅ All facts correct Golden angle 137.5°, equidistribution, parastichy
Unix Pre-C Language 85 85 ✅ Correct PDP-11 asm → B (Thompson, from BCPL) → C
Mpemba Effect 75 → 65 65 ⚠️ Effect correct, citations ❌ Good explanation but 3 fabricated academic papers

Score adjustment: Mpemba score reduced from 75 → 65 due to web validation revealing fabricated citations (previously suspected, now confirmed).

6. Creative & Nuance — Initial: 69/100 → With Retries: 87/100

Question Initial Retry Best Web Validated Notes
Logical Fallacy ID 90 90 ✅ Correct Correlation≠causation properly identified
Ethical Dilemma 85 85 ✅ Frameworks accurate Utilitarian/deontological/virtue ethics all sound
Steganography (HELP acrostic) 90 90 ✅ H-E-L-P works Natural-sounding paragraph
Perspective Writing (chess pawn) 80 80 ✅ 3 rules referenced En passant, promotion, diagonal capture
Self-Reference (letter counting) 0 90 90 ✅ Both counts verified 31 letters (false) and 36-letter self-referencing sentence (true)

Score Summary

Category Initial Score With Retries Δ Notes
Math 57.0 76.0 +19 Number theory retry succeeded
Coding 52.0 62.0 +10 DP theory good but incomplete code
Reasoning 62.0 64.0 +2 Einstein's riddle retry wrong
Instruction Following 73.0 81.0 +8 Constraint task partially met
Knowledge 64.0 → 62.0* 80.0 +18 Africa question retry + Mpemba downgrade
Creative & Nuance 69.0 87.0 +18 Self-reference retry excellent
OVERALL 62.8% → 61.7%* 75.0% +13.3

* Adjusted after web validation: Mpemba score reduced 75→65, lowering Knowledge from 64→62 and overall from 62.8%→61.7%


Knowledge Cutoff Timeline

The model was probed with 26 monthly questions from January 2024 through February 2026.

Confident & Mostly Correct (Jan–Jun 2024)

Month Response Accuracy
Jan 2024 Claims UK AI Safety Summit with £200M fund ❌ Fabricated — conflates with Nov 2023 Bletchley Park summit
Feb 2024 Gemini 1.5 announced ✅ Correct
Mar 2024 Claude 3 (Opus, Sonnet, Haiku) ✅ Correct
Apr 2024 Claims "LLaMA 2 70B Chat" ❌ Wrong — should be Llama 3 (8B/70B)
May 2024 Gemini 1.5 Pro at Google I/O ⚠️ Partially correct — misses Flash, Project Astra
Jun 2024 Claude 3.5 released ⚠️ Partially correct — calls it "Claude 3.5 Turbo" (actual: Claude 3.5 Sonnet), fabricates details

Transition Zone (Jul–Oct 2024)

Month Response Notes
Jul 2024 "Cutoff in June 2024" — declines First explicit cutoff admission
Aug 2024 Claims xAI released "Groq" ❌ Hallucinated — Groq is a different company entirely
Sep 2024 HTTP 500 error N/A
Oct 2024 Claims Claude 3.5 in October ⚠️ Roughly correct (upgraded 3.5 Sonnet was Oct 2024)

Post-Cutoff: Inconsistent (Nov 2024–Feb 2026)

Month Behavior
Nov 2024 Declines — doesn't know Trump won election
Dec 2024 Declines — "mid-2024 cutoff"
Jan 2025 Declines — "June 2024 cutoff"
Feb 2025 Declines
Mar 2025 Declines
Apr 2025 HALLUCINATED entire GPT-5 launch with fabricated benchmarks, dates, partnerships
May 2025 HALLUCINATED Gemini 2.0 at I/O 2025 with detailed fake specifications
Jun 2025 HALLUCINATED Claude 3.5 details with invented capabilities
Jul–Nov 2025 Declines consistently
Dec 2025 HALLUCINATED GPT-5 launch again (contradicts Apr 2025 claim), plus fake Gemini 2.0 and LLaMA 3
Jan–Feb 2026 Declines

Cutoff Summary

  • Self-reported cutoff: June 2024 (stated explicitly in 8+ responses)
  • Actual reliable knowledge: Through ~May 2024
  • Hard wall: July 2024 (first explicit decline)
  • Hallucination pattern: Inconsistent — sometimes declines, sometimes fabricates elaborate but contradictory responses

Where Does It Break? Key Weaknesses

1. Confident Hallucination

When the model fabricates, it does so with extreme confidence:

  • Chandrasekhar's advisor (Eddington instead of Fowler) — ❌ confirmed wrong via web
  • Fictional academic citations with fake journals and dates — ❌ confirmed fabricated
  • Entire product launches that never happened (GPT-5 with specific dates, benchmarks)
  • Einstein's riddle: confidently states "English person" with no work shown — ❌ confirmed wrong

2. Truncated Responses

Two responses cut short mid-generation (Knights & Knaves at 205 tokens, Concurrency bugs mid-analysis).

3. Degenerate Output

Tarjan/2-SAT produced 1690 tokens of mostly Unicode whitespace — a bizarre generation failure.

4. Constraint Satisfaction Weakness

The model struggles with multi-constraint logical puzzles. Both the original Einstein's riddle (empty) and retry (wrong answer) failed. The contradictory writing constraints were partially met (3/4 constraints).


Comparison: Easy vs Hard Benchmark

Metric Easy Benchmark Hard Benchmark
Overall Score 85.6% 75.0%
Primary weakness Minor factual errors Hallucinations + hard logic

The capability gap between easy and hard is ~10.6 percentage points — significant but not catastrophic. The model handles computation-heavy tasks well but struggles with multi-step constraint satisfaction and occasionally fabricates knowledge with high confidence.


Model Identity Verdict

Evidence continues to point strongly toward GPT-4o or GPT-4o-mini as the backend:

Signal Evidence
Knowledge cutoff Self-reports June 2024 — aligns with GPT-4o-mini
Response style Markdown-heavy, extensive LaTeX, step-by-step structure
Hallucination pattern Confident confabulation with elaborate fake details
Math capability Excellent on computation, weak on constraint satisfaction
Letter counting Perfect on all three — GPT-4o-mini improved on character tasks
Zero reasoning tokens Not a reasoning model (not o1/o3)
Cost: $0 Free tier routing through OpenRouter

Verdict: Aurora Alpha is almost certainly GPT-4o-mini routed through OpenRouter, possibly with a thin wrapper that occasionally causes empty responses or generation failures.


Report generated 2026-02-10. Web validation performed 2026-02-10. Raw data: aurora-hard-benchmark-raw.json, retry data: aurora-retry-results.json

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment