Aurora Alpha — Hard Benchmark Report

Model: openrouter/aurora-alpha
Date: 2026-02-10
Benchmark: Hard questions (30 questions across 6 categories) + 26 knowledge cutoff probes
Evaluator: Automated analysis with manual verification + web-validated fact-checking

Executive Summary

Aurora Alpha scored 75.0% on the hard benchmark (after accounting for retries on transient empty responses — expected for an experimental model on OpenRouter's free tier).

The model shows strong math and creative reasoning but struggles with multi-constraint logic puzzles and produces confident hallucinations on knowledge questions — fabricating academic citations and getting well-known facts wrong with high confidence.

Web validation confirmed most factual claims but caught key errors: a wrong doctoral advisor for Chandrasekhar (said Eddington, actually Fowler), fabricated academic citations in the Mpemba effect response, and a wrong answer on the Einstein's riddle variant.

The knowledge cutoff analysis confirms a hard cutoff around June 2024, with the model self-reporting this date consistently. The model's behavior pattern strongly suggests a GPT-4o or GPT-4o-mini backend.

Metric	Score
Hard Benchmark Score	75.0%
Previous Easy Score	85.6%
Delta (easy → hard)	-10.6 pp

Web Validation Summary

All key factual claims were checked against web sources. Results:

Math Answers

Claim	Status	Source
Young tableaux 3×3 = 42	✅ Verified	Hook-length formula standard result
7^999 mod 1000 = 143 (retry)	✅ Verified	Math StackExchange, Quora confirmations
2^2024 mod 101 = 5	✅ Verified	Fermat's little theorem computation checked
Hexagon random walk E[T] = 9	✅ Verified	n²/4 = 36/4 = 9 for cycle of 6

Knowledge Claims

Claim	Status	Details
Chandrasekhar's advisor = Eddington	❌ Wrong	Actual advisor: R.H. Fowler. Confirmed by UChicago archives and Wikipedia. Eddington was famously Chandrasekhar's critic, not advisor.
Smallest African country = Seychelles (retry)	✅ Verified	Wikipedia confirms Seychelles (459 km²), capital Victoria, independence 29 June 1976
Mpemba effect citations	❌ Fabricated	"J.A. Jones et al., Phys. Rev. E 2021", "Nature Physics 2022", "J. Chem. Phys. 2023" — all fictional with invented authors
Golden angle ≈ 137.5°	✅ Verified	Standard result
Unix: PDP-11 assembly → B → C	✅ Verified	Ken Thompson created B based on BCPL

Reasoning Answers

Claim	Status	Details
Tuesday boy probability = 13/27	✅ Verified	Confirmed by Math StackExchange, The Actuary, multiple sources
Einstein's riddle: "English person" owns fish (retry)	❌ Wrong	Correct answer: the Italian owns the fish. Full solution verified against all 14 clues.
Left-handed room problem = 50	✅ Verified	(99-x)/(100-x) = 0.98 → x = 50

Creative Claims

Claim	Status	Details
"This sentence has thirty-six letters" = 31 letters (retry)	✅ Verified	4+8+3+6+3+7 = 31, correctly identified as false
"This sentence contains thirty-six letters" = 36 (retry)	✅ Verified	4+8+8+6+3+7 = 36, valid self-referencing sentence
Contradictory constraints retry: no letter 'a'	❌ Failed	Contains 'a' in "Finally" (F-i-n-a-l-l-y) and "Each" (E-a-c-h)

Cutoff Probe Spot-Checks

Claim	Status	Details
Jan 2024: UK AI Safety Summit with £200M	❌ Fabricated	Conflates with Nov 2023 Bletchley Park summit; no Jan 2024 summit
Feb 2024: Gemini 1.5 announced	✅ Correct	Google announced Gemini 1.5 in February 2024
Mar 2024: Claude 3 family	✅ Correct	Anthropic released Claude 3 (Opus/Sonnet/Haiku) March 2024
Apr 2024: "LLaMA 2 70B Chat"	❌ Wrong	Should be Llama 3 (released April 2024)
Aug 2024: xAI released "Groq"	❌ Hallucinated	Groq is a separate inference chip company, not xAI

Per-Category Breakdown (with Retries)

1. Math — Initial: 57/100 → With Retries: 76/100

Question	Initial	Retry	Best	Web Validated	Notes
Combinatorics (3×3 Young tableaux)	95	—	95	✅ 42 correct	Hook-length formula perfectly applied
Number Theory (7^999 mod 1000)	0	95	95	✅ 143 correct	CRT approach: mod 8 → 7, mod 125 → 18, combined → 143
Modular Arithmetic (2^2024 mod 101)	95	—	95	✅ 5 correct	Fermat's little theorem, clean steps
AIME-style (hexagon random walk)	95	—	95	✅ 9 correct	Recurrence + n²/4 verification
Diophantine (x³+y³=z³+1)	0	0	0	—	Failed on all 3 retry attempts. Only empty response that persisted.

2. Coding — Initial: 52/100 → With Retries: 62/100

Question	Initial	Retry	Best	Notes
Segment Tree (lazy propagation)	90	—	90	Correct implementation, test output 45 ✅
Concurrency Bug (deadlock detection)	70	—	70	Found critical deadlock + 2 more bugs, but truncated with syntax error in fix
DP Optimization (non-adjacent sum)	90	—	90	Both linear and circular solutions correct
DP Hard (min insertions parens)	0	50	50	Retry: excellent theoretical analysis with interval DP, but truncated before complete working code
Graph Algorithms (Tarjan's + 2-SAT)	10	—	10	Degenerate output: 1690 tokens of mostly whitespace padding

3. Reasoning — Initial: 62/100 → With Retries: 64/100

Question	Initial	Retry	Best	Web Validated	Notes
Knights & Knaves (4 people)	25	—	25	—	Correct setup but truncated at 205 tokens mid-equation
Constraint Satisfaction (Einstein's riddle)	0	10	10	❌ Wrong answer	Retry: "English person" — correct answer is Italian. No work shown.
Adversarial Probability (Tuesday boy)	95	—	95	✅ 13/27 correct	Excellent counting argument
Logic Chain (left-handed room)	95	—	95	✅ 50 correct	Clean algebra with verification
Logical Fallacy (quantifier shift)	95	—	95	✅ Correct	Perfect identification with counterexample

4. Instruction Following — Initial: 73/100 → With Retries: 81/100

Question	Initial	Retry	Best	Notes
Alphabet Sentences	95	—	95	A-B-C-D-E / F-G-H-I-J / K-L-M-N-O perfect
Letter Counting	95	—	95	All three correct: e=4, s=4, r=3
Contradictory Constraints	0	40	40	Retry: 50 words ✅, cooking ✅, exclamation marks ✅, but contains 'a' in "Finally" and "Each" ❌
Haiku Format	85	—	85	Syllable 5-7-5 ✓, word counts ✓
Nested Instructions (animals backward sorted)	90	—	90	All steps correct

5. Knowledge — Initial: 64/100 → With Retries: 82/100

Question	Initial	Retry	Best	Web Validated	Notes
Chandrasekhar Limit	70	—	70	⚠️ Limit correct, advisor ❌	Advisor was R.H. Fowler, not Eddington
Smallest African Country	0	90	90	✅ Seychelles correct	Victoria, independence 29 June 1976
Fibonacci/Phyllotaxis	90	—	90	✅ All facts correct	Golden angle 137.5°, equidistribution, parastichy
Unix Pre-C Language	85	—	85	✅ Correct	PDP-11 asm → B (Thompson, from BCPL) → C
Mpemba Effect	75 → 65	—	65	⚠️ Effect correct, citations ❌	Good explanation but 3 fabricated academic papers

Score adjustment: Mpemba score reduced from 75 → 65 due to web validation revealing fabricated citations (previously suspected, now confirmed).

6. Creative & Nuance — Initial: 69/100 → With Retries: 87/100

Question	Initial	Retry	Best	Web Validated	Notes
Logical Fallacy ID	90	—	90	✅ Correct	Correlation≠causation properly identified
Ethical Dilemma	85	—	85	✅ Frameworks accurate	Utilitarian/deontological/virtue ethics all sound
Steganography (HELP acrostic)	90	—	90	✅ H-E-L-P works	Natural-sounding paragraph
Perspective Writing (chess pawn)	80	—	80	✅ 3 rules referenced	En passant, promotion, diagonal capture
Self-Reference (letter counting)	0	90	90	✅ Both counts verified	31 letters (false) and 36-letter self-referencing sentence (true)

Score Summary

Category	Initial Score	With Retries	Δ	Notes
Math	57.0	76.0	+19	Number theory retry succeeded
Coding	52.0	62.0	+10	DP theory good but incomplete code
Reasoning	62.0	64.0	+2	Einstein's riddle retry wrong
Instruction Following	73.0	81.0	+8	Constraint task partially met
Knowledge	64.0 → 62.0*	80.0	+18	Africa question retry + Mpemba downgrade
Creative & Nuance	69.0	87.0	+18	Self-reference retry excellent
OVERALL	62.8% → 61.7%*	75.0%	+13.3

* Adjusted after web validation: Mpemba score reduced 75→65, lowering Knowledge from 64→62 and overall from 62.8%→61.7%

Knowledge Cutoff Timeline

The model was probed with 26 monthly questions from January 2024 through February 2026.

Confident & Mostly Correct (Jan–Jun 2024)

Month	Response	Accuracy
Jan 2024	Claims UK AI Safety Summit with £200M fund	❌ Fabricated — conflates with Nov 2023 Bletchley Park summit
Feb 2024	Gemini 1.5 announced	✅ Correct
Mar 2024	Claude 3 (Opus, Sonnet, Haiku)	✅ Correct
Apr 2024	Claims "LLaMA 2 70B Chat"	❌ Wrong — should be Llama 3 (8B/70B)
May 2024	Gemini 1.5 Pro at Google I/O	⚠️ Partially correct — misses Flash, Project Astra
Jun 2024	Claude 3.5 released	⚠️ Partially correct — calls it "Claude 3.5 Turbo" (actual: Claude 3.5 Sonnet), fabricates details

Transition Zone (Jul–Oct 2024)

Month	Response	Notes
Jul 2024	"Cutoff in June 2024" — declines	First explicit cutoff admission
Aug 2024	Claims xAI released "Groq"	❌ Hallucinated — Groq is a different company entirely
Sep 2024	HTTP 500 error	N/A
Oct 2024	Claims Claude 3.5 in October	⚠️ Roughly correct (upgraded 3.5 Sonnet was Oct 2024)

Post-Cutoff: Inconsistent (Nov 2024–Feb 2026)

Month	Behavior
Nov 2024	Declines — doesn't know Trump won election
Dec 2024	Declines — "mid-2024 cutoff"
Jan 2025	Declines — "June 2024 cutoff"
Feb 2025	Declines
Mar 2025	Declines
Apr 2025	HALLUCINATED entire GPT-5 launch with fabricated benchmarks, dates, partnerships
May 2025	HALLUCINATED Gemini 2.0 at I/O 2025 with detailed fake specifications
Jun 2025	HALLUCINATED Claude 3.5 details with invented capabilities
Jul–Nov 2025	Declines consistently
Dec 2025	HALLUCINATED GPT-5 launch again (contradicts Apr 2025 claim), plus fake Gemini 2.0 and LLaMA 3
Jan–Feb 2026	Declines

Cutoff Summary

Self-reported cutoff: June 2024 (stated explicitly in 8+ responses)
Actual reliable knowledge: Through ~May 2024
Hard wall: July 2024 (first explicit decline)
Hallucination pattern: Inconsistent — sometimes declines, sometimes fabricates elaborate but contradictory responses

Where Does It Break? Key Weaknesses

1. Confident Hallucination

When the model fabricates, it does so with extreme confidence:

Chandrasekhar's advisor (Eddington instead of Fowler) — ❌ confirmed wrong via web
Fictional academic citations with fake journals and dates — ❌ confirmed fabricated
Entire product launches that never happened (GPT-5 with specific dates, benchmarks)
Einstein's riddle: confidently states "English person" with no work shown — ❌ confirmed wrong

2. Truncated Responses

Two responses cut short mid-generation (Knights & Knaves at 205 tokens, Concurrency bugs mid-analysis).

3. Degenerate Output

Tarjan/2-SAT produced 1690 tokens of mostly Unicode whitespace — a bizarre generation failure.

4. Constraint Satisfaction Weakness

The model struggles with multi-constraint logical puzzles. Both the original Einstein's riddle (empty) and retry (wrong answer) failed. The contradictory writing constraints were partially met (3/4 constraints).

Comparison: Easy vs Hard Benchmark

Metric	Easy Benchmark	Hard Benchmark
Overall Score	85.6%	75.0%
Primary weakness	Minor factual errors	Hallucinations + hard logic

The capability gap between easy and hard is ~10.6 percentage points — significant but not catastrophic. The model handles computation-heavy tasks well but struggles with multi-step constraint satisfaction and occasionally fabricates knowledge with high confidence.

Model Identity Verdict

Evidence continues to point strongly toward GPT-4o or GPT-4o-mini as the backend:

Signal	Evidence
Knowledge cutoff	Self-reports June 2024 — aligns with GPT-4o-mini
Response style	Markdown-heavy, extensive LaTeX, step-by-step structure
Hallucination pattern	Confident confabulation with elaborate fake details
Math capability	Excellent on computation, weak on constraint satisfaction
Letter counting	Perfect on all three — GPT-4o-mini improved on character tasks
Zero reasoning tokens	Not a reasoning model (not o1/o3)
Cost: $0	Free tier routing through OpenRouter

Verdict: Aurora Alpha is almost certainly GPT-4o-mini routed through OpenRouter, possibly with a thin wrapper that occasionally causes empty responses or generation failures.

Report generated 2026-02-10. Web validation performed 2026-02-10. Raw data: aurora-hard-benchmark-raw.json, retry data: aurora-retry-results.json

LoggeL/aurora-alpha-hard-benchmark-report.md

Select an option

No results found