You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Model:openrouter/aurora-alpha Date: 2026-02-10 Benchmark: Hard questions (30 questions across 6 categories) + 26 knowledge cutoff probes Evaluator: Automated analysis with manual verification + web-validated fact-checking
Executive Summary
Aurora Alpha scored 75.0% on the hard benchmark (after accounting for retries on transient empty responses — expected for an experimental model on OpenRouter's free tier).
The model shows strong math and creative reasoning but struggles with multi-constraint logic puzzles and produces confident hallucinations on knowledge questions — fabricating academic citations and getting well-known facts wrong with high confidence.
Web validation confirmed most factual claims but caught key errors: a wrong doctoral advisor for Chandrasekhar (said Eddington, actually Fowler), fabricated academic citations in the Mpemba effect response, and a wrong answer on the Einstein's riddle variant.
The knowledge cutoff analysis confirms a hard cutoff around June 2024, with the model self-reporting this date consistently. The model's behavior pattern strongly suggests a GPT-4o or GPT-4o-mini backend.
Metric
Score
Hard Benchmark Score
75.0%
Previous Easy Score
85.6%
Delta (easy → hard)
-10.6 pp
Web Validation Summary
All key factual claims were checked against web sources. Results:
Math Answers
Claim
Status
Source
Young tableaux 3×3 = 42
✅ Verified
Hook-length formula standard result
7^999 mod 1000 = 143 (retry)
✅ Verified
Math StackExchange, Quora confirmations
2^2024 mod 101 = 5
✅ Verified
Fermat's little theorem computation checked
Hexagon random walk E[T] = 9
✅ Verified
n²/4 = 36/4 = 9 for cycle of 6
Knowledge Claims
Claim
Status
Details
Chandrasekhar's advisor = Eddington
❌ Wrong
Actual advisor: R.H. Fowler. Confirmed by UChicago archives and Wikipedia. Eddington was famously Chandrasekhar's critic, not advisor.
Smallest African country = Seychelles (retry)
✅ Verified
Wikipedia confirms Seychelles (459 km²), capital Victoria, independence 29 June 1976
Mpemba effect citations
❌ Fabricated
"J.A. Jones et al., Phys. Rev. E 2021", "Nature Physics 2022", "J. Chem. Phys. 2023" — all fictional with invented authors
Golden angle ≈ 137.5°
✅ Verified
Standard result
Unix: PDP-11 assembly → B → C
✅ Verified
Ken Thompson created B based on BCPL
Reasoning Answers
Claim
Status
Details
Tuesday boy probability = 13/27
✅ Verified
Confirmed by Math StackExchange, The Actuary, multiple sources
Einstein's riddle: "English person" owns fish (retry)
❌ Wrong
Correct answer: the Italian owns the fish. Full solution verified against all 14 clues.
Left-handed room problem = 50
✅ Verified
(99-x)/(100-x) = 0.98 → x = 50
Creative Claims
Claim
Status
Details
"This sentence has thirty-six letters" = 31 letters (retry)
⚠️ Partially correct — calls it "Claude 3.5 Turbo" (actual: Claude 3.5 Sonnet), fabricates details
Transition Zone (Jul–Oct 2024)
Month
Response
Notes
Jul 2024
"Cutoff in June 2024" — declines
First explicit cutoff admission
Aug 2024
Claims xAI released "Groq"
❌ Hallucinated — Groq is a different company entirely
Sep 2024
HTTP 500 error
N/A
Oct 2024
Claims Claude 3.5 in October
⚠️ Roughly correct (upgraded 3.5 Sonnet was Oct 2024)
Post-Cutoff: Inconsistent (Nov 2024–Feb 2026)
Month
Behavior
Nov 2024
Declines — doesn't know Trump won election
Dec 2024
Declines — "mid-2024 cutoff"
Jan 2025
Declines — "June 2024 cutoff"
Feb 2025
Declines
Mar 2025
Declines
Apr 2025
HALLUCINATED entire GPT-5 launch with fabricated benchmarks, dates, partnerships
May 2025
HALLUCINATED Gemini 2.0 at I/O 2025 with detailed fake specifications
Jun 2025
HALLUCINATED Claude 3.5 details with invented capabilities
Jul–Nov 2025
Declines consistently
Dec 2025
HALLUCINATED GPT-5 launch again (contradicts Apr 2025 claim), plus fake Gemini 2.0 and LLaMA 3
Jan–Feb 2026
Declines
Cutoff Summary
Self-reported cutoff: June 2024 (stated explicitly in 8+ responses)
Actual reliable knowledge: Through ~May 2024
Hard wall: July 2024 (first explicit decline)
Hallucination pattern: Inconsistent — sometimes declines, sometimes fabricates elaborate but contradictory responses
Where Does It Break? Key Weaknesses
1. Confident Hallucination
When the model fabricates, it does so with extreme confidence:
Chandrasekhar's advisor (Eddington instead of Fowler) — ❌ confirmed wrong via web
Fictional academic citations with fake journals and dates — ❌ confirmed fabricated
Entire product launches that never happened (GPT-5 with specific dates, benchmarks)
Einstein's riddle: confidently states "English person" with no work shown — ❌ confirmed wrong
2. Truncated Responses
Two responses cut short mid-generation (Knights & Knaves at 205 tokens, Concurrency bugs mid-analysis).
3. Degenerate Output
Tarjan/2-SAT produced 1690 tokens of mostly Unicode whitespace — a bizarre generation failure.
4. Constraint Satisfaction Weakness
The model struggles with multi-constraint logical puzzles. Both the original Einstein's riddle (empty) and retry (wrong answer) failed. The contradictory writing constraints were partially met (3/4 constraints).
Comparison: Easy vs Hard Benchmark
Metric
Easy Benchmark
Hard Benchmark
Overall Score
85.6%
75.0%
Primary weakness
Minor factual errors
Hallucinations + hard logic
The capability gap between easy and hard is ~10.6 percentage points — significant but not catastrophic. The model handles computation-heavy tasks well but struggles with multi-step constraint satisfaction and occasionally fabricates knowledge with high confidence.
Model Identity Verdict
Evidence continues to point strongly toward GPT-4o or GPT-4o-mini as the backend:
Confident confabulation with elaborate fake details
Math capability
Excellent on computation, weak on constraint satisfaction
Letter counting
Perfect on all three — GPT-4o-mini improved on character tasks
Zero reasoning tokens
Not a reasoning model (not o1/o3)
Cost: $0
Free tier routing through OpenRouter
Verdict: Aurora Alpha is almost certainly GPT-4o-mini routed through OpenRouter, possibly with a thin wrapper that occasionally causes empty responses or generation failures.
Report generated 2026-02-10. Web validation performed 2026-02-10. Raw data: aurora-hard-benchmark-raw.json, retry data: aurora-retry-results.json