Mass Formalization of Erdos Problems

I have used Claude and Gemini to formalize conjectures for all remaining Erdos problems from https://github.com/teorth/erdosproblems

This is marked as a key milestone in the DeepMind Formal Conjectures repo: https://github.com/google-deepmind/formal-conjectures/milestone/1

Models Used

Gemini 3 Flash
Gemini 3 Pro
Claude Sonnet 4.5
Claude Opus 4.6

Gemini 3 Pro

This was a reliably good model to use but I had multiple days where I very quickly slammed into the 250 requests-per-day limit on my free trial of my GCP account.

Gemini 3 Flash

To get around the limited requests available for 3 Pro, I tried out both Gemini 2.5 Pro and Gemini 3 Flash and found the latter to perform much better and surprisingly good at autonomously accomplishing the formalization and build tasks.

Claude Sonnet 4.5

Once I began hitting quota errors even using Gemini 3 Flash, I decided to sign up for a Claude Pro account and use that in addition to Gemini to continue the work.

Claude Opus 4.6

Opus 4.6 dropped mid-project and they offered $50 of extra usage which was a godsend for being able to utilize the new hotness. I stuck with Sonnet for the vast majority of tasks but switched explicitly to high-effort Opus 4.6 to conduct final reviews of the conjectures to focus on the actual implementation of the conjectures (as opposed to style guide adherence, etc).

Methodology

Identify unformalized problems

Fetch canonical set of 1179 Erdos problems from the repo listed above (link)
Identify remaining problems to work on by diffing the above list with that of existing formalizations within FormalConjectures/ErdosProblems/*.lean (827 total)

Formalization

The "harness" for focusing the model's efforts on a particular problem NUM looked like:

Fetch data from https://www.erdosproblems.com/NUM
Produce lean formalization in FormalConjectures/ErdosProblems/NUM.lean (adhere to style guide)
Run build using lake build FormalConjectures/ErdosProblems/NUM.lean (debug and rebuild until successful)
Commit changes

Critical Review

Prior to submitting for review to the DeepMind team, I'd like to take care that the formalization work is up-to-snuff and can avoid hangups at the PR review phase - namely, adherence to style guidelines and correctness.

Summarize style guide adherence and fix any outstanding issues accordingly
Critical analysis using Opus 4.6 of conjecture adherence

Batching causes models to seek sub-optimal efficiency gains

As batches got larger, I noticed various shortcuts taken by different models to accomplish the task. These included:

Implementing for loops in bash
Overlooking strict adherence to style guidelines
Haiku literally just implementing trivial proofs for a batch of ~80 problems just to get builds passing
Firing off multiple lake builds in parallel, which caused issues on the small VM being used
Going deep for a batch of 10 files, but remaining fairly surface-level for a batch of 50-100 files

A future state of affairs would involve rigorous-enough guidelines (and a corporate budget for AI usage lol). This did not look like firing up claude opus with --dangerously-skip-permissions and just saying "get back to me when the work is complete".

Followup - Adhering to style guide

I used Claude Sonnet 4.5 to create a style guide checklist document from the README's style guidelines. I then had it iterate through problems and produce summary documents for each problem highlighting the state of style guide adherence. I then instructed the model to fix any issues to get all conjecture files fixed up (and ensure builds pass).

Followup - Review

Given the iterative nature of the efforts and the spread of different models used, I wanted to put an expert-level review step at the end of this process to provide some semblance of quality guarantees around the completeness and correctness of the AI output and how well it conforms to the actual conjecture posed by Erdos in the first place.

To accomplish this, I switched to Opus 4.6 and advised it to review the original conjecture, point out areas that are vague or could be implemented better, and to provide a confidence grade from 0-100% that the conjecture was implemented as stated.

Compute Layer - GitHub Codespaces

All development was done within a 4-core GitHub Codespace. I've been paranoid about AI systems since reading Bostrom's Superintelligence back in 2015, reading much from Eliezer Yudkowsky since then, and then following the latest research mostly coming out of Anthropic about the still-unsolved challenges of emergent misalignment and data exfiltration. I have never ran an AI agent on a local machine and don't plan on doing so any time soon. So, codespaces is a perfect fit for an isolated cloud environment with access to nothing I care about keeping secret other than my API keys, which can be rotated easily enough.

Commentary

It's not unreasonable to expect that, for each model, a $/conjecture relationship could be approximately deduced, maybe clustered by problem type.

It seems like for a project like this, using a make-like approach is handy. Keep state around in a formalized structured way, be explicit about the particular commands to be used to carry out commands, etc is a virtuous thing. The methodologies used for doing the same operation varied wildly depending on the model used and the batch size it was working with at any given time.

Inbox

2/12 - Ran out of Claude credits. Gemini 3 Pro appears to be really good at figuring out how to batch requests for working on dozens or 100 problems at a time. Avoiding model switching and just trying to have it conduct the final style and conjecture review parts now. On to 900-999 as the latest batch worked. Wild.

TODO - will be interesting to review summaries for previously-committed files versus AI-generated. Probably an AI review of all previously-committed files should provide the default benchmark for how complete conjectures should be before being considered complete.

2/13 - Passing in batches of 10 or 20 to gemini 3 appears to route to pro properly and actually conducts more in-depth reviews. Anecdotally, it appears that the reviews are more thorough when batches of 10 are submitted than when batches of 20 are submitted (~4 bullet points versus ~2 bullet points). Unclear if limited by thinking horsepower or total tokens sent back across the wire, also unclear if truly an issue or merely anecdotal.

Worth it to start over from scratch?

Given the slapdash nature of how this has all panned out, how inefficient Gemini 3 Pro will remain, the fact that I'm out of Claude credits for the rest of the week, and that there's probably an arms race now to be the first to accomplish this, it might be easier to scrap the actual artifacts that have been produced, take what I've learned, and try to go full send with a more rigorous and consistent approach on a separate branch. What would that plan look like?

Working around token minimization is a crazy limitation here. Reminds me of learning to program on Arduino, with 2k RAM available for everything.

What does process look like?

Human - curl / git clone erdosproblems.com/NUM
Model (Pro) - Batch - formalize and adhere to style guide
Human - lake build NUM > log
Model (Pro) - Batch - formalize again and fix errors

Gotchas - be clear about what tool use is allowed and shouldn't require oversight (writefile, editfile, fetchUrl?).

Doing this all in docker would be cool.

Iterating on standalone project in google cloud VM to avoid codespaces budget constraints.

ryantuck/erdos-mass-formalization.md

Select an option

No results found