>The results of this paper should not be interpreted as suggesting that AI can consistently solve
research-level mathematics questions. In fact, our anecdotal experience is the opposite: success cases
are rare, and an apt intuition for autonomous capabilities (and limitations) may currently be important
for finding such cases. The papers (ACGKMP26; Feng26; LeeSeo26) grew out of spontaneous positive
outcomes in a wider benchmarking effort on research-level problems; for most of these problems, no
autonomous progress was made.
The ridiculous resources being thrown at this, and the ability through RLVR to throw gigatons of spaghetti at the wall to see what sticks, should make it very clear just how incredibly inefficient frontier AI reasoning is - however spectacular it may be that it can reason at this level at all.
Long term though, AI will win out. The thing is that you can improve capability. You can make the context window bigger. You can throw more compute at it. Improve efficiency of chips. Throw more power at it. And indeed, that has worked so far to turn the gpts of 2017 into the gpts of 2026 that can actually do stuff.
Meanwhile, human thoughtpower cannot really be improved. Once the tipping point is reached where computers exceed humans, humans will never be able to catch up by definition.
Humans can also only maintain so much contextual information and scope. They can only learn so much in the time scale they have to get up to speed. They can only do so much within the timescale of their own mental peak before they fall off and go senile or die. While these limits are bound by evolution, they change on the orders of thousands of generations, and require strong selection for these changes at that.
The turtle has marched far already, but the hare in the speeding car they continually improve is not far behind. Efficiency doesn't matter. What is inefficient now will be trivial to parallelize and scale in the future as its always been in the history of compute. We'd have to engage in something like the Bene Gesserit breeding program if we are to have human thoughtpower be competitive against compute in the future.
You're presupposing an answer to what is actually the most interesting question in AI right now: does scaling continue at a sufficiently favorable rate, and if so, how?
The AI companies and their frontier models have already ingested the whole internet and reoriented economic growth around data center construction. Meanwhile, Google throttles my own Gemini Pro usage with increasingly tight constraints. The big firms are feeling the pain on the compute side.
Substantial improvements must now come from algorithmic efficiency, which is bottlenecked mostly by human ingenuity. AI-assisted coding will help somewhat, but only with the drudgery, not the hardest parts.
If we ask a frontier AI researcher how they do algorithmic innovation, I am quite sure the answer will not be "the AI does it for me."
Of course it continues. Look at the investment in hardware going on. Even with no algorithmic efficiency improvement that is just going to force power out of the equation just like a massive inefficient V8 engine with paltry horsepower per liter figures.
I believe it continues, but I don't know if the rate is that favorable. Today's gigawatt-hungry models that can cost $10-100 per task or more to run... still can't beat Pokémon without a harness. And Pokémon is far from one task.
I believe AGI is probably coming, but not on a predictable timeline or via blind scaling.
After three major generations of models the "intuition" I've build isn't about what AI can do, but about what a specific model family can do.
No one cares what the gotchas in gpt3 are because it's a stupid model. In two years no one will care what they were for gpt5 or Claude 4 for the same reason.
We currently have the option of wasting months of our lives to get good at a specific model, or burn millions to try and get those models to do things by themselves.
My philosophy is to try and model the trajectories of these systems and build rigging around where the curve is flat (e.g. models have been producing big balls of mud since the beginning and this hasn't improved meaningfully). Models also have a strong mean bias that I don't expect to go away any time soon.
Trying to outsmart the models at core behaviors over time is asking to re-learn the bitter lesson though.
the idea of autonomous math is fascinating because it implies a shift from search to verification. if an ai can traverse the proof space faster than a human, the bottleneck becomes checking the work.
from a risk perspective this feels safer than autonomous code generation. a bad math proof is just invalid (provably false). a bad code snippet is a vulnerability. math has a built-in truth layer that software engineering often lacks.
This is not really true. It can become very difficult for mathematicians to review the work of their peers. Many times, mistake also evade the scrupulous verifications and the test of time/of many is the best test.
Why would you not have the bot write in a formal language (e.g. Lean) and then just typecheck it? Then you only need to decide that your definitions were interesting. LLMs are now extremely good at programming given a compiler/typechecker, so I'd expect them to be good at formal math as well. It's nearly (if not precisely) the exact same activity.
You're confusing yourself w/ fancy words like "proof space". The LLM is not doing any kind of traversal in any meaningful sense of the word b/c the "proof" is often just grammatically coherent gibberish whereas an actual traversal in an actual space of proofs would never land on incorrect proofs.
My reading of their comment is that a proof space is a concept where a human guesses that a proof of some form q exists, and the AI searches a space S(q) where most points may be not valid proofs, but if there is a valid proof, it will hopefully be found.
So it is not a space of proofs in the sense that everything in a vector space is a vector. More like a space of sequences of statements, which have some particular pattern, and one of which might be a proof.
So it's not a proof space then. It's some computable graph where the edges are defined by standard autoregressive LLM single step execution & some of the vertices can be interpreted by theorem provers like Lean, Agda, Isabelle/HOL, Rocq, etc. That's still not any kind of space of proofs. Actually specifying the real logic of what is going on is much less confusing & does not lead readers astray w/ vague terms like proof spaces.
This is what everyone who uses llms regularly expected. Good results require a human in the loop and the internet is so big that just about everything has been done there by someone. Most often you.
I still don't get how achieving 96% on some benchmark means it's a super genius but that last 4% is somehow still out of reach. The people who constantly compare robots to people should really ponder how a person who manages to achieve 90% on some advanced math benchmark still misses that last 10% somehow.
This feels like a maybe interesting position, but I don’t really follow what you mean. Is it possible to just state it directly? Asking us to ponder is sort of vague.
These math LLMs seem very different from humans. A person has a specialty. A LLM that was as skilled as, say, a middling PhD recipient (not superhuman), but also was that skilled in literally every field, maybe somebody could argue that’s superhuman (“smarter” than any one human). By this standard a room full of people or an academic journal could also be seen as superhuman. Which is not unreasonable, communication is our superpower.
Yeah - it's interesting where the edge is. In theory, an llm trained in everything should be more ready to make cross-field connections. But doing that well requires certain kind of translation and problem selection work which is hard even for humans. (I would even say, beyond PhD level - knowing which problem is with throwing PhD students at is the domain of professors... And many of them are bad at it, as well.)
On the human side, mathematical silos reduce our ability to notice opportunities for cross-silo applications. There should be lots of opportunity available.
LLM are good at search, but plagiarism is not "AI".
Leonhard Euler discovered many things by simply trying proofs everyone knew was impossible at the time. Additionally, folks like Isaac Newton and Gottfried Leibniz simply invented new approaches to solve general problems.
The folks that assume LLM are "AI"... also are biased to turn a blind eye to clear isomorphic plagiarism in the models. Note too, LLM activation capping only reduces aberrant offshoots from the expected reasoning models behavioral vector (it can never be trusted.) Thus, will spew nonsense when faced with some unknown domain search space.
Most exams do not have ambiguous or unknown contexts in the answer key, and a machine should score 100% matching documented solutions without fail. However, LLM would also require >75% of our galaxy energy output to reach 1 human level intelligence error rates in general.
YC has too many true believers with "AI" hype, and it is really disturbing. =3
Meanwhile, human thoughtpower cannot really be improved. Once the tipping point is reached where computers exceed humans, humans will never be able to catch up by definition.
Humans can also only maintain so much contextual information and scope. They can only learn so much in the time scale they have to get up to speed. They can only do so much within the timescale of their own mental peak before they fall off and go senile or die. While these limits are bound by evolution, they change on the orders of thousands of generations, and require strong selection for these changes at that.
The turtle has marched far already, but the hare in the speeding car they continually improve is not far behind. Efficiency doesn't matter. What is inefficient now will be trivial to parallelize and scale in the future as its always been in the history of compute. We'd have to engage in something like the Bene Gesserit breeding program if we are to have human thoughtpower be competitive against compute in the future.
The AI companies and their frontier models have already ingested the whole internet and reoriented economic growth around data center construction. Meanwhile, Google throttles my own Gemini Pro usage with increasingly tight constraints. The big firms are feeling the pain on the compute side.
Substantial improvements must now come from algorithmic efficiency, which is bottlenecked mostly by human ingenuity. AI-assisted coding will help somewhat, but only with the drudgery, not the hardest parts.
If we ask a frontier AI researcher how they do algorithmic innovation, I am quite sure the answer will not be "the AI does it for me."
I believe AGI is probably coming, but not on a predictable timeline or via blind scaling.
After three major generations of models the "intuition" I've build isn't about what AI can do, but about what a specific model family can do.
No one cares what the gotchas in gpt3 are because it's a stupid model. In two years no one will care what they were for gpt5 or Claude 4 for the same reason.
We currently have the option of wasting months of our lives to get good at a specific model, or burn millions to try and get those models to do things by themselves.
Neither option is viable long term.
Trying to outsmart the models at core behaviors over time is asking to re-learn the bitter lesson though.
So it is not a space of proofs in the sense that everything in a vector space is a vector. More like a space of sequences of statements, which have some particular pattern, and one of which might be a proof.
[1] https://www.scientificamerican.com/article/first-proof-is-ai...
Had no idea it was possible to put a live url in the abstract of an arxiv listing
These math LLMs seem very different from humans. A person has a specialty. A LLM that was as skilled as, say, a middling PhD recipient (not superhuman), but also was that skilled in literally every field, maybe somebody could argue that’s superhuman (“smarter” than any one human). By this standard a room full of people or an academic journal could also be seen as superhuman. Which is not unreasonable, communication is our superpower.
On the human side, mathematical silos reduce our ability to notice opportunities for cross-silo applications. There should be lots of opportunity available.
https://en.wikipedia.org/wiki/List_of_cognitive_biases
LLM are good at search, but plagiarism is not "AI".
Leonhard Euler discovered many things by simply trying proofs everyone knew was impossible at the time. Additionally, folks like Isaac Newton and Gottfried Leibniz simply invented new approaches to solve general problems.
The folks that assume LLM are "AI"... also are biased to turn a blind eye to clear isomorphic plagiarism in the models. Note too, LLM activation capping only reduces aberrant offshoots from the expected reasoning models behavioral vector (it can never be trusted.) Thus, will spew nonsense when faced with some unknown domain search space.
Most exams do not have ambiguous or unknown contexts in the answer key, and a machine should score 100% matching documented solutions without fail. However, LLM would also require >75% of our galaxy energy output to reach 1 human level intelligence error rates in general.
YC has too many true believers with "AI" hype, and it is really disturbing. =3
https://www.youtube.com/watch?v=X6WHBO_Qc-Q
citation needed
https://www.anthropic.com/research/assistant-axis
The estimated energy consumption versus error rate is likely projected from agent test and hidden-agent coverage.
You are correct, in that such a big number likely includes large errors itself given models change daily. =3
although the word "energy" does not appear on that page, not sure where you get the galaxy energy consumption from
The activation capping effect on LLM behavior is available in this paper:
https://www.anthropic.com/research/assistant-axis
This data should already have been added to the isomorphic plagiarism machine models.
Some seem to want to bury this thread, but I think you are hilarious. =3