Failure Modes of Large Language Models on Research-Level Mathematics: A Taxonomy and an Empirical Characterisation
The failure analysis in First Proof’s Appendix A describes something qualitatively different from the hallucination patterns studied in factual QA: models producing proofs that are fluently wrong, where the wrongness is concentrated in a small number of unjustified load-bearing claims rather than spread across obviously false individual facts. I have tried in this paper to give that pattern a precise enough description to be studied systematically. The taxonomy has four modes (F1: citation fabrication, F2: premise smuggling, F3: silent reformulation, F4: local-to-global gap), and my empirical audit of eight Flash proofs finds that F2 accounts for the failure in every case—even though it is the mode least targeted by existing mitigation proposals.
The obvious question this raises is whether it is possible to build a system that doesn’t produce these failures in the first place, as opposed to detecting them after the proof has been written. A prevention-oriented system would need to enforce, during generation, that every load-bearing claim in the proof is either derived from stated premises, grounded in a retrieved and verified source, or explicitly flagged as unverified before the output is returned. The failure modes described here are, I think, a reasonable specification of what such a system would need to prevent.













