In the world of George Orwell's 1984, two and two make five. And large language models are not much better at math.
Though AI models have been trained to emit the correct answer and to recognize that "2 + 2 = 5" might be a reference to the errant equation's use as a Party loyalty test in Orwell's dystopian novel, they still can't calculate reliably.
Scientists affiliated with Omni Calculator, a Poland-based maker of online calculators, and with universities in France, Germany, and Poland, devised a math benchmark called ORCA (Omni Research on Calculation in AI), which poses a series of math-oriented natural language questions in a wide variety of technical and scientific fields. Then they put five leading LLMs to the test.
ChatGPT-5, Gemini 2.5 Flash, Claude Sonnet 4.5, Grok 4, and Deep

The Register

PC World
Fast Company Technology
Santa Maria Times Local
San Bernardino Sun
CBS News
The Atlantic
The Columbian Business
CourierPress