Why LLMs struggle with basic math

Why LLMs struggle with basic math

Why large language models fail at basic math, what 2025 research revealed, and how ops leaders should deploy AI for accurate, data-driven decisions.

What LLM arithmetic struggles means for ops leaders

Large language models are not bad at math by accident. 

They are bad at it by design.

Here is the part that matters for anyone running an operation. 

The fix is not a better prompt. 

The fix is pairing the model with systems built for exact numbers. 

That is the whole purpose behind augmented analytics, where AI handles language and structured systems handle arithmetic.

  • Why this happens
  • What new research revealed about how models actually compute
  • The specific places ops leaders should never trust a raw model with a number
  • Where agentic analytics changes the equation

Why do large language models fail at math?

Large Language Models fail at arithmetic because they were built to predict language, not to calculate. 

They generate the next likely token. 

Math rewards the one exact answer, not the most plausible-looking one.

Five forces behind AI struggles

Numbers are tokens 

A model sees 87439 as symbols, not as eighty-seven thousand.

Prediction beats precision 

The objective is fluency. 

Close enough reads fine in a sentence and fails in a ledger.

No scratchpad by default 

Without an external place to carry steps, multi-step math drifts.

No built-in error check 

Nothing flags that 437 times 892 is nowhere near 200.

Training data is the internet

Plenty of 2 plus 2. 

Far less five-digit multiplication done correctly.

None of this makes the model useless 

It makes it the wrong tool for one specific job. 

The same logic shows up when teams ask which AI chatbot is most accurate, then quietly discover that accuracy on numbers is a separate question from accuracy on words.

Domain Intelligence

Give AI the context your best people already know.

Scoop captures operator judgment, screens every location, and turns hidden signals into governed investigations, clear findings, and action plans your team can trust.

  • Context-aware analysis
  • Autonomous investigation
  • Executive-ready reports

Numbers are not numbers to an LLM

To a model, a number is text. 

It gets chopped into tokens, the same way words do.

Write 437 x 892 and the model may split it into pieces like 437, x, and 892, or into stranger fragments depending on the tokenizer

The string 123.45 can break into several tokens that carry no sense of place value.

A 2025 research line put a name on the core issue: existing tokenizers parse numbers left to right and split them on a fixed vocabulary, which is poorly suited to comparing or calculating values

The token for a number does not inherently mean its magnitude.

Picture doing long multiplication while treating every digit as an unrelated word. 

That is the starting position.

  • Tokenization optimizes for language coverage, not numeric structure.
  • Place value, the foundation of arithmetic, is not preserved in the tokens.
  • This is one reason machine learning in data analytics treats numbers as numbers from the start, not as text to be guessed.

Numbers split into tokens stop behaving like numbers. They behave like vocabulary.

They are built for patterns, not precision

An LLM is a prediction engine.

It learned that 2 + 2 = is usually followed by 4. 

It did not learn addition.

Think of the classmate who memorized the answer key without learning the formulas

Fine until the test changes the numbers.

Researchers describe this as the stochastic parrot problem: the model continues text in a way that matches patterns it has seen, without an internal notion of the rule underneath. 

  • Language is statistical. 
  • Arithmetic is rule based. 

There is no leeway in a five-digit multiplication or in compounding 2.75 percent for 13 months with banker's rounding.

A confident paragraph can still hide one wrong digit. 

That digit is what fails a reconciliation, trips a compliance exception, or quietly distorts a forecast.

  • Plausible and correct are different targets. Math only rewards correct.
  • This gap is why agentic BI separates the reasoning layer from the calculation layer instead of asking one model to do both.

Retail Analytics for Multi-Location Teams

Stop choosing which locations get your attention.

Scoop helps retail chains move beyond dashboards with AI retail analytics that screens every store, surfaces the locations that need action, and delivers the briefing your team needs to move faster.

  • Store-level diagnosis
  • District and regional rollups
  • Weekly executive briefings

How does Claude actually add two numbers? 

The 2025 research

In 2025, Anthropic's interpretability team traced the internal steps a model takes to add numbers, and the method was nothing like school arithmetic.

Using a technique they compare to a brain scanner, researchers watched the model add 36 and 59. 

There was no carrying the one. 

Two strategies ran in parallel:

  • One path estimated a rough total, adding approximate values to land near 92ish.
  • A second path focused only on the last digits, 6 and 9, and worked out that the answer must end in 5.
  • The two combined to produce 95.

Stranger still: 

When asked how it solved the problem, the model described the standard carrying method. 

It did not report the parallel-estimation trick it actually used. 

The explanation and the mechanism did not match.

A model can give a convincing account of its reasoning that does not reflect what it did. 

For anyone weighing how AI supports decision making, the lesson is blunt: a fluent explanation is not evidence of a correct calculation.

The model can describe carrying the one while doing something else entirely. Mimicry is not mastery.

No working memory means broken multi-step math

Humans carry intermediate steps. 

A base model does not hold a running total the way you would on paper.

Generate part of an answer and the model cannot reliably reference its own earlier work mid-calculation.

On a multi-step problem, it can lose the thread halfway through.

This is the failure that compounds in real work. 

A single arithmetic step is risky. 

A chain of them, the kind any forecast or margin calculation requires, multiplies the risk.

  • Multi-step problems amplify small errors into wrong conclusions.
  • Structured systems hold state by design, which is why fusing machine learning with LLMs outperforms a language model working alone.
  • It is also why predictive analytics leans on models trained for numeric reasoning, not text prediction.

Hotel Management Company Analytics

Stop sending reports that only show the numbers.

Scoop investigates every property, connects PMS and financial data, and turns hospitality analytics into clear narratives for owners, GMs, regional VPs, and portfolio leaders.

  • Property-level diagnosis
  • USALI-aware analysis
  • Owner-ready reports

There is no internal error-checking

People sense when a number looks wrong. The human double-check. A base model doesn’t.

There is no internal voice saying 437 times 892 is obviously not 200. 

The model does not cross-reference mathematical rules

It continues, confidently, whether right or wrong.

This is exactly why model makers bolt on external calculators and code execution

The model is good at deciding what to compute. It needs a real tool to compute it.

Confidence is not correctness

The tone stays steady even when the math is off.

Anomaly detection's role

Catching wrong numbers before they spread is the job of anomaly detection, which screens for values that break the expected pattern.

The internet is not a math textbook

Training data is mostly the open web, and the web is not overflowing with correct, worked math.

Models see endless basic addition and far fewer examples of large or unusual calculations done right. 

Accuracy gets patchy on: 

  • Bigger numbers
  • Rare formats
  • Edge cases

Benchmarks bear this out. 

One widely cited evaluation found that even strong models nail multiplication up to a point, then fail completely as the operands grow. 

One model returned a five-digit product that was close but wrong, with the right first and last digits and an error in the middle. 

Close, and useless for a ledger.

  • Performance drops sharply as problems grow more complex.
  • Middle digits are where multiplication tends to break.
  • Reliable numbers come from data accuracy practices applied to structured sources, not from a model recalling patterns.

AI Retail Analytics for Retail Chains

Find store problems before they hit the P&L.

Scoop brings AI retail analytics to retail chains by capturing how your best operators investigate performance, then running that diagnostic logic across every location, every week.

  • Retail analytics at scale
  • 10 hypotheses in parallel
  • Executive-ready reports

Why do models ace math benchmarks but fail at adding two numbers?

Because high benchmark scores and reliable arithmetic are not the same thing.

Recent studies surfaced a disconnect that should make any ops leader cautious. 

Models post strong results on hard math benchmarks, the grade-school word problems and competition sets. 

Then stumble on basic tasks like adding or sorting a list of numbers.

Two patterns explain the gap:

Pattern matching, not reasoning 

High scores can reflect familiarity with benchmark-style problems rather than genuine arithmetic ability.

Overthinking simple problems

Models often wrap trivial arithmetic in long reasoning chains, and longer chains have been shown to produce more wrong answers, not fewer.

The takeaway for buyers:

A model's benchmark headline tells you little about whether it will add your revenue lines correctly. 

Treat the two as separate questions.

This is the same trap behind why so much data is useless: impressive output that does not survive contact with a real decision.

What fixes the problem? 

Tools, code, and structured systems

The reliable fix is to stop asking the model to be the calculator. 

Hand the arithmetic to a system built for it.

Three approaches have matured, and they stack:

Approach
What it does
Where it falls short
Tool and calculator hooks
The model calls an external calculator or symbolic engine for exact math.
Only as good as knowing when to call the tool.
Code interpreters
The model writes and runs code to compute the answer.
Adds latency, and the code itself can be wrong.
Structured data systems
Numbers live in a real data layer. The model handles language, the system handles math.
Requires connected, governed data to work.

Important note from researchers 

Tools and reasoning chains are useful crutches, but they let a model solve arithmetic without ever gaining real numeric skill. 

For business use, that is fine. 

You do not need the model to understand math. You need the answer to be right.

The durable pattern is division of labor. 

Language model for narrative and intent. machine learning analytics for the numbers.

 The combination beats either one alone.

See how a fully agentic analytics stack separates these jobs from data prep through ML.

Franchise Performance Analytics

Stop explaining the diagnosis. Start coaching the next move.

Scoop equips field ops teams with franchisee-level intelligence before every call, so consultants can spend less time proving the problem and more time guiding action.

  • Pre-call briefings
  • District and regional rollups
  • Action tracking by cycle

Why this matters for ops leaders

If a model touches your revenue projections, forecasts, or pipeline health, you need to know exactly when not to trust it.

You are not building a robot accountant. Fair

But the moment AI feeds a number into a decision, the math has to be exact, and a raw model cannot promise that.

The risk is not that the model refuses. 

It is that it answers, confidently, with a number that is slightly off. 

Slightly off is what fails an audit or misprices a quarter.

Knowing the blind spots is not a reason to ditch AI. 

It is the reason to deploy it correctly.

  • Never let a raw model do the arithmetic behind a financial decision.
  • Do let it interpret, summarize, and investigate, which is the heart of how data drives decisions in finance.
  • Watch for the same gap in sales forecasting, where a confident wrong number is worse than no number.

The issue is not math. It is interpretation

Even when the numbers are exact, a report still does not tell you what to do. 

That gap is where most value leaks out.

A VP of Quality at a roughly one-billion-dollar operation, put the real problem plainly during a customer conversation:

We have a gold mine of data. How do I explore it and translate it into a gold bar?

He went further, describing a frustration that nearly everyone has and almost no one admits:

I'm reading the distribution reports and it makes sense to them, but it doesn't make sense to me.

That is not a math problem. 

The numbers were correct. 

The interpretation did not scale

Your BI shows what happened. 

But still, you need to know what it means or what to do next.

This is the layer Domain Intelligence is built for

It captures how your best operator reads the data, then runs that interpretation across every location, every cycle, automatically. 

The arithmetic is handled by structured systems

The judgment comes from your people, encoded once and scaled everywhere.

As one founder describes the capture step: if you recorded everything a senior manager thought while reading their reports, you could stick that into the system so it runs on their behalf. 

Not the chart. 

Not the click. 

The interpretation

Operators do not log in to run queries

A report arrives, already interpreted. 

See how Scoop investigates rather than just monitoring.

It sits on top of the BI you already run

This adds the meaning layer.

That is the premise behind agentic vs augmented analytics.

Property Management Domain Intelligence

Catch portfolio risks before owners start asking.

Scoop helps multifamily property management teams connect rent rolls, occupancy trends, maintenance logs, and operating expenses to explain what is happening, why it is happening, and what to do next.

  • Every property. Every cycle.
  • Retention, maintenance, and NOI insights
  • Owner-ready portfolio reports

Where Scoop comes in

Scoop pairs the narrative power of AI with the mathematical rigor of real data systems. 

It does not ask a chatbot to do arithmetic.

Scoop uses your structured data to build real insights, with presentation-ready outputs behind every conclusion. 

You define what matters. 

The system does the calculation, then explains the result in plain language.

  • It is AI-powered and human-approved
  • The model interprets. 
  • The data layer computes. 

Neither is asked to do the other's job.

Ask questions in plain English and get answers, not dashboards, through natural language analytics.

Connect your: 

  • CRM
  • Financial systems
  • Marketing data 

All without a migration. 

So what do we do with this?

Understanding the limits of AI does not make it weaker. It makes you sharper.

A model can: 

  • Brainstorm
  • Write
  • Categorize
  • Summarize 

Pair it with systems built for accuracy and it becomes dependable. 

Skip that step and you risk decisions built on math that could not pass a pop quiz.

The future is not LLM-only

The future is: 

LLM + structured systems + smart workflows

So next time your model flubs a math problem, laugh a little. 

Then ask the real question: 

How is my team combining human logic, machine intelligence, and structured data?

Frequently asked questions

Why are LLMs bad at math but good at writing?

Language models are trained to predict the next likely word, which suits writing and fails arithmetic. Writing rewards plausible patterns. Math rewards one exact answer. The same design that makes them fluent makes them unreliable with numbers.

Can LLMs do math if you give them a calculator?

Yes, much better. When a model calls an external calculator, code interpreter, or symbolic engine, accuracy improves sharply. The model decides what to compute and the tool computes it. The remaining risk is the model knowing when to reach for the tool.

Why does an AI count the letters in strawberry wrong?

Because the model sees strawberry as a few tokens, not as a sequence of individual letters. Counting characters requires a structure the tokenizer strips away. The same tokenization issue that breaks letter counting also breaks number handling.

  • Tokenization optimizes for language, which discards the per-character and per-digit structure that counting and arithmetic need.

Do newer models like GPT and Claude still struggle with arithmetic?

They are better, especially with tools and reasoning enabled, but the core limitation remains. Strong benchmark scores do not guarantee reliable basic arithmetic, and studies in 2025 found models still failing at adding or sorting simple numbers. For business decisions, route the math through a structured system.

How should ops teams use AI for forecasting and reporting?

Use AI to interpret, summarize, and investigate, and use structured data systems to do the math. Never let a raw model produce the numbers behind a revenue, margin, or compliance decision. The reliable pattern is a language model for meaning and a data layer for calculation.

Why LLMs struggle with basic math

Scoop Team

At Scoop, we make it simple for ops teams to turn data into insights. With tools to connect, blend, and present data effortlessly, we cut out the noise so you can focus on decisions—not the tech behind them.

Subscribe to our newsletter

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Frequently Asked Questions

No items found.