Scientific Frontline: Why can’t powerful AIs learn basic multiplication?

Thursday, December 25, 2025

Why can’t powerful AIs learn basic multiplication?

Image Credit: Scientific Frontline / Stock image

These days, large language models can handle increasingly complex tasks, writing complex code and engaging in sophisticated reasoning.

But when it comes to four-digit multiplication, a task taught in elementary school, even state-of-the-art systems fail. Why?

A new paper by University of Chicago computer science Ph.D. student Xiaoyan Bai and faculty co-director of the Data Science Institute's Novel Intelligence Research Initiative Chenhao Tan finds answers by reverse-engineering failure and success.

They worked with collaborators from MIT, Harvard University, University of Waterloo and Google DeepMind to probe AI’s “jagged frontier”—a term for its capacity to excel at complex reasoning yet stumble on seemingly simple tasks.

As you may remember (or have forgotten), multiplying larger numbers requires carrying over digits, and mentally “holding on” to partial products so you can add them up to get your final total. Processes that require storing information for later use in this way are called “long-range dependencies.”

Standard large language models work by learning to recognize patterns in the data they’re trained on. But the more complex a problem gets, the less likely a model is to have seen it specifically. So how do you teach a model to not just memorize answers but learn a process?

Why standard training fails

Models are often taught new tasks with a process known as standard fine-tuning, which relies on scaling up the training data, or adding more steps or “layers.”

But even when the research team tested models with two layers all the way up to 12 layers, they all achieved less than 1% accuracy when multiplying two four-digit numbers. The standard approaches were clearly failing, and researchers wanted to understand why.

“As AI is increasingly integrated into critical decision-making, it’s essential to understand its unique ways of learning and thinking. Our research is trying to chart that terrain.”
Chenhao Tan

They found that under the standard approach, models converge on a “local optimum,” or what they identify as the best solution in each dataset. But tasks like multi-digit multiplication require a model to be able to remember earlier computations while producing later digits.

Without an architecture that can store and retrieve intermediate information, a model gets stuck, unable to move beyond that local optimum—no matter how long it trains or how large it scales.

Next, the researchers identified a model trained using a different method: Implicit Chain of Thought (ICoT).

Where standard fine-tuning achieved less than 1% accuracy, the ICoT model was able to achieve 100% accuracy. To understand what this approach was doing differently, the team took both apart to uncover some fundamental insights.

First, they saw that the ICoT model learns to remember what matters.

Unlike the standard fine-tuning model, the ICoT model learned to track those long-range dependencies, or the information it gradually put together to solve a problem. The team verified this by testing whether they could decode intermediate values, such as running sums, from the models’ internal states. In the ICoT model, they could—but in the standard model, they couldn’t.

The ICoT method gradually removes intermediate reasoning steps during training, in a sense forcing the model to internalize the reasoning process in its hidden states rather than relying on explicit step-by-step tokens.

Next, they saw the ICoT model organizes its attention into distinct pathways across time.

Think of it like a well-organized filing system: In early layers, the model computes products of digit pairs and stores them at specific locations. In later layers, it retrieves exactly the values it needs to calculate each digit of the final answer. The result is an efficient internal structure for carrying out multiplication, one that never emerges in the standard model.

Finally, and perhaps most remarkably, the researchers found the ICoT model internally represents these operations using elegant structures. Instead of treating digits as symbols alone, the model encodes them as wave-like patterns known as Fourier bases and organizes its arithmetic in a visual, spatial way.

When multiplying digit pairs, the model uses a natural geometric operation called a Minkowski sum—something the researchers didn’t program, but rather emerged naturally during training in the ICoT model. It’s as if the successful model derived its own efficient mathematical language for arithmetic.

A simple fix

The researchers reasoned that if the standard fine-tuning models failed because they lacked the right built-in guidance, then providing the right training signal should fix it. To test this, the team introduced a simple solution: an added training objective that teaches the model to track running sums at each step, allowing it to carry intermediate values and partial products forward.

The result: 99% accuracy without explicit chain-of-thought supervision.

It turned out that making this one addition to the two-layer model that completely failed under standard training did the trick. The result: 99% accuracy without explicit chain-of-thought supervision.

When the researchers examined the model’s attention patterns, they found it had learned mechanisms similar to ICoT’s—structures that store and retrieve partial products as needed. The model also developed additional strategies, including a way to track multiple digit pairs at the same time.

Novel intelligence

While multiplication might seem a specific kind of task, the findings illuminate fundamental aspects of how large language models learn and “think.”

The long-range dependency problem isn’t unique to arithmetic—it appears throughout language modeling and other sequential tasks. The UChicago team’s approach asks foundational questions about the distinctions between memorization and learning, and what architectural constraints help or hinder models’ performance.

“As AI is increasingly integrated into critical decision-making, it’s essential to understand its unique ways of learning and thinking,” said Tan. “Our research is trying to chart that terrain.”

This paper’s key contribution: Architectural insights and training techniques can overcome obstacles that scaling alone cannot address. The right built-in guidance, not just more parameters or data, are key to pushing AI capabilities forward.

While the solution for the multiplication issue is task-specific, the researchers anticipate future work will develop more general approaches to improve learning on tasks requiring models to keep track of information across many steps.

Published in journal: arXiv

Title: Why can’t transformers learn multiplication? reverse-engineering reveals long range dependency pitfalls

Authors: Xiaoyan Bai, Itamar Pres,Yuntian Deng, Chenhao Tan, Stuart Shieber, Fernanda Viegas, Martin Wattenberg, and Andrew Lee

Source/Credit: University of Chicago | Manasa Reddy

Reference Number: ai122525_01

Privacy Policy | Terms of Service | Contact Us