The Latent State Window and the Limits of Reasoning in LLMs

A detailed look at recent papers on the limits of reasoning in LLMs.

Jul 05, 2025

∙ Paid

There are a few papers out recently that investigate AI’s ability to reason, and they’ve been framed on my X timeline mainly in terms of whether or not they successfully deflate the hype around LLMs and AGI (artificial general intelligence). Some would-be AI hype debunkers look to these papers as proof of Yann LeCun’s stance that autoregressive LLMs will never scale to AGI or ASI (artificial super-intelligence), while others have pushed back on them in various ways as overstating the case against LLMs.

Right now, I don’t have very many thoughts about the AGI discourse around any of this work, at least not that’s in any state to share with readers. Ultimately, I am a guy who is trying to get work done with LLMs and who is in the business of trying to help others get work done with LLMs. This AGI or ASI stuff will happen or it won’t, but in the meantime, my customers and I have jobs to do.

💭 Three of these papers, listed below, are useful because they reinforce intuitions that many of us already have from working closely with state-of-the-art (SoTA) language models.

These papers also provide a bit of formal and experimental justification for the kind of folk psychology of practical LLM context management I’ve presented in previous articles:

💰But the real payoff of these papers, especially when read together with some other recent work on LLM reasoning that I’ll highlight in this post, is the following set of hypotheses that I think are worth testing and that have implications for how we use LLMs to do real work:

LLM reasoning involves managing a set of latent variables that we might refer to in the aggregate as latent state.
The amount of latent state any one LLM can manage on an inference pass is limited, and once you overfill that latent state window, the model’s ability to reason collapses.
The size of an LLM’s latent state window is only loosely tied to its context window limits, mainly in the sense that thinking tokens are the primary storage medium that a reasoning model uses to represent latent state at time t for an inference step at time t + n.
More abstract concepts use more latent state than simpler, more concrete concepts. This is why a large-context model that crushes a simple Q&A benchmark even with a full context window might fail on a far smaller prompt (measured in tokens) that nonetheless overloads its latent state window.

I’ll take these papers one at a time in the remainder of this post, drawing some lessons from them about how to (and how not to) do things with large language models.

🍎 Apple’s “Illusion of Thinking” paper

This Apple paper was widely discussed online when it came out, and while I read much of the discussion I have to confess I barely remember any of it. I personally didn’t think anything in this paper was at all surprising, and it all tracked closely with my own experience of using SoTA LLMs to solve problems.

Here’s my brief summary of the paper and its findings:

Most attempts to measure the quality of an LLM’s thinking are pretty fuzzy, and it would be better to benchmark a model’s thinking by posting problems that increase in complexity in some measurable, straightforward way.
The authors propose a set of puzzles that have basically a single “ complexity knob” (i.e. number of checkers to jump, number of disks to move) that you can turn to dial up the difficulty.
When this knob is set at low levels non-thinking models actually outperform thinking models.
When this knob is set at medium levels, thinking models outperform non-thinking models.
There’s a threshold in all the thinking models where, if you dial the knob up just high enough, the model chokes and it can’t solve the problems at all.

My general sense of what has been detailed in this paper tracks with my general sense of how these models work, which is something like the following: The more latent variables you’re asking the model to juggle in an extended inference pass, the worse the model will perform.

What I mean is, in order to solve the problem the model has to think in something like the following manner: “For this river crossing problem, there are two sides of the river that I have to keep track of, and I also have to keep track of who is in the boat, and who is on each side of the river or who can go in the boat depends on the previous moves I made to solve this...”

So there’s a lot of what we computer types would call “state” in these problems — “state” being a computer science catch-all term for, “details the computer has to remember in order to complete a task.” As you add more checkers or towers or would-be river-crossers, you increase the amount of state the model needs to somehow represent (in language via thinking tokens) in order to solve the problem.

It’s important to recall that the model has no internal “memory” where it can store any of the details it needs to solve the problem. The only read/write memory it can use is the token window, and it “remembers” all the details it needs to manage by spelling them out in English as thinking tokens.

What Is It Like To Be ChatGPT?

Jon Stokes

April 5, 2023

The story so far: One of the main functions of this newsletter is as an archive for explanations I find myself giving repeatedly to people. After about the third time I hear myself using a particular analogy or explanatory frame in an interview or private conversation, I think, “I should write this down so I can refer people to it.”

Read full story

So when you load the token window with a prompt where the target completion can only be found by juggling too much latent state, you’re going to hit the limits of what the model can do.

🎓 Main lesson: Limit abstractions & state

The Apple team flagged a really interesting quality of all the models in the more complex problem regime where they weren’t able to solve the problem at all: the models gave up early on in the inference run and didn’t even use their thinking token budgets. The LLMs sort of threw up their virtual hands once the complexity dial was turned up too high.

So right at the outset, before a single token was generated, the prompt on these more complex problems already had too much latent state for the model to handle.

What I’m suggesting here is that in the problems used in this paper, there were two types of latent state the model needed to manage:

Synchronic latent state: Latent variables, like the positions of checkers, disks, or river-crossers.
Diachronic latent state: A time-ordered sequence of moves.

In the medium-complexity prompts, the model would progressively build up additional diachronic state before either succeeding or failing by playing out moves in the thinking tokens. It could at some point reach a number of moves where the amount of state it’s being asked to manage overwhelms its abilities, and at that point fails. But my point is that it builds up to this tipping point by thinking as it works its way through the problem.

In the high-complexity prompts, the model starts out with too much latent state in the form of synchronic latent variables. Because the model is already at or near the limit of its latent state window, it can’t do very many sequential problem-solving moves before failing.

My deeper point here is that there’s a limit to the aggregate amount of abstraction a model can manage — this is what I’m calling the “latent state window” — and this limit is not necessarily tied to token window size. My guess is that it scales with some other number or set of numbers, like parameter count, training tokens, or training runs. Further (and likely costly) investigation would clarify this, though.

To rephrase this in less specialized language as a practical recommendation: Just don’t ask the models to keep track of too many things at once, and you’ll be fine.

I also think there’s another dimension to this abstraction limit that all LLMs have: The higher-level the abstractions, the fewer of them it can successfully manage in an inference pass.

The above wrinkle comes from my own experience of working with models in a content-writing context. Let me give an example of what I mean, to illustrate the point.

✅ If you fill up Gemini 2.5’s 1-million-token context window with an S-1 filing, and ask it for a specific number from that filing, it will probably do pretty well at this task. These so-called “needle in a haystack” tasks are a core part of the benchmarking that LLM providers perform on models and publish results for. So there’s a ton of context tokens, but you’re asking a very narrowly defined, concrete, lookup type of question.

❌ If you fill up that same token window with a copy of “Anna Karenina” and ask it something really subtle and detailed about the intersection of 19th-century Russian politics and Christianity as it plays out in the spiritual evolution of four of the main characters, you will get a smart-sounding answer that has a lot of words in it but that would probably strike a Tolstoy scholar as at best superficial, or at worst flat-out wrong.

I don’t think this second example is strictly about world knowledge, either. Rather, as you move up the abstraction ladder and ask the model to work with very high-order, often contested concepts like “love”, “justice”, “honor”, “salvation”, etc., you’re asking it to manage too much state at once.

The above is just something I have observed, but it would be very hard to benchmark this in some objective way.

🛠️ Practical recommendations

The practical recommendations that flow naturally from this paper are the same ones I’ve been pushing in this newsletter since I rebooted it:

Break complex problems up into smaller chunks of work, where each chunk of work has a minimal number of latent variables and needs a minimum number of operations on those variables to complete the task.
Constantly reset the context whenever you’re at a stopping point (i.e., you’ve completed a sub-task), so you can start the next sub-task with minimal state.
Don’t let the model generate for too long before you intervene and check the work. (More on all of this in previous posts, though.)

Keep reading with a 7-day free trial

Subscribe to jonstokes.com to keep reading this post and get 7 days of free access to the full post archives.