Reasoning Models: How o1, DeepSeek-R1, and Their Successors Think

Something changed when OpenAI released o1 in September 2024. It was not simply a better language model in the conventional sense — it was a different kind of system. Where previous models generated answers by predicting the next token based on training and a brief prompt, o1 visibly "thought" before answering, producing long internal chains of reasoning that it used to check its work, reconsider approaches, and arrive at solutions that previous models simply could not reach.

The benchmark numbers were striking. On AIME (the American Invitational Mathematics Examination), o1 scored around 83% — comparable to the top human performers. On Codeforces competitive programming problems, it reached the 89th percentile. These were not tasks where being a better language model in the conventional sense made much difference; they required sustained multi-step reasoning that prior models failed at even when they were far more capable on other metrics.

What happened, and what has happened since?

Test-Time Compute: The Core Idea

The fundamental innovation in o1 and its successors is the redirection of compute from training to inference. Standard language models have a fixed computation budget per generated token: one forward pass through the network. If the answer requires 10 steps of reasoning, the model has to somehow encode all of that in the process of generating each token, which is an extremely demanding constraint.

Reasoning models break this constraint by generating a long chain of thought before producing the final answer. This chain can be hundreds or even thousands of tokens long. Each step in the chain is visible (in some systems) or internal (in o1's case), but crucially, it gives the model working memory: the ability to write down intermediate results, check calculations, try alternative approaches, and backtrack when something does not work.

This is not a new idea. Chain-of-thought prompting, which I'll discuss in a separate post, demonstrated that prompting models to "think step by step" improved performance on reasoning tasks. What reasoning models add is a training procedure that makes this extended thinking systematic, reliable, and deeply integrated with the model's behavior — not just a prompting trick that works some of the time.

How o1 Was Trained

OpenAI has been characteristically opaque about o1's training details, but the technical report and subsequent analysis suggest a picture something like this: o1 was trained using reinforcement learning, where the reward signal came from outcome verification — specifically, whether the model's final answer was correct on problems with checkable answers (math, code, logic puzzles). The model was trained to generate internal reasoning chains, and those chains were optimized not by directly supervising their content but by training toward getting the right answer at the end.

This is a form of process reward, but at the outcome level. The model learns that longer, more careful thinking tends to lead to correct answers. It also learns to be skeptical of its own first attempts — to double-check, to try alternative approaches, to recognize when it is uncertain.

The result is a model with qualitatively different behavior from previous systems. o1 sometimes gets problems wrong by being too quick; when given more tokens to think, it corrects itself. This scaling of performance with inference compute — often called "thinking budget" — is novel and has significant implications for how we deploy these systems.

DeepSeek-R1: Transparency and Open Replication

In January 2025, DeepSeek published "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning" — and it became one of the most discussed papers of the year. The reason was not just technical: it was that DeepSeek had replicated and in some ways surpassed o1's reasoning capabilities, described their methodology in remarkable detail, and released the weights under a permissive license.

The DeepSeek-R1 paper describes a two-stage training process. In the first stage, called R1-Zero, they applied reinforcement learning directly to a base model (DeepSeek-V3) without any supervised fine-tuning warm-up. The rewards were simple: format correctness and answer accuracy on math and code problems. Remarkably, R1-Zero spontaneously developed behaviors like extended reasoning, self-verification, and reflection — without being explicitly trained on these behaviors. The model learned that thinking harder led to more correct answers, and so it thought harder.

The emergent behaviors were striking enough that the paper describes the team observing the model "discover" these reasoning strategies during training. This has echoes of the AlphaGo moment — a system developing novel strategies through reinforcement learning that had not been anticipated or explicitly programmed.

The second stage, R1, added supervised fine-tuning on curated chain-of-thought data before RL, which improved readability and reduced some failure modes of the raw RL approach. The final R1 model matched or exceeded o1 on several benchmarks while being open-weight and requiring significantly less compute to develop.

What These Models Are Actually Doing

A natural question is whether the "thinking" in these models is genuine reasoning or an elaborate pattern-matching mimicry of reasoning. This is partly a philosophical question and partly an empirical one.

Empirically, several findings are worth noting. First, the reasoning chains in these models are not always coherent when examined in detail — they sometimes contain false steps that happen to lead to correct answers, or correct reasoning about the wrong sub-problem. The chains are not a transparent window into the model's computation; they are themselves learned behaviors that co-evolved with the answer generation.

Second, performance on reasoning models scales with the length of the reasoning chain, up to a point. Giving a model more "thinking budget" generally helps on hard problems. But this scaling is not unlimited: there are diminishing returns and, on some problems, longer chains lead to worse answers as the model wanders down unproductive paths. Controlling this — knowing when to think more versus less — is an active research problem.

Third, reasoning models are much more robust to adversarial perturbations that break standard models. If you rephrase a math problem in an unusual way, o1 and R1 are far less likely to fail than GPT-4 without extended thinking. This suggests the reasoning process is doing real work, not just pattern-matching on surface features.

Successors and the Current Landscape

By early 2026, the reasoning model paradigm has proliferated widely. o3 succeeded o1 and extended the capability further, particularly on scientific and programming tasks. Anthropic's Claude 3.7 Sonnet introduced extended thinking as an optional mode, allowing users to see the reasoning chain and to specify thinking budgets. Google's Gemini 2.0 Flash Thinking brought similar capabilities to Google's ecosystem.

DeepSeek-R1 spawned a wave of open replication and extension work. Smaller distilled versions — R1-Distill-Qwen, R1-Distill-Llama — demonstrated that reasoning capabilities could be transferred from large reasoning models to smaller base models through supervised fine-tuning on reasoning traces, dramatically lowering the compute cost of capable reasoning.

The most interesting recent development is the exploration of whether reasoning models can be trained without checkable rewards — that is, whether we can get these reasoning benefits on tasks where there is no clear right or wrong answer. Mathematical and coding tasks have the advantage of being automatically verifiable; open-ended writing, strategic analysis, and scientific hypothesis generation do not. Extending the reinforcement learning approach to these domains requires either better automated evaluators or carefully structured human feedback, and this is a frontier of active research.

Implications for the Field

Reasoning models represent a genuine paradigm shift for several reasons.

First, they suggest that the path to more capable AI is not solely through larger models and more training data. Inference-time scaling — spending more compute per query — is an alternative axis that was being systematically underexplored. This has significant implications for deployment economics: a smaller, cheaper-to-train model that uses extended thinking might outperform a larger expensive model that answers immediately.

Second, they raise new questions about evaluation. Standard benchmarks measure model capability at fixed inference compute. A model that scores 60% on a benchmark without extended thinking might score 85% with a large thinking budget. This makes benchmark comparison across model families more complex.

Third, they create new opportunities for interpretability. The reasoning chains in these models, while not perfectly transparent, provide much more information about how the model arrived at an answer than a single-token prediction. That is valuable both for debugging and for building user trust.

The models that arrived in 2024 and 2025 with the label "reasoning model" are early instantiations of a broader principle: that intelligence benefits from the ability to deliberate, reconsider, and check one's work. We knew this about humans. We are now discovering how to build it into machines.