Chain-of-Thought Prompting: Unlocking Reasoning in Language Models

In early 2022, researchers at Google Brain published a paper with a deceptively simple observation: if you include examples in your prompt that show step-by-step reasoning before the final answer, large language models become dramatically better at reasoning tasks. The paper, "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models" (Wei et al., 2022), introduced the term "chain-of-thought" and demonstrated improvements on arithmetic, commonsense, and symbolic reasoning benchmarks that were significant enough to shift how the entire community thought about prompting.

The finding was not just that models could be nudged to produce more verbose outputs — it was that the explicit reasoning steps appeared to genuinely help the model arrive at correct answers that it could not reach through direct prediction. Something about generating intermediate reasoning steps changed the computation that the model performed, enabling it to solve problems that were previously beyond it.

The Original Finding and Why It Matters

The core experiment in Wei et al. is elegant. Take a set of mathematical word problems. In standard few-shot prompting, you provide examples as (question, answer) pairs. In chain-of-thought prompting, you provide examples as (question, reasoning chain, answer) triples, where the reasoning chain walks through the solution step by step.

The improvement on the math benchmarks was substantial — sometimes more than 20 percentage points — but what was more striking was the pattern of where improvements appeared. Chain-of-thought prompting helped most on problems that required multiple steps, logical composition, and intermediate computation. On simple single-step problems, it made little difference. This suggested that the chains were not just scaffolding for the output format; they were enabling a more powerful computational process.

The authors also observed that chain-of-thought prompting appeared to exhibit a strong scaling effect: small models showed little benefit, while large models (roughly 100B parameters and above at the time) showed dramatic improvements. This led to the framing of chain-of-thought as an emergent capability of scale — something that only manifests once a model is large enough to learn to use the reasoning structure.

Zero-Shot Chain of Thought: "Let's Think Step by Step"

A follow-up paper (Kojima et al., 2022) made the story even simpler. Rather than providing hand-crafted few-shot examples with reasoning chains, they simply appended "Let's think step by step" to prompts and evaluated whether models would spontaneously reason in ways that improved their answers.

They would.

For large language models, this two-stage zero-shot approach (first eliciting reasoning, then extracting the final answer from the reasoning) produced substantial improvements on reasoning benchmarks without any task-specific example engineering. The "Let's think step by step" instruction became briefly famous as one of the most effective prompts ever discovered — a meme that also contained genuine substance.

This zero-shot CoT finding has a provocative implication: large language models have reasoning capabilities latent in their training that they do not use by default when answering directly, but that can be activated by explicit instruction. The model "knows how" to reason through a problem but defaults to answering directly, and the instruction changes this default.

How Chain-of-Thought Actually Works

The mechanistic explanation for why chain-of-thought prompting helps is still not fully settled, but several hypotheses have empirical support.

One view is that CoT works by offloading computation to the context window. A language model's per-token computation is fixed — one forward pass through the network to produce each next token. For a multi-step problem, the model would need to compress all the intermediate reasoning into a single forward pass to answer directly. By generating intermediate steps, the model effectively has more computation available: each step can be conditioned on the output of previous steps, enabling deeper reasoning than direct single-step prediction.

This view predicts that CoT should help more for problems requiring many computation steps, which matches the empirical findings. It also predicts that longer, more detailed chains should help more than shorter ones for complex problems, which is generally observed.

A second view emphasizes pattern matching: large models have seen enormous amounts of worked examples in their training data — textbook solutions, code with comments, mathematical derivations, argued essays — and chain-of-thought prompting activates these patterns. The model is not reasoning in a fully general sense; it is pattern-matching to the structure of problem-solution pairs it has seen.

These views are not mutually exclusive. The evidence suggests that both are partially correct: CoT exploits real computational advantages of sequential token generation, and it also benefits from training distribution patterns.

Self-Consistency: Sampling Many Chains

A significant extension to chain-of-thought prompting is self-consistency (Wang et al., 2022), which generates multiple diverse reasoning chains for the same problem and then selects the final answer by majority vote. Rather than relying on a single chain that might contain errors, self-consistency exploits the observation that correct reasoning tends to converge on correct answers across different reasoning paths, while errors tend to produce diverse wrong answers.

Self-consistency substantially improves performance on reasoning benchmarks, often by an additional 5-15 percentage points over single-chain CoT. The cost is proportional to the number of chains sampled — typically 20-40 samples. This makes it expensive but powerful for high-stakes applications where accuracy matters more than efficiency.

The self-consistency finding also provides evidence that model reasoning is genuinely stochastic and diverse: different sampling runs produce meaningfully different chains, not just minor surface variations. This diversity is what makes majority voting effective.

Tree of Thoughts and Beyond

Chain-of-thought prompting treats reasoning as a linear sequence. Tree of Thoughts (Yao et al., 2023) generalized this to a tree structure, where the model generates multiple candidate reasoning steps at each point, evaluates them, and explores the most promising branches while pruning others.

This is inspired by traditional AI search methods like beam search and MCTS, applied to language model reasoning. For problems where the correct next step is not obvious and where you might need to backtrack — certain combinatorial puzzles, multi-step planning tasks, open-ended math problems — tree search over reasoning steps can find solutions that linear chain-of-thought misses.

The overhead is significant: tree-structured reasoning with exploration requires many more model calls than standard CoT, making it expensive for routine tasks. But for hard problems where success matters more than cost, the improvement can be substantial.

Graph of Thoughts, Program of Thoughts (which converts reasoning into executable code rather than natural language), and Skeleton of Thought (which generates an answer outline before expanding it) are further variants, each adapting the basic chain-of-thought insight to specific problem structures.

Chain-of-Thought in Training, Not Just Prompting

The prompting research spawned a parallel line of work: what if you trained models specifically to produce high-quality reasoning chains? This is the line that leads to the reasoning models discussed elsewhere in this series.

The key transition is from CoT as a prompting technique (eliciting existing model capabilities at inference time) to CoT as a training signal (explicitly training models to reason in chains). When training data includes high-quality reasoning chains, models internalize the reasoning process rather than merely being prompted toward it. When reinforcement learning rewards correct answers, models learn to generate reasoning chains that lead to correct answers across diverse problems.

The result — models like o1 and DeepSeek-R1 — are qualitatively different from base models with CoT prompting, even though the surface behavior looks similar. The reasoning process is more reliable, scales with thinking budget in predictable ways, and generalizes better to novel problem types.

Practical Guidance for CoT Prompting

For practitioners using chain-of-thought prompting with current models:

Modern instruction-tuned models often apply chain-of-thought reasoning by default on complex problems, especially when explicitly asked to be thorough or to explain their reasoning. You may not need elaborate few-shot chain-of-thought examples if you simply prompt the model to "work through this step by step."

For structured reasoning tasks — math, code, logic — being explicit about the format of the reasoning chain helps. Asking the model to label steps, number its reasoning, or flag uncertainty at each stage can improve final answer quality.

Self-consistency sampling is the highest-value extension when you need maximum accuracy and can afford the cost. For classification tasks, question answering, and any problem with a verifiable correct answer, majority-voting over multiple chains is worth the additional inference cost.

For problems with very long chains (many steps), consider breaking the problem into explicit sub-problems and chaining the model's outputs programmatically, rather than hoping the model maintains coherence over a very long monolithic reasoning chain. Long chains accumulate errors.

The Deeper Lesson

Chain-of-thought prompting revealed something that, in retrospect, should have been obvious: the right way to evaluate a language model's reasoning capability is not to ask it to jump to conclusions, but to give it room to think. The "room to think" in CoT is the tokens of the intermediate reasoning chain.

This seems almost trivially true about how humans reason — we work through hard problems on paper, in conversation, in structured notes, precisely because working memory is limited and sequential reasoning enables solutions that direct recall cannot. Finding that language models share this property was genuinely surprising, and understanding why they share it has deepened our understanding of what kind of computation language models actually perform.

The impact of this single insight — let the model think out loud — continues to ripple through the field, from prompting best practices to training procedures to the architecture of reasoning systems that are now at the frontier of AI capability.

Chain-of-Thought Prompting: Unlocking Reasoning in Language Models

The Original Finding and Why It Matters

Zero-Shot Chain of Thought: "Let's Think Step by Step"

How Chain-of-Thought Actually Works

Self-Consistency: Sampling Many Chains

Tree of Thoughts and Beyond

Chain-of-Thought in Training, Not Just Prompting

Practical Guidance for CoT Prompting

The Deeper Lesson

Related Posts

ThinkPRM: Teaching Reward Models to Think Before They Judge

DeepSeek-R1: The Paper That Changed How We Think About Training Reasoning Models

s1: How 1,000 Samples and a 'Wait' Token Beat OpenAI o1-preview