Around 2022, AI researchers began noticing something strange. As language models were scaled up in parameters and training data, certain capabilities did not improve gradually — they appeared nearly absent in smaller models and then materialized quite suddenly as scale crossed certain thresholds. A model with 10 billion parameters could not perform multi-digit arithmetic; a model with 50 billion parameters could do it reliably. No one had explicitly trained either model to do arithmetic — they had both been trained on the same objective, predicting the next token in text — but somewhere between those scales, the arithmetic capability emerged.
This pattern — the "emergent abilities" of large language models — was documented systematically in a landmark 2022 paper by Wei et al. from Google Brain and subsequently discussed extensively in the GPT-4 technical report. It has become one of the most discussed and debated phenomena in AI research, generating strong claims on multiple sides.
What "Emergence" Means in This Context
The term "emergence" comes from complex systems theory, where it describes properties of a system that cannot be predicted from the properties of its components. In the LLM context, Wei et al. defined emergent abilities as capabilities that "are not present in smaller models and are present in larger models, such that they cannot be predicted by simply extrapolating the performance of smaller models."
The key empirical claim is that the ability appears as a near-discontinuity when plotted against model scale: performance is near-random (i.e., the model is essentially guessing) up to some scale threshold, and then rises sharply. This is different from the smooth, predictable improvement in perplexity and similar metrics that scaling laws describe.
The paper documented dozens of such abilities: multi-step arithmetic, analogical reasoning, passing specific symbolic manipulation tasks, answering questions about uncommon words, language-to-code translation, and many others. The examples were compelling because they were specific and verifiable.
The Controversy: Are Emergent Abilities Real?
A significant counter-argument came from Schaeffer et al. (2023) in "Are Emergent Abilities of Large Language Models a Mirage?" Their core claim: apparent emergence is an artifact of evaluation metrics, not a genuine property of the underlying capabilities.
The argument is subtle but important. If you evaluate model performance with a metric that is nonlinear in accuracy — like accuracy on a multi-step task, where every step must be correct to get any credit — then smooth underlying improvement in individual-step accuracy can look like a discontinuity at the aggregate level. The "threshold" where the model starts getting the full task right corresponds to the point where per-step accuracy crosses 100% — a mathematical artifact of the all-or-nothing scoring, not a genuine discontinuity in what the model has learned.
The authors showed that for many tasks where emergence had been reported, switching to a smoother metric (like partial credit scoring, or log-probability rather than accuracy) revealed smooth scaling curves with no discontinuity. The apparent emergence disappeared.
This is a strong methodological point that should make everyone more careful about interpreting benchmark results. Many evaluation metrics are poorly calibrated for the purpose of measuring learning progress — they are designed to be informative about whether a model can do a task, not about how close it is to being able to do it. For detecting emergence, metric choice matters enormously.
What Remains After the Critique
Does the Schaeffer et al. argument fully explain away all reported emergent abilities? I would argue not entirely.
Some emergent phenomena seem genuinely discontinuous in ways that are hard to explain as metric artifacts. The ability to follow complex multi-step instructions coherently, for example, involves a qualitative shift: at small scales, the model fails to maintain the structure of the task across many steps regardless of how you score individual steps. At large scales, it succeeds reliably. This does not feel like a continuous improvement in underlying capability being masked by a harsh metric — it feels like a qualitative capability that did not exist at small scales.
In-context learning is another example. The ability to learn from examples provided in the prompt — to extract a pattern from a handful of demonstrations and apply it to new cases — appears much more strongly in large models than small ones and does not seem to reduce to metric artifacts. Small models can follow formats shown in examples but fail to extract abstract patterns; large models can do genuine in-context learning.
The most defensible position is that some reported "emergent" abilities are indeed metric artifacts, and researchers should be more careful about this. But some capabilities are genuinely qualitatively different at large scales from what smaller models exhibit, and these are real and important.
Specific Emergent Capabilities and Their Significance
The emergent abilities that matter most for practical AI deployment include:
Instruction following. Smaller language models struggle to follow complex, multi-part instructions reliably. Larger models exhibit qualitatively better instruction comprehension and execution, which is why RLHF-based alignment works much better at large scale.
In-context learning. As discussed, the ability to extract patterns from few-shot examples and generalize them is significantly stronger in large models. This is foundational to the prompting paradigm.
Chain-of-thought reasoning. As Wei et al.'s original chain-of-thought paper documented, CoT prompting shows little benefit at small scales and large benefits at large scales. Small models do not learn to use intermediate reasoning steps even when prompted to produce them.
Calibration. Larger models are better at knowing what they do not know — they express appropriate uncertainty and are less likely to confabulate confidently. This appears to improve with scale, though the relationship is non-monotonic and depends heavily on training procedure.
Analogical reasoning. Abstract pattern recognition — "A is to B as C is to D" — improves markedly at large scales, with qualitative differences in the complexity of analogies the model can handle.
Why Emergence Happens: Theoretical Accounts
The theoretical understanding of why emergence occurs is incomplete but several accounts have been proposed.
Threshold crossing. Many tasks require a collection of sub-capabilities to be present simultaneously. Arithmetic might require carrying in addition, knowing multiplication tables, handling multi-digit number formats, and maintaining a running computation. Each of these improves continuously with scale, but the full task requires all of them to be above some functional threshold. The task "emerges" when the last required sub-capability crosses its threshold.
Representation quality. Capabilities may require reaching certain thresholds of representation quality rather than just scaling a specific skill. Language may need to be represented at sufficient granularity to support complex compositional operations before certain tasks become possible. This would produce discontinuities tied to the quality of learned representations rather than to any specific training signal.
Circuit formation. Work in mechanistic interpretability (particularly from Anthropic and academia) suggests that specific capabilities are implemented by recognizable circuits — patterns of attention and MLP computation that implement specific algorithms. These circuits may form reliably above certain model sizes because smaller models lack the representational capacity to implement them cleanly.
Implications for Safety and Evaluation
The emergence phenomenon has profound implications for AI safety and evaluation. If capabilities can appear suddenly with scale, safety evaluations performed on smaller models may fail to anticipate capabilities in larger models. A capability that was absent or weak at 50B parameters might appear robustly at 200B parameters, including capabilities that are concerning from a safety perspective.
This is sometimes called the "unknown unknowns" problem of AI evaluation: we can test for capabilities we know to look for, but emergent capabilities by definition are things we might not think to test for. The GPT-4 technical report explicitly notes this challenge — the model passed tests that smaller models failed, and some of these unexpected capabilities were only discovered through extensive red-teaming after the model was trained.
The policy implication is that capability evaluations need to include model families across a range of scales, and that concerning capabilities observed in small models at low rates might indicate capabilities that will be robust and reliable at larger scales. The inverse — assuming that capabilities absent at small scale will remain absent at large scale — is not safe.
What Emergence Tells Us About Language Models
The emergence debate has deepened our understanding of language models in several ways. First, it has forced more careful thinking about evaluation methodology — the lesson that metric choice can determine whether you see emergence or smooth scaling is methodologically important beyond the emergence debate itself.
Second, it has provided evidence that language models are not simply interpolating between training examples in a shallow sense. The development of capabilities that were not explicitly present in training data, at scales that cross qualitative behavioral thresholds, suggests that these systems are doing something more complex than sophisticated text matching.
Third, it has highlighted the limitations of our theoretical understanding. We lack a principled predictive theory of which capabilities will emerge at which scales, which makes safety planning for future models genuinely difficult. Building that theory — understanding the internal structure of how capabilities are represented and composed in neural networks — is one of the most important open problems in AI research.
Emergence may or may not be the right word for what we are observing. But the observations themselves — that scale produces qualitative behavioral shifts that go beyond smooth quantitative improvements — are real and require explanation.



