Cover image for 'The Anatomy of a Hallucination' article

Published: June 14, 2025


In this post, I'd like to share my working mental model for why LLMs hallucinate.

This is intended for users of LLMs via ChatGPT and other AI assistants who aren't necessarily familiar with the underlying technology. I'm by no means a LLM expert, but I do enjoy finding intuitive ways to explain their inner workings with visuals, concrete examples, and analogies. Many of the concepts here are simplified to make them easier to understand, so please treat this post as more of an accessible introduction and less of a definitive guide.

Here's a high-level description of my hallucination mental model:

  1. LLMs generate text by repeatedly predicting the next piece of text using everything that has come before it, a process known as next-token prediction.
  2. The LLM training process shapes those predictions to reflect both factual knowledge and broader language patterns that generalize to a variety of contexts. The latter can sometimes look like "making stuff up".
  3. LLM training can also bias LLMs into making "claims" about unfamiliar topics.
  4. Hallucinations are the result of making such claims and relying on generalized language patterns to fill in the details.

The rest of the post explains each point in greater detail by showing concrete examples from Mistral-7b-v0.1, a popular open-source LLM. Mistral is an older, smaller model, which makes it more prone to hallucinations - we can get it to hallucinate simply by asking it to Tell me about Blake Bourne (a name I made up). The LLMs that currently power AI assistants are a bit "smarter" now - for instance, ChatGPT will say it doesn't recognize the name for reasons I talk about at the end. Nonetheless, the general explanation for why LLMs hallucinate is applicable to all models.

#Next-Token Prediction

LLMs generate text by repeatedly predicting the next piece of text in a sequence, an operation known as next-token prediction.

This operation starts with input text that is converted to a token sequence. Tokens are numbers that represent bits of text. These bits might be words, parts of words, or symbols and punctuations. LLMs have a fixed set of tokens they can work with, which is known as a vocabulary. Mistral's vocabulary contains 32,000 tokens - 20 of which are shown below:

A visualization showing 20 tokens from Mistral-7b-v0.1's vocabulary of 32,000 tokens

"Vocabulary of Mistral-7b-v0.1"

What we recognize as single words will often look like multiple tokens to a LLM. Notice how vocabulary is split into two tokens here:

Example showing how the word 'vocabulary' is split into multiple tokens in the model

#Probability Distributions

As the name implies, next-token prediction involves predicting the next token in a given input sequence. LLMs express their predictions by assigning a probability to every token in its vocabulary. These probabilities indicate how likely each token is to follow the input, and are known as a next-token probability distribution.

Visualization of next-token probability distribution showing top 4 and bottom 3 probabilities

The next-token probabilities are shown on the right. Each token in the vocabulary has a probability assigned to it, but only the top 4 and the bottom 3 are shown here.

LLMs can calculate this distribution for virtually any input sequence:

Example of how the model calculates probability distributions for different input sequences

#Text Generation

LLMs can sample from probability distributions to produce a new token. Imagine a hat with 10,000 slips of paper, with a token written on each slip. In the twinkle twinkle little example, star would be on 7,373 slips, stars would be on 639 slips, bat on 332, and so on. A token is drawn from the hat at random and is called a sample.

This sampled token can now be added to the initial sequence to form a new input, which the model can use to calculate a new probability distribution.

Visualization showing token sampling process and subsequent probability distribution calculation

TOP: 94% of the time, the model will sample "skills" and add it to the input.
BOTTOM: It then calculates a new probability distribution for the new sequence.

We can now sample from this new distribution to produce the next token in the sequence, and so on, until a special end-of-sequence (</s>) token is sampled, signaling the model to stop. The visual below shows how Mistral samples from probability distributions to generate text about LeBron James:

Step-by-step visualization of Mistral's text generation process about LeBron James, with sampled tokens highlighted in green

The probability distribution for each step of the text generation. The sampled tokens are highlighted in green.

Sampling is non-deterministic: two draws from the same exact hat won't always produce the same token. This explains why re-running the same prompt in ChatGPT twice produces two different results.

First example of model's response to a prompt showing non-deterministic sampling
Second example of model's response to the same prompt, demonstrating different output due to non-deterministic sampling

Two different respones to the same prompt.

When we use ChatGPT, our initial prompt serves as the starting token sequence, and the model generates a response by predicting what comes after our prompt. For the most part, these responses are quite helpful, but they sometimes contain hallucinations - statements that sound plausible but are factually incorrect. To understand why hallucinations occur, we need to look at how LLMs are trained.

#LLM Training

LLMs have weights (a.k.a parameters) that are responsible for calculating the next-token probability distribution for any given input token sequence. We can think of each these weights as an individual dial that can be turned. Turning any single dial (changing the value of any single weight) changes how the next-token probability distributions across all input sequences are calculated.

LLMs are initially untrained, with weights initialized to random values. In this state, next-token probability distributions are meaningless and sampling from them generates incoherent text. Training refers to the process of getting the values of these weights and their associated distributions in a more usable state.

Visualization of untrained model's random weights and meaningless probability distributions

The model's weights are initially random, and the next-token probability distributions are meaningless.

#Pretraining

Pretraining is the first stage in a LLM's training pipeline. Pretraining involves having a LLM predict the next token using a text sequence taken from the internet, and adjusting its weights when it's wrong. Here's a simplified description of how pretraining works:

  1. Take a snippet from the internet and split it into tokens. Remove the last token, and feed the rest of the tokens into the model as input.
  2. Example of splitting text into tokens and removing the last token for training
  3. Use the current configuration of the model's weights to calculate a next-token probability distribution. This distribution represents the model's prediction of what the next token will be.
  4. Visualization of model's prediction process during training
  5. Look at the distribution and use the probability assigned to the removed token to calculate a loss — the lower the probability, the higher the loss (and vice versa).
  6. Example of loss calculation based on probability assigned to the removed token

    Note: The red bar isn't the actual loss, but it is proportional to it.

  7. Adjust the weights according to the loss so the removed token gets a slightly higher probability next time, and the incorrect ones get slightly lower probabilities.
  8. Visualization of weight adjustment process during training
  9. Repeat this process billions and billions of times for billions and billions of different snippets. The billions and billions of snippets used here are referred to as its training data. Note how the same snippet can be "chopped-up" and reused multiple times.
  10. Example of how training snippets can be reused multiple times

The objective of these billions and billions of updates is to minimize prediction error. Intuitively, this means if you were to run the same training snippets through a model after pretraining, it would consistently assign high probabilities to the correct next tokens.

Visualization of model's improved prediction accuracy after pretraining

#Learning Language Patterns

When pretraining is finished, the weights are in a more usable state. A pretrained LLM is referred to as a base model, and their weights encode language patterns present in the training data. These patterns can be leveraged to fluently continue text.

I'd like to differentiate between two types of patterns in particular, which I refer to as specific and generalized language patterns. Hallucinations happen, in part, when LLMs use generalized language patterns to generate responses about topics they lack specific patterns to draw from.

The example probability distributions below come from the Mistral-7b-v0.1 base model (there's another variant, Mistral-7b-Instruct-v0.1, which we'll use shortly).

#Specific Language Patterns

If we feed Mistral the input Once upon a, here are the 4 most likely next tokens:

Example of specific language pattern showing high probability for 'time' after 'Once upon a'

This is an example of a specific language pattern the LLM has learned. Specific language patterns are characterized by peaked probability distributions where only one or a handful of next tokens dominate, and include factual knowledge the LLM has picked up on. For example, the high probability assigned to basketball reflects the model's knowledge of LeBron James.

Example of factual knowledge pattern showing high probability for 'basketball' in LeBron James context

These high probabilities are the result of minimizing prediction error: each time Mistral sees a snippet about LeBron James in its training data, it assigns higher probabilities to the tokens that follow (e.g. "professional basketball player", "drafted in 2003", "Los Angeles Lakers"). With enough repetition, the next-token probability distributions related to LeBron concentrate around these tokens. Sampling from those distributions produce those tokens consistently, and when Mistral generates text about LeBron James, it effectively functions as if it's recalling factual knowledge.

#Generalized Patterns

Generalized language patterns, on the other hand, are patterns that generalize beyond a specific entity. As an example, here's what happens if I make up a name (Blake Bourne), and ask Mistral to predict the next token:

Example of flat probability distribution for made-up name 'Blake Bourne'

Unlike the LeBron James example, these probabilities are flat and even, reflecting the model's uncertainty. I think the intuition for why these probabilities are flat looks something like this: while there are no mentions of Blake Bourne in the training data, there are countless examples of similar phrases.

Visualization of how the model learns general patterns from similar phrases

Each time it sees a phrase such as (name of person) competes professionally as a (name of sport) during pretraining, the model not only associates the specific sport with that person, it also learns the general pattern that competes professionally as a is usually followed by a class of sports such as bodybuilding and mixed martial arts. And when it comes to making a prediction about Blake Bourne, it also learns that its "best bet" is to distribute probabilities evenly amongst those activities, since it has no specific associations for the name.

This allows the model to make a contextually relevant prediction about Blake Bourne, which means the model can generate coherent text about new, unseen inputs. As an analogy, I remember playing Mad Libs in elementary school, a game involving templates like this one:

Example of a Mad Libs template showing how language patterns work

I think of generalized language patterns as learning a variety of Mad Libs templates. With a little imagination, the next-token probability distributions can be thought of as the start of a template:

Visualization of how next-token probability distributions can be thought of as Mad Libs templates

I could generate coherent stories in elementary school by filling in the blanks of a Mad Libs template. LLMs do something similar, they just do so by noticing what tokens typically come next based on similar contexts from its training data.

This is the foundation of my mental model for why LLMs hallucinate, especially when asked about obscure topics: the lack of representation in the training data means the model has to generate responses using generalized language patterns, rather than drawing from specific language patterns that reflect factual knowledge. They're essentially "playing Mad Libs".

But it isn't the complete mental model: starting with Blake Bourne competes professionally as a is a bit like dropping the LLM off in the middle of a steep hill - it basically forces the LLM to hallucinate. To continue it coherently, the model has to guess the name of a sport. It's too late to stop, climb back to the top, and say "I don't know who that is."

This is where the Instruction Fine-Tuning (IFT) training stage comes into play. At a high level, I think of this stage as teaching the model which "hills" to go down in the first place.

#Instruction Fine-Tuning

Instruction Fine-Tuning (IFT) happens after pretraining. The goal here is to take the weights learned during pretraining and fine-tune (update) them so that the LLM can "behave" like an assistant - meaning it can provide helpful answers in response to instructions .

Recall that pretraining teaches a LLM how to fluently continue text based on patterns in its training data. This alone doesn't automatically make a LLM an assistant. To see why, let's take the prompt: What's the weather like today? A purely pretrained model might reasonably continue this as part of a dialogue in a story ("she asked, looking out the window…") or as part of a quiz ("A. Sunny B. Rainy C. …"). If we want the LLM to act like an assistant, we need to teach the model to treat prompts as instructions to be immediately followed by helpful answers.

Instruction Fine-Tuning does this by showing the model multiple examples of instructions that are immediately followed by helpful answers - these examples are known as the instruction-following dataset. Crucially, these training examples include special formatting tokens <INST> and </INST> that indicate the start and end of an instruction:

Example of instruction-following dataset format with special tokens

The training process looks a lot like pretraining. Chop up these instruction-answer pairs, remove the last token, have the LLM predict the next token, and adjust weights according to the loss.

Visualization of instruction fine-tuning process and weight adjustment

Top: IFT training "snippets".
Bottom: Training on those snippets adjusts the model's weights to minimize prediction error.

Visualization of how instruction fine-tuning shapes model responses

Just as pretraining involves learning patterns in the training data, IFT involves learning patterns present in the instruction-following dataset which get encoded into the model's weights. This updated model is referred to as an instruction-tuned model, and I'll use probability distributions from an instruction-tuned Mistral-7B-Instruct-v0.1 to demonstrate two of these patterns.

#Instruction Boundaries

If we give Mistral-7b-Instruct the text Tell me about Blake Bourne without any formatting tokens, the next-token probability distribution favors tokens that promote continuation (it generates Tell me about Blake Bourne, part 1. This article was originally posted…)

Example of model's response without instruction boundary tokens

\n means a new line - the equivalent of hitting Enter.

Wrap the text in instruction boundary tokens, however, and the probability distribution shifts dramatically:

Example of model's response with instruction boundary tokens

Instead of continuing the text, the model has learned that the next token should be the start of an answer. The model's certainty in "Blake" reflects a pattern I think of as a response template, or the start of a "claim" the model makes about Blake Bourne.

#Response Templates

Many of the example answers will share a similar structure. Although we don't know the exact contents of Mistral's instruction-tuning dataset, it's reasonable to assume it might contain a few examples that look like this:

Example of biographical response templates in instruction-following dataset

If it sees these examples during IFT, the model will learn these answers share a structural pattern, which we might loosely think of as a "biographical response" template:

Visualization of learned biographical response template pattern

There are glimpses of this template in the response Mistral-7b-Instruct generates about LeBron James:

[INST] Tell me about LeBron James [/INST]

LeBron James is an American professional basketball player who is widely considered to be one of the greatest...

Here are the underlying probability distributions Mistral samples from to generate the response:

Probability distributions for Mistral's response about LeBron James

We can see that the model is quite certain in the "response template" portion of its answer, which sets it up to make about a claim about LeBron James. Thanks to pretraining, the model has specific language patterns about LeBron James to draw, and the claim is factually correct.

Here's the key: if the instruction-tuning dataset only includes examples that follow these biographical templates, it will apply it to all such prompts, even when Person X doesn't exist. For example, here's what it generates about Blake Bourne:

[INST] Tell me about Blake Bourne [/INST]

Blake Bourne is an American football coach and former player. He was born on June 3,...

And here are the probability distributions used to generate the response:

Probability distributions for Mistral's response about Blake Bourne

Just like the LeBron James example, the model is certain that it should start with a "response template" to make a claim about Blake Bourne. But this drops the model in the middle of a steep hill, and when it comes to filling in the details about who he is, our model has no specific patterns to draw from. It has to rely on a general language pattern (actor, professional, football?), and we get a hallucination.

#Complete Mental Model

The LLMs we use in AI Assistants such as ChatGPT have undergone both Pretraining and Instruction Fine-Tuning. Pretraining teaches the LLM how to fluently continue text based on language patterns found on the internet, Instruction Fine-Tuning teaches the model to recognize instructions and how to answer them helpfully.

After we type a prompt into ChatGPT and hit Enter, our prompt is first wrapped with <inst> … </inst> tokens and sent to the LLM. The model, having learned to treat these tokens as an instruction boundary, now switch to answer mode - this activates a "response template" it has for how it should answer the prompt. If the topic of the prompt is not well-represented in the training data, the model lacks specific patterns to draw from, and it will be forced "fill in the blanks" of the template using generalized language patterns. Generalized language patterns produce tokens that are contextually relevant but not factually grounded, and when they are used to make "claims", you get a hallucination.

#Saying I Don't Know

If you were to ask ChatGPT today to "Tell me about Blake Bourne", it will say "I Don't Know".

Example of ChatGPT's 'I don't know' response to query about Blake Bourne

This is something that is also learned during IFT, as instruction-tuning datasets now contain examples of when the model should respond with uncertainty or "I don't know". But even with this ability, LLMs still hallucinate - OpenAI even reports that its latest reasoning model o3 hallucinates more than its predecessor o1 .

Graph showing hallucination rates in different OpenAI models

So while this post explains why LLMs hallucinate, there's an equally important question to consider, which is when do LLMs hallucinate? Are there certain conditions and prompts that make hallucinations more likely? Recent research from Anthropic provides some insight here that I'd love to talk about, but this post is already too long, so I'll save that for next time.

Thanks for reading!

References

[1]

Highly recommend watching Large Language Models explained briefly by 3Blue1Brown if you want an animated explanation.

[2]

01:01:06 from Andrej Karpathy's YouTube video Deep Dive into LLMs like ChatGPT

[3]

The OpenAssistant dataset contains examples of an instruction-following dataset. Karparthy mentions this at 01:12:46.

[5]

OpenAI o3 and o4-mini System Card. See 3.3 Hallucinations.