Published: June 14, 2025
In this post, I'd like to share my working mental model for why LLMs hallucinate.
This is intended for users of LLMs via ChatGPT and other AI assistants who aren't necessarily familiar with the underlying technology. I'm by no means a LLM expert, but I do enjoy finding intuitive ways to explain their inner workings with visuals, concrete examples, and analogies. Many of the concepts here are simplified to make them easier to understand, so please treat this post as more of an accessible introduction and less of a definitive guide.
Here's a high-level description of my hallucination mental model:
The rest of the post explains each point in greater detail by showing concrete examples from Mistral-7b-v0.1, a popular open-source LLM. Mistral is an older, smaller model, which makes it more prone to hallucinations - we can get it to hallucinate simply by asking it to Tell me about Blake Bourne (a name I made up). The LLMs that currently power AI assistants are a bit "smarter" now - for instance, ChatGPT will say it doesn't recognize the name for reasons I talk about at the end. Nonetheless, the general explanation for why LLMs hallucinate is applicable to all models.
LLMs generate text by repeatedly predicting the next piece of text in a sequence, an operation known as next-token prediction.
This operation starts with input text that is converted to a token sequence. Tokens are numbers that represent bits of text. These bits might be words, parts of words, or symbols and punctuations. LLMs have a fixed set of tokens they can work with, which is known as a vocabulary. Mistral's vocabulary contains 32,000 tokens - 20 of which are shown below:
"Vocabulary of Mistral-7b-v0.1"
What we recognize as single words will often look like multiple tokens to a LLM. Notice how vocabulary is split into two tokens here:
As the name implies, next-token prediction involves predicting the next token in a given input sequence. LLMs express their predictions by assigning a probability to every token in its vocabulary. These probabilities indicate how likely each token is to follow the input, and are known as a next-token probability distribution.
The next-token probabilities are shown on the right. Each token in the vocabulary has a probability assigned to it, but only the top 4 and the bottom 3 are shown here.
LLMs can calculate this distribution for virtually any input sequence:
LLMs can sample from probability distributions to produce a new token. Imagine a hat with 10,000 slips of paper, with a token written on each slip. In the twinkle twinkle little example, star would be on 7,373 slips, stars would be on 639 slips, bat on 332, and so on. A token is drawn from the hat at random and is called a sample.
This sampled token can now be added to the initial sequence to form a new input, which the model can use to calculate a new probability distribution.
TOP: 94% of the time, the model will sample "skills" and add it to the input.
BOTTOM: It then calculates a new probability distribution for the new sequence.
We can now sample from this new distribution to produce the next token in the sequence, and so on, until a special end-of-sequence (</s>)
token is sampled, signaling the model to stop. The visual below shows how Mistral samples from probability distributions to generate text about LeBron James:
The probability distribution for each step of the text generation. The sampled tokens are highlighted in green.
Sampling is non-deterministic: two draws from the same exact hat won't always produce the same token. This explains why re-running the same prompt in ChatGPT twice produces two different results.
Two different respones to the same prompt.
When we use ChatGPT, our initial prompt serves as the starting token sequence, and the model generates a response by predicting what comes after our prompt. For the most part, these responses are quite helpful, but they sometimes contain hallucinations - statements that sound plausible but are factually incorrect. To understand why hallucinations occur, we need to look at how LLMs are trained.
LLMs have weights (a.k.a parameters) that are responsible for calculating the next-token probability distribution for any given input token sequence. We can think of each these weights as an individual dial that can be turned. Turning any single dial (changing the value of any single weight) changes how the next-token probability distributions across all input sequences are calculated.
LLMs are initially untrained, with weights initialized to random values. In this state, next-token probability distributions are meaningless and sampling from them generates incoherent text. Training refers to the process of getting the values of these weights and their associated distributions in a more usable state.
The model's weights are initially random, and the next-token probability distributions are meaningless.
Pretraining is the first stage in a LLM's training pipeline. Pretraining involves having a LLM predict the next token using a text sequence taken from the internet, and adjusting its weights when it's wrong. Here's a simplified description of how pretraining works:
Note: The red bar isn't the actual loss, but it is proportional to it.
The objective of these billions and billions of updates is to minimize prediction error. Intuitively, this means if you were to run the same training snippets through a model after pretraining, it would consistently assign high probabilities to the correct next tokens.
When pretraining is finished, the weights are in a more usable state. A pretrained LLM is referred to as a base model, and their weights encode language patterns present in the training data. These patterns can be leveraged to fluently continue text.
I'd like to differentiate between two types of patterns in particular, which I refer to as specific and generalized language patterns. Hallucinations happen, in part, when LLMs use generalized language patterns to generate responses about topics they lack specific patterns to draw from.
The example probability distributions below come from the Mistral-7b-v0.1 base model (there's another variant, Mistral-7b-Instruct-v0.1, which we'll use shortly).
If we feed Mistral the input Once upon a, here are the 4 most likely next tokens:
This is an example of a specific language pattern the LLM has learned. Specific language patterns are characterized by peaked probability distributions where only one or a handful of next tokens dominate, and include factual knowledge the LLM has picked up on. For example, the high probability assigned to basketball reflects the model's knowledge of LeBron James.
These high probabilities are the result of minimizing prediction error: each time Mistral sees a snippet about LeBron James in its training data, it assigns higher probabilities to the tokens that follow (e.g. "professional basketball player", "drafted in 2003", "Los Angeles Lakers"). With enough repetition, the next-token probability distributions related to LeBron concentrate around these tokens. Sampling from those distributions produce those tokens consistently, and when Mistral generates text about LeBron James, it effectively functions as if it's recalling factual knowledge.
Generalized language patterns, on the other hand, are patterns that generalize beyond a specific entity. As an example, here's what happens if I make up a name (Blake Bourne), and ask Mistral to predict the next token:
Unlike the LeBron James example, these probabilities are flat and even, reflecting the model's uncertainty. I think the intuition for why these probabilities are flat looks something like this: while there are no mentions of Blake Bourne in the training data, there are countless examples of similar phrases.
Each time it sees a phrase such as (name of person) competes professionally as a (name of sport) during pretraining, the model not only associates the specific sport with that person, it also learns the general pattern that competes professionally as a is usually followed by a class of sports such as bodybuilding and mixed martial arts. And when it comes to making a prediction about Blake Bourne, it also learns that its "best bet" is to distribute probabilities evenly amongst those activities, since it has no specific associations for the name.
This allows the model to make a contextually relevant prediction about Blake Bourne, which means the model can generate coherent text about new, unseen inputs. As an analogy, I remember playing Mad Libs in elementary school, a game involving templates like this one:
I think of generalized language patterns as learning a variety of Mad Libs templates. With a little imagination, the next-token probability distributions can be thought of as the start of a template:
I could generate coherent stories in elementary school by filling in the blanks of a Mad Libs template. LLMs do something similar, they just do so by noticing what tokens typically come next based on similar contexts from its training data.
This is the foundation of my mental model for why LLMs hallucinate, especially when asked about obscure topics: the lack of representation in the training data means the model has to generate responses using generalized language patterns, rather than drawing from specific language patterns that reflect factual knowledge. They're essentially "playing Mad Libs".
But it isn't the complete mental model: starting with Blake Bourne competes professionally as a is a bit like dropping the LLM off in the middle of a steep hill - it basically forces the LLM to hallucinate. To continue it coherently, the model has to guess the name of a sport. It's too late to stop, climb back to the top, and say "I don't know who that is."
This is where the Instruction Fine-Tuning (IFT) training stage comes into play. At a high level, I think of this stage as teaching the model which "hills" to go down in the first place.
Instruction Fine-Tuning (IFT) happens after pretraining. The goal here is to take the weights learned during pretraining and fine-tune (update) them so that the LLM can "behave" like an assistant - meaning it can provide helpful answers in response to instructions .
Recall that pretraining teaches a LLM how to fluently continue text based on patterns in its training data. This alone doesn't automatically make a LLM an assistant. To see why, let's take the prompt: What's the weather like today? A purely pretrained model might reasonably continue this as part of a dialogue in a story ("she asked, looking out the window…") or as part of a quiz ("A. Sunny B. Rainy C. …"). If we want the LLM to act like an assistant, we need to teach the model to treat prompts as instructions to be immediately followed by helpful answers.
Instruction Fine-Tuning does this by showing the model multiple examples of instructions that are immediately followed by helpful answers - these examples are known as the instruction-following dataset. Crucially, these training examples include special formatting tokens <INST>
and </INST>
that indicate the start and end of an instruction:
The training process looks a lot like pretraining. Chop up these instruction-answer pairs, remove the last token, have the LLM predict the next token, and adjust weights according to the loss.
Top: IFT training "snippets".
Bottom: Training on those snippets adjusts the model's weights to minimize prediction error.
Just as pretraining involves learning patterns in the training data, IFT involves learning patterns present in the instruction-following dataset which get encoded into the model's weights. This updated model is referred to as an instruction-tuned model, and I'll use probability distributions from an instruction-tuned Mistral-7B-Instruct-v0.1 to demonstrate two of these patterns.
If we give Mistral-7b-Instruct the text Tell me about Blake Bourne without any formatting tokens, the next-token probability distribution favors tokens that promote continuation (it generates Tell me about Blake Bourne, part 1. This article was originally posted…)
\n
means a new line - the equivalent of hitting Enter.
Wrap the text in instruction boundary tokens, however, and the probability distribution shifts dramatically:
Instead of continuing the text, the model has learned that the next token should be the start of an answer. The model's certainty in "Blake" reflects a pattern I think of as a response template, or the start of a "claim" the model makes about Blake Bourne.
Many of the example answers will share a similar structure. Although we don't know the exact contents of Mistral's instruction-tuning dataset, it's reasonable to assume it might contain a few examples that look like this:
If it sees these examples during IFT, the model will learn these answers share a structural pattern, which we might loosely think of as a "biographical response" template:
There are glimpses of this template in the response Mistral-7b-Instruct generates about LeBron James:
[INST] Tell me about LeBron James [/INST]
LeBron James is an American professional basketball player who is widely considered to be one of the greatest...
Here are the underlying probability distributions Mistral samples from to generate the response:
We can see that the model is quite certain in the "response template" portion of its answer, which sets it up to make about a claim about LeBron James. Thanks to pretraining, the model has specific language patterns about LeBron James to draw, and the claim is factually correct.
Here's the key: if the instruction-tuning dataset only includes examples that follow these biographical templates, it will apply it to all such prompts, even when Person X doesn't exist. For example, here's what it generates about Blake Bourne:
[INST] Tell me about Blake Bourne [/INST]
Blake Bourne is an American football coach and former player. He was born on June 3,...
And here are the probability distributions used to generate the response:
Just like the LeBron James example, the model is certain that it should start with a "response template" to make a claim about Blake Bourne. But this drops the model in the middle of a steep hill, and when it comes to filling in the details about who he is, our model has no specific patterns to draw from. It has to rely on a general language pattern (actor, professional, football?), and we get a hallucination.
The LLMs we use in AI Assistants such as ChatGPT have undergone both Pretraining and Instruction Fine-Tuning. Pretraining teaches the LLM how to fluently continue text based on language patterns found on the internet, Instruction Fine-Tuning teaches the model to recognize instructions and how to answer them helpfully.
After we type a prompt into ChatGPT and hit Enter, our prompt is first wrapped with <inst> … </inst>
tokens and sent to the LLM. The model, having learned to treat these tokens as an instruction boundary, now switch to answer mode - this activates a "response template" it has for how it should answer the prompt. If the topic of the prompt is not well-represented in the training data, the model lacks specific patterns to draw from, and it will be forced "fill in the blanks" of the template using generalized language patterns. Generalized language patterns produce tokens that are contextually relevant but not factually grounded, and when they are used to make "claims", you get a hallucination.
If you were to ask ChatGPT today to "Tell me about Blake Bourne", it will say "I Don't Know".
This is something that is also learned during IFT, as instruction-tuning datasets now contain examples of when the model should respond with uncertainty or "I don't know". But even with this ability, LLMs still hallucinate - OpenAI even reports that its latest reasoning model o3 hallucinates more than its predecessor o1 .
So while this post explains why LLMs hallucinate, there's an equally important question to consider, which is when do LLMs hallucinate? Are there certain conditions and prompts that make hallucinations more likely? Recent research from Anthropic provides some insight here that I'd love to talk about, but this post is already too long, so I'll save that for next time.
Thanks for reading!
Highly recommend watching Large Language Models explained briefly by 3Blue1Brown if you want an animated explanation.
01:01:06 from Andrej Karpathy's YouTube video Deep Dive into LLMs like ChatGPT
The OpenAssistant dataset contains examples of an instruction-following dataset. Karparthy mentions this at 01:12:46.
OpenAI o3 and o4-mini System Card. See 3.3 Hallucinations.