How Chatbots Actually Work: Predicting the Next Word, At Scale, With Transformers, Attention, and Human Feedback

Written by Massa Medi| October 3, 2025

Imagine you stumble on a short movie script where a person is chatting with their AI assistant. You can read every line the person says. But right when the AI is about to reply, the page is torn off. What would the assistant have said next? If you had a magical machine that could read text and guess the next word in the sequence, you could finish the scene. You would feed the dialogue into this machine, let it predict the first word of the AI's reply, then the next, then the next, watching the answer unfold one word at a time.

That is exactly what a chatbot does. The magic looks like fiction when you see it as a movie scene, but underneath, it is a very particular kind of math, a very large function with an appetite for patterns, and a very serious amount of computation. Let me walk you through it in the precise order it actually works, so the whole picture clicks into place.

Prefer watching instead of reading? You can watch the full walkthrough below, or keep scrolling to read the complete article.

The Movie Script Thought Experiment

Let us sit with that torn-off script for a moment. You have a dialogue where a person asks a question, maybe something specific like how to fix a leaky faucet, or maybe something more open-ended like why the sky is blue. The scene is vivid in your head. The lighting, the pause before a reply, the expectation that the assistant will answer helpfully. The only problem is that the assistant’s reply is missing. It is an empty beat begging to be completed.

Now imagine you own a machine that, given any text, can tell you what word most plausibly comes next. This machine does not care about the plot or the characters. It only cares about sequences and the next step in those sequences. You pass the user’s line into it, and it gives you a probability distribution over all possible next words. Maybe it thinks “Sure,” is likely. Maybe “Here’s” is even more likely. You pick one, add it to the script, then ask again for the next word, and the next. Before long, the answer is filled in. The reply sounds coherent because each word flowed from the previous context.

Read that process slowly and you will notice that the machine is not writing an answer in one go. It is predicting one word at a time, repeatedly, and the growing text becomes its own context. That is not just a useful analogy. That is literally what is happening each time you interact with a modern chatbot.

From Autocomplete to a Conversation: What a Large Language Model Really Is

A large language model is a sophisticated mathematical function that takes in some text and outputs probabilities for what word should come next. Think of it like a huge weighted web that lights up differently depending on the words you feed it. It does not just point to one word with certainty. Instead it assigns a probability to every possible next word in its vocabulary. If the context is “peanut butter and,” then “jelly” might get a very high probability, while “galaxy” might get a tiny one, though still nonzero.

To build a chatbot, we structure the input text to look like a conversation. We write a little system preface that describes an interaction between a user and a helpful AI assistant. Then we append whatever the user actually types as the first part of that dialogue. The model is asked to predict the next word that the assistant would say in response. After the first word is chosen, the model is asked again with the updated context. The process repeats, and the assistant’s answer streams out one token at a time.

There is a subtle trick that makes the output feel more natural. If you always pick the single most likely next word, you get stiff, repetitive prose. Real language has quirks, detours, and a bit of surprise. So during generation, we let the model sometimes pick less likely words at random, guided by the probabilities it assigns. You can picture it like reaching into a weighted bag where common words are heavy marbles and rare words are light ones. You usually pull a heavy marble, but now and then a lighter one slips out, and that keeps the voice lively.

Here is an odd twist. Even though the model itself is deterministic, meaning the underlying function gives the same probabilities for the same input, the sampling process introduces randomness. So the same prompt will often yield different answers on different runs. That wiggle room is a feature, not a bug. It helps with creativity, it helps avoid getting stuck in a single phrasing, and it makes the assistant feel human without pretending to be human.

How Models Learn: Training on a Mountain of Text

Models learn how to make these predictions by processing an enormous amount of text, usually pulled from the public internet along with licensed sources. When I say enormous, I do not mean a big bookshelf. I mean a body of text that would flatten a reader. For a standard human to read the amount used to train GPT-3, if they read nonstop, no breaks, 24 hours a day, 7 days a week, the journey would take more than 2,600 years. That is not a typo. Two thousand six hundred years of continuous reading, and that was a model from years ago. Larger models since then train on much, much more.

What does learning look like inside the model? Picture a machine with a mind-bending number of dials, knobs, and sliders. Each dial represents one continuous value inside the model. These values are called parameters or weights. The way the model behaves is entirely determined by where those dials are set. Turn a few and the model gets better at grammar. Turn a different cluster and it starts to pick up on style. With hundreds of billions of these parameters, the machine has a huge capacity to encode patterns about language.

No human sits there and sets these parameters. The model begins at random, which means at the start it outputs gibberish. It has no idea what words go together. There is no concept of subject-verb agreement, no knowledge that “cat” and “cats” are related, no notion of world facts. All of that is learned. The learning happens by showing the model example after example where it is challenged to predict the next word. Then we score how well it did and nudge all the dials to make the right answer a little more likely next time.

Each training example can be short, like a handful of words, or long, like thousands of tokens. The mechanics are always the same. We pass in all but the last word from the example. The model outputs a probability distribution for what the last word might be. We compare that distribution against the true last word from the example. Where it got things wrong, an algorithm called backpropagation computes how to tweak the parameters so that the model becomes slightly more likely to choose the true word and slightly less likely to choose all the others.

Do this not just for a few examples, but for many trillions of them. The model starts by flailing and missing obvious answers. Over time, it makes fewer blunders. Then it starts to generalize. Not only does it become accurate on the training examples, it gets good at predicting text it has never seen before. It learns grammar from repetition. It learns idioms from context. It learns that “bank” can mean money or river, and that the surrounding words resolve the ambiguity.

Mind-Boggling Compute: A 100 Million Year Thought Experiment

The scale of computation involved in training a large language model is hard to wrap your head around. Let us try anyway with a mental experiment. Imagine that you, personally, could perform 1 billion additions and multiplications every single second. You are basically a superhero calculator with perfect focus. How long would it take you to perform all the operations needed to train the largest language models we build today?

Would it be a year? That already sounds wild. Would it be something like 10,000 years? That feels like geologic time. The answer is actually much more than that. It is well over 100 million years of nonstop arithmetic at that blistering personal pace. That is the kind of raw computational work that goes into tuning those hundreds of billions of parameters. It gives you a sense of why this field depends so heavily on specialized hardware and huge, coordinated compute clusters.

Pretraining vs Being Helpful: Why We Add Reinforcement Learning With Human Feedback

Everything I have described so far is called pretraining. The goal during pretraining is simple. Autocomplete text from the internet. That objective teaches the model syntax, style, facts, and patterns. But it is not the same objective as being a good AI assistant. The internet contains plenty of passages that are unhelpful, off-topic, or worse. If you only trained on that objective, your assistant would often be clever but not aligned with what users want.

So we add another step called reinforcement learning with human feedback. Humans read model outputs and flag which ones are helpful, honest, and safe, and which ones are not. They suggest better answers when something feels off. Their judgments are used to further change the model’s parameters. In other words, we take the raw predictive ability learned from pretraining and we refine it so the assistant is more likely to give predictions that people actually prefer in a conversation setting.

This does not mean the model suddenly reasons like a person. It still predicts words. But that prediction process gets a new nudge from human preferences. If two plausible outputs exist, and one is curt while the other is helpful, the training shifts probability mass toward the helpful one. Over many rounds of this, the assistant’s tone and behavior shift toward what users expect when they ask for help.

Why GPUs Matter, and Why Parallelization Changed Everything

Given the staggering amount of computation in pretraining, this all only works because we use special chips called GPUs. GPUs are designed to run many operations in parallel. Instead of doing one addition then the next, they do thousands or millions at once. This parallelism matches the math behind deep learning, which is mostly big matrix multiplications and element-wise operations that can be spread across lots of cores.

Here is the catch. Not all language models can be easily parallelized over sequences. Before 2017, many models processed text one word at a time from left to right. That sequential dependency makes training slow because you have to wait for one step to finish before starting the next. You can still parallelize across examples in a batch, but not across positions in a single example, so a long sentence becomes a long wait.

Then a team of researchers at Google introduced a new architecture known as the Transformer. The key idea is simple to say and powerful in practice. Transformers do not read text strictly from start to finish. They soak in the whole sequence at once, in parallel, and let different parts of the input directly influence each other. That change unlocked huge efficiency gains and made it practical to train on much longer contexts with much more compute humming in parallel.

Inside a Transformer: Vectors, Attention, and Feedforward Networks

Let us walk cleanly through what a transformer does with text. The very first step is to turn each word or token into a long list of numbers. You can think of this list as a vector, an array that sits in a high-dimensional space. The training process only works with continuous values. So we need to encode language as numbers. These vectors are learned during training so that tokens with related meanings end up with related vectors. That way, math on vectors can mirror relationships in meaning.

What makes transformers special is their reliance on an operation called attention. Attention lets each token’s vector look around at other tokens in the sequence and ask, who should I care about for this context, and by how much? All those vectors get a chance to talk to one another and refine what they encode, in parallel. If the word is “bank,” attention looks at neighbors like “river” or “loan” to decide which sense is relevant. The numbers for “bank” shift so that the vector prioritizes the right meaning for that instance.

There is another ingredient in each layer called a feedforward neural network. This is a small, independent network that takes the refined vector and transforms it through a few more learned nonlinear steps. You can think of attention as mixing information across positions, and the feedforward part as processing that information at each position to store useful patterns. The two operations together give the model the capacity to soak up structure, style, and facts over training.

In practice, a transformer has many layers stacked on top of one another. The data repeatedly flows through attention and feedforward blocks, again and again. With each pass, the vectors become more enriched with context. Early layers might focus on local grammar. Later layers might capture long-range dependencies, topic structure, and subtle cues that matter for a good next-word prediction. Everything is tuned so that, by the time we get to the end of the stack, the model has what it needs to pick the next word wisely.

From Enriched Vectors to a Next-Word Probability

After the sequence has moved through all those layers, we take the last vector in the sequence, the one that now encodes the context of the input and everything the model has learned. We apply one final transformation to turn that vector into a probability distribution over the entire vocabulary. The output does not say, the answer is this word. It says, here is how likely each possible next word is, given everything so far.

Researchers design the framework for each step. We choose that there will be attention, that there will be feedforward layers, that the model will be trained to predict the next token. But the specific behavior is an emergent phenomenon from how those hundreds of billions of parameters settle during training. That is why it is so challenging to say exactly why the model chose one particular phrasing over another. We can point to patterns and sometimes interpret parts of the network, but on the whole, the behavior emerges rather than being hand-coded.

Why The Outputs Feel So Fluent

When you use a large language model to autocomplete a prompt, the words it generates often feel uncannily fluent. You get sentences that flow, paragraphs that stay on theme, and references that make sense. That happens because the model has seen so many examples of how people write, ask, and explain. It has learned the rhythms and transitions that make text readable. It has also learned that good answers are not just correct but shaped for the reader’s expectation.

Sometimes the outputs are more than just fluent. They are fascinating or downright useful. The model can combine patterns from different domains, navigate ambiguity, and show you a framing you had not considered. That pairing of fluency and usefulness is why this technology feels like magic when you first see it, even though underneath it is a mountain of statistics, a sea of vectors, and a lot of multiply-and-add.

If You Want More on Transformers and Attention

If you are a new viewer and you are curious about more details on how transformers and attention work, boy, do I have some material for you. One option is to jump into a series I made about deep learning where we visualize and motivate the details of attention and all the other steps in a transformer. In that series, I go frame by frame through the math and the intuition, showing how the pieces click together. You will see how the vectors move, how the weights change, and why the architecture is built the way it is. It is designed so that you can feel the computations, not just read formulas.

Also, on my second channel, I just posted a talk that I gave a couple months ago about this topic for the company TNG in Munich. The vibe of a talk is different from a produced video. There is a bit of live energy, some off-the-cuff clarifications, and a pacing that follows the audience’s curiosity. I sometimes prefer that format because it lets me linger where people naturally have questions and speed past what is already clear. If you want the subject to feel conversational and grounded, that talk might be your thing.

I will leave it up to you which one feels like the better follow on. If you like polished visuals and carefully staged animations, the produced series is a great path. If you prefer the flow of a casual walkthrough with a room full of people nodding along, take the Munich talk for a spin. Either way, you will come away with the inner picture that makes transformers and attention feel tangible instead of mysterious.

What To Picture When I Say “Look at this”

Sometimes I gesture to visuals like diagrams of attention or stacks of transformer layers. Since you are reading this, here is how to translate those gestures into mental images:

For the script thought experiment: imagine a screenplay page with a highlighted cursor appearing after the user’s line. Each predicted word pops in as if typed by an invisible hand.
For probabilities over next words: picture a horizontal bar chart where “jelly” towers above “galaxy” when you type “peanut butter and.” Bars jiggle as context changes.
For parameters: visualize a wall covered with an absurd number of sliders. Training nudges them, sometimes by a hair’s breadth, across many rounds.
For GPUs and parallelism: think of thousands of conveyor belts moving in sync, each carrying blocks that get multiplied, added, and passed along.
For attention: imagine each word shining a beam toward other words it deems relevant. The brightness of the beam shows how much it pays attention.
For the final prediction: see a roulette wheel with wedges sized by probability. The ball almost always lands on a big wedge, but sometimes a small wedge wins and keeps the text interesting.

Closing Thoughts

Let us tie the threads. A chatbot is a next-word prediction engine trained on a staggering corpus, refined with human feedback, and powered by architectures that let it process context in parallel. You give it a prompt shaped like a dialogue. It predicts one word at a time, with a bit of randomness to keep things lively, and it streams out an answer that reads like it was composed in one sitting.

The parameters start random. Backpropagation tunes them over trillions of examples until the model generalizes to new text it has never seen. The compute bill is huge, which is why GPUs and parallelization are not just nice to have but essential. Transformers make that parallelization practical by letting tokens pay attention to each other directly instead of waiting in a line.

Researchers design the scaffolding. The behavior emerges from how the parameters settle. That is why specific predictions can be hard to explain in mechanistic detail, and also why the results can feel fluent, fascinating, and useful in practice. If this piqued your curiosity, the deep learning series and the TNG Munich talk give you two paths to see the machinery up close. Pick the style that fits you best, and enjoy watching the pieces lock into place.

Thanks for reading. If you want to keep going, explore the attention visuals, try the talk, or reread the movie script thought experiment and notice how your mental model sharpened. Once you see chatbots as next-word predictors with very smart training, the whole field feels less like magic and more like a clear set of ideas you can hold in your head.