Learn/AI Fundamentals/How Large Language Models Work
Lesson 2 of 2
reading25 min

How Large Language Models Work

Inside the transformer architecture — tokens, attention, and why 'next token prediction' produces remarkably capable systems.

To use language models effectively, you don't need a PhD in machine learning. But a basic mental model of what's happening under the hood changes how you interact with them. This lesson builds that model.

Everything Starts with Tokens

LLMs don't process words — they process tokens, which are chunks of text ranging from a single character to a few characters. The word "beyond" might be one token. "Extraordinarily" might be three.

Why does this matter? Because the model's context window is measured in tokens, not words. A 100,000-token context window holds roughly 75,000 words. And understanding token cost helps you understand why some prompts are more expensive than others.

"The quick brown fox" → ["The", " quick", " brown", " fox"]
"ChatGPT"            → ["Chat", "G", "PT"]
"tokenization"       → ["token", "ization"]

The Core Task: Next Token Prediction

Here's the fundamental insight: language models are trained to predict the next token.

Given "The quick brown", what comes next? " fox" — with very high probability. Given "The patient presents with acute", what comes next? Probably something medical.

By training on trillions of examples of this task — essentially, compressing the internet into a model — LLMs learn extraordinarily rich representations of language, facts, reasoning patterns, and world knowledge.

The emergent capabilities (code generation, translation, reasoning) aren't explicitly trained — they arise from doing next-token prediction very, very well at scale.

The Transformer Architecture

The transformer, introduced in 2017, has three key components:

1. Embeddings Each token is converted into a vector of numbers (an embedding) — a point in high-dimensional space where similar tokens cluster together. "King" and "Queen" are nearby; "King" and "automobile" are far apart.

2. Attention Mechanism This is the core innovation. Self-attention lets every token in the input "attend to" every other token simultaneously — weighing how relevant each token is to understanding the current one.

When processing the word "it" in "The animal didn't cross the street because it was too tired," the attention mechanism figures out that "it" refers to "animal," not "street" — by learning which tokens are most relevant in context.

3. Feed-Forward Layers After attention, each position passes through a neural network layer that transforms the representation. These layers are where much of the factual knowledge is believed to be stored.

Training vs. Inference

Training is when the model learns. Billions of examples, gradient descent, weeks of GPU time, enormous cost. This happens once (or periodically, to update the model).

Inference is when you use the model. You send a prompt, the model predicts one token at a time until it's done. Much cheaper, but still compute-intensive — which is why API calls cost money.

Temperature and Sampling

When the model predicts the next token, it outputs a probability distribution. "70% chance it's 'the,' 15% chance it's 'a,' 5% chance it's 'an'..."

Temperature controls how you sample from this distribution:

  • Low temperature (0.1–0.4): Almost always pick the most likely token. Deterministic, predictable, "safer."
  • High temperature (0.8–1.2): Sample more broadly, pick less likely tokens. More creative, more variable, sometimes surprising.
  • Temperature 0: Always pick the maximum probability token (greedy decoding).

For factual tasks, lower temperature. For creative writing, higher. Most APIs default around 0.7–1.0.

Why This Mental Model Matters

Understanding that LLMs are next-token predictors helps explain several behaviors:

  • Hallucination: The model predicts plausible-sounding tokens, even when it doesn't "know" the answer. It's pattern-matching, not retrieving facts from a database.
  • Prompt sensitivity: Small changes in wording shift the probability distributions, leading to different outputs.
  • Context window limits: Everything must fit in the window. Beyond that, the model has no memory.
  • Inconsistency: Different runs at non-zero temperature produce different outputs.

In the next lesson, we'll cover how training shapes behavior — RLHF, system prompts, and why different models feel so different to interact with.