What this unit solvesCopying configuration without understanding the principles isn’t engineering; it’s copy-paste engineering. This unit explains how LLMs generate, why output floats and hallucination happens, and how the agentic loop differs from chat, all clearly enough that you can reason independently about every configuration decision instead of relying on “someone online wrote it this way”. The goal isn’t for you to hand-code a transformer; it’s for you to know which lever you’re moving when you adjust temperature, write a rule file, or set permissions.
Learning objectives
After this unit, you should be able to:- Explain tokens, tokenization, and the context window, and roughly estimate the token cost and window usage of a piece of text.
- Describe how probabilistic generation and temperature cause output to float, and where hallucination fundamentally comes from.
- Distinguish the three message roles (system prompt, user prompt, tool result) and their priority order, and understand why prompt injection is dangerous.
- Describe the agentic loop (observe, decide, act) and its essential difference from plain conversation, and identify who controls each step.
1. Tokens and tokenization
The model doesn’t read “words”; it reads tokens. A token is a subword unit, carved out by the tokenizer based on corpus statistics, sitting somewhere between a character and a full word. In English, one token commonly maps to one word or word-root (running might split into run + ning); in Chinese, one character typically maps to one or two tokens. Every piece of text you submit is first chopped into a token sequence before the model starts computing.
The engineering implication is direct: your cost and context usage are counted in tokens, not characters or words. Two rough rules of thumb to remember first (both are order-of-magnitude estimates; different tokenizers vary, so use the official tokenizer for precise figures):
- English: roughly 1 token per 4 characters, or roughly 1 token per 0.75 words.
- Chinese: typically 1 to 2 tokens per character, noticeably higher token density than English.
2. The context window
The context window is the maximum number of tokens the model can see in one pass; input and output share this budget. Conversation history, system prompt, files you paste in, and the reply the model is currently generating all compete for the same window. When the window fills, the oldest content gets pushed out and the model “forgets” what was said earlier. As of 2026-05, most flagship models have a standard window around 200K tokens, with some offering 1M token long-context variants (often beta or specific models, and priced higher past a threshold; see the tool comparison in this unit and 02-2). There is one key counterintuitive point here: Practical principle: reserve a buffer on important tasks (a common heuristic is roughly 20%, but there is no universal standard; calibrate to your task and model). Don’t pile rule files, conversation history, and pasted documents up to the limit. When you need long context, prioritize “only what’s relevant” over “everything”. This is why 01-4 Context engineering is a discipline in its own right, not just “a bigger window solves it”.3. Probabilistic generation and output variation
LLMs are autoregressive models: they generate one token at a time, feed what they’ve generated back into the input, predict the next token, and repeat until they produce an end token or hit the limit. At each step, the model doesn’t compute “the one correct answer”; it computes a probability distribution over the entire vocabulary, then samples one token from that distribution. “Sampling” is where variation comes from. The main parameters controlling sampling behavior:- temperature: scales the sharpness of the probability distribution. Low temperature (toward 0) makes the distribution more peaked, almost always picking the highest-probability token, producing stable, convergent output. High temperature flattens the distribution, giving low-probability tokens a chance, producing more divergent, creative, but also less controllable output.
- top-p (nucleus sampling): samples only from the set of candidate tokens whose cumulative probability reaches p, dynamically trimming the long tail.
Why the same question gets different answers twiceBecause every step is “sampling”, not “table lookup”. Unless temperature is set to 0 with all other conditions fixed, two generations take different random paths. This isn’t a bug; it’s how generative models are designed. For reproducibility, lower temperature and fix the other variables. For brainstorming multiple options, raise it.
This is the technical basis for ‘verify what’s verifiable’Hallucination is not a flaw you can fix with a better prompt; it’s an inherent property of probabilistic generation. That’s why this Playbook repeats: verify every programmatically verifiable claim (numbers, whether a citation exists, whether code runs). The model’s fluency and its correctness are two different things. Don’t conflate the former with the latter.
4. Message roles and priority order
The model doesn’t receive a blob of text; it receives a sequence of messages with role labels. Four common roles:| Role | Written by | Purpose |
|---|---|---|
| system | platform / developer | Sets the model’s identity, rules, and constraints; highest priority |
| developer | developer (some platforms split this out) | Application-layer instructions, between system and user |
| user | end user | The current request and input |
| tool | tool execution result | Data returned by tool calls, used as the basis for the next round of reasoning |
CLAUDE.md or AGENTS.md is not ordinary conversation; it’s injected as a high-priority instruction layer, which is why it can stably constrain model behavior rather than being washed out by subsequent turns.
5. Tool calls (tool use / function calling)
Tool calls are the key mechanism that upgrades a “chat assistant” into an “agent”. You give the model a set of tool descriptions (name, purpose, parameter schema); during generation the model can decide “this step I want to call a tool” and produce a structured call request (tool name plus parameters). The flow:- The model judges that it needs an external capability (read a file, search, run code) and outputs a tool call request instead of plain text.
- Your execution environment (the harness) intercepts this request, actually runs the tool, and gets the result.
- The result is injected back into the context under the
toolrole, becoming the basis for the model’s next round of reasoning. - The model reads the result and decides the next step: call another tool, or produce the final reply.
The model doesn’t ‘execute’ tools; it ‘requests’ themThe key division of responsibility: the model only decides “what to call and what parameters to pass”; the actual execution happens in the external harness. This boundary is where all permission control operates. You can intercept, audit, or reject a specific tool call in the harness; the model itself can’t touch your filesystem or network unless the harness permits it. Understanding this layer is what lets you make sense of the permission settings and Hooks in 02-2.
6. The agentic loop
The essential difference between an agent and a chat is who controls each step and who decides when to stop. Plain conversation is one question, one answer: you ask, the model answers, the turn ends, and control returns to you. An agentic loop has the model running multiple rounds autonomously toward a goal:7. Inference parameters and extended thinking
Extended thinking, streaming, and caching directly determine your latency and bill, and affect cost more than temperature does. Breaking each one down. Extended thinking (reasoning effort): lets the model generate an internal reasoning trace before producing an answer. It makes “thinking” explicit and provides real benefit for multi-step reasoning, debugging, and architectural decisions, at the cost of more tokens, more latency, and more expense. The criterion: worth enabling when the task requires multi-step derivation or is the kind where one wrong step breaks everything; unnecessary for fact lookups and simple rewrites, where enabling it just burns tokens. Streaming: returns tokens as they are generated rather than waiting for the full completion. It doesn’t save cost, but it improves perceived latency because the user sees the first token sooner. Prompt caching: caches repeated prefixes (long system prompts, fixed rule files, large reference documents), dramatically reducing cost and latency for subsequent requests that hit the cache. If your application sends a long, fixed context every request, enabling caching can be a substantial saving. This is especially noticeable for agentic workflows with long rule files; the fixedCLAUDE.md and tool definitions are ideal cache targets.
These are cost levers, not decorationExtended thinking, streaming, and caching aren’t “advanced toys”; they directly determine your latency and bill per interaction. Knowing they exist and knowing when to enable which one is part of using the tool correctly. For the specific toggles and pricing per tool, see 02-2.
8. Multimodality (brief overview)
Most modern flagship models are multimodal: beyond text, they can accept image, PDF, and audio input, with some supporting image output. For everyday workflows, this means you can paste a screenshot for the model to read an error screen, drop in an architecture diagram for it to analyze, or upload a PDF for summarization. For researchers doing visual image analysis, multimodal models are an additional path worth evaluating. They don’t replace purpose-trained specialized vision models (precision, controllability, and reproducibility remain strengths of specialized models), but they can cut significant upfront cost for rapid prototyping, image description, and cross-modal Q&A. When to use a general multimodal model versus when to train a specialized one is a typical tradeoff question for the Part III judgment chapter.Tool comparison (flagship models and context windows, as of 2026-05)
Model versions and window sizes are highly time-sensitive figures. The table below is a snapshot as of 2026-05; verify against official pages before citing.| Dimension | Anthropic Claude (primary) | OpenAI | |
|---|---|---|---|
| Flagship models | Claude Opus 4.8 (strongest) / Sonnet 4.6 / Haiku 4.5 | GPT-5.5 (latest) / Codex line (GPT-5.2-Codex, etc.) | Gemini 3.1 Pro |
| Standard context window | 200K tokens | Varies by model; flagship reaches million-token scale | 1M tokens |
| Long context | 1M token beta (Opus 4.6+, Sonnet 4.5/4.6) | GPT-5.5 over 1M tokens (~1.05M); surcharge past ~270K input | 1M tokens (3.1 Pro) |
| Notes | Three-tier division: Opus for deep reasoning / Sonnet as main workhorse / Haiku for high-frequency low-cost | Codex line is coding-specialized | Gemini 3 Pro preview was deprecated 2026-03-26; replaced by 3.1 Pro |
Hands-on exercises
Take the tool you use most and do three things to ground this unit in your own environment:- Measure a piece of text in tokens: find a recent Chinese prompt or spec you’ve written, count it with the official tokenizer (or the tool’s built-in token counter), compare to what you assumed from word count, and calibrate your intuition.
- Run a variation experiment: ask the same open-ended question twice each at high and low temperature, and observe how much the output diverges. Keep that difference in mind; next time you pick a temperature, you’ll have a basis for the decision.
- Spot one agentic loop: next time you use a coding agent, watch each step and identify which stage of observe, decide, act it’s in; count how many decisions it makes in a row before coming back to you. That number is your “gating handed over”.
Common pitfalls
Self-check
The bar for passing this unit
- Can you explain to a non-technical colleague “why the same question gets different answers twice” and make clear this isn’t a bug?
- Can you describe the priority order of the system / user / tool roles, and explain why prompt injection is a matter of mistaking data for instructions?
- Can you describe the four stages of an agentic loop and explain why the configuration focus shifts from prompt to gating?
Sources and further reading
Model versions and context windows are fast-changing facts. The following are official sources verified as of 2026-05:- Anthropic: Claude models overview: model tiers, token limits, and capabilities;
max_input_tokens/max_tokensare programmatically queryable via the Models API. - Anthropic: Claude Code model configuration: model selection and settings within Claude Code (including 1M context and effort level mapping).
- OpenAI: GPT-5.5 model: GPT-5.5 specs, over-1M-token window, and surcharge past the threshold.
- Google: Gemini API models and long context docs: Gemini 3.1 Pro and long-context support.
- ReAct original paper: Yao, et al. (2022), ReAct: Synergizing Reasoning and Acting in Language Models (ICLR 2023).
- Remaining concepts (tokens, autoregressive generation, temperature, tool use) are established knowledge; the author’s synthesis, not dependent on a single external source.
- Connections: 01-4 Context engineering (how to put the right things in a finite window) and 02-2 Anthropic Claude setup (the actual toggles for these parameters).