Skip to main content
What this unit solvesPrompt engineering teaches you “how to phrase this one question”; context engineering teaches you “what should and should not be in the model’s working memory before it answers,” covering system memory, external data injection, real-time RAG retrieval, tool-call state, and more, all designed through the lens of the context window. In agentic tools, the model runs many turns on its own, accumulating file reads and search results as it goes. What determines output quality at that point is not the wording of any single message, but your management of everything entering the window.

Learning objectives

  • Name the four sources that enter the context window and assess each source’s relevance and cost.
  • Explain context rot and the “lost in the middle” phenomenon, and state their practical implications for how much to include.
  • Describe how compaction, just-in-time loading, and subagent isolation manage information within a limited context budget.
  • Distinguish “short-term context” from “cross-session long-term memory” and map them to the mechanisms and risks of specific tools.

1. From prompt engineering to context engineering

Prompt engineering’s implicit assumption is “one question, one answer”: you craft the question precisely, the model returns an answer, and the interaction ends. That assumption held in the chat-box era; it breaks down in agentic tools. After a Coding Agent receives a task, it reads files, searches, executes, and then decides its next step based on the results, potentially running dozens of turns per task. Every turn’s output is pushed back into the context window, and the next turn’s reasoning is built on top of that accumulation. The lever you control shifts from “how do I word this sentence” to “what is in the window right now, and what is their signal-to-noise ratio.” Anthropic defines context engineering as: within the model’s limited “attention budget,” finding the minimal set of high-signal tokens that maximizes goal-achievement rate [1]. The key words are minimal and high-signal. More is not better; “just enough, and all of it relevant” is. This runs counter to engineering instinct. The first impulse for most people is “throw in everything that might be relevant and let the model sort it out,” which is precisely where quality collapse begins. See Section 3 for why.

2. The four sources of context

Everything entering the window, regardless of tool, falls into four categories. Each has a cost (how many tokens it occupies, how much attention it dilutes) and a relevance (how much it helps the current task). You should be able to make that trade-off for every category.
  • Instructions: system prompt, rule files (CLAUDE.md / AGENTS.md / .claude/rules/). These have the highest signal density and most deserve their place, because they define the model’s role and constraints and are active for the entire session.
  • Knowledge: attached files, RAG-retrieved chunks, @-referenced content. The most variable in relevance: the right reference is critical evidence; the wrong one is pure noise.
  • History: the prior conversation in the current session, plus cross-session long-term memory. Monotonically expands with each interaction and needs the most active compaction.
  • Tool results: file reads, grep output, command execution returns. Often enormous in volume (a single ls -R or a full test log can run to thousands of tokens) yet only one or two lines are genuinely useful.
Treat the four sources as a budgeted procurement, not an all-you-can-eat buffetThe attention budget is zero-sum. Every “might be useful” file you add takes attention away from the instructions and evidence you know are useful. Every item you put in should be able to answer the question: “why does this need to be here?“

3. Context rot and lost in the middle

“The window has 200K tokens, so I can fill 200K” is wrong. Window size is physical capacity, not effective capacity; the portion the model actually uses well is far smaller than what it can hold. Two empirically supported phenomena back this up. Lost in the middle: Liu et al. (2023) found that model performance on long contexts is strongly correlated with the position of key information. Accuracy is highest when key information appears at the beginning or end, drops significantly when it falls in the middle, and beyond a certain length, information in the middle is almost as if it were never provided [2]. Context rot: A Chroma (2025) technical report tested 18 models (including GPT-4.1, Claude 4, Gemini 2.5, and Qwen3) and found that models do not use context uniformly; the longer the input, the less reliable the performance. Degradation is especially pronounced when the task requires semantic matching rather than literal matching [3]. The shared implication is direct: more context does not mean better, and often means worse. A precise 5K-token context beats a 100K-token context diluted with “might be relevant” noise. Two operational conclusions follow:
  1. Control volume: fewer, more precise. Break the habit of “throw everything in and let the model sort it out.”
  2. Control position: the most critical instructions and evidence belong at the beginning or end of the context, not buried in the middle under a pile of file contents. Part of why rule files work is that they are typically placed first.

4. Compaction and just-in-time loading: continuing work within budget

A long task will inevitably hit the window limit. There are two main approaches. Compaction: summarize a conversation approaching the limit into a structured digest, then open a new window with that digest and continue. This is the primary mechanism for maintaining long-horizon coherence [1]. In Claude Code, part of this happens automatically (when the limit approaches, old tool outputs are cleared first, then the conversation is summarized), and you can also trigger it manually with /compact, optionally with a focus hint such as /compact focus on the API changes to specify what to preserve [4]. One mechanism detail you must know: early conversation turns may be discarded during automatic compaction, but the project root’s CLAUDE.md will not be. After compaction, Claude re-reads it from disk and re-injects it. Subdirectory CLAUDE.md files are not automatically re-injected; they reload only when Claude next reads a file in that subdirectory. The compaction summary also only carries over skills you actually used; descriptions of unused skills are not brought into the new window (as of 2026-06) [4]. This is why “write persistent rules into CLAUDE.md rather than stating them verbally in the conversation” is a hard recommendation. Verbally stated constraints quietly evaporate after a compaction [4].
Use the official interactive visualization to see what is in your windowAnthropic provides an interactive context-window simulation that lists every item auto-loaded at session start (system prompt, auto-memory MEMORY.md, environment info, MCP tools, skill descriptions, ~/.claude/CLAUDE.md, project CLAUDE.md) along with token estimates for each, and demonstrates which items survive a /compact and which do not: Explore the context window (official) (as of 2026-06). Use it as your “window budget” reference chart.
Just-in-time loading: rather than loading all potentially useful data at the start, have the agent hold lightweight pointers (file paths, queries, links) and pull the corresponding data into the window with a tool call only when needed [1]. For a Coding Agent, this means providing file paths and search capability rather than pasting the entire repo; for a research workflow, it means providing the dataset location and a read tool rather than dumping the full dataset into the conversation.
Reserve a buffer for important tasksA rule of thumb (not an official figure; see 01-2): for important multi-file tasks, aim to converge or actively compact before the window reaches roughly 80% full, leaving about 20% as a buffer. The closer the window is to full, the stronger the lost-in-the-middle and context-rot effects, and the more likely the model is to drop the ball at exactly the finishing stages where you can least afford errors.

5. Memory: short-term vs. long-term

The word “memory” is used loosely across tools, but there are really only two kinds. Keeping them distinct prevents misuse. Short-term context: the conversation inside a single session window. Gone when the session closes. Essentially working memory. Everything in the preceding four sections operates at this layer. Long-term memory: information persisted across sessions so it is still there when a new conversation opens. The three major tools differ in mechanism and naming (as of 2026-05):
  • Anthropic Claude: Claude.ai has cross-conversation memory; Claude Code uses files on disk as the persistence layer: project-level CLAUDE.md, user-level ~/.claude/, plus auto-memory under ~/.claude/projects/<repo>/memory/ (on by default since v2.1.59; disable with autoMemoryEnabled or an env var) [4]. Long-term memory here is files, human-readable, auditable, and version-controllable, which is an advantage for reproducibility.
  • OpenAI ChatGPT: split into “Saved Memories” (explicit facts and preferences you ask it to remember) and “Reference chat history” (automatic reference to past conversations). Settings under Settings > Personalization; turning off “Reference saved memories” also turns off “Reference chat history” [5].
  • Google Gemini: under “Personal context” in the “Memory” setting, on by default, learns from past conversations; also supports importing memories from other apps into Gemini [6].
Turning on long-term memory does not mean you can stop managing contextTwo costs are commonly overlooked. First, long-term memory is a security surface: cross-session persistent memory is a vector for memory poisoning and prompt injection. Malicious content can be implanted across multiple sessions and assembled later to trigger an action. See 03-3 for details and mitigations. Second, auto-memory silently injects stale preferences you did not intend to bring into the current task, which is a form of context pollution you did not authorize. When handling a one-off sensitive task, using a temporary conversation or turning memory off is the default action.

6. Retrieval and references: connecting external knowledge to the window

When the knowledge a task needs is not in the model’s training data (your papers, your codebase, the latest documentation), you have to connect it to the context. There are two approaches. Active reference: you explicitly specify what to load. @-reference a file, #-index a repo, or upload files into the Projects (Claude / ChatGPT) / Gems (Gemini) knowledge base. Best when you already know which documents are relevant: precise and controllable. RAG (retrieval-augmented generation): when the data volume is too large to fit and you are not sure which segment is relevant, chunk the data, vectorize it into a database, and at query time retrieve only the most relevant chunks to place in the window. It is essentially an automated version of just-in-time loading. You do not need RAG for every task. First ask: is the relevant data too large to @-reference directly? If not, direct reference is simpler and more accurate. For researchers, the practical mapping is: @-reference the two or three papers you need to read closely (precise); connect the full literature corpus as a knowledge base or RAG pipeline (scale). Do not mix these up. Stuffing everything into the window triggers the degradation described in Section 3; putting closely read papers through RAG risks not retrieving the critical passage.

Vector retrieval is not the only option: PageIndex as an example

RAG has become nearly synonymous with “connecting external knowledge,” but vector retrieval has structural weaknesses, especially in high-precision domains like finance and law. Semantic similarity does not equal content relevance (querying “operating margin” may surface many scattered paragraphs that mention the term while missing the table that actually contains the answer). Chunking destroys the original document’s structure and context. Retrieval results are not traceable (you cannot ask “why was this chunk selected?”). PageIndex (open-source under MIT license [8]) takes a different path: vectorless, reasoning replaces similarity search. Rather than embedding and chunking, it first parses the document into a semantic tree (each node carrying a heading, page-number range, and LLM-generated summary), then at query time lets the LLM read the table of contents, locate sections, and reason downward layer by layer, like a human expert, inspired by AlphaGo’s tree search [8]. According to its official repo, a system built on this approach achieves 98.7% accuracy on FinanceBench (self-reported SOTA, substantially outperforming traditional vector RAG; as of 2026-05) [8]. The trade-off between the two approaches:
DimensionPageIndex (tree-based reasoning)Traditional vector RAG
Retrieval mechanismLLM reads tree and reasons layer by layerEmbedding + cosine similarity
PreprocessingParse structure, generate node summariesChunk, vectorize
ExplainabilityHigh (retrieval path follows document structure)Low (only a numerical distance)
Inference cost and latencyHigher (multiple LLM reasoning passes over nodes)Lower (vector distance computation is very fast)
Best fitHigh-precision, traceable Q&A on a single deep document (financial reports, contracts)Fuzzy retrieval across large, multi-document corpora
The selection criterion follows this section’s main thread: ask about scale and precision requirements first, then choose the approach. For precise, traceable Q&A on a single well-structured long document, tree-based reasoning retrieval is worth trying; for fuzzy search across a massive multi-document corpus, vector retrieval still wins on speed and cost, and the two can be combined. PageIndex uses MCP and can be connected directly to MCP-compatible clients such as Claude and Cursor (as of 2026-05).

7. Subagents and context isolation

A counterintuitive but highly effective technique: assign subtasks to a clean-context subagent and have the main agent receive only its conclusions. Why does this work? Suppose the main task requires “first understand how this module works.” If the main agent explores it directly, the twenty files it reads, the ten grep runs it issues, the three dead ends it walks, all pile into the main window, pushing the actual work to the middle (triggering lost in the middle) and burning a large number of tokens. Instead, dispatch a subagent to do the exploration. The subagent works through all the files in its own independent window, then returns just a “summary of how this module works” to the main agent. The main agent’s window contains only the task itself and that clean summary from beginning to end. This is the dual benefit of isolation: token savings (the exploration process does not occupy the main window) and quality improvement (the main agent is not disturbed by exploration noise). In tooling terms, Claude’s Subagent mechanism is designed exactly for this: each subagent has its own context, its own toolset, and returns conclusions rather than process [7]. For detailed usage, see 02-5.

8. Practical principle: minimum necessary context

Distill the preceding seven sections into an actionable checklist. Run through this before putting anything into the window:
  • Minimum necessary: default to not including it unless it can answer “why does this need to be here?” If a pointer and just-in-time loading will do, do not preload everything.
  • High-signal first: keep instructions and direct evidence; cut “might be relevant” file bundles. Signal-to-noise ratio matters more than absolute count.
  • Manage position: the most critical items go at the beginning or end, not buried in the middle.
  • Proactive compaction: on long tasks, do not wait for the window to overflow and trigger passive compaction. At natural convergence points, actively run /compact and specify a focus.
  • Persist to files: constraints that need to survive compaction and cross-session must go into CLAUDE.md, not be stated verbally in the conversation.
  • Isolate noise: exploratory, divergent subtasks go to a subagent; the main window receives only the conclusions.
The same research task, two approaches to context management Task: “Based on the existing architecture of this repo, implement a new data-loading module whose style is consistent with the existing code, referencing best practices from these ten papers.” Before (stuff everything, go on instinct): drag the entire repo (80 files) into the conversation, paste all ten PDFs, then say the sentence. The window instantly consumes 150K tokens; the actual task instruction at the top is immediately pushed toward the “middle” by piles of file content. The model starts implementing, aligns the style with some unrelated old module (because that file happened to land where the model could still “see” it), and never cites the key practice from the papers (buried in the middle, context rot). After (minimum necessary + isolation):
  1. Write four lines in CLAUDE.md: the project’s naming conventions, error-handling contract, and “new modules align with the existing implementations under src/loaders/.”
  2. Use @src/loaders/base_loader.py to reference precisely the one template file to align with.
  3. Dispatch a subagent to read the ten papers and return “three data-loading best practices with citations.”
  4. The main agent’s window now contains: four rule lines (at the top), one template file, one paper summary, and the task instruction. About 8K tokens, all high-signal.
Window contents (After)
┌─────────────────────────────────────────────┐
│ CLAUDE.md four rule lines      ← top       │
│ base_loader.py template                     │
│ Subagent summary: three best practices      │
│ Task instruction               ← bottom    │
│ Total ≈ 8K tokens, all high-signal         │
└─────────────────────────────────────────────┘
The difference in outcome is not “a smarter prompt.” After shifts the model’s attention from “retrieve signal from 150K of noise” to “execute against 8K of all-signal.” The extra cost is thirty seconds of upfront organization. The saving is avoiding an entire pass of misaligned output that would need to be redone from scratch.

Tool comparison

Each vendor packages the same context engineering concepts under different names (as of 2026-05; for precise configuration locations and verification, see 02-2 and 02-6).
MechanismAnthropic Claude (primary)OpenAIGoogleGitHub CopilotCursor
Long-term memoryClaude.ai memory / Claude Code CLAUDE.md, ~/.claude/Saved Memories + Reference chat historyPersonal context → Memoryinstructions file (not auto-memory)User Rules / Memories
Knowledge attachmentProjects knowledge base / @ file referenceProjects / Custom GPT KnowledgeGems Knowledge / Drive connection# files, repo index@ files / @Docs / codebase index
Active compactionAuto compaction + /compact (with optional focus)Auto conversation summaryAuto summaryAutomaticAutomatic
Subagent isolationSubagent (independent context, returns conclusions)Multi-agent / AgentsAntigravity agents(no user-layer equivalent)Background agents
The comparison table gives coordinates, not detailsThis table’s purpose is to tell you what your tool calls the same concept. The precise configuration path, toggle location, and version differences for each cell are fast-changing facts; they are verified with dates in 02-2 and 02-6, and are not duplicated here to avoid version inconsistencies across multiple locations. Some mechanism names in the Copilot and Cursor columns are expressed as stable descriptions; for exact names, consult each tool’s official documentation.

Common pitfalls

Anti-pattern list
  • Treating window capacity as effective capacity: “I have 200K so I’ll fill 200K.” Effective capacity is far smaller than physical capacity; the fuller the window, the worse the degradation (Section 3).
  • Stuffing everything and letting the model sort it out: you think you’re giving the model more information; you’re actually lowering signal-to-noise ratio and pushing critical instructions into the middle. Fewer, more precise beats more, noisier.
  • Thinking long-term memory means you don’t need to manage context: auto-memory silently injects stale preferences you did not intend to include, a form of pollution you did not control; and it is a security vector for memory poisoning and prompt injection (03-3).
  • Relying on verbal statements for persistent constraints: automatic compaction discards early conversation. Constraints that need to survive must go into CLAUDE.md, which is re-read and re-injected after compaction.
  • Running everything through RAG: when the relevant data is small enough to @-reference directly, RAG only adds the risk of imprecise retrieval. Ask about scale first, then choose the approach.
  • Having the main agent do divergent exploration itself: exploration noise fills up the main window and interferes with reasoning. Send divergent subtasks to a subagent; the main window receives only the conclusions.

Self-check

The bar for passing this unit
  1. Given a complex multi-file task, can you state which items should enter the context and which should stay outside to be loaded just-in-time via a pointer, along with your reasoning?
  2. Can you explain why “writing critical constraints into CLAUDE.md” is more reliable than “stating them once in the conversation”? The answer relates to the compaction mechanism.
  3. Can you identify where the long-term memory toggle is in your primary tool, and explain why you should turn it off when handling a sensitive one-off task?
  4. The next time your task window starts swelling, will your first move be to keep stuffing, or to actively compact or dispatch a subagent? Why?

Sources and further reading

Factual claims are grounded in official documentation and original research; fast-changing items are dated to 2026-05. IEEE numbering style.