What this unit solvesPrompt engineering teaches you “how to phrase this one question”; context engineering teaches you “what should and should not be in the model’s working memory before it answers,” covering system memory, external data injection, real-time RAG retrieval, tool-call state, and more, all designed through the lens of the context window. In agentic tools, the model runs many turns on its own, accumulating file reads and search results as it goes. What determines output quality at that point is not the wording of any single message, but your management of everything entering the window.
Learning objectives
- Name the four sources that enter the context window and assess each source’s relevance and cost.
- Explain context rot and the “lost in the middle” phenomenon, and state their practical implications for how much to include.
- Describe how compaction, just-in-time loading, and subagent isolation manage information within a limited context budget.
- Distinguish “short-term context” from “cross-session long-term memory” and map them to the mechanisms and risks of specific tools.
1. From prompt engineering to context engineering
Prompt engineering’s implicit assumption is “one question, one answer”: you craft the question precisely, the model returns an answer, and the interaction ends. That assumption held in the chat-box era; it breaks down in agentic tools. After a Coding Agent receives a task, it reads files, searches, executes, and then decides its next step based on the results, potentially running dozens of turns per task. Every turn’s output is pushed back into the context window, and the next turn’s reasoning is built on top of that accumulation. The lever you control shifts from “how do I word this sentence” to “what is in the window right now, and what is their signal-to-noise ratio.” Anthropic defines context engineering as: within the model’s limited “attention budget,” finding the minimal set of high-signal tokens that maximizes goal-achievement rate [1]. The key words are minimal and high-signal. More is not better; “just enough, and all of it relevant” is. This runs counter to engineering instinct. The first impulse for most people is “throw in everything that might be relevant and let the model sort it out,” which is precisely where quality collapse begins. See Section 3 for why.2. The four sources of context
Everything entering the window, regardless of tool, falls into four categories. Each has a cost (how many tokens it occupies, how much attention it dilutes) and a relevance (how much it helps the current task). You should be able to make that trade-off for every category.- Instructions: system prompt, rule files (
CLAUDE.md/AGENTS.md/.claude/rules/). These have the highest signal density and most deserve their place, because they define the model’s role and constraints and are active for the entire session. - Knowledge: attached files, RAG-retrieved chunks,
@-referenced content. The most variable in relevance: the right reference is critical evidence; the wrong one is pure noise. - History: the prior conversation in the current session, plus cross-session long-term memory. Monotonically expands with each interaction and needs the most active compaction.
- Tool results: file reads, grep output, command execution returns. Often enormous in volume (a single
ls -Ror a full test log can run to thousands of tokens) yet only one or two lines are genuinely useful.
3. Context rot and lost in the middle
“The window has 200K tokens, so I can fill 200K” is wrong. Window size is physical capacity, not effective capacity; the portion the model actually uses well is far smaller than what it can hold. Two empirically supported phenomena back this up. Lost in the middle: Liu et al. (2023) found that model performance on long contexts is strongly correlated with the position of key information. Accuracy is highest when key information appears at the beginning or end, drops significantly when it falls in the middle, and beyond a certain length, information in the middle is almost as if it were never provided [2]. Context rot: A Chroma (2025) technical report tested 18 models (including GPT-4.1, Claude 4, Gemini 2.5, and Qwen3) and found that models do not use context uniformly; the longer the input, the less reliable the performance. Degradation is especially pronounced when the task requires semantic matching rather than literal matching [3]. The shared implication is direct: more context does not mean better, and often means worse. A precise 5K-token context beats a 100K-token context diluted with “might be relevant” noise. Two operational conclusions follow:- Control volume: fewer, more precise. Break the habit of “throw everything in and let the model sort it out.”
- Control position: the most critical instructions and evidence belong at the beginning or end of the context, not buried in the middle under a pile of file contents. Part of why rule files work is that they are typically placed first.
4. Compaction and just-in-time loading: continuing work within budget
A long task will inevitably hit the window limit. There are two main approaches. Compaction: summarize a conversation approaching the limit into a structured digest, then open a new window with that digest and continue. This is the primary mechanism for maintaining long-horizon coherence [1]. In Claude Code, part of this happens automatically (when the limit approaches, old tool outputs are cleared first, then the conversation is summarized), and you can also trigger it manually with/compact, optionally with a focus hint such as /compact focus on the API changes to specify what to preserve [4].
One mechanism detail you must know: early conversation turns may be discarded during automatic compaction, but the project root’s CLAUDE.md will not be. After compaction, Claude re-reads it from disk and re-injects it. Subdirectory CLAUDE.md files are not automatically re-injected; they reload only when Claude next reads a file in that subdirectory. The compaction summary also only carries over skills you actually used; descriptions of unused skills are not brought into the new window (as of 2026-06) [4]. This is why “write persistent rules into CLAUDE.md rather than stating them verbally in the conversation” is a hard recommendation. Verbally stated constraints quietly evaporate after a compaction [4].
Just-in-time loading: rather than loading all potentially useful data at the start, have the agent hold lightweight pointers (file paths, queries, links) and pull the corresponding data into the window with a tool call only when needed [1]. For a Coding Agent, this means providing file paths and search capability rather than pasting the entire repo; for a research workflow, it means providing the dataset location and a read tool rather than dumping the full dataset into the conversation.
5. Memory: short-term vs. long-term
The word “memory” is used loosely across tools, but there are really only two kinds. Keeping them distinct prevents misuse. Short-term context: the conversation inside a single session window. Gone when the session closes. Essentially working memory. Everything in the preceding four sections operates at this layer. Long-term memory: information persisted across sessions so it is still there when a new conversation opens. The three major tools differ in mechanism and naming (as of 2026-05):- Anthropic Claude: Claude.ai has cross-conversation memory; Claude Code uses files on disk as the persistence layer: project-level
CLAUDE.md, user-level~/.claude/, plus auto-memory under~/.claude/projects/<repo>/memory/(on by default since v2.1.59; disable withautoMemoryEnabledor an env var) [4]. Long-term memory here is files, human-readable, auditable, and version-controllable, which is an advantage for reproducibility. - OpenAI ChatGPT: split into “Saved Memories” (explicit facts and preferences you ask it to remember) and “Reference chat history” (automatic reference to past conversations). Settings under Settings > Personalization; turning off “Reference saved memories” also turns off “Reference chat history” [5].
- Google Gemini: under “Personal context” in the “Memory” setting, on by default, learns from past conversations; also supports importing memories from other apps into Gemini [6].
6. Retrieval and references: connecting external knowledge to the window
When the knowledge a task needs is not in the model’s training data (your papers, your codebase, the latest documentation), you have to connect it to the context. There are two approaches. Active reference: you explicitly specify what to load.@-reference a file, #-index a repo, or upload files into the Projects (Claude / ChatGPT) / Gems (Gemini) knowledge base. Best when you already know which documents are relevant: precise and controllable.
RAG (retrieval-augmented generation): when the data volume is too large to fit and you are not sure which segment is relevant, chunk the data, vectorize it into a database, and at query time retrieve only the most relevant chunks to place in the window. It is essentially an automated version of just-in-time loading. You do not need RAG for every task. First ask: is the relevant data too large to @-reference directly? If not, direct reference is simpler and more accurate.
For researchers, the practical mapping is: @-reference the two or three papers you need to read closely (precise); connect the full literature corpus as a knowledge base or RAG pipeline (scale). Do not mix these up. Stuffing everything into the window triggers the degradation described in Section 3; putting closely read papers through RAG risks not retrieving the critical passage.
Vector retrieval is not the only option: PageIndex as an example
RAG has become nearly synonymous with “connecting external knowledge,” but vector retrieval has structural weaknesses, especially in high-precision domains like finance and law. Semantic similarity does not equal content relevance (querying “operating margin” may surface many scattered paragraphs that mention the term while missing the table that actually contains the answer). Chunking destroys the original document’s structure and context. Retrieval results are not traceable (you cannot ask “why was this chunk selected?”). PageIndex (open-source under MIT license [8]) takes a different path: vectorless, reasoning replaces similarity search. Rather than embedding and chunking, it first parses the document into a semantic tree (each node carrying a heading, page-number range, and LLM-generated summary), then at query time lets the LLM read the table of contents, locate sections, and reason downward layer by layer, like a human expert, inspired by AlphaGo’s tree search [8]. According to its official repo, a system built on this approach achieves 98.7% accuracy on FinanceBench (self-reported SOTA, substantially outperforming traditional vector RAG; as of 2026-05) [8]. The trade-off between the two approaches:| Dimension | PageIndex (tree-based reasoning) | Traditional vector RAG |
|---|---|---|
| Retrieval mechanism | LLM reads tree and reasons layer by layer | Embedding + cosine similarity |
| Preprocessing | Parse structure, generate node summaries | Chunk, vectorize |
| Explainability | High (retrieval path follows document structure) | Low (only a numerical distance) |
| Inference cost and latency | Higher (multiple LLM reasoning passes over nodes) | Lower (vector distance computation is very fast) |
| Best fit | High-precision, traceable Q&A on a single deep document (financial reports, contracts) | Fuzzy retrieval across large, multi-document corpora |
7. Subagents and context isolation
A counterintuitive but highly effective technique: assign subtasks to a clean-context subagent and have the main agent receive only its conclusions. Why does this work? Suppose the main task requires “first understand how this module works.” If the main agent explores it directly, the twenty files it reads, the ten grep runs it issues, the three dead ends it walks, all pile into the main window, pushing the actual work to the middle (triggering lost in the middle) and burning a large number of tokens. Instead, dispatch a subagent to do the exploration. The subagent works through all the files in its own independent window, then returns just a “summary of how this module works” to the main agent. The main agent’s window contains only the task itself and that clean summary from beginning to end. This is the dual benefit of isolation: token savings (the exploration process does not occupy the main window) and quality improvement (the main agent is not disturbed by exploration noise). In tooling terms, Claude’s Subagent mechanism is designed exactly for this: each subagent has its own context, its own toolset, and returns conclusions rather than process [7]. For detailed usage, see 02-5.8. Practical principle: minimum necessary context
Distill the preceding seven sections into an actionable checklist. Run through this before putting anything into the window:- Minimum necessary: default to not including it unless it can answer “why does this need to be here?” If a pointer and just-in-time loading will do, do not preload everything.
- High-signal first: keep instructions and direct evidence; cut “might be relevant” file bundles. Signal-to-noise ratio matters more than absolute count.
- Manage position: the most critical items go at the beginning or end, not buried in the middle.
- Proactive compaction: on long tasks, do not wait for the window to overflow and trigger passive compaction. At natural convergence points, actively run
/compactand specify a focus. - Persist to files: constraints that need to survive compaction and cross-session must go into
CLAUDE.md, not be stated verbally in the conversation. - Isolate noise: exploratory, divergent subtasks go to a subagent; the main window receives only the conclusions.
- Write four lines in
CLAUDE.md: the project’s naming conventions, error-handling contract, and “new modules align with the existing implementations undersrc/loaders/.” - Use
@src/loaders/base_loader.pyto reference precisely the one template file to align with. - Dispatch a subagent to read the ten papers and return “three data-loading best practices with citations.”
- The main agent’s window now contains: four rule lines (at the top), one template file, one paper summary, and the task instruction. About 8K tokens, all high-signal.
Tool comparison
Each vendor packages the same context engineering concepts under different names (as of 2026-05; for precise configuration locations and verification, see 02-2 and 02-6).| Mechanism | Anthropic Claude (primary) | OpenAI | GitHub Copilot | Cursor | |
|---|---|---|---|---|---|
| Long-term memory | Claude.ai memory / Claude Code CLAUDE.md, ~/.claude/ | Saved Memories + Reference chat history | Personal context → Memory | instructions file (not auto-memory) | User Rules / Memories |
| Knowledge attachment | Projects knowledge base / @ file reference | Projects / Custom GPT Knowledge | Gems Knowledge / Drive connection | # files, repo index | @ files / @Docs / codebase index |
| Active compaction | Auto compaction + /compact (with optional focus) | Auto conversation summary | Auto summary | Automatic | Automatic |
| Subagent isolation | Subagent (independent context, returns conclusions) | Multi-agent / Agents | Antigravity agents | (no user-layer equivalent) | Background agents |
The comparison table gives coordinates, not detailsThis table’s purpose is to tell you what your tool calls the same concept. The precise configuration path, toggle location, and version differences for each cell are fast-changing facts; they are verified with dates in 02-2 and 02-6, and are not duplicated here to avoid version inconsistencies across multiple locations. Some mechanism names in the Copilot and Cursor columns are expressed as stable descriptions; for exact names, consult each tool’s official documentation.
Common pitfalls
Self-check
The bar for passing this unit
- Given a complex multi-file task, can you state which items should enter the context and which should stay outside to be loaded just-in-time via a pointer, along with your reasoning?
- Can you explain why “writing critical constraints into
CLAUDE.md” is more reliable than “stating them once in the conversation”? The answer relates to the compaction mechanism. - Can you identify where the long-term memory toggle is in your primary tool, and explain why you should turn it off when handling a sensitive one-off task?
- The next time your task window starts swelling, will your first move be to keep stuffing, or to actively compact or dispatch a subagent? Why?
Sources and further reading
Factual claims are grounded in official documentation and original research; fast-changing items are dated to 2026-05. IEEE numbering style.- [1] Anthropic, “Effective context engineering for AI agents,” Anthropic Engineering, 2025. (attention budget, just-in-time loading, compaction) https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents (as of 2026-05)
- [2] N. F. Liu, K. Lin, J. Hewitt, A. Paranjape, M. Bevilacqua, F. Petroni, P. Liang, “Lost in the Middle: How Language Models Use Long Contexts,” Transactions of the ACL, 2024 (arXiv:2307.03172, 2023). https://arxiv.org/abs/2307.03172
- [3] Chroma, “Context Rot: How Increasing Input Tokens Impacts LLM Performance,” Chroma Technical Report, 2025. https://www.trychroma.com/research/context-rot (as of 2026-05)
- [4] Anthropic, “How Claude remembers your project” (memory,
/compact, re-readingCLAUDE.mdafter compaction) and “Explore the context window” (window load order and compaction survival rules), Claude Code Docs. https://code.claude.com/docs/en/memory and https://code.claude.com/docs/zh-TW/context-window (as of 2026-06) - [5] OpenAI, “What is Memory?” / “Memory FAQ” (Saved Memories, Reference chat history), OpenAI Help Center. https://help.openai.com/en/articles/8983136-what-is-memory (as of 2026-05)
- [6] Google, “Get personalization in Gemini Apps” (Personal context → Memory), Gemini Apps Help. https://support.google.com/gemini/answer/16598623 (as of 2026-05)
- [7] Anthropic, “Create custom subagents” (subagent independent context, toolset, returns conclusions), Claude Code Docs. https://code.claude.com/docs/en/sub-agents (as of 2026-05)
- [8] VectifyAI, “PageIndex: vectorless, reasoning-based RAG” (tree-based reasoning retrieval, AlphaGo-inspired, FinanceBench 98.7% self-reported, MIT license, MCP support), GitHub, 2025. https://github.com/VectifyAI/PageIndex (as of 2026-05)
- Related: 01-1 mental models, 01-2 token and window technical basis, 02-2 Claude configuration, 02-5 Subagents, 03-3 memory poisoning and supply-chain risk.