What this unit solvesA prompt is not a wish; it is a task written as a model-executable spec. This unit gives you actionable prompt-design methods: assemble role, task, constraints, output format, and examples using a structured skeleton; use XML tags to cleanly separate blocks; use Few-Shot examples to calibrate abstract style and format quality into model-alignable templates; when manual iteration yields diminishing returns, use Automated Prompt Optimization (APO) to drive automated iteration via evaluation; finally, use a Skill to lock in the optimization flow as a reusable, callable tool.
Learning objectives
After this unit, you should be able to:- Write a structured prompt containing all five components (role, task, constraints, output format, examples) and explain the purpose of each.
- Use XML tags to separate instructions, data to process, and examples, preventing instructions from being contaminated by data.
- Use 2 to 3 Few-Shot examples to turn abstract style or format requirements into model-alignable templates.
- Judge when manual prompt tuning is sufficient and when to switch to APO automated optimization.
- Encapsulate the “evaluate a prompt and rewrite it” flow as a reusable, callable Skill to reduce repeated work.
1. The nature of a prompt: an executable spec, not a wishing well
A prompt is not a wish to the model; it is shrinking the solution space small enough that the model does not have to guess. Every constraint you provide prunes the model’s search; the more precise the spec, the less the model has to wander among “plausible but not what you wanted” outputs, and the more stable the output quality. Think of this as writing a spec rather than writing a prayer, and all the techniques that follow have a foothold. Most dissatisfaction with prompts traces not to a dumb model but to treating a task that needs constraints as a wishing well. Here is a real comparison. Before / After: same task, wish vs. structured spec Task: ask the model to summarize a paper. Wish-style prompt:2. Structured skeleton: assembling the five components
A reliable prompt typically consists of five components. Not every task needs all five, but you should know which ones are missing and why. Role: sets the task perspective and tonal baseline, shifting the model’s output distribution toward a specific professional subset. It is not a magic incantation. “You are a top expert” carries almost no information; the model has nothing to prune with. “You are the first author preparing a rebuttal for reviewers” actually pins the perspective, tone, and what to emphasize. A role is only useful when it is specific enough to change what the model should attend to. Task: a clear verb plus input description plus success criterion. “Summarize this paper” is a verb with input but no success criterion. Adding “so a reader can decide in thirty seconds whether it is worth reading in full” tells the model for whom and why it is writing. The success criterion is the most commonly omitted component and the most critical. Constraints: format rules, prohibited behaviors, length limits. The rule of thumb is that “do not do X” is harder to miss than “please do Y”, because the positive behavior you want can be achieved in many ways, but once you name a prohibited negative behavior (adding external knowledge, adding commentary, changing the format), the boundary is clear. Write your worst failure modes as explicit prohibitions. Output Format: Markdown structure, JSON schema, table skeleton. If you can paste a template, paste an empty skeleton for the model to fill; do not describe the format in prose. “Use bullet points” is far weaker than just giving it3. Structuring prompts with XML tags (especially recommended for Claude)
When a prompt includes instructions, background context, examples, and data to process, the model may conflate them. The classic failure is “instructions contaminated by data”: a document you paste in happens to contain a sentence that looks like an instruction, and the model follows it. The purpose of XML tags is to wrap each block of content in its own tag so the boundaries are unambiguous. Anthropic officially recommends using XML tags to separate different components of a prompt, and notes that when a prompt contains multiple types of content (context, instructions, examples, variable input), tags let Claude parse more precisely and produce higher-quality output [1] (as of 2026-05). Claude shows high conformance to tag boundaries. Common tags include<instructions>, <context>, <example>, <document>; tag names have no fixed dictionary, and any descriptively named tag (<customer_data>, <lameness_notes>) the model reads correctly, as long as the naming is consistent and semantically clear.
Three key uses:
- Wrapping data: put long text inside
<document>, separating it from instructions to reduce the chance of instructions being drowned out by data. - Framing output: ask the model to place its answer inside
<answer>for easy programmatic extraction and post-processing. - Nested hierarchy:
<examples>containing multiple<example>tags for Few-Shot; multiple documents wrapped in<documents>, each as<document index="1">.
<comment> is data not a command, and correctly outputs “neutral” or “positive” (depending on your definition). This isolation habit is a baseline defense when handling user input, scraped web pages, or text extracted from PDFs, and ties directly to the prompt injection discussion in 03-3.
Trade-offs with Markdown and JSON: all three can coexist in one prompt, each serving its role. XML is suited to dividing blocks and expressing nested semantics (which section is instructions, which is data); JSON is suited to requesting structured output from the model (for direct downstream parsing); Markdown is suited to human-readable formatting (headings, bullet lists). A common combination is using XML to cut the big sections of a prompt and asking the model to output a JSON object inside <answer>. Do not obsess over the choice; decide by “is this content for a person to read, for a machine to read, or does it need a clean boundary”.
4. Calibrating style and format with Few-Shot (key section)
Few-Shot is the most powerful way to concretize abstract requirements. For requirements like “use the academic Taiwanese Mandarin register” or “match our team’s commit format”, no matter how long you describe them in text, the model still can’t quite nail it. Throw in 2 to 3 demonstrations and the model aligns directly to the template. The information density of examples far exceeds the same word count of description. Core purpose: externalize the “right look” in your head into concrete output the model can imitate. Selection strategy: examples should cover the key boundaries of output variability, not just a few “looks correct” picks. If your task has several typical input types (normal case, edge case, a case that should return “cannot determine”), the examples should cover those representative situations; otherwise the model can only guess when it encounters unseen patterns. Covering boundaries matters more than accumulating quantity: 2 to 3 well-chosen examples beat 10 that all look the same. Worked example: prompt skeleton plus two demonstrations, converting rough notes to fixed JSON Task: normalize field-observation notes into a JSON with a fixed schema.severity with a number, the other with a string), the model will not pick one and stick to it; it will waver between them and produce a hybrid you did not expect. Internal consistency across examples is as important as selecting the right examples. When reviewing Few-Shot, start by asking “do any of my examples conflict with each other?“
5. Automated Prompt Optimization (APO): evaluation-driven automated iteration
Manual prompt tuning hits a ceiling: you change one sentence on intuition, run a few times, feel like it might be slightly better, but have no quantitative basis, and the later iterations become guesswork. The core idea of APO is to automate this: given an evaluation function (eval), automatically modify the prompt to maximize the eval score, instead of relying on human intuition. Two representative academic methods:- OPRO (Optimization by PROmpting): uses the LLM itself as the optimizer. At each step it feeds “past prompts and their scores” back to the model, asking it to generate a better new prompt, which is evaluated and added to the next round’s context. Yang et al. (2023) found that on GSM8K and Big-Bench Hard tasks, OPRO-generated prompts beat human-designed prompts by 8% to 50% [2].
- DSPy: abstracts the LLM pipeline into programmable modules and uses a compiler to auto-optimize against a given metric (automatically generating and selecting few-shot demonstrations, rewriting instructions). Khattab et al. (2023) showed that a few lines of DSPy let a model bootstrap a pipeline that outperforms standard few-shot [3]. In practice, DSPy is one of the most mature frameworks for making “prompt as an optimizable parameter” work in production (as of 2026-05).
APO is a fast-moving area; here is a directional map (as of 2026-06)This space changes quickly. A three-way split by how much evaluation cost you are willing to invest:
- Pursuing academic-grade accuracy: GEPA (Genetic-Pareto reflective evolution, ICLR 2026 Oral, outperforms DSPy’s existing MIPROv2 by more than 10% on multiple benchmarks [11]).
- Wanting practical balance and production readiness: DSPy’s MIPROv2, the trade-off point among performance, infrastructure, and production readiness [3].
- Wanting fast setup with low configuration cost: TextGrad (iterates via a text-based “automatic differentiation”, simple setup, sufficient when task difficulty is uniform; original paper source needs confirmation), or one-click optimization (OpenAI Optimize [4], Vertex AI Prompt Optimizer [5]).
6. Encapsulating the prompt optimization flow as a Skill
Once you find that an “evaluate and rewrite” flow for a certain task type keeps recurring, lock it down rather than rebuilding from scratch each time. The concept is to encapsulate “score a prompt and rewrite it” as a Skill: reusable, version-controlled, shareable. The ecosystem already has ready-made prompt optimization capabilities to leverage: OpenAI has built Optimize as a Playground button [4]; CLI agents like Claude Code, Codex, Antigravity, and Copilot CLI let you write the optimization flow as a Skill (aSKILL.md under .claude/skills/ with references and scripts), turning “load a prompt plus eval set, run evaluation, output rewrite suggestions” into a one-command tool. There are also community prompt-optimizer skills in circulation (exact name and source depend on your installed plugins; check with the 03-3 supply-chain principles before use).
The benefits of encapsulation are direct: the same task type no longer requires retuning; apply the Skill and get rewrite suggestions. Once the flow is in version control, everyone on the team uses the same optimization logic and gets reproducible output. For the full process of creating and configuring a Skill (frontmatter, references, scripts, per-tool differences), see 04-4 Skills. The one thing to remember here: any prompt-tuning flow you have repeated three or more times is the signal to encapsulate it as a Skill.
7. Positioning on the concept ladder: prompt is the bottom layer of four-layer engineering
Prompt engineering is the bottom layer of AI engineering, but not the whole of it. Every sentence you write in a prompt operates inside a larger context design. The engineering mental model broken into four layers, bottom to top:Tool comparison
Where the same prompt concepts land in each tool (as of 2026-05; fast-moving items are cited):| Concept | Anthropic Claude (primary) | OpenAI | GitHub Copilot | Cursor | |
|---|---|---|---|---|---|
| System prompt location | system parameter (API); Claude.ai Project Instructions; CLAUDE.md and .claude/rules/ (Claude Code) | system/developer role (API); Custom GPT Instructions | systemInstruction (Gemini API [6]); Gems instructions (requires Gemini Advanced [7]) | .github/copilot-instructions.md (repo-level); GitHub.com personal instructions [8] | User Rules (settings GUI, global); Project Rules (.cursor/rules/*.mdc, version-controlled [9]) |
| Structured separation | XML tags (officially recommended, high conformance [1]) | Markdown sections or XML both work | Markdown or XML | Markdown (instructions file) | Markdown (.mdc with frontmatter) |
| Few-Shot format | user/assistant alternating turns, or example section inside system | Same: user/assistant turns | Same | Examples inside the instructions file | Examples inside the rule file |
| Built-in prompt auto-optimization | No native APO; can integrate DSPy [3], OPRO [2] | Optimize and Generate in Playground (free [4]) | Vertex AI Prompt Optimizer (GA, data-driven [5]) | Original source needs confirmation | Original source needs confirmation |
| Prompt flow encapsulation | Claude Code Skill (.claude/skills/; see 04-4) | Custom GPTs / GPT Actions | Gems [7] | Copilot prompt files (*.prompt.md, version needs confirmation) | .cursor/rules/ rule files |
The direction of cross-tool convergenceThe system-prompt mechanisms across vendors are converging on
AGENTS.md as a cross-tool standard (see 02-1 and 04-2). Every vendor has a Claude Code-equivalent CLI agent: OpenAI’s Codex, Google’s Antigravity (CLI written in Go), GitHub’s Copilot CLI. Copilot has already folded AGENTS.md into its instruction layer [8], and Codex and Antigravity both support it. In practice an optimization flow written to AGENTS.md and Skill conventions can be reused across these CLI agents; migration is mainly about moving files (as of 2026-06). Learn to write prompts as specs in Claude and the core method carries over to other tools.Hands-on exercises
- Rewrite your most-used task prompt: pick a task you drop into AI every week (writing commits, cleaning up meeting notes, editing SQL), lay out your current prompt, and check it against the five components from Section 2. Fill in the missing pieces (usually the success criterion and the output format skeleton). Run before and after three times each, and compare consistency of output: not which run was best, but “how stable is it?”
- Add two Few-Shot examples: for the same task, add 2 demonstration input/output pairs: one normal case, one edge case (the kind that should return “cannot determine”). Observe how the model changes in style consistency and format adherence, paying special attention to whether it makes things up when it encounters an unseen input.
- Do one XML isolation: if your task requires feeding in external text (pasted web pages, someone else’s message), separate instructions and data with XML tags, deliberately plant a fake instruction inside the data, and see whether the model falls for it before and after isolation.
Common pitfalls
Self-check
Ask yourself these questions
- For one task you recently sent to AI, which of the five components does its prompt contain? Do the missing ones correspond to exactly where you feel the output is unstable?
- If a colleague had to take over your prompt and get consistent output on the same task, is your current prompt precise enough, or does half of it depend on tacit knowledge in your head?
- Are your Few-Shot examples internally consistent in format? Do they cover the “should return cannot determine” edge case?
- The last time you were dissatisfied with output, was it really a prompt wording problem, or was context overloaded, or the task should have been split into multiple steps?
- What prompt-tuning task have you already repeated three or more times? That is the signal to encapsulate it as a Skill.
Sources and further reading
- [1] Anthropic, “Use XML tags to structure your prompts,” Claude Docs. Accessed: 2026-05. [Online]. Available: https://platform.claude.com/docs/en/build-with-claude/prompt-engineering/use-xml-tags
- [2] C. Yang, et al., “Large Language Models as Optimizers,” arXiv:2309.03409, 2023. [Online]. Available: https://arxiv.org/abs/2309.03409
- [3] O. Khattab, A. Singhvi, P. Maheshwari, Z. Zhang, et al., “DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines,” arXiv:2310.03714, 2023. [Online]. Available: https://arxiv.org/abs/2310.03714
- [4] OpenAI, “Prompt management in Playground,” OpenAI Help Center. Accessed: 2026-05. [Online]. Available: https://help.openai.com/en/articles/9824968-prompt-management-in-playground
- [5] Google Cloud, “Data-driven prompt optimizer,” Generative AI on Vertex AI Documentation. Accessed: 2026-05. [Online]. Available: https://docs.cloud.google.com/vertex-ai/generative-ai/docs/learn/prompts/data-driven-optimizer
- [6] Google, “Gemini API: System instructions,” Gemini Cookbook. Accessed: 2026-05. [Online]. Available: https://github.com/google-gemini/cookbook/blob/main/quickstarts/System_instructions.ipynb
- [7] Google, “Gemini Gems.” Accessed: 2026-05. Original official source needs confirmation (features and subscription requirements are per Gemini Advanced plan, as of 2026-05).
- [8] GitHub, “Adding repository custom instructions for GitHub Copilot,” GitHub Docs. Accessed: 2026-05. [Online]. Available: https://docs.github.com/copilot/customizing-copilot/adding-custom-instructions-for-github-copilot
- [9] Cursor, “Rules,” Cursor Docs. Accessed: 2026-05. [Online]. Available: https://cursor.com/docs/rules
- [10] Anthropic, “Adaptive thinking,” Claude Docs. Accessed: 2026-05. [Online]. Available: https://platform.claude.com/docs/en/build-with-claude/adaptive-thinking
- [11] “GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning,” arXiv:2507.19457, 2025 (ICLR 2026 Oral). Accessed: 2026-06. [Online]. Available: https://arxiv.org/abs/2507.19457
- DSPy framework documentation (Stanford NLP, including MIPROv2 and GEPA optimizer): https://dspy.ai/ (as of 2026-06)
- Further reading: 01-4 Context Engineering, 01-5 Workflow Engineering, 01-6 Harness Engineering, 04-4 Skills