01-5 Workflow Engineering: orchestrating multi-step tasks into repeatable flows

What this unit solvesMulti-step tasks that a single prompt cannot handle are solved by workflow orchestration: decomposition, chaining, parallelism, verification, and convergence. Prompt engineering governs “how to phrase this one request”; context engineering governs “what belongs in the window”; workflow engineering governs “what structure connects these steps, who goes first, and where to place checkpoints”. Getting this layer right is what makes multi-step flows repeatable, auditable, and not luck-dependent, instead of producing different results on every run.

Learning objectives

Identify when a task exceeds the scope of a single prompt and requires a workflow to run reliably, and distinguish the boundary between “workflow” and “agent”.
Decompose a compound goal into steps with dependency relationships, and identify which can run in parallel and which must be sequential (DAG thinking).
Apply primitives such as pipeline, fan-out, barrier, and loop-until, and map them to the five named patterns Anthropic identifies.
Place adversarial verification and convergence checkpoints at critical nodes to prevent upstream errors from propagating silently.

1. When you need a workflow: where the single-prompt ceiling is

When any step’s output is the next step’s input, and that intermediate result is hard to maintain reliably within the same context, you need a workflow. Conversely, anything that can be completed stably within one prompt and one context window should not be split; splitting only adds latency and state management overhead (see Section 8 on common pitfalls). A single prompt hits three ceilings:

Output length: the model’s per-response token limit means asking it to produce a long report plus complete code plus tests in one shot either truncates or skims every section.
Context depth: when a task requires “remembering the intermediate result computed earlier” before proceeding, those intermediates crowd the same window and dilute each other, triggering the lost-in-the-middle and context-rot effects covered in 01-4.
External retrieval: some steps must first query a database, run tests, or fetch a page, then use the result to decide what comes next. A single-round prompt has no structure for “pause, retrieve, then continue”.

Typical tasks that exceed the boundary: large code refactors (multi-file, requiring understand then change then verify), multi-document summarization and comparison (summarize individually then cross-compare), iterative quality improvement (generate → evaluate → revise → re-evaluate), data pipelines (fetch → clean → transform → load). The shared trait is that intermediate results must land somewhere and be consumed by the next step.

Workflow and agent are not the same thingAnthropic’s distinction in Building Effective Agents is: a workflow is a system where “LLMs and tools are orchestrated through predefined code paths”; an agent is a system where “LLMs dynamically direct their own processes and tool usage, controlling how they accomplish tasks” [1].The difference is who decides the next step. A workflow’s steps and transitions are defined in code in advance; the model only works within each node. An agent receives a goal and tools, and plans the steps itself. This unit focuses on getting workflows right because they are predictable, reproducible, and easy to debug; the flexibility of agents comes at the cost of controllability, which is the territory of 01-6 harness engineering. In practice, the vast majority of tasks that appear to require an agent are solved by a well-designed workflow, and solved more reliably.

2. Decomposition and dependencies: DAG thinking

The most reliable way to break a compound goal into steps is to ask backwards from the end: “What does this final output directly require?” For each answer, trace one layer further up: “And where does that come from?” Continue until you reach atomic steps that cannot be further divided. Compared to working forward from “what is step one”, backward reasoning is less likely to miss hidden prerequisites. After decomposing, draw the steps as a directed graph: nodes are steps, edges are dependencies (A → B means B uses A’s output). This graph must be acyclic; a cycle means A waits on B and B waits on A, which deadlocks. This is DAG (directed acyclic graph) thinking. Once the DAG is drawn, parallelism opportunities surface on their own: nodes with no path between them can run simultaneously. In this graph, the three summarization steps C1, C2, and C3 are independent and can run in parallel; but all three must wait for B to finish (they need the cleaned data), and all three must finish before D (the comparison needs all three summaries). D is a natural barrier point (see Section 3). Decomposition most commonly produces two opposite mistakes:

Chaining parallelizable steps: C1/C2/C3 are entirely independent but are run one after another. No error occurs; you just pay three times the time with nothing to show for it.
Parallelizing dependent steps: launching D (comparison) and C (summarization) simultaneously means D receives incomplete summaries, producing inconsistent or empty-based output. This mistake is harder to spot because it “runs fast”, but the results are wrong.

3. Common patterns: from mechanical primitives to named patterns

Four primitives are the actual building blocks you manipulate when orchestrating; Anthropic combines them into five higher-level named patterns. Master the primitives first, recognize the combinations next, and encountering a new framework is just a syntax change. Four mechanical primitives:

Pipeline (chaining): A → B → C, each step’s output feeds the next. Suited for linear transformations. Corresponds to Anthropic’s prompt chaining: decompose a complex task into sequential steps, with programmatic checkpoints between steps to verify progress before continuing [1].
Fan-out: the same input is sent to multiple parallel steps simultaneously. Suited for analyzing the same data from multiple perspectives (checking security, performance, and types in parallel).
Barrier: wait for all parallel branches to complete before proceeding. Fan-out is usually followed by a barrier and then aggregation. Fan-out + barrier together make up Anthropic’s parallelization, which has two variants: sectioning (split the task into independent sub-tasks and run them simultaneously) and voting (run the same task multiple times to get independent results and take a vote) [1].
Loop-until (convergence loop): repeat until verification passes or the limit is reached. Suited for quality iteration and self-correction. Corresponds to Anthropic’s evaluator-optimizer: one LLM call generates output, another scores and gives feedback, iterating toward a passing result [1].

Barriers have a cost; default to pipelineA common over-synchronization: fan out three branches, use a barrier to wait for all to finish, then forward the combined result. But if one branch is much slower, the barrier makes the two fast ones sit idle. Unless the next step genuinely needs all branch results together (for example, deduplicating across all results or comparing them against each other), letting each branch independently continue (pipeline style) avoids waste. The criterion: does the next step need global information across branches? If yes, set a barrier; if not, don’t block.

When does a pattern “upgrade” to an agent? The transitions in the four primitives above are all predefined by you. When “what sub-tasks to decompose into” must be determined by the model after seeing the input and cannot be written in advance, you need two more of Anthropic’s patterns:

Routing: use a classification step first to determine which category the input falls into, then direct it to the appropriate specialized handler. This is still fundamentally a workflow (you define the routing table), with an added layer of dynamic dispatch.
Orchestrator-workers: a central LLM dynamically decomposes the task, dispatches sub-tasks to workers, and synthesizes the results [1]. The key difference from parallelization is: the sub-tasks are not predefined; the orchestrator decides them after seeing the input. At this point you have stepped into agent territory and partially surrendered control to the model.

A concrete workflow you can build right now: adversarial code review Scenario: review changes on a branch. Use a pipeline to chain a “review” phase and a “verification” phase, and fan out across multiple dimensions.

dimensions = [correctness, security, performance, style]
for each dimension (pipeline, independent, no barrier):
  phase 1: spawn a sub-agent to review that dimension -> return findings (structured)
  phase 2: for each finding, spawn an independent sub-agent to adversarially verify -> return verdict
final: collect all findings where verdict.isReal == true

Why this design: the four dimensions are independent, so pipelining lets the correctness verification start running while performance is still being reviewed, without being blocked by the slowest dimension (contrast with the barrier warning above). The adversarial verification in phase 2 is the focus of Section 4; note its position here. Claude Code’s sub-agents (the Task tool) and the SDK-layer orchestrator pattern are designed exactly for this kind of orchestration (as of 2026-05) [2].

4. Verification and convergence: placing adversarial verification at critical nodes

The consequences of skipping verification are concrete: one small upstream error is treated as a correct premise by every downstream step and amplified, and by the time the output three steps later is obviously wrong, tracing which step broke is already difficult, and all the tokens consumed at every intermediate step are wasted. The value of a verification checkpoint is catching errors while they are still cheap. The weakest form of “verification” is having the same model check its own output, but models tend to agree with what they just produced, making this minimally effective. Stronger is adversarial verification: use a different perspective, or an independent model call, to challenge the previous step’s output rather than confirm it. Two key design moves:

Switch the role to critic: don’t write “check whether this result is correct” for the verification step; write “try to refute this conclusion; when uncertain, default to not-established”. Reversing the burden of proof is what suppresses false positives.
Multiple perspectives rather than multiple copies: if a conclusion can fail in multiple ways, give each verifier a different lens (correctness, security, reproducibility) rather than running three identical checks. This is exactly the voting variant of Anthropic’s parallelization [1].

Convergence checkpoints (especially loop-until) must have three explicit exit conditions; missing any one can turn into a resource sink in production:

Pass criterion: use a structured pass / fail schema rather than a vague judgment like “looks acceptable”.
Maximum retry count: if the Nth iteration still fails, stop; never iterate indefinitely.
Timeout exit: abort if a single round or the whole loop exceeds the time limit.

A loop-until + adversarial verification convergence loop Task: produce a code segment to merge into the main branch, required to pass tests and have no obvious security issues.

bugs_confirmed = []
for round in 1..MAX_ROUNDS:          # maximum retry cap
    candidate = generation_step(task, previous_round_feedback)
    # adversarial: 3 independent reviewers each try to refute "this is safe and correct"
    votes = parallel(reviewer_1, reviewer_2, reviewer_3)
    refute_count = count(v for v in votes if v.refuted)
    if refute_count == 0:            # pass criterion: nobody can refute it
        return candidate             # converged, exit
    feedback = aggregate(votes)      # feed the rebuttals back into the next round
return failure("did not converge within limit")  # exit: don't let it spin forever

The three reviewers run in parallel (fan-out + barrier to wait for votes); if the majority can refute, the candidate is sent back. Without the MAX_ROUNDS line, any case the model can never pass will burn until you notice the bill. Claude Code’s Stop hook can serve as this convergence sentinel, running a deterministic pass / fail check at the end of each round (as of 2026-05) [2].

5. Positioning on the concept ladder

In the four-layer engineering stack, workflow occupies L3: it depends downward on the quality of L2 context, and requires L4 harness as a safety net above it (from shallow to deep: 01-3 prompt, 01-4 context, this unit’s workflow, 01-6 harness):

L1 Prompt engineering   -- how to phrase a single request
L2 Context engineering  -- what belongs in the window
L3 Workflow engineering -- what structure connects multiple steps (this unit)
L4 Harness engineering  -- the shell that keeps the whole thing running stably (monitoring, permissions, kill switch)

Two boundaries to remember:

Connects downward to context: every node in a workflow is fundamentally a context assembly. The quality ceiling of a node’s output is set by the quality of the context you give it (see 01-4). No matter how elegant the workflow structure, if a single node’s context design is poor, its output is poor. Workflow amplifies good design and amplifies bad design equally.
Connects upward to harness: a workflow only describes “how steps are chained”; it does not handle “who applies the brakes when something breaks”. Adding monitoring, permission boundaries, memory persistence, and a kill switch is what makes a complete agentic harness (see 01-6). Building a solid L3 without an L4 kill switch is fitting a racing engine with no brakes.

This positioning also directly dismantles two common misjudgments: treating workflow as a cure-all that fixes bad prompts (it cannot; it only amplifies), and treating some agentic framework as the only option for doing workflows (a program with loops and conditionals is sufficient in most cases).

Tool comparison

Three main vehicle types for putting multi-step orchestration into practice, with different applicable scenarios (as of 2026-05):

Claude Code sub-agent orchestration: spawn multiple sub-agents using the Task tool inside an orchestrator prompt; each sub-agent has its own independent context and tool set and returns only the final conclusion, not the process [2]. Best for situations where the orchestration target is LLM work and you want to stay within the Claude ecosystem. Anthropic launched a research preview of Dynamic Workflows as an orchestration layer for larger tasks, and Agent Teams that collaborate via a shared git workspace, on 2026-05-28 (research preview as of 2026-05) [2].
n8n: visual low-code workflow where nodes correspond to steps; strongest at integrating a large number of external services (databases, APIs, messaging platforms). Its AI capability is built on LangChain, providing 70+ AI nodes; the AI Agent node defaults to Tools Agent and supports human-in-the-loop approval before tool execution (as of 2026-05) [4]. Best for flows that span many SaaS services and where non-engineering roles need to read and modify the flow.
LangGraph: describes flows as graphs in Python / TypeScript; the three primitives are State (shared across all nodes), Node (a function that reads and writes state), and Edge (direct or conditional transitions). Version 1.2 (released 2026-05-11) treats an agent run as a persistent graph execution (durable execution), checkpointing at every node so the server can resume from the last breakpoint after a restart [3]. Best for agentic flows requiring fine-grained control, state persistence, and human-in-the-loop.

Trade-offs: n8n (visual) is most intuitive for debugging visibility; LangGraph is strongest for fine-grained control and testability; Claude Code sub-agents are most convenient when the orchestration target is LLM sub-tasks and you want to stay entirely within the Claude ecosystem.

Concept	Anthropic Claude (primary)	OpenAI	Google	GitHub Copilot	Cursor
Multi-step orchestration entry point	Claude Code Task tool / sub-agents; SDK orchestrator pattern	Agents SDK / Responses API (with tool calls)	Vertex AI Agent Builder; Gemini API function calling	Copilot coding agent (cloud-autonomous, opens PRs)	Background agents
Parallel sub-tasks (fan-out)	Spawn multiple Task sub-agents inside the orchestrator prompt, each with independent context	Agents SDK multi-agent handoff	Official documentation does not provide equivalent detail	Official documentation does not provide equivalent detail	Background agents parallel
Loop-until / self-correction	Orchestrator conditional re-call; Stop hook as convergence sentinel	Agents SDK loop + guardrails	Official documentation does not provide equivalent detail	Official documentation does not provide equivalent detail	Official documentation does not provide equivalent detail
External workflow integration	n8n, LangGraph both call the Claude API	n8n, LangGraph both support OpenAI	n8n, LangGraph both support Gemini	Via API integration	Via API integration

The comparison table gives coordinates, not detailThe precise mechanism names, versions, and configuration paths in each cell are fast-changing facts; some are presented with more stable descriptions. Cells where official documentation does not cover the equivalent concept note this explicitly; use each tool’s official documentation for exact names. The table’s purpose is to tell you “what your tool calls the same orchestration concept”; for deep configuration see Part II and Part IV.

Hands-on exercises

Two exercises, from paper decomposition to actual orchestration:

Draw an existing manual task as a DAG. Pick a multi-step flow you run manually (for example: collect data → summarize individually → cross-compare → write report). Draw the nodes and dependency edges, label which nodes can run in parallel (no path between them) and which must be sequential. Then ask three questions for every node: where does the input come from, who consumes the output, and what is the pass criterion? Which node most deserves a verification checkpoint (usually the one where an error would be amplified by downstream steps as a correct premise)?

Implement a minimal pipeline in Claude Code. Three chained steps; the second step verifies the first step’s output:

# orchestrator prompt skeleton (executed by the Claude Code main agent)
Step 1: dispatch sub-agent A to read src/ and produce a "module dependency list" (structured JSON).
Step 2: dispatch sub-agent B to adversarially verify that list, confirming item by item that files
        and symbols actually exist; mark non-existent items as invalid.
        If the schema fails, return to step 1 and redo (max 2 retries).
Step 3: produce refactoring suggestions using only verified items.

Making “verification” a standalone node with a pass / fail flag and a retry cap is the point of this exercise. Add a Stop hook for a deterministic end-of-run check, and this pipeline has a convergence sentinel.

Common pitfalls

Anti-pattern list

Treating workflow as a cure-all: a workflow is only structure. If a node’s prompt / context design is poor, no flow can recover it; it will faithfully pass the bad output downstream and amplify it. Get the single node right before talking about orchestration.
Skipping verification nodes: assume upstream output is correct and chain straight to the next step. Errors typically surface three steps later, making attribution hard and wasting all the tokens consumed at every intermediate step. Catch errors while they are still cheap.
Over-decomposing: force-splitting a task that one prompt could handle into five steps, adding latency and state management complexity. Confirm a single prompt genuinely cannot handle it before splitting.
Overusing barriers: applying a barrier after every fan-out to wait for all branches. If the next step does not need global cross-branch information, the barrier just makes the fast branches wait on the slowest. Default to pipeline; barriers are the exception.
Loops with no exit: loop-until with no maximum iteration count and no timeout. Any case that never converges will consume production resources indefinitely until you notice the bill.
Forcing an agent where a workflow suffices: handing control to the model for dynamic decision-making (orchestrator-workers) looks convenient, but the cost is predictability and reproducibility. If a workflow with a predefined path can solve it, don’t let the model improvise.

Self-check

The bar for passing this unit

Given a multi-step task, can you explain clearly why it exceeds the scope of a single prompt (output length, context depth, or external retrieval)? If you can’t, it may not need splitting at all.
For each node in a workflow you design, can you answer “where does the input come from, who consumes the output, what is the pass criterion”? Any node you can’t answer for means the design is not yet mature.
Is the barrier after your fan-out genuinely needed because the next step requires global cross-branch information, or did you just put it there out of habit?
Does your loop-until have an explicit maximum iteration count and a timeout exit?
Can you distinguish whether what you are building is a workflow (path predefined) or an agent (model decides the path)? The reproducibility cost of the two differs.

Sources and further reading

Facts are sourced from official documentation; fast-changing items are annotated “as of 2026-05”.

[1] E. Schluntz and B. Zhang, “Building Effective Agents,” Anthropic Engineering, Dec. 19, 2024. (definitions of workflow and agent; the five named patterns: prompt chaining, routing, parallelization, orchestrator-workers, evaluator-optimizer) https://www.anthropic.com/engineering/building-effective-agents (as of 2026-05)
[2] Anthropic, “Create custom subagents” (sub-agent independent context and tool set, returns conclusion only; Task tool orchestration; Dynamic Workflows and Agent Teams research preview, 2026-05-28), Claude Code Docs. https://code.claude.com/docs/en/sub-agents (as of 2026-05)
[3] LangChain, “Durable execution,” LangGraph Docs (State / Node / Edge, checkpointing, durable execution; LangGraph 1.2 released 2026-05-11). https://docs.langchain.com/oss/python/langgraph/durable-execution (as of 2026-05)
[4] n8n, “AI Agent node documentation” (LangChain-based, Tools Agent default, human-in-the-loop approval before tool execution; 70+ AI nodes), n8n Docs. https://docs.n8n.io/integrations/builtin/cluster-nodes/root-nodes/n8n-nodes-langchain.agent/ (as of 2026-05)

Connections: 01-3 prompt quality at a single node; 01-4 context assembly per node and sub-agent isolation; 01-6 wrapping a workflow into a harness with monitoring and a kill switch; 05-1 and 05-2 real orchestration designs from two open-source agents.

Overview

Part I Foundations

Part II Configuration

Part III Judgment

Part IV Customization

Part V Case Studies

Appendix

01-5 Workflow Engineering: orchestrating multi-step tasks into repeatable flows

Learning objectives

1. When you need a workflow: where the single-prompt ceiling is

2. Decomposition and dependencies: DAG thinking

3. Common patterns: from mechanical primitives to named patterns

4. Verification and convergence: placing adversarial verification at critical nodes

5. Positioning on the concept ladder

Tool comparison

Hands-on exercises

Common pitfalls

Self-check

Sources and further reading

​Learning objectives

​1. When you need a workflow: where the single-prompt ceiling is

​2. Decomposition and dependencies: DAG thinking

​3. Common patterns: from mechanical primitives to named patterns

​4. Verification and convergence: placing adversarial verification at critical nodes

​5. Positioning on the concept ladder

​Tool comparison

​Hands-on exercises

​Common pitfalls

​Self-check

​Sources and further reading

Learning objectives

1. When you need a workflow: where the single-prompt ceiling is

2. Decomposition and dependencies: DAG thinking

3. Common patterns: from mechanical primitives to named patterns

4. Verification and convergence: placing adversarial verification at critical nodes

5. Positioning on the concept ladder

Tool comparison

Hands-on exercises

Common pitfalls

Self-check

Sources and further reading