TW
← Back to writing

Writing

Designing Agentic Workflows That Don’t Fall Apart In Production

Tallha WaheedJanuary 4, 20253 min read

Patterns for reliable tool-calling and multi-step agents for ops, support, and data teams.

Designing Agentic Workflows That Don’t Fall Apart In Production

Agents look great in demos and brittle in production. This is how I design resilient, observable workflows that use tools, memory, and control logic without creating ticket fire drills.

1) Keep The Control Plane Deterministic

  • Use a graph-based orchestrator (LangGraph-style) so routing is explicit and versioned.
  • Avoid letting the LLM decide routing for critical branches; gate with rules, classifiers, or business logic.
  • Make every tool idempotent or add compensating actions; retries should not duplicate side effects.
  • Add step ceilings and time-boxing so agents cannot loop forever.

2) Separate Short-Term And Long-Term Memory

  • Short-term (scratchpad): structured messages describing prior steps; trim aggressively and redact secrets.
  • Long-term: vector store + metadata filters; index tool outputs and decisions so they’re discoverable later.
  • Caching: cache expensive tool results by parameters; include TTLs and version keys to avoid stale writes.

3) Guardrails Before And After Tool Calls

  • Pre-validate tool arguments (schema, ranges, required fields). Reject early with a friendly recovery prompt.
  • Post-validate tool results (shape, missing fields, anomalies). If validation fails, fall back to a safe template or escalate.
  • Normalize and redact PII before logging or passing to other tools.
  • For high-impact actions (refunds, contract edits), require dual confirmation or human-in-the-loop.

4) Break Down By Business Capability

  • Create capability-specific subagents: “Refund Policy,” “Contract Clause Extractor,” “Call Summarizer,” “KYC Verifier.”
  • Each has its own prompts, tools, guardrails, and tests. This keeps debugging localized and avoids prompt sprawl.
  • Share core utilities (auth, logging, tracing, PII scrubbers) so every node is consistent.

5) Observability That Matches Real Incidents

  • Log every transition: node name, chosen tool, args, output snippet, latency, token cost.
  • Emit metrics per node: success rate, validation failures, retries, user escalations, timeouts.
  • Store a replayable trace (inputs, tool calls, outputs) to reproduce an interaction without hitting production APIs.
  • Add alerting on: rising retry rates, timeout spikes, safety/PII violations, and cost anomalies.

6) Test Like A Workflow, Not A Model

  • Build scenario tests that mirror real user intents end-to-end, not just single prompts.
  • Include “ugly” inputs: partial data, conflicting asks, missing fields, long narratives, and prompt-injection attempts.
  • Regression-test prompts, tool contracts, and classifiers whenever you change dependencies or upgrade models.
  • Keep a golden set per capability; fail the build if any critical scenario regresses.

7) Rollout And Safety

  • Start in shadow mode on live traffic; compare completions, tool usage, and safety outcomes.
  • Canary to a small slice; auto-rollback if success rates or safety drop.
  • Enforce cost and latency budgets per path; downgrade or short-circuit when ceilings are hit.
  • Add human escalation paths for ambiguous or high-risk cases.

8) Patterns That Keep Agents Stable

  • Prefer deterministic routers and small prompts with explicit schemas.
  • Use function/tool calling with typed arguments; reject on validation failures instead of “hoping” the model fixes itself.
  • Keep domain prompts short; move long policy text into retrieval with filters + rerankers.
  • Version everything: prompts, tools, policies, embeddings, and routers. Make rollback a one-flag change.

9) Example Reference Stack

  • Orchestration: LangGraph-style DAGs with typed nodes and explicit edges.
  • Memory: Scratchpad for short-term; vector store + metadata filters for long-term; TTL caches for expensive tools.
  • Guardrails: JSON schema validators, PII scrubbers, toxicity filters, allow/deny lists for tools.
  • Observability: OpenTelemetry tracing, structured logs, replay endpoints, dashboards for success/timeout/safety/cost.
  • Rollout: Shadow → canary → gradual ramp; feature flags per model/tool; auto-rollback on regressions.

Build agents this way and they’ll feel like dependable coworkers instead of unpredictable interns. If you need this in your product, let’s chat.