Skip to main content

Agentic AI Systems

How do we turn a large agentic AI reference into systems we can build, evaluate, and improve?

This method turns agentic AI research into a bounded system design. It uses Haggai Roitman's 2026 The Hitchhiker's Guide to Agentic AI: From Foundations to Systems as the source trail, then translates the useful parts into Dreamineering operating language.

The core move is simple: build the harness before trusting the agent.

Inputs

  • A target outcome the agent must produce.
  • The action boundary: tools, data, cost, time, authority, and stop rule.
  • A source-grounded recall matrix for architecture, memory, tools, planning, coordination, evaluation, safety, and production.
  • Local proof surfaces: traces, tests, golden trajectories, logs, or receipts.
  • A decision on capability state: REALITY, DREAM, or CONSUMED.

Problem

An agent is not a model with a better prompt. It is a stateful loop around a model.

The source frames the agent harness as runtime infrastructure that wraps a stateless LLM with tool use, memory, orchestration, state, communication, and observability (ch. 18, pp. 343-360).

That is the useful system boundary. The model reasons. The harness executes, records, gates, and recovers.

Do not start with a multi-agent team, a tool marketplace, or a memory layer. Start with one unit of work and ask whether a harness can make that work repeatable.

Operating Model

Use five surfaces. Each one must leave proof.

SurfaceJobSource groundingProof
HarnessWrap the model with state, tools, memory, routing, and observability.Agent harness and context budget, ch. 18, pp. 343-359.Token budget, state schema, trace, stop rule.
MemoryStore only what improves future action.Working, episodic, semantic, and procedural memory, ch. 17, pp. 320-330.Recall, precision, staleness, provenance.
ToolsExpose actions with typed inputs, outputs, constraints, and side effects.Tool signatures, routing, output processing, and sandboxing, ch. 18, pp. 348-352; MCP, ch. 21, pp. 392-410.Tool schema, permission level, audit log.
ControlChoose the simplest orchestration that works.ReAct, plan-and-execute, workflow graphs, HITL, ch. 18, pp. 353-356; design patterns, ch. 19, pp. 371-374.Plan DAG, loop guard, approval gate.
EvaluationMeasure the policy, not one answer.Agent environments, ch. 20, pp. 375-386; testing pyramid and golden trajectories, ch. 25, pp. 475-479.Unit tests, trajectory tests, behavior tests, cost and latency bounds.

If one surface has no proof, the agent is still a demo.

Build Pattern

  1. Name the unit of work.

    Output: one sentence that says what the agent must change in the world.

    Check: a human can tell whether the work is done without reading the trace.

  2. Draw the harness boundary.

    Output: a small contract for model, prompt blocks, memory, tools, state, observability, and stop condition.

    Check: the context budget names system prompt, memory/RAG, tool definitions, history, and reserved output before the first run. The source calls silent truncation a trap because the model can lose instructions without an error signal (ch. 18, pp. 344-347).

  3. Attach tools as contracts.

    Output: each tool has a verb-noun name, typed input, return shape, constraints, permission level, and failure mode.

    Check:

    • tool outputs are treated as untrusted data;
    • destructive tools require approval;
    • read-only tools say they are read-only;
    • use MCP when tool discovery and multi-provider reuse matter;
    • use direct integration when latency and tight coupling dominate (ch. 21, pp. 398-408).
  4. Add memory only where it changes the next action.

    Output: a memory policy for write, read, update, reflect, and delete.

    Check: memory entries carry provenance, timestamp, confidence, and retention logic. The source warns that RAG can introduce hallucination when retrieved content is stale or shallowly relevant, so provenance and faithfulness checks are mandatory (ch. 17, pp. 322-326).

  5. Prove the loop.

    Output: a test and observability pack: tool unit tests, agent-loop integration tests, golden trajectories, behavioral constraints, trace fields, cost bounds, and latency bounds.

    Check: the trace can replay the run. The failure taxonomy can classify tool error, reasoning error, hallucination, loop, context overflow, or false refusal (ch. 25, pp. 475-479).

Pattern Choice

Start with the lowest control surface that can pass the proof.

NeedUseAvoid
One known sequencePrompt chain or state machineMulti-agent routing
Unknown next step after each observationReAct loopFixed plan
Long task with stable dependenciesPlan-and-execute DAGFree-form tool loop
High-stakes or irreversible actionHuman approval gateAutonomous execution
Many roles with clear handoffsSupervisor or hierarchyPeer mesh
Reliability through independent checksEnsemble or reviewer agentOne unchecked generalist

The source's own pattern guide says to move down the complexity ladder only when the simpler pattern fails (ch. 19, pp. 371-374). Dreamineering adopts that rule.

Evaluation Loop

Run this loop before promoting a capability from DREAM to REALITY.

  1. Define the target tasks, allowed actions, refusal cases, and success metric.
  2. Instrument every model call, tool call, state transition, token count, cost, and error.
  3. Test tools first, then full agent loops, then golden trajectories, then behavior constraints.
  4. Run the agent in a sandbox or shadow mode before live authority.
  5. Review failures by class and change the smallest surface that caused the variance.

Useful metrics:

  • Task success rate.
  • Tool-call accuracy.
  • Recovery rate after initial failure.
  • Human escalation rate.
  • Cost per task.
  • Time to first useful output.
  • Memory recall, precision, staleness, and token efficiency.
  • Faithfulness of claims to retrieved context.

Failure Modes

  • Harness gap - the model is asked to own state, retries, permissions, or observability.
  • Tool blur - the agent can call tools whose side effects, scope, or auth owner are unclear.
  • Memory bloat - the system stores everything, then retrieves noise.
  • Context overflow - history or tool output silently pushes out the system prompt or source.
  • Loop drift - the agent repeats work because there is no max-iteration guard or progress check.
  • Coordination tax - multi-agent messages cost more than the value they add.
  • Eval theater - one happy-path answer passes, but the policy fails under varied runs.
  • Safety afterthought - guardrails are prompt text only, not enforced constraints.

Source Matrix

QuestionAdoptReject or rename
What is an agent?A model inside a harness with tools, memory, state, and observations."Agent" as any chat model with a prompt.
What components matter?Harness, memory, tools, orchestration, state, observability, evaluation.Framework-first design.
What memory matters?Working, episodic, semantic, and procedural memory with provenance.Storing every transcript as useful memory.
How should tools be exposed?Typed contracts with constraints, side-effect labels, and audit logs.Vague tool names and invisible permissions.
How should planning be bounded?DAGs, stop rules, replanning triggers, and human gates.Open-ended "keep going" loops.
How should multi-agent systems coordinate?Start centralized, use shared state, add hierarchy when scale demands it.Peer meshes before traceability.
What evals prove it works?Tool tests, loop tests, golden trajectories, behavior tests, traces, cost bounds.Single response grading.
What safety gates are required?Sandboxing, least privilege, approval gates, prompt-injection handling, halt rules.Trusting tool output as instruction.

Proof Of Done

An agentic AI system is ready to ship when:

  • the harness boundary is explicit;
  • every action has a contract and permission level;
  • memory writes are selective and auditable;
  • the loop has a stop rule and recovery path;
  • the evaluation pack can replay success and failure;
  • the capability state is marked REALITY only after proof exists.

Context

  • Agent Operating Model - accountability loop for intent, capability, action, receipt, and consequence.
  • MCP Toolkit - tool reality, auth, scope, and remediation.
  • Skills - procedural memory as reusable workflow capability.
  • Agent-Operable Codebase - codebase surfaces that let agents act with proof.
  • Agentic Workflows - when a workflow earns more structure.
  • Roitman, H. (2026). The Hitchhiker's Guide to Agentic AI: From Foundations to Systems, arXiv:2606.24937v1.

Questions

Which missing proof keeps this agent in DREAM state?

  • What state must the harness own instead of the model?
  • Which tool needs a tighter permission contract?
  • Which memory write rule would reduce future noise?
  • Which golden trajectory should catch the next regression?
  • Which human gate protects irreversible action?