Claude Code Tools

How does Claude Code connect judgment to computation?

Tool Use

Claude Code's power comes from tool calling — the agent reasons about what to do, then calls tools to execute.

Fact	Detail
What	Claude selects and calls tools (Read, Write, Edit, Bash, Grep, Glob, Task, WebFetch) to accomplish tasks.
Pattern	Reasoning → tool selection → execution → result → next reasoning step.
Docs	Tool use

input_examples

Concrete examples that anchor tool calling consistency. Instead of describing what a tool does abstractly, you show it real input/output pairs.

Fact	Detail
What	Concrete input/output examples attached to tool definitions. Make LLM tool calling consistent across sessions.
Pattern	"Is this more like Pain=5 ($12K/month lost) or Pain=1 (calendar-could-be-better)?"
Docs	input_examples

Our use: The score-prds skill uses calibration examples as input_examples — concrete PRD anchors at each score level so different agents produce convergent scores.

Programmatic Tool Calling

Fact	Detail
What	Claude writes code that calls tools in a sandboxed container. Reduces round-trips between reasoning and execution.
Pattern	Instead of Claude → tool → result → Claude → tool → result, it becomes Claude → writes code that calls N tools → single result.
Sandbox	`code_execution_20260120` tool runs Python in an isolated container with `$TOOL_CALL` function.
Docs	Programmatic tool calling

How It Maps

Layer	Manual (our system)	Programmatic
Who writes code	Human wrote `retired PRD scorer`	Claude writes Python at runtime
Who calls it	Skill tells Claude to run `node scripts/...`	Claude's code calls tools via `$TOOL_CALL`
Execution	Local Node process	Sandboxed container
Determinism	Script does pure math, zero judgment	Code does pure computation, zero judgment
The split	LLM scores → script ranks	LLM reasons → code executes

The core principle is identical: separate judgment from computation.

Where Each Wins

Local scripts (retired PRD scorer): Version-controlled, testable, reviewable by humans, runs without API calls. Better when the computation is stable and reusable.

Programmatic tool calling: Better when computation needs to happen inside a Claude API session — agent platforms where orchestrators compose tool calls dynamically. Relevant for the Agent Platform PRD and BOaaS dispatch.

Scoring Pipeline

The scoring pipeline is the proof-of-concept for both patterns:

JUDGMENT (skill)                  MATH (script)
─────────────────                 ─────────────────
1. Skill loads calibration        4. Script reads frontmatter
   examples (Pain=5 → $12K/mo)   5. Geometric mean, weighted sum
2. Agent reads PRD, scores 1-5    6. Gate checks (BLOCKED, NO PLAYERS)
3. Writes to frontmatter          7. Sort by final score + diff

Layer	What	Deterministic?
Calibration	input_examples pattern anchors scoring	No — but constrained by concrete examples
Scores	10 dimensions per PRD in frontmatter	No — human/LLM judgment
Algorithm	`retired PRD scorer`	Yes — pure math
Enforcement	Gate checks, dependency bonuses	Yes — binary pass/fail

The script is the commissioning principle applied to code: the team that scores (LLM) is never the team that computes the ranking (script).

Extended Thinking

Fact	Detail
What	Claude uses additional "thinking tokens" before responding — visible reasoning that improves complex tasks.
When	Architecture decisions, multi-file refactors, ambiguous requirements.
Config	`alwaysThinkingEnabled: true` in settings.json
Docs	Extended thinking

Our config: Always on. Cost justified by fewer rework cycles on complex tasks.

Token-Efficient Tools

Fact	Detail
What	Optimized tool result formats that reduce token consumption without losing information.
Pattern	Tools return structured, compact results instead of verbose output.
Docs	Token-efficient tools

Relevant when building API-level agent orchestration — every token saved compounds across tool calls in long-running agent sessions.

Context

Claude Code — Concepts, config, best practices
Skills — Reusable procedures and quality gates
Cron — Scheduled automation patterns

Questions

Which engineering decision related to this topic has the highest switching cost once made — and how do you make it well with incomplete information?

At what scale or complexity level does the right answer to this topic change significantly?
How does the introduction of AI-native workflows change the conventional wisdom about this technology?
Which anti-pattern in this area is most commonly introduced by developers who know enough to be dangerous but not enough to know what they don't know?

Tool Use​

input_examples​

Programmatic Tool Calling​

How It Maps​

Where Each Wins​

Scoring Pipeline​

Extended Thinking​

Token-Efficient Tools​

Context​

Questions​