Skip to main content

Claude Code Tools

How does Claude Code connect judgment to computation?


Tool Use

Claude Code's power comes from tool calling — the agent reasons about what to do, then calls tools to execute.

FactDetail
WhatClaude selects and calls tools (Read, Write, Edit, Bash, Grep, Glob, Task, WebFetch) to accomplish tasks.
PatternReasoning → tool selection → execution → result → next reasoning step.
DocsTool use

input_examples

Concrete examples that anchor tool calling consistency. Instead of describing what a tool does abstractly, you show it real input/output pairs.

FactDetail
WhatConcrete input/output examples attached to tool definitions. Make LLM tool calling consistent across sessions.
Pattern"Is this more like Pain=5 ($12K/month lost) or Pain=1 (calendar-could-be-better)?"
Docsinput_examples

Our use: The score-prds skill uses calibration examples as input_examples — concrete PRD anchors at each score level so different agents produce convergent scores.


Programmatic Tool Calling

FactDetail
WhatClaude writes code that calls tools in a sandboxed container. Reduces round-trips between reasoning and execution.
PatternInstead of Claude → tool → result → Claude → tool → result, it becomes Claude → writes code that calls N tools → single result.
Sandboxcode_execution_20260120 tool runs Python in an isolated container with $TOOL_CALL function.
DocsProgrammatic tool calling

How It Maps

LayerManual (our system)Programmatic
Who writes codeHuman wrote retired PRD scorerClaude writes Python at runtime
Who calls itSkill tells Claude to run node scripts/...Claude's code calls tools via $TOOL_CALL
ExecutionLocal Node processSandboxed container
DeterminismScript does pure math, zero judgmentCode does pure computation, zero judgment
The splitLLM scores → script ranksLLM reasons → code executes

The core principle is identical: separate judgment from computation.

Where Each Wins

Local scripts (retired PRD scorer): Version-controlled, testable, reviewable by humans, runs without API calls. Better when the computation is stable and reusable.

Programmatic tool calling: Better when computation needs to happen inside a Claude API session — agent platforms where orchestrators compose tool calls dynamically. Relevant for the Agent Platform PRD and BOaaS dispatch.


Scoring Pipeline

The scoring pipeline is the proof-of-concept for both patterns:

JUDGMENT (skill) MATH (script)
───────────────── ─────────────────
1. Skill loads calibration 4. Script reads frontmatter
examples (Pain=5 → $12K/mo) 5. Geometric mean, weighted sum
2. Agent reads PRD, scores 1-5 6. Gate checks (BLOCKED, NO PLAYERS)
3. Writes to frontmatter 7. Sort by final score + diff
LayerWhatDeterministic?
Calibrationinput_examples pattern anchors scoringNo — but constrained by concrete examples
Scores10 dimensions per PRD in frontmatterNo — human/LLM judgment
Algorithmretired PRD scorerYes — pure math
EnforcementGate checks, dependency bonusesYes — binary pass/fail

The script is the commissioning principle applied to code: the team that scores (LLM) is never the team that computes the ranking (script).


Extended Thinking

FactDetail
WhatClaude uses additional "thinking tokens" before responding — visible reasoning that improves complex tasks.
WhenArchitecture decisions, multi-file refactors, ambiguous requirements.
ConfigalwaysThinkingEnabled: true in settings.json
DocsExtended thinking

Our config: Always on. Cost justified by fewer rework cycles on complex tasks.


Token-Efficient Tools

FactDetail
WhatOptimized tool result formats that reduce token consumption without losing information.
PatternTools return structured, compact results instead of verbose output.
DocsToken-efficient tools

Relevant when building API-level agent orchestration — every token saved compounds across tool calls in long-running agent sessions.

Context

  • Claude Code — Concepts, config, best practices
  • Skills — Reusable procedures and quality gates
  • Cron — Scheduled automation patterns

Questions

Which engineering decision related to this topic has the highest switching cost once made — and how do you make it well with incomplete information?

  • At what scale or complexity level does the right answer to this topic change significantly?
  • How does the introduction of AI-native workflows change the conventional wisdom about this technology?
  • Which anti-pattern in this area is most commonly introduced by developers who know enough to be dangerous but not enough to know what they don't know?