In March 2023, a developer built a ChatGPT-powered assistant that could check the weather, look up flight prices, and book restaurant reservations — all within a single conversation. The trick? The AI never actually called a single API itself. Instead, it told the developer’s code exactly which function to call and with which arguments, received the results, and wove them into a seamless natural language response. The user had no idea they were talking to a text generator that couldn’t actually do anything on its own. That trick has a name: tool calling. And it’s the single most important capability that transformed large language models from impressive text generators into agents that can interact with the real world.
Here’s the uncomfortable truth about LLMs: they are fundamentally trapped. An LLM doesn’t know today’s date. It can’t check a stock price. It can’t query your database, send an email, or read a file on your computer. It only knows what was in its training data (which is months or years old) and whatever you include in the current conversation. Without tool calling, asking an LLM “What’s NVIDIA’s stock price right now?” gets you a polite apology and a reminder of its knowledge cutoff date.
Tool calling changed everything. It’s the mechanism that lets an AI model say, “I don’t know the answer to this, but I know which function to call to get the answer — and here are the exact arguments.” Your code then executes that function, feeds the result back to the model, and the model responds to the user as if it knew all along. This is how ChatGPT plugins work. This is how Claude Code reads and writes files. This is how every AI agent operates under the hood.
In this guide, I’m going to break down tool calling from the ground up. You’ll learn exactly how it works, see complete code examples for Claude and OpenAI, understand the differences between providers, and walk away with everything you need to build your own tool-calling applications. Whether you’re a developer building AI-powered products or an investor evaluating AI companies, understanding tool calling is essential — it’s the bridge between “AI that talks” and “AI that acts.”
What Is Tool Calling?
Tool calling (also called function calling) is a mechanism where a large language model can request the execution of external functions or APIs during a conversation. Instead of trying to answer everything from memory, the model can reach out to the real world — checking databases, calling APIs, performing calculations, or executing code — by asking your application to run specific functions on its behalf.
The key insight is deceptively simple: the model doesn’t execute the tools itself. It generates a structured request — a function name plus arguments in JSON format — and your code is responsible for actually executing it. The result gets sent back to the model, which then incorporates it into its response.
Think of it like a brain and hands. The LLM is the brain: it plans, reasons, and decides what needs to happen. The tools are the hands: they actually do things in the physical world. The brain can’t pick up a cup of coffee on its own, but it can tell the hands exactly how to do it. Similarly, an LLM can’t check the weather, but it can tell your code to call a weather API with specific coordinates and interpret the result.
The Three-Step Loop
Every tool calling interaction follows the same fundamental pattern:
- User asks something → “What’s the weather in Tokyo right now?”
- Model decides to call a tool → Outputs structured JSON:
{"name": "get_weather", "arguments": {"city": "Tokyo"}} - Your code executes the tool → Calls the weather API, gets the result → Sends it back to the model
- Model responds naturally → “It’s currently 22°C and sunny in Tokyo with a light breeze from the east.”
Here’s the full flow described step by step:
┌─────────┐ "What's the weather ┌─────────┐
│ │ in Tokyo?" │ │
│ User │ ──────────────────────────→│ Your │
│ │ │ App │
└─────────┘ └────┬────┘
│
Sends message + │
tool definitions │
▼
┌─────────┐
│ │
│ LLM │
│ (API) │
└────┬────┘
│
Returns: │
tool_use: │
get_weather │
{"city":"Tokyo"} │
▼
┌─────────┐
│ Your │
│ App │──→ Calls weather API
│(execute)│←── Gets result: 22°C
└────┬────┘
│
Sends tool_result│
back to LLM │
▼
┌─────────┐
│ LLM │
│ (API) │
└────┬────┘
│
Final response: │
"It's 22°C and │
sunny in Tokyo" │
▼
┌─────────┐
│ User │
│ sees │
│ response│
└─────────┘
Why This Is Revolutionary
Before tool calling: LLMs could only generate text. They were extraordinarily good at it, but they were fundamentally disconnected from the world. Ask for today’s weather and you’d get a hallucinated guess or an apology. Ask them to send an email and they’d write you a draft you’d have to copy-paste yourself.
After tool calling: LLMs can take actions. They can check real-time data, interact with databases, control software, browse the web, manage files, send messages, and orchestrate complex multi-step workflows. The same text-generation capability that was previously limited to chat responses now powers decision-making about which actions to take and how to interpret the results.
This single capability — the ability for a model to say “call this function with these arguments” — is what turned LLMs from chatbots into agents. Every AI agent framework, every chatbot plugin system, and every autonomous AI workflow is built on tool calling.
How Tool Calling Works Under the Hood
Let’s walk through each step of the tool calling process in detail, with the actual data structures you’ll encounter when building with the APIs.
Step 1: Tool Definition
Before the model can use any tools, you have to tell it what tools are available. You do this by including a tool definition in your API request. Each tool definition is a JSON Schema that describes the function’s name, what it does, and what parameters it accepts.
{
"name": "get_current_weather",
"description": "Get the current weather conditions for a specific city. Returns temperature in Celsius, weather condition, humidity, and wind speed. Use this when the user asks about current weather, temperature, or atmospheric conditions for any location.",
"input_schema": {
"type": "object",
"properties": {
"city": {
"type": "string",
"description": "The city name, e.g. 'Tokyo', 'New York', 'London'"
},
"units": {
"type": "string",
"enum": ["celsius", "fahrenheit"],
"description": "Temperature units. Defaults to celsius.",
"default": "celsius"
}
},
"required": ["city"]
}
}
The description is critically important — it’s what the model reads to decide when to use this tool. A vague description like “weather stuff” will lead to the model using the tool at the wrong times or not using it when it should. A detailed description like the one above helps the model make precise decisions.
Step 2: Tool Selection
When the model receives a user message along with tool definitions, it makes a decision: should it respond directly, or should it call one or more tools first? This decision is made by the model itself — it’s part of the model’s inference process, not a separate system.
The model considers:
- Does the user’s request require information I don’t have?
- Is there a tool that can provide this information?
- What arguments should I pass to the tool?
- Do I need to call multiple tools?
- Should I call tools in parallel or sequentially?
If the user asks “What’s 2 + 2?”, the model will answer directly — no tool needed. If the user asks “What’s the weather in Tokyo?”, and a get_current_weather tool is available, the model will decide to call it.
Step 3: Structured Output
When the model decides to call a tool, it doesn’t output free-form text. Instead, it outputs a structured tool_use block with the function name and arguments as valid JSON:
{
"role": "assistant",
"content": [
{
"type": "tool_use",
"id": "toolu_01A09q90qw90lq917835lq9",
"name": "get_current_weather",
"input": {
"city": "Tokyo",
"units": "celsius"
}
}
]
}
This is not a suggestion or a natural language request — it’s a precisely structured instruction. The function name matches exactly what you defined, and the arguments conform to the JSON Schema you provided. This is what makes tool calling reliable: the model doesn’t say “maybe try checking the weather”; it says “call get_current_weather with {"city": "Tokyo", "units": "celsius"}“.
Step 4: Execution
Your application code receives this tool_use block, parses it, and executes the actual function. This is where the real work happens — you make the API call, run the database query, perform the calculation, or whatever the tool does:
# Your code — NOT the model's code
def get_current_weather(city: str, units: str = "celsius") -> dict:
response = requests.get(
f"https://api.openweathermap.org/data/2.5/weather",
params={"q": city, "units": "metric", "appid": API_KEY}
)
data = response.json()
return {
"city": city,
"temperature": data["main"]["temp"],
"condition": data["weather"][0]["description"],
"humidity": data["main"]["humidity"],
"wind_speed": data["wind"]["speed"]
}
Step 5: Result Injection
You send the tool result back to the model as a tool_result message:
{
"role": "user",
"content": [
{
"type": "tool_result",
"tool_use_id": "toolu_01A09q90qw90lq917835lq9",
"content": "{\"city\": \"Tokyo\", \"temperature\": 22, \"condition\": \"clear sky\", \"humidity\": 45, \"wind_speed\": 3.6}"
}
]
}
Step 6: Final Response
The model reads the tool result and generates a natural language response for the user. It doesn’t just parrot the raw data — it interprets it, adds context, and presents it conversationally:
“Right now in Tokyo, it’s a beautiful 22°C with clear skies. Humidity is at a comfortable 45%, and there’s a gentle breeze at 3.6 m/s. Perfect weather for a walk!”
Multi-Tool and Iterative Tool Use
Modern models can call multiple tools in a single turn. If a user asks “What’s the weather in Tokyo and New York?”, the model can output two tool_use blocks simultaneously — a parallel tool call. Your code executes both and sends both results back.
Models can also use tools iteratively. In a complex task, the model might call tool A, examine the result, decide it needs more information, call tool B, examine that result, and then finally respond. This iterative capability is the foundation of AI agents — the model keeps calling tools in a loop until it has enough information to complete the task.
Tool Calling Across Major AI Providers
The core concept is the same across providers, but the API formats differ. Let’s look at complete, runnable examples for each major provider.
Anthropic Claude (Messages API)
Claude’s tool calling uses a clean, content-block-based format. Tools are defined with input_schema (standard JSON Schema), and the model responds with tool_use content blocks.
Here’s a complete, runnable Python example:
import anthropic
import json
client = anthropic.Anthropic() # Uses ANTHROPIC_API_KEY env var
# Define tools
tools = [
{
"name": "get_weather",
"description": "Get the current weather for a city. Returns temperature (Celsius), condition, humidity, and wind speed.",
"input_schema": {
"type": "object",
"properties": {
"city": {
"type": "string",
"description": "City name, e.g. 'Tokyo', 'London'"
}
},
"required": ["city"]
}
},
{
"name": "get_stock_price",
"description": "Get the current stock price for a given ticker symbol. Returns price in USD, daily change, and percentage change.",
"input_schema": {
"type": "object",
"properties": {
"ticker": {
"type": "string",
"description": "Stock ticker symbol, e.g. 'AAPL', 'NVDA', 'GOOGL'"
}
},
"required": ["ticker"]
}
}
]
# Simulated tool implementations
def get_weather(city: str) -> dict:
# In production, call a real weather API
return {"city": city, "temperature": 22, "condition": "sunny", "humidity": 45}
def get_stock_price(ticker: str) -> dict:
# In production, call a real stock API
return {"ticker": ticker, "price": 875.30, "change": +12.50, "percent_change": "+1.45%"}
# Map function names to implementations
tool_functions = {
"get_weather": get_weather,
"get_stock_price": get_stock_price,
}
# Send initial message with tools
messages = [{"role": "user", "content": "What's the weather in Tokyo and NVIDIA's stock price?"}]
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
tools=tools,
messages=messages
)
print(f"Stop reason: {response.stop_reason}")
# Process tool calls
while response.stop_reason == "tool_use":
# Collect all tool use blocks
tool_results = []
for block in response.content:
if block.type == "tool_use":
# Execute the tool
func = tool_functions[block.name]
result = func(**block.input)
print(f"Called {block.name}({block.input}) → {result}")
tool_results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": json.dumps(result)
})
# Send results back to Claude
messages.append({"role": "assistant", "content": response.content})
messages.append({"role": "user", "content": tool_results})
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
tools=tools,
messages=messages
)
# Print final response
for block in response.content:
if hasattr(block, "text"):
print(f"\nClaude's response:\n{block.text}")
tool_choice parameter to control tool usage: "auto" (model decides), "any" (must use at least one tool), or {"type": "tool", "name": "get_weather"} (must use a specific tool). Use "auto" for most cases.
Claude-specific features:
- Parallel tool calls: Claude can output multiple
tool_useblocks in a single response, allowing you to execute them in parallel - Streaming with tools: Tool calls work with streaming — you receive
content_block_startevents for tool_use blocks as they’re generated - Tool choice control: Fine-grained control over when the model uses tools via
tool_choice - Large tool sets: Claude handles large numbers of tools well, though keeping it under 20 is recommended for optimal performance
OpenAI GPT (Chat Completions API)
OpenAI’s format uses a tools array with type: "function" wrappers. The response includes a tool_calls array, and results are sent back as messages with role: "tool".
from openai import OpenAI
import json
client = OpenAI() # Uses OPENAI_API_KEY env var
# Define tools — note the different format from Claude
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather for a city.",
"parameters": {
"type": "object",
"properties": {
"city": {
"type": "string",
"description": "City name, e.g. 'Tokyo'"
}
},
"required": ["city"]
}
}
},
{
"type": "function",
"function": {
"name": "get_stock_price",
"description": "Get the current stock price for a ticker symbol.",
"parameters": {
"type": "object",
"properties": {
"ticker": {
"type": "string",
"description": "Stock ticker, e.g. 'NVDA'"
}
},
"required": ["ticker"]
}
}
}
]
# Same tool implementations as above
def get_weather(city):
return {"city": city, "temperature": 22, "condition": "sunny"}
def get_stock_price(ticker):
return {"ticker": ticker, "price": 875.30, "change": "+1.45%"}
tool_functions = {"get_weather": get_weather, "get_stock_price": get_stock_price}
messages = [{"role": "user", "content": "What's the weather in Tokyo and NVIDIA's stock price?"}]
response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
tools=tools,
tool_choice="auto"
)
message = response.choices[0].message
# Process tool calls
while message.tool_calls:
messages.append(message) # Add assistant message with tool calls
for tool_call in message.tool_calls:
func = tool_functions[tool_call.function.name]
args = json.loads(tool_call.function.arguments)
result = func(**args)
# Note: OpenAI uses role="tool" instead of tool_result content blocks
messages.append({
"role": "tool",
"tool_call_id": tool_call.id,
"content": json.dumps(result)
})
response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
tools=tools
)
message = response.choices[0].message
print(message.content)
Google Gemini
Gemini’s function calling follows a similar pattern but with its own API format. Tool definitions use FunctionDeclaration objects, and responses include function_call parts. Gemini supports both automatic and manual function calling modes, and can handle parallel function calls similar to Claude and GPT.
The key difference with Gemini is its tight integration with Google’s ecosystem — function calling works seamlessly with Google Search, Google Maps, and other Google APIs as built-in tools.
Provider Comparison
| Feature | Claude (Anthropic) | GPT (OpenAI) | Gemini (Google) |
|---|---|---|---|
| Tool definition key | input_schema |
parameters |
parameters |
| Tool call format | tool_use content block |
tool_calls array |
function_call part |
| Result format | tool_result content block |
role: "tool" message |
function_response part |
| Parallel tool calls | Yes | Yes | Yes |
| Streaming with tools | Yes | Yes | Yes |
| Tool choice control | auto / any / specific | auto / none / required / specific | auto / none / specific |
| JSON reliability | Excellent | Excellent | Good |
| Stop reason indicator | stop_reason: "tool_use" |
finish_reason: "tool_calls" |
Part type check |
Practical Tool Calling Examples (with Complete Code)
Theory is great, but let’s build real things. Here are four complete examples that demonstrate increasingly complex tool calling patterns.
Example 1: Chained Tools — Weather by City Name
This example shows tool chaining: the model calls one tool to get coordinates, then uses those coordinates to call a second tool for weather data. The model autonomously decides it needs both calls.
import anthropic
import json
import requests
client = anthropic.Anthropic()
tools = [
{
"name": "get_coordinates",
"description": "Convert a city name to latitude/longitude coordinates using geocoding.",
"input_schema": {
"type": "object",
"properties": {
"city": {"type": "string", "description": "City name, e.g. 'Paris'"},
"country_code": {"type": "string", "description": "ISO country code, e.g. 'FR'"}
},
"required": ["city"]
}
},
{
"name": "get_weather_by_coords",
"description": "Get weather data for specific latitude/longitude coordinates.",
"input_schema": {
"type": "object",
"properties": {
"latitude": {"type": "number", "description": "Latitude coordinate"},
"longitude": {"type": "number", "description": "Longitude coordinate"}
},
"required": ["latitude", "longitude"]
}
}
]
API_KEY = "your_openweathermap_api_key"
def get_coordinates(city: str, country_code: str = None) -> dict:
params = {"q": city if not country_code else f"{city},{country_code}",
"limit": 1, "appid": API_KEY}
resp = requests.get("http://api.openweathermap.org/geo/1.0/direct", params=params)
data = resp.json()[0]
return {"city": data["name"], "lat": data["lat"], "lon": data["lon"],
"country": data["country"]}
def get_weather_by_coords(latitude: float, longitude: float) -> dict:
params = {"lat": latitude, "lon": longitude, "units": "metric", "appid": API_KEY}
resp = requests.get("https://api.openweathermap.org/data/2.5/weather", params=params)
data = resp.json()
return {
"temperature": data["main"]["temp"],
"feels_like": data["main"]["feels_like"],
"condition": data["weather"][0]["description"],
"humidity": data["main"]["humidity"],
"wind_speed": data["wind"]["speed"]
}
tool_map = {"get_coordinates": get_coordinates, "get_weather_by_coords": get_weather_by_coords}
def chat_with_tools(user_message: str) -> str:
messages = [{"role": "user", "content": user_message}]
while True:
response = client.messages.create(
model="claude-sonnet-4-20250514", max_tokens=1024,
tools=tools, messages=messages
)
if response.stop_reason == "end_turn":
return "".join(b.text for b in response.content if hasattr(b, "text"))
# Process tool calls
tool_results = []
for block in response.content:
if block.type == "tool_use":
result = tool_map[block.name](**block.input)
print(f" Tool: {block.name}({block.input}) → {result}")
tool_results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": json.dumps(result)
})
messages.append({"role": "assistant", "content": response.content})
messages.append({"role": "user", "content": tool_results})
# The model will first call get_coordinates("Paris"),
# then use the result to call get_weather_by_coords(48.85, 2.35)
print(chat_with_tools("What's the weather like in Paris right now?"))
The model doesn’t need to be told to chain these calls — it reads the tool descriptions, understands that get_weather_by_coords needs coordinates, and autonomously calls get_coordinates first. This is emergent reasoning, not hard-coded logic.
Example 2: Database Query Tool
This example gives the model the ability to query a SQLite database. The model generates SQL, the tool executes it safely, and the model interprets the results.
import anthropic
import json
import sqlite3
client = anthropic.Anthropic()
# Create a sample database
conn = sqlite3.connect(":memory:")
cursor = conn.cursor()
cursor.executescript("""
CREATE TABLE users (id INTEGER PRIMARY KEY, name TEXT, email TEXT,
signup_date DATE, plan TEXT);
INSERT INTO users VALUES (1, 'Alice', 'alice@example.com', '2026-03-15', 'pro');
INSERT INTO users VALUES (2, 'Bob', 'bob@example.com', '2026-03-20', 'free');
INSERT INTO users VALUES (3, 'Charlie', 'charlie@example.com', '2026-02-10', 'pro');
INSERT INTO users VALUES (4, 'Diana', 'diana@example.com', '2026-03-25', 'enterprise');
INSERT INTO users VALUES (5, 'Eve', 'eve@example.com', '2026-01-05', 'free');
CREATE TABLE orders (id INTEGER PRIMARY KEY, user_id INTEGER,
amount DECIMAL, order_date DATE);
INSERT INTO orders VALUES (1, 1, 99.99, '2026-03-16');
INSERT INTO orders VALUES (2, 3, 199.99, '2026-03-01');
INSERT INTO orders VALUES (3, 4, 499.99, '2026-03-26');
INSERT INTO orders VALUES (4, 1, 49.99, '2026-03-28');
""")
tools = [
{
"name": "query_database",
"description": """Execute a READ-ONLY SQL query against the database.
Available tables:
- users (id, name, email, signup_date, plan) — plan is 'free', 'pro', or 'enterprise'
- orders (id, user_id, amount, order_date) — user_id references users.id
Only SELECT statements are allowed. Returns rows as a list of dictionaries.""",
"input_schema": {
"type": "object",
"properties": {
"query": {
"type": "string",
"description": "SQL SELECT query to execute"
}
},
"required": ["query"]
}
}
]
def query_database(query: str) -> dict:
# Security: only allow SELECT statements
if not query.strip().upper().startswith("SELECT"):
return {"error": "Only SELECT queries are allowed"}
try:
cursor.execute(query)
columns = [desc[0] for desc in cursor.description]
rows = [dict(zip(columns, row)) for row in cursor.fetchall()]
return {"columns": columns, "rows": rows, "row_count": len(rows)}
except Exception as e:
return {"error": str(e)}
# Ask a natural language question about the data
messages = [{"role": "user", "content": "How many users signed up in March 2026, and what's the total revenue from orders that month?"}]
response = client.messages.create(
model="claude-sonnet-4-20250514", max_tokens=1024,
tools=tools, messages=messages
)
# Process (the model will likely make two queries)
while response.stop_reason == "tool_use":
tool_results = []
for block in response.content:
if block.type == "tool_use":
result = query_database(**block.input)
print(f"SQL: {block.input['query']}")
print(f"Result: {result}\n")
tool_results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": json.dumps(result)
})
messages.append({"role": "assistant", "content": response.content})
messages.append({"role": "user", "content": tool_results})
response = client.messages.create(
model="claude-sonnet-4-20250514", max_tokens=1024,
tools=tools, messages=messages
)
for block in response.content:
if hasattr(block, "text"):
print(block.text)
Example 3: Multi-Tool Agent
This example builds a mini agent that can search the web, read URLs, and send emails. It demonstrates the agentic loop — the model calls tools iteratively until the task is complete.
import anthropic
import json
client = anthropic.Anthropic()
tools = [
{
"name": "search_web",
"description": "Search the web for current information. Returns a list of results with titles, URLs, and snippets.",
"input_schema": {
"type": "object",
"properties": {
"query": {"type": "string", "description": "Search query"}
},
"required": ["query"]
}
},
{
"name": "read_url",
"description": "Read the text content of a web page given its URL.",
"input_schema": {
"type": "object",
"properties": {
"url": {"type": "string", "description": "Full URL to read"}
},
"required": ["url"]
}
},
{
"name": "send_email",
"description": "Send an email to a recipient with a subject and body.",
"input_schema": {
"type": "object",
"properties": {
"to": {"type": "string", "description": "Recipient email address"},
"subject": {"type": "string", "description": "Email subject line"},
"body": {"type": "string", "description": "Email body (plain text)"}
},
"required": ["to", "subject", "body"]
}
}
]
# Simulated tool implementations
def search_web(query):
return {"results": [
{"title": "NVIDIA Q4 2026 Earnings", "url": "https://example.com/nvidia-earnings",
"snippet": "NVIDIA reported revenue of $45B, up 78% YoY..."},
{"title": "NVIDIA Earnings Analysis", "url": "https://example.com/nvidia-analysis",
"snippet": "Data center revenue drove growth at $38B..."}
]}
def read_url(url):
return {"content": "NVIDIA reported Q4 2026 revenue of $45 billion, beating estimates of $42B. "
"Data center revenue reached $38B (+95% YoY). Gaming revenue was $4.2B (+15%). "
"Gross margin was 73.5%. The company announced a $50B buyback program."}
def send_email(to, subject, body):
return {"status": "sent", "message_id": "msg_abc123"}
tool_map = {"search_web": search_web, "read_url": read_url, "send_email": send_email}
def run_agent(task: str, max_iterations: int = 10) -> str:
"""Run the agent loop until task completion or max iterations."""
messages = [{"role": "user", "content": task}]
for i in range(max_iterations):
response = client.messages.create(
model="claude-sonnet-4-20250514", max_tokens=4096,
tools=tools, messages=messages
)
if response.stop_reason == "end_turn":
return "".join(b.text for b in response.content if hasattr(b, "text"))
# Execute all tool calls
tool_results = []
for block in response.content:
if block.type == "tool_use":
result = tool_map[block.name](**block.input)
print(f" [{i+1}] {block.name}({json.dumps(block.input)[:80]}...)")
tool_results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": json.dumps(result)
})
messages.append({"role": "assistant", "content": response.content})
messages.append({"role": "user", "content": tool_results})
return "Max iterations reached"
# The agent will: search → read article → compose email → send
result = run_agent(
"Research the latest NVIDIA earnings and email a summary to investor@example.com"
)
print(result)
Notice the run_agent function — it’s a simple while loop that keeps calling the model until the task is done. The model autonomously decides the sequence: search first, read the most relevant article, compose an email, and send it. This is the core pattern behind every AI agent framework.
Example 4: Calculator and Code Execution
LLMs are notoriously bad at arithmetic. Tool calling solves this by offloading computation to actual code:
import anthropic
import json
import math
client = anthropic.Anthropic()
tools = [
{
"name": "calculate",
"description": "Evaluate a mathematical expression. Supports standard math operations (+, -, *, /, **, %), functions (sqrt, sin, cos, log, abs), and constants (pi, e). Examples: '2**10', 'sqrt(144)', 'log(1000, 10)'",
"input_schema": {
"type": "object",
"properties": {
"expression": {"type": "string", "description": "Math expression to evaluate"}
},
"required": ["expression"]
}
},
{
"name": "run_python",
"description": "Execute a Python code snippet and return stdout output. Use for complex calculations, data processing, or generating formatted results. The code runs in a sandboxed environment.",
"input_schema": {
"type": "object",
"properties": {
"code": {"type": "string", "description": "Python code to execute"}
},
"required": ["code"]
}
}
]
def calculate(expression: str) -> dict:
# Safe math evaluation with limited namespace
allowed = {k: v for k, v in math.__dict__.items() if not k.startswith('_')}
allowed.update({"abs": abs, "round": round, "min": min, "max": max})
try:
result = eval(expression, {"__builtins__": {}}, allowed)
return {"expression": expression, "result": result}
except Exception as e:
return {"error": str(e)}
def run_python(code: str) -> dict:
# WARNING: In production, use a proper sandbox (Docker, gVisor, etc.)
import io, contextlib
output = io.StringIO()
try:
with contextlib.redirect_stdout(output):
exec(code, {"__builtins__": __builtins__})
return {"stdout": output.getvalue(), "status": "success"}
except Exception as e:
return {"error": str(e), "status": "error"}
tool_map = {"calculate": calculate, "run_python": run_python}
# Ask something that requires precise computation
messages = [{"role": "user", "content":
"If I invest $10,000 at 7.5% annual return compounded monthly, "
"how much will I have after 20 years? Show the year-by-year breakdown."}]
response = client.messages.create(
model="claude-sonnet-4-20250514", max_tokens=4096,
tools=tools, messages=messages
)
while response.stop_reason == "tool_use":
tool_results = []
for block in response.content:
if block.type == "tool_use":
result = tool_map[block.name](**block.input)
tool_results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": json.dumps(result)
})
messages.append({"role": "assistant", "content": response.content})
messages.append({"role": "user", "content": tool_results})
response = client.messages.create(
model="claude-sonnet-4-20250514", max_tokens=4096,
tools=tools, messages=messages
)
for block in response.content:
if hasattr(block, "text"):
print(block.text)
run_python tool above uses exec(), which is dangerous in production. Always sandbox code execution using containers, WebAssembly, or dedicated code execution services. Never run LLM-generated code with full system access.
The Agentic Loop: From Tool Calling to AI Agents
Tool calling is a single request-response interaction. An AI agent is what happens when you put tool calling in a loop. The agent keeps thinking, calling tools, observing results, and thinking again — until the task is complete.
The Basic Agent Loop
while task is not complete:
1. THINK → Model analyzes the current state and decides what to do next
2. SELECT → Model chooses a tool and generates arguments
3. EXECUTE → Application runs the tool and captures the result
4. OBSERVE → Result is fed back to the model
5. REPEAT → Model decides: need more info? Call another tool. Done? Respond.
┌──────────────────────────────────────────────┐
│ AGENT LOOP │
│ │
│ ┌─────────┐ ┌──────────┐ ┌─────────┐ │
│ │ THINK │────→│ SELECT │───→│ EXECUTE │ │
│ │ │ │ TOOL │ │ TOOL │ │
│ └────▲────┘ └──────────┘ └────┬────┘ │
│ │ │ │
│ │ ┌──────────┐ │ │
│ └─────────│ OBSERVE │◀─────────┘ │
│ │ RESULT │ │
│ └─────┬────┘ │
│ │ │
│ Done? ───┤ │
│ No ─────┘ (loop back) │
│ Yes ─────→ RESPOND to user │
└──────────────────────────────────────────────┘
This pattern is everywhere:
- Claude Code — the tool you might be reading this post through — uses exactly this pattern. When you ask Claude Code to “fix the bug in auth.py”, it calls tools like
Read(to read files),Grep(to search code),Edit(to modify files), andBash(to run tests), iterating until the bug is fixed. - ChatGPT with plugins follows the same loop — the model decides which plugins to invoke, executes them, reads the results, and continues.
- GitHub Copilot’s agent mode reads your codebase, makes edits, runs tests, and iterates — all through tool calling.
How Claude Code Uses Tool Calling
Claude Code is a perfect real-world example. When you give it a task, it has access to tools like:
| Tool | What It Does | Example Use |
|---|---|---|
Read |
Reads a file from disk | Read src/auth.py to understand the code |
Write |
Creates or overwrites a file | Write a new test file |
Edit |
Makes targeted edits to a file | Fix a specific line in a function |
Bash |
Runs a shell command | Run pytest to check if the fix works |
Grep |
Searches file contents | Find all usages of a function |
Glob |
Finds files by pattern | Find all *.test.py files |
A typical Claude Code session might involve 20-50 tool calls for a single task. The model reads a file, identifies the problem, searches for related code, makes an edit, runs the tests, sees a test fail, reads the error, makes another edit, runs the tests again, and finally reports success. Every step is a tool call. The “intelligence” is in deciding which tool to call and what arguments to use — the actual execution is done by your computer.
The Progression: Tool Call to Agent
Understanding tool calling lets you see the full progression of AI capability:
- Simple tool call: User asks a question → model calls one tool → responds. (Weather lookup)
- Multi-tool call: Model calls several tools in parallel or sequence within one turn. (Weather + stock price)
- Multi-step chain: Model calls tools iteratively across multiple turns, using each result to inform the next call. (Research → read → summarize → email)
- Autonomous agent: Model operates in a loop with minimal human intervention, using tools to accomplish complex goals. (Claude Code fixing a bug across multiple files)
Each step builds on the one before it. If you understand step 1, you understand the foundation for step 4. Tool calling is the atomic unit of AI agency.
Model Context Protocol (MCP): The Standard for Tool Calling
If every AI application defines its tools in a different format, the ecosystem becomes fragmented. That’s the problem the Model Context Protocol (MCP) solves.
MCP is an open standard created by Anthropic that provides a universal way to connect AI models to external tools, data sources, and services. Think of it as USB-C for AI tools — a single standard that works everywhere, instead of every device having its own proprietary connector.
How MCP Works
MCP defines a client-server architecture:
- MCP Clients (like Claude Code, Claude Desktop, or your custom app) connect to MCP servers and expose the available tools to the AI model
- MCP Servers expose three types of capabilities:
- Tools: Functions the model can call (same concept as function calling)
- Resources: Data the model can read (files, database records, API responses)
- Prompts: Pre-defined prompt templates for common tasks
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Claude │ │ MCP │ │ External │
│ Desktop / │────→│ Server │────→│ Service │
│ Claude Code│ │ (your app) │ │ (DB, API) │
│ (MCP Client) │ │ │ │
└─────────────┘ └─────────────┘ └─────────────┘
The MCP Server exposes:
- Tools: query_database, create_ticket, send_slack_message
- Resources: customer_data, product_catalog
- Prompts: summarize_ticket, generate_report
Building a Simple MCP Server
Here’s a minimal MCP server that exposes a database query tool:
from mcp.server import Server
from mcp.types import Tool, TextContent
import sqlite3
import json
server = Server("database-server")
@server.list_tools()
async def list_tools():
return [
Tool(
name="query_database",
description="Run a read-only SQL query against the customer database.",
inputSchema={
"type": "object",
"properties": {
"query": {"type": "string", "description": "SQL SELECT query"}
},
"required": ["query"]
}
)
]
@server.call_tool()
async def call_tool(name: str, arguments: dict):
if name == "query_database":
conn = sqlite3.connect("customers.db")
cursor = conn.cursor()
if not arguments["query"].strip().upper().startswith("SELECT"):
return [TextContent(type="text", text="Error: Only SELECT queries allowed")]
cursor.execute(arguments["query"])
columns = [d[0] for d in cursor.description]
rows = [dict(zip(columns, row)) for row in cursor.fetchall()]
conn.close()
return [TextContent(type="text", text=json.dumps(rows, indent=2))]
# Run with: python -m mcp.server.stdio database_server
Once this MCP server is running, any MCP-compatible client (Claude Code, Claude Desktop, custom applications) can connect to it and the AI model will be able to query your database through tool calling — with the MCP protocol handling all the communication plumbing.
MCP vs. Other Approaches
| Approach | Standardized? | Multi-Client | Discovery | Status |
|---|---|---|---|---|
| MCP | Open standard | Yes | Built-in | Growing adoption |
| OpenAI Plugins | OpenAI-specific | No | Plugin manifest | Deprecated in favor of GPTs |
| Custom function calling | No | No | Manual | Most flexible |
MCP is gaining significant momentum in 2026. Major IDE extensions, AI coding tools, and enterprise platforms are adopting it as the standard way to connect AI to external systems. If you’re building tools for AI models, building them as MCP servers future-proofs your work.
Best Practices for Designing Tools
The quality of your tools directly determines how well your AI application performs. A well-designed tool is like a well-written function: clear name, documented parameters, predictable behavior. A poorly designed tool leads to hallucinated arguments, incorrect tool selection, and frustrated users.
Naming and Descriptions
The model reads your tool’s name and description to decide when and how to use it. Invest time in these — they’re essentially prompts for the model.
| Aspect | Bad | Good |
|---|---|---|
| Function name | weather |
get_current_weather |
| Function name | do_stuff |
create_calendar_event |
| Description | “Gets weather” | “Get current weather conditions (temperature, humidity, wind) for a specific city. Use when the user asks about weather or atmospheric conditions.” |
| Parameter description | “The city” | “City name, e.g. ‘Tokyo’, ‘New York’, ‘London’. Use the English name.” |
Key Design Principles
One tool per action. Don’t create a manage_database tool that can query, insert, update, and delete. Create separate tools: query_database, insert_record, update_record, delete_record. This gives the model clearer choices and reduces errors.
Detailed JSON Schema. Use types, required fields, enums, defaults, and descriptions for every parameter. The more constrained the schema, the more reliable the model’s output:
{
"properties": {
"priority": {
"type": "string",
"enum": ["low", "medium", "high", "critical"],
"description": "Task priority level. Use 'critical' only for production outages.",
"default": "medium"
},
"due_date": {
"type": "string",
"description": "Due date in ISO 8601 format (YYYY-MM-DD), e.g. '2026-04-15'"
}
}
}
Structured error messages. When a tool fails, return a structured error message that the model can understand and act on — not a stack trace:
# Bad: raises exception that crashes the loop
raise Exception("Connection timeout")
# Good: returns error the model can understand
return {"error": "Database connection timed out after 30s. The database may be under heavy load. Try again in a few minutes."}
Separate read and write tools. This is crucial for safety. A query_database tool (read-only) is safe to call freely. A delete_record tool (destructive) should require confirmation. By separating them, you can apply different safety policies.
Confirmation for dangerous actions. Before deleting data, sending emails, or making payments, have the model ask for user confirmation. You can implement this by having the tool return a “confirmation required” response that the model must present to the user before proceeding.
Common Pitfalls and How to Avoid Them
Even with well-designed tools, things can go wrong. Here are the most common issues and their solutions:
| Pitfall | Cause | Solution |
|---|---|---|
| Model hallucinating tool calls | Tool name similar to a known concept | Use strict tool definitions; validate tool name before execution |
| Wrong argument types | Vague or missing JSON Schema | Add detailed types, enums, and descriptions; include examples |
| Infinite tool loops | Model keeps calling tools without converging | Set max_iterations limit; add “no more info needed” guidance |
| Unnecessary tool calls | Overly broad tool description | Write precise descriptions about when to use the tool |
| Ignoring tool errors | Error returned as exception, not tool result | Always return errors as tool results so the model can handle them |
| SQL injection via tool args | LLM-generated SQL executed without validation | Parameterized queries; read-only database user; query allowlists |
| Command injection | LLM-generated shell commands executed directly | Sandboxing; allowlisted commands only; never pass to shell=True |
| Token cost explosion | Tool results too large (e.g., full database dumps) | Paginate results; limit response size; summarize large outputs |
Security Considerations
Security deserves special attention because tool calling gives an LLM the ability to take real actions. A prompt injection attack that convinces the model to call delete_all_users() is no longer a theoretical concern — it’s a real risk.
Key security practices:
- Input validation: Validate all tool arguments before execution. Don’t trust the model to always provide safe inputs.
- Least privilege: Give tools the minimum permissions necessary. Database tools should use read-only credentials unless writes are required.
- Rate limiting: Limit how often tools can be called to prevent abuse or runaway loops.
- Audit logging: Log every tool call with its arguments and results. This is essential for debugging and security auditing.
- Sandboxing: Code execution tools must run in isolated environments (containers, VMs, or WebAssembly sandboxes).
- Confirmation gates: Destructive operations (delete, send, pay) should require human confirmation before execution.
Tool Calling in Production
Moving from a prototype to production requires additional engineering around reliability, observability, and cost management.
Reliability Patterns
Caching: Cache tool results to avoid redundant API calls. If the model asks for the weather in Tokyo twice in the same conversation, return the cached result. Use time-based expiration (e.g., 5-minute TTL for weather data).
from functools import lru_cache
from datetime import datetime, timedelta
_cache = {}
def cached_tool_call(name: str, args: dict, ttl_seconds: int = 300):
key = f"{name}:{json.dumps(args, sort_keys=True)}"
if key in _cache:
result, timestamp = _cache[key]
if datetime.now() - timestamp < timedelta(seconds=ttl_seconds):
return result
result = execute_tool(name, args)
_cache[key] = (result, datetime.now())
return result
Retry with backoff: External APIs fail. Implement retries with exponential backoff for transient errors (timeouts, rate limits, 5xx errors).
Fallback strategies: When a tool fails after retries, return a structured error message that lets the model inform the user gracefully, rather than crashing the entire interaction.
Observability
Logging: Log every tool call with a structured format:
{
"timestamp": "2026-04-03T10:30:00Z",
"conversation_id": "conv_abc123",
"tool_name": "get_weather",
"arguments": {"city": "Tokyo"},
"result_summary": "success, temperature=22",
"latency_ms": 245,
"tokens_used": {"input": 150, "output": 45}
}
Monitoring: Track key metrics:
- Tool call success rate (should be above 95%)
- Average tool latency (directly impacts user experience)
- Tool calls per conversation (indicates complexity)
- Token cost per tool call cycle (each call adds tokens to the context)
- Error rates by tool (identifies problematic tools)
Cost Optimization
Every tool call adds tokens to your context window. The tool definitions themselves are included in every API request, so 20 detailed tools might add 2,000-3,000 tokens before the conversation even starts.
Strategies to manage costs:
- Dynamic tool loading: Only include relevant tools based on the conversation context. A weather conversation doesn't need database tools.
- Result compression: Truncate or summarize large tool results before sending them back to the model. A full database dump is rarely necessary — send summary statistics instead.
- Conversation pruning: In long multi-tool conversations, summarize earlier tool results and remove the raw data from the context.
- Model selection: Use cheaper, faster models (like Claude Haiku or GPT-4o-mini) for simple tool-calling tasks, and reserve expensive models for complex reasoning.
Testing Tool-Calling Applications
Test tools independently before integrating them with the LLM:
- Unit tests: Test each tool function with various inputs, including edge cases and invalid arguments.
- Integration tests: Test the tool with the actual API or database it connects to.
- LLM integration tests: Test the full loop with the model. Provide a set of test prompts and verify the model calls the right tools with correct arguments.
- Adversarial tests: Test with prompts designed to trick the model into misusing tools (prompt injection).
# Example: testing that the model calls the right tool
def test_weather_tool_selection():
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
tools=tools,
messages=[{"role": "user", "content": "What's the weather in London?"}]
)
tool_calls = [b for b in response.content if b.type == "tool_use"]
assert len(tool_calls) == 1
assert tool_calls[0].name == "get_weather"
assert tool_calls[0].input["city"] == "London"
def test_no_tool_for_general_question():
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
tools=tools,
messages=[{"role": "user", "content": "What is the capital of France?"}]
)
# Model should answer directly, no tool call
assert response.stop_reason == "end_turn"
The Future of Tool Calling
Tool calling is evolving rapidly. Here's where it's heading:
Computer Use
Anthropic's computer use capability takes tool calling to its logical extreme: instead of calling specific APIs, the model can control an entire computer desktop. It sees the screen (via screenshots), moves the mouse, clicks buttons, and types text. The "tools" become the entire computer interface — every application, every website, every file. This is the most general form of tool use: rather than building a specific tool for every task, you give the model the same tools a human uses.
More Reliable Structured Output
Constrained decoding is making tool calling more reliable. Instead of hoping the model produces valid JSON, the decoding process itself enforces the JSON Schema — the model literally cannot produce invalid output. OpenAI's "strict mode" and Anthropic's improvements in JSON reliability are steps in this direction.
Tool Learning and Discovery
Current models use tools that are explicitly defined in the request. Future models may be able to discover tools dynamically — browsing an API directory, reading documentation, and figuring out how to use a new tool without it being pre-defined. MCP is laying the groundwork for this with its discovery protocol.
Multi-Agent Tool Sharing
As multi-agent systems become more common (multiple AI agents collaborating on a task), tool sharing becomes important. One agent might specialize in database queries while another handles email. MCP's architecture supports this by allowing multiple agents to connect to the same tool servers.
Standardization
MCP adoption is accelerating. In the same way that REST APIs standardized web service communication, MCP is standardizing how AI models interact with external tools. For developers and companies building AI tools, this means writing your tool once and making it available to every AI model and client that supports MCP.
Conclusion
Tool calling is the invisible infrastructure behind every AI agent, every chatbot plugin, and every autonomous AI system. It's deceptively simple — a model outputs a function name and arguments, your code executes it, and the result goes back to the model — but this simple loop is what transformed LLMs from text generators into systems that can do things in the real world.
Let's recap what we covered:
- The core concept: Tool calling lets LLMs request the execution of external functions. The model plans, your code acts.
- The three-step loop: User asks → model calls tool → your code executes → model responds with the result.
- Provider implementations: Claude, GPT, and Gemini all support tool calling with slightly different formats but the same underlying pattern.
- Practical patterns: From simple weather lookups to chained tool calls, database queries, and multi-tool agents.
- The agentic loop: Tool calling in a loop is the foundation of AI agents. Claude Code, ChatGPT plugins, and GitHub Copilot all work this way.
- MCP: The open standard that's making tool definitions universal and interoperable.
- Best practices: Clear naming, detailed schemas, error handling, security, and the read/write separation principle.
- Production concerns: Caching, logging, cost optimization, and testing strategies.
If you're a developer, start building with tool calling today. Pick an API you already use, define it as a tool, and hook it up to Claude or GPT. You'll be surprised at how quickly you go from "AI that chats" to "AI that acts." If you're an investor, understand that tool calling is not a feature — it's the foundation of the entire AI agent ecosystem. Companies that master tool integration will win the next phase of AI.
The era of AI that only talks is over. The era of AI that does is just beginning — and tool calling is the mechanism that makes it possible.
References
- Anthropic. "Tool use (function calling) — Claude Documentation." docs.anthropic.com/en/docs/build-with-claude/tool-use
- OpenAI. "Function calling — OpenAI API Documentation." platform.openai.com/docs/guides/function-calling
- Google. "Function calling — Gemini API Documentation." ai.google.dev/gemini-api/docs/function-calling
- Anthropic. "Model Context Protocol — Documentation." modelcontextprotocol.io
- Anthropic. "Computer use — Claude Documentation." docs.anthropic.com/en/docs/build-with-claude/computer-use
- Anthropic. "Claude Code — Documentation." docs.anthropic.com/en/docs/claude-code
- Schick, T., et al. "Toolformer: Language Models Can Teach Themselves to Use Tools." arXiv:2302.04761, 2023.
- Qin, Y., et al. "ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs." arXiv:2307.16789, 2023.
Leave a Reply