Summary
What this post covers: An end-to-end guide to tool calling (function calling) in LLMs—how it works, how Claude, GPT, and Gemini implement it, complete code examples, the agentic loop, MCP, and the production patterns that turn a chatbot into an AI agent.
Key insights:
- The model never executes tools itself; it emits structured JSON (function name + arguments) and your code runs the actual function, feeds the result back, and the model weaves it into a natural response, this single loop is what transforms text generators into agents.
- Every major provider (Anthropic, OpenAI, Google) follows the same three-step pattern (user asks, model requests tool, your code executes and returns), but their wire formats differ slightly enough that abstraction layers like LangChain or MCP are worth the indirection.
- The Model Context Protocol (MCP) is becoming for AI tools what REST became for web services: a universal interface that lets you write a tool once and expose it to every MCP-compatible client.
- Tool design quality drives agent performance more than model choice, clear naming, detailed JSON schemas, error handling, and separating read-only from mutating operations are the difference between a reliable agent and one that hallucinates calls.
- Putting tool calling in a loop with no exit conditions is the foundation of every modern AI agent (Claude Code, ChatGPT, GitHub Copilot), but in production it must be paired with caching, logging, rate limits, and explicit halt criteria to control cost and risk.
Main topics: What Is Tool Calling, How Tool Calling Works Internally, Tool Calling Across Major AI Providers, Practical Tool Calling Examples (with Complete Code), The Agentic Loop: From Tool Calling to AI Agents, Model Context Protocol (MCP): The Standard for Tool Calling, Best Practices for Designing Tools, Common Pitfalls and How to Avoid Them, Tool Calling in Production, The Future of Tool Calling, Final Thoughts, References.
In March 2023, a developer built a ChatGPT-powered assistant that could check the weather, look up flight prices, and book restaurant reservations within a single conversation. The mechanism deserves scrutiny: the AI itself never called a single API. Instead, it told the developer’s code exactly which function to call and with which arguments, received the results, and incorporated them into a seamless natural language response. The user could not have known that they were conversing with a text generator unable to act on its own. The mechanism has a name: tool calling. It is the single most important capability that transformed large language models from impressive text generators into agents capable of interacting with the real world.
A central limitation of LLMs warrants direct acknowledgement: they are fundamentally constrained. An LLM does not know today’s date. It cannot check a stock price. It cannot query a database, send an email, or read a file on the user’s computer. It knows only what was in its training data (which is months or years old) and whatever appears in the current conversation. Without tool calling, asking an LLM “What is NVIDIA’s stock price now?” yields a polite apology and a reminder of its knowledge cutoff date.
Tool calling changed this situation. It is the mechanism that allows an AI model to indicate, “I do not know the answer, but I know which function to call to obtain it, and here are the exact arguments.” The user’s code then executes that function, feeds the result back to the model, and the model responds as if it had known all along. This is how ChatGPT plugins operate, how Claude Code reads and writes files, and how every AI agent functions internally.
This guide examines tool calling from the ground up. It explains exactly how the mechanism works, presents complete code examples for Claude and OpenAI, describes the differences between providers, and provides what is required to build tool-calling applications. For developers building AI-powered products and for analysts evaluating AI companies, understanding tool calling is essential: it is the bridge between “AI that talks” and “AI that acts.”
What Is Tool Calling
Tool calling (also referred to as function calling) is a mechanism by which a large language model can request the execution of external functions or APIs during a conversation. Rather than attempting to answer entirely from memory, the model can reach into the real world—checking databases, calling APIs, performing calculations, or executing code—by asking the application to run specific functions on its behalf.
The central insight is deceptively simple: the model does not execute the tools itself. It generates a structured request—a function name plus arguments in JSON format—and the user’s code is responsible for actually executing it. The result is sent back to the model, which then incorporates it into its response.
The relationship can be likened to that of a brain and hands. The LLM is the brain: it plans, reasons, and decides what should happen. The tools are the hands: they perform actions in the world. The brain cannot lift a cup of coffee by itself, but it can direct the hands precisely. Similarly, an LLM cannot check the weather directly, but it can instruct a code path to call a weather API with specific coordinates and then interpret the result.
The Three-Step Loop
Every tool calling interaction follows the same fundamental pattern:
- User asks something. “What is the weather in Tokyo right now?”
- The model decides to call a tool. It outputs structured JSON:
{"name": "get_weather", "arguments": {"city": "Tokyo"}}. - The user’s code executes the tool. It calls the weather API, obtains the result, and sends it back to the model.
- The model responds naturally. “It is currently 22°C and sunny in Tokyo, with a light breeze from the east.”
The full flow is described step by step below:
┌─────────┐ "What's the weather ┌─────────┐
│ │ in Tokyo?" │ │
│ User │ ──────────────────────────→│ Your │
│ │ │ App │
└─────────┘ └────┬────┘
│
Sends message + │
tool definitions │
▼
┌─────────┐
│ │
│ LLM │
│ (API) │
└────┬────┘
│
Returns: │
tool_use: │
get_weather │
{"city":"Tokyo"} │
▼
┌─────────┐
│ Your │
│ App │──→ Calls weather API
│(execute)│←── Gets result: 22°C
└────┬────┘
│
Sends tool_result│
back to LLM │
▼
┌─────────┐
│ LLM │
│ (API) │
└────┬────┘
│
Final response: │
"It's 22°C and │
sunny in Tokyo" │
▼
┌─────────┐
│ User │
│ sees │
│ response│
└─────────┘
Why This Is a Significant Development
Before tool calling: LLMs could only generate text. They were highly capable in that respect, but they were fundamentally disconnected from the world. A request for today’s weather produced a hallucinated guess or an apology. A request to send an email produced a draft that the user had to copy and send manually.
After tool calling: LLMs can take actions. They can check real-time data, interact with databases, control software, browse the web, manage files, send messages, and orchestrate complex multi-step workflows. The same text-generation capability previously limited to chat responses now drives decision-making about which actions to take and how to interpret the results.
This single capability—the ability for a model to say “call this function with these arguments”—is what turned LLMs from chatbots into agents. Every AI agent framework, every chatbot plugin system, and every autonomous AI workflow is built on tool calling.
How Tool Calling Works Internally
The following walkthrough describes each step of the tool calling process in detail, using the actual data structures encountered when building with these APIs.
Step 1: Tool Definition
Before the model can use any tools, the available tools must be declared. This is done by including a tool definition in the API request. Each tool definition is a JSON Schema describing the function’s name, purpose, and parameters.
{
"name": "get_current_weather",
"description": "Get the current weather conditions for a specific city. Returns temperature in Celsius, weather condition, humidity, and wind speed. Use this when the user asks about current weather, temperature, or atmospheric conditions for any location.",
"input_schema": {
"type": "object",
"properties": {
"city": {
"type": "string",
"description": "The city name, e.g. 'Tokyo', 'New York', 'London'"
},
"units": {
"type": "string",
"enum": ["celsius", "fahrenheit"],
"description": "Temperature units. Defaults to celsius.",
"default": "celsius"
}
},
"required": ["city"]
}
}
The description is critically important: it is what the model reads to determine when to use a given tool. A vague description such as “weather stuff” will lead the model to use the tool at the wrong times, or not at all when it should. A detailed description, like the one above, supports precise decisions.
Step 2: Tool Selection
When the model receives a user message along with tool definitions, it makes a decision: respond directly, or call one or more tools first. This decision is made by the model itself; it is part of the model’s inference process, not a separate system.
The model considers the following questions:
- Does the user’s request require information that the model does not have?
- Is there a tool that can provide that information?
- What arguments should be passed to the tool?
- Are multiple tool calls required?
- Should tools be called in parallel or sequentially?
If the user asks “What is 2 + 2?”, the model answers directly, with no tool needed. If the user asks “What is the weather in Tokyo?” and a get_current_weather tool is available, the model will determine that the tool should be called.
Step 3: Structured Output
When the model decides to call a tool, it does not output free-form text. Instead, it outputs a structured tool_use block with the function name and arguments as valid JSON:
{
"role": "assistant",
"content": [
{
"type": "tool_use",
"id": "toolu_01A09q90qw90lq917835lq9",
"name": "get_current_weather",
"input": {
"city": "Tokyo",
"units": "celsius"
}
}
]
}
This is not a suggestion or a natural language request; it is a precisely structured instruction. The function name matches exactly what was defined, and the arguments conform to the JSON Schema provided. This is what makes tool calling reliable: the model does not say “maybe try checking the weather”; it says “call get_current_weather with {"city": "Tokyo", "units": "celsius"}“.
Step 4: Execution
The application code receives this tool_use block, parses it, and executes the actual function. This is where the real work occurs: the API call is made, the database query is run, the calculation is performed, or whatever else the tool does:
# Your code — NOT the model's code
def get_current_weather(city: str, units: str = "celsius") -> dict:
response = requests.get(
f"https://api.openweathermap.org/data/2.5/weather",
params={"q": city, "units": "metric", "appid": API_KEY}
)
data = response.json()
return {
"city": city,
"temperature": data["main"]["temp"],
"condition": data["weather"][0]["description"],
"humidity": data["main"]["humidity"],
"wind_speed": data["wind"]["speed"]
}
Step 5: Result Injection
The tool result is sent back to the model as a tool_result message:
{
"role": "user",
"content": [
{
"type": "tool_result",
"tool_use_id": "toolu_01A09q90qw90lq917835lq9",
"content": "{\"city\": \"Tokyo\", \"temperature\": 22, \"condition\": \"clear sky\", \"humidity\": 45, \"wind_speed\": 3.6}"
}
]
}
Step 6: Final Response
The model reads the tool result and generates a natural language response for the user. The model does not simply repeat the raw data; it interprets the data, adds context, and presents it conversationally:
“At present in Tokyo the temperature is 22°C with clear skies. Humidity is 45%, and there is a light breeze at 3.6 m/s.”
Multi-Tool and Iterative Tool Use
Modern models can call multiple tools in a single turn. If a user asks “What is the weather in Tokyo and New York?”, the model can output two tool_use blocks simultaneously—a parallel tool call. The application executes both and returns both results.
Models can also use tools iteratively. In a complex task, the model may call tool A, examine the result, determine that more information is required, call tool B, examine that result, and only then respond. This iterative capability is the foundation of AI agents: the model continues to call tools in a loop until it has enough information to complete the task.
Tool Calling Across Major AI Providers
The core concept is the same across providers, although the API formats differ. The following sections present complete, runnable examples for each major provider.
Anthropic Claude (Messages API)
Claude’s tool calling uses a clean, content-block-based format. Tools are defined with input_schema (standard JSON Schema), and the model responds with tool_use content blocks.
A complete, runnable Python example follows:
import anthropic
import json
client = anthropic.Anthropic() # Uses ANTHROPIC_API_KEY env var
# Define tools
tools = [
{
"name": "get_weather",
"description": "Get the current weather for a city. Returns temperature (Celsius), condition, humidity, and wind speed.",
"input_schema": {
"type": "object",
"properties": {
"city": {
"type": "string",
"description": "City name, e.g. 'Tokyo', 'London'"
}
},
"required": ["city"]
}
},
{
"name": "get_stock_price",
"description": "Get the current stock price for a given ticker symbol. Returns price in USD, daily change, and percentage change.",
"input_schema": {
"type": "object",
"properties": {
"ticker": {
"type": "string",
"description": "Stock ticker symbol, e.g. 'AAPL', 'NVDA', 'GOOGL'"
}
},
"required": ["ticker"]
}
}
]
# Simulated tool implementations
def get_weather(city: str) -> dict:
# In production, call a real weather API
return {"city": city, "temperature": 22, "condition": "sunny", "humidity": 45}
def get_stock_price(ticker: str) -> dict:
# In production, call a real stock API
return {"ticker": ticker, "price": 875.30, "change": +12.50, "percent_change": "+1.45%"}
# Map function names to implementations
tool_functions = {
"get_weather": get_weather,
"get_stock_price": get_stock_price,
}
# Send initial message with tools
messages = [{"role": "user", "content": "What's the weather in Tokyo and NVIDIA's stock price?"}]
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
tools=tools,
messages=messages
)
print(f"Stop reason: {response.stop_reason}")
# Process tool calls
while response.stop_reason == "tool_use":
# Collect all tool use blocks
tool_results = []
for block in response.content:
if block.type == "tool_use":
# Execute the tool
func = tool_functions[block.name]
result = func(**block.input)
print(f"Called {block.name}({block.input}) → {result}")
tool_results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": json.dumps(result)
})
# Send results back to Claude
messages.append({"role": "assistant", "content": response.content})
messages.append({"role": "user", "content": tool_results})
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
tools=tools,
messages=messages
)
# Print final response
for block in response.content:
if hasattr(block, "text"):
print(f"\nClaude's response:\n{block.text}")
tool_choice parameter to control tool usage: "auto" (the model decides), "any" (at least one tool must be used), or {"type": "tool", "name": "get_weather"} (a specific tool must be used). Use "auto" for most cases.
Claude-specific features:
- Parallel tool calls. Claude can output multiple
tool_useblocks in a single response, allowing parallel execution. - Streaming with tools. Tool calls work with streaming; the application receives
content_block_startevents for tool_use blocks as they are generated. - Tool choice control. Fine-grained control over when the model uses tools via
tool_choice. - Large tool sets. Claude handles large numbers of tools well, though keeping the count below approximately 20 is recommended for optimal performance.
OpenAI GPT (Chat Completions API)
OpenAI’s format uses a tools array with type: "function" wrappers. The response includes a tool_calls array, and results are sent back as messages with role: "tool".
from openai import OpenAI
import json
client = OpenAI() # Uses OPENAI_API_KEY env var
# Define tools — note the different format from Claude
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather for a city.",
"parameters": {
"type": "object",
"properties": {
"city": {
"type": "string",
"description": "City name, e.g. 'Tokyo'"
}
},
"required": ["city"]
}
}
},
{
"type": "function",
"function": {
"name": "get_stock_price",
"description": "Get the current stock price for a ticker symbol.",
"parameters": {
"type": "object",
"properties": {
"ticker": {
"type": "string",
"description": "Stock ticker, e.g. 'NVDA'"
}
},
"required": ["ticker"]
}
}
}
]
# Same tool implementations as above
def get_weather(city):
return {"city": city, "temperature": 22, "condition": "sunny"}
def get_stock_price(ticker):
return {"ticker": ticker, "price": 875.30, "change": "+1.45%"}
tool_functions = {"get_weather": get_weather, "get_stock_price": get_stock_price}
messages = [{"role": "user", "content": "What's the weather in Tokyo and NVIDIA's stock price?"}]
response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
tools=tools,
tool_choice="auto"
)
message = response.choices[0].message
# Process tool calls
while message.tool_calls:
messages.append(message) # Add assistant message with tool calls
for tool_call in message.tool_calls:
func = tool_functions[tool_call.function.name]
args = json.loads(tool_call.function.arguments)
result = func(**args)
# Note: OpenAI uses role="tool" instead of tool_result content blocks
messages.append({
"role": "tool",
"tool_call_id": tool_call.id,
"content": json.dumps(result)
})
response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
tools=tools
)
message = response.choices[0].message
print(message.content)
Google Gemini
Gemini’s function calling follows a similar pattern but uses its own API format. Tool definitions use FunctionDeclaration objects, and responses include function_call parts. Gemini supports both automatic and manual function calling modes and can handle parallel function calls, as Claude and GPT do.
The principal difference with Gemini is its tight integration with the Google ecosystem: function calling works seamlessly with Google Search, Google Maps, and other Google APIs as built-in tools.
Provider Comparison
| Feature | Claude (Anthropic) | GPT (OpenAI) | Gemini (Google) |
|---|---|---|---|
| Tool definition key | input_schema |
parameters |
parameters |
| Tool call format | tool_use content block |
tool_calls array |
function_call part |
| Result format | tool_result content block |
role: "tool" message |
function_response part |
| Parallel tool calls | Yes | Yes | Yes |
| Streaming with tools | Yes | Yes | Yes |
| Tool choice control | auto / any / specific | auto / none / required / specific | auto / none / specific |
| JSON reliability | Excellent | Excellent | Good |
| Stop reason indicator | stop_reason: "tool_use" |
finish_reason: "tool_calls" |
Part type check |
Practical Tool Calling Examples (with Complete Code)
The following four examples build progressively more complex tool calling patterns.
Example 1: Chained Tools—Weather by City Name
This example illustrates tool chaining: the model calls one tool to obtain coordinates, then uses those coordinates to call a second tool for weather data. The model autonomously determines that both calls are required.
import anthropic
import json
import requests
client = anthropic.Anthropic()
tools = [
{
"name": "get_coordinates",
"description": "Convert a city name to latitude/longitude coordinates using geocoding.",
"input_schema": {
"type": "object",
"properties": {
"city": {"type": "string", "description": "City name, e.g. 'Paris'"},
"country_code": {"type": "string", "description": "ISO country code, e.g. 'FR'"}
},
"required": ["city"]
}
},
{
"name": "get_weather_by_coords",
"description": "Get weather data for specific latitude/longitude coordinates.",
"input_schema": {
"type": "object",
"properties": {
"latitude": {"type": "number", "description": "Latitude coordinate"},
"longitude": {"type": "number", "description": "Longitude coordinate"}
},
"required": ["latitude", "longitude"]
}
}
]
API_KEY = "your_openweathermap_api_key"
def get_coordinates(city: str, country_code: str = None) -> dict:
params = {"q": city if not country_code else f"{city},{country_code}",
"limit": 1, "appid": API_KEY}
resp = requests.get("http://api.openweathermap.org/geo/1.0/direct", params=params)
data = resp.json()[0]
return {"city": data["name"], "lat": data["lat"], "lon": data["lon"],
"country": data["country"]}
def get_weather_by_coords(latitude: float, longitude: float) -> dict:
params = {"lat": latitude, "lon": longitude, "units": "metric", "appid": API_KEY}
resp = requests.get("https://api.openweathermap.org/data/2.5/weather", params=params)
data = resp.json()
return {
"temperature": data["main"]["temp"],
"feels_like": data["main"]["feels_like"],
"condition": data["weather"][0]["description"],
"humidity": data["main"]["humidity"],
"wind_speed": data["wind"]["speed"]
}
tool_map = {"get_coordinates": get_coordinates, "get_weather_by_coords": get_weather_by_coords}
def chat_with_tools(user_message: str) -> str:
messages = [{"role": "user", "content": user_message}]
while True:
response = client.messages.create(
model="claude-sonnet-4-20250514", max_tokens=1024,
tools=tools, messages=messages
)
if response.stop_reason == "end_turn":
return "".join(b.text for b in response.content if hasattr(b, "text"))
# Process tool calls
tool_results = []
for block in response.content:
if block.type == "tool_use":
result = tool_map[block.name](**block.input)
print(f" Tool: {block.name}({block.input}) → {result}")
tool_results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": json.dumps(result)
})
messages.append({"role": "assistant", "content": response.content})
messages.append({"role": "user", "content": tool_results})
# The model will first call get_coordinates("Paris"),
# then use the result to call get_weather_by_coords(48.85, 2.35)
print(chat_with_tools("What's the weather like in Paris right now?"))
The model is not instructed to chain these calls. It reads the tool descriptions, recognises that get_weather_by_coords requires coordinates, and autonomously calls get_coordinates first. This represents emergent reasoning rather than hard-coded logic.
Example 2: Database Query Tool
This example provides the model with the ability to query a SQLite database. The model generates SQL, the tool executes it safely, and the model interprets the results.
import anthropic
import json
import sqlite3
client = anthropic.Anthropic()
# Create a sample database
conn = sqlite3.connect(":memory:")
cursor = conn.cursor()
cursor.executescript("""
CREATE TABLE users (id INTEGER PRIMARY KEY, name TEXT, email TEXT,
signup_date DATE, plan TEXT);
INSERT INTO users VALUES (1, 'Alice', 'alice@example.com', '2026-03-15', 'pro');
INSERT INTO users VALUES (2, 'Bob', 'bob@example.com', '2026-03-20', 'free');
INSERT INTO users VALUES (3, 'Charlie', 'charlie@example.com', '2026-02-10', 'pro');
INSERT INTO users VALUES (4, 'Diana', 'diana@example.com', '2026-03-25', 'enterprise');
INSERT INTO users VALUES (5, 'Eve', 'eve@example.com', '2026-01-05', 'free');
CREATE TABLE orders (id INTEGER PRIMARY KEY, user_id INTEGER,
amount DECIMAL, order_date DATE);
INSERT INTO orders VALUES (1, 1, 99.99, '2026-03-16');
INSERT INTO orders VALUES (2, 3, 199.99, '2026-03-01');
INSERT INTO orders VALUES (3, 4, 499.99, '2026-03-26');
INSERT INTO orders VALUES (4, 1, 49.99, '2026-03-28');
""")
tools = [
{
"name": "query_database",
"description": """Execute a READ-ONLY SQL query against the database.
Available tables:
- users (id, name, email, signup_date, plan) — plan is 'free', 'pro', or 'enterprise'
- orders (id, user_id, amount, order_date) — user_id references users.id
Only SELECT statements are allowed. Returns rows as a list of dictionaries.""",
"input_schema": {
"type": "object",
"properties": {
"query": {
"type": "string",
"description": "SQL SELECT query to execute"
}
},
"required": ["query"]
}
}
]
def query_database(query: str) -> dict:
# Security: only allow SELECT statements
if not query.strip().upper().startswith("SELECT"):
return {"error": "Only SELECT queries are allowed"}
try:
cursor.execute(query)
columns = [desc[0] for desc in cursor.description]
rows = [dict(zip(columns, row)) for row in cursor.fetchall()]
return {"columns": columns, "rows": rows, "row_count": len(rows)}
except Exception as e:
return {"error": str(e)}
# Ask a natural language question about the data
messages = [{"role": "user", "content": "How many users signed up in March 2026, and what's the total revenue from orders that month?"}]
response = client.messages.create(
model="claude-sonnet-4-20250514", max_tokens=1024,
tools=tools, messages=messages
)
# Process (the model will likely make two queries)
while response.stop_reason == "tool_use":
tool_results = []
for block in response.content:
if block.type == "tool_use":
result = query_database(**block.input)
print(f"SQL: {block.input['query']}")
print(f"Result: {result}\n")
tool_results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": json.dumps(result)
})
messages.append({"role": "assistant", "content": response.content})
messages.append({"role": "user", "content": tool_results})
response = client.messages.create(
model="claude-sonnet-4-20250514", max_tokens=1024,
tools=tools, messages=messages
)
for block in response.content:
if hasattr(block, "text"):
print(block.text)
Example 3: Multi-Tool Agent
This example builds a small agent that can search the web, read URLs, and send emails. It demonstrates the agentic loop: the model calls tools iteratively until the task is complete.
import anthropic
import json
client = anthropic.Anthropic()
tools = [
{
"name": "search_web",
"description": "Search the web for current information. Returns a list of results with titles, URLs, and snippets.",
"input_schema": {
"type": "object",
"properties": {
"query": {"type": "string", "description": "Search query"}
},
"required": ["query"]
}
},
{
"name": "read_url",
"description": "Read the text content of a web page given its URL.",
"input_schema": {
"type": "object",
"properties": {
"url": {"type": "string", "description": "Full URL to read"}
},
"required": ["url"]
}
},
{
"name": "send_email",
"description": "Send an email to a recipient with a subject and body.",
"input_schema": {
"type": "object",
"properties": {
"to": {"type": "string", "description": "Recipient email address"},
"subject": {"type": "string", "description": "Email subject line"},
"body": {"type": "string", "description": "Email body (plain text)"}
},
"required": ["to", "subject", "body"]
}
}
]
# Simulated tool implementations
def search_web(query):
return {"results": [
{"title": "NVIDIA Q4 2026 Earnings", "url": "https://example.com/nvidia-earnings",
"snippet": "NVIDIA reported revenue of $45B, up 78% YoY..."},
{"title": "NVIDIA Earnings Analysis", "url": "https://example.com/nvidia-analysis",
"snippet": "Data center revenue drove growth at $38B..."}
]}
def read_url(url):
return {"content": "NVIDIA reported Q4 2026 revenue of $45 billion, beating estimates of $42B. "
"Data center revenue reached $38B (+95% YoY). Gaming revenue was $4.2B (+15%). "
"Gross margin was 73.5%. The company announced a $50B buyback program."}
def send_email(to, subject, body):
return {"status": "sent", "message_id": "msg_abc123"}
tool_map = {"search_web": search_web, "read_url": read_url, "send_email": send_email}
def run_agent(task: str, max_iterations: int = 10) -> str:
"""Run the agent loop until task completion or max iterations."""
messages = [{"role": "user", "content": task}]
for i in range(max_iterations):
response = client.messages.create(
model="claude-sonnet-4-20250514", max_tokens=4096,
tools=tools, messages=messages
)
if response.stop_reason == "end_turn":
return "".join(b.text for b in response.content if hasattr(b, "text"))
# Execute all tool calls
tool_results = []
for block in response.content:
if block.type == "tool_use":
result = tool_map[block.name](**block.input)
print(f" [{i+1}] {block.name}({json.dumps(block.input)[:80]}...)")
tool_results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": json.dumps(result)
})
messages.append({"role": "assistant", "content": response.content})
messages.append({"role": "user", "content": tool_results})
return "Max iterations reached"
# The agent will: search → read article → compose email → send
result = run_agent(
"Research the latest NVIDIA earnings and email a summary to investor@example.com"
)
print(result)
The run_agent function is a simple while loop that continues calling the model until the task is complete. The model autonomously determines the sequence: search first, read the most relevant article, compose an email, and send it. This is the core pattern underlying every AI agent framework.
Example 4: Calculator and Code Execution
LLMs are notably poor at arithmetic. Tool calling resolves this by offloading computation to actual code:
import anthropic
import json
import math
client = anthropic.Anthropic()
tools = [
{
"name": "calculate",
"description": "Evaluate a mathematical expression. Supports standard math operations (+, -, *, /, **, %), functions (sqrt, sin, cos, log, abs), and constants (pi, e). Examples: '2**10', 'sqrt(144)', 'log(1000, 10)'",
"input_schema": {
"type": "object",
"properties": {
"expression": {"type": "string", "description": "Math expression to evaluate"}
},
"required": ["expression"]
}
},
{
"name": "run_python",
"description": "Execute a Python code snippet and return stdout output. Use for complex calculations, data processing, or generating formatted results. The code runs in a sandboxed environment.",
"input_schema": {
"type": "object",
"properties": {
"code": {"type": "string", "description": "Python code to execute"}
},
"required": ["code"]
}
}
]
def calculate(expression: str) -> dict:
# Safe math evaluation with limited namespace
allowed = {k: v for k, v in math.__dict__.items() if not k.startswith('_')}
allowed.update({"abs": abs, "round": round, "min": min, "max": max})
try:
result = eval(expression, {"__builtins__": {}}, allowed)
return {"expression": expression, "result": result}
except Exception as e:
return {"error": str(e)}
def run_python(code: str) -> dict:
# WARNING: In production, use a proper sandbox (Docker, gVisor, etc.)
import io, contextlib
output = io.StringIO()
try:
with contextlib.redirect_stdout(output):
exec(code, {"__builtins__": __builtins__})
return {"stdout": output.getvalue(), "status": "success"}
except Exception as e:
return {"error": str(e), "status": "error"}
tool_map = {"calculate": calculate, "run_python": run_python}
# Ask something that requires precise computation
messages = [{"role": "user", "content":
"If I invest $10,000 at 7.5% annual return compounded monthly, "
"how much will I have after 20 years? Show the year-by-year breakdown."}]
response = client.messages.create(
model="claude-sonnet-4-20250514", max_tokens=4096,
tools=tools, messages=messages
)
while response.stop_reason == "tool_use":
tool_results = []
for block in response.content:
if block.type == "tool_use":
result = tool_map[block.name](**block.input)
tool_results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": json.dumps(result)
})
messages.append({"role": "assistant", "content": response.content})
messages.append({"role": "user", "content": tool_results})
response = client.messages.create(
model="claude-sonnet-4-20250514", max_tokens=4096,
tools=tools, messages=messages
)
for block in response.content:
if hasattr(block, "text"):
print(block.text)
run_python tool above uses exec(), which is unsafe in production. Code execution should always be sandboxed using containers, WebAssembly, or dedicated code execution services. LLM-generated code should never be run with full system access.
The Agentic Loop: From Tool Calling to AI Agents
Tool calling is a single request-response interaction. An AI agent is what results when tool calling is placed within a loop. The agent continues to think, call tools, observe results, and think again, until the task is complete.
The Basic Agent Loop
while task is not complete:
1. THINK → Model analyzes the current state and decides what to do next
2. SELECT → Model chooses a tool and generates arguments
3. EXECUTE → Application runs the tool and captures the result
4. OBSERVE → Result is fed back to the model
5. REPEAT → Model decides: need more info? Call another tool. Done? Respond.
┌──────────────────────────────────────────────┐
│ AGENT LOOP │
│ │
│ ┌─────────┐ ┌──────────┐ ┌─────────┐ │
│ │ THINK │────→│ SELECT │───→│ EXECUTE │ │
│ │ │ │ TOOL │ │ TOOL │ │
│ └────▲────┘ └──────────┘ └────┬────┘ │
│ │ │ │
│ │ ┌──────────┐ │ │
│ └─────────│ OBSERVE │◀─────────┘ │
│ │ RESULT │ │
│ └─────┬────┘ │
│ │ │
│ Done? ───┤ │
│ No ─────┘ (loop back) │
│ Yes ─────→ RESPOND to user │
└──────────────────────────────────────────────┘
This pattern is widespread:
- Claude Code—the tool through which a reader may be encountering this post—uses exactly this pattern. When Claude Code is asked to fix a bug in
auth.py, it calls tools such asRead(to read files),Grep(to search code),Edit(to modify files), andBash(to run tests), iterating until the bug is fixed. - ChatGPT with plugins follows the same loop: the model decides which plugins to invoke, executes them, reads the results, and continues.
- GitHub Copilot’s agent mode reads the codebase, makes edits, runs tests, and iterates—all through tool calling.
How Claude Code Uses Tool Calling
Claude Code is an effective real-world example. Given a task, it has access to tools such as:
| Tool | What It Does | Example Use |
|---|---|---|
Read |
Reads a file from disk | Read src/auth.py to understand the code |
Write |
Creates or overwrites a file | Write a new test file |
Edit |
Makes targeted edits to a file | Fix a specific line in a function |
Bash |
Runs a shell command | Run pytest to check if the fix works |
Grep |
Searches file contents | Find all usages of a function |
Glob |
Finds files by pattern | Find all *.test.py files |
A typical Claude Code session may involve 20 to 50 tool calls for a single task. The model reads a file, identifies the problem, searches for related code, makes an edit, runs the tests, observes a test failure, reads the error, makes another edit, runs the tests again, and finally reports success. Every step is a tool call. The intelligence is in determining which tool to call and which arguments to use; the actual execution is performed by the user’s computer.
The Progression: Tool Call to Agent
Understanding tool calling makes the full progression of AI capability visible:
- Simple tool call: A user asks a question, the model calls one tool, and the model responds. (Weather lookup.)
- Multi-tool call: The model calls several tools in parallel or sequence within a single turn. (Weather plus stock price.)
- Multi-step chain: The model calls tools iteratively across multiple turns, using each result to inform the next call. (Research, read, summarise, email.)
- Autonomous agent: The model operates in a loop with minimal human intervention, using tools to accomplish complex goals. (Claude Code fixing a bug across multiple files.)
Each step builds on the one before. Understanding step 1 establishes the foundation for step 4. Tool calling is the atomic unit of AI agency.
Model Context Protocol (MCP): The Standard for Tool Calling
If every AI application defines its tools in a different format, the ecosystem becomes fragmented. The Model Context Protocol (MCP) addresses this problem.
MCP is an open standard, developed by Anthropic, that provides a universal way to connect AI models to external tools, data sources, and services. It can be understood as a USB-C equivalent for AI tools: a single standard that works across systems, in place of each system requiring its own proprietary connector.
How MCP Works
MCP defines a client-server architecture:
- MCP Clients (such as Claude Code, Claude Desktop, or a custom application) connect to MCP servers and expose the available tools to the AI model.
- MCP Servers expose three types of capabilities:
- Tools: Functions the model can call (the same concept as function calling).
- Resources: Data the model can read (files, database records, API responses).
- Prompts: Pre-defined prompt templates for common tasks.
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Claude │ │ MCP │ │ External │
│ Desktop / │────→│ Server │────→│ Service │
│ Claude Code│ │ (your app) │ │ (DB, API) │
│ (MCP Client) │ │ │ │
└─────────────┘ └─────────────┘ └─────────────┘
The MCP Server exposes:
- Tools: query_database, create_ticket, send_slack_message
- Resources: customer_data, product_catalog
- Prompts: summarize_ticket, generate_report
Building a Simple MCP Server
The following is a minimal MCP server that exposes a database query tool:
from mcp.server import Server
from mcp.types import Tool, TextContent
import sqlite3
import json
server = Server("database-server")
@server.list_tools()
async def list_tools():
return [
Tool(
name="query_database",
description="Run a read-only SQL query against the customer database.",
inputSchema={
"type": "object",
"properties": {
"query": {"type": "string", "description": "SQL SELECT query"}
},
"required": ["query"]
}
)
]
@server.call_tool()
async def call_tool(name: str, arguments: dict):
if name == "query_database":
conn = sqlite3.connect("customers.db")
cursor = conn.cursor()
if not arguments["query"].strip().upper().startswith("SELECT"):
return [TextContent(type="text", text="Error: Only SELECT queries allowed")]
cursor.execute(arguments["query"])
columns = [d[0] for d in cursor.description]
rows = [dict(zip(columns, row)) for row in cursor.fetchall()]
conn.close()
return [TextContent(type="text", text=json.dumps(rows, indent=2))]
# Run with: python -m mcp.server.stdio database_server
Once this MCP server is running, any MCP-compatible client (Claude Code, Claude Desktop, or custom applications) can connect to it, and the AI model can query the underlying database through tool calling. The MCP protocol handles the communication.
MCP Compared with Other Approaches
| Approach | Standardized? | Multi-Client | Discovery | Status |
|---|---|---|---|---|
| MCP | Open standard | Yes | Built-in | Growing adoption |
| OpenAI Plugins | OpenAI-specific | No | Plugin manifest | Deprecated in favor of GPTs |
| Custom function calling | No | No | Manual | Most flexible |
MCP is gaining substantial adoption in 2026. Major IDE extensions, AI coding tools, and enterprise platforms are adopting it as the standard means of connecting AI systems to external systems. For developers building tools for AI models, implementing them as MCP servers helps future-proof the work.
Best Practices for Designing Tools
The quality of the tools directly determines how well an AI application performs. A well-designed tool is comparable to a well-written function: a clear name, documented parameters, predictable behaviour. A poorly designed tool produces hallucinated arguments, incorrect tool selection, and unsatisfactory user experiences.
Naming and Descriptions
The model reads the tool’s name and description to determine when and how to use it. Investment in these elements is worthwhile, since they function effectively as prompts for the model.
| Aspect | Bad | Good |
|---|---|---|
| Function name | weather |
get_current_weather |
| Function name | do_stuff |
create_calendar_event |
| Description | “Gets weather” | “Get current weather conditions (temperature, humidity, wind) for a specific city. Use when the user asks about weather or atmospheric conditions.” |
| Parameter description | “The city” | “City name, e.g. ‘Tokyo’, ‘New York’, ‘London’. Use the English name.” |
Key Design Principles
One tool per action. Avoid creating a single manage_database tool that can query, insert, update, and delete. Instead, create separate tools: query_database, insert_record, update_record, delete_record. This provides the model with clearer choices and reduces errors.
Detailed JSON Schema. Use types, required fields, enums, defaults, and descriptions for every parameter. The more constrained the schema, the more reliable the model’s output:
{
"properties": {
"priority": {
"type": "string",
"enum": ["low", "medium", "high", "critical"],
"description": "Task priority level. Use 'critical' only for production outages.",
"default": "medium"
},
"due_date": {
"type": "string",
"description": "Due date in ISO 8601 format (YYYY-MM-DD), e.g. '2026-04-15'"
}
}
}
Structured error messages. When a tool fails, return a structured error message that the model can understand and act on, rather than a stack trace:
# Bad: raises exception that crashes the loop
raise Exception("Connection timeout")
# Good: returns error the model can understand
return {"error": "Database connection timed out after 30s. The database may be under heavy load. Try again in a few minutes."}
Separate read and write tools. This separation is essential for safety. A query_database tool (read-only) is safe to call freely. A delete_record tool (destructive) should require confirmation. Separation allows different safety policies to be applied to each.
Confirmation for dangerous actions. Before deleting data, sending emails, or making payments, the model should ask for user confirmation. This can be implemented by having the tool return a “confirmation required” response that the model must present to the user before proceeding.
Common Pitfalls and How to Avoid Them
Even with well-designed tools, problems can arise. The following are the most common issues, along with their remedies:
| Pitfall | Cause | Solution |
|---|---|---|
| Model hallucinating tool calls | Tool name similar to a known concept | Use strict tool definitions; validate tool name before execution |
| Wrong argument types | Vague or missing JSON Schema | Add detailed types, enums, and descriptions; include examples |
| Infinite tool loops | Model keeps calling tools without converging | Set max_iterations limit; add “no more info needed” guidance |
| Unnecessary tool calls | Overly broad tool description | Write precise descriptions about when to use the tool |
| Ignoring tool errors | Error returned as exception, not tool result | Always return errors as tool results so the model can handle them |
| SQL injection via tool args | LLM-generated SQL executed without validation | Parameterized queries; read-only database user; query allowlists |
| Command injection | LLM-generated shell commands executed directly | Sandboxing; allowlisted commands only; never pass to shell=True |
| Token cost explosion | Tool results too large (e.g., full database dumps) | Paginate results; limit response size; summarize large outputs |
Security Considerations
Security warrants particular attention because tool calling enables an LLM to take real actions. A prompt injection attack that convinces the model to call delete_all_users() is no longer a theoretical concern; it is a real risk.
Key security practices include:
- Input validation. Validate all tool arguments before execution. The model should not be trusted to provide safe inputs consistently.
- Least privilege. Provide tools with the minimum permissions necessary. Database tools should use read-only credentials unless writes are required.
- Rate limiting. Limit how often tools can be called to prevent abuse or runaway loops.
- Audit logging. Log every tool call with its arguments and results. This is essential for debugging and security audit.
- Sandboxing. Code execution tools must run in isolated environments (containers, VMs, or WebAssembly sandboxes).
- Confirmation gates. Destructive operations (delete, send, pay) should require human confirmation before execution.
Tool Calling in Production
Moving from a prototype to production requires additional engineering around reliability, observability, and cost management.
Reliability Patterns
Caching: Cache tool results to avoid redundant API calls. If the model requests the weather in Tokyo twice in the same conversation, the cached result should be returned. Use time-based expiration (for example, a 5-minute TTL for weather data).
from functools import lru_cache
from datetime import datetime, timedelta
_cache = {}
def cached_tool_call(name: str, args: dict, ttl_seconds: int = 300):
key = f"{name}:{json.dumps(args, sort_keys=True)}"
if key in _cache:
result, timestamp = _cache[key]
if datetime.now() - timestamp < timedelta(seconds=ttl_seconds):
return result
result = execute_tool(name, args)
_cache[key] = (result, datetime.now())
return result
Retry with backoff: External APIs fail. Implement retries with exponential backoff for transient errors (timeouts, rate limits, 5xx errors).
Fallback strategies: When a tool fails after retries, return a structured error message that allows the model to inform the user appropriately, rather than crashing the entire interaction.
Observability
Logging: Log every tool call in a structured format:
{
"timestamp": "2026-04-03T10:30:00Z",
"conversation_id": "conv_abc123",
"tool_name": "get_weather",
"arguments": {"city": "Tokyo"},
"result_summary": "success, temperature=22",
"latency_ms": 245,
"tokens_used": {"input": 150, "output": 45}
}
Monitoring: Track key metrics:
- Tool call success rate (should remain above 95%).
- Average tool latency (directly affects user experience).
- Tool calls per conversation (indicative of complexity).
- Token cost per tool call cycle (each call adds tokens to the context).
- Error rates by tool (useful for identifying problematic tools).
Cost Optimisation
Every tool call adds tokens to the context window. The tool definitions themselves are included in every API request, so 20 detailed tools may add 2,000 to 3,000 tokens before the conversation begins.
Strategies to manage costs include:
- Dynamic tool loading. Include only relevant tools, based on the conversation context. A weather conversation does not require database tools.
- Result compression. Truncate or summarise large tool results before returning them to the model. A full database dump is rarely necessary; summary statistics are usually sufficient.
- Conversation pruning. In long multi-tool conversations, summarise earlier tool results and remove the raw data from the context.
- Model selection. Use cheaper, faster models (such as Claude Haiku or GPT-4o-mini) for simple tool-calling tasks, and reserve expensive models for complex reasoning.
Testing Tool-Calling Applications
Tools should be tested independently before they are integrated with the LLM:
- Unit tests. Test each tool function with a variety of inputs, including edge cases and invalid arguments.
- Integration tests. Test the tool against the actual API or database to which it connects.
- LLM integration tests. Test the full loop with the model. Provide a set of test prompts and verify that the model calls the correct tools with correct arguments.
- Adversarial tests. Test with prompts designed to trick the model into misusing tools (prompt injection).
# Example: testing that the model calls the right tool
def test_weather_tool_selection():
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
tools=tools,
messages=[{"role": "user", "content": "What's the weather in London?"}]
)
tool_calls = [b for b in response.content if b.type == "tool_use"]
assert len(tool_calls) == 1
assert tool_calls[0].name == "get_weather"
assert tool_calls[0].input["city"] == "London"
def test_no_tool_for_general_question():
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
tools=tools,
messages=[{"role": "user", "content": "What is the capital of France?"}]
)
# Model should answer directly, no tool call
assert response.stop_reason == "end_turn"
The Future of Tool Calling
Tool calling is evolving rapidly. Several directions are notable:
Computer Use
Anthropic's computer use capability extends tool calling to its logical conclusion: instead of calling specific APIs, the model controls an entire computer desktop. It views the screen (via screenshots), moves the mouse, clicks buttons, and types text. The "tools" become the entire computer interface: every application, website, and file. This is the most general form of tool use: rather than building a specific tool for every task, the model is given the same tools a human uses.
More Reliable Structured Output
Constrained decoding is making tool calling more reliable. Rather than relying on the model to produce valid JSON, the decoding process itself enforces the JSON Schema; the model is mechanically prevented from producing invalid output. OpenAI's strict mode and Anthropic's improvements in JSON reliability move in this direction.
Tool Learning and Discovery
Current models use tools that are explicitly defined in the request. Future models may be able to discover tools dynamically—browsing an API directory, reading documentation, and determining how to use a new tool without it being predefined. MCP is laying the groundwork for this through its discovery protocol.
Multi-Agent Tool Sharing
As multi-agent systems become more common (multiple AI agents collaborating on a task), tool sharing becomes important. One agent may specialise in database queries while another handles email. MCP's architecture supports this by allowing multiple agents to connect to the same tool servers.
Standardisation
MCP adoption is accelerating. In the same way that REST APIs standardised web service communication, MCP is standardising how AI models interact with external tools. For developers and companies building AI tools, this means writing the tool once and making it available to every AI model and client that supports MCP.
Final Thoughts
Tool calling is the underlying infrastructure behind every AI agent, every chatbot plugin, and every autonomous AI system. The mechanism is deceptively simple—a model outputs a function name and arguments, the application's code executes the function, and the result is returned to the model—but this simple loop is what transformed LLMs from text generators into systems that can act in the real world.
To summarise the material covered:
- The core concept. Tool calling allows LLMs to request the execution of external functions. The model plans; the application acts.
- The three-step loop. The user asks, the model calls a tool, the application executes the tool, and the model responds with the result.
- Provider implementations. Claude, GPT, and Gemini all support tool calling with slightly different formats but the same underlying pattern.
- Practical patterns. Examples range from simple weather lookups to chained tool calls, database queries, and multi-tool agents.
- The agentic loop. Tool calling in a loop is the foundation of AI agents. Claude Code, ChatGPT plugins, and GitHub Copilot all operate on this basis.
- MCP. The open standard that is making tool definitions universal and interoperable.
- Best practices. Clear naming, detailed schemas, error handling, security, and the read/write separation principle.
- Production concerns. Caching, logging, cost optimisation, and testing strategies.
Developers should begin building with tool calling immediately. Select an API already in use, define it as a tool, and connect it to Claude or GPT. The transition from "AI that converses" to "AI that acts" is more rapid than expected. For analysts and investors, the relevant observation is that tool calling is not merely a feature; it is the foundation of the entire AI agent ecosystem. Companies that master tool integration will define the next phase of AI.
The era of AI that only converses has passed. The era of AI that acts is beginning, and tool calling is the mechanism that makes it possible.
References
- Anthropic. "Tool use (function calling)—Claude Documentation." docs.anthropic.com/en/docs/build-with-claude/tool-use
- OpenAI. "Function calling, OpenAI API Documentation." platform.openai.com/docs/guides/function-calling
- Google. "Function calling—Gemini API Documentation." ai.google.dev/gemini-api/docs/function-calling
- Anthropic. "Model Context Protocol—Documentation." modelcontextprotocol.io
- Anthropic. "Computer use, Claude Documentation." docs.anthropic.com/en/docs/build-with-claude/computer-use
- Anthropic. "Claude Code—Documentation." docs.anthropic.com/en/docs/claude-code
- Schick, T., et al. "Toolformer: Language Models Can Teach Themselves to Use Tools." arXiv:2302.04761, 2023.
- Qin, Y., et al. "ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs." arXiv:2307.16789, 2023.
Leave a Reply