Benchmarks Claim 95%. Production Disagrees.
The Berkeley Function Calling Leaderboard (BFCL V4) reports that GPT-4o achieves over 90% accuracy on single-function tool calls. Add a second tool to the context, and accuracy drops by double digits. Add five, and you’re in a different regime entirely. The gap between benchmark function calling performance and what teams observe in production is not a rounding error — it’s the difference between shipping an agent and babysitting one.
Function calling — the mechanism by which LLMs select and populate external tool schemas — is the load-bearing wall of every agentic system. If the model calls the wrong function, passes a malformed parameter, or hallucinates a required field, the entire downstream workflow breaks. And in production, with real API schemas, nested parameters, and multi-step orchestration, it breaks far more often than leaderboard numbers suggest.
Where Benchmarks Mislead
The BFCL evaluates function calling across several categories: simple single-function calls, multiple function selection, parallel invocation, and relevance detection (knowing when not to call a function). On the simple category, top models like GPT-4o and Claude Opus 4.8 regularly exceed 90% accuracy. That number sells well in release notes. It’s also nearly useless for planning production systems.
Consider what a production function schema looks like. Not the clean, self-contained test functions from BFCL. Real enterprise schemas — Stripe payment intents, AWS SDK calls, Kubernetes API operations — have dozens of optional fields, nested objects, conditional requirements, and interdependencies between parameters. The BFCL leaderboard itself acknowledges this: when you move from simple to multiple-function scenarios, accuracy degrades noticeably, and open-source models fall behind proprietary ones by 15-20 percentage points.
The original Gorilla paper from UC Berkeley documented this problem precisely: even GPT-4 struggles with generating accurate input arguments for API calls and has a persistent tendency to hallucinate incorrect API usage. The paper introduced APIBench, and the results showed that retrieval-augmented fine-tuning was necessary to get reliable API usage — prompting alone wasn’t cutting it.
The Schema Complexity Wall
Here’s what actually kills function calling accuracy in production systems:
- Schema size: Most benchmarks test with 2-10 functions in context. Real agent systems expose 20-50 tools. Every additional function definition competes for attention, and the model’s ability to discriminate between similar functions degrades non-linearly.
- Nested parameters: A Stripe
payment_intent.createcall has deeply nested objects (e.g.,payment_method_options.card.request_three_d_secure). LLMs frequently flatten or omit nested structures. - Enum constraints: When a parameter accepts one of 30 possible values, models often generate plausible-sounding but invalid options. BFCL’s error analysis shows “Value Errors” as a dominant failure category.
- Conditional fields: Field A is required only when field B has value X. This kind of constraint is trivially expressed in JSON Schema but poorly handled by most LLMs.
- Multi-step orchestration: BFCL V3 and V4 added multi-turn evaluation precisely because single-turn accuracy was painting an optimistic picture. In multi-turn scenarios, errors compound — a wrong function call at step 2 cascades through all subsequent steps.
The APIGen paper from Salesforce Research demonstrated something instructive: a 7B-parameter model trained on their verified function-calling dataset outperformed GPT-4 on the BFCL benchmark. The key wasn’t model size — it was data quality. Their three-stage verification pipeline (format checking, actual execution, semantic verification) produced training data that taught the model what correct function calls actually look like. This suggests that the problem is less about model capability and more about the distribution gap between training data and production schemas.
Strict Mode Changes the Game
Anthropic recently introduced strict tool use — adding strict: true to tool definitions guarantees that Claude’s output conforms to the provided JSON Schema. This is a meaningful engineering primitive, not just a checkbox. It means you can rely on the output being parseable and structurally valid, even if the semantic content might be wrong.
OpenAI offers a similar guarantee with structured outputs via JSON Schema. These are not trivial features. Before strict modes, every production system needed a defensive validation layer between the LLM’s output and the actual function call — parsing try/catch blocks, schema validators, retry loops. Strict mode eliminates the structural error class entirely. What it doesn’t eliminate is semantic errors: the model fills in a valid but incorrect value, calls the right function with wrong parameters, or selects an inappropriate tool.
The hierarchy of function calling errors looks like this (building on our earlier analysis of schema-valid LLM output still getting 20% of values wrong):
| Error Type | Strict Mode Fixes? | Frequency in Prod |
|---|---|---|
| Invalid JSON / malformed output | Yes | Low (was ~5-10%) |
| Schema non-conformance (wrong types) | Yes | Medium (was ~10-15%) |
| Wrong function selection | No | Medium-High (~15-20%) |
| Valid but incorrect parameter values | No | High (~20-30%) |
| Missing required context for multi-step | No | High in agents (~25-40%) |
Strict mode solves the first two rows. The bottom three — which represent the bulk of production failures — require different strategies.
Defensive Patterns That Actually Work
Teams running function-calling agents at scale converge on a small set of defensive patterns. Here’s what works, based on Anthropic’s own engineering guidance and production reports from gateway providers:
Reduce the Tool Surface Per Call
Don’t expose your entire tool catalog to every LLM call. Use a routing layer that selects 3-5 relevant tools based on the current task context. Anthropic’s guidance explicitly recommends this: “finding the simplest solution possible, and only increasing complexity when needed.” Fewer tools in context means higher discrimination accuracy. Portkey’s Agent Gateway implements this at the infrastructure level — registering agents with scoped tool sets and enforcing access control per invocation.
Validate Before You Execute
Even with strict mode, run every function call through a validation layer that checks semantic constraints: are the parameter values within acceptable ranges? Does the referenced resource exist? Is the requested operation permitted for this user? This is standard API gateway practice — apply it to LLM-generated calls too. LiteLLM’s proxy architecture supports guardrails and custom plugins that can intercept and validate function calls before they reach the target API.
Distinguish Between Recoverable and Fatal Errors
A function call that returns a 404 because the model used the wrong resource ID is recoverable — the agent can retry with corrected context. A function call that debits the wrong account is not. Map your function catalog to a risk matrix, and implement different execution policies accordingly. High-risk functions should require human confirmation or additional verification steps.
Log Everything, Trace the Chain
Function calling in agents is inherently multi-step. A failure at step 5 might be caused by a subtly wrong parameter at step 2. Full distributed tracing across the agent’s tool call chain is not optional — it’s the only way to debug non-deterministic failures. OpenTelemetry integration with LLM observability tools (Langfuse, Phoenix, Portkey traces) should be wired in from day one, not bolted on after the first production incident.
Multi-Turn Is Where Things Break
BFCL V3 introduced multi-turn function calling evaluation for a reason. Single-turn accuracy is a poor predictor of multi-turn reliability. If each turn has 85% accuracy (a reasonable estimate for a capable model with a moderate tool set), the probability of a 10-step agent completing all steps without error is roughly 0.85^10 ≈ 20%. This is the same compounding error problem we’ve documented before, but it hits function calling particularly hard because each tool call’s output becomes the next turn’s input.
The practical implication: you cannot design a multi-step agent workflow assuming each step will succeed. You need retry logic, fallback tools, and graceful degradation paths. As our analysis of why agents fail at step 47 showed, Anthropic’s agent patterns document recommends prompt chaining with programmatic validation gates between steps — explicitly checking intermediate results before proceeding, rather than trusting the agent to self-correct.
This is also where LLM gateway failover becomes relevant. If your primary model’s function calling accuracy degrades (whether due to provider issues, rate limiting, or model updates), the ability to route to an alternative model at the infrastructure level — without changing agent code — is a production safety net that takes one afternoon to set up and saves weeks of fire-fighting.
The Training Data Gap
The APIGen result — a 7B model beating GPT-4 on function calling benchmarks — points to a deeper issue. Current frontier models are trained on broad corpora where function calling is a small fraction of the data. Their function calling capability comes from instruction tuning and RLHF, not from deep exposure to real API schemas. This means:
- Models generalize well to simple schemas they’ve seen during training.
- They struggle with novel, complex, or domain-specific schemas.
- The gap between simple and complex is where production systems live.
Salesforce’s approach — generating verified training data from 3,673 real APIs across 21 categories, then filtering through format checking, execution, and semantic verification — produced 60,000 high-quality training examples. Models trained on this data dramatically outperformed larger models trained on general corpora. The lesson for engineering teams: if you’re building agents that call your company’s internal APIs, invest in generating a verified dataset of correct function calls for those specific APIs. Fine-tuning on your actual schema distribution will outperform any general-purpose model’s out-of-the-box function calling.
What to Do on Monday
If you’re running function-calling agents in production, or about to deploy one:
- Enable strict mode on every tool definition. It eliminates an entire class of structural errors for zero latency cost.
- Audit your tool catalog. Count how many tools you expose per call. If it’s more than 10, implement a routing or selection layer.
- Add semantic validation between the LLM output and execution. Structural validity ≠ correct behavior.
- Implement full tracing across multi-step tool call chains. You will need this to debug the first production incident.
- Benchmark on your actual schemas, not on BFCL. Create a test set from your real API definitions and measure accuracy on those. The number will be lower than you expect.
- Build a retry and fallback budget into every agent workflow. Plan for failure at every step.
Function calling is the most important primitive in agentic AI, and it’s getting better — strict mode, better training data, and improved model capabilities are all real advances. But the gap between benchmark performance and production reality remains substantial. Designing for that gap, rather than hoping it doesn’t exist, is what separates working agent systems from expensive demos.
References
- Berkeley Function Calling Leaderboard (BFCL V4) — UC Berkeley RISELab, last updated April 2026
- Gorilla: Large Language Model Connected with Massive APIs — Patil et al., May 2023
- APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets — Liu et al., June 2024
- Tool Use with Claude — Anthropic Documentation
- Building Effective Agents — Anthropic Engineering, December 2024
- Introducing the Agent Gateway — Portkey, April 2026
- LiteLLM Proxy: Docker, Helm, Terraform Deployment