Function Calling Accuracy Plummets in Production Workflows

Benchmarks Claim 95%. Production Disagrees. The Berkeley Function Calling Leaderboard (BFCL V4) reports that GPT-4o achieves over 90% accuracy on single-function tool calls. Add a second tool to the context, and accuracy drops by double digits. Add five, and you’re in a different regime entirely. The gap between benchmark function …