A Reddit post from a former Manus backend lead landed because it named a problem many teams already feel but rarely describe clearly: function calling works beautifully in product demos, then starts to creak when an agent needs dozens of tools, long context, and real state. The interesting part was not the provocation. It was the architecture behind it. More builders are collapsing bloated tool catalogs into a few durable command surfaces. That shift may define the next generation of agent systems.
Reddit surfaced a bottleneck that product teams already know
The trigger for this piece was a recent r/LocalLLaMA post by a former Manus backend lead who now works on the open-source projects Pinix and agent-clip. His claim was blunt: after two years building agents, he stopped using classic function calling as the primary interface and moved toward a single run(command="...") surface backed by Unix-style commands, pipes, help text, and predictable error messages.
That is easy to misread as anti-tool-calling theater. It is not. The more careful reading is architectural. The author explicitly says typed APIs still make sense in strongly structured or high-security contexts. His argument is narrower and more interesting: once an agent has to navigate a large and growing tool universe, choosing between twenty or fifty schemas becomes its own cognitive burden. The model spends attention on “which tool should I call?” instead of “what sequence of actions would solve this task?”
Pinix makes the idea concrete. The platform exposes a compact Clip interface with three operations: Invoke, ReadFile, and GetInfo. Agent-clip then layers a terminal-like action surface on top of that, including file operations, memory search, browser actions, clip calls, and chained commands. In other words, the system is not throwing away structure. It is compressing structure into a smaller number of reusable verbs.
That distinction matters. Many teams are discovering the same thing from a different angle: the problem is not tools in the abstract. The problem is tool sprawl. We made a related argument recently in CloudAI’s coverage of portfolio-based model stacks and AI ROI in 2026. Model choice matters, but interface design is increasingly where reliability is won or lost.
Why function calling starts to crack at scale
Function calling remains a major advance. It gives developers typed parameters, schema validation, explicit permissions, and clean observability. For narrow workflows, that is exactly what you want. If an agent only needs six well-defined actions to book travel, check inventory, or open a support ticket, typed tools are often the adult choice.
The trouble begins when the toolset stops being narrow. The 2025 BFCL paper, which formalized the Berkeley Function Calling Leaderboard, found that state-of-the-art models are strong on single-turn calls but still struggle with memory, dynamic decision-making, and long-horizon reasoning in more agentic settings. Tau-bench, a benchmark built around realistic tool-agent-user interactions, pushes the point further: even leading function-calling agents succeeded on fewer than half of the tasks, and pass8 in retail remained below 25%. That is not a small miss. It is a warning that realistic multi-step execution is a different regime from schema completion.
Why does performance fall off? First, prompt real estate is finite. Every additional tool definition takes tokens, and every extra parameter description adds ambiguity. Second, tools rarely live in isolation. Real work involves composition: read a log, filter it, count failures, open a ticket, attach evidence, notify someone, then update memory. Function calling can do that, but often through a series of brittle handoffs where the model must repeatedly re-orient itself across unrelated schemas.
Third, function catalogs often age badly. What starts as ten elegant tools becomes forty overlapping ones: search_docs, query_kb, find_article, search_confluence, search_slack, each with slightly different parameter rules and return shapes. Humans can barely keep that straight. Expecting a language model to do it perfectly under latency, token, and context pressure is optimistic at best.
This is where the Reddit post connects with a broader research pattern. Anthropic’s guidance on building effective agents argues for the simplest workable pattern and warns against adding abstraction layers that obscure prompts and responses. In a separate engineering note on writing tools for agents, Anthropic highlights three principles that sound almost mundane until you try to scale them: clear boundaries, meaningful returned context, and token-efficient outputs. Those are precisely the qualities that large, ad hoc tool catalogs tend to lose first.
What command layers get right
The most compelling case for command layers is not that shell syntax is magical. It is that a command surface gives the agent a more compact action language. A single call can express composition natively: read, filter, transform, fall back, then continue. The Manus engineer’s examples were simple but revealing. Counting error lines in a log can be one chained command instead of a three-call choreography. That saves tokens, reduces intermediate state handling, and shrinks the number of decision points where the model can drift.
There is also a training-data advantage. LLMs have seen enormous volumes of shell commands, CLI help text, README snippets, CI scripts, and troubleshooting transcripts. That does not make any specific command safe or correct, but it does mean the syntax and operating style are familiar territory. A terminal-like interface lets the model operate inside a behavioral grammar it already partially understands: verbs, flags, pipes, stderr, exit codes, retries.
The idea extends beyond Bash. The CodeAct paper from ICML 2024 made a related argument from the Python side: a unified executable action space can outperform rigid JSON or text formats by allowing composition, revision, and self-debugging, with gains of up to 20% in success rate on evaluated tasks. Different implementation, same architectural lesson: richer action surfaces can beat tool menus when the job requires adaptation rather than single-step dispatch.
What the best versions of this pattern do well is preserve navigation. In the Reddit design, help text, error messages, and consistent output metadata are not afterthoughts. They are the user interface for the agent itself. A bad tool description forces blind guessing. A good error message points to recovery. A compact, informative output keeps the model moving without flooding the window. That is also why this trend connects naturally with CloudAI’s recent argument about tool-use benchmarks mattering more than generic benchmark theater.
The trade-offs are real, and sometimes decisive
This is the part that command-layer evangelists sometimes underplay. A terminal-like interface is not automatically better. It is often more dangerous. Strings are looser than typed parameters. Injection risks are more obvious. Permission boundaries can get fuzzy fast. The top comments under the Reddit thread zeroed in on exactly the right concern: if the shell becomes the universal action surface, what stops it from becoming the fastest possible way to do the wrong thing?
That concern is not theoretical. The ToolEmu paper, which evaluated tool-using language model agents in risky scenarios, found that even the safest agent tested still exhibited a non-trivial risky failure rate. The exact number matters less than the lesson: when agents can act, not just talk, interface quality and safety design become inseparable.
There are also domains where typed interfaces remain plainly superior. Payments, identity changes, compliance workflows, HR systems, and anything with strict authorization semantics benefit from explicit schemas, strong parameter validation, and hard permission boundaries. If an agent is closing a refund, rotating credentials, or editing a production firewall rule, “natural fit for language models” is not the only requirement. Auditability and constrained side effects matter more.
The mature view, then, is not “replace function calling.” It is use function calling at transactional edges and compress complexity everywhere else. Let the agent reason and compose inside a safe, ergonomic action layer, then require typed, policy-aware handoffs for sensitive commits. That is a much more credible production pattern than either extreme.
An implementation framework that can actually ship
If you are building or refactoring an agent system today, the most practical move is not to choose a religion. It is to redesign the action surface around a few operational rules.
- Start with workflows, not tools. Map the real jobs your agent must finish end to end. If three tools are always called together, that is evidence they may belong behind one higher-level verb.
- Collapse tool sprawl into action families. In many systems, five to eight well-named command families are more usable than fifty raw functions. Think
files,memory,browser,search,tickets,deploy, not a long tail of near-duplicates. - Make discovery part of the interface. Help output, examples, defaults, and next-step error messages should be designed for agent recovery, not just human convenience.
- Return small, decision-ready outputs. Anthropic’s tool-writing guidance is right here: meaningful context matters, but token-efficient responses matter too. Give the model what it needs to decide the next step, not a data dump.
- Separate execution from presentation. The Reddit post’s two-layer design is smart. Raw execution semantics should stay clean; truncation, summaries, and binary guards belong in a presentation layer built for model cognition.
- Use typed commits for risky actions. Keep explicit schemas and approval gates for payments, deletes, security changes, or anything legally or operationally sensitive.
- Evaluate on long-horizon tasks. Do not stop at “the tool call was valid.” Measure completion, retries, total tool calls, latency, token cost, and recovery behavior on messy, multi-step tasks taken from real logs.
This framework is less glamorous than a benchmark chart, but it is what separates demos from systems. The teams getting value from agents in 2026 are increasingly the ones treating interface design as product infrastructure.
Where this architecture is most likely to win first
Three categories stand out. First, research and operations copilots, where the work is mostly reading, filtering, comparing, extracting, and documenting across many sources. Second, coding and incident agents, where files, logs, tests, and browser actions already resemble a command-oriented world. Third, internal back-office automation, where most of the flow is open-ended until the last transactional step.
What these categories share is not industry. It is workflow shape. They reward composition, recovery, and adaptive sequencing more than rigid one-shot tool dispatch. That makes them ideal places to test a post-function-calling architecture before rolling it into higher-risk domains.
FAQ
Is function calling obsolete?
No. It is still the best fit for many bounded workflows, especially where schemas, permissions, and audit logs matter more than open-ended composition.
Are command layers just “giving the model a shell”?
At the shallowest level, yes. In serious systems, though, the shell metaphor sits inside a constrained runtime with curated commands, help text, output shaping, permissions, and often sandboxing. The value is not raw shell access. It is a compact action grammar.
What is the biggest mistake teams make here?
They confuse more tools with more capability. In practice, oversized catalogs often make agents less reliable because the model spends effort choosing interfaces instead of solving the task.
What should teams measure to decide?
Measure task completion on multi-step workloads, retry count, average tool calls per successful run, latency, token consumption, and failure recovery. If a smaller action surface cuts those numbers without increasing risk, you have your answer.
The editorial verdict
The Reddit thread matters because it captures an architectural turn that is becoming visible across the market. The next AI stack is not abandoning tools. It is moving away from tool menus as the main abstraction. The winning pattern is likely to be smaller, more ergonomic action surfaces for reasoning and exploration, combined with typed, policy-aware interfaces for sensitive commits.
That is a healthier direction than both extremes. It avoids the fantasy that JSON schemas alone solve long-horizon agent reliability, and it avoids the equally naive belief that unrestricted shells are a production strategy. The real opportunity is in the middle: compact command layers, explicit boundaries, and evaluation on the messy workflows that matter. If you are building agents this year, that is where the next serious gains are likely to come from.
References
- Reddit: “I was backend lead at Manus. After building agents for 2 years, I stopped using function calling entirely”
- Pinix GitHub repository
- agent-clip GitHub repository
- Anthropic: Building effective agents
- Anthropic: Writing effective tools for agents
- Patil et al. (2025), The Berkeley Function Calling Leaderboard (BFCL)
- Tau-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains
- Wang et al. (2024), Executable Code Actions Elicit Better LLM Agents
- ToolEmu: Evaluating the Safety and Security Risks of Large Language Model Agents
- Featured image source: Utilities-terminal.svg, Wikimedia Commons



