Sixty-eight percent of organizations running self-hosted AI ingest those models transitively through third-party software, according to Wiz’s State of AI in the Cloud 2026 report. The models your platform team approved are not the only models running in your environment — and those unattributed calls bypass your LLM gateway, evade your observability stack, inflate your cloud bill, and silently distort every capacity model you maintain.
This is not a governance talking point. It is a hard engineering problem with a measurable cost, and most teams have no instrumentation to see it. If your GPU utilization, egress spend, and p99 latency move for reasons you cannot attribute, transitive AI is a prime suspect.
What Transitive AI Actually Means
Transitive AI is the model equivalent of a transitive dependency in your package tree: an AI capability that enters your runtime not because you deployed a model, but because some library, SDK, SaaS connector, IDE extension, or MCP server pulled it in. Wiz found that 81% of organizations now use managed AI services and 90% run self-hosted models, so the surface for this is enormous. But 68% of those self-hosting ingest models through third-party software rather than deploying them directly.
The practical consequence: the model registry your MLOps team curates is a fiction. It lists the models you chose. It does not list the models a vendor appliance embeds, the ones a npm package calls when you imported it for an unrelated reason, or the ones an MCP server your developers wired up last sprint invokes against a database. Your dependency tree now contains inference endpoints that bill you per token.
This compounds fast. The 2026 Gravitee adoption report, cited by AI Outlooks, found that 80.9% of technical teams have moved past planning into active testing or production with AI agents, while only 14.4% say all of their live agents carry full security and IT approval. The gap between “deployed” and “approved” is where transitive AI lives.
The Four Vectors AI Sneaks In
Unattributed AI enters through a small number of predictable paths. A breakdown of the shadow MCP problem identifies four insertion points that map cleanly onto what most platform teams see in production:
| Vector | How it enters | Why it persists |
|---|---|---|
| Direct SDK use | Developer wires an AI client into a prototype to ship a feature fast. | The prototype ships, no review follows, it becomes load-bearing. |
| Transitive dependency | A package you imported pulls an AI SDK up from its own dependency tree. | You never added it; your lock file did. SCA tools rarely flag inference calls. |
| Config files | YAML, TOML, or JSON carries a model endpoint URL and an API key. | Looks like ordinary config, passes code review and static analysis. |
| Reused snippets | Hardcoded LLM calls copied from a notebook or tutorial into tooling. | Nobody tracks provenance; the snippet outlives the person who pasted it. |
None of these require malice. They require velocity without an inventory. The MCP ecosystem alone has grown past 10,000 public servers, and the SDKs are propagating through npm and PyPI. Each one is a potential inference call leaving your environment that your gateway never sees.
Why Your LLM Gateway Misses It
The standard cost-control pattern in 2026 is an LLM gateway: route every model call through one egress, enforce budgets, cache, and log. It works — for calls you wrote. Transitive AI calls do not traverse it because nothing told them to. A vendor appliance talks to its own hosted model over HTTPS. An MCP server in a developer’s Claude Desktop authenticates as that user and hits an external API. From your identity provider’s perspective it looks like ordinary user activity, which is exactly why it evades egress controls.
Security researchers at Operant demonstrated this with an attack they call Shadow Escape: a zero-click exfiltration chain in which a compromised MCP connection surfaces private data and transmits it without any user interaction, all inside sanctioned identity boundaries. The security dimension is serious, but the engineering lesson is more mundane and more expensive: if your observability is gateway-based, you have a blind spot precisely where unattributed traffic flows. You cannot alert on, cache, rate-limit, or fail over for calls you cannot see.
That blind spot is why your cost dashboards and your actual cloud bill diverge. The gateway reports token spend for the providers you configured. The finance team’s invoice includes charges from providers you did not.
The Cost You Are Not Tracking
Unattributed AI is not free, and it does not degrade gracefully. Every hidden inference call is billed somewhere — to a vendor that baked model access into its license, to an API key an engineer hardcoded, or to your own infrastructure when a transitive model runs self-hosted on capacity you provisioned for the approved workload. You are paying for compute and egress you cannot attribute, which means you cannot optimize it.
The pattern mirrors the early shadow-IT era, but with a sharper edge. A 2024 Stack Overflow Developer Survey found 75% of developers use AI assistants regularly, and Knostic’s analysis notes these tools operate outside approved monitoring and bypass the logging systems teams rely on to detect misuse. When the calls are invisible, the spend is invisible, and invisible spend does not get reviewed or retired.
This is where transitive AI intersects with a problem we have covered before: model sprawl that nobody knows how to retire. If you cannot enumerate the models running in your environment, you cannot kill the expensive one. The cost compounds quarter over quarter because the inventory never gets built.
Capacity Planning Breaks Down
Cost is the symptom you notice first; capacity is the one that bites in an incident. GPU scheduling assumes you know your inference workload. When a transitive dependency spikes a self-hosted model during a vendor’s batch job, it competes for the same HBM and KV-cache memory your approved serving stack needs. Your p99 latency climbs, your autoscaler scales out, and nobody connects the dots because the noisy tenant is not in your deployment manifest.
Wiz reports that 57% of organizations now deploy self-hosted AI agents and 80% adopt MCP servers. Every one of those components is a potential control plane that moves compute and data on its own schedule. Capacity models built from your declared workloads will be wrong in direct proportion to how much transitive AI you carry, and the error only surfaces under load — the worst possible time to discover you have unattributed tenants on your accelerators.
Build an AI Bill of Materials
The fix starts with an inventory you do not have. An AI Bill of Materials, or AIBOM, is a continuously maintained list of every AI component in your environment: models, datasets, embeddings, prompts, APIs, and the infrastructure each depends on. OWASP defines it as a structured, machine-readable inventory of AI components — models, datasets, agent tools, guardrails, and runtime elements — along with evidence of origin, rights, integrity, and evaluation. Treat it as the SBOM’s successor for anything that emits a token.
Generating one is not a manual exercise. It requires three automated inputs: a dependency-graph scan that flags packages importing AI SDKs, a network-level audit of which hosts are talking to inference endpoints, and a static analysis pass over config files for hardcoded model URLs and keys. You will not catch everything on the first pass, and that is acceptable. The first deliverable is a baseline. The baseline is what makes the next unattributed call visible instead of invisible.
Force a Single Exit Point
An AIBOM tells you what exists; a forced egress tells you what runs. The durable engineering control is to make one network path the only legal way to reach any inference endpoint, whether it is your self-hosted vLLM cluster or a third-party API. Egress filtering at the network layer blocks direct calls to model providers that bypass the gateway, and a transparent proxy on the self-hosted path captures calls from transitive components that share your GPU fleet.
This is the same principle that tamed outbound cloud egress costs: you cannot control what you cannot see, and you cannot see what you cannot route. The payoff is that every model call — approved or transitive — becomes observable, cacheable, rate-limitable, and billable to a team. A call that cannot justify its existence when it hits the gateway is a call you can retire, and retiring unattributed calls is the fastest cost reduction most AI platforms have left.
What to Do on Monday
Start narrow. Pick one production service and run a dependency scan for any package importing an LLM or MCP SDK. Audit its config files for model endpoint URLs and API keys. Check network logs for hosts reaching known inference providers. The gap between what you find and what your gateway logged is your transitive-AI surface — and it is almost always larger than teams expect. Quantify that delta in dollars and in latency before you build policy around it. You cannot govern AI you have not first learned to count.
References
- Wiz — State of AI in the Cloud 2026
- AI Outlooks — Shadow MCP: What it is, why it’s dangerous, and how to find it
- Knostic — 6 Biggest Shadow AI Risks and How to Mitigate Them
- OWASP — AI Bill of Materials (AIBOM)
- BeyondScale — AI Bill of Materials: Enterprise Guide 2026
- Elektor — 2026: An AI Odyssey, the Vibe Coding Hangover