Running Claude, Codex, and Gemini in One Shop

The single-model shop is a fiction. Every production AI system I have built or audited ends up multi-model, the same way every working garage ends up with more than one wrench. Not because engineers love complexity, but because the job mix demands it. I run Claude, Codex, and Gemini in one shop every day, distributed and governed through a layer I built called Tool-Bag. This is what that takes, why it is unavoidable, and the patterns I would defend in a code review.

Why one model is never enough

Four forces push every serious shop toward multiple models, and none of them are fashion.

Cost tiers. Running frontier models on routine work is paying surgeon rates to take a temperature. Most production tasks are routine: extraction, classification, formatting, summarization at known shapes. Routing those to lighter tiers and reserving the heavy models for genuinely hard reasoning is not optimization theater. It is the difference between a system you can afford to run at volume and one you quietly throttle until it stops mattering.

Strengths. The model families have different temperaments, and anyone who works across them daily learns the contours within a week. One is the first pick for one shape of work, another for the next, and the rankings shift with every release cycle. I am not going to publish a benchmark table. Plenty exist, and most are stale before you finish reading them, which is itself the argument: if the leaderboard turns over monthly, hard-wiring your stack to a single vendor is a bet you have to re-win every month. Build so that switching to this week's best model for a task is a routing change, not a migration.

Refusal behavior. Models decline different requests, in different ways, with different amounts of explanation. In production, a refusal cannot be an exception that stalls a pipeline at two in the morning. It has to be a routable event: catch it, log it, hand the task to a model with a different temperament, escalate to a human if they all balk. A shop with one model has no second opinion.

Latency. Interactive surfaces need answers at conversation speed. Batch pipelines can wait, and should be priced like they can wait. One model gives you one latency profile, which means you are either overpaying on batch or underdelivering on chat, and usually both.

Underneath all four sits the unglamorous one: outages happen. A single-vendor AI stack is a single point of failure you chose on purpose, and on the day it matters, nobody will remember why.

What the layer actually has to do

Once you accept multiple models, the hard problem moves. It is no longer which model. It is the layer that decides, permits, equips, and verifies. Four jobs, none of them optional.

Routing. Call sites ask for a capability, never a vendor. The layer maps the request to a model based on task shape, cost tier, and current behavior. The moment a model name gets hard-coded at a call site, you have signed a contract you did not read, and you will renegotiate it during an incident.

Permissions. Which agent, running which model, may touch which tool. This is where the Docker MCP Gateway earns its place in my stack: every tool call goes through one controlled door instead of a dozen direct integrations. One place to grant, one place to revoke, one place to audit. On the worst day, and that day comes, you want one door.

Shared skills. A skill written once should run no matter which model picks it up. Skills, not prompts, are the unit of reuse. Prompts pasted across three vendors drift into three dialects within a month, and nobody notices until the outputs disagree in front of someone who matters.

CI. Agent configuration is code and deserves the discipline of code: versioned, reviewed, exercised on every change. If a skill loads differently after an update, or a permission quietly widens, I want a pipeline to fail loudly before a human discovers it expensively.

Lessons from building Tool-Bag

Tool-Bag is my answer to those four jobs: 14 plugins, 108 skills, a Docker MCP Gateway, and 12 CI workflows orchestrating Claude, Codex, and Gemini. It is named publicly and it is not open-sourced. Building it taught me more about production AI than any single model ever has. Four lessons survived contact.

Governance is the product. The models grow more interchangeable every quarter. The layer that routes, permits, and audits them does not. Whoever owns that layer owns the leverage, which is exactly why I built mine instead of renting somebody else's.

Skills compound, prompts rot. The 108 skills exist because every repeated task got captured once, made model-agnostic, and put under version control. The library gets more valuable with every addition, because each skill works across every model behind the gateway. A folder of pasted prompts moves in the opposite direction, and it moves fast.

CI is what separates a system from a pile of configurations. The 12 workflows are not ceremony. They are the reason a change to routing or permissions is a reviewed event instead of a quiet surprise, and the reason I can ship a change and sleep the same night.

Build for the expert who does not code. Tool-Bag is built for a specific person: the one who has the context and knows exactly what they need but does not write software. That is the real distribution problem in AI. The bottleneck was never model capability. It is putting capability in the hands of the people who hold the judgment, with governance strong enough that nobody has to hover.

Patterns I would defend in a code review

Opinions, earned the usual way:

Route by task shape with a cheap default tier, and escalate on failure. Starting expensive feels safe and compounds into a bill nobody can explain.
Expose capabilities, not model names. Vendor identity is configuration, never code.
Treat refusals as routable events with a fallback chain, not exceptions that page a human.
One gateway for all tool access. Direct integrations multiply attack surface and divide accountability.
Version skills like code. If it is not in CI, it is not governed. It is vibes.
Log every routing decision. When an output surprises you, the first question is which model and why, and the answer should take seconds, not a forensic weekend.

One shop, many wrenches

Multi-model is not a flex. It is what production looks like once cost, strengths, refusals, and latency stop being slide bullets and start being your pager. The shop that treats models as interchangeable labor behind an owned governance layer will outrun the shop that married a vendor, and the gap widens with every release cycle. The rest of what I build, including the platform this layer feeds, lives at /work. If you are wrestling three models into one shop and the duct tape is starting to show, I have been there. Reach out.