Governing Agents at Work: A Field Guide

The demo agent lives in a sandbox. The useful agent does not.

The useful agent writes to your CRM, answers your customers at midnight, files records, moves data between systems, and opens work in your codebase. The day an agent touches a real system, governance stops being a compliance slide and becomes an operational question with a dollar sign on it. Not because the models are reckless, but because any system that acts at machine speed amplifies whatever you failed to specify.

I run agents across three model families every working day, and I build the systems that keep that safe. This guide is the operating model: the four questions every agent program has to answer, how I implement them in my own stack, and why your meeting notes and your agent logs are about to become the same document.

The four questions

Every agent deployment, from one chat assistant in a BDC to a fleet across an enterprise, has to answer four questions. Skip one and you will find out which one you skipped at the worst possible time, in front of a customer or an auditor.

Permissions: what may this agent touch?
Audit: what did it do, and why?
Distribution: who in the organization gets which capability?
Human gates: what always requires a person's sign-off?

Most teams obsess over model choice and prompt quality. Those matter, but they are the easy third of the problem. The four questions are the governance layer, and the governance layer is what separates a tool from a liability.

Permissions: what an agent may touch

Permissions for agents work like permissions for people, with one difference: an agent will use everything you give it, tirelessly, at scale, without the judgment pause a human applies before doing something unusual. So the rule is least privilege, scoped by task, not by trust.

Trust in the model is not a permission strategy. The best model available still should not hold credentials it does not need for the task in front of it. Concretely: the lead-response agent can read inventory and write CRM notes. It cannot touch pricing, cannot issue refunds, cannot email the whole database. Not because it would, but because the cost of being wrong about "would" is unbounded, and the cost of scoping the credential is an afternoon.

Tier the permissions by blast radius:

Tier	What the agent does	Examples	Default gate
Read	Query, summarize, draft for a human	Pull a report, draft a reply, brief the morning meeting	Logging only
Reversible write	Create or update internal records	Log a call note, update a lead stage, schedule a task	Owner samples the output weekly
External action	Communicate outside the building	After-hours lead response, review replies	Tight scope, templates or rules, instant handoff path
Commitments	Spend, price, promise, sign	Nothing, today	Always a human

The tiers are not about distrust of AI. They are how you let an agent be genuinely useful at tier one and two while you earn the evidence to expand tier three.

Audit: what it did and why

Every action an agent takes should produce a record: what it saw, what it decided, what it did, and under whose authority it was operating. If you cannot reconstruct an agent's afternoon, you do not have automation. You have a mystery generator with API keys.

The audit trail has three requirements that sound obvious and are routinely violated:

It must be complete. Actions, not summaries of actions. The draft that was sent, not a count of drafts.
It must be queryable. "Show me every customer-facing message this agent sent last Tuesday" should take a minute, not a support ticket.
It must live in a system the agent, and the agent's vendor, cannot edit. The referee does not play for either team. If you read my writing on owning your systems, you already know this rule: nobody grades their own homework.

In a dealership this is concrete, not abstract. The after-hours agent that answered forty conversations last month should leave you able to answer three questions: which ones it converted to appointments, which ones it handed to a human and why, and exactly what it said in the one a customer complained about. If those answers live in a vendor dashboard you cannot export, you have stacked an audit problem on top of a vendor-dependence problem, and you will meet both on the same bad day.

Audit is also where the economics quietly favor you. The same logs that make agents accountable make them improvable. The handoffs, the failures, the weird edge cases: that is your tuning data, and it is yours only if you kept it.

Distribution: who gets which capability

This is the question almost everyone skips, and it is the one that scales worst.

The day one agent capability works, people want it. What happens next in most organizations is capability sprawl: prompts pasted between chat threads, configurations copied with stale instructions, every department running a slightly different version of the thing that worked, with no versioning and no way to revoke any of it. That is not adoption. That is shadow IT with a language model attached.

If that sounds theoretical, audit your own building: count the places agent instructions live today. If the answer involves screenshots, sticky notes, or a group chat, you already have a distribution problem. You just have not had the incident that names it yet.

Distribution means agent capabilities ship like software, even when no one involved writes software. Versioned, so you know who runs what. Centrally updated, so a fix lands everywhere. Scoped, so the capability arrives with its permissions attached instead of its credentials exposed. Revocable, so offboarding an employee or retiring a workflow is one action, not an archaeology project.

Human gates: what always needs sign-off

Some actions never go autonomous. Money leaving the building. Pricing commitments. Anything legal or HR. Anything irreversible and customer-visible at scale. The gate is not a failure of automation; it is a design feature, and the organizations that get this right make the gate cheap: one click, full context attached, decision logged.

A gate that takes effort gets rubber-stamped, and a rubber-stamped gate is worse than no gate, because it produces the paperwork of oversight without the oversight. Design the gate so the human reviewing it can actually review it in the time they will actually spend.

Where the gates go is a business decision, not a technical one, and the right people to draw them are the people who own the consequences. The desk decides what touches pricing. The controller decides what touches money. The principle is that someone with authority decided on purpose, in writing, instead of the boundary defaulting to whatever the tool shipped with.

How I run it: Tool-Bag

Tool-Bag is the distribution and governance layer I built for my own multi-model work. It orchestrates Claude, Codex, and Gemini through 14 plugins, 108 skills, a Docker MCP Gateway, and 12 CI workflows. It is not open-sourced and I am not going to walk through internals. What is worth sharing is the shape, because the shape answers the four questions and you can reproduce the shape with your own tooling.

Plugins are capability bundles. A capability arrives with its boundaries attached: what it may touch is part of what it is. That is the permissions answer.
Skills are written procedures. 108 of them means an agent runs a documented, versioned play, not an improvisation. When the play changes, it changes in one place for every model.
The gateway is one door. Every call from any model to any external system passes through a single controlled layer. One door means one place to enforce policy and one place to watch. That is most of the audit answer.
CI workflows are enforcement. Twelve of them, so capabilities get tested, validated, and shipped the way software does. Governance that depends on people remembering rules fails. Governance that is enforced by pipeline does not care who is busy this week.

Tool-Bag exists for a specific kind of user: experts who have the context and know exactly what they need but do not write code. That is precisely the user who needs governance most, because capability without governance forces an impossible choice between handing over raw credentials and locking experts out entirely. Distribution done right dissolves that choice: the expert gets the capability, the system keeps the keys.

None of this requires my stack. The shape is the point: bundle capability with its boundaries, write the plays down, route every external call through one door you watch, and let a pipeline enforce what people forget. A dealer group can build the same shape from tools its team already runs. What it cannot do is skip the shape and stay safe at scale.

Meeting records and agent logs are one story

Here is the convergence I did not expect when I started building this.

At Strolid I built an internal meeting-intelligence pipeline in TypeScript plus Python: meetings go in, structured records come out, with the decisions and commitments preserved instead of evaporating when the call ends. Ask what a meeting record is for and the answer is accountability: who decided what, when, with what reasoning, on whose authority.

Now read that sentence again as a description of an agent log. It is the same artifact. One captures the decisions humans make in rooms; the other captures the decisions software makes in systems. As agents take on real work, the two streams describe a single operation, and any accountability story that covers only one of them has a hole in the middle exactly where the hard questions land: who approved this, what did the agent do with it, and what did we know at the time?

Dealers already understand this instinct better than most industries. Every store keeps a deal jacket, because when a question surfaces in month eleven, the answer has to be in the file, not in someone's memory. Agent governance is the deal jacket for decisions made by software. The stores that would never deliver a unit without paper are about to run thousands of customer interactions without any, unless they decide otherwise now.

That is the direction I am taking Meeting Intelligence, and it is why the project is headed for open source as an AI safety, governance, and accountability tool. The audit layer is the one part of the stack that should be inspectable by the people it holds accountable. Closed-source accountability asks for exactly the trust it exists to replace.

The governance starter checklist

You do not need an enterprise program to start governing agents well. You need these, written down, before the first agent touches a production system:

Every agent has a named human owner. A person, not a department.
Every agent has a written scope: which systems, which actions, which tier.
Every action is logged in a place neither the agent nor its vendor can edit.
Customer-facing messages run on templates or rules until sampled output earns wider autonomy.
The human gates are listed explicitly: money, pricing, legal, anything irreversible at scale.
A kill switch exists, and someone has actually tested it this quarter.
Capabilities are versioned and revocable; offboarding revokes agent access the same hour it revokes passwords.
Log review has an owner and a cadence. Unread logs are decoration.

The one-line version, and the rule I hold every deployment to: an agent without an owner and an audit trail is not an asset. It is a liability that has not been invoiced yet.

Govern early, scale calmly

The stores and teams that win with agents will not be the ones that moved first. They will be the ones that could expand agent autonomy quickly because the governance was already in place, while everyone else was frozen by their first incident. Permissions, audit, distribution, gates: four questions, answered in writing, before the stakes arrive.

If you are putting agents to work and want the governance built right, that is the work I do. See the work for what I have shipped and pricing for how an engagement starts.