The Agent Trust Problem: Should Your AI Agent Write Code to Your Repository?
The Agent Trust Problem: Should Your AI Agent Write Code to Your Repository?
We just gave an AI agent write access to our GitHub repository.
This decision took two days of debate. Here’s what we concluded, how we mitigated the risk, and what you should do if you’re in the same position.
The Trade-off
Benefit: Agent can create branches, commit code, open PRs, and deploy without human friction.
One example from this week:
- Agent drafts blog post (2 minutes)
- Creates branch (10 seconds)
- Commits markdown file (5 seconds)
- Opens PR (10 seconds)
- Total: 3 minutes, $0.02 in API costs
vs.
- Agent drafts post and sends it to us
- We manually create branch
- We copy/paste the content
- We commit, push, open PR
- Total: 15 minutes of human time, $0 in API costs but context switch cost
Risk: If the agent’s permissions are too broad or its judgment is flawed, it could:
- Commit secrets (API keys, credentials) to the repo
- Overwrite production code by mistake
- Create a PR that merges destructive changes
- Spam the repository with thousands of commits
How We Evaluated the Risk
We asked three questions:
1. What’s the blast radius if the agent makes a mistake?
For content agents writing markdown: Low. A bad blog post can be edited or unpublished. A bad commit message is embarrassing but fixable.
For engineering agents modifying critical paths: High. A bad deployment breaks production. A bad permission change leaks secrets.
2. Can we detect and revert mistakes quickly?
For content PRs: Yes. We review every PR before merge. If the agent writes bad content, we catch it during review.
For production code: Yes, but slower. If the agent commits bad code, we catch it in review, but revert time is measured in minutes or hours, not seconds.
3. What’s the minimum permission set we can give the agent?
- Read-only access to most of the repo (fine)
- Write access only to
03-Marketing Materials/(tight) - No access to credentials, infrastructure code, or production keys
- No permission to merge without human approval
- All commits must be attributed to the agent (audit trail)
What We Built
GitHub team-based permissions:
- Agent is member of the “Content” team
- Team has write access to:
03-Marketing Materials/only - Team has no access to:
00-Corporate/,02-Infrastructure/,01-Course-Content/ - Protected branch rules: No merges without at least one human review
In Paperclip (our agent control plane):
- Agents can create branches and commit code
- Agents cannot merge PRs (humans only)
- All commits are logged with timestamp and reason (“Drafting W03 LinkedIn posts”)
- Agents can only work on issues they’re explicitly assigned
- If an agent exceeds rate limits or detects an error, it blocks itself and escalates
In practice:
- Agent creates PR
- PR appears in review queue
- Human reviews (2-5 minutes)
- Human approves or requests changes
- Agent reads approval, takes action (merge or update)
- Done
The human is always in the loop for final decision.
The Risks We Accepted
Risk 1: The agent commits credentials Mitigation: We use GitHub’s secret scanning. Any commit with a secret pattern (AWS key format, private key marker) is blocked before it reaches the repo. Plus: our git hooks scan for secrets locally.
Risk 2: The agent overwrites important files Mitigation: Protected branches. Agent cannot merge without approval. Plus: the files are in a tightly scoped directory. It can’t touch infrastructure code.
Risk 3: The agent spam-commits Mitigation: We can revoke the agent’s token immediately. GitHub API has rate limits (5,000 requests/hour). Beyond that, the agent’s requests fail and it blocks itself.
Risk 4: The agent’s prompt is bad, and it commits weird things Mitigation: All commits are reviewed. A weird commit is caught in review, not merged, and we adjust the agent’s instructions.
When Agents Shouldn’t Have Write Access
Don’t give write access if:
- High-stakes code changes. Production infrastructure, auth systems, payment processing. Humans only.
- Regulated environments. Healthcare, finance, legal. Agents + humans, but the human signature matters for compliance.
- Ambiguous requirements. If the spec is fuzzy, an agent will commit what it interprets. You’ll spend time reverting. Clarify first, then let the agent write.
- Untested workflows. Test the workflow with a human first. Once it’s proven, let the agent automate it.
When Agents Should Have Write Access
Do give write access if:
- Clear, repetitive tasks. “Draft 3 LinkedIn posts per week following this template.” Agents excel here.
- Low-risk output. Content can be edited. A blog post is never final until published.
- Scoped permissions. Agent can only touch designated directories.
- Audit trail required. All commits logged, attributed, timestamped.
- Fast human review. You can review every PR in 5 minutes. If review takes 30 minutes per PR, the bottleneck shifts to you, not the agent.
The Broader Lesson
This is a trust boundary decision, not a technical one.
You’re deciding: “What parts of my system am I comfortable with an AI agent controlling?”
For us:
- Agent controls: Content, blog posts, PR drafts
- Agent cannot control: Merging to main, deploying, credentials, infrastructure
It’s not that agents are untrustworthy. It’s that they’re alien. They don’t have skin in the game. If an agent breaks something, it doesn’t feel pain. So we use guards: reviews, permissions, audit trails, rate limits.
These guards are the same ones we use for junior humans joining the team. The principle is identical: trust, but verify. And build walls around things that matter.
Next: If you’re running agents in your workflow, audit the permissions you’ve given them. Do they need write access? Or can they draft, propose, and hand off to a human for merge? The answer determines your risk profile.
Comments