The Docker Trap That Erased Our AI Company's Memory
The Docker Trap That Erased Our AI Company’s Memory
We lost everything overnight.
Not from a cyberattack. Not from a corrupted hard drive. From a Docker default that millions of developers use every day without thinking about it: named volumes.
Here’s what happened, why it matters for anyone running AI infrastructure, and the three-layer backup system we built so it can never happen again.
What We Were Running
We’d spent a full day setting up Paperclip AI — a self-hosted agent orchestration platform — on a $12/month AWS EC2 instance. CEO agent configured. CTO agent configured. Projects created, issues seeded, corporate context loaded. The whole org bootstrapped from scratch.
It took about 16 hours across two days. Configuration, debugging ARM64 compatibility issues, getting Claude CLI authenticated inside the container, connecting GitHub, onboarding the instance. Real work.
Then we restarted the containers.
What Named Volumes Actually Do
When you run docker compose down and docker compose up with different container names, Docker creates new named volumes. The old ones still exist on disk — orphaned, invisible to your current compose configuration.
In our case, the old containers were named src-paperclip-1 and src-postgres-1. The new compose spec generated src-server-1 and src-db-1. Two different names. Two different volume sets. The Postgres data — every agent config, every project, every issue, every onboarding record — was sitting in the old volumes, completely unreachable.
The database the new containers saw was empty. Fresh install. Day zero.
# Old compose (what we had)
volumes:
pgdata:
paperclip-data:
# These volumes on disk:
# src_pgdata
# src_paperclip-data
# New compose (what we ran)
# Generated container names: src-server-1, src-db-1
# Looked for: src_pgdata (same name, found it)
# But the SERVICE name changed, so Docker looked for new volumes
The subtle part: this doesn’t always break. When you restart without changing names, volumes reconnect fine. The trap springs when you change your compose configuration even slightly — a service rename, a project directory change, running compose from a different folder.
Why This Hits AI Infrastructure Harder
Traditional web apps store most state in external databases with connection strings you control. Lose the container, the data is still in RDS, or MongoDB Atlas, or wherever.
Self-hosted AI platforms are different. They store agent configurations, conversation history, workspace mappings, authentication state, and instance configuration inside the Docker environment itself. Paperclip, for example, writes everything to PostgreSQL running in a sibling container. If that container loses its volume mapping, you don’t just lose a cache — you lose the entire organizational memory of your AI team.
Your agents forget who they are. Your projects disappear. Your issues evaporate. You’re back to day zero.
The Fix: Bind Mounts at Explicit Paths
Named volumes are convenient. Bind mounts are reliable.
The difference: a named volume is managed by Docker, stored at a path Docker chooses, referenced by a name that can become inconsistent. A bind mount is an explicit path on the host filesystem — always at /data/paperclip/pgdata, regardless of what your compose file is named, regardless of which container is running.
# Named volume — fragile
volumes:
- pgdata:/var/lib/postgresql/data
# Bind mount — explicit and durable
volumes:
- /data/paperclip/pgdata:/var/lib/postgresql/data
With bind mounts, you can delete every container, recreate them from scratch with different names, and the data is still sitting at /data/paperclip/pgdata waiting to be remounted.
What We Actually Lost
The irony: we lost the Paperclip database configuration, but we lost almost nothing of real business value. Why?
Because we keep everything in the repo.
The agent system prompts are in 02-Infrastructure/agents/. The corporate strategy is in 00-Corporate/. The infrastructure configuration is in Terraform. The docker-compose.yml is version controlled and uploaded to S3. Even the deployment instructions — the exact steps to rebuild from scratch — are committed to the repo.
What used to take 16 hours to set up took under 2 hours to rebuild. And we came out of it with better infrastructure than we had before.
The Three Backup Layers We Added
After this incident, we implemented three layers of protection:
Layer 1 — Bind mounts on encrypted EBS. All data lives at explicit paths on a separate 20GB EBS volume. Postgres at /data/paperclip/pgdata, app data at /data/paperclip/appdata. The EBS volume is encrypted and persists independently of the EC2 instance.
Layer 2 — Daily EBS snapshots. AWS Data Lifecycle Manager takes automated snapshots of the data volume every day at 3 AM UTC, retaining 14 days. Even if the EBS volume itself is corrupted, we can restore from snapshot.
Layer 3 — Nightly pg_dump to S3. A cron job at 2 AM UTC runs pg_dump and uploads a compressed SQL file to S3. Portable, inspectable, restorable to any Postgres instance anywhere.
# /usr/local/bin/paperclip-backup.sh (simplified)
DATE=$(date +%Y%m%d-%H%M)
docker exec src-db-1 pg_dump -U paperclip paperclip \
| gzip \
| aws s3 cp - "s3://your-bucket/backups/pg-$DATE.sql.gz"
Total cost of this backup stack: about $3.50/month.
The Lesson
AI infrastructure has the same failure modes as any infrastructure, but the blast radius is higher. When a traditional web app loses its database, you lose user data. When an AI orchestration platform loses its database, you lose the agents themselves — their identities, their context, their operational history.
Treat AI agent configuration like you treat production databases: explicit paths, multiple backup layers, everything in version control.
The 16 hours we lost cost us nothing except time. Because the repo had everything we needed to rebuild.
That’s the real lesson: the backup that saved us wasn’t EBS snapshots or pg_dump. It was Git.
Comments