The Three-Layer Backup Strategy for Self-Hosted AI Infrastructure

by Alien Brain Trust AI Learning
The Three-Layer Backup Strategy for Self-Hosted AI Infrastructure

The Three-Layer Backup Strategy for Self-Hosted AI Infrastructure

After losing our AI agent configuration overnight to a Docker named volume issue, we built a three-layer backup system that covers every failure mode we could think of. Total cost: about $3.50/month.

This is the exact setup — Terraform, bash scripts, and cron included.

The Three Failure Modes We’re Protecting Against

Before choosing backup tools, we mapped out what could actually go wrong:

  1. Container mishap — compose file changes, orphaned volumes, accidental docker compose down -v
  2. Instance failure — EC2 hardware failure, accidental termination, corrupt root volume
  3. Data corruption — application bug writes bad data, Postgres corruption, filesystem issue

Each layer of the backup stack addresses a different failure mode.

Layer 1: Bind Mounts on a Separate EBS Volume

This is the foundation. Everything Docker writes goes to explicit paths on a dedicated 20GB EBS volume mounted at /data, not to Docker-managed named volumes.

# docker-compose.yml
services:
  db:
    image: postgres:17-alpine
    volumes:
      - /data/paperclip/pgdata:/var/lib/postgresql/data  # not pgdata:

  server:
    image: paperclip:latest
    user: "1000:1000"
    volumes:
      - /data/paperclip/appdata:/paperclip
      - /data/paperclip/appdata:/app/data

The EBS volume is separate from the root volume and persists independently. Terminate the EC2 instance, the EBS volume survives. Recreate the instance, reattach the volume, and every container sees its data exactly where it left it.

What this protects against: Container mishaps, instance replacement, compose file changes.

What this doesn’t protect against: EBS volume corruption, accidental aws ec2 delete-volume.

Layer 2: Automated EBS Snapshots (14-Day Retention)

AWS Data Lifecycle Manager takes daily snapshots of the data volume. We manage this in Terraform so it’s reproducible:

resource "aws_dlm_lifecycle_policy" "ebs_snapshots" {
  description        = "Daily EBS snapshots - 14-day retention"
  execution_role_arn = aws_iam_role.dlm.arn
  state              = "ENABLED"

  policy_details {
    resource_types = ["VOLUME"]

    schedule {
      name = "14-day daily snapshots"

      create_rule {
        interval      = 24
        interval_unit = "HOURS"
        times         = ["03:00"]
      }

      retain_rule {
        count = 14
      }
    }

    target_tags = {
      Name = "paperclip-app-data"
    }
  }
}

Snapshots are incremental after the first full snapshot. Cost for a 20GB volume with daily snapshots: about $1.60/month.

What this protects against: EBS volume corruption, regional failures, accidental volume deletion.

What this doesn’t protect against: Application-level data corruption that persists across snapshot windows, needing to restore to a different environment.

Layer 3: Nightly pg_dump to S3

EBS snapshots are block-level backups — fast to restore, but you need the same AWS region and account. A SQL dump is portable. You can restore it to any Postgres instance anywhere, which matters if you’re ever migrating infrastructure.

We run this as a cron job at 2 AM UTC (before the 3 AM EBS snapshot):

#!/bin/bash
# /usr/local/bin/paperclip-backup.sh
set -e
DATE=$(date +%Y%m%d-%H%M)
BUCKET="your-scripts-bucket"
DEST="s3://$BUCKET/backups/pg-$DATE.sql.gz"

echo "[$DATE] Starting pg_dump backup to $DEST"
docker exec src-db-1 pg_dump -U paperclip paperclip \
  | gzip \
  | aws s3 cp - "$DEST" --region us-east-1

echo "[$DATE] Backup complete"

# Prune backups older than 30 days
aws s3 ls "s3://$BUCKET/backups/" \
  | awk '{print $4}' \
  | while read key; do
      created=$(echo "$key" | grep -oP '\d{8}')
      cutoff=$(date -d "30 days ago" +%Y%m%d)
      if [[ "$created" < "$cutoff" ]]; then
        aws s3 rm "s3://$BUCKET/backups/$key"
      fi
    done

Cron entry (/etc/cron.d/paperclip-backup):

0 2 * * * root /usr/local/bin/paperclip-backup.sh >> /var/log/paperclip-backup.log 2>&1

What this protects against: Everything. Portable SQL dump restorable anywhere.

Cost: S3 storage for 30 days of compressed dumps is under $0.50/month.

How to Restore

From bind mounts (fastest — container restart)

cd /data/paperclip/src
docker compose down && docker compose up -d
# Data is still at /data/paperclip/pgdata — containers remount it

From EBS snapshot (instance replacement)

# In AWS Console or CLI:
# 1. Create new EBS volume from snapshot
# 2. Attach to new EC2 instance as /dev/sdf
# 3. Mount at /data
# 4. docker compose up -d

From pg_dump (point-in-time restore or migration)

aws s3 cp s3://your-bucket/backups/pg-20260315-0200.sql.gz - \
  | gunzip \
  | docker exec -i src-db-1 psql -U paperclip paperclip

The Infrastructure-as-Code Layer

The fourth layer, which costs nothing: keeping everything in version control.

The entire infrastructure — Terraform files, docker-compose.yml, user_data.sh bootstrap script, deployment instructions — lives in a private GitHub repo. When we lost our Paperclip configuration, we didn’t lose the knowledge of how to rebuild it. Everything was committed.

What took 16 hours to build the first time took under 2 hours to rebuild. And we came out with better infrastructure: recursive chown for container permissions, explicit bind mounts instead of named volumes, and this three-layer backup system baked into the Terraform from the start.

The Full Cost Breakdown

LayerResourceMonthly Cost
Layer 120GB EBS data volume (gp3)$1.60
Layer 2EBS snapshots, 14-day incremental~$1.60
Layer 3S3 pg_dump storage, 30-day retention< $0.50
BonusKMS encryption key for secrets$1.00
Total~$4.70

For $4.70/month, you get three independent recovery paths covering every failure mode. For a self-hosted AI platform running your company’s agent infrastructure, that’s not optional.

Comments

Loading comments...