We Broke Our AWS Credentials — Then Fixed Them Properly

by Alien Brain Trust AI Learning
We Broke Our AWS Credentials — Then Fixed Them Properly

We Broke Our AWS Credentials — Then Fixed Them Properly

Meta Description: How ABT went from root CLI credentials and accidental IAM user deletion to SSM Parameter Store, dedicated service accounts, and SSO — the unfiltered story.

We deleted the IAM user by accident. Then realized we were running the AWS CLI as root. Then discovered our GitHub Actions workflows had been failing for weeks because of a stale key from the deleted user. All of this happened in the same session.

Here’s what we learned.

The Starting State (It Was Bad)

When we ran aws configure list to debug a workflow failure, we got this:

NAME       VALUE                    TYPE      LOCATION
access_key ****************LFKH     login
secret_key ****************ykTr     login

Type: login. That sounds fine. Then we ran aws sts get-caller-identity:

{
  "UserId": "630287363158",
  "Account": "630287363158",
  "Arn": "arn:aws:iam::630287363158:root"
}

Root. We were running every local AWS CLI command — including pulling secrets from SSM — as the root account. Not because we meant to. Because we had deleted and recreated the terraform-deployer IAM user mid-session, the new credentials never got wired up, and the fallback was our SSO daily login which happens to authenticate as the management account root.

We had also been rotating GitHub Actions secrets from the same user, so when the user got deleted, every Terraform CI run started failing with invalid security token. The error had been sitting there for days before we dug into it.

The Circular Secret Problem

While debugging this, we hit a pattern we now call the circular secret problem.

We store secrets in AWS SSM Parameter Store — that’s the right place for them. But to read from SSM, you need AWS credentials. And we were trying to figure out which credentials were configured where, without being able to see the actual secret values (which is correct — you should never print them).

The sequence that trips people up:

  1. You rotate an IAM key
  2. You update SSM with the new key value
  3. Something downstream still uses the old key to read SSM
  4. It fails
  5. You can’t easily tell which key is stale because you can’t print them

The fix is separating concerns clearly: bootstrap credentials (the ones needed to reach SSM in the first place) must live outside SSM. That means GitHub secrets for CI/CD, and local SSO credentials for developers. Everything else — API keys, tokens, passwords — lives in SSM only.

What We Actually Fixed

Separated GitHub Actions from local dev credentials. Created a dedicated github-actions-aws-deployer IAM user with only the permissions Terraform needs. This user’s key lives in GitHub secrets and nowhere else. Local dev uses SSO temporary credentials. Rotating one no longer breaks the other.

Deleted the orphaned terraform-deployer user. Nothing was using it. No reason to keep it. Dead credential = attack surface with no value.

Validated that local CLI is SSO, not a static key. When aws configure list shows type: login, those are temporary credentials from the daily SSO session. They expire when the session ends — no long-term key sitting in a file. That’s the right model for human access.

Confirmed all secrets are in SSM. Not in .env files, not hardcoded in scripts, not in GitHub secrets (except the bootstrap ones). Every script that needs a secret pulls it at runtime:

KEY=$(aws ssm get-parameter \
  --name "/paperclip-app/api-key" \
  --with-decryption \
  --region us-east-1 \
  --query Parameter.Value \
  --output text)

Created AWS Organizations and prepared for IAM Identity Center. The long-term answer for human access is SSO with named permission sets, not long-term keys at all. We created the org this session. Identity Center setup is next.

The Pattern That Holds

After working through this, here’s the model that actually makes sense for a small team running infrastructure with AI agents:

Who/WhatAuth MethodKey Type
Jared (local)SSO daily sessionTemporary (expires)
GitHub ActionsDedicated IAM userStatic key in GH secrets
EC2 instancesIAM RoleNo key — instance profile
Agents (Paperclip)API key from SSMStatic, rotated periodically

The rule is: the fewer long-term static keys that exist, the smaller your blast radius when something goes wrong.

What We’d Do Differently

Don’t rotate credentials in the middle of a debugging session. When you’re already confused about which key is which, adding a new key to the mix is how you end up deleting the user entirely.

Set up SSO before you need to rotate anything. Once IAM Identity Center is running, developers never touch static keys at all — and there’s nothing to accidentally delete.

Keep a credential inventory. We’re building a runbook that lists every secret, where it lives, when it was last rotated, and what breaks if it expires. It’s boring documentation but it would have saved two hours of debugging.

What’s Next

Root account access is being locked down now — not eventually. After this incident we did a full security review: MFA is confirmed enabled on root, static root credentials do not exist, and root is not used for any day-to-day operations.

IAM Identity Center is the final piece. Elon (our CTO agent) has the plan ready — permission sets scoped to actual needs, human identities tied to real names in CloudTrail, 1-hour credential TTL via STS. The org is already created. Enabling Identity Center and cutting the first named user is the immediate next step, not a backlog item.

The root account exposure we described is not acceptable to leave open. We’re closing it.


Building ABT in public. Following along? The full issue tracker shows what we’re working on week by week.

Tags: #building-in-public#technical#automation#ai-security#implementation

Comments

Loading comments...