When Automation Fails

The "Auto-Fail" Scenario

The most dangerous tool in an SRE's arsenal is a script that works in standard scenarios but fails catastrophically in edge cases it wasn't designed for. Recently, a major CDN experienced a global outage not because of a hacker or a hardware failure, but because an automated "cleanup" script misidentified active configuration as stale and deleted it across the entire network in moments.

This is why we emphasize that **Reasoning > Remediation**. Automation without reasoning is just a faster way to break things.

Analogy // The Overzealous Gardener 1. **The Mower**: A smart mower programmed to "shred anything not grass" (the blind script).
2. **The Roses**: Prize-winning flowers that are mistakenly identified as weeds (the critical system migration).
3. **The Gardener**: A human who stops the mower because they understand the intent of the garden (the Brain Agent).
Reasoning ensures we don't "prune" healthy, revenue-generating migrations during a perceived outage.

Why Blind Scripts Fail

Traditional automation is "If-This-Then-That" (IFTTT). *If CPU is highly saturated, Then restart container.* But what if the CPU is saturated because you are running a critical migration? Restarting the container during a migration could corrupt your database.

Our **Brain Agent** looks at the *intent* of the system. It asks: "Is the CPU high due to an anomaly, or due to a scheduled task?" Only if the answer is "Anomaly" does it trigger the **Fixer Agent**.

Designing for Failure

At SRE-Space, our automation follows the "Principle of Least Sovereignty." No automated action is ever irreversible. If a container is restarted, a snapshot of its state is taken. If a configuration is changed, the GitOps history allows for a one-click revert. We build systems that assume our own automation will eventually make a mistake.