The sovereignty Loop vs. The Kill-Switch
In the pursuit of "No-Ops," there is a dangerous temptation to grant AI agents full access to production. This is a mistake. An autonomous agent is only as reliable as its guardrails. Without limits, a small error in reasoning can lead to a mass container deletion or a global configuration rollback.
At SRE-Space, our **Fixer Agent** operates under a set of rigid boundaries. It cannot act if the **Brain Agent's** confidence is below the safety threshold, and it cannot affect a significant portion of the fleet simultaneously.
2. **The Guardrail**: A hard-coded limit prevents setting the temp above the safety limit (the Decision Boundary).
3. **The Fuse**: A physical thermal fuse melts and cuts power if the system overheats (the Hard Constraint).
Guardrails are the physical reality that prevents autonomous logic from burning the "house" down.
Human-in-the-Loop: When to Hand Over
For high-risk actions—like database schema changes or primary region failovers—we require a human-in-the-loop. The **Brain Agent** presents the evidence, the proposed fix, and the risk assessment to an on-call engineer, who provides the final "GO" signal.
// Confidence & Boundary Check
if (brain.confidence < safety_threshold || action.impact_radius > max_blast_radius) {
await_human_approval(action_plan);
} else {
fixer.execute(action_plan);
}
Blast-Radius Control
Every remediation is executed as a "canary fix." The **Fixer Agent** applies the change to a single node, waits for **Scout** to confirm that the SLIs have normalized, and only then proceeds to the rest of the cluster. This ensures that even a "bad" fix is contained before it becomes a customer-facing catastrophe.