sre.space

SRE · AIOps

Learning. Experimenting. Sharing reliable stories.

Engineering reliability into complex systems
hub Autonomous Tactical Squad

The Agent Roster

A coordinated system of specialized agents working together to ensure reliability

monitoring

Scout Agent— Observability

Continuous monitoring of service health and business SLIs. Scout filters irrelevant metrics to identify true signals of degradation across the fleet.

Analogy // The Fire Lookout Identifies the first plume of smoke amidst thousands of signals before the forest burns.

Electric_Bolt

CAG Agent— Instant Cache

Cache-Augmented Generation. Checks the tactical knowledge cache for known failure patterns. If a match is found, bypasses heavy reasoning for sub-second recovery.

Analogy // Muscle Memory Pulling your hand away from a hot stove before your brain even processes the heat.

hub

Brain Agent— Reasoning

The reasoning layer powered by LangGraph. It correlates Jaeger traces and OpenTelemetry signals to perform multi-step cognitive evaluation before requesting any high-impact fix.

Analogy // The Detective Examines fingerprints (Otel traces) and checks the alibi (logs) to identify the true culprit.

Gpp_Good

Guardrail Agent— Safety

Enforces safety policies and architectural constraints. Guardrail evaluates every proposed action against the blast-radius policy to prevent cascading failures.

Analogy // The Brake System Ensuring the car doesn't go off the cliff, no matter how hard the driver hits the gas.

settings_suggest

Fixer Agent— Remediation

Executes remediation only after multi-step decision loops. Uses GitOps via GitHub MCP to ensure all actions are auditable, version-controlled, and reversible.

Analogy // Controlled Demolition Precise, approved, zero collateral damage.

Auto_Stories

Curator Agent— Memory

The Improved Memory Agent. Manages the RAG knowledge base by evaluating, indexing, and merging triage logs into ChromaDB for future retrieval.

Analogy // The Librarian Archiving every troubleshooting journey so the system never repeats the same mistake.

Tier-3 Architectural Authority

Jules — The Architect

Google Jules operates as the final architectural authority, triggered only for chronic or systemic failures. Jules performs deep code refactoring—introducing circuit breakers, safety patterns, and structural fixes rather than just symptoms.

Jules Icon

System Sync

Cognitive
Loop.

monitoring
Scout
Electric_Bolt
CAG
hub
Brain
Gpp_Good
Guardrail
settings_suggest
Fixer
auto_stories
Curator

Orchestrated Reliability

Whenever an anomaly is detected by Scout, it is first routed to CAG for an instant troubleshooting match. If the cache misses, the Brain (LangGraph) initiates a multi-step triage cycle, correlating traces with historical data. Every decision is then passed through the Guardrail for safety validation before Fixer implements the change. Finally, Curator indexes the entire lifecycle to improve future triage.

PIPELINE /start/0x2FA STEP 01: SCOUT_TRIGGER -> CAPTURING_TRACE STEP 02: CAG_CHECK -> MISS STEP 03: BRAIN_REASON -> ANALYZING_LANGGRAPH STEP 04: GUARDRAIL_VERIFY -> PERMITTED STEP 05: FIXER_APPLY -> GITOPS_PR_CREATED STEP 06: CURATOR_INDEX -> MEMORY_OPTIMIZED PIPELINE /end/SUCCESS
code Operational Evidence

Featured Work

Production systems and experiments in autonomous reliability engineering

GET /health/nominal HTTP/1.1 Host: sre.space Tracing: Jaeger-0xFA2 Status: RESOLVED RCA: TRACE_CORRELATION Scout: DETECTED Brain: REASONED Fixer: PATCHED
Operational Excellence

Autonomous
Issue Triage.

Built an AI-assisted system using LangChain to automate root cause analysis through trace correlation. The system uses multi-step evaluation to ensure safe action selection before resolution.

Operational Reality

Substantial reduction in manual intervention and engineer toil.

QUERY /chromadb/memory Retrieving: Confluence_Runbook Orchestrator: LangChain Context: GROUNDED
Operational Intelligence

RAG-Grounded
Issue Resolution.

Integrated Confluence runbooks into a RAG retrieval layer, orchestrated via LangChain. This provides agents with verified operational context, ensuring reasoned triage flows.

Operational Outcome

Safe issue resolution by grounding decisions in approved enterprise knowledge.

SCAN /submission/image DETECTION: Synthetic_Check ANALYSIS: Metadata_Consistency STATUS: AUTHENTICATING
Operational Integrity

Visual Authenticity
Validation.

Built an AI-assisted proof-of-concept to detect synthetic, manipulated, or inauthentic images in enterprise submissions—addressing a real operational challenge in insurance.

Focus Area

Building trust, integrity, and scale through explainable validation systems.

auto_stories Institutional Knowledge

Reliable Stories

Real-world experiences and lessons learned from building reliable systems

CAG / Performance Electric_Bolt
The CAG Speedrun.

How caching tactical issue signatures reduced resolution time from minutes to 2 seconds for known recurring anomalies.

Speed Analysis arrow_forward
Guardrails / Safety Gpp_Good
Safety Breach Prevented.

When the Brain Agent proposed a cascading restart of the production ingress. How Guardrail stepped in and blocked the request.

Safety Audit arrow_forward
Curator / Memory Auto_Stories
Curator's Deep Recall.

Solving a "ghost error" in the payment gateway by linking current traces to a similar event from a 2024 deployment archive.

Recall Depth arrow_forward
Case Study experiment
Systemic AWS Management.

Regional failure analysis: Why high-availability is not a substitute for architectural discipline during cloud degradations.

Triage Analysis arrow_forward
Chaos Engineering error
Fault Injection Lessons.

Validating structural resilience through controlled fault injection. Identifying dormant failure modes before they trigger customer-facing issues.

Read Experiment arrow_forward
Memory Architecture memory
RAG in SRE Systems.

Bridging real-time observability and long-term memory to reduce triage errors during critical system events.

Deep Architecture arrow_forward
Control Systems security
Autonomous Guardrails.

Why agents must never have full sovereignty. Balancing self-healing with strictly defined blast-radius control.

Safety Strategy arrow_forward
Risk Management warning
Automation Risks.

Managing the risks of high-velocity automation. Why intent-based reasoning must precede triage actions to prevent failure loops.

Risk Analysis arrow_forward
Decision Architecture hub
Signals to Decisions.

Deconstructing the operational loop from signal detection to reasoned resolution. Aligning technical observability with user health.

Flow Depth arrow_forward
Integrity verified_user
Visual Authenticity.

An integrity verification POC addressing the real-world risk of inauthentic submissions. Protecting automated triage logic through trust validation.

Integrity Depth arrow_forward
User Experience visibility
Beyond The Dashboard.

Shifting from infrastructure metrics to CUJ health. Why your CPU graph is lying about user satisfaction.

UX Depth arrow_forward