sre.space

SRE · AIOps

Learning. Experimenting. Sharing reliable stories.

Engineering reliability into complex systems

Explore Projects arrow_forward Operating Model hub

hub Autonomous Tactical Squad

The Agent Roster

A coordinated system of specialized agents working together to ensure reliability

monitoring

Scout Agent— Observability

Continuous monitoring of service health and business SLIs. Scout filters irrelevant metrics to identify true signals of degradation across the fleet.

Analogy // The Fire Lookout Identifies the first plume of smoke amidst thousands of signals before the forest burns.

Electric_Bolt

CAG Agent— Instant Cache

Cache-Augmented Generation. Checks the tactical knowledge cache for known failure patterns. If a match is found, bypasses heavy reasoning for sub-second recovery.

Analogy // Muscle Memory Pulling your hand away from a hot stove before your brain even processes the heat.

hub

Brain Agent— Reasoning

The reasoning layer powered by LangGraph. It correlates Jaeger traces and OpenTelemetry signals to perform multi-step cognitive evaluation before requesting any high-impact fix.

Analogy // The Detective Examines fingerprints (Otel traces) and checks the alibi (logs) to identify the true culprit.

Gpp_Good

Guardrail Agent— Safety

Enforces safety policies and architectural constraints. Guardrail evaluates every proposed action against the blast-radius policy to prevent cascading failures.

Analogy // The Brake System Ensuring the car doesn't go off the cliff, no matter how hard the driver hits the gas.

settings_suggest

Fixer Agent— Remediation

Executes remediation only after multi-step decision loops. Uses GitOps via GitHub MCP to ensure all actions are auditable, version-controlled, and reversible.

Analogy // Controlled Demolition Precise, approved, zero collateral damage.

Auto_Stories

Curator Agent— Memory

The Improved Memory Agent. Manages the RAG knowledge base by evaluating, indexing, and merging triage logs into ChromaDB for future retrieval.

Analogy // The Librarian Archiving every troubleshooting journey so the system never repeats the same mistake.

Tier-3 Architectural Authority

Jules — The Architect

Google Jules operates as the final architectural authority, triggered only for chronic or systemic failures. Jules performs deep code refactoring—introducing circuit breakers, safety patterns, and structural fixes rather than just symptoms.

System Sync

Cognitive
Loop.

monitoring

Scout

arrow_forward

Electric_Bolt

CAG

arrow_forward

hub

Brain

arrow_forward

Gpp_Good

Guardrail

arrow_forward

settings_suggest

Fixer

arrow_forward

auto_stories

Curator

Orchestrated Reliability

Whenever an anomaly is detected by Scout, it is first routed to CAG for an instant troubleshooting match. If the cache misses, the Brain (LangGraph) initiates a multi-step triage cycle, correlating traces with historical data. Every decision is then passed through the Guardrail for safety validation before Fixer implements the change. Finally, Curator indexes the entire lifecycle to improve future triage.

PIPELINE /start/0x2FA STEP 01: SCOUT_TRIGGER -> CAPTURING_TRACE STEP 02: CAG_CHECK -> MISS STEP 03: BRAIN_REASON -> ANALYZING_LANGGRAPH STEP 04: GUARDRAIL_VERIFY -> PERMITTED STEP 05: FIXER_APPLY -> GITOPS_PR_CREATED STEP 06: CURATOR_INDEX -> MEMORY_OPTIMIZED PIPELINE /end/SUCCESS

code Operational Evidence

Featured Work

Production systems and experiments in autonomous reliability engineering

GET /health/nominal HTTP/1.1 Host: sre.space Tracing: Jaeger-0xFA2 Status: RESOLVED RCA: TRACE_CORRELATION Scout: DETECTED Brain: REASONED Fixer: PATCHED

Operational Excellence

Autonomous
Issue Triage.

Built an AI-assisted system using LangChain to automate root cause analysis through trace correlation. The system uses multi-step evaluation to ensure safe action selection before resolution.

Operational Reality

Substantial reduction in manual intervention and engineer toil.

QUERY /chromadb/memory Retrieving: Confluence_Runbook Orchestrator: LangChain Context: GROUNDED

Operational Intelligence

RAG-Grounded
Issue Resolution.

Integrated Confluence runbooks into a RAG retrieval layer, orchestrated via LangChain. This provides agents with verified operational context, ensuring reasoned triage flows.

Operational Outcome

Safe issue resolution by grounding decisions in approved enterprise knowledge.

SCAN /submission/image DETECTION: Synthetic_Check ANALYSIS: Metadata_Consistency STATUS: AUTHENTICATING

Operational Integrity

Visual Authenticity
Validation.

Built an AI-assisted proof-of-concept to detect synthetic, manipulated, or inauthentic images in enterprise submissions—addressing a real operational challenge in insurance.

Focus Area

Building trust, integrity, and scale through explainable validation systems.

auto_stories Institutional Knowledge

Reliable Stories

Real-world experiences and lessons learned from building reliable systems

CAG / Performance Electric_Bolt

The CAG Speedrun.

How caching tactical issue signatures reduced resolution time from minutes to 2 seconds for known recurring anomalies.

Speed Analysis arrow_forward

Guardrails / Safety Gpp_Good

Safety Breach Prevented.

When the Brain Agent proposed a cascading restart of the production ingress. How Guardrail stepped in and blocked the request.

Safety Audit arrow_forward

Curator / Memory Auto_Stories

Curator's Deep Recall.

Solving a "ghost error" in the payment gateway by linking current traces to a similar event from a 2024 deployment archive.

Recall Depth arrow_forward

Case Study experiment

Systemic AWS Management.

Regional failure analysis: Why high-availability is not a substitute for architectural discipline during cloud degradations.

Triage Analysis arrow_forward

Chaos Engineering error

Fault Injection Lessons.

Validating structural resilience through controlled fault injection. Identifying dormant failure modes before they trigger customer-facing issues.

Read Experiment arrow_forward

Memory Architecture memory

RAG in SRE Systems.

Bridging real-time observability and long-term memory to reduce triage errors during critical system events.

Deep Architecture arrow_forward

Control Systems security

Autonomous Guardrails.

Why agents must never have full sovereignty. Balancing self-healing with strictly defined blast-radius control.

Safety Strategy arrow_forward

Risk Management warning

Automation Risks.

Managing the risks of high-velocity automation. Why intent-based reasoning must precede triage actions to prevent failure loops.

Risk Analysis arrow_forward

Decision Architecture hub

Signals to Decisions.

Deconstructing the operational loop from signal detection to reasoned resolution. Aligning technical observability with user health.

Flow Depth arrow_forward

Integrity verified_user

Visual Authenticity.

An integrity verification POC addressing the real-world risk of inauthentic submissions. Protecting automated triage logic through trust validation.

Integrity Depth arrow_forward

User Experience visibility

Beyond The Dashboard.

Shifting from infrastructure metrics to CUJ health. Why your CPU graph is lying about user satisfaction.

UX Depth arrow_forward

SRE · AIOps

The Agent Roster

Scout Agent— Observability

CAG Agent— Instant Cache

Brain Agent— Reasoning

Guardrail Agent— Safety

Fixer Agent— Remediation

Curator Agent— Memory

Jules — The Architect

Cognitive Loop.

Orchestrated Reliability

Featured Work

Autonomous Issue Triage.

RAG-Grounded Issue Resolution.

Visual Authenticity Validation.

Reliable Stories

The CAG Speedrun.

Safety Breach Prevented.

Curator's Deep Recall.

Systemic AWS Management.

Fault Injection Lessons.

RAG in SRE Systems.

Autonomous Guardrails.

Automation Risks.

Signals to Decisions.

Visual Authenticity.

Beyond The Dashboard.

Cognitive
Loop.

Autonomous
Issue Triage.

RAG-Grounded
Issue Resolution.

Visual Authenticity
Validation.