An SRE Playbook: Diagnosing Intermittent Failures
Most production issues described as “flaky” are not truly random. They are systems crossing a boundary that has not been measured yet.
Intermittent failures are rarely random.
Before investigating, ask a more precise question:
Is this truly intermittent, or deterministic under conditions we have not yet identified?
When something fails “sometimes”, a boundary is usually involved.
Systems change behavior when limits are crossed. The task is to identify the constraint.
Working Model
Treat every intermittent failure as a threshold event.
Typical boundaries:
- CPU scheduling
- Memory limits
- File descriptor ceilings
- I/O throughput caps
- cgroup enforcement
- Network variability
- Execution-order differences
Do not start only with logs. Start with limits, then use logs to confirm where the boundary was crossed.
Failure Categories
Hard Deterministic
Fails every time.
Common causes:
- Bad configuration
- Missing shared object
- Schema mismatch
- Version drift
Basic inspection:
ldd binary
strace -f
journalctl -xe
diff <(env | sort) <(reference_env)
Timing Sensitivity
Fails depending on execution order or runtime state.
Common triggers:
- Parallel builds
- Thread scheduling
- Race conditions
- Cold vs warm cache
Remove parallelism:
MAKEFLAGS="-j1"
Probe the stack boundary:
ulimit -s 4096
If failure frequency changes, timing or stack depth is involved.
Resource Pressure
Appears under load or after long runtime.
Common causes:
- Memory pressure
- Heap fragmentation
- File descriptor exhaustion
- CPU contention
Baseline checks:
ulimit -n
lsof | wc -l
free -m
vmstat 1
top -H
In containers:
cat /sys/fs/cgroup/memory.max
cat /proc/self/limits
Always verify cgroup limits separately from host capacity.
Dependency Instability
Fails when something external becomes slow, unavailable, or inconsistent.
Common causes:
- DNS resolution failure
- Upstream timeout
- Rate limiting
- TLS handshake issues
- Stale service discovery
- Partial network partition
Basic checks:
dig service.internal
curl -v https://dependency.example
ss -tan
If only one dependency path is unstable, the problem is not random. It is conditional.
Platform-Imposed Constraints
CPU Steal Time
mpstat 1
If steal time rises during degradation, hypervisor contention is likely.
Burstable CPU Behavior
Check provider metrics:
- CPU credit balance
- Baseline vs burst usage
Sustained usage beyond baseline will degrade performance.
OOM Events
dmesg -T | grep -Ei 'killed process|out of memory'
journalctl -k | grep -Ei 'oom|killed process'
Kernel OOM kills indicate memory boundary violations.
Disk Throughput Caps
df -h
iostat -x 1
Cloud storage often enforces burst limits or throughput caps.
Node Comparison
When only some nodes fail, compare:
- Kernel version
- Instance type
- CPU architecture
- cgroup limits
- Swap configuration
uname -a
cat /proc/cpuinfo
ulimit -a
cat /proc/self/limits
Assume nodes differ until proven identical.
What to Measure
Make the boundary visible in metrics, not just symptoms.
Useful signals:
- Latency percentile shifts
- Error rate by node or instance type
- Memory high-water mark
- Open file descriptors over time
- Queue depth
- Retry count
- CPU steal or throttling
- OOM or cgroup events
A graph that moves with the failure is usually more useful than another page of logs.
Increasing Reproducibility
Make the boundary visible:
- Lower stack limits
- Constrain memory
- Increase concurrency
- Loop execution paths
for i in {1..100}; do run_command; done
Prefer reproducing in a controlled environment before increasing pressure in production.
Reproducibility is more valuable than log volume.
Common Threshold Events
Most intermittent failures reduce to:
- Stack exceeded
- Memory limit reached
- File descriptors exhausted
- CPU credits depleted
- I/O throttled
- Thread contention
- Dependency timeout
- Network instability
Identify the exact limit. Confirm it under controlled conditions. Adjust capacity or design accordingly.
Operating Principles
- Random usually means unmeasured.
- A restart resets state. It does not explain the failure.
- Cloud infrastructure introduces hidden enforcement layers.
- Stability improves as hidden variables are removed.
An intermittent failure becomes solvable the moment you can name the limit it crosses.