Most production issues described as “flaky” are not truly random. They are systems crossing a boundary that has not been measured yet.

Intermittent failures are rarely random.

Before investigating, ask a more precise question:
Is this truly intermittent, or deterministic under conditions we have not yet identified?

When something fails “sometimes”, a boundary is usually involved.
Systems change behavior when limits are crossed. The task is to identify the constraint.


Working Model

Treat every intermittent failure as a threshold event.

Typical boundaries:

  • CPU scheduling
  • Memory limits
  • File descriptor ceilings
  • I/O throughput caps
  • cgroup enforcement
  • Network variability
  • Execution-order differences

Do not start only with logs. Start with limits, then use logs to confirm where the boundary was crossed.


Failure Categories

Hard Deterministic

Fails every time.

Common causes:

  • Bad configuration
  • Missing shared object
  • Schema mismatch
  • Version drift

Basic inspection:

ldd binary
strace -f
journalctl -xe
diff <(env | sort) <(reference_env)

Timing Sensitivity

Fails depending on execution order or runtime state.

Common triggers:

  • Parallel builds
  • Thread scheduling
  • Race conditions
  • Cold vs warm cache

Remove parallelism:

MAKEFLAGS="-j1"

Probe the stack boundary:

ulimit -s 4096

If failure frequency changes, timing or stack depth is involved.


Resource Pressure

Appears under load or after long runtime.

Common causes:

  • Memory pressure
  • Heap fragmentation
  • File descriptor exhaustion
  • CPU contention

Baseline checks:

ulimit -n
lsof | wc -l
free -m
vmstat 1
top -H

In containers:

cat /sys/fs/cgroup/memory.max
cat /proc/self/limits

Always verify cgroup limits separately from host capacity.


Dependency Instability

Fails when something external becomes slow, unavailable, or inconsistent.

Common causes:

  • DNS resolution failure
  • Upstream timeout
  • Rate limiting
  • TLS handshake issues
  • Stale service discovery
  • Partial network partition

Basic checks:

dig service.internal
curl -v https://dependency.example
ss -tan

If only one dependency path is unstable, the problem is not random. It is conditional.


Platform-Imposed Constraints

CPU Steal Time

mpstat 1

If steal time rises during degradation, hypervisor contention is likely.


Burstable CPU Behavior

Check provider metrics:

  • CPU credit balance
  • Baseline vs burst usage

Sustained usage beyond baseline will degrade performance.


OOM Events

dmesg -T | grep -Ei 'killed process|out of memory'
journalctl -k | grep -Ei 'oom|killed process'

Kernel OOM kills indicate memory boundary violations.


Disk Throughput Caps

df -h
iostat -x 1

Cloud storage often enforces burst limits or throughput caps.


Node Comparison

When only some nodes fail, compare:

  • Kernel version
  • Instance type
  • CPU architecture
  • cgroup limits
  • Swap configuration
uname -a
cat /proc/cpuinfo
ulimit -a
cat /proc/self/limits

Assume nodes differ until proven identical.


What to Measure

Make the boundary visible in metrics, not just symptoms.

Useful signals:

  • Latency percentile shifts
  • Error rate by node or instance type
  • Memory high-water mark
  • Open file descriptors over time
  • Queue depth
  • Retry count
  • CPU steal or throttling
  • OOM or cgroup events

A graph that moves with the failure is usually more useful than another page of logs.


Increasing Reproducibility

Make the boundary visible:

  • Lower stack limits
  • Constrain memory
  • Increase concurrency
  • Loop execution paths
for i in {1..100}; do run_command; done

Prefer reproducing in a controlled environment before increasing pressure in production.

Reproducibility is more valuable than log volume.


Common Threshold Events

Most intermittent failures reduce to:

  • Stack exceeded
  • Memory limit reached
  • File descriptors exhausted
  • CPU credits depleted
  • I/O throttled
  • Thread contention
  • Dependency timeout
  • Network instability

Identify the exact limit. Confirm it under controlled conditions. Adjust capacity or design accordingly.


Operating Principles

  • Random usually means unmeasured.
  • A restart resets state. It does not explain the failure.
  • Cloud infrastructure introduces hidden enforcement layers.
  • Stability improves as hidden variables are removed.

An intermittent failure becomes solvable the moment you can name the limit it crosses.