Intermittent failures are rarely random.

Before investigating, ask a more precise question: Is this truly intermittent, or deterministic under conditions we have not yet identified?

When something fails “sometimes”, a boundary is usually involved. Systems change behavior when limits are crossed. The task is to identify the constraint.


Core Assumption

Treat every intermittent failure as a threshold event.

Typical boundaries:

  • CPU scheduling
  • Memory limits
  • File descriptor ceilings
  • I/O throughput caps
  • cgroup enforcement
  • Network variability
  • Execution order differences

Do not start with logs. Start with limits.


Failure Categories

Hard Deterministic

Fails every time.

Common causes:

  • Bad configuration
  • Missing shared object
  • Schema mismatch
  • Version drift

Basic inspection:

ldd binary
strace -f
journalctl -xe
diff <(env | sort) <(reference_env)

Timing Sensitivity

Fails depending on execution order or runtime state.

Common triggers:

  • Parallel builds
  • Thread scheduling
  • Race conditions
  • Cold vs warm cache

Remove parallelism:

MAKEFLAGS="-j1"

Test stack boundary:

ulimit -s 4096

If failure frequency changes, timing or stack depth is involved.


Resource Pressure

Appears under load or long runtime.

Common causes:

  • Memory pressure
  • Heap fragmentation
  • File descriptor exhaustion
  • CPU contention

Baseline checks:

ulimit -n
lsof | wc -l
free -m
vmstat 1
top -H

In containers:

cat /sys/fs/cgroup/memory.max
cat /proc/self/limits

Always verify cgroup limits separately from host capacity.


Cloud-Induced Constraints

CPU Steal Time

mpstat 1

If steal time rises during degradation, hypervisor contention is likely.


Burstable CPU Behavior

Check provider metrics:

  • CPU credit balance
  • Baseline vs burst usage

Sustained usage beyond baseline will degrade performance.


OOM Events

dmesg | grep -i kill

Kernel OOM kills indicate memory boundary violations.


Disk Throughput Caps

df -h
iostat -x 1

Cloud storage often enforces burst limits or throughput caps.


Node Comparison

When only some nodes fail, compare:

  • Kernel version
  • Instance type
  • CPU architecture
  • cgroup limits
  • Swap configuration
uname -a
cat /proc/cpuinfo
ulimit -a
cat /proc/self/limits

Assume nodes differ until proven identical.


Increasing Reproducibility

Make the boundary visible:

  • Lower stack limits
  • Constrain memory
  • Increase concurrency
  • Loop execution paths
for i in {1..100}; do run_command; done

Reproducibility is more valuable than log volume.


Common Threshold Events

Most intermittent failures reduce to:

  • Stack exceeded
  • Memory limit reached
  • File descriptors exhausted
  • CPU credits depleted
  • I/O throttled
  • Thread contention

Identify the exact limit. Confirm it under controlled conditions. Adjust capacity or design accordingly.


Operating Principles

  • Random usually means unmeasured.
  • Restart resets state; it does not fix root cause.
  • Cloud infrastructure introduces hidden enforcement layers.
  • Stability improves as entropy is removed.