When ETCD Crashes, Check Your Disks First: A Pod CrashLoopBack Debugging Story
Summary
The article details how etcd can crash in a distributed Kubernetes setup due to storage I/O latency, illustrating a real-world debugging session. It identifies the root cause as slow disk performance in a VM-shared environment and shows how ZFS tuning (disable sync, use compression, disable atime, 8k recordsize) stabilized the cluster, emphasizing storage as a critical factor in etcd reliability.