Crawling a billion web pages in just over 24 hours, in 2025
Summary
Andrew Chan documents building a billion-page web crawler in under 24 hours using a 12-node Redis-backed cluster. The piece covers architectural choices, scaling experiments, bottlenecks (notably parsing and SSL CPU costs), politeness considerations, and practical lessons from running large-scale crawls on commodity hardware.