How Much Linear Memory Access Is Enough?
Summary
The article experimentally evaluates how memory contiguity affects performance and finds that 1 MB blocks are generally sufficient for most workloads, with 128 KB blocks adequate when per-byte processing is around one cycle, and 4 KB blocks sufficient when processing costs exceed ~10 cycles per byte. It details the experimental setup, kernels, and results, and suggests a streaming/chunk-based approach for data processing.