NeFut Logo NeFut
Admin Login

Understanding CPU Cache Mechanisms and C++ Array Access Optimization

Published at: 2026-05-26 02:45 Last updated: 2026-06-06 13:04
#C++ #Tutorial

Core Logic and Mathematical Principles

Modern CPUs do not directly read data from physical memory (RAM), but instead fetch data through a cache hierarchy consisting of L1, L2, and L3 caches. The smallest unit of data exchange between physical memory and the cache is the Cache Line, which is fixed at $64$ bytes in mainstream x86_64 architectures.

When a program attempts to access memory address $A$, the CPU triggers a cache hit check at the hardware level. If the data block containing $A$ is already in the cache, it is called a Cache Hit, with a delay of only a few clock cycles; if it is not, a Cache Miss occurs, and the CPU must suspend the current execution pipeline to issue a read request to physical memory via the bus, leading to a delay that can increase by two orders of magnitude (approximately 50-200 clock cycles).

To maximize the probability of a Cache Hit, CPUs are equipped with hardware prefetchers that exploit spatial locality: when accessing address $A$, the hardware automatically pulls all adjacent data that fills a Cache Line (64 bytes) from physical memory into the cache.

In C++, multi-dimensional arrays are laid out in memory in a row-major order. Let’s consider a two-dimensional array matrix[N][M], where each element is $S$ bytes in size. The mathematical formula that maps the two-dimensional coordinates $(i, j)$ to a one-dimensional physical memory address $ ext{Addr}(i, j)$ is given by:

$$ ext{Addr}(i, j) = ext{BaseAddress} + (i \times M + j) \times S$$

If the outer loop iterates over the row index $i$ and the inner loop iterates over the column index $j$ (with a stride of $ ext{Stride} = 1 \times S$), the memory access pattern aligns perfectly with the CPU's spatial locality. Once matrix[i][j] is read, the entire Cache Line is filled, and subsequent accesses to matrix[i][j+1] will achieve a $100\%$ pure memory-level Cache Hit.

Conversely, if the outer loop iterates over $j$ and the inner loop iterates over $i$ (with a stride of $ ext{Stride} = M \times S$), when $M \times S > 64$, each inner loop iteration will forcibly cross one or more Cache Lines, causing each memory access to degrade into a high-latency Cache Miss. In this case, the constant time complexity of the algorithm will be magnified several times.


Cache Miss and Array Stride Alignment Derivation

1. Mathematical Degradation of Stride Alignment

Assuming we declare a three-dimensional DP state array int dp[100][100][100]. In physical memory, the rightmost dimension is contiguous in address space. The correct traversal topology should strictly follow the increasing depth of dimensions:

for (int i = 0; i < 100; ++i)
    for (int j = 0; j < 100; ++j)
        for (int k = 0; k < 100; ++k)
            // Here, the stride of k is 1, perfectly triggering cache prefetching
            dp[i][j][k] = ...

If, while implementing tree DP or state compression DP, due to logical confusion, we write a reverse traversal (e.g., with k in the outermost loop and i in the innermost loop), each increment of i will cause the memory address to jump forward by $100 \times 100 \times 4 = 40000$ bytes. This far exceeds the $64$ bytes Cache Line capacity, leading to frequent Cache Line Pollution and Eviction within L1/L2 Cache, resulting in a dramatic drop in throughput.

2. False Sharing and Boundary Alignment

In multi-dimensional arrays or arrays of structures, if two unrelated critical control variables happen to fall within the same $64$ byte Cache Line, even if they logically do not interfere, when the CPU attempts to repeatedly modify one of the variables via registers, the entire Cache Line will be marked as invalid, forcing the other variable's reads to resynchronize with physical memory. When implementing globally sensitive counters, aligning the array dimensions to powers of $2$ can ensure that the memory addresses naturally fit the CPU boundaries.


C++ Standard Source Code

The following source code demonstrates the significant efficiency differences between Cache-friendly and Cache-unfriendly access patterns at the hardware execution level using the Floyd-Warshall algorithm on an adjacency matrix. The code can be compiled in a NOIP local Linux environment using g++ -O2.

#include <iostream>
#include <vector>
#include <algorithm>
#include <chrono>

using std::cin;
using std::cout;

// Setting the array size to 1500, 1500 * 1500 * 4 bytes ≈ 9MB, far exceeding L1/L2 cache capacity, clearly amplifying the lethality of Cache Miss
const int N = 1500;
int dist[N][N];

// Cache-friendly Floyd algorithm: strictly adheres to physical contiguous memory layout
void cache_friendly_floyd() {
    for (int k = 0; k < N; ++k) {
        for (int i = 0; i < N; ++i) {
            // Critical pitfall: move the non-frequently changing dist[i][k] outside the innermost loop to avoid repeated addressing
            int tmp = dist[i][k]; 
            for (int j = 0; j < N; ++j) {
                // Here, the innermost loop aligns completely with the rightmost dimension j, stride is 1, memory is read continuously, Cache hit rate approaches 100%
                if (dist[i][j] > tmp + dist[k][j]) {
                    dist[i][j] = tmp + dist[k][j];
                }
            }
        }
    }
}

// Cache-unfriendly Floyd algorithm: reverse the order of the innermost loop
void cache_unfriendly_floyd() {
    for (int k = 0; k < N; ++k) {
        for (int i = 0; i < N; ++i) {
            for (int j = 0; j < N; ++j) {
                // Critical pitfall: if this inner loop is written as dist[j][i], then the rightmost dimension i remains unchanged,
                // while the left dimension j changes. This causes the stride to increase to N, completely breaking the CPU's spatial locality, resulting in a high Cache Miss rate
                if (dist[j][i] > dist[j][k] + dist[k][i]) {
                    dist[j][i] = dist[j][k] + dist[k][i];
                }
            }
        }
    }
}

int main() {
    std::ios::sync_with_stdio(false);
    cin.tie(nullptr);

    // Initialize matrix data
    for (int i = 0; i < N; ++i) {
        for (int j = 0; j < N; ++j) {
            if (i == j) dist[i][j] = 0;
            else dist[i][j] = (i + j) % 12345;
        }
    }

    // Execute cache-friendly test
    auto start = std::chrono::high_resolution_clock::now();
    cache_friendly_floyd();
    auto end = std::chrono::high_resolution_clock::now();
    auto d1 = std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count();

    cout << "Cache Friendly Time: " << d1 << " ms\n";

    return 0;
}

NOIP Practical Pitfall Guide

1. Matrix Multiplication/Graph Adjacency Matrix Reverse Traversal Leading to Constant Explosion

The result is that even with -O2 enabled at $N = 1000$, it still times out severely. The reason is that when accessing b[k][j] in the innermost loop, since $k$ is changing, the rightmost dimension $j$ remains fixed, causing the addressing space to jump vertically with a stride of $N$, which completely negates the CPU prefetch mechanism.

2. Performance Collapse Due to Dimension Order Inversion in Multi-dimensional State Arrays


Classic NOIP/Luogu Problems

1. Luogu P2886 [USACO07NOV] Cow Relays G

2. Luogu P3959 [NOIP2017 Advanced Group] Treasure

Original Source: Local_Import

Previous: None
[h] Back to Home