omar's website :)

Matrix Multiplication, Caching, and Related Stuff

Jul 2, 2024

After years not coding in C and not thinking too much about low-level optimizations, I decided to review some stuff, for (self-)pedagogical reasons, and for fun (of course). I’m mostly looking at topics related to neural networks training and inference, so it makes sense to start from matrix multiplication.

I will focus on the computation of the product $C \gets \alpha AB + \beta C$, for matrices $A, B, C$ and scalars $\alpha, \beta$, known as general matrix multiplication (gemm in BLAS, see wiki).

Starting from a quick review of computer architectures and caching, we will see why a simple straightforward implementation of gemm

void gemm(const int M, const int N, const int K, const float alpha, const float beta, const float* A, const float* B, float* C) 
{
    /* \brief  Naive implementation of matrix multiply: computes C = alpha * A * B + beta * C. Assumes storage in row-major order.
    * \note A has dimensions (M, K), B has (K, N), C has (M, N).
    * \param M Number of rows in A and C.
    * \param N Number of columns in C and in B.
    * \param K Number of columns in A and rows in B.
    * \param alpha Coefficient of A * B in the result.
    * \param beta Coefficient of C in the result.
    * \param A Array containing matrix A.
    * \param B Array containing matrix B.
    * \param C Array containing matrix C, and used to store result.
    */
	for(int m=0; m < M; m++)
	{
		for(int n=0; n < N; n++)
		{
			float acc = 0;
			for(int k=0; k < K; k++)
			{
				acc += alpha * A[m*K + k] * B[k * N + n];
			}
			C[m * N + n] = acc + beta * C[m * N + n];
		}
	}
}

where $A \in \mathbb{R}^{M\times K}$ , $B \in \mathbb{R}^{K\times N}$, $C \in \mathbb{R}^{M\times N}$ are stored in one-dimensional arrays (assumed in row-major order), can be faster if we first make a copy of the columns of $B$:

for(int n=0; n < N; n++)
{
	float* Bcol = (float*)malloc(K * sizeof(float));
	for(int k=0; k < K; k++)
	{
		Bcol[k] = B[k * N + n];
	}
	for(int m=0; m < M; m++)
	{
		float acc = 0;
		for(int k=0; k < K; k++)
		{
			acc += alpha * A[m*K + k] * Bcol[k];
		}
		C[m * N + n] = acc + beta * C[m * N + n];
	}
	free(Bcol);
}

and we’ll also discuss some other directions of optimization.

Computer Architecture

Most computers are designed using the von Neumann architecture consisting, basically, of a central processing unit (CPU) and a memory unit. Computer programs (instructions) and data are stored in the memory. CPUs contain also registers (small and fast memory) to store data required for the current operation. As an example of a typical operation, consider the following a (pseudo-)assembly program:

LOAD R1, [0x1000]  ; Load the first integer from memory address 0x1000 into register R1
LOAD R2, [0x1004]  ; Load the second integer from memory address 0x1004 into register R2

ADD R1, R2  ; Add the values in R1 and R2, store the result in R1

STORE [0x1008], R1  ; Store the result from R1 into memory address 0x1008

Assuming that there are two integers in the 0x1000 and 0x1004 addresses in memory, the program above loads the two values into the CPU registers R1 and R2. Once loaded to the registers, the CPU and add those values and store the result in the register R1. Finally, the result is stored in the memory address 0x1008 and the CPU registers are free to store other values for future operations.

Memory Speeds and Caching

As illustrated above, data transfers between memory and CPU is a key routine performed by computers. If the transfer is slow, this is a huge bottleneck for computer programs. Typical arithmetic operations in a CPU take about 5 clock cycles (see here , for instance), and modern processors operate at a frequency higher than 2GHz (clocks per second), resulting in less than $2.5 \times 10^{-9}$ seconds per arithmetic operation.

The transfer speed between modern RAM and CPU (bandwidth) is typically around 64 GB/s (see here for DDR5 SDRAM). Hence, a transfer of a float16 (floating point number with 16 bits = 2 bytes) takes about $0.03 \times 10^{-9}$ seconds. However, to initiate the transfer, there is a latency of about $14 \times 10^{-9}$ seconds (for DDR5), which is about 5 times more than the time to perform an arithmetic operation. When transferring a contiguous block of memory (e.g., many floating point numbers stored sequentially in the memory), we only pay the cost of latency once.

To mitigate latency issues, CPUs also include a cache memory. From Wikipedia, “cache is a hardware or software component that stores data so that future requests for that data can be served faster”. Latency and transfer from cache memory (to registers) is much faster than access to the RAM, typically 1-5 clock cycles of latency. So, instead of copying only the data required for the next operation, the CPU copies a larger block of the memory into its cache. Intuitively, if an operation requires data at the memory address $X$, it is likely that the next operations will require data at the addresses $X+1, X+2$ etc. Once the subsequent data is in the cache, we don’t suffer from the latency incurred when transferring from the RAM, until the next block of data is required.

Thinking about caching can be a powerful strategy to optimize our code, as I will illustrate through matrix multiplication.

Using Cache to Optimize Matrix Multiplication

Let’s look at the simplest implementation of gemm ($C \gets \alpha AB + \beta C$):

for(int m=0; m < M; m++)
{
	for(int n=0; n < N; n++)
	{
		float acc = 0;
		for(int k=0; k < K; k++)
		{
			acc += alpha * A[m*K + k] * B[k * N + n];
		}
		C[m * N + n] = acc + beta * C[m * N + n];
	}
}

Here, matrices are stored in arrays in row-major order, such that the $ij$ element of a matrix $P \in \mathbb{R}^{M \times N}$ stored in an array arr is $P_{ij} = \text{arr}[i \times N + j]$ for $0 \leq i < M$ and $0 \leq j < N$.

Now, let’s focus on the inner loop

for(int k=0; k < K; k++)
{
	acc += alpha * A[m*K + k] * B[k * N + n];
}

and consider, for simplicity, the case where $m = n = 0$ (i.e., we focus on the computation of $C_{00}$). In this loop, we see that we sequentially read from the memory the following elements of the arrays $A$ and $B$, respectively:

\[\begin{align*} A[0], A[1], A[2], \ldots, A[K-1], \text{and} \end{align*}\] \[\begin{align*} B[0], B[N], B[2N], \ldots, B[(K-1)N] \end{align*}.\]

Matrix $A$ elements are stored contiguously, allowing the CPU to efficiently cache them. However, matrix $B$ elements are not contiguous, leading to frequent cache misses (i.e., not finding the requested data in the cache memory) and increased latency.

To optimize, we can copy columns of $B$ into contiguous memory before the main computation:

for(int n=0; n < N; n++)
{
	float* Bcol = (float*)malloc(K * sizeof(float));
	for(int k=0; k < K; k++)
	{
		Bcol[k] = B[k * N + n];
	}
	for(int m=0; m < M; m++)
	{
		float acc = 0;
		for(int k=0; k < K; k++)
		{
			acc += alpha * A[m*K + k] * Bcol[k];
		}
		C[m * N + n] = acc + beta * C[m * N + n];
	}
	free(Bcol);
}

which should significantly decrease the total latency of the program.

Now let’s give it some numbers. Using square matrices of dimension $d = M = N = K$, I tested both implementations for different values of $d$ using a Raspberry Pi 5, and here are my results:

Dimension	Runtime without optimization	Runtime with optimization
100	0.001 (0.000)	0.001 (0.000)
200	0.006 (0.000)	0.006 (0.000)
500	0.114 (0.002)	0.105 (0.001)
1000	1.544 (0.008)	0.872 (0.001)
2000	12.599 (0.007)	7.027 (0.003)

Runtimes are in seconds and computed with the average of 10 independent runs. Values in parentheses represent the standard deviation.

There it is, for larger matrices, we see roughly a 45% decrease in runtime with this simple optimization.

Other Optimizations

Following the same idea, there are other things we can do to decrease the runtime, and make the code a bit more elegant. The general idea to optimize for caching is to split the total computation into smaller chunks, where each chunk requires only a small amount of data to be executed. If the size of the data chunk fits into the cache, each chunk of computation will be optimized.

Matrix multiplication, for instance, can be split as follows:

\[\begin{pmatrix} A_{1} \\ A_{2} \end{pmatrix} \begin{pmatrix} B_{1} & B_{2} \end{pmatrix} = \begin{pmatrix} A_{1}B_{1} & A_{1}B_{2} \\ A_{2}B_{1} & A_{2}B_{2} \end{pmatrix}\]

and generalized into multiple tiles. The cache-optimized code would look like:

for i in [0, ..., N_BLOCKS-1]:      # row block index
	for j in [0, ..., N_BLOCKS-1]:  # column block index
		# copy Ai to contiguous memory
		# copy Bj to contiguous memory
		# compute Ai * Bj and store in output array 

Going beyond cache optimization, notice that we could parallelize the code and compute each $A_{i}B_{j}$ product in a different thread (or thread blocks). That’s how GPUs can be used to massively accelerate matrix multiplications, by parallelizing the computations using localized memory!

Also, other algorithms such the Strassen algorithm can be used to accelerate further matrix optimization.

Some References

Besides the references cited in the text, here are some other useful ones:

Wiki on Loop nest optimization, the technique that we used here.
Intel’s user guide for General Matrix Multiply in OpenCL.
Introduction to CUDA videos by Josh Holloway.

Conclusion

It’s always useful to have cache in mind! For linear algebra, it’s highly recommended to stick to existing implementations, since people have optimized it a lot. But it’s nice to have an idea of what’s happening under the hood and get some inspiration for life.