## Task 3.3 - Block Matrix Multiplication [optional]

Looking at the matrix multiplication algorithm you notice that the complete computation involves 2*n^3 arithmetic operations. However, it consumes and produces only 3*n^2 values. So the computation reuses a rather large amount of data. As a complete matrix often does not fit in a small CPU cache, it helps to break up the computation into small chunks, each of which is small enough to fit within the cache. These chunks of data for matrix multiplications are sub-matrices or blocks of the complete matrix and the pseudo-code for a block-matrix multiplication (of square matrices of size n with block size b) is

for r0 = 1 to n step b
for c0 = 1 to n step b
for l0 = 1 to n step b
for r = r0 to min(r0+b-1,n)
for c = c0 to min(c0+b-1,n)
for l = l0 to min(l0+b-1,n)
c[r][c] += a[r][l] * b[l][c]
• Implement this algorithm for matrix multiplication
• Try to find a good block size b by trying various sizes and comparing performance
• Use the result of the above task to find the best loop ordering

Upload a graph of your timing results and describe the optimal loop ordering.