Matrix Multiplication and Gaussian Elimination C optimization.

This is a Practice of the Class Parallel Computing at UBC ECE528 given in 2015 fall, solely purpose of sharing, and as a guide for Students who are studying C Optimizations for these algorithms.

gauss_forward_improved_linux.c	Gaussian Elimination Compatible Optimized for Linux GCC compiler
gauss_forward_not_improved_linux.c	Gaussian Elimination Compatible Not Optimized
improved.c	Matrix Multiplication Optimized for Windows GCC compiler
improved_linux.c	Matrix Multiplication Optimized for Linux GCC compiler
not_improved.c	Matrix Multiplication Not Optimized for Windows GCC compiler
not_improved_linux.c	Matrix Multiplication Not Optimized for Linux GCC compiler

Some useful Guidelines:

Optimizing for the serial processors

Lecture 2: Memory Hierarchies and Optimizing Matrix Multiplication

01-6 Compiler Optimizations UIUC.pdf

01-5 Compiler Optimizations Wikipedia.pdf

Objectives:

Write a Vanilla Matrix Multiply Program in C

Parameterize the data type to support, int float and double
initialize with random data
measure and display de run-time of the matrix multiply

Excute 3 executables version in, float and double, each with 4 compiler optimization settings(none, -o,-o2,-o3), use matrix sizes with 128x128 to 4096x4096
Rewrite your vanilla code by applying as many typical compiler transformation as you can: document the ones you applied in a list.
Repeat execution 2 with the other transformations.
Write a 1-Page Report, including a table to show runtime results of step 2 and 4.
Write Gaussian Elimination and repeat the same procedure.

Procedure:

For this assignment I had to review about matrix multiply, so I decided to study the concept first in order to get involved with the procedure that requires a matrix multiply. For that purpose, I used the following link which I consider a ease way to refresh my mind.

Afterwards, I decided to move on the C algorithm. Firstly, I must know about measuring performance using C libraries, this is important because we must be sure about the demanded time by the algorithm when the processor is multiplying the matrix. In this case the initialization process will be not taken into account hence all the measured times will be taken after the matrix initialization, and before the end of the algorithm.

Step 1(Execute 3 executables versions)

I have written the code (see attachements) to perform a matrix multiplication, I decided to use macros so that I could change the data type easily without rewriting sections of code. For the tests I used INT data type, FLOAT TYPE, and the DOUBLE TYPE, and this is what the results show.

In order to measure performance I use I high resolution timer base in <sys/time.h> library as you can observe on the code, so all the results are shown in microseconds. Furthermore, I decided to use args to create a single software which is capable of creating a dynamic matrix therefore is possible to change the size of the matrices without recompiling the original source.

I have to clarify the all the measurements were run on the g32.ece machine which has the following characteristics:

Model: AMD Opteron(tm) Processor 6128

Architecture: x86_64

CPU op-mode(s): 64-bit

CPU(s): 32

Thread(s) per core: 1

Core(s) per socket: 8

CPU socket(s): 4

NUMA node(s): 8

Vendor ID: AuthenticAMD

CPU family: 16

Model: 9

Stepping: 1

CPU MHz: 2000.228

Virtualization: AMD-V

L1d cache: 64K

L1i cache: 64K

L2 cache: 512K

L3 cache: 5118K

LEVELS OF OPTIMIZATION USING INT DATA TYPE

This is the table of results for the first runtime before applying any compiler transformation:

The table shows that the bigger the size of the matrix the more time is required to perform the multiplication. Furthermore, it is clear that there is an speed up when it is applied the compiler optimization.The performance of the matrix of 2048x2048 has been risen in 20% using the third level of optimization. Nevertheless, if it is compared the third level of optimization with the second level, they are slightly different, and the reliability of the code might have been reduced using the third level of optimization, which is unsafe and aggressive. Moreover, the time for the matrices under 2048 reached a better speed up than the matrices of 2048 and 4096, which could be inferred by taking into account the size of the cache memory. In this case the processor has three cache sizes, L1(64K), L2(512K), and L3(5118K), for the matrices under 2048, there is needed at least 1024x1024*(INT size=4)=4096K of cache memory whereas for bigger matrices, it is needed more than 2048x2048x4=19872K which is greater than the biggest cache (L3) memory and as consequence there are more cache misses.

Without compiler transformation.

	128x128	256x256	512x512	1024x1024	2048x2048	4096x4096
--(us)	29335	255607	2753045	71837305	797579689	2467845089
-O1(us)	13264	125123	1644491	72929674	782077242	2454478250
-O2(us)	12160	117564	1513549	63214368	735030301	1949429030
-O3(us)	12117	117758	1525796	52434348	704332290	1827743086

Step 2(Execute 3 executables versions, applying compiler transformations)

In order to improve my code I have applied the next two compiler transformation to my matrix multiplication algorithm:

The Method called Loop Invariant Code motion:

This Method is use to avoid using a variable that slightly changes within a loop, as an example I show the improvement within my code:

as you can observe in the image, there is a multiplication within the k look, where it is performed the multiplication of N by c. N it is a constant and c is changed only by the c loop, thereby removing the multiplication from the k loop and moving it to the c loop. The unnecessary multiplication is just made N times instead of NxN times.

Principle of Memory Locality.

The cache is a fast memory access whereby the computer is capable of accessing often required data to perform different operations repeatedly. The cache is used to to reduce the time required by the processor to access local memory such as RAM. In order to exploit the principle of Memory Locality we must know how cache actually works when you are implementing loops and reading data from a array, so if we are able to read the values from the Matrix in a proper way to exploit computer's cache, we are going to increase in the performance of the matrix multiplication Algorithm.
As the preview image shows, there are 3 loops, c loop , d loop , and k look.

The first code (on the left): To illustrate the behavior of this code I will show 2 matrices, of 2x2

A(0,0)	A(0,1)
A(1,0)	A(1,1)

B(0,0)	B(0,1)
B(1,0)	B(1,1)

C(0,0)	C(0,1)
C(1,0)	C(1,1)

First Iteration(d loop)
C[0][0]+=A[0][0]*B[0][0];
C[0][0]+=A[0][1]*B[1][0]; ==> Where C[0][0] was accessed N times

Second Iteration(d loop)
C[0][1]+=A[0][0]*B[0][1]; ==> Where C[0][0] was accessed N times
C[0][1]+=A[0][1]*B[1][1];

Third Iteration(d loop)
C[1][0]+=A[1][0]*B[0][0]; ==> Where C[0][0] was accessed N times
C[1][0]+=A[1][1]*B[1][0];

Fourth Iteration(d loop)
C[1][1]+=A[1][0]*B[0][1]; ==> Where C[0][0] was accessed N times
C[1][1]+=A[1][1]*B[1][1];

Now imagine what happens when you have a matrix of 4096 elements, you will naturally access C[0][0] ==> N(4096) times to write it which only gives to the cache an often data or an array of a sequence of data.
Each element of the array A is being accessed N times each on the d loop, and each k loop there's a sequence from A[0][0] to A[0][N] where the array A could be stored on the cache memory, and then it could be replaced when the c loop changes.
Each element of B will be hold each d loop

Now if we compared the not improved algorithm with the improved one we have that:

First Iteration(k loop)
C[0][0]+=A[0][0]*B[0][0];
C[0][1]+=A[0][0]*B[0][1]; ==> Where the Array is accessed in sequence C[0][0] --> C[0][n]

Second Iteration(k loop)
C[0][0]+=A[0][1]*B[1][0];
C[0][1]+=A[0][1]*B[1][1];

Third Iteration(k loop)
C[1][0]+=A[1][0]*B[0][0];
C[1][1]+=A[1][0]*B[0][1];

Fourth Iteration(k loop)
C[1][0]+=A[1][1]*B[1][0];
C[1][1]+=A[1][1]*B[1][1];

Where the array C is accessed in sequence and it will be stored on the cache memory because it stays steadily accessed using the c loop and d loop.
On the other hand the array A is accessed in sequence as well and can be stored on the cache memory as well whereas the B array could be replaced every k loop

Table of Results with Compiler Transformations

	128x128	256x256	512x512	1024x1024	2048x2048	4096x4096
--(us)	28930	230222	1843271	14975535	118977048	971977444
-O1(us)	9640	74382	629343	4986690	41345095	330860175
-O2(us)	9059	71859	572221	4700699	37936452	301893036
-O3(us)	9073	72141	577498	4695937	37729592	303559212

As the Runtime improvements table shows the performance has been rocketed up considerably, specially for the matrices of 1024x1024, 2048x2048 and 4096x4096 whereas the performance for the smallest matrices hasn't been dramatically increased, notwithstanding, there is a huge difference regarding the time reached by the matrix of 1024 (0.5 Seconds) compared with the matrix of 2048(4.7 seconds). Once more, the maximum performance was reached by using the -O2 using the matrix 2048x2048. The performance wasn't affected dramatically for the matrices under 2048 due to the cache size, the available space for those matrices allows them to fit without having cache misses which in comparison with the bigger sizes there are more cache misses as it was explained in the first table.

Runtime Improvements or Terms of Speedup (With Compiler Transformations)

	128x128	256x256	512x512	1024x1024	2048x2048	4096x4096
Not Improved /Improved --	1x	1.1x	1.5x	4.8x	6.7x	2.5x
Not Improved /Improved --O1	1.3x	1.6x	2.6x	14.6x	18.9x	7.4x
Not Improved /Improved --O2	1.3x	1.6x	2.6x	13.4x	19.3x	6.5x
Not Improved /Improved --O3	1.3x	1.6x	2.6x	11.2x	18.6x	6.0x

Part 2, Gaussian Elimination.

For this part of the assignment I have reviewed the Gaussian forward Elimination to code the first algorithm, which does not have compiler transformations. Once more, the machine in which I have tested the code was the G32.ECE.UBC.CA.

The results can be observed in the following table with different levels of optimization using GCC.

Gaussian Forward Elimination

Without compiler transformations

1024x1024 2048x2048 4096x4096

--(us) 6589499 53412624 436066621

-O1(us) 3793153 30941944 248007318

-O2(us) 1060157 9410766 76096890

-O3(us) 917158 8595758 72623577

As the first part of this assignment it is clear that the different level of optimization gives speed up to the algorithm, the best results can be found with the level of optimization -O3 where the code was sped up 6 times for the biggest matrix (4096), and 7 times for the smallest matrix in the table (1024). However, the reliability of the code is much better for the -O2 as I mentioned before.

In order to enhance the code for the Gaussian forward elimination I have implemented the method loop invariant code motion, and to reduce the cache misses I have used the principle of locality by mapping the matrix in sections using blocks of 8 columns of it instead of calculating all the results in the same loop.The improvement by using blocks of 8 columns, instead of using all the columns to calculate Gaussian Forward Elimination, brings speed up for the code and avoids cache misses by reducing the sequence that is being accessed and stored on the cache memory.

Table of Results with Compiler Transformations
using blocks of 8 columns within the matrix

1024x1024 2048x2048 4096x4096

--(us) 1139853 6443612 46158971

-O1(us) 578313 3508618 29553864

-O2(us) 308846 1990565 19968219

-O3(us)
289194
1854847
19256503

In comparison with the first table in which was not applied any Compiler transformations, and the one which has compiler transformations, there was a really good enhancement. In the case of the matrix 4096x4096, without using any compiler optimization, the algorithm was performed 9.4 times, which is faster than the first attempt with the initial algorithm.

Runtime Improvements

1024x1024 2048x2048 4096x4096

Not Improved /Improved
-- 5.7x 8.2x 9.4x

Not Improved /Improved
-O1 6.5x 8.8x 8.3x

Not Improved /Improved
-O2 3.4x 4.7x 3.8x

Not Improved /Improved
-O3 3.1x 4.6x 3.7x

Back Sustitution Results

Without compiler transformations

	1024x1024	2048x2048	4096x4096
--(us)	10333	50822	218825
-O1(us)	6951	39515	171568
-O2(us)	5290	39444	176992
-O3(us)	5290	39486	177109

Conclusions:

One of the conclusion that I can infer about the given results, it is that the more we know about how to manage memory into the cache, the much faster our algorithm becomes, as it can be observed on the result table. Other thing to think about, it is the amount of data or chunk that the computer cache can really store, as it is seen in the Result Table the bigger the Matrix, the lower the result in terms of speed up we get. This is due to the cache size, so if we really want to improve this, we have to be aware of how to use wisely smalls chunks of data (blocking) which can be store more efficiently on cache memory in order to avoid cache misses. Furthermore, knowing how to implement the different ways to optimize your code using the compiler flags could be useful in order to get good result, and it is proved that there is a really good increase in the speed up when we pass from an not optimized code to an optimized code using the compiler flags. In addition, the computer architecture is crucial in order to reach better speed up, in this case the cache size brings a lot of benefits to perform matrix multiplications, and Gaussian elimination.