This is a Practice of the Class Parallel Computing at UBC ECE528 given in 2015 fall, solely purpose of sharing, and as a guide for Students who are studying C Optimizations for these algorithms.
Objectives:
- Write a Vanilla Matrix Multiply Program in C
- Parameterize the data type to support, int float and double
- initialize with random data
- measure and display de run-time of the matrix multiply
- Excute 3 executables version in, float and double, each with 4 compiler optimization settings(none, -o,-o2,-o3), use matrix sizes with 128x128 to 4096x4096
- Rewrite your vanilla code by applying as many typical compiler transformation as you can: document the ones you applied in a list.
- Repeat execution 2 with the other transformations.
- Write a 1-Page Report, including a table to show runtime results of step 2 and 4.
- Write Gaussian Elimination and repeat the same procedure.
Procedure:
For this assignment I had to review about matrix multiply, so I decided to study the concept first in order to get involved with the procedure that requires a matrix multiply. For that purpose, I used the following link which I consider a ease way to refresh my mind.
Afterwards, I decided to move on the C algorithm. Firstly, I must know about measuring performance using C libraries, this is important because we must be sure about the demanded time by the algorithm when the processor is multiplying the matrix. In this case the initialization process will be not taken into account hence all the measured times will be taken after the matrix initialization, and before the end of the algorithm.
Step 1(Execute 3 executables versions)
I have written the code (see attachements) to perform a matrix multiplication, I decided to use macros so that I could change the data type easily without rewriting sections of code. For the tests I used INT data type, FLOAT TYPE, and the DOUBLE TYPE, and this is what the results show.
In order to measure performance I use I high resolution timer base in <sys/time.h> library as you can observe on the code, so all the results are shown in microseconds. Furthermore, I decided to use args to create a single software which is capable of creating a dynamic matrix therefore is possible to change the size of the matrices without recompiling the original source.
I have to clarify the all the measurements were run on the g32.ece machine which has the following characteristics:
Model:
AMD Opteron(tm) Processor 6128
Architecture: x86_64
CPU op-mode(s): 64-bit
CPU(s): 32
Thread(s) per core: 1
Core(s) per socket: 8
CPU socket(s): 4
NUMA node(s): 8
Vendor ID: AuthenticAMD
CPU family: 16
Model: 9
Stepping: 1
CPU MHz: 2000.228
Virtualization: AMD-V
L1d cache: 64K
L1i cache: 64K
L2 cache: 512K
L3 cache: 5118K
LEVELS OF OPTIMIZATION USING INT DATA TYPE
This is the table of results for the first runtime before applying any compiler transformation:
The table shows that the bigger the size of the matrix the more time is required to perform the multiplication. Furthermore, it is clear that there is an speed up when it is applied the compiler optimization.The performance of the matrix of 2048x2048 has been risen in 20% using the third level of optimization. Nevertheless, if it is compared the third level of optimization with the second level, they are slightly different, and the reliability of the code might have been reduced using the third level of optimization, which is unsafe and aggressive. Moreover, the time for the matrices under 2048 reached a better speed up than the matrices of 2048 and 4096, which could be inferred by taking into account the size of the cache memory. In this case the processor has three cache sizes, L1(64K), L2(512K), and L3(5118K), for the matrices under 2048, there is needed at least 1024x1024*(INT size=4)=4096K of cache memory whereas for bigger matrices, it is needed more than 2048x2048x4=19872K which is greater than the biggest cache (L3) memory and as consequence there are more cache misses.
Without compiler transformation.
|
128x128 |
256x256 |
512x512 |
1024x1024 |
2048x2048 |
4096x4096 |
--(us) |
29335 |
255607 |
2753045 |
71837305 |
797579689 |
2467845089 |
-O1(us) |
13264 |
125123 |
1644491 |
72929674 |
782077242 |
2454478250 |
-O2(us) |
12160 |
117564 |
1513549 |
63214368 |
735030301 |
1949429030 |
-O3(us) |
12117 |
117758 |
1525796 |
52434348 |
704332290 |
1827743086
|
Step 2(Execute 3 executables versions, applying compiler transformations)
In order to improve my code I have applied the next two compiler transformation to my matrix multiplication algorithm:
- The Method called Loop Invariant Code motion:
- This Method is use to avoid using a variable that slightly changes within a loop, as an example I show the improvement within my code:
as you can observe in the image, there is a multiplication within the k look, where it is performed the multiplication of N by c. N it is a constant and c is changed only by the c loop, thereby removing the multiplication from the k loop and moving it to the c loop. The unnecessary multiplication is just made N times instead of NxN times.
- Principle of Memory Locality.
- The cache is a fast memory access whereby the computer is capable of accessing often required data to perform different operations repeatedly. The cache is used to to reduce the time required by the processor to access local memory such as RAM. In order to exploit the principle of Memory Locality we must know how cache actually works when you are implementing loops and reading data from a array, so if we are able to read the values from the Matrix in a proper way to exploit computer's cache, we are going to increase in the performance of the matrix multiplication Algorithm.
- As the preview image shows, there are 3 loops, c loop , d loop , and k look.
Part 2, Gaussian Elimination.
For this part of the assignment I have reviewed the Gaussian forward Elimination to code the first algorithm, which does not have compiler transformations. Once more, the machine in which I have tested the code was the G32.ECE.UBC.CA.
The results can be observed in the following table with different levels of optimization using GCC.
Gaussian Forward Elimination
Without compiler transformations
|
1024x1024 |
2048x2048 |
4096x4096 |
--(us) |
6589499 |
53412624 |
436066621 |
-O1(us) |
3793153 |
30941944 |
248007318 |
-O2(us) |
1060157 |
9410766 |
76096890 |
-O3(us) |
917158 |
8595758 |
72623577 |
As the first part of this assignment it is clear that the different level of optimization gives speed up to the algorithm, the best results can be found with the level of optimization -O3 where the code was sped up 6 times for the biggest matrix (4096), and 7 times for the smallest matrix in the table (1024). However, the reliability of the code is much better for the -O2 as I mentioned before.
In order to enhance the code for the Gaussian forward elimination I have implemented the method loop invariant code motion, and to reduce the cache misses I have used the principle of locality by mapping the matrix in sections using blocks of 8 columns of it instead of calculating all the results in the same loop.The improvement by using blocks of 8 columns, instead of using all the columns to calculate Gaussian Forward Elimination, brings speed up for the code and avoids cache misses by reducing the sequence that is being accessed and stored on the cache memory.
Table of Results with Compiler Transformations
using blocks of 8 columns within the matrix
|
1024x1024 |
2048x2048 |
4096x4096 |
--(us) |
1139853 |
6443612 |
46158971 |
-O1(us) |
578313 |
3508618 |
29553864 |
-O2(us) |
308846 |
1990565 |
19968219 |
-O3(us) |
289194 |
1854847 |
19256503 |
In comparison with the first table in which was not applied any Compiler transformations, and the one which has compiler transformations, there was a really good enhancement. In the case of the matrix 4096x4096, without using any compiler optimization, the algorithm was performed 9.4 times, which is faster than the first attempt with the initial algorithm.
|
1024x1024 |
2048x2048 |
4096x4096 |
Not Improved /Improved -- |
5.7x |
8.2x |
9.4x |
Not Improved /Improved -O1 |
6.5x |
8.8x |
8.3x |
Not Improved /Improved -O2 |
3.4x |
4.7x |
3.8x |
Not Improved /Improved -O3 |
3.1x |
4.6x |
3.7x
|
Back Sustitution Results
Without compiler transformations
|
1024x1024 |
2048x2048 |
4096x4096 |
--(us) |
10333 |
50822 |
218825 |
-O1(us) |
6951 |
39515 |
171568 |
-O2(us) |
5290 |
39444 |
176992 |
-O3(us) |
5290 |
39486 |
177109
|
Conclusions:
One of the conclusion that I can infer about the given results, it is that the more we know about how to manage memory into the cache, the much faster our algorithm becomes, as it can be observed on the result table. Other thing to think about, it is the amount of data or chunk that the computer cache can really store, as it is seen in the Result Table the bigger the Matrix, the lower the result in terms of speed up we get. This is due to the cache size, so if we really want to improve this, we have to be aware of how to use wisely smalls chunks of data (blocking) which can be store more efficiently on cache memory in order to avoid cache misses. Furthermore, knowing how to implement the different ways to optimize your code using the compiler flags could be useful in order to get good result, and it is proved that there is a really good increase in the speed up when we pass from an not optimized code to an optimized code using the compiler flags. In addition, the computer architecture is crucial in order to reach better speed up, in this case the cache size brings a lot of benefits to perform matrix multiplications, and Gaussian elimination.