This is a Practice of the Class Parallel Computing at UBC ECE528 given in 2015 fall, solely purpose of sharing, and as a guide for Students who are studying C Optimizations for these algorithms.
 
 
Download this file (gauss_forward_improved_linux.c)gauss_forward_improved_linux.c Gaussian Elimination Compatible Optimized for Linux GCC compiler
Download this file (gauss_forward_not_improved_linux.c)gauss_forward_not_improved_linux.c Gaussian Elimination Compatible Not Optimized 
Download this file (improved.c)improved.c Matrix Multiplication Optimized for Windows GCC compiler
Download this file (improved_linux.c)improved_linux.c Matrix Multiplication Optimized for Linux GCC compiler
Download this file (not_improved.c)not_improved.c Matrix Multiplication Not Optimized for Windows GCC compiler
Download this file (not_improved_linux.c)not_improved_linux.c Matrix Multiplication Not Optimized for Linux GCC compiler

 

Some useful Guidelines:

Download this file (matmul.ppt)Optimizing for the serial processors

Download this file (Lect_02_1999b.pdf)Lecture 2: Memory Hierarchies and Optimizing Matrix Multiplication

 
 
Objectives:
  1. Write a Vanilla Matrix Multiply Program in C
    1. Parameterize the data type to support, int float and double
    2. initialize with random data
    3. measure and display de run-time of the matrix multiply
  2. Excute 3 executables version in, float and double, each with 4 compiler optimization settings(none, -o,-o2,-o3), use matrix sizes with  128x128 to 4096x4096
  3. Rewrite your vanilla code by applying as many typical compiler transformation as you can: document the ones you applied in a list.
  4. Repeat execution 2 with the other transformations.
  5. Write a 1-Page Report, including a table to show runtime results of step 2 and 4.
  6. Write Gaussian Elimination and repeat the same procedure.
 
Procedure:
 
For this assignment I had to review about matrix multiply, so I decided to study the concept first in order to get involved with the procedure that requires a matrix multiply. For that purpose, I used the following link which I consider a ease way to refresh my mind.

Afterwards, I decided to move on the C algorithm. Firstly, I must know about measuring performance using C libraries, this is important because we must be sure about the demanded time by the algorithm when the processor is multiplying the matrix. In this case the initialization process will be not taken into account hence all the measured times will be taken after the matrix initialization, and before the end of the algorithm.
 
 
 
Step 1(Execute 3 executables versions)
 
I have written the code (see attachements) to perform a matrix multiplication, I decided to use macros so that I could change the data type easily without rewriting sections of code. For the tests I used INT data type, FLOAT TYPE, and the DOUBLE TYPE, and this is what the results show.
 
In order to measure performance I use I high resolution timer base in <sys/time.h> library  as you can observe on the code, so all the results are shown in microseconds. Furthermore, I decided to use args to create a single software which is capable of creating a dynamic matrix therefore is possible to change the size of the matrices without recompiling the original source.
 
 
 
I have to clarify the all the measurements were run on the g32.ece machine which has the following characteristics:
Model: AMD Opteron(tm) Processor 6128
Architecture:          x86_64
CPU op-mode(s):        64-bit
CPU(s):                32
Thread(s) per core:    1
Core(s) per socket:    8
CPU socket(s):         4
NUMA node(s):          8
Vendor ID:             AuthenticAMD
CPU family:            16
Model:                 9
Stepping:              1
CPU MHz:               2000.228
Virtualization:        AMD-V
L1d cache:             64K
L1i cache:             64K
L2 cache:              512K
L3 cache:              5118K
 
LEVELS OF OPTIMIZATION USING INT DATA TYPE
 
This is the table of results for the first runtime before applying any compiler transformation:
 
The table shows that the bigger the size of the matrix the more time is required to perform the multiplication. Furthermore, it is clear that there is an speed up when it is applied the compiler optimization.The performance of the matrix of 2048x2048 has been risen in 20% using the third level of optimization. Nevertheless, if it is compared the third level of optimization with the second level, they are slightly different, and the reliability of the code might have been reduced using the third level of optimization, which is unsafe and aggressive. Moreover, the time for the matrices under 2048 reached a better speed up than the matrices of 2048 and 4096, which could be inferred by taking into account the size of the cache memory. In this case the processor has three cache sizes, L1(64K), L2(512K), and L3(5118K), for the matrices under 2048, there is needed at least 1024x1024*(INT size=4)=4096K of cache memory whereas for bigger matrices, it is needed more than 2048x2048x4=19872K which is greater than the biggest cache (L3) memory and as consequence there are more cache misses.
 
Without compiler transformation.
 
   128x128 256x256  512x512  1024x1024  2048x2048  4096x4096 
 --(us)  29335  255607  2753045  71837305  797579689   2467845089
 -O1(us)  13264   125123   1644491   72929674   782077242   2454478250
 -O2(us)  12160   117564   1513549   63214368   735030301   1949429030 
 -O3(us)  12117   117758   1525796   52434348   704332290   1827743086 



 
Step 2(Execute 3 executables versions, applying compiler transformations)
 
In order to improve my code I have applied the next two compiler transformation to my matrix multiplication algorithm:
  • The Method called Loop Invariant Code motion:
    • This Method is use to avoid using a variable that slightly changes within a loop, as an example I show the improvement within my code:

      as you can observe in the image, there is a multiplication within the k look, where it is performed the multiplication of N by c. N it is a constant and c is changed only by the c loop, thereby removing the multiplication from the k loop and moving it to the c loop. The unnecessary multiplication is just made N times instead of NxN times.

  • Principle of Memory Locality.
    • The cache is a fast memory access whereby the computer is capable of accessing often required data to perform different operations repeatedly. The cache is used to to reduce the time required by the processor to access local memory such as RAM. In order to exploit the principle of Memory Locality we must know how cache actually works when you are implementing loops and reading data from a array, so if we are able to read the values from the Matrix in a proper way to exploit computer's cache, we are going to increase in the performance of the matrix multiplication Algorithm.
    • As the preview image shows, there are 3 loops, c loop , d loop , and k look.
      • The first code (on the left): To illustrate the behavior of this code I will show 2 matrices, of 2x2
         A(0,0)  A(0,1) 
          A(1,0)  A(1,1) 
        x
         B(0,0)  B(0,1)
         B(1,0)  B(1,1)
        =
         C(0,0)  C(0,1)
         C(1,0)  C(1,1)

        First Iteration(d loop)
        C[0][0]+=A[0][0]*B[0][0];
        C[0][0]+=A[0][1]*B[1][0];  ==> Where C[0][0] was accessed N times

        Second Iteration(d loop)
        C[0][1]+=A[0][0]*B[0][1]; ==> Where C[0][0] was accessed N times
        C[0][1]+=A[0][1]*B[1][1];

        Third Iteration(d loop)
        C[1][0]+=A[1][0]*B[0][0]; ==> Where C[0][0] was accessed N times
        C[1][0]+=A[1][1]*B[1][0];

        Fourth Iteration(d loop)
        C[1][1]+=A[1][0]*B[0][1]; ==> Where C[0][0] was accessed N times
        C[1][1]+=A[1][1]*B[1][1];

        Now imagine what happens when you have a matrix of 4096 elements, you will naturally access C[0][0] ==> N(4096) times to write it which only gives to the cache an often data or an array of a sequence of data.
        Each element of the array A is being accessed N times each on the d loop, and each k loop there's a sequence from A[0][0] to A[0][N] where the array A could be stored on the cache memory, and then it could be replaced when the c loop changes.
        Each element of B will be hold each d loop

        Now if we compared the not improved algorithm with the improved one we have that:

        First Iteration(k loop)
        C[0][0]+=A[0][0]*B[0][0];
        C[0][1]+=A[0][0]*B[0][1];  ==> Where the Array is accessed in sequence C[0][0] --> C[0][n]

        Second Iteration(k loop)
        C[0][0]+=A[0][1]*B[1][0]; 
        C[0][1]+=A[0][1]*B[1][1];

        Third Iteration(k loop)
        C[1][0]+=A[1][0]*B[0][0]; 
        C[1][1]+=A[1][0]*B[0][1];

        Fourth Iteration(k loop)
        C[1][0]+=A[1][1]*B[1][0];
        C[1][1]+=A[1][1]*B[1][1];

        Where the array C is accessed in sequence and it will be stored on the cache memory because it stays steadily accessed using the c loop and d loop.
        On the other hand the array A is accessed in sequence as well and can be stored on the cache memory as well whereas the B array could be replaced every k loop

        Table of Results with Compiler Transformations
           128x128  256x256   512x512   1024x1024  2048x2048   4096x4096 
         --(us)  28930   230222   1843271   14975535   118977048   971977444 
         -O1(us)  9640   74382   629343   4986690   41345095   330860175 
         -O2(us)  9059   71859   572221   4700699   37936452  301893036 
         -O3(us)  9073   72141   577498   4695937   37729592   303559212 


        As the Runtime improvements table shows the performance has been rocketed up considerably, specially for the matrices of 1024x1024, 2048x2048 and 4096x4096 whereas the performance for the smallest matrices hasn't been dramatically increased, notwithstanding, there is a huge difference regarding the time reached by the matrix of 1024 (0.5 Seconds) compared with the matrix of 2048(4.7 seconds). Once more, the maximum performance was reached by using the -O2 using the matrix 2048x2048. The performance wasn't affected dramatically for the matrices under 2048 due to the cache size, the available space for those matrices allows them to fit without having cache misses which in comparison with the bigger sizes there are more cache misses as it was explained in the first table.

        Runtime Improvements or Terms of Speedup (With Compiler Transformations) 
           128x128  256x256   512x512   1024x1024   2048x2048   4096x4096 
         Not Improved /Improved
        --
         1x  1.1x  1.5x  4.8x  6.7x  2.5x
          Not Improved /Improved
        --O1
         1.3x  1.6x  2.6x  14.6x  18.9x  7.4x
          Not Improved /Improved
        --O2
         1.3x  1.6x  2.6x  13.4x  19.3x  6.5x
          Not Improved /Improved
        --O3
         1.3x  1.6x  2.6x  11.2x  18.6x  6.0x





Part 2, Gaussian Elimination.
 
For this part of the assignment I have reviewed the Gaussian forward Elimination to code the first algorithm, which does not have compiler transformations. Once more, the machine in which I have tested the code was the G32.ECE.UBC.CA. 
 
The results can be observed in the following table with different levels of optimization using GCC.
 
Gaussian  Forward Elimination 
 
Without compiler transformations
   1024x1024 2048x2048  4096x4096 
 --(us)  6589499   53412624   436066621 
 -O1(us)  3793153   30941944   248007318 
 -O2(us)  1060157  9410766  76096890
 -O3(us)  917158   8595758   72623577 
 
 
As the first part of this assignment it is clear that the different level of optimization gives speed up to the algorithm, the best results can be found with the level of optimization -O3 where the code was sped up 6 times for the biggest matrix (4096), and 7 times for the smallest matrix in the table (1024). However, the reliability of the code is much better for the -O2 as I mentioned before.
 
In order to enhance the code for the Gaussian forward elimination I have implemented the method loop invariant code motion, and to reduce the cache misses I have used the principle of locality by mapping the matrix in sections using blocks of 8 columns of it instead of calculating all the results in the same loop.The improvement by using blocks of 8 columns, instead of using all the columns to calculate Gaussian Forward Elimination, brings speed up for the code and avoids cache misses by reducing the sequence that is being accessed and stored on the cache memory.
 
 
 
 
Table of Results with Compiler Transformations
using blocks of 8 columns within the matrix
   1024x1024 2048x2048  4096x4096 
 --(us)  1139853   6443612  46158971
 -O1(us)  578313  3508618  29553864
 -O2(us)  308846   1990565   19968219
 -O3(us)
  289194

 1854847
 
  19256503
 
In comparison with the first table in which was not applied any Compiler transformations, and the one which has compiler transformations, there was a really good enhancement. In the case of the matrix 4096x4096, without using any compiler optimization, the algorithm was performed 9.4 times, which is faster than the first attempt with the initial algorithm. 
 
Runtime Improvements 
   1024x1024 2048x2048  4096x4096 
  Not Improved /Improved
--
 5.7x  8.2x  9.4x
 Not Improved /Improved
 -O1
 6.5x  8.8x  8.3x
 Not Improved /Improved
 -O2
 3.4x  4.7x  3.8x
 Not Improved /Improved
 -O3
 3.1x  4.6x  3.7x

 

Back Sustitution Results
 
Without compiler transformations
   1024x1024 2048x2048  4096x4096 
 --(us)  10333   50822   218825 
 -O1(us)  6951   39515   171568 
 -O2(us)  5290   39444   176992 
 -O3(us)  5290   39486   177109 

 
 
 
 
 
Conclusions:
 
One of the conclusion that I can infer about the given results, it is that the more we know about how to manage memory into the cache, the much faster our algorithm becomes, as it can be observed on the result table. Other thing to think about, it is the amount of data or chunk that the computer cache can really store, as it is seen in the Result Table the bigger the Matrix, the lower the result in terms of speed up we get. This is due to the cache size, so if we really want to improve this, we have to be aware of how to use wisely smalls chunks of data (blocking) which can be store more efficiently on cache memory in order to avoid cache misses. Furthermore, knowing how to implement the different ways to optimize your code using the compiler flags could be useful in order to get good result, and it is proved that there is a really good increase in the speed up when we pass from an not optimized code to an optimized code using the compiler flags. In addition, the computer architecture is crucial in order to reach better speed up, in this case the cache size brings a lot of benefits to perform matrix multiplications, and Gaussian elimination.
 
 
 
 
 
Welcome to FPGA lover!!!
 
I invite you to take a look at the new manuals on the web page, and have a tour, Enjoy, and Love FPGAs and Embedded Systems!
 
Cheers!