Thinking Parallel

A Blog on Parallel Programming and Concurrency by Michael Suess

Locality optimization experiments

Chris over at HPCAnswers has written a nice introductory article about locality optimization. He uses matrix multiplication as an example. The rest of this article assumes you have read his post and therefore know at least some basics about loop reordering and locality optimization.

I remember very clearly, when my supervisor first introduced the concept of locality to us in her parallel programming class a couple of years ago: I was stunned. I could not believe that such a simple change made such a huge difference in performance (I was younger then 😉 ). And above all: I could not believe that the compiler was not smart enough to reorder that loop by itself. After all, there are no complicated dependencies or function calls in there! But I did some tests, and the performance difference between the loop versions was there and clearly visible.

Reading Chris’ post I wanted to know whether that was still true, even for a super-simple example like matrix multiplication. So I turned Chris’ example into a program once more (now where are so many of my programs from the time of my studies? And why didn’t I use source control then? Gosh, I must have been way younger then :D). If you want to play with it on your compiler, it is right here:

/* This program basically multiplies the matrix A with the
* matrix B and stores the result into matrix C. A and B
* are filled with (not so random) numbers, as we are not
* interested in the actual result but more in the time it
* takes to compute it.
*
* For verification purposes and to stop the compiler from
* throwing out the whole calculation, the sum of all
* elements from C is calculated and printed (will probably
* overrun, but that’s OK).
*/

#include

/* size of our matrices */
#define N 1000

/* our matrices */
int A[N][N];
int B[N][N];
int C[N][N];

int main()
{
int i, j, k;
/* the sum of all elements of C */
long sum = 0;

/* fill A, B and C */
for (i = 0; i < N; i++) for (j = 0; j < N; j++) { A[i][j] = i; B[i][j] = j; C[i][j] = 0; } /* calculate C */ for (i = 0; i < N; i++) for (j = 0; j < N; j++) for (k = 0; k < N; k++) C[i][j] = A[i][k] * B[k][j] + C[i][j]; /* calculate sum of all elements of C */ for (i = 0; i < N; i++) for (j = 0; j < N; j++) sum += C[i][j]; /* print out the sum */ printf("sum: %ld\n", sum); return 0; } [/c] Yes, I know, global variables are evil, but the arrays are too large to fit on the stack in the main function and dynamic allocation on the heap would have made the program longer. So please live with it or feel free to correct it yourself ;-). Where was I? Oh yes, compiled the program on my laptop (Intel Pentium M 725 processor with 1.6 screaming GHz, Intel Compiler 9.0 on Linux, optimization level -O3), ran it a couple times, and got a best runtime of about 8,59s. I had to run the the program a couple of times because I saw quite some variations in the measured wall-clock times and wanted to make sure to minimize their impact – I am not sure about the cause of these variations, though, might have to check that later. Now switch the j and k-loops, compile again on the same computer, same options and get a best runtime of: *drumroll*: 3,60s.

I call that quite impressive for such a tiny change, a speedup of 2,39 just by changing two loop-variables. Nevertheless, I am still surprised the compiler does not do the reordering by itself. But then again, I am by no means a compiler expert, so there may be good reasons for that (if you know any, please be sure to leave a comment, thanks a lot!).

I am almost at the end of my little trip into the world of locality optimization. There is one thing some people may have noticed in my article: the optimization level -O3 I used for my experiments is not the highest one for the Intel compiler. There is a convenient switch called -fast, which is supposed to turn on even more compiler magic. It does not work on my Pentium M though, because it produces code that can only be run on a Pentium 4. This is quickly fixed by giving the compiler the option -xB, which is basically a switch that optimizes my code for the Pentium M exclusively.

Let’s try again with this option: The best result I was able to achieve for the unoptimized loop was 1,70s (again with huge variations) and about the same for the optimized one. That’s a speedup of 5,0 from the original program! And the best part is: I did not even have to do any loop reorderings as described in the original article. The reason this version is so fast can be found in the compiler output while compiling:

$ icc matrix_mult.c -o matrix_mult -O3 -xB
matrix_mult.c(29) : (col. 3) remark: LOOP WAS VECTORIZED.
matrix_mult.c(43) : (col. 2) remark: LOOP WAS VECTORIZED.

The -xB switch causes the compiler to use the SIMD-units of my processor (you may know them as MMX, SSE or SSE2-instructions), which are of course a natural fit for a calculation like this, because there are no dependencies and therefore it can process multiple calculations at the same time. I won’t go into the details of vector processing here, as the article is long enough as is.

I will try a short summary instead: Building upon Chris’ original article, I have tried to show that locality is something to keep in mind when programming, even with our advanced compilers today. Of course, I only tested with one compiler (Intel), but I am fairly sure that there are always cases, when the compiler does not know enough about the dependencies of my code to be able to reorder loops by itself or do other optimizations that benefit locality. There are cases of course, when the compiler is smart enough already, especially with the highest optimizations turned on (in our case thanks to vectorization).

There are two things left that I don’t understand, though: Obviously, the compiler is smart enough to detect that there are no dependencies in the main loops, or else it could not vectorize them. If it is smart enough for that, why is it not able to reorder the loops by itself?? And the second question is: where are the tools to help the programmers with locality optimization? Searching the web for the term locality optimization tool still yields our own work on locality optimization as top result and all the other top hits look very much like research projects as well and not like ready-to-use tools. Will have to talk to my supervisor about this, as she has done a whole lot of work on locality optimization some time ago and probably knows more about this question. Maybe you know an answer to these questions as well, and if you do I would appreciate it if you left a comment!

P.S.: Regarding Europar

I will be at the Europar conference in Dresden for the rest of this week, where I will be presenting our work on irregular algorithms and cancelling parallel regions in OpenMP. If any of you will be there as well, I would be delighted to have a talk!

P.P.S.: Regarding Spam

With popularity comes spam. As this blog is raising more and more attention (according to my weblogs and subscriber statistics), unfortunately spammers seem to have found it as well. I have therefore deployed some countermeasures in the form of anti-spam plugins. Since I do not have any experience with any of them and just had to trust their reputation on the web, I do not have a feeling about how intrusive these are for regular commenters (that means you 😉 ), yet. Therefore, if you have any problems commenting on any articles, please be sure to CFLink and I will try to work things out as fast as possible! Thanks!

P.P.P.S: Regarding Feedback

While I am talking about feedback, if you have any be sure to let me know, either by commenting on the article in question directly or by contacting me by email! I am always looking to improve my knowledge, writing style or whatever other skill you can think of ;-).

5 Responses to Locality optimization experiments


Comments