Thinking Parallel

A Blog on Parallel Programming and Concurrency by Michael Suess

Matrix Optimization Gone Wrong

The MatrixThe other day, two students of mine showed me some of their code. Part of their task was to fill a two-dimensional matrix using the formula: A[i][j] = 2 * i + 3 * j + 1; They showed me an optimization they did to speed up that calculation and I was skeptical. This article shows this optimization and my thoughts about why it failed.

But let’s start from the beginning. The code in question was supposed to look like this:

for (i = 0; i < N; i++) { for (j = 0; j < N; j++) { A[i][j] = 2 * i + 3 * j + 1; } } [/c] The students did this optimization:

for (i = 0; i < N; i++) { for (j = 0; j < N; j++) { if (j == 0) { A[i][j] = 2 * i + 1; } else { A[i][j] = A[i][j - 1] + 3; } } } [/c] Technically, this is correct. And yes, it does safe a couple of arithmetic operations, among them many multiplications, which are expensive as many people know. But if you actually turn this into a program, the second version is about twenty times slower, when compiled with the Intel compiler and optimization level 3. Its about the same speed when compiled with gcc -O3. I have looked briefly at the assembler code to make sure the whole calculation was not optimized away by the compiler, and it was not. So why is the code so much slower? I see two reasons: First of all, todays architectures don't like branches. One wrongly predicted branch and your whole pipeline goes to hell. Therefore if-clauses don't help, when you want to be really fast. Second, although a couple of multiplications are saved, now every calculations has to wait for its predecessor to complete. If I remember my computer architecture class correctly, there are ways to feed results back into the pipeline without even going through the registers, but I am not sure it does not loose a few cycles in the process. These are my guesses as to why this optimization has gone so wrong. Should have taken my advice given in an earlier article to their heart and tested each and every one of their optimizations for effectiveness. But as I have mentioned multiple times, I am not really a hardware-guy, so maybe I am totally off-base and you know different reasons why this is so slow? If yes, please share them here and leave a comment…

Update: The reason why the program is so slow is clarified in my next post.

9 Responses to Matrix Optimization Gone Wrong »»


Comments

  1. Comment by Rory Geoghegan | 2007/01/28 at 19:58:37

    You are right, with pipelining, most modern processors do not handle branching very well. Also, even if two multiplications would intuitively take twice as long, because most modern processors, once again, pipeline instructions, they are more or less done in parallel, especially if they are floating point operations.

    Actually, the less branching you perform, the better it is, especially when you consider things like the optimisations done by the compiler and the processor.

  2. Comment by brainy | 2007/01/28 at 20:03:39

    Ok, one final attempt (isn’t there any proper way to post code as a comment?):

    Following your suggestion, I was thinking how efficient the code would be when eliminating the unnecessary if clause like the following:

    for (i = 0; i <N; i++) {
    A[i][j] = 2 * i + 1;
    for (j = 0; j <N; j++) {
    A[i][j] = A[i][j – 1] + 3;
    }
    }

    And then maybe taking it one step further and instead of referring back to the array for each assignment, keeping the previous value in a local variable:

    for (i = 0; i <N; i++) {
    val = 2 * i + 1;
    A[i][j] = val;
    for (j = 0; j <N; j++) {
    val += 3;
    A[i][j] = val;
    }
    }

  3. Comment by gwenhwyfaer | 2007/01/28 at 20:40:33

    The obvious point that jumps out at me is that you don’t need any multiplies to work out 2*i + 3+j + 1:

    mov ebx, [i]
    mov ecx, [j]
    lea ebx, [2*ebx+ecx+1]
    lea eax, [2*ecx+ebx]

    leaves the value in eax. The loop indices can equally simply be hoisted into registers and incremented as a block; similarly, 2*i doesn’t change in advance either. It’s quite likely that the code could be compiled without any multiplications in it at all.

  4. Comment by gwenhwyfaer | 2007/01/28 at 20:58:41

    Ah. Just noticed that you say the calculation wasn’t optimised away. And I don’t think it’s branching – there’ll be a couple of mispredictions each go round the outer loop, but overall the jumps settle into repeated patterns.

    In that case, I’d suggest that because of the way it’s written, the reference to A[i][j-1] is forcing the CPU to wait until the write to it in the previous iteration is complete before it can read from it again. Presumably gcc spots that it can just use a temp instead and hoists it, but the Intel compiler doesn’t.

    Does replacing the section of code with:

    int temp = 2*i + 1;
    if( i == 0 ) {
    A[i][j] = temp;
    } else {
    temp = temp + 3;
    A[i][j] = temp;
    }

    improve matters? (I know the code is still clunky, but I’m trying to leave it as close to form 2 as possible.)

  5. Comment by Mark Dominus | 2007/01/28 at 23:39:40

    if the problem is really with the added “if” test, then I wonder whether the following variation would be an improvement:

    for (i = 0; i < N; i++) {
    a[i][0] = 2*i + 1;
    for (j = 1; j < N; j++) {
    A[i][j] = A[i][j – 1] + 3;
    }
    }
    }

    Since the “if” test is only to distinguish the special case of j=0,
    we can eliminate it by moving that case out of the loop.

  6. Comment by Mark Dominus | 2007/01/29 at 00:05:42

    It seems that moving the j=0 case out of the loop produces a small but measurable improvement, on my machine, for the cases I tried.

    I’ve placed my source code and makefile at http://www.plover.com/~mjd/misc/c/matrix-opt.tgz .

  7. Comment by Bjoern Knafla | 2007/01/29 at 04:20:49

    Okay, some “have fun with optimizations I don’t know will work while make the code harder to understand and make it harder for future compilers to do their magic” 😉

    // Include std::memset
    #include

    // Include std::size_t
    #include

    std:.size_t const N = ????;

    typedef int elem_t;
    elem_t A[ N ][ N ];

    // Set all elements of A to 1 to optimize away one add operation per loop.
    // Though it has been a long time that I used memset and the following line needs to be tested if this is the way memset works…
    std::memset( A, 1, N * N * sizeof( elem_t ) );

    The rest just in words:
    Initialize the array with 1 to safe N*N add-operations.

    Save the 2*i value in a const before entering the j-loop to save re-calculating it N times.
    Perhaps try i lesser-than lesser-than 1 (shift-op) instead of 2*i but a good compiler would optimize multiplying by 2 by itself.

    Good night.

  8. Comment by Christopher | 2007/01/29 at 08:14:54

    I knew what the problem was as soon as I saw the conditionals: branching. Bjoern is right; just move the 2*i from the inner loop and the program will be fine. I suppose one could also try Duff’s device to be fancy, or maybe even use OpenMP on a multi-core machine if the arrays are really huge.

  9. Comment by MiguelB | 2007/01/31 at 19:33:44

    I think the number of memory accesses is important too. The first program does NxN memory accesses, while the second one does Nx(2N-1). Memory access is by far the slowest operation a CPU can do.


Leave a Reply

HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>