Thinking Parallel

A Blog on Parallel Programming and Concurrency by Michael Suess

Matrix Optimization Gone Wrong – Reloaded

A TelephoneWho would have thought that such a small article on a Matrix Optimization and its impact on execution time would generate that many comments in such a short time, thanks for all of them! The original point I was trying to make in that article was to always check, if an optimization actually improves matters before you employ it. But I guess most of you found the question about why it was going wrong more interesting to discuss. And since I don’t want to disappoint my readers (thats you :P), I decided to put some of your suggestions to the test and see what actually caused the slowdown on the Intel Compiler.

The first thing that came to my mind when I saw the code was the additional branch and how it may be causing problems. As many readers have suggested, one can avoid it like this:

for (i = 0; i < N; i++) { A[i][0] = 2 * i + 1; for (j = 1; j < N; j++) { A[i][j] = A[i][j - 1] + 3; } } [/c] Who would have thought, this gives a small but noticeable improvement. But the program is still about 20 times slower than the original one on the Intel compiler (all these numbers are just rough measurements, of course). So let's see if by using a temporary variable we can improve performance: [c] for (i = 0; i < N; i++) { for (j = 0; j < N; j++) { if (j == 0) { temp = 2 * i + 1; A[i][j] = temp; } else { temp += 3; A[i][j] = temp; } } } [/c] The if-branch is still left in there to stick to one change per program. This program is about as fast as the original one. Wow. Who would have thought that todays compilers still have issues with register allocation. If my students had presented me with this program first, I would also have had issues with it, because they had obfuscated a perfectly fine loop-nest for an optimization that I thought every compiler did today. I guess I was wrong. P.S.: I am looking for a solution to the code disappearing in comments-problem that bit many of you, but have not found anything yet. If you have an idea how to solve this and allow people to post code (including <-signs) in WordPress-comments, please let me know. Thanks!

7 Responses to Matrix Optimization Gone Wrong – Reloaded »»


Comments

  1. Comment by Daniel Michels | 2007/01/31 at 03:20:35
  2. Comment by david howard | 2007/01/31 at 04:13:26

    >Who would have thought that todays compilers still have issues with register allocation.

    I would expect the compiler is not allowed to move the expression from the line

    A[i][j] = A[i][j – 1] + 3;

    into a register because A[i][j-1] could be aliased with a pointer somewhere else. using the ‘temp’ variable eliminates the aliasing problem. So the slow version with the A[i][j-1] is forced to do an access to memory in the inner loop that is not present in the original or final optimization.

  3. Comment by franjesus | 2007/01/31 at 12:47:16

    test

    #include

    #include \

    #include \

    Preview of comments would be nice 😉

  4. Comment by Michael Suess | 2007/01/31 at 22:08:37

    Thanks for the link, Daniel, it appears to work. Just wrap your code into code-tags and they should not be formatted anymore.

    franjesus: the comments preview plugin I used turns all my articles into blank pages, not going to turn it on again.

  5. Comment by Marc Brooks | 2007/02/01 at 10:58:04

    Why not hoist that if out of the inner loop, huh?


    for (i = 0; i < N; i++) { int temp = 2 * i + 1; A[i][0] = temp; for (j = 1; j < N; +) { temp += 3; A[i][j] = temp; } }

  6. Comment by gwenhwyfaer | 2007/02/09 at 02:54:58

    So I was right then? 🙂

    As I said, it’s not register allocation; it’s whether or not the read from A[i][j-1] has to wait for the previous write to A[i][j] (when j was 1 less) to complete, because the processor *itself* can’t tell that that the read is just recycling a value. In modern CPU architectures, writes can proceed asynchronously – but any subsequent reads from the same address (especially in the presence of multiprocessor systems) must wait for the write to complete first. I’m guessing that gcc spots that the read will actually fetch back the same value that was just written and substitutes in a temp to avoid that – but that’s actually not a safe optimisation, so I’m not too surprised the Intel compiler didn’t go for it.

  7. Comment by Michael Suess | 2007/02/09 at 17:17:31

    @gwenhwyfaer: yes, it appears you were right. Sorry, I don’t have a price to award :-).

    What I don’t get is why this is not a safe optimization. Obviously, there is some room for discussion here, since the Intel people seem to consider it unsafe and the gcc people do not. David Howard hints in an earlier comment that “A[i][j-1] could be aliased with a pointer somewhere else” – but where should that somewhere else be? After all, this is not a volatile variable and we are not even talking about multiple threads here (and even if we did – most memory models from threaded systems I know would still allow it). So I am afraid I don’t get it. Can anyone enlighten me as to why this optimization is not safe? Thanks a lot.


Leave a Reply

HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>