## Matrix Optimization Gone Wrong

The other day, two students of mine showed me some of their code. Part of their task was to fill a two-dimensional matrix using the formula: `A[i][j] = 2 * i + 3 * j + 1;`

They showed me an optimization they did to speed up that calculation and I was skeptical. This article shows this optimization and my thoughts about why it failed.

But let’s start from the beginning. The code in question was supposed to look like this:

for (i = 0; i < N; i++) {
for (j = 0; j < N; j++) {
A[i][j] = 2 * i + 3 * j + 1;
}
}
[/c]
The students did this *optimization*:

for (i = 0; i < N; i++) { for (j = 0; j < N; j++) { if (j == 0) { A[i][j] = 2 * i + 1; } else { A[i][j] = A[i][j - 1] + 3; } } } [/c] Technically, this is correct. And yes, it does safe a couple of arithmetic operations, among them many multiplications, which are expensive as many people know. But if you actually turn this into a program, the second version is about twenty times slower, when compiled with the Intel compiler and optimization level 3. Its about the same speed when compiled with gcc -O3. I have looked briefly at the assembler code to make sure the whole calculation was not optimized away by the compiler, and it was not. So why is the code so much slower? I see two reasons: First of all, todays architectures don't like branches. One wrongly predicted branch and your whole pipeline goes to hell. Therefore if-clauses don't help, when you want to be really fast. Second, although a couple of multiplications are saved, now every calculations has to wait for its predecessor to complete. If I remember my computer architecture class correctly, there are ways to feed results back into the pipeline without even going through the registers, but I am not sure it does not loose a few cycles in the process. These are my guesses as to why this optimization has gone so wrong. Should have taken my advice given in an earlier article to their heart and tested each and every one of their optimizations for effectiveness. But as I have mentioned multiple times, I am not really a hardware-guy, so maybe I am totally off-base and you know different reasons why this is so slow? If yes, please share them here and leave a comment…

**Update:** The reason why the program is so slow is clarified in my next post.