Thinking Parallel

A Blog on Parallel Programming and Concurrency by Michael Suess

Parallel Programming Fun with Loop Carried Dependencies

The LoopThere was a very interesting discussion on the OpenMP Mailing List last week about a for loop and how to parallelize / optimize it. I will comment on the problems and solutions presented there and also have an interesting observation to add at the end of this article. Although the examples are in OpenMP, the problem and solutions are applicable to any threads-based parallel programming system.

But let’s not get ahead of ourselves and start at the very beginning. The original mail from Hicham Mouline shows the following code snippet:

const double up = 1.1 ;
double Sn=1000.0;
double opt[N+1];
int n;
for (n=0; n<=N; ++n) { opt[n] = Sn; Sn *= up; } [/c] He wanted to parallelize the for-loop with OpenMP, but could not due to the Sn-variable, which depends on the previous loop iteration, respectively. This case is so common in parallel programming that it even has a name: Loop Carried Dependency. I would not have written about it here, if that was the end of the story and there was no way to parallelize the loop, though. Closely examining it you will notice that the dependency is breakable and helpful as I am, I went and implemented a solution. After some testing to make sure I get the exact same results, I wanted to submit it to the mailing list, only to realize that I had been beaten by Clay Breshears (who, by the way also maintains a fine blog about parallel programming – I have been meaning to write about that for some time and will in a future post), who had just posted a solution very similar to what I had done – don’t you just hate it, when that happens ;-). This is what he has come up with:

const double up = 1.1 ;
double Sn, origSn=1000.0;
double opt[N+1];
int n;

opt[0] = origSn;
#pragma omp parallel for private(n) lastprivate(Sn)
for (n=1; n<=N; ++n) { Sn = origSn * pow(up, n); opt[n] = Sn; } Sn *= up; [/c] Actually, this is not quite what he posted, as his original solution had a different outcome than the original code – a slight off-by-one error had obviously slipped in (my old computer science teacher used to say: “programmers are always off by one” – and I can support that thesis from my own experience, this happens a lot to me as well), which I took the liberty to correct here. I bet this is why he had his solution faster than I did :P. The difference between his solution and mine is that he used lastprivate, while I pulled the update of Sn out of the loop entirely. Does not make a big difference either way, though, therefore I am not posting it here.

Anyways, what’s important to note here is that with a simple reorganization the loop is suddenly parallelizable and the loop carried dependency has disappeared. Unfortunately, these dependencies are not always as easy to break as this time, though. This solution has another problem, though – it uses the power-function quite extensively, for each and every iteration. Power is an expensive operation, therefore a few posts later a guy called Jakub Jelinek (from Redhat, judging from his email-address, the very helpful people who have built OpenMP-support into the next gcc-release [4.2]) stepped up to post an even better solution, which is sketched below:

const double up = 1.1;
double Sn, origSn=1000.0;
double opt[N+1];
int n, lastn;

lastn = -2;
#pragma omp parallel for private(n) firstprivate(lastn) lastprivate(Sn)
for (n=0; n<=N; ++n) { if (lastn != n - 1) { Sn = origSn * pow(up, n); } else { Sn *= up; } opt[n] = Sn; lastn = n; } Sn *= up; [/c] A very nice solution, which divides the iteration space into chunks and avoids recalculating the power function for each iteration this way. And it even works for all kinds of schedules, although I suspect it is fastest with a static one. So, are we done now? This particular loop carried dependency is done away with, a solution has been posted and the thread petered off after a few more posts. Unfortunately, I still had some doubts. I could hear the words of the wise Sanjiv ringing in my head, who had told me in his interview:

Michael: What is the worst mistake you have ever encountered in an OpenMP program?

Sanjiv: The most common mistake: parallelizing loops that are too fine grained or consume a lot of bandwidth and contribute negatively to overall scaling. Even very experienced OpenMP programmers make this mistake. Programmers must understand the performance and scaling of every parallel region to avoid such mistakes. This is one of the worst mistakes because people are just shooting themselves in the foot.

I suspected this may be the case here. To make a not so long story short, I turned all code snippets into full programs, made N as big as possible (it won’t go much higher than 7000 on the architectures I tried this on before variables start overflowing), threw in some well-placed time measurement functionality (only the actual loop was measured, not the supporting code around it), fired up our trusted 4-way Opteron system and let the horses run, as they say. Here are the results for 4 Threads:

  • Clay’s version: 0.637ms
  • Jakub’s version: 0.330ms
  • Original (non-parallel) version: 0.054ms

What a bummer! I usually try to get well into the seconds when measuring performance, but unfortunately the overflows started much sooner for this example and a millisecond is an eternity for current processors anyways, so these will do. After I have mentioned Sanjiv’s quote so prominently, the results should come as no surprise at this point: by parallelizing the code, we have made it an order of magnitude slower. Even Jakub’s version (which is twice as fast as Clay’s for this architecture and number of threads) is still way behind the original one.

What are the reasons here? I suspect it’s overhead from creating the threads, coupled with the very expensive power function, that still needs to be calculated for each thread at least once. But without examining the example more closely e.g. with a parallel profiler, its hard to tell.

Let me summarize the moral of the story: loop carried dependencies look like a show stopper for parallelism, but sometimes they are not and can be worked around. Unfortunately, that alone is not a guarantee for good performance. Sometimes the smartest optimization there is when it comes to parallelism – is not to use it. And this is a lesson well worth learning. I find it so useful, that sometimes my students get assignments from me where no satisfactory, parallel solution that is faster than a sequential one can be found. This is always very confusing for them. After all we have been telling them this much about parallelism and performance, only for them to discover that it is not the cure for everything. “Live is not a lollipop”, as my colleague Bjoern likes to say in moments like these ;).

7 Responses to Parallel Programming Fun with Loop Carried Dependencies


  1. Comment by Bernd | 2007/05/02 at 13:39:29

    Hi Michael,

    If you need more iterations for better statistics, why not reducing the multiplier ‘up’ from 1.1 to something like 1.001 (or even smaller)? If float overflows are your concern, this should allow for about 700000 iterations.


  2. Comment by Jason | 2007/05/02 at 20:40:48

    If the cost is in the library call, you could write a loop to find the power but it seems like you are going from O(n) to O(N^2) so unless you have n processors…

  3. Comment by Sanjiv | 2007/05/03 at 07:31:15

    I feel compelled to point out a couple of caveats which I always used to do back when I was doing OpenMP tutorials:

    1) there are some instances when parallelizing such loops makes sense – the classic case is to cause a side effect, like data distribution across multiple processors or to preserve a data distribution from previous loops. This particular case doesn’t access much data, so the point is moot here.

    2) It sometimes makes sense to parallelize loops with recurrences, even when the recurrence cannot be eliminated as in this case. Typical examples are when a LARGE otherwise fully parallel loop contains a small amount of recurrences, i.e., the ratio of parallel to synchronized code is large, especially on small numbers of processors.

    Good post, Michael!

  4. Comment by Hicham | 2007/05/03 at 10:52:00

    To have larger N, you can change the allocation of opt to dynamic,
    because the usual stack size limitation is low, and change the multiplicative factor to a number much closer to 1, as suggested above, say 1.0000001….
    you can then try with N in the range of 10 of millions, assuming u have more than 100Megs of RAM….

    An important factor is how much time is spent in user code, and how much in system code. A time-like utility can be used to determine that.
    The higher the proportion of time spent in user code, the higher the benefit of parallelization.

    In this particular case, it seems memory access
    opt [n] = Sn;
    is the costly part, which reduces the benefit of parallelization.

  5. Comment by vabun | 2007/05/12 at 04:39:19

    Loop Carried Dependencies are made for caching.
    With a to atomic parallel approach this is ignored.
    you might want to use something like
    Distributed Carried Dependencies ( i just made that up)

    const double up = 1.1 ;
    double Sn=1000.0;
    double opt[N+1];
    int n,n_seq,hardware_threads;
    hardware_threads=/* insert number of hardware parallel executable
    threads (dependendt on the nuber of cores, floatingpointunits etc.)
    seq_dim=N/hardware_threads; /*hoping N%hardware_threads==0*/
    opt[0] = Sn;
    #pragma omp parallel for private(n) num_threads(hardware_threads)
      for (n_seq=2+n*seq_dim; n_seq< =(n+1)*seq_dim; n_seq++) {     opt[n_seq] = opt[n_seq-1]*up;}   } Sn = opt[N]*up;

Trackbacks & Pingbacks »»

  1. Trackback by Codeplay | 2007/05/08 at 13:39:39

    Loop Carried Dependencies in Sieve…

    Recently Michael Suess over at Thinking Parallel posted an interesting article on resolving Loop Carried Dependencies using OpenMP. These dependencies are so named because variables depend on previous iterations within a loop. If one wants to paralleli…

  2. […] This is a great example of why a href=”” onclick=””” target=”_blank”>neither of the simplistic approaches to parallelization (”everything’s a future” or “let the programmer decide”) will ultimately prevail and how something akin to run-time optimization (a la HotSpot) will have to be used. PLAIN TEXT […]

Comments are closed