Comments on: Parallel Programming Fun with Loop Carried Dependencies

By: Knowing .NET » Blog Archive » Thread Creation Overhead Can Trip Up Pros

Wed, 08 Jul 2009 13:25:27 +0000

[…] This is a great example of why a href=”http://www.knowing.net/CommentView%2cguid%2c677872c4-05cd-48f1-bbfb-237f3a0c8b05.aspx” onclick=””” target=”_blank”>neither of the simplistic approaches to parallelization (”everything’s a future” or “let the programmer decide”) will ultimately prevail and how something akin to run-time optimization (a la HotSpot) will have to be used. PLAIN TEXT […]

By: vabun

vabun — Sat, 12 May 2007 02:39:19 +0000

Loop Carried Dependencies are made for caching.
With a to atomic parallel approach this is ignored.
you might want to use something like
Distributed Carried Dependencies ( i just made that up)

const double up = 1.1 ; double Sn=1000.0; double opt[N+1]; int n,n_seq,hardware_threads; hardware_threads=/* insert number of hardware parallel executable threads (dependendt on the nuber of cores, floatingpointunits etc.) here*/; seq_dim=N/hardware_threads; /*hoping N%hardware_threads==0*/ opt[0] = Sn; #pragma omp parallel for private(n) num_threads(hardware_threads) for(n=0;hardware_threads;n++){ Â opt[1+n*seq_dim]=opt[0]*pow(up,1+n*seq_dim) Â for (n_seq=2+n*seq_dim; n_seq<=(n+1)*seq_dim; n_seq++) { Â Â opt[n_seq] = opt[n_seq-1]*up;} Â } Sn = opt[N]*up;

By: Codeplay

Codeplay — Tue, 08 May 2007 11:39:39 +0000

Loop Carried Dependencies in Sieve…

Recently Michael Suess over at Thinking Parallel posted an interesting article on resolving Loop Carried Dependencies using OpenMP. These dependencies are so named because variables depend on previous iterations within a loop. If one wants to paralleli…

By: Hicham

Hicham — Thu, 03 May 2007 08:52:00 +0000

To have larger N, you can change the allocation of opt to dynamic,
because the usual stack size limitation is low, and change the multiplicative factor to a number much closer to 1, as suggested above, say 1.0000001….
you can then try with N in the range of 10 of millions, assuming u have more than 100Megs of RAM….

An important factor is how much time is spent in user code, and how much in system code. A time-like utility can be used to determine that.
The higher the proportion of time spent in user code, the higher the benefit of parallelization.

In this particular case, it seems memory access
opt [n] = Sn;
is the costly part, which reduces the benefit of parallelization.

By: Sanjiv

Sanjiv — Thu, 03 May 2007 05:31:15 +0000

I feel compelled to point out a couple of caveats which I always used to do back when I was doing OpenMP tutorials:

1) there are some instances when parallelizing such loops makes sense – the classic case is to cause a side effect, like data distribution across multiple processors or to preserve a data distribution from previous loops. This particular case doesn’t access much data, so the point is moot here.

2) It sometimes makes sense to parallelize loops with recurrences, even when the recurrence cannot be eliminated as in this case. Typical examples are when a LARGE otherwise fully parallel loop contains a small amount of recurrences, i.e., the ratio of parallel to synchronized code is large, especially on small numbers of processors.

Good post, Michael!

By: Jason

Jason — Wed, 02 May 2007 18:40:48 +0000

If the cost is in the library call, you could write a loop to find the power but it seems like you are going from O(n) to O(N^2) so unless you have n processors…

By: Bernd

Bernd — Wed, 02 May 2007 11:39:29 +0000

Hi Michael,

If you need more iterations for better statistics, why not reducing the multiplier ‘up’ from 1.1 to something like 1.001 (or even smaller)? If float overflows are your concern, this should allow for about 700000 iterations.

Bernd