<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	>
<channel>
	<title>Comments on: Parallel Programming Fun with Loop Carried Dependencies</title>
	<atom:link href="http://www.thinkingparallel.com/2007/05/02/parallel-programming-fun-with-loop-carried-dependencies/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.thinkingparallel.com/2007/05/02/parallel-programming-fun-with-loop-carried-dependencies/</link>
	<description>A Blog on Parallel Programming and Concurrency by Michael Suess</description>
	<pubDate>Wed, 09 Jul 2008 02:29:12 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.5</generator>
		<item>
		<title>By: vabun</title>
		<link>http://www.thinkingparallel.com/2007/05/02/parallel-programming-fun-with-loop-carried-dependencies/#comment-8846</link>
		<dc:creator>vabun</dc:creator>
		<pubDate>Sat, 12 May 2007 02:39:19 +0000</pubDate>
		<guid isPermaLink="false">http://www.thinkingparallel.com/2007/05/02/parallel-programming-fun-with-loop-carried-dependencies/#comment-8846</guid>
		<description>Loop Carried Dependencies are made for caching.
With a to atomic parallel approach this is ignored.
you might want to use something like 
Distributed Carried Dependencies ( i just made that up)

&lt;code&gt;
const double up = 1.1 ;
double Sn=1000.0;
double opt[N+1];
int n,n_seq,hardware_threads;
hardware_threads=/* insert number of hardware parallel executable
threads (dependendt on the nuber of cores, floatingpointunits etc.)
here*/;
seq_dim=N/hardware_threads; /*hoping N%hardware_threads==0*/
opt[0] = Sn;
#pragma omp parallel for private(n) num_threads(hardware_threads)
for(n=0;hardware_threads;n++){
  opt[1+n*seq_dim]=opt[0]*pow(up,1+n*seq_dim)
  for (n_seq=2+n*seq_dim; n_seq&lt;=(n+1)*seq_dim; n_seq++) {
    opt[n_seq] = opt[n_seq-1]*up;}
  }
Sn = opt[N]*up;
&lt;/code&gt;</description>
		<content:encoded><![CDATA[<p>Loop Carried Dependencies are made for caching.<br />
With a to atomic parallel approach this is ignored.<br />
you might want to use something like<br />
Distributed Carried Dependencies ( i just made that up)</p>
<p>[
<div class="igBar"><span id="lcode-1"><a href="#" onclick="javascript:showPlainTxt('code-1'); return false;">PLAIN TEXT</a></span></div>
<div class="syntax_hilite"><span class="langName">CODE:</span>
<div id="code-1">
<div class="code">
<ol>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;"><span style="color:#006600; font-weight:bold;">&#93;</span>DQpjb25zdCBkb3VibGUgdXAgPSAxLjEgOw0KZG91YmxlIFNuPTEwMDAuMDsNCmRvdWJsZSBvcHRbTisxXTsNCmludCBuLG5fc2VxLGhhcmR3YXJlX3RocmVhZHM7DQpoYXJkd2FyZV90aHJlYWRzPS8qIGluc2VydCBudW1iZXIgb2YgaGFyZHdhcmUgcGFyYWxsZWwgZXhlY3V0YWJsZQ0KdGhyZWFkcyAoZGVwZW5kZW5kdCBvbiB0aGUgbnViZXIgb2YgY29yZXMsIGZsb2F0aW5ncG9pbnR1bml0cyBldGMuKQ0KaGVyZSovOw0Kc2VxX2RpbT1OL2hhcmR3YXJlX3RocmVhZHM7IC8qaG9waW5nIE4laGFyZHdhcmVfdGhyZWFkcz09MCovDQpvcHRbMF0gPSBTbjsNCiNwcmFnbWEgb21wIHBhcmFsbGVsIGZvciBwcml2YXRlKG4pIG51bV90aHJlYWRzKGhhcmR3YXJlX3RocmVhZHMpDQpmb3Iobj0wO2hhcmR3YXJlX3RocmVhZHM7bisrKXsNCsKgIG9wdFsxK24qc2VxX2RpbV09b3B0WzBdKnBvdyh1cCwxK24qc2VxX2RpbSkNCsKgIGZvciAobl9zZXE9MituKnNlcV9kaW07IG5fc2VxPD0obisxKSpzZXFfZGltOyBuX3NlcSsrKSB7DQrCoCDCoCBvcHRbbl9zZXFdID0gb3B0W25fc2VxLTFdKnVwO30NCsKgIH0NClNuID0gb3B0W05dKnVwOw0K<span style="color:#006600; font-weight:bold;">&#91;</span> </div>
</li>
</ol>
</div>
</div>
</div>
<p>]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Codeplay</title>
		<link>http://www.thinkingparallel.com/2007/05/02/parallel-programming-fun-with-loop-carried-dependencies/#comment-8500</link>
		<dc:creator>Codeplay</dc:creator>
		<pubDate>Tue, 08 May 2007 11:39:39 +0000</pubDate>
		<guid isPermaLink="false">http://www.thinkingparallel.com/2007/05/02/parallel-programming-fun-with-loop-carried-dependencies/#comment-8500</guid>
		<description>&lt;strong&gt;Loop Carried Dependencies in Sieve...&lt;/strong&gt;

Recently Michael Suess over at Thinking Parallel posted an interesting article on resolving Loop Carried Dependencies using OpenMP. These dependencies are so named because variables depend on previous iterations within a loop. If one wants to paralleli...</description>
		<content:encoded><![CDATA[<p><strong>Loop Carried Dependencies in Sieve...</strong></p>
<p>Recently Michael Suess over at Thinking Parallel posted an interesting article on resolving Loop Carried Dependencies using OpenMP. These dependencies are so named because variables depend on previous iterations within a loop. If one wants to paralleli...</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Hicham</title>
		<link>http://www.thinkingparallel.com/2007/05/02/parallel-programming-fun-with-loop-carried-dependencies/#comment-8135</link>
		<dc:creator>Hicham</dc:creator>
		<pubDate>Thu, 03 May 2007 08:52:00 +0000</pubDate>
		<guid isPermaLink="false">http://www.thinkingparallel.com/2007/05/02/parallel-programming-fun-with-loop-carried-dependencies/#comment-8135</guid>
		<description>To have larger N, you can change the allocation of opt to dynamic,
because the usual stack size limitation is low, and change the multiplicative factor to a number much closer to 1, as suggested above,  say 1.0000001....
you can then try with N in the range of 10 of millions, assuming u have more than 100Megs of RAM....

An important factor is how much time is spent in user code, and how much in system code. A time-like utility can be used to determine that.
The higher the proportion of time spent in user code, the higher the benefit of parallelization.

In this particular case, it seems memory access
    opt [n] = Sn;
is the costly part, which reduces the benefit of parallelization.</description>
		<content:encoded><![CDATA[<p>To have larger N, you can change the allocation of opt to dynamic,<br />
because the usual stack size limitation is low, and change the multiplicative factor to a number much closer to 1, as suggested above,  say 1.0000001....<br />
you can then try with N in the range of 10 of millions, assuming u have more than 100Megs of RAM....</p>
<p>An important factor is how much time is spent in user code, and how much in system code. A time-like utility can be used to determine that.<br />
The higher the proportion of time spent in user code, the higher the benefit of parallelization.</p>
<p>In this particular case, it seems memory access<br />
    opt [n] = Sn;<br />
is the costly part, which reduces the benefit of parallelization.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Sanjiv</title>
		<link>http://www.thinkingparallel.com/2007/05/02/parallel-programming-fun-with-loop-carried-dependencies/#comment-8125</link>
		<dc:creator>Sanjiv</dc:creator>
		<pubDate>Thu, 03 May 2007 05:31:15 +0000</pubDate>
		<guid isPermaLink="false">http://www.thinkingparallel.com/2007/05/02/parallel-programming-fun-with-loop-carried-dependencies/#comment-8125</guid>
		<description>I feel compelled to point out a couple of caveats which I always used to do back when I was doing OpenMP tutorials:

1) there are some instances when parallelizing such loops makes sense - the classic case is to cause a side effect, like data distribution across multiple processors or to preserve a data distribution from previous loops.  This particular case doesn't access much data, so the point is moot here.

2) It sometimes makes sense to parallelize loops with recurrences, even when the recurrence cannot be eliminated as in this case.  Typical examples are when a LARGE otherwise fully parallel loop contains a small amount of recurrences, i.e., the ratio of parallel to synchronized code is large, especially on small numbers of processors.

Good post, Michael!</description>
		<content:encoded><![CDATA[<p>I feel compelled to point out a couple of caveats which I always used to do back when I was doing OpenMP tutorials:</p>
<p>1) there are some instances when parallelizing such loops makes sense - the classic case is to cause a side effect, like data distribution across multiple processors or to preserve a data distribution from previous loops.  This particular case doesn't access much data, so the point is moot here.</p>
<p>2) It sometimes makes sense to parallelize loops with recurrences, even when the recurrence cannot be eliminated as in this case.  Typical examples are when a LARGE otherwise fully parallel loop contains a small amount of recurrences, i.e., the ratio of parallel to synchronized code is large, especially on small numbers of processors.</p>
<p>Good post, Michael!</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Jason</title>
		<link>http://www.thinkingparallel.com/2007/05/02/parallel-programming-fun-with-loop-carried-dependencies/#comment-8096</link>
		<dc:creator>Jason</dc:creator>
		<pubDate>Wed, 02 May 2007 18:40:48 +0000</pubDate>
		<guid isPermaLink="false">http://www.thinkingparallel.com/2007/05/02/parallel-programming-fun-with-loop-carried-dependencies/#comment-8096</guid>
		<description>If the cost is in the library call, you could write a loop to find the power but it seems like you are going from O(n) to O(N^2) so unless you have n processors...</description>
		<content:encoded><![CDATA[<p>If the cost is in the library call, you could write a loop to find the power but it seems like you are going from O(n) to O(N^2) so unless you have n processors...</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Bernd</title>
		<link>http://www.thinkingparallel.com/2007/05/02/parallel-programming-fun-with-loop-carried-dependencies/#comment-8073</link>
		<dc:creator>Bernd</dc:creator>
		<pubDate>Wed, 02 May 2007 11:39:29 +0000</pubDate>
		<guid isPermaLink="false">http://www.thinkingparallel.com/2007/05/02/parallel-programming-fun-with-loop-carried-dependencies/#comment-8073</guid>
		<description>Hi Michael,

If you need more iterations for better statistics, why not reducing the multiplier 'up' from 1.1 to something like 1.001 (or even smaller)?  If float overflows are your concern, this should allow for about 700000 iterations.

Bernd</description>
		<content:encoded><![CDATA[<p>Hi Michael,</p>
<p>If you need more iterations for better statistics, why not reducing the multiplier 'up' from 1.1 to something like 1.001 (or even smaller)?  If float overflows are your concern, this should allow for about 700000 iterations.</p>
<p>Bernd</p>
]]></content:encoded>
	</item>
</channel>
</rss>
