<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	>
<channel>
	<title>Comments on: Thoughts on Larry O&#8217;Briens article at devx.com</title>
	<atom:link href="http://www.thinkingparallel.com/2006/08/24/thoughts-on-larry-obriens-article-at-devxcom/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.thinkingparallel.com/2006/08/24/thoughts-on-larry-obriens-article-at-devxcom/</link>
	<description>A Blog on Parallel Programming and Concurrency by Michael Suess</description>
	<pubDate>Wed, 09 Jul 2008 02:35:59 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.5</generator>
		<item>
		<title>By: Joe</title>
		<link>http://www.thinkingparallel.com/2006/08/24/thoughts-on-larry-obriens-article-at-devxcom/#comment-30</link>
		<dc:creator>Joe</dc:creator>
		<pubDate>Mon, 28 Aug 2006 18:30:57 +0000</pubDate>
		<guid isPermaLink="false">http://www.thinkingparallel.com/2006/08/24/thoughts-on-larry-obriens-article-at-devxcom/#comment-30</guid>
		<description>Hi Christopher:

  I like the idea of measuring the multi core systems as compared to postulating the overall impact.  We did some experiments and wrote some white papers on what we found for "normal" threaded apps using OpenMP, pthreads, and other models on a single node of dual Opteron 275 more than a year ago.  While there was contention, as you could see if you looked at the papers, this was not a major issue.  

  It does not mean that it isn't a major issue for some codes.  Any code which fills the memory pipeline (or any other shared resource) is likely to have serious contention issues.   What I was measuring was the performance of these codes built for parallel execution on a parallel machine using realistic data sets.

  You can pull the PDF's off my company home page (&lt;a href="http://www.scalableinformatics.com" rel="nofollow"&gt;http://www.scalableinformatics.com&lt;/a&gt;) towards the bottom.  We are working on expanding the testing.  Quad cores will, I believe, have a more severe contention issue, and we are trying to develop tests which allow us to explore this with real codes.

Joe</description>
		<content:encoded><![CDATA[<p>Hi Christopher:</p>
<p>  I like the idea of measuring the multi core systems as compared to postulating the overall impact.  We did some experiments and wrote some white papers on what we found for &#8220;normal&#8221; threaded apps using OpenMP, pthreads, and other models on a single node of dual Opteron 275 more than a year ago.  While there was contention, as you could see if you looked at the papers, this was not a major issue.  </p>
<p>  It does not mean that it isn&#8217;t a major issue for some codes.  Any code which fills the memory pipeline (or any other shared resource) is likely to have serious contention issues.   What I was measuring was the performance of these codes built for parallel execution on a parallel machine using realistic data sets.</p>
<p>  You can pull the PDF&#8217;s off my company home page (<a href="http://www.scalableinformatics.com" rel="nofollow">http://www.scalableinformatics.com</a>) towards the bottom.  We are working on expanding the testing.  Quad cores will, I believe, have a more severe contention issue, and we are trying to develop tests which allow us to explore this with real codes.</p>
<p>Joe</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Michael Suess</title>
		<link>http://www.thinkingparallel.com/2006/08/24/thoughts-on-larry-obriens-article-at-devxcom/#comment-25</link>
		<dc:creator>Michael Suess</dc:creator>
		<pubDate>Fri, 25 Aug 2006 08:13:30 +0000</pubDate>
		<guid isPermaLink="false">http://www.thinkingparallel.com/2006/08/24/thoughts-on-larry-obriens-article-at-devxcom/#comment-25</guid>
		<description>Christopher:
I am not entirely sure I understand your comment fully. That said, I think Larry knows where superlinear speedups come from ;-).  And I do not see a reason why they should be any less common with multicore-chips, because they usually have more cache than their single-core counterparts. But as I said, I am not even sure you were suggesting that :-).

What I am sure about, though, is that you are probably right with your assumption that 32-core CPUs are not going to be very efficient without changes to the memory subsystems. I am also sure, the smart people designing these processors will come up with a solution. Maybe we will see NUMAs on a chip by that time :-).

I also agree in part to your comments regarding locality - this is very important for performance. Nevertheless, Larry stated in his article that it is not about "pedal-to-the-metal" optimizations (and unfortunately, I still regard locality optimization as one of those as long as there are not better tools to help). Therefore, I do not see locality optimizations as a prequisite for parallelization, but rather as an orthogonal technique. Both can be done independently to gain performance and for &lt;em&gt;best&lt;/em&gt; results on multi-core architectures you need to employ both - but on the other hand, sometimes you do not need &lt;em&gt;best&lt;/em&gt; results and in that case I do not see anything wrong with doing either parallelization or locality optimization.

Cobo:
This blog is about parallelization in general and I will be blogging about other parallel programming systems - or else I would have probably called it &lt;em&gt;Thinking OpenMP&lt;/em&gt; :-). This blog has an emphasis on OpenMP because it is the system I know about the most. But I am always interested in playing with other systems, and in fact I have done the first half of the Erlang tutorial again just yesterday to refresh my memory on it. You will definitly see posts about other systems. You may even see them very soon, since I am going to &lt;a href="http://www.europar2006.de/" rel="nofollow"&gt;Europar&lt;/a&gt; next week and there are talks about a lot of exiting systems happening there...</description>
		<content:encoded><![CDATA[<p>Christopher:<br />
I am not entirely sure I understand your comment fully. That said, I think Larry knows where superlinear speedups come from ;-).  And I do not see a reason why they should be any less common with multicore-chips, because they usually have more cache than their single-core counterparts. But as I said, I am not even sure you were suggesting that :-).</p>
<p>What I am sure about, though, is that you are probably right with your assumption that 32-core CPUs are not going to be very efficient without changes to the memory subsystems. I am also sure, the smart people designing these processors will come up with a solution. Maybe we will see NUMAs on a chip by that time :-).</p>
<p>I also agree in part to your comments regarding locality - this is very important for performance. Nevertheless, Larry stated in his article that it is not about &#8220;pedal-to-the-metal&#8221; optimizations (and unfortunately, I still regard locality optimization as one of those as long as there are not better tools to help). Therefore, I do not see locality optimizations as a prequisite for parallelization, but rather as an orthogonal technique. Both can be done independently to gain performance and for <em>best</em> results on multi-core architectures you need to employ both - but on the other hand, sometimes you do not need <em>best</em> results and in that case I do not see anything wrong with doing either parallelization or locality optimization.</p>
<p>Cobo:<br />
This blog is about parallelization in general and I will be blogging about other parallel programming systems - or else I would have probably called it <em>Thinking OpenMP</em> :-). This blog has an emphasis on OpenMP because it is the system I know about the most. But I am always interested in playing with other systems, and in fact I have done the first half of the Erlang tutorial again just yesterday to refresh my memory on it. You will definitly see posts about other systems. You may even see them very soon, since I am going to <a href="http://www.europar2006.de/" rel="nofollow">Europar</a> next week and there are talks about a lot of exiting systems happening there&#8230;</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Cobo</title>
		<link>http://www.thinkingparallel.com/2006/08/24/thoughts-on-larry-obriens-article-at-devxcom/#comment-24</link>
		<dc:creator>Cobo</dc:creator>
		<pubDate>Fri, 25 Aug 2006 00:20:06 +0000</pubDate>
		<guid isPermaLink="false">http://www.thinkingparallel.com/2006/08/24/thoughts-on-larry-obriens-article-at-devxcom/#comment-24</guid>
		<description>To Michael:
I have been willing to ask you this for a long time, but thought it wasn't the time. Now, as you mentioned it... Do you have any background or plan on blogging about other parallelization techniques and languages such as the aforementioned Erlang?
I find quite interesting things like their processes implementation which seems to be much more efficient than mapping to OS threads, and all that kind of stuff.
Also (and I suppose as consequence of my inexperience), I see OpenMP like a great short-term solution for existing sequential languages, but I think the future should stand for languages where you don't have to consider which blocks of code should be concurrent or not, instead of having concurrency by default.

Just the thoughts of a newbie...

Very interesting both Larry's article and Christopher's comment on locality of reference and future data fetching from memory (which is a very interesting topic too).

Cheers!</description>
		<content:encoded><![CDATA[<p>To Michael:<br />
I have been willing to ask you this for a long time, but thought it wasn&#8217;t the time. Now, as you mentioned it&#8230; Do you have any background or plan on blogging about other parallelization techniques and languages such as the aforementioned Erlang?<br />
I find quite interesting things like their processes implementation which seems to be much more efficient than mapping to OS threads, and all that kind of stuff.<br />
Also (and I suppose as consequence of my inexperience), I see OpenMP like a great short-term solution for existing sequential languages, but I think the future should stand for languages where you don&#8217;t have to consider which blocks of code should be concurrent or not, instead of having concurrency by default.</p>
<p>Just the thoughts of a newbie&#8230;</p>
<p>Very interesting both Larry&#8217;s article and Christopher&#8217;s comment on locality of reference and future data fetching from memory (which is a very interesting topic too).</p>
<p>Cheers!</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Christopher Aycock</title>
		<link>http://www.thinkingparallel.com/2006/08/24/thoughts-on-larry-obriens-article-at-devxcom/#comment-23</link>
		<dc:creator>Christopher Aycock</dc:creator>
		<pubDate>Thu, 24 Aug 2006 23:31:51 +0000</pubDate>
		<guid isPermaLink="false">http://www.thinkingparallel.com/2006/08/24/thoughts-on-larry-obriens-article-at-devxcom/#comment-23</guid>
		<description>Larry, superlinear speed comes from better cache exploitation; memory-intensive applications work better on multiple processors simply because there's more cache. As for multicore chips, it's more likely the case that improved locality (from sharing memory regions among the threads) is the primary benefactor.

There is the contention issue to worry about on an SMP, which is why I doubt that 32-core CPUs will ever make it without a serious change to the way data is fetched from memory.

Therefore, if you really want to speed-up an application, improve the locality of reference. There are a number of tricks that unfortunately aren't exploited by most compilers, such as a simple reordering of iterations in a nested loop. Only once the cache misses have been significantly reduced should a programmer investigate multicore execution.</description>
		<content:encoded><![CDATA[<p>Larry, superlinear speed comes from better cache exploitation; memory-intensive applications work better on multiple processors simply because there&#8217;s more cache. As for multicore chips, it&#8217;s more likely the case that improved locality (from sharing memory regions among the threads) is the primary benefactor.</p>
<p>There is the contention issue to worry about on an SMP, which is why I doubt that 32-core CPUs will ever make it without a serious change to the way data is fetched from memory.</p>
<p>Therefore, if you really want to speed-up an application, improve the locality of reference. There are a number of tricks that unfortunately aren&#8217;t exploited by most compilers, such as a simple reordering of iterations in a nested loop. Only once the cache misses have been significantly reduced should a programmer investigate multicore execution.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Larry O'Brien</title>
		<link>http://www.thinkingparallel.com/2006/08/24/thoughts-on-larry-obriens-article-at-devxcom/#comment-22</link>
		<dc:creator>Larry O'Brien</dc:creator>
		<pubDate>Wed, 23 Aug 2006 23:27:04 +0000</pubDate>
		<guid isPermaLink="false">http://www.thinkingparallel.com/2006/08/24/thoughts-on-larry-obriens-article-at-devxcom/#comment-22</guid>
		<description>Thanks for the comments. Re "the only way..." well, yeah, it _is_ a hook and the way I justify it is the "mainstream" qualifier. There are _lots_ of programming languages that have models that allow for automatic parallelization, but I think it's fair to say that they are not in the mainstream. Whether they will _enter_ the mainstream is another question... 

As to the 70% number, I guess I did use it a couple times, because it happened to be the number I saw on this example and a couple others I've recently measured. Familiarity breeds... I don't know ... expectation, I guess.

Even since I wrote the article, I was looking at some results showing superlinear speedup. It seems more common today than it was "back in the day." I don't know if that's a reflection of more complexity in computer architectures or a reflection of more sophisticated algorithms. A little of both, I suppose. 

Anyway, thanks for the comments!</description>
		<content:encoded><![CDATA[<p>Thanks for the comments. Re &#8220;the only way&#8230;&#8221; well, yeah, it _is_ a hook and the way I justify it is the &#8220;mainstream&#8221; qualifier. There are _lots_ of programming languages that have models that allow for automatic parallelization, but I think it&#8217;s fair to say that they are not in the mainstream. Whether they will _enter_ the mainstream is another question&#8230; </p>
<p>As to the 70% number, I guess I did use it a couple times, because it happened to be the number I saw on this example and a couple others I&#8217;ve recently measured. Familiarity breeds&#8230; I don&#8217;t know &#8230; expectation, I guess.</p>
<p>Even since I wrote the article, I was looking at some results showing superlinear speedup. It seems more common today than it was &#8220;back in the day.&#8221; I don&#8217;t know if that&#8217;s a reflection of more complexity in computer architectures or a reflection of more sophisticated algorithms. A little of both, I suppose. </p>
<p>Anyway, thanks for the comments!</p>
]]></content:encoded>
	</item>
</channel>
</rss>
