<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	>
<channel>
	<title>Comments on: Matrix Optimization Gone Wrong</title>
	<atom:link href="http://www.thinkingparallel.com/2007/01/28/matrix-optimization-gone-wrong/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.thinkingparallel.com/2007/01/28/matrix-optimization-gone-wrong/</link>
	<description>A Blog on Parallel Programming and Concurrency by Michael Suess</description>
	<pubDate>Sat, 31 Jul 2010 08:30:05 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.7</generator>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
		<item>
		<title>By: MiguelB</title>
		<link>http://www.thinkingparallel.com/2007/01/28/matrix-optimization-gone-wrong/comment-page-1/#comment-1519</link>
		<dc:creator>MiguelB</dc:creator>
		<pubDate>Wed, 31 Jan 2007 17:33:44 +0000</pubDate>
		<guid isPermaLink="false">http://www.thinkingparallel.com/2007/01/28/matrix-optimization-gone-wrong/#comment-1519</guid>
		<description>I think the number of memory accesses is important too. The first program does NxN memory accesses, while the second one does Nx(2N-1). Memory access is by far the slowest operation a CPU can do.</description>
		<content:encoded><![CDATA[<p>I think the number of memory accesses is important too. The first program does NxN memory accesses, while the second one does Nx(2N-1). Memory access is by far the slowest operation a CPU can do.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Christopher</title>
		<link>http://www.thinkingparallel.com/2007/01/28/matrix-optimization-gone-wrong/comment-page-1/#comment-1404</link>
		<dc:creator>Christopher</dc:creator>
		<pubDate>Mon, 29 Jan 2007 06:14:54 +0000</pubDate>
		<guid isPermaLink="false">http://www.thinkingparallel.com/2007/01/28/matrix-optimization-gone-wrong/#comment-1404</guid>
		<description>I knew what the problem was as soon as I saw the conditionals: branching. Bjoern is right; just move the 2*i from the inner loop and the program will be fine. I suppose one could also try Duff's device to be fancy, or maybe even use OpenMP on a multi-core machine if the arrays are really huge.</description>
		<content:encoded><![CDATA[<p>I knew what the problem was as soon as I saw the conditionals: branching. Bjoern is right; just move the 2*i from the inner loop and the program will be fine. I suppose one could also try Duff&#8217;s device to be fancy, or maybe even use OpenMP on a multi-core machine if the arrays are really huge.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Bjoern Knafla</title>
		<link>http://www.thinkingparallel.com/2007/01/28/matrix-optimization-gone-wrong/comment-page-1/#comment-1396</link>
		<dc:creator>Bjoern Knafla</dc:creator>
		<pubDate>Mon, 29 Jan 2007 02:20:49 +0000</pubDate>
		<guid isPermaLink="false">http://www.thinkingparallel.com/2007/01/28/matrix-optimization-gone-wrong/#comment-1396</guid>
		<description>Okay, some "have fun with optimizations I don't know will work while make the code harder to understand and make it harder for future compilers to do their magic" ;-)

// Include std::memset
#include 

// Include std::size_t
#include 

std:.size_t const N = ????;

typedef int elem_t;
elem_t A[ N ][ N ];

// Set all elements of A to 1 to optimize away one add operation per loop.
// Though it has been a long time that I used memset and the following line needs to be tested if this is the way memset works...
std::memset( A, 1,  N * N * sizeof( elem_t ) );

The rest just in words:
Initialize the array with 1 to safe N*N add-operations.

Save the 2*i value in a const before entering the j-loop to save re-calculating it N times.
Perhaps try i lesser-than lesser-than 1 (shift-op) instead of 2*i but a good compiler would optimize multiplying by 2 by itself.

Good night.</description>
		<content:encoded><![CDATA[<p>Okay, some &#8220;have fun with optimizations I don&#8217;t know will work while make the code harder to understand and make it harder for future compilers to do their magic&#8221; <img src='http://www.thinkingparallel.com/wp-includes/images/smilies/icon_wink.gif' alt=';-)' class='wp-smiley' /> </p>
<p>// Include std::memset<br />
#include </p>
<p>// Include std::size_t<br />
#include </p>
<p>std:.size_t const N = ????;</p>
<p>typedef int elem_t;<br />
elem_t A[ N ][ N ];</p>
<p>// Set all elements of A to 1 to optimize away one add operation per loop.<br />
// Though it has been a long time that I used memset and the following line needs to be tested if this is the way memset works&#8230;<br />
std::memset( A, 1,  N * N * sizeof( elem_t ) );</p>
<p>The rest just in words:<br />
Initialize the array with 1 to safe N*N add-operations.</p>
<p>Save the 2*i value in a const before entering the j-loop to save re-calculating it N times.<br />
Perhaps try i lesser-than lesser-than 1 (shift-op) instead of 2*i but a good compiler would optimize multiplying by 2 by itself.</p>
<p>Good night.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Mark Dominus</title>
		<link>http://www.thinkingparallel.com/2007/01/28/matrix-optimization-gone-wrong/comment-page-1/#comment-1387</link>
		<dc:creator>Mark Dominus</dc:creator>
		<pubDate>Sun, 28 Jan 2007 22:05:42 +0000</pubDate>
		<guid isPermaLink="false">http://www.thinkingparallel.com/2007/01/28/matrix-optimization-gone-wrong/#comment-1387</guid>
		<description>It seems that moving the j=0 case out of the loop produces a small but measurable improvement, on my machine, for the cases I tried.

I've placed my source code and makefile at http://www.plover.com/~mjd/misc/c/matrix-opt.tgz .</description>
		<content:encoded><![CDATA[<p>It seems that moving the j=0 case out of the loop produces a small but measurable improvement, on my machine, for the cases I tried.</p>
<p>I&#8217;ve placed my source code and makefile at <a href="http://www.plover.com/~mjd/misc/c/matrix-opt.tgz" rel="nofollow">http://www.plover.com/~mjd/misc/c/matrix-opt.tgz</a> .</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Mark Dominus</title>
		<link>http://www.thinkingparallel.com/2007/01/28/matrix-optimization-gone-wrong/comment-page-1/#comment-1385</link>
		<dc:creator>Mark Dominus</dc:creator>
		<pubDate>Sun, 28 Jan 2007 21:39:40 +0000</pubDate>
		<guid isPermaLink="false">http://www.thinkingparallel.com/2007/01/28/matrix-optimization-gone-wrong/#comment-1385</guid>
		<description>if the problem is really with the added "if" test, then I wonder whether the following variation would be an improvement:

      for (i = 0; i &#60; N; i++) {
          a[i][0] = 2*i + 1;
          for (j = 1; j &#60; N; j++) {
                  A[i][j] = A[i][j - 1] + 3;
              }
          }
      }

Since the "if" test is only to distinguish the special case of j=0,
we can eliminate it by moving that case out of the loop.</description>
		<content:encoded><![CDATA[<p>if the problem is really with the added &#8220;if&#8221; test, then I wonder whether the following variation would be an improvement:</p>
<p>      for (i = 0; i &lt; N; i++) {<br />
          a[i][0] = 2*i + 1;<br />
          for (j = 1; j &lt; N; j++) {<br />
                  A[i][j] = A[i][j - 1] + 3;<br />
              }<br />
          }<br />
      }</p>
<p>Since the &#8220;if&#8221; test is only to distinguish the special case of j=0,<br />
we can eliminate it by moving that case out of the loop.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: gwenhwyfaer</title>
		<link>http://www.thinkingparallel.com/2007/01/28/matrix-optimization-gone-wrong/comment-page-1/#comment-1380</link>
		<dc:creator>gwenhwyfaer</dc:creator>
		<pubDate>Sun, 28 Jan 2007 18:58:41 +0000</pubDate>
		<guid isPermaLink="false">http://www.thinkingparallel.com/2007/01/28/matrix-optimization-gone-wrong/#comment-1380</guid>
		<description>Ah. Just noticed that you say the calculation wasn't optimised away. And I don't think it's branching - there'll be a couple of mispredictions each go round the outer loop, but overall the jumps settle into repeated patterns.

In that case, I'd suggest that because of the way it's written, the reference to A[i][j-1] is forcing the CPU to wait until the write to it in the previous iteration is complete before it can read from it again. Presumably gcc spots that it can just use a temp instead and hoists it, but the Intel compiler doesn't.

Does replacing the section of code with:

   int temp = 2*i + 1;
   if( i == 0 ) {
      A[i][j] = temp;
   } else {
      temp = temp + 3;
      A[i][j] = temp;
   }

improve matters? (I know the code is still clunky, but I'm trying to leave it as close to form 2 as possible.)</description>
		<content:encoded><![CDATA[<p>Ah. Just noticed that you say the calculation wasn&#8217;t optimised away. And I don&#8217;t think it&#8217;s branching - there&#8217;ll be a couple of mispredictions each go round the outer loop, but overall the jumps settle into repeated patterns.</p>
<p>In that case, I&#8217;d suggest that because of the way it&#8217;s written, the reference to A[i][j-1] is forcing the CPU to wait until the write to it in the previous iteration is complete before it can read from it again. Presumably gcc spots that it can just use a temp instead and hoists it, but the Intel compiler doesn&#8217;t.</p>
<p>Does replacing the section of code with:</p>
<p>   int temp = 2*i + 1;<br />
   if( i == 0 ) {<br />
      A[i][j] = temp;<br />
   } else {<br />
      temp = temp + 3;<br />
      A[i][j] = temp;<br />
   }</p>
<p>improve matters? (I know the code is still clunky, but I&#8217;m trying to leave it as close to form 2 as possible.)</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: gwenhwyfaer</title>
		<link>http://www.thinkingparallel.com/2007/01/28/matrix-optimization-gone-wrong/comment-page-1/#comment-1378</link>
		<dc:creator>gwenhwyfaer</dc:creator>
		<pubDate>Sun, 28 Jan 2007 18:40:33 +0000</pubDate>
		<guid isPermaLink="false">http://www.thinkingparallel.com/2007/01/28/matrix-optimization-gone-wrong/#comment-1378</guid>
		<description>The obvious point that jumps out at me is that you don't need any multiplies to work out 2*i + 3+j + 1:

  mov ebx, [i]
  mov ecx, [j]
  lea ebx, [2*ebx+ecx+1]
  lea eax, [2*ecx+ebx]

leaves the value in eax. The loop indices can equally simply be hoisted into registers and incremented as a block; similarly, 2*i doesn't change in advance either. It's quite likely that the code could be compiled without any multiplications in it at all.</description>
		<content:encoded><![CDATA[<p>The obvious point that jumps out at me is that you don&#8217;t need any multiplies to work out 2*i + 3+j + 1:</p>
<p>  mov ebx, [i]<br />
  mov ecx, [j]<br />
  lea ebx, [2*ebx+ecx+1]<br />
  lea eax, [2*ecx+ebx]</p>
<p>leaves the value in eax. The loop indices can equally simply be hoisted into registers and incremented as a block; similarly, 2*i doesn&#8217;t change in advance either. It&#8217;s quite likely that the code could be compiled without any multiplications in it at all.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: brainy</title>
		<link>http://www.thinkingparallel.com/2007/01/28/matrix-optimization-gone-wrong/comment-page-1/#comment-1374</link>
		<dc:creator>brainy</dc:creator>
		<pubDate>Sun, 28 Jan 2007 18:03:39 +0000</pubDate>
		<guid isPermaLink="false">http://www.thinkingparallel.com/2007/01/28/matrix-optimization-gone-wrong/#comment-1374</guid>
		<description>Ok, one final attempt (isn't there any proper way to post code as a comment?):

Following your suggestion, I was thinking how efficient the code would be when eliminating the unnecessary if clause like the following:

for (i = 0; i &#60;N; i++) {
	A[i][j] = 2 * i + 1;
	for (j = 0; j &#60;N; j++) {
		A[i][j] = A[i][j - 1] + 3;
	}
}

And then maybe taking it one step further and instead of referring back to the array for each assignment, keeping the previous value in a local variable:

for (i = 0; i &#60;N; i++) {
	val = 2 * i + 1;
	A[i][j] = val;
	for (j = 0; j &#60;N; j++) {
		val += 3;
		A[i][j] = val;
	}
}</description>
		<content:encoded><![CDATA[<p>Ok, one final attempt (isn&#8217;t there any proper way to post code as a comment?):</p>
<p>Following your suggestion, I was thinking how efficient the code would be when eliminating the unnecessary if clause like the following:</p>
<p>for (i = 0; i &lt;N; i++) {<br />
	A[i][j] = 2 * i + 1;<br />
	for (j = 0; j &lt;N; j++) {<br />
		A[i][j] = A[i][j - 1] + 3;<br />
	}<br />
}</p>
<p>And then maybe taking it one step further and instead of referring back to the array for each assignment, keeping the previous value in a local variable:</p>
<p>for (i = 0; i &lt;N; i++) {<br />
	val = 2 * i + 1;<br />
	A[i][j] = val;<br />
	for (j = 0; j &lt;N; j++) {<br />
		val += 3;<br />
		A[i][j] = val;<br />
	}<br />
}</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Rory Geoghegan</title>
		<link>http://www.thinkingparallel.com/2007/01/28/matrix-optimization-gone-wrong/comment-page-1/#comment-1372</link>
		<dc:creator>Rory Geoghegan</dc:creator>
		<pubDate>Sun, 28 Jan 2007 17:58:37 +0000</pubDate>
		<guid isPermaLink="false">http://www.thinkingparallel.com/2007/01/28/matrix-optimization-gone-wrong/#comment-1372</guid>
		<description>You are right, with pipelining, most modern processors do not handle branching very well. Also, even if two multiplications would intuitively take twice as long, because most modern processors, once again, pipeline instructions, they are more or less done in parallel, especially if they are floating point operations.

Actually, the less branching you perform, the better it is, especially when you consider things like the optimisations done by the compiler and the processor.</description>
		<content:encoded><![CDATA[<p>You are right, with pipelining, most modern processors do not handle branching very well. Also, even if two multiplications would intuitively take twice as long, because most modern processors, once again, pipeline instructions, they are more or less done in parallel, especially if they are floating point operations.</p>
<p>Actually, the less branching you perform, the better it is, especially when you consider things like the optimisations done by the compiler and the processor.</p>
]]></content:encoded>
	</item>
</channel>
</rss>
