Thinking Parallel

A Blog on Parallel Programming and Concurrency by Michael Suess

Thoughts on Larry O’Briens article at

Larry O’Brien has written an introductory article on parallel programming with OpenMP on Windows and announced it in his blog. I enjoyed reading the article and think it is a really nice resource for people new to parallel programming. I would like to comment some parts of his article and since it does not have a comment section (and would be quite big, anyways), I will do it here:

Many programmers don’t understand this simple reality: When using mainstream programming languages the only way to take advantage of multiple cores is to explicitly use multithreading.

Wow, what an introduction :D. I hope this is meant as an opening phrase to catch some attention, because I sincerely hope it is not true. I know most of my students understand the fact that one way of taking advantage of multiple cores is to use multithreading. Most of my friends involved in programming do. But then again, this is not sufficient proof of anything. But I can imagine, Larry has a hard time prooving his assumption as well ;-). Also please note the emphasis in my version of the assumption. There are certainly other ways to take advantage of multiple cores (think MPI or Erlang). It is not on me to judge how mainstream these are, though…

OpenMP, a multiplatform API for C++ and Fortran

Don’t forget the C support :P…

simply wrapping a processor-intensive loop in a #pragma block can lead to about a 70 percent performance increase on a dual-core or dual-processor system

I just don’t know why he is so keen on the 70 percent number. He mentioned it earlier in one of his blog posts, which got picked up by Eric Sink. And I am sitting there scratching my head because this is quite contradictory to my experiences. He talks about embarassingly parallel programs and I do not see a reason why they should not scale to a 100 percent performance increase on a dual-core machine. In fact, I have witnessed speedups like that. Sometimes, you can even get better performance (called superlinear speedups). On the other hand of course it depends on what you measure. Just the parallel region? Then these speedups are possible. The whole program? Then Amdahls Law will bite and your performance increases will probably be lower. Still, there is no guarantee they will be anywhere near seventy percent. May as well be sixty percent, twenty percent or ninety percent. Totally depends on your application, algorithms, parallelization skills and I sometimes get the feeling, even the phases of the moon are involved from time to time 8O.

To err is human, but to really screw up you need shared state.

😆 Oh so true. Luckily for us, there are some tools availabe today to help…

In Visual C++ 2005, using OpenMP is as simple as adding the #pragma and compiling with a command-line switch (/openmp)

Do not try this with the free Express Edition, though. I downloaded it once to play with the OpenMP-support, and since I usually develop on Linux, this seemed like a great way to toy around with it. The switch is still there as described, yet OpenMP-support is completly missing from the Express Edition. Took me some time to figure this out. And I am still not sure, why they could not at least have deactivated the switch on the Preferences-Page…

Beyond that, what about when machines start having 16 and 32 cores? Today you might be able to get by without parallelizing your code, but that’s certainly not going to be the case in the not-so-distant future.

That future is here already when you look beyond Intel and AMD, e.g. in Suns UltraSPARC T1 processor which supports 32 threads running at the same time (not exactly mainstream I admit and its floating point performance is still not satisfactory, but you will not have to sell your soul to get one today either :D).

This closes my short musings on the article, and as I already said in my introduction it was a true pleasure to read. I was quite glad to see an article about OpenMP and parallel programming in general generating responses and discussions in the blogsphere and sure hope this does not die down quickly. As I already said in one of my previous posts: We are in the middle of a (parallel) revolution and this time it’s for real!

5 Responses to Thoughts on Larry O’Briens article at »»


  1. Comment by Larry O'Brien | 2006/08/24 at 01:27:04

    Thanks for the comments. Re “the only way…” well, yeah, it _is_ a hook and the way I justify it is the “mainstream” qualifier. There are _lots_ of programming languages that have models that allow for automatic parallelization, but I think it’s fair to say that they are not in the mainstream. Whether they will _enter_ the mainstream is another question…

    As to the 70% number, I guess I did use it a couple times, because it happened to be the number I saw on this example and a couple others I’ve recently measured. Familiarity breeds… I don’t know … expectation, I guess.

    Even since I wrote the article, I was looking at some results showing superlinear speedup. It seems more common today than it was “back in the day.” I don’t know if that’s a reflection of more complexity in computer architectures or a reflection of more sophisticated algorithms. A little of both, I suppose.

    Anyway, thanks for the comments!

  2. Comment by Christopher Aycock | 2006/08/25 at 01:31:51

    Larry, superlinear speed comes from better cache exploitation; memory-intensive applications work better on multiple processors simply because there’s more cache. As for multicore chips, it’s more likely the case that improved locality (from sharing memory regions among the threads) is the primary benefactor.

    There is the contention issue to worry about on an SMP, which is why I doubt that 32-core CPUs will ever make it without a serious change to the way data is fetched from memory.

    Therefore, if you really want to speed-up an application, improve the locality of reference. There are a number of tricks that unfortunately aren’t exploited by most compilers, such as a simple reordering of iterations in a nested loop. Only once the cache misses have been significantly reduced should a programmer investigate multicore execution.

  3. Comment by Cobo | 2006/08/25 at 02:20:06

    To Michael:
    I have been willing to ask you this for a long time, but thought it wasn’t the time. Now, as you mentioned it… Do you have any background or plan on blogging about other parallelization techniques and languages such as the aforementioned Erlang?
    I find quite interesting things like their processes implementation which seems to be much more efficient than mapping to OS threads, and all that kind of stuff.
    Also (and I suppose as consequence of my inexperience), I see OpenMP like a great short-term solution for existing sequential languages, but I think the future should stand for languages where you don’t have to consider which blocks of code should be concurrent or not, instead of having concurrency by default.

    Just the thoughts of a newbie…

    Very interesting both Larry’s article and Christopher’s comment on locality of reference and future data fetching from memory (which is a very interesting topic too).


  4. Comment by Michael Suess | 2006/08/25 at 10:13:30

    I am not entirely sure I understand your comment fully. That said, I think Larry knows where superlinear speedups come from ;-). And I do not see a reason why they should be any less common with multicore-chips, because they usually have more cache than their single-core counterparts. But as I said, I am not even sure you were suggesting that :-).

    What I am sure about, though, is that you are probably right with your assumption that 32-core CPUs are not going to be very efficient without changes to the memory subsystems. I am also sure, the smart people designing these processors will come up with a solution. Maybe we will see NUMAs on a chip by that time :-).

    I also agree in part to your comments regarding locality – this is very important for performance. Nevertheless, Larry stated in his article that it is not about “pedal-to-the-metal” optimizations (and unfortunately, I still regard locality optimization as one of those as long as there are not better tools to help). Therefore, I do not see locality optimizations as a prequisite for parallelization, but rather as an orthogonal technique. Both can be done independently to gain performance and for best results on multi-core architectures you need to employ both – but on the other hand, sometimes you do not need best results and in that case I do not see anything wrong with doing either parallelization or locality optimization.

    This blog is about parallelization in general and I will be blogging about other parallel programming systems – or else I would have probably called it Thinking OpenMP :-). This blog has an emphasis on OpenMP because it is the system I know about the most. But I am always interested in playing with other systems, and in fact I have done the first half of the Erlang tutorial again just yesterday to refresh my memory on it. You will definitly see posts about other systems. You may even see them very soon, since I am going to Europar next week and there are talks about a lot of exiting systems happening there…

  5. Joe
    Comment by Joe | 2006/08/28 at 20:30:57

    Hi Christopher:

    I like the idea of measuring the multi core systems as compared to postulating the overall impact. We did some experiments and wrote some white papers on what we found for “normal” threaded apps using OpenMP, pthreads, and other models on a single node of dual Opteron 275 more than a year ago. While there was contention, as you could see if you looked at the papers, this was not a major issue.

    It does not mean that it isn’t a major issue for some codes. Any code which fills the memory pipeline (or any other shared resource) is likely to have serious contention issues. What I was measuring was the performance of these codes built for parallel execution on a parallel machine using realistic data sets.

    You can pull the PDF’s off my company home page ( towards the bottom. We are working on expanding the testing. Quad cores will, I believe, have a more severe contention issue, and we are trying to develop tests which allow us to explore this with real codes.


Leave a Reply

HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>