Thinking Parallel

A Blog on Parallel Programming and Concurrency by Michael Suess

Ten Questions with Sanjiv Shah about Parallel Programming and OpenMP

Sanjiv ShahThis is the third post in my Interviewing the Parallel Programming Idols-Series. My interview partner today is Sanjiv Shah, who I was lucky enough to meet at various OpenMP-workshops. I have come to know him as the most knowledgeable person about OpenMP and parallel programming ever :P. Let me add a little bit about his background. Sanjiv Shah is a Senior Principal Engineer in the Software and Solutions Group specializing in multi-threaded computing and the Director of the Performance, Analysis and Threading Lab at Intel. During his career, Sanjiv has worked on and managed many aspects of performance and correctness analysis tools, compilers and runtimes for parallel applications and systems. He has been extensively involved in the creation of the OpenMP specifications and of the industry standards organization known as the OpenMP Architecture Review Board. He is a former CEO of the OpenMP ARB and continues to serve on its Board of Directors. What a long list of titles :D. Besides that, he is also a really nice guy and a joy to talk to.

I think I have praised him enough for now, let’s start with the interview. The first five questions are about parallel programming in general:

Michael: As we are entering the many-core era, do you think parallel computing is finally going to be embraced by the mainstream? Or is this just another phase and soon the only people interested in parallel programming will be the members of the high performance community (once again) ?

Sanjiv: I do believe there has been a fundamental shift (so called power wall) such that parallel computing will be adopted by the mainstream. While there is some very interesting work in many different areas that might eventually speed up sequential computing again (optical, quantum, molecular), for mass manufacturing, all the works seems a couple to a few decades away.

Michael: From time to time a heated discussion evolves on the net regarding the issue of whether shared-memory programming or message passing is the superior way to do parallel programming. What is your opinion on this?

Sanjiv: Religious arguments such as these are fun to listen to, aren’t they? Many of the arguments on the net comparing the two are flawed in that they often compare apples to oranges: different algorithms coded in each paradigm.

Which one is appropriate depends very highly on the application: shared memory makes some things very easy, whereas distributed memory makes other things easy. SETI@home is becoming the classical distributed example: very little shared state and millions of computers worldwide can run independently. Many such applications map naturally to distributed memory. On the other hand, when there is a lot of continuously changing state to be shared, shared memory programming makes a lot of sense. Many of the applications we use every day, like the editor/email system I am typing this in, word processors, web browsers, spreadsheet programs, video and image processing, computer games and so on depend heavily on shared state.

Michael: From your point of view, what are the most exciting developments /innovations regarding parallel programming going on presently or during the last few years?

Sanjiv: Multi-core processors. But that’s not what you were asking about. To be honest, I do not see much new innovation on the general purpose software side. A lot of what many call new is really old ideas being recycled, either from programming or from adjacent fields. Transactional memory comes from the world of databases and journaling file systems. Futures, Lambdas, tasks, etc. have all been present in various programming languages. Lock free algorithms are not new but quite attractive. Vector extensions are becoming popular for attached processors, but these are old ideas.

OpenMP is a very nice standardization effort that is now widely available to programmers with implementations available in EVERY major compiler. There is some good working going on with regard to tasking in OpenMP which will make OpenMP much more accessible to C++ programmers.

Threading Building Blocks is another nice way to represent parallelism. It is a parallel “language” embedded in a C++ template library for control and data parallelism and provides concurrent versions of some of the more commonly used data structures. It is a nice capture of the state of the art in an easily usable form.

Michael: Where do you see the future of parallel programming? Are there any “silver bullets” on the horizon?

Sanjiv: I do not see any technological “silver bullets”. However, parallel programming will be ubiquitous, let there be no doubt about it. ISV’s will have to think parallel or they will perish. Fast movers will use parallelism as a competitive advantage and sequential apps may survive due to installed base, features etc. But over time, as more and more cores become available, sequential applications will be at a bigger and bigger disadvantage.

Look at it this way: with 2 cores, a sequential application is not taking advantage of 50% of the available computing power. With 4 cores, 75%. With 8 cores, 87.5%. With 16, 93.75%. Around 4-8 cores, sequential applications will be ignoring too much of the available power to continue to compete and survive.

And ultimately, that is the “silver bullet”. Need. Humans adapt infinitely in order to survive. Programmers will adapt to be very adept at parallel programming.

Michael: One of the most pressing issues presently appears to be that parallel programming is still harder and less productive than its sequential counterpart. Is there any way to change this in your opinion?

Sanjiv: Three things are necessary: the need mentioned above, coupled with changes in University curricula to teach parallel programming in undergraduate courses and the availability of parallel languages and tools for the entire programming life cycle (from experiment and design to coding to testing and validation to maintenance).

4 and 8 core processors bring the need, University curricula are starting to change, parallel languages are becoming widely available and tools are starting to become available for some of the more common languages.

However, I’d like to point out that sequential programming will likely always be easier than parallel programming because the environment is more constrained.

So much for the first part of this interview. Without further ado, here are the questions about OpenMP:

Michael: What are the specific strengths and weaknesses of OpenMP as compared to other parallel programming systems? Where would you like it to improve?

Sanjiv: The ability to add OpenMP incrementally is a huge strength, not just for existing sequential applications but also for new applications. The fact that OpenMP applications can be coded in a way to present two entire encodings of an application, a sequential one and a parallel one is another not well understood but powerful strength. This fact is not well understood, but allows some really powerful, almost magical, tools like Assure and Thread Checker that eliminate a lot of the difficulty you allude to in question 5.

A weakness of OpenMP is that it is trying to serve too broad a market. On one hand, you have HPC experts trying to squeeze every FLOP on large systems because of system cost and on the other, millions of programmers happy with relatively small gains on modest sized systems that are virtually free. In catering to both, we may end up catering to neither. The expert wants total control of where threads are running and what they are doing. The ordinary user is blissfully ignorant. Thread id’s are another specific example of the dichotomy – I wish we could do away with them.

The language needs to improve in its expressive power, in its error handling, in coexisting with other threading models, for C++ support. The current OpenMP library has only the very basics necessary – programmers need to build upon these basics to get anything done. The library should be usable out of the box without having to build upon it.

Michael: If you could start again from scratch in designing OpenMP, what would you do differently?

Sanjiv: The list is long. Some of the routines between Fortran and C/C++ are badly named. Features like WORKSHARE shouldn’t exist. It should be much harder to use thread id’s in order to encourage people to program abstractly. An ABI should be specified for a compiler to target, making inter-operability between different implementations much easier. The ability to work with user specified thread pools would be nice. A much richer standard library should be included. Interoperability with underlying threads should be better specified. Better hooks performance and correctness tools should be built into the language and library. Better synchronization primitives. Rely more on language scoping in languages that allow it, instead of the private keyword. Eliminate some things the programmer can do trivially, like firstprivate. Do not allow privatization of globals. As I said, the list is long (and did I say controversial?).

Michael: Are there any specific tools you would like to recommend to people who want to program in OpenMP? IDEs? Editors? Debuggers? Profilers? Correctness Tools? Any others?

Sanjiv: I am very biased here, but it comes from 2 decades of experience (and many of the best independent OpenMP programmers out there agree with me): every OpenMP programmer needs to learn how to programmer in a “thread count independent” way that allows both a sequential program and a parallel program to coexist in his source code so that they can benefit from the magic of Thread Checker (Assure). It is amazingly powerful. You get the productivity of sequential programming.

The multiple run comparison feature of Thread Profiler for OpenMP (GuideView) is also very powerful for performance tuning. It lets you dive to individual parallel regions and the sequential regions between parallel regions and understand the scaling and non-scaling behavior at this level. There are other performance profilers out there from Bernd Mohr, Al Malony and others that have also become quite good at OpenMP.

Michael: Please share a little advice on how to get started for programmers new to OpenMP! How would you start? Any books? Tutorials? Resources on the internet? And where is the best place to ask questions about it?

Sanjiv: Get your hands dirty! Good programmers should skim a book, look at some of the introductory tutorials available on the web, look at some examples and dive in with some real programming. For a deeper understanding, it pays to get a basic understanding of what the compiler does to their program – OpenMP is incredibly simple when you look at it from this perspective. And do not forget tools – correctness and performance – they are critical – Thread Checker for OpenMP remains the only tool of its kind.

Look at how Thread Checker can be used to add parallelism very quickly. I follow a very simple recipe:

  1. Get sequential program correct.
  2. Identify loops I want to parallelize via profiling.
  3. Eyeball the loop to identify obvious private objects and use as much language specific scoping (or the private clause for Fortran).
  4. Use parallel for and Thread Checker to get a worklist of remaining work!

Its that simple. People including myself have parallelized million like apps using this simple recipe.

Michael: What is the worst mistake you have ever encountered in an OpenMP program?

Sanjiv: The most common mistake: parallelizing loops that are too fine grained or consume a lot of bandwidth and contribute negatively to overall scaling. Even very experienced OpenMP programmers make this mistake. Programmers must understand the performance and scaling of every parallel region to avoid such mistakes. This is one of the worst mistakes because people are just shooting themselves in the foot.

I would be remiss if I didn’t point out that sometimes it is very important to parallelize such loops to get or preserve side effects. For example, on NUMA systems, people parallelize such loops for the memory allocation side effect (due to first touch allocation policy). And when data is already spread out among different threads, it pays to keep the data spread by paralleling fine grained loops, even if it costs a little overhead.

Michael: Thank you very much for the interview!

7 Responses to Ten Questions with Sanjiv Shah about Parallel Programming and OpenMP »»


Comments

  1. Comment by bong | 2007/04/03 at 14:53:42

    Is Intel Thread Checker really the _only_ tool? Does anyone know of any open source tools out there? How about the other processor manufacturers?

  2. Comment by franjesus | 2007/04/03 at 17:45:19

    Does Intel Thread Checker work for other x86 platforms? (eg AMD Opteron).

    Is it NUMA-aware?

  3. Comment by franjesus | 2007/04/03 at 18:26:52

    You might be interested in this.

    I just managed to install the intel thread checker under Ubuntu. I’ll reproduce it here:

    In short:

    tar xzvf tcheck3.0_007cli_lin.tar.gz
    cd itt_tc_cl/data/
    vi install-tc.sh
    :1399
    :s/rpm -q/dpkg -s/
    :wq
    sudo apt-get install rpm
    sudo touch /etc/SuSE-release
    sudo ./install-tc.sh –nonrpm

    Accept all defaults and copy the license file to /opt/intel/licenses

    sudo rm /etc/SuSE-release

    Done!

  4. Comment by Michael Suess | 2007/04/04 at 10:09:42

    @bong: I know of the Sun Studio Data-Race detection tool, but have not found the time to try it out. I think Valgrind does some limited checking as well (mostly for sequential correctness, don’t know if they have extended it to threads yet), but all others are not as comprehensive as the Intel solution as far as I know.

    @franjesus: the thread-checker works on AMD-processors just fine. I don’t think it is NUMA-aware. Thanks for the instructions on installing on Ubuntu, I will make sure to try them out shortly!

  5. Comment by franjesus | 2007/04/04 at 11:55:50

    Yes, now that I installed it, I know it works fine :-). Thanks!!

  6. Comment by pradeep | 2007/04/05 at 11:52:24

    Hai folks,

    Sun studio is having sun thread analyzer which is a data race detection tool.
    Here I will tell you how to use it

    while compiling use -xinstrument=datarace but wwithout using -c.
    and collect the experiment using
    $collect -r all source arguments
    by default the exp starts with tha.1.er
    check for data race using this
    $tha tha.1.er
    You will pop up with a graphical window.

    I think its pretty easy to work with sun thread analyzer


Trackbacks & Pingbacks »»

  1. […] ThinkingParallel 发布了 Interviewing the Parallel Programming Idols 系列的第三部—— Ten Questions with Sanjiv Shah about Parallel Programming and OpenMP 让我对于并行计算的其他解决方案(这次是OpenMP)有了一些了解。 […]

Leave a Reply

HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>