Thinking Parallel

A Blog on Parallel Programming and Concurrency by Michael Suess

Scoped locking vs. critical in OpenMP – a personal shootout

When reading any recent book about using C++ and parallel programming, you will probably be told that scoped locking is the greatest way to handle mutual exclusion and is in general the greatest invention since hot water. Yet, most of these writers are not talking about OpenMP, but about lower level threading systems. OpenMP has its own rock-star directive for taking care of synchronisation, and it is called critical. Nevertheless, it is possible to implement scoped locking in OpenMP and I was curious: What would an honest comparison between the two contenders result in?

Before I start, a word of warning: This article requires some C++ and OpenMP knowledge, as well as some knowledge on general parallel programming concepts. It also concentrates on a very special problem, which may not be interesting for everyone. You have been warned :P.

Lately, I have been experimenting with concurrent patterns in C++ and OpenMP. One of the first things I tried was to implement a Guard Object. The technique behind this is also commonly referred to as Scoped Locking. I start this article by shortly explaining what scoped locking means (if you already know that feel free to skip the first few paragraphs) and afterwards compare it to what can be done with the critical-directive in OpenMP. My personal conclusion is at the end of this article.

An introduction to Scoped Locking

Mutual exclusion is the most common form of synchronisation in threaded parallel programs. Most threading systems provide data types and API-functions to guarantee mutual exclusion (e.g. mutex-variables in Pthreads) and I will refer to these types as locks for the rest of this article. A common error when using these functions is to forget to unset the lock. Has never happened to you, you say? Think of the case, when there are break, exit or return statements inside the protected regions. These give us multiple ways to exit that region and screw up by forgetting the lock, especially when multiple authors work on the same code-base.

Enter guard objects to the rescue. These are using the Resource Acquisition Is Initialization technique to guarantee that the unset function is always called. What does this concretely mean? A guard object is a simple object, its constructor takes a lock as parameter and locks it right in the constructor. The lock is unlocked in the destructor of the guard object. Since in C++ the destructor is automatically called as soon as the object goes out of scope, the lock will always be unlocked as soon as the program leaves the protected code region. Sometimes an example says more than a thousand words, so I am providing one here (just a snippet, in OpenMP):

  1. #include "omp_guard.hpp"
  2. ...
  3. omp_lock_t my_lock;
  4. ...
  5. // initialize lock
  6. omp_init_lock (&my_lock);
  7. ...
  8. // go parallel
  9. #pragma omp parallel
  10. {          
  11.     // do something sensible in parallel
  12.     ...
  13.     {
  14.         omp_guard my_guard (my_lock);
  15.         // protected region starts here
  16.         // only one thread at a time works here
  17.     }
  18.     // more parallel work
  19.     ...
  20. }
  21. omp_destroy_lock (&my_lock);

The important parts are on line 14 to 17. On line 14 the guard object is declared, its constructor is automatically called (you will see that one in a second) and the lock is acquired. On line 17, the guard object goes out of scope and its destructor is called automatically, thereby releasing the lock. Not too difficult to understand, nevertheless quite a lot of overhead involved. We will get to that in a second, but before that as promised the actual implementations of the guard object in OpenMP. Lets start with the header:

  1. #ifndef OMP_GUARD_HPP_
  2. #define OMP_GUARD_HPP_
  3.  
  4. #include <omp.h>
  5.  
  6. /** This is a class for guard objects using OpenMP
  7.  *  It is adapted from the book
  8.  *  "Pattern-Oriented Software Architecture". */
  9. class omp_guard {
  10. public:
  11.     /** Acquire the lock and store a pointer to it */
  12.     omp_guard (omp_lock_t &lock);
  13.     /** Set the lock explicitly */
  14.     void acquire ();
  15.     /** Release the lock explicitly (owner thread only!) */
  16.     void release ();
  17.     /** Destruct guard object */
  18.     ~omp_guard ();
  19.  
  20. private:
  21.     omp_lock_t *lock_;  // pointer to our lock
  22.     bool owner_;   // is this object the owner of the lock?
  23.    
  24.     // Disallow copies or assignment
  25.     omp_guard (const omp_guard &);
  26.     void operator= (const omp_guard &);
  27. };
  28.  
  29. #endif /*OMP_GUARD_HPP_*/

And here goes the actual implementation:

  1. /** The class contained in this file is a guard object */
  2.  
  3. #include "omp_guard.hpp"
  4.  
  5. /** Construct guard object and acquire our lock */
  6. omp_guard::omp_guard (omp_lock_t &lock) : lock_ (&lock)
  7.     , owner_ (false)
  8. {
  9.     acquire ();
  10. }
  11.  
  12. /** Explicitly set our lock */
  13. void omp_guard::acquire ()
  14. {
  15.     omp_set_lock (lock_);
  16.     owner_ = true;
  17. }
  18.  
  19. /** Explicitly unset our lock.
  20.  * Only unset it, though, if we are still the owner.
  21.  */
  22. void omp_guard::release ()
  23. {
  24.     if (owner_) {
  25.         owner_ = false;
  26.         omp_unset_lock (lock_);
  27.     }
  28. }
  29.  
  30. /** Destruct guard object, release the lock */
  31. omp_guard::~omp_guard ()
  32. {
  33.     release ();
  34. }

Figured it out? Not much to explain there for me (of course you need to know a little C++ to understand the example). The owner variable is added so I can release the mutex manually using the release()-function, without unsetting the lock twice when the destructor is called (which is illegal). Bear with me, we are almost through with the code examples. Nevertheless, I have one more and this time it is the first example rewritten to use the critical-directive for comparison. Here it is:

  1. ...
  2. // go parallel
  3. #pragma omp parallel
  4. {          
  5.     // do something sensible in parallel
  6.     ...
  7.     #pragma omp critical
  8.     {
  9.         // protected region starts here
  10.         // only one thread at a time works here
  11.     }
  12.     // more parallel work
  13.     ...
  14. }

Quite a bit shorter, isn’t it? OK, I am done posting code snippets now, let’s get to work comparing them.

Scoped Locking vs. critical – the comparison

First of all, let’s see what both solutions do:

  • cannot forget to unset the lock, as it is done automatically for scoped locking or detected as a compile-time error for critical

So the initial problem appears to be solved by both. But wait, there is (unfortunately) more to it. Here are some more reasons to use scoped locking and NOT the critical-directive:

  1. it is allowed to leave the locked region with jumps (e.g. break, continue, return), this is forbidden in regions protected by the critical-directive
  2. scoped locking is exception safe, critical is not
  3. all criticals wait for each other, with guard objects you can have as many different locks as you like – named critical sections help a bit, but name must be given at compile-time instead of at run-time like for scoped locking

Let’s go through them one by one: the first point is already a serious blow to the usability of critical. In my introduction I mentioned that it is easy to forget to unset a lock, when the protected region of code has multiple points of exit. Well, fact is, if you have code like that you cannot use the critical directive, because it is forbidden to jump out of a critical region by the OpenMP specification! Of course there are workarounds for that (such as filling the critical region with conditional clauses and flags or resorting to OpenMP-locks), but they take away the ease of use of critical in the process. The second point is closely related: It is not allowed to throw an exception out of a critical region when using the critical directive. Exceptions are not as common in C++ as they are in other languages (e.g. Java), but this is still a serious limitation. All of this works flawlessly with scoped locking.

The third point is a little more difficult to explain. Let’s say I want to implement a library function which can be called from within a parallel region. The function has been made thread-safe, data is passed into it and the critical directive is used to prevent concurrent access to this data. The problem with this scenario is that even though the programmer may pass different data into the function, the critical section will prevent the threads from working on them at the same time! I think I need one more piece of example code to clear the waters a bit here, first the example library function:

  1. void threadsafe_foo (int& data)
  2. {
  3.     // do some work
  4.     ...
  5.     #pragma omp critical (threadsafe_foo_1)
  6.     {
  7.         data = rand ();
  8.     }
  9. }

And now a piece of code that uses it:

  1. int data1, data2;
  2.  
  3. #pragma omp parallel
  4. {
  5.     threadsafe_foo (data1);
  6.     // do some work
  7.     threadsafe_foo (data2);
  8. }

See the problem? Even though threadsafe_foo() operates on completely different data, the critical region inside the function does not and cannot know that and therefore the threads are stalling each other unnecessarily. With scoped locking this can be avoided by passing a different lock into the function, when different data are to be protected – and the problem disappears. OK, so does this mean that everyone is supposed to forget about the critical directive in OpenMP and use scoped locking? I am afraid it is not that easy, since there are also advantages to using critical:

  1. no need to declare, initialize and destroy a lock
  2. no need to include, declare and define a guard object yourself, as critical is a part of the language (OpenMP)
  3. you always have explicit control over where your critical section ends, where the end of the scoped lock is implicit
  4. works for C and C++ (and FORTRAN)
  5. less overhead, as the compiler can translate it directly to lock and unlock calls, where it has to construct and destruct an object for scoped locking

The first two points on this list are about simplicity and ease of use. It is just less overhead to use the critical directive, because it is part of the language. You can count the lines in the examples above if you do not believe me (but actually it is fairly obvious without counting :roll: ). The third point is easily worked around by opening a new scope around the protected code as done in the examples above, but if you are not used to that (and braces that seem to exist without a reason) this will look strange. The fourth point is only valid if you are (like me) also doing work in other languages, as scoped locking does not work in e.g. C, while the critical directive works in every language supported by OpenMP (C / C++ / FORTRAN).

The fifth point may be a serious one for your application. A critical directive can usually be translated directly to lock operations at compile-time by the compiler, there is no (or at least very little) overhead involved at run-time. This is not the case for scoped locking. For every thread entering the protected region, the run-time systems needs to construct and destruct an object – and I am not aware of any magic the compiler could do to reduce this overhead, but then again I am by no means a compiler expert, if you know of a technique, please be sure to leave a comment. This overhead may be negligible, but then again it may not be and may seriously impact your performance. The only way to be sure is to measure it yourself for your specific application.

Conclusion

Congratulations for reading this far :lol: . I promised you a shootout, what I did not promise you was a clear winner. I cannot deliver one either, as both solutions have their merits as explained in the previous sections. The ability to jump out of protected regions (via exceptions, break() or whatever) speaks heavily for scoped locking, yet the simplicity and performance of critical is nothing to sneeze at. I will put up a recommendation here that at least I am willing to adhere to: Start by using critical in your OpenMP programs and beware of its limitations. When you encounter any of the problems mentioned here, don’t spend your valuable time implementing ugly workarounds, but try switching to the more general solution of scoped locking instead. When you do, be sure to measure if you take a performance hit.

P.S.

That’s it for today. This has been my longest article so far, and a lot more low-level than the previous ones as well. If I have forgotten anything regarding the two techniques, if something was not sufficiently clear or if you have anything else to add, please do not hesitate to leave a comment and I will try to clarify / discuss the issues.

9 Responses to Scoped locking vs. critical in OpenMP – a personal shootout »»


Comments

  1. Joe
    Comment by Joe | 2006/08/23 at 20:49:13

    Nice article. A point that I try to emphasize to students when I teach a course on HPC programming using OpenMP and MPI, is that critical sections of any sort are points of serialization in code, and should be avoided whenever possible. Some things cannot be done without locks and critical sections. You gave rand as an example. It turns out in the rand case that there are better solutions. Given the typical use case for rand, you can actually light off numerous independent PRNGs which don’t depend upon global hidden state (which is what the lock attempts to serialize for you).

    Basically what I am advocating is, whenever possible, try to avoid locking, and try to avoid serialized/serializing regions or function calls. In any parallel program, serialized regions are responsible for the “s” part in Amdahl’s law, and you want that as small as possible.

  2. Comment by Michael Suess | 2006/08/23 at 23:11:22

    Joe,
    thanks for your comment. The article presented here is for an audience that are not beginners in parallel programming. I thought about whether or not to include a section discussing the more general issue of locking and when it is needed, but this would have blown it way out of proportion and would have bored the heck out of the more advanced readers who (hopefully) find the rest of the article more interesting.

    That said, I totally agree with you, locking should be avoided whenever possible, I am also telling this to my students quite frequently and do not even want to think about the amount of time I have spent optimizing out critical regions in my code ;-).

    That said, there is another thing I obviously did not make sufficiently clear: The critical region in the last example does not serve to protect the rand()-function, but was rather meant to protect the data-variable. I should have made that more clear I guess. In OpenMP it is not necessary to protect the call to rand (), as all library functions supplied by the base language are required to be thread-safe by the specification. Nevertheless, you are still right, using rand() was not the brightest thing to do, I should have rather used something less specific like some_function_foo() ;-).

  3. Comment by Jeremy Rogers | 2007/03/23 at 05:21:11

    Thanks for the excellent article! As a student starting to learn about parallel programming, I was wondering exactly what this article addresses. Thanks for a well-put-together article. For once, my question got answered with exactly one search. Thank you again for the great article.

  4. Comment by Jeremy Rogers | 2007/03/23 at 05:22:40

    ^^^ Gee, gush much? :)

  5. Comment by Michael Suess | 2007/03/23 at 14:40:00

    Always glad to help…

  6. Comment by Brandon Barker | 2007/07/09 at 04:35:29

    Thanks for the very helpful article!

  7. Comment by Chirag Singla | 2007/11/12 at 19:04:07

    Really nice article:)

  8. Comment by Some random guy | 2013/04/05 at 14:35:36

    If data is passed as arguments, then the per-thread stack is used, then each thread gets the local copy of the data (in this case copy of reference to data). Therefore no synchronization is needed, unless rand() is not thread-safe. Then however common lock on all threads is required.

  9. Comment by Pawel | 2014/03/19 at 21:07:58

    Thanks! Very helpful comparison.


Leave a Reply

HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>