Thinking Parallel

A Blog on Parallel Programming and Concurrency by Michael Suess

10 Ways to Reduce Lock Contention in Threaded Programs

A LockI see this mistake quite often: people have a performance or scalability problem with their threaded code and start blaming locks. Because everyone knows that locks are slow and limit scalability (yes, that was ironic :grin:). They start to optimize their code by getting rid of the locks entirely, e.g. by messing around with memory barriers. Or by just not using locks at all. Most of the time the result happens to work on their machine but has subtle bugs that surface later. Let’s see what the experts say about using locks, in this case Brian Goetz in his excellent book on Java Concurrency in Practice:

When assessing the performance impact of synchronization, it is important to distinguish between contended and uncontended synchonization. The synchronized mechanism is optimized for the uncontended case (volatile is always uncontended), and at this writing, the performance cost of a “fast-path” uncontended synchronization ranges from 20 to 250 clock cycles for most systems. While this is certainly not zero, the effect of needed, uncontended synchronization is rarely significant in overall application performance, and the alternative involves compromising safety and potentially signing yourself (or your successor) up for some very painful bug hunting later.

Could not have told you any better than that and I can assure you this man knows what he is talking about. Although he speaks about Java specifically, the result is the same for other parallel programming systems: the first step when you have synchronization problems is not to get rid of locks, but to reduce lock contention. This article will tell you 10 (+1) ways to do just that.

Update: As Brian notes correctly in his comment below this article, these are advanced techniques and should only be employed if you are absolutely sure, locks are the problem! The only way to be sure of course is to use your profiler of choice, or else you will be guilty of the number one sin of ever programmer: premature optimization (and you don’t want to go to programmers hell for this one, do you? :lol:).

  1. Protect data, not code: A common technique to achieve thread-safety quickly is to slap a big lock around a whole function call. Or in Java’s case, declare the whole method synchronized. Remember that it is merely the data that needs to be protected from concurrent access, not the code! Following this advice alone can take the time spent in you locks down significantly.
  2. Get rid of expensive calculations while in locks: This may seem obvious, but a little instruction reordering and use of temporary variables can bring down the time spent in locks considerably, especially as soon as I/O is involved.
  3. Use lock striping: Does your whole array really need to be protected by the same lock? Or can you give each element its own lock? Or if thats too many, use e.g. the modulo-function to distribute array elements to different locks and make sure your threads are not contending for the same lock? I bet you can in many cases. The more general advice here is: use different locks for different data whenever possible.
  4. Employ interlocked / atomic operations: There is no need to use locking just to increase a counter. Most parallel programming systems offer atomic operations to achieve simple tasks like that. Use them whenever available.
  5. Use synchronized data structures: Even if you cannot use atomic operations directly, maybe you can use a data structure that uses them internally. A lock-free message queue for example. Data structures like that are available for a variety of systems now and give you the power to synchronize your threads without worrying about locks.
  6. Use Reader-Writer Locks where applicable: For the cases where a lot of threads read a memory location that is rarely changed, the reader-writer lock was invented and is available in many parallel programming systems. It ensures that multiple readers can enter the lock at the same time. Only when a writer enters the picture, the lock becomes exclusive.
  7. Use Read-Only data whenever possible: In some parallel progamming systems (e.g. in Java), the system will make sure these are properly published to all threads, eliminating the need for locking entirely (since they cannot be changed anyways).
  8. Avoid Object Pooling: In some parallel programming systems, object creation is expensive (note: I am not talking about Java today. Maybe the Java of 5 years ago.) Many programmers have therefore become used to reusing objects and storing them away for later reuse in pools. In the multi-threaded case this turns out to be a problem, because of course the pool needs to be protected from concurrent access as well. Depending on how fast objects can be created in your system, it may be a good idea to forget about object pooling while working on multi-threaded code (or in general).
  9. Use local variables or thread-local storage: In many cases, a shared variable can be substituted by another one, which is local to each thread. Think of the simple case of a reduction: when you want to find the biggest number in an array, you can store either store it immediately in a max-variable that is increased by all threads as they go (this is the contended case), or you can have each thread store its own result locally and combine them in the end (uncontended case).
  10. Avoid hotspots: A hotspot is a resource (e.g. a shared variable) that needs to be updated frequently. Think of a linked list. The size of the list is stored somewhere in many list-implementations, because the implementers want the size()-operation to complete in constant time (if the size is not stored and updated constantly, the size()-method needs to walk the list each time it is called). What seems like a good performance optimization at first sight might shoot you in the foot in the multi-threaded case: now each and every list operation that changes the number of elements in the list needs to update the size-variable, an operation that needs to be protected from concurrent access. This is what is called a hotspot. Avoiding it is simple in this case: just don’t store the size of the list and the hotspot is gone. Software development is the art of choosing the right tradeoffs ;-).
  11. Bonus: use a transactional memory system: transactional memory is supposed to bring you all the advantages of fine-grained locking while getting away with a complexity comparable to coarse grain locking. Unfortunately, I have yet to see a transactional memory system that is ready for production (and that’s the reason why this is merely a bonus). But maybe you do? If yes, be sure to let me know!

Disclaimer

The tips are not ordered according to importance, each one may help in different situations. Some of them I have already posted in my post about thread-safety, some of them are inspired by the excellent book on Java Concurrency in Practice (pages 229-245) and some I have not seen in a written format anywhere else. Of course, not all of them are applicable to each parallel programming system. POSIX Threads for example has no atomic operations. I still find this a useful list to have in one place, no matter what threading system is closest to your heart :smile:.

13 Responses to 10 Ways to Reduce Lock Contention in Threaded Programs »»


Comments

  1. Comment by Kent Boogaart | 2007/08/01 at 04:46:07

    Great post – 5 stars.

    What about keeping locks private? ie. not locking on public objects. If you lock on a public object (such as the this pointer) then you’re opening yourself up for contention issues because anyone can lock the same object. This is another reason to avoid synchronized methods.

  2. Comment by Revence | 2007/08/01 at 09:48:34

    Nice post, here.

    Shared-nothing is the best default when parallelisation is going to happen.
    This shouldn’t be about the system design. It can be about the language, as well. As they say, `In a parallel world, imperative is the wrong default’.

  3. Comment by Brian Goetz | 2007/08/01 at 16:39:31

    Nice list. Just one caveat…

    These are all good recommendations — but some of them should be pursued ONLY IF you are experiencing performance or scalability problems. However, the majority of the time I see some of these techniques in use (lock striping and RW locks especially), they are premature optimization. I’d prefer to see some disclaimers to that effect, because programmers are drawn to “optimization” like flies to you-know-what.

    There is a significant trade-off associated with these techniques. I’ll take lock striping as an example, but the reasoning works for RW locks, excessive splitting of synchronized blocks, etc. Lock striping gives you finer-grained access control, enabling greater concurrency — but is harder to get right. With one big fat lock, it’s much easier to verify thread safety; with many, its harder. So this is a tradeoff where you get better potential scalability at the cost of maintainability / fragility.

    This is sometimes a great trade — but often is a lousy one. Do you _know_ that the code in question is the source of performance problems? If not, you’ve made your code more complicated, more fragile, and less maintainable for no benefit. Add this to the fact that most programmers have terrible intuition about where the bottlenecks are. Without a good measurement program in place, or a deep understanding of the issues (in which case the reader probably doesn’t need this advice), most of the time applying these techniques is (at best) premature optimization.

    I would rename item #5 “Trust the libraries”.

  4. Comment by Michael Suess | 2007/08/01 at 17:14:16

    @Kent: you are right, this is a valid technique for some cases as well, although partly covered by point no. 1 already.

    @Revence: I am not going to get myself into a language war with you, thats for sure :-).

    @Brian: thanks for your insightful comment (and for the book, for that matter :-) ), you are right, a warning about the applicability of the techniques is in order. I have added it in big bold letters. Must not forget that people don’t read all of my articles, but merely some of them so I need to repeat myself frequently (for example, I have a whole post warning about premature optimization).

  5. Comment by Kyle | 2007/08/01 at 18:32:20

    Haskell (GHC, specifically) comes standard with a transactional memory library (STM), with all sorts of bells and whistles (orElse, etc) and it is as production ready as anything else in Haskell.

    For reference, see SPJ’s OSCon talk or http://haskell.org/ghc/docs/latest/html/libraries/stm/Control-Monad-STM.html

  6. Comment by Paul Betts | 2007/08/01 at 21:54:11

    Great article, I didn’t think of using the mod operator to protect an array like that, so you have n/k locks instead of n.

    Of course, like every other dev, I’ve got an opinion to put in about the article – while RW locks seem like a good idea to implement in any scenario that fits conceptually, you should remember that a RW lock typically uses 2 locks and you have to wait on people more, so they could be costing you more than just using a simple lock. However, if you’ve got a resource that has a *lot* of readers and only occasional writers, it’s a win.

    But how much is a lot? Don’t guess, profile! There’s far too many factors to accurately guess which one to use.

  7. Comment by Saurabh | 2007/08/03 at 00:59:32

    Nice one, but I dont like your example in item 10 for two reasons:
    1. If the append/remove of a link is going to be locked, how much harm can there be in slipping counter in/de-crement inside the lock? I would say that surely wont follow the 80-20 rule and in almost all realistic usage cases it’d be better to store the size inside the list.
    2. Item 4 already talks about atomic operations for counters.

    Also similar to the object pooling in item 8 is reference counting.

  8. Comment by Michael Suess | 2007/08/03 at 21:48:02

    @Paul: I fully agree, every decisions has its tradeoffs and when in doubt, only your profiler can tell for sure whats better.

    @Saurabh: Who says that the list is protected by a lock anyways? :smile:

  9. Comment by Matan | 2011/11/03 at 12:57:47

    Just stumbled across this article, better late than never :)
    While I’m mostly aware of your points and had the (mis)fortune to do a lot of concurrent programming, I have to say this is an excellent list!

    Let me pitch in with 2 comments regarding read-write locks:

    1) When using them, find out if and how do the locking functions you’re using handle starvation. For example, say you have 3 threads on 3 cpus (one on each). Threads 1 and 2 pretty much do their entire work under the lock in read mode (they release it between tasks but are constantly doing tasks). The 3rd thread runs very rarely but when it does it needs the lock in write mode. Now, imagine an implementation where access is always granted when possible (e.g. Linux 2.6 kernel rwlock_t locks). So, thread 1 obtains the lock for read and do a task. Now thread 3 wants the lock for write so it waits. Midway through thread’s 1 work, thread 2 obtains the lock for read. Now thread 1 finishes its task and release the (read) lock but thread 2 still hold it (for read) so thread 3 can’t get it (for write). Now, while thread 2 is in the middle of its task, thread 1 starts another task and takes the lock for read. And you can see where I’m going with this. Thread 3 is starved “forever”.

    2) Assuming you’re familiar with the need to define a hierarchy between different locks, read-write locks present some extra difficulty (they actually do that even if you’re unfamiliar with locking hierarchy). Suppose you hold a read lock on a data structure, do some work and now decide you need to change the data structure which requires the lock in write mode. Can you change the read lock you have to write mode? You can, but you have to first release the lock for read and then obtain it for write to prevent hierarchy issues (at least in some implementation – otherwise two readers trying to change to write will get into deadlock). What’s the added complication here? Between the point were you released the read lock and the point where you obtained the write lock another thread could have modified the database and whatever you’ve decided to do while holding the read lock and for which you obtained the write lock might not be relevant anymore (which is a nice way to say that your code can crash). Thus, you may need to re-verify some assumptions after you obtain the lock for write.


Trackbacks & Pingbacks »»

  1. [...] locks kill scalability and make threaded programs slow. Here are 10 techniques to reduce lock contention… Bookmark [...]

  2. Pingback by Legendary Diplomacy » Blog Archive » Scarlet news… | 2007/12/06 at 14:02:04

    [...] synchronized home-grown caches forced us to write a common cache adapter on top of them, using a lock-striping solution and than TC-ize that. Not exactly the kind of stuff you code in the [...]

  3. [...] Suess has a great blog on parallel programming and concurrency. Even though this post has already got some months, I still recommend it to everyone that wants to improve their knowledge [...]

  4. [...] 【陈硕】这条深得我心,muduo thread lib 目前就没有提供读写锁的封装。另外,这一条也能鉴别另一篇关于线程争用的文章不靠谱。 [...]

Leave a Reply

HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>