I see this mistake quite often: people have a performance or scalability problem with their threaded code and start blaming locks. Because everyone knows that locks are slow and limit scalability (yes, that was ironic ). They start to optimize their code by getting rid of the locks entirely, e.g. by messing around with memory barriers. Or by just not using locks at all. Most of the time the result happens to work on their machine but has subtle bugs that surface later. Let’s see what the experts say about using locks, in this case Brian Goetz in his excellent book on Java Concurrency in Practice:
When assessing the performance impact of synchronization, it is important to distinguish between contended and uncontended synchonization. The synchronized mechanism is optimized for the uncontended case (volatile is always uncontended), and at this writing, the performance cost of a “fast-path” uncontended synchronization ranges from 20 to 250 clock cycles for most systems. While this is certainly not zero, the effect of needed, uncontended synchronization is rarely significant in overall application performance, and the alternative involves compromising safety and potentially signing yourself (or your successor) up for some very painful bug hunting later.
Could not have told you any better than that and I can assure you this man knows what he is talking about. Although he speaks about Java specifically, the result is the same for other parallel programming systems: the first step when you have synchronization problems is not to get rid of locks, but to reduce lock contention. This article will tell you 10 (+1) ways to do just that.
Update: As Brian notes correctly in his comment below this article, these are advanced techniques and should only be employed if you are absolutely sure, locks are the problem! The only way to be sure of course is to use your profiler of choice, or else you will be guilty of the number one sin of ever programmer: premature optimization (and you don’t want to go to programmers hell for this one, do you? ).
- Protect data, not code: A common technique to achieve thread-safety quickly is to slap a big lock around a whole function call. Or in Java’s case, declare the whole method synchronized. Remember that it is merely the data that needs to be protected from concurrent access, not the code! Following this advice alone can take the time spent in you locks down significantly.
- Get rid of expensive calculations while in locks: This may seem obvious, but a little instruction reordering and use of temporary variables can bring down the time spent in locks considerably, especially as soon as I/O is involved.
- Use lock striping: Does your whole array really need to be protected by the same lock? Or can you give each element its own lock? Or if thats too many, use e.g. the modulo-function to distribute array elements to different locks and make sure your threads are not contending for the same lock? I bet you can in many cases. The more general advice here is: use different locks for different data whenever possible.
- Employ interlocked / atomic operations: There is no need to use locking just to increase a counter. Most parallel programming systems offer atomic operations to achieve simple tasks like that. Use them whenever available.
- Use synchronized data structures: Even if you cannot use atomic operations directly, maybe you can use a data structure that uses them internally. A lock-free message queue for example. Data structures like that are available for a variety of systems now and give you the power to synchronize your threads without worrying about locks.
- Use Reader-Writer Locks where applicable: For the cases where a lot of threads read a memory location that is rarely changed, the reader-writer lock was invented and is available in many parallel programming systems. It ensures that multiple readers can enter the lock at the same time. Only when a writer enters the picture, the lock becomes exclusive.
- Use Read-Only data whenever possible: In some parallel progamming systems (e.g. in Java), the system will make sure these are properly published to all threads, eliminating the need for locking entirely (since they cannot be changed anyways).
- Avoid Object Pooling: In some parallel programming systems, object creation is expensive (note: I am not talking about Java today. Maybe the Java of 5 years ago.) Many programmers have therefore become used to reusing objects and storing them away for later reuse in pools. In the multi-threaded case this turns out to be a problem, because of course the pool needs to be protected from concurrent access as well. Depending on how fast objects can be created in your system, it may be a good idea to forget about object pooling while working on multi-threaded code (or in general).
- Use local variables or thread-local storage: In many cases, a shared variable can be substituted by another one, which is local to each thread. Think of the simple case of a reduction: when you want to find the biggest number in an array, you can store either store it immediately in a max-variable that is increased by all threads as they go (this is the contended case), or you can have each thread store its own result locally and combine them in the end (uncontended case).
- Avoid hotspots: A hotspot is a resource (e.g. a shared variable) that needs to be updated frequently. Think of a linked list. The size of the list is stored somewhere in many list-implementations, because the implementers want the size()-operation to complete in constant time (if the size is not stored and updated constantly, the size()-method needs to walk the list each time it is called). What seems like a good performance optimization at first sight might shoot you in the foot in the multi-threaded case: now each and every list operation that changes the number of elements in the list needs to update the size-variable, an operation that needs to be protected from concurrent access. This is what is called a hotspot. Avoiding it is simple in this case: just don’t store the size of the list and the hotspot is gone. Software development is the art of choosing the right tradeoffs .
- Bonus: use a transactional memory system: transactional memory is supposed to bring you all the advantages of fine-grained locking while getting away with a complexity comparable to coarse grain locking. Unfortunately, I have yet to see a transactional memory system that is ready for production (and that’s the reason why this is merely a bonus). But maybe you do? If yes, be sure to let me know!
The tips are not ordered according to importance, each one may help in different situations. Some of them I have already posted in my post about thread-safety, some of them are inspired by the excellent book on Java Concurrency in Practice (pages 229-245) and some I have not seen in a written format anywhere else. Of course, not all of them are applicable to each parallel programming system. POSIX Threads for example has no atomic operations. I still find this a useful list to have in one place, no matter what threading system is closest to your heart .