Thinking Parallel

A Blog on Parallel Programming and Concurrency by Michael Suess

Please Don’t Rely on Memory Barriers for Synchronization!

A FenceI was innocently browsing the net, when I found this on reddit. It’s an article about synchronization using memory barriers. And it made me sad (I was going to write: it almost made me cry, but I figured that would be a little exaggerated). This article is a warning and it tells you the reasons why you please should not use the techniques described there, unless you really, really know what you are doing!

I usually don’t do this, but I would like to step through this article a bit with you to make it clear why it bugs me so much. But before I do, let me make one thing perfectly clear: I don’t know the author of the article. I hear he makes a fine hex editor for Mac OSX. This is not in any way meant to be a personal attack on him or his views, I just think this particular post is in dire need of warnings, clarifications and editorial changes, because the conclusions he draws are dangerous. It is an interesting article for people with experience in multithreading. I believe it to be dangerous for beginners to the topic.

It starts with the definition of terms:

Threads are just preemptively scheduled contexts of execution that share an address space.
[snip]
Multithreading is the physically simultaneous execution of multiple threads for increased performance, which requires a dualie or more.

Threads are not necessarily preemptively scheduled. Multithreading has nothing to do with simultaneous execution. There is a nice and easy to find article in the Wikipedia that explains the terms correctly, so I don’t have to. Or in any good book on any parallel programming system using threads.

Yeah, I know. “Multithreading is hard” is a cliché, and it bugs me, because it is not some truism describing a fundamental property of nature, but it’s something we did. We made multithreading hard because we optimized so heavily for the single threaded case.

This may be one of the reasons. Multithreading is considered hard, because it requires the programmer to keep multiple execution paths in his head simultaneously. Multithreading is considered hard, because threads share state and therefore can potentially change each others values in memory at any time. Multithreading is considered hard, because the debuggers and profilers are not as mature as the ones for the single-threaded case. And there are a multitude of other reasons. None of them are show-stoppers for me. None of them keep my students from writing correct programs after they have had some practice and guidance (but thats a topic for another post). But please, please, don’t make it look like the only thing to be aware of when dealing with threads is instruction reordering by the compiler!

Let’s face it: until here, everything I have said can be considered hair-splitting. Nothing that a few well-placed edits in the article cannot fix. I am not going to go through each of his implementations, but will jump a little further down instead into one of his explanatory boxes, where it starts to get interesting again:

What’s the usual response to cross-processor synchronization issues like this? A mutex! Let’s try it.
fish ) ./a.out
0 failures (0.0 percent of the time) in 479.5 seconds

It made the tests pass, all right – but it was 130 times slower! A spinlock does substantially better, at 20 seconds, but that’s still 440% worse than no locking – and spinlocks won’t scale. Surely we can do better.

Ouch. Until this part of the article, everything can be considered toying around with threads. But now it is getting dangerous. As people are reading this, this point will stick: locks are slow and should be avoided. And this is so wrong! Locks are the only safe and portable solution to the toy-problem described in this article. Let me repeat: there is no other way when programming with POSIX Threads to solve this problem, except with locks. Unfortunately, for this solution there is not even code in the article. Every other solution relies on architecture specifics (that can change anytime the processor vendor feels like it!) and should not be employed. I am sure the original author wanted to improve on that solution, but to the reader, this leaves a very wrong impression (more on that later). Is the solution with locks slower? Sure it is, but in a real program, that does real work, the percentage of time spent in the locks will go down and even though locks may be slower than toying with memory barriers, you may not even notice the difference then. And even if you do: there are ways to reduce the need for locking and if that’s not enough, there are lock-free libraries available, but please don’t start to play with memory barriers and the like until you are sure you actually have a problem with synchronization speed!

I have really gotten into preaching mode here, I am sorry about that, but this is really, really touching me. The methodology shown here is so wrong as well. It is obvious that the author has not really dug deep into POSIX threads, the system he is using (if he did, he would have know pthread_join() or pthread_exit() – and would not have needed the hack he used to keep the main thread alive). When he saw a problem with synchronization, he chose to dig into the architecture-description of his Mac instead to rely on unportable, unsafe hacks to fulfill his one testcase. Why not read a good book on POSIX threads instead? In Programming with POSIX Threads (the bible when it comes to POSIX Threads) he would not have found advice like his. Instead, he would have read to always make sure to always create a correct program first, then optimize later when the need arises. A point that is very important to understand for beginners.

Messing around with memory barriers is hard. It is so hard that it is e.g. considered a mistake by many members of the OpenMP language-committee to have exposed them in the OpenMP-API via the flush()-directive. I found the article I am talking about here on the frontpage of programming.reddit.com, where it was read and voted up by hundreds of programmers, many of whom have only a vague idea about what programming with threads means. What will they learn from this article? Locks are slow and should be avoided? Memory barriers are a good and fast way to guarantee synchronization? When in doubt, use volatile? Double-checked locking is a good idea (it’s not. never. ever.)? Each of which are fatal misconceptions, if you ask me – at least for beginners to threads-programming. When these programmers go out and do their first project with threads and try to program using the advice given, they will undoubtedly produce a Mount Everest of errors. And they will try to debug them and are most likely not going to make it. The project is then canceled and the next time somebody asks them, they will happily tell everyone that Threading is just too hard to get right.

Am I exaggerating a little? Sure I am. But I am also sick of hearing this over and over again. It is possible to program with threads, if you stick to the tried and known rules. People have been doing it for ages in the field of High Performance Computing. Sure it is harder than normal programming, but it’s no rocket science either.

This post sounds harsh on the original author of the ridiculousfish-article. It’s not meant to be so. Unfortunately, what began as an interesting experiment on his side (and thats what it is – an interesting experiment) – has ended with all the wrong conclusions. Conclusions that are dangerous and cannot be drawn from it, if you ask me. And thats why I feel the need to speak up here. If nothing else, please, at least add big, fat warning signs to your post that it is not intended for beginners! Right now, it sounds like a nice introduction to threading. But it is far from that, it’s material that should only be considered by advanced programmers in the field of concurrent programming who need every bit of performance they can get on their specific architecture. Memory Barriers are no toys for inexperienced programmers and should be used as a last resort only! Sorry for the rant, but I just had to get this off my chest…

24 Responses to Please Don’t Rely on Memory Barriers for Synchronization! »»


Comments

  1. Comment by HÃ¥kan | 2007/02/19 at 12:58:52

    I think it would be far more beneficial to all the readers of the ridiculous_fish article (that may end up reading this response) if your reply contained at least summaries of why his code is “very wrong”.

  2. nex
    Comment by nex | 2007/02/19 at 13:18:05

    Hi Michael, I’ll try to play devil’s advocate and point out everything that’s wrong with your post; not because I hate it, but just for the heck of it — well, you already know what I’m talking about :-)

    In most of the issues you address, beginners are your main concern. But the article states right at the beginning that (a) it is about multiprocessing and (b) “I admit it – I don’t get multiprocessing”. A beginner wouldn’t see this and think, “cool, here’s a beginners’ tutorial that teaches me all I need to know about threads.” Unless he’s also an idiot … someone that stupid won’t ever figure out correct multithreading anyays.

    That article points out interesting pitfalls that someone who _does_ know, in theory, how to write multithreaded programs correctly, might so far only have avoided because they’re irrelevant to her particular CPU architecture (say, x86). You won’t read it if you have no clue about the topic, because in that case, it would be clear that the subject is over your head. Otherwise, you already know that instruction (or I/O) reordering isn’t really the only important issue.

    And it’s not a mistake to rely on architecture specifics! One issue here is having more cores than you know what to do with, which means having to take the threads you already have and splitting them up in very fine-grained ways. So fine-grained that the speed difference between using a lock and not using a lock is measured in orders of magnitude. It would be nice if there was a portable, platform-independent way to handle this, but as you said yourself, the tools we have aren’t even quite ready yet for ordinary multithreading. Here, in this context (where we might have 80 cores), relying on architecture specifics is a necessity. And neither is there any problem with premature optimization here. We already have a correct program to begin with. The whole article is about optimizing it (in a correct way), from the start. We’ve already determined that it’s a problem that it only runs on a fraction of the available processors. Also, even if double checked locking is never a good idea, the author does point out circumstances where it can’t possibly work correctly, and what to do about it. So maybe he dealt with it in a silly way, but not in a wrong way.

    One more thing:
    “Threads are not necessarily preemptively scheduled. Multithreading has nothing to do with simultaneous execution.”
    That’s why he didn’t say this was an explanation — as you say yourself, it’s a definition. I’m not above having pity with people who can’t read and might misunderstand this, but you’re implying that the quoted text is wrong, which it is not.

  3. Comment by Michael Suess | 2007/02/19 at 13:57:33

    @Hakan: if thats what you have gotten out of my article, then I am to blame for not making myself clear enough. I am not saying his code is wrong. As far as I can see, the code is right for his particular problem on his particular platform. Yet what he does is not portable and not a safe thing to do in general, yet I suspect many readers will assume it is. And this is what bothers me about the article…

  4. Comment by Ryan | 2007/02/19 at 16:45:11

    Dude,

    You still have not been specific. You are saying “Yet what he does is not portable and not a safe thing to do in general” …

    “general” is not specific. I guess what we need to have you mention an example of a case or two of how and where things can go wrong.

    Without few specific examples, your article has no bite. And you’re not teaching us anything new. Atleast the guy who toyed with memory barriers gave us the possibility of something new.

  5. Comment by scott | 2007/02/19 at 17:16:57

    Ryan,

    so using the word “general” in a sentence means that the sentence could not possibly be communicating something specific?

    how about this for a specific statement:
    “Specifically, what he does is not portable and not a safe thing to do in general (ie – other processors/platforms)”

  6. Comment by Blain | 2007/02/19 at 18:11:43

    I admit that it’s rather well hidden, but he admits what he does isn’t portable and not a safe thing to do:

    Ok, here’s the things I mean when I say the following, uh, things


    How do we know that the reader thread won’t see a variable in some intermediate state, midway through being updated? We have to know that these particular operations are atomic. On PowerPC and x86 machines...


    The answer is that, yes, amazingly, dependent reads like that can be performed seemingly out of order, but not on any processors that Apple ships.


    [memory barriers] take more thought, and aren’t always applicable...

    So yes, you’re both right, although he probably should have made a disclaimer that this code only addresses macs. (The objective-c and PPC-only assembly were big hints, however.) Although that makes me wonder, in the extremely-limited context of non-POSIX darwin systems running on PowerPC and Core Duos, do any of the pitfalls you illustrate apply?

  7. Comment by Jason | 2007/02/19 at 19:10:32

    Your implication seems to be that people shouldn’t ever write articles that aren’t noob friendly. That’s plainly ridiculous. The article in question had all the right caveats, and the blog in question has a history of tackling similar non-beginner topics. If you want to make the basic point that locks are more portable and easier to use than memory barriers, fine, but don’t shroud your sherlockism* in a criticism of what amounts to a good programmer using his personal blog to work through an issue that we’re all going to need to become intimately familiar with over the next few years.

    (*as in, no shit, sherlock)

  8. Comment by King Chung Huang | 2007/02/19 at 19:13:56

    It might help to note that ridiculous_fish is an Apple engineer, which would explain the Mac-centric focus. While using memory barriers isn’t truly portable and is hardware dependent, the code presented in the article are targetted at hardware that Apple ships, as noted in the article.

  9. Comment by Michael Suess | 2007/02/19 at 21:55:24

    Thanks for all of your responses. I am sorry, I don’t have the time to answer each of you individually, but will point out a couple of things again, since obviously I did not make them clear enough before:

    The topic of parallel programming has gotten a lot of attention lately, therefore I think there are a lot of people out there who think it is important and especially since ridiculous_fish and his blog have such a fine reputation, will use the knowledge presented there without getting “the big picture” first. I also do know, that ridiculous_fish probably did not intend this article as a tutorial for beginners, but at least for me it leaves the impression that it is one, although the content presented there is advanced and even dangerous for a beginner, because it is so hard to get right. And as you might have noticed, this was my main plea to him: please make it more clear, that this is advanced material for the Mac only. Maybe regular readers of his blog know this, but since it was also featured on reddit, many people that did not know his blog before also got to read it (like e.g. me).

    Of course I am not saying nobody should be writing articles that are for experts and not suited for beginners, but if you do, please make it not sound like the opposite (which is of course, a personal impression again, but one that seems to be shared by many people).

    @Ryan: you wanted examples of where the code may break. Like I have said before, it probably won’t break on Macs (or if it does, I am not the one to judge that, because I don’t own one). From a pure POSIX perspective (without taking architecture-specifics into account), there are several things wrong. You cannot rely on atomicity, therefore you have to protect each and every write to a memory location that could be written to by another thread with locks. And its even worse than that: it is not even guaranteed that the reader thread ever gets to see the new values at all (this has to do with the POSIX memory model and is a topic too big for a comment – I have been meaning to write an article about that for some time). Double-checked locking may work on one specific platform with one specific compiler, but is generally considered broken (I already gave the links in the article). All of these things are probably ok as long as you know your platform and it is not changed by your vendor – but will break as soon as the platform changes. And thats why I consider them dangerous for beginners, because things like that should only be used as a last resort, by people who really know what they are doing. And thats the main point I was trying to make. Thanks for listening.

  10. Comment by pongba | 2007/02/20 at 05:40:30

    I’m sorry, Michael, but isn’t the PThread approach proven inferior(as in Hans Beohm’s paper “Threads can’t be implemented as library”)?
    What made me so surprised was that PThread memory model doesn’t allow one to rely on atomicity, I mean, how’s that gonna work?

  11. Comment by Michael Suess | 2007/02/20 at 10:25:21

    @pongba: the question is, inferior to what? C/C++ has no concept of threads inside the language (the C++-committee is trying to change that at the moment, but it will take some years). Therefore you pretty much have no choice but to use a library (e.g. pthreads) or use something like OpenMP that is built on top of C/C++ (and needs a compiler, that understands OpenMP – which is a barrier for many programmers that don’t want to switch compilers). Or switch the language altogether, and use one where threads are included (Java and C# probably being the most prominent). Or, as many other programmers will probably tell you, switch away from threads entirely and use message passing. It has been shown that the “threads as a library” approach is suboptimal, but the alternatives for C/C++ have their problems, too.

    Regarding your question, Pthreads does not need atomicity, because it just says if you need to change a variable, protect it with a lock. This is the slower, albeit more portable approach.

  12. Comment by Benjamin Ragheb | 2007/02/20 at 20:16:42

    I am so glad you posted this, since you are so smart you can point out how dangerous the information on the Ridiculous Fish blog is when read by people who aren’t as smart as you. I will admit I was disappointed, however, that you didn’t make your criticism more portable, so I wouldn’t have to modify it to apply to the specific details of other blogs I read.

  13. Comment by Shawn Erickson | 2007/02/21 at 01:18:59

    Wow why all the anger at what Michael posted… he is correct in his assessment that you should only start doing stuff like this when you find that you really need to (for performance reasons) and that the post on ridiculous fish is a little to light on the warnings about doing this.

    You often will get better performance increases by changing algorithms (after profiling) then attempting to avoid locks using tricks various tricks.

    Oh Benjamin nice attitude… *rolls eyes*

  14. Comment by Paul Davis | 2007/02/21 at 18:01:44

    Just a quick note: the fact that POSIX doesn’t define atomic operations doesn’t mean that they cannot be implemented in a portable fashion. glib, the linux kernel and several other libraries all contain a useful set of atomic operations that use architecture specific memory barriers and other techniques.

    Its unfortunate that PThreads didn’t include this functionality. As much as advocate of PThreads as I am, I see this as the POSIX committee’s issue and not mine, and feel free to use atomic operations in my code, portably.

    I also think that those of us who do real-time development with threads in which we cannot take afford to take locks and have thus been forced to grapple with lock-free but safe programming techniques need to spread the news about what we do a little more widely.

  15. Comment by Benjamin Ragheb | 2007/02/21 at 21:15:07

    Shawn why are you rolling your eyes at my attitude? I am not angry at all. I am very grateful that there are people like Michael on the Internet to warn people like myself about information that could be dangerous to people who are not as smart as he is. The only thing that would be better is if the smarter people could keep that stuff away from the Internet to begin with so people like me would not run the risk of reading it in the first place.

  16. Comment by alexr | 2007/02/22 at 19:33:28

    “Locks are the only safe and portable solution”

    Unfortunately, I can’t agree with you on this. There is a time and place for hardware atomic primitives including memory barriers. Perhaps in a Cocoa app as shown on the ridiculous_fish site is not the best place, but assuming a POSIX threading model there would be incorrect as well.

    Cocoa (Foundation) has the NSThread abstraction, which is the proper place for syncronization techniques related to whatever the underlying threading model is. (NSLock, NSConditionLock, etc.) It is an oversight by the AppKit team that lower-level operations are not supplied in the framework, although portable ones are supplied by libkern below it that could probably be assumed to exist for a Cocoa application. ($5 says they’re already implemented on the ARM port for the iPhone.)

    Additionally, recent versions of GCC implement quite a full suite of atomic operations like compare_and_swap in the backend such that if one can assume GCC as a compiler, one can assume that these operations are available in a fairly portable fashion. It is a simple exercise to produce a set of macros that convert these calls into those implemented in Visual Studio or many other compilers which now increasingly provide access to hardware atomic primitives.

  17. Comment by Andrew Miehs | 2007/03/01 at 14:15:24

    I don’t think that the ridiculous fish article was trying to say that ‘memory barriers’ are the only way to do things – I think it was more a case of ‘be careful’ and ‘look – when I poke the machine, it pokes me back’.

    Most of the articles on his page are to do with ‘performance’ – and I think most people would see this article as a performance vs portability tradeoff.

  18. Comment by Sitsofe Wheeler | 2007/06/10 at 22:34:21

    You aren’t the only one warning against casual use of memory barriers. There’s a warning that many people wind up misusing them in Linux kernel drivers too over here: http://zaitcev.livejournal.com/144041.html . It seems that intuitive usage of them doesn’t tend towards correctness even with experienced (but non-parallel expert) programmers…

  19. Comment by Crest | 2007/08/04 at 23:38:36

    The used Instructions (sync, lwsync, eieio) are not vendor specific but architecture specific as every assembler code is by design. The PowerPC specification does offer some freedom to implementors e.g. to implement fsel but the syncronisation primitives aren’t optional and their semantic is well defined. Nevertheless the critisiced article is no suited for beginners as it encouraged premature optimations.

  20. Comment by Dmitriy V'jukov | 2009/03/04 at 11:05:20
  21. Comment by Kaya | 2010/12/03 at 22:28:08

    Corensic has a tool, Jinx, that can help reproduce and debug concurrency problems like the one elaborated on by ridiculous_fish. Jinx makes your code “unlucky” by forcing hard to find concurrency bugs to occur with more frequency while you are debugging and testing.

    In a nutshell, Jinx monitors program at runtime and intelligently decides on thread timings to run in order to force a bug to manifest itself. For each of its chosen thread timings, Jinx creates a simulation to explore a different path of execution. Once Jinx finds a simulation that will cause a bug, Jinx “replays” the simulation into reality so that you can debug it, stopping each thread at a point which is likely causally related to the bug. Unlike many other tools, there are no false positives, and because the search is intelligently directed, Jinx can speedup repro. rates by orders of magnitude, and find bugs which haven’t yet been seen in-house.

    More detail on how Jinx works can be found here: http://www.corensic.com/WhyYouNeedJinx/CorensicHasaUniqueTechnologyforFindingBugs.aspx

    Disclaimer: I work for the company, but I think the underlying technology is pretty amazing.

  22. Comment by Hugh | 2013/01/16 at 22:10:44

    I think there are valuable issues discussed in each of these articles. let me say that from the outset.

    However Mike’s article’s title discourages the use of memory barriers for synchronization whereas the article by Fish isn’t really about synchronization but ordering.

    A barrier never causes a thread to block and never can under any circumstances – whereas a lock can and will sometimes cause a thread to block – by design.

    Ordering and synchronization are different issues and should not be lumped together as they seem to be here.

    Look that simple example in this wikipedia write-up:

    http://en.wikipedia.org/wiki/Memory_barrier

    If the barrier operations are inserted correctly then no lock is required, period; and there never will be any performance impact that we might get with a lock (yes there may be a tiny degradation because we are denying the processor the opportunity to reorder as it sees fit – but this will be tiny).

    The simple wikipedia example could be solved with a lock of some form but why do that when it is only the ordering that is an issue?

    This is one of the problems I see often with all this – there are several issues at play typically and the problem at hand must be 100% understood.


Trackbacks & Pingbacks »»

  1. [...] Then a followup re-adjusting newbies to the complexities of using memory barriers for synchronisation. [...]

  2. Trackback by Peter Holditch's Blog | 2007/04/04 at 23:58:17

    The morning after……

    Roaming the blogosphere, as I all too seldom do, I often feel like a party-goer the morning after a major rave-up… Stumbling between different points of view and commentary, finding interesting tidbits here and there as I pass by. On…

Leave a Reply

HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>