We have a dual-processor, dual-core Opteron system at work (that makes 4 cores in total 😉 ). I often use it for performance-measurements of my parallel codes. I expected performance variations in my benchmarks when only two threads are used. The explanation of course would be simple: the Opteron-system is actually a NUMA-system. Half of the memory belongs to processor one, the other half belongs to processor two. Both processors are connected through the Hypertransport-bus (running at 1000MHz if I am not mistaken). If processor one wants to access memory belonging to processor two, it can do so. It will take longer than usual though, because it cannot do so directly, but only via processor two. That gives us two different memory access times. It gets even more complicated with four processors (does not matter how many cores they have, as the two cores on the dual-core processors share a single memory controller). On many boards, they are connected like a ring. And that gives us three different memory access times, as a memory access may have to travel across zero, one or in the worst case two hypertransport-connections. OK, enough of the NUMA-basics already, lets return to my main thread.
When I am running a program with two threads on this machine, which are communicating from time to time, I should see different wall-clock times for my programs, depending on where they are scheduled (on the same processor on different cores, or on different processors). Or at least this is what I suspected.
Trying to confirm or deny this once and for all, I did some experiments using three different programs:
- Watergap – a relatively big (a couple 1000 LOC) program to simulate water households and their changes over time (hybrid version parallelized with OpenMP and MPI).
- Quicksort – the fastest parallel OpenMP-implementation I could come up with
Before I could start my little comparison, there was one problem left to solve, though: I needed a way to bind threads to processors (another option would be to wait until the scheduler decides to schedule my threads the way I like, but I do not like waiting). Luckily for me, there are two options for that on linux now:
- use the NPTL-functionality (pthread_setaffinity_np() and friends)
- use taskset out of the schedutils package
I toyed around with pthread_setaffinity_np() and it works (as can be observed e.g. by using top). But since I had to change my programs for it and they are usually written with OpenMP anyways, I quickly abandoned that idea. The other one is nicer anyways: just by starting your program like this: taskset -c 0,2 programname, it will only run on e.g. processors 0 and 2. Nice and easy, no need to change the program and it works with OpenMP nicely. Problem solved.
OK, back to business. Now that I knew how to schedule my threads in any way I liked, I ran my experiments using the above mentioned programs. The results were surprising (at least for me): I could not measure a noticeable performance difference between using two different cores on the same processor and using two different processors altogether. Since I did not expect this, I was wondering if maybe something with my methodology was wrong, and therefore I tried a third program:
Finally, I was able to reliably measure a small difference for the barrier tests contained therein (1.43 microseconds when on the same processor vs. 1.56 microseconds when on two different processors, which is a difference of less than 10%). 10% might sound like a lot, but I usually do not have a whole lot of barriers in my programs (and if I have, I usually have bigger problems to worry about anyways 😉 ).
This brings me to the end of this short experiment. What I have learned is, that the Opteron is indeed a NUMA-system and behaves like one, but I will probably not notice it in my programs. Therefore, I can usually treat the system like an SMP-machine and not worry about e.g. placing some threads closely together, and others further apart (a possible reason to do this would be to increase memory bandwidth, as it doubles when using the second processor – and yes, I confirmed this with a short test as well). If doubts arise again, I can use taskset to check. Your results may of course vary, for example when you have a truly memory-bound application.