Thinking Parallel

A Blog on Parallel Programming and Concurrency by Michael Suess

Opteron NUMA-effects

We have a dual-processor, dual-core Opteron system at work (that makes 4 cores in total ๐Ÿ˜‰ ). I often use it for performance-measurements of my parallel codes. I expected performance variations in my benchmarks when only two threads are used. The explanation of course would be simple: the Opteron-system is actually a NUMA-system. Half of the memory belongs to processor one, the other half belongs to processor two. Both processors are connected through the Hypertransport-bus (running at 1000MHz if I am not mistaken). If processor one wants to access memory belonging to processor two, it can do so. It will take longer than usual though, because it cannot do so directly, but only via processor two. That gives us two different memory access times. It gets even more complicated with four processors (does not matter how many cores they have, as the two cores on the dual-core processors share a single memory controller). On many boards, they are connected like a ring. And that gives us three different memory access times, as a memory access may have to travel across zero, one or in the worst case two hypertransport-connections. OK, enough of the NUMA-basics already, lets return to my main thread.

When I am running a program with two threads on this machine, which are communicating from time to time, I should see different wall-clock times for my programs, depending on where they are scheduled (on the same processor on different cores, or on different processors). Or at least this is what I suspected.

Trying to confirm or deny this once and for all, I did some experiments using three different programs:

  • Watergap – a relatively big (a couple 1000 LOC) program to simulate water households and their changes over time (hybrid version parallelized with OpenMP and MPI).
  • Quicksort – the fastest parallel OpenMP-implementation I could come up with

Before I could start my little comparison, there was one problem left to solve, though: I needed a way to bind threads to processors (another option would be to wait until the scheduler decides to schedule my threads the way I like, but I do not like waiting). Luckily for me, there are two options for that on linux now:

  • use the NPTL-functionality (pthread_setaffinity_np() and friends)
  • use taskset out of the schedutils package

I toyed around with pthread_setaffinity_np() and it works (as can be observed e.g. by using top). But since I had to change my programs for it and they are usually written with OpenMP anyways, I quickly abandoned that idea. The other one is nicer anyways: just by starting your program like this: taskset -c 0,2 programname, it will only run on e.g. processors 0 and 2. Nice and easy, no need to change the program and it works with OpenMP nicely. Problem solved.

OK, back to business. Now that I knew how to schedule my threads in any way I liked, I ran my experiments using the above mentioned programs. The results were surprising (at least for me): I could not measure a noticeable performance difference between using two different cores on the same processor and using two different processors altogether. Since I did not expect this, I was wondering if maybe something with my methodology was wrong, and therefore I tried a third program:

Finally, I was able to reliably measure a small difference for the barrier tests contained therein (1.43 microseconds when on the same processor vs. 1.56 microseconds when on two different processors, which is a difference of less than 10%). 10% might sound like a lot, but I usually do not have a whole lot of barriers in my programs (and if I have, I usually have bigger problems to worry about anyways ๐Ÿ˜‰ ).

This brings me to the end of this short experiment. What I have learned is, that the Opteron is indeed a NUMA-system and behaves like one, but I will probably not notice it in my programs. Therefore, I can usually treat the system like an SMP-machine and not worry about e.g. placing some threads closely together, and others further apart (a possible reason to do this would be to increase memory bandwidth, as it doubles when using the second processor – and yes, I confirmed this with a short test as well). If doubts arise again, I can use taskset to check. Your results may of course vary, for example when you have a truly memory-bound application.

3 Responses to Opteron NUMA-effects »»


Comments

  1. Comment by Dieter | 2006/08/23 at 18:22:23

    We typically observer ccNUMA effects on the Opteron when using more than 2 threads.
    It seems that the memory architecture can sufficiently support 2 threads, whereever they are running.
    But with more than 2 threads things are quite different!!

    It may get worse with memory benchmarks like stream. THere you can see the difference between good and bad memory placement with 2 threads already

    See
    http://www.rz.rwth-aachen.de/computing/events/2005/parco05/drops_slides.pdf

  2. Comment by Michael Suess | 2006/08/23 at 22:33:26

    Dieter,
    thanks for your comment. I should have probably mentioned in this article that this is not a comprehensive evaluation of NUMA-effects on the Opteron. If it were, I would not have published it here but in a research paper :-), like e.g. these guys (a good read by the way). I have only investigated the part of the effects related to memory latency and left out the part about memory bandwidth (mainly, because I do not have any memory-bound applications and therefore this case is not so interesting for me). Reading through the article again, I should have made that more clear, as the last sentence is probably not enough of a warning. Anyways, I will not change the article now, as the correction can be seen in the comments section.


Trackbacks & Pingbacks »»

  1. […] Skimming through the activity logs of this blog, I can see that many people come here looking for information about pthread_setaffinity_np. I mentioned it briefly in my article about Opteron NUMA-effects, but barely touched it because I had found a more satisfying solution for my personal use (taskset). And while I do not have in depth knowledge of the function, maybe the test programs I wrote will be of some help to someone to understand the function better. I will also post my test program for sched_setaffinity here while I am at it, simply because the two offer similar functionality. […]

Leave a Reply

HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>