Thinking Parallel

A Blog on Parallel Programming and Concurrency by Michael Suess

Thoughts on “Valve goes Multicore”

GamesA little while ago, an event was held by Valve Software, Inc. at their headquarters in Bellevue, Washington. If you don’t know Valve, it is the company behind the Source Engine, which powers Half Life 2. If you still don’t know what I am talking about, this short article is probably of no interest to you. 🙂

Anyways, they have invited a couple of tech-journalists to tell them how they have parallelized their game-engine. And while I wasn’t invited (now who would have guessed :P), I have stumbled across two articles covering the event: one from bit-tech.net and one from ArsTechnica. Reading through both of these articles, I could not resist the urge to comment on some of the points they made (need some practice in the self-control department, I guess). 😉

As usual, let me add a little disclaimer beforehand: I don’t know too much about game-programming. I have been to a seminar or two on the subject, and have had regular conversations with a colleague of mine who is very knowledgeable about it. But of course, this does not make me an expert or anything, therefore I am going to try to stick to the parallel and threading part, because I am way more at home there.

I have thought about whether or not to give you an executive summary of the articles beforehand, but decided it was not worth it because you can always read through the referenced articles yourselves (they are not that long). Sorry, I am a bit on the lazy side today. Or you can of course just try to make sense of my comments without the context.

Enough of the preface, let’s start with the article on bit-tech.net:

“Fundamentally, games are about what you do, not what you see. So in terms of things which make games fundamentally profound experiences – and differentiate them from non-interactive entertainment such as TV and film – it’s more about what you do on the CPU than on the GPU.”

I wonder if NVIDIA agrees? But no, I am not going to get into this topic now. 🙂 I just found it funny to hear this quote and all the announcements about doing traditional CPU-tasks on the GPU in such a short time-frame. 🙂

To do this, Valve created its own set of tools. Whilst it evaluated using existing programmes – such as operating system-level threading engines, or open source threading engines – it found that these were inefficient when it came to games.

Unfortunately, both articles are very light on details about what exactly makes these threading systems so inefficient for games – a point I would have very much liked to know more about. Too heavy-weight? Not enough features? Too slow? Or even the old “not invented here”-syndrome? Its a shame they don’t go into this further, especially since I would find it a considerable pain to have to maintain my own threading library along with my application when there are other, probably more tested alternatives available. But then again, these companies operate in another league when it comes to resources, so what do I know. 🙂

“We gravitated towards creating a custom work management system that’s designed specifically for gaming problems – its job is to keep the cores 100% utilised. We make it so that there are N-1 threads for N cores, with the other thread being the master synchronisation thread. The system allows the main thread to throw off work to as many cores as there might be available to run on, and it also allows itself to be subjugated to other threads if needs be.”

Or they could have just called it Master-Slave, like the High-Performance-Computing-world has done for ages :). They did not have to invent their own threading-system for that, that’s for sure…

The engine uses lock-free algorithms to solve one of the major problems of threading – access to data.

“We realised that, 95% of the time, you’re trying to read from a data set, and only 5% of the time are you actually trying to write to it, to modify it. So, we just allowed all the threads to read any time they want, and only write when the data was locked to that thread.”

This part is very misleading, and I hope the journalists misunderstood something here. What he describes are not lock-free algorithms in any way, he describes good old Reader-Writer locks. These are far easier to write and far more portable than lock-free algorithms (in fact, one of my students just wrote a Reader-Writer-Lock implementation with C++ and OpenMP without too many problems). And therefore the terms should not be freely mixed in my humble opinion, because we are talking a whole new level of difficulty here (and by the way, Reader-Writer Locks are also available in many threading systems, so I am starting to think it was not lack of features that made them implement their own).

That’s the bits I wanted to comment on from the first article. Let’s take a look at the second one from ArsTechnica:

To accomplish this, programmer Leonard made use of a technique called lock-free algorithms. He implemented a spin lock to replace the more traditional mutexes (mutual-exclusion algorithms) and semaphores that are used to flag threads as being “safe” to multithread. The spin loop utilizes a new interlock instruction that is built into the CPU.

Now they claim lock-free algorithms can be implemented with spin-locks. *sigh*. Same mistake as in the other article.

Tom examined the threads’ activity with profiling tools, and it turned out that 95 percent of the time the threads were reading memory while only spending 5 percent of their time writing. A read/write lock allowed many threads to read the same bit of memory, but only one thread to write to it.

This sounds better and more correct: so they do know what a Reader-Writer lock is after all, and the above blurbs were probably things the journalists misunderstood. *sigh of relief* 🙂

Not only was the speedup on the four-core Kentsfield chips nearly linear in most cases (an average of 3.2 times over a single core)

Please repeat after me: Speedups are measured against the fastest sequential version, not against the parallel version run on one core. Speedups are measures against…
If I had gotten a dollar for every time I heard this mistake in the past I would be a rich man by now – or maybe it’s just selective memory :?…

The demo was run on a 2.6GHz Kentsfield CPU with four cores and 2GB of RAM. On a single-core 3.2 GHz Pentium 4, fewer than 100 critters could run around at the same frame rate, which looked much less impressive.

Let’s not talk about comparing apples with oranges, shall we? Different architecture, different number of cores, different clock-speed – these numbers tell us exactly – not so much 8O.

The tools don’t completely remove the responsibility of the programmer to think about the best approach to adding multithreading support to their code, but they do help programmers avoid common pitfalls and greatly diminish the time it takes for coders who are new to multithreading to get up to speed.

How come they always stop when it gets interesting? Why do these tools help so much? How are they different from ordinary threading-systems? These are the questions that are interesting to a multi-threading-junkie like me (but I guess the target audience of both articles does not exactly fall into that category).

A question was asked about supporting executing game code directly on GPUs, an idea that some have considered due to the improvements in GPU speed over the last few years, Valve does not believe that this is the correct approach. GPUs can’t natively execute existing x86 code whereas multiple cores can execute it directly, meaning less work for the programmers.

What a nice coincidence: the closing quote from this article is about the same topic as the first one: the power of GPUs and how to harness it. I don’t find this last sentence too convincing, though: I am regularly writing code that runs quite nice on non-x86-hardware and as long as I have a compiler that understands and translates my programming language of choice into the target machine language, I could not care less about x86-compatibility. I would only expect problems if there were assembler-instructions scattered throughout the code, which of course would make a port extremely difficult. On the other hand, porting code to a whole new programming paradigm (like e.g. stream-processing) is of course hard – but if the graphics hardware companies do deliver, what they have been promising lately, programmers should not have to worry about this sometime in the not so distant future.

This closes my little commentary on the two articles covering the Valve-event, hope you enjoyed my perspective a little!

4 Responses to Thoughts on “Valve goes Multicore”


Comments