Thinking Parallel

A Blog on Parallel Programming and Concurrency by Michael Suess

Stream processing for the Masses? I don’t think so!

Joe from points to an interesting article. Apparently, ATI is moving into the field of stream processing. In this post, I will tell you a little about what stream processing is, what your graphics processor has to do with it and also what problems I see with it. But let me start by having a little fun with their announcement:

ATI has invited reporters to a Sept. 29 event in San Francisco at which it will reveal “a new class of processing known as Stream Computing.”

Wow. Looks like they invented a whole new paradigma. Or maybe not? I guess some smart guys in Stanford have been into stream processing for almost ten years now. And I bet they weren’t the first ones either.

Ok, so stream processing is not entirely new and I will stop giving their poor marketing people a hard time. It is expected that ATI will reveal a product called Firestream, which enables stream processing on their graphics cards. The concept behind that is called GPGPU, or General Purpose Graphics Processor Units. And this is where it gets really exciting. In fact, it gets so exciting that a colleague of mine did a seminar (in German) on the topic with some very motivated students last semester – and this is basically where I learned everything I am going to tell you next. Therefore, please take everything I will tell you here with a grain of salt, as I am certainly NOT an expert in this field.

You probably know that your graphics adapter has a lot of power (at least you may know that it eats a lot of power, but that’s a different, although related, story :P). Modern GPUs have a complexity (measured in transistors) that rivals, if not surpasses the complexity of your CPU. And it gets even better: while your CPU has to be able to do a whole lot of things, the GPU specializes on a very narrow task: it takes a huge amount of triangles and some textures, applies some relatively small operations to the triangles (in its so called vertex shaders), splits them up into pixels, applies some more operations (this time in the pixel shaders) and spits out a whole lot of pixels for your viewing pleasure. I can almost hear you scream in pain if you know your way around graphics programming and have to listen to this very gross oversimplification of the graphics pipeline, but since I am not an expert in this field and I expect most of you aren’t either, I will leave it at that :D. Feel free to flame me in the comments-section. No dependencies are involved when applying the shader-operations and therefore you have many of them that can work in parallel (which is good, because that makes the operations described above really, really fast).

Now, how does stream computing fit into the picture? Let’s say you have a huge amount of data and want to apply the same operation to each element. Maybe in the game of life. Or if you want to simulate a waterfall. Your CPU is not really well suited for that task, because it can only process one element at a time. Your graphics adapter on the other hand is a perfect fit: just feed these data into the graphics pipeline (turning them into a stream of data, which is really just a one-dimensional array) and program the shaders with the appropriate operations (called kernels in this context) and the GPU will happily crunch through the data, using all its data-parallel processing power. And because that’s what it can do best, it will do so fast. Really fast. We have seen speedups of 30 in a simple experiment with matrix multiplication done by one of our students, but of course your results may vary.

Unfortunately, the story has downsides as well. If you think parallel programming is hard, then programming GPGPUs is hell :(. At least that was my personal conclusion after listening to some of the experiences made by our students. Although there are some higher level languages available (e.g. Sh or Brook), you basically need to be a graphics expert anyways to program in them. You will also trip over countless bugs, as well as limitations of your particular graphic adapter that you never had to think about before.

And it gets even worse: Because all the fun is happening on the graphics adapter, there is no way to access main memory! Therefore, all the additional data you may need have to be converted into textures somehow, which can then be read by your kernels. Very strange, if you ask me. Your data may also not have any dependencies inside the stream, because the data is worked on in parallel by the different shaders on the GPU.

And finally, expressing your algorithms in a streamy way is not an easy task, when you look at the limitations sketched above. I will even go as far as to say that most known algorithms and data structures today are not usable at all for stream processing and new ways to express them have to be found upfront. A very rewarding task for a research project, but not so good when you want to utilize it in a product and your boss is bugging you with a deadline.

A very pessimistic outlook so far. But maybe ATI has something more than just hardware up at its sleeves. Some kind of Stream Building Blocks would be nice (kind of like what the Intel Thread Building Blocks are trying to do for threads). Or a language that abstracts away at least some of the hardware specifics and raises the level of abstraction considerably. And actually works. And if it’s not ATI, maybe somebody else manages to make GPGPU-programming at least achievable without loosing all your hair beforehand ;).

I want to close this post with a citation from a commercial for a large tire-manufacturer, which I find very fitting in this context:

Power is nothing without control.

I am hoping at least some people at ATI worry about control as much as about power.

9 Responses to Stream processing for the Masses? I don’t think so! »»


  1. Comment by Christopher | 2006/09/22 at 20:42:58

    Wow, EVERYONE is talking about streams this week. I even wrote my post about it yesterday (which made the front of dzone no less). ATI’s announcement even got onto Digg.

    The PeakStream stuff looks pretty interesting, and the fact that they are funded by both Sequoia and KPCB means that streams will have a strong push in the market.

  2. Comment by Vivek | 2006/09/27 at 23:32:31

    When the NVidia GEForce first came out with the GPU concept, I predicted it would one day be used for non-graphics tasks!

    I wish I had a million $ for every time my predictions came true…

  3. PB
    Comment by PB | 2006/10/03 at 08:32:38

    The following SIGGRAPH talk from ATI might have some answers to issues brought up by Michael.

    - PB

  4. Comment by franjesus | 2006/10/17 at 17:04:48

    A short paper.

    Interesting stuff, I’m just wondering what will be nVidia’s answer. Will they keep on the way of specific physics co-processor??

    Also, the buy of ATi by AMD and AMD’s push for HyperTransport-attached co-processors makes me hope for the day when big simulations won’t be carried on clusters of thousands of computers, but on relatively cheap vector processing units, with the central CPU acting only as director and feeding instructions to the units. Let’s hope memory and buses manage keep the pace of processors.

  5. Comment by franjesus | 2006/10/17 at 17:06:11

    I forgot the paper.

  6. Comment by Vince | 2006/12/01 at 11:14:28

    Very Nice introduction… I am trying to get started with GPGPU

  7. Ian
    Comment by Ian | 2007/05/23 at 21:08:39

    It’s nice that ATi have finally given people the opportunity to unleash some of the power in their graphics cards.

    But the real speed is currently lying dormant in the thousands of computers that don’t have GPUs are can’t be bothered to install the correct software.

    Stream processing with AJAX

  8. Comment by Rashed | 2007/12/28 at 22:20:58

    It’s interesting to see companies claiming as their own inventions and concepts developed decades ago. ATI is not first!! :)

    I would agree with Michael that his post is a bit too pessimistic especially while referring to GPGPU programming: the half empty as opposed to half-full. Kindly let me add that programmers have been doing GPGPU based programming for a while now and often using good abstractions therefore without need to be too much aware of internal details of GPUs.

    Also, this discussion will be incomplete without mention of CUDA (by NVIDIA) which provides a nice abstraction layer for GPGPU programmers. There are others already active in the field of GPU, multicore, and cell processor based programming (e.g. RapidMind) for providing a flat abstraction for the three kinds of compute hardware.

    Of course jury is still out as how these various threads of innovation would take shape in future!

    Cheers and Happy New Year!


  9. Comment by Rashed | 2007/12/28 at 22:33:53

    Oops! I thought this was a recent post!! My comments therefore are not relevant!!


Leave a Reply

HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>