Understanding Modern ( Pixel ) Pipelines, v1.1


For the latest rumors, visit GPU:RW


Note: I've done my best to minimize mistakes and errors in this article. However, should any arise, I *will* correct them and put an update log. Remember, there's a feedback form! The update log is at the end of the article.

It’s just plain sickening. Small sites, big sites ( won’t give names… must resist… ) , individuals, organizations – practically everyone is making gigantic mistakes about pixel pipelines ( goign to refer to them as pipelines in the rest of the article ) in modern GPUs. Or well, sometimes, how you should imagine pipelines, because in some cases, pipelines just don’t exist. More on that later.

First, let’s see this on an architectural standpoint. When it exists, what’s a pipeline?
A pipeline is an implementation technique whereby multiple instructions are overlapped in execution, and in the case of GPUs a pixel pipeline works one one pixel. ( this is my definition, please feel free to oppose it )

From this definition, we may easily conclude that the pipelines will practically never be identical to the ones of other architectures. They may differ in several ways, including:
- Native instruction support ( instructions who can be ran at full speed, thus 1op/clock )
- Speed of non-native instructions ( that is, instructions supported with or without macros, but who cannot be run at full speed )
- Precision of the operations in the pipeline ( may vary in some architectures )
- Number of instructions and registers available for Fragment Shading ( = PS )
- Number of TMUs ( if any... )

As you may see, it is false to say, for example, that an 8 pipeline architecture is ALWAYS better than a 4 pipeline one.

I believe there are 3 basic factors influencing the quality of a pipeline:

- Quality (precision) – this is harder to determine for architectures having more than one possible precision type. Let’s take the NV30. It supports FX12 ( 12-bit integer ) , FP16 ( 16-bit floating point ) and FP32 ( 32-bit floating point ) . Let’s compare it to the R300 who supports only FP24. In a way, the NV30 can look better when at maximum quality, but practically, to have equal speed to the R300, it got to use mostly FX12. Thus, you are required to consider the NV30 quality as lower than the R300’s. As you may see, this is quite approximate.

- Ops/clock (Operations per clock cycle = raw speed ) for all instructions, both native and non-native. This is because many architectures support instructions natively that others do not. It is essential to consider this as an advantage. Also, in architectures such as the NV3X’s, other factors influence speed, such as register usage ( the NV3x is slower when more registers are used ) – you should consider that as a disadvantage.
Ideally, you should consider this for every instruction ( and at every possible precision on the architecture ) and consider the real-world frequency of all those instructions to determine overall quality. This is practically impossible, and again this will thus likely remain approximate.

- Flexibility. There are two types of flexibility you should consider: real-time flexibility and offline flexibility. No one right in his head is expecting a game using the CineFX instruction barrier of 1024 instructions and working on it, it’d be way too slow even if it ran on a minuscule fraction of the scene. However, something like 200 instructions might be possible with a highly optimized shader and only on a quite small portion of the scene. The offline flexibility is still important for workstations however: it’s a big win for the Quadro FX. Real-time flexibility is also approximate.

I guess you got my point: the quality of a pipeline is very approximate. That’s why you really shouldn’t try to say a pipeline is more future oriented than another simply based on a few facts: you need a hefty load of them to do that. It is even practically even impossible to do that alone, and this is why discussion forums such as Beyond3D’s ( www.beyond3d.com ) can be very helpful to determine that.

Now, it would obviously be impractical to only have a “better than” or “worse than” opinion: it is also required to have a good way to summarize this. I’ll get back to that soon.

I initially mentioned that some architectures do not truly have pipelines. Some of you might think I’m on crack right now. But let me assure you I’m not. It is widely speculated that the NV3x might not have multiple pipelines, more like one uber one. Nothing’s certain, however.
The NV30 obviously is still *pipelined*: vertex and pixel work is still done at the same time and that pixel system is still pretty much one pipeline. And please don't go out and say the NV30 got 1 pipeline and ATI got 8 pipelines, because that's obviously not fully comparable.

It is believed that the NV30 *might* use control logic and a pool of units. That would mean if you’ve got *one* unit coordinating the use of many calculation units as well as output units linked to that coordination unit.
In the NV30 case, it is believed you would have:
- Control logic
- 4 FP32 ops/clock or 8 TEX ops/clock units
- 8 FX12 register combiners
- 4 Color Output units
- 8 Z Output units

(Note: there is likely an error about the organization of the FX12 units, and we don't have the details on the special-purpose units, so this makes little sense, and as I said, this is just speculation, not facts! The control logic part is more likely than that.)

Such architectures have advantages and disadvantages, but that is beyond the scope of this article. The register usage problems of the NV3x quite likely come from this.
Register usage on the NV30 is a very significant problem. It’s not “only a few %”. It can be five times slower than normal operation at full register usage! Please refer yourself to: http://www.beyond3d.com/forum/viewtopic.php?t=5150 for similar details.

Now, let’s go back to the traditional system with pipelines. How to describe this in an accurate and complete way? Well, I proposed a quite complex system a while ago: http://www.beyond3d.com/forum/viewtopic.php?t=5649

This is a very complete system IMO, but it is also much too complex. I fail to see any complete and simple system however, but suggestions are welcome (remember, there’s a feedback page on NFI.) – for newbies, I believe synthetic benchmarks such as ShaderMark, or shader-limited games, are the only way to look at this issue. Even that is much more accurate than saying 4x2 or 8x1.

Now, how should we define the NV30?
Well, the easiest way to see this is as a very good 4 pipelines architecture.Why?
Because in order to use all units as effectively possible, more precisely the output units, it is best to "emulate" as many traditional GPU pixel pipeline ( a traditional GPU pipeline works on only one pixel with eventual loopback ) one as of output units. Thus, in most cases, it has 4 pipelines ( traditional pipelines, obviously, but you get the point, right ) . However sometimes it has 8 pipelines such as for Z Passes. In games like Doom 3, it often has 8 pipelines. Of course, other conditions may apply, such as register usage, to determine how many pipelines to emulate, so this obviously isn’t perfectly accurate.
Sadly, I see no other roughly correct simplification. Also, saying it’s “very good” is not very accurate either. That’s because if you need mostly FP & Texturing power, the NV3x is a very bad 4 pipelines architecture. So, for highly future-oriented games ( that means, for in about 1 or 2 years *minimum* - Doom 3 is far from one of those ) , it’s a very bad 4 pipelines architecture. Right now, it’s a very good one.
As you see, it is also required to judge the quality of a pipeline based on time!

Now that we got all this info, let’s get to what most people care about. How does this translate into performance?
Well, there's one keyword here: internal parallelism.
Imagine ( because it doesn't exist ) an architecture which only got one FP processing unit per pipeline. No separate TMUs, no FX units, nothing. Well, there, the internal parallelism will always be 100% - there's never any waste.
Now, let's imagine an architecture with one FP unit and one TMU per pipeline. ( the R300 is very near that, more info on it later )In such an architecture, you have either 100% internal parallelism ( The TMU and FP works on different instructions in parallel ) or 50% internal parallelism ( either TMU or FPU works at once - not both ) in ANY point of the fragment program.
That means that the worst case scenario is 50%, the best case scenario is 100%, but it'll practically always be something between both.

Now, let's get to the interesting part - why is a 8 pipelines architecture ALWAYS better than a 4 pipelines architecture having the same number of processing units?
Well, let's take the NV30, assuming it got 4 pipelines. You got 1FP/TEX unit and 2FX units.
In the case it cannot do FX operations, it actually only does 4 operations/clock. The R300, on the other hand, will do between 8 and 16 operations ( or even 24, more on that later again )

So, this results in the following conclusion: the minimum number of native instructions run in a clock cycle is equal to the number of (practical) pipelines in the architecture.

Now, we know the minimum - but there's also the maximum: the maximum number run in a clock cycle is equal to the number of (independent) processing units in the architecture. The "independent" is required to make sure you can't inflate that value thanks to resource sharing.

Okay, so there's one last required factor to conclude the performance of an architecture based on pipelines - what's the efficiency in each pipeline? That is, how often can it deliver its maximum throughput?
Well, there's some bad news here: no magic formula exist to determine this. It's all about estimation.

Now, as I said before, the R300 can do between 8 and 24 native instructions per clock. It got three units per pipeline, on eight pipelines: 8 TMUs, 8 FP24 Vec3 units and 8 FP24 Scalar units.

nVidia, on the other hand, always uses Vec4 processing units. So why did ATI use Vec3 + Scalar? Because some instructions only use 3 elements, and some instructions only use one element. So in these cases, you've got 2 FP instructions done in parallel. But when you need 4 elements, you simply use both units to compute the information and do 1 FP instruction per clock cycle. This is more efficient than nVidia's Vec4 architecture, but it's also obviously more expensive (transistor-wise) to implement.
Okay, so to summarize, an efficient 4 pipelines design *CAN* be better than an inefficient 8 pipelines design. I do not believe however that this is the case of the R300 vs NV30 - the R300 is even more efficient per-pipeline, IMO!
Obviously, as I said earlier in the article, there's also non-native instructions speed and the differences in native instruction sets. For example, the NV30 is *significantly* more efficient than the R300 when doing trigonometric work such as COS and SIN. But such instructions are quite rare compared to other ones, really.


I hope this article was helpful. I also hope it isn't full of errors and mistakes. If it is, I'll make sure to correct it later. Thanks for reading!


Uttar
Updated, 21th of May 2003: After some suggestions from B3D people, I changed the definition of a pipeline based on this paper - also, this means the NV30 got one pixel pipeline, not zero. Furthermore,

Return home
Contact Me