👴🏻 🤘 🔅 Nvidia Streaming Multiprocessor History 🏇🏾 🎖️ 👣

Last weekend I spent learning CUDA and SIMT programming. This fruitfully spent time ended with an almost 700-fold acceleration of my “business card rider racer” ^[1] - from 101 seconds to 150 ms.

Such a pleasant experience was a good excuse for further study of the theme and evolution of Nvidia architecture. Due to the huge amount of documentation published over the years by the “green” team, I managed to go back in time and briefly walk through the amazing evolution of its streaming multiprocessors.

In this article we will consider:

Year Generation Crystal Process Technology Series Most Powerful Card
=================================================== ===========================
2006 Tesla GeForce 8 G80 90 nm 8800 GTX 
2010 Fermi GeForce 400 GF100 40 nm GTX 480
2012 Kepler GeForce 600 GK104 28 nm GTX 680
2014 Maxwell GeForce 900 GM204 28 nm GTX 980 Ti
2016 Pascal GeForce 10 GP102 16 nm GTX 1080 Ti
2018 Turing GeForce 20 TU102 12 nm RTX 2080 Ti

Dead end

Until 2006, NVidia's GPU architecture correlated with the logical stages of the API rendering ^[2] . The GeForce 7900 GTX, controlled by the G71 crystal, consisted of three parts involved in processing vertices (8 blocks), generating fragments (24 blocks), and combining fragments (16 blocks).

Crystal G71. Pay attention to the Z-Cull optimization, which discards a fragment that would not pass the Z-test.

This correlation made the designers guess the location of the bottlenecks of the conveyor for the correct balancing of each of the layers. With the advent of another stage in DirectX 10 - the geometric shader, Nvidia engineers faced the difficult task of balancing the crystal without knowing how actively this stage will be used. It is time for a change.

Tesla

Nvidia solved the problem of increasing complexity with the help of the “integrated” Tesla architecture, released in 2006.

There was no longer any difference between the layers in the G80 crystal. Due to the ability to execute vertex, fragment, and geometric "core", the stream multiprocessor (Stream Multiprocessor, SM) has replaced all previously existing blocks. Load balancing was performed automatically, thanks to the replacement of the “core” performed by each SM, depending on the requirements of the conveyor.

“In fact, we threw out the entire NV30 / NV40 shader architecture and from scratch created a new one with a new common architecture for universal processors (SIMT), which also introduced new processor design methodologies.”

John Alben (extremetech.com interview)

No longer able to execute SIMD instructions, “shader blocks” turned into “kernels”, capable of executing one integer instruction or one instruction with float32 per cycle. SM receives threads in groups of 32 threads, called warp. Ideally, all threads of the same warp execute the same instruction at the same time, only for different data (hence the name SIMT). The multi-threaded Instruction Unit (MT) is engaged in enabling / disabling threads in the warp if their instruction pointer (Instruction Pointer, IP) converges / rejects.

Two SFUs help you perform complex mathematical calculations, such as the inverse square root, sin, cos, exp, and rcp. These blocks are also capable of executing one instruction per cycle, but since there are only two of them, the speed of the warp is divided into four. There is no hardware support for float64, calculations are performed programmatically, which greatly affects the speed of execution.

SM realizes its maximum potential when it is able to hide memory latencies due to the constant presence of dispatchable warp s, but also when the flow in the warp does not deviate (control logic keeps it on the same path of executing instructions). Stream states are stored in 4 kilobyte register files (Register File, RF). Threads that take up too much space on the stack reduce the number of possible threads that can run at the same time, while reducing performance.

The flagship crystal of the Tesla generation was the 90nm G80 introduced in the GeForce 8800 GTX. Two SMs are combined in a Texture Processor Cluster (TPC) together with a Texture Unit and a Tex L1 cache. It was promised that the G80 with 8 TPC and 128 cores generates 345.6 gigaflops ^[3]. The 8800 GTX card was extremely popular at one time, it received wonderful reviews and fell in love with those who could afford it. It turned out to be such an excellent product that, thirteen months after its release, it remained one of the fastest GPUs on the market.

G80 installed in 8800 GTX. Render Output Units (ROP) do the smoothing.

Together with Tesla, Nvidia introduced the C programming language for Compute Unified Device Architecture (CUDA), a superset of the C99 language. GPGPU enthusiasts, who welcomed the alternative to deceiving the GPU with GLSL textures and shaders, liked this.

Although I mainly talk about SM in this section, it was only one half of the system. In SM, you need to transfer instructions and data stored in the memory of the GPU. To avoid downtime, GPUs do not try to minimize memory transfers using large caches and predicting how the CPU does. GPUs take advantage of latency, saturating the memory bus to meet the I / O needs of thousands of threads. For this, a chip (for example, G80) realizes high memory bandwidth using six two-sided DRAM memory buses.

GPUs take advantage of memory latencies, while CPUs hide them with a huge cache and prediction logic.

Fermi

Tesla was a risky move that proved to be very successful. It was so successful that it became the foundation for the NVidia GPU for the next two decades.

« , , (Fermi , Maxwell ), , G80, [Pascal]».

( extremetech.com)

In 2010, Nvidia released the GF100, based on the brand new Fermi architecture. The interiors of its latest chip are described in detail in the Fermi technical documentation ^[4] .

The execution model is still based on warp of 32 threads dispatched to SM. NVidia managed to double / quadruple all the indicators only thanks to the 40-nanometer process technology. Thanks to two arrays of 16 CUDA cores, SM was now able to simultaneously dispatch two semi-warp (16 threads each). Despite the fact that each core executed one instruction per clock cycle, SM was essentially able to exclude one warp instruction per clock cycle (four times more than the Tesla architecture SM).

The number of SFUs has also increased, but not so much - the capacity has only doubled. It can be concluded that instructions of this type were not used very actively.

There is semi-hardware support for float64, which combines operations performed by two CUDA cores. Thanks to the 32-bit ALU (in Tesla it was 24-bit), the GF100 can perform integer multiplication in one cycle, and due to the transition from IEEE 754-1985 to IEEE 754-2008 it has increased accuracy when working with the float32 pipeline using Fused Multiply -Add (FMA) (more accurate than that used in Tesla MAD).

From a programming point of view, Fermi's integrated memory system made it possible to complement CUDA C with C ++ features such as an object, virtual methods, and exceptions.

Due to the fact that the texture blocks have now become SM, the concept of TPC has been abandoned. It has been replaced by Graphics Processor Clusters (GPC) clusters, each with four SMs. Last but not least, SM is now gifted with the Polymorph Engine, which deals with getting vertices, transforming the viewport and tessellation. The flagship GeForce GTX 480 based on the GF100 was advertised as containing 512 cores and capable of providing 1,345 gigaflops ^[5] .

GF100 installed in the GeForce GTX 480. Note the six memory controllers that serve the GPC.

Kepler

In 2012, Nvidia released the Kepler architecture, named after an astrologer, best known for discovering the laws of planetary motion. As usual, the technical documentation GK104 ^[6] allowed us to look inside .

At Kepler, Nvidia significantly improved the energy efficiency of the chip by lowering the clock speed and combining the core frequency with the frequency of the card (previously, their frequency was doubled).

Such changes should have led to a decrease in productivity. However, thanks to a halving process technology (28 nanometers) and replacing the hardware dispatcher with a software one, Nvidia was able to not only place more SM on the chip, but also improve their design.

Next Generation Streaming Multiprocessor (SMX) is a monster, almost all of whose indicators have been doubled or tripled.

Thanks to four warp dispatchers capable of processing an entire warp in one clock cycle (Fermi could only process half of the warp), SMX now contained 196 cores. Each dispatcher had dual dispatching, which allowed to execute the second instruction in warp if it was independent of the current executable instruction. Dual scheduling was not always possible because one column of 32 cores was common to two scheduling operations.

Such a scheme complicated the scheduling logic (we will return to this later), but thanks to the execution of up to six warp instructions per cycle, the SMX provided twice the performance compared to the Fermi architecture SM.

It was stated that the flagship NVIDIA GeForce GTX 680 with a GK104 chip and eight SMX has 1536 cores, reaching 3,250 gigaflops ^[7]. The elements of the crystal became so intricate that I had to remove all the signatures from the diagram.

GK104 installed in the GeForce GTX 680.

Pay attention to the completely redesigned memory subsystems, working with a breathtaking frequency of 6 GHz. They allowed to reduce the number of memory controllers from six to four.

Maxwell

In 2014, Nvidia released the tenth generation GPU called Maxwell. As stated in the GM107 technical documentation ^[8] , the motto of the first generation of architecture was “Maximum energy efficiency and extraordinary performance for each watt consumed.” The cards were positioned for "power-limited environments such as laptops and small form factor (SFF) PCs."

The most important decision was to abandon the Kepler structure with the number of CUDA cores in SM, which is not a power of two: some kernels became common and returned to work in half warp mode. For the first time in the history of architecture, SMM had fewer cores than its predecessor: “only” 128 cores.

Matching the number of cores and warp size improved crystal segmentation, resulting in space and energy savings.

One 2014 SMM had as many cores (128) as the entire GTX 8800 in 2006.

The second generation Maxwell (described in the technical documentation GM200 ^[9] ) significantly increased productivity, while maintaining the energy efficiency of the first generation.

The process technology remained at 28 nanometers, so Nvidia engineers could not resort to simple miniaturization to increase productivity. However, a decrease in the number of SMM cores has reduced their size, due to which more SMMs could be placed on the chip. Compared to Kepler, the second generation of Maxwell doubled the number of SMMs, while increasing its crystal area by only 25%.

In the list of enhancements, you can also find simplified dispatch logic, which allowed to reduce the number of redundant redundancy of dispatch and the delay of calculations, which ensured an increase in the optimality of the use of warp. Also, the memory frequency was increased by 15%.

Studying the Maxwell GM200 block diagram is already starting to strain your eyes. But we still carefully examine it. The flagship NVIDIA GeForce GTX 980 Ti card with a GM200 crystal and 24 SMM promised 3072 cores and 6,060 gigaflops ^[10] .

GM200 installed in the GeForce GTX 980 Ti.

Pascal

In 2016, Nvidia introduced Pascal. The GP104 Technical Documentation ^[11] leaves a déjà vu sensation because the Pascal SM looks exactly like the Maxwell SMM. The lack of SM changes did not stagnate performance, because the 16-nanometer process technology allowed us to place more SMs and double the number of gigaflops again.

Among other major improvements was a memory system based on the all-new GDDR5X. The 256-bit memory interface, thanks to eight memory controllers, provided transfer speeds of 10 gigaflops, increasing memory bandwidth by 43% and reducing the downtime of warp s.

The flagship NVIDIA GeForce GTX 1080 Ti with a GP102 chip and 28 TSM promised 3584 cores and 11,340 gigaflops ^[12] .

GP104 installed in the GeForce GTX 1080.

Turing

With the release of Turing in 2018, Nvidia made its “largest architectural step in ten years” ^[13] . In Turing SM, not only specialized Tensor cores with artificial intelligence appeared, but also cores for ray tracing (rautracing, RT). Such a fragmented structure reminds me of the layered architecture that existed before Tesla, and this proves once again that history loves repetition.

In addition to the new cores, three important features have appeared in Turing. Firstly, the CUDA kernel has now become superscalar, which allows parallel execution of instructions with integers and floating-point numbers. If you find 1996, then this may remind you of Intel's “innovative” architecture.

Secondly, the new memory subsystem on GDDR6X, supported by 16 controllers, is now able to provide 14 gigaflops.

Thirdly, streams now do not have common instruction pointers (IP) in warp. Thanks to Independent Thread Scheduling in Volta, each thread has its own IP. As a result of this, SMs can more flexibly configure dispatching flows in warp without the need for convergence as quickly as possible.

The flagship NVIDIA GeForce GTX 2080 Ti with TU102 and 68 TSM crystals has 4352 and reaches 13 45 gigaflops ^[14] . I did not draw a block diagram because it would look like a blurred green spot.

What awaits us next

According to rumors, the next architecture, code-named Ampere, will be announced in 2020. As Intel proved with the Ice Lake example that there is still the potential for miniaturization using the 7-nanometer process technology, there is almost no doubt that Nvidia uses it to further reduce SM and double its performance.

Teraflops / s for each Nvidia chip / card (data source: techpowerup.com).

It will be interesting to see how Nvidia continues the evolution of the idea of crystals having three types of cores that perform different tasks. Will we see crystals, the whole state of tensor cores or RT cores? Curious.

Reference materials

[1] Source: Revisiting the Business Card Raytracer
[2] Source: Fermi: The First Complete GPU Computing Architecture
[3] Source: NVIDIA GeForce 8800 GTX (techpowerup.com)
[4] Source: Fermi (GF100) whitepaper
[5] Source: NVIDIA GeForce GTX 480
[6] Source: Kepler (GK104) whitepaper
[7] Source: NVIDIA GeForce GTX 680
[8] Source: Maxwell Gen1 (GM107) whitepaper
[9] Source: Maxwell Gen2 (GM200) whitepaper
[10] Source: NVIDIA GeForce GTX 980 Ti
[11] Source: Pascal (GP102) whitepaper
[12] Source:NVIDIA GeForce GTX 1080 Ti
[13] Source: Turing (TU102) whitepaper
[14] Source: NVIDIA GeForce GTX 2080 Ti

Nvidia Streaming Multiprocessor History