Nvidia Turing GPU deep dive: What's inside the radical GeForce RTX 2080 Ti
- 14 September, 2018 23:00
It’s time to pull back the curtain on the Turing GPU inside Nvidia’s radical new GeForce RTX 20-series, the first-ever graphics cards designed to handle real-time ray tracing thanks to the inclusion of dedicated tensor and RT cores. But the GeForce RTX 2080 and RTX 2080 Ti were also designed to significantly improve performance in traditionally rendered games, with enough power to feed those blazing-fast 4K, 144Hz G-Sync HDR gaming monitors.
Nvidia revealed plenty of numbers during the GeForce RTX 2080 Ti announcement. Clock speeds, memory bandwidth, CUDA core counts—it was all there. This deeper dive explains the underlying, architectural changes that make Nvidia’s Turing GPU more potent than its Pascal predecessor. We’ll also highlight some new Nvidia tools that developers can embrace to speed up performance even more, or bring the AI-boosted power of Nvidia’s Saturn V supercomputer into your graphics card.
This isn’t a review of the GeForce RTX 2080 or GeForce RTX 2080 Ti, both of which are scheduled to hit the streets on September 20. But with RTX preorders already open, hopefully this gives you a better idea of what you’re spending your money on. Again, be sure to read our coverage of the GeForce GTX 2080 Ti’s announcement for feeds, speeds, and product details about the graphics cards themselves.
Nvidia Turing GPU overview
Before we dig in, here’s a high-level specifications overview for the Turing TU102 GPU inside the flagship GeForce RTX 2080 Ti.
Here’s Nvidia’s high-level overview, in case what you’re looking at isn’t clear:
“The TU102 GPU includes six Graphics Processing Clusters (GPCs), 36 Texture Processing Clusters (TPCs), and 72 Streaming Multiprocessors (SMs). Each GPC includes a dedicated raster engine and six TPCs, with each TPC including two SMs. Each SM contains 64 CUDA Cores, eight Tensor Cores, a 256 KB register file, four texture units, and 96 KB of L1/shared memory which can be configured for various capacities depending on the compute or graphics workloads… Tied to each memory controller are eight ROP units and 512 KB of L2 cache.”
You’ll also find a single RT processing core within each SM, so there are 72 in the GeForce RTX 2080 Ti. Because the RT and tensor cores are baked right into each streaming multiprocessor, the lower you go in the GeForce RTX 20-series lineup, the fewer you’ll find of each. The RTX 2080 has 46 RT cores and 368 tensor cores, for example, and the RTX 2070 will have 36 RT cores and 288 tensor cores.
With all that cutting-edge stuff packed in, all requiring dedicated hardware, it shouldn’t come as a surprise that Turing is absolutely massive. The die measures in at a whopping 754mm, compared to the 471mm Pascal GPU inside the GTX 1080 Ti.
Inside Nvidia Turing GPU: Shading and memory improvements
Let’s explain the improvements to the long-established stuff before digging into the exotic new tensor and RT cores.
Nvidia says the GeForce RTX 2080 can be roughly 50 percent faster than the GTX 1080 in traditional games. Many of the comparisons occur in games with HDR enabled, which take a performance hit on current GTX 10-series cards. The GeForce RTX 2080 can be more than twice as fast as the GTX 1080 in games that support Nvidia’s DLSS technology, Nvidia claims (we’ll talk more about DLSS later), and surpass 60 frames per second in several triple-A games at 4K resolution with HDR visuals enabled.
The real-world performance of Nvidia’s high-end RTX duo remains to be seen. Nvidia hasn’t uttered a peep about the GeForce RTX 2080 Ti’s performance in traditional games. We still have no idea how the GeForce RTX 2080 compares against the older GTX 1080 Ti in non-HDR games. Nvidia’s frame rates in the 4K/60 HDR games listed above don’t mention what graphics settings they were tested with.
Because the RTX 2080 is at least in the same performance ballpark as the GTX 1080 Ti despite having nearly 20 percent fewer overall CUDA cores, it’s clear those CUDA cores have been upgraded.
Scratch that: The Turing GPU’s simultaneous multiprocessors haven’t just been upgraded, they’ve been overhauled. Beyond the addition of tensor and RT cores, Nvidia also added a new integer pipeline (INT32) alongside the floating point pipeline (FP32) traditionally used to process shading.
When Nvidia examined how real-world games behaved, it found that for every 100 floating point instructions performed, an average of 36 and as many as 50 non-floating point instructions were also processed, jamming things up. The new integer pipeline handles those extra instructions separately from and concurrently with the FP32 pipeline. Executing the two tasks at the same time results in a big speed boost, according to Jonah Alben, Nvidia’s VP of GPU engineering.
Nvidia also rejiggered how the memory caches inside its simultaneous multiprocessors work. Now, smaller SMs each feed into a unified pool of L1 and shared memory, which in turn feeds into an L2 cache that’s twice as large as before. The shake-up means Turing has almost three times more L1 memory available than the Pascal GPUs in the GTX 10-series, with twice as much bandwidth and lower latency.
Add it all up and Nvidia claims that the Turing GPU performs traditional shading a whopping 50 percent better than Pascal. That’s a massive architectural improvement, though the actual gains will vary from game to game, as shown in the slide above.
But games aren’t bound by shading performance alone. Memory bandwidth can directly affect how well your games play. Turing improves upon Pascal’s superb memory compression technology, and the GeForce RTX 2080 and 2080 Ti build atop that with the introduction of Micron’s next-gen GDDR6 memory—the first time it’s appeared in a GPU. GDDR6 blazes along at 14Gbps despite being 20 percent more power-efficient than GDDR5X, and Nvidia optimized Turing’s RAM for 40 percent lower crosstalk than in its predecessor.
The grab-bag of improvements gives the RTX 2080 Ti a 50-percent increase in effective memory bandwidth over the GTX 1080 Ti, Nvidia says. In real-world terms, the GeForce RTX 2080 Ti hits a total memory bandwidth of 616GBps, versus the GTX 1080 Ti’s 484GBps, even though both cards offer identical memory capacities and bus sizes. That’s the power of GDDR6.
Turing’s new shading technologies
As with most major GPU architecture launches, Nvidia also introduced some new shading technologies that developers can take advantage of to improve performance, visuals, or both.
Mesh shading help take some of the burden off your CPU during very visually complex scenes, with tens or hundreds of thousands of objects. It consists of two new shader stages. Task shaders perform object culling to determine which elements of a scene need to be rendered. Once that’s decided, Mesh Shaders determine the level of detail at which the visible objects should be rendered. Ones that are farther away need a much lower level of detail, while closer objects need to look as sharp as possible.
Nvidia showed off mesh shading with an impressive, playable demo where you flew a spaceship through a massive field of 300,000 asteroids. The demo ran around 50 frames per second despite that gargantuan object count because mesh shading reduced the number of drawn triangles at any given point down to around 13,000, from a maximum of 3 trillion potential drawn triangles. Intriguing stuff.
Variable rate shading is sort of like a supercharged version of the multi-resolution shading that Nvidia’s supported for years now. Human eyes only see the focal points of what’s in their vision at full detail; objects at the periphery or in motion aren’t as sharp. Variable rate shading takes advantage of that to shade primary objects at full resolution, but secondary objects at a lower rate, which can improve performance.
One potential use case for this is Motion Adaptive Shading, where non-critical parts of a moving scene are rendered with less detail. The image above shows how it could be handled in Forza Horizon. Traditionally, every part of the screen would be rendered at full detail, but with Motion Adaptive Shading, only the blue sections of the scene get such lofty treatment.
Content Adaptive Shading applies the same principles, but it dynamically identifies portions of the screen that have low detail or large swathes of similar colors, and shades those at lower detail, and more so when you’re in motion. It looked damned fine in action during a playable Wolfenstein II demo that let you toggle the feature on and off. I couldn’t perceive any change in visual quality, but Nvidia’s Alben says that activating CAS boosts imaging speed by 20 fps or more in situations where you’re targeting 60 fps on a mainstream GPU. Fingers crossed developers support this sort of technology with more gusto than they did multi-res shading, which blew me away in Shadow Warrior 2 but didn’t gain any traction beyond that.
Variable rate shading can also help in virtual reality workloads by tailoring the level of detail to where you’re looking. Another new VR tech, Multi-View Rendering, expands upon the Simultaneous Multi-Projection technology introduced with the GTX 10-series to allow “developers to efficiently draw a scene from multiple viewpoints or even draw multiple instances of a character in varying poses, all in a single pass.”
Finally, Nvidia also introduced Texture Space Shading, which shades an area around an object rather than a single scene to let developers reuse shading in multiple perspectives and frames.
What happens inside a frame with Turing
For a standard GPU architecture, that’d be all you need to know. But we’re just getting started with Turing. In fact, before we dig in, let’s quickly cover how Turing handles workloads that take advantage of all its capabilities, such as a ray traced game.
The lines up top marked “1 Turing frame” are the key. For Nvidia’s other GPUs, it would consist of just one portion—the yellow FP32 line. But it’s more complicated in Turing.
While that standard FP32 shader processing takes place, the dedicated RT cores and integer pipeline are executing their own specialized tasks at the same time. Once all that’s done, everything’s handed off to the tensor cores for the final 20 percent of a frame, which perform their machine learning-enhanced magic—such as denoising ray traced images or applying Deep Learning Super Sampling—in games that utilize Nvidia’s RTX technology stack.
Next page: RT cores, tensor cores, and video/display upgrades.
Inside Nvidia’s Turing GPU: RT cores
Let’s start with the RT cores. Nvidia never said so explicitly, but “RT” likely stands for “ray tracing,” as this hardware is included in Turing GPUs solely to accelerate ray tracing performance. Ray tracing’s the marquee feature of the GeForce RTX 20-series—enough so that these new GPUs drop the longstanding “GTX” name.
Ray tracing simulates the way light works in the real world, casting out rays to interact with objects in a scene, rather than “faking it” like rasterization, the rendering method traditionally used in games. It’s what makes CGI movies so beautiful. But ray tracing is incredibly computationally expensive, which is why it’s been considered gaming’s exalted and unattainable Holy Grail for decades now. Even Nvidia’s RTX implementation, built on top of Microsoft’s forthcoming DirectX Raytracing API, doesn’t provide “true” ray tracing. Instead, it’s a hybrid rendering model that leans on traditional rasterization techniques for most of the scene, then turns to ray tracing for light, shadow, and reflections. It’s visually impressive even in this form, though.
Nvidia’s RTX ray tracing implementation is built around bounding volume hierarchy (BVH) traversal, using a technique that tests whether an object was hit by a ray of light by breaking it into a series of large boxes rather than scanning each of its gajillion small triangles. Each box quickly reports whether it was hit by the ray; if one was, it’s split into a smaller set of boxes, and the one in there that’s hit gets split into another series of small boxes, and so on. Finally, when you get down to the smallest set of boxes, it’s split out into the object’s individual triangles. When the final triangle’s hit, it provides location and color information.
Testing ray tracing via BVH groupings lets games go from scanning millions of potential triangle targets down to mere dozens, according to Nvidia’s Jonah Alben. But performing BVH with traditional shader hardware bogs doesn’t work well, because it takes several thousand software instruction slots per ray to work through all the boxes and find those final triangles. The shaders sit around and wait while all those boxes and triangles get discovered. Here’s a diagram of how it works with the GTX 10-series’ Pascal GPU.
Turing behaves differently. The shaders launch a ray probe, and instead of having to process the thousands of instructions via software emulation, it boots the testing over to the dedicated RT core hardware, leaving the shaders free to perform other tasks.
RT cores consist of two specialized parts. First, box intersection evaluators plow through all the boxes created during BVH traversal to find the final set of triangles. After that, the workload’s handed off to a second unit to perform ray-triangle intersection tests. The RT cores pass the final information back to the shaders for image processing.
Splitting the computationally expensive ray tracing task up between shaders and dedicated RT cores works wonders for performance. Nvidia says that while the GTX 1080 Ti can generate 1.1 Giga Rays per second, the Turing GPU in the GeForce RTX 2080 Ti can run ray tracing at a clip of 10-plus Giga Rays. It’s the dedicated hardware that makes real-time ray tracing—a feat the industry expected to be years away—possible with GeForce RTX today, albeit in hybrid form.
That’s not the end of the special sauce, though. Ray tracing and other Nvidia RTX features also wouldn’t be possible if Turing lacked tensor cores.
Inside Nvidia’s Turing GPU: Tensor cores and NGX
“Tensor Cores are specialized execution units designed specifically for performing the tensor / matrix operations that are the core compute function used in Deep Learning,” Nvidia says. They’re used for machine learning tasks.
The GeForce RTX 2080 and 2080 Ti are the first consumer graphics cards to include tensor cores, which originally appeared in Nvidia’s pricey Volta data center GPUs. Unlike Volta tensor cores, however, Turing’s include “new INT8 and INT4 precision modes for inferencing workloads that can tolerate quantization and don’t require FP16 precision.” FP16 is crucial for gaming uses, though, and Turing supports it to the tune of 114 teraflops.
That’s a lot of numbers and specialized terms. What do Turing’s tensor cores actually do? Well, to start, there’s a reason they mop things up at the end of a frame. Tensor cores basically leverage AI to clean up or alter the images created by the CUDA and RT cores. Crucially, they help “denoise” ray tracing. Real-time ray tracing isn’t new; it’s been around for ages but didn’t perform well. Ray traced scenes often appear visually grainy due to the limited number of rays you can cast without choking performance, as you can see in this ray traced Quake 2 video (which was not created for or tested on Nvidia’s RTX). Turing’s tensor cores use machine learning to help identify and eliminate that graininess.
That’s not the only application though. Nvidia’s new NGX (“neural graphics framework”) suite leverages tensor cores in multiple ways, and some are pretty damned interesting.
“NGX utilizes deep neural networks (DNNs) and set of ‘Neural Services’ to perform AI-based functions that accelerate and enhance graphics, rendering, and other client-side applications,” Nvidia says. Basically, the company uses its Saturn V supercomputer to train AI on how best to optimize a game around an NGX feature, and then pushes that pre-trained information out to RTX 2080 and 2080 Ti owners via the GeForce Experience app. Once installed, the graphics card’s tensor cores use that AI training model in-game, performing the same tasks on your local hardware.
Some of the first NGX features include AI-boosted slow-mo video, and an “AI Inpainting” tool that lets you “remove existing content from an image and use an NGX AI algorithm to replace the removed content with a realistic computer-generated alternative. For example, Inpainting could be used to automatically remove power lines from a landscape image, replacing them seamlessly with the existing sky background.” Sounds nifty!
Other NGX features prove more relevant for gamers. AI Super Rez increases the effective resolution of an image or video by 2x, 4x, or 8x by training the NGX AI on a “ground truth” image as near to perfect as possible. It can then upgrade even low-resolution images by using that “ground truth” information to interpret what’s being shown, and intelligently adding pixels where they’re need to improve the resolution. Nvidia says its video super-resolution network can upscale 1080p to 4K in real time at 30 frames per second.
(I suggest clicking the images above and below to examine them in finer detail.)
But the most intriguing NGX feature must be Deep Learning Super Sampling (DLSS), an antialiasing method that provides much sharper results than traditional Temporal Anti-Aliasing (TAA) at a much lower performance cost. Epic’s Infiltrator demo runs at just under 40 frames per second on the GTX 1080 Ti at 4K resolution with TAA enabled; it runs just shy of 60 fps on the RTX 2080 Ti using the same settings, or at 80 fps with DLSS enabled. Hot damn!
Nvidia’s Turing architecture whitepaper explains how it works:
“The key to this result is the training process for DLSS, where it gets the opportunity to learn how to produce the desired output based on large numbers of super-high-quality examples. To train the network, we collect thousands of “ground truth” reference images rendered with the gold standard method for perfect image quality, 64x supersampling (64xSS). 64x supersampling means that instead of shading each pixel once, we shade at 64 different offsets within the pixel, and then combine the outputs, producing a resulting image with ideal detail and anti-aliasing quality. We also capture matching raw input images rendered normally. Next, we start training the DLSS network to match the 64xSS output frames, by going through each input, asking DLSS to produce an output, measuring the difference between its output and the 64xSS target, and adjusting the weights in the network based on the differences, through a process called back propagation. After many iterations, DLSS learns on its own to produce results that closely approximate the quality of 64xSS, while also learning to avoid the problems with blurring, disocclusion, and transparency that affect classical approaches like TAA.
In addition to the DLSS capability described above, which is the standard DLSS mode, we provide a second mode, called DLSS 2X. In this case, DLSS input is rendered at the final target resolution and then combined by a larger DLSS network to produce an output image that approaches the level of the 64x super sample rendering - a result that would be impossible to achieve in real time by any traditional means.”
More than a dozen games are already lined up for DLSS support. Here’s a full list of every game that supports ray tracing or DLSS.
Developers need to actively support NGX features in their games and applications. Nvidia offers an NGX API and SDK, and it doesn’t charge developers to train NGX features. Tony Tamasi, Nvidia’s senior VP of content and technology, says the company has successfully integrated NGX features into existing games in under a day.
Inside Nvidia’s Turing GPU: Display and video
Finally, lets wrap things up with changes to Turing’s display and video handling. The GeForce RTX 2070, 2080, and 2080 Ti all support VirtualLink, a new USB-C alternate-mode standard that contains all the video, audio, and data connections necessary to power a virtual reality headset. Turing also supports low-latency native HDR output with tone mapping, and 8K at up to 60Hz on dual displays. The DisplayPorts on GeForce RTX graphics cards are also DisplayPort 1.4a ready.
Turing also includes improvements in video encoding and decoding, which should be of particular interest to streamers and other video enthusiasts. Check out the details in the screenshot below.
That wraps up our deep-dive into the Turing GPU, the cutting-edge monster at the heart of the GeForce RTX 2080 and 2080 Ti. Check out our roundup of the custom GeForce RTX graphics cards you can preorder right now ahead of their launch on September 20.