Here’s how the Ampere architecture changes the underlying elements of the GPU. Get ready for the next round of ray tracing.
(Image credit: Nvidia)
The Ampere architecture will power the GeForce RTX 3090, GeForce RTX 3080, GeForce RTX 3070, and other upcoming Nvidia GPUs. It represents the next major upgrade from Team Green and promises a massive leap in performance. Based on current details (the cards come out later this month and in October for the 3070), these GPUs should easily move to the top of our GPU hierarchy, and knock a few of the best graphics cards down a peg or two. Let’s get into the details of what we know about the Ampere architecture, including specifications, features, and other performance enhancements.
[Note: We’ve updated some of the information on the CUDA cores and how it effects performance, provided accurate die size and transistor counts, and additional details on DLSS 2.1 and ray tracing improvements.]
The Ampere architecture marks an important inflection point for Nvidia. It’s the company’s first 7nm GPU, or 8nm for the consumer parts. Either way, the process shrink allows for significantly more transistors packed into a smaller area than before. It’s also the second generation of consumer ray tracing and third generation deep learning hardware. The smaller process provides a great opportunity for Nvidia to radically improve on the previous RTX 20-series hardware and technologies.
We know the Ampere architecture will find its way into upcoming GeForce RTX 3090, RTX 3080, and RTX 3070 graphics cards, and we expect to see RTX 3060 and RTX 3050 next year. It’s also part of the Nvidia A100 data center GPUs, which are a completely separate category of hardware. Here we’ll break down both the consumer and data center variations of the Ampere architecture and dig into some of the differences.
The launch of Nvidia’s Ampere GPUs feels like a blend of 2016’s Pascal and 2018’s Turing GPus. Nvidia CEO Jensen Huang unveiled the data center focused A100 on May 14, giving us our first official taste of what’s to come, but the A100 isn’t designed for GeForce cards. It’s the replacement for the Volta GV100 (which replaced the GP100). The consumer models have a different feature set, powered by separate GPUs like the GA102, GA104, and so on. The consumer cards also use GDDR6X/GDDR6, where the A100 uses HBM2.
Besides the underlying GPU architecture, Nvidia has revamped the core graphics card design, with a heavy focus on cooling and power. As an Nvidia video notes, “Whenever we talk about GPU performance, it all comes from the more power you can give and can dissipate, the more performance you can get.” A reworked cooling solution, fans, and PCB (printed circuit board) are all part of improving the overall performance story of Nvidia’s Ampere GPUs. Of course, third party designs are free to deviate from Nvidia’s designs.
With the shift from TSMC’s 12nm FinFET node to TSMC N7 and Samsung 8N, many expected Ampere to deliver better performance at lower power levels. Instead, Nvidia is taking all the extra transistors and efficiency and simply offering more, at least at the top of the product stack. GA100 for example has 54 billion transistors and an 826mm square die size. That’s a massive 156% increase in transistor count from the GV100, while the die size is only 1.3% larger. The consumer GPUs also increase in transistor counts while greatly reducing die sizes.
While 7nm/8nm does allow for better efficiency at the same performance, it also allows for much higher performance at the same power. Nvidia is taking the middle route and offering even more performance at still higher power levels. The V100 was a 300W part for the data center model, and the new Nvidia A100 pushes that to 400W. We see the same on the consumer models. GeForce RTX 2080 Ti was a 250/260W part, and the Titan RTX was a 280W part. The RTX 3090 comes with an all-time high TDP for a single GPU of 350W (that doesn’t count the A100, obviously), while the RTX 3080 has a 320W TDP.
What does that mean to the end users? Besides potentially requiring a power supply upgrade, and the use of a 12-pin power connector on Nvidia’s own models, it means a metric truckload of performance. It’s the largest single generation jump in performance I can recall seeing from Nvidia. Combined with the architectural updates, which we’ll get to in a moment, Nvidia says the RTX 3080 has double the performance of the RTX 2080. And if those workloads include ray tracing and/or DLSS, the gulf might be even wider.
Thankfully, depending on how you want to compare pricing, pricing isn’t going to be significantly worse than the previous generation GPUs. The GeForce RTX 3090 is set to debut at $1,499, which is a record for a single-GPU GeForce card, effectively replacing the Titan family. The RTX 3080 meanwhile costs $699, and the RTX 3070 will launch at $499, keeping the same pricing as the previous generation RTX 2080 Super and RTX 2070 Super. Does the Ampere architecture justify the pricing? We’ll have to wait a bit longer to actually test the hardware ourselves, but the specs at least look extremely promising.
Let’s also tackle the efficiency question quickly. At one point in his presentation, Jensen said that Ampere delivers 1.9X the performance per watt as Turing. That sounds impressive, but that appears to be more of a theoretical performance uplift rather than what we’ll see on the initial slate of GPUs.
Take the RTX 3080 as an example. It has a 320W TDP, which is nearly 50% more than the 215W TDP of the RTX 2080. Even if it really is double the performance of the RTX 2080, that’s still only a 35% improvement in performance per watt.
Nvidia gets the 1.9X figure not from fps/W, but rather by looking at the amount of power required to achieve the same performance level as Turing. If you take a Turing GPU and limit performance to 60 fps in some unspecified game, and do the same with Ampere, Nvidia claims Ampere would use 47% less power.
That’s not all that surprising. We’ve seen power limited GPU designs for a long time in laptops. The RTX 2080 laptops for example can theoretically clock nearly as high as the desktop parts, but they’re restricted to a much lower power level, which means actual clocks and performance are lower. A 10% reduction in performance can often deliver a 30% gain in efficiency when you near the limits of a design.
AMD’s R9 Nano was another example of how badly efficiency decreases at the limit of power and voltage. The R9 Fury X was a 275W TDP part with 4096 shaders clocked at 1050 MHz. R9 Nano took the same 4096 shaders but clocked them at a maximum of 1000 MHz, and applied a 175W TDP limit. Performance was usually closer to 925MHz in practice, but still at one third less power.
Nvidia Ampere Architecture Specifications
Along with the GA100 for data center use, Nvidia has at least three other Ampere GPUs slated to launch in 2020. There will potentially be as many as three additional Ampere solutions during the coming year, though those are as yet unconfirmed (and not in this table). Here’s the high-level overview.
|Graphics Card||Nvidia A100||GeForce RTX 3090||GeForce RTX 3080||GeForce RTX 3070|
|Process (nm)||TSMC N7||Samsung 8N||Samsung 8N||Samsung 8N|
|Die Size (mm^2)||826||628.4||628.4||392.5|
|Boost Clock (MHz)||1410||1700||1710||1730|
|VRAM Speed (Gbps)||2.43||19.5 (GDDR6X)||19 (GDDR6X)||14 (GDDR6)|
|VRAM (GB)||40 (48 max)||24||10||8|
|Bus Width||5120 (6144 max)||384||320||256|
|Tensor TFLOPS FP16 (sparsity)||312 (628)||143 (285)||119 (238)||81 (163)|
|TBP (watts)||400 (250 PCIe)||350||320||220|
|Launch Date||May 2020||September 24, 2020||September 17, 2020||October 15, 2020|
|Launch Price||$199K for DXG A100 (with 8xA100)||$1,499||$699||$499|
The biggest and baddest GPU is the A100. It has up to 128 SMs and six HBM2 stacks of 8GB each, of which only 108 SMs and five HBM2 stacks are currently enabled in the Nvidia A100. Future variations could have the full GPU and RAM configuration. However, the GA100 isn’t going to be a consumer part, just like the GP100 and GV100 before it were only for data center and workstation use. Without ray tracing hardware, the GA100 isn’t remotely viable as a GeForce card, never mind the cost of the massive die, HBM2, and silicon interposer.
(Image credit: Nvidia)
Stepping down to the consumer models, Nvidia makes some big changes. Nvidia apparently doubled the number of FP32 CUDA cores per SM, which results in huge gains in shader performance. With the GA102, Nvidia has a total of seven GPC clusters, each with 12 SMs, giving a maximum configuration of 84 SMs. Of these, 82 are enabled in the RTX 3090 while the RTX 3080 only has 68 enabled. The HBM2 and silicon interposer are also gone, replaced by 24 GDDR6X chips, each running on a 16-bit half-width interface for the 3090, or 10 GDDR6X chips running on a 32-bit interface for the 3080.
With the doubled CUDA cores per SM, that equates to 10496 CUDA cores, with two FP64 capable CUDA cores per SM. In other words, FP64 performance is 1/64 the FP32 performance. Nvidia strips out the remaining FP64 functionality, and in its place adds 2nd generation RT cores. There are also four 3rd gen tensor cores, each of which is four times the throughput per clock of the previous gen Turing tensor cores.
The boost clock of 1700 MHz gives a potential 35.7 TFLOPS of FP32 compute performance, and the 19.5 Gbps GDDR6X delivers 936 GBps of bandwidth. In case that’s not clear, potentially the RTX 3090 will have more than double the performance of the RTX 2080 Ti.
Considering the RTX 3090 is very nearly a full GA102 chip, there’s not much room for anything faster right now. Could there be a future Titan card with a fully enabled GA102? Absolutely, but it would only be 2.4% faster at the same clocks as the 3090. Maybe 21 Gbps memory would help, but realistically we don’t see Nvidia doing a Titan card for Ampere. Instead, the RTX 3090 is an extreme performance consumer-focused card, and it’s now open for third parties to create custom designs (unlike the Titan cards of previous generations).
There’s more to it than a simple doubling of CUDA cores, however. Specifically, Nvidia’s Ampere architecture for consumer GPUs now has one set of CUDA cores that can handle FP32 and INT instructions, and a second set of CUDA cores that can only do FP32 instructions.
To understand how this effects performance, we need to go back to the Turing architecture where Nvidia added concurrent FP32 + INT support. If you’re thinking Ampere can now do concurrent FP32 + FP32 + INT, that’s incorrect. Instead, it’s concurrent FP32 + (FP32 or INT). That means that while theoretical TFLOPS has increased dramatically, we won’t see gaming performance scale directly with TFLOPS.
With Turing, Nvidia said that in many games (looking at a broad cross section of games), roughly 35% of the CUDA core calculations were integer workloads. Memory pointer lookups are a typical example of this. If that ratio still holds, one third of all GPU calculations in a game will be INT calculations, which potentially occupy more than half of the FP32+INT portion of the SMs.
Nvidia’s own performance numbers reflect this. It has shown a generational performance increase of up to 2X when comparing RTX 3080 to RTX 2080, but if you look just at TFLOPS, the RTX 3080 is nearly triple the theoretical performance. But the reality is the RTX 2080 could do FP32 + INT at around 10 tera-OPS each, whereas the RTX 3080 has nearly 30 tera-OPS of FP32 available and only 15 tera-OPS of INT available. Using the two-thirds idea from above, that means it might end up doing 10 TOPS of INT on the one set of cores, and 15+5 TFLOPS of FP32 spread across the FP32 cores.
Even though compute performance has still received a massive increase, it’s also important to note that bandwidth hasn’t grown as much. The RTX 3080 has triple the FP32 potential, 1.5X the INT potential, and about 1.5X the bandwidth as well (1.53X to be exact). There are probably improvements in memory compression that make the effective bandwidth higher, but overall we likely will never see anything close to a 3X increase in FP32 performance, unless someone can make a pure FP32 theoretical test.
In a bit of a surprise move, the RTX 3080 also uses the same GA102 chip as the 3090, only this time with 68 SMs enabled. Nvidia says yields are great for Ampere, but obviously part of that is being able to use partially enabled GPUs. That gives the RTX 3080 a still very impressive 8704 CUDA cores. Two of the memory channels are also disabled, giving it 10GB of GDDR6X memory.
Unlike in previous generations, the clocks on all three RTX 30-series GPUs are relatively similar: 1700-1730MHz. In terms of theoretical performance, the RTX 3080 can do 29.8 TFLOPS and has 760 GBps of bandwidth, and Nvidia says it’s twice as fast as the outgoing RTX 2080.
That doesn’t quite add up, as we noted above. The theoretical FP32 TFLOPS performance is nearly tripled, but the split in FP32 vs. FP32/INT on the cores, along with other elements like memory bandwidth, means a 2X improvement is going to be at the higher end overall.
The RTX 3070 switches over to the GA104 GPU, and it continues the trimming relative to the GA102. Where GA102 has seven GPCs with 12 SMs each, GA104 has six GPCs with 8 SMs each, giving a maximum of 48 SMs. The RTX 3070, similar to the 3090, has two SMs disabled to improve yields, leaving 46 active SMs and 5888 CUDA cores. Naturally, it has a smaller size and lower transistor count as well: 17.4 billion transistors and 392.5mm square die size.
Unlike the 3090/3080, the RTX 3070 uses GDDR6 and has eight channels with 8GB of memory on a 256-bit bus. Does GA104 support both GDDR6 and GDDR6X? We don’t know. Curiously, the GDDR6 memory speed remains at 14Gbps, the same as the Turing GPUs, which means it could run into bandwidth bottlenecks in some workloads. However, it also has the same 96 ROPs as the 3080, and 50% more ROPs than the previous generation RTX 2070 Super. The RTX 3070 will launch on October 15, so we’ll receive additional details over the coming days.
The RTX 3070 delivers 20.4 TFLOPS and 448 GBps of bandwidth. Nvidia says the RTX 3070 will end up faster than the RTX 2080 Ti as well, though there might be cases where the 11GB vs. 8GB VRAM allows the former heavyweight champion to come out ahead. Again, architectural enhancements will definitely help, so without further ado, let’s talk about the Ampere architecture.