GeForce RTX 3080 Founders Edition: Hail to the King!
Nvidia’s GeForce RTX 3080 Founders Edition is here, claiming the top spot on our GPU benchmarks hierarchy, and ranking as the best graphics card currently available — provided you’re after performance first, with price and power being lesser concerns. After months of waiting, we finally have independent benchmarks and testing data. Nvidia has thrown down the gauntlet, clearly challenging AMD’s Big Navi to try and match or beat what the Ampere architecture brings to the table.
We’re going to hold off on a final verdict for now, as we have other third-party RTX 3080 cards to review, which will begin as soon as tomorrow. That’s good news, as it means customers won’t be limited to Nvidia’s Founders Edition for the first month or so like we were with the RTX 20-series launch. Another piece of good news is that there’s no Founders Edition ‘tax’ this time: The RTX 3080 FE costs $699, direct from Nvidia, and that’s the base price of RTX 3080 cards for the time being. The bad news is that we fully expect supply to be insufficient to keep up with what we expect to be exceptionally high demand.
The bottom line, if you don’t mind spoilers, is that the RTX 3080 FE is 33% faster than the RTX 2080 Ti, on average. Or, if you prefer other points of comparison, it’s 57% faster than the RTX 2080 Super, 69% faster than the RTX 2080 FE — heck, it’s even 26% faster than the Titan RTX!
But there’s a catch: We measured all of those ‘percent faster’ results across our test suite running at 4K ultra settings. The lead narrows if you drop down to 1440p, and it decreases even more at 1080p. It’s still 42% faster than a 2080 FE at 1080p ultra, but this is very much a card made for higher resolutions. Also, you might need a faster CPU to get the full 3080 experience — check out our companion GeForce RTX 3080 CPU Scaling article for the full details.
|Graphics Card||RTX 3080 FE||RTX 2080 Super FE||RTX 2080 FE|
|Process (nm)||Samsung 8N||TSMC 12FFN||TSMC 12FFN|
|Die size (mm^2)||628.4||545||545|
|FP32 CUDA Cores||8704||3072||2944|
|Boost Clock (MHz)||1710||1815||1800|
|VRAM Speed (Gbps)||19||15.5||14|
|VRAM Bus Width||320||256||256|
|Tensor TFLOPS FP16 (Sparsity)||119 (238)||89||85|
We have a separate article going deep into the Ampere architecture that powers the GeForce RTX 3080 and other related GPUs. If you want the full rundown of everything that’s changed compared to the Turing architecture, we recommend starting there. But here’s the highlight reel of the most important changes:
The GA102 is the first GPU from Nvidia to drop into the single digits on lithography, using Samsung’s 8N process. The general consensus is that TSMC’s N7 node is ‘better’ overall, but it also costs more and is currently in very high demand — including from Nvidia’s own A100. Could the consumer Ampere GPUs have been even better with 7nm? Perhaps. But they might have cost more, only been available in limited quantities, or maybe they would have been delayed a few more months. Regardless, GA102 is still a big and powerful chip, boasting 28.3 billion transistors packed into a 628.4mm square die. If you’re wondering, that’s 52% more transistors than the TU102 chip used in RTX 2080 Ti, but in a 17% smaller area.
Ampere ends up as a split architecture, with the GA100 taking on data center ambitions while the GA102 and other consumer chips have significant differences. The GA100 focuses far more on FP64 performance for scientific workloads, as well as doubling down on deep learning hardware. Meanwhile, the GA102 drops most of the FP64 functionality and instead includes ray tracing hardware, plus some other architectural enhancements. Let’s take a closer look at the Ampere SM found in the GA102 and GA104.
One thing that hasn’t changed much is the video ports. Okay, that’s only partially true. First, there’s a single HDMI port, but it’s HDMI 2.1 instead of Turing’s HDMI 2.0b, but the three DisplayPort connections remain 1.4a. And last but not least, there’s no VirtualLink port this round — apparently, VirtualLink is dead. RIP. The various ports are all capable of 8K60 using DSC (Display Stream Compression), a “visually lossless” technique that’s actually not really visually lossless. But you might not notice at 8K.
Getting back to the cores, Nvidia’s third-gen tensor cores in GA102 work on 8x4x4 FP16 matrices, so up to 128 matrix operations per cycle. (Turing’s tensor cores used 4x4x4 matrices, while the GA100 uses 8x4x8 matrices.) With FMA (fused multiply-add), that’s 256 FP operations per cycle, per tensor core. Multiply by the 272 total tensor cores and clock speed, and that gives you 119 TFLOPS of FP16 compute. However, Ampere’s tensor cores also add support for fine-grained sparsity — basically, it eliminates wasting time doing multiplications by 0, since the answer is always 0. Sparsity can provide up to twice the FP16 performance in applications that can use it.
The RT cores receive similar enhancements, with up to double the ray/triangle intersection calculations per clock. The RT cores also support a time variable, which is useful for calculating things like motion blur. All told, Nvidia says the 3080’s new RT cores are 1.7 times faster than the RTX 2080’s, and they can be up to five times as fast for motion blur.
There are plenty of other changes as well. The L1 cache/shared memory capacity and bandwidth has been increased to better feed the cores (8704KB vs. 4416KB), and the L2 cache is also 25% larger than before (5120KB vs. 4096KB). The L1 cache can also be configured as varying amounts of L1 vs. shared memory, depending on the needs of the application. Register file size is also nearly 50% larger (17408KB vs. 11776KB) with the RTX 3080. GA102 can also do concurrent RT + graphics + DLSS (previously, using the RT cores would stop the CUDA cores).
Finally, the raster operators (ROPS) have been moved out of the memory controllers and into the GPCs. Each GPC has two ROP partitions of eight ROP units each. This provides more flexibility in performance, so where the GA102 has up to 112 ROPS total, the RTX 3080 disables two memory controllers but only one GPC and ends up with 96 ROPS. This is more critical for the RTX 3070 / GA104, however, which still has 96 ROPS even though it only has eight memory controllers. Each GPC also includes six TPCs (Texture Processing Clusters) with eight TMUs (Texture Mapping Units) and a polymorph engine, though Nvidia only enables 34 TPCs for the 3080.
With the core enhancements out of the way, let’s also quickly discuss the memory subsystem. GA102 supports up to twelve 32-bit memory channels, of which ten are enabled on the RTX 3080. Nvidia teamed up with Micron to use its GDDR6X memory, which uses PAM4 signaling to boost data rates even higher than before. Where the RTX 20-series cards topped out at 15.5 Gbps in the 2080 Super and 14 Gbps in the other RTX cards, GDDR6X runs at 19 Gbps in the RTX 3080. Combined with the 320-bit interface, that yields 760 GBps of bandwidth – a 70% improvement over RTX 2080.
The RTX 3080’s memory controller has also been improved, with a new feature called EDR: Error Detection and Replay. When the memory detects a failed transmission, rather than crashing or corrupting data, it simply tries again. It will do this until it’s successful, though it’s still possible to cause a crash with memory overclocking. The interesting bit is that with EDR, higher memory clocks might be achievable, but still result in lower performance. That’s because the EDR ends up reducing memory performance when failed transmissions occur. We’ll have more to say on this in the overclocking section.