Nvidia unveils the A100 Tensor Core GPU

Published in AI

Nvidia unveils the A100 Tensor Core GPU

by Slobodan Simic on14 May 2020

font size decrease font size increase font size

GPU compute beast for DGX A100 AI system

The focus of the GTC 2020 was Nvidia's new A100 Tensor Core GPU, which is based on the new Ampere architecture GA100 GPU, and will be a part of Nvidia's new DGX A100 AI system. As a direct successor of the V100 accelerator, the A100 will bring both significant compute performance improvements and new features.

As we already wrote, the A100 Tensor Core GPU is based on the GA100 GPU, which we wrote about earlier, and when compared to the fully enabled GPU, it comes with one less GPC, leaving it with a total of 108 SMs and 6912 FP32 CUDA cores, as well as 432 Tensor Cores, and 40GB of HBM2 on a 5120-bit memory interface with 1.6TB/s of memory bandwidth.

The A100 Tensor Core GPU is a compute beast

The story of the day is the A100 GPU accelerator, or as Nvidia calls it, the A100 Tensor Core GPU. Designed as a successor to the V100, the Ampere-based A100 GPU brings several new features and delivers significantly higher performance in HPC, AI, and data analytics workloads.

According to Nvidia, it also provides strong scaling for GPU compute and AI deep learning (DL) applications both in single- and multi-GPU workstations, servers, clusters, cloud data centers, systems at the edge, and supercomputers.

nvidia a100compute 3

The A100, which is based on the GA100 GPU, works at up to 1410MHz, packs 7 enabled GPCs, for a total of 108 SMs and 6912 FP32 CUDA cores per GPU.

With this in mind, it also comes with 432 3rd generation Tensor Cores, as well as features 5 HBM2 stacks and 10 512-bit memory controllers, leaving it with 40GB of 1215MHz DDR HBM2 and 5120-bit memory interface for a total of 1.6TB/s of memory bandwidth.

The GPU packs 40960KB of L2 cache with up to 164KB of shared memory size per SM.

The Nvidia A100 comes in SXM4 board form factor and has a TDP of 400W, as well as has the third-generation of NVIDIA high-speed NVLink interconnect with a total of 12 links, leaving it with 600GB/s of total bandwidth, doubling it compared to V100. The A100 also supports PCI Express Gen 4, offering 31.5 GB/sec.

When it comes to compute performance, the A100 does not bring significant improvements in single-precision performance (FP32), peaking at 19.5 TFLOPs, which is not far from 15.7 TFLOPs on the V100, and the double-precision (FP64) compute peaks at 9.7 TFLOPs (7.8 TFLOPs for the V100), but this is where Tensor cores come in.

nvidia a100compute 2

Tensor Cores are what matters

The Volta-based V100 was only capable of pushing FP16 Tensor computations, peaking at 125 TFLOPs, which is far behind 312 TFLOPs that A100 can achieve. Unlike the V100 640 Tensor Cores, the 432 Tensor Cores in the A100 GPU can do all computations, including FP16, BF16, TF32, FP64, INT8, INT4, and Binary.

The performance gain is incredible with A100 hitting 19.5 TFLOPs for double-precision (FP64), 156 TFLOPs for TF32, and 125 TFLOPs for BF16/FP16 compute performance.

nvidia a100compute 1

Plenty of other improvements, including the Multi-Instance GPU (MIG)

In addition to new architecture improvements, 3rd gen Tensor Cores, more VRAM, large L1 and 40MB of L2 cache, all leading to an impressive compute performance uplift, the A100 GPU will bring several other improvements, like the new Multi-Instance GPU (MIG) feature, which allows the A100 Tensor Core GPU to be securely partitioned into as many as seven separate GPU Instances for CUDA applications.

With MIG, each of those instance's processors have separate and isolated paths through the entire memory system, as well as uniquely assigned on-chip crossbar ports, L2 cache banks, memory controllers, and DRAM address busses. This allows better utilization of the GPU hardware while isolating different clients, like VMs, containers, and processes.

nvidia a100mig 1

In addition to the MIG, the A100 brings the aforementioned third-generation of NVIDIA high-speed NVLink interconnect with 600GB/s of total bandwidth, support for NVIDIA Magnum IO and Mellanox interconnect solutions, PCIe Gen 4, single root input/output virtualization (SR-IOV), has improved error and fault detection, isolation, and containment, support aforementioned asynchronous copy instruction with asynchronous hardware-accelerated barriers in shared memory, has ask graph acceleration, and more.

The A100 GPU comes in the DGX A100 AI System

The A100 GPU will be available in Nvidia's DGX A100 AI system which features eight A100 Tensor Core GPUs, providing 5 PFLOPs of AI power, and 320GB of memory for 12.4TB/s of memory bandwidth. It also includes six NVIDIA NVSwitch interconnect fabrics, nine Mellanox ConnectX-6 HDR 200Gb per second network interfaces, 15TB Gen4 NVMe internal storage, and NVIDIA DGX software stack.

The Nvidia DGX A100, which should be immediately available and shipping worldwide, with the first customer being the U.S. Department of Energy’s (DOE) Argonne National Laboratory, starts at $199,000. It will be available through NVIDIA Partner Network resellers worldwide.

nvidia dgxa100 1

Last modified on 15 May 2020

Rate this item

(0 votes)

Tagged under