CONTACT US: icc@iitr.ac.in | ckausik@iitr.ac.in

Introduction

The NVIDIA DGX-1 is a deep learning system, architected for high throughput and high interconnect bandwidth to maximize neural network training performance. The core of the system is a complex of Eight Tesla V100 GPUs connected in the hybrid cube-mesh NVLink network topology. In addition to the eight GPUs, DGX-1 includes two CPUs for boot, storage management, and deep learning framework coordination. DGX-1 is built into a three-rack-unit (3U) enclosure that provides power, cooling, network, multi-system interconnect, and SSD file system cache, balanced to optimize throughput and deep learning training time.

NVLink is an energy-efficient, high-bandwidth interconnect that enables NVIDIA GPUs to connect to peer GPUs or other devices within a node at an aggregate bi-directional bandwidth of up to 300 GB/s per GPU: over nine times that of current PCIe Gen3 x16 interconnections. The NVLink interconnect and the DGX-1 architecture's hybrid cube-mesh GPU network topology enables the highest achievable data-exchange bandwidth between a groups of eight Tesla V100 GPUs


HARDWARE OVERVIEW

GPUs 8xTesla V100
GPU Memory 256 GB (32 GB/GPU)
CPU Dual 20-core Intel Xeon E5-2698 v4 2.2 GHz
NVIDIA CUDA Cores 40,960
NVIDIA Tensor Cores (on V100 based systems) 5,120
System Memory 512 GB 2.133 GHz DDR4 RDIMM
Storage 4x1.92 TB SSD RAID-0
Network Dual 10 GbE

Performance - 1 Peta FLOPS [Mixed Precision]

TESLA V100 GPU (NVLINK) Performance Single Precision Double Precision Deep Learning(Mixed Precision)
Single V100 GPU Up to 7.8 TFLOPS Up to 15.7 TFLOPS Up to 125 TFLOPS
TOTAL ( 8 * V100 GPU) Up to 62.4 TFLOPS Up to 125.6 TFLOPS Up to 1 TFLOPS


DGX-1 NVLink Network Topology
for Efficient Application Scaling

DGX-1 includes eight NVIDIA Tesla V100 accelerators; providing the highest compute-density available in an air-cooled 3U chassis. Application scaling on this many highly parallel GPUs can be hampered by today's PCIe interconnect. NVLink provides the communications performance needed to achieve good scaling on deep learning and other applications. Each Tesla V100 GPU has six NVLink connection points, each providing a point-to-point connection to another GPU at a peak bandwidth of 25 GB/s in each direction. Multiple NVLink connections can be aggregated, multiplying the available interconnection bandwidth between a given pair of GPUs. The result is that NVLink provides a flexible interconnect that can be used to build a variety of network topologies among multiple GPUs. V100 also supports 16 lanes of PCIe 3.0. In DGX-1, these are used for connecting between the CPUs and GPUs and high-speed IB network interface cards.
The design of the NVLink network topology for DGX-1 aims to optimize a number of factors, including the bandwidth achievable for a variety of point-to-point and collective communications primitives, the ability to support a variety of common communication patterns, and the ability to maintain performance when only a subset of the GPUs is utilized.
The hybrid cube-mesh topology can be thought of as a cube with GPUs at its corners and with all twelve edges connected through NVLink (some edges have two NVLink connections), and with two of the six faces having their diagonals connected as well. The topology can also be thought of as three interwoven rings of single NVLink connections.
The cube-mesh topology provides the highest bandwidth of any 8-GPU NVLink topology for multiple collective communications primitives, including broadcast, gather, all-reduce, and allgather, which are important to deep learning.
Figure: DGX-1 uses an 8-GPU hybrid cube-mesh interconnection network topology. The corners of the mesh-connected faces of the cube are connected to the PCIe tree network, which also connects to the CPUs and NICs.

Tensor Core

Tesla V100's Tensor Cores are programmable matrix-multiply-and-accumulate units that can deliver up to 125 Tensor TFLOPS for training and inference applications. The Tesla V100 GPU contains 640 Tensor Cores: 8 per SM (Streaming Multiprocessor). Tensor Cores and their associated data paths are custom-crafted to dramatically increase floating-point compute throughput at only modest area and power costs. Clock gating is used extensively to maximize power savings.
Each Tensor Core provides a 4x4x4 matrix processing array which performs the operation D = A * B + C, where A, B, C and D are 4x4 matrices. The matrix multiply inputs A and B are FP16 matrices, while the accumulation matrices C and D may be FP16 or FP32 matrices.
Each Tensor Core performs 64 floating point FMA mixed-precision operations per clock (FP16 input multiply with full-precision product and FP32 accumulate) and 8 Tensor Cores in an SM perform a total of 1024 floating point operations per clock. This is a dramatic 8X increase in throughput for deep learning applications per SM compared to Pascal GP100 using standard FP32 operations, resulting in a total 12X increase in throughput for the Volta V100 GPU compared to the Pascal P100 GPU. Tensor Cores operate on FP16 input data with FP32 accumulation. The FP16 multiply results in a full-precision result that is accumulated in FP32 operations with the other products in a given dot product for a 4x4x4 matrix multiply.

NVIDIA GPU Cloud

NGC is the hub for GPU-optimized software for deep learning, machine learning, and HPC that takes care of all the plumbing so data scientists, developers, and researchers can focus on building solutions, gathering insights, and delivering business value.
NGC provides a range of options that meet the needs of data scientists, developers, and researchers with various levels of AI expertise. Build models from scratch with framework containers, leverage model training scripts and pre-trained models to speedup projects, or use complete workflows for the fastest AI implementation. No matter which option you pick, you will experience a faster start and a faster time-to-solution.

 TRAINING SCHEDULE

 Date : 6th August 2019, Venue : LHC-002

10:00-10:45 Introduction to DGX-1 Kausik C, IITR ; Mandeep Kumar, Locuz PPT Slides and demonstration
10:45-11:15 Introduction to NGC Mandeep Kumar, Locuz PPT Slides and demonstration
11:15-13:00 running application using Nvidia-Docker Mandeep Kumar, Locuz PPT Slides and demonstration
13:00-14:00 Lunch Break
14:00-15:00 Train a model - Image Classification workload using DIGITS Mandeep Kumar, Locuz demonstration
15:00-16:00 Demonstration of TensorFlow/MXNet container Mandeep Kumar, Locuz demonstration
16:00-17:00 running container using SLURM workload manager Mandeep Kumar, Locuz demonstration
17:00 QnA and conclusion Kausik C, IITR ; Mandeep Kumar, Locuz

Docker

What is docker

A Docker container is a mechanism for bundling a Linux application with all of its libraries, data files, and environment variables so that the execution environment...

Why NVIDIA docker

A Docker container is a mechanism for bundling a Linux application with all of its libraries, data files, and environment variables so that the execution environment...