NVIDIA DGX-1 CLUSTER

Introduction

The NVIDIA DGX-1 is a deep learning system, architected for high throughput and high interconnect bandwidth to maximize neural network training performance. The core of the system is a complex of Eight Tesla V100 GPUs connected in the hybrid cube-mesh NVLink network topology. In addition to the eight GPUs, DGX-1 includes two CPUs for boot, storage management, and deep learning framework coordination. DGX-1 is built into a three-rack-unit (3U) enclosure that provides power, cooling, network, multi-system interconnect, and SSD file system cache, balanced to optimize throughput and deep learning training time.

NVLink is an energy-efficient, high-bandwidth interconnect that enables NVIDIA GPUs to connect to peer GPUs or other devices within a node at an aggregate bi-directional bandwidth of up to 300 GB/s per GPU: over nine times that of current PCIe Gen3 x16 interconnections. The NVLink interconnect and the DGX-1 architecture's hybrid cube-mesh GPU network topology enables the highest achievable data-exchange bandwidth between a groups of eight Tesla V100 GPUs

HARDWARE OVERVIEW

GPUs

8xTesla V100

GPU Memory

256 GB (32 GB/GPU)

CPU

Dual 20-core Intel Xeon E5-2698 v4 2.2 GHz

NVIDIA CUDA Cores

40,960

NVIDIA Tensor Cores (on V100 based systems)

5,120

System Memory

512 GB 2.133 GHz DDR4 RDIMM

Storage

4x1.92 TB SSD RAID-0

Network

Dual 10 GbE

Performance - 1 Peta FLOPS [Mixed Precision]

TESLA V100 GPU (NVLINK) Performance

Single Precision

Double Precision

Deep Learning(Mixed Precision)

Single V100 GPU

Up to 7.8 TFLOPS

Up to 15.7 TFLOPS

Up to 125 TFLOPS

TOTAL ( 8 * V100 GPU)

Up to 62.4 TFLOPS

Up to 125.6 TFLOPS

Up to 1 TFLOPS

DGX-1 NVLink Network Topology
for Efficient Application Scaling

DGX-1 includes eight NVIDIA Tesla V100 accelerators; providing the highest compute-density available in an air-cooled 3U chassis. Application scaling on this many highly parallel GPUs can be hampered by today's PCIe interconnect. NVLink provides the communications performance needed to achieve good scaling on deep learning and other applications. Each Tesla V100 GPU has six NVLink connection points, each providing a point-to-point connection to another GPU at a peak bandwidth of 25 GB/s in each direction. Multiple NVLink connections can be aggregated, multiplying the available interconnection bandwidth between a given pair of GPUs. The result is that NVLink provides a flexible interconnect that can be used to build a variety of network topologies among multiple GPUs. V100 also supports 16 lanes of PCIe 3.0. In DGX-1, these are used for connecting between the CPUs and GPUs and high-speed IB network interface cards.
The design of the NVLink network topology for DGX-1 aims to optimize a number of factors, including the bandwidth achievable for a variety of point-to-point and collective communications primitives, the ability to support a variety of common communication patterns, and the ability to maintain performance when only a subset of the GPUs is utilized.
The hybrid cube-mesh topology can be thought of as a cube with GPUs at its corners and with all twelve edges connected through NVLink (some edges have two NVLink connections), and with two of the six faces having their diagonals connected as well. The topology can also be thought of as three interwoven rings of single NVLink connections.
The cube-mesh topology provides the highest bandwidth of any 8-GPU NVLink topology for multiple collective communications primitives, including broadcast, gather, all-reduce, and allgather, which are important to deep learning.
Figure: DGX-1 uses an 8-GPU hybrid cube-mesh interconnection network topology. The corners of the mesh-connected faces of the cube are connected to the PCIe tree network, which also connects to the CPUs and NICs.

TRAINING SCHEDULE

Date : 6th August 2019, Venue : LHC-002

10:00-10:45

Introduction to DGX-1

Kausik C, IITR ; Mandeep Kumar, Locuz

PPT Slides and demonstration

10:45-11:15

Introduction to NGC

Mandeep Kumar, Locuz

PPT Slides and demonstration

11:15-13:00

running application using Nvidia-Docker

Mandeep Kumar, Locuz

PPT Slides and demonstration

13:00-14:00

Lunch Break

14:00-15:00

Train a model - Image Classification workload using DIGITS

Mandeep Kumar, Locuz

demonstration

15:00-16:00

Demonstration of TensorFlow/MXNet container

Mandeep Kumar, Locuz

demonstration

16:00-17:00

running container using SLURM workload manager

Mandeep Kumar, Locuz

demonstration

17:00

QnA and conclusion

Kausik C, IITR ; Mandeep Kumar, Locuz

Docker

What is docker

A Docker container is a mechanism for bundling a Linux application with all of its libraries, data files, and environment variables so that the execution environment...

Why NVIDIA docker

A Docker container is a mechanism for bundling a Linux application with all of its libraries, data files, and environment variables so that the execution environment...

Nvidia

Introduction

HARDWARE OVERVIEW

Performance - 1 Peta FLOPS [Mixed Precision]

DGX-1 NVLink Network Topologyfor Efficient Application Scaling

Tensor Core

TRAINING SCHEDULE

Date : 6th August 2019, Venue : LHC-002

Docker

What is docker

Why NVIDIA docker

DGX-1 NVLink Network Topology
for Efficient Application Scaling