The combination of the state-of-the-art NVIDIA GPUs, Mellanox's InfiniBand, GPUDirect RDMA and NCCL to train neural networks has already become a de-facto standard when scaling out deep learning frameworks, such as Caffe, Caffe2, Chainer, MXNet, TensorFlow, and PyTorch. With the Mellanox SHARP technology and HDR InfiniBand, deep learning training's data aggregation operations can be offloaded and accelerated by the InfiniBand network, resulting in improving their performance by two times.
The joint effort with NVIDIA and testing performed in Mellanox's performance labs, using the Mellanox HDR InfiniBand Quantum connecting four system hosts, each with eight NVIDIA V100 Tensor Core GPUs with NVLink interconnect technology and a single ConnectX-6 HDR adapter per host, have achieved an effective reduction bandwidth of 19.6GB/s by integrating SHARP's native streaming aggregation capability with NVIDIA's latest NCCL 2.4 library, which now takes full advantage of the bi-directional bandwidth available from the Mellanox interconnect. This implementation is effectively two times higher bandwidth than NVIDIA's current tree-based implementation using the same hardware configuration.
In the more common setup for this configuration, four HCAs in each system host are used for balanced performance across a variety of workloads where the initial SHARP and NCCL results yielded an expected 70.3GB/s. For more densely populated GPU-based systems, like NVIDIA DGX-2, which houses 16 NVIDIA V100 Tensor Core GPUs with NVLink in each system node, the in-network capabilities and available bidirectional bandwidth of the Mellanox fabric can be fully leveraged.
"Our long-standing collaboration with NVIDIA has again delivered a robust solution that takes full advantage of the best-of-breed capabilities from Mellanox InfiniBand, including GPUDirect RDMA and now extending in-network computing to NCCL, which delivers two times better performance for AI", stated Gilad Shainer, Vice President of Marketing at Mellanox Technologies. "HDR InfiniBand in-network computing acceleration engines, including the SHARP technology, provide the highest performance and scalability for HPC and AI workloads."
"Mellanox solutions amplify NVIDIA's unmatched CUDA-X acceleration libraries using NCCL, our open source collective communication library", stated Ian Buck, vice president and general manager of Accelerated Computing at NVIDIA. "Together, we offer solutions that ensure the most demanding AI applications in the data centre benefit from cutting-edge performance and scaling efficiency."