The challenges of large-scale AI training

Mon, 17th Dec 2018

FYI, this story is more than a year old

Baidu released X-MAN3.0, a super AI computing platform optimised for deep neural. Jointly developed by Inspur and Baidu, the X-MAN3.0 solution can achieve 2,000 trillion deep neural network operations per second.

The innovation of computing technologies, alongside data and algorithms, is one of the most important components that has propelled the advancement of deep learning.

As one of Baidu's most important strategic partners in the field of data center computing and storage infrastructure, Inspur has been working with Baidu to develop AI-specific computing platforms, including X-MAN3.0, a specialised platform for ultra-large-scale AI training.

The first generation of the product was released in 2016 and has been upgraded to the third generation.

The 8U X-MAN3.0 consists of two independent 4U AI modules, each supporting 8 of the latest NVIDIA V100 Tensor Core GPUs.

The two AI modules are connected by high-speed interconnected backplanes with 48 NVLink links. The GPUs can directly communicate through NVIDIA NVSwitch, and the overall unidirectional bandwidth among all GPUs is up to 2400GB/s.

X-MAN 3.0 is also equipped with two levels of PCIe switch supporting interconnections among CPU, AI accelerators and other IO.

The relationship between CPU and GPU can be set in a software-defined manner, so as to flexibly support diversified AI workloads without system bottlenecks. This is a significant difference between X-MAN3.0 and other products in the industry.

Super AI computing platform optimised for deep neural networks

Large-scale and distributed training is bringing increasing challenges for computing platforms. To improve the accuracy of AI models, the average size of training datasets has increased by more than 300 times.

By the end of 2017, the number of labelled pictures in Google Open Image reached 9 million. The complexity of models has surged at such a high speed that some Internet companies' AI models have reached 100 billion parameters.

This surge in data requires users to deploy larger GPU computing platforms with a greater scale-up capability to solve the increasing challenges in communication between GPUs.

For example, the three-dimensional Fast Fourier Transform, an algorithm commonly used in AI models, requires one global communication for every three operations in GPU parallel processing, heavily dependent on the communication bandwidth between GPUs.

With NVIDIA NVSwitch, the platform can supposedly alleviate communication bottlenecks, delivering more-than-expected application values to Internet companies' ultra-large-scale AI training.

With the rapid development of deep learning, silicon chip giants, as well as start-ups, are developing new AI accelerators which are expected to be deployed in late 2019, and this brings more choices for large internet companies.

In light of this, X-MAN3.0 is supposedly designed with a concept of modular HW components, standard interfaces, and flexible topologies, which is meant to provide a key technical foundation for Baidu to quickly adopt more competitive AI training solutions.

Share on:

Guides

Search

The challenges of large-scale AI training

Top stories