Story image

The challenges of large-scale AI training

Baidu released X-MAN3.0, a super AI computing platform optimised for deep neural. Jointly developed by Inspur and Baidu, the X-MAN3.0 solution can achieve 2,000 trillion deep neural network operations per second.

The innovation of computing technologies, alongside data and algorithms, is one of the most important components that has propelled the advancement of deep learning.

As one of Baidu's most important strategic partners in the field of data centre computing and storage infrastructure, Inspur has been working with Baidu to develop AI-specific computing platforms, including X-MAN3.0, a specialised platform for ultra-large-scale AI training. 

The first generation of the product was released in 2016 and has been upgraded to the third generation.

The 8U X-MAN3.0 consists of two independent 4U AI modules, each supporting 8 of the latest NVIDIA V100 Tensor Core GPUs. 

The two AI modules are connected by high-speed interconnected backplanes with 48 NVLink links. The GPUs can directly communicate through NVIDIA  NVSwitch, and the overall unidirectional bandwidth among all GPUs is up to 2400GB/s.

X-MAN 3.0 is also equipped with two levels of PCIe switch supporting interconnections among CPU, AI accelerators and other IO. 

The relationship between CPU and GPU can be set in a software-defined manner, so as to flexibly support diversified AI workloads without system bottlenecks. This is a significant difference between X-MAN3.0 and other products in the industry.

Super AI computing platform optimised for deep neural networks

Large-scale and distributed training is bringing increasing challenges for computing platforms. To improve the accuracy of AI models, the average size of training datasets has increased by more than 300 times. 

By the end of 2017, the number of labelled pictures in Google Open Image reached 9 million. The complexity of models has surged at such a high speed that some Internet companies' AI models have reached 100 billion parameters.

This surge in data requires users to deploy larger GPU computing platforms with a greater scale-up capability to solve the increasing challenges in communication between GPUs. 

For example, the three-dimensional Fast Fourier Transform, an algorithm commonly used in AI models, requires one global communication for every three operations in GPU parallel processing, heavily dependent on the communication bandwidth between GPUs.

With NVIDIA NVSwitch, the platform can supposedly alleviate communication bottlenecks, delivering more-than-expected application values to Internet companies' ultra-large-scale AI training.

With the rapid development of deep learning, silicon chip giants, as well as start-ups, are developing new AI accelerators which are expected to be deployed in late 2019, and this brings more choices for large internet companies. 

In light of this, X-MAN3.0 is supposedly designed with a concept of modular HW components, standard interfaces, and flexible topologies, which is meant to provide a key technical foundation for Baidu to quickly adopt more competitive AI training solutions.

Data#3 wins learning and development award two years running
Chief Learning Officer magazine’s LearningElite programme honours the best organisations for learning and development.
Hootsuite leads the social engagement charge - Forrester report
“Hootsuite leads the pack with its seller focus and scale,” writes Forrester principal analyst Mary Shea.
The fight for power in the Fourth Industrial Revolution
"Like the industrial revolutions before it, the Fourth Industrial Revolution highlights the role of new technologies in society."
Intel releases 8th gen vPro mobile processors
This generation promises longer battery life, better performance, and comes with a built-in hardware security solution, Intel Hardware Shield.
Unisys encourages financial institutions to adopt open banking
“It establishes the bank as an integral part of the customers’ life – a ‘one-stop-shop’ where they can get personalised products and services they want, when they want them.”
Developers use Intel AI to solve some of the world’s biggest challenges
Risab Biswas developed a computer vision application to help farmers more easily detect pathological disease in their plants.
Smarter cities through cross-border and G2G collaborations
"As countries race ahead in their bid to accelerate smart city development through industrialisation, the environment and ultimately humanity is paying the price for this phenomenon."
SingularityNET CEO discusses the future of AI
"In my view, AI will eliminate essentially all need for humans to do practical work."