Story image

The challenges of large-scale AI training

Baidu released X-MAN3.0, a super AI computing platform optimised for deep neural. Jointly developed by Inspur and Baidu, the X-MAN3.0 solution can achieve 2,000 trillion deep neural network operations per second.

The innovation of computing technologies, alongside data and algorithms, is one of the most important components that has propelled the advancement of deep learning.

As one of Baidu's most important strategic partners in the field of data centre computing and storage infrastructure, Inspur has been working with Baidu to develop AI-specific computing platforms, including X-MAN3.0, a specialised platform for ultra-large-scale AI training. 

The first generation of the product was released in 2016 and has been upgraded to the third generation.

The 8U X-MAN3.0 consists of two independent 4U AI modules, each supporting 8 of the latest NVIDIA V100 Tensor Core GPUs. 

The two AI modules are connected by high-speed interconnected backplanes with 48 NVLink links. The GPUs can directly communicate through NVIDIA  NVSwitch, and the overall unidirectional bandwidth among all GPUs is up to 2400GB/s.

X-MAN 3.0 is also equipped with two levels of PCIe switch supporting interconnections among CPU, AI accelerators and other IO. 

The relationship between CPU and GPU can be set in a software-defined manner, so as to flexibly support diversified AI workloads without system bottlenecks. This is a significant difference between X-MAN3.0 and other products in the industry.

Super AI computing platform optimised for deep neural networks

Large-scale and distributed training is bringing increasing challenges for computing platforms. To improve the accuracy of AI models, the average size of training datasets has increased by more than 300 times. 

By the end of 2017, the number of labelled pictures in Google Open Image reached 9 million. The complexity of models has surged at such a high speed that some Internet companies' AI models have reached 100 billion parameters.

This surge in data requires users to deploy larger GPU computing platforms with a greater scale-up capability to solve the increasing challenges in communication between GPUs. 

For example, the three-dimensional Fast Fourier Transform, an algorithm commonly used in AI models, requires one global communication for every three operations in GPU parallel processing, heavily dependent on the communication bandwidth between GPUs.

With NVIDIA NVSwitch, the platform can supposedly alleviate communication bottlenecks, delivering more-than-expected application values to Internet companies' ultra-large-scale AI training.

With the rapid development of deep learning, silicon chip giants, as well as start-ups, are developing new AI accelerators which are expected to be deployed in late 2019, and this brings more choices for large internet companies. 

In light of this, X-MAN3.0 is supposedly designed with a concept of modular HW components, standard interfaces, and flexible topologies, which is meant to provide a key technical foundation for Baidu to quickly adopt more competitive AI training solutions.

Three things that will happen in 2019 – and one that will not
Commvault's Nigel Tozer reflects on the year that's been and the one ahead with three predictions of what will be and one that won't.
Huawei CEO goes public on CFO arrest & China security concerns
Ken Hu faced a press conference where he addressed all the elephants in the room and growing concerns around the company's future.
A10’s app delivery solution now on Azure Marketplace
With the Harmony Controller, organisations can automate deployment and operations of application services.
The pillars of ethical automation
"As the builders and users of autonomous systems, it’s important that we consider what ethical automation should look like."
Virtustream launches cloud automation and security capabilities
Virtustream Enterprise Cloud enhancements accelerate time-to-value for enterprises moving mission critical apps to the cloud.
TCS collaborates with Red Hat to build digital transformation solutions
“By leveraging TCS' technology skills to build more secure, intelligent and responsive solutions, we aim to deliver superior end-user experiences."
Twitter suspects state-sponsored ties to support forum breach
One of Twitter’s support forums was hit by a data breach that may have ties to a state-sponsored attack, however users' personal data was exposed.
How McAfee aims to curb enterprise data loss
McAfee DLP aims to help safeguard intellectual property and ensure compliance by protecting sensitive data.