Story image

NVIDIA has launched 8th generation AI software TensorRT 8

By Ryan Morris-Reade, Wed 21 Jul 2021

NVIDIA has launched the eighth generation of the company’s AI software, the TensorRT 8, which now does language queries at half the speed. 

TensorRT 8’s optimisations deliver record-setting speed for language applications, enabling developers to build high-performance search engines, ad recommendations, and chatbots from the cloud to the edge. 

The new software runs BERT Large, a widely used transformer-based model, in 1.2 milliseconds. NVIDIA says companies had to reduce the model size in the past, resulting in less accurate results. It says with TensorRT 8, companies can double or triple their model size to improve accuracy dramatically. 

“AI models are growing exponentially more complex, and worldwide demand is surging for real-time applications that use AI,” says NVIDIA VP of developer programs, Greg Estes.

“That makes it imperative for enterprises to deploy state-of-the-art inferencing solutions. The latest version of TensorRT introduces new capabilities that enable companies to deliver conversational AI applications to their customers with a level of quality and responsiveness never before possible.” 

According to NVIDIA, in five years, more than 350,000 developers across 27,500 companies in wide-ranging areas, including healthcare, automotive, finance and retail, have downloaded TensorRT nearly 2.5 million times. TensorRT applications can be deployed in hyperscale data centres, embedded, or automotive product platforms. 

In addition to transformer optimisations, TensorRT 8’s AI inference is made possible through two other key ways. 

  • Sparsity is a new performance technique in NVIDIA Ampere architecture GPUs to increase efficiency, allowing developers to accelerate their neural networks by reducing computational operations. 
  • Quantisation aware training enables developers to use trained models to run inference in INT8  precision without losing accuracy. This significantly reduces compute and storage overhead for efficient inference on Tensor Cores. 

Hugging Face is an open-source AI company that works with AI service providers across multiple industries. The company is working closely with NVIDIA to introduce new AI services to enable text analysis, neural search and conversational applications at scale. 

“We’re closely collaborating with NVIDIA to deliver the best possible performance for state-of-the-art models,” says Hugging Face product director, Jeff Boudier. 

“The Hugging Face Accelerated Inference API already delivers up to 100x speedup for transformer models powered by NVIDIA GPUs. With TensorRT 8, Hugging Face achieved 1ms inference latency on BERT, and we’re excited to offer this performance to our customers later this year.” 

GE Healthcare, a global medical technology, diagnostics and digital solutions company, uses TensorRT to help accelerate computer vision applications for ultrasounds, a critical tool for the early detection of diseases. This enables clinicians to deliver high-quality care through its healthcare solutions. 

“When it comes to ultrasound, clinicians spend valuable time selecting and measuring images,” says GE Healthcare chief engineer of Cardiovascular Ultrasound, Erik Steen.

“During the R&D project leading up to the Vivid Patient Care Elevated Release, we wanted to make the process more efficient by implementing automated cardiac view detection on our Vivid  E95 scanner. The cardiac view recognition algorithm selects appropriate images for analysis of cardiac wall motion. 

TensorRT, with its real-time inference capabilities, improves the performance of the view detection algorithm, and it also shortened our time to market during the R&D project,” he says.

TensorRT 8 is now generally available and free of charge to members of the NVIDIA Developer program. The latest versions of plug-ins, parsers and samples are also available open-source from the TensorRT GitHub repository.
 

Recent stories
More stories