On July 20, NVIDIA launched TensorRT 8, a software development kit (SDK) designed to help companies build smarter, more interactive language apps from cloud to edge. The latest version of the SDK is available for free to members of NVIDIA’s developer program. Plug-ins, parsers, and samples are also available to developers from the TensorRT GitHub repository.
TensorRT 8 features the latest innovations in deep learning inference or the process of applying knowledge from a trained neural network model to understand how the data affects the response. TensorRT 8 cuts inference time in half for language queries using two key features:
- Sparsity is a new performance technique in the NVIDIA Ampere architecture graphics processing units (GPUs), which increases efficiency for developers by diminishing computational operations. Not all parts of a deep learning model are equally important and some can be turned down to zero. Therefore, computations don’t need to be performed on those particular “weights” or parameters within a neural network. Using sparsity within GPUs, NVIDIA is able to turn down nearly half of the weights on certain models for improved performance, throughput, and latency.
- Quantization allows developers to use trained models to run inference in eight-bit computations (known as INT8), which significantly reduces compute and storage for inference on Tensor Cores. INT8 has grown in popularity for optimizing machine learning frameworks like TensorFlow and NVIDIA’s TensorRT because it reduces memory and computing requirements. By applying this technique, NVIDIA is able to retain accuracy while offering exceedingly high performance in TensorRT 8.
TensorRT is widely deployed across many industries
Over the past five years, developers in industries spanning healthcare, automotive, financial services, and retail, have downloaded TensorRT nearly 2.5 million times.
For example, GE Healthcare is using Tensor RT to power its cardiovascular ultrasound systems. The digital diagnostics solutions provider implemented automated cardiac view detection on its Vivid E95 scanner, accelerated with TensorRT. With an improved view detection algorithm, cardiologists can make more accurate diagnosis and identify diseases in early stages. Other companies using TensorRT include Verizon, Ford, the US Postal Service, American Express and other large brands.
What NVIDIA also introduced in TensorRT 8 is a flexible set of compiler optimizations that provide twice the performance of TensorRT 7, irrespective of the transformer model a company is using. TensorRT 8 is able to run BERT-Large—a widely used transformer-based model—in 1.2 milliseconds, which means companies can double or triple their model size for greater accuracy.
There are numerous inference services that are using language models like BERT-Large behind the scenes. However, language-based apps typically don’t understand nuance or emotion, which creates a subpar experience across the board. With TensorRT 8, companies can now deploy an entire workflow within a millisecond. These advancements could enable a new generation of conversational AI apps that offer a smarter, low latency experience to users.
“This is a huge improvement beyond what we have ever delivered in the past,” said Sharma. “We look forward to seeing how developers are going to use TensorRT 8.”
Real Time Apps with AI
Real-time applications that use artificial intelligence (AI) like chatbots are on the rise. But as AI gets smarter and better at delivering new kinds of services, it gets more complicated and more difficult to compute. This creates some challenges for those building AI based services.
Today’s developers must make hard choices across different parameters when dealing with complex AI models. There could be hundreds of models served in the data center, all running together within just a few milliseconds.
“This is one of the biggest challenges in deploying AI apps today. How do you maximize or retain the amount of accuracy that you train with and then offer it to your customers with the least amount of latency?” said Siddharth Sharma, NVIDIA’s head of product marketing for AI software, during a news briefing.
AI has the potential to have the biggest, transformative effect on society since the birth of the Internet. AI success is dependent on the quality of models and the speed of execution. NVIDIA’s GPUS are widely regarded as the best silicon to execute AI processing but its software, such as TensorRT, is equally important in making AI mainstream and usable in everyday life.