An open source library designed to accelerate recommender systems

NVIDIA Merlin

NVIDIA Merlin is an open source library designed to accelerate recommender systems on NVIDIA’s GPUs. It enables data scientists, machine learning engineers, and researchers to build high-performing recommenders at scale. Merlin includes tools to address common ETL, training, and inference challenges. Each stage of the Merlin pipeline is optimized to support hundreds of terabytes of data, which is all accessible through easy-to-use APIs. With Merlin, better predictions and increased click-through rates (CTRs) are within reach. For more information, see NVIDIA Merlin.

Benefits

NVIDIA Merlin is a scalable and GPU-accelerated solution, making it easy to build recommender systems from end to end. With NVIDIA Merlin, you can:

transform data (ETL) for preprocessing and engineering features.
accelerate existing training pipelines in TensorFlow, PyTorch, or FastAI by leveraging optimized, custom-built dataloaders.
scale large deep learning recommender models by distributing large embedding tables that exceed available GPU and CPU memory.
deploy data transformations and trained models to production with only a few lines of code.

Components of NVIDIA Merlin

NVIDIA Merlin is a collection of open source libraries:

NVTabular
HugeCTR
Triton Inference Server

NVTabular

NVTabular is a feature engineering and preprocessing library for tabular data. NVTabular is essentially the ETL component of the Merlin ecosystem. It is designed to quickly and easily manipulate terabyte-size datasets that are used to train deep learning based recommender systems. NVTabular offers a high-level API that can be used to define complex data transformation workflows. NVTabular is also capable of transformation speedups that can be 100 times to 1,000 times faster than transformations taking place on optimized CPU clusters. With NVTabular, you can:

prepare datasets quickly and easily for experimentation so that more models can be trained.
process datasets that exceed GPU and CPU memory without having to worry about scale.
focus on what to do with the data and not how to do it by using abstraction at the operation level.

NVTabular DataLoaders

NVTabular provides seamless integration with common deep learning frameworks, such as TensorFlow, PyTorch, and HugeCTR. When training deep learning recommender system models, dataloading can be a bottleneck. Therefore, we’ve developed custom, highly-optimized dataloaders to accelerate existing TensorFlow and PyTorch training pipelines. The NVTabular dataloaders can lead to a speedup that is 9 times faster than the same training pipeline used with the GPU. With the NVTabular dataloaders, you can:

remove bottlenecks from dataloading by processing large chunks of data at a time instead of item by item.
process datasets that don’t fit within the GPU or CPU memory by streaming from the disk.
prepare batches asynchronously into the GPU to avoid CPU-GPU communication.
integrate easily into existing TensorFlow or PyTorch training pipelines by using a similar API.

HugeCTR

HugeCTR is a GPU-accelerated framework designed to distribute training across multiple GPUs and nodes and estimate click-through rates. HugeCTR contains optimized dataloaders that can be used to prepare batches with GPU-acceleration. In addition, HugeCTR is capable of scaling large deep learning recommendation models. The neural network architectures often contain large embedding tables that represent hundreds of millions of users and items. These embedding tables can easily exceed the CPU and GPU memory. HugeCTR provides strategies for scaling large embedding tables beyond available memory. With HugeCTR, you can:

scale embedding tables over multiple GPUs or nodes.
load a subset of an embedding table into the GPU in a coarse grained, on-demand manner during the training stage.

Triton

NVTabular and HugeCTR both support the Triton Inference Server to provide GPU-accelerated inference. The Triton Inference Server is open-source inference serving software that can be used to simplify the deployment of trained AI models from any framework to production. With Triton, you can:

deploy NVTabular ETL workflows and trained deep learning models to production with a few lines of code.
deploy an ensemble of NVTabular ETL and trained deep learning models to ensure that the same data transformations are applied in production.
deploy models concurrently on GPUs to maximize utilization.
enable low latency inferencing in real time or batch inferencing to maximize GPU and CPU utilization.
scale the production environment with Kubernetes for orchestration, metrics, and auto-scaling using a Docker container.

Examples

A collection of end-to-end examples is available within this repository in the form of Jupyter notebooks. The example notebooks demonstrate how to:

download and prepare the dataset.
use preprocessing and engineering features.
train deep learning recommendation models with TensorFlow, PyTorch, FastAI, or HugeCTR.
deploy the models to production.

These examples are based on different datasets and provide a wide range of real-world use cases.