Pylomin

Pylomin (PYtorch LOw-Memory INference) is a deep learning optimization library for low-memory inferencing in PyTorch.

Motivation

The scale of deep learning models has grown exponentially in recent years, which has greatly increased the difficulty of product deployment.

Image source: Microsoft Research Blog

The goal of this library is to enable low-cost deployment of large-scale models:

  • Minimize memory requirements
    • For example, we can reduce the peak memory requirement for the inference of a BERT-like model (with 1.6 GiB parameters) to 46 MiB.
  • Minimize memory requirements while maintaining the model throughput

Peak memory is the maximum amount of memory needed to store model parameters and hidden states at any time during the model inference.

Getting Started

1. Lazy-loading

Load model parameters only when needed and delete them immediately after use.

Provide a list of target_classes or target_modules to be converted to lazy-loading mode.
In addition, when using target_classes, you can also provide a list of modules to be skipped.

# Use target_classes
model = pylomin.lazy_loading(model, target_classes=[nn.Linear, nn.Embedding])
model = pylomin.lazy_loading(model, target_classes=[nn.Linear, nn.Embedding], skip_modules=[model.embeddings.word_embeddings])

# Use target_modules
target_modules = [module for module in model.modules() if some_condition]
model = pylomin.lazy_loading(model, target_modules=target_modules)

2. Chunked-embedding

Attempts to split an torch.nn.Embedding layer into multiple chunks of torch.nn.Embedding with smaller num_embeddings.

model = pylomin.chunked_embedding(model, target_module_name='embeddings.word_embeddings', chunk_size=2048)

Demo

We provide a script to test model inference peak memory and throughput resulting from different optimization approaches.

bash run.sh
python3 plot.py

The environment used to run the above benchmarks:

$ lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              6
On-line CPU(s) list: 0-5
Thread(s) per core:  1
Core(s) per socket:  6
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               79
Model name:          Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz
Stepping:            1
CPU MHz:             2593.992
BogoMIPS:            5187.98
Hypervisor vendor:   Microsoft
Virtualization type: full
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            35840K
NUMA node0 CPU(s):   0-5

GitHub

View Github