A transformer that does not hog your GPU memory

This is an early in-development codebase: if you want a stable and documented hivemind codebase, look at CALM or dalle-hivemind.

Readme under construction

LeanTransformer implements a specific version of transformer with two goals in mind:

  • using as little GPU memory as possible
  • stable training for very large models

The core philosophy of LeanTransformer is to replace torch.autograd with grad students. Automatic differentiation is
great if you want to test ideas quickly, less so if a single training run can cost over $4 million (or >1000 years in grad school).

Related work: GSO

Our implementation partially replaces automatic differentiation with Grad Student Optimization (GSO) – a biologically inspired black box optimization algorithm.
In the past, GSO has seen widespread adoption thanks to its strong theoretical foundations
and unparalleled cost efficiency (Chom et al).
Previous successfully applied GSO for hyperparameter tuning
and natural language generation.
To the best of our knowledge we are the first work to successfully
apply distributed fault-tolerant GSO for optimizing the memory footprint of transformers. We summarize our findings below:

Memory saving features:

Other features:

Not implemented:

  • In reversible mode, one can further save memory by computing backward in chunks:
    • a few tokens at a time for feedforward layers, since grad(concat(mlp(x1), mlp(x2))) = concat(grad(mlp(x1)), grad(mlp(x2)))
    • a few heads at a time for self-attention, since grad(head1 + head2) = grad(head1) + grad(head2), where head1 and head2 are attention outputs after linear projection
  • Attention could be computed in O(sqrt(n)) memory (Rabe et al, 2021)
  • No sparse or linear attention: they are great for very long sequences. However, for large models, attention is not a bottleneck in typical NLP and vision tasks (tested gpt-3 up to length 4096).
  • Per-block grad scaling as described in (Ramesh et al, 2021) – we rely on Sandwich Norm to maintain stability up to 96 layers (did not test more). However, it would be nice to
    have per-block scaling to avoid the need for an extra LayerNorm.
  • Something else that we missed – please find us on discord.

A day will come a day when we explain all these modifications and provide instructions on how to tune them.
But it is not this day!. Until then, we’ll happily answer any questions on our discord.

Running the code

[under constructuion] – use the instructions from CALM readme


  • Most of the architecture and stability optimizations were learned through the BigScience research workshop
  • YSDA community helped us survive through the early messy versions of this code
  • NeuroPark trained the first practical model (SahajBERT-XL, SoTA in bengali, details here)
  • TODO DALLE community: at least mention the demo, maybe we end up training something even cooler
  • TODO NCAI community: ask them how best to acknowledge them
  • TODO Hugging Face: ask them how best to acknowledge them
  • TODO Personal: stas00, samyam, jared, more? (this does not include co-authors: Tim,Lucile,Quentin,Denis,Gennady,etc; also, this does not include hivemind contributors)


View Github