/ Machine Learning

A CUDA implementation of fused LSTM and GRU layers

A CUDA implementation of fused LSTM and GRU layers


Haste is a CUDA implementation of fused LSTM and GRU layers with built-in DropConnect and Zoneout regularization. These layers are exposed through C++ and Python APIs for easy integration into your own projects or machine learning frameworks.

What's included in this project?

  • a standalone C++ API (libhaste)
  • a TensorFlow Python API (haste_tf)
  • examples for writing your own custom C++ inference / training code using libhaste

For questions or feedback about Haste, please open an issue on GitHub or send us an email at [email protected].


Here's what you'll need to get started:

Once you have the prerequisites, run the following to build the code and install the TensorFlow API:

make && pip install haste_tf-*.whl


Getting started with the TensorFlow API is easy:

import haste_tf as haste

lstm_layer = haste.LSTM(num_units=256, direction='bidirectional', zoneout=0.1, dropout=0.05)
gru_layer = haste.GRU(num_units=256, direction='bidirectional', zoneout=0.1, dropout=0.05)

# `x` is a tensor with shape [N,T,C]
y, state = lstm_layer(x)
y, state = gru_layer(x)

The TensorFlow Python API is documented in docs/tf/haste_tf.md.
The C++ API is documented in lib/haste.h and there are code samples in examples/.

Code layout

  • docs/tf/: API reference documentation for haste_tf
  • examples/: examples for writing your own C++ inference / training code using libhaste
  • frameworks/tf/: TensorFlow Python API and custom op code
  • lib/: CUDA kernels and C++ API

Implementation notes

  • the GRU implementation is based on 1406.1078v1 (same as cuDNN) rather than 1406.1078v3
  • Zoneout on LSTM cells is applied to the hidden state only, and not the cell state


  1. Hochreiter, S., & Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation, 9(8), 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735
  2. Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. arXiv:1406.1078 [cs, stat]. http://arxiv.org/abs/1406.1078.
  3. Wan, L., Zeiler, M., Zhang, S., Cun, Y. L., & Fergus, R. (2013). Regularization of Neural Networks using DropConnect. In International Conference on Machine Learning (pp. 1058–1066). Presented at the International Conference on Machine Learning. http://proceedings.mlr.press/v28/wan13.html.
  4. Krueger, D., Maharaj, T., Kramár, J., Pezeshki, M., Ballas, N., Ke, N. R., et al. (2017). Zoneout: Regularizing RNNs by Randomly Preserving Hidden Activations. arXiv:1606.01305 [cs]. http://arxiv.org/abs/1606.01305.

Citing this work

To cite this work, please use the following BibTeX entry:

  title  = {Haste: a fast, simple, and open RNN library},
  author = {Sharvil Nanavati},
  year   = 2020,
  month  = "Jan",
  howpublished = {\url{https://github.com/lmnt-com/haste/}},