This is an easy-to-use python module that helps you to extract the BERT embeddings for a large text dataset efficiently. It is intended to be used for Bengali and English texts.
Specially, optimized for usability in limited computational setups (i.e. free colab/kaggle GPUs). Extracting embeddings for IMDB dataset (a list of
25000 texts) took less than
~28 mins.on Colab’s GPU. (Haven’t perform any hardcore benchmark, so take these numbers with a grain of salt).
$ pip install git+https://github.com/khalidsaifullaah/BERTify
from bertify import BERTify # Example 1: Bengali Embedding Extraction bn_bertify = BERTify( lang="bn", # language of your text. last_four_layers_embedding=True # to get richer embeddings. ) # By default, `batch_size` is set to 64. Set `batch_size` higher for making things even faster but higher value than 96 may throw `CUDA out of memory` on Colab's GPU, so try at your own risk. # bn_bertify.batch_size = 96 # A list of texts that we want the embedding for, can be one or many. (You can turn your whole dataset into a list of texts and pass it into the method for faster embedding extraction) texts = ["বিখ্যাত হওয়ার প্রথম পদক্ষেপ", "জীবনে সবচেয়ে মূল্যবান জিনিস হচ্ছে", "বেশিরভাগ মানুষের পছন্দের জিনিস হচ্ছে"] bn_embeddings = bn_bertify.embedding(texts) # returns numpy matrix # shape of the returned matrix in this example 3x4096 (3 -> num. of texts, 4096 -> embedding dim.) # Example 2: English Embedding Extraction en_bertify = BERTify( lang="en", last_four_layers_embedding=True ) # bn_bertify.batch_size = 96 texts = ["how are you doing?", "I don't know about this.", "This is the most important thing."] en_embeddings = en_bertify.embedding(texts) # shape of the returned matrix in this example 3x3072 (3 -> num. of texts, 3072 -> embedding dim.)
- Try passing all your text data through the
.embedding()function at once by turning it into a list of texts.
- For faster inference, make sure you’re using your colab/kaggle GPU while making the
- Try increasing the
batch_sizeto make it even faster, by default we’re using
64(to be on the safe side) which doesn’t throw any
CUDA out of memorybut I believe we can go even further. Thanks to Alex, from his empirical findings, it seems like it can be pushed until
96. So, before making the
.embedding()call, you can do
bertify.batch_zie=96to set a larger
class BERTify(lang: str = "en", last_four_layers_embedding: bool = False)
A module for extracting embedding from BERT model for Bengali or English text datasets.
'en' -> English data, it uses
bert-base-uncased model embeddings,
'bn' -> Bengali data, it uses
sahajBERT model embeddings.
lang (str, optional): language of your data. Currently supports only
'bn'. Defaults to
last_four_layers_embedding (bool, optional):
BERTpaper discusses they’ve reached the best results
by concatenating the output of the last four layers, so if this argument is set to
your embedding vector would be (for
bert-basemodel for example)
4*768=3072dimensional, otherwise it’d be
768dimensional. Defaults to
def BERTify.embedding(texts: List[str])
The embedding function, that takes a list of texts, feed them through the model and returns a list of embeddings.
texts (List[str]): A list of texts, that you want to extract embedding for (e.g.
["This movie was a total waste of time.", "Whoa! Loved this movie, totally loved all the characters"])
np.ndarray: A numpy matrix of shape
num_of_texts x embedding_dimension