Image captioning

End-to-end image captioning with EfficientNet-b3 + LSTM with Attention

Model is seq2seq model.
In the encoder pretrained EfficientNet-b3 model is used to extract the features.
Decoder is the LSTM with the Bahdanau Attention.


The dataset is available at kaggle and contains 8,000 images that are each paired with five different captions.


run in terminal: python -m img_caption


The user interface consists of file:

  • config.yaml – general configuration with data and model parameters

Default config.yaml:

  path_to_data_folder: "data"
  caption_file_name: "captions.txt"
  images_folder_name: "Images"
  output_folder_name: "output"
  logging_file_name: "logging.txt"
  model_file_name: ""

batch_size: 32
num_worker: 1
gensim_model_name: "glove-wiki-gigaword-200"

  embedding_dimension: 200
  decoder_hidden_dimension: 300
  learning_rate: 0.0001
  momentum: 0.9
  n_epochs: 50
  clip: 5
  fine_tune_encoder: false


After training the model, the pipeline will return the following files:

  • – checkpoint with:
    • epoch – last epoch
    • model_state_dict – model parameters
    • optimizer_state_dict – the state of the optimizer
    • train_history – training history from a model
    • valid_history – validation history from a model
    • best_valid_loss – the best validation loss