A machine Learning project to generate captions for video frames indicating the relationship between the objects in the video.


In our framework we use a sequence-to-sequence model to perform video visual relationship predictions where the input is a sequence of video frames and the output is a relation triplet < object1 − relationship − object2 > representing the videos. We extend the sequence-to-sequence modelling approach to an input of sequence of video frames.


Figure: Bidirectional LSTM layer (coloured red) encodes visual feature inputs, and the LSTM layer (coloured green) decodes the features into a sequence of words.



Python Dependencies

  1. Pandas
  2. Keras
  3. Tensorflow
  4. Numpy
  5. albumenations
  6. Pillow



For training the model, run the script


For training on your own dataset:
Save your data in a directory (for the format check the data folder).
Update the json files.

  1. object1_object2.json:
    It contains a dictionary for each object, with object labels as keys and ids as values.

  2. relationship.json:
    It contains a dictionary for each relationship, with relationship labels as keys and ids as values.

  3. training_annotations.json:
    It contains a dictionary for each video in the training data, with video ids as keys and a list of <object1, relationship, object2> as values.

While running the script provide your directory path.

  python --train_data <directory_path>


For testing the model or making predictions on your own dataset, run the script

  python --test_data <directory_path>

Result will be saved to a csv file ‘test_data_predictions.csv’.


View Github