OFA

[Paper] [Blog] [Colab]

Overview

OFA is a unified multimodal pretrained model that unifies modalities (i.e., cross-modality, vision, language) and tasks
(e.g., image generation, visual grounding, image captioning, image classification, text generation, etc.)
to a simple sequence-to-sequence learning framework. For more information, please refer to our paper: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework.

News

  • 2022.2.10: Released the inference code & finetuned checkpoint for image captioning, which can reproduce the results on COCO Karparthy test split (149.6 CIDEr)

TODO

  • To release finetuning and inference codes for multimodal downstream tasks, including image captioning, VQA, text-to-image generation, SNLI-VE, Referring expression, comprehension, etc. Expected time: Mar. 2022.
  • To release codes for pretraining. Expected time: Apr. 2022.

Approach

approach

Requirements

  • python 3.7.4
  • pytorch 1.8.1
  • JAVA 1.8 (for COCO evaluation)

Installation

git clone https://github.com/OFA-Sys/OFA
pip install -r requirements.txt

Datasets and Checkpoints

See datasets.md and checkpoints.md.

Pretraining

To release soon:)

Finetuning & Inference

Below we provide methods for fintuning and inference on different downstream tasks. At this moment we only provide the scripts for inference, and we will soon release those for finetuning.

Caption

  1. Download data and files and put them in the correct directory
  2. Run the commands below,

cd run_scripts/caption
sh evaluate_caption.sh

Gallery

Below we provide examples of OFA in text-to-image generation and open-ended VQA. Also, we demonstrate its performance in unseen task (Grounded QA) as well as unseen domain (Visual Grounding on images from unseen domains).

Text-to-Image Generation (normal query)

t2i_normal

Text-to-Image Generation (counterfactual query)

t2i_counterfactual

Open-Ended VQA

open_vqa

Grounded QA (unseen task)

grounded_qa

Viusal Grounding (unseen domain)

vg

Citation

Please cite our paper if you find it helpful ?

@article{wang2022OFA,
  title={Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework},
  author={Wang, Peng and Yang, An and Men, Rui and Lin, Junyang and Bai, Shuai and Li, Zhikang and Ma, Jianxin and Zhou, Chang and Zhou, Jingren and Yang, Hongxia},
  journal={arXiv e-prints},
  pages={arXiv--2202},
  year={2022}
}

Related Codebase

License

Apache-2.0