ViTSTR is a simple single-stage model that uses a pre-trained Vision Transformer (ViT) to perform Scene Text Recognition (ViTSTR). It has a comparable accuracy with state-of-the-art STR models although it uses significantly less number of parameters and FLOPS. ViTSTR is also fast due to the parallel computation inherent to ViT architecture.


ViTSTR is built using a fork of CLOVA AI Deep Text Recognition Benchmark whose original documentation is at the bottom. Below we document how to train and evaluate ViTSTR-Tiny and ViTSTR-small.

Install requirements

pip3 install -r requirements.txt


Download lmdb dataset. See CLOVA AI original documentation below.

Quick validation using a pre-trained model


CUDA_VISIBLE_DEVICES=0 python3 --eval_data data_lmdb_release/evaluation 
--benchmark_all_eval --Transformation None --FeatureExtraction None 
--SequenceModeling None --Prediction None --Transformer
--sensitive --data_filtering_off  --imgH 224 --imgW 224
--TransformerModel=vitstr_small_patch16_224 --saved_model 

Available model weights:

Tiny Small Base
vitstr_tiny_patch16_224 vitstr_small_patch16_224 vitstr_base_patch16_224
ViTSTR-Tiny ViTSTR-Small ViTSTR-Base
ViTSTR-Tiny+Aug ViTSTR-Small+Aug ViTSTR-Base+Aug

Benchmarks (Top 1% accuracy)

Model IIIT SVT IC03 IC03 IC13 IC13 IC15 IC15 SVTP CT Acc Std
3000 647 860 867 857 1015 1811 2077 645 288 % %
TRBA (Baseline) 87.7 87.4 94.5 94.2 93.4 92.1 77.3 71.6 78.1 75.5 84.3 0.1
ViTSTR-Tiny 83.7 83.2 92.8 92.5 90.8 89.3 72.0 66.4 74.5 65.0 80.3 0.2
ViTSTR-Tiny+Aug 85.1 85.0 93.4 93.2 90.9 89.7 74.7 68.9 78.3 74.2 82.1 0.1
ViTSTR-Small 85.6 85.3 93.9 93.6 91.7 90.6 75.3 69.5 78.1 71.3 82.6 0.3
ViTSTR-Small+Aug 86.6 87.3 94.2 94.2 92.1 91.2 77.9 71.7 81.4 77.9 84.2 0.1
ViTSTR-Base 86.9 87.2 93.8 93.4 92.1 91.3 76.8 71.1 80.0 74.7 83.7 0.1
ViTSTR-Base+Aug 88.4 87.7 94.7 94.3 93.2 92.4 78.5 72.6 81.8 81.3 85.2 0.1

Comparison with other STR models

Accuracy vs Number of Parameters

Acc vs Parameters

Accuracy vs Speed (2080Ti GPU)

Acc vs Speed

Accuracy vs FLOPS

Acc vs FLOPS


ViTSTR-Tiny without data augmentation


CUDA_VISIBLE_DEVICES=0 python3 --train_data data_lmdb_release/training
--valid_data data_lmdb_release/evaluation --select_data MJ-ST 
--batch_ratio 0.5-0.5 --Transformation None --FeatureExtraction None 
--SequenceModeling None --Prediction None --Transformer 
--TransformerModel=vitstr_tiny_patch16_224 --imgH 224 --imgW 224 
--manualSeed=$RANDOM  --sensitive

Multi-GPU training

ViTSTR-Small on a 4-GPU machine

It is recommended to train larger networks like ViTSTR-Small and ViTSTR-Base on a multi-GPU machine. To keep a fixed batch size at 192, use the --batch_size option. Divide 192 by the number of GPUs. For example, to train ViTSTR-Small on a 4-GPU machine, this would be --batch_size=48.

python3 --train_data data_lmdb_release/training 
--valid_data data_lmdb_release/evaluation --select_data MJ-ST 
--batch_ratio 0.5-0.5 --Transformation None --FeatureExtraction None 
--SequenceModeling None --Prediction None --Transformer 
--TransformerModel=vitstr_small_patch16_224 --imgH 224 --imgW 224 
--manualSeed=$RANDOM --sensitive --batch_size=48

Data augmentation

ViTSTR-Tiny using rand augment

It is recommended to use more workers (eg from default of 4, use 32 instead) since the data augmentation process is CPU intensive. In determining the number of workers, a simple rule of thumb to follow is it can be set to a value between 25% to 50% of the total number of CPU cores. For example, for a system with 64 CPU cores, the number of workers can be set to 32 to use 50% of all cores. For multi-GPU systems, the number of workers must be divided by the number of GPUs. For example, for 32 workers in a 4-GPU system, --workers=8. For convenience, simply use --workers=-1, 50% of all cores will be used. Lastly, instead of using a constant learning rate, a cosine scheduler improves the performance of the model during training.

Below is a sample configuration for a 4-GPU system using batch size of 192.

python3 --train_data data_lmdb_release/training
--valid_data data_lmdb_release/evaluation --select_data MJ-ST 
--batch_ratio 0.5-0.5 --Transformation None --FeatureExtraction None 
--SequenceModeling None --Prediction None --Transformer 
--TransformerModel=vitstr_tiny_patch16_224 --imgH 224 --imgW 224 
--manualSeed=$RANDOM  --sensitive
--batch_size=48 --isrand_aug --workers=-1 --scheduler


ViTSTR-Tiny. Find the path to best_accuracy.pth checkpoint file (usually in saved_model folder).

CUDA_VISIBLE_DEVICES=0 python3 --eval_data data_lmdb_release/evaluation 
--benchmark_all_eval --Transformation None --FeatureExtraction None  
--SequenceModeling None --Prediction None --Transformer 
--sensitive --data_filtering_off  --imgH 224 --imgW 224
--saved_model <path_to/best_accuracy.pth>


If you find this work useful, please cite:

  title={Vision Transformer for Fast and Efficient Scene Text Recognition},
  author={Atienza, Rowel},
  booktitle = {International Conference on Document Analysis and Recognition (ICDAR)},

What Is Wrong With Scene Text Recognition Model Comparisons? Dataset and Model Analysis

| paper | training and evaluation data | failure cases and cleansed label | pretrained model | Baidu ver(passwd:rryk) |

Official PyTorch implementation of our four-stage STR framework, that most existing STR models fit into.

Using this framework allows for the module-wise contributions to performance in terms of accuracy, speed, and memory demand, under one consistent set of training and evaluation datasets.

Such analyses clean up the hindrance on the current comparisons to understand the performance gain of the existing modules.


Based on this framework, we recorded the 1st place of ICDAR2013 focused scene text, ICDAR2019 ArT and 3rd place of ICDAR2017 COCO-Text, ICDAR2019 ReCTS (task1).

The difference between our paper and ICDAR challenge is summarized here.


Aug 3, 2020: added guideline to use Baidu warpctc which reproduces CTC results of our paper.

Dec 27, 2019: added FLOPS in our paper, and minor updates such as log_dataset.txt and ICDAR2019-NormalizedED.

Oct 22, 2019: added confidence score, and arranged the output form of training logs.

Jul 31, 2019: The paper is accepted at International Conference on Computer Vision (ICCV), Seoul 2019, as an oral talk.

Jul 25, 2019: The code for floating-point 16 calculation, check @YacobBY's pull request

Jul 16, 2019: added dataset, word images contain special characters in SynthText (ST) dataset, see this issue

Jun 24, 2019: added gt.txt of failure cases that contains path and label of each image, see

May 17, 2019: uploaded resources in Baidu Netdisk also, added Run demo. (check @sharavsambuu's colab demo also)

May 9, 2019: PyTorch version updated from 1.0.1 to 1.1.0, use torch.nn.CTCLoss instead of torch-baidu-ctc, and various minor updated.

Getting Started


  • This work was tested with PyTorch 1.3.1, CUDA 10.1, python 3.6 and Ubuntu 16.04.
    You may need pip3 install torch==1.3.1.

    In the paper, expriments were performed with PyTorch 0.4.1, CUDA 9.0.
  • requirements : lmdb, pillow, torchvision, nltk, natsort
pip3 install lmdb pillow torchvision nltk natsort

Download lmdb dataset for traininig and evaluation from here contains below.

training datasets : MJSynth (MJ)[1] and SynthText (ST)[2]
validation datasets : the union of the training sets IC13[3], IC15[4], IIIT[5], and SVT[6].
evaluation datasets : benchmark evaluation datasets, consist of IIIT[5], SVT[6], IC03[7], IC13[3], IC15[4], SVTP[8], and CUTE[9].

Run demo with pretrained model

  1. Download pretrained model from here
  2. Add image files to test into demo_image/
  3. Run (add --sensitive option if you use case-sensitive model)
--Transformation TPS --FeatureExtraction ResNet --SequenceModeling BiLSTM --Prediction Attn \
--image_folder demo_image/ \
--saved_model TPS-ResNet-BiLSTM-Attn.pth

prediction results

demo images TRBA (TPS-ResNet-BiLSTM-Attn) TRBA (case-sensitive version)
available Available
shakeshack SHARESHACK
london Londen
greenstead Greenstead
toast TOAST
merry MERRY
underground underground
ronaldo RONALDO
bally BALLY
university UNIVERSITY

Training and evaluation

  1. Train CRNN[10] model
--train_data data_lmdb_release/training --valid_data data_lmdb_release/validation \
--select_data MJ-ST --batch_ratio 0.5-0.5 \
--Transformation None --FeatureExtraction VGG --SequenceModeling BiLSTM --Prediction CTC
  1. Test CRNN[10] model. If you want to evaluate IC15-2077, check data filtering part.
--eval_data data_lmdb_release/evaluation --benchmark_all_eval \
--Transformation None --FeatureExtraction VGG --SequenceModeling BiLSTM --Prediction CTC \
--saved_model saved_models/None-VGG-BiLSTM-CTC-Seed1111/best_accuracy.pth
  1. Try to train and test our best accuracy model TRBA (TPS-ResNet-BiLSTM-Attn) also. (download pretrained model)
--train_data data_lmdb_release/training --valid_data data_lmdb_release/validation \
--select_data MJ-ST --batch_ratio 0.5-0.5 \
--Transformation TPS --FeatureExtraction ResNet --SequenceModeling BiLSTM --Prediction Attn
--eval_data data_lmdb_release/evaluation --benchmark_all_eval \
--Transformation TPS --FeatureExtraction ResNet --SequenceModeling BiLSTM --Prediction Attn \
--saved_model saved_models/TPS-ResNet-BiLSTM-Attn-Seed1111/best_accuracy.pth


  • --train_data: folder path to training lmdb dataset.
  • --valid_data: folder path to validation lmdb dataset.
  • --eval_data: folder path to evaluation (with lmdb dataset.
  • --select_data: select training data. default is MJ-ST, which means MJ and ST used as training data.
  • --batch_ratio: assign ratio for each selected data in the batch. default is 0.5-0.5, which means 50% of the batch is filled with MJ and the other 50% of the batch is filled ST.
  • --data_filtering_off: skip data filtering when creating LmdbDataset.
  • --Transformation: select Transformation module [None | TPS].
  • --FeatureExtraction: select FeatureExtraction module [VGG | RCNN | ResNet].
  • --SequenceModeling: select SequenceModeling module [None | BiLSTM].
  • --Prediction: select Prediction module [CTC | Attn].
  • --saved_model: assign saved model to evaluation.
  • --benchmark_all_eval: evaluate with 10 evaluation dataset versions, same with Table 1 in our paper.

Download failure cases and cleansed label from here contains failure case images and benchmark evaluation images with cleansed label.

When you need to train on your own dataset or Non-Latin language datasets.

  1. Create your own lmdb dataset.
pip3 install fire
python3 --inputPath data/ --gtFile data/gt.txt --outputPath result/

The structure of data folder as below.

├── gt.txt
└── test
    ├── word_1.png
    ├── word_2.png
    ├── word_3.png
    └── ...

At this time, gt.txt should be {imagepath}\t{label}\n

For example

test/word_1.png Tiredness
test/word_2.png kills
test/word_3.png A
  1. Modify --select_data, --batch_ratio, and opt.character, see this issue.


This implementation has been based on these repository crnn.pytorch, ocr_attention.


[1] M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman. Synthetic data and artificial neural networks for natural scenetext recognition. In Workshop on Deep Learning, NIPS, 2014.

[2] A. Gupta, A. Vedaldi, and A. Zisserman. Synthetic data fortext localisation in natural images. In CVPR, 2016.

[3] D. Karatzas, F. Shafait, S. Uchida, M. Iwamura, L. G. i Big-orda, S. R. Mestre, J. Mas, D. F. Mota, J. A. Almazan, andL. P. De Las Heras. ICDAR 2013 robust reading competition. In ICDAR, pages 1484–1493, 2013.

[4] D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, S. Ghosh, A. Bagdanov, M. Iwamura, J. Matas, L. Neumann, V. R.Chandrasekhar, S. Lu, et al. ICDAR 2015 competition on ro-bust reading. In ICDAR, pages 1156–1160, 2015.

[5] A. Mishra, K. Alahari, and C. Jawahar. Scene text recognition using higher order language priors. In BMVC, 2012.

[6] K. Wang, B. Babenko, and S. Belongie. End-to-end scenetext recognition. In ICCV, pages 1457–1464, 2011.

[7] S. M. Lucas, A. Panaretos, L. Sosa, A. Tang, S. Wong, andR. Young. ICDAR 2003 robust reading competitions. In ICDAR, pages 682–687, 2003.

[8] T. Q. Phan, P. Shivakumara, S. Tian, and C. L. Tan. Recognizing text with perspective distortion in natural scenes. In ICCV, pages 569–576, 2013.

[9] A. Risnumawan, P. Shivakumara, C. S. Chan, and C. L. Tan. A robust arbitrary text detection system for natural scene images. In ESWA, volume 41, pages 8027–8048, 2014.

[10] B. Shi, X. Bai, and C. Yao. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. In TPAMI, volume 39, pages2298–2304. 2017.


Please consider citing this work in your publications if it helps your research.

  title={What Is Wrong With Scene Text Recognition Model Comparisons? Dataset and Model Analysis},
  author={Baek, Jeonghun and Kim, Geewook and Lee, Junyeop and Park, Sungrae and Han, Dongyoon and Yun, Sangdoo and Oh, Seong Joon and Lee, Hwalsuk},
  booktitle = {International Conference on Computer Vision (ICCV)},


Feel free to contact us if there is any question:

for code/paper Jeonghun Baek [email protected]; for collaboration [email protected] (our team leader).