beyond masking

Beyond Masking: Demystifying Token-Based Pre-Training for Vision Transformers

The code is coming

beyond masking

Figure 1: Pipeline of token-based pre-training.


Figure 2: The visualization of the proposed 5 tasks.

main results

All the results are pre-trained for 300 epochs using Vit-base as default.

zoomed-in zoomed-out distorted blurred de-colorized
finetune 82.7 82.5 82.1 81.8 81.4
zoomed-in (a) mask (m) (a)+(m)
finetune 82.7 82.9 83.2

We note that the integrated version dose not require extra computational cost.



Figure 3: Efficiency of the integrated task.


View Github