MetaFormer Baselines for Vision

This is a PyTorch implementation of several MetaFormer baslines including IdentityFormer, RandFormer, RandFormer and CAFormer proposed by our paper “MetaFormer Baselines for Vision“.

Figure1 Figure 1: Performance of MetaFormer baselines and other state-of-the-art models on ImageNet-1K at 224×224 resolution. The architectures of our proposed models are shown in Figure 2. (a) IdentityFormer/RandFormer achieve over 80%/81% accuracy, indicating MetaFormer has solid lower bound of performance and works well on arbitrary token mixers. The accuracy of well-trained ResNet-50 is from “ResNet strikes back”. (b) Without novel token mixers, pure CNN-based ConvFormer outperforms ConvNeXt, while CAFormer sets a new record of 85.5% accuracy on ImageNet-1K at 224×224 resolution under normal supervised training without external data or distillation.

Overall Figure 2: (a-d) Overall frameworks of IdentityFormer, RandForemr, ConvFormer and CAFormer. Similar to ResNet, the models adopt hierarchical architecture of 4 stages, and stage $i$ has $L_i$ blocks with feature dimension $D_i$. Each downsampling module is implemented by a layer of convolution. The first downsampling has kernel size of 7 and stride of 4, while the last three ones have kernel size of 3 and stride of 2. (e-h) Architectures of IdentityFormer, RandForemr, ConvFormer and Transformer blocks, which have token mixer of identity mapping, global random mixing, separable depthwise convolutions, or vanilla self-attention, respectively.



  title={MetaFormer Is Actually What You Need for Vision},
  author={Yu, Weihao and Si, Chenyang and Zhou, Pan and Luo, Mi and Zhou, Yichen and Feng, Jiashi and Yan, Shuicheng and Wang, Xinchao},
  journal={arXiv preprint arXiv:2210.13452},


torch>=1.7.0; torchvision>=0.8.0; pyyaml; timm (pip install timm==0.6.11)

Data preparation: ImageNet with the following folder structure, you can extract ImageNet by this script.

│  ├── n01440764
│  │   ├── n01440764_10026.JPEG
│  │   ├── n01440764_10027.JPEG
│  │   ├── ......
│  ├── ......
│  ├── n01440764
│  │   ├── ILSVRC2012_val_00000293.JPEG
│  │   ├── ILSVRC2012_val_00002138.JPEG
│  │   ├── ......
│  ├── ......

MetaFormer baselines

Models with common token mixers trained and finetuned (at 384) on ImageNet

Model Resolution Params MACs Top1 Acc Download
caformer_s18 224 26M 4.1G 83.6 here
caformer_s18_384 384 26M 13.4G 85.0 here
caformer_s36 224 39M 8.0G 84.5 here
caformer_s36_384 384 39M 26.0G 85.7 here
caformer_m36 224 56M 13.2G 85.2 here
caformer_m36_384 384 56M 42.0G 86.2 here
caformer_b36 224 99M 23.2G 85.5 here
caformer_b36_384 384 99M 72.2G 86.4 here
convformer_s18 224 27M 3.9G 83.0 here
convformer_s18_384 384 27M 11.6G 84.4 here
convformer_s36 224 40M 7.6G 84.1 here
convformer_s36_384 384 40M 22.4G 85.4 here
convformer_m36 224 57M 12.8G 84.5 here
convformer_m36_384 384 57M 37.7G 85.6 here
convformer_b36 224 100M 22.6G 84.8 here
convformer_b36_384 384 100M 66.5G 85.7 here

Models with common token mixers pretrained on ImageNet-21k and finetuned on ImgeNet-1K

Model Resolution Params MACs Top1 Acc Download
caformer_b36_in21ft1k 224 99M 23.2G 87.4 here
caformer_b36_384_in21ft1k 384 99M 72.2G 88.1 here
convformer_b36_in21ft1k 224 100M 22.6G 87.0 here
convformer_b36_384_in21kft1k 384 100M 66.5G 87.6 here

Models with common token mixers pretrained on ImageNet-21k

Model Resolution Params MACs Download
caformer_b36_in21k 224 99M 23.2G here
convformer_b36_in21k 224 100M 22.6G here

Models with basic token mixers trained on ImageNet-1K

Model Resolution Params MACs Top1 Acc Download
identityformer_s12 224 11.9M 1.8G 74.6 here
identityformer_s24 224 21.3M 3.4G 78.2 here
identityformer_s36 224 30.8M 5.0G 79.3 here
identityformer_m36 224 56.1M 8.8G 80.0 here
identityformer_m48 224 73.3M 11.5G 80.4 here
randformer_s12 224 11.9 + 0.2M 1.9G 74.6 here
randformer_s24 224 21.3 + 0.5M 3.5G 78.2 here
randformer_s36 224 30.8 + 0.7M 5.2G 79.3 here
randformer_m36 224 56.1 + 0.7M 9.0G 80.0 here
randformer_m48 224 73.3 + 0.9M 11.9G 80.4 here
poolformerv2_s12 224 11.9M 1.8G 78.0 here
poolformerv2_s24 224 21.3M 3.4G 80.7 here
poolformerv2_s36 224 30.8M 5.0G 81.6 here
poolformerv2_m36 224 56.1M 8.8G 82.2 here
poolformerv2_m48 224 73.3M 11.5G 82.6 here

The underlined numbers mean the numbers of parameters that are frozen after random initialization.


To evaluate our PoolFormer models, run:

python3 /path/to/imagenet  --model $MODEL -b 128 \
  --checkpoint /path/to/checkpoint 


We use batch size of 4096 by default and we show how to train models with 8 GPUs. For multi-node training, adjust --grad-accum-steps according to your situations.

CODE_PATH=/path/to/code/metaformer # modeify code path here

GRAD_ACCUM_STEPS=4 # Adjust according to your GPU numbers and memory size.

--model convformer_s18 --opt adamw --lr 4e-3 --warmup-epochs 20 \
-b $BATCH_SIZE --grad-accum-steps $GRAD_ACCUM_STEPS \
--drop-path 0.2 --head-dropout 0.0

Training (fine-tuning) scripts of other models are shown in scripts.


Weihao Yu would like to thank TPU Research Cloud (TRC) program for the support of partial computational resources. Our implementation is mainly based on pytorch-image-models.


View Github