TAble PArSing (TAPAS)

Code and checkpoints for training the transformer-based Table QA models introduced in the paper TAPAS: Weakly Supervised Table Parsing via Pre-training.



  • Added a colab to try predictions on open domain question answering.







  • Small change to WTQ training example creation
    • Questions with ambiguous cell matches will now be discarded
    • This improves denotation accuracy by ~1 point
    • For more details see this issue.
  • Added option to filter table columns by textual overlap with question



  • Added a colab to try predictions on WTQ


  • New pre-trained models (see Data section below)
  • reset_position_index_per_cell: New option that allows to train models that instead of using absolute position indices reset the position index when a new cell starts.


  • Bump TensorFlow to v2.2



  • Added a colab to try predictions on SQA


The easiest way to try out TAPAS with free GPU/TPU is in our Colab, which shows how to do predictions on SQA.

The repository uses protocol buffers, and requires the protoc compiler to run. You can download the latest binary for your OS here. On Ubuntu/Debian, it can be installed with:

sudo apt-get install protobuf-compiler

Afterwards, clone and install the git repository:

git clone
cd tapas
pip install -e .

To run the test suite we use the tox library which can be run by calling:

pip install tox


We provide pre-trained models for different model sizes.

The metrics are computed by our tool and not the official metrics of the respective tasks. We provide them so one can verify whether one’s own runs are in the right ballpark. They are medians over three individual runs.

Models with intermediate pre-training (2020/10/07).

New models based on the ideas discussed in Understanding tables with intermediate pre-training. Learn more about the methods use here.


Trained from Mask LM, intermediate data, SQA, WikiSQL.

Size Reset Dev Accuracy Link
LARGE noreset 0.5062
LARGE reset 0.5097
BASE noreset 0.4525
BASE reset 0.4638
MEDIUM noreset 0.4324
MEDIUM reset 0.4324
SMALL noreset 0.3681
SMALL reset 0.3762
MINI noreset 0.2783
MINI reset 0.2854
TINY noreset 0.0823
TINY reset 0.1039


Trained from Mask LM, intermediate data, SQA.

Size Reset Dev Accuracy Link
LARGE noreset 0.8948
LARGE reset 0.8979
BASE noreset 0.8859
BASE reset 0.8855
MEDIUM noreset 0.8766
MEDIUM reset 0.8773
SMALL noreset 0.8552
SMALL reset 0.8615
MINI noreset 0.8063
MINI reset 0.82
TINY noreset 0.3198
TINY reset 0.6046


Trained from Mask LM, intermediate data.

Size Reset Dev Accuracy Link
LARGE noreset 0.8101
LARGE reset 0.8159
BASE noreset 0.7856
BASE reset 0.7918
MEDIUM noreset 0.7585
MEDIUM reset 0.7587
SMALL noreset 0.7321
SMALL reset 0.7346
MINI noreset 0.6166
MINI reset 0.6845
TINY noreset 0.5425
TINY reset 0.5528


Trained from Mask LM, intermediate data.

Size Reset Dev Accuracy Link
LARGE noreset 0.7223
LARGE reset 0.7289
BASE noreset 0.6737
BASE reset 0.6874
MEDIUM noreset 0.6464
MEDIUM reset 0.6561
SMALL noreset 0.5876
SMALL reset 0.6155
MINI noreset 0.4574
MINI reset 0.5148
TINY noreset 0.2004
TINY reset 0.2375


Trained from Mask LM.

Size Reset Dev Accuracy Link
LARGE noreset 0.9309
LARGE reset 0.9317
BASE noreset 0.9134
BASE reset 0.9163
MEDIUM noreset 0.8988
MEDIUM reset 0.9005
SMALL noreset 0.8788
SMALL reset 0.8798
MINI noreset 0.8218
MINI reset 0.8333
TINY noreset 0.6359
TINY reset 0.6615

Small Models & position index reset (2020/08/08)

Based on the pre-trained checkpoints available at the BERT github page. See the page or the paper for detailed information on the model dimensions.

Reset refers to whether the parameter reset_position_index_per_cell was set to true or false during training. In general it’s recommended to set it to true.

The accuracy depends on the respective task. It’s denotation accuracy for WTQ and WIKISQL, average position accuracy with gold labels for the previous answers for SQA and Mask-LM accuracy for Mask-LM.

The models were trained in a chain as indicated by the model name. For example, sqa_masklm means the model was first trained on the Mask-LM task and then on SQA. No destillation was performed.


Size Reset Dev Accuracy Link
LARGE noreset 0.4822
LARGE reset 0.4952
BASE noreset 0.4288
BASE reset 0.4433
MEDIUM noreset 0.4158
MEDIUM reset 0.4097
SMALL noreset 0.3267
SMALL reset 0.3670
MINI noreset 0.2275
MINI reset 0.2409
TINY noreset 0.0901
TINY reset 0.0947


Size Reset Dev Accuracy Link
LARGE noreset 0.8862
LARGE reset 0.8917
BASE noreset 0.8772
BASE reset 0.8809
MEDIUM noreset 0.8687
MEDIUM reset 0.8736
SMALL noreset 0.8285
SMALL reset 0.8550
MINI noreset 0.7672
MINI reset 0.7944
TINY noreset 0.3237
TINY reset 0.3608


Size Reset Dev Accuracy Link
LARGE noreset 0.7002
LARGE reset 0.7130
BASE noreset 0.6393
BASE reset 0.6689
MEDIUM noreset 0.6026
MEDIUM reset 0.6141
SMALL noreset 0.4976
SMALL reset 0.5589
MINI noreset 0.3779
MINI reset 0.3687
TINY noreset 0.2013
TINY reset 0.2194


Size Reset Dev Accuracy Link
LARGE noreset 0.7513
LARGE reset 0.7528
BASE noreset 0.7323
BASE reset 0.7335
MEDIUM noreset 0.7059
MEDIUM reset 0.7054
SMALL noreset 0.6818
SMALL reset 0.6856
MINI noreset 0.6382
MINI reset 0.6425
TINY noreset 0.4826
TINY reset 0.5282

Original Models

The pre-trained TAPAS checkpoints can be downloaded here:

The first two models are pre-trained on the Mask-LM task and the last two on the Mask-LM task first and SQA second.

Fine-Tuning Data

You also need to download the task data for the fine-tuning tasks:


Note that you can skip pre-training and just use one of the pre-trained checkpoints provided above.

Information about the pre-taining data can be found here.

The TF examples for pre-training can be created using Google Dataflow:

python3 sdist
python3 tapas/ \
  --input_file="gs://tapas_models/2020_05_11/interactions.txtpb.gz" \
  --vocab_file="gs://tapas_models/2020_05_11/vocab.txt" \
  --output_dir="gs://your_bucket/output" \
  --runner_type="DATAFLOW" \
  --gc_project="you-project" \
  --gc_region="us-west1" \
  --gc_job_name="create-pretrain" \
  --gc_staging_location="gs://your_bucket/staging" \
  --gc_temp_location="gs://your_bucket/tmp" \

You can also run the pipeline locally but that will take a long time:

python3 tapas/ \
  --input_file="$data/interactions.txtpb.gz" \
  --output_dir="$data/" \
  --vocab_file="$data/vocab.txt" \

This will create two tfrecord files for training and testing. The pre-training can then be started with the command below. The init checkpoint should be a standard BERT checkpoint.

python3 tapas/experiments/ \
  --eval_batch_size=32 \
  --train_batch_size=512 \
  --tpu_iterations_per_loop=5000 \
  --num_eval_steps=100 \
  --save_checkpoints_steps=5000 \
  --num_train_examples=512000000 \
  --max_seq_length=128 \
  --input_file_train="${data}/train.tfrecord" \
  --input_file_eval="${data}/test.tfrecord" \
  --init_checkpoint="${tapas_data_dir}/model.ckpt" \
  --bert_config_file="${tapas_data_dir}/bert_config.json" \
  --model_dir="..." \
  --compression_type="" \

Where compression_type should be set to GZIP if the tfrecords are compressed. You can start a separate eval job by setting --nodo_train --doeval.

Running a fine-tuning task

We need to create the TF examples before starting the training. For example, for SQA that would look like:

python3 tapas/ \
  --task="SQA" \
  --input_dir="${sqa_data_dir}" \
  --output_dir="${output_dir}" \
  --bert_vocab_file="${tapas_data_dir}/vocab.txt" \

Optionally, to handle big tables, we can add a --prune_columns flag to apply the HEM method described section 3.3 of our paper to discard some columns based on textual overlap with the sentence.

Afterwards, training can be started by running:

python3 tapas/ \
  --task="SQA" \
  --output_dir="${output_dir}" \
  --init_checkpoint="${tapas_data_dir}/model.ckpt" \
  --bert_config_file="${tapas_data_dir}/bert_config.json" \
  --mode="train" \

This will use the preset hyper-parameters set in

It’s recommended to start a separate eval job to continuously produce predictions for the checkpoints created by the training job. Alternatively, you can run the eval job after training to only get the final results.

python3 tapas/ \
  --task="SQA" \
  --output_dir="${output_dir}" \
  --init_checkpoint="${tapas_data_dir}/model.ckpt" \
  --bert_config_file="${tapas_data_dir}/bert_config.json" \

Another tool to run experiments is It’s more flexible than but also requires setting all the hyper-parameters (via the respective command line flags).


Here we explain some details about different tasks.


By default, SQA will evaluate using the reference answers of the previous questions. The number in the paper (Table 5) are computed using the more realistic setup where the previous answer are model predictions. will output additional prediction files for this setup as well if run on GPU.


For the official evaluation results one should convert the TAPAS predictions to the WTQ format and run the official evaluation script. This can be done using


As discussed in the paper our code will compute evaluation metrics that deviate from the official evaluation script (Table 3 and 10).

Hardware Requirements

TAPAS is essentialy a BERT model and thus has the same requirements. This means that training the large model with 512 sequence length will require a TPU. You can use the option max_seq_length to create shorter sequences. This will reduce accuracy but also make the model trainable on GPUs. Another option is to reduce the batch size (train_batch_size), but this will likely also affect accuracy. We added an options gradient_accumulation_steps that allows you to split the gradient over multiple batches. Evaluation with the default test batch size (32) should be possible on GPU.

How to cite TAPAS?

You can cite the ACL 2020 paper and the EMNLP 2020 Findings paper for the laters work on pre-training objectives.


This is not an official Google product.

Contact information

For help or issues, please submit a GitHub issue.