Oct 21, 2021
6 min read
Scalable implementation of dense retrieval.
Training on cluster
By default it trains locally:
PYTHONPATH=.:$PYTHONPATH python dpr_scale/main.py trainer.gpus=1
SLURM Training
To train the model on SLURM, run:
PYTHONPATH=.:$PYTHONPATH python dpr_scale/main.py -m trainer=slurm trainer.num_nodes=2 trainer.gpus=2
Reproduce DPR on 8 gpus
PYTHONPATH=.:$PYTHONPATH python dpr_scale/main.py -m --config-name nq.yaml +hydra.launcher.name=dpr_stl_nq_reproduce
Generate embeddings on Wikipedia
<div class="snippet-clipboard-content position-relative overflow-auto" data-snippet-clipboard-copy-content="PYTHONPATH=.:$PYTHONPATH python dpr_scale/generate_embeddings.py -m –config-name nq.yaml datamodule=generate datamodule.test_path=psgs_w100.tsv +task.ctx_embeddings_dir= +task.checkpoint_path=
“>
PYTHONPATH=.:$PYTHONPATH python dpr_scale/generate_embeddings.py -m --config-name nq.yaml datamodule=generate datamodule.test_path=psgs_w100.tsv +task.ctx_embeddings_dir=<CTX_EMBEDDINGS_DIR> +task.checkpoint_path=<CHECKPOINT_PATH>
Get retrieval results
Currently this runs on 1 GPU. Use CTX_EMBEDDINGS_DIR from above.
<div class="snippet-clipboard-content position-relative overflow-auto" data-snippet-clipboard-copy-content="PYTHONPATH=.:$PYTHONPATH python dpr_scale/run_retrieval.py –config-name nq.yaml trainer=gpu_1_host trainer.gpus=1 +task.output_path= +task.ctx_embeddings_dir= +task.checkpoint_path= +task.passages=psgs_w100.tsv datamodule.test_path=
“>
PYTHONPATH=.:$PYTHONPATH python dpr_scale/run_retrieval.py --config-name nq.yaml trainer=gpu_1_host trainer.gpus=1 +task.output_path=<PATH_TO_OUTPUT_JSON> +task.ctx_embeddings_dir=<CTX_EMBEDDINGS_DIR> +task.checkpoint_path=<CHECKPOINT_PATH> +task.passages=psgs_w100.tsv datamodule.test_path=<PATH_TO_QUERIES_JSONL>
Generate query embeddings
Alternatively, query embedding generation and retrieval can be separated. After query embeddings are generated using the following command, the run_retrieval_fb.py
or run_retrieval_multiset.py
script can be used to perform retrieval.
<div class="snippet-clipboard-content position-relative overflow-auto" data-snippet-clipboard-copy-content="PYTHONPATH=.:$PYTHONPATH python dpr_scale/generate_query_embeddings.py -m –config-name nq.yaml trainer.gpus=1 datamodule.test_path= +task.ctx_embeddings_dir= +task.checkpoint_path= +task.query_emb_output_path=
“>
PYTHONPATH=.:$PYTHONPATH python dpr_scale/generate_query_embeddings.py -m --config-name nq.yaml trainer.gpus=1 datamodule.test_path=<PATH_TO_QUERIES_JSONL> +task.ctx_embeddings_dir=<CTX_EMBEDDINGS_DIR> +task.checkpoint_path=<CHECKPOINT_PATH> +task.query_emb_output_path=<OUTPUT_TO_QUERY_EMB>
Get evaluation metrics for a given JSON output file
<div class="snippet-clipboard-content position-relative overflow-auto" data-snippet-clipboard-copy-content="python dpr_scale/eval_dpr.py –retrieval –topk 1 5 10 20 50 100
“>
python dpr_scale/eval_dpr.py --retrieval <PATH_TO_OUTPUT_JSON> --topk 1 5 10 20 50 100
Get evaluation metrics for MSMARCO
python dpr_scale/msmarco_eval.py ~data/msmarco/qrels.dev.small.tsv PATH_TO_OUTPUT_JSON
Domain-matched Pre-training Tasks for Dense Retrieval
Paper: https://arxiv.org/abs/2107.13602
The sections below provide links to datasets and pretrained models, as well as, instructions to prepare datasets, pretrain and fine-tune them.
Q&A Datasets
PAQ
Download the dataset from here
Conversational Datasets
You can download the dataset from the respective tables.
Reddit
ConvAI2
DSTC7
Prepare by downloading the tar ball linked here , and using the command below.
<div class="snippet-clipboard-content position-relative overflow-auto" data-snippet-clipboard-copy-content="DSTC7_DATA_ROOT=
python dpr_scale/data_prep/prep_conv_datasets.py \
–dataset dstc7 \
–in_file_path $DSTC7_DATA_ROOT/ubuntu_train_subtask_1_augmented.json \
–out_file_path $DSTC7_DATA_ROOT/ubuntu_train.jsonl
“>
DSTC7_DATA_ROOT=<path_of_dir_where_the_data_is_extracted>
python dpr_scale/data_prep/prep_conv_datasets.py \
--dataset dstc7 \
--in_file_path $DSTC7_DATA_ROOT/ubuntu_train_subtask_1_augmented.json \
--out_file_path $DSTC7_DATA_ROOT/ubuntu_train.jsonl
Ubuntu V2
Prepare by downloading the tar ball linked here , and using the command below.
<div class="snippet-clipboard-content position-relative overflow-auto" data-snippet-clipboard-copy-content="UBUNTUV2_DATA_ROOT=
python dpr_scale/data_prep/prep_conv_datasets.py \
–dataset ubuntu2 \
–in_file_path $UBUNTUV2_DATA_ROOT/train.csv \
–out_file_path $UBUNTUV2_DATA_ROOT/train.jsonl
“>
UBUNTUV2_DATA_ROOT=<path_of_dir_where_the_data_is_extracted>
python dpr_scale/data_prep/prep_conv_datasets.py \
--dataset ubuntu2 \
--in_file_path $UBUNTUV2_DATA_ROOT/train.csv \
--out_file_path $UBUNTUV2_DATA_ROOT/train.jsonl
Pretraining DPR
Pretrained Checkpoints
Pretraining on PAQ dataset
<div class="snippet-clipboard-content position-relative overflow-auto" data-snippet-clipboard-copy-content="DPR_ROOT=
MODEL="bert-large-uncased"
NODES=8
BSZ=16
MAX_EPOCHS=20
LR=1e-5
TIMOUT_MINS=4320
EXP_DIR=
TRAIN_PATH=
mkdir -p ${EXP_DIR}/logs
PYTHONPATH=$DPR_ROOT python ${DPR_ROOT}/dpr_scale/main.py -m \
–config-dir ${DPR_ROOT}/dpr_scale/conf \
–config-name nq.yaml \
hydra.launcher.timeout_min=$TIMOUT_MINS \
hydra.sweep.dir=${EXP_DIR} \
trainer.num_nodes=${NODES} \
task.optim.lr=${LR} \
task.model.model_path=${MODEL} \
trainer.max_epochs=${MAX_EPOCHS} \
datamodule.train_path=$TRAIN_PATH \
datamodule.batch_size=${BSZ} \
datamodule.num_negative=1 \
datamodule.num_val_negative=10 \
datamodule.num_test_negative=50 > ${EXP_DIR}/logs/log.out 2> ${EXP_DIR}/logs/log.err &
“>
DPR_ROOT=<path_of_your_repo's_root>
MODEL="bert-large-uncased"
NODES=8
BSZ=16
MAX_EPOCHS=20
LR=1e-5
TIMOUT_MINS=4320
EXP_DIR=<path_of_the_experiment_dir>
TRAIN_PATH=<path_of_the_training_data_file>
mkdir -p ${EXP_DIR}/logs
PYTHONPATH=$DPR_ROOT python ${DPR_ROOT}/dpr_scale/main.py -m \
--config-dir ${DPR_ROOT}/dpr_scale/conf \
--config-name nq.yaml \
hydra.launcher.timeout_min=$TIMOUT_MINS \
hydra.sweep.dir=${EXP_DIR} \
trainer.num_nodes=${NODES} \
task.optim.lr=${LR} \
task.model.model_path=${MODEL} \
trainer.max_epochs=${MAX_EPOCHS} \
datamodule.train_path=$TRAIN_PATH \
datamodule.batch_size=${BSZ} \
datamodule.num_negative=1 \
datamodule.num_val_negative=10 \
datamodule.num_test_negative=50 > ${EXP_DIR}/logs/log.out 2> ${EXP_DIR}/logs/log.err &
Pretraining on Reddit dataset
<div class="snippet-clipboard-content position-relative overflow-auto" data-snippet-clipboard-copy-content="# Use a batch size of 16 for BERT and RoBERTa base models.
BSZ=4
NODES=8
MAX_EPOCHS=5
WARMUP_STEPS=10000
LR=1e-5
MODEL="roberta-large"
EXP_DIR=
PYTHONPATH=. python dpr_scale/main.py -m \
–config-dir ${DPR_ROOT}/dpr_scale/conf \
–config-name reddit.yaml \
hydra.launcher.nodes=${NODES} \
hydra.sweep.dir=${EXP_DIR} \
trainer.num_nodes=${NODES} \
task.optim.lr=${LR} \
task.model.model_path=${MODEL} \
trainer.max_epochs=${MAX_EPOCHS} \
task.warmup_steps=${WARMUP_STEPS} \
datamodule.batch_size=${BSZ} > ${EXP_DIR}/logs/log.out 2> ${EXP_DIR}/logs/log.err &
“>
# Use a batch size of 16 for BERT and RoBERTa base models.
BSZ=4
NODES=8
MAX_EPOCHS=5
WARMUP_STEPS=10000
LR=1e-5
MODEL="roberta-large"
EXP_DIR=<path_of_the_experiment_dir>
PYTHONPATH=. python dpr_scale/main.py -m \
--config-dir ${DPR_ROOT}/dpr_scale/conf \
--config-name reddit.yaml \
hydra.launcher.nodes=${NODES} \
hydra.sweep.dir=${EXP_DIR} \
trainer.num_nodes=${NODES} \
task.optim.lr=${LR} \
task.model.model_path=${MODEL} \
trainer.max_epochs=${MAX_EPOCHS} \
task.warmup_steps=${WARMUP_STEPS} \
datamodule.batch_size=${BSZ} > ${EXP_DIR}/logs/log.out 2> ${EXP_DIR}/logs/log.err &
Fine-tuning DPR on downstream tasks/datasets
Fine-tune the pretrained PAQ checkpoint
<div class="snippet-clipboard-content position-relative overflow-auto" data-snippet-clipboard-copy-content="# You can also try 2e-5 or 5e-5. Usually these 3 learning rates work best.
LR=1e-5
# Use a batch size of 32 for BERT and RoBERTa base models.
BSZ=12
MODEL="bert-large-uncased"
MAX_EPOCHS=40
WARMUP_STEPS=1000
NODES=1
PRETRAINED_CKPT_PATH=
EXP_DIR=
PYTHONPATH=. python dpr_scale/main.py -m \
–config-dir ${DPR_ROOT}/dpr_scale/conf \
–config-name nq.yaml \
hydra.launcher.name=${NAME} \
hydra.sweep.dir=${EXP_DIR} \
trainer.num_nodes=${NODES} \
trainer.max_epochs=${MAX_EPOCHS} \
datamodule.num_negative=1 \
datamodule.num_val_negative=25 \
datamodule.num_test_negative=50 \
+trainer.val_check_interval=150 \
task.warmup_steps=${WARMUP_STEPS} \
task.optim.lr=${LR} \
task.pretrained_checkpoint_path=$PRETRAINED_CKPT_PATH \
task.model.model_path=${MODEL} \
datamodule.batch_size=${BSZ} > ${EXP_DIR}/logs/log.out 2> ${EXP_DIR}/logs/log.err &
“>
# You can also try 2e-5 or 5e-5. Usually these 3 learning rates work best.
LR=1e-5
# Use a batch size of 32 for BERT and RoBERTa base models.
BSZ=12
MODEL="bert-large-uncased"
MAX_EPOCHS=40
WARMUP_STEPS=1000
NODES=1
PRETRAINED_CKPT_PATH=<path_of_checkpoint_pretrained_on_reddit>
EXP_DIR=<path_of_the_experiment_dir>
PYTHONPATH=. python dpr_scale/main.py -m \
--config-dir ${DPR_ROOT}/dpr_scale/conf \
--config-name nq.yaml \
hydra.launcher.name=${NAME} \
hydra.sweep.dir=${EXP_DIR} \
trainer.num_nodes=${NODES} \
trainer.max_epochs=${MAX_EPOCHS} \
datamodule.num_negative=1 \
datamodule.num_val_negative=25 \
datamodule.num_test_negative=50 \
+trainer.val_check_interval=150 \
task.warmup_steps=${WARMUP_STEPS} \
task.optim.lr=${LR} \
task.pretrained_checkpoint_path=$PRETRAINED_CKPT_PATH \
task.model.model_path=${MODEL} \
datamodule.batch_size=${BSZ} > ${EXP_DIR}/logs/log.out 2> ${EXP_DIR}/logs/log.err &
Fine-tune the pretrained Reddit checkpoint
Batch sizes that worked on Volta 32GB GPUs for respective model and datasets.
Model
Dataset
Batch Size
BERT/RoBERTa base
ConvAI2
64
RBERT/RoBERTa base
ConvAI2
16
BERT/RoBERTa base
DSTC7
24
BERT/RoBERTa base
DSTC7
8
BERT/RoBERTa base
Ubuntu V2
64
BERT/RoBERTa large
Ubuntu V2
16
<div class="snippet-clipboard-content position-relative overflow-auto" data-snippet-clipboard-copy-content="# Change the config file name to convai2.yaml or dstc7.yaml for the respective datasets.
CONFIG_FILE_NAME=ubuntuv2.yaml
# You can also try 2e-5 or 5e-5. Usually these 3 learning rates work best.
LR=1e-5
BSZ=16
NODES=1
MAX_EPOCHS=5
WARMUP_STEPS=10000
MODEL="roberta-large"
PRETRAINED_CKPT_PATH=
EXP_DIR=
PYTHONPATH=${DPR_ROOT} python ${DPR_ROOT}/dpr_scale/main.py -m \
–config-dir=${DPR_ROOT}/dpr_scale/conf \
–config-name=$CONFIG_FILE_NAME \
hydra.launcher.nodes=${NODES} \
hydra.sweep.dir=${EXP_DIR} \
trainer.num_nodes=${NODES} \
trainer.max_epochs=${MAX_EPOCHS} \
+trainer.val_check_interval=150 \
task.pretrained_checkpoint_path=$PRETRAINED_CKPT_PATH \
task.warmup_steps=${WARMUP_STEPS} \
task.optim.lr=${LR} \
task.model.model_path=$MODEL \
datamodule.batch_size=${BSZ} > ${EXP_DIR}/logs/log.out 2> ${EXP_DIR}/logs/log.err &
“>
# Change the config file name to convai2.yaml or dstc7.yaml for the respective datasets.
CONFIG_FILE_NAME=ubuntuv2.yaml
# You can also try 2e-5 or 5e-5. Usually these 3 learning rates work best.
LR=1e-5
BSZ=16
NODES=1
MAX_EPOCHS=5
WARMUP_STEPS=10000
MODEL="roberta-large"
PRETRAINED_CKPT_PATH=<path_of_checkpoint_pretrained_on_reddit>
EXP_DIR=<path_of_the_experiment_dir>
PYTHONPATH=${DPR_ROOT} python ${DPR_ROOT}/dpr_scale/main.py -m \
--config-dir=${DPR_ROOT}/dpr_scale/conf \
--config-name=$CONFIG_FILE_NAME \
hydra.launcher.nodes=${NODES} \
hydra.sweep.dir=${EXP_DIR} \
trainer.num_nodes=${NODES} \
trainer.max_epochs=${MAX_EPOCHS} \
+trainer.val_check_interval=150 \
task.pretrained_checkpoint_path=$PRETRAINED_CKPT_PATH \
task.warmup_steps=${WARMUP_STEPS} \
task.optim.lr=${LR} \
task.model.model_path=$MODEL \
datamodule.batch_size=${BSZ} > ${EXP_DIR}/logs/log.out 2> ${EXP_DIR}/logs/log.err &
License
dpr-scale is CC-BY-NC 4.0 licensed as of now.
GitHub
View Github
John was the first writer to have joined pythonawesome.com. He has since then inculcated very effective writing and reviewing culture at pythonawesome which rivals have found impossible to imitate.
Previous Post
Official PyTorch Implementation of Mask-aware IoU and maYOLACT Detector
Next Post
DiSECt: Differentiable Simulator for Robotic Cutting
Towards Unified Keyframe Propagation Models
13 August 2022
Serving ViT Model for Feature Embedding using TorchServe Framework
13 August 2022
PEVL: Position-enhanced Pre-training and Prompt Tuning for Vision-language Models
31 July 2022
Live Training for Open-source Big Models
31 July 2022
Lightweight (Bayesian) Media Mix Model
16 February 2022
The code repository for LRM Stochastic watershed model
15 February 2022
Icepickle: A cooler way to store simple linear models
15 February 2022
3D Model files and source code for rotating turntable. Raspberry Pi, DC servo and PWM modulator required
14 February 2022