Voxel Set Transformer: A Set-to-Set Approach to 3D Object Detection from Point Clouds (CVPR2022)[paper]
Authors: Chenhang He, Ruihuang Li, Shuai Li, Lei Zhang.
This project is built on OpenPCDet.
Introduction
Transformer has demonstrated promising performance in many 2D vision tasks. However, it is cumbersome to compute the self-attention on large-scale point cloud data because point cloud is a long sequence and unevenly distributed in 3D space. To solve this issue, existing methods usually compute self-attention locally by grouping the points into clusters of the same size, or perform convolutional self-attention on a discretized representation. However, the former results in stochastic point dropout, while the latter typically has narrow attention fields. In this paper, we propose a novel voxel-based architecture, namely Voxel Set Transformer (VoxSeT), to detect 3D objects from point clouds by means of set-to-set translation. VoxSeT is built upon a voxel-based set attention (VSA) module, which reduces the self-attention in each voxel by two cross attentions and models features in a hidden space induced by a group of latent codes. With the VSA module, VoxSeT can manage voxelized point clusters with arbitrary size in a wide range, and process them in parallel with linear complexity. The proposed VoxSeT integrates the high performance of transformer with the efficiency of voxel-based model, which can be used as a good alternative to the convolutional and point-based backbones.
1. Recommended Environment
- Linux (tested on Ubuntu 16.04)
- Python 3.7
- PyTorch 1.4 or higher (tested on PyTorch 1.10.1)
- CUDA 9.0 or higher (tested on CUDA 10.2)
2. Set the Environment
pip install -r requirement.txt
python setup.py build_dist --inplace
The torch_scatter package is required
3. Data Preparation
- Prepare KITTI dataset and road planes
# Download KITTI and organize it into the following form:
├── data
│ ├── kitti
│ │ │── ImageSets
│ │ │── training
│ │ │ ├──calib & velodyne & label_2 & image_2 & (optional: planes)
│ │ │── testing
│ │ │ ├──calib & velodyne & image_2
# Generatedata infos:
python -m pcdet.datasets.kitti.kitti_dataset create_kitti_infos tools/cfgs/dataset_configs/kitti_dataset.yaml
4. Pretrain model
You can download the pretrain model here and the log file here.
The performance (using 11 recall poisitions) on KITTI validation set is as follows:
Car [email protected], 0.70, 0.70:
bev AP:90.1572, 88.0972, 86.8397
3d AP:88.8694, 78.7660, 77.5758
Pedestrian [email protected], 0.50, 0.50:
bev AP:63.1125, 58.5591, 55.1318
3d AP:60.2515, 55.5535, 50.1888
Cyclist [email protected], 0.50, 0.50:
bev AP:85.6768, 71.9008, 67.1551
3d AP:85.4238, 70.2774, 64.9804
The runtime is about 33 ms per sample.
5. Train
- Train with a single GPU
python train.py --cfg_file tools/cfgs/kitti_models/voxset.yaml
- Train with multiple GPUs
cd VoxSeT/tools
bash scripts/dist_train.sh --cfg_file ./cfgs/kitti_models/voxset.yaml
6. Test with a pretrained model
cd VoxSeT/tools
python test.py --cfg_file --cfg_file ./cfgs/kitti_models/voxset.yaml --ckpt ${CKPT_FILE}
Citation
@inproceedings{he2022voxset,
title={Voxel Set Transformer: A Set-to-Set Approach to 3D Object Detection from Point Clouds},
author={Chenhang He, Ruihuang Li, Shuai Li and Lei Zhang},
booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},
year={2022}
}