Automatically calibrate RGBD cameras with PyTorch. The intrinsics and extrinsics of the camera pair are optimized based on photometric consistency after projecting the ToF camera into the color camera. I tested on semi-rectified color and Kinect continuous wave ToF cameras from the NYU Depth V2 dataset. Related work using photometric consistency as a loss signal: LSD-SLAM, KinectFusion, CVPR 2017, ICCV 2019, ICCV 2021.
Setting up color camera and ToF camera
I tested on semi-rectified color and ToF cameras in a stereo arrangement. This makes initialization much nicer because we can assume identity rotation. The translation and rotation vectors are updated during optimization, but there is a better chance of convergence if you tune the x component of the translation vector from the ToF camera to the color camera. I initialize the x component of the translation vector to 0.1.
There is a better chance of convergence if you initialize the focal lengths in a sensible range. The focal length fx = F * s, where F is your lens in mm and s is the unit-less horizontal resolution. Repeat for fy. F and s can be found either in the metadata of your image or can be easily looked up through the technical camera docs online. I tested with randomly initializing focal lengths between 400 and 600. I initialize the other intrinsic matrix parameters to 0.5.
In summary, get the cameras decently rectified and initialize the camera matrices sensibly close to ground truth. This will ensure better calibration performance. If optimization is failing early, revisit these steps first.
- PyTorch 1.10.2, used torch.linalg.pinv() which is only available in newer pytorch
I only tested NYU Depth V2 and provide a short segment of it. I recommend using scenes with weak perspective and valid ToF pixels to calibrate since they optimized better from my experience (middle and bottom video from above). When there is strong perspective and less valid ToF pixels optimization struggled more (top video from above). Taking a varied video of a dynamic environment will potentially improve performance because it gives optimization a chance to get out of local minima. Optimization infrequently diverged after quality convergence even for long videos with varied scenes, so it seems that camera matrix initialization matters most and quality scene content initialization matters second for good final convergence.
This image should appear after optimization is complete: