SC-Depth: Unsupervised Scale-consistent Depth Estimation (IJCV 2021 & NeurIPS 2019)

Notice: this post presents the NeurIPS version. Please see more information in our IJCV paper [pdf].

Abstract

Recent work has shown that CNN-based depth and ego-motion estimators can be learned using unlabelled monocular videos. However, the performance is limited by unidentified moving objects that violate the underlying static scene assumption in geometric image reconstruction. More significantly, due to lack of proper constraints, networks output scale-inconsistent results over different samples, i.e., the ego-motion network cannot provide full camera trajectories over a long video sequence because of the per-frame scale ambiguity. This paper tackles these challenges by proposing a geometry consistency loss for scale-consistent predictions and an induced self-discovered mask for handling moving objects and occlusions. Since we do not leverage multi-task learning like recent works, our framework is much simpler and more efficient. Comprehensive evaluation results demonstrate that our depth estimator achieves the state-of-the-art performance on the KITTI dataset. Moreover, we show that our ego-motion network is able to predict a globally scale-consistent camera trajectory for long video sequences, and the resulting visual odometry accuracy is competitive with the recent model that is trained using stereo videos. To the best of our knowledge, this is the first work to show that deep networks trained using unlabelled monocular videos can predict globally scale-consistent camera trajectories over a long video sequence.

Publication

Unsupervised Scale-consistent Depth Learning from Video, Jia-Wang Bian, Huangying Zhan, Naiyan Wang, Zhichao Li, Le Zhang, Chunhua Shen, Ming-Ming Cheng, Ian Reid, IJCV, 2021. [pdf | code] (Extended version of NeurIPS 2019)

@article{bian2021ijcv, 
  title={Unsupervised Scale-consistent Depth Learning from Video}, 
  author={Bian, Jia-Wang and Zhan, Huangying and Wang, Naiyan and Li, Zhichao and Zhang, Le and Shen, Chunhua and Cheng, Ming-Ming and Reid, Ian}, 
  journal= {International Journal of Computer Vision (IJCV)}, 
  year={2021} 
}

Contributions

  1. We propose a geometry consistency constraint to enforce the scale-consistency of depth and ego-motion networks, leading to globally scale-consistent results.
  2. We propose a self-discovered mask for detecting dynamics and occlusions by the aforementioned geometry consistency constraint. Compared with other approaches, our proposed approach does not require additional optical flow or semantic segmentation networks, which makes the learning framework simpler and more efficient.
  3. The proposed depth estimator achieves state-of-the-art performance on KITTI dataset, and the proposed ego-motion predictor shows competitive visual odometry results compared with the state-of-the-art model that is trained using stereo videos.

Proposed Framework

  1. LGC stands for the proposed geometry-consistency loss. It penalizes the inconsistency of depth predictions on consecutive frames, i.e., the difference between the predicted depth and the projected depth (from the other frame). By constraining this, the scale-consistency could be enforced. Besides, it regularizes the network and overcomes the overfitting issue. See the paper for details.
  2. M stands for the proposed self-discovered mask, which is derived from LGC. Specifically, it is a confidence map (how the predicted depth is consistent with the projected depth), and it is normalized to the range (0,1). We apply this as a weight mask during photometric loss calculation. The low-weight regions are detected dynamics and occlusions.
  3. Other parts are similar to SfMLearner (Zhou et al. [5]).

Visual Results of Depth and Mask

  1. Top to bottom: two consecutive images, estimated depth, proposed mask. White (black) stand for high (low) confidences.
  2. Note that only static and co-viewed regions by both images can provide reasonable supervisions, and other regions are noises in this geometry-based framework.
  3. The proposed mask, derived from geometry, can effectively detect good regions and remove noisy ones (i.e., moving objects and occlusions).

Depth Results on KITTI

  1. We use Eigen’s split for training and testing (standard solution).
  2. The methods trained on KITTI raw dataset are denoted by K. Models with pre-training on CityScapes are denoted by CS+K.
  3. (D) denotes depth supervision, (B) denotes binocular/stereo input pairs, (M) denotes monocular video clips. (J) denotes joint learning of multiple tasks. The best performance is highlighted as bold.

Visual Odometry Results

  1. All deep methods are trained on KITTI 00-08. ORB-SLAM (without loop closing) is compared as the strong baseline.
  2. Zhou et al. [5] use monocular videos for training. We align its scale of each frame to ground truth, because its scale is not consistent.
  3. Zhan et al. [16] use stereo videos for training, so no scale ambiguity.
  4. Our method uses monocular videos for training, but only aligns one global scale to the ground truth. The results are comparable and even better than [16].

Efficiency of Training

  1. We compare with CC [9] on a single 16GB Tesla V100 GPU. The time taken for each iteration consisting of forward and backward pass using a batch size of 4 is reported, where image resolution is 832 × 256.
  2. CC [9] needs train 3 parts iteratively, while we only need train 1 part once for 200K iterations. CC takes about 7 days for training, while our method takes 32 hours.

Selected Reference

  • [5] Tinghui Zhou, Matthew Brown, Noah Snavely, and David G Lowe. Unsupervised learning of depth and ego-motion from video. CVPR, 2017.
  • [9] Anurag Ranjan, Varun Jampani, Kihwan Kim, Deqing Sun, Jonas Wulff, and Michael J Black. Competitive Collaboration: Joint unsupervised learning of depth, camera motion, optical flow and motion segmentation. CVPR, 2019.
  • [16] Huangying Zhan, Ravi Garg, Chamara Saroj Weerasekera, Kejie Li, Harsh Agarwal, and Ian Reid. Unsupervised learning of monocular depth estimation and visual odometry with deep feature reconstruction. CVPR, 2018.

Reconstruction Demo

10 thoughts on “SC-Depth: Unsupervised Scale-consistent Depth Estimation (IJCV 2021 & NeurIPS 2019)”

  1. Dear author:

    I downloaded your pretrained model(depth), but I got an error when uncompressing it. Then I changed my account and used a new PC, it failed again. I guess that your uploaded models maybe have some problems. It would be a better idea if you could check the link or something. thank you very much.

    1. It does not need to be upcompressed. You just need pass its location (e.g., “~/Research/SC-Models/cs+k_depth.tar”) to the evaluation code.

  2. Hi, Thanks for sharing great work.
    May I ask you sharing full Pseudo RGB-D SLAM system code?
    It would be very grateful.

    Many thanks!

    1. You may need to implement it by yourself. You just need to save the depth prediction, and then feed it to ORB-SLAM2.

Leave a Reply

Your email address will not be published. Required fields are marked *