Monocular depth estimation with affinity, vertical pooling, and label enhancement
Yukang Gan, Xiangyu Xu, Wenxiu Sun, Liang Lin
While significant progress has been made in monocular depth estimation with Convolutional Neural Networks (CNNs) extracting absolute features, such as edges and textures, the depth constraint of neighboring pixels, namely relative features, has been mostly ignored by recent methods. To overcome this limitation, we explicitly model the relationships of different image locations with an affinity layer and combine absolute and relative features in an end-to-end network. In addition, we also consider another prior knowledge that major depth changes in images lie in the vertical direction, and thus, it is beneficial to capture local vertical features for refined depth estimation. In the proposed algorithm we introduce vertical pooling to aggregate image features vertically to improve the depth accuracy.Furthermore, since the Lidar depth ground truth is quite sparse, we enhance the depth labels by generating high-quality dense depth maps with off-the-shelf stereo matching method which takes left-right image pairs as input.We also integrate multi-scale structures in our network to obtain global understanding the image depth and exploit residual learning to help depth refinement.We demonstrate that the proposed algorithm performs favorably against state-of-the-art methods both qualitatively and quantitatively on the KITTI driving dataset.