Fine-Scaled 3D Geometry Recovery from Single RGB Images

Dissertation, Bonn University, Jan. 2018
 

Abstract

3D geometry recovery from single RGB images is a highly ill-posed and inherently ambiguous problem, which has been a challenging research topic in computer vision for several decades. When fine-scaled 3D geometry is required, the problem become even more difficult. 3D geometry recovery from single images has the objective of recovering geometric information from a single photograph of an object or a scene with multiple objects. The geometric information that is to be retrieved can be of different representations such as surface meshes, voxels, depth maps or 3D primitives, etc. In this thesis, we investigate fine-scaled 3D geometry recovery from single RGB images for three categories: facial wrinkles, indoor scenes and man-made objects. Since each category has its own particular features, styles and also variations in representation, we propose different strategies to handle different 3D geometry estimates respectively. We present a lightweight non-parametric method to generate wrinkles from monocular Kinect RGB images. The key lightweight feature of the method is that it can generate plausible wrinkles using exemplars from one high quality 3D face model with textures. The local geometric patches from the source could be copied to synthesize different wrinkles on the blendshapes of specific users in an offline stage. During online tracking, facial animations with high quality wrinkle details can be recovered in real-time as a linear combination of these personalized wrinkled blendshapes. We propose a fast-to-train two-streamed CNN with multi-scales, which predicts both dense depth map and depth gradient for single indoor scene images.The depth and depth gradient are then fused together into a more accurate and detailed depth map. We introduce a novel set loss over multiple related images. By regularizing the estimation between a common set of images, the network is less prone to overfitting and achieves better accuracy than competing methods. Fine-scaled 3D point cloud could be produced by re-projection to 3D using the known camera parameters. To handle highly structured man-made objects, we introduce a novel neural network architecture for 3D shape recovering from a single image. We develop a convolutional encoder to map a given image to a compact code. Then an associated recursive decoder maps this code back to a full hierarchy, resulting a set of bounding boxes to represent the estimated shape. Finally, we train a second network to predict the fine-scaled geometry in each bounding box at voxel level. The per-box volumes are then embedded into a global one, and from which we reconstruct the final meshed model. Experiments on a variety of datasets show that our approaches can estimate fine-scaled geometry from single RGB images for each category successfully, and surpass state-of-the-art performance in recovering faithful 3D local details with high resolution mesh surface or point cloud.

Download: https://nbn-resolving.org/urn:nbn:de:hbz:5n-49349

Bibtex

@PHDTHESIS{li-2018-dissertation,
    author = {Li, Jun},
     title = {Fine-Scaled 3D Geometry Recovery from Single RGB Images},
      type = {Dissertation},
      year = {2018},
     month = jan,
    school = {Bonn University},
  abstract = {3D geometry recovery from single RGB images is a highly ill-posed and inherently ambiguous problem,
              which has been a challenging research topic in computer vision for several decades. When fine-scaled
              3D geometry is required, the problem become even more difficult. 3D geometry recovery from single
              images has the objective of recovering geometric information from a single photograph of an object
              or a scene with multiple objects. The geometric information that is to be retrieved can be of
              different representations such as surface meshes, voxels, depth maps or 3D primitives, etc.
              In this thesis, we investigate fine-scaled 3D geometry recovery from single RGB images for three
              categories: facial wrinkles, indoor scenes and man-made objects. Since each category has its own
              particular features, styles and also variations in representation, we propose different strategies
              to handle different 3D geometry estimates respectively.
              We present a lightweight non-parametric method to generate wrinkles from monocular Kinect RGB
              images. The key lightweight feature of the method is that it can generate plausible wrinkles using
              exemplars from one high quality 3D face model with textures. The local geometric patches from the
              source could be copied to synthesize different wrinkles on the blendshapes of specific users in an
              offline stage. During online tracking, facial animations with high quality wrinkle details can be
              recovered in real-time as a linear combination of these personalized wrinkled blendshapes.
              We propose a fast-to-train two-streamed CNN with multi-scales, which predicts both dense depth map
              and depth gradient for single indoor scene images.The depth and depth gradient are then fused
              together into a more accurate and detailed depth map. We introduce a novel set loss over multiple
              related images. By regularizing the estimation between a common set of images, the network is less
              prone to overfitting and achieves better accuracy than competing methods. Fine-scaled 3D point cloud
              could be produced by re-projection to 3D using the known camera parameters.
              To handle highly structured man-made objects, we introduce a novel neural network architecture for
              3D shape recovering from a single image. We develop a convolutional encoder to map a given image to
              a compact code. Then an associated recursive decoder maps this code back to a full hierarchy,
              resulting a set of bounding boxes to represent the estimated shape. Finally, we train a second
              network to predict the fine-scaled geometry in each bounding box at voxel level. The per-box volumes
              are then embedded into a global one, and from which we reconstruct the final meshed model.
              Experiments on a variety of datasets show that our approaches can estimate fine-scaled geometry from
              single RGB images for each category successfully, and surpass state-of-the-art performance in
              recovering faithful 3D local details with high resolution mesh surface or point cloud.},
       url = {https://nbn-resolving.org/urn:nbn:de:hbz:5n-49349}
}