Browsing by Author "Haala, Norbert (apl. Prof. Dr.-Ing.)"

Now showing 1 - 3 of 3

Open Access
Integrated georeferencing for precise depth map generation exploiting multi-camera image sequences from mobile mapping
(2020) Cavegn, Stefan; Haala, Norbert (apl. Prof. Dr.-Ing.)
Image-based mobile mapping systems featuring multi-camera configurations allow for efficient geospatial data acquisition in both outdoor and indoor environments. We aim at accurate geospatial 3D image spaces consisting of collections of georeferenced multi-view RGB-D imagery, which may serve as basis for 3D street view services. In order to obtain high-quality depth maps, dense image matching exploiting multi-view image sequences captured with high redundancy needs to be performed. Since this process is entirely dependent on accurate image orientations, we mainly focus on pose estimation of multi-camera systems within this thesis. Nonetheless, we also present methods and investigations to obtain accurate, reliable and complete 3D scene representations based on multi-stereo mobile mapping sequences. Conventional image orientation approaches such as direct georeferencing enable absolute accuracies at the centimeter level in open areas with good GNSS coverage. However, GNSS conditions of street-based mobile mapping in urban canyons are often deteriorated by multipath effects and by shading of the signals caused by vegetation and large multi-story buildings. Moreover, indoor spaces do not even allow for any GNSS signals. Hence, we propose a powerful and versatile image orientation procedure that is able to cope with these issues encountered in challenging urban environments. Our integrated georeferencing approach extends the powerful structure-from-motion pipeline COLMAP with georeferencing capabilities. It assumes initial camera poses with sub-meter accuracy, which allow for direct triangulation of the complete scene. Such a global approach is much more efficient than an incremental structure-from-motion procedure. Furthermore, an initial image orientation solution already facilitates to georeference in a geodetic reference frame. Nevertheless, accuracies at the centimeter level can only be achieved by incorporation of ground control points. In order to obtain sub-pixel accurate relative orientations, strong tie point connections for the highly redundant multi-view image sequences are required. However, hardly overlapping fields of view, strongly varying views and weakly textured surfaces aggravate image feature matching. Hence, constraining relative orientation parameters among cameras is crucial for accurate, robust and efficient image orientation. Apart from supporting fixed multi-camera rigs, our integrated georeferencing approach that uses bundle adjustment allows for self-calibration of all relative orientation parameters or just single components. We extensively evaluated our integrated georeferencing procedure using six challenging real-world datasets in order to demonstrate its accuracy, robustness, efficiency and versatility. Four datasets were captured outdoors, one by a rail-based and three by different street-based multi-stereo camera systems. A portable mobile mapping system featuring a multi-head panorama camera collected two datasets in an indoor environment. Employing relative orientation constraints and ground control points within these indoor spaces resulted in absolute 3D accuracies of ca. 2 cm, and precisions at the millimeter level for relative 3D measurements. Depending on the use case, absolute 3D accuracy values for outdoor environments are slightly larger and amount to a few centimeters. However, determining 3D reference coordinates is a costly task. Not relying on any ground control points led to horizontal accuracies of ca. 5 cm for a scenario featuring some loops, while dropping down to a few decimeters for an extended junction area. Since the height component is even more dependent on prior camera poses from direct georeferencing, these 2D accuracies significantly decreased for the 3D case. However, incorporating just one ground control point facilitates the elimination of systematic effects, which results in 3D accuracies within the sub-decimeter range. Nevertheless, at least one additional check point is recommended in order to ensure a reliable solution. Once consistent and sub-pixel accurate relative poses of spatially adjacent images are available, in-sequence dense image matching can be performed. Aiming at precise and dense depth map generation, we evaluated several image matching configurations. Standard single stereo matching led to high accuracies, which could not significantly be improved by in-sequence matching. However, the image redundancy provided by additional epochs resulted in more complete and reliable depth maps.
Open Access
On the information transfer between imagery, point clouds, and meshes for multi-modal semantics utilizing geospatial data
(2022) Laupheimer, Dominik; Haala, Norbert (apl. Prof. Dr.-Ing.)
The semantic segmentation of the huge amount of acquired 3D data has become an important task in recent years. Images and Point Clouds (PCs) are fundamental data representations, particularly in urban mapping applications. Textured meshes integrate both representations by wiring the PC and texturing the reconstructed surface elements with high-resolution imagery. Meshes are adaptive to the underlying mapped geometry due to their graph structure composed of non-uniform and non-regular entities. Hence, the mesh is a memory-efficient realistic-looking 3D map of the real world. For these reasons, we primarily opt for semantic segmentation of meshes, which is a widely overlooked topic in photogrammetry and remote sensing yet. In particular, we head for multi-modal semantics utilizing supervised learning. However, publicly available annotated geospatial mesh data has been rare at the beginning of the thesis. Therefore, annotating mesh data has to be done beforehand. To kill two birds with one stone, we aim for a multi-modal fusion that enables multi-modal enhancement of entity descriptors and semi-automatic data annotation leveraging publicly available annotations of non-mesh data. We propose a novel holistic geometry-driven association mechanism that explicitly integrates entities of modalities imagery, PC, and mesh. The established entity relationships between pixels, points, and faces enable the sharing of information across the modalities in a two-fold manner: (i) feature transfer (measured or engineered) and (ii) label transfer (predicted or annotated). The implementation follows a tile-wise strategy to facilitate scalability to large-scale data sets. At the same time, it enables parallel, distributed processing, reducing processing time. We demonstrate the effectiveness of the proposed method on the International Society for Photogrammetry and Remote Sensing (ISPRS) benchmark data sets Vaihingen 3D and Hessigheim 3D. Taken together, the proposed entity linking and subsequent information transfer inject great flexibility into the semantic segmentation of geospatial data. Imagery, PCs, and meshes can be semantically segmented with classifiers trained on any of these modalities utilizing features derived from any of these modalities. Particularly, we can semantically segment a modality by training a classifier on the same modality (direct approach) or by transferring predictions from other modalities (indirect approach). Hence, any established well-performing modality-specific classifier can be used for semantic segmentation of these modalities - regardless of whether they follow an end-to-end learning or feature-driven scheme. We perform an extensive ablation study on the impact of multi-modal handcrafted features for automatic 3D scene interpretation - both for the direct and indirect approach. We discuss and analyze various Ground Truth (GT) generation methods. The semi-automatic labeling leveraging the entity linking achieves consistent annotation across modalities and reduces the manual label effort to a single representation. Please note that the multiple epochs of the Hessigheim data consisting of manually annotated PCs and semi-automatically annotated meshes are a result of this thesis and provided to the community as part of the Hessigheim 3D benchmark. To further reduce the labeling effort to a few instances on a single modality, we combine the proposed information transfer with active learning. We recruit non-experts for the tedious labeling task and analyze their annotation quality. Subsequently, we compare the resulting classifier performances to conventional passive learning using expert annotation. In particular, we investigate the impact of visualizing the mesh instead of the PC on the annotation quality achieved by non-experts. In summary, we accentuate the mesh and its utility for multi-modal fusion, GT generation, multi-modal semantics, and visualizational purposes.
Open Access
On the reconstruction, interpretation and enhancement of virtual city models
(2020) Tutzauer, Patrick; Haala, Norbert (apl. Prof. Dr.-Ing.)
With constant advances both in hardware and software, the availability of urban data is more versatile than ever. Structure-from-Motion (SfM), dense image matching (DIM), and multi-view stereo (MVS) algorithms have revolutionized the software side and scale to large data sets. The outcomes of these pipelines are various products such as point clouds and textured meshes of complete cities. Correspondingly, the geometric reconstruction of large scale urban scenes is widely solved. To keep detailed urban data understandable for humans, the highest level of detail (LOD) might not always be the best representation to transport intended information. Accordingly, the semantic interpretation of various levels of urban data representations is still in its early stages. There are many applications for digital urban scenes: gaming, urban planning, disaster management, taxation, navigation, and many more. Consequently, there is a great variety of geometric representations of urban scenes. Hence, this work does not focus on a single data representation such as imagery but instead incorporates various representation types to address several aspects of the reconstruction, enhancement, and, most importantly, interpretation of virtual building and city models. A semi-automatic building reconstruction approach with subsequent grammar-based synthesis of facades is presented. The goal of this framework is to generate a geometrically, as well as semantically enriched CityGML LOD3 model from coarse input data. To investigate the human understanding of building models, user studies on building category classification are performed. Thereof, important building category-specific features can be extracted. This knowledge and respective features can, in turn, be used to modify existing building models to make them better understandable. To this end, two approaches are presented - a perceptionbased abstraction and a grammar-based enhancement of building models using category-specific rule sets. However, in order to generate or extract building models, urban data has to be semantically analyzed. Hence, this work presents an approach for semantic segmentation of urban textured meshes. Through a hybrid model that combines explicit feature calculation and convolutional feature learning, triangle meshes can be semantically enhanced. For each face within the mesh, a multi-scale feature vector is calculated and fed into a 1D convolutional neural network (CNN). The presented approach is compared with a random forest (RF) baseline. Once buildings can be extracted from urban data representation, further distinctions can be made on an instance-level. Therefore, a deep learning-based approach for building use classification, i.e., the subdivision of buildings into different types of use, based on image representations, is presented. In order to train a CNN for classification, large amounts of training data are necessary. The presented work addresses this issue by proposing a pipeline for large-scale automated training data generation, which is comprised of crawling Google Street View (GSV) data, filtering the imagery for suitable training samples, and linking each building sample to ground truth cadastral data in the form of building polygons. Classification results of the trained CNNs are reported. Additionally, class-activation maps (CAMs) are used to investigate critical features for the classifier. The transferability to different representation types of building is investigated, and CAMs assist here to compare important features to those extracted from the previous human user studies. By these means, several integral parts that contribute to a holistic pipeline for urban scene interpretation are proposed. Finally, several open issues and future directions related to maintaining and processing virtual city models are presented.