Unsupervised model adaptation for vision-centric deep neural networks
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Despite significant advancements in deep learning and computer vision, the robustness of human perception is still unmatched. For example, while humans can reliably interpret a scene in unfamiliar locations or environments, Deep Neural Networks (DNNs) usually struggle in such scenarios, exhibiting substantial performance degradation. This decline in performance can be attributed to differences between the training and test distribution, as well as the current limitations of DNNs to effectively generalize their learned knowledge.
The lack of generalization has several consequences for the practical application of DNNs. Each time the test conditions change, new data must be collected, manually annotated, and used to retrain the model. While gathering unlabeled data and retraining the model is often less problematic, manual labelling is highly time-consuming and should thus be minimized. For safety-critical applications like automated driving, the limited generalization requires to cover virtually every situation with labeled data to ensure reliability, which verges on impossible.
A promising approach to minimize manual labeling efforts or increase a network's robustness during inference is Unsupervised Model Adaptation, which is the focus of this thesis. For example, by transferring knowledge via Unsupervised Domain Adaptation (UDA) from a labeled source dataset to an unlabeled target dataset collected under new test conditions, manual data annotation can be avoided. Alternatively, robustness can be improved by adapting the model directly during inference using online Test-time Adaptation (TTA). However, both research fields still face challenges related to performance, stability, and other issues that must be resolved before they can reliably overcome the problems caused by insufficient generalization.
The foundations for the subsequently presented contributions to UDA and TTA are laid by first defining the settings and providing a comprehensive review of existing methods for UDA, online TTA, and related subfields (like continual UDA), which are also important for this thesis. Then, a novel framework to perform UDA for semantic segmentation is introduced, which exploits contrastive learning to align two domains at the category level. The label information required to create the pairs for contrastive learning is extracted via online-refined pseudo-labels that further allow an effective adaptation via self-training. To prevent the network's output head from developing harmful biases toward either the source or target domain, an additional loss-weighting scheme is employed that promotes globally diverse predictions. The effectiveness of the framework is validated using two widely adopted benchmarks for UDA.
Although fusing the information provided by two complementary sensors, like RGB and LiDAR, can increase a network's robustness, performance degradation in adverse weather conditions can still occur. To again overcome manual data annotation, the first framework for conducting UDA in multimodal 2D object detection using RGB and LiDAR data is presented. The approach uses adversarial learning and pre-text tasks to align the domains in the feature space. Focusing on perception in autonomous driving, the framework is shown to be effective not only for adapting a model to a single adverse weather condition but also for adapting to multiple adverse weather scenarios simultaneously.
However, since the data from multiple target domains may become available sequentially, without access to previous target data, continual UDA is subsequently studied. By conditioning an AdaIN-based style transfer model on each class and exploiting style replay, the framework outperforms other baseline methods on a purely synthetic and a more challenging real-world domain sequence.
Addressing the aspect of enhancing a model's robustness, a novel framework that improves the efficacy of self-training during online model adaptation is introduced. The basic idea involves converting the currently encountered domain shift into a more gradual one, where self-training has been shown to be particularly effective. This is achieved by introducing an artificial intermediate domain at each time step t, created either through mixup or lightweight style transfer. The approach proves to be highly effective for urban scene segmentation in non-stationary environments but also performs well for classification tasks.
Since methods for online model adaptation must remain stable across diverse scenarios, a comprehensive picture of many practically relevant test settings is introduced. By thoroughly analyzing and empirically validating the challenges in these scenarios, a highly effective self-training-based adaptation framework is derived, that includes diversity and certainty weighting, continuous weight ensembling, and prior correction. Extensive experiments across a wide range of datasets, settings, and models not only validate the framework's superiority but also reveal the limitations of existing methods for online TTA. The thesis concludes with a summary of the key contributions, a discussion of the various techniques to perform Unsupervised Model Adaptation, and an outlook on remaining challenges for both UDA and online TTA.