Institute for Visualization and Interactive Systems University of Stuttgart Universitätsstraße 38 D–70569 Stuttgart Bachelorarbeit Flow Prediction Meets Flow Learning: Combining Different Learning Strategies for Computing the Optical Flow Haining Yang Course of Study: Informatik Examiner: Prof. Dr. -Ing. Andrés Bruhn Supervisor: Daniel Maurer Commenced: November 14, 2018 Completed: May 14, 2019 Abstract Optical flow estimation is an important topic in computer vision. The goal is to compute the inter-frame displacement field between two consecutive frames of an image sequence. In practice, optical flow estimation plays a significant role in multiple application domains including autonomous driving and medical imaging. Different categories of methods exist for solving the optical flow problem. The most common technique is based on a variational framework, where an energy functional is designed and minimized in order to calculate the optical flow. Recently, other approaches like pipeline-based approach and learning-based approach also attract much attention. Despite the great advances achieved by these algorithms, it is still difficult to find an algorithm that can perform well under all the challenges, e.g. lightning changes, large displacements, and occlusions. Hence, it is worth combining different algorithms to create a new approach that can combine their advantages. Inspired by this idea, in this thesis we select two top-performing algorithms PWC-Net and ProFlow as candidate approaches and conduct a combination of these two algorithms. While PWC-Net performs generally well in the estimation of non-occluded areas, ProFlow can especially provide an accurate estimation for the occluded areas. Thereby, we expect that the combination of these two algorithms can yield an algorithm that performs well in both occluded and non-occluded areas. Since ProFlow is a pipeline approach, we first integrate the PWC-Net in the ProFlow pipeline, then evaluate the new created pipeline PWC-ProFlow based on the MPI Sintel and KITTI 2015 benchmarks. Contrary to our expectations, the newly created algorithm does not exceed the candidate methods PWC-Net and ProFlow on either benchmark. Through the analysis of the evaluation results, we explore the problems hidden in the PWC-ProFlow pipeline that can lead to its underperformance, and organize some modification ideas. Based on these ideas, we propose six new pipelines with the purpose of improving the estimation accuracy of PWC-ProFlow. All the new generated pipelines are also evaluated on the Sintel and KITTI benchmarks. The experiment results demonstrate that all the modifications created achieve great improvements on both datasets compared to PWC-ProFlow. Further, all of them also outperform the ProFlow pipeline on both benchmarks. Compared to PWC-Net, one modification exceeds PWC-Net on the KITTI dataset, however, all our modifications achieve a better performance on the Sintel dataset, in particular, one modification presents a significant improvement with a more than 10% lower average endpoint error on the Sintel dataset. 3 Contents 1 Introduction 7 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.2 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.3 Goal of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2 Background 12 2.1 Optical Flow Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2 Aperture Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3 Occlusion Handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.4 The Coarse-to-fine Approach . . . . . . . . . . . . . . . . . . . . . . . 15 2.5 Quality Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.6 Visualization Techniques . . . . . . . . . . . . . . . . . . . . . . . . . 18 3 Related Algorithms 20 3.1 PWC-Net . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.2 ProFlow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4 Combining PWC-Net and ProFlow 32 4.1 Pipeline Update . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 5 Modification 41 5.1 Modifying the Input of Inpainting . . . . . . . . . . . . . . . . . . . . . 41 5.2 Replacing Inpainting with Dense Filling . . . . . . . . . . . . . . . . . 43 5.3 Applying New Filtering Algorithms . . . . . . . . . . . . . . . . . . . . 47 6 Conclusion 56 Bibliography 58 Appendix A Experiment Results per Scene of Sintel Benchmark 62 5 1 Introduction 1.1 Motivation The extraction of motion information from image sequences is one of the fundamental problems in computer vision. Typically, one considers the motion estimation problem to be equivalent to the optical flow problem [VP89], which computes the inter-frame motion field between two consecutive images, as presented in Figure 1.1. The applications of optical flow estimation cover different domains including autonomous navigation, advanced driver assistance systems (ADAS), video analysis, and medical imaging. Since optical flow allows to estimate the location of an object in the next frame, it can be applied to object tracking. Combined with navigation information, such as rotational velocities and terrain positions, optical flow prediction can assist the process of solving different tasks in robots and vehicle navigation, such as collision detection and obstacle avoidance [CGN14]. The same benefits to ADAS applications, where computing optical flow is highly useful for the estimation of the vanishing point, the horizon line, and the ego-motion [Onk13]. In combination with modern object detection frameworks employing deep neural networks, tasks in the fields of video analysis and video surveillance can be performed more efficiently, such as action recognition, gesture recognition, and facial expression recognition[FBK15; HZZ17]. Similarly, in a biomedical context, the estimation of blood flow and the observation of radiation dose distribution are examples of medical applications that require accurate motion analysis [FBK15]. All of these can indicate the important role of optical flow in real applicative domains. In most scenarios, approaches that can solve the optical flow problem need to face many difficult situations, e.g. motion discontinuities, large displacements, lightning changes, and occlusions. Despite the great progress achieved by the recent algorithms, it seems unlikely that an existing algorithm can handle all these issues. While some approaches may perform well on some of the challenges, other methods may overcome some other difficulties. Hence, it would be desirable to combine different computing strategies in a reasonable way to create a novel approach that can outperform these algorithms and combine their advantages. 7 1 Introduction Figure 1.1: Optical flow estimation between two consecutive frames. Left: Image frame at time t. Right: Image frame at time t+1. Middle: Optical flow: estimated displacement vectors on the top of the flow color scheme. Figure from [WC11]. 1.2 Previous Work The primary research on the optical flow problem dates back decades. However, considering its significance in different applications, the optical flow problem remains to be one of the most popular topics in computer vision. A wide range of approaches exist for solving this problem, from traditional variational approaches to modern learning-based and pipeline- based approaches. Variational approach. In 1981, Horn and Schunck first proposed a variational optical flow approach, which computes the motion field as a minimizer of an energy functional consisting of a data term and a smoothness term [HS81]. The data term penalizes deviations from the brightness constancy assumption, while the smoothness term enforces smoothness on the estimation result. Even with this new approach, difficulties like illumination changes, large displacements, and robustness are still unresolved [Bru18]. To overcome the different challenges, diverse kinds of data terms and smoothness terms have been explored over the years. These data terms consider adopting the gradient constancy assumption to better resist illumination changes, using RGB channels to fully exploit information in color sequences, and applying non-quadratic penalty functions to reduce the influence of outliers [Bru18]. Different smoothness terms are also proposed to cope with image discontinuities and to increase robustness, such as image-driven isotropic and anisotropic smoothness terms [Bru18]. Beyond that, researchers proposed to combine variational approaches with other techniques. For example, embedding the variational approach in a coarse-to-fine scheme can help to estimate large displacements, and applying feature matching in the estimation process can further increase the detection accuracy of small and rapidly moving objects [FBK15]. Even thought variational approaches become more and more advanced, they still yield complex optimization problems, hence, extensive computational costs. Partially learning-based approach. In recent years, deep learning is regarded as a great tool for tackling some high-level computer vision problems, such as object detection and object segmentation [LDY18; ZZXW19]. Inspired by its good performance, researchers also integrated this technique into traditional optical flow approaches, known as partially learning-based approach. In these methods, CNN networks are normally used as com- 8 1.2 Previous Work ponents to solve some intermediate tasks that are computationally expensive or difficult to solve with traditional solutions. For example, Xu et al. [XRK17] took advantage of CNN networks to learn nonlinear features for constructing a full cost volume. Methods like [BVS17; GW16; SRLB08] also adopted CNN networks as a feature extractor. Further, in the MRFlow proposed by Wulff et al. [WSB17], convolutional networks were applied to classify rigid and non-rigid regions in a scene, so that different regions can be handled sep- arately. For example, moving objects are estimated using unconstrained flow method, and for static regions, the Plane-Parallax framework is adopted to estimate the scene structure and the camera motion. Bai et al. [BLKU16] used deep learning to perform instance-level segmentation, aiming to produce individual optical flow estimation for each object and the background. Sevilla-Lara et al. [SSJB16] employed CNN models for semantic scene segmentation to build different motion models according to semantic class labels. Since all these methods treat CNN networks only as partial components, a real-time estimation is still hard to accomplish. Pure learning-based approach. Instead of using CNN networks merely as parts of the methods, pure learning-based approaches aim to train an end-to-end network that allows to estimate the flow field directly from input images. With the pretrained network, this kind of approaches can realize a real-time estimation. Dosovitskiy et al. [DFI15] first implemented this paradigm by proposing two CNN models, i.e. FlowNetS and FlowNetC. Both models demonstrate the possibility of casting the optical flow problem as a supervised learning problem. However, their performance could not compete with the top-performing variational approaches. In order to improve the accuracy, Ilg et al. [IMS17] stacked multiple FlowNetS and FlowNetC together to create a more complex model FlowNet2, where small and large displacements are estimated by two different networks separately and then fused to produce the final estimation. Due to the deep network, FlowNet2 performs on par with the state-of-the-art. To decrease the model size, Ranjan et al. [RB17] developed a network called SpyNet, which embeds traditional principles in neural networks, such as image pyramid and warping strategy. Inspired by SpyNet, PWC-Net [SYLK18] is proposed recently, which is more accurate than all existing approaches based on an end-to-end network. This network will be explained in Chapter 3. Pipeline approach. Another type of approaches is the pipeline approach, which includes four steps: matching, filtering, inpainting, and variational refinement [MSB17]. Pipeline approaches aim to provide a good initial flow, which is computed in the first three steps, for the final variational refinement. Compared to other methods, pipeline approaches possess higher flexibility and adaptivity, i.e. the algorithms used in each step can be configurated in terms of application scenarios. The matching step requires to solve a correspondence problem with two given images. To solve this problem, we first need to compute features that are optimally invariant under illumination changes and image deformations, such as using the famous SIFT descriptor [Low04]. The descriptors can be applied to each pixel of an image or only to interest locations, such as corners. For each input image, a set of features are computed, then it is necessary to create matching criteria based on similarities to find the best matches between 9 1 Introduction the two sets [Bru18]. Since the second step in a pipeline approach is filtering, it would be desirable to generate rather dense matches in the first step. In this context, algorithms like [LHS18; MHG15; WRHS13] are examples of appropriate candidates. The matches computed in the first step usually contain certain outliers, such as wrong matches caused by occlusions or the case where two or more pixels in the reference image correspond to the same pixel in the next frame. It is important to detect and filter these outliers, in order to guarantee the quality of the inpainting step and the overall estimation. The bidirectional consistency checking and uniqueness checking are two of the approaches that can implement the filtering [Bru18]. More details about these two methods will be given in Section 2.3. In the filtering step, a certain number of matches are filtered out. As a result, a non-dense flow field is generated. However, the final variational refinement requires a fully dense flow field as input. In this context, an inpainting step is applied to achieve a sparse-to-dense interpolation. Normally, the non-available matches are filled in based on the neighboring information and the contour structure in order to retain the sharp edges, such as in EpicFlow [RWHS15] and RIC [HLS17]. The final step makes use of a variational approach to refine the inpainted flow field. Here, a standard energy minimization framework is employed to gain a sub-pixel precision [Bru18]. In addition to the traditional first-order refinement, Maurer et al. [MSB17] proposed an advanced variational model for refinement, which combines an illumination aware data term with an order adaptive smoothness term. In terms of the evaluation on public benchmarks, this algorithm can bring more improvements to the input flow field than traditional methods. 1.3 Goal of the Thesis The goal of this thesis is to combine two top-performing algorithms, PWC-Net and ProFlow, to create a new approach that can optimally combine the advantages of these two algorithms. PWC-Net is a pure learning-based approach which allows a real-time estimation of optical flow with high accuracy, especially in non-occluded areas. ProFlow is a pipeline approach, combined with a shallow CNN network, which serves to determine the underlying motion model, thus generating the same accurate estimation in occluded areas. After combining, the newly created approach is evaluated based on the public benchmarks MPI Sintel [BWSB12] and KITTI 2015 [MG15], with an emphasis on accuracy, robustness, and runtime. In addition, the interaction between PWC-Net and different steps of ProFlow is also investigated. Based on the analysis of the interaction and evaluation, the new generated method is further modified and optimized. 10 1.4 Outline 1.4 Outline The rest of the thesis is organized as following: Chapter 2 introduces the background of the optical flow problem including the definition of optical flow, the detection of occlusions, the common method used to cope with large displacements, the evaluation metrics, and the standard visualization techniques. In Chapter 3, the two algorithms PWC-Net and ProFlow are explained to lay a foundation for the further combination. Chapter 4 first explains the combination of PWC-Net with the ProFlow pipeline, and then presents the evaluation results of the new created pipeline PWC-ProFlow on the Sintel and KITTI benchmarks. After analyzing the evaluation results, six different modifications are proposed in Chapter 5 to optimize the PWC-ProFlow pipeline. All the modifications are also evaluated on the two datasets. Further, the estimation performance of different modifications is compared with each other and with the performance of PWC-Net, ProFlow, and PWC-ProFlow. Lastly, Chapter 6 summarizes the work completed in the thesis with an overview of the future work. 11 2 Background This chapter introduces the basic concepts regarding the optical flow estimation and lays an important foundation for the following chapters. Section 2.1 introduces the formal definition of the optical flow problem and the brightness constancy assumption. The aperture problem, which is an inherent problem contained in the optical flow estimation, is described in Section 2.2. Since handling occlusions is one of the focuses in this thesis, two occlusion detection methods are explained in Section 2.3, followed by the introduction of the coarse-to-fine scheme in Section 2.4. In order to better evaluate and demonstrate estimation results, quality measures and the commonly used visualization techniques are discussed in Section 2.5 and Section 2.6. 2.1 Optical Flow Problem Given two consecutive frames of an image sequence taken by a camera at different times, optical flow problem is to estimate the inter-frame displacement field between the two input frames. The displacement field usually represents the apparent motion of brightness patterns, which is caused by the relative movements between the camera and the object scene, such as a moving camera or moving objects [FBK15]. The foundation of solving the optical flow problem is to find corresponding pixels between two frames. For this purpose, we always assume that certain features of a pixel remain constant over time. This assumption allows to obtain pixel matches by tracking these features between frames. Based on the pixel matches, the corresponding displacement vector of each pixel can be computed. In the following parts , we use I(x, y, t) to represent a continuous image sequence, where (x, y) denotes a location within the image domain, t records the time of a frame, and I(x, y, t) yields the intensity of position (x, y) on a frame with time t. The Grey Value Constancy Assumption The most widely used assumption is the grey value constancy assumption, which describes the intensity of a point in the image domain remaining constant as it moves. Here, we denote the displacement vector of a point as (u(x, y, t), v(x, y, t))>, then the grey value constancy assumption reads as following [Bru18]: 12 2.2 Aperture Problem I(x + u, y + v, t + 1) − I(x, y, t) = 0. (2.1) However, this equation often produces a difficult optimization process due to the implicit functions. A common way to deal with it is to linearize the equation via a first-order Taylor expansion around the point (x, y, t)>, thus obtaining the brightness constancy constraint equation (BCCE) [Bru18]: Ixu + Iyv + It = 0, (2.2) where Ix and Iy represent the partial derivatives of I w.r.t. x and y, respectively. It can be considered as the derivative in the time orientation. The linearized expansion usually appropriates to small displacements or very smooth images. In the case of large displacements, the approximation is less accurate. Since the linearization plays an important role in most differential methods, other techniques need to be combined to overcome this drawback. One of the standard strategies to cope with large displacements is to employ a coarse-to-fine scheme [BBPW04], which will be described in Section 2.4. 2.2 Aperture Problem The BCCE gives an equation with two unknown variables, u and v. Using this equation alone can only determine the flow component in the direction of image gradient, which is referred to as the normal flow and computed as following: 0 = Ixu + Iyv + It = ( u v ) > ∇I + It (2.3)( u v ) > normal = [ ( u v ) > ∇I |∇I | ] = −It |∇I | ∇I |∇I | (2.4) However, the flow component perpendicular to the image gradient cannot be calculated by the BCCE. According to Equation 2.3, adding any arbitrary value in the perpendicular direction does not violate the BCCE, thus, an infinite number of (u(x, y, t), v(x, y, t))> can be regarded as solution. This is known as the aperture problem, indicating that motion of linear or untextured structures is by nature ambiguous, without regard to neighboring information [FBK15], as shown in Figure 2.1. The non-uniqueness makes optical flow problem ill-posed in the sense of Hadamard. To make the two-dimensional optical flow vector uniquely solvable, other information or assumptions need to be involved, such as the neighboring information [LK81] or the smoothness assumption of the optical flow field [HS81]. 13 2 Background Figure 2.1: Aperture problem. Left: In case |∇I | , 0, one single point corresponds to multiple points. Right: In case |∇I | = 0, no information can be obtained. Figure from [Bru18]. 2.3 Occlusion Handling In the previous two sections, we explained the definition of the optical flow problem and introduced the basic ideas of how to solve this problem. In this section, we deal with one of the most challenging difficulties regarding optical flow estimation, i.e. the occlusions. In most cases, occluded pixels are the main source of outliers, thereby handling these pixels appropriately is essential to a reliable and accurate estimation. Occluded regions in a frame are defined as a set of pixels which become hidden by moving objects in the next frame or leave the camera field of view while moving [FBK15]. As shown in Figure 2.2, the pixels marked in white are from the occluded regions. Obviously, occluded pixels possess no matching correspondence in the next frame. Hence, wrong matches or inconsistent matches could appear in the occluded regions. A standard method of occlusion detection is bidirectional consistency checking, which makes use of both forward and backward matches to detect flow consistency. Suppose we have two images t and t + 1. As a first step, we compute the forward flow (u f w, v f w) from image t to image t +1 and the backward flow (ubw, vbw) from image t +1 to image t. Then the consistency checking is performed in terms of the following expression [Bru18]: ��4u(x, y, t) �� = { (u f w(x, y, t) + ubw(x + u f w(x, y, t), y + v f w(x, y, t), t + 1))2+ (v f w(x, y, t) + vbw(x + u f w(x, y, t), y + v f w(x, y, t), t + 1))2 }1/2 , (2.5) where (x, y, t) and (x + u f w(x, y, t), y + v f w(x, y, t), t + 1) denote a location in the frame t and its corresponding location in the frame t + 1, respectively. 14 2.4 The Coarse-to-fine Approach Figure 2.2: Demonstration of occluded regions in a frame (ambush_2 #4 and ambush_6 #15 of the Sintel dataset). From left to right: Reference frame, subsequent frame, and occlusion map [BWSB12]. In case of correctly estimated pixels, the sum terms vanish due to consistent flow in both directions and the value of ��4u(x, y, t) �� approximates to 0. Inconsistent matches can produce a large ��4u(x, y, t) �� due to an inaccurate prediction in at least one direction, which can be seen as an indicator of an occluded pixel. In practice, we usually set a threshold Tocc for��4u(x, y, t) �� and treat all pixels with ��4u(x, y, t) �� larger than the threshold as occluded. As one can see, the bidirectional checking requires the computation of flow fields in two directions, which can double the computational costs. An alternative strategy for occlusion detection is uniqueness checking, which is based on the consideration of matching uniqueness, i.e. multiple pixels mapping to the same pixel in the next frame using forward warping are possibly occluded by each other. This idea is adopted in [XJM12]. The occluded regions are detected through the following equation: o(x, y, t) = { 0 if m(x + u f w(x, y, t), y + v f w(x, y, t), t + 1) = 1 1 else (2.6) where (x + u f w(x, y, t), y + v f w(x, y, t), t + 1) represents the target position in the next frame. The m(z) function counts the total number of pixels landing at position z. In the case that the target position leaves the image boundary, m(z) is set to be 0 by default. o(x, y, t) is assigned a value of 0 (non-occluded) only when (x, y, t) is the unique correspondence for its target position. 2.4 The Coarse-to-fine Approach After introducing how to detect occluded regions, we now deal with another challenging problem, i.e. computing optical flow for large displacements. A widely used strategy is to embed the variational approach in the coarse-to-fine scheme, which yields the coarse- to-fine variational approach. This approach is based on an image pyramid, consisting of 15 2 Background down-sampled versions of the input image. For linearization-based variational approaches, large displacements can be accurately estimated at coarser levels, since they become small due to the effect of down-sampling. Further, the flow field estimated at one level is usually interpolated and used as an initialization for the next level, which can help to reduce the search scope and avoid falling into large local minima during estimation [Bru18; FBK15]. Given an input image f , we first construct an image pyramid with L levels. The zeroth level places the original image f and the L − 1 level is the coarsest level of the pyramid. Starting with zeroth level, the image at an arbitrary level l can be obtained through the following two steps: (1) apply Gaussian filtering to the image at l − 1th level to remove noise and to avoid aliasing effect in the subsequent sampling (2) perform down-sampling to the filtered image to reduce its resolution by a predefined factor η. Figure 2.3: Image pyramid of 4 levels with the down-sampling factor η = 0.5. Image is from the Teddy sequence of the Middlebury dataset [BSL11]. After generating the image pyramids for two input images, the whole estimation process is started from the coarsest level. Two pyramid levels are connected through the so-called warping strategy. For example in Figure 2.4, we first compute the flow field at coarsest level L − 1 using a variational approach. After that, the computed flow field is up-sampled to the resolution of image at L − 2 level and then applied to warp the second image towards the first. The first and warped images are then applied to the variational approach at L − 2th level. The flow field is refined iteratively from the coarsest level to the zeroth level, where the finial estimation result is produced. 16 2.5 Quality Measures Figure 2.4: Left: Image pyramid of the first input image. Right: Image pyramid of the second input image. Middle: Operations applied at each level. Figure is the modified version of an image from [SB12]. 2.5 Quality Measures In the previous sections, we described the common approaches to detect occlusions and to solve the problem regarding large displacements in the estimation process. In this Section, we introduce two commonly used error metrics to evaluate the performance of different optical flow approaches. Normally, the error between the estimated flow and the ground truth flow is quantitatively measured With the evaluation metrics. In the following, we adopt the discrete version of a continuous image signal I(x, y, t), where ui, j denotes a pixel located at (i, j) in the image. The width and height of the image are N and M , respectively. The first error metric refers to the average endpoint error (AEE), proposed by Otto and Nagel in 1994 [ON94]. AEE computes the average Euclidean distance between all the estimated flow vectors (ue, ve) and the ground truth flow vectors (ut, vt). The equation reads as following: AEE(ut, ue) = 1 N M N∑ i=1 M∑ j=1 √ (ut i, j − ue i, j) 2 + (vt i, j − ve i, j) 2. (2.7) 17 2 Background The second metric bad pixel error (BP) is based on the computation of the endpoint error (EE). One pixel is defined as a bad pixel, when its EE greater than a predefined tolerant value T . Equation 2.8 demonstrates the computation of BP, where 1(x) represents an indicator function. In the equation, the total number of bad pixels is first accumulated, and then the percentage of bad pixels relative to all pixels is calculated. BP(ut, ue) = 100 N M N∑ i=1 M∑ j=1 1(√ (uti, j−uei, j) 2+(vti, j−v e i, j) 2>T ) . (2.8) 2.6 Visualization Techniques Another way to compare the estimation results is to visualize them. The visualization of displacement fields can provide a direct insight into how good the estimation performance is. There exist two main visualization methods, i.e. color visualization and arrow visualization [FBK15]. In the arrow visualization, seeing Figure 2.5 (c), the estimated motion vectors are directly presented, which can provide an intuitive perception of physical motion. If the motion arrows of all the pixels in a frame are visualized, we may see a whole black area. Hence, the flow arrows used for display are typically sampled from the original motion field, for the sake of a clean visualization. For color visualization, a color code is adopted, where the color hue and saturation illustrate the direction and magnitude of a flow vector. The flow vector of each pixel associates to a color, thus, all flow vectors can be visualized via an image, as shown in Figure 2.5 (d). Further, since humans are more sensitive to colors, the color visualization can provide a better observation of small motion vectors. 18 2.6 Visualization Techniques (a) (b) (c) (d) Figure 2.5: Visualization techniques of optical flow. (a) Input image at time t. (b) Input image at time t + 1. (c) Optical flow visualization in arrows. (d) Color code and optical flow visualization in colors. Figure from [FBK15]. 19 3 Related Algorithms After explaining the basic concepts regarding optical flow estimation in the previous chapter, we now analyze the two top-performing algorithms PWC-Net and ProFlow. Section 3.1 introduces the PWC-Net, including its network structure and definition of loss function, while Section 3.2 presents the core idea of the ProFlow pipeline and how each step in the pipeline works. 3.1 PWC-Net As described in Section 1.2, PWC-Net belongs to the pure learning-based approach, which trains an end-to-end network to solve the optical flow problem. A common difficulty of this kind of methods is the trade-off between model size and accuracy [SYLK18]. Most of the methods, e.g. FlowNet2 [IMS17], improve the estimation precision by deepening the network, which can in turn produce a large burden on training. To reduce the network size while maintaining high accuracy, PWC-Net combines several basic and well-established principles with deep neural networks. This idea has already been realized in the SpyNet [RB17]. However, it embeds only two traditional principles, i.e. the image pyramid and warping strategy, in convolutional networks. As making more use of domain knowledge, it turns out that PWC-Net can achieve higher accuracy and better runtime performance. One of the applied principles in PWC-Net is the coarse-to-fine strategy, i.e. pyramid processing. At each level of the pyramid exists an end-to-end CNN network, which generates flow estimation for this level. The result from one level is passed to the next level. The networks at different levels possess the same structure but different parameters. In the following, we first explain the generation of feature pyramids for two input images and then analyze the generic network structure at one pyramid level (mth level), as shown in Figure 3.1. Further, the definition of the loss function used for training is also discussed. The explanations are based on paper [SYLK18]. In the following, we use the symbol w to represent the flow vector (u, v)> to better formulate the equations. 20 3.1 PWC-Net Figure 3.1: Feature pyramids and network structure of PWC-Net at one pyramid level. Figure from [SYLK18]. 3.1.1 Feature Pyramid Extractor Since raw images are usually very sensitive to different shadows and illuminations, PWC- Net uses a neural network to extract features that remain invariant to different changes and to build learnable feature pyramids. The generation process of the feature pyramid is quite similar to that of the normal image pyramid. In the following, we first introduce the general building process of a feature pyramid. Assume we want to construct a feature pyramid with L levels for an input image I, denoting the levels of the pyramid as { c0, c1, . . . , cL−1 } . By default, the original image is placed at the bottom level, i.e. c0 = I, and the feature map at mth level can be obtained by a series of convolution of the feature map at m-1th level. Here, we first apply a convolutional layer with stride equal to 2 to downscale the resolution of the feature map at m-1th level by a factor of 2. Afterwards, some convolutional layers are used to extract more complex features while maintaining the resolution. The feature pyramid is generated from the finest level to the coarsest by sequentially applying the convolutional layers between each two levels. In PWC-Net, a feature pyramid with 7 levels is built for the input image. Since PWC-Net takes two consecutive images as inputs, two fully identical feature pyramid extractors are combined to build a Siamese network, which can generate the two 7-level feature pyramids simultaneously. Figure 3.2 presents the network structure of the feature pyramid extractor used in PWC-Net. As we can see, the numbers of feature channels of convolution filters from the first to the sixth level are 16, 32, 64, 96, 128, and 192, respectively. 21 3 Related Algorithms Figure 3.2: The feature pyramid extractor network. The first image (t = 1) and the second image (t = 2) are encoded using the same network with identical parameters and weights. Each convolution is followed by a leaky ReLU unit. cl t denotes extracted features of image t at level l. Figure from [SYLK18]. 3.1.2 Warping Layer The second applied principle in PWC-Net is the warping strategy, which is an important tool in traditional coarse-to-fine approaches to connect different pyramid levels. PWC-Net realizes the warping strategy through a warping layer, which takes as inputs the mth level of the feature pyramid of the second image cm 2 , and the up-sampled flow estimation at m+1th level up2(wm+1). As mentioned above, we use symbol w with w = (u, v)> to represent the flow field. The warped feature map at mth level is generated using the following equation: cm w (x) = cm 2 (x + up2(wm+1)(x)), (3.1) where x denotes the location of a pixel, and up2(wm+1)(x) yields the estimated flow vector of pixel x extracted from the up-sampled flow estimation at m+1th level. In most cases, the result of x + up2(wm+1)(x) is not located at an exact pixel position, thus, the bilinear interpolation is applied to compute the warped feature cm w (x). At the coarsest level, up2(wL) is initialized as zero by default. Since the motion between the first and warped image is usually small, the search field for each pixel can be constrained to a small size, thus it is possible to use a small CNN network to estimate this motion. 22 3.1 PWC-Net 3.1.3 Cost Volume Layer The third used principle is cost volume. The cost volume is a widely used component in stereo matching, which stores the matching costs for associating one pixel with its corresponding pixels in another frame. Many algorithms proposed also adopt the full cost volume for solving the optical flow problem, such as in DCFlow [XRK17]. Different from stereo matching, where the search space for the matching can be restricted to a 1D space along the epipolar line, a more complex 2D search is performed for optical flow [LUT18]. Thus, building a full cost volume can be rather computationally expensive. Instead of generating a full cost volume, PWC-Net builds a partial cost volume at each pyramid level by setting limited search window for each pixel. It can help to reduce both computational costs and memory usage. The cost volume layer is used for constructing the partial cost volume, where the matching cost is defined as the correlation between features of the first image and the warped features of the second image. The correlation is interpreted as following: cvm(x1, x2) = 1 N (cm 1 (x1)) >cm w (x2), (3.2) where x1 and x2 are two pixels, and N denotes the length of the feature vector cm 1 (x1) and cm w (x2). To construct a partial cost volume, the relative distance between x1 and x2 is restricted to a range of d pixels, i.e. | x1 − x2 |∞ 6 d. It is sufficient to set d as a small value at each level. On the one hand, for coarser levels, a one-pixel motion corresponds to a large motion range at finer levels. On the other hand, as described above, the warping strategy enables a small motion increment at each level, thus a small search field fulfills the estimation process at each level. Instead of building a 4D partial cost volume, the associating costs for each pixel are stored as a vector to build a 3D partial cost volume, which is more appropriate to be passed as input to the estimator network. For the mth level, the size of the partial cost volume is (2d + 1)2 × Hm × Wm, where Hm and Wm represent the height and width of the feature map at mth level. 3.1.4 Optical Flow Estimator The optical flow estimator is a multilayer CNN network that takes as inputs the mth level features of the first image, mth level cost volume, and up-sampled flow estimation at m+1th level to estimate the flow field at mth level. Figure 3.3 (a) demonstrates the general network structure, and inputs at pyramid level 2 are taken as example. Although the network structure at different levels is identical, each network is trained individually, which allows the trained network to be more targeted on the estimation at one pyramid level. From the perspective of variational approaches, the function of optical flow estimator is similar to energy minimization which is able to find the optimal optical flow value for each pixel. 23 3 Related Algorithms (a) (b) Figure 3.3: (a) The optical flow estimator network with inputs at pyramid level 2. Each convolutional layer is followed by a leaky ReLU unit except the last (light green) one that outputs the optical flow [SYLK18]. (b) A 5-layer dense block. Each layer takes all preceding feature-maps as input [HLVW17]. Inspiring by the excellent performance of DenseNet [HLVW17] in image classification, the idea of DenseNet connections is also applied in the optical flow estimator network. As shown in Figure 3.3 (b), in a DenseNet block, the output of each convolutional layer is passed as input to all the subsequent convolutional layers. The direct connection between each two layers can realize the feature propagation and accumulation. The reuse of the features already trained allows each layer to learn fewer feature maps, which can reduce the number of trainable parameters while achieving excellent performance. The experiment results of PWC-Net indicate that the PWC-Net with DenseNet connections is 5% more accuracy than without these connections, however 40% slower, since more computation is completed during estimation. 3.1.5 Context Network After estimating the flow field, a post-processing step is applied in the most of optical flow approaches. In this step, the computed flow field is refined by combining more contextual information, such as using a median filter. In PWC-Net, a context network is designed for this purpose, which makes use of dilated convolutions. Instead of only refining the final 24 3.1 PWC-Net Figure 3.4: Systematic dilation supports exponential expansion of the receptive field with- out loss of resolution or coverage. Red points denote the filter elements in a dilated convolutional filter. The blue square around each pixel denotes the re- ceptive field in the original image I0. (a) Original image I0 and a convolutional filter with dilation factor k = 1; Each element in I0 has a receptive field of 1 × 1. (b) Image I1, which is produced from I0 by the 1-dilated convolution and a convolutional filter with dilation factor k = 2; Each element in I1 has a receptive field of 3 × 3. (c) Image I2, which is produced from I1 by the 2-dilated convolution and a convolutional filter with dilation factor k=4; Each element in I2 has a receptive field of 7 × 7. Figure from [YK16]. estimation, the computed flow field at each level can be refined before passed to the next level. The use of the context network is optional in PWC-Net, however, the experiment results indicate a consistent help of applying this network as post-processing [SYLK18]. Dilated convolutions are a special category of the normal convolution, where the convolution filter is filled with different gaps. The size of the filter depends on a so-called dilation factor k. In the case that k equals 1, the dilated convolution is identical to the normal convolution. As the factor k increases, the empty space between the filter elements also increases. Systematically applying dilated convolution with different dilation factors can realize the aggregation of multi-scale contextual information by exponentially expanding the receptive field for a pixel, seeing Figure 3.4 [YK16]. Based on this characteristic, the context network possesses 7 convolution layers, as shown in Figure 3.5. The dilation factor for each layer is 1,2,4,8,16,1, and 1, respectively. The network takes as inputs the estimated flow field and the second last layer of the optical flow estimator, where the features are more complex and detailed, and generates a refined flow estimation. 25 3 Related Algorithms Figure 3.5: The context network with inputs at pyramid level 2. Each convolutional layer is followed by a leaky ReLU unit except the last (light green) one that outputs the optical flow. The last number in each convolutional layer denotes the dilation factor k [SYLK18]. 3.1.6 Training Loss After explaining the network structure of PWC-Net, we now introduce the definition of training loss. The trainable parameters contained in PWC-Net are from the feature pyramid extractor, the optical flow estimator networks at different pyramid levels, and the context network at different pyramid levels (depending on the option). The warping and cost volume layer contain no learnable parameters. The multiscale loss function used for training is based on the AEE between the ground truth and estimated flow at each pyramid level, and is defined as following: L(θ) = L∑ l=l0 αl ∑ x ���wt l(x) − we,θ l (x) ��� 2 + γ |θ |2 , (3.3) where θ represents the set of trainable parameters and αl is the individual weight for each pyramid level. we,θ l (x) and wt l (x) denote the estimated flow vector of pixel x at lth pyramid level and its ground truth flow vector, respectively. For a fixed pyramid level l, both vectors correspond to the vector (ue i, j, v e i, j) > and (ut i, j, v t i, j) > in the AEE Equation 2.7, where (i, j) also represents a pixel location. The first term in the loss function computes the weighted sum of all endpoint errors at each pyramid level, and the second term is to regularize the parameters in θ in order to avoid overfitting. 26 3.2 ProFlow During the training process, l0 is set to be 2, and a search range of d = 4 pixels is used for building the partial cost volume. The resulted flow field is a quarter of the original resolution, which is further up-sampled to the original resolution using the bilinear interpolation. For fine-tuning based on different benchmarks, a more robust training loss is used: L(θ) = L∑ l=l0 αl ∑ x ( ���wt l(x) − we,θ l (x) ���+ ε)q + γ |θ |2 , (3.4) where ε is a small constant. To reduce the penalty for outliers, the L2 norm in the previous training loss is replaced with the L1 norm and q is set to be less than 1 . 3.2 ProFlow The second algorithm we want to analyze is ProFlow, proposed by Maurer et al. [MB18] in 2018. ProFlow takes as inputs three consecutive frames t − 1, t and t +1, and estimates the forward flow of frame t. ProFlow is based on the traditional optical flow pipeline, where a shallow CNN network is also embedded, as shown in Figure 3.6. Different from the normal learning-based approaches, where CNN models are pretrained on a large training dataset, ProFlow implements an self-supervised online learning process during estimation, where no ground truth or labeled data are required. The training process is performed individually for each frame of every sequence, and it also takes the location of each training sample into account. As a result, ProFlow possesses a great estimation performance as well as a high adaptability regarding the changes of content between frames and independently moving objects in the same frame. In the following, we will explain the function of each step in the ProFlow pipeline. The explanations are based on the paper [MB18]. 3.2.1 Initial Flow Estimation/Baseline In the first step, a baseline approach is applied to generate the initial backward and forward flow, i.e. the flow fields from frame t to frame t − 1 and from frame t to frame t + 1, respectively. Theoretically, any approaches based on two frames can be employed in this step. To generate rather accurate estimation results, a pipeline approach that combines several advanced techniques is used as the baseline approach. As described in Chapter 1.2, a pipeline approach consists of four steps: matching, filtering, inpainting, and variational refinement. The algorithm Coarse-to-fine PatchMatch (CPM) of Hu et al. [HSL16] is used in the matching step, which combines the PatchMatch ideas, i.e. neighboring propagation and random search strategy, with traditional coarse-to-fine schema, seeing Figure 3.7. This algorithm can provide an accurate flow estimation even for tiny 27 3 Related Algorithms Figure 3.6: Overview of the ProFlow pipeline. Figure from [MB18]. structures with large displacements and possesses an implicit regularization effect [HSL16], thus is a great candidate for the first step. For outlier filtering, the standard method bidirec- tional checking is employed, as introduced in Section 2.3. It can eliminate considerably large amounts of outliers caused by wrong matches. To perform the bidirectional checking, matches in the reverse directions, i.e. from frame t − 1 to frame t and from frame t + 1 to frame t also need to be computed in the matching step. The inpainting step makes use of the robust interpolation technique (RIC), proposed by Hu et al. [HLS17], to produce dense motion fields. In RIC, the input image is segmented into several non-overlapping superpixels, and for each superpixel an affine motion model is assumed and computed based on the neighbouring blocks. The purpose of the last refinement step is to refine the estimated optical flow to gain a sub-pixel precision. To this end, the order-adaptive illumination-aware refinement (OIC) of Maurer et al. [MSB17] is used, which can bring great improvements to the results compared to normal refinement techniques. 3.2.2 Outlier filtering After applying the baseline approach, a filtering step is performed with the aim of removing outliers contained in the initial forward and backward flow estimation. Here, the bidirec- tional checking is used again. The same as before, the initial flow fields in the reverse directions i.e. from frame t − 1 to frame t and from frame t + 1 to frame t, need to be computed with the baseline approach. Hence, when using the bidirectional checking in the second filtering step, four flow fields will be generated in the first step. The bidirectional checking is then applied to each pair of forward/backward flow and its reverse flow to filter the corresponding outliers out. In this context, only the consistent flow vectors are remained. After filtering, many outliers can be excluded, especially in the occluded and low textured regions. 28 3.2 ProFlow Figure 3.7: Overview of the CPM algorithm. The whole process runs from top to bottom. Left: Image pyramid of the first input image. Right: Image pyramid of the second input image. Middle: Operations at each pyramid level including generating initial correspondences, performing neighbouring propagation and random search. Figure from [HSL16]. 3.2.3 Learning a Motion Model In this step, a CNN network is trained to learn the vary mappings from backward flow (t → t − 1) to forward flow (t → t + 1). Using this network, it is possible to re-estimate the forward flow vectors that are filtered out in the filtering step, if their corresponding backward flow vectors pass the filtering step. Since the occluded pixels are the main source of outliers, the CNN model can especially bring some improvements in the estimation of occluded regions. The CNN network used in ProFlow is quite light-weighted and trained in an self-supervised way. The training process is performed during the flow estimation and only depends on the flow fields computed in the first two steps, without using any ground truth flow. Moreover, a completely new CNN model is trained for each input triplet images of the ProFlow pipeline, thus, each frame in a sequence can possess an individual motion model. In the following, we will describe the generation of the training dataset and the structure of the CNN network. The training dataset is build based on the filtered forward and backward flow fields. To guarantee the correctness of training samples, only the locations, where both forward and backward flow pass the filtering step, are regarded as candidates of training samples. However, if all these candidates are added in the training set, in the case of very dense filtered forward and backward flow, the training set may be overlarge. Thus, a sampling operation is performed to make the training set achieve a reasonable size. To this end, a grid spacing of 10 pixels is used for sampling the training candidates equidistantly. In order to give a context for each candidate, a 7×7 patch centered with the candidate pixel is used. For each pixel in the patch, the following data are stacked together: (i) the backward flow components ubw and vbw (ii) a validity flag {0,1} indicating if this pixel is a valid training candidate. (iii) the normalized pixel coordinates x and y in the range of [-1,1]×[-1,1]. Thereby, the dimension of a training sample is 7×7×5, as shown in Figure 3.8 (a). The 29 3 Related Algorithms (a) (b) Figure 3.8: Left: Training sample extraction. Right: CNN network architecture. Both figures from [MB18]. training reference is the forward flow components u f w and v f w of the center pixel in a patch. The whole extraction process is performed automatically after the filtering of forward and backward flow. After extracting the training data, a CNN network is trained to learn the relation between inputs and outputs. The network contains three layers, i.e. 2 convolutional layers and 1 fully connected layer, as shown in Figure 3.8 (b). In both convolutional layers, a 3×3 filter with 16 feature channels is applied to generate complex non-linear features. The fully connected layer produces a 2-dimensional vector, which corresponds to the predicted forward flow components u f w and v f w. The loss function considers the absolute difference between the predicted flow vector and the actual flow vector computed by the baseline approach. Since the architecture of the network is relatively simple, the training process can be finished into 20 seconds. Normally, it takes only 3000-4000 training steps, depending on the setting of the learning rate. As introduced in the previous paragraphs, each training sample contains not only backward flow and validity, but also the pixel location. This enables the network to learn a location dependent model, i.e. different motion patterns can be learned at different locations in a frame. The location dependent model can especially contribute to the motion estimation of individually moving objects in a frame, and improve the estimation accuracy when non-rigid deformation or perspective effect occurs [MB18]. 3.2.4 Combination and Final Estimation After training the neural network, the learned model is applied to all locations, where the forward flow is filtered but backward flow is reserved, to predict their forward flow. The predictions are combined with the filtered forward flow computed in the second step to produce a denser forward flow estimation. After the combination, there can still exist some locations where neither forward flow nor backward flow are available. Thereby, the 30 3.2 ProFlow inpainting algorithm RIC is employed again with the purpose of densifying the flow field. Afterwards, the dense flow is refined using the algorithm OIR. Both are the same as in the baseline approach. 31 4 Combining PWC-Net and ProFlow After introducing the two top-performing approaches PWC-Net and ProFlow, in this chapter we conduct the combination of these two algorithms. In Section 4.1, we first integrate the PWC-Net into the ProFlow pipeline to generate a new pipeline which is then evaluated in Section 4.2. The evaluation is based on two most widely used benchmarks MPI-Sintel and KITTI 2015. By analyzing the evaluation results, we discuss the problems existing in the new pipeline and their causes. Afterwards, different modification ideas are proposed to lay a foundation for the next chapter. 4.1 Pipeline Update Figure 4.1 (a) demonstrates an overview of the individual steps in the ProFlow pipeline, where the baseline approach composes of four methods, i.e. CPM, bidirectional filtering, RIC, and OIR. Since the baseline approach is used to generate dense and accurate initial flow with two consecutive frames as input, we replace it with the advanced PWC-Net. The same as with the baseline approach, PWC-Net is also used to generate both forward and backward flow and their reverse flow fields, which are necessary to perform the bidirectional filtering. The new generated pipeline is referred to as PWC-ProFlow and represented in Figure 4.1 (b). (a) (b) Figure 4.1: (a) Overview of the original ProFlow pipeline. (b) Overview of the new generated PWC-ProFlow pipeline. 32 4.2 Evaluation 4.2 Evaluation After presenting the structure of the new PWC-ProFlow, we now evaluate the new pipeline through some experiments and analyze its performance. 4.2.1 Summary of Experiments Experiments Introduction As experiment candidates, we take the baseline approach used in the ProFlow pipeline, PWC-Net, ProFlow, and the new created method PWC-ProFlow. Further, the two most widely used benchmarks MPI Sintel [BWSB12] and KITTI 2015 [MG15] are applied to perform our evaluation. The Sintel dataset contains 23 highly challenging scenes related to large displacements, illumination changes, and significant occlusions. Each scene contains 50 consecutive frames. The KITTI dataset is taken from 200 different real road scenes, and most scenes contain cars in various sizes, shadows, lighting changes, and different backgrounds. Here, we only use the training dataset of both benchmarks, since the ground truth flow of each frame is available for evaluation. In particular, two types of ground truth are provided in KITTI dataset. The one is noc, which only contains the ground truth flow of non-occluded pixels. The other is occ, which contains the ground truth flow vectors of all pixels including both occluded and non-occluded pixels. Hence, in the following analysis, we use the name noc and occ to represent the evaluation of non-occluded pixels and all pixels of KITTI dataset, respectively. Experiment Results Analysis Table 4.1 presents the experiment results of the experiment methods on both datasets. We first compare the two baseline approaches used in ProFlow and PWC-ProFlow, i.e. the PF- baseline and PWC-Net. As one can see, PWC-Net outperforms the PF-baseline approach on both datasets, with a 17% lower AEE on the Sintel dataset, and a 45% and 55% lower AEE in the noc and occ cases of the KITTI dataset, respectively. As PWC-Net can realize a real-time estimation, it can also bring a higher runtime efficiency. Moreover, we can observe that ProFlow also outperforms its baseline approach on both datasets, especially in the occ case of the KITTI dataset, where ProFlow achieves a 24% lower AEE w.r.t. PF-baseline. This comparison demonstrates that the additional four steps in the ProFlow pipeline can bring great improvements to the estimation of the baseline approach. Based on the above observations, it is worth expecting that PWC-ProFlow could gain a better estimation performance than PWC-Net and ProFlow, since a more advanced baseline approach is employed in the pipeline. However, as shown in Table 4.1, PWC-ProFlow ranks the third place in the estimation of the Sintel dataset, achieving a 18% higher AEE than PWC-Net and ProFlow. Further, considering the AEE, PWC-ProFlow places the second 33 4 Combining PWC-Net and ProFlow Table 4.1: Average endpoint error (AEE) and the percentage of bad pixels (BP) of different methods on the training set of KITTI 2015 and MPI Sintel (clean path). PF- baseline refers to the baseline approach used in the ProFlow pipeline. One pixel with an endpoint error > 3px is defined as a bad pixel. Method KITTI 2015 Sintel noc occ clean AEE BP (%) AEE BP (%) AEE PF-baseline 3.49 12.22 7.30 20.33 2.361 ProFlow 3.16 11.40 5.56 17.83 1.98 PWC-Net 1.89 8.52 3.25 13.68 1.951 PWC-ProFlow 2.89 11.52 5.18 19.03 2.34 in both noc and occ cases of the KITTI dataset, with 9% and 7% less AEE comparing to the ProFlow, which presents certain improvements brought by applying a better baseline approach. Besides, PWC-ProFlow achieves a better runtime efficiency than ProFlow due to the real-time estimation of PWC-Net. Summarizing the experiment results, the comparison between PWC-Net and PWC-ProFlow turns out that the additional steps in the PWC-ProFlow pipeline may deteriorate the initial estimation computed by the PWC-Net. The underperformance of PWC-ProFlow can be observed on both datasets. In order to investigate the problems hidden in the PWC-ProFlow pipeline, we analyze the experiment data of each scene in both datasets. In the following, we will discuss the improvements and problem causes based on our analysis. 4.2.2 Improvements After checking the AEE per scene of the Sintel dataset, we discover that PWC-ProFlow achieves a better accuracy in 13 scenes than PWC-Net and 11 scenes than ProFlow. It man- ifests that PWC-ProFlow is able to improve the estimation accuracy in many circumstances. Several examples confirm this observation by showing that PWC-ProFlow effectively im- proves the inaccurate estimations of PWC-Net and ProFlow especially in the occluded regions, seeing the right up corner in Figure 4.2 (a), the left boundary area in Figure 4.2 (b). The same improvements can be detected in the KITTI dataset, seeing the right boundary in Figure 4.3 (a). In addition to occluded regions, PWC-ProFlow also provides great im- provements for the estimation under lighting changes, such as in the head of the dragon in Figure 4.3 (b). In these positive examples, the inaccurate estimations of the PWC-Net are filtered out in the filtering step, as shown in the value map of filtered forward flow. After online learning, the trained CNN model can help to predict the more accurate forward flow 1In order to unify the number of test frames used for methods with two and triplet images as input, the error of flow estimation of the first frame is not included. 34 4.2 Evaluation vectors (t → t + 1) in the filtered locations by virtue of existing filtered backward flow (t → t − 1). After the inpainting and refinement, the estimated flow field can be further improved. 4.2.3 Problem Analysis Even though many examples demonstrate the great performance of PWC-ProFlow, its high AEE on both datasets still indicates that in some situations the PWC-ProFlow achieves a very poor performance. By analyzing the negative samples in the Sintel dataset, we discover two main problems, i.e. large gaps contained in the input flow of the inpainting step and wrong predictions produced by the CNN model. Both problems can be caused by a same situation, namely in a certain region of the filtered forward flow (t → t + 1), which usually corresponds to an individual local motion pattern, there exist no or very few flow vectors. In the case that the same few flow vectors exist in this region in the filtered backward flow (t → t − 1), the predictions that can be made by the CNN network and thus combined with the filtered forward flow are very few. If the region takes up a large proportion of the whole flow field, it can lead to a large gap in the input flow of the inpainting step, as shown in the bottom half of Figure 4.4 (a). Since most of the inpainting approaches interpolate the empty places based on their neighboring information, large gaps can severely deteriorate the inpainting result and in turn the final estimation. In the case that the filtered backward flow (t → t − 1) contains sufficient flow vectors in this region, the problem regarding inaccurate predictions can still arise and destroy the estimation result. The reason is that the training set of the CNN model includes only a small amount of training samples from this region, since a training sample needs to possess both a valid forward (t → t +1) and backward flow (t → t − 1). Thus, it is unlikely for the CNN network to correctly learn the motion pattern located in this region. For the existing backward flow vectors (t → t − 1) in this region, the CNN model would possibly apply the neighboring patterns to predict the corresponding forward flow (t → t + 1). As shown in Figure 4.4 (b), the predictions in the leg area are made based on the pattern located in the arm. Due to the different moving directions in these two areas, the flow vectors in the leg area are predicted with a significantly low accuracy. The two problems also occur in the estimation of KITTI dataset. However, by analyzing the negative samples in the KITTI set, we find another problem, i.e. the predictions at the contour of the cars normally possess a very low accuracy, especially for small cars. Figure 4.5 presents one example, seeing the upper contour of the left car. As we can see, the CNN model treats these positions as parts of the background when performing the predictions. For individual small cars, it is difficult for the CNN network to learn their patterns precisely due to the small amount of training samples from these areas. Hence, the predictions at the contour may be affected by the adjacent neighboring background flow. Since we would like to maintain the way creating and training the CNN model, this problem seems unavoidable in our pipeline approach. Thus, the focus of the modification will be mainly on the above two problems. After analysis we suggest the following modification ideas. 35 4 Combining PWC-Net and ProFlow On the one hand, we can focus on improving the inpainting result, such as find a way to reduce the large gaps in its input flow, or find an appropriate alternative of the inpainting step which is also capable of densifying the flow field and preparing the input for the refinement step. On the other hand, we can attempt to filter less flow vectors in the filtering step, especially in the forward filtering. Compared to the baseline approach used in the ProFlow pipeline, PWC-Net is specifically pretrained for the forward flow prediction. Due to the lack of a backward training, there is no guarantee to the accuracy of the estimated backward flow (t + 1 → t). Since the bidirectional filtering makes use of flow vectors in both directions, inaccurate predictions in the backward direction (t + 1 → t) can cause the filtering of some accurate forward flow vectors (t → t + 1). Based on this characteristic of PWC-Net, we can replace the bidirectional filtering with other filtering approaches which use flow field in only one direction for the filtering to see if more flow vectors can be remained in the filtered forward flow. Based on these ideas, different modifications of the PWC-ProFlow pipeline will be intro- duced in the next chapter. 36 4.2 Evaluation (a) (b) < 0.1875 < 0.375 < 0.75 < 1.5 < 3 < 6 < 12 < 24 < 48 ≥ 48 Figure 4.2: Visualization of improvements in occluded regions (Sintel dataset ambush_7 #41 and ambush_2 #4 [BWSB12]). For both figures: First row: Previous, reference, and next frame. Second row: Value map of filtered backward flow and forward flow, ground truth flow. Third and fourth row: Flow estimation of the reference frame and bad pixel visualization. From left to right: PWC-Net, ProFlow, and PWC-ProFlow. 37 4 Combining PWC-Net and ProFlow (a) (b) < 0.1875 < 0.375 < 0.75 < 1.5 < 3 < 6 < 12 < 24 < 48 ≥ 48 Figure 4.3: Visualization of improvements in occluded regions (KITTI dataset occ #145 [MG15]) and lightning changes (Sintel dataset market_5 #8 [BWSB12]). For both figures: First row: Previous, reference, and next frame. Second row: Value map of filtered backward flow and forward flow, ground truth flow. Third and fourth row: Flow estimation of reference frame and bad pixel visualization. From left to right: PWC-Net, ProFlow, and PWC-ProFlow. 38 4.2 Evaluation (a) (b) < 0.1875 < 0.375 < 0.75 < 1.5 < 3 < 6 < 12 < 24 < 48 ≥ 48 Figure 4.4: Visualization of negative performance due to large gaps in the input flow of the inpainting step (KITTI dataset occ #188 [MG15]) and due to wrong predictions (Sintel dataset ambush_4 #7 [BWSB12]). For both figures: First row: Previous, reference, and next frame. Second row: Value map of filtered backward flow and forward flow, ground truth flow. Third and fourth row: Flow estimation of reference frame and bad pixel visualization. From left to right: PWC-Net, ProFlow, and PWC-ProFlow. 39 4 Combining PWC-Net and ProFlow < 0.1875 < 0.375 < 0.75 < 1.5 < 3 < 6 < 12 < 24 < 48 ≥ 48 Figure 4.5: Visualization of negative performance due to wrong predictions at the contour (KITTI dataset occ #114 [MG15]). First row: Previous, reference, and next frame. Second row: Value map of filtered backward flow and forward flow, ground truth flow. Third and fourth row: Flow estimation of reference frame and bad pixel visualization. From left to right: PWC-Net, ProFlow, and PWC-ProFlow. 40 5 Modification In the previous chapter, we analyze the reasons that can lead to an underperformance of the PWC-ProFlow pipeline. Based on the analysis, six different modifications are created in this chapter with the purpose of increasing the estimation accuracy. In Section 5.1 and 5.2, the first two modifications are designed based on the idea of improving the inpainting result. Section 5.3 introduces the other four modifications, where new filtering algorithms are adopted to replace the original bidirectional filtering. All the new generated pipelines are evaluated on the Sintel and KITTI datasets. Based on the experiment results, we further analyze their estimation performance and compare their estimation results with PWC-ProFlow, PWC-Net, and ProFlow. 5.1 Modifying the Input of Inpainting 5.1.1 Model Generation As analyzed in Section 4.2, large gaps contained in the input flow of the inpainting step can deteriorate the inpainting result due to the lack of neighbouring information, and thus lead to a poor final estimation. The most intuitive way to deal with this problem is to sample flow vectors from the initial forward flow computed by the PWC-Net in the first step to the current input flow. By inserting more flow vectors, we intend to provide more available information for the inpainting process. Based on the good performance of PWC-Net, we assume that inpainting using some of the filtered flow vectors of PWC-Net can still be hypothetically better than inpainting using flow vectors very far away. The first modification is referred to as PWC-ProFlow-S and the new generated pipeline is shown in Figure 5.1. For the purpose of sampling, we employ a grid checking process. In this process, a grid with a predefined size slides over the filtered forward flow. Meanwhile, the number of flow vectors inside the grid is counted. If the number is less than a threshold, the initial forward flow will be sampled equidistantly to fill the blank area in the grid. In the case that Figure 5.1: Overview of the pipeline of PWC-ProFlow-S. 41 5 Modification Table 5.1: Average endpoint error (AEE) and the percentage of bad pixels (BP) of different methods on the training set of KITTI 2015 and MPI Sintel (clean path). One pixel with an endpoint error > 3px is defined as a bad pixel. Method KITTI 2015 Sintel noc occ clean AEE BP (%) AEE BP (%) AEE ProFlow 3.16 11.40 5.56 17.83 1.98 PWC-Net 1.89 8.52 3.25 13.68 1.95 PWC-ProFlow 2.89 11.52 5.18 19.03 2.34 PWC-ProFlow-S 2.16 9.10 3.65 15.92 1.82 a sampled value and a predicted value, generated by the CNN model for this position, exist simultaneously, a weighted sum of both is calculated as the result for this position. The weight of the predicted value depends on the total number of flow vectors in the current grid. Less flow vectors correspond to a lower potential of a correct prediction, thus less weight will be given to the predicted value. The sum of both weights remains to be 1. We also tried to directly take the predicted value as the result instead of calculating the weighted sum. However, we find using the weighted sum can help to reduce the negative impact of the wrong predictions. A later example can demonstrate this advantage. 5.1.2 Evaluation In the experiments, we define the grid size as 200× 100 for the Sintel dataset and 300× 100 for the KITTI dataset in order to keep a similar ratio as their training images. The threshold is set to be 0.5, meaning that when less than a half flow vectors exist in the grid, the sampling process will be triggered. The sampling distance in both horizontal and vertical directions is 3. The evaluation results are presented in Table 5.1. In comparison with PWC-ProFlow. As shown in Table 5.1, the new pipeline PWC- ProFlow-S achieves a better estimation accuracy than PWC-ProFlow on both datasets, with a 22% lower AEE on the Sintel dataset, a 26% and 30% lower AEE in the noc and occ cases of the KITTI dataset, respectively. The test outcome indicates that appropriate handling of large gaps before inpainting can effectively improve the estimation results, especially in some scenes of the Sintel dataset, e.g. the AEE of ambush_4 decreases from 11.3 to 7.6. Figure 5.2 (a) and (b) present the results of PWC-ProFlow-S regarding the negative estimation examples of PWC-ProFlow mentioned in Chapter 4. Here, we can observe significant improvements in the estimation of the leg area in subfigure (a) and the bottom edge area in subfigure (b), where large gaps exist in the input flow of the inpainting step in PWC-ProFlow. Besides, the wrong predictions in the leg area in subfigure (a) are also 42 5.2 Replacing Inpainting with Dense Filling treated in a proper way during sampling. The same great progress can also be obtained when compared to the ProFlow. As the quality of the inpainting results enhances, the advantages of using an advanced baseline approach are demonstrated. In comparison with PWC-Net. As shown in Table 5.1, the new pipeline PWC-ProFlow-S outperforms the PWC-Net in the Sintel dataset with a 7% lower AEE. It manifests that the additional steps in the PWC-ProFlow-S pipeline can provide a general improvement for the estimation of PWC-Net on the Sintel dataset. Figure 5.3 (a) and (b) present two examples to show the improvements, seeing the lower-right area in subfigure (a) and the right edge area in subfigure (b). However, in the case of KITTI dataset, PWC-ProFlow-S performs worse than the PWC-Net. By analyzing the negative samples, we find that one possible reason is still due to the inaccurate predictions at contours of the small cars, as explained in Chapter 4. Another reason also stems from the predictions of very small cars. While grid checking, the areas of some small cars may contain no flow vectors, however, because of their small sizes, the total number of non-existing flow vectors inside the grid is still lower than a half. Hence, the sampling process will not be initiated. Figure 5.4 demonstrates one of the examples, where no flow vectors are sample to the second car on the left, thus, the inpainting result and the final estimation in this area are inaccurate. 5.2 Replacing Inpainting with Dense Filling 5.2.1 Model Generation The idea of the second modification is also based on improving the inpainting results. How- ever, instead of remedying the input flow of the inpainting step, we replace the inpainting step with another approach that is also able to densify the flow field. Considering the great performance of PWC-Net, we apply a method called dense filling, which is implemented as following: after combining the CNN predictions, the initial forward flow vectors are sampled to all blank positions, where no forward flow exists. The problem regarding to large gaps is trivially tackled in this modification. The new pipeline with dense filling is referred to as PWC-ProFlow-D and shown in Figure 5.5. The second to fourth step together can be regarded as a corrector of the potentially erroneous flow vectors contained in the initial forward flow. The experiment results are presented in Table 5.2. 5.2.2 Evaluation In comparison with PWC-ProFlow. As shown in Table 5.2, the new generated pipeline PWC-ProFlow-D outperforms the PWC-ProFlow on both datasets, with a 22% lower AEE on Sintel dataset, a 37 % and 38 % lower AEE in the noc and occ cases of the KITTI dataset, respectively. Actually, we can find a general decrease of AEE per scene in both datasets. The test outcome proves that the dense filling is a great alternative of the inpainting step 43 5 Modification (a) (b) < 0.1875 < 0.375 < 0.75 < 1.5 < 3 < 6 < 12 < 24 < 48 ≥ 48 Figure 5.2: Visualization of estimation improvements comparing with PWC-ProFlow (Sin- tel ambush_4 #7 [BWSB12] and KITTI occ #188 [MG15]). For both figures: First row: Previous, reference, and next frame. Second row: Value map of the input flow of the inpainting step in PWC-ProFlow, flow estimation of PWC- ProFlow and PWC-ProFlow-S. Third row: Value map of the input flow of the inpainting step in PWC-ProFlow-S, bad pixel visualization of PWC-ProFlow and PWC-ProFlow-S. 44 5.2 Replacing Inpainting with Dense Filling (a) (b) < 0.1875 < 0.375 < 0.75 < 1.5 < 3 < 6 < 12 < 24 < 48 ≥ 48 Figure 5.3: Visualization of estimation improvements comparing with PWC-Net (Sintel ambush_5 #30 and market_6 #10 [BWSB12]). For both figures: First row: Reference frame, estimation results of PWC-Net and PWC-ProFlow-S. Second row: The next frame and bad pixel visualization of PWC-Net and PWC- ProFlow-S. < 0.1875 < 0.375 < 0.75 < 1.5 < 3 < 6 < 12 < 24 < 48 ≥ 48 Figure 5.4: Visualization of negative estimation samples in KITTI dataset due to non- sampling of small cars (KITTI benchmark occ # 93 [MG15]). First row: Reference frame, estimation results of PWC-Net and PWC-ProFlow-S. Second row: Value map of the input flow of the inpainting step in PWC-ProFlow-S, bad pixel visualization of PWC-Net and PWC-ProFlow-S. 45 5 Modification Table 5.2: Average endpoint error (AEE) and the percentage of bad pixels (BP) of different methods on the training set of KITTI 2015 and MPI Sintel (clean path). One pixel with an endpoint error > 3px is defined as a bad pixel. Method KITTI 2015 Sintel noc occ clean AEE BP (%) AEE BP (%) AEE ProFlow 3.16 11.40 5.56 17.83 1.98 PWC-Net 1.89 8.52 3.25 13.68 1.95 PWC-ProFlow 2.89 11.52 5.18 19.03 2.34 PWC-ProFlow-D 1.83 8.51 3.23 14.38 1.82 and is capable of providing a better input for the final refinement step. Figure 5.6 (a) and (b) demonstrate the estimation results of PWC-ProFlow-D regarding the negative samples of PWC-ProFlow listed in Chapter 4. Again, the estimation in the leg area in subfigure (a) and the bottom edge area in subfigure (b) are effectively improved. However, as one can observe, in this modification, the wrong predictions in the leg area in subfigure (a) still deteriorate parts of the estimation result. It indicates that different from the first modification, there is no chance to correct the wrong predictions in PWC-ProFlow-D. In comparison with PWC-Net. Except the BP in the occ case of the KITTI dataset, PWC- ProFlow-D performs better than PWC-Net on both datasets, with a 7% lower AEE on the Sintel dataset and slight improvements on the KITTI dataset. The improvements brought by this modification confirm that the middle steps in the pipeline can effectively improve some of the inaccurate estimations of PWC-Net. Figure 5.7 present some of the examples, where estimation improvements can be detected in the upper left corner of subfigure (a), the upper right edge area of subfigure (b), and left edge area of subfigure (c). The problems that cause the underperformance of the first modification PWC-ProFlow-S in KITTI dataset are automatically solved in the process of dense filling. Hence, an improvement in estimation accuracy on the KITTI dataset can be found when comparing the PWC-ProFlow-D with PWC-ProFlow-S. Figure 5.5: Overview of the pipeline of PWC-ProFlow-D. 46 5.3 Applying New Filtering Algorithms (a) (b) < 0.1875 < 0.375 < 0.75 < 1.5 < 3 < 6 < 12 < 24 < 48 ≥ 48 Figure 5.6: Visualization of estimation improvements comparing with PWC-ProFlow (Sin- tel dataset ambush_4 #7 [BWSB12] and KITTI dataset occ #188 [MG15]). For both figures: First row: Previous, reference, and next frame. Second row: Value map of filtered backward flow, estimated flow of PWC-ProFlow and PWC-ProFlow-D. Third row: Value map of filtered forward flow, bad pixel visualization of PWC-ProFlow and PWC-ProFlow-D. 5.3 Applying New Filtering Algorithms 5.3.1 Model Generation In the first two modifications, we increase the estimation accuracy by improving the inpainting results. However, the root cause of an inaccurate inpainting result and wrong predictions of CNN model is because the filtered forward flow contains few flow vectors in some local regions after the bidirectional filtering. As described in Chapter 4, the unstable performance of the backward flow estimation (t + 1 → t) of PWC-Net can be one possible 47 5 Modification (a) (b) (c) < 0.1875 < 0.375 < 0.75 < 1.5 < 3 < 6 < 12 < 24 < 48 ≥ 48 Figure 5.7: Visualization of estimation improvements comparing with PWC-Net (Sintel dataset cave_2 19, temple_2 #25 [BWSB12] and KITTI dataset occ #183 [MG15]). For all three figures: First row: Reference frame and estimated flow of PWC-Net and PWC-ProFlow-D. Second row: The next frame and bad pixel visualization of PWC-Net and PWC-ProFlow-D. 48 5.3 Applying New Filtering Algorithms reason that leads to this problem. Hence, another starting point for the modification is to adopt a different filtering method which only takes flow predictions in one direction as input. In this section, we implement two different filtering algorithms. Combined with the normal pipeline steps containing the inpainting step and the new pipeline steps containing the dense filling, the last four new pipelines are created. In the following, we first introduce the two filtering algorithms and then the structures of the new pipelines. Uniqueness Filtering Since occluded pixels are the main source of outliers, we adopt the uniqueness checking, introduced in Section 2.3, to create the first filtering algorithm. The uniqueness filtering takes a flow field as input to generate the filtered flow field. The whole computation process is mainly based on the discrete version of Equation 2.6, where (i, j) denotes a pixel location: o(i, j, t) = { 0 if m(i + u f w i, j , j + v f w i, j , t + 1) = 1 1 else (5.1) In the case of forward filtering, the input flow refers to the initial forward flow computed in the first step. Further, the initial forward flow vectors are substituted into the equation as the value of u f w and v f w. During the implementation, we first build a 2-dimensional array with the same size of the flow field. Each value of the array unit accumulates the total number of pixels that fall into this unit location after the flow vector is added. Normally, bilinear interpolation is applied to deal with the issue that the target location, i.e. the location where a pixel lands after adding the flow vector, deviates from an exact pixel location. In this process, all the four neighboring units around the target location are assigned to a certain value based on the weights computed by the bilinear interpolation. After generating the 2D array, we can check if one pixel is the unique pixel that falls in its target position. In the case that other pixels exist, the flow vector in this pixel location is filtered out. Here, we set the threshold for the uniqueness checking to be 1.5 due to the employing of bilinear interpolation. In the case of backward filtering, the input flow field refers to the initial backward flow. Uniqueness and RGB Filtering Figure 5.8 demonstrates the filtered results of different filtering algorithms. Since our main purpose in the filtering step is to filter the occluded pixels, the ground truth occlusion map is here regarded as the reference result. As one can see, the results of uniqueness filtering contain significantly more flow vectors than the results of the bidirectional filtering. However, compared with the reference occlusion maps, some occluded pixels still remain in the results of uniqueness filtering. To remove more occluded pixels, we create a second 49 5 Modification Figure 5.8: Forward filtering results using different filtering algorithms. From top to bottom: Figures are cave_4 #44, market_5 #18, temple_3 #37, and ambush_4 #7 of the Sintel dataset [BWSB12]. From left to right: Ground truth occlusion map, bidirectional filtering, uniqueness filtering, and uniqueness and RGB filtering. filtering algorithm by combining an extra RGB checking with the uniqueness checking. The uniqueness and RGB filtering takes as inputs the reference image, the target image, and a flow field. The filtering algorithm is implemented as following: first the uniqueness checking is applied to the input flow field. If one flow vector fails in the uniqueness checking, it is filtered out as in the uniqueness filtering. Otherwise, the Euclidean distance of the RGB vectors between the starting pixel in the reference image and the nearest landing pixel in the target image is computed. If the distance is larger than a threshold, the flow vector will be filtered out. For forward filtering, the target image refers to the subsequent frame, and for backward filtering, it is the previous frame. Comparing different Filtering Algorithms As one can see in Figure 5.8, more occluded pixels are filtered out in the results of the uniqueness and RGB filtering, even so, the filtered results still contain more flow vectors than the bidirectional filtering. As conclusion, compared to the reference occlusion maps, the uniqueness filtering would remove slightly fewer occluded pixels in some regions, while the uniqueness and RGB filtering would remove slightly more pixels in some regions. However, in the most cases, both algorithms are able to produce denser filtered results than the bidirectional filtering. 50 5.3 Applying New Filtering Algorithms Pipeline Generation After introducing the two filtering algorithms, we now generate four new pipelines based on them. One advantage of using the new filtering algorithms is that it is not necessary to generate the reverse forward (t + 1 → t) and backward (t − 1 → t) flow fields in the first step, which can save computational costs. As shown in Figure 5.9, in the first two pipelines, we replace the bidirectional filtering with the uniqueness filtering for both forward and backward filtering. While the first pipeline PWC-ProFlow-U-I applies inpainting as the tool of densification, the second pipeline PWC-ProFlow-U-D uses dense filling for the same purpose. As described above, the filtering results of the uniqueness filtering usually contain some occluded pixels. Hence, in the other two modifications we apply the uniqueness and RGB filtering to the backward filtering and continue to use the uniqueness filtering for the forward filtering. The idea is that the CNN model takes the existing flow vectors in the filtered backward flow (t → t − 1) as input and performs the forward flow predictions. The backward flow vectors in the occluded regions can lead to wrong predictions, thus an inaccurate final estimation. Hence, in the filtered backward flow, we prefer to leave as less outliers as possible. On the contrary, we slightly increase the tolerance to the number of outliers in the filtered forward flow, since large gaps in the filtered forward flow can bring great difficulties to the subsequent steps. The last two new pipelines are referred to as PWC-ProFlow-U-RGB-I and PWC-ProFlow-U-RGB-D. 5.3.2 Evaluation All of the four pipelines generated are evaluated on the Sintel and KITTI datasets. The experiment results are demonstrated in Table 5.3. In comparison with PWC-ProFlow. As shown in Table 5.3, all the four new pipelines perform better than PWC-ProFlow on both datasets. In particular, the pipelines PWC- ProFlow-U-I and PWC-ProFlow-U-RGB-I outperform PWC-ProFlow with a more than 19% lower AEE on the Sintel dataset, and a more than 10% and 9% lower AEE in the noc and occ cases of the KITTI dataset. This experiment outcome demonstrates the general improvements brought by the new applied filtering algorithms and confirms our idea that allowing more flow vectors to retain in the filtered flow can increase the estimation accuracy. Figure 5.10 (a) and (b) show the results of the four modifications regarding the negative estimation examples of PWC-ProFlow mentioned in Chapter 4. In subfigure (a), we can detect great improvements in the leg area in the estimation results. Inspecting both the filtered backward flow and the filtered forward flow of uniqueness filtering in Figure 5.8, we can observe that some locations in the leg area possess both forward and backward flow after filtering, which is not the case in PWC-ProFlow. The information in these locations can be added to the training dataset, thus, the CNN model has a chance to learn parts of the motion pattern in the leg area. This helps the CNN model to predict the forward flow in this area more accurately. 51 5 Modification (a) (b) (c) (d) Figure 5.9: From (a) to (d): Overview of the pipeline of PWC-ProFlow-U-I, PWC- ProFlow-U-D, PWC-ProFlow-U-RGB-I, and PWC-ProFlow-U-RGB-D. In comparision with each other. By comparing the four modifications with each other, we discover that the pipelines with dense filling generally outperform the pipelines with inpainting, as previously detected in Section 5.2. Between different filtering algorithms, the cooperation of uniqueness filtering and uniqueness and RGB filtering achieves better accuracy on both datasets than only employing uniqueness filtering for both forward and backward filtering. It indicates that less outliers in the filtered backward flow can help to reduce the amount of wrong predictions. Actually, we have also built two pipelines, where the uniqueness and RGB filtering is applied for both forward and backward filtering. However, the experiment results do not show an improvement compared to the pipelines with above two filtering combinations. As a conclusion, the pipeline PWC-ProFlow-U-RGB-D performs the best among the four new pipelines created in this section. In comparison with the PWC-Net. As shown in Table 5.3, all the four variations outper- form PWC-Net in the estimation of Sintel dataset, especially the method PWC-ProFlow- U-RGB-D, which achieves the best accuracy on Sintel dataset among all the experiment 52 5.3 Applying New Filtering Algorithms Table 5.3: Average endpoint error (AEE) and the percentage of bad pixels (BP) of different methods on the training set of KITTI 2015 and MPI Sintel (clean path). One pixel with an endpoint error > 3px is defined as a bad pixel. Method KITTI 2015 Sintel noc occ clean AEE BP (%) AEE BP (%) AEE ProFlow 3.16 11.40 5.56 17.83 1.98 PWC-Net 1.89 8.52 3.25 13.68 1.95 PWC-ProFlow 2.89 11.52 5.18 19.03 2.34 PWC-ProFlow-S 2.16 9.10 3.65 15.92 1.82 PWC-ProFlow-D 1.83 8.51 3.23 14.38 1.82 PWC-ProFlow-U-I 2.60 9.56 4.98 17.37 1.95 PWC-ProFlow-U-D 2.08 9.03 4.06 15.89 1.82 PWC-ProFlow-U-RGB-I 2.61 9.61 4.62 16.84 1.88 PWC-ProFlow-U-RGB-D 1.95 8.70 3.56 14.82 1.75 candidates. However, for the KITTI dataset, PWC-Net possesses a higher estimation accu- racy in both noc and occ cases. We can also see an outperformance when comparing the pipeline PWC-ProFlow-D with PWC-ProFlow-U-D and PWC-ProFlow-U-RGB-D. Since the general effectiveness of the new filtering algorithms is proved in the comparison with PWC-ProFlow, this outcome indicates that in some frames of the KITTI dataset the new filtering algorithms perform significantly worse than the bidirectional filtering. After ana- lyzing the experiment results, we discover some negative samples as shown in Figure 5.11. Subfigure (a) and (b) present two examples in the noc case. It is clearly to see that the filtered forward flow of bidirectional filtering contain more flow vectors. Due to the lack of training samples, the front regions of the cars in subfigure (a) and subfigure (b) are predicted with a very low accuracy in the pipelines PWC-ProFlow-U-D and PWC-ProFlow-U-RGB-D. In this context, one possible reason is that the inaccurate initial flow vectors computed by the PWC-Net can cause multiple pixels from one object landing in the same area in the target image, for example the car area in subfigure (b). After bilinear interpolation, some valid positions in this area may contain a value higher than the threshold, thus are filtered out. For this problem, we can consider to slightly increase the threshold defined for the uniqueness checking for the KITTI dataset to leave more flow vectors in the filtered forward flow. Besides the problem observed in the noc case, we also find that in the occ case, the predictions in the bottom edge area in some frames are slightly deteriorated by the neighbouring flow, as shown in the subfigure (c). These wrong predictions can also decrease the overall estimation accuracy. 53 5 Modification (a) (b) < 0.1875 < 0.375 < 0.75 < 1.5 < 3 < 6 < 12 < 24 < 48 ≥ 48 Figure 5.10: Visualization of estimation improvements comparing with PWC-ProFlow (Sintel dataset ambush_4 #7 [BWSB12] and KITTI dataset occ #188 [MG15]). For both figures: First row: Reference frame, subsequent frame, value map of filtered backward flow using uniqueness filtering and uniqueness and RGB filtering. Second row and third row: Estimation results and bad pixel visualization of four modifications. From left to right: PWC-ProFlow-U-I, PWC-ProFlow-U-D, PWC-ProFlow-U-RGB-I, and PWC-ProFlow-U-RGB- D. Summarizing all the modifications created in this chapter, the experiment results in Table 5.3 demonstrate that all the modifications perform better than the original PWC-ProFlow created in Chapter 4 on both datasets. In particular, the pipeline PWC-ProFlow-U-RGB-D achieves the best estimation performance on the Sintel dataset, while PWC-ProFlow-D yields the best estimation performance in the noc and occ cases of the KITTI dataset. 54 5.3 Applying New Filtering Algorithms (a) (b) (c) < 0.1875 < 0.375 < 0.75 < 1.5 < 3 < 6 < 12 < 24 < 48 ≥ 48 Figure 5.11: Comparison of estimation results in KITTI dataset (noc #43, noc #91, and occ #176 [MG15]). For both figures: First row: Previous frame, value map of filtered backward flow using bidirectional filtering, uniqueness filtering, and uniqueness and RGB filtering. Second row: Reference frame, value map of filtered forward flow using bidirectional filtering and uniqueness filtering, bad pixel visualization of PWC-Net. Third row: Next frame, bad pixel visualization of PWC-ProFlow-D, PWC-ProFlow-U-D, and PWC-ProFlow- U-RGB-D. 55 6 Conclusion In this thesis, we combined two top-performing optical flow algorithms PWC-Net and ProFlow with the purpose of generating a new algorithm that can outperform these two algorithms and combine their advantages. Through the evaluation on the MPI Sintel and KITTI 2015 benchmarks, we discovered that the intuitive combination of both algorithms did not achieve the expected performance on either dataset. To improve the estimation performance, we explored the causes that lead to the underperformance of the new algorithm. We found that the sparse filtered forward flow produced by the bidirectional filtering can on the one hand lead to large gaps in the input flow of the inpainting step, on the other hand lead to wrong predictions of the CNN model. Both problems can severely reduce the estimation accuracy. Based on these findings, we proposed six different modifications in this thesis. In the first modification, the large gaps contained in the input flow of the inpainting step are partially filled via sampling so that more information can be provided for the inpainting process. In the second modification, the inpainting step is completely replaced with a so-called dense filling method which densifies the flow field by directly sampling the initial forward estimation of PWC-Net. Thanks to the generally good estimation of PWC-Net, it turns out that the dense filling achieves a more stable performance compared to the inpainting. In the last four modifications, new filtering algorithms are embedded in the pipeline. By applying new filtering algorithms, we expected to leave more flow vectors in the filtered forward flow. Through experiments, it confirms that with proper setting of the threshold, the new filtering algorithms can produce denser filtered flow than the bidirectional filtering. By analyzing the experiment results, we discussed advantages and disadvantages of each modification with positive and negative samples. As a conclusion, compared to PWC- ProFlow and ProFlow, all the modifications created demonstrate significant improvements in the estimation accuracy on both MPI Sintel and KITTI 2015 datasets. Compared to PWC-Net, our modifications also present great improvements on the Sintel dataset. In particular, the pipelines PWC-ProFlow-D and PWC-ProFlow-U-RGB-D achieve the best estimation accuracy among all the experiment candidates on the KITTI and Sintel dataset, respectively. In the future, we can further modify the pipelines generated in this thesis. For example, the threshold in the uniqueness filtering can be fine-tuned for the KITTI dataset to see if the pipeline PWC-ProFlow-U-RGB-D can outperform the PWC-Net on the KITTI dataset. Since the RGB checking may not be fully appropriate to lightning changes, we can further integrate other filtering algorithms in the pipeline. Beyond that, it is also worth exploring how important the last refinement step is by comparing the estimation performance with 56 and without refinement. Moreover, we can also fine-tune the parameters contained in the variational refinement to see if some further improvements can be brought to the estimation results. 57 Bibliography [BBPW04] T. Brox, A. Bruhn, N. Papenberg, J. Weickert. “High accuracy optical flow estimation based on a theory for warping.” In: Proceedings European Con- ference on Computer Vision. 2004, pp. 25–36 (cit. on p. 13). [BLKU16] M. Bai, W. Luo, K. Kundu, R. Urtasun. “Exploiting semantic information and deep matching for optical flow.” In: Proceedings European Conference on Computer Vision. 2016, pp. 154–170 (cit. on p. 9). [Bru18] A. Bruhn. Lecture notes in Correspondence Problems in Computer Vision. 2018 (cit. on pp. 8, 10, 12–14, 16). [BSL11] S. Baker, D. Scharstein, J. Lewis, S. Roth, M. J. Black, R. Szeliski. “A database and evaluation methodology for optical flow.” In: International Journal of Computer Vision 92.1 (2011), pp. 1–31 (cit. on p. 16). [BVS17] C. Bailer, K. Varanasi, D. Stricker. “CNN-Based Patch Matching for Optical Flow with Thresholded Hinge Embedding Loss.” In: Proceedings IEEE Conference on Computer Vision and Pattern Recognition (2017) (cit. on p. 9). [BWSB12] D. J. Butler, J. Wulff, G. B. Stanley, M. J. Black. “A naturalistic open source movie for optical flow evaluation.” In: Proceedings European Conference on Computer Vision. 2012, pp. 611–625 (cit. on pp. 10, 15, 33, 37–39, 44, 45, 47, 48, 50, 54, 62, 63). [CGN14] H. Chao, Y. Gu, M. Napolitano. “A Survey of Optical Flow Techniques for Robotics Navigation Applications.” In: Journal of Intelligent & Robotic Systems 73.1 (2014), pp. 361–372 (cit. on p. 7). [DFI15] A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov, P. Van Der Smagt, D. Cremers, T. Brox. “Flownet: Learning optical flow with convolutional networks.” In: Proceedings IEEE International Conference on Computer Vision. 2015, pp. 2758–2766 (cit. on p. 9). [FBK15] D. Fortun, P. Bouthemy, C. Kervrann. “Optical flow modeling and computa- tion: a survey.” In: Computer Vision and Image Understanding 134.1 (2015), pp. 21–46 (cit. on pp. 7, 8, 12–14, 16, 18, 19). [GW16] D. Gadot, L. Wolf. “PatchBatch: A batch augmented loss for optical flow.” In: Proceedings IEEE Conference on Computer Vision and Pattern Recognition. 2016, pp. 4236–4245 (cit. on p. 9). 58 Bibliography [HLS17] Y. Hu, Y. Li, R. Song. “Robust interpolation of correspondences for large displacement optical flow.” In: Proceedings IEEE Conference on Computer Vision and Pattern Recognition. 2017, pp. 481–489 (cit. on pp. 10, 28). [HLVW17] G. Huang, Z. Liu, L. Van Der Maaten, K. Q. Weinberger. “Densely connected convolutional networks.” In: Proceedings IEEE Conference on Computer Vision and Pattern Recognition. 2017, pp. 4700–4708 (cit. on p. 24). [HS81] B. K. P. Horn, B. G. Schunck. “Determining Optical Flow.” In: Artificial Intelligence. 17 (1981), pp. 185–203 (cit. on pp. 8, 13). [HSL16] Y. Hu, R. Song, Y. Li. “Efficient coarse-to-fine patchmatch for large displace- ment optical flow.” In: Proceedings IEEE Conference on Computer Vision and Pattern Recognition. 2016, pp. 5704–5712 (cit. on pp. 27–29). [HZZ17] Y. Han, P. Zhang, T. Zhuo, W. Huang, Y. Zhang. “Video Action Recogni- tion Based on Deeper Convolution Networks with Pair-Wise Frame Motion Concatenation.” In: Proceedings IEEE Conference on Computer Vision and Pattern Recognition Workshops. 2017, pp. 1226–1235 (cit. on p. 7). [IMS17] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, T. Brox. “Flownet 2.0: Evolution of optical flow estimation with deep netwo