Institute of Parallel and Distributed Systems University of Stuttgart Universitätsstraße 38 D–70569 Stuttgart Masterarbeit Model-Based policy Search for Learning Multivariate PID Gain Scheduling Control Maksym Lefarov Course of Study: Computer Science Examiner: Dr. Daniel Hennes Supervisor: Dipl. Ing. Andreas Doerr, Dr. Christian Daniel, Dr. Duy Nguyen-Tuong. Commenced: October 18, 2017 Completed: April 18, 2018 Abstract Due to its simplicity and demonstrated performance, Proportional Integral and Derivative (PID) controller remains one of the most widely-used closed-loop control mechanisms in industrial applications. For the unknown model of a system, however, a PID design can become a significantly complex task especially for a Multiple Input Multiple Output (MIMO) case. For the efficient control of a nonlinear and non-stationary systems, a scheduled PID controller can be designed. The classical approach to gain scheduling is a system linearization and the design of controllers at different operating points with a subsequent application of interpolation. This thesis continues on the recent advances in application of Reinforcement Learning (RL) to a multivariate PID tuning. In this work we extend the multivariate PID tuning framework based on the Probabilistic Inference for Learning Control (PILCO) algorithm to tune a scheduled PID controllers. The developed method does not require the linear model of a system dynamics and is not restricted to the low-order or Single Input Single Output (SISO) systems. The algorithm is evaluated using the Noisy Cart-Pole and Non-stationary Mass-Damper systems. Additionally, the proposed method is applied to the tuning of a scheduled PID controllers of autonomous Remote Control (RC) car. 3 Contents 1 Introduction 13 2 Background 15 2.1 Control Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2 PID controller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3 Gaussian Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3 Related Work 21 3.1 Research in PID Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.2 PILCO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.3 PID Tuning using PILCO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.4 Gain Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4 Gain Scheduling using PILCO 33 4.1 Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.2 Dependent Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.3 Independent Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.4 Cross Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.5 Numerical Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 5 Testing Systems 43 5.1 Noisy Cart-Pole . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 5.2 Non-stationary Mass-Damper . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 6 Application to Autonomous RC Car 51 6.1 System Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 6.2 Control Scheme Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 6.3 Dynamics Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 6.4 Control Redesign . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 6.5 Validation Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 6.6 New Errors Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 6.7 New Errors Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 7 Conclusion and Outlook 69 A Appendix 71 A.1 Linear transformation of Gaussian distribution . . . . . . . . . . . . . . . . . . . 71 A.2 Product of two independent Gaussian distributions . . . . . . . . . . . . . . . . 71 Bibliography 73 5 List of Figures 2.1 Quantities of a control system performance . . . . . . . . . . . . . . . . . . . . 16 2.2 The impact of PID gains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 4.1 Numerical test of the derived approximations . . . . . . . . . . . . . . . . . . . 41 5.1 Noisy Cart-Pole dynamics model . . . . . . . . . . . . . . . . . . . . . . . . . . 44 5.2 The PID optimization for the Noisy Cart-Pole . . . . . . . . . . . . . . . . . . . 45 5.3 The optimized PID for the Noisy Cart-Pole . . . . . . . . . . . . . . . . . . . . 45 5.4 Non-stationary Mass-Damper properties . . . . . . . . . . . . . . . . . . . . . . 47 5.5 Non-stationary Mass-Damper dynamics model . . . . . . . . . . . . . . . . . . 48 5.6 The optimized nonscheduled PID for the Non-stationary Mass-Damper . . . . . . 48 5.7 The optimized scheduled PID for the Non-Stationary Mass-Damper . . . . . . . 49 5.8 Concatenated optimization process for the scheduled PID . . . . . . . . . . . . . 50 6.1 Simplified control scheme of the autonomous RC car . . . . . . . . . . . . . . . 52 6.2 The definition of longitudinal and lateral errors used for PID control . . . . . . . 54 6.3 Autonomous RC car trajectories . . . . . . . . . . . . . . . . . . . . . . . . . . 55 6.4 Tests of trajectory specific model of the errors dynamics . . . . . . . . . . . . . 57 6.5 Tests of trajectory dependent model of the errors dynamics . . . . . . . . . . . . 58 6.6 The hyper-parameters of the trajectory dependent model . . . . . . . . . . . . . 59 6.7 Tests of trajectory dependent model with additional states . . . . . . . . . . . . . 59 6.8 The RC car experiment subdivision . . . . . . . . . . . . . . . . . . . . . . . . . 60 6.9 Tests of the local model of the errors dynamics . . . . . . . . . . . . . . . . . . 61 6.10 The hyper-parameters of the local model of the errors dynamics . . . . . . . . . 62 6.11 Validation experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 6.12 The new errors definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 6.13 Test of the redefined errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 7 List of Tables 4.1 The moments errors for the subset of numerical test experiments . . . . . . . . . 40 5.1 Noisy Cart-Pole parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 5.2 Non-Stationary Mass-Damper parameters . . . . . . . . . . . . . . . . . . . . . 46 6.1 The data transmitted within the control scheme of the autonomous RC car . . . . 53 9 List of Abbreviations PID Proportional Integral and Derivative. 3 MIMO Multiple Input Multiple Output. 3 RL Reinforcement Learning. 3 PILCO Probabilistic Inference for Learning Control. 3 SISO Single Input Single Output. 3 RC Remote Control. 3 GP Gaussian Processes. 13 LTI Linear Time-Invariant. 15 IAE Intergated Absolute Error. 16 ISE Intergated square Error. 16 ZN Ziegler-Nichols method. 21 AMIGO Approximate M-constraint Integral Gain Optimization. 21 IMC Internal Model Control. 23 SOFC Static Output Feedback Controller. 23 LMI Linera Matrix Inequalities. 23 NARX Non-linear Auto-regressive Exogenous. 29 QLPV Quasi-Linear Parameter Varying. 30 GMV General Minimum Variance. 31 LQR Linear Quadratic Regulator. 43 BFGS Broyden — Fletcher — Goldfarb — Shanno. 44 UKF Unscented Kalman Filter. 52 11 1 Introduction A simple structure and a demonstrated performance permitted PID to become one of the most popular closed-loop control mechanisms used among the industry [21]. MIMO systems can be controlled using the multivariate PID controllers [1]. Despite the clear structure of PID, it is usually designed via the application of a tedious heuristic tuning routines. The advanced PID design techniques such as Pole Placement [1], Loop shaping [4], D-Partitioning and Cancellation of Poles [11] require the linear model of a system. Moreover, some of the design methods, e.g. Tuning based on Gain and Phase margins [2] and λ-tuning [4], are applicable only to the low-order systems. For the multivariate PID tuning it is commonly assumed that a plant can be decoupled into SISO systems [39]. A design of controllers is then performed separately for every system via the standard techniques. The evolutionary based optimization methods for multivariate PID tuning do not require the decoupling and are not restricted to the low-order systems [20]. Nevertheless, the known or approximated model of a plant is still required for the evaluation of a fitness function. For the unknown systems, auto-tuning methods such as Relay Auto-tuner [34] or Feature-Based tuning techniques [4] are used. Both perform the model estimation with subsequent application of a model-based tuning techniques. Application of learning-based methods to the control tasks in general and PID tuning in particular is limited by their data inefficiency. A system rollouts of a real control scenarios can be rather costly. Addressing this limitation, Deisenroth et al. [14] developed a model-based policy search algorithm PILCO, which maintains the probabilistic model of a system dynamics learned from an agent-system interactions. Such setup permits efficient learning with relatively small amount of observed data. Doerr et al. [16] extended PILCO for multivariate PID tuning. The developed framework is applicable to a nonlinear MIMO systems and does not require prior knowledge about the system dynamics. Gain scheduling is an efficient way to control a nonlinear and non-stationary systems, whose dynamics changes with different operating conditions [4]. The design of a scheduled controller, usually performed via the linearization of a plant and subsequent application of linear design techniques [29]. The scheduled controller is then obtained using the nearest-neighbor interpolation, e.g. controller switching, or linear interpolation, e.g. controller blending. Additionally, Fuzzy logic can be exploited to introduce the human expertise into the controllers interpolation [38]. In contrast, gain scheduling can be implemented in a form of a parametrized nonlinear function, which computes the gains directly [9]. The parameters of this function are obtained via the optimization procedures. For both approaches, a known model of system dynamics is required. In this thesis we extend the multivariate PID tuning framework by Doerr et al. to learn the gain scheduling function for a nonlinear and non-stationary MIMO systems control. The proposed method does require neither prior knowledge of a dynamics model, nor its linearizion. The 13 1 Introduction scheduling function is defined as a set of Gaussian Processes (GP) parameterized by a training targets with a uniformly distributed training inputs. All equations, necessary for the integration of this definition into PILCO, were derived and implemented. Tests were conducted using the simulations of Noisy Cart-Pole and Non-stationary Mass-Damper systems, which were developed as a part of the thesis. In order to prove the applicability of the method to the real control task, it was exploited to tune the scheduled PID controllers of the autonomous RC car. Different approaches to the RC car dynamics modeling were proposed and evaluated. Additionally, modifications of the control mechanism, which can potentially improve the results of model learning were introduced and implemented. The thesis continues with a presentation of a theoretical background (Chapter 2) required for the understanding of a further work. This background includes the basics of Control Design, PID controller, gain scheduling and GP. Next, Chapter 3 surveys the related research in PID tuning and gain scheduling. Additionally, it contains the detailed explanation of PILCO and its application to PID tuning. Following chapter (Chapter 4) introduces the PILCO modification for a scheduled PID learning. The chapter presents both, definition of gain scheduling function and all derivations required for its integration into PILCO. Next two chapters are devoted to the testing of the implemented PILCO modification. Test experiments with Noisy Cart-Pole and Non-stationary Mass-Damper are presented in Chapter 5. Chapter 6 summarizes the work, made in algorithm application to the autonomous RC car. It starts with the system description and control mechanism analysis. Then the approaches to the dynamics modeling are presented. Chapter 6 ends with a discussion of the proposed control modifications aimed to improve the model learning results. Finally, Chapter 7 concludes this work. 14 2 Background In the following chapter the theoretical background, required for the further work is presented. This background includes basic information about Control Design and PID Controller as well as about Gaussian Processes. 2.1 Control Design Åström and Murray [4] define control as exploitation of one dynamical system, called controller to influence the behavior of another one, called plant. Plant has a sate x ∈ RM . y ∈ RM is a state observation (in some real scenarios, state x can be observed only partially). Closed-loop control system is a mechanism, where the output of a controller u ∈ RF depends on a process variable ŷ ∈ RD≤M , which is an observation of (the part of) a plant state x that have to be controlled. The output of a controller is calculated based on the feedback error or control error: e = r − ŷ ∈ RD , where r is reference value or set point. From now onward for notation clarity we will assume that process variable is a complete observation ŷ = y. In a control domain plants and controllers are commonly described in terms of a transfer functions, which define the relation between their inputs and outputs of a Linear Time-Invariant (LTI) system. A transfer function is defined as: Gp(s) = Y (s) U(s) = L{y(t)} L{u(t)} , with L{·} being a Laplace transformation. The transfer function can be rewritten in a pole-zero form: Gp(s) = K (s − z1)(s − z2)...(s − zm) (s − p1)(s − p2)...(s − pn) , where z1, ..., zm are system zeros and p1, ..., pm are system poles. Denominator of a pole-zero form of the transfer function is called characteristic polynomial. The number of roots of characteristic polynomial (system poles) defines the order of a system. Transfer function of MIMO plant has a form of a matrix: Gp(s) =  g11(s) . . . g1F (s) ... . . . ... gD1(s) . . . gDF (s)  ∈ R D×F . (2.1) One of the main advantage of a transfer function notation is that the complex system, which contains the number of different functional subsystems, can be represented as a product of their transfer functions. Thus, the transfer function of a closed-loop control system is given as: Gl(s) = C(s)Gp(s) 1 − C(s)Gp(s) , 15 2 Background 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 0 0.5 1 1.5 Steady-State Error Settling time Rise time Percent Overshoot Figure 2.1: Quantities of a control system performance where C(s) and Gp(s) are controller and plant transfer functions respectively. The control design has two common objectives, namely a set point following and a disturbance rejection [4]. The set point following objective is defined by the performance requirements. A performance of a SISO control system is often measured by applying a step function as r, and observing the response of y. Commonly, the response is quantified by waveform characteristics [23]. Rise time is the time, that system takes to reach the 80− 90% of final value after excitation with step r. Percent overshoot is the amount that y overshoots the r, expressed as a percentage of the final value of r . Settling time is the time required for the y to settle within a certain percentage (usually 5%) of the final value. Steady-state error is the final difference between the process variable y and set point r . Exact definition of these quantities can vary in industry and academia. For the MIMO systems these characteristics are measured for every output dimension. All described quantities are presented on Figure 2.1. The disturbance rejection objective is defined via the error, caused by the disturbance. Common quantities to characterize the disturbance attenuation are maximum error, time to maximum, decay ratio, Intergated Absolute Error (IAE) and Intergated square Error (ISE) [4]. 2.2 PID controller PID is a closed-loop feedback controller [2]. PID consists of three correcting terms, which are multiplications of gains with control error, its integral and derivative. Gains are the coefficients, which define the impact of a particular error type on PID output. Summation of terms gives the control output according to the equation: u(t) = Kpe(t) + Ki ∫ t 0 e(τ)dτ + Kd de(t) d(t) . (2.2) In the above equation: 16 2.2 PID controller 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 0 0.5 1 1.5 Step Response Kp 300 Kp 200 Kp 100 (a) Proportional gain 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 0 0.5 1 1.5 Step Response Ki 100 Ki 70 Ki 30 (b) Integral gain 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 0 0.5 1 1.5 Step Response Kd 20 Kd 10 Kd 5 (c) Derivative gain Figure 2.2: The impact of PID gains on a waveform of a step response. Kp is a Proportional gain Ki is an Integral gain Kd is a Derivative gain e(t) is a control error t is a time τ is an integration variable. The proportional term Kpe(t) accounts for the magnitude of current control error. The proportional gain determines the influence of the error on control output. As name suggests, high values of the gain result in the large change of control output for the given change in error. Excessively high values of the proportional gain could lead to the oscillation of the output and destabilization of the system, whereas low values could potentially make it insensitive to the error. As it demonstrated in Figure 2.2a, an increase in proportional gain reduced both the rise time and the steady-state error Additionally the overshoot is increased, and the settling time descries by a small amount. The integral term Ki ∫ t 0 e(τ)dτ accounts for both, magnitude of the error and its duration. It is a summation of all instantaneous errors over the time which corresponds to accumulated offset that should have been corrected previously. The integral gain defines the impact of this offset on the output of controller. An increase in integral gain tends to reduce the steady-state error since it is compensating for the collected errors from the past, however the overshooting (undershooting) is increased for the same reason (Figure 2.2b). The derivative term Kd de(t) d(t) is set to compensate for the approximated future error behavior using the derivative of error with respect to time. Thus, tuning of derivative term improve the settling time and stability of the system (Figure 2.2c). Most implementations of PID include additional low-pass filter for the derivative to limit the influence of a noise on a system. Transfer function of PID controller is defined as: C(s) = Kp s2TiTd + sTi + 1 sTi , Ti = Kp Ki , Td = Kd Kp . (2.3) 17 2 Background 2.2.1 Multivariate Case In a general multidimensional case, equation (2.2) can be rewritten in a matrix notation form as following: ut = K pet + K i ∫ t 0 et + Kd Ûet . (2.4) In the above equation K p, K i, and Kd are gain matrices of dimensionality RF×D , where F is number of controller outputs and D is number of error states. Similarly, et , ∫ t 0 et and Ûet are vectors of errors with dimensionality RD at time step t and ut ∈ R F is a control output at time t. Two types of Multivariate PID controllers can be defined depending on the structure of gain matrices. The controller, which has the gain matrices of diagonal form is called Decentralized (Decoupled). Every output dimension of such controller depends only on one control error state. On the other hand, Centralized PIDs has gain matrices of arbitrary forms, meaning that every output of controller arbitrary depends on control errors. Transfer function of multivariate PID controller can be obtained similarly to (2.1). A transfer matrix is of diagonal form for a decoupled controller and arbitrary for a centralized. Centralized controllers have a better performance on systems with high interconnection between the states, however tuning of such controllers is significantly more difficult than decentralized ones, which is itself a tedious task. A detailed explanation on tuning of PID controllers is provided in Chapter 3. 2.2.2 Gain Scheduling Gain scheduling is a changing of the controller’s parameters (gains) depending on the measurements of operating conditions [4]. That measurements are called scheduling variables and could be systems states, process value, output of the controller or even external signals. Many design notions can fall under such broad definition - controller switching and controller blending fits the notion of gain scheduling as well as introduction of a nonlinear scheduling function [9, 25]. The gain scheduling is an efficient way to control nonlinear and non-stationary systems, whose dynamics changes with different operating conditions [4]. The most typical approach to design of gain scheduled controller for nonlinear or time varying system is a linearizion scheduling, which includes three steps [29]. The first step is computation of a linear parameter-varying model of the system. The most common method for such modeling is Jacobian Linearizion of a system around the equilibrium points ρ. This method results in a parametrized family of linearized systems, which form a basis for linearizion scheduling. The parametrization corresponds to the fixed values of the scheduling variables observed at corresponding equilibrium point. Next step is to use the linear controller design methods for every linearized system in the family. This step results in a set of linear controllers at isolated values of scheduling variable. Final step is to obtain a linear parameter-varying controller. The simplest approach is controller switching, which is the nearest-neighbor interpolation of a controller parameters for the observed value of a scheduling variable. Even though such approach benefits form the simplicity, it produces discontinuous jumps in controller parameters which leads to performance decrease and probable destabilizations. Another 18 2.3 Gaussian Processes way is to exploit linear interpolation, e.g. controller blending. Independent from linearizion and interpolation method, the more linear controllers are designed, the better performance is achieved [29]. More advanced method of gain scheduling will be reviewed in Chapter 3. 2.3 Gaussian Processes A Gaussian process is a generalization of the Gaussian probability distribution, which describes the functions instead of scalars or vectors [26]. GP is a Bayesian nonparametric model which outputs the posterior distribution over the functions that describe the observed data p( f |D). Non parametric stands for a model Mw parametrized by w, which is selected from some family {Mw : w ∈ W}, whereW is infinite dimensional [31]. GP is completely specified by the prior distribution and observed data D: n input vectors, x ∈ RD concatenated into matrix X ∈ RD×n, and targets y ∈ R, concatenated into y ∈ Rn. Prior distribution represents the beliefs over the kind of functions, which one expects before the data is shown. The prior of GP is defined by mean function m(·) and kernel (positive semidefinite covariance function) k(·, ·). m(·) is commonly kept zero. Square Exponential is the usual choice for the kernel: κ(x, x ′) = σ2 f e ( − 1 2 (x−x ′)TΛ−1(x−x′) ) . (2.5) In the above equation, Λ is diag(l)2, where l ∈ RD is a vector of positive values called length-scales. The inverse of length-scale li determines the relevance of input dimension i. The hyper-parameters of GP, θ are defined then as (l, σ2 f ) T . For the realistic modeling scenario the noisy observations instead of a direct function values have to be assumed y = f (x) + ϵ [26]. For that purpose the additive independent identically distributed Gaussian noise ϵ with variance σ2 w is added to a kernel: κ(x, x ′) + δσ2 w , with δ being a Kronecker delta which returns one if x = x ′ and zero otherwise. σ2 w is included into GP hyper-parameters θ = (l, σ2 f , σ 2 w) T . Covariance matrix for noisy observations �y is defined as (� + σ2 w I ) with �i j = κ(xi, x j). The hyper-parameters are obtained using the maximization of marginal likelihood: p(y |X ; θ) = ∫ p(y | f , X ; θ)p( f |X ; θ)df (2.6) The first two moments of a predictive posterior p(y∗ |x∗, X, y; θ) of the GP for the new input x∗ are defined as following: E[y∗] = κ(X, x∗)T�−1 y y, (2.7) var[y∗] = κ(x∗, x∗) − κ(X, x∗)T�−1 y κ(X, x∗), (2.8) The noticeable drawback of GP is its computational inefficiency - inversion of �y is of O(n3) complexity. In order to reduce the complexity, sparse approximation of GP posterior can be used [12]. Snelson and Ghahramani [30] presents the Sparse Pseudo-Input Gaussian Processes. This model is parametrized by the pseudo (artificial) data set D̄, with the size m ≪ n. Pseudo inputs x̄ are placed via the maximization of marginal likelihood (2.6). Pseudo targets f̄ (without noise) are 19 2 Background integrated out using the prior N( f̄ |0,�m) with �mi j = κ(x̄i, x̄ j). The posterior of resulting sparse model is computed wilt complexity O(m2n). After additional precomputations complexity can be improved even further to O(m) for the mean of prediction and O(m2) for the prediction variance [30]. 20 3 Related Work This chapter gives the overview of the work related to the master thesis. First, the PID tuning research is surveyed. Then detailed explanation of the PILCO [14] and its application to optimization of PID gains [16] are presented. Chapter is concluded with the analysis of research in gain scheduling. 3.1 Research in PID Tuning In the following section the review of relevant PID tuning techniques will be presented. For convenience all techniques are grouped into two artificial categories: classical techniques and techniques based on optimization. 3.1.1 Classical Techniques Feature-based techniques are exploiting a features of system dynamics such as static gain K , velocity gain Kv , dominant time constant T and dominant dead time L [3], estimated via experiments, for the system approximation. Time-domain Ziegler-Nichols method (ZN) [4] is the most basic example of a feature-based tuning techniques, which is still used in industrial applications. This method approximates the system with a first-order plus dead-time model using the measurements obtained from the step response: Gp(s) = K 1 + sT e−sL . (3.1) The gains are given as functions of dominant dead time L and static gain Kp. Additionally, the Frequency-domain ZN [19] method defines gains as the functions of ultimate gain Ku (gain under which the system starts to oscillate) and ultimate period Tu (period of the oscillation). The ZN methods are easy to apply, however they result in a stable but not optimal in term of set point following and disturbances attenuation controller (Section 2.1). Thus many modifications were proposed, which aimed to retain the simplicity of ZN and improve the performance in the same time. Approximate M-constraint Integral Gain Optimization (AMIGO) [4] uses the model of a system dynamics based on velocity gain Kv: Gp(s) = Kv s e−sL . (3.2) Additionally AMIGO uses different tuning rules (gains functions). The design goal of these rules was to maximize the integral gain while fulfilling the specified disturbance attenuation constraints [4]. 21 3 Related Work Tuning Based on Gain gm and Phase φm margins [11] is aiming to find a PI controller: C(s) = Kp sTi + 1 sTi , (3.3) for the given approximation of a system model (3.1) so that closed-loop system has defined gain and phase margins. Kp and Ti are obtained via the solution of gm and φm equations [22]. Analytical methods use the system dynamics model (approximated) to directly calculate the gains of PID controller. A Pole Placement [1] is a subset of an analytical method applied to the systems with a low-order transfer function. A common approach is to specify the desired damping ratio ζ and the natural frequency ω0 [22] for the system and to position the poles such that fulfill the specified constraints and give the required closed-loop performance. For instance, pole placement can be applied to a first-order plant: Gp(s) = K 1 + sT , (3.4) controlled by PI controller (3.3). The transfer function of such closed-loop system will have the characteristic polynomial of a form: s2 + 2ζω0s + ω2 0, then controller parameters are derived from actual transfer function as: Kp = 2ζω0T − 1 K , Ti = 2ζω0T − 1 ω2 0T . (3.5) The pole placement can also be exploited to tune the PID (2.3) applied to the second-order system: Gp(s) = K (1 + sT1)(1 + sT2) . (3.6) The characteristic polynomial of a closed-loop system will have a form: (s + αω0)(s2 + 2ζω0s + ω2 0), and PID gains are obtained similarly to (3.5). The Dominant Pole [1] design is a simplified pole placement technique used for high-order systems. Commonly, dominant dynamics of a high-order system can be approximated by two dominant poles p1, p2 that are poles with real part closest to zero [11, 23]. Therefore dominant poles can be placed in a complex plane such that result in a system with desired ζ and ω0: p1 = −ζω0 + iω0 √ 1 − ζ2, p2 = −ζω0 − iω0 √ 1 − ζ2. 22 3.1 Research in PID Tuning The PID gains are determined such that p1 and p2 are the roots of characteristic polynomial of closed-loop system. Dominant pole placement can be applied to Multivariate PID tuning [21]. The λ-tuning method assumes first-order plus dead-time model (3.1) with a long dominant dead time L [4]. This method specified the desired closed-loop transfer function as: Gl(s) = e−sL 1 + sλT , where λ is an additional tuning parameter which influences the response time of the resulting closed-loop system. λ < 1 leads to a faster response and decrease of the integrated absolute error, but also increases sensitivity of a controller to variations in a system dynamics. The controller transfer function is then obtained from a transfer function of closed-loop system with feedback error: C(s) = 1 + sT K(1 + λsT − e−sL) , which, for L = 0, becomes a PI controller with: Kp = 1 λK , Ti = T . Internal Model Control (IMC) is a general control system design method, when the controller contains the model of a system [4] . The corresponding controller transfer function looks as: C(s) = G f G † m 1 − G f G † mGm , where Gm is a model transfer function, G†m its inverse and G f is a low-pass filter. The low-pass filter accounts for modeling errors. This method results in high-order controllers. However, for simple models such as (3.4) and (3.1), it is possible to obtain PI or PID controller. PID controller for (3.4) and a filter of form: G f (s) = 1 1 + sTf , obtained via IMC is defined as: C(s) = (1 + sL/2)(1 + sT) Ks(L + Tf ) , with Tf being design parameter. The main advantage of IMC method is that the robustness is considered explicitly during the controller design. Skogestad [27] proposes the internal model control method to tune the PI controller for the first-order plus time-delay systems (3.1) using Taylor-series approximation of exponential term. Dong and Brosilow [17] propose the method to derive the Multivariate PID from the Multivariate IMC using the MacLaurin series expansion. 23 3 Related Work Zheng et al. [39] propose another approach, similar to the IMC in a way that PID is obtained via the transformation to another type of controller. In presented work PID is transformed to a Static Output Feedback Controller (SOFC). The stabilization problem of static feedback controller is solved using the Linera Matrix Inequalities (LMI) method. PID gains are then recovered from the obtained controller. Loop-shaping techniques can also be applied to PID tuning problem [4]. Loop-shaping aims to obtain the desired transfer function of a closed-loop system choosing the right controller. First, the desired gain crossover frequency ωgc is chosen (for instance based on requirement of attenuation of load disturbance). Additionally, target phase margin φm is specified. Then gains of PID are calculated such that closed-loop system meets given specifications [4]. Most of the reviewed design methods require the known linear model of a system dynamics, e.g. Analytical techniques [1] and Loop-shaping [4]. ZN and AMIGO [4] approximates the unknown system with a low-order models using the measurements from a relay or step response experiments. Some, techniques are applicable only to a low-order plants, e.g. Gain and Phase margin based design [11], λ-tuning [4]. For the high-order systems, either the low-order approximations have to be computed, e.g. Dominant Pole design [21], or transformation to a different type of controller has to be performed [17, 39]. Among the surveyed methods, only a Dominant Pole [21] design can be relatively easy extended to a MIMO settings. 3.1.2 Optimization Based Techniques The optimization methods can also be exploited for PID tuning. Given a controller parametrized by its gains, one is optimizing the closed-loop performance defined as functions of controller parameters. Among the popular optimization criteria are IAE, ISE etc. The optimization methods are rather sensitive to the formulation of criteria, meaning that they can result in optimal, with respect to ISE, but unstable controller because of neglected constraints [4]. Local minimas of the optimization criteria can also deteriorate the performance of method [4]. Cancellation of Poles and D-partitioning [11] are optimization methods for PID tuning which require the linear model of a plant. Modulus Optimum and Symmetrical Optimum are limited to low-order systems [1]. Genetics algorithms and evolutionary based techniques are, similarly to optimization methods seeking for the controller parameters, which optimize some criteria. However these methods are not restricted to the low-order and SISO systems and demonstrated their ability to overcome local minima problems [20]. Chang [8] proposes to use a multi-crossover genetic algorithm to optimize the IAE of multivariate PID controller. All elements of PID gain matrices (2.4) are concatenated into one vector, used as a chromosome. Algorithm was tested on classical binary distillation column plant - first order MIMO system with strong interaction between input and output pair, and significant time delays. According to the results, genetic algorithm converged to the stable controller after 300 of generations. Gaing [18] applies modified Particle Swarm Optimization algorithm for tuning of multivariate PID controller. Each PID gain was presented by a particle. Tests were conducted on two automatic voltage regulator systems. After 10 seconds of optimization in average algorithm was able to output stable controller gains. 24 3.2 PILCO Even though the above algorithms managed to demonstrate satisfactory results for a MIMO testing systems, evolutionary based approaches cannot guarantee convergence [28]. Moreover, both require the model of a plant for the evaluation of a fitness function. 3.2 PILCO The data inefficiency of RL limits its application to control and robotics tasks, where the agent- environment interaction could become rather costly. Deisenroth et al. [14] address this problem via modeling the observed dynamics of the environment with a flexible nonparametric approach. The model-based RL methods are more efficient in extracting valuable information from the data [14]. However, the performance of such methods degrades significantly if learned model is biased. That is why, in order to compensate for the bias in model, Disenroth et al. are exploit probabilistic model with the explicit notion of uncertainty, which is also incorporated into planning and policy evaluation. Authors are presenting PILCO, a model based-policy search framework with nonparametric GP (Section 2.3) used as a model. Long term predictions and policy evaluation in PILCO based on deterministic approximation of GP inference. Policy updates use the analytic solution of policy gradients. Here onwards the detailed description of PILCO is presented with all equations, since they will be used in further chapters during the derivation of the gain scheduling. Nonetheless, policy gradients are omitted, because of auto-derivation software exploitation. PILCO considers the dynamic systems of form: xt+1 = f (xt, ut ) + ω, ω ∼ N(0,Σω), (3.7) where x and u is a continuous-valued system state RD and control RF respectively. f is unknown transition dynamics and ω is a Gaussian system noise. Policy search is optimizing π : x 7→ π(x, θ) with respect to expected long term cost: Jπ(θ) = T∑ t=0 E[c(xt )], x0 ∼ N(µ0,Σ0). (3.8) In the above equation c(xt ) is a cost of being in state x at time t. A policy π is a function parametrized by θ. In order to find π∗, which minimizes (3.8) PILCO builds a probabilistic GP model and uses the deterministic approximation of inference to compute the long term predictions and evaluate policy, then optimizes it exploiting analytical gradients (here however omitted, due to the usage of auto-derivation software). PILCO steps are summarized in Algorithm 3.1. As a probabilistic dynamics model, PILCO uses GP, with tuples (xt, ut ) ∈ R D+F as training inputs and differences ∆t = xt+1 − xt ∈ R D as training targets. As it was stated in Section 2.3, GP outputs the posterior distribution over the functions, thus it cannot have multidimensional output. That is why PILCO actually uses D conditionally independent GP, trained simultaneously for every target dimension. 25 3 Related Work Algorithm 3.1 PILCO 1: procedure Policy Search 2: Initialize: Sample policy parameters θ ∼ N(0, I ) Apply π(θ) and record data {xt, ut }t=1...T . 3: repeat 4: Learn probabilistic GP dynamics model, using all data. 5: repeat 6: Approximate inference for policy evaluation Jπ(θ). 7: Obtain gradients Jπ (θ dθ (auto-derivation software). 8: Update parameters θ (e.g., CG or L-BFGS). 9: until convergence; return θ∗ = argminJ(θ) 10: Set π∗ ← π(θ∗). 11: Apply π∗ to system and record new data {xt, ut }t=1...T . 12: until task learned; return π∗ 13: end procedure According to Section 2.3, GP is completely specified by prior mean function m(·) (which is usually taken as zero) and kernel k(·, ·). In a PILCO paper Disenroth et al. use square exponential kernel (2.5). Training points of GP are defined as X̃ = [x̃1, ..., x̃n] and y = [∆1, ...,∆n] T . Hyper-parameters of GP posterior are obtained via marginal likelihood maximization (2.6). GP posterior gives one-step prediction xt+1, which is Gaussian distributed: p(xt+1 |xt, ut ) = N(xt+1 |µt+1,Σt+1), (3.9) µt+1 = xt + E[∆t ], Σt+1 = var[∆t ], In the above equation E[∆t ] and var[∆t ] are calculated according to (2.7) and (2.8) respectively. In order to evaluate Jπ and optimize π (steps 6, 7, and 8 at Algorithm 3.1) PILCO constructs the t-step-ahead marginal distribution p(x1 |π), ..., p(xT |π) (further on conditioning on π is omitted for simplicity) with the initial state distribution p(x0) = N(x0 |µ0,Σ0), which is done by the recursive application of (3.9). For that, one needs to propagate the uncertain inputs through the GP. For one propagation step, i.e. to obtain xt+1 given p(xt ), joint distribution p(x̃) = p(xt, ut ) is required (3.7). This distribution is approximated by Gaussian p(x̃) = N(x̃t | µ̃t, Σ̃t ). Using the approximated joint input distribution, posterior for uncertain input is given by: p(∆t ) = ∫ ∫ p( f (x̃t )| x̃t )p(x̃t )df d x̃, (3.10) with f being the posterior of GP. Exact computation of (3.10) is analytically convoluted, that is why it is also approximated by Gaussian. Next state prediction p(x̃t+1) is obtained then as: µt+1 = µt + µ∆, Σt+1 = Σt + Σ∆ + cov[xt,∆t ] + cov[∆t, xt ]. (3.11) 26 3.2 PILCO Above approximation can be obtained using either Moment Matching or Linearizion of GP posterior mean function [14]. Details of latter method is omitted here since the Moment Matching is used in a further work. To construct t-step-ahead marginal distribution using Moment Matching, one needs to compute predictive mean µ∆, predictive covariance matrix Σ∆, covariances cov[x̃t,∆t ], cov[∆t, x̃t ] (3.11) and cross-covariance cov[x̃t, µ∆]. First two terms are needed to obtain the distribution p(∆t ) = N(µ∆,Σ∆) (3.10), covariances are used to get next step state distribution and cross-covariance is used to get next step joint state-action distribution. Exact derivations of mean and variance [14] are not presented here for thesis length reasons. Covariance and cross-covariance terms, in contrast, listed with detailed derivation, since they will be referenced during the formulation of gain scheduling in the following chapters. The cross-covariance is computed between the joint state-action distribution at time step t : x̃t ∈ N(x̃t | µ̃t, Σ̃t ) and corresponding predicted (approximated) state difference xt+1 − xt = ∆t ∼ N(µ∆,Σ∆) as following: cov[x̃t,∆t ] = E[x̃t∆Tt ] − µ̃tµ T ∆, (3.12) where the right term is defined, µ∆ is given [14] and E[x̃t ] defined previously. In order to compute the left term, for each state dimension a = 1, ...,D, low of iterated expectations is applied: E[x̃t∆at ] = E[x̃t E[∆at | x̃t ]] (3.13) = ∫ x̃tma f (x̃t )p(x̃t )d x̃t = ∫ x̃t ( n∑ i=1 βaika f (x̃t, x̃i) ) p(x̃t )d x̃t, with posterior mean function m f (x̃t ) represented as a finite kernel expansion. Then summation order is changed and βai is pulled out of integral: E[x̃t∆at ] = n∑ i=1 βai ∫ x̃t c1N(x̃t | x̃i,Λa)︸ ︷︷ ︸ κa f (x̃t, x̃i ) N(x̃t | µ̃t, Σ̃t )︸ ︷︷ ︸ p(x̃t ) d x̃t, (3.14) where c1 := σ2 f a(2π) D+F 2 |Λa | 1 2 is introduced to represent the GP kernel as an unnormalized Gaussian for training inputs x̃i, i = 1, ..., n. The product of two Gaussian (3.14) also results in unnormalized Gaussian c−1 2 N(x̃t |ϕi,Ψ) with moments defined as following (A.2): c−1 2 = σ 2 f a(2π) −D+F 2 |Λa + Σ̃t | − 1 2 e ( − 1 2 (x̃i−µ̃t ) T (Λa+Σ̃t ) −1(x̃i−µ̃t ) ) , (3.15) ϕi = (Λ −1 a x̃i + Σ̃ −1 t µ̃t ), Ψ = (Λ−1 a + Σ̃ −1 t ) −1. After pooling all variables, independent of x̃t , out of integral in (3.14), it results in expected value of product of two Gaussians ϕi: E[x̃t∆at ] = n∑ i=1 c1c−1 2 βaiϕi, a = 1, ...,D. (3.16) 27 3 Related Work After inserting it back into (3.12): cov[x̃t,∆at ] = n∑ i=1 c1c−1 2 βaiϕi − µ̃t µ a ∆. (3.17) Final simplifications leads to: cov[x̃t,∆at ] = n∑ i=1 c1c−1 2 βaiΣ̃t (Λa + Σ̃t ) −1(x̃i − µ̃t ), a = 1, ...,D. (3.18) Desired covariance terms cov[x̃t,∆t ] from (3.11) is a sub-matrix of size RD×D of matrix R(D+F)×D computed at above equation. Now all terms required for the uncertainty propagation are defined, thus t-step-ahead marginal distribution can be obtained. Last step to evaluation of Jπ (3.8) is computation of expected values: E[c(xt )] = ∫ c(xt )N(xt |µt,Σt )dxt, with c being a cost function. PILCO uses binary saturation cost function: c(x) = 1 − e ( − 1 2σ2 c d(x,xtar) 2 ) ∈ [0, 1], where d is euclidean distance and σ2 c is parameter, which controls the width of the cost function. For the further work this function will be kept, since it allows for a natural exploration [14]. PILCO was evaluated using Double-Pendulum swing-up and Cart-Pole problems. Main focus was set on learning speed and quality of approximated inference. PILCO was trained for 15 iterations (37.5 seconds of interaction) for Cart-Pole system and 30 iterations (75 seconds) for the Double-Pendulum swing-up. As a results, algorithm in 95% was able to learn the policy, which solves the Cart-Pole problem, after 15 - 20 seconds of experience. The same success rate for the Double-Pendulum system was achieved after 40 - 50 seconds of experience. Additionally, resulting t-step-ahead marginal distribution has a relatively small variance for the successfully learned policy. That indicates the ability of GPs to model the dynamics of both systems. Taking into account the difficulty of tasks and obtained results, one can conclude that PILCO fulfilled the data efficiency constraint. 3.3 PID Tuning using PILCO Considering the demonstrated performance of PILCO and its applicability to the real control scenarios, i.e. data efficiency, Doerr et al. [16] extend the described framework for multivariate PID tuning. To represent the arbitrary multivariate PID controller as a parametrized static policy, a state information has to be extended. In order to compute the controller output ut , all necessary error information from (2.4) has to be available at time step t. Additionally, a joint distribution for the 28 3.3 PID Tuning using PILCO extended state has to be calculated in order to propagate the uncertainty, compute the t-step-ahead marginal distribution and evaluate JPID. New system state zt was introduced. It contains the error at previous time step et−1 and errors, accumulated until the previous time step ∑t−1 τ=0 eτ : zt := [ et−1,∆T t−1∑ τ=0 eτ, x̃t−1 ] , with [·, ·] stating for concatenation of vectors and ∆T being a system sampling time. Then the GP posterior p(xt ), approximated via Moment Matching, is added to the state, resulting in z̃t = [zt, xt ]. After calculation of posterior expectation, variance and cross-covariance terms, described in Section 3.2, the joint distribution for the new state extension can be obtained. Assuming that the xt is independent from the error part of zt one gets: [ ẽt−1 x̃t−1 ] xt  ∼ N ©­­« [ µzt µxt ] ,  Σzt [ 0 Σxt, x̃t−1 ] [ 0 Σx̃t−1,xt ] Σxt  ª®®¬ , (3.19) where ẽt−1 is the errors part of extended state. Σxt, x̃t−1 stands for the cross-covariance term cov[xt, x̃t−1] and Σx̃t−1,xt for cov[x̃t−1, xt ]. Next, a desired trajectory information has to be added. The target trajectory xtr is given as Gaussian Distribution N(xtr |µtr,Σtr) [15]. A new state extension is z̃t = [zt, xt, xtr,t ]. Since the target trajectory is independent from z̃t the joint distribution of the new state extension can be obtained similarly to (3.19), where the state prediction part xt is independent from ẽt−1. A new mean of z̃t is just a concatenation of a previous mean and a target mean µtr,t . A covariance matrix is a concatenation of Σz̃ and Σtr,t along the diagonal. The current error is a difference between the predicted state xt and desired xtr. For the current error a derivative and integrated error approximations are calculated as following: Ûet ≈ et − et−1 ∆T ,∫ t ·∆T 0 e(τ)dτ ≈ ∆T t−1∑ τ=0 eτ + ∆T et . Defined above errors are linear transformations of z̃, thus the joint distribution of the new state extension z̃t = [zt, xt, xtar,t, et,∆T ∑t τ=0 eτ, et−et−1 ∆T ] remains Gaussian and can be obtained via the known formulas of linear Gaussian transformation (A.1). At this point, z̃t contains the information required for the calculation of PID output ut as a linear transformation of the augmented state. The final joint distribution p( z̃t, ut ) is calculated according to (A.1). All values required for one PILCO iteration are defined. Evaluation experiments were conducted using the inverted pendulum problem on the humanoid upper-body robot Apollo. State of the inverted pendulum system consists of four states: effector position and velocity, a pendulum angle and an angular velocity. In order to omit the modeling of 29 3 Related Work the entire system state and its dynamics during the experiments, the Non-linear Auto-regressive Exogenous (NARX) model [7] for the effector position and pendulum angle was used. For the dynamics modeling Doerr et al. [16], use a sparse GP with the hyper-parameters learned by a marginal likelihood maximization. According to the demonstrated results, the presented algorithm was able to learn accurate dynamics model and optimize the PID policy with 106 seconds of interaction with physical system in total. Hence, the algorithm preserves the data-efficiency property of PILCO and can be applied to real control scenarios. Therefore, this work was considered as a suitable starting point for the further research in data-driven controller tuning. For the scope of this thesis it was decided to extend the presented algorithm for learning of optimal gain scheduled PID, with the aim on improvement of performance for nonlinear systems and on application to non-stationary systems [4]. 3.4 Gain Scheduling Apart from Jacobian linearizion mentioned in Section 2.2.2, Quasi-Linear Parameter Varying (QLPV) [29] approach can be used to obtain a family of linear systems. The nonlinear system is rewritten in a way to replace the nonlinear terms with new time-varying parameters and use them later one as a scheduling variables. This method does not require the linearizion and equilibrium point determination. However, new definition of a system can introduce "the additional behavior beyond the original plant description" [29], which is complicating the design of a suitable controller. Not all systems can be redefined as QLPV which constraints the possible applications of this method. Moreover, the QLPV representation of a system is not unique and have to be additionally determined for a particular implementation of gain scheduling [29]. The extension of linearizion based scheduling (Jacobian and QLPV) to MIMO systems depends on the controller design method used for a linearized parameter varying system [6]. 3.4.1 Fuzzy Logic Based Gain Scheduling The Fuzzy Inference [37] is a common mechanism to incorporate the system knowledge into a scheduling step (interpolation of controllers) via the specification of a Fuzzy rules (membership functions) in order to increase the performance and robustness [38]. Visioli [33] presents the work where the gain scheduling was applied to build the two degrees of freedom controller with one part handling the attenuation of load disturbances and the second part being responsible for set point following. Two controllers were tuned separately using ZN. The resulting controller then was scheduled based on the current control error and its derivative using the Fuzzy Inference for interpolation. Woo et al. [36] applies the Fuzzy inference to schedule a parallel PI and PD controllers. They also introduced the Parameter Regulator which adaptively change the membership functions of the Fuzzy module, based on the step response. The requirement of a human expertise and absence of general tuning method makes the Fuzzy scheduling strongly application dependent [38]. Since the Fuzzy logic based scheduling modifies the interpolation step only, it does not influence the applicability to MIMO systems. 30 3.4 Gain Scheduling 3.4.2 Artificial Neurons Based Gain Scheduling Due to the capability of neural networks to accurately approximate the highly nonlinear function, they appear in gain scheduling research [24]. Chen and Huang [10] design the gain scheduled PID controller for a highly nonlinear systems. PID gains are scheduled using the separate outer loop, composed of a General Minimum Variance (GMV) controller [5] and a neural network. The neural network is trained off-line to model the dynamics of a system. The linear dynamics model is obtained at every control iteration via the linearizion of the neural net. The GMV computes then the optimal gains of PID for the given linear model of a system. The proposed method, however, accounts only for a SISO nonlinear systems. Chang et al. [9] propose the method to schedule a multivariate PID using the Auto-tuning Neurons. Auto-tuning neuron is a sigmoid function with an adaptive shape applied to the thresholded input. Sigmoid is parametrized by the saturation level and its slope. Each PID gain of MIMO controller is defined as an output of such sigmoid. The gains are scheduled then based on the control error. In order to tune such scheduled controller a monotonically responding system is assumed, i.e. the output is monotonically increasing or decreasing with the increase in a control signal. Under such assumption the authors were able to derive the optimization algorithm, which outputs the optimal sigmoid parameters with respect to the IAE. 31 4 Gain Scheduling using PILCO In this chapter the definition of gain scheduling function is presented as well as mathematical derivations, necessary for its implementation and integration into PILCO framework. 4.1 Formulation As it was discussed in Section 2.2.2 and Section 3.4, a gain scheduling can vary from simple interpolation of linear controllers, built at system’s equilibrium points, to the complex nonlinear function, which defines the gains given the scheduling variable. For the purpose of this master thesis it was decided to follow the latter approach and encapsulate the entire scheduling into one nonlinear parametrized function. Parameters θ of this function then will be optimized as a policy parameters π(θ) during the PILCO iteration (Algorithm 3.1). In order to stay in fully Bayesian setting and maintain the PILCO notion of uncertainty, the gain scheduling function was defined as a set of GPs parametrized by the training targets with uniformly distributed training inputs. Gains of the multivariate PID controller are given as: K ∼ N(µK,ΣK ) ∈ R 3FE, (4.1) where F is the dimensionality of a control signal and E is the dimensionality of a control error. Each element E[Ka] of gains mean vector µK and var[Ka] of covariance matrix ΣK for a = 1, ..., 3FE are defined according to (2.7) and (2.8) as: E[Ka] = κ(Γ, xs) T (� + σ2 w I ) −1ya, (4.2) var[Ka] = κ(xs, xs) − κ(Γ, xs) T (� + σ2 w I ) −1κ(Γ, xs). (4.3) In the above equation, xs ∈ R S is a vector of scheduling variables and Γ ∈ RN×S is a matrix of uniformly distributed training inputs with N being a number of inputs and S - dimensionality of scheduling vector. Elements of the matrix� are given by κ(γi, γ j), where κ(·, ·) is a prior covariance and γi, γ j ⊂ Γ. ya ∈ RN is a training target vector for GP outputting Ka. Vectors ya, concatenated into matrix Y ∈ RN×3FE , form the parameters θ of a policy π(θ), which will be optimized by PILCO. Hyper-parameters such as k(·, ·) and Γ are shared among all GPs. They are not a part of θ, i.e. are not optimized by the algorithm. The gradient of GP with respect to its prior covariance is more unstable than with respect to targets, thus the convergence of a gradient based method can be compromised. That is why hyper-parameters are determined independently for every application using the experiments. 33 4 Gain Scheduling using PILCO Joint distribution p( z̃, K ), where z̃ is the extended state from Section 3.3 is completely defined [14]. Nevertheless, the control signal u under such definition is a product of two correlated multidimensional random variables p(K |xs) and p(ẽ |zt, xt, xtar,t ), that is concatenation of all errors vectors. To integrate the defined above scheduling function into the PILCO framework, the control mean E[u] and variance var[u] have to be computed (approximated). Additionally, cross-covariance term cov[u, z̃] have to be derived for the approximation of joint distribution p( z̃, u), where z̃ is already concatenated with gains. During the work on this thesis two cases were considered: K being dependent on and independent of ẽ. It is obvious that for independent case neglects the cross-covariance information between K and ẽ, which can lead to deterioration of optimization results. However, for the fully dependent case rough approximations for var[u] are used, which can also turn into a problem. Rest of the chapter contains all necessary derivations for both cases. 4.2 Dependent Case The multivariate PID controller’s equation (2.4) from the Background section looks as following: ut = Kpet + Ki ∫ t 0 et + Kd Ûet . (4.4) Since the above equation is a linear combination of three terms, in order to avoid redundancy, derivations only for the proportional up part are provided. Two others can be derived via the same computations. Furthermore, for the clarity of derivations, an input is assumed to be one dimensional. Extension to multidimensional case is trivial (separate application to every dimension). 4.2.1 Expected Value Expected value of proportional term can be decomposed as: E[up] = E[KT p e] = E[ ∑ i K piei] = ∑ i E[K piei]. To the above equation total covariance low is applied: cov[X,Y ] = E[XY ] − E[X]E[Y ], (4.5) which results in: E[up] = ∑ i (cov[K pi, ei] + µkiµei) = µTk µe + ∑ i cov[K pi, ei]. Covariances and mean vectors, required for the computation of above equation, can be obtained from the known joint distribution of the extended state p( z̃). 34 4.2 Dependent Case 4.2.2 Variance In order to compute the variance, as previously, up is first rewritten as a sum of products: var[up] = var[ ∑ i K piei]. Consider system state first being only two dimensional (K, e ∈ R2). Here onwards K stands for K p to avoid indices redundancy: var[K0e0 + K1e1] = var[K0e0] + var[K1e1] + 2 cov[K0e0,K1e1], which can be generalized to: var[ ∑ i Kiei] = ∑ i var[Kiei] + n−1∑ i n∑ j=i+1 2cov[Kiei,Kjej]. (4.6) First, blue term of the (4.6) is derived. Application of variance definition results in: var[Kiei] = E[(Kiei)2] − E[Kiei]2 = E[K2 i e2 i ] − E[Kiei]2. (4.7) The second term of right side of above equation can be computed via the application of total covariance low (4.5) as following: E[Kiei]2 = (cov[Ki, ei] + E[Ki]E[ei])2 = (cov[Ki, ei] + µkiµei)2, where all required information is contained in the joint distribution of the extended state p( z̃). However, the E[K2 i e2 i ] term of (4.7) turned out to be more difficult to compute. Application of the low of iterated expectations to the E[K2 i e2 i ] results in: E[K2 i e2 i ] = E[E[K2 i e2 i |Ki]] = E[K2 i E[e2 i |Ki]], (4.8) where Ki and ei are jointly normally distributed random variables with correlation coefficient ρ. Then, the conditional distribution of ei on Ki is also normal with mean and variance defined as following: E[ei |Ki] = E[ei] + ρ √ var[ei] var[Ki] (Ki − E[Ki]), var[ei |Ki] = var[ei](1 − ρ2). (4.9) Now, the expected value of error squared conditioned on gain E[e2 i |Ki] is obtained via definition of conditioned variance: E[e2 i |Ki] = var[ei |Ki] − E[ei |Ki] 2. 35 4 Gain Scheduling using PILCO Substitution of the inner term of the right most expectation in (4.8): K2 i E[e2 i |Ki] = K2 i [ var[ei |Ki] + E[ei |Ki] 2 ] . Then the above equation is rewritten with the definitions from (4.9): K2 i E[e2 i |Ki] = K2 i [ var[ei](1 − ρ2) + ( E[ei] + ρ √ var[ei] var[Ki] ( Ki − E[Ki] ))2 ] . The above expectation is a quartic function of Ki, named g(Ki). Insertion of g(Ki) back into (4.8) results in: E[K2 i e2 i ] = E[g(Ki)]. At this point a first approximation is introduced. A performance of the algorithm will probably depend on a quality of this approximation, so, as a future work, higher order approximations for this term can be derived. For now E[g(Ki)] can be approximated using the Taylor Series, first two derivatives of g(Ki) and first two moments of Ki as following: E[g(Ki)] = E[g(µki) + Ûg(µki)(Ki − µki) + 1 2 Üg(µki)(Ki − µki) 2] = g(µki) + 1 2 Üg(µki)σ 2 x . Below first two derivatives of g(Ki) are computed. The ρ √ var[ei ] var[Ki ] term is substituted with A for readability reasons: Ûg(Ki) = 2Ki ( var[ei](1 − ρ2) + ( E[ei] + A ( Ki − µki ) )2 ) + K2 i ( 2 ( E[ei] + A ( Ki − µki ) ) A ) . Üg(Ki) = 2 ( var[ei](1 − ρ2) + ( E[ei] + A ( Ki − µki ) )2 ) + 4Ki ( 2 ( E[ei] + A ( Ki − µki ) ) A ) + K2 i A2. All information, reacquired for computation of above equations and E[K2 i e2 i ] can be obtained from joint distribution of extended state p( z̃). Now all components of var[Kiei] in (4.6) are defined. In order to compute cov[Kiei,Kjej] from (4.6) we are applying the total covariance low (4.5): cov[Kiei,Kjej] = E[KieiKjej] − E[Kiei]E[Kjej]. (4.10) Second part E[Kiei]E[Kjej] of the right side of the above equation is completely defined. Consec- utive application of (4.5) to it results in terms contained at p( z̃). However, E[KieiKjej] revealed to be more difficult. The above term was derived with loss of information, which is introducing 36 4.3 Independent Case second source of possible deterioration in results and is an aim for improvement in further work. For now, Ki is assumed to be independent of ej and Kj of ei , which permits to rearranging the inner multiplication terms of expectation and rewrite it as E[KiKj]E[eiej]. Then substitute it back to (4.10) and apply (4.5) to the right part. Rearrangement of terms results in: cov[Kiei,Kjej] = E[KiKj]E[eiej] − (cov[Ki, ei] + µkiµei)(cov[Kj, ej] + µk j µej) = (cov[Ki,Kj] + µkiµk j)(cov[ei, ej] + µeiµej) − (cov[Ki, ei] + µkiµei)(cov[Kj, ej] + µk j µej) = cov[Ki,Kj] cov[ei, ej] + cov[Ki,Kj]µeiµej + cov[ei, ej]µkiµk j − cov[Ki, ei] cov[Kj, ej] − cov[Ki, ei]µk j µej − cov[Kj, ej]µkiµei . The above equation concludes the derivation of all terms, required to compute the variance (4.6), even though currently we are neglecting cov[Ki, ej] cov[Kj, ei] etc. 4.3 Independent Case As it was mentioned above, assuming the independence of K and ẽ results in loss of valuable information. Nevertheless, for the independent case computations are significantly simpler than for the dependent one, which allows to avoid approximations 4.3.1 Expected Value Since cov[Kpi, ei] is equal to 0, the expected value becomes: E[up] = ∑ i (cov[Kpi, ei] + µkiµei) = µTk µe . 4.3.2 Variance Instead of approximating E[K2 i e2 i ] in (4.7), the covariance low (4.5) is applied repeatedly: E[K2 i e2 i ] = cov[K2 i , e 2 i ] − E[K2 i ]E[e 2 i ] = cov[K2 i , e 2 i ] − (var[Ki] + E[Ki] 2)(var[ei] + E[ei]2). Substitution of the above equations to (4.7) results in: var[Kiei] = cov[K2 i , e 2 i ] − (var[Ki] + E[Ki] 2)(var[ei] + E[ei]2) − (cov[Kiei] + µkiµei)2. 37 4 Gain Scheduling using PILCO Now cov[K2 i , e 2 i ] and cov[Ki, ei] can be replaced with 0. Further simplification leads to: var[Kiei] = var[Ki] var[ei] + var[Ki]µ 2 ei + var[ei]µ2 ki . And cov[Kiei,Kjej] from (4.6) simplifies to: cov[Kiei,Kjej] = cov[Ki,Kj] cov[ei, ej] + cov[Ki,Kj]µeiµej + cov[ei, ej]µkiµk j 4.4 Cross Covariance Cross covariance term resembles one given at [14]. However, the computation becomes significantly more complex for the covariance between the u and the entire extended state. z̃ stands for the augmented state, which includes all information: cov[ z̃, u] = E[ z̃u] − µz̃µ T u, (4.11) with E[ z̃u] being a vector of [E[ z̃u0],E[ z̃u1], ..,E[ z̃ua], ..]. Using the low of iterated expectations and given that all GPs of scheduling function share hyper-parameters similarly to (3.13): E[ z̃ua] = E[ z̃ E[ua | z̃]] = ∫ z̃ ( µkp a e + µki a ∫ t 0 e + µkd a Ûe ) p( z̃)d z̃ (4.12) = ∫ z̃ ∑ i κ(xs, γi) ( βkp ai e + βki ai ∫ t 0 e + βkd ai Ûe ) p( z̃)d z̃, where xs is a vector of scheduling variables, γi is a training input and κ(·, ·) is a prior covariance of the scheduling function described at Section 4.1. βkp a is obtained from (4.2) as: βkp a = (� + σ 2 w I ) −1ya, for GP outputting the a dimension of proportional gain vector K p etc. Now the linear transformation matrices are introduced: S returns the scheduling variable xs from the extended state z̃ P returns current error I returns sum of errors and D returns error derivative. 38 4.4 Cross Covariance The insertion of the above matrices into (4.12) results in: E[ z̃ua] = ∫ z̃ ∑ i k(S z̃, γi) ( β kp ai (Pa z̃) + β ki ai(Ia z̃) + β kd ai (Da z̃) ) p( z̃)d z̃ = ∫ z̃ ∑ i k(S z̃, γi)β kp ai (Pa z̃)p( z̃)d z̃ + ∫ z̃ ∑ i k(S z̃, γi)β ki ai(Ia z̃)p( z̃)d z̃ + ∫ z̃ ∑ i k(S z̃, γi)β kd ai (Da z̃)p( z̃)d z̃. Then only the first summation term is considered. After pulling out the independent variables and change the order of summation and integration: E[ z̃ua] = ∑ i β kp ai ∫ z̃k(S z̃, γi)(Pa z̃)p( z̃)d z̃. Next, k(S z̃, γi), in a blue term above, is transformed to a Normal distribution, using the multiplication with constant c1 as in (3.14). (Pa z̃) is also normal according to (A.1):∫ z̃k(S z̃, γi)(Pa z̃)p( z̃)d z̃ = ∫ z̃c1N(S z̃ |γi,Λ)N(Pa z̃ |µp,Σp)N( z̃ |µz̃,Σz̃)d z̃. Now, as in [14], Gaussians product rule (A.2) can be applied. However, for that N(S z̃ |γi,Λ) has to be transformed to N( z̃ |., .) and similarly N(Pa z̃ |µp,Σp). Using the (A.1): S z̃ ∼ N(γi,Λ), z̃ ∼ N(S†γi, S †ΛS†T ), where S† is a pseudo inverse of S. At this point it was decided to stop computation of cross- covariance between the action and entire extended state, since there was no confidence that it will lead to the right result and not just over-complication. Instead the cross-covariance term only for scheduling vector xs of extended state z̃ was calculated: cov[xs, u] = E[xsu] − µxsµ T u . All errors terms are assumed to be given as the constants and independent from xs. As previously, the low of iterated expectations is applied to every element E[xsua] of expectation of product in above equation: E[xsua] = ∫ xs ∑ i k(xs, γi) ( βkp ai e + βki ai ∫ t 0 e + βkd ai Ûe ) p(xs)dxs. 39 4 Gain Scheduling using PILCO Dependent Approximation Independent Approximation Experiment µ Error σ Error µ Error σ Error 1 0.214 1.727 0.211 0.531 2 0.061 0.413 0.069 1.342 3 0.005 0.021 0.005 9.506 Table 4.1: The moments errors for the subset of numerical test experiments. Next, by pulling out the independent terms and switching the order of summation and integration: E[xsua] = ∑ i ( βkp ai e + βki ai ∫ t 0 e + βkd ai Ûe ) ( ∫ xsk(xs, γi)p(xs)dxs ) = ∑ i βkp ai e ( ∫ xsk(xs, γi)p(xs)dxs ) + ...+ = e ∑ i βkp ai ( ∫ xsk(xs, γi)p(xs)dxs ) + ... + . In above equation the red term is E[xsK a p] defined in (3.16). Furthermore, for the implementation simplicity it can be obtained from the covariance cov[xs, K a p], which is contained at the joint distribution of extended state p( z̃) using (3.17). Now the final equation for cross-covariance computation look like: cov[xs, ua] = cov[xsK a p]e + µxsµ T Ka p + cov[xsK a i ] ∫ t 0 e + µxsµ T Ka i + cov[xsK a d]Ûe + µxsµ T Ka d + µxsµua . Thus, all equations, required for integration of our scheduling function, defined at Section 4.1, into the PILCO framework are derived. 4.5 Numerical Test In order to validate the derivations provided in Section 4.2 and Section 4.4, several numerical test experiments were conducted. Subset of test experiments is presented at Figure 4.1. Left side of the figure demonstrates initial correlated Gaussian distributions (green and orange) as well as real distribution of a product (gray dashed) according to [13]. Plots on the right present the approximations of product with Normal distributions using the equation for dependent (blue) and independent (red) cases derived above. Additionally, for every experiment, we sampled the exact product distribution [13] and calculated its mean and variance. Errors between the first two moments of approximated distributions and exact one for experiments at Figure 4.1 can be found in Table 4.1. As one can infer, mean error 40 4.5 Numerical Test 6 4 2 0 2 4 6 x 0.0 0.2 0.4 0.6 0.8 1.0 p( x) Initial Distributions Product Normal_1 Normal_2 6 4 2 0 2 4 6 x Approximated Product Product Approx_Normal_Indep Approx_Normal_Dep 6 4 2 0 2 4 6 x 0.0 0.2 0.4 0.6 0.8 1.0 p( x) Initial Distributions Product Normal_1 Normal_2 6 4 2 0 2 4 6 x Approximated Product Product Approx_Normal_Indep Approx_Normal_Dep 6 4 2 0 2 4 6 x 0.0 0.2 0.4 0.6 0.8 1.0 p( x) Initial Distributions Product Normal_1 Normal_2 6 4 2 0 2 4 6 x Approximated Product Product Approx_Normal_Indep Approx_Normal_Dep Figure 4.1: Subset of numerical test experiments 41 4 Gain Scheduling using PILCO does not heavily depend on way of approximation, at least for the considered experiments. Whereas the variance changes a lot for dependent and independent cases. For variance there is no clear superiority of one approximation comparing to another - for some experiments dependent variance is outperforming independent, for others - vice versa. Decision on which approximation to use should be made separately for every application based on the experiments which permit to measure the quality of t-step-ahead marginal state distribution. According to the demonstration, the provided derivations result in reasonable approximation. How- ever, considering that during the PILCO iteration input distributions will be already approximated using the Moment Matching, it is rather difficult to estimate the quality until the tests of entire algorithm on real (testing) system. 42 5 Testing Systems In this chapter the developed algorithm is applied to the testing systems. First, PILCO is used to tune the gains of a PID controller applied to the noisy Cart-Pole, which is the modification of OpenAI Gym Cart-Pole environment. Then, the modified PILCO will optimize the scheduled PID, which is controlling the non-stationary Mass-Damper system implemented for testing purposes. 5.1 Noisy Cart-Pole In order to test the algorithm developed by Doerr et al. [16] and create a baseline for further work, PILCO was exploited to find the optimal gains of PID controlling the noisy Cart-Pole system. For that noisy modification of OpenAI Cart-Pole environment was developed. The standard OpenAI Gym Cart-Pole environment neglects a friction coefficient of a cart and a mass moment of inertia of the pole. The state-space representation of a system was recovered from the source code and the additive noise for the observations was included: Û̃x =  Ûx Üx Ûφ Üφ  =  0 1 0 0 0 0 mlg 0 0 0 0 1 0 0 − g(m+M) − 4 3 l(m+M)+lm 0   x Ûx φ Ûφ  +  0 4 3 l(m+M) (m+M) 0 − 1 4 3 l(m+M)+lm  u, ỹ = [ x φ ] = [ 1 0 0 0 0 0 1 0 ]  x Ûx φ Ûφ  + [ 0 0 ] u + σy . In the above equation: x stands for a cart’s position θ is a pendulum angle from vertical u is a force applied to the cart. System parameters and their values used during the experiments are given in Table 5.1. Additionally, the PID controller was implemented to follow the desired trajectory of the cart position and the pole angle. The recovered state-space representation was discretized using Matlab 2016b. Then a Linear Quadratic Regulator (LQR) was computed as a benchmark for PILCO tuning. We put additional threshold to the pole angle in the modified environment. With the pole angle bounded at +/− π 6 the Cart-Pole dynamics is approximately linear. Target trajectory was set to zero for both, position and angle. 43 5 Testing Systems Parameter Value Unit m mass of pole 0.1 kg M mass of cart 1.0 kg l length to the pole center of mass 0.5 m dt system’s sampling time 0.02 s σy spring characteristic ∼ U(1e−2, 1e−3) m, rad Table 5.1: Noisy Cart-Pole parameters. 0.0 0.2 0.4 0.6 0.8 1.0 Time t, dt = 0.020000, H = 54 1.00 0.75 0.50 Ou tp ut Exp.: 1 of 2 Exp., Out 0 Pred., Out 0 0.0 0.2 0.4 0.6 0.8 1.0 Time t, dt = 0.020000, H = 54 0 1 Ou tp ut Exp., Out 1 Pred., Out 1 0.0 0.2 0.4 0.6 0.8 1.0 Time t, dt = 0.020000, H = 54 10 0 10 In pu t Exp., In 0 Pred., In 0 (a) Test experiment 1 0.0 0.1 0.2 0.3 0.4 0.5 Time t, dt = 0.020000, H = 28 0.40 0.45 Ou tp ut Exp.: 2 of 2 Exp., Out 0 Pred., Out 0 0.0 0.1 0.2 0.3 0.4 0.5 Time t, dt = 0.020000, H = 28 0.1 0.0 0.1 Ou tp ut Exp., Out 1 Pred., Out 1 0.0 0.1 0.2 0.3 0.4 0.5 Time t, dt = 0.020000, H = 28 10 0 10 In pu t Exp., In 0 Pred., In 0 (b) Test experiment 2 Figure 5.1: Tests of Noisy Cart-Pole dynamics model As a first step of PILCO, the initial dynamics model have to be learned (Algorithm 3.1). For that a sparse GP with 150 inducing points was trained on two experiments with the random policy (0.84 and 0.62 seconds) and one experiment with suboptimal, manually tuned, gains (2.0 seconds) in order to collect additional valuable information around equilibrium point. NARX state included 2 states of the history for position, angle and control in order to learn the derivatives. Three experiments with the random policy were used for testing. The simulated feed-forward rollouts with initial dynamics model for two of them are presented in Figure 5.1. Both plots include the blue line which stands for the data collected from the real rollout with the given policy, the orange line, which presents the mean of t-step-ahead marginal distribution obtained from the model, and the shaded region which demonstrates the variance of distribution (model uncertainty). This description holds for all plots which demonstrate the comparison of simulated and real rollouts from now onwards. Output 0 and 1 are cart position and pole angle respectively, and Input is a control signal. Since the results were obtained from the feed-forward simulation, Input does not have a variance. As one can see, an initial GP rather accurately models the dynamics of a system but has a relatively big posterior variance. With initial dynamics model, PILCO was applied to tune the gains of a PID controller. The concatenated optimization progress of four PILCO iterations is presented in Figure 5.2. The Broyden — Fletcher — Goldfarb — Shanno (BFGS) was used for optimization. BFGS iterations from 0 to 32 corresponds to the first PILCO iteration. Due to the good precision of the posterior mean of our initial model, the cost obtained from the simulated rollout (orange dashed line) reassembles the cost from the system rollout (blue dashed). The persistent margin between two lines corresponds 44 5.1 Noisy Cart-Pole 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 BFGS Iteration 20 40 60 80 100 C o st Optimization Progress LQR cost Cost from Model Cost from Simulation Figure 5.2: Concatenated optimization progress for the Noisy Cart-Pole PID. 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Time t, dt = 0.020000, H = 100 0.00 0.05 0.10 Ou tp ut Exp.: 1 of 1 Exp., Out 0 Pred., Out 0 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Time t, dt = 0.020000, H = 100 0.00 0.05 Ou tp ut Exp., Out 1 Pred., Out 1 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Time t, dt = 0.020000, H = 100 0 10 In pu t Exp., In 0 Pred., In 0 (a) Simulated feed-back rollout for optimized PID 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Time t, dt = 0.020000, H = 100 0.00 0.05 0.10 Ou tp ut Exp.: 1 of 1 Exp., Out 0 LQR., Out 0 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Time t, dt = 0.020000, H = 100 0.00 0.05 Ou tp ut Exp., Out 1 LQR., Out 1 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Time t, dt = 0.020000, H = 100 0 10 In pu t Exp., In 0 LQR., In 0 (b) The optimized PID versus LQR Figure 5.3: The optimized PID for the Noisy Cart-Pole. to the variance pf the initial model. According to Algorithm 3.1, at the end of the first iteration the model is retrained using a new data recorder with the optimized policy (one experiment, 1.42 seconds). However, since the number of inducing points of GP was not increased, retraining results in a degradation of the model precision. This effect can be observed during the second iteration (BFGS steps 33 − 63). Despite this fact, at the end of second iteration, the optimized policy was able to outperform the LQR (green line). Further concatenation of data and retraining of the model improved the precision and decreased the variance (BFGS steps 64 − 120). The results of 4 iterations of PILCO are demonstrated in Figure 5.3. Figure 5.3a presents the simulated rollout for the optimized policy. As one can see, the resulting GP precisely models the dynamics and has relatively small variance. Figure 5.3b compares PID with the optimized gains (blue line) and LQR (green line). Since the cost function was set in a way to prioritize the pole angle error minimization, which leads to the faster system stabilization, LQR results in better tracking of the cart position (Output 0). PID optimized by PILCO, is clearly outperforming the LQR in second Output (pole angle), thus is more optimal with respect to the specified cost. The conducted experiments prove that PILCO modification [16] is able to optimally tune the gains of PID with being data-efficient (9.46 seconds of system interaction was used in total to obtain the results from Figure 5.3). In the following section experimental setup used for testing of the scheduled PID tuning with PILCO (Chapter 4) is described and obtained results are analyzed. 45 5 Testing Systems Parameter Value Unit m mass 1.0 kg D damping coefficient 1.5 Ns/m k spring characteristic ∼ Figure 5.4a - dt system’s sampling time 0.05 s Table 5.2: Non-Stationary Mass-Damper parameters. 5.2 Non-stationary Mass-Damper In order to test the derived integration of gain scheduling into PILCO the non-stationary Mass- Damper system was developed. The spring characteristic (k) is changing according to the sigmoid function with respect to the position of a center of mass (Figure 5.4a). Twenty continuous Mass-Damper systems with the following state-space representation were calculated: Û̃x = [ Ûx Üx ] = [ 0 1 − k m −D m ] [ x Ûx ] + [ 0 1 m ] u, ỹ = [ x Ûx ] = [ 1 0 0 1 ] [ x Ûx ] + [ 0 0 ] u, where: x stands for the position of a center of mass u is a force applied to the a center of mass. In the above differential equation, the parameters were set according to Table 5.2. Then, using Matlab 2016b these systems were converted to the discrete form with sampling time of 0.05 seconds. For the final non-stationary system − k m entry is linearly interpolated from discrete forms based on x. Next, the PID controller for tracking the desired x was implemented. In order to define the controller output boundaries and feasible desired trajectory, the excitation signal (Figure 5.4b) was applied. The system’s response (Figure 5.4c) demonstrates that with control bounded to +/−15, it is possible to follow the desired trajectory with the boundaries +/−0.5m. Additionally LQR was computed in Matlab 2016b for the system with k equal to 15 (dynamics for negative x) for the benchmarking. Figure 5.4d presents the trajectory obtained with LQR (blue) versus the desired trajectory (orange). For the LQR trajectory one can observe persistent steady-state error in both negative and positive positions. Nonetheless, the error is lower for the negative x. The difference in steady-state errors and overshooting in positive and zero x shows that controller was tuned for dynamics in negative positions. Then PILCO was applied to tune the PID gains. According to Algorithm 3.1, the initial probabilistic dynamics model have to be learned first. Sparse GPs with 100 inducing points and the NARX state with the history length of 4 for both, state and action were trained. The hyper-parameters were optimized using marginal likelihood. For the train experiments we use three rollouts with gains sampled uniformly from −10 to 10. The test experiments use similar rollouts. 46 5.2 Non-stationary Mass-Damper 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 x 15 20 25 30 35 40 45 K K-values Sigmoid (a) The change of k-value based on x 0 1 2 3 4 5 t [0.05] 0 2 4 6 8 10 12 14 u Input Vector: u_vec (b) The excitation input signal 0 1 2 3 4 5 t [0.05] 0.8 0.6 0.4 0.2 0.0 0.2 0.4 0.6 0.8 x System Response for u_vec (c) The excitation system’s response 0 1 2 3 4 5 t [0.05] 0.4 0.2 0.0 0.2 0.4 x LQR Experiment System response Target trajectory (d) LQR controller Figure 5.4: Non-stationary Mass-Damper properties. The results of test experiments are demonstrated in Figure 5.5. The Output stands for the x trajectory and the Input for a control command. Due to the simplicity of the system and enough NARX history GPs were able to learn the dynamics from three train experiments (5 seconds each). The PID controller obtained after the first PILCO iteration is demonstrated in Figure 5.6. Figure 5.6a presents the simulated feedback rollout for the optimized controller. The predicted trajectory of x has a low variance, but deviates from the real one in the areas where the set point has a steep change. The reason is that the dynamics model predicts the non-truncated control commands and has no information about the boundaries of a controller output. This issue can be resolved via the concatenation of a new training data which contains a non-truncated control at the end of PILCO iteration. Despite this deviation, optimized controller outperforms LQR with respect to the specified cost function already after the first PILCO iteration (Figure 5.6d). Therefore, it was decided not to continue with a further iterations but proceed with a tuning of scheduled PID. Figure 5.6b demonstrates the comparison of the rollouts with optimized PID and LQR. Figure 5.6c plots the optimized PID rollout and the reference trajectory. Next, the modified PILCO was applied to tuning of a gain scheduled PID which controls the Non-Stationary Mass-Damper system. As it was described in Section 4.1, the gain scheduling function is defined as a GPs with a uniformly distributed training inputs parametrized by the training 47 5 Testing Systems 0 1 2 3 4 5 Time t, dt = 0.050000, H = 100 1.25 1.00 0.75 0.50 0.25 0.00 Ou tp ut Exp.: 1 of 2 Exp., Out 0 Pred., Out 0 0 1 2 3 4 5 Time t, dt = 0.050000, H = 100 10 0 10 In pu t Exp., In 0 Pred., In 0 (a) Test experiment 1 0 1 2 3 4 5 Time t, dt = 0.050000, H = 100 3 2 1 0 1 2 Ou tp ut Exp.: 2 of 2 Exp., Out 0 Pred., Out 0 0 1 2 3 4 5 Time t, dt = 0.050000, H = 100 10 0 10 In pu t Exp., In 0 Pred., In 0 (b) Test experiment 2 Figure 5.5: Tests of Non-stationary Mass-Damper dynamics model. 0 1 2 3 4 5 Time t, dt = 0.050000, H = 100 0.5 0.0 0.5 Ou tp ut Exp.: 1 of 1 Exp., Out 0 Pred., Out 0 0 1 2 3 4 5 Time t, dt = 0.050000, H = 100 50 0 50 In pu t Exp., In 0 Pred., In 0 (a) The optimized PID after first iteration. 0 1 2 3 4 5 Time t, dt = 0.050000, H = 100 0.4 0.2 0.0 0.2 0.4 Ou tp ut Exp.: 1 of 1 Exp., Out 0 LQR., Out 0 0 1 2 3 4 5 Time t, dt = 0.050000, H = 100 10 0 10 In pu t Exp., In 0 LQR., In 0 (b) The optimized PID comparing to LQR. 0 1 2 3 4 5 t [0.05] 0.6 0.4 0.2 0.0 0.2 0.4 X Optimized Policy System response Target trajectory (c) The optimized PID comparing to target. 0 2 4 6 8 10 12 14 16 BFGS Iteration 30 35 40 45 50 55 Co st Optimization Progress LQR cost Cost from Model Cost from Simulation (d) The optimization progress. Figure 5.6: The optimized nonscheduled PID for the Non-stationary Mass-Damper. 48 5.2 Non-stationary Mass-Damper 0 1 2 3 4 5 Time t, dt = 0.050000, H = 100 0.50 0.25 0.00 0.25 0.50 Ou tp ut Exp.: 1 of 1 Exp., Out 0 Pred., Out 0 0 1 2 3 4 5 Time t, dt = 0.050000, H = 100 40 20 0 20 40 In pu t Exp., In 0 Pred., In 0 (a) Test experiment 1 0 1 2 3 4 5 Time t, dt = 0.050000, H = 100 0.6 0.4 0.2 0.0 0.2 0.4 Ou tp ut Exp.: 1 of 1 Exp., Out 0 Pred., Out 0 0 1 2 3 4 5 Time t, dt = 0.050000, H = 100 40 20 0 20 40 In pu t Exp., In 0 Pred., In 0 (b) Test experiment 2 Figure 5.7: The optimized scheduled PID for the Non-Stationary Mass-Damper. targets. Since the system dynamics depends on x, it was taken as a scheduling variable. The taring inputs were placed at −0.5, 0, 0.5. It was assumed that such inputs setup permits the optimized scheduling function to approximate the change of k and adjust the gains accordingly. The scheduled controller optimized by 3 iterations of the modified PILCO is demonstrated in Figure 5.7a. The resulting controller has a reasonable performance with respect to the model prediction, however, the real rollout results in an oscillating trajectory. The observed behavior is caused by a steep scheduling function, which results in an extreme change of gains for the small variations in x. The order of magnitude is equal to 1 for the training targets and to −1 for x. Additionally the optimization procedure tends to place the targets in a relatively big range, e.g. from 20 to 80. The combination of this two points leads to the scheduling function with an extreme slope. Therefore during the feed-back simulation for the predicted x with a relatively small deviation from the real trajectory the gains are calculated with a strong bias. In order to address this problem it was decided to put the additional cost on the training targets to prevent the learning of an extreme values. However, the adjustment of a cost coefficients of the binary saturation loss function (Section 3.2) revealed to be a rather complex problem. The small cost coefficients which punish the extreme values result in inability of auto-derivation software to compute the gradients. The coefficients were set to the smallest values that does not compromise a gradients calculation. The result of a scheduled PID learning with the additional cost for the predicted gains are presented in Figure 5.7b. The difference between the predicted and simulated x trajectories is not as significant as in Figure 5.7a. However, the second plot of the figure shows that the predicted control signal still deviates from the real one. Nevertheless, the introduction of the additional cost permits to obtain the controller which outperform both, LQR and a nonscheduled PID. The concatenated optimization progress of 4 modified PILCO iterations is demonstrated in Figure 5.8. As one can see, due to the problem of the steep change in a gains the cost obtained from a model simulation and the cost of a real rollout does not converge to the perfect match. Nonetheless, after the 110 BFGS iterations (4 PILCO iterations) the optimized scheduled controller outperforms the LQR and a plain PID. The costs of LQR and simple PID rollouts are different from Figure 5.6d due to the additional cost terms. 49 5 Testing Systems 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 BFGS Iteration 70 75 80 85 90 C o st LQR cost Nonscheduled PID cost Cost from Model Cost from Simulation Optimization Progress Figure 5.8: Concatenated optimization process for the scheduled PID. Considering the demonstrated results and assuming that described problem of an extreme gains change is application specific, it was decided to apply the developed method to the real control system for further testing and evaluation. 50 6 Application to Autonomous RC Car In order to prove the applicability of the designed algorithm to the real physical systems, it was exploited to tune the scheduled PID controllers of the autonomous RC car, which was developed as part of master thesis of Alexander Wischnewski [35]. This chapter presents the description of a system with all modifications which were made during the method integration, as well as the approaches to the modeling of system dynamics. 6.1 System Description 6.1.1 Physical Prototype The physical prototype of the autonomous system is based on a commercial RC car of scale 1:6 to a real vehicle. This car is controlled by the steering and throttle input commands. Four-wheel system, which transmits the traction forces, is powered by a Brushed DC Motor. The front-wheel steering is performed by two separate but mechanically linked servo-motors. The frame is mounted on a single-wheel suspension with a spring-damper combination. Several parameters such as camber, toe and spring-preload are available for tuning. The autonomous driving functionality on a physical prototype is implemented using the Pixhawk PX4 micro-controller, however this choice is not essential. The Pixhawk PX4 directly provides the measurements of 3D-accelerometer, 3D-magnetometer, 3D-gyroscope and GPS. Additionally, angular positions of the steering servo-motors are obtained from built-in rotary potentiometers and the Pixhawk’s AD-converter interface. The sensors have a different sample rates which are lower than the minimal limit set for runtime of control software. The oversampling approach was implemented. The Sensors output the raw physical quantities and have to be calibrated, however this part is not the scope of this thesis. Further details about the physical prototype [35] are irrelevant for this thesis. Taking the results from Wischnewski as a proof by demonstration, it was concluded that the system can be optimally controlled by the manually tuned PID scheduled on the vehicle velocity, thus is appropriate for application of our method. 6.1.2 Simulink Model The controller for autonomous driving was developed in Simulink, using the precise RC car model provided by Bosch. Additionally, a software package which configures MATLAB Coder to work with the Pixhawk PX4 and provides an actuator and SD-card interfaces were installed. The resulting Simulink model was provided as well as the physical prototype and includes the following functional modules: 51 6 Application to Autonomous RC Car C D T CR M ui t u uo Control Scheme RC Model TL eO (IM) y Sensors Simula�on xy - - - Figure 6.1: Simplified control scheme of the autonomous RC car. 1. High-level state-flow control 2. The Sensor and actuator simulators 3. The control mechanism for autonomous driving 4. Precise RC car model. 5. Data logging and post-processing routines 6. Controller parametrization via SD-card interface. The following sections are focused on the control mechanism for autonomous driving module, other modules [35] are not in the scope of this thesis. 6.2 Control Scheme Analysis As a part the master thesis Wischnewski [35] introduced the control mechanism for the RC car, which is able to follow the predefined discrete target trajectory. The control mechanism includes two parallel controllers, which can be switched during the simulation and execution. The controllers are nonlinear IMC controller, and multivariate scheduled PID. IMC was developed as a primary controller and PID was used for the benchmarking. However, since the nonlinear IMC is out of scope for this work, only the multivariate PID is considered. A simplified control scheme is presented in Figure 6.1. The block M stands for the precise RC car model, which has to be controlled. Inputs ūo of this model are throttle command ut in percents +/−100, and steering command us in radians +/−0.2 (Table 6.1). The model output state x with the information provided in Table 6.1, is transmitted through the outer loop and enters the Sensors Simulation. x is post-processed with Sensors Simulators (noise and delays are added) in order to account for the real hardware behavior. The resulting state observation y obtained from the sensors is transmitted to the State Observer(O). All sensor measurements 52 6.2 Control Scheme Analysis State Information Range Unit u ulat Lateral PID output - rad/s ulon Longitudinal PID output - m/s2 ūo us Steering angle +/−0.2 rad ut Throttle +/−100 perc ūi Ûθ∗ Requested yaw-rate - rad/s a∗ Requested acceleration - m/s2 γ Estimated side-slip angle - rad x x, y Vehicle position - m θ Vehicle orientation - rad v Vehicle velocity - m/s Ûθ Vehicle yaw rate - rad/s a Vehicle acceleration - m/s2 ȳ xk, yk Filtered position - m θk Filtered orientation +/−3.14 rad vk Filtered velocity - m/s Ûθk Filtered yaw-rate - rad/s γk Filtered side-slip angle - rad t s Path along the trajectory - m xtr, ytr Position in local coordinates - m k Curvature - rad/s vtr Target velocity - m/s θtr Target orientation - rad Ûθtr Target yaw-rate - rad/s e elat Lateral error - m elon Longitudinal error - m Table 6.1: The data transmitted within the control scheme of the autonomous RC car. have different delays and sampling times. Therefore, the State Observer (O in Figure 6.1) has a convoluted functionality of a sensors fusion which includes an Unscented Kalman Filter (UKF) and the internal process model. UKF outputs the state estimation based on the information measured by sensors. Nonetheless, the complete information of a system state required for UKF to compute an estimation is available with a delay of 3t where t is a system sampling time equal to 0.02. That is why the State Observer does also include the internal simplified model of RC car dynamics. The internal model is used to correct the chain of UKF observations made with an outdated sensor measurements. The internal model inputs ūi are Requested yaw-rate Ûθ∗, Requested acceleration a∗, and Estimated side-slip angle γ. They are transmitted via the inner loop (Figure 6.1) from the Control Router (CR) and delayed for three time steps using the delay stack (D). The internal model then predicts, using the new sensed information related to time-step (t-3) and all control outputs which were commanded since that 53 6 Application to Autonomous RC Car Filtered State Filtered State with all information (t-3) Closest Trajectory State (t-3) Target Trajectory State (t-3) Target Trajectory State Lateral Error Longitudinal Error Filtered Orientation (t-3) Filtered Yaw Rate (t-3) Estimated Side Slip Angle (t-3) Closets Trajectory State x , y n θ . θ v γ Figure 6.2: The definition of longitudinal and lateral error used for PID control. point (stored in delay stack), what will be the correct estimation at the current step. The Kalman state estimate ȳ contains Filtered vehicle position xk and yk, heading θk, velocity vk, yaw-rate Ûθk and side-slip angle γk. The details are provided in Table 6.1. Next, the estimated Kalman state is transmitted to the Trajectory Localization module (TL at Figure 6.1). Another input of TL is the trajectory t, provided as a set of discrete points with the information given in Table 6.1. The trajectory is stored as a parameter in a Matlab Workspace during the simulation, but loaded from the SD card during the run on the physical system. First, the coordinates of the trajectory points are transformed to the coordinate frame fixed at initial vehicle position with Y axes aligned to vehicle orientation θk. This coordinate frame holds for both, Kalman filtered and target positions, until the system finishes a lap. Afterwards, TL calculates the lateral and longitudinal errors are later used as an inputs to PID controllers. The Definition of these two errors is schematically presented in Figure 6.2. The transparent car represents the state at time-step t −3 when system had last sensor information update. The colored car stands for the filtered state at the current time t, corrected by the information from t − 3. In order to compute the errors, the closest point to the filtered vehicle state on a target trajectory is calculated (gray circle in Figure 6.2). The Closest point is obtained in three steps. First, the path coordinate of the anticipated closest point sc is integrated from sc,t−1 (previously closest 54 6.2 Control Scheme Analysis -15 -10 -5 0 5 10 15 20 X [m] -5 0 5 10 15 20 25 30 Y [ m ] 2 2.5 3 3.5 4 4.5 5 5.5 6 V [ m p s] (a) The target trajectory with velocity profile -10 -5 0 5 10 15 20 X [m] 0 5 10 15 20 25 Y [ m ] GPS Measurement GPS Position Estimated GPS Estimated CoG Target CoG (b) RC car trajectory Figure 6.3: The target and actual trajectories of the autonomous RC car simulation (taken from [35]). one) and filtered vehicle velocity vk. At the first iteration sc is set to 0. Then,the set of temporary points s� is sampled from a region around sc plus some safety margin. For every s� position is interpolated from the discrete trajectory points. The euclidean distance is calculated then between every sampled point and the filtered position of the vehicle xk, yk. According to the euclidean distance, the closest point on a trajectory is selected. Once the closest point is defined, a difference between the filtered position and the closest point is projected to the normal at sc. This projection is taken as the lateral error (red bar at Figure 6.1). The longitudinal error (blue bar) is calculated as a difference in s coordinate between the closest point and the target one, which is obtained via the integration of the target velocity. The calculated errors are transmitted to The Controller (C) module, which contains two parallel PID controllers with the custom derivate terms scheduled on the velocity. As a derivative term of the lateral PID a difference between the filtered vehicle orientation and orientation in closest trajectory point (interpolated) is taken. The derivative term of the longitudinal controller is calculated as a difference in target and filtered velocities. Last module in presented scheme is the Control Router (CR). CR takes a filtered Kalman state as an additional input. First, CR computes requested yaw-rate Ûθ∗, acceleration a∗ and estimated side-slip angle γ using the PID outputs u and the Kalman state ȳ. Then this information is transmitted to the Delay stack (D) via the internal loop for further use in the State Observer (O). Then the Router performs the post-processing of a PIDs outputs. CR does include the inverse dynamics model of steering actuator, which is used to transform the lateral PID output and speedup the steering response [35]. Additionally the output of the longitudinal PID is transformed using a special look-up table to compensate for different acceleration dynamics of the RC car at different velocities [35]. Considering that the original scheduling mechanism ought to be replaced with the parametrized GPs and is not required for the method application, it was removed from the control mechanism. The gains of the remained plain PID controllers were slightly adjusted to give the performance comparable to the scheduled PID. The resulting trajectory of the autonomous RC car with the manually tuned PID controllers is presented in Figure 6.3b. Figure 6.3a demonstrates the target trajectory with the velocity profile which is used during the simulations. The trajectory starts at (0, 0) and goes to the positive X direction. During the first lap the trajectory is followed more 55 6 Application to Autonomous RC Car precisely since the target velocity is reduced for the sensor calibration purposes [35]. Starting from the second lap, one can observe a persistent error in the bottom right corner of a trajectory. However, the manually tuned PIDs are still able to follow the defined target trajectory with relatively small accumulated longitudinal and lateral errors. That is why this system was considered to be an appropriate application base for the developed algorithm. The results of the PID tuned by Wischnewski are taken as a benchmark. All experiments conducted from now on use the target trajectory from Figure 6.3a. 6.3 Dynamics Modeling This section addresses learning of initial probabilistic dynamics model which corresponds to step 4 in Algorithm 3.1. The errors definition do not correspond to the linear transformation of a vehicle state and desired trajectory anymore. Thus, the integration of error calculation into the extended state z̃ requires additional approximations for the joint distribution p( z̃). Therefore, it was decided to model the error dynamics directly. Initial GP has to output et instead of xt (Section 3.3). Initially it was decided to encapsulate the entir