Institute of Parallel and Distributed Systems

University of Stuttgart
Universitätsstraße 38

D–70569 Stuttgart

Masterarbeit

Model-Based policy Search for
Learning Multivariate PID Gain

Scheduling Control

Maksym Lefarov

Course of Study: Computer Science

Examiner: Dr. Daniel Hennes

Supervisor: Dipl. Ing. Andreas Doerr,
Dr. Christian Daniel,
Dr. Duy Nguyen-Tuong.

Commenced: October 18, 2017

Completed: April 18, 2018


Abstract

Due to its simplicity and demonstrated performance, Proportional Integral and Derivative (PID)
controller remains one of the most widely-used closed-loop control mechanisms in industrial
applications. For the unknown model of a system, however, a PID design can become a significantly
complex task especially for a Multiple Input Multiple Output (MIMO) case.

For the efficient control of a nonlinear and non-stationary systems, a scheduled PID controller can
be designed. The classical approach to gain scheduling is a system linearization and the design of
controllers at different operating points with a subsequent application of interpolation.

This thesis continues on the recent advances in application of Reinforcement Learning (RL) to a
multivariate PID tuning. In this work we extend the multivariate PID tuning framework based on the
Probabilistic Inference for Learning Control (PILCO) algorithm to tune a scheduled PID controllers.
The developed method does not require the linear model of a system dynamics and is not restricted
to the low-order or Single Input Single Output (SISO) systems. The algorithm is evaluated using
the Noisy Cart-Pole and Non-stationary Mass-Damper systems. Additionally, the proposed method
is applied to the tuning of a scheduled PID controllers of autonomous Remote Control (RC) car.

3


Contents

1 Introduction 13

2 Background 15
2.1 Control Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 PID controller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3 Gaussian Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3 Related Work 21
3.1 Research in PID Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 PILCO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.3 PID Tuning using PILCO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.4 Gain Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4 Gain Scheduling using PILCO 33
4.1 Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.2 Dependent Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.3 Independent Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.4 Cross Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.5 Numerical Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

5 Testing Systems 43
5.1 Noisy Cart-Pole . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.2 Non-stationary Mass-Damper . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

6 Application to Autonomous RC Car 51
6.1 System Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
6.2 Control Scheme Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
6.3 Dynamics Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
6.4 Control Redesign . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6.5 Validation Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.6 New Errors Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.7 New Errors Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

7 Conclusion and Outlook 69

A Appendix 71
A.1 Linear transformation of Gaussian distribution . . . . . . . . . . . . . . . . . . . 71
A.2 Product of two independent Gaussian distributions . . . . . . . . . . . . . . . . 71

Bibliography 73

5


List of Figures

2.1 Quantities of a control system performance . . . . . . . . . . . . . . . . . . . . 16
2.2 The impact of PID gains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4.1 Numerical test of the derived approximations . . . . . . . . . . . . . . . . . . . 41

5.1 Noisy Cart-Pole dynamics model . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.2 The PID optimization for the Noisy Cart-Pole . . . . . . . . . . . . . . . . . . . 45
5.3 The optimized PID for the Noisy Cart-Pole . . . . . . . . . . . . . . . . . . . . 45
5.4 Non-stationary Mass-Damper properties . . . . . . . . . . . . . . . . . . . . . . 47
5.5 Non-stationary Mass-Damper dynamics model . . . . . . . . . . . . . . . . . . 48
5.6 The optimized nonscheduled PID for the Non-stationary Mass-Damper . . . . . . 48
5.7 The optimized scheduled PID for the Non-Stationary Mass-Damper . . . . . . . 49
5.8 Concatenated optimization process for the scheduled PID . . . . . . . . . . . . . 50

6.1 Simplified control scheme of the autonomous RC car . . . . . . . . . . . . . . . 52
6.2 The definition of longitudinal and lateral errors used for PID control . . . . . . . 54
6.3 Autonomous RC car trajectories . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6.4 Tests of trajectory specific model of the errors dynamics . . . . . . . . . . . . . 57
6.5 Tests of trajectory dependent model of the errors dynamics . . . . . . . . . . . . 58
6.6 The hyper-parameters of the trajectory dependent model . . . . . . . . . . . . . 59
6.7 Tests of trajectory dependent model with additional states . . . . . . . . . . . . . 59
6.8 The RC car experiment subdivision . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.9 Tests of the local model of the errors dynamics . . . . . . . . . . . . . . . . . . 61
6.10 The hyper-parameters of the local model of the errors dynamics . . . . . . . . . 62
6.11 Validation experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.12 The new errors definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.13 Test of the redefined errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

7


List of Tables

4.1 The moments errors for the subset of numerical test experiments . . . . . . . . . 40

5.1 Noisy Cart-Pole parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.2 Non-Stationary Mass-Damper parameters . . . . . . . . . . . . . . . . . . . . . 46

6.1 The data transmitted within the control scheme of the autonomous RC car . . . . 53

9


List of Abbreviations

PID Proportional Integral and Derivative. 3

MIMO Multiple Input Multiple Output. 3

RL Reinforcement Learning. 3

PILCO Probabilistic Inference for Learning Control. 3

SISO Single Input Single Output. 3

RC Remote Control. 3

GP Gaussian Processes. 13

LTI Linear Time-Invariant. 15

IAE Intergated Absolute Error. 16

ISE Intergated square Error. 16

ZN Ziegler-Nichols method. 21

AMIGO Approximate M-constraint Integral Gain Optimization. 21

IMC Internal Model Control. 23

SOFC Static Output Feedback Controller. 23

LMI Linera Matrix Inequalities. 23

NARX Non-linear Auto-regressive Exogenous. 29

QLPV Quasi-Linear Parameter Varying. 30

GMV General Minimum Variance. 31

LQR Linear Quadratic Regulator. 43

BFGS Broyden — Fletcher — Goldfarb — Shanno. 44

UKF Unscented Kalman Filter. 52

11


1 Introduction

A simple structure and a demonstrated performance permitted PID to become one of the most
popular closed-loop control mechanisms used among the industry [21]. MIMO systems can be
controlled using the multivariate PID controllers [1].

Despite the clear structure of PID, it is usually designed via the application of a tedious heuristic
tuning routines. The advanced PID design techniques such as Pole Placement [1], Loop shaping
[4], D-Partitioning and Cancellation of Poles [11] require the linear model of a system. Moreover,
some of the design methods, e.g. Tuning based on Gain and Phase margins [2] and λ-tuning [4], are
applicable only to the low-order systems. For the multivariate PID tuning it is commonly assumed
that a plant can be decoupled into SISO systems [39]. A design of controllers is then performed
separately for every system via the standard techniques. The evolutionary based optimization
methods for multivariate PID tuning do not require the decoupling and are not restricted to the
low-order systems [20]. Nevertheless, the known or approximated model of a plant is still required
for the evaluation of a fitness function.

For the unknown systems, auto-tuning methods such as Relay Auto-tuner [34] or Feature-Based
tuning techniques [4] are used. Both perform the model estimation with subsequent application of a
model-based tuning techniques.

Application of learning-based methods to the control tasks in general and PID tuning in particular
is limited by their data inefficiency. A system rollouts of a real control scenarios can be rather
costly. Addressing this limitation, Deisenroth et al. [14] developed a model-based policy search
algorithm PILCO, which maintains the probabilistic model of a system dynamics learned from
an agent-system interactions. Such setup permits efficient learning with relatively small amount
of observed data. Doerr et al. [16] extended PILCO for multivariate PID tuning. The developed
framework is applicable to a nonlinear MIMO systems and does not require prior knowledge about
the system dynamics.

Gain scheduling is an efficient way to control a nonlinear and non-stationary systems, whose
dynamics changes with different operating conditions [4]. The design of a scheduled controller,
usually performed via the linearization of a plant and subsequent application of linear design
techniques [29]. The scheduled controller is then obtained using the nearest-neighbor interpolation,
e.g. controller switching, or linear interpolation, e.g. controller blending. Additionally, Fuzzy
logic can be exploited to introduce the human expertise into the controllers interpolation [38]. In
contrast, gain scheduling can be implemented in a form of a parametrized nonlinear function, which
computes the gains directly [9]. The parameters of this function are obtained via the optimization
procedures. For both approaches, a known model of system dynamics is required.

In this thesis we extend the multivariate PID tuning framework by Doerr et al. to learn the gain
scheduling function for a nonlinear and non-stationary MIMO systems control. The proposed
method does require neither prior knowledge of a dynamics model, nor its linearizion. The

13


1 Introduction

scheduling function is defined as a set of Gaussian Processes (GP) parameterized by a training
targets with a uniformly distributed training inputs. All equations, necessary for the integration of
this definition into PILCO, were derived and implemented.

Tests were conducted using the simulations of Noisy Cart-Pole and Non-stationary Mass-Damper
systems, which were developed as a part of the thesis. In order to prove the applicability of the
method to the real control task, it was exploited to tune the scheduled PID controllers of the
autonomous RC car. Different approaches to the RC car dynamics modeling were proposed and
evaluated. Additionally, modifications of the control mechanism, which can potentially improve the
results of model learning were introduced and implemented.

The thesis continues with a presentation of a theoretical background (Chapter 2) required for
the understanding of a further work. This background includes the basics of Control Design,
PID controller, gain scheduling and GP. Next, Chapter 3 surveys the related research in PID
tuning and gain scheduling. Additionally, it contains the detailed explanation of PILCO and its
application to PID tuning. Following chapter (Chapter 4) introduces the PILCO modification for a
scheduled PID learning. The chapter presents both, definition of gain scheduling function and all
derivations required for its integration into PILCO. Next two chapters are devoted to the testing of
the implemented PILCO modification. Test experiments with Noisy Cart-Pole and Non-stationary
Mass-Damper are presented in Chapter 5. Chapter 6 summarizes the work, made in algorithm
application to the autonomous RC car. It starts with the system description and control mechanism
analysis. Then the approaches to the dynamics modeling are presented. Chapter 6 ends with a
discussion of the proposed control modifications aimed to improve the model learning results.
Finally, Chapter 7 concludes this work.

14


2 Background

In the following chapter the theoretical background, required for the further work is presented. This
background includes basic information about Control Design and PID Controller as well as about
Gaussian Processes.

2.1 Control Design

Åström and Murray [4] define control as exploitation of one dynamical system, called controller to
influence the behavior of another one, called plant. Plant has a sate x ∈ RM . y ∈ RM is a state
observation (in some real scenarios, state x can be observed only partially). Closed-loop control
system is a mechanism, where the output of a controller u ∈ RF depends on a process variable
ŷ ∈ RD≤M , which is an observation of (the part of) a plant state x that have to be controlled. The
output of a controller is calculated based on the feedback error or control error: e = r − ŷ ∈ RD ,
where r is reference value or set point. From now onward for notation clarity we will assume that
process variable is a complete observation ŷ = y.

In a control domain plants and controllers are commonly described in terms of a transfer functions,
which define the relation between their inputs and outputs of a Linear Time-Invariant (LTI) system.
A transfer function is defined as:

Gp(s) =
Y (s)
U(s)

=
L{y(t)}
L{u(t)}

,

with L{·} being a Laplace transformation. The transfer function can be rewritten in a pole-zero
form:

Gp(s) = K
(s − z1)(s − z2)...(s − zm)
(s − p1)(s − p2)...(s − pn)

,

where z1, ..., zm are system zeros and p1, ..., pm are system poles. Denominator of a pole-zero form
of the transfer function is called characteristic polynomial. The number of roots of characteristic
polynomial (system poles) defines the order of a system. Transfer function of MIMO plant has a
form of a matrix:

Gp(s) =


g11(s) . . . g1F (s)
...

. . .
...

gD1(s) . . . gDF (s)

 ∈ R
D×F . (2.1)

One of the main advantage of a transfer function notation is that the complex system, which contains
the number of different functional subsystems, can be represented as a product of their transfer
functions. Thus, the transfer function of a closed-loop control system is given as:

Gl(s) =
C(s)Gp(s)

1 − C(s)Gp(s)
,

15


2 Background

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
0

0.5

1

1.5

Steady-State Error

Settling time

Rise time

Percent Overshoot

Figure 2.1: Quantities of a control system performance

where C(s) and Gp(s) are controller and plant transfer functions respectively.

The control design has two common objectives, namely a set point following and a disturbance
rejection [4]. The set point following objective is defined by the performance requirements. A
performance of a SISO control system is often measured by applying a step function as r, and
observing the response of y. Commonly, the response is quantified by waveform characteristics [23].
Rise time is the time, that system takes to reach the 80− 90% of final value after excitation with step
r. Percent overshoot is the amount that y overshoots the r, expressed as a percentage of the final
value of r . Settling time is the time required for the y to settle within a certain percentage (usually
5%) of the final value. Steady-state error is the final difference between the process variable y and
set point r . Exact definition of these quantities can vary in industry and academia. For the MIMO
systems these characteristics are measured for every output dimension. All described quantities are
presented on Figure 2.1. The disturbance rejection objective is defined via the error, caused by the
disturbance. Common quantities to characterize the disturbance attenuation are maximum error,
time to maximum, decay ratio, Intergated Absolute Error (IAE) and Intergated square Error (ISE)
[4].

2.2 PID controller

PID is a closed-loop feedback controller [2]. PID consists of three correcting terms, which are
multiplications of gains with control error, its integral and derivative. Gains are the coefficients,
which define the impact of a particular error type on PID output. Summation of terms gives the
control output according to the equation:

u(t) = Kpe(t) + Ki

∫ t

0
e(τ)dτ + Kd

de(t)
d(t)
. (2.2)

In the above equation:

16


2.2 PID controller

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
0

0.5

1

1.5

Step Response
Kp 300
Kp 200
Kp 100

(a) Proportional gain

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
0

0.5

1

1.5

Step Response
Ki 100
Ki 70
Ki 30

(b) Integral gain

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
0

0.5

1

1.5

Step Response
Kd 20
Kd 10
Kd 5

(c) Derivative gain

Figure 2.2: The impact of PID gains on a waveform of a step response.

Kp is a Proportional gain
Ki is an Integral gain
Kd is a Derivative gain
e(t) is a control error
t is a time
τ is an integration variable.

The proportional term Kpe(t) accounts for the magnitude of current control error. The proportional
gain determines the influence of the error on control output. As name suggests, high values of the
gain result in the large change of control output for the given change in error. Excessively high
values of the proportional gain could lead to the oscillation of the output and destabilization of the
system, whereas low values could potentially make it insensitive to the error. As it demonstrated in
Figure 2.2a, an increase in proportional gain reduced both the rise time and the steady-state error
Additionally the overshoot is increased, and the settling time descries by a small amount.

The integral term Ki

∫ t

0 e(τ)dτ accounts for both, magnitude of the error and its duration. It is a
summation of all instantaneous errors over the time which corresponds to accumulated offset that
should have been corrected previously. The integral gain defines the impact of this offset on the
output of controller. An increase in integral gain tends to reduce the steady-state error since it is
compensating for the collected errors from the past, however the overshooting (undershooting) is
increased for the same reason (Figure 2.2b).

The derivative term Kd
de(t)
d(t) is set to compensate for the approximated future error behavior using

the derivative of error with respect to time. Thus, tuning of derivative term improve the settling
time and stability of the system (Figure 2.2c). Most implementations of PID include additional
low-pass filter for the derivative to limit the influence of a noise on a system.

Transfer function of PID controller is defined as:

C(s) = Kp
s2TiTd + sTi + 1

sTi
,

Ti =
Kp

Ki
,

Td =
Kd

Kp
.

(2.3)

17


2 Background

2.2.1 Multivariate Case

In a general multidimensional case, equation (2.2) can be rewritten in a matrix notation form as
following:

ut = K pet + K i

∫ t

0
et + Kd Ûet . (2.4)

In the above equation K p, K i, and Kd are gain matrices of dimensionality RF×D , where F is
number of controller outputs and D is number of error states. Similarly, et ,

∫ t

0 et and Ûet are vectors
of errors with dimensionality RD at time step t and ut ∈ R

F is a control output at time t.

Two types of Multivariate PID controllers can be defined depending on the structure of gain
matrices. The controller, which has the gain matrices of diagonal form is called Decentralized
(Decoupled). Every output dimension of such controller depends only on one control error state.
On the other hand, Centralized PIDs has gain matrices of arbitrary forms, meaning that every output
of controller arbitrary depends on control errors. Transfer function of multivariate PID controller
can be obtained similarly to (2.1). A transfer matrix is of diagonal form for a decoupled controller
and arbitrary for a centralized. Centralized controllers have a better performance on systems with
high interconnection between the states, however tuning of such controllers is significantly more
difficult than decentralized ones, which is itself a tedious task. A detailed explanation on tuning of
PID controllers is provided in Chapter 3.

2.2.2 Gain Scheduling

Gain scheduling is a changing of the controller’s parameters (gains) depending on the measurements
of operating conditions [4]. That measurements are called scheduling variables and could be
systems states, process value, output of the controller or even external signals. Many design notions
can fall under such broad definition - controller switching and controller blending fits the notion
of gain scheduling as well as introduction of a nonlinear scheduling function [9, 25]. The gain
scheduling is an efficient way to control nonlinear and non-stationary systems, whose dynamics
changes with different operating conditions [4].

The most typical approach to design of gain scheduled controller for nonlinear or time varying system
is a linearizion scheduling, which includes three steps [29]. The first step is computation of a linear
parameter-varying model of the system. The most common method for such modeling is Jacobian
Linearizion of a system around the equilibrium points ρ. This method results in a parametrized
family of linearized systems, which form a basis for linearizion scheduling. The parametrization
corresponds to the fixed values of the scheduling variables observed at corresponding equilibrium
point.

Next step is to use the linear controller design methods for every linearized system in the family.
This step results in a set of linear controllers at isolated values of scheduling variable. Final step is to
obtain a linear parameter-varying controller. The simplest approach is controller switching, which is
the nearest-neighbor interpolation of a controller parameters for the observed value of a scheduling
variable. Even though such approach benefits form the simplicity, it produces discontinuous jumps
in controller parameters which leads to performance decrease and probable destabilizations. Another

18


2.3 Gaussian Processes

way is to exploit linear interpolation, e.g. controller blending. Independent from linearizion and
interpolation method, the more linear controllers are designed, the better performance is achieved
[29]. More advanced method of gain scheduling will be reviewed in Chapter 3.

2.3 Gaussian Processes

A Gaussian process is a generalization of the Gaussian probability distribution, which describes the
functions instead of scalars or vectors [26]. GP is a Bayesian nonparametric model which outputs
the posterior distribution over the functions that describe the observed data p( f |D). Non parametric
stands for a model Mw parametrized by w, which is selected from some family {Mw : w ∈ W},
whereW is infinite dimensional [31].

GP is completely specified by the prior distribution and observed data D: n input vectors, x ∈ RD
concatenated into matrix X ∈ RD×n, and targets y ∈ R, concatenated into y ∈ Rn. Prior distribution
represents the beliefs over the kind of functions, which one expects before the data is shown. The
prior of GP is defined by mean function m(·) and kernel (positive semidefinite covariance function)
k(·, ·). m(·) is commonly kept zero. Square Exponential is the usual choice for the kernel:

κ(x, x ′) = σ2
f e

(
− 1

2 (x−x
′)TΛ−1(x−x′)

)
. (2.5)

In the above equation, Λ is diag(l)2, where l ∈ RD is a vector of positive values called length-scales.
The inverse of length-scale li determines the relevance of input dimension i. The hyper-parameters
of GP, θ are defined then as (l, σ2

f )
T . For the realistic modeling scenario the noisy observations

instead of a direct function values have to be assumed y = f (x) + ϵ [26]. For that purpose the
additive independent identically distributed Gaussian noise ϵ with variance σ2

w is added to a kernel:
κ(x, x ′) + δσ2

w , with δ being a Kronecker delta which returns one if x = x ′ and zero otherwise. σ2
w

is included into GP hyper-parameters θ = (l, σ2
f , σ

2
w)

T . Covariance matrix for noisy observations
�y is defined as (� + σ2

w I ) with �i j = κ(xi, x j).

The hyper-parameters are obtained using the maximization of marginal likelihood:

p(y |X ; θ) =
∫

p(y | f , X ; θ)p( f |X ; θ)df (2.6)

The first two moments of a predictive posterior p(y∗ |x∗, X, y; θ) of the GP for the new input x∗ are
defined as following:

E[y∗] = κ(X, x∗)T�−1
y y, (2.7)

var[y∗] = κ(x∗, x∗) − κ(X, x∗)T�−1
y κ(X, x∗), (2.8)

The noticeable drawback of GP is its computational inefficiency - inversion of �y is of O(n3)

complexity. In order to reduce the complexity, sparse approximation of GP posterior can be used
[12]. Snelson and Ghahramani [30] presents the Sparse Pseudo-Input Gaussian Processes. This
model is parametrized by the pseudo (artificial) data set D̄, with the size m ≪ n. Pseudo inputs x̄
are placed via the maximization of marginal likelihood (2.6). Pseudo targets f̄ (without noise) are

19


2 Background

integrated out using the prior N( f̄ |0,�m) with �mi j = κ(x̄i, x̄ j). The posterior of resulting sparse
model is computed wilt complexity O(m2n). After additional precomputations complexity can be
improved even further to O(m) for the mean of prediction and O(m2) for the prediction variance
[30].

20


3 Related Work

This chapter gives the overview of the work related to the master thesis. First, the PID tuning research
is surveyed. Then detailed explanation of the PILCO [14] and its application to optimization of PID
gains [16] are presented. Chapter is concluded with the analysis of research in gain scheduling.

3.1 Research in PID Tuning

In the following section the review of relevant PID tuning techniques will be presented. For
convenience all techniques are grouped into two artificial categories: classical techniques and
techniques based on optimization.

3.1.1 Classical Techniques

Feature-based techniques are exploiting a features of system dynamics such as static gain K , velocity
gain Kv , dominant time constant T and dominant dead time L [3], estimated via experiments, for the
system approximation. Time-domain Ziegler-Nichols method (ZN) [4] is the most basic example
of a feature-based tuning techniques, which is still used in industrial applications. This method
approximates the system with a first-order plus dead-time model using the measurements obtained
from the step response:

Gp(s) =
K

1 + sT
e−sL . (3.1)

The gains are given as functions of dominant dead time L and static gain Kp. Additionally, the
Frequency-domain ZN [19] method defines gains as the functions of ultimate gain Ku (gain under
which the system starts to oscillate) and ultimate period Tu (period of the oscillation).

The ZN methods are easy to apply, however they result in a stable but not optimal in term of set
point following and disturbances attenuation controller (Section 2.1). Thus many modifications
were proposed, which aimed to retain the simplicity of ZN and improve the performance in the
same time. Approximate M-constraint Integral Gain Optimization (AMIGO) [4] uses the model of
a system dynamics based on velocity gain Kv:

Gp(s) =
Kv

s
e−sL . (3.2)

Additionally AMIGO uses different tuning rules (gains functions). The design goal of these rules
was to maximize the integral gain while fulfilling the specified disturbance attenuation constraints
[4].

21


3 Related Work

Tuning Based on Gain gm and Phase φm margins [11] is aiming to find a PI controller:

C(s) = Kp
sTi + 1

sTi
, (3.3)

for the given approximation of a system model (3.1) so that closed-loop system has defined gain
and phase margins. Kp and Ti are obtained via the solution of gm and φm equations [22].

Analytical methods use the system dynamics model (approximated) to directly calculate the gains
of PID controller. A Pole Placement [1] is a subset of an analytical method applied to the systems
with a low-order transfer function. A common approach is to specify the desired damping ratio
ζ and the natural frequency ω0 [22] for the system and to position the poles such that fulfill the
specified constraints and give the required closed-loop performance. For instance, pole placement
can be applied to a first-order plant:

Gp(s) =
K

1 + sT
, (3.4)

controlled by PI controller (3.3). The transfer function of such closed-loop system will have the
characteristic polynomial of a form:

s2 + 2ζω0s + ω2
0,

then controller parameters are derived from actual transfer function as:

Kp =
2ζω0T − 1

K
,

Ti =
2ζω0T − 1
ω2

0T
.

(3.5)

The pole placement can also be exploited to tune the PID (2.3) applied to the second-order system:

Gp(s) =
K

(1 + sT1)(1 + sT2)
. (3.6)

The characteristic polynomial of a closed-loop system will have a form:

(s + αω0)(s2 + 2ζω0s + ω2
0),

and PID gains are obtained similarly to (3.5). The Dominant Pole [1] design is a simplified pole
placement technique used for high-order systems. Commonly, dominant dynamics of a high-order
system can be approximated by two dominant poles p1, p2 that are poles with real part closest to
zero [11, 23]. Therefore dominant poles can be placed in a complex plane such that result in a
system with desired ζ and ω0:

p1 = −ζω0 + iω0

√
1 − ζ2,

p2 = −ζω0 − iω0

√
1 − ζ2.

22


3.1 Research in PID Tuning

The PID gains are determined such that p1 and p2 are the roots of characteristic polynomial of
closed-loop system. Dominant pole placement can be applied to Multivariate PID tuning [21].

The λ-tuning method assumes first-order plus dead-time model (3.1) with a long dominant dead
time L [4]. This method specified the desired closed-loop transfer function as:

Gl(s) =
e−sL

1 + sλT
,

where λ is an additional tuning parameter which influences the response time of the resulting
closed-loop system. λ < 1 leads to a faster response and decrease of the integrated absolute error,
but also increases sensitivity of a controller to variations in a system dynamics. The controller
transfer function is then obtained from a transfer function of closed-loop system with feedback
error:

C(s) =
1 + sT

K(1 + λsT − e−sL)
,

which, for L = 0, becomes a PI controller with:

Kp =
1
λK
,

Ti = T .

Internal Model Control (IMC) is a general control system design method, when the controller
contains the model of a system [4] . The corresponding controller transfer function looks as:

C(s) =
G f G

†
m

1 − G f G
†
mGm

,

where Gm is a model transfer function, G†m its inverse and G f is a low-pass filter. The low-pass
filter accounts for modeling errors. This method results in high-order controllers. However, for
simple models such as (3.4) and (3.1), it is possible to obtain PI or PID controller. PID controller
for (3.4) and a filter of form:

G f (s) =
1

1 + sTf
,

obtained via IMC is defined as:

C(s) =
(1 + sL/2)(1 + sT)

Ks(L + Tf )
,

with Tf being design parameter. The main advantage of IMC method is that the robustness is
considered explicitly during the controller design. Skogestad [27] proposes the internal model
control method to tune the PI controller for the first-order plus time-delay systems (3.1) using
Taylor-series approximation of exponential term. Dong and Brosilow [17] propose the method to
derive the Multivariate PID from the Multivariate IMC using the MacLaurin series expansion.

23


3 Related Work

Zheng et al. [39] propose another approach, similar to the IMC in a way that PID is obtained via
the transformation to another type of controller. In presented work PID is transformed to a Static
Output Feedback Controller (SOFC). The stabilization problem of static feedback controller is
solved using the Linera Matrix Inequalities (LMI) method. PID gains are then recovered from the
obtained controller.

Loop-shaping techniques can also be applied to PID tuning problem [4]. Loop-shaping aims to
obtain the desired transfer function of a closed-loop system choosing the right controller. First, the
desired gain crossover frequency ωgc is chosen (for instance based on requirement of attenuation
of load disturbance). Additionally, target phase margin φm is specified. Then gains of PID are
calculated such that closed-loop system meets given specifications [4].

Most of the reviewed design methods require the known linear model of a system dynamics, e.g.
Analytical techniques [1] and Loop-shaping [4]. ZN and AMIGO [4] approximates the unknown
system with a low-order models using the measurements from a relay or step response experiments.
Some, techniques are applicable only to a low-order plants, e.g. Gain and Phase margin based
design [11], λ-tuning [4]. For the high-order systems, either the low-order approximations have to
be computed, e.g. Dominant Pole design [21], or transformation to a different type of controller has
to be performed [17, 39]. Among the surveyed methods, only a Dominant Pole [21] design can be
relatively easy extended to a MIMO settings.

3.1.2 Optimization Based Techniques

The optimization methods can also be exploited for PID tuning. Given a controller parametrized
by its gains, one is optimizing the closed-loop performance defined as functions of controller
parameters. Among the popular optimization criteria are IAE, ISE etc. The optimization methods
are rather sensitive to the formulation of criteria, meaning that they can result in optimal, with
respect to ISE, but unstable controller because of neglected constraints [4]. Local minimas of the
optimization criteria can also deteriorate the performance of method [4]. Cancellation of Poles and
D-partitioning [11] are optimization methods for PID tuning which require the linear model of a
plant. Modulus Optimum and Symmetrical Optimum are limited to low-order systems [1].

Genetics algorithms and evolutionary based techniques are, similarly to optimization methods
seeking for the controller parameters, which optimize some criteria. However these methods are
not restricted to the low-order and SISO systems and demonstrated their ability to overcome local
minima problems [20].

Chang [8] proposes to use a multi-crossover genetic algorithm to optimize the IAE of multivariate
PID controller. All elements of PID gain matrices (2.4) are concatenated into one vector, used as a
chromosome. Algorithm was tested on classical binary distillation column plant - first order MIMO
system with strong interaction between input and output pair, and significant time delays. According
to the results, genetic algorithm converged to the stable controller after 300 of generations.

Gaing [18] applies modified Particle Swarm Optimization algorithm for tuning of multivariate PID
controller. Each PID gain was presented by a particle. Tests were conducted on two automatic
voltage regulator systems. After 10 seconds of optimization in average algorithm was able to output
stable controller gains.

24


3.2 PILCO

Even though the above algorithms managed to demonstrate satisfactory results for a MIMO testing
systems, evolutionary based approaches cannot guarantee convergence [28]. Moreover, both require
the model of a plant for the evaluation of a fitness function.

3.2 PILCO

The data inefficiency of RL limits its application to control and robotics tasks, where the agent-
environment interaction could become rather costly.

Deisenroth et al. [14] address this problem via modeling the observed dynamics of the environment
with a flexible nonparametric approach. The model-based RL methods are more efficient in
extracting valuable information from the data [14]. However, the performance of such methods
degrades significantly if learned model is biased. That is why, in order to compensate for the bias
in model, Disenroth et al. are exploit probabilistic model with the explicit notion of uncertainty,
which is also incorporated into planning and policy evaluation.

Authors are presenting PILCO, a model based-policy search framework with nonparametric GP
(Section 2.3) used as a model. Long term predictions and policy evaluation in PILCO based on
deterministic approximation of GP inference. Policy updates use the analytic solution of policy
gradients.

Here onwards the detailed description of PILCO is presented with all equations, since they will be
used in further chapters during the derivation of the gain scheduling. Nonetheless, policy gradients
are omitted, because of auto-derivation software exploitation.

PILCO considers the dynamic systems of form:

xt+1 = f (xt, ut ) + ω, ω ∼ N(0,Σω), (3.7)

where x and u is a continuous-valued system state RD and control RF respectively. f is unknown
transition dynamics and ω is a Gaussian system noise. Policy search is optimizing π : x 7→ π(x, θ)
with respect to expected long term cost:

Jπ(θ) =
T∑
t=0

E[c(xt )], x0 ∼ N(µ0,Σ0). (3.8)

In the above equation c(xt ) is a cost of being in state x at time t. A policy π is a function
parametrized by θ.

In order to find π∗, which minimizes (3.8) PILCO builds a probabilistic GP model and uses
the deterministic approximation of inference to compute the long term predictions and evaluate
policy, then optimizes it exploiting analytical gradients (here however omitted, due to the usage of
auto-derivation software). PILCO steps are summarized in Algorithm 3.1.

As a probabilistic dynamics model, PILCO uses GP, with tuples (xt, ut ) ∈ R
D+F as training inputs

and differences ∆t = xt+1 − xt ∈ R
D as training targets. As it was stated in Section 2.3, GP outputs

the posterior distribution over the functions, thus it cannot have multidimensional output. That is
why PILCO actually uses D conditionally independent GP, trained simultaneously for every target
dimension.

25


3 Related Work

Algorithm 3.1 PILCO
1: procedure Policy Search
2: Initialize:

Sample policy parameters θ ∼ N(0, I )
Apply π(θ) and record data {xt, ut }t=1...T .

3: repeat
4: Learn probabilistic GP dynamics model, using all data.
5: repeat
6: Approximate inference for policy evaluation Jπ(θ).
7: Obtain gradients Jπ (θ

dθ (auto-derivation software).
8: Update parameters θ (e.g., CG or L-BFGS).
9: until convergence; return θ∗ = argminJ(θ)

10: Set π∗ ← π(θ∗).
11: Apply π∗ to system and record new data {xt, ut }t=1...T .
12: until task learned; return π∗
13: end procedure

According to Section 2.3, GP is completely specified by prior mean function m(·) (which is usually
taken as zero) and kernel k(·, ·). In a PILCO paper Disenroth et al. use square exponential kernel
(2.5). Training points of GP are defined as X̃ = [x̃1, ..., x̃n] and y = [∆1, ...,∆n]

T . Hyper-parameters
of GP posterior are obtained via marginal likelihood maximization (2.6).

GP posterior gives one-step prediction xt+1, which is Gaussian distributed:

p(xt+1 |xt, ut ) = N(xt+1 |µt+1,Σt+1), (3.9)
µt+1 = xt + E[∆t ],
Σt+1 = var[∆t ],

In the above equation E[∆t ] and var[∆t ] are calculated according to (2.7) and (2.8) respectively.

In order to evaluate Jπ and optimize π (steps 6, 7, and 8 at Algorithm 3.1) PILCO constructs the
t-step-ahead marginal distribution p(x1 |π), ..., p(xT |π) (further on conditioning on π is omitted for
simplicity) with the initial state distribution p(x0) = N(x0 |µ0,Σ0), which is done by the recursive
application of (3.9). For that, one needs to propagate the uncertain inputs through the GP.

For one propagation step, i.e. to obtain xt+1 given p(xt ), joint distribution p(x̃) = p(xt, ut ) is
required (3.7). This distribution is approximated by Gaussian p(x̃) = N(x̃t | µ̃t, Σ̃t ). Using the
approximated joint input distribution, posterior for uncertain input is given by:

p(∆t ) =
∫ ∫

p( f (x̃t )| x̃t )p(x̃t )df d x̃, (3.10)

with f being the posterior of GP. Exact computation of (3.10) is analytically convoluted, that is
why it is also approximated by Gaussian. Next state prediction p(x̃t+1) is obtained then as:

µt+1 = µt + µ∆,

Σt+1 = Σt + Σ∆ + cov[xt,∆t ] + cov[∆t, xt ]. (3.11)

26


3.2 PILCO

Above approximation can be obtained using either Moment Matching or Linearizion of GP posterior
mean function [14]. Details of latter method is omitted here since the Moment Matching is used in
a further work.

To construct t-step-ahead marginal distribution using Moment Matching, one needs to compute
predictive mean µ∆, predictive covariance matrix Σ∆, covariances cov[x̃t,∆t ], cov[∆t, x̃t ] (3.11)
and cross-covariance cov[x̃t, µ∆]. First two terms are needed to obtain the distribution p(∆t ) =
N(µ∆,Σ∆) (3.10), covariances are used to get next step state distribution and cross-covariance
is used to get next step joint state-action distribution. Exact derivations of mean and variance
[14] are not presented here for thesis length reasons. Covariance and cross-covariance terms, in
contrast, listed with detailed derivation, since they will be referenced during the formulation of gain
scheduling in the following chapters.

The cross-covariance is computed between the joint state-action distribution at time step t :
x̃t ∈ N(x̃t | µ̃t, Σ̃t ) and corresponding predicted (approximated) state difference xt+1 − xt = ∆t ∼
N(µ∆,Σ∆) as following:

cov[x̃t,∆t ] = E[x̃t∆Tt ] − µ̃tµ
T
∆, (3.12)

where the right term is defined, µ∆ is given [14] and E[x̃t ] defined previously. In order to compute
the left term, for each state dimension a = 1, ...,D, low of iterated expectations is applied:

E[x̃t∆at ] = E[x̃t E[∆at | x̃t ]] (3.13)

=

∫
x̃tma

f (x̃t )p(x̃t )d x̃t

=

∫
x̃t

( n∑
i=1
βaika

f (x̃t, x̃i)

)
p(x̃t )d x̃t,

with posterior mean function m f (x̃t ) represented as a finite kernel expansion. Then summation
order is changed and βai is pulled out of integral:

E[x̃t∆at ] =
n∑
i=1
βai

∫
x̃t c1N(x̃t | x̃i,Λa)︸             ︷︷             ︸

κa
f
(x̃t, x̃i )

N(x̃t | µ̃t, Σ̃t )︸          ︷︷          ︸
p(x̃t )

d x̃t, (3.14)

where c1 := σ2
f a(2π)

D+F
2 |Λa |

1
2 is introduced to represent the GP kernel as an unnormalized

Gaussian for training inputs x̃i, i = 1, ..., n. The product of two Gaussian (3.14) also results in
unnormalized Gaussian c−1

2 N(x̃t |ϕi,Ψ) with moments defined as following (A.2):

c−1
2 = σ

2
f a(2π)

−D+F
2 |Λa + Σ̃t |

− 1
2 e

(
− 1

2 (x̃i−µ̃t )
T (Λa+Σ̃t )

−1(x̃i−µ̃t )

)
, (3.15)

ϕi = (Λ
−1
a x̃i + Σ̃

−1
t µ̃t ),

Ψ = (Λ−1
a + Σ̃

−1
t )
−1.

After pooling all variables, independent of x̃t , out of integral in (3.14), it results in expected value
of product of two Gaussians ϕi:

E[x̃t∆at ] =
n∑
i=1

c1c−1
2 βaiϕi, a = 1, ...,D. (3.16)

27


3 Related Work

After inserting it back into (3.12):

cov[x̃t,∆at ] =
n∑
i=1

c1c−1
2 βaiϕi − µ̃t µ

a
∆. (3.17)

Final simplifications leads to:

cov[x̃t,∆at ] =
n∑
i=1

c1c−1
2 βaiΣ̃t (Λa + Σ̃t )

−1(x̃i − µ̃t ), a = 1, ...,D. (3.18)

Desired covariance terms cov[x̃t,∆t ] from (3.11) is a sub-matrix of size RD×D of matrix R(D+F)×D
computed at above equation.

Now all terms required for the uncertainty propagation are defined, thus t-step-ahead marginal
distribution can be obtained. Last step to evaluation of Jπ (3.8) is computation of expected values:

E[c(xt )] =
∫

c(xt )N(xt |µt,Σt )dxt,

with c being a cost function. PILCO uses binary saturation cost function:

c(x) = 1 − e

(
− 1

2σ2
c
d(x,xtar)

2
)
∈ [0, 1],

where d is euclidean distance and σ2
c is parameter, which controls the width of the cost function.

For the further work this function will be kept, since it allows for a natural exploration [14].

PILCO was evaluated using Double-Pendulum swing-up and Cart-Pole problems. Main focus
was set on learning speed and quality of approximated inference. PILCO was trained for 15
iterations (37.5 seconds of interaction) for Cart-Pole system and 30 iterations (75 seconds) for the
Double-Pendulum swing-up. As a results, algorithm in 95% was able to learn the policy, which
solves the Cart-Pole problem, after 15 - 20 seconds of experience. The same success rate for the
Double-Pendulum system was achieved after 40 - 50 seconds of experience. Additionally, resulting
t-step-ahead marginal distribution has a relatively small variance for the successfully learned policy.
That indicates the ability of GPs to model the dynamics of both systems. Taking into account the
difficulty of tasks and obtained results, one can conclude that PILCO fulfilled the data efficiency
constraint.

3.3 PID Tuning using PILCO

Considering the demonstrated performance of PILCO and its applicability to the real control
scenarios, i.e. data efficiency, Doerr et al. [16] extend the described framework for multivariate PID
tuning.

To represent the arbitrary multivariate PID controller as a parametrized static policy, a state
information has to be extended. In order to compute the controller output ut , all necessary error
information from (2.4) has to be available at time step t. Additionally, a joint distribution for the

28


3.3 PID Tuning using PILCO

extended state has to be calculated in order to propagate the uncertainty, compute the t-step-ahead
marginal distribution and evaluate JPID. New system state zt was introduced. It contains the error
at previous time step et−1 and errors, accumulated until the previous time step

∑t−1
τ=0 eτ :

zt :=
[
et−1,∆T

t−1∑
τ=0

eτ, x̃t−1

]
,

with [·, ·] stating for concatenation of vectors and ∆T being a system sampling time. Then
the GP posterior p(xt ), approximated via Moment Matching, is added to the state, resulting in
z̃t = [zt, xt ].

After calculation of posterior expectation, variance and cross-covariance terms, described in
Section 3.2, the joint distribution for the new state extension can be obtained. Assuming that the xt
is independent from the error part of zt one gets:

[
ẽt−1
x̃t−1

]
xt

 ∼ N
©­­«
[
µzt

µxt

]
,


Σzt

[
0

Σxt, x̃t−1

]
[
0 Σx̃t−1,xt

]
Σxt


ª®®¬ , (3.19)

where ẽt−1 is the errors part of extended state. Σxt, x̃t−1 stands for the cross-covariance term
cov[xt, x̃t−1] and Σx̃t−1,xt for cov[x̃t−1, xt ].

Next, a desired trajectory information has to be added. The target trajectory xtr is given as Gaussian
Distribution N(xtr |µtr,Σtr) [15]. A new state extension is z̃t = [zt, xt, xtr,t ]. Since the target
trajectory is independent from z̃t the joint distribution of the new state extension can be obtained
similarly to (3.19), where the state prediction part xt is independent from ẽt−1. A new mean of
z̃t is just a concatenation of a previous mean and a target mean µtr,t . A covariance matrix is a
concatenation of Σz̃ and Σtr,t along the diagonal.

The current error is a difference between the predicted state xt and desired xtr. For the current error
a derivative and integrated error approximations are calculated as following:

Ûet ≈
et − et−1

∆T
,∫ t ·∆T

0
e(τ)dτ ≈ ∆T

t−1∑
τ=0

eτ + ∆T et .

Defined above errors are linear transformations of z̃, thus the joint distribution of the new state
extension z̃t = [zt, xt, xtar,t, et,∆T

∑t
τ=0 eτ,

et−et−1
∆T ] remains Gaussian and can be obtained via the

known formulas of linear Gaussian transformation (A.1).

At this point, z̃t contains the information required for the calculation of PID output ut as a linear
transformation of the augmented state. The final joint distribution p( z̃t, ut ) is calculated according
to (A.1). All values required for one PILCO iteration are defined.

Evaluation experiments were conducted using the inverted pendulum problem on the humanoid
upper-body robot Apollo. State of the inverted pendulum system consists of four states: effector
position and velocity, a pendulum angle and an angular velocity. In order to omit the modeling of

29


3 Related Work

the entire system state and its dynamics during the experiments, the Non-linear Auto-regressive
Exogenous (NARX) model [7] for the effector position and pendulum angle was used. For the
dynamics modeling Doerr et al. [16], use a sparse GP with the hyper-parameters learned by a
marginal likelihood maximization.

According to the demonstrated results, the presented algorithm was able to learn accurate dynamics
model and optimize the PID policy with 106 seconds of interaction with physical system in total.
Hence, the algorithm preserves the data-efficiency property of PILCO and can be applied to real
control scenarios. Therefore, this work was considered as a suitable starting point for the further
research in data-driven controller tuning. For the scope of this thesis it was decided to extend the
presented algorithm for learning of optimal gain scheduled PID, with the aim on improvement of
performance for nonlinear systems and on application to non-stationary systems [4].

3.4 Gain Scheduling

Apart from Jacobian linearizion mentioned in Section 2.2.2, Quasi-Linear Parameter Varying
(QLPV) [29] approach can be used to obtain a family of linear systems. The nonlinear system is
rewritten in a way to replace the nonlinear terms with new time-varying parameters and use them
later one as a scheduling variables. This method does not require the linearizion and equilibrium
point determination. However, new definition of a system can introduce "the additional behavior
beyond the original plant description" [29], which is complicating the design of a suitable controller.
Not all systems can be redefined as QLPV which constraints the possible applications of this
method. Moreover, the QLPV representation of a system is not unique and have to be additionally
determined for a particular implementation of gain scheduling [29]. The extension of linearizion
based scheduling (Jacobian and QLPV) to MIMO systems depends on the controller design method
used for a linearized parameter varying system [6].

3.4.1 Fuzzy Logic Based Gain Scheduling

The Fuzzy Inference [37] is a common mechanism to incorporate the system knowledge into a
scheduling step (interpolation of controllers) via the specification of a Fuzzy rules (membership
functions) in order to increase the performance and robustness [38].

Visioli [33] presents the work where the gain scheduling was applied to build the two degrees of
freedom controller with one part handling the attenuation of load disturbances and the second part
being responsible for set point following. Two controllers were tuned separately using ZN. The
resulting controller then was scheduled based on the current control error and its derivative using
the Fuzzy Inference for interpolation.

Woo et al. [36] applies the Fuzzy inference to schedule a parallel PI and PD controllers. They
also introduced the Parameter Regulator which adaptively change the membership functions of the
Fuzzy module, based on the step response.

The requirement of a human expertise and absence of general tuning method makes the Fuzzy
scheduling strongly application dependent [38]. Since the Fuzzy logic based scheduling modifies
the interpolation step only, it does not influence the applicability to MIMO systems.

30


3.4 Gain Scheduling

3.4.2 Artificial Neurons Based Gain Scheduling

Due to the capability of neural networks to accurately approximate the highly nonlinear function,
they appear in gain scheduling research [24].

Chen and Huang [10] design the gain scheduled PID controller for a highly nonlinear systems.
PID gains are scheduled using the separate outer loop, composed of a General Minimum Variance
(GMV) controller [5] and a neural network. The neural network is trained off-line to model the
dynamics of a system. The linear dynamics model is obtained at every control iteration via the
linearizion of the neural net. The GMV computes then the optimal gains of PID for the given linear
model of a system. The proposed method, however, accounts only for a SISO nonlinear systems.

Chang et al. [9] propose the method to schedule a multivariate PID using the Auto-tuning Neurons.
Auto-tuning neuron is a sigmoid function with an adaptive shape applied to the thresholded input.
Sigmoid is parametrized by the saturation level and its slope. Each PID gain of MIMO controller is
defined as an output of such sigmoid. The gains are scheduled then based on the control error. In
order to tune such scheduled controller a monotonically responding system is assumed, i.e. the
output is monotonically increasing or decreasing with the increase in a control signal. Under such
assumption the authors were able to derive the optimization algorithm, which outputs the optimal
sigmoid parameters with respect to the IAE.

31


4 Gain Scheduling using PILCO

In this chapter the definition of gain scheduling function is presented as well as mathematical
derivations, necessary for its implementation and integration into PILCO framework.

4.1 Formulation

As it was discussed in Section 2.2.2 and Section 3.4, a gain scheduling can vary from simple
interpolation of linear controllers, built at system’s equilibrium points, to the complex nonlinear
function, which defines the gains given the scheduling variable. For the purpose of this master
thesis it was decided to follow the latter approach and encapsulate the entire scheduling into one
nonlinear parametrized function. Parameters θ of this function then will be optimized as a policy
parameters π(θ) during the PILCO iteration (Algorithm 3.1).

In order to stay in fully Bayesian setting and maintain the PILCO notion of uncertainty, the gain
scheduling function was defined as a set of GPs parametrized by the training targets with uniformly
distributed training inputs. Gains of the multivariate PID controller are given as:

K ∼ N(µK,ΣK ) ∈ R
3FE, (4.1)

where F is the dimensionality of a control signal and E is the dimensionality of a control error. Each
element E[Ka] of gains mean vector µK and var[Ka] of covariance matrix ΣK for a = 1, ..., 3FE
are defined according to (2.7) and (2.8) as:

E[Ka] = κ(Γ, xs)
T (� + σ2

w I )
−1ya, (4.2)

var[Ka] = κ(xs, xs) − κ(Γ, xs)
T (� + σ2

w I )
−1κ(Γ, xs). (4.3)

In the above equation, xs ∈ R
S is a vector of scheduling variables and Γ ∈ RN×S is a matrix of

uniformly distributed training inputs with N being a number of inputs and S - dimensionality of
scheduling vector. Elements of the matrix� are given by κ(γi, γ j), where κ(·, ·) is a prior covariance
and γi, γ j ⊂ Γ. ya ∈ RN is a training target vector for GP outputting Ka.

Vectors ya, concatenated into matrix Y ∈ RN×3FE , form the parameters θ of a policy π(θ), which
will be optimized by PILCO. Hyper-parameters such as k(·, ·) and Γ are shared among all GPs.
They are not a part of θ, i.e. are not optimized by the algorithm. The gradient of GP with respect to
its prior covariance is more unstable than with respect to targets, thus the convergence of a gradient
based method can be compromised. That is why hyper-parameters are determined independently
for every application using the experiments.

33


4 Gain Scheduling using PILCO

Joint distribution p( z̃, K ), where z̃ is the extended state from Section 3.3 is completely defined
[14]. Nevertheless, the control signal u under such definition is a product of two correlated
multidimensional random variables p(K |xs) and p(ẽ |zt, xt, xtar,t ), that is concatenation of all errors
vectors. To integrate the defined above scheduling function into the PILCO framework, the control
mean E[u] and variance var[u] have to be computed (approximated). Additionally, cross-covariance
term cov[u, z̃] have to be derived for the approximation of joint distribution p( z̃, u), where z̃ is
already concatenated with gains.

During the work on this thesis two cases were considered: K being dependent on and independent
of ẽ. It is obvious that for independent case neglects the cross-covariance information between K
and ẽ, which can lead to deterioration of optimization results. However, for the fully dependent
case rough approximations for var[u] are used, which can also turn into a problem. Rest of the
chapter contains all necessary derivations for both cases.

4.2 Dependent Case

The multivariate PID controller’s equation (2.4) from the Background section looks as following:

ut = Kpet + Ki

∫ t

0
et + Kd Ûet . (4.4)

Since the above equation is a linear combination of three terms, in order to avoid redundancy,
derivations only for the proportional up part are provided. Two others can be derived via the same
computations. Furthermore, for the clarity of derivations, an input is assumed to be one dimensional.
Extension to multidimensional case is trivial (separate application to every dimension).

4.2.1 Expected Value

Expected value of proportional term can be decomposed as:

E[up] = E[KT
p e] = E[

∑
i

K piei] =
∑
i

E[K piei].

To the above equation total covariance low is applied:

cov[X,Y ] = E[XY ] − E[X]E[Y ], (4.5)

which results in:

E[up] =
∑
i

(cov[K pi, ei] + µkiµei)

= µTk µe +
∑
i

cov[K pi, ei].

Covariances and mean vectors, required for the computation of above equation, can be obtained
from the known joint distribution of the extended state p( z̃).

34


4.2 Dependent Case

4.2.2 Variance

In order to compute the variance, as previously, up is first rewritten as a sum of products:

var[up] = var[
∑
i

K piei].

Consider system state first being only two dimensional (K, e ∈ R2). Here onwards K stands for K p

to avoid indices redundancy:

var[K0e0 + K1e1] = var[K0e0] + var[K1e1] + 2 cov[K0e0,K1e1],

which can be generalized to:

var[
∑
i

Kiei] =
∑
i

var[Kiei] +
n−1∑
i

n∑
j=i+1

2cov[Kiei,Kjej]. (4.6)

First, blue term of the (4.6) is derived. Application of variance definition results in:

var[Kiei] = E[(Kiei)2] − E[Kiei]2

= E[K2
i e2

i ] − E[Kiei]2. (4.7)

The second term of right side of above equation can be computed via the application of total
covariance low (4.5) as following:

E[Kiei]2 = (cov[Ki, ei] + E[Ki]E[ei])2

= (cov[Ki, ei] + µkiµei)2,

where all required information is contained in the joint distribution of the extended state p( z̃).
However, the E[K2

i e2
i ] term of (4.7) turned out to be more difficult to compute.

Application of the low of iterated expectations to the E[K2
i e2

i ] results in:

E[K2
i e2

i ] = E[E[K2
i e2

i |Ki]] = E[K2
i E[e2

i |Ki]], (4.8)

where Ki and ei are jointly normally distributed random variables with correlation coefficient ρ.
Then, the conditional distribution of ei on Ki is also normal with mean and variance defined as
following:

E[ei |Ki] = E[ei] + ρ

√
var[ei]
var[Ki]

(Ki − E[Ki]),

var[ei |Ki] = var[ei](1 − ρ2).

(4.9)

Now, the expected value of error squared conditioned on gain E[e2
i |Ki] is obtained via definition of

conditioned variance:
E[e2

i |Ki] = var[ei |Ki] − E[ei |Ki]
2.

35


4 Gain Scheduling using PILCO

Substitution of the inner term of the right most expectation in (4.8):

K2
i E[e2

i |Ki] = K2
i

[
var[ei |Ki] + E[ei |Ki]

2

]
.

Then the above equation is rewritten with the definitions from (4.9):

K2
i E[e2

i |Ki] = K2
i

[
var[ei](1 − ρ2) +

(
E[ei] + ρ

√
var[ei]
var[Ki]

(
Ki − E[Ki]

))2
]
.

The above expectation is a quartic function of Ki, named g(Ki). Insertion of g(Ki) back into (4.8)
results in:

E[K2
i e2

i ] = E[g(Ki)].

At this point a first approximation is introduced. A performance of the algorithm will probably
depend on a quality of this approximation, so, as a future work, higher order approximations for
this term can be derived. For now E[g(Ki)] can be approximated using the Taylor Series, first two
derivatives of g(Ki) and first two moments of Ki as following:

E[g(Ki)] = E[g(µki) + Ûg(µki)(Ki − µki) +
1
2
Üg(µki)(Ki − µki)

2] = g(µki) +
1
2
Üg(µki)σ

2
x .

Below first two derivatives of g(Ki) are computed. The ρ
√

var[ei ]
var[Ki ]

term is substituted with A for
readability reasons:

Ûg(Ki) = 2Ki

(
var[ei](1 − ρ2) +

(
E[ei] + A

(
Ki − µki

) )2
)

+ K2
i

(
2
(

E[ei] + A
(
Ki − µki

) )
A
)
.

Üg(Ki) = 2
(

var[ei](1 − ρ2) +
(

E[ei] + A
(
Ki − µki

) )2
)

+ 4Ki

(
2
(

E[ei] + A
(
Ki − µki

) )
A
)

+ K2
i A2.

All information, reacquired for computation of above equations and E[K2
i e2

i ] can be obtained from
joint distribution of extended state p( z̃). Now all components of var[Kiei] in (4.6) are defined.

In order to compute cov[Kiei,Kjej] from (4.6) we are applying the total covariance low (4.5):

cov[Kiei,Kjej] = E[KieiKjej] − E[Kiei]E[Kjej]. (4.10)

Second part E[Kiei]E[Kjej] of the right side of the above equation is completely defined. Consec-
utive application of (4.5) to it results in terms contained at p( z̃). However, E[KieiKjej] revealed
to be more difficult. The above term was derived with loss of information, which is introducing

36


4.3 Independent Case

second source of possible deterioration in results and is an aim for improvement in further work.
For now, Ki is assumed to be independent of ej and Kj of ei , which permits to rearranging the inner
multiplication terms of expectation and rewrite it as E[KiKj]E[eiej]. Then substitute it back to
(4.10) and apply (4.5) to the right part. Rearrangement of terms results in:

cov[Kiei,Kjej] = E[KiKj]E[eiej]
− (cov[Ki, ei] + µkiµei)(cov[Kj, ej] + µk j µej)

= (cov[Ki,Kj] + µkiµk j)(cov[ei, ej] + µeiµej)
− (cov[Ki, ei] + µkiµei)(cov[Kj, ej] + µk j µej)

= cov[Ki,Kj] cov[ei, ej] + cov[Ki,Kj]µeiµej + cov[ei, ej]µkiµk j
− cov[Ki, ei] cov[Kj, ej] − cov[Ki, ei]µk j µej − cov[Kj, ej]µkiµei .

The above equation concludes the derivation of all terms, required to compute the variance (4.6),
even though currently we are neglecting cov[Ki, ej] cov[Kj, ei] etc.

4.3 Independent Case

As it was mentioned above, assuming the independence of K and ẽ results in loss of valuable
information. Nevertheless, for the independent case computations are significantly simpler than for
the dependent one, which allows to avoid approximations

4.3.1 Expected Value

Since cov[Kpi, ei] is equal to 0, the expected value becomes:

E[up] =
∑
i

(cov[Kpi, ei] + µkiµei)

= µTk µe .

4.3.2 Variance

Instead of approximating E[K2
i e2

i ] in (4.7), the covariance low (4.5) is applied repeatedly:

E[K2
i e2

i ] = cov[K2
i , e

2
i ] − E[K2

i ]E[e
2
i ]

= cov[K2
i , e

2
i ] − (var[Ki] + E[Ki]

2)(var[ei] + E[ei]2).

Substitution of the above equations to (4.7) results in:

var[Kiei] = cov[K2
i , e

2
i ]

− (var[Ki] + E[Ki]
2)(var[ei] + E[ei]2)

− (cov[Kiei] + µkiµei)2.

37


4 Gain Scheduling using PILCO

Now cov[K2
i , e

2
i ] and cov[Ki, ei] can be replaced with 0. Further simplification leads to:

var[Kiei] = var[Ki] var[ei] + var[Ki]µ
2
ei + var[ei]µ2

ki .

And cov[Kiei,Kjej] from (4.6) simplifies to:

cov[Kiei,Kjej] = cov[Ki,Kj] cov[ei, ej] + cov[Ki,Kj]µeiµej + cov[ei, ej]µkiµk j

4.4 Cross Covariance

Cross covariance term resembles one given at [14]. However, the computation becomes significantly
more complex for the covariance between the u and the entire extended state. z̃ stands for the
augmented state, which includes all information:

cov[ z̃, u] = E[ z̃u] − µz̃µ
T
u, (4.11)

with E[ z̃u] being a vector of [E[ z̃u0],E[ z̃u1], ..,E[ z̃ua], ..]. Using the low of iterated expectations
and given that all GPs of scheduling function share hyper-parameters similarly to (3.13):

E[ z̃ua] = E[ z̃ E[ua | z̃]]

=

∫
z̃
(
µkp
a e + µki

a

∫ t

0
e + µkd

a Ûe
)
p( z̃)d z̃ (4.12)

=

∫
z̃
∑
i

κ(xs, γi)
(
βkp
ai e + βki

ai

∫ t

0
e + βkd

ai Ûe
)
p( z̃)d z̃,

where xs is a vector of scheduling variables, γi is a training input and κ(·, ·) is a prior covariance of
the scheduling function described at Section 4.1. βkp

a is obtained from (4.2) as:

βkp
a = (� + σ

2
w I )

−1ya,

for GP outputting the a dimension of proportional gain vector K p etc. Now the linear transformation
matrices are introduced:

S returns the scheduling variable xs from the extended state z̃
P returns current error
I returns sum of errors and
D returns error derivative.

38


4.4 Cross Covariance

The insertion of the above matrices into (4.12) results in:

E[ z̃ua] =

∫
z̃
∑
i

k(S z̃, γi)
(
β
kp
ai (Pa z̃) + β

ki
ai(Ia z̃) + β

kd
ai (Da z̃)

)
p( z̃)d z̃

=

∫
z̃
∑
i

k(S z̃, γi)β
kp
ai (Pa z̃)p( z̃)d z̃

+

∫
z̃
∑
i

k(S z̃, γi)β
ki
ai(Ia z̃)p( z̃)d z̃

+

∫
z̃
∑
i

k(S z̃, γi)β
kd
ai (Da z̃)p( z̃)d z̃.

Then only the first summation term is considered. After pulling out the independent variables and
change the order of summation and integration:

E[ z̃ua] =
∑
i

β
kp
ai

∫
z̃k(S z̃, γi)(Pa z̃)p( z̃)d z̃.

Next, k(S z̃, γi), in a blue term above, is transformed to a Normal distribution, using the multiplication
with constant c1 as in (3.14). (Pa z̃) is also normal according to (A.1):∫

z̃k(S z̃, γi)(Pa z̃)p( z̃)d z̃ =
∫

z̃c1N(S z̃ |γi,Λ)N(Pa z̃ |µp,Σp)N( z̃ |µz̃,Σz̃)d z̃.

Now, as in [14], Gaussians product rule (A.2) can be applied. However, for that N(S z̃ |γi,Λ) has to
be transformed to N( z̃ |., .) and similarly N(Pa z̃ |µp,Σp). Using the (A.1):

S z̃ ∼ N(γi,Λ),

z̃ ∼ N(S†γi, S
†ΛS†T ),

where S† is a pseudo inverse of S. At this point it was decided to stop computation of cross-
covariance between the action and entire extended state, since there was no confidence that it will
lead to the right result and not just over-complication. Instead the cross-covariance term only for
scheduling vector xs of extended state z̃ was calculated:

cov[xs, u] = E[xsu] − µxsµ
T
u .

All errors terms are assumed to be given as the constants and independent from xs. As previously,
the low of iterated expectations is applied to every element E[xsua] of expectation of product in
above equation:

E[xsua] =

∫
xs

∑
i

k(xs, γi)
(
βkp
ai e + βki

ai

∫ t

0
e + βkd

ai Ûe
)
p(xs)dxs.

39


4 Gain Scheduling using PILCO

Dependent Approximation Independent Approximation

Experiment µ Error σ Error µ Error σ Error

1 0.214 1.727 0.211 0.531
2 0.061 0.413 0.069 1.342
3 0.005 0.021 0.005 9.506

Table 4.1: The moments errors for the subset of numerical test experiments.

Next, by pulling out the independent terms and switching the order of summation and integration:

E[xsua] =
∑
i

(
βkp
ai e + βki

ai

∫ t

0
e + βkd

ai Ûe
) ( ∫

xsk(xs, γi)p(xs)dxs
)

=
∑
i

βkp
ai e

( ∫
xsk(xs, γi)p(xs)dxs

)
+ ...+

= e
∑
i

βkp
ai

( ∫
xsk(xs, γi)p(xs)dxs

)
+ ... + .

In above equation the red term is E[xsK
a
p] defined in (3.16). Furthermore, for the implementation

simplicity it can be obtained from the covariance cov[xs, K
a
p], which is contained at the joint

distribution of extended state p( z̃) using (3.17). Now the final equation for cross-covariance
computation look like:

cov[xs, ua] = cov[xsK
a
p]e + µxsµ

T
Ka

p

+ cov[xsK
a
i ]

∫ t

0
e + µxsµ

T
Ka

i

+ cov[xsK
a
d]Ûe + µxsµ

T
Ka

d

+ µxsµua .

Thus, all equations, required for integration of our scheduling function, defined at Section 4.1, into
the PILCO framework are derived.

4.5 Numerical Test

In order to validate the derivations provided in Section 4.2 and Section 4.4, several numerical
test experiments were conducted. Subset of test experiments is presented at Figure 4.1. Left side
of the figure demonstrates initial correlated Gaussian distributions (green and orange) as well
as real distribution of a product (gray dashed) according to [13]. Plots on the right present the
approximations of product with Normal distributions using the equation for dependent (blue) and
independent (red) cases derived above.

Additionally, for every experiment, we sampled the exact product distribution [13] and calculated
its mean and variance. Errors between the first two moments of approximated distributions and
exact one for experiments at Figure 4.1 can be found in Table 4.1. As one can infer, mean error

40


4.5 Numerical Test

6 4 2 0 2 4 6
x

0.0

0.2

0.4

0.6

0.8

1.0
p(

x)

Initial Distributions
Product
Normal_1
Normal_2

6 4 2 0 2 4 6
x

Approximated Product
Product
Approx_Normal_Indep
Approx_Normal_Dep

6 4 2 0 2 4 6
x

0.0

0.2

0.4

0.6

0.8

1.0

p(
x)

Initial Distributions
Product
Normal_1
Normal_2

6 4 2 0 2 4 6
x

Approximated Product
Product
Approx_Normal_Indep
Approx_Normal_Dep

6 4 2 0 2 4 6
x

0.0

0.2

0.4

0.6

0.8

1.0

p(
x)

Initial Distributions
Product
Normal_1
Normal_2

6 4 2 0 2 4 6
x

Approximated Product
Product
Approx_Normal_Indep
Approx_Normal_Dep

Figure 4.1: Subset of numerical test experiments

41


4 Gain Scheduling using PILCO

does not heavily depend on way of approximation, at least for the considered experiments. Whereas
the variance changes a lot for dependent and independent cases. For variance there is no clear
superiority of one approximation comparing to another - for some experiments dependent variance
is outperforming independent, for others - vice versa. Decision on which approximation to use
should be made separately for every application based on the experiments which permit to measure
the quality of t-step-ahead marginal state distribution.

According to the demonstration, the provided derivations result in reasonable approximation. How-
ever, considering that during the PILCO iteration input distributions will be already approximated
using the Moment Matching, it is rather difficult to estimate the quality until the tests of entire
algorithm on real (testing) system.

42


5 Testing Systems

In this chapter the developed algorithm is applied to the testing systems. First, PILCO is used
to tune the gains of a PID controller applied to the noisy Cart-Pole, which is the modification of
OpenAI Gym Cart-Pole environment. Then, the modified PILCO will optimize the scheduled PID,
which is controlling the non-stationary Mass-Damper system implemented for testing purposes.

5.1 Noisy Cart-Pole

In order to test the algorithm developed by Doerr et al. [16] and create a baseline for further work,
PILCO was exploited to find the optimal gains of PID controlling the noisy Cart-Pole system. For
that noisy modification of OpenAI Cart-Pole environment was developed. The standard OpenAI
Gym Cart-Pole environment neglects a friction coefficient of a cart and a mass moment of inertia of
the pole. The state-space representation of a system was recovered from the source code and the
additive noise for the observations was included:

Û̃x =


Ûx
Üx
Ûφ
Üφ


=


0 1 0 0
0 0 mlg 0
0 0 0 1
0 0 −

g(m+M)

− 4
3 l(m+M)+lm

0



x
Ûx
φ
Ûφ


+


0

4
3 l(m+M)

(m+M)

0
− 1

4
3 l(m+M)+lm


u,

ỹ =

[
x
φ

]
=

[
1 0 0 0
0 0 1 0

] 
x
Ûx
φ
Ûφ


+

[
0
0

]
u + σy .

In the above equation:

x stands for a cart’s position
θ is a pendulum angle from vertical
u is a force applied to the cart.

System parameters and their values used during the experiments are given in Table 5.1. Additionally,
the PID controller was implemented to follow the desired trajectory of the cart position and the pole
angle. The recovered state-space representation was discretized using Matlab 2016b. Then a Linear
Quadratic Regulator (LQR) was computed as a benchmark for PILCO tuning. We put additional
threshold to the pole angle in the modified environment. With the pole angle bounded at +/− π

6 the
Cart-Pole dynamics is approximately linear. Target trajectory was set to zero for both, position and
angle.

43


5 Testing Systems

Parameter Value Unit

m mass of pole 0.1 kg

M mass of cart 1.0 kg

l length to the pole center of mass 0.5 m

dt system’s sampling time 0.02 s

σy spring characteristic ∼ U(1e−2, 1e−3) m, rad

Table 5.1: Noisy Cart-Pole parameters.

0.0 0.2 0.4 0.6 0.8 1.0
Time t, dt = 0.020000, H = 54

1.00

0.75

0.50

Ou
tp

ut

Exp.: 1 of 2
Exp., Out 0
Pred., Out 0

0.0 0.2 0.4 0.6 0.8 1.0
Time t, dt = 0.020000, H = 54

0

1

Ou
tp

ut

Exp., Out 1
Pred., Out 1

0.0 0.2 0.4 0.6 0.8 1.0
Time t, dt = 0.020000, H = 54

10

0

10

In
pu

t Exp., In 0
Pred., In 0

(a) Test experiment 1

0.0 0.1 0.2 0.3 0.4 0.5
Time t, dt = 0.020000, H = 28

0.40

0.45

Ou
tp

ut

Exp.: 2 of 2
Exp., Out 0
Pred., Out 0

0.0 0.1 0.2 0.3 0.4 0.5
Time t, dt = 0.020000, H = 28

0.1

0.0

0.1

Ou
tp

ut

Exp., Out 1
Pred., Out 1

0.0 0.1 0.2 0.3 0.4 0.5
Time t, dt = 0.020000, H = 28

10

0

10

In
pu

t Exp., In 0
Pred., In 0

(b) Test experiment 2

Figure 5.1: Tests of Noisy Cart-Pole dynamics model

As a first step of PILCO, the initial dynamics model have to be learned (Algorithm 3.1). For that a
sparse GP with 150 inducing points was trained on two experiments with the random policy (0.84
and 0.62 seconds) and one experiment with suboptimal, manually tuned, gains (2.0 seconds) in
order to collect additional valuable information around equilibrium point. NARX state included 2
states of the history for position, angle and control in order to learn the derivatives.

Three experiments with the random policy were used for testing. The simulated feed-forward
rollouts with initial dynamics model for two of them are presented in Figure 5.1. Both plots include
the blue line which stands for the data collected from the real rollout with the given policy, the
orange line, which presents the mean of t-step-ahead marginal distribution obtained from the model,
and the shaded region which demonstrates the variance of distribution (model uncertainty). This
description holds for all plots which demonstrate the comparison of simulated and real rollouts
from now onwards. Output 0 and 1 are cart position and pole angle respectively, and Input is a
control signal. Since the results were obtained from the feed-forward simulation, Input does not
have a variance. As one can see, an initial GP rather accurately models the dynamics of a system
but has a relatively big posterior variance.

With initial dynamics model, PILCO was applied to tune the gains of a PID controller. The
concatenated optimization progress of four PILCO iterations is presented in Figure 5.2. The
Broyden — Fletcher — Goldfarb — Shanno (BFGS) was used for optimization. BFGS iterations
from 0 to 32 corresponds to the first PILCO iteration. Due to the good precision of the posterior mean
of our initial model, the cost obtained from the simulated rollout (orange dashed line) reassembles
the cost from the system rollout (blue dashed). The persistent margin between two lines corresponds

44


5.1 Noisy Cart-Pole

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120
BFGS Iteration

20

40

60

80

100

C
o
st

Optimization Progress

LQR cost
Cost from Model
Cost from Simulation

Figure 5.2: Concatenated optimization progress for the Noisy Cart-Pole PID.

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00
Time t, dt = 0.020000, H = 100

0.00

0.05

0.10

Ou
tp

ut

Exp.: 1 of 1

Exp., Out 0
Pred., Out 0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00
Time t, dt = 0.020000, H = 100

0.00

0.05

Ou
tp

ut

Exp., Out 1
Pred., Out 1

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00
Time t, dt = 0.020000, H = 100

0

10

In
pu

t

Exp., In 0
Pred., In 0

(a) Simulated feed-back rollout for optimized PID

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00
Time t, dt = 0.020000, H = 100

0.00

0.05

0.10

Ou
tp

ut

Exp.: 1 of 1

Exp., Out 0
LQR., Out 0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00
Time t, dt = 0.020000, H = 100

0.00

0.05

Ou
tp

ut

Exp., Out 1
LQR., Out 1

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00
Time t, dt = 0.020000, H = 100

0

10

In
pu

t

Exp., In 0
LQR., In 0

(b) The optimized PID versus LQR

Figure 5.3: The optimized PID for the Noisy Cart-Pole.

to the variance pf the initial model. According to Algorithm 3.1, at the end of the first iteration
the model is retrained using a new data recorder with the optimized policy (one experiment, 1.42
seconds). However, since the number of inducing points of GP was not increased, retraining results
in a degradation of the model precision. This effect can be observed during the second iteration
(BFGS steps 33 − 63). Despite this fact, at the end of second iteration, the optimized policy was
able to outperform the LQR (green line). Further concatenation of data and retraining of the model
improved the precision and decreased the variance (BFGS steps 64 − 120).

The results of 4 iterations of PILCO are demonstrated in Figure 5.3. Figure 5.3a presents the
simulated rollout for the optimized policy. As one can see, the resulting GP precisely models the
dynamics and has relatively small variance. Figure 5.3b compares PID with the optimized gains
(blue line) and LQR (green line). Since the cost function was set in a way to prioritize the pole angle
error minimization, which leads to the faster system stabilization, LQR results in better tracking of
the cart position (Output 0). PID optimized by PILCO, is clearly outperforming the LQR in second
Output (pole angle), thus is more optimal with respect to the specified cost.

The conducted experiments prove that PILCO modification [16] is able to optimally tune the gains
of PID with being data-efficient (9.46 seconds of system interaction was used in total to obtain
the results from Figure 5.3). In the following section experimental setup used for testing of the
scheduled PID tuning with PILCO (Chapter 4) is described and obtained results are analyzed.

45


5 Testing Systems

Parameter Value Unit

m mass 1.0 kg

D damping coefficient 1.5 Ns/m

k spring characteristic ∼ Figure 5.4a -
dt system’s sampling time 0.05 s

Table 5.2: Non-Stationary Mass-Damper parameters.

5.2 Non-stationary Mass-Damper

In order to test the derived integration of gain scheduling into PILCO the non-stationary Mass-
Damper system was developed. The spring characteristic (k) is changing according to the sigmoid
function with respect to the position of a center of mass (Figure 5.4a). Twenty continuous
Mass-Damper systems with the following state-space representation were calculated:

Û̃x =

[
Ûx
Üx

]
=

[
0 1
− k

m −D
m

] [
x
Ûx

]
+

[
0
1
m

]
u,

ỹ =

[
x
Ûx

]
=

[
1 0
0 1

] [
x
Ûx

]
+

[
0
0

]
u,

where:

x stands for the position of a center of mass
u is a force applied to the a center of mass.

In the above differential equation, the parameters were set according to Table 5.2. Then, using
Matlab 2016b these systems were converted to the discrete form with sampling time of 0.05 seconds.
For the final non-stationary system − k

m entry is linearly interpolated from discrete forms based on
x.

Next, the PID controller for tracking the desired x was implemented. In order to define the controller
output boundaries and feasible desired trajectory, the excitation signal (Figure 5.4b) was applied.
The system’s response (Figure 5.4c) demonstrates that with control bounded to +/−15, it is possible
to follow the desired trajectory with the boundaries +/−0.5m.

Additionally LQR was computed in Matlab 2016b for the system with k equal to 15 (dynamics for
negative x) for the benchmarking. Figure 5.4d presents the trajectory obtained with LQR (blue)
versus the desired trajectory (orange). For the LQR trajectory one can observe persistent steady-state
error in both negative and positive positions. Nonetheless, the error is lower for the negative x. The
difference in steady-state errors and overshooting in positive and zero x shows that controller was
tuned for dynamics in negative positions.

Then PILCO was applied to tune the PID gains. According to Algorithm 3.1, the initial probabilistic
dynamics model have to be learned first. Sparse GPs with 100 inducing points and the NARX state
with the history length of 4 for both, state and action were trained. The hyper-parameters were
optimized using marginal likelihood. For the train experiments we use three rollouts with gains
sampled uniformly from −10 to 10. The test experiments use similar rollouts.

46


5.2 Non-stationary Mass-Damper

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75
x

15

20

25

30

35

40

45

K
K-values Sigmoid

(a) The change of k-value based on x

0 1 2 3 4 5
t [0.05]

0

2

4

6

8

10

12

14

u

Input Vector: u_vec

(b) The excitation input signal

0 1 2 3 4 5
t [0.05]

0.8

0.6

0.4

0.2

0.0

0.2

0.4

0.6

0.8

x

System Response for u_vec

(c) The excitation system’s response

0 1 2 3 4 5
t [0.05]

0.4

0.2

0.0

0.2

0.4

x

LQR Experiment
System response
Target trajectory

(d) LQR controller

Figure 5.4: Non-stationary Mass-Damper properties.

The results of test experiments are demonstrated in Figure 5.5. The Output stands for the x trajectory
and the Input for a control command. Due to the simplicity of the system and enough NARX history
GPs were able to learn the dynamics from three train experiments (5 seconds each).

The PID controller obtained after the first PILCO iteration is demonstrated in Figure 5.6. Figure 5.6a
presents the simulated feedback rollout for the optimized controller. The predicted trajectory of
x has a low variance, but deviates from the real one in the areas where the set point has a steep
change. The reason is that the dynamics model predicts the non-truncated control commands
and has no information about the boundaries of a controller output. This issue can be resolved
via the concatenation of a new training data which contains a non-truncated control at the end
of PILCO iteration. Despite this deviation, optimized controller outperforms LQR with respect
to the specified cost function already after the first PILCO iteration (Figure 5.6d). Therefore, it
was decided not to continue with a further iterations but proceed with a tuning of scheduled PID.
Figure 5.6b demonstrates the comparison of the rollouts with optimized PID and LQR. Figure 5.6c
plots the optimized PID rollout and the reference trajectory.

Next, the modified PILCO was applied to tuning of a gain scheduled PID which controls the
Non-Stationary Mass-Damper system. As it was described in Section 4.1, the gain scheduling
function is defined as a GPs with a uniformly distributed training inputs parametrized by the training

47


5 Testing Systems

0 1 2 3 4 5
Time t, dt = 0.050000, H = 100

1.25

1.00

0.75

0.50

0.25

0.00

Ou
tp

ut

Exp.: 1 of 2
Exp., Out 0
Pred., Out 0

0 1 2 3 4 5
Time t, dt = 0.050000, H = 100

10

0

10

In
pu

t

Exp., In 0
Pred., In 0

(a) Test experiment 1

0 1 2 3 4 5
Time t, dt = 0.050000, H = 100

3

2

1

0

1

2

Ou
tp

ut

Exp.: 2 of 2
Exp., Out 0
Pred., Out 0

0 1 2 3 4 5
Time t, dt = 0.050000, H = 100

10

0

10

In
pu

t

Exp., In 0
Pred., In 0

(b) Test experiment 2

Figure 5.5: Tests of Non-stationary Mass-Damper dynamics model.

0 1 2 3 4 5
Time t, dt = 0.050000, H = 100

0.5

0.0

0.5

Ou
tp

ut

Exp.: 1 of 1
Exp., Out 0
Pred., Out 0

0 1 2 3 4 5
Time t, dt = 0.050000, H = 100

50

0

50

In
pu

t

Exp., In 0
Pred., In 0

(a) The optimized PID after first iteration.

0 1 2 3 4 5
Time t, dt = 0.050000, H = 100

0.4

0.2

0.0

0.2

0.4

Ou
tp

ut

Exp.: 1 of 1
Exp., Out 0
LQR., Out 0

0 1 2 3 4 5
Time t, dt = 0.050000, H = 100

10

0

10

In
pu

t

Exp., In 0
LQR., In 0

(b) The optimized PID comparing to LQR.

0 1 2 3 4 5
t [0.05]

0.6

0.4

0.2

0.0

0.2

0.4

X

Optimized Policy
System response
Target trajectory

(c) The optimized PID comparing to target.

0 2 4 6 8 10 12 14 16
BFGS Iteration

30

35

40

45

50

55

Co
st

Optimization Progress
LQR cost
Cost from Model
Cost from Simulation

(d) The optimization progress.

Figure 5.6: The optimized nonscheduled PID for the Non-stationary Mass-Damper.

48


5.2 Non-stationary Mass-Damper

0 1 2 3 4 5
Time t, dt = 0.050000, H = 100

0.50

0.25

0.00

0.25

0.50

Ou
tp

ut

Exp.: 1 of 1

Exp., Out 0
Pred., Out 0

0 1 2 3 4 5
Time t, dt = 0.050000, H = 100

40

20

0

20

40

In
pu

t

Exp., In 0
Pred., In 0

(a) Test experiment 1

0 1 2 3 4 5
Time t, dt = 0.050000, H = 100

0.6
0.4
0.2
0.0
0.2
0.4

Ou
tp

ut

Exp.: 1 of 1
Exp., Out 0
Pred., Out 0

0 1 2 3 4 5
Time t, dt = 0.050000, H = 100

40

20

0

20

40

In
pu

t

Exp., In 0
Pred., In 0

(b) Test experiment 2

Figure 5.7: The optimized scheduled PID for the Non-Stationary Mass-Damper.

targets. Since the system dynamics depends on x, it was taken as a scheduling variable. The taring
inputs were placed at −0.5, 0, 0.5. It was assumed that such inputs setup permits the optimized
scheduling function to approximate the change of k and adjust the gains accordingly.

The scheduled controller optimized by 3 iterations of the modified PILCO is demonstrated in
Figure 5.7a. The resulting controller has a reasonable performance with respect to the model
prediction, however, the real rollout results in an oscillating trajectory. The observed behavior is
caused by a steep scheduling function, which results in an extreme change of gains for the small
variations in x. The order of magnitude is equal to 1 for the training targets and to −1 for x.
Additionally the optimization procedure tends to place the targets in a relatively big range, e.g. from
20 to 80. The combination of this two points leads to the scheduling function with an extreme slope.
Therefore during the feed-back simulation for the predicted x with a relatively small deviation from
the real trajectory the gains are calculated with a strong bias.

In order to address this problem it was decided to put the additional cost on the training targets to
prevent the learning of an extreme values. However, the adjustment of a cost coefficients of the
binary saturation loss function (Section 3.2) revealed to be a rather complex problem. The small
cost coefficients which punish the extreme values result in inability of auto-derivation software to
compute the gradients. The coefficients were set to the smallest values that does not compromise a
gradients calculation.

The result of a scheduled PID learning with the additional cost for the predicted gains are presented
in Figure 5.7b. The difference between the predicted and simulated x trajectories is not as significant
as in Figure 5.7a. However, the second plot of the figure shows that the predicted control signal
still deviates from the real one. Nevertheless, the introduction of the additional cost permits to
obtain the controller which outperform both, LQR and a nonscheduled PID. The concatenated
optimization progress of 4 modified PILCO iterations is demonstrated in Figure 5.8. As one can
see, due to the problem of the steep change in a gains the cost obtained from a model simulation
and the cost of a real rollout does not converge to the perfect match. Nonetheless, after the 110
BFGS iterations (4 PILCO iterations) the optimized scheduled controller outperforms the LQR and
a plain PID. The costs of LQR and simple PID rollouts are different from Figure 5.6d due to the
additional cost terms.

49


5 Testing Systems

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110
BFGS Iteration

70

75

80

85

90

C
o
st

LQR cost
Nonscheduled PID cost
Cost from Model
Cost from Simulation

Optimization Progress

Figure 5.8: Concatenated optimization process for the scheduled PID.

Considering the demonstrated results and assuming that described problem of an extreme gains
change is application specific, it was decided to apply the developed method to the real control
system for further testing and evaluation.

50


6 Application to Autonomous RC Car

In order to prove the applicability of the designed algorithm to the real physical systems, it was
exploited to tune the scheduled PID controllers of the autonomous RC car, which was developed
as part of master thesis of Alexander Wischnewski [35]. This chapter presents the description of
a system with all modifications which were made during the method integration, as well as the
approaches to the modeling of system dynamics.

6.1 System Description

6.1.1 Physical Prototype

The physical prototype of the autonomous system is based on a commercial RC car of scale 1:6 to a
real vehicle. This car is controlled by the steering and throttle input commands. Four-wheel system,
which transmits the traction forces, is powered by a Brushed DC Motor. The front-wheel steering
is performed by two separate but mechanically linked servo-motors. The frame is mounted on a
single-wheel suspension with a spring-damper combination. Several parameters such as camber,
toe and spring-preload are available for tuning.

The autonomous driving functionality on a physical prototype is implemented using the Pixhawk
PX4 micro-controller, however this choice is not essential. The Pixhawk PX4 directly provides
the measurements of 3D-accelerometer, 3D-magnetometer, 3D-gyroscope and GPS. Additionally,
angular positions of the steering servo-motors are obtained from built-in rotary potentiometers
and the Pixhawk’s AD-converter interface. The sensors have a different sample rates which are
lower than the minimal limit set for runtime of control software. The oversampling approach was
implemented. The Sensors output the raw physical quantities and have to be calibrated, however
this part is not the scope of this thesis.

Further details about the physical prototype [35] are irrelevant for this thesis. Taking the results
from Wischnewski as a proof by demonstration, it was concluded that the system can be optimally
controlled by the manually tuned PID scheduled on the vehicle velocity, thus is appropriate for
application of our method.

6.1.2 Simulink Model

The controller for autonomous driving was developed in Simulink, using the precise RC car model
provided by Bosch. Additionally, a software package which configures MATLAB Coder to work
with the Pixhawk PX4 and provides an actuator and SD-card interfaces were installed. The resulting
Simulink model was provided as well as the physical prototype and includes the following functional
modules:

51


6 Application to Autonomous RC Car

C

D T

CR M

ui

t

u uo

Control Scheme RC Model

TL eO
(IM)

y

Sensors Simula�on xy

-

-

-

Figure 6.1: Simplified control scheme of the autonomous RC car.

1. High-level state-flow control
2. The Sensor and actuator simulators
3. The control mechanism for autonomous driving
4. Precise RC car model.
5. Data logging and post-processing routines
6. Controller parametrization via SD-card interface.

The following sections are focused on the control mechanism for autonomous driving module, other
modules [35] are not in the scope of this thesis.

6.2 Control Scheme Analysis

As a part the master thesis Wischnewski [35] introduced the control mechanism for the RC car,
which is able to follow the predefined discrete target trajectory. The control mechanism includes
two parallel controllers, which can be switched during the simulation and execution. The controllers
are nonlinear IMC controller, and multivariate scheduled PID. IMC was developed as a primary
controller and PID was used for the benchmarking. However, since the nonlinear IMC is out of
scope for this work, only the multivariate PID is considered.

A simplified control scheme is presented in Figure 6.1. The block M stands for the precise RC car
model, which has to be controlled. Inputs ūo of this model are throttle command ut in percents
+/−100, and steering command us in radians +/−0.2 (Table 6.1).

The model output state x with the information provided in Table 6.1, is transmitted through the outer
loop and enters the Sensors Simulation. x is post-processed with Sensors Simulators (noise and
delays are added) in order to account for the real hardware behavior. The resulting state observation
y obtained from the sensors is transmitted to the State Observer(O). All sensor measurements

52


6.2 Control Scheme Analysis

State Information Range Unit

u
ulat Lateral PID output - rad/s

ulon Longitudinal PID output - m/s2

ūo
us Steering angle +/−0.2 rad

ut Throttle +/−100 perc

ūi

Ûθ∗ Requested yaw-rate - rad/s

a∗ Requested acceleration - m/s2

γ Estimated side-slip angle - rad

x

x, y Vehicle position - m

θ Vehicle orientation - rad

v Vehicle velocity - m/s

Ûθ Vehicle yaw rate - rad/s

a Vehicle acceleration - m/s2

ȳ

xk, yk Filtered position - m

θk Filtered orientation +/−3.14 rad

vk Filtered velocity - m/s

Ûθk Filtered yaw-rate - rad/s

γk Filtered side-slip angle - rad

t

s Path along the trajectory - m

xtr, ytr Position in local coordinates - m

k Curvature - rad/s

vtr Target velocity - m/s

θtr Target orientation - rad

Ûθtr Target yaw-rate - rad/s

e
elat Lateral error - m

elon Longitudinal error - m

Table 6.1: The data transmitted within the control scheme of the autonomous RC car.

have different delays and sampling times. Therefore, the State Observer (O in Figure 6.1) has a
convoluted functionality of a sensors fusion which includes an Unscented Kalman Filter (UKF) and
the internal process model.

UKF outputs the state estimation based on the information measured by sensors. Nonetheless, the
complete information of a system state required for UKF to compute an estimation is available with
a delay of 3t where t is a system sampling time equal to 0.02. That is why the State Observer does
also include the internal simplified model of RC car dynamics. The internal model is used to correct
the chain of UKF observations made with an outdated sensor measurements. The internal model
inputs ūi are Requested yaw-rate Ûθ∗, Requested acceleration a∗, and Estimated side-slip angle γ.
They are transmitted via the inner loop (Figure 6.1) from the Control Router (CR) and delayed for
three time steps using the delay stack (D). The internal model then predicts, using the new sensed
information related to time-step (t-3) and all control outputs which were commanded since that

53


6 Application to Autonomous RC Car

Filtered State

Filtered State with all information (t-3)

Closest Trajectory State (t-3)

Target Trajectory State (t-3)

Target Trajectory State

Lateral Error

Longitudinal Error

Filtered Orientation (t-3)

Filtered Yaw Rate (t-3)

Estimated Side Slip Angle (t-3)

Closets Trajectory State

x ,  y

n

θ
.

θ

v
γ

Figure 6.2: The definition of longitudinal and lateral error used for PID control.

point (stored in delay stack), what will be the correct estimation at the current step. The Kalman
state estimate ȳ contains Filtered vehicle position xk and yk, heading θk, velocity vk, yaw-rate Ûθk
and side-slip angle γk. The details are provided in Table 6.1.

Next, the estimated Kalman state is transmitted to the Trajectory Localization module (TL at
Figure 6.1). Another input of TL is the trajectory t, provided as a set of discrete points with the
information given in Table 6.1. The trajectory is stored as a parameter in a Matlab Workspace
during the simulation, but loaded from the SD card during the run on the physical system. First, the
coordinates of the trajectory points are transformed to the coordinate frame fixed at initial vehicle
position with Y axes aligned to vehicle orientation θk. This coordinate frame holds for both, Kalman
filtered and target positions, until the system finishes a lap.

Afterwards, TL calculates the lateral and longitudinal errors are later used as an inputs to PID
controllers. The Definition of these two errors is schematically presented in Figure 6.2. The
transparent car represents the state at time-step t −3 when system had last sensor information update.
The colored car stands for the filtered state at the current time t, corrected by the information
from t − 3. In order to compute the errors, the closest point to the filtered vehicle state on a target
trajectory is calculated (gray circle in Figure 6.2). The Closest point is obtained in three steps. First,
the path coordinate of the anticipated closest point sc is integrated from sc,t−1 (previously closest

54


6.2 Control Scheme Analysis

-15 -10 -5 0 5 10 15 20

X [m]

-5

0

5

10

15

20

25

30
Y
 [

m
]

2

2.5

3

3.5

4

4.5

5

5.5

6

V
 [

m
p
s]

(a) The target trajectory with velocity profile

-10 -5 0 5 10 15 20

X [m]

0

5

10

15

20

25

Y
 [

m
]

GPS Measurement
GPS Position
Estimated GPS
Estimated CoG
Target CoG

(b) RC car trajectory

Figure 6.3: The target and actual trajectories of the autonomous RC car simulation (taken from
[35]).

one) and filtered vehicle velocity vk. At the first iteration sc is set to 0. Then,the set of temporary
points s� is sampled from a region around sc plus some safety margin. For every s� position is
interpolated from the discrete trajectory points. The euclidean distance is calculated then between
every sampled point and the filtered position of the vehicle xk, yk. According to the euclidean
distance, the closest point on a trajectory is selected. Once the closest point is defined, a difference
between the filtered position and the closest point is projected to the normal at sc. This projection is
taken as the lateral error (red bar at Figure 6.1). The longitudinal error (blue bar) is calculated as a
difference in s coordinate between the closest point and the target one, which is obtained via the
integration of the target velocity.

The calculated errors are transmitted to The Controller (C) module, which contains two parallel PID
controllers with the custom derivate terms scheduled on the velocity. As a derivative term of the
lateral PID a difference between the filtered vehicle orientation and orientation in closest trajectory
point (interpolated) is taken. The derivative term of the longitudinal controller is calculated as a
difference in target and filtered velocities.

Last module in presented scheme is the Control Router (CR). CR takes a filtered Kalman state as an
additional input. First, CR computes requested yaw-rate Ûθ∗, acceleration a∗ and estimated side-slip
angle γ using the PID outputs u and the Kalman state ȳ. Then this information is transmitted to the
Delay stack (D) via the internal loop for further use in the State Observer (O). Then the Router
performs the post-processing of a PIDs outputs. CR does include the inverse dynamics model
of steering actuator, which is used to transform the lateral PID output and speedup the steering
response [35]. Additionally the output of the longitudinal PID is transformed using a special look-up
table to compensate for different acceleration dynamics of the RC car at different velocities [35].

Considering that the original scheduling mechanism ought to be replaced with the parametrized
GPs and is not required for the method application, it was removed from the control mechanism.
The gains of the remained plain PID controllers were slightly adjusted to give the performance
comparable to the scheduled PID. The resulting trajectory of the autonomous RC car with the
manually tuned PID controllers is presented in Figure 6.3b. Figure 6.3a demonstrates the target
trajectory with the velocity profile which is used during the simulations. The trajectory starts at
(0, 0) and goes to the positive X direction. During the first lap the trajectory is followed more

55


6 Application to Autonomous RC Car

precisely since the target velocity is reduced for the sensor calibration purposes [35]. Starting
from the second lap, one can observe a persistent error in the bottom right corner of a trajectory.
However, the manually tuned PIDs are still able to follow the defined target trajectory with relatively
small accumulated longitudinal and lateral errors. That is why this system was considered to
be an appropriate application base for the developed algorithm. The results of the PID tuned by
Wischnewski are taken as a benchmark. All experiments conducted from now on use the target
trajectory from Figure 6.3a.

6.3 Dynamics Modeling

This section addresses learning of initial probabilistic dynamics model which corresponds to step 4
in Algorithm 3.1. The errors definition do not correspond to the linear transformation of a vehicle
state and desired trajectory anymore. Thus, the integration of error calculation into the extended
state z̃ requires additional approximations for the joint distribution p( z̃). Therefore, it was decided
to model the error dynamics directly. Initial GP has to output et instead of xt (Section 3.3).

Initially it was decided to encapsulate the entir