Institute for Artificial Intelligence Systems - MLS

University of Stuttgart
Universitätsstraße 32

D–70569 Stuttgart

Masterarbeit

Model-Based Reinforcement
Learning under Sparse Rewards

Ravi Akash

Study program: Information Technology

Examiner: Prof. Dr. Mathias Niepert

Supervisors: Carlos E. Luis, M.Sc,
Dr. Ing. Felix Berkenkamp
(Bosch Center for Artificial Intelligence)

Commenced: February 15, 2023

Completed: August 14, 2023


Abstract

Reinforcement Learning (RL) has recently seen significant advances over the last decade in simulated
and controlled environments. RL has shown impressive results in difficult decision-making problems
such as playing video games or controlling robot arms, especially in industrial applications where
most methods require many interactions with the system in order to achieve good performance,
which can be costly and time-consuming. Model-Based Reinforcement Learning (MBRL) promises
to close this gap by leveraging learned environment models and using them for data generation
and/or planning and, at the same time trying to be sample efficient. However, Learning with
sparse rewards remains a significant challenge in the field of RL. In order to promote efficient
learning the sparsity of rewards must be addressed. This thesis work tries to study individual
components of MBRL algorithms under sparse reward settings and investigate different design
choices made to measure the impact on learning efficiency. Suitable Integral Probability Metrics
(IPM) are introduced to understand the model’s reward and observation space distribution during
training. These design combinations will be evaluated on continuous control tasks with established
benchmarks.

3


Kurzfassung

RL hat in den letzten zehn Jahren bedeutende Fortschritte in simulierten und kontrollierten Umge-
bungen verzeichnet. RL hat beeindruckende Ergebnisse bei schwierigen Entscheidungsproblemen
erzielt, wie zum Beispiel das Spielen von Videospielen oder die Steuerung von Roboterarmen,
insbesondere in industriellen Anwendungen, bei denen die meisten Methoden viele Interaktionen mit
dem System erfordern, um eine gute Leistung zu erzielen. Dies kann kostspielig und zeitaufwändig
sein. MBRL verspricht, diese Lücke zu schließen, indem gelernte Umgebungsmodelle genutzt
werden, um Daten zu generieren und/oder zu planen, und gleichzeitig versucht wird, eine hohe
Probeneffizienz zu erreichen. Die Herausforderung des Lernens mit spärlichen Belohnungen bleibt
jedoch ein bedeutendes Problem im Bereich des RL. Um ein effizientes Lernen zu fördern, muss die
Spärlichkeit der Belohnungen angegangen werden. Diese Masterarbeit versucht, einzelne Kompo-
nenten von MBRL-Algorithmen unter Bedingungen mit spärlichen Belohnungen zu untersuchen und
verschiedene Designentscheidungen zu untersuchen, um ihre Auswirkungen auf die Lerneffizienz zu
messen. Geeignete IPM werden eingeführt, um das Belohnungs- und Beobachtungsraumverteilung
des Modells während des Trainings zu verstehen. Diese Designkombinationen werden anhand von
kontinuierlichen Steuerungsaufgaben mit etablierten Benchmarktests ausgewertet.

4


Contents

1 Introduction 12
1.1 Goal of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.2 Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2 Related Work 15
2.1 Curiosity-Driven Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2 Random Network Distillation . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3 Hindsight Experience Replay . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4 Policy Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3 Technical Background 20
3.1 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2 Model-Free Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . 25
3.3 Model-Based Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . 27
3.4 Integral Probability Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4 Methodology 38

5 Experimental Setup 39
5.1 Hardware Specifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.2 Inverted Pendulum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.3 Mountain Car Continuous . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.4 Cheetah Run . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

6 Evaluations and Analysis 43
6.1 SAC vs MBPO Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
6.2 Hyperparameters Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . 45
6.3 Dense Reward Cheetah Run Environment . . . . . . . . . . . . . . . . . . . . . 58
6.4 Reduced MBPO Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

7 Discussion and Conclusion 76
7.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
7.2 Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
7.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

Bibliography 80

5


List of Figures

3.1 Interaction between RL agent and an environment . . . . . . . . . . . . . . . . . 20
3.2 Illustration of the detailed architecture of an MBPO Agent . . . . . . . . . . . . 29
3.3 Illustration of collection and storage of model-generated data inside model replay

buffer of an MBPO agent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.4 Illustration of collection and storage of model-generated data inside model replay

buffer of an MBPO-C agent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.5 Illustration depicting Wasserstein distance calculation . . . . . . . . . . . . . . . 34
3.6 Illustration depicting Maximum Mean Discrepancy distance calculation . . . . . 37
3.7 Comparison of Gaussian and Laplacian Kernel applied to MMD distance metric . 37

5.1 Illustration of continuous control environments . . . . . . . . . . . . . . . . . . 39

6.1 Performance comparison of SAC and MBPO agents in both sparse pendulum and
continuous mountain car environment . . . . . . . . . . . . . . . . . . . . . . . 45

6.2 Ablation study over rollout length hyperparameter using MBPO agent . . . . . . 48
6.3 Ablation study over rollout length hyperparameter using MBPO-C agent . . . . . 49
6.4 Visualization of reward distribution and 𝑝𝑑𝑓 during training in sparse pendulum

swing-up task using MBPO agent . . . . . . . . . . . . . . . . . . . . . . . . . 50
6.5 Ablation study over the number of rollouts per step hyperparameter using MBPO

agent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
6.6 Ablation study over the number of rollouts per step hyperparameter using MBPO-C

agent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
6.7 Ablation study over the number of updates to retain buffer hyperparameter using

MBPO agent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6.8 Ablation study over the number of updates to retain buffer hyperparameter using

MBPO-C agent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
6.9 Ablation study over ensemble size hyperparameter using MBPO agent . . . . . . 55
6.10 Ablation study over frequency retrain hyperparameter using MBPO agent . . . . 56
6.11 Ablation study over trainer patience hyperparameter using MBPO agent . . . . . 58
6.12 Ablation study over rollout length hyperparameter using MBPO agent in dense

reward Cheetah Run environment . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.13 Ablation study over the number of rollouts per step using MBPO agent in dense

reward Cheetah Run environment . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.14 Ablation study over the number of updates to retain buffer hyperparameter using

MBPO agent in dense reward Cheetah Run environment . . . . . . . . . . . . . . 61
6.15 Learning curves representing the reduced performance of an MBPO agent operating

close to model-free SAC agent . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.16 Comparative metric study between MBPO and MBPO close to SAC setting in

Continuous Mountain Car Environment and Sparse Pendulum environment . . . 64

6


6.17 Ablation study over ensemble size and rollout length hyperparameters using MBPO
agent operating close to a model-free SAC agent in Sparse Pendulum environment 66

6.18 Ablation study over ensemble size and rollout length hyperparameters using MBPO
agent operating close to a model-free SAC agent in Mountain Car environment . 67

6.19 Ablation study over the number of rollouts per step and frequency retrain hyperpa-
rameters using MBPO agent operating close to a model-free SAC agent in Sparse
Pendulum environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

6.20 Ablation study over the number of rollouts per step and frequency retrain using
MBPO agent operating close to a model-free SAC agent in Mountain Car environment 70

6.21 Ablation study over rollout length and the number of rollouts per step hyperpa-
rameters using MBPO agent operating close to a model-free SAC agent in Sparse
Pendulum environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

6.22 Ablation study over rollout length and the number of rollouts per step hyperparam-
eters using MBPO agent operating close to a model-free SAC agent in Mountain
car environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

6.23 Ablation study over longer rollout length using MBPO agent operating close to
a model-free SAC agent in both sparse pendulum and continuous mountain car
environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

6.24 Ablation study over updates to retain buffer using MBPO agent operating close
to a model-free SAC agent in both sparse pendulum and continuous mountain car
environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

7


List of Tables

3.1 Hyperparameters for MBPO style learning . . . . . . . . . . . . . . . . . . . . . 31

5.1 Sparse Inverted Pendulum State Space . . . . . . . . . . . . . . . . . . . . . . . 40
5.2 Continuous Mountain Car State Space . . . . . . . . . . . . . . . . . . . . . . . 41
5.3 Cheetah Run State Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

6.1 MBPO Hyperparameter settings for Continuous Control Experiments . . . . . . 46
6.2 MBPO Hyperparameter settings reduced close to SAC Performance . . . . . . . 65

8


List of Algorithms

3.1 Model Based Policy Optimization algorithm . . . . . . . . . . . . . . . . . . . . 30
3.2 Model Based Policy Optimization-Consistent algorithm . . . . . . . . . . . . . . 32

9


Acronyms

A2C Advantage Actor-Critic. 25

BLR Bayesian Linear Regression. 18

CEM Cross Entropy Method. 27

CPC Contrastive Predictive Coding. 15

CPU Central Processing Unit. 39

DDP Differential Dynamic Programming. 18

DL Deep Learning. 21

DQN Deep Q-Networks. 15

DRL Deep Reinforcement Learning. 21

EMD Earth Mover Distance. 33

GAN Generative Adversarial Networks. 15

GP Gaussian Process. 13

GPU Graphical processing units. 39

HER Hindsight Experience Replay. 17

HPC High Performance Cluster. 39

ICM Intrinsic Curiosity Model. 16

IPM Integral Probability Metrics. 3

MBPO Model Based Policy Optimization. 20

MBPO-C Model Based Policy Optimization-Consistent. 31

MBRL Model-Based Reinforcement Learning. 3

MCTS Monte Carlo Tree Search. 27

MDP Markov Decison Process. 14

MFRL Model-Free Reinforcement Learning. 20

MMD Maximum Mean Discrepancy. 33

MPC Model Predictive Control. 18

MVE Model Value Expansion. 27

10


Acronyms

NN Neural Networks. 13

PPO Proximal Policy Optimization. 25

PSRL Posterior Sampling for Reinforcement Learning. 18

RKHS Reproducing Kernel Hilbert Space. 35

RL Reinforcement Learning. 3

RND Random Network Distillation. 17

SAC Soft Actor Critic. 17

SLBO Stochastic Lower Bound Optimization. 28

STEVE Stochastic Ensemble Value Expansion. 27

TRPO Trust Region Policy Optimization. 25

WD Wasserstein Distance. 33

11


1 Introduction

RL has become a popular paradigm of Machine Learning over the last decade, where intelligent
agents learn to act in some environment with the objective of maximizing cumulative rewards. The
agents learn optimal strategies directly by interacting with an a priori unknown dynamical system
[SB20]. RL has garnered a lot of attention and popularity in the research community due to the
success of mastering the game of Go [SSS+17] than any other human player could, winning in
most of the Atari games it has been deployed to play and most recently Deepmind’s Starcraft II
[VEB+17]. RL algorithms are categorized mainly into two classes MFRL and MBRL. The key
difference between them is that MFRL learns a policy by directly interacting with the environment,
while in MBRL, the agent learns an approximate model of the environment’s dynamics and uses
this model for policy optimization or planning. The key advantage of MBRL over MFRL is that the
agent can first learn the model of the environment and then can make informed decisions when
interacting with the actual environment. This helps MBRL to be more sample-efficient learning
and faster convergence to optimal policies. The learned model can also be used for planning and
exploration, allowing the agent to explore various scenarios and hypothetical actions without risking
real interactions with the environment.

A fundamental challenge in RL is dealing with sparse rewards, It is crucial to study sparse reward
problems, since they often appear in real-world tasks and are easy to design without domain
knowledge. The sparse reward can be specified as long as there is a defined state-based criterion for
success (e.g., a goal location is reached) since there are no rewards elsewhere except in the area
of the state space that meets the success criterion. In most real-world cases it would be hard to
specify dense rewards since we only have limited knowledge about the system [DMH19]. Most of
the traditional Rl and MFRL algorithms might fail to solve the sparse reward problems because
of the absence of intermediate rewards which enable the agent to drive exploration. MBRL can
be particularly helpful in solving sparse reward tasks since it offers the promise to leverage the
uncertainty of the learned dynamics model to drive the exploration toward interesting regions of the
state space.

One of the pioneers from the field of reinforcement learning shared his view on MBRL and how it
certainly plays an essential role in shaping the research ahead:

"The next big step forward in AI will be systems that actually understand their worlds.
The world is only accessed through the lens of experience, so to understand the world
means to be able to predict and control your experience, and your sensitive data,
with some accuracy and flexibility. In other words, understanding means forming a
predictive model of the world and using it to get what you want. This is model-based
reinforcement learning."

-Richard S. Sutton

12


The asymptotic performance of MBRL algorithms was historically lower than that of model-free
methods, but the gap has been closing in recent years [JFZL21] as a lot of effort has been directed
towards scaling RL for real-world environments and also dealing with sparse rewards [WEH+22].
MBRL has shown great success in terms of sample efficiency [DR11], [BHT+19], improving
planning [JFZL21] and being robust enough for distributional shifts[FVM+22].

The major motivation behind studying sparse reward problems is that it requires close to no domain
knowledge to design a reward function for a given task/environment. For example, in a continuous
control task where a robotic arm task is required to stack an object on top of other objects, we
can assign a sparse reward of 1 if the robotic arm successfully stacks the object and 0 otherwise.
Designing intermediate rewards in this scenario can be challenging and in most cases not practical.
The sparse reward problem has been addressed in previous state-of-the-art literature in a plethora
of ways, in which Gaussian Process (GP) [DR11] and Neural Networks (NN) [CCML18] were
used as a typical model for representing the one-step dynamics of the RL environment. Ensembles
of probabilistic NN are a common paradigm used by a plethora of MBRL methods leveraging
uncertainty estimates to improve performance [BHT+19],[ZZW+19]. The learned models are then
used, for instance, for planning as done by [CCML18]. [FVM+22] present one of the most recent
studies where classic RL benchmarks are modified for different reward sparsity levels. However,
research in MBRL under a sparse reward setting has not answered some key questions such as
how accurate and diverse the model should be, how aware the model should be about uncertainty,
and how long the rollout trajectory should be. While it is fairly common for authors to regularly
experiment and provide various algorithms for solving sparse reward tasks, it still remains an open
question as to what the key ingredients and design choices which are essential for solving sparse
reward tasks. The study carried out in this thesis helps us understand why some hyperparameter
design choices might be working better than others through the lens of the dataset which we use to
train the agent.

13


1 Introduction

1.1 Goal of the Thesis

The primary focus of this study revolves around understanding the essential elements within the
Model-Based Reinforcement Learning (MBRL) algorithm that effectively tackle the challenge of
sparse rewards. The objective of this thesis is to thoroughly examine the current state-of-the-art
MBRL algorithm, namely Model-Based Policy Optimization (MBPO), and investigate the impact
of various design decisions in the algorithm’s design when dealing with sparse rewards scenarios.
Modifications suggested in the literature are used as a starting point for benchmarking performance
under sparse rewards in a customized sparse inverted pendulum [CBK20], mountain car continuous
[BCP+16a] and a complex DeepMind continuous control environment cheetah run [TMD+20],
which are stochastic in terms of their initial state and are well suited for testing continuous control
tasks.

1.2 Thesis Overview

The remainder of this work is organized as follows. Chapter 2 introduces prior research from
the literature that is pertinent to our approach. Subsequently, Chapter 3 delves into the necessary
technical background, encompassing topics such as MBRL, Markov Decison Process (MDP), and
more. Chapter 4, we discuss the details of the design of experiments conducted and their particulars.
Chapter 5 encompasses the details of the experimental setup and hardware specifications. Moving
forward to Chapter 6, we present the assessments and in-depth analysis of the outcomes derived
from the simulated experiments. Transitioning to Chapter 7, we provide a succinct summary of our
findings while also offering glimpses into potential avenues for future research.

14


2 Related Work

This section summarizes prior state-of-the-art literature on solving sparse reward problems. An
overview of the literature is provided, as well as a brief discussion of the methodologies employed.

The sparse reward problem is a common and difficult challenge that many RL algorithms face in
real-world scenarios. The sparse reward problem is particularly challenging due to the fact that there
is little signal to learn from and also rely on accurate models of environments to make effective
decisions. Moreover, this sparse reward setting makes the learning process slower and less efficient.
In order to overcome the difficult challenges posed by the sparse reward problem, a wide range of
novel ideas in deep reinforcement learning research have emerged.

[VHS+18] proposed a model-free approach using demonstrations collected from a human demonstra-
tor to tackle the sparse reward problem on a high-dimensional robotic control problem. The purpose
of using demonstrations is to replace carefully engineered rewards and reduce the exploration
burden in sparse reward settings.

A hybrid approach using Deep Q-Networks (DQN) was suggested by [GL19], which combines both
model-free and model-based approaches that help in exploring least likely seen states and learning
environments with sparse rewards more efficiently.

Reward shaping is an effective strategy to help the RL algorithms learn more efficiently where
carefully engineered rewards are introduced in the environment to guide the algorithm towards
convergence. [LTA20] propose an reward shaping methodology through Contrastive Predictive
Coding (CPC) by learning predictive representations offline. CPC learns self-supervised repre-
sentations by predicting the future in latent space with the help of auto-regressive models. With
their algorithm, long-horizon tasks can be addressed with shaped reward signals. Another parallel
literature work proposed by [WWWZ20], introduces a non-expert helpers algorithm, where a prior
control policy(non-expert helper) plays an important role in guiding the agent in exploring the state
space by dynamically reshaping the learning objectives over time.

[CM20] propose PlanGAN a model-based algorithm for addressing the multi-goal task in the
presence of sparse rewards. The collected experience by an RL algorithm is used to train an
ensemble of Generative Adversarial Networks (GAN) to generate multiple plausible experiences,
where these experiences are combined into a novel planning algorithm that helps in achieving
efficient learning. [KCM20] frame policy search as a multi-objective problem, where the objectives
are optimized using a Pareto-based multi-objective optimization problem. The proposed approach
was able to solve sparse reward tasks within a few episodes.

[LWZZ21] take advantage of model error as an extra reward which aids in increasing reward
density in sparse reward settings in order to drive the exploration. [NHM21] introduce a novel
computationally light model-based approach to tackle the sparse reward problem, which encourages
the agent to explore uncertain states by considering a prior model of their goal behavior. To
tackle sparse reward problems in continuous action setting [WWW21] deployed a particle swarm

15


2 Related Work

optimization planner as an actor in actor-critic architecture which helped to regulate the exploration
rate. This aided the RL algorithm to identify rewards in non-interesting regions in state space.
Motivated by the recent success of value-based for approximating state-action values [RYH+21]
utilize radial basis value function for addressing continuous control robotic manipulation multi-task
sparse reward settings.

An effective alternative for reward shaping was proposed by [DDHB22], where model predictive
control is utilized as an experienced source for training RL algorithms in sparse reward environments.
Algorithm complexity increases significantly when demonstrations are involved as there is more
hyperparameters introduced and tuned. As an effective alternative to this approach [WBD+22]
introduce a parameter-free modification to standard actor-critic algorithms which computes a
modified Q-value and a Monte Carlo estimate of the reward-to-go as a result increasing learning
efficiency in sparse reward tasks. [LWRG22] leverage long-term Q-Values and provide richer
feedback signals to the actions taken to improve learning efficiency. [SGK22] propose a novel
approach of redistributing local and shared global rewards across multiple agents thereby boosting
exploration in multi-agent sparse reward setting.

In the following sections, various approaches are broadly categorized into standard methodologies
which are utilized in addressing the sparse reward problem are examined.

2.1 Curiosity-Driven Exploration

The main idea behind curiosity-driven methods is that the RL algorithm is encouraged to visit
unseen states in the environment, The main intuition behind this approach is that an agent that is
curious about its environment will be more likely to discover novel and informative experiences,
which can ultimately lead to better performance in the long run.

[HCD+16] proposes an exploration strategy that maximizes information gain about the algorithm’s
belief of environment dynamics, which has significant performance in tackling continuous control
tasks with sparse rewards. [PAED17] propose Intrinsic Curiosity Model (ICM) where curiosity is
formulated as an error in the algorithm’s ability to predict the consequence of its own actions in a
visual feature space learned by self-supervised inverse dynamics model. The algorithm is composed
of two subsystems, one which outputs a curiosity-driven intrinsic reward signal and a policy that
outputs a sequence of actions to maximize the reward signal. While the previous approach used
a model-free agent to solve the environments [SRD+20] showed that the idea of curiosity can be
combined with MBRL to create an agent that efficiently explores and solves sparse reward tasks.
However, blindly pursuing novel states renders the previous methods’ sample inefficient and may
also fail to converge to useful behavior in some cases. To overcome this issue, [LLL+20] formulate
a goal-oriented curiosity-driven exploration method and dynamic initial state selection mechanism
which achieves a much higher success rate and leads to faster convergence.

16


2.2 Random Network Distillation

2.2 Random Network Distillation

The main idea behind curiosity-driven learning is to build a reward function that is intrinsic to the
algorithm but has serious drawbacks due to the "noisy TV problem". This is a common issue in RL
where the observations are noisy or incomplete due to environmental factors such as sensor noise,
occlusions, or partial observability. In this scenario, the algorithm must learn to distinguish between
the relevant and irrelevant parts of the observation in order to make accurate decisions. Imagine
that an RL algorithm is rewarded with seeking novel experience, and RL algorithms are distracted
forever by stochastic elements in the environment. So at every timestep, the curiosity reward will be
much higher and pushes the algorithm to pursue these noisy states. Random Network Distillation
(RND) addresses this problem by calculating curiosity but is not attracted by the stochastic elements
of an environment. RND consists of a non-trained target network with fixed random weights and
a predictor network that tries to predict the target network’s output. The target network feature
representation for a given state will be the same. The predictor network predicts the target network’s
output for the next state. Therefore the next state is propagated into both networks and the prediction
network is trained to minimize the mean square error between the output of two networks. This
process distills a randomly initialized neural network (target) into a trained (predictor) network. To
improve exploration in several hard Atari games, [BESK18] included RND bonus combined with
flexible integration of intrinsic and extrinsic rewards helped in achieving better than human average
performance. An ensemble-free alternative approach was suggested by [NKTK23] for quantifying
uncertainty by making use of RND combined with Soft Actor Critic (SAC) which helped deliver
performance comparable to ensemble-based methods.

2.3 Hindsight Experience Replay

Curiosity driven methods try to maximize the statistical novelty of states in comparison with
previously experienced states. There is a potential pitfall in this method, the variation among states
explored is observed only when there is a significant amount of diversity present among the states.
When the explored states lack diversity in large state space ICM will fail to explore the state space
efficiently. In a sparse reward environment, the algorithm finds it difficult to appropriately explore
the environment and learn the sequence of steps required to achieve the desired goal.

Hindsight Experience Replay (HER) [AWR+17] consists of a buffer that stores a copy of each
trajectory experienced and replaces the actual rewards with rewards calculated assuming the goals
are the steps achieved at the end of the trajectories. As a result, the algorithm will be able to explore
more effectively and learn intermediate goals that build up to the actual goal. [ZZLL20] leverage
demonstrations to accelerate training and propose a new experience replay mechanism to address the
sparse reward problems in robotic tasks. Sequential object manipulation robotic tasks are extremely
difficult in sparse reward settings where, [LWD+22] proposes relay HER, where the huge sequential
task is decomposed into multiple sub-tasks. The decomposed sub-tasks are efficiently learned with
the help of HER. The concept of HER is applied to model-based methods to solve multi-goal tasks
in sparse reward settings. [MR21] propose Imaginary-HER which incorporates imaginary data into
policy updates to improve exploration using HER and is endowed with curiosity-based intrinsic
rewards. This imaginary data is replaced each time the model is updated. Another parallel work

17


2 Related Work

proposed by [YFH+21] introduces model-based HER which exploits experiences by leveraging
environmental dynamics to generate imaginary virtual goals. These generated virtual goals allows
the algorithm to reinterpret its actions using a different goal as per the latest policy.

2.4 Policy Optimization

Policy optimization methods are centered around the policy, a function that maps the agent’s
state to its next action. These methods view reinforcement learning as a numerical optimization
problem where the expected rewards are optimized with respect to the policy’s parameters. [LK13]
proposed a guided policy search algorithm that has the capability of learning complex policies by
incorporating guiding samples into the policy search. A model-free initial policy search is guided
by a model-based Differential Dynamic Programming (DDP) which samples from a distribution
of high reward trajectories, is helpful in optimizing the policy without requiring new on-policy
samples.

In this line of work of policy optimization, [JFZL21] propose a simple procedure of making limited
use of the model. In particular, unclasping the model goal and the original task goal by querying the
model only short rollouts. [CBK20] introduce an optimistic exploration algorithm that augments
the control space of the agent, which helps to control the agent’s epistemic uncertainty for short
transition dynamics. They try to address the problem of greedy and insufficient exploration of
MBRL algorithms with probabilistic dynamical problems by leveraging the model uncertainty to
optimize the policy.

As model-based methods often struggle with errors in learned models, [ZLW20] deduces an upper
bound for the uncertainty in Q-values in order to achieve asymptotic performance as model-free
methods. Furthermore, [QPC20] propose an uncertainty-aware trust region policy optimization
algorithm that optimizes the policy conservatively thereby increasing performance and reducing
overfitting to inaccurate models by optimizing conservatively.

[CVLH19] formulate model-free optimistic actor-critic which facilitates deep exploration by
leveraging the predictive uncertainty of the policy performance during policy optimization. They
achieve this by using a bootstrap method to estimate the epistemic uncertainty of the Q-function
so that they can adjust the upper confidence bound for critics based on the principle of optimism
in the face of uncertainty. In this similar line of work, [FM21] propose a model-based version by
implementing Posterior Sampling for Reinforcement Learning (PSRL) with function approximation
and making use of Bayesian Linear Regression (BLR) when fitting transition and reward models.
In the end, the authors implement Model Predictive Control (MPC) to optimize policy under the
sampled models in each episode.

[FLZB22] introduce on-policy corrections methodology which uses on-policy transition data
accompanied by a learned model in order to make accurate long-term predictions for MBRL. The
authors show that when generating trajectories on-policy with the model true state distribution can
be recovered by improving by means of policy improvement bound.

18


2.4 Policy Optimization

The idea of policy optimization is further extended to Offline RL by [YTY+20]. In this paper, the
authors present an offline MBRL algorithm that optimizes a policy in an uncertainty-penalized MDP,
the proposed algorithm penalizes states with high model uncertainty with a policy that maximizes
the MDP. The algorithm ensures that the traps of behavioral distribution can be mitigated while
avoiding the risk of making mistakes.

19


3 Technical Background

In this chapter, we briefly introduce the standard formalisms and techniques on which the thesis is
based, as well as their significance in the machine learning and reinforcement learning communities.
In Section 3.1, an introduction to reinforcement learning and the framework in which it operates is
provided. This aids in framing the issue of sparse rewards in Section 3.1.7. In section 3.2, a brief
introduction is given to Model-Free Reinforcement Learning (MFRL) and in section 3.3 to MBRL.
Further, a brief discussion of a model-based algorithm, Model Based Policy Optimization (MBPO)
method is provided. This method serves as a baseline upon which the rest of the thesis investigation
is carried out.

3.1 Reinforcement Learning

Reinforcement learning, is one of the three fundamental machine learning paradigms, along with
supervised and unsupervised learning. RL is involved in making sequences of decisions where it
considers an intelligent agent situated in an unknown environment. At each timestep, the agent
takes an action and receives an observation and reward. The primary goal of the RL algorithm is to
maximize the notion of cumulative reward, given an unknown environment, through a trial-and-error
learning process. Section 3.1.4 onwards, provides a more detailed description of the mathematical
formulation of RL.

In the Figure shown below Figure 3.1, depicts an extremely general RL problem that involves a
reward-maximizing agent. There have been numerous applications of RL algorithms to various
fields, including robot control [KBP13], [CER20] applied RL techniques to economics, game theory,
operation research and finance, and the recent technological trend of large language models [Ope23]
where RL is used to fine-tune the model’s behavior with human feedback, to produce responses
which are better aligned with the user’s intent.

Agent

Environment

at
rtst

rt+1

st+1

Figure 3.1: Interaction loop between an RL agent and environment. The reward and the state
resulting from taking an action serve as the input for the next iteration [SB18]

20


3.1 Reinforcement Learning

3.1.1 Deep Reinforcement Learning

Most of the modern machine learning methods are primarily concerned with learning functions from
data. Deep Learning (DL) is a sub-field of machine learning that uses artificial neural networks to
model and solve complex problems. DL involves a loss function, a non-linear function approximator,
and utilizes the gradient descent method to optimize parameters between nodes in order to minimize
the difference between the predicted output and the actual output. DL has been applied to various
fields and has produced some groundbreaking results in computer vision [SPP18], autonomous
driving [HC20] and natural language processing [TSK+21]. DL transforms a learning problem
to an optimization problem in a straightforward fashion in supervised learning scenarios, but this
reduction is less straightforward in RL.

The two main caveats involved in RL with respect to supervised learning are that the dataset is
non-stationary (changing over time) and the data is not independently and identically distributed,
instead the data is strongly correlated. Moreover, the input data to the agent strongly depends on
how it behaves in the unknown environment which makes it difficult to develop straightforward
algorithms to solve the task. In RL we have various choices of approximating such as policies,
value functions, dynamics models, and what kind of function approximators to utilize. all the
above-mentioned can also choose in various combinations when solving a RL problem.

Deep Reinforcement Learning (DRL) is a subfield of machine learning which studies reinforcement
learning using neural networks as non-linear function approximators. DRL incorporates DL which
helps the RL agent to make decisions from unstructured input data without the need for manual
engineering of the state space. DRL algorithms have the capacity to take ingest large datasets to
maximize the cumulative reward. DRL has been applied to diverse sets of applications where
[SSS+17] mastered the game of Go to play better than most human experts. They utilized a
combination of supervised learning, several RL steps to train deep neural networks combined with
a Monte-Carlo tree search algorithm. [MKS+13] present DQN a classical example of DRL, since it
scales up conventional Q-learning to perform tasks with more complex observation space. Since
then, there has been an explosion of state-of-the-art DRL algorithms applied in various domains. In
the upcoming subsections, individual components which contribute to the DRL are discussed in
detail.

3.1.2 Markov Decision Process

We consider a RL agent that interacts with a MDP. An MDP is described by the tuple M =

⟨S,A,P,R, 𝜇, 𝛾⟩, P(𝑠′ |𝑠, 𝑎) defines a Markovian transition probability density between the
current state 𝑠 and the next state 𝑠′ under action 𝑎. The initial state distribution is defined as
𝜇 : S → [0, 1], a reward function 𝑟 : S ×A → R and a discount factor 𝛾 ∈ [0, 1). Given an MDP,
the goal of RL agent is to learn a specific behavior in an unknown environment. That is, we want to
learn an action selection policy that maximizes the expected cumulative reward

21


3 Technical Background

3.1.3 Policy

The policy 𝜋 determines the agent’s behavior by selecting an action given its current state and is
formally defined as a mapping from a state 𝑠 to an action 𝑎, 𝜋 : S → A. It is possible for the policy
to be a deterministic function, which means it will always return the same action regardless of the
state. Policies can also be stochastic and be defined by a distribution 𝜋 : S × A → [0, 1] where A
a probability is assigned to each action 𝑎 given a state 𝑠.

When an agent is present in a state it has a wide variety of actions to choose from and we need
a performance measure to evaluate a given policy. The performance of a policy is quantified by
means of its return 𝑅. This return is defined as the sum of all rewards 𝑟𝑡 received within an episode
starting at 𝑡 = 0 and terminating at 𝑡 = 𝑇 .

𝑅 =

𝑇∑︁
𝑡=0

𝑟𝑡 (3.1)

In the finite horizon setting, the return is calculated over a finite number of timesteps and is therefore
bounded. However, to deal with infinite horizons tasks a discount factor, 𝛾 ∈ [0, 1] is introduced to
weigh down the contribution of distant future rewards. This discounted return is further made used
in calculations of value functions.

𝑅 =

∞∑︁
𝑡=0

𝛾𝑡𝑟𝑡 (3.2)

3.1.4 Value Function

The RL agent needs to find a policy 𝜋(𝑎 |𝑠) to take an appropriate action 𝑎 when the agent is in state
𝑠. There are multiple actions which the agent can choose from a particular state and this is given
by a so-called state value function or state-action value function. The state-action value function
indicates how beneficial it is to be in a certain state 𝑠 and perform a certain action 𝑎. Formally, the
state value function can be defined as the expected return when starting in state 𝑠 and selecting
actions according to the policy 𝜋:

𝑉 𝜋 (𝑠) = E𝜋

[ ∞∑︁
𝑡=0

𝛾𝑡𝑟𝑡 | 𝑠0 = 𝑠

]
. (3.3)

Correspondingly, the state-action value function is given by

𝑄 𝜋 (𝑠, 𝑎) = E𝜋

[ ∞∑︁
𝑡=0

𝛾𝑡𝑟𝑡 | 𝑠0 = 𝑠, 𝑎0 = 𝑎

]
, (3.4)

22


3.1 Reinforcement Learning

When we have access to the reward function and the transition function, the state-action value
function can be calculated recursively:

𝑄 𝜋 (𝑠, 𝑎) = E𝜋
[
𝑟0 + 𝛾𝑟1 + 𝛾2𝑟2 + ... | 𝑠0 = 𝑠, 𝑎0 = 𝑎

]
= E𝜋 [𝑟0 | 𝑠0 = 𝑠, 𝑎0 = 𝑎] + 𝛾E𝜋 [𝑟1 + 𝛾𝑟2 + ... | 𝑠0 = 𝑠, 𝑎0 = 𝑎]
= 𝑟 (𝑠, 𝑎) + 𝛾

∑︁
𝑠′
𝑃(𝑠′ | 𝑎, 𝑠)E [𝑟1 + 𝛾𝑟2 + ... | 𝑠1 = 𝑠′]

= 𝑟 (𝑠, 𝑎) + 𝛾
∑︁
𝑠′
𝑃(𝑠′ | 𝑎, 𝑠)𝑄 𝜋 (𝑠′, 𝜋(𝑠′))

(3.5)

The above equation is called the Bellman equation for the state-action value function. When we
consider two policies 𝜋 and 𝜋′, where 𝜋 ≥ 𝜋′ if and only if 𝑉 𝜋 (𝑠) ≥ 𝑉 𝜋′ (𝑠) for all 𝑠 ∈ S.

Analogously, we can calculate a similar Bellman equation for the state value function which is given
by:

𝑉 𝜋 (𝑠) = 𝑟 (𝑠, 𝜋(𝑠)) + 𝛾
∑︁
𝑠′
𝑃(𝑠′ | 𝜋(𝑠), 𝑠)𝑉 𝜋 (𝑠′) (3.6)

Therefore, we can assess policies given their value functions and derive an optimal policy 𝜋∗ based
on value functions. The value function that defines the optimal policy 𝜋∗ is called the optimal
state-action value function 𝑄∗(𝑠, 𝑎) which is given by:

𝑄∗(𝑠, 𝑎) = max
𝜋
𝑄 𝜋 (𝑠, 𝑎), ∀𝑠 ∈ S, 𝑎 ∈ A (3.7)

An optimal state-value function 𝑉∗(𝑠) is given by:

𝑉∗(𝑠) = max
𝜋
𝑉 𝜋 (𝑠), ∀𝑠 ∈ S (3.8)

And we can define the Bellman optimality equation for the state-action value function as:

𝑄∗(𝑠, 𝑎) = 𝑟 (𝑠, 𝑎) + 𝛾
∑︁
𝑠′
𝑃(𝑠′ | 𝑎, 𝑠)𝑄 𝜋 (𝑠′, 𝜋(𝑠′)) (3.9)

Similarly, we can define the Bellman optimality equation for the state value function:

𝑉∗(𝑠) = max
𝑎

[
𝑟 (𝑠, 𝑎) + 𝛾

∑︁
𝑠′
𝑃(𝑠′ | 𝑎, 𝑠)𝑉∗(𝑠)

]
(3.10)

And finally the optimal policy 𝜋∗ is given as:

𝜋∗(𝑠) = argmax
𝑎

[
𝑟 (𝑠, 𝑎) + 𝛾

∑︁
𝑠′
𝑃(𝑠′ | 𝑎, 𝑠)𝑄∗(𝑠, 𝑎))

]
(3.11)

23


3 Technical Background

We can write the above equation in a simplified form:

𝜋∗(𝑠) = argmax
𝑎

𝑄∗(𝑠, 𝑎) (3.12)

3.1.5 Sparse Rewards

Reward functions play an important role in helping the agent to learn a better policy. In sparse
reward problems, the reward function returns a reward signal which is typically not informative
towards learning an optimal policy. Instead, only a small region of state-action space is well-defined
to provide valuable feedback for learning. Learning becomes very difficult in these scenarios
because the algorithm must internally realize when the agent made a mistake during an episode,
which led to negative rewards after termination. Similarly, after a successful episode, the agent needs
to know what action led to a successful episode. Ideally RL algorithms are applied to simulated
environments where it is possible to define rewards at each time step and help the agent to reach a
goal state. Unfortunately, for many problems in real-world settings, this is not the case.

Let us consider a state space of a previously considered example of the robotic arm stack task to be
represented by a tuple (𝑆, 𝐴, 𝑃, 𝑅, 𝛾) where, 𝑆 is the set of possible states in the environment, 𝐴 is
the set of possible actions that the agent can take, 𝑃(𝑠′ | 𝑠, 𝑎) represents the transition probability
from state 𝑠 to state 𝑠′ when action 𝑎 is taken 𝑅(𝑠, 𝑎, 𝑠′) is the reward received by the agent when
transitioning from state 𝑠 to 𝑠′ by taking action 𝑎 and 𝛾 is the discount factor that determines the
importance of future rewards.

The reward can be defined as 𝑅(𝑠, 𝑎, 𝑠′) sparse, that is rewards are only achieved when the robotic
arm is in a specific small region of defined reward space. This makes the agent learning an optimal
policy more difficult. The agent may receive a positive reward of +1 only when it reaches the goal
state and zero rewards for all other state-action transitions. In this case, the agent has to rely on
sparse positive rewards to learn an effective policy. Mathematically, this can be represented as,

𝑅(𝑠, 𝑎, 𝑠′) =
{
+1, if 𝑠′ is goal state
0, otherwise

24


3.2 Model-Free Reinforcement Learning

3.2 Model-Free Reinforcement Learning

MFRL is a group of algorithms that learn optimal policies through trail-and-error interactions with
an environment without the need for explicitly modeling the environment’s dynamics i.e. the agent
tries to learn from experience, rather than building a model of the environment’s dynamics and
utilizing it to plan ahead.

In the MFRL setting, the agent explores the environment and receives observations that represent
the state of the environment at a given time. Agents choose an action based on observations in
the environment and receive a reward as feedback on the quality of their decisions. The goal
of the agent is to learn a policy that maximizes the expected cumulative reward over time. It is
possible to achieve this goal by using value-based or policy-based algorithms, which are two main
classifications of MFRL algorithms [Sil15]. There are a plethora of RL algorithms, which are
classified as On-Policy or Off-Policy methods, and in the context of this thesis, we currently focus
on the category of off-policy learning and the actor-critic framework.

On-Policy methods update the policy based on the data the agent collects using a current policy
i.e. the data used for learning is from the interactions with the environment utilizing the current
policy. As a result, the agent can only update the policy it is currently following. On the other
hand, Off-Policy methods learn from data collected by following a different policy. The agent has a
flexible choice of choosing either a random policy or a previously learned policy. This flexibility
allows the agent to learn more efficiently by reusing past experience.

The above-mentioned data collection methods can be applied to both Value-based and Policy-based
methods. Value-based methods typically estimate the value of each state or state-action pair
and utilize this value to select actions and in turn, maximize the expected cumulative reward.
Policy-based methods on the other hand, directly learn a policy, mapping each state to a probability
distribution over actions. The policy is updated based on the observed rewards, and it involves
computing the gradient of the policy’s performance objective. Policy-based methods can be further
classified as Gradient-free and Gradient-based methods.

Gradient-free methods do not rely on gradient information for policy optimization. Instead, their
approach consists of searching the policy space directly for optimal policies, utilizing evolutionary
algorithms or other search-based optimization techniques such as Particle Swarm Optimization
and the cross-entropy method. These methods explore the policy space through trial and error,
gradually improving policies over iterations.

Gradient-based methods utilize gradients to optimize the policy directly. The goal of this method
is to find a policy that maximizes the expected cumulative reward by iteratively updating the
policy in the direction of the steepest ascent. Commonly used Gradient-based methods include
state-of-the-art policy optimization algorithm Proximal Policy Optimization (PPO) [SWD+17],
Trust Region Policy Optimization (TRPO) [SLM+17].

There are hybrid methods that combine the advantages of both Policy-based and Value-based
methods such as Advantage Actor-Critic (A2C) which employs an actor-critic framework. The
critic network is used to estimate a value function and the policy gradient is computed based on an
advantage function which is utilized to guide policy updates to select corresponding actions. The
actor is responsible for selecting actions based on the current state which then adjusts its policy
based on the feedback from the critic to improve future actions [SML+18].

25


3 Technical Background

Extending this line work, [HZAL18] proposed state-of-the-art SAC which incorporates stochastic
policies, allowing the agent to explore and generate a diverse set of actions. The exploration
is encouraged by an entropy regularizer term and discourages overly deterministic policies. By
maximizing entropy, SAC seeks a policy that strikes a balance between exploration and exploitation
which builds the base for the rest of the thesis and is discussed in detail in section 3.2.1.

3.2.1 Soft Actor-Critic

Soft Actor-Critic algorithm is a model-free, off-policy, actor-critic reinforcement learning method.
While PPO, A2C and TRPO algorithms learn in an on-policy manner, i.e., they need new sets
of samples for each policy update, SAC learns in an off-policy way, i.e. by using experience
replay buffers to learn from past data. SAC is primarily applied to continuous action space
environment. By incorporating a maximum entropy objective, the SAC algorithm addresses the
exploration-exploitation trade-off. It is used as a regularization term in policy optimization to
measure the degree of uncertainty in a policy distribution. Additionally, the entropy quantity
prevents convergence to sub-optimal policies. The entropy 𝐻 for a policy is defined as

𝐻 (𝜋) = E𝜋 [− log(𝜋(· | 𝑠𝑡 )] . (3.13)

Maximum entropy term is incorporated into the standard RL objective, where the agent receives an
additional bonus reward at each timestep proportional to the entropy of the policy at that timestep.
As a result the modified RL objective [HZAL18] is given as,

𝜋∗ = argmax
𝜋

E𝜋

[ ∞∑︁
𝑡=0

𝛾𝑡 (𝑟 (𝑠𝑡 , 𝑎𝑡 ) + 𝛼𝐻 (𝜋(· | 𝑠𝑡 )))
]
. (3.14)

The temperature parameter 𝛼 controls the stochasticity of the optimal policy and is used to balance
entropy regularization terms. As a result of adding the entropy parameter in the modified objective,
the state value and state-action value functions are defined as follows,

𝑉 𝜋 (𝑠) = E𝜋

[ ∞∑︁
𝑡=0

𝛾𝑡 (𝑟 (𝑠𝑡 , 𝑎𝑡 ) + 𝛼𝐻 (𝜋(· | 𝑠𝑡 )))
]
, (3.15)

𝑄 𝜋 (𝑠, 𝑎) = E𝜋

[ ∞∑︁
𝑡=0

𝛾𝑡𝑟 (𝑠𝑡 , 𝑎𝑡 ) + 𝛼
∞∑︁
𝑡=1

𝛾𝑡𝐻 (𝜋(· | 𝑠𝑡 ))
]
. (3.16)

Finally, the modified Bellman equation for the state-action value function is given as,

𝑄 𝜋 (𝑠, 𝑎) = E𝜋 [𝑟 (𝑠, 𝑎) + 𝛾(𝑄 𝜋 (𝑠′, 𝑎′) + 𝛼𝐻 (𝜋(· | 𝑠′)))] . (3.17)

SAC simultaneously learns a policy and two Q-functions to improve policy optimization. Two
Q-functions are learned independently and the minimum value between the Q-functions is utilized
for both improved policy update and the accurate target value estimation. Also, the two Q-
functions provide multiple value estimates, which is helpful in guiding policy learning, and entropy

26


3.3 Model-Based Reinforcement Learning

regularization which encourages policy to explore diverse sets of actions, thereby promoting
exploration. SAC utilizes clipped double-Q trick [FHM18] to reduce overestimation of biases for
the two Q-functions. A minimum Q-value estimate between two Q-functions is used by SAC when
estimating the target value for updating the Q-functions. As a result, the overestimation bias is
mitigated and accurate value estimates are provided. During the policy update, SAC utilizes the
maximum of Q-function estimates, encouraging the policy to choose actions with higher Q-values.

SAC trains the Q-network by minimizing the mean squared error between Q-value estimate𝑄 𝜃𝑖 (𝑠, 𝑎)
which is given as

𝐽𝑄 (𝜃) = E(𝑠,𝑎,𝑟 ,𝑠′ )∼D
[
(𝑄 𝜃 (𝑠, 𝑎) − �̂�(𝑠, 𝑎))2

]
(3.18)

with the entropy regularized Bellman update target is given as

�̂�(𝑠, 𝑎) = 𝑟 (𝑠, 𝑎) + 𝛾(min
𝜃1,2

𝑄 𝜃 ′
𝑖
(𝑠′, 𝑎′) − 𝛼 log 𝜋(· | 𝑠′)) (3.19)

where 𝑎′ ∼ 𝜋(· | 𝑠′) and log 𝜋(· | 𝑠′) term approximates the entropy of the policy, and D is the
replay buffer storing samples of the agent during training. The term min𝜃1,2 𝑄 𝜃 ′

𝑖
(𝑠′, 𝑎′) takes the

minimum of Q-value network estimate between the two target Q-value networks 𝑄 𝜃 ′1
and 𝑄 𝜃 ′2

for
the next state and action pair.

Given the stochastic nature of the policy in SAC, the policy objective is formulated so as to maximize
the likelihood of actions 𝑎 ∼ 𝜋(· | 𝑠) that would result in high Q-value estimate 𝑄(𝑠, 𝑎). The
policy’s objective function can thus be defined as

max 𝐽𝜋 (𝜙) = E𝑠∼D
[
min
𝜃1,2

𝑄 𝜃𝑖 (𝑠, 𝑎) − 𝛼 log 𝜋𝜙 (𝑎 | 𝑠)
]

(3.20)

3.3 Model-Based Reinforcement Learning

Following the discussion of MFRL algorithms in the previous sections, this section discusses MBRL
upon which the rest of the thesis is built upon. In MBRL approaches, an agent explicitly learns a
probabilistic model of the environment’s dynamics. Various techniques can be used to build the
model, such as stochastic models, regression models, and neural networks. This learned model can
later be leveraged to solve various RL problems. [CCML18] propose a probabilistic ensemble with
trajectory sampling, where Cross Entropy Method (CEM) method is utilized to sample actions from
a distribution that is similar to previous action samples that yielded high rewards. [SHM+16] uses
this Monte Carlo Tree Search (MCTS) methodology in the game of AplhaGo. Each node in the
tree corresponds to a state which will be evaluated based on the returns with the help of a random
policy.

An alternate strategy is that model can be used to generate simulated data for policy learning or value
approximation. In this line of work, [FWS+18] proposed Model Value Expansion (MVE) which
utilizes n-step temporal difference targets and depends on a fixed horizon. [BHT+19] proposed
Stochastic Ensemble Value Expansion (STEVE) which interpolates between different horizons
based on the uncertainty calculated using the ensemble networks.

27


3 Technical Background

The model can be used for data augmentation to improve the policy. [LXL+21] proposed Stochastic
Lower Bound Optimization (SLBO) which uses a multi-step L2-norm loss to train the dynamics.
SLBO uses the model to rollout full-length trajectories from the start state. However, long rollouts
typically result in compounding prediction errors, which ultimately hinder the policy optimization
process. [JFZL21] introduced MBPO, which makes use of shorter rollout lengths to avoid large
compounding errors. MBPO begins to roll out from states which are sampled from the real
environment and run n-steps according to the policy and the learned model. MBPO uses the
model-generated data as a replay buffer to train a standard SAC agent. As such, MBPO can be
considered a natural extension of SAC that incorporates model rollouts. SAC is utilized as an agent
to update the policy with mixed data from both real and learned models. MBPO is discussed in
detail in section 3.3.1 and its respective hyperparameters are in section 3.3.2.

3.3.1 Model-Based Policy Optimization

The natural choice of this model-based algorithm to further investigate is that it is a natural extension
of model-free counterpart SAC. In this approach, the learned model is used to gather imaginary
data to train a policy. To gather this imaginary data, a bootstrap ensemble of dynamics model is
utilized where each member of the ensemble is a probabilistic neural network. Using probabilistic
neural networks, aleatoric uncertainty, or noise in outputs relative to inputs, is captured while
bootstrapping ensures to capture of epistemic uncertainty or uncertainty in model parameters. To
generate a prediction from the ensemble, a random model is sampled from the ensemble of the
dynamics model to test different transitions along a single model rollout.

Policy optimization is performed using SAC which is adopted from open-repository mbrl-lib
[PAZ+21]. SAC alternates between a policy evaluation step, which estimates the state-action value
function, and a policy improvement step. For all the experiments, we use the automatic entropy
tuning flag that adaptively modifies the entropy gain based on the stochasticity of the policy.

Performing extended model rollouts leads to compounding error problems. To mitigate this problem,
the MBPO algorithm begins propagating short model rollouts from state distribution of a different
policy under the true environment dynamics. As a result, a few long trajectories with large errors
can be traded for many short trajectories with smaller errors. The Figure 3.2 shows the architecture
of a MBPO agent. It consists of two replay buffers for storing real experiences from the stochastic
environment and generated experiences from the predictive model. Off-policy SAC is utilized as a
policy optimization algorithm. SAC is designed using actor-critic framework which is constructed
using neural networks. The actor-network consists of 2 hidden layers of 64 units each with a Tanh
activation function. Similarly, the critic network also consists of 2 hidden layers of 256 units each
with a Tanh activation function. The predictive model is represented as an ensemble of probabilistic
neural networks whose outputs parameterize a Gaussian distribution. Each individual network of
the ensemble consists of 4 hidden layers of 200 units each with SiLU activation [EUD17].

Figure 3.3 shows the data collection process in MBPO algorithm. Random models are uniformly
sampled for a single model rollout step from the ensemble to generate a transition prediction. These
transition predictions are later stored in the model replay buffer which are further utilized as inputs
for the SAC algorithm for policy optimization.

Algorithm 3.1 shows the working of the Model-Based Policy Optimization algorithm.

28

https://github.com/facebookresearch/mbrl-lib


3.3 Model-Based Reinforcement Learning

Replay Memory
Environment Buffer

Replay Memory
Model Buffer

Stochastic Environment

Real
Experience

Interactions

Sample
generated

experiences

Generated
Experience

MBPO Agent

Sample real
experiences to
train Predictive

Model

2 layers, 64 units,
Tanh activation

Actor

2 layers, 256 units,
Tanh activation

Critic

Off-Policy SAC
Agent

Predictive Models

4 layers, 200 units,
SiLU activation

Ensemble Models

Figure 3.2: Detailed architecture depicting the individual components of an MBPO agent, which
utilizes SAC as a policy optimization algorithm. The Figure depicts the interaction of the
SAC agent with the stochastic environment to collect real experiences, which are later
utilized to train a predictive model to generate a simulated experience. This simulated
experience is later utilized as input to the SAC algorithm for policy optimization.

Replay Memory
Environment Buffer

Replay Memory
Model Buffer

Minibataches

Policy

Policy

Model 2

Model 3

Policy

Policy

Model 1

Model 4

Sample
Observations

r1, s1', done1 r1', s1'', done1'

r2, s2', done2 r2', s2'', done2'

s1

s2

s2

s1
a1

a2

a1'

a2'

s1, a1, r1, s1', done1
s1', a1', r1', s1'', done1'

s2, a2, r2, s2', done2
s2', a2', r2', s2'', done2'

Figure 3.3: Experiences collected in the environment replay buffer are sampled as mini-batches.
To generate a transition prediction, a model is uniformly sampled at random from the
ensemble and the predicted experiences are stored inside the model replay buffer.

29


3 Technical Background

Algorithm 3.1 Model Based Policy Optimization algorithm

1: Initialize policy 𝜋𝜙, predictive model 𝑝𝜃 , critic ensemble {𝑄𝑖}2𝑖=1, environment buffer D𝑡 ,
model buffer D𝑚𝑜𝑑𝑒𝑙

2: 𝑔𝑙𝑜𝑏𝑎𝑙𝑠𝑡𝑒𝑝 ← 0
3: for episode 𝑡 = 0, ..., 𝑇 − 1 do
4: for 𝐸 steps do
5: if global step %𝐹 == 0 then
6: Train model 𝑝𝜃 on D𝑡 via maximum likelihood
7: for 𝑀 model rollouts do
8: Perform 𝑘-step model rollouts starting from 𝑠 ∼ D𝑡 ; add to D𝑚𝑜𝑑𝑒𝑙

9: Take action in environment according to 𝜋𝜙; add to D𝑡

10: for 𝐺 gradient updates do
11: update {𝑄𝑖}2𝑖=1 with mini-batches from D𝑚𝑜𝑑𝑒𝑙, via SGD (11)
12: update 𝜋𝜙 with mini-batches from D𝑚𝑜𝑑𝑒𝑙, via SGA on optimistic Q-values of (9)
13: 𝑔𝑙𝑜𝑏𝑎𝑙𝑠𝑡𝑒𝑝 ← 𝑔𝑙𝑜𝑏𝑎𝑙𝑠𝑡𝑒𝑝 + 1

3.3.2 Hyperparameters

Table 3.1 provides the main hyperparameters utilized for the continuous learning environments.
This section further discusses specific hyperparameters which play an important role in defining the
capacity of model-generated buffers and are also later used to conduct further ablation studies.

Model rollout length 𝑘 , refers to the number of steps taken by the agent during the rollout phase.
A rollout is a simulated sequence of actions taken by the agent, starting from an initial state and
following a particular policy until a terminal state or a predefined time horizon is reached. Rollout
length determines the length of the prediction horizon generated by the model to estimate the
expected returns of various policies.

Model rollouts per step 𝑀, refers to the number of rollouts generated by a model per training
step. This hyperparameter controls the number of rollouts executed under the model, it has a direct
control on the amount of data generated by the model. It gives us the control of gathering the
number of real environment rollouts and simulated rollouts. The increasing number of rollouts per
step allows for gathering more simulated experiences, thereby improving the quality of data and the
accuracy of the agent’s policy.

Frequency of model retrain 𝐹, refers to how frequently the model is retrained during the training
process. This hyperparameter determines the frequency at which the model parameters are updated
to improve the performance of the model. Higher frequency retraining suggests that the model is
retrained more frequently to adapt to the distribution of states observed in the true environment to
improve the model’s accuracy.

Updates to retain buffer 𝑅, refers to the number of model updates that are retained in the replay
buffer. The replay buffer contains past experiences of the agent, typically consisting of state, action,
reward, next-state, and done flags stored as tuples. The higher the value of this hyperparameter
the more off-policy (old) data can be stored and sampled for training. Table 3.1 shows the list of
important hyperparameters used in MBPO style learning in continuous control experiments.

30


3.3 Model-Based Reinforcement Learning

Ensemble Size 𝑁 , refers to the number of models or dynamics functions used in the ensemble.
Each model present in the ensemble is a probabilistic neural network whose outputs parameterize a
Gaussian distribution [JFZL21]

𝑝5
𝜃 (𝑠𝑡+1, 𝑟 | 𝑠𝑡 , 𝑎𝑡 ) = N

(
𝜇5
𝜃 (𝑠𝑡 , 𝑎𝑡 ) , Σ

5
𝜃 (𝑠𝑡 , 𝑎𝑡 )

))
, Ensemble models are responsible for capturing both aleatoric and epistemic uncertainties.

Trainer Patience 𝑝, refers to a parameter used for early stopping during model training. If the
validation performance does not improve for consecutive epochs, training is halted to prevent further
training on potentially overfitting the model.

Model-generated buffers are created by using a predictive model to generate additional training
data for reinforcement learning algorithms. Model-generated buffers are used to augment the
environment dataset and improve the agent’s learning performance. This augmented dataset is
utilized to simulate imaginary actions and outcomes, expanding the range of collected experiences.
As a result, the capacity of the model-generated buffers D𝑚𝑜𝑑𝑒𝑙 is computed as 𝑘 × 𝑀 × 𝐹 × 𝑅.
The capacity of the model replay buffer is defined in such a manner as to hold R iterations of data
generation.

Table 3.1: Hyperparameters for MBPO style learning
Hyperparameters Acronyms

# episodes 𝑇

# steps per episode 𝐸

policy updates per step 𝐺

# model rollouts per step 𝑀

frequency of model retrain (#steps) 𝐹

# updates to retain buffer 𝑅

ensemble size 𝑁

rollout length 𝑘

Trainer Patience 𝑝

3.3.3 MBPO-C Method

This section introduces Model Based Policy Optimization-Consistent (MBPO-C), a variant of the
MBPO approach. Following the suggestion presented in [CCML18], instead of having a model
being sampled at random for a rollout, MBPO-C uses a fixed model for the entire rollout and
randomly samples states. Therefore, the rollouts are model consistent and we obtain a mean of the
value distribution equal to the mean of the true value distribution. The key idea behind this approach
is to leverage the diversity of multiple models to improve overall performance. By using different
fixed models for each rollout, we can benefit from the strengths and weaknesses of each model. As
a result, the ensemble of models helps reduce bias and increase the robustness of the estimated
value. Figure 3.4 shows modified training loop of MBPO by having model consistent rollouts.

Algorithm 3.2 shows the working of the Model-Based Policy Optimization-Consistent algorithm.

31


3 Technical Background

Replay Memory
Environment Buffer

Replay Memory
Model Buffer

Minibataches

Policy

Policy

Model 1

Model 3

Policy

Policy

Model 1

Model 3

Sample
Observations

r1, s1', done1 r1', s1'', done1'

r2, s2', done2 r2', s2'', done2'

s1

s2

s2

s1
a1

a2

a1'

a2'

s1, a1, r1, s1', done1
s1', a1', r1', s1'', done1'

s2, a2, r2, s2', done2
s2', a2', r2', s2'', done2'

Figure 3.4: Experiences collected in the environment replay buffer are sampled as mini-batches.
In comparison to the MBPO-styled rollout in Figure 3.3, A transition prediction is
generated using a model which is uniformly sampled at random from the ensemble,
and the same model is utilized throughout the entire rollout to generate predicted
experiences. The predicted experiences using the model are stored inside the model
replay buffer.

Algorithm 3.2 Model Based Policy Optimization-Consistent algorithm

1: Initialize policy 𝜋𝜙, predictive model 𝑝𝜃 , critic ensemble {𝑄𝑖}2𝑖=1, environment buffer D𝑡 ,
model buffer D𝑚𝑜𝑑𝑒𝑙

2: 𝑔𝑙𝑜𝑏𝑎𝑙𝑠𝑡𝑒𝑝 ← 0
3: for episode 𝑡 = 0, ..., 𝑇 − 1 do
4: for 𝐸 steps do
5: if global step %𝐹 == 0 then
6: Train model 𝑝𝜃 on D𝑡 via maximum likelihood
7: Using fixed model M, perform 𝑘-step model rollouts starting from 𝑠 ∼ D𝑡 ; add to
D𝑚𝑜𝑑𝑒𝑙

8: Take action in environment according to 𝜋𝜙; add to D𝑡

9: for 𝐺 gradient updates do
10: update {𝑄𝑖}2𝑖=1 with mini-batches from D𝑚𝑜𝑑𝑒𝑙, via SGD (11)
11: update 𝜋𝜙 with mini-batches from D𝑚𝑜𝑑𝑒𝑙, via SGA on optimistic Q-values of (9)
12: 𝑔𝑙𝑜𝑏𝑎𝑙𝑠𝑡𝑒𝑝 ← 𝑔𝑙𝑜𝑏𝑎𝑙𝑠𝑡𝑒𝑝 + 1

32


3.4 Integral Probability Metrics

3.4 Integral Probability Metrics

This section gives a detailed introduction to a class of statistical metrics named IPM, which quantifies
the discrepancy between two probability distributions. In this study, IPM’s are used to compare the
discrepancies between distributions of environment rewards and model imagined rewards, as well as
empirical state distributions of both environment and the model. The main idea of incorporating the
metrics is to study the impact of the different model-based hyperparameters on the model-generated
data and measure the discrepancy between the model-generated distribution and the environment
distribution of observations and rewards.

IPM provide a unified framework to compare distributions without making specific assumptions
about their parametric form. For example, Let us consider two empirical distributions 𝑥 and 𝑦 over
a common space and a distance function 𝐷 which is an IPM. This distance function is considered a
metric when properties such as non-negativity, positive definiteness, and symmetry are satisfied.
The conditions are given below:

𝐷 (𝑥, 𝑦) ≥ 0 (non-negativity), (3.21)

𝐷 (𝑥, 𝑦) = 0 (3.21 and 3.22 together produce positive definiteness), (3.22)

𝐷 (𝑥, 𝑦) = 𝑑 (𝑦, 𝑥) (symmetry), (3.23)

This study introduces two IPM metrics [SFG+09] namely Wasserstein Distance (WD), which
measures the discrepancy between distributions of environment and model rewards, and Maximum
Mean Discrepancy (MMD), which measures the discrepancy between empirical state distributions
of the environment and the model.

3.4.1 Wasserstein Distance

Wasserstein distance, also known as the Earth Mover Distance (EMD), is a distance metric that
measures the dissimilarity between two probability distributions. The metric quantifies the minimum
cost which is required to transform one distribution into another. The cost of the metric is defined as
the amount of mass that needs to be moved from each point in one distribution to its corresponding
point in the other distribution. Mathematically, If 𝑃 is an empirical probability distribution of
rewards from environment dataset 𝑋1, ...., 𝑋𝑛 and 𝑄 is an empirical probability distribution of
rewards from model dataset 𝑌1, ...., 𝑌𝑛 of the same size, then the Wasserstein distance between the
two distributions is calculated as follows:

𝑊𝑝 (𝑃,𝑄) =
[

𝑛∑︁
𝑖=1
| |𝑋𝑖 − 𝑌𝑖 | |𝑝

] (1/𝑝)
(3.24)

33


3 Technical Background

Rewards

Probability

Model
Rewards
Q(Y)

Environment
Rewards
P(X)

W(P,Q)

Figure 3.5: The plot depicts the Wasserstein distance calculation between environment reward
distribution P(X) and Model reward distribution Q(Y). W(P,Q) quantifies the minimum
cost which is required to transform one distribution into another.

In this study, the Wasserstein distance metric is particularly used for comparing reward distributions
between the model and the environment. The reward distributions are one-dimensional and the
metric provides a way to capture the underlying structure and spatial relationships between two
types of reward distributions. The distance metric is calculated using the Python open-source
library, SciPy [VGO+20]. The Figure 3.5 shows a visual representation of how the Wasserstein
distance is calculated between two reward distributions.

The Wasserstein distance metric can be extended to higher dimensional state space, by considering
the amount of mass that needs to be moved in a multidimensional space. The distance metric
measures the minimum cost required to transform one distribution into another. It is done by
moving the appropriate amount of mass in a way that preserves the overall mass and as a result,
minimizes the total distance. In practice, computing the exact Wasserstein distance for higher
dimensional distributions can be computationally intensive due to the optimization involved. As the
dimensionality increases, the number of required samples to accurately estimate the Wasserstein
distance also grows exponentially. High-dimensional spaces suffer from the curse of dimensionality,
where the volume of the space increases exponentially with the number of dimensions. This makes
it more challenging to sample from and represent distributions accurately, potentially leading to
distorted distance measures. To address the problem of measuring distance in higher dimensional
spaces, subsection 3.4.2 introduces MMD distance metric to measure the distance between higher
dimensional distributions.

34


3.4 Integral Probability Metrics

3.4.2 Maximum Mean Discrepancy Distance

MMD is a kernel-based metric, that quantifies the discrepancy between distributions as distances
between mean embeddings of features. Moreover, under certain conditions, MMD can be zero if
the two distributions are identical. MMD metric does not assume any specific parametric form
of the distribution being compared. MMD metric maps the samples from each distribution into a
higher-dimensional space using a feature map and compares the maximum means of the distributions.
A smaller MMD value indicates a higher similarity between the distributions and vice-versa.

Given two probability distributions, if 𝑃 is an empirical probability distribution of state space
from environment dataset 𝑋1, ...., 𝑋𝑛 and 𝑄 is an empirical probability distribution of observation
space from model dataset 𝑌1, ...., 𝑌𝑛, and assuming we have samples drawn from each distribution,
mathematically the MMD between 𝑃 and 𝑄 is computed as follows:

First, MMD is defined by a feature map for environment observation space 𝜓 : X → H , where
H is called as Reproducing Kernel Hilbert Space (RKHS) [Gre13]. MMD starts by mapping the
samples from the original input space to a higher dimensional feature space. These feature maps
are spaces of functions that satisfy the reproducing property: ⟨ 𝑓 , 𝜓(𝑥)⟩H = 𝑓 (𝑥) for any 𝑓 ∈ H

Second, a kernel function is defined to measure the similarity between pairs of samples in the
observation space. The kernel function compares the representations of the mapped samples. This
combined kernel function for both environment and model observations can be represented as
𝐾 (𝑥, 𝑦) = ⟨𝜓(𝑥), 𝜓(𝑦)⟩H . Common kernel functions include the Gaussian kernel, polynomial
kernel, and Laplacian kernel. The Gaussian kernel and Laplacian kernel are defined respectively in
equations 3.7a and 3.7b,

𝐾 (𝑥, 𝑦) = 𝑒−
| |𝑥−𝑦 | |2

2𝜎2 , 𝜎 > 0, (3.25)

𝐹 (𝑥, 𝑦) = 𝑒−
| |𝑥−𝑦 | |
2𝜎2 , 𝜎 > 0, (3.26)

Figure 3.7 shows a comparative study using Gaussian and Laplacian kernel function for calculating
the MMD distance. The study was performed between the model and environment observation
dataset from the Sparse Inverted Pendulum environment. Both types of kernel functions can be
utilized since both kernel functions return an unbounded distance value. In general, the Gaussian
kernel results in smaller distance values. Since the squared Euclidean distance in the numerator
leads to faster decay as the distance between data points increases. On the other hand, the Laplacian
kernel with the absolute value of the Euclidean distance decays more slowly, resulting in larger
distance values. As a result Gaussian kernel is utilized in the MMD distance metric to compute the
similarity between mapped representations.

Finally, the MMD distance is computed based on the differences and similarities between samples
within each distribution and across the distributions.

The first term of the MMD, captures the similarity within the environment state space The term
calculates the average pairwise kernel similarity between samples belonging to environment state
space. The goal is to capture the inherent structures and patterns in each distribution individually.
Similarly, the second term of the MMD captures the similarity within the model state space. The
first and second terms are correspondingly given as follows:

35


3 Technical Background

⟨E𝑋∼𝑃𝜓(𝑋),E𝑋′∼𝑃𝜓(𝑋 ′)⟩H , (3.27)

⟨E𝑌∼𝑄𝜓(𝑌 ),E𝑌 ′∼𝑄𝜓(𝑌 ′)⟩H , (3.28)

The third term measures the dissimilarity between samples across the environment and model state
space distributions. This term captures the differences or discrepancies between the distributions
and is given as follows:

⟨E𝑋∼𝑃𝜓(𝑋),E𝑌∼𝑄𝜓(𝑌 )⟩H , (3.29)

Finally, the MMD distance is calculated by combining equations 4.6, 4.7, and 4.8 which gives an
overall measure of the difference between the distributions. The weights in front of each term ensure
that the contributions from each term are properly balanced and finally, the combined equation is
given as follows:

𝑀𝑀𝐷2(𝑃,𝑄) = ⟨E𝑋∼𝑃𝜓(𝑋),E𝑋′∼𝑃𝜓(𝑋 ′)⟩H + ⟨E𝑌∼𝑄𝜓(𝑌 ),E𝑌 ′∼𝑄𝜓(𝑌 ′)⟩H
− 2⟨E𝑋∼𝑃𝜓(𝑋),E𝑌∼𝑄𝜓(𝑌 )⟩H ,

(3.30)

The above equation can be simplified as follows

𝑀𝑀𝐷2(𝑃,𝑄) = | |E𝑋∼𝑃𝜓(𝑋) − E𝑌∼𝑄𝜓(𝑌 ) | |2H (3.31)

Figure 3.6 shows a visual representation of how the MMD is calculated by transforming the
observations to RKHS.

36


3.4 Integral Probability Metrics

Environment
State Space

Model State
Space

State Space
Observations

Reproducing Kernel
Hilbert Space

Maximum Mean
Discrepancy

Figure 3.6: The figure depicts MMD distance calculation between Environment state space and
Model state space. MMD metric maps the samples from each distribution into a
higher-dimensional Reproducing Kernel Hilbert Space (RKHS) using a feature map
and compares the maximum means of the distributions in the RKHS space.

0 10 20 30 40 50 60 70

Episodes

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

M
M

D
di

st
an

ce

model obs-env obs

(a) Gaussian Kernel

0 10 20 30 40 50 60 70

Episodes

0.3

0.4

0.5

0.6

0.7

0.8

M
M

D
di

st
an

ce

model obs-env obs

(b) Laplacian Kernel

Figure 3.7: Comparison of Gaussian and Laplacian kernel function applied to MMD distance
metric for comparing the disparity between model observed states and environment
states of the sparse pendulum environment.

37


4 Methodology

In the preceding chapters, we elucidated the elements of reinforcement learning, encompassing their
classification into model-free and model-based approaches. Additionally, we provided an in-depth
exploration of the various hyperparameters employed in the model-based setting. Furthermore,
a comprehensive introduction to IPMs, which are subsequently utilized in this study to quantify
the disparity between the distribution of observations and rewards in the model and the actual
environment.

Following a thorough examination of existing literature, the primary research objective of this
thesis is to identify the core components of MBRL algorithms and design specific experiments
to understand the impact of such components in isolation under sparse reward scenarios. A non-
exhaustive list of MBRL methodologies design choices includes the diversity of model-generated
data, training the model for one-step or multi-step prediction, amount of model-generated data,
frequency of retraining the model, length of rollouts and amount of off-policy data stored inside the
model buffer. For this purpose, MBPO is chosen as the choice of the model-based algorithm for
extensive investigation in this study. The primary benefit of opting for this model-based algorithm
is its capacity to yield superior data efficiency and greater flexibility in adjusting hyperparameters,
and the ability to operate in a manner that closely resembles a model-free setting. This analysis of
design decisions aims to establish guiding principles for addressing challenges posed by sparse
reward scenarios through the utilization of model-based approaches.

Integral probability metrics are employed to measure the impact of design choices. This is achieved
by evaluating the discrepancy between the distribution of observations and rewards perceived
by the model and the state and reward distributions in real-world environment. Specifically, the
Wasserstein distance metric is employed to quantify the discrepancy between rewards observed by
the model and the actual rewards distribution present in the environment. The maximum mean
discrepancy distance metric is utilized to assess the dissimilarity between the state distribution
as observed by the model and the state distribution in the actual environment. These established
metrics assist in illuminating the model’s interpretation of rewards and states when considering
different components of the MBRL algorithm.

Moving forward, chapter 5 of this study outlines the experimental configuration, while chapter
6 presents a comprehensive examination and analysis of the experiments carried out to explore
various design choices.

38


5 Experimental Setup

This section gives a brief introduction to continuous control environments, where the effect of
hyperparameters is evaluated in continuous state-action spaces. In contrary to discrete control
environments, where the agent has the option of choosing actions from a finite set of possibilities,
continuous control environments allow the agent to choose from a broader set of action possibilities.
In continuous control environments, actions are represented as continuous variables such as real
numbers or vectors. Examples of continuous control environments which are used to train RL
agents include OpenAI Gym [BCP+16a], and DeepMind Control Suite [TMD+20].

5.1 Hardware Specifications

The experiments designed are carried out in-house High Performance Cluster (HPC). The hardware
configuration is as follows, Central Processing Unit (CPU) is Intel Xeon Gold 6148 and has the
specification of the processor with 20 cores and 3.70 GHz at turbo performance and 2.40 GHz
nominal performance. Graphical processing units (GPU) is NVIDIA Tesla V100 SXM and has a
specification of 32 GB RAM and a memory bandwidth of 900 GB/s. The GPU consists of 5120
CUDA cores and 640 NVIDIA Tensor cores.

(a) Inverted Pendulum (b) Mountain Car Continuous (c) Cheetah Run

Figure 5.1: Illustration of continuous control environments where different design choices of
MBPO algorithm are tested upon.

39


5 Experimental Setup

5.2 Inverted Pendulum

The Inverted Pendulum environment in which rewards are set up as sparse was proposed by [CBK20].
In this task, the pendulum is always initialized in the downward position with zero velocity and the
goal of the RL agent is to apply appropriate torque to swing the pendulum to the upwards position.
Figure 5.1a shows the inverted pendulum environment.

5.2.1 State Space

The state space of a sparse inverted pendulum environment is continuous and consists of the angular
position of the pendulum given by the sine and cosine angles and its angular velocity. Table 5.3
shows the elements of the state space.

Table 5.1: State Space specifications of Sparse Inverted Pendulum environment
States Min Max

x = cos(angle) -18.19 18.19
y = sin(angle) -71.81 71.81

Angular Velocity -0.5 0.5

5.2.2 Action Space

The action space in a sparse inverted pendulum environment is continuous. The actions of this
environment represent the torque applied at the base of the pendulum. The action values range
from -1.0 to 1.0, where -1.0 corresponds to maximum torque in one direction, 1.0 corresponds to
maximum torque in the opposite direction, and 0.0 corresponds to no torque applied.

5.2.3 Rewards

The inverted pendulum environment is considered sparse because the agent is rewarded with
a positive value given as 𝑐𝑜𝑠𝑖𝑛𝑒_𝑎𝑛𝑔𝑙𝑒_𝑡𝑜𝑙𝑒𝑟𝑎𝑛𝑐𝑒 × 𝑣𝑒𝑙𝑜𝑐𝑖𝑡𝑦_𝑡𝑜𝑙𝑒𝑟𝑎𝑛𝑐𝑒 when the pendulum
remains within both angle and velocity threshold. To complicate the problem, an action cost of
value 0.1 is added, which is a cost proportional to the applied torque added to the state reward. This
action cost counteracts the effect of exploration signals as a result the agent receives a penalized
reward if the pendulum falls or deviates from the upright position. Action cost can be referred to as
the cost associated with taking specific actions. The action cost is incorporated into the reward
function thereby influencing the agent’s behavior by encouraging actions that minimize the effect of
the overall cost. This sparse pendulum setup along with action cost, makes the task challenging, as
the agent needs to explore effective policies despite the negative signal coming from the action cost.
The episode is terminated when the agent reaches 400 steps.

40


5.3 Mountain Car Continuous

5.3 Mountain Car Continuous

The Mountain Car Continuous environment is part of classical control environments from OpenAI
Gym [BCP+16b]. A car is stochastically placed at the bottom of a valley. The goal of the agent is to
help the car reach the flag at the top of the hill by applying appropriate acceleration. Figure 5.1b
shows the mountain car’s continuous environment.

5.3.1 State Space

The state space of a mountain car’s continuous environment consists of two variables: the position
of the car along the x-axis and velocity along the x-axis. Table 5.2 shows the elements of the state
space.

Table 5.2: State Space specifications of Continuous Mountain Car environment
States Min Max

Position of the car along the x-axis -1.2 0.6
Velocity of the car -0.07 0.07

5.3.2 Action Space

The action space of the Mountain Car environment is continuous and the agent can apply acceleration
in the action value ranging from -1.0 to 1.0, where a value of -1.0 corresponds to full acceleration
in the left direction, 1.0 corresponds to full acceleration in the right direction, and 0.0 corresponds
to no acceleration. The action value is finally multiplied by a power of 0.0015

5.3.3 Rewards

The mountain car continuous task is considered solved when the car reaches a position greater than
or equal to 0.45 at the top of the hill and receives a sparse reward of value 100. However, the agent
receives a negative reward of −0.1 × 𝑎𝑐𝑡𝑖𝑜𝑛2, encouraging it to reach the goal with the minimum
amount of effort. Each episode lasts 1000 steps.

𝑟𝑡 =

{
−0.1 × 𝑎𝑐𝑡𝑖𝑜𝑛2, (𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛 < 0.45),
100, (𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛 ≥ 0.45)

41


5 Experimental Setup

5.4 Cheetah Run

The Cheetah Run environment is a continuous control environment, where the goal is to train the
agent to make the cheetah attain a forward maximum velocity and the task is to optimize the control
policy to achieve high-speed running. This environment is a dense reward environment and a
parallel study is conducted on this environment to observe if the same set of hyperparameters are
responsible for the performance of the model-based agent in the dense reward setting.

5.4.1 State Space

The state space of the cheetah run consists of 17 dimensions, high dimensional in comparison with
the previously introduced environment. The body position represents the overall position of the
cheetah and velocity represents the rate at which the cheetah is moving in the environment. The
position and the velocity of the cheetah are unbounded. Table Table 5.3, shows the elements of the
state space.

Table 5.3: State Space specification of Cheetah Run environment
States

Body position of the Cheetah
Body Velocity of the Cheetah

5.4.2 Action Space

The action space is also continuous and 6 dimensional in size. The goal of the RL agent is to apply
torque to actuate the cheetah’s joints and make the cheetah run forward with the maximum velocity
possible.

5.4.3 Rewards

The cheetah run task is completed when the agent is able to make the cheetah achieve a velocity
of value greater than or equal to 10 and receives a reward of 1. If otherwise, the agent receives a
reward of value between 0 and 1.

𝑟𝑡 =

{
0.1 × 𝑣𝑒𝑙𝑜𝑐𝑖𝑡𝑦, (0 ≤ 𝑣𝑒𝑙𝑜𝑐𝑖𝑡𝑦 < 10),
1, (𝑣𝑒𝑙𝑜𝑐𝑖𝑡𝑦 ≥ 10)

42


6 Evaluations and Analysis

In this section, we conduct several ablation studies to analyze the behavior of MBPO and, the
experimental results are discussed in detail. Each experiment carried out was conducted with
20 random seeds to be able to properly evaluate the results. The experimentation carried out in
this section was performed using SAC [HZAL18] as baseline model-free algorithm and MBPO
[JFZL21] as baseline model-based algorithm. The evaluation return plots presented in this section
are computed by collecting the evaluation returns across several runs, calculating the mean score
and finally smoothening with a moving average filter of window size 5. The confidence interval
regions are produced by calculating the standard error of the mean for the smoothened returns and
plotting the mean with ±1 standard error. By calculating the confidence interval value, we can
estimate the range of values within which the true population mean is likely to fall with a certain
level of confidence. All the algorithms were implemented using MBRL-Lib [PAZ+21].

In section 6.1, we conduct experiments to evaluate the performance of SAC and MBPO agents
in both 𝑆𝑝𝑎𝑟𝑠𝑒𝑃𝑒𝑛𝑑𝑢𝑙𝑢𝑚 − 𝑣0 and 𝑀𝑜𝑢𝑛𝑡𝑎𝑖𝑛𝐶𝑎𝑟𝐶𝑜𝑛𝑡𝑖𝑛𝑢𝑜𝑢𝑠 − 𝑣0 sparse reward environments.
This study is conducted to compare how a model-free SAC agent and a model-based MBPO agent
behave in sparse reward environments. In section 6.2, experiments with hyperparameter ablations
were conducted on rollout length (𝑘), model rollouts per step (𝑀), updates to retain buffer (𝑅),
ensemble size (𝑁), frequency retrain (𝐹) and trainer patience. The goal is to examine why some
hyperparameters work better than others through the lens of the dataset that is used for the training
of the agent. The IPMs are utilized to better analyze the behavior of the agent during training. The
hyperparameter ablation study was parallelly conducted on the MBPO-C agent, to evaluate the
performance of this MBPO variant in sparse reward scenarios.

In section 6.3, a hyperparameter ablation study was performed in a dense reward 𝐶ℎ𝑒𝑒𝑡𝑎ℎ𝑅𝑢𝑛− 𝑣0
environment to compare the behavior of the MBPO agent in higher dimensional dense reward
settings. Finally, we conduct an experiment by specifically making the MBPO agent produce a
performance close to the SAC agent in sparse reward scenarios. After reducing the performance,
we conduct a specific hyperparameter study to understand which specific hyperparameter or a
combination of hyperparameters is responsible for contributing to the performance boost in sparse
reward scenarios.

43


6 Evaluations and Analysis

6.1 SAC vs MBPO Performance

This section examines the behavior of model-free SAC and model-based MBPO agents in
𝑆𝑝𝑎𝑟𝑠𝑒𝑃𝑒𝑛𝑑𝑢𝑙𝑢𝑚 − 𝑣0 and 𝑀𝑜𝑢𝑛𝑡𝑎𝑖𝑛𝐶𝑎𝑟𝐶𝑜𝑛𝑡𝑖𝑛𝑢𝑜𝑢𝑠 − 𝑣0 sparse reward environments. In
the aforementioned experiments, the MBPO agent uses the hyperparameter settings presented in
Table 6.1. Model-free SAC agent uses the same configurations of policy network, Q-Network, and
model network as the agent presented in Table 6.1, except for the model-based hyperparameters.
We perform a comparative study to observe the effect of introducing the action cost and how both
model-free and model-based agents manage to solve the sparse reward task in the environment.

The goal of introducing action cost in a sparse pendulum environment is to make the sparse reward
problem more difficult to solve. An action cost refers to a penalty or negative value associated
with taking a specific action in a particular state of the environment. Action costs are often used
in scenarios where taking certain actions might have drawbacks that the agent should consider
when making decisions. These costs can influence the agent’s policy by encouraging it to choose
actions that minimize the cumulative cost over time. A similar action cost was not introduced in a
continuous mountain car environment since the environment has an action cost already implemented.
The results of the experiment are seen in Figure 6.1b.

In figure 6.1a, In the plot where no action cost is applied, i.e. there is no penalty for exploring
diverse actions that are not helpful in finding the reward. Model-free SAC and model-based MBPO
agents were able to find the sparse reward and converge to an effective policy across a majority of
the seeds. Another key observation from the plot is that a model-based agent uses less number of
environmental steps to converge to an optimal policy, on the other hand, the model-free agent is less
sample efficient.

In stark contrast, When action cost is applied to the sparse pendulum problem, i.e. awarding a
penalty for performing explorations that are not helpful in finding the sparse reward. The SAC
algorithm incorporates unstructured exploration to encourage the agent to explore new states and
actions in the environment. Unstructured exploration in this context, refers to exploring state/action
space in a more random and undirected manner, rather than relying solely on the policy optimization
process. Through the combination of entropy regularization and exploration noise, SAC strikes a
balance between structured exploration (encouraging diverse action selection) and unstructured
exploration (encouraging random exploration in the action space. The results show us that the
state-of-the-art model-free SAC algorithm struggles to explore and find the optimal or near-optimal
policies in sparse reward scenarios because of the unstructured exploration.

In contrast, the model-based MBPO agent manages to find sparse rewards in nine of the twenty
seeds launched. By utilizing the learned model, MBPO can explore via model rollouts which helps
in effective exploration and convergence to optimal policies. A similar kind of comparative study
was performed between model-free and model-based agents, on the challenging continuous version
of the mountain car problem. It is evident from this study that the state-of-the-art SAC algorithm
has limitations since it performs unstructured exploration by injecting noise into action selection,
and that is insufficient for tasks such as continuous mountain cars and sparse pendulum, which
require sustained exploration to find the solution [RKS21] [EHPM22]. On the other hand, MBPO
solves the problem roughly for half of the seeds.

44


6.2 Hyperparameters Ablation Study

0 10k 20k 30k
Environment steps

0

100

200

300
R

et
ur

n

action cost = 0.0

0 10k 20k 30k
Environment steps

action cost = 0.1

(a) Sparse Pendulum

0 10k 20k 30k 40k 50k
Environment steps

−50

0

50

100

R
et

ur
n

Agents
mbpo sac

(b) Mountain Car

Figure 6.1: (a) Learning curves of SAC and MBPO agents with and without action cost in sparse
pendulum environment. (b) Learning curves of SAC and MBPO agents in a sparse
continuous mountain car environment. Results presented show average returns over 20
random seeds which are smoothened by a moving average filter and we report the mean
(solid lines) and standard error (shaded regions) and carried out using hyperparameter
settings in Table 6.1.

6.2 Hyperparameters Ablation Study

In this section, MBPO hyperparameter ablation study was conducted for both 𝑆𝑝𝑎𝑟𝑠𝑒𝑃𝑒𝑛𝑑𝑢𝑙𝑢𝑚−𝑣0
and 𝑀𝑜𝑢𝑛𝑡𝑎𝑖𝑛𝐶𝑎𝑟𝐶𝑜𝑛𝑡𝑖𝑛𝑢𝑜𝑢𝑠 − 𝑣0 sparse reward environments. All the experiments were carried
out by applying action cost = 0.1, and the results presented show average returns over 20 random
seeds which are smoothened by a moving average filter and report the mean (solid lines) and standard
error (shaded regions). The results of the ablation study along with the metrics are discussed in
further subsections. We utilize the hyperparameter settings from Table 6.1, which were used to
conduct the initial experiment in the sparse pendulum as shown in Figure 6.1a and the mountain car
as shown in Figure 6.1b. We use the average return in both the sparse pendulum environment with
action cost as a baseline for the remainder of the hyperparameter ablation study, and similarly in the
case of the mountain car environment.

45


6 Evaluations and Analysis

Table 6.1: MBPO Hyperparameter settings for continuous control experiments
Hyperparameter Sparse Pendulum Mountain Car Cheetah Run

# environment steps 30e3 50e3 250e3
𝑇 - # episodes 75 50 250

𝐸 - # steps per episode 400 1000
𝐺 - policy updates per step 20 10
𝑀 - # model rollouts per step 10

𝐹 - frequency of model retrain (#steps) 400 250
𝑅 - # updates to retain buffer 1 10

𝑁 - ensemble size 5
𝑘 - rollout length 10 5
trainer patience 1 5

learning rate 3e-4 1e-3

Policy network
2 layers,
64 units,

Tanh activation

2 layers,
128 units,

Tanh activation

Q network
2 layers,

256 units,
Tanh activation

Model network
4 layers,

200 units,
SiLU activation

The MMD distance metric plots presented in the hyperparameter ablation study indicate the distance
between the distributions of environment and model state space. A higher MMD value indicates
that the distributions are dissimilar, while an MMD value closer to zero indicates they are similar.
Similarly, The Wasserstein distance metric plot shows the distance between the distributions of
environment and model rewards. A higher distance value indicates the two distributions are different,
and a value closer to zero indicates the distributions are similar. All the distance metric plots
presented in the hyperparameter ablation study are clipped to a region where the distinction between
the curves is clearly observable and as the training progresses all the values more or less remain
constant.

6.2.1 Rollout Length Ablation

In this subsection, we present the results of an ablation study conducted on rollout length
hyperparameter. The study was conducted using MBPO and MBPO-C style methods in the sparse
pendulum and mountain car environments. We performed the experiments with 𝑘 = {1, 3, 7 ,10}
and kept the rest of the hyperparameters fixed to values mentioned in Table 6.1.

Rollout length determines the length of the prediction horizon generated by the model to estimate
the expected returns of various policies [BTZ22]. As a result, longer rollout length emphasizes
the increased exploration of unfamiliar states and may lead to better policies but The accuracy

46


6.2 Hyperparameters Ablation Study

of predictions typically decreases for longer rollouts due to distributional shift, i.e., we query
the learned model out-of-distribution. On the other hand, a shorter rollout length emphasizes
exploitation, which indicates that the agent focuses on immediate rewards and makes decisions
based on its current state.

The results presented in Figure 6.2, the top row shows the rollout length ablation study performed in
the sparse pendulum environment and the bottom row corresponds to the ablation study performed
in the mountain car environment. We observe from the learning curves that a higher value for
rollout length in both the sparse reward environments resulted in higher returns when compared to
the returns of smaller values of rollout length. A rollout length of value 1 indicates that the model
takes a single action, observes the resulting state and reward, and then makes the next decision.
This leads to a lack of exploration because, in continuous control environments, the model often
has numerous potential actions available for any given state. By focusing on just one action, the
agent could overlook opportunities to explore alternative actions that might yield superior results
over time. Furthermore, the agent could become trapped in local optima, a situation where it makes
a narrow-minded choice that enhances its immediate reward but hinders its progress towards a more
favorable state in the future.

We confirmed this hypothesis in the mountain car environment, where the learning curve with k=1
did not yield better returns compared to other rollout lengths with higher values. Interestingly, the
sparse pendulum environment showed positive returns with a rollout length of 1. However, the
hypothesis is still supported, as higher rollout lengths led to even greater returns. This suggests that
other factors are at play in contributing to the positive returns observed in the sparse pendulum
environment.

In the W-distance metric plots (center column), In the sparse pendulum environment, we can observe
that the difference between the reward distributions generated by the model and the environment is
more pronounced at the start of training. As training advances, the Wasserstein distance for various
rollout lengths gradually converges. The reported results are truncated because this convergence
behavior persisted throughout the remaining environment steps. In the mountain car environment,
we notice that a greater rollout length led to the most significant disparity between model-generated
and environment reward distributions. This consequently led to a higher overall Wasserstein distance.
This trend mirrors the observation made in the sparse pendulum environment. It can be attributed
to the fact that at the beginning of the training the model visits unfamiliar states with longer rollouts
and receives rewards based on the action taken from that unfamiliar states and hence the discrepancy
between the model-generated and environment distribution at the beginning of the training and the
eventual convergence at the end of the training.

The results observed from the MMD distance metric plots (right column) indicate that higher rollout
lengths produced maximum discrepancy between model-generated and environment observation
state distributions for both sparse reward environments. As the training progressed the MMD
distance value decreases, which indicates the distribution of states genera