Institute for Artificial Intelligence Systems - MLS University of Stuttgart Universitätsstraße 32 D–70569 Stuttgart Masterarbeit Model-Based Reinforcement Learning under Sparse Rewards Ravi Akash Study program: Information Technology Examiner: Prof. Dr. Mathias Niepert Supervisors: Carlos E. Luis, M.Sc, Dr. Ing. Felix Berkenkamp (Bosch Center for Artificial Intelligence) Commenced: February 15, 2023 Completed: August 14, 2023 Abstract Reinforcement Learning (RL) has recently seen significant advances over the last decade in simulated and controlled environments. RL has shown impressive results in difficult decision-making problems such as playing video games or controlling robot arms, especially in industrial applications where most methods require many interactions with the system in order to achieve good performance, which can be costly and time-consuming. Model-Based Reinforcement Learning (MBRL) promises to close this gap by leveraging learned environment models and using them for data generation and/or planning and, at the same time trying to be sample efficient. However, Learning with sparse rewards remains a significant challenge in the field of RL. In order to promote efficient learning the sparsity of rewards must be addressed. This thesis work tries to study individual components of MBRL algorithms under sparse reward settings and investigate different design choices made to measure the impact on learning efficiency. Suitable Integral Probability Metrics (IPM) are introduced to understand the model’s reward and observation space distribution during training. These design combinations will be evaluated on continuous control tasks with established benchmarks. 3 Kurzfassung RL hat in den letzten zehn Jahren bedeutende Fortschritte in simulierten und kontrollierten Umge- bungen verzeichnet. RL hat beeindruckende Ergebnisse bei schwierigen Entscheidungsproblemen erzielt, wie zum Beispiel das Spielen von Videospielen oder die Steuerung von Roboterarmen, insbesondere in industriellen Anwendungen, bei denen die meisten Methoden viele Interaktionen mit dem System erfordern, um eine gute Leistung zu erzielen. Dies kann kostspielig und zeitaufwändig sein. MBRL verspricht, diese Lücke zu schließen, indem gelernte Umgebungsmodelle genutzt werden, um Daten zu generieren und/oder zu planen, und gleichzeitig versucht wird, eine hohe Probeneffizienz zu erreichen. Die Herausforderung des Lernens mit spärlichen Belohnungen bleibt jedoch ein bedeutendes Problem im Bereich des RL. Um ein effizientes Lernen zu fördern, muss die Spärlichkeit der Belohnungen angegangen werden. Diese Masterarbeit versucht, einzelne Kompo- nenten von MBRL-Algorithmen unter Bedingungen mit spärlichen Belohnungen zu untersuchen und verschiedene Designentscheidungen zu untersuchen, um ihre Auswirkungen auf die Lerneffizienz zu messen. Geeignete IPM werden eingeführt, um das Belohnungs- und Beobachtungsraumverteilung des Modells während des Trainings zu verstehen. Diese Designkombinationen werden anhand von kontinuierlichen Steuerungsaufgaben mit etablierten Benchmarktests ausgewertet. 4 Contents 1 Introduction 12 1.1 Goal of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.2 Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2 Related Work 15 2.1 Curiosity-Driven Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2 Random Network Distillation . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.3 Hindsight Experience Replay . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.4 Policy Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3 Technical Background 20 3.1 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.2 Model-Free Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . 25 3.3 Model-Based Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . 27 3.4 Integral Probability Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4 Methodology 38 5 Experimental Setup 39 5.1 Hardware Specifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 5.2 Inverted Pendulum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 5.3 Mountain Car Continuous . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 5.4 Cheetah Run . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 6 Evaluations and Analysis 43 6.1 SAC vs MBPO Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 6.2 Hyperparameters Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . 45 6.3 Dense Reward Cheetah Run Environment . . . . . . . . . . . . . . . . . . . . . 58 6.4 Reduced MBPO Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 7 Discussion and Conclusion 76 7.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 7.2 Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 7.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 Bibliography 80 5 List of Figures 3.1 Interaction between RL agent and an environment . . . . . . . . . . . . . . . . . 20 3.2 Illustration of the detailed architecture of an MBPO Agent . . . . . . . . . . . . 29 3.3 Illustration of collection and storage of model-generated data inside model replay buffer of an MBPO agent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.4 Illustration of collection and storage of model-generated data inside model replay buffer of an MBPO-C agent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.5 Illustration depicting Wasserstein distance calculation . . . . . . . . . . . . . . . 34 3.6 Illustration depicting Maximum Mean Discrepancy distance calculation . . . . . 37 3.7 Comparison of Gaussian and Laplacian Kernel applied to MMD distance metric . 37 5.1 Illustration of continuous control environments . . . . . . . . . . . . . . . . . . 39 6.1 Performance comparison of SAC and MBPO agents in both sparse pendulum and continuous mountain car environment . . . . . . . . . . . . . . . . . . . . . . . 45 6.2 Ablation study over rollout length hyperparameter using MBPO agent . . . . . . 48 6.3 Ablation study over rollout length hyperparameter using MBPO-C agent . . . . . 49 6.4 Visualization of reward distribution and 𝑝𝑑𝑓 during training in sparse pendulum swing-up task using MBPO agent . . . . . . . . . . . . . . . . . . . . . . . . . 50 6.5 Ablation study over the number of rollouts per step hyperparameter using MBPO agent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 6.6 Ablation study over the number of rollouts per step hyperparameter using MBPO-C agent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 6.7 Ablation study over the number of updates to retain buffer hyperparameter using MBPO agent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 6.8 Ablation study over the number of updates to retain buffer hyperparameter using MBPO-C agent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 6.9 Ablation study over ensemble size hyperparameter using MBPO agent . . . . . . 55 6.10 Ablation study over frequency retrain hyperparameter using MBPO agent . . . . 56 6.11 Ablation study over trainer patience hyperparameter using MBPO agent . . . . . 58 6.12 Ablation study over rollout length hyperparameter using MBPO agent in dense reward Cheetah Run environment . . . . . . . . . . . . . . . . . . . . . . . . . . 59 6.13 Ablation study over the number of rollouts per step using MBPO agent in dense reward Cheetah Run environment . . . . . . . . . . . . . . . . . . . . . . . . . . 60 6.14 Ablation study over the number of updates to retain buffer hyperparameter using MBPO agent in dense reward Cheetah Run environment . . . . . . . . . . . . . . 61 6.15 Learning curves representing the reduced performance of an MBPO agent operating close to model-free SAC agent . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 6.16 Comparative metric study between MBPO and MBPO close to SAC setting in Continuous Mountain Car Environment and Sparse Pendulum environment . . . 64 6 6.17 Ablation study over ensemble size and rollout length hyperparameters using MBPO agent operating close to a model-free SAC agent in Sparse Pendulum environment 66 6.18 Ablation study over ensemble size and rollout length hyperparameters using MBPO agent operating close to a model-free SAC agent in Mountain Car environment . 67 6.19 Ablation study over the number of rollouts per step and frequency retrain hyperpa- rameters using MBPO agent operating close to a model-free SAC agent in Sparse Pendulum environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 6.20 Ablation study over the number of rollouts per step and frequency retrain using MBPO agent operating close to a model-free SAC agent in Mountain Car environment 70 6.21 Ablation study over rollout length and the number of rollouts per step hyperpa- rameters using MBPO agent operating close to a model-free SAC agent in Sparse Pendulum environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 6.22 Ablation study over rollout length and the number of rollouts per step hyperparam- eters using MBPO agent operating close to a model-free SAC agent in Mountain car environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 6.23 Ablation study over longer rollout length using MBPO agent operating close to a model-free SAC agent in both sparse pendulum and continuous mountain car environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 6.24 Ablation study over updates to retain buffer using MBPO agent operating close to a model-free SAC agent in both sparse pendulum and continuous mountain car environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 7 List of Tables 3.1 Hyperparameters for MBPO style learning . . . . . . . . . . . . . . . . . . . . . 31 5.1 Sparse Inverted Pendulum State Space . . . . . . . . . . . . . . . . . . . . . . . 40 5.2 Continuous Mountain Car State Space . . . . . . . . . . . . . . . . . . . . . . . 41 5.3 Cheetah Run State Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 6.1 MBPO Hyperparameter settings for Continuous Control Experiments . . . . . . 46 6.2 MBPO Hyperparameter settings reduced close to SAC Performance . . . . . . . 65 8 List of Algorithms 3.1 Model Based Policy Optimization algorithm . . . . . . . . . . . . . . . . . . . . 30 3.2 Model Based Policy Optimization-Consistent algorithm . . . . . . . . . . . . . . 32 9 Acronyms A2C Advantage Actor-Critic. 25 BLR Bayesian Linear Regression. 18 CEM Cross Entropy Method. 27 CPC Contrastive Predictive Coding. 15 CPU Central Processing Unit. 39 DDP Differential Dynamic Programming. 18 DL Deep Learning. 21 DQN Deep Q-Networks. 15 DRL Deep Reinforcement Learning. 21 EMD Earth Mover Distance. 33 GAN Generative Adversarial Networks. 15 GP Gaussian Process. 13 GPU Graphical processing units. 39 HER Hindsight Experience Replay. 17 HPC High Performance Cluster. 39 ICM Intrinsic Curiosity Model. 16 IPM Integral Probability Metrics. 3 MBPO Model Based Policy Optimization. 20 MBPO-C Model Based Policy Optimization-Consistent. 31 MBRL Model-Based Reinforcement Learning. 3 MCTS Monte Carlo Tree Search. 27 MDP Markov Decison Process. 14 MFRL Model-Free Reinforcement Learning. 20 MMD Maximum Mean Discrepancy. 33 MPC Model Predictive Control. 18 MVE Model Value Expansion. 27 10 Acronyms NN Neural Networks. 13 PPO Proximal Policy Optimization. 25 PSRL Posterior Sampling for Reinforcement Learning. 18 RKHS Reproducing Kernel Hilbert Space. 35 RL Reinforcement Learning. 3 RND Random Network Distillation. 17 SAC Soft Actor Critic. 17 SLBO Stochastic Lower Bound Optimization. 28 STEVE Stochastic Ensemble Value Expansion. 27 TRPO Trust Region Policy Optimization. 25 WD Wasserstein Distance. 33 11 1 Introduction RL has become a popular paradigm of Machine Learning over the last decade, where intelligent agents learn to act in some environment with the objective of maximizing cumulative rewards. The agents learn optimal strategies directly by interacting with an a priori unknown dynamical system [SB20]. RL has garnered a lot of attention and popularity in the research community due to the success of mastering the game of Go [SSS+17] than any other human player could, winning in most of the Atari games it has been deployed to play and most recently Deepmind’s Starcraft II [VEB+17]. RL algorithms are categorized mainly into two classes MFRL and MBRL. The key difference between them is that MFRL learns a policy by directly interacting with the environment, while in MBRL, the agent learns an approximate model of the environment’s dynamics and uses this model for policy optimization or planning. The key advantage of MBRL over MFRL is that the agent can first learn the model of the environment and then can make informed decisions when interacting with the actual environment. This helps MBRL to be more sample-efficient learning and faster convergence to optimal policies. The learned model can also be used for planning and exploration, allowing the agent to explore various scenarios and hypothetical actions without risking real interactions with the environment. A fundamental challenge in RL is dealing with sparse rewards, It is crucial to study sparse reward problems, since they often appear in real-world tasks and are easy to design without domain knowledge. The sparse reward can be specified as long as there is a defined state-based criterion for success (e.g., a goal location is reached) since there are no rewards elsewhere except in the area of the state space that meets the success criterion. In most real-world cases it would be hard to specify dense rewards since we only have limited knowledge about the system [DMH19]. Most of the traditional Rl and MFRL algorithms might fail to solve the sparse reward problems because of the absence of intermediate rewards which enable the agent to drive exploration. MBRL can be particularly helpful in solving sparse reward tasks since it offers the promise to leverage the uncertainty of the learned dynamics model to drive the exploration toward interesting regions of the state space. One of the pioneers from the field of reinforcement learning shared his view on MBRL and how it certainly plays an essential role in shaping the research ahead: "The next big step forward in AI will be systems that actually understand their worlds. The world is only accessed through the lens of experience, so to understand the world means to be able to predict and control your experience, and your sensitive data, with some accuracy and flexibility. In other words, understanding means forming a predictive model of the world and using it to get what you want. This is model-based reinforcement learning." -Richard S. Sutton 12 The asymptotic performance of MBRL algorithms was historically lower than that of model-free methods, but the gap has been closing in recent years [JFZL21] as a lot of effort has been directed towards scaling RL for real-world environments and also dealing with sparse rewards [WEH+22]. MBRL has shown great success in terms of sample efficiency [DR11], [BHT+19], improving planning [JFZL21] and being robust enough for distributional shifts[FVM+22]. The major motivation behind studying sparse reward problems is that it requires close to no domain knowledge to design a reward function for a given task/environment. For example, in a continuous control task where a robotic arm task is required to stack an object on top of other objects, we can assign a sparse reward of 1 if the robotic arm successfully stacks the object and 0 otherwise. Designing intermediate rewards in this scenario can be challenging and in most cases not practical. The sparse reward problem has been addressed in previous state-of-the-art literature in a plethora of ways, in which Gaussian Process (GP) [DR11] and Neural Networks (NN) [CCML18] were used as a typical model for representing the one-step dynamics of the RL environment. Ensembles of probabilistic NN are a common paradigm used by a plethora of MBRL methods leveraging uncertainty estimates to improve performance [BHT+19],[ZZW+19]. The learned models are then used, for instance, for planning as done by [CCML18]. [FVM+22] present one of the most recent studies where classic RL benchmarks are modified for different reward sparsity levels. However, research in MBRL under a sparse reward setting has not answered some key questions such as how accurate and diverse the model should be, how aware the model should be about uncertainty, and how long the rollout trajectory should be. While it is fairly common for authors to regularly experiment and provide various algorithms for solving sparse reward tasks, it still remains an open question as to what the key ingredients and design choices which are essential for solving sparse reward tasks. The study carried out in this thesis helps us understand why some hyperparameter design choices might be working better than others through the lens of the dataset which we use to train the agent. 13 1 Introduction 1.1 Goal of the Thesis The primary focus of this study revolves around understanding the essential elements within the Model-Based Reinforcement Learning (MBRL) algorithm that effectively tackle the challenge of sparse rewards. The objective of this thesis is to thoroughly examine the current state-of-the-art MBRL algorithm, namely Model-Based Policy Optimization (MBPO), and investigate the impact of various design decisions in the algorithm’s design when dealing with sparse rewards scenarios. Modifications suggested in the literature are used as a starting point for benchmarking performance under sparse rewards in a customized sparse inverted pendulum [CBK20], mountain car continuous [BCP+16a] and a complex DeepMind continuous control environment cheetah run [TMD+20], which are stochastic in terms of their initial state and are well suited for testing continuous control tasks. 1.2 Thesis Overview The remainder of this work is organized as follows. Chapter 2 introduces prior research from the literature that is pertinent to our approach. Subsequently, Chapter 3 delves into the necessary technical background, encompassing topics such as MBRL, Markov Decison Process (MDP), and more. Chapter 4, we discuss the details of the design of experiments conducted and their particulars. Chapter 5 encompasses the details of the experimental setup and hardware specifications. Moving forward to Chapter 6, we present the assessments and in-depth analysis of the outcomes derived from the simulated experiments. Transitioning to Chapter 7, we provide a succinct summary of our findings while also offering glimpses into potential avenues for future research. 14 2 Related Work This section summarizes prior state-of-the-art literature on solving sparse reward problems. An overview of the literature is provided, as well as a brief discussion of the methodologies employed. The sparse reward problem is a common and difficult challenge that many RL algorithms face in real-world scenarios. The sparse reward problem is particularly challenging due to the fact that there is little signal to learn from and also rely on accurate models of environments to make effective decisions. Moreover, this sparse reward setting makes the learning process slower and less efficient. In order to overcome the difficult challenges posed by the sparse reward problem, a wide range of novel ideas in deep reinforcement learning research have emerged. [VHS+18] proposed a model-free approach using demonstrations collected from a human demonstra- tor to tackle the sparse reward problem on a high-dimensional robotic control problem. The purpose of using demonstrations is to replace carefully engineered rewards and reduce the exploration burden in sparse reward settings. A hybrid approach using Deep Q-Networks (DQN) was suggested by [GL19], which combines both model-free and model-based approaches that help in exploring least likely seen states and learning environments with sparse rewards more efficiently. Reward shaping is an effective strategy to help the RL algorithms learn more efficiently where carefully engineered rewards are introduced in the environment to guide the algorithm towards convergence. [LTA20] propose an reward shaping methodology through Contrastive Predictive Coding (CPC) by learning predictive representations offline. CPC learns self-supervised repre- sentations by predicting the future in latent space with the help of auto-regressive models. With their algorithm, long-horizon tasks can be addressed with shaped reward signals. Another parallel literature work proposed by [WWWZ20], introduces a non-expert helpers algorithm, where a prior control policy(non-expert helper) plays an important role in guiding the agent in exploring the state space by dynamically reshaping the learning objectives over time. [CM20] propose PlanGAN a model-based algorithm for addressing the multi-goal task in the presence of sparse rewards. The collected experience by an RL algorithm is used to train an ensemble of Generative Adversarial Networks (GAN) to generate multiple plausible experiences, where these experiences are combined into a novel planning algorithm that helps in achieving efficient learning. [KCM20] frame policy search as a multi-objective problem, where the objectives are optimized using a Pareto-based multi-objective optimization problem. The proposed approach was able to solve sparse reward tasks within a few episodes. [LWZZ21] take advantage of model error as an extra reward which aids in increasing reward density in sparse reward settings in order to drive the exploration. [NHM21] introduce a novel computationally light model-based approach to tackle the sparse reward problem, which encourages the agent to explore uncertain states by considering a prior model of their goal behavior. To tackle sparse reward problems in continuous action setting [WWW21] deployed a particle swarm 15 2 Related Work optimization planner as an actor in actor-critic architecture which helped to regulate the exploration rate. This aided the RL algorithm to identify rewards in non-interesting regions in state space. Motivated by the recent success of value-based for approximating state-action values [RYH+21] utilize radial basis value function for addressing continuous control robotic manipulation multi-task sparse reward settings. An effective alternative for reward shaping was proposed by [DDHB22], where model predictive control is utilized as an experienced source for training RL algorithms in sparse reward environments. Algorithm complexity increases significantly when demonstrations are involved as there is more hyperparameters introduced and tuned. As an effective alternative to this approach [WBD+22] introduce a parameter-free modification to standard actor-critic algorithms which computes a modified Q-value and a Monte Carlo estimate of the reward-to-go as a result increasing learning efficiency in sparse reward tasks. [LWRG22] leverage long-term Q-Values and provide richer feedback signals to the actions taken to improve learning efficiency. [SGK22] propose a novel approach of redistributing local and shared global rewards across multiple agents thereby boosting exploration in multi-agent sparse reward setting. In the following sections, various approaches are broadly categorized into standard methodologies which are utilized in addressing the sparse reward problem are examined. 2.1 Curiosity-Driven Exploration The main idea behind curiosity-driven methods is that the RL algorithm is encouraged to visit unseen states in the environment, The main intuition behind this approach is that an agent that is curious about its environment will be more likely to discover novel and informative experiences, which can ultimately lead to better performance in the long run. [HCD+16] proposes an exploration strategy that maximizes information gain about the algorithm’s belief of environment dynamics, which has significant performance in tackling continuous control tasks with sparse rewards. [PAED17] propose Intrinsic Curiosity Model (ICM) where curiosity is formulated as an error in the algorithm’s ability to predict the consequence of its own actions in a visual feature space learned by self-supervised inverse dynamics model. The algorithm is composed of two subsystems, one which outputs a curiosity-driven intrinsic reward signal and a policy that outputs a sequence of actions to maximize the reward signal. While the previous approach used a model-free agent to solve the environments [SRD+20] showed that the idea of curiosity can be combined with MBRL to create an agent that efficiently explores and solves sparse reward tasks. However, blindly pursuing novel states renders the previous methods’ sample inefficient and may also fail to converge to useful behavior in some cases. To overcome this issue, [LLL+20] formulate a goal-oriented curiosity-driven exploration method and dynamic initial state selection mechanism which achieves a much higher success rate and leads to faster convergence. 16 2.2 Random Network Distillation 2.2 Random Network Distillation The main idea behind curiosity-driven learning is to build a reward function that is intrinsic to the algorithm but has serious drawbacks due to the "noisy TV problem". This is a common issue in RL where the observations are noisy or incomplete due to environmental factors such as sensor noise, occlusions, or partial observability. In this scenario, the algorithm must learn to distinguish between the relevant and irrelevant parts of the observation in order to make accurate decisions. Imagine that an RL algorithm is rewarded with seeking novel experience, and RL algorithms are distracted forever by stochastic elements in the environment. So at every timestep, the curiosity reward will be much higher and pushes the algorithm to pursue these noisy states. Random Network Distillation (RND) addresses this problem by calculating curiosity but is not attracted by the stochastic elements of an environment. RND consists of a non-trained target network with fixed random weights and a predictor network that tries to predict the target network’s output. The target network feature representation for a given state will be the same. The predictor network predicts the target network’s output for the next state. Therefore the next state is propagated into both networks and the prediction network is trained to minimize the mean square error between the output of two networks. This process distills a randomly initialized neural network (target) into a trained (predictor) network. To improve exploration in several hard Atari games, [BESK18] included RND bonus combined with flexible integration of intrinsic and extrinsic rewards helped in achieving better than human average performance. An ensemble-free alternative approach was suggested by [NKTK23] for quantifying uncertainty by making use of RND combined with Soft Actor Critic (SAC) which helped deliver performance comparable to ensemble-based methods. 2.3 Hindsight Experience Replay Curiosity driven methods try to maximize the statistical novelty of states in comparison with previously experienced states. There is a potential pitfall in this method, the variation among states explored is observed only when there is a significant amount of diversity present among the states. When the explored states lack diversity in large state space ICM will fail to explore the state space efficiently. In a sparse reward environment, the algorithm finds it difficult to appropriately explore the environment and learn the sequence of steps required to achieve the desired goal. Hindsight Experience Replay (HER) [AWR+17] consists of a buffer that stores a copy of each trajectory experienced and replaces the actual rewards with rewards calculated assuming the goals are the steps achieved at the end of the trajectories. As a result, the algorithm will be able to explore more effectively and learn intermediate goals that build up to the actual goal. [ZZLL20] leverage demonstrations to accelerate training and propose a new experience replay mechanism to address the sparse reward problems in robotic tasks. Sequential object manipulation robotic tasks are extremely difficult in sparse reward settings where, [LWD+22] proposes relay HER, where the huge sequential task is decomposed into multiple sub-tasks. The decomposed sub-tasks are efficiently learned with the help of HER. The concept of HER is applied to model-based methods to solve multi-goal tasks in sparse reward settings. [MR21] propose Imaginary-HER which incorporates imaginary data into policy updates to improve exploration using HER and is endowed with curiosity-based intrinsic rewards. This imaginary data is replaced each time the model is updated. Another parallel work 17 2 Related Work proposed by [YFH+21] introduces model-based HER which exploits experiences by leveraging environmental dynamics to generate imaginary virtual goals. These generated virtual goals allows the algorithm to reinterpret its actions using a different goal as per the latest policy. 2.4 Policy Optimization Policy optimization methods are centered around the policy, a function that maps the agent’s state to its next action. These methods view reinforcement learning as a numerical optimization problem where the expected rewards are optimized with respect to the policy’s parameters. [LK13] proposed a guided policy search algorithm that has the capability of learning complex policies by incorporating guiding samples into the policy search. A model-free initial policy search is guided by a model-based Differential Dynamic Programming (DDP) which samples from a distribution of high reward trajectories, is helpful in optimizing the policy without requiring new on-policy samples. In this line of work of policy optimization, [JFZL21] propose a simple procedure of making limited use of the model. In particular, unclasping the model goal and the original task goal by querying the model only short rollouts. [CBK20] introduce an optimistic exploration algorithm that augments the control space of the agent, which helps to control the agent’s epistemic uncertainty for short transition dynamics. They try to address the problem of greedy and insufficient exploration of MBRL algorithms with probabilistic dynamical problems by leveraging the model uncertainty to optimize the policy. As model-based methods often struggle with errors in learned models, [ZLW20] deduces an upper bound for the uncertainty in Q-values in order to achieve asymptotic performance as model-free methods. Furthermore, [QPC20] propose an uncertainty-aware trust region policy optimization algorithm that optimizes the policy conservatively thereby increasing performance and reducing overfitting to inaccurate models by optimizing conservatively. [CVLH19] formulate model-free optimistic actor-critic which facilitates deep exploration by leveraging the predictive uncertainty of the policy performance during policy optimization. They achieve this by using a bootstrap method to estimate the epistemic uncertainty of the Q-function so that they can adjust the upper confidence bound for critics based on the principle of optimism in the face of uncertainty. In this similar line of work, [FM21] propose a model-based version by implementing Posterior Sampling for Reinforcement Learning (PSRL) with function approximation and making use of Bayesian Linear Regression (BLR) when fitting transition and reward models. In the end, the authors implement Model Predictive Control (MPC) to optimize policy under the sampled models in each episode. [FLZB22] introduce on-policy corrections methodology which uses on-policy transition data accompanied by a learned model in order to make accurate long-term predictions for MBRL. The authors show that when generating trajectories on-policy with the model true state distribution can be recovered by improving by means of policy improvement bound. 18 2.4 Policy Optimization The idea of policy optimization is further extended to Offline RL by [YTY+20]. In this paper, the authors present an offline MBRL algorithm that optimizes a policy in an uncertainty-penalized MDP, the proposed algorithm penalizes states with high model uncertainty with a policy that maximizes the MDP. The algorithm ensures that the traps of behavioral distribution can be mitigated while avoiding the risk of making mistakes. 19 3 Technical Background In this chapter, we briefly introduce the standard formalisms and techniques on which the thesis is based, as well as their significance in the machine learning and reinforcement learning communities. In Section 3.1, an introduction to reinforcement learning and the framework in which it operates is provided. This aids in framing the issue of sparse rewards in Section 3.1.7. In section 3.2, a brief introduction is given to Model-Free Reinforcement Learning (MFRL) and in section 3.3 to MBRL. Further, a brief discussion of a model-based algorithm, Model Based Policy Optimization (MBPO) method is provided. This method serves as a baseline upon which the rest of the thesis investigation is carried out. 3.1 Reinforcement Learning Reinforcement learning, is one of the three fundamental machine learning paradigms, along with supervised and unsupervised learning. RL is involved in making sequences of decisions where it considers an intelligent agent situated in an unknown environment. At each timestep, the agent takes an action and receives an observation and reward. The primary goal of the RL algorithm is to maximize the notion of cumulative reward, given an unknown environment, through a trial-and-error learning process. Section 3.1.4 onwards, provides a more detailed description of the mathematical formulation of RL. In the Figure shown below Figure 3.1, depicts an extremely general RL problem that involves a reward-maximizing agent. There have been numerous applications of RL algorithms to various fields, including robot control [KBP13], [CER20] applied RL techniques to economics, game theory, operation research and finance, and the recent technological trend of large language models [Ope23] where RL is used to fine-tune the model’s behavior with human feedback, to produce responses which are better aligned with the user’s intent. Agent Environment at rtst rt+1 st+1 Figure 3.1: Interaction loop between an RL agent and environment. The reward and the state resulting from taking an action serve as the input for the next iteration [SB18] 20 3.1 Reinforcement Learning 3.1.1 Deep Reinforcement Learning Most of the modern machine learning methods are primarily concerned with learning functions from data. Deep Learning (DL) is a sub-field of machine learning that uses artificial neural networks to model and solve complex problems. DL involves a loss function, a non-linear function approximator, and utilizes the gradient descent method to optimize parameters between nodes in order to minimize the difference between the predicted output and the actual output. DL has been applied to various fields and has produced some groundbreaking results in computer vision [SPP18], autonomous driving [HC20] and natural language processing [TSK+21]. DL transforms a learning problem to an optimization problem in a straightforward fashion in supervised learning scenarios, but this reduction is less straightforward in RL. The two main caveats involved in RL with respect to supervised learning are that the dataset is non-stationary (changing over time) and the data is not independently and identically distributed, instead the data is strongly correlated. Moreover, the input data to the agent strongly depends on how it behaves in the unknown environment which makes it difficult to develop straightforward algorithms to solve the task. In RL we have various choices of approximating such as policies, value functions, dynamics models, and what kind of function approximators to utilize. all the above-mentioned can also choose in various combinations when solving a RL problem. Deep Reinforcement Learning (DRL) is a subfield of machine learning which studies reinforcement learning using neural networks as non-linear function approximators. DRL incorporates DL which helps the RL agent to make decisions from unstructured input data without the need for manual engineering of the state space. DRL algorithms have the capacity to take ingest large datasets to maximize the cumulative reward. DRL has been applied to diverse sets of applications where [SSS+17] mastered the game of Go to play better than most human experts. They utilized a combination of supervised learning, several RL steps to train deep neural networks combined with a Monte-Carlo tree search algorithm. [MKS+13] present DQN a classical example of DRL, since it scales up conventional Q-learning to perform tasks with more complex observation space. Since then, there has been an explosion of state-of-the-art DRL algorithms applied in various domains. In the upcoming subsections, individual components which contribute to the DRL are discussed in detail. 3.1.2 Markov Decision Process We consider a RL agent that interacts with a MDP. An MDP is described by the tuple M = ⟨S,A,P,R, 𝜇, 𝛾⟩, P(𝑠′ |𝑠, 𝑎) defines a Markovian transition probability density between the current state 𝑠 and the next state 𝑠′ under action 𝑎. The initial state distribution is defined as 𝜇 : S → [0, 1], a reward function 𝑟 : S ×A → R and a discount factor 𝛾 ∈ [0, 1). Given an MDP, the goal of RL agent is to learn a specific behavior in an unknown environment. That is, we want to learn an action selection policy that maximizes the expected cumulative reward 21 3 Technical Background 3.1.3 Policy The policy 𝜋 determines the agent’s behavior by selecting an action given its current state and is formally defined as a mapping from a state 𝑠 to an action 𝑎, 𝜋 : S → A. It is possible for the policy to be a deterministic function, which means it will always return the same action regardless of the state. Policies can also be stochastic and be defined by a distribution 𝜋 : S × A → [0, 1] where A a probability is assigned to each action 𝑎 given a state 𝑠. When an agent is present in a state it has a wide variety of actions to choose from and we need a performance measure to evaluate a given policy. The performance of a policy is quantified by means of its return 𝑅. This return is defined as the sum of all rewards 𝑟𝑡 received within an episode starting at 𝑡 = 0 and terminating at 𝑡 = 𝑇 . 𝑅 = 𝑇∑︁ 𝑡=0 𝑟𝑡 (3.1) In the finite horizon setting, the return is calculated over a finite number of timesteps and is therefore bounded. However, to deal with infinite horizons tasks a discount factor, 𝛾 ∈ [0, 1] is introduced to weigh down the contribution of distant future rewards. This discounted return is further made used in calculations of value functions. 𝑅 = ∞∑︁ 𝑡=0 𝛾𝑡𝑟𝑡 (3.2) 3.1.4 Value Function The RL agent needs to find a policy 𝜋(𝑎 |𝑠) to take an appropriate action 𝑎 when the agent is in state 𝑠. There are multiple actions which the agent can choose from a particular state and this is given by a so-called state value function or state-action value function. The state-action value function indicates how beneficial it is to be in a certain state 𝑠 and perform a certain action 𝑎. Formally, the state value function can be defined as the expected return when starting in state 𝑠 and selecting actions according to the policy 𝜋: 𝑉 𝜋 (𝑠) = E𝜋 [ ∞∑︁ 𝑡=0 𝛾𝑡𝑟𝑡 | 𝑠0 = 𝑠 ] . (3.3) Correspondingly, the state-action value function is given by 𝑄 𝜋 (𝑠, 𝑎) = E𝜋 [ ∞∑︁ 𝑡=0 𝛾𝑡𝑟𝑡 | 𝑠0 = 𝑠, 𝑎0 = 𝑎 ] , (3.4) 22 3.1 Reinforcement Learning When we have access to the reward function and the transition function, the state-action value function can be calculated recursively: 𝑄 𝜋 (𝑠, 𝑎) = E𝜋 [ 𝑟0 + 𝛾𝑟1 + 𝛾2𝑟2 + ... | 𝑠0 = 𝑠, 𝑎0 = 𝑎 ] = E𝜋 [𝑟0 | 𝑠0 = 𝑠, 𝑎0 = 𝑎] + 𝛾E𝜋 [𝑟1 + 𝛾𝑟2 + ... | 𝑠0 = 𝑠, 𝑎0 = 𝑎] = 𝑟 (𝑠, 𝑎) + 𝛾 ∑︁ 𝑠′ 𝑃(𝑠′ | 𝑎, 𝑠)E [𝑟1 + 𝛾𝑟2 + ... | 𝑠1 = 𝑠′] = 𝑟 (𝑠, 𝑎) + 𝛾 ∑︁ 𝑠′ 𝑃(𝑠′ | 𝑎, 𝑠)𝑄 𝜋 (𝑠′, 𝜋(𝑠′)) (3.5) The above equation is called the Bellman equation for the state-action value function. When we consider two policies 𝜋 and 𝜋′, where 𝜋 ≥ 𝜋′ if and only if 𝑉 𝜋 (𝑠) ≥ 𝑉 𝜋′ (𝑠) for all 𝑠 ∈ S. Analogously, we can calculate a similar Bellman equation for the state value function which is given by: 𝑉 𝜋 (𝑠) = 𝑟 (𝑠, 𝜋(𝑠)) + 𝛾 ∑︁ 𝑠′ 𝑃(𝑠′ | 𝜋(𝑠), 𝑠)𝑉 𝜋 (𝑠′) (3.6) Therefore, we can assess policies given their value functions and derive an optimal policy 𝜋∗ based on value functions. The value function that defines the optimal policy 𝜋∗ is called the optimal state-action value function 𝑄∗(𝑠, 𝑎) which is given by: 𝑄∗(𝑠, 𝑎) = max 𝜋 𝑄 𝜋 (𝑠, 𝑎), ∀𝑠 ∈ S, 𝑎 ∈ A (3.7) An optimal state-value function 𝑉∗(𝑠) is given by: 𝑉∗(𝑠) = max 𝜋 𝑉 𝜋 (𝑠), ∀𝑠 ∈ S (3.8) And we can define the Bellman optimality equation for the state-action value function as: 𝑄∗(𝑠, 𝑎) = 𝑟 (𝑠, 𝑎) + 𝛾 ∑︁ 𝑠′ 𝑃(𝑠′ | 𝑎, 𝑠)𝑄 𝜋 (𝑠′, 𝜋(𝑠′)) (3.9) Similarly, we can define the Bellman optimality equation for the state value function: 𝑉∗(𝑠) = max 𝑎 [ 𝑟 (𝑠, 𝑎) + 𝛾 ∑︁ 𝑠′ 𝑃(𝑠′ | 𝑎, 𝑠)𝑉∗(𝑠) ] (3.10) And finally the optimal policy 𝜋∗ is given as: 𝜋∗(𝑠) = argmax 𝑎 [ 𝑟 (𝑠, 𝑎) + 𝛾 ∑︁ 𝑠′ 𝑃(𝑠′ | 𝑎, 𝑠)𝑄∗(𝑠, 𝑎)) ] (3.11) 23 3 Technical Background We can write the above equation in a simplified form: 𝜋∗(𝑠) = argmax 𝑎 𝑄∗(𝑠, 𝑎) (3.12) 3.1.5 Sparse Rewards Reward functions play an important role in helping the agent to learn a better policy. In sparse reward problems, the reward function returns a reward signal which is typically not informative towards learning an optimal policy. Instead, only a small region of state-action space is well-defined to provide valuable feedback for learning. Learning becomes very difficult in these scenarios because the algorithm must internally realize when the agent made a mistake during an episode, which led to negative rewards after termination. Similarly, after a successful episode, the agent needs to know what action led to a successful episode. Ideally RL algorithms are applied to simulated environments where it is possible to define rewards at each time step and help the agent to reach a goal state. Unfortunately, for many problems in real-world settings, this is not the case. Let us consider a state space of a previously considered example of the robotic arm stack task to be represented by a tuple (𝑆, 𝐴, 𝑃, 𝑅, 𝛾) where, 𝑆 is the set of possible states in the environment, 𝐴 is the set of possible actions that the agent can take, 𝑃(𝑠′ | 𝑠, 𝑎) represents the transition probability from state 𝑠 to state 𝑠′ when action 𝑎 is taken 𝑅(𝑠, 𝑎, 𝑠′) is the reward received by the agent when transitioning from state 𝑠 to 𝑠′ by taking action 𝑎 and 𝛾 is the discount factor that determines the importance of future rewards. The reward can be defined as 𝑅(𝑠, 𝑎, 𝑠′) sparse, that is rewards are only achieved when the robotic arm is in a specific small region of defined reward space. This makes the agent learning an optimal policy more difficult. The agent may receive a positive reward of +1 only when it reaches the goal state and zero rewards for all other state-action transitions. In this case, the agent has to rely on sparse positive rewards to learn an effective policy. Mathematically, this can be represented as, 𝑅(𝑠, 𝑎, 𝑠′) = { +1, if 𝑠′ is goal state 0, otherwise 24 3.2 Model-Free Reinforcement Learning 3.2 Model-Free Reinforcement Learning MFRL is a group of algorithms that learn optimal policies through trail-and-error interactions with an environment without the need for explicitly modeling the environment’s dynamics i.e. the agent tries to learn from experience, rather than building a model of the environment’s dynamics and utilizing it to plan ahead. In the MFRL setting, the agent explores the environment and receives observations that represent the state of the environment at a given time. Agents choose an action based on observations in the environment and receive a reward as feedback on the quality of their decisions. The goal of the agent is to learn a policy that maximizes the expected cumulative reward over time. It is possible to achieve this goal by using value-based or policy-based algorithms, which are two main classifications of MFRL algorithms [Sil15]. There are a plethora of RL algorithms, which are classified as On-Policy or Off-Policy methods, and in the context of this thesis, we currently focus on the category of off-policy learning and the actor-critic framework. On-Policy methods update the policy based on the data the agent collects using a current policy i.e. the data used for learning is from the interactions with the environment utilizing the current policy. As a result, the agent can only update the policy it is currently following. On the other hand, Off-Policy methods learn from data collected by following a different policy. The agent has a flexible choice of choosing either a random policy or a previously learned policy. This flexibility allows the agent to learn more efficiently by reusing past experience. The above-mentioned data collection methods can be applied to both Value-based and Policy-based methods. Value-based methods typically estimate the value of each state or state-action pair and utilize this value to select actions and in turn, maximize the expected cumulative reward. Policy-based methods on the other hand, directly learn a policy, mapping each state to a probability distribution over actions. The policy is updated based on the observed rewards, and it involves computing the gradient of the policy’s performance objective. Policy-based methods can be further classified as Gradient-free and Gradient-based methods. Gradient-free methods do not rely on gradient information for policy optimization. Instead, their approach consists of searching the policy space directly for optimal policies, utilizing evolutionary algorithms or other search-based optimization techniques such as Particle Swarm Optimization and the cross-entropy method. These methods explore the policy space through trial and error, gradually improving policies over iterations. Gradient-based methods utilize gradients to optimize the policy directly. The goal of this method is to find a policy that maximizes the expected cumulative reward by iteratively updating the policy in the direction of the steepest ascent. Commonly used Gradient-based methods include state-of-the-art policy optimization algorithm Proximal Policy Optimization (PPO) [SWD+17], Trust Region Policy Optimization (TRPO) [SLM+17]. There are hybrid methods that combine the advantages of both Policy-based and Value-based methods such as Advantage Actor-Critic (A2C) which employs an actor-critic framework. The critic network is used to estimate a value function and the policy gradient is computed based on an advantage function which is utilized to guide policy updates to select corresponding actions. The actor is responsible for selecting actions based on the current state which then adjusts its policy based on the feedback from the critic to improve future actions [SML+18]. 25 3 Technical Background Extending this line work, [HZAL18] proposed state-of-the-art SAC which incorporates stochastic policies, allowing the agent to explore and generate a diverse set of actions. The exploration is encouraged by an entropy regularizer term and discourages overly deterministic policies. By maximizing entropy, SAC seeks a policy that strikes a balance between exploration and exploitation which builds the base for the rest of the thesis and is discussed in detail in section 3.2.1. 3.2.1 Soft Actor-Critic Soft Actor-Critic algorithm is a model-free, off-policy, actor-critic reinforcement learning method. While PPO, A2C and TRPO algorithms learn in an on-policy manner, i.e., they need new sets of samples for each policy update, SAC learns in an off-policy way, i.e. by using experience replay buffers to learn from past data. SAC is primarily applied to continuous action space environment. By incorporating a maximum entropy objective, the SAC algorithm addresses the exploration-exploitation trade-off. It is used as a regularization term in policy optimization to measure the degree of uncertainty in a policy distribution. Additionally, the entropy quantity prevents convergence to sub-optimal policies. The entropy 𝐻 for a policy is defined as 𝐻 (𝜋) = E𝜋 [− log(𝜋(· | 𝑠𝑡 )] . (3.13) Maximum entropy term is incorporated into the standard RL objective, where the agent receives an additional bonus reward at each timestep proportional to the entropy of the policy at that timestep. As a result the modified RL objective [HZAL18] is given as, 𝜋∗ = argmax 𝜋 E𝜋 [ ∞∑︁ 𝑡=0 𝛾𝑡 (𝑟 (𝑠𝑡 , 𝑎𝑡 ) + 𝛼𝐻 (𝜋(· | 𝑠𝑡 ))) ] . (3.14) The temperature parameter 𝛼 controls the stochasticity of the optimal policy and is used to balance entropy regularization terms. As a result of adding the entropy parameter in the modified objective, the state value and state-action value functions are defined as follows, 𝑉 𝜋 (𝑠) = E𝜋 [ ∞∑︁ 𝑡=0 𝛾𝑡 (𝑟 (𝑠𝑡 , 𝑎𝑡 ) + 𝛼𝐻 (𝜋(· | 𝑠𝑡 ))) ] , (3.15) 𝑄 𝜋 (𝑠, 𝑎) = E𝜋 [ ∞∑︁ 𝑡=0 𝛾𝑡𝑟 (𝑠𝑡 , 𝑎𝑡 ) + 𝛼 ∞∑︁ 𝑡=1 𝛾𝑡𝐻 (𝜋(· | 𝑠𝑡 )) ] . (3.16) Finally, the modified Bellman equation for the state-action value function is given as, 𝑄 𝜋 (𝑠, 𝑎) = E𝜋 [𝑟 (𝑠, 𝑎) + 𝛾(𝑄 𝜋 (𝑠′, 𝑎′) + 𝛼𝐻 (𝜋(· | 𝑠′)))] . (3.17) SAC simultaneously learns a policy and two Q-functions to improve policy optimization. Two Q-functions are learned independently and the minimum value between the Q-functions is utilized for both improved policy update and the accurate target value estimation. Also, the two Q- functions provide multiple value estimates, which is helpful in guiding policy learning, and entropy 26 3.3 Model-Based Reinforcement Learning regularization which encourages policy to explore diverse sets of actions, thereby promoting exploration. SAC utilizes clipped double-Q trick [FHM18] to reduce overestimation of biases for the two Q-functions. A minimum Q-value estimate between two Q-functions is used by SAC when estimating the target value for updating the Q-functions. As a result, the overestimation bias is mitigated and accurate value estimates are provided. During the policy update, SAC utilizes the maximum of Q-function estimates, encouraging the policy to choose actions with higher Q-values. SAC trains the Q-network by minimizing the mean squared error between Q-value estimate𝑄 𝜃𝑖 (𝑠, 𝑎) which is given as 𝐽𝑄 (𝜃) = E(𝑠,𝑎,𝑟 ,𝑠′ )∼D [ (𝑄 𝜃 (𝑠, 𝑎) − �̂�(𝑠, 𝑎))2 ] (3.18) with the entropy regularized Bellman update target is given as �̂�(𝑠, 𝑎) = 𝑟 (𝑠, 𝑎) + 𝛾(min 𝜃1,2 𝑄 𝜃 ′ 𝑖 (𝑠′, 𝑎′) − 𝛼 log 𝜋(· | 𝑠′)) (3.19) where 𝑎′ ∼ 𝜋(· | 𝑠′) and log 𝜋(· | 𝑠′) term approximates the entropy of the policy, and D is the replay buffer storing samples of the agent during training. The term min𝜃1,2 𝑄 𝜃 ′ 𝑖 (𝑠′, 𝑎′) takes the minimum of Q-value network estimate between the two target Q-value networks 𝑄 𝜃 ′1 and 𝑄 𝜃 ′2 for the next state and action pair. Given the stochastic nature of the policy in SAC, the policy objective is formulated so as to maximize the likelihood of actions 𝑎 ∼ 𝜋(· | 𝑠) that would result in high Q-value estimate 𝑄(𝑠, 𝑎). The policy’s objective function can thus be defined as max 𝐽𝜋 (𝜙) = E𝑠∼D [ min 𝜃1,2 𝑄 𝜃𝑖 (𝑠, 𝑎) − 𝛼 log 𝜋𝜙 (𝑎 | 𝑠) ] (3.20) 3.3 Model-Based Reinforcement Learning Following the discussion of MFRL algorithms in the previous sections, this section discusses MBRL upon which the rest of the thesis is built upon. In MBRL approaches, an agent explicitly learns a probabilistic model of the environment’s dynamics. Various techniques can be used to build the model, such as stochastic models, regression models, and neural networks. This learned model can later be leveraged to solve various RL problems. [CCML18] propose a probabilistic ensemble with trajectory sampling, where Cross Entropy Method (CEM) method is utilized to sample actions from a distribution that is similar to previous action samples that yielded high rewards. [SHM+16] uses this Monte Carlo Tree Search (MCTS) methodology in the game of AplhaGo. Each node in the tree corresponds to a state which will be evaluated based on the returns with the help of a random policy. An alternate strategy is that model can be used to generate simulated data for policy learning or value approximation. In this line of work, [FWS+18] proposed Model Value Expansion (MVE) which utilizes n-step temporal difference targets and depends on a fixed horizon. [BHT+19] proposed Stochastic Ensemble Value Expansion (STEVE) which interpolates between different horizons based on the uncertainty calculated using the ensemble networks. 27 3 Technical Background The model can be used for data augmentation to improve the policy. [LXL+21] proposed Stochastic Lower Bound Optimization (SLBO) which uses a multi-step L2-norm loss to train the dynamics. SLBO uses the model to rollout full-length trajectories from the start state. However, long rollouts typically result in compounding prediction errors, which ultimately hinder the policy optimization process. [JFZL21] introduced MBPO, which makes use of shorter rollout lengths to avoid large compounding errors. MBPO begins to roll out from states which are sampled from the real environment and run n-steps according to the policy and the learned model. MBPO uses the model-generated data as a replay buffer to train a standard SAC agent. As such, MBPO can be considered a natural extension of SAC that incorporates model rollouts. SAC is utilized as an agent to update the policy with mixed data from both real and learned models. MBPO is discussed in detail in section 3.3.1 and its respective hyperparameters are in section 3.3.2. 3.3.1 Model-Based Policy Optimization The natural choice of this model-based algorithm to further investigate is that it is a natural extension of model-free counterpart SAC. In this approach, the learned model is used to gather imaginary data to train a policy. To gather this imaginary data, a bootstrap ensemble of dynamics model is utilized where each member of the ensemble is a probabilistic neural network. Using probabilistic neural networks, aleatoric uncertainty, or noise in outputs relative to inputs, is captured while bootstrapping ensures to capture of epistemic uncertainty or uncertainty in model parameters. To generate a prediction from the ensemble, a random model is sampled from the ensemble of the dynamics model to test different transitions along a single model rollout. Policy optimization is performed using SAC which is adopted from open-repository mbrl-lib [PAZ+21]. SAC alternates between a policy evaluation step, which estimates the state-action value function, and a policy improvement step. For all the experiments, we use the automatic entropy tuning flag that adaptively modifies the entropy gain based on the stochasticity of the policy. Performing extended model rollouts leads to compounding error problems. To mitigate this problem, the MBPO algorithm begins propagating short model rollouts from state distribution of a different policy under the true environment dynamics. As a result, a few long trajectories with large errors can be traded for many short trajectories with smaller errors. The Figure 3.2 shows the architecture of a MBPO agent. It consists of two replay buffers for storing real experiences from the stochastic environment and generated experiences from the predictive model. Off-policy SAC is utilized as a policy optimization algorithm. SAC is designed using actor-critic framework which is constructed using neural networks. The actor-network consists of 2 hidden layers of 64 units each with a Tanh activation function. Similarly, the critic network also consists of 2 hidden layers of 256 units each with a Tanh activation function. The predictive model is represented as an ensemble of probabilistic neural networks whose outputs parameterize a Gaussian distribution. Each individual network of the ensemble consists of 4 hidden layers of 200 units each with SiLU activation [EUD17]. Figure 3.3 shows the data collection process in MBPO algorithm. Random models are uniformly sampled for a single model rollout step from the ensemble to generate a transition prediction. These transition predictions are later stored in the model replay buffer which are further utilized as inputs for the SAC algorithm for policy optimization. Algorithm 3.1 shows the working of the Model-Based Policy Optimization algorithm. 28 https://github.com/facebookresearch/mbrl-lib 3.3 Model-Based Reinforcement Learning Replay Memory Environment Buffer Replay Memory Model Buffer Stochastic Environment Real Experience Interactions Sample generated experiences Generated Experience MBPO Agent Sample real experiences to train Predictive Model 2 layers, 64 units, Tanh activation Actor 2 layers, 256 units, Tanh activation Critic Off-Policy SAC Agent Predictive Models 4 layers, 200 units, SiLU activation Ensemble Models Figure 3.2: Detailed architecture depicting the individual components of an MBPO agent, which utilizes SAC as a policy optimization algorithm. The Figure depicts the interaction of the SAC agent with the stochastic environment to collect real experiences, which are later utilized to train a predictive model to generate a simulated experience. This simulated experience is later utilized as input to the SAC algorithm for policy optimization. Replay Memory Environment Buffer Replay Memory Model Buffer Minibataches Policy Policy Model 2 Model 3 Policy Policy Model 1 Model 4 Sample Observations r1, s1', done1 r1', s1'', done1' r2, s2', done2 r2', s2'', done2' s1 s2 s2 s1 a1 a2 a1' a2' s1, a1, r1, s1', done1 s1', a1', r1', s1'', done1' s2, a2, r2, s2', done2 s2', a2', r2', s2'', done2' Figure 3.3: Experiences collected in the environment replay buffer are sampled as mini-batches. To generate a transition prediction, a model is uniformly sampled at random from the ensemble and the predicted experiences are stored inside the model replay buffer. 29 3 Technical Background Algorithm 3.1 Model Based Policy Optimization algorithm 1: Initialize policy 𝜋𝜙, predictive model 𝑝𝜃 , critic ensemble {𝑄𝑖}2𝑖=1, environment buffer D𝑡 , model buffer D𝑚𝑜𝑑𝑒𝑙 2: 𝑔𝑙𝑜𝑏𝑎𝑙𝑠𝑡𝑒𝑝 ← 0 3: for episode 𝑡 = 0, ..., 𝑇 − 1 do 4: for 𝐸 steps do 5: if global step %𝐹 == 0 then 6: Train model 𝑝𝜃 on D𝑡 via maximum likelihood 7: for 𝑀 model rollouts do 8: Perform 𝑘-step model rollouts starting from 𝑠 ∼ D𝑡 ; add to D𝑚𝑜𝑑𝑒𝑙 9: Take action in environment according to 𝜋𝜙; add to D𝑡 10: for 𝐺 gradient updates do 11: update {𝑄𝑖}2𝑖=1 with mini-batches from D𝑚𝑜𝑑𝑒𝑙, via SGD (11) 12: update 𝜋𝜙 with mini-batches from D𝑚𝑜𝑑𝑒𝑙, via SGA on optimistic Q-values of (9) 13: 𝑔𝑙𝑜𝑏𝑎𝑙𝑠𝑡𝑒𝑝 ← 𝑔𝑙𝑜𝑏𝑎𝑙𝑠𝑡𝑒𝑝 + 1 3.3.2 Hyperparameters Table 3.1 provides the main hyperparameters utilized for the continuous learning environments. This section further discusses specific hyperparameters which play an important role in defining the capacity of model-generated buffers and are also later used to conduct further ablation studies. Model rollout length 𝑘 , refers to the number of steps taken by the agent during the rollout phase. A rollout is a simulated sequence of actions taken by the agent, starting from an initial state and following a particular policy until a terminal state or a predefined time horizon is reached. Rollout length determines the length of the prediction horizon generated by the model to estimate the expected returns of various policies. Model rollouts per step 𝑀, refers to the number of rollouts generated by a model per training step. This hyperparameter controls the number of rollouts executed under the model, it has a direct control on the amount of data generated by the model. It gives us the control of gathering the number of real environment rollouts and simulated rollouts. The increasing number of rollouts per step allows for gathering more simulated experiences, thereby improving the quality of data and the accuracy of the agent’s policy. Frequency of model retrain 𝐹, refers to how frequently the model is retrained during the training process. This hyperparameter determines the frequency at which the model parameters are updated to improve the performance of the model. Higher frequency retraining suggests that the model is retrained more frequently to adapt to the distribution of states observed in the true environment to improve the model’s accuracy. Updates to retain buffer 𝑅, refers to the number of model updates that are retained in the replay buffer. The replay buffer contains past experiences of the agent, typically consisting of state, action, reward, next-state, and done flags stored as tuples. The higher the value of this hyperparameter the more off-policy (old) data can be stored and sampled for training. Table 3.1 shows the list of important hyperparameters used in MBPO style learning in continuous control experiments. 30 3.3 Model-Based Reinforcement Learning Ensemble Size 𝑁 , refers to the number of models or dynamics functions used in the ensemble. Each model present in the ensemble is a probabilistic neural network whose outputs parameterize a Gaussian distribution [JFZL21] 𝑝5 𝜃 (𝑠𝑡+1, 𝑟 | 𝑠𝑡 , 𝑎𝑡 ) = N ( 𝜇5 𝜃 (𝑠𝑡 , 𝑎𝑡 ) , Σ 5 𝜃 (𝑠𝑡 , 𝑎𝑡 ) )) , Ensemble models are responsible for capturing both aleatoric and epistemic uncertainties. Trainer Patience 𝑝, refers to a parameter used for early stopping during model training. If the validation performance does not improve for consecutive epochs, training is halted to prevent further training on potentially overfitting the model. Model-generated buffers are created by using a predictive model to generate additional training data for reinforcement learning algorithms. Model-generated buffers are used to augment the environment dataset and improve the agent’s learning performance. This augmented dataset is utilized to simulate imaginary actions and outcomes, expanding the range of collected experiences. As a result, the capacity of the model-generated buffers D𝑚𝑜𝑑𝑒𝑙 is computed as 𝑘 × 𝑀 × 𝐹 × 𝑅. The capacity of the model replay buffer is defined in such a manner as to hold R iterations of data generation. Table 3.1: Hyperparameters for MBPO style learning Hyperparameters Acronyms # episodes 𝑇 # steps per episode 𝐸 policy updates per step 𝐺 # model rollouts per step 𝑀 frequency of model retrain (#steps) 𝐹 # updates to retain buffer 𝑅 ensemble size 𝑁 rollout length 𝑘 Trainer Patience 𝑝 3.3.3 MBPO-C Method This section introduces Model Based Policy Optimization-Consistent (MBPO-C), a variant of the MBPO approach. Following the suggestion presented in [CCML18], instead of having a model being sampled at random for a rollout, MBPO-C uses a fixed model for the entire rollout and randomly samples states. Therefore, the rollouts are model consistent and we obtain a mean of the value distribution equal to the mean of the true value distribution. The key idea behind this approach is to leverage the diversity of multiple models to improve overall performance. By using different fixed models for each rollout, we can benefit from the strengths and weaknesses of each model. As a result, the ensemble of models helps reduce bias and increase the robustness of the estimated value. Figure 3.4 shows modified training loop of MBPO by having model consistent rollouts. Algorithm 3.2 shows the working of the Model-Based Policy Optimization-Consistent algorithm. 31 3 Technical Background Replay Memory Environment Buffer Replay Memory Model Buffer Minibataches Policy Policy Model 1 Model 3 Policy Policy Model 1 Model 3 Sample Observations r1, s1', done1 r1', s1'', done1' r2, s2', done2 r2', s2'', done2' s1 s2 s2 s1 a1 a2 a1' a2' s1, a1, r1, s1', done1 s1', a1', r1', s1'', done1' s2, a2, r2, s2', done2 s2', a2', r2', s2'', done2' Figure 3.4: Experiences collected in the environment replay buffer are sampled as mini-batches. In comparison to the MBPO-styled rollout in Figure 3.3, A transition prediction is generated using a model which is uniformly sampled at random from the ensemble, and the same model is utilized throughout the entire rollout to generate predicted experiences. The predicted experiences using the model are stored inside the model replay buffer. Algorithm 3.2 Model Based Policy Optimization-Consistent algorithm 1: Initialize policy 𝜋𝜙, predictive model 𝑝𝜃 , critic ensemble {𝑄𝑖}2𝑖=1, environment buffer D𝑡 , model buffer D𝑚𝑜𝑑𝑒𝑙 2: 𝑔𝑙𝑜𝑏𝑎𝑙𝑠𝑡𝑒𝑝 ← 0 3: for episode 𝑡 = 0, ..., 𝑇 − 1 do 4: for 𝐸 steps do 5: if global step %𝐹 == 0 then 6: Train model 𝑝𝜃 on D𝑡 via maximum likelihood 7: Using fixed model M, perform 𝑘-step model rollouts starting from 𝑠 ∼ D𝑡 ; add to D𝑚𝑜𝑑𝑒𝑙 8: Take action in environment according to 𝜋𝜙; add to D𝑡 9: for 𝐺 gradient updates do 10: update {𝑄𝑖}2𝑖=1 with mini-batches from D𝑚𝑜𝑑𝑒𝑙, via SGD (11) 11: update 𝜋𝜙 with mini-batches from D𝑚𝑜𝑑𝑒𝑙, via SGA on optimistic Q-values of (9) 12: 𝑔𝑙𝑜𝑏𝑎𝑙𝑠𝑡𝑒𝑝 ← 𝑔𝑙𝑜𝑏𝑎𝑙𝑠𝑡𝑒𝑝 + 1 32 3.4 Integral Probability Metrics 3.4 Integral Probability Metrics This section gives a detailed introduction to a class of statistical metrics named IPM, which quantifies the discrepancy between two probability distributions. In this study, IPM’s are used to compare the discrepancies between distributions of environment rewards and model imagined rewards, as well as empirical state distributions of both environment and the model. The main idea of incorporating the metrics is to study the impact of the different model-based hyperparameters on the model-generated data and measure the discrepancy between the model-generated distribution and the environment distribution of observations and rewards. IPM provide a unified framework to compare distributions without making specific assumptions about their parametric form. For example, Let us consider two empirical distributions 𝑥 and 𝑦 over a common space and a distance function 𝐷 which is an IPM. This distance function is considered a metric when properties such as non-negativity, positive definiteness, and symmetry are satisfied. The conditions are given below: 𝐷 (𝑥, 𝑦) ≥ 0 (non-negativity), (3.21) 𝐷 (𝑥, 𝑦) = 0 (3.21 and 3.22 together produce positive definiteness), (3.22) 𝐷 (𝑥, 𝑦) = 𝑑 (𝑦, 𝑥) (symmetry), (3.23) This study introduces two IPM metrics [SFG+09] namely Wasserstein Distance (WD), which measures the discrepancy between distributions of environment and model rewards, and Maximum Mean Discrepancy (MMD), which measures the discrepancy between empirical state distributions of the environment and the model. 3.4.1 Wasserstein Distance Wasserstein distance, also known as the Earth Mover Distance (EMD), is a distance metric that measures the dissimilarity between two probability distributions. The metric quantifies the minimum cost which is required to transform one distribution into another. The cost of the metric is defined as the amount of mass that needs to be moved from each point in one distribution to its corresponding point in the other distribution. Mathematically, If 𝑃 is an empirical probability distribution of rewards from environment dataset 𝑋1, ...., 𝑋𝑛 and 𝑄 is an empirical probability distribution of rewards from model dataset 𝑌1, ...., 𝑌𝑛 of the same size, then the Wasserstein distance between the two distributions is calculated as follows: 𝑊𝑝 (𝑃,𝑄) = [ 𝑛∑︁ 𝑖=1 | |𝑋𝑖 − 𝑌𝑖 | |𝑝 ] (1/𝑝) (3.24) 33 3 Technical Background Rewards Probability Model Rewards Q(Y) Environment Rewards P(X) W(P,Q) Figure 3.5: The plot depicts the Wasserstein distance calculation between environment reward distribution P(X) and Model reward distribution Q(Y). W(P,Q) quantifies the minimum cost which is required to transform one distribution into another. In this study, the Wasserstein distance metric is particularly used for comparing reward distributions between the model and the environment. The reward distributions are one-dimensional and the metric provides a way to capture the underlying structure and spatial relationships between two types of reward distributions. The distance metric is calculated using the Python open-source library, SciPy [VGO+20]. The Figure 3.5 shows a visual representation of how the Wasserstein distance is calculated between two reward distributions. The Wasserstein distance metric can be extended to higher dimensional state space, by considering the amount of mass that needs to be moved in a multidimensional space. The distance metric measures the minimum cost required to transform one distribution into another. It is done by moving the appropriate amount of mass in a way that preserves the overall mass and as a result, minimizes the total distance. In practice, computing the exact Wasserstein distance for higher dimensional distributions can be computationally intensive due to the optimization involved. As the dimensionality increases, the number of required samples to accurately estimate the Wasserstein distance also grows exponentially. High-dimensional spaces suffer from the curse of dimensionality, where the volume of the space increases exponentially with the number of dimensions. This makes it more challenging to sample from and represent distributions accurately, potentially leading to distorted distance measures. To address the problem of measuring distance in higher dimensional spaces, subsection 3.4.2 introduces MMD distance metric to measure the distance between higher dimensional distributions. 34 3.4 Integral Probability Metrics 3.4.2 Maximum Mean Discrepancy Distance MMD is a kernel-based metric, that quantifies the discrepancy between distributions as distances between mean embeddings of features. Moreover, under certain conditions, MMD can be zero if the two distributions are identical. MMD metric does not assume any specific parametric form of the distribution being compared. MMD metric maps the samples from each distribution into a higher-dimensional space using a feature map and compares the maximum means of the distributions. A smaller MMD value indicates a higher similarity between the distributions and vice-versa. Given two probability distributions, if 𝑃 is an empirical probability distribution of state space from environment dataset 𝑋1, ...., 𝑋𝑛 and 𝑄 is an empirical probability distribution of observation space from model dataset 𝑌1, ...., 𝑌𝑛, and assuming we have samples drawn from each distribution, mathematically the MMD between 𝑃 and 𝑄 is computed as follows: First, MMD is defined by a feature map for environment observation space 𝜓 : X → H , where H is called as Reproducing Kernel Hilbert Space (RKHS) [Gre13]. MMD starts by mapping the samples from the original input space to a higher dimensional feature space. These feature maps are spaces of functions that satisfy the reproducing property: ⟨ 𝑓 , 𝜓(𝑥)⟩H = 𝑓 (𝑥) for any 𝑓 ∈ H Second, a kernel function is defined to measure the similarity between pairs of samples in the observation space. The kernel function compares the representations of the mapped samples. This combined kernel function for both environment and model observations can be represented as 𝐾 (𝑥, 𝑦) = ⟨𝜓(𝑥), 𝜓(𝑦)⟩H . Common kernel functions include the Gaussian kernel, polynomial kernel, and Laplacian kernel. The Gaussian kernel and Laplacian kernel are defined respectively in equations 3.7a and 3.7b, 𝐾 (𝑥, 𝑦) = 𝑒− | |𝑥−𝑦 | |2 2𝜎2 , 𝜎 > 0, (3.25) 𝐹 (𝑥, 𝑦) = 𝑒− | |𝑥−𝑦 | | 2𝜎2 , 𝜎 > 0, (3.26) Figure 3.7 shows a comparative study using Gaussian and Laplacian kernel function for calculating the MMD distance. The study was performed between the model and environment observation dataset from the Sparse Inverted Pendulum environment. Both types of kernel functions can be utilized since both kernel functions return an unbounded distance value. In general, the Gaussian kernel results in smaller distance values. Since the squared Euclidean distance in the numerator leads to faster decay as the distance between data points increases. On the other hand, the Laplacian kernel with the absolute value of the Euclidean distance decays more slowly, resulting in larger distance values. As a result Gaussian kernel is utilized in the MMD distance metric to compute the similarity between mapped representations. Finally, the MMD distance is computed based on the differences and similarities between samples within each distribution and across the distributions. The first term of the MMD, captures the similarity within the environment state space The term calculates the average pairwise kernel similarity between samples belonging to environment state space. The goal is to capture the inherent structures and patterns in each distribution individually. Similarly, the second term of the MMD captures the similarity within the model state space. The first and second terms are correspondingly given as follows: 35 3 Technical Background ⟨E𝑋∼𝑃𝜓(𝑋),E𝑋′∼𝑃𝜓(𝑋 ′)⟩H , (3.27) ⟨E𝑌∼𝑄𝜓(𝑌 ),E𝑌 ′∼𝑄𝜓(𝑌 ′)⟩H , (3.28) The third term measures the dissimilarity between samples across the environment and model state space distributions. This term captures the differences or discrepancies between the distributions and is given as follows: ⟨E𝑋∼𝑃𝜓(𝑋),E𝑌∼𝑄𝜓(𝑌 )⟩H , (3.29) Finally, the MMD distance is calculated by combining equations 4.6, 4.7, and 4.8 which gives an overall measure of the difference between the distributions. The weights in front of each term ensure that the contributions from each term are properly balanced and finally, the combined equation is given as follows: 𝑀𝑀𝐷2(𝑃,𝑄) = ⟨E𝑋∼𝑃𝜓(𝑋),E𝑋′∼𝑃𝜓(𝑋 ′)⟩H + ⟨E𝑌∼𝑄𝜓(𝑌 ),E𝑌 ′∼𝑄𝜓(𝑌 ′)⟩H − 2⟨E𝑋∼𝑃𝜓(𝑋),E𝑌∼𝑄𝜓(𝑌 )⟩H , (3.30) The above equation can be simplified as follows 𝑀𝑀𝐷2(𝑃,𝑄) = | |E𝑋∼𝑃𝜓(𝑋) − E𝑌∼𝑄𝜓(𝑌 ) | |2H (3.31) Figure 3.6 shows a visual representation of how the MMD is calculated by transforming the observations to RKHS. 36 3.4 Integral Probability Metrics Environment State Space Model State Space State Space Observations Reproducing Kernel Hilbert Space Maximum Mean Discrepancy Figure 3.6: The figure depicts MMD distance calculation between Environment state space and Model state space. MMD metric maps the samples from each distribution into a higher-dimensional Reproducing Kernel Hilbert Space (RKHS) using a feature map and compares the maximum means of the distributions in the RKHS space. 0 10 20 30 40 50 60 70 Episodes 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 M M D di st an ce model obs-env obs (a) Gaussian Kernel 0 10 20 30 40 50 60 70 Episodes 0.3 0.4 0.5 0.6 0.7 0.8 M M D di st an ce model obs-env obs (b) Laplacian Kernel Figure 3.7: Comparison of Gaussian and Laplacian kernel function applied to MMD distance metric for comparing the disparity between model observed states and environment states of the sparse pendulum environment. 37 4 Methodology In the preceding chapters, we elucidated the elements of reinforcement learning, encompassing their classification into model-free and model-based approaches. Additionally, we provided an in-depth exploration of the various hyperparameters employed in the model-based setting. Furthermore, a comprehensive introduction to IPMs, which are subsequently utilized in this study to quantify the disparity between the distribution of observations and rewards in the model and the actual environment. Following a thorough examination of existing literature, the primary research objective of this thesis is to identify the core components of MBRL algorithms and design specific experiments to understand the impact of such components in isolation under sparse reward scenarios. A non- exhaustive list of MBRL methodologies design choices includes the diversity of model-generated data, training the model for one-step or multi-step prediction, amount of model-generated data, frequency of retraining the model, length of rollouts and amount of off-policy data stored inside the model buffer. For this purpose, MBPO is chosen as the choice of the model-based algorithm for extensive investigation in this study. The primary benefit of opting for this model-based algorithm is its capacity to yield superior data efficiency and greater flexibility in adjusting hyperparameters, and the ability to operate in a manner that closely resembles a model-free setting. This analysis of design decisions aims to establish guiding principles for addressing challenges posed by sparse reward scenarios through the utilization of model-based approaches. Integral probability metrics are employed to measure the impact of design choices. This is achieved by evaluating the discrepancy between the distribution of observations and rewards perceived by the model and the state and reward distributions in real-world environment. Specifically, the Wasserstein distance metric is employed to quantify the discrepancy between rewards observed by the model and the actual rewards distribution present in the environment. The maximum mean discrepancy distance metric is utilized to assess the dissimilarity between the state distribution as observed by the model and the state distribution in the actual environment. These established metrics assist in illuminating the model’s interpretation of rewards and states when considering different components of the MBRL algorithm. Moving forward, chapter 5 of this study outlines the experimental configuration, while chapter 6 presents a comprehensive examination and analysis of the experiments carried out to explore various design choices. 38 5 Experimental Setup This section gives a brief introduction to continuous control environments, where the effect of hyperparameters is evaluated in continuous state-action spaces. In contrary to discrete control environments, where the agent has the option of choosing actions from a finite set of possibilities, continuous control environments allow the agent to choose from a broader set of action possibilities. In continuous control environments, actions are represented as continuous variables such as real numbers or vectors. Examples of continuous control environments which are used to train RL agents include OpenAI Gym [BCP+16a], and DeepMind Control Suite [TMD+20]. 5.1 Hardware Specifications The experiments designed are carried out in-house High Performance Cluster (HPC). The hardware configuration is as follows, Central Processing Unit (CPU) is Intel Xeon Gold 6148 and has the specification of the processor with 20 cores and 3.70 GHz at turbo performance and 2.40 GHz nominal performance. Graphical processing units (GPU) is NVIDIA Tesla V100 SXM and has a specification of 32 GB RAM and a memory bandwidth of 900 GB/s. The GPU consists of 5120 CUDA cores and 640 NVIDIA Tensor cores. (a) Inverted Pendulum (b) Mountain Car Continuous (c) Cheetah Run Figure 5.1: Illustration of continuous control environments where different design choices of MBPO algorithm are tested upon. 39 5 Experimental Setup 5.2 Inverted Pendulum The Inverted Pendulum environment in which rewards are set up as sparse was proposed by [CBK20]. In this task, the pendulum is always initialized in the downward position with zero velocity and the goal of the RL agent is to apply appropriate torque to swing the pendulum to the upwards position. Figure 5.1a shows the inverted pendulum environment. 5.2.1 State Space The state space of a sparse inverted pendulum environment is continuous and consists of the angular position of the pendulum given by the sine and cosine angles and its angular velocity. Table 5.3 shows the elements of the state space. Table 5.1: State Space specifications of Sparse Inverted Pendulum environment States Min Max x = cos(angle) -18.19 18.19 y = sin(angle) -71.81 71.81 Angular Velocity -0.5 0.5 5.2.2 Action Space The action space in a sparse inverted pendulum environment is continuous. The actions of this environment represent the torque applied at the base of the pendulum. The action values range from -1.0 to 1.0, where -1.0 corresponds to maximum torque in one direction, 1.0 corresponds to maximum torque in the opposite direction, and 0.0 corresponds to no torque applied. 5.2.3 Rewards The inverted pendulum environment is considered sparse because the agent is rewarded with a positive value given as 𝑐𝑜𝑠𝑖𝑛𝑒_𝑎𝑛𝑔𝑙𝑒_𝑡𝑜𝑙𝑒𝑟𝑎𝑛𝑐𝑒 × 𝑣𝑒𝑙𝑜𝑐𝑖𝑡𝑦_𝑡𝑜𝑙𝑒𝑟𝑎𝑛𝑐𝑒 when the pendulum remains within both angle and velocity threshold. To complicate the problem, an action cost of value 0.1 is added, which is a cost proportional to the applied torque added to the state reward. This action cost counteracts the effect of exploration signals as a result the agent receives a penalized reward if the pendulum falls or deviates from the upright position. Action cost can be referred to as the cost associated with taking specific actions. The action cost is incorporated into the reward function thereby influencing the agent’s behavior by encouraging actions that minimize the effect of the overall cost. This sparse pendulum setup along with action cost, makes the task challenging, as the agent needs to explore effective policies despite the negative signal coming from the action cost. The episode is terminated when the agent reaches 400 steps. 40 5.3 Mountain Car Continuous 5.3 Mountain Car Continuous The Mountain Car Continuous environment is part of classical control environments from OpenAI Gym [BCP+16b]. A car is stochastically placed at the bottom of a valley. The goal of the agent is to help the car reach the flag at the top of the hill by applying appropriate acceleration. Figure 5.1b shows the mountain car’s continuous environment. 5.3.1 State Space The state space of a mountain car’s continuous environment consists of two variables: the position of the car along the x-axis and velocity along the x-axis. Table 5.2 shows the elements of the state space. Table 5.2: State Space specifications of Continuous Mountain Car environment States Min Max Position of the car along the x-axis -1.2 0.6 Velocity of the car -0.07 0.07 5.3.2 Action Space The action space of the Mountain Car environment is continuous and the agent can apply acceleration in the action value ranging from -1.0 to 1.0, where a value of -1.0 corresponds to full acceleration in the left direction, 1.0 corresponds to full acceleration in the right direction, and 0.0 corresponds to no acceleration. The action value is finally multiplied by a power of 0.0015 5.3.3 Rewards The mountain car continuous task is considered solved when the car reaches a position greater than or equal to 0.45 at the top of the hill and receives a sparse reward of value 100. However, the agent receives a negative reward of −0.1 × 𝑎𝑐𝑡𝑖𝑜𝑛2, encouraging it to reach the goal with the minimum amount of effort. Each episode lasts 1000 steps. 𝑟𝑡 = { −0.1 × 𝑎𝑐𝑡𝑖𝑜𝑛2, (𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛 < 0.45), 100, (𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛 ≥ 0.45) 41 5 Experimental Setup 5.4 Cheetah Run The Cheetah Run environment is a continuous control environment, where the goal is to train the agent to make the cheetah attain a forward maximum velocity and the task is to optimize the control policy to achieve high-speed running. This environment is a dense reward environment and a parallel study is conducted on this environment to observe if the same set of hyperparameters are responsible for the performance of the model-based agent in the dense reward setting. 5.4.1 State Space The state space of the cheetah run consists of 17 dimensions, high dimensional in comparison with the previously introduced environment. The body position represents the overall position of the cheetah and velocity represents the rate at which the cheetah is moving in the environment. The position and the velocity of the cheetah are unbounded. Table Table 5.3, shows the elements of the state space. Table 5.3: State Space specification of Cheetah Run environment States Body position of the Cheetah Body Velocity of the Cheetah 5.4.2 Action Space The action space is also continuous and 6 dimensional in size. The goal of the RL agent is to apply torque to actuate the cheetah’s joints and make the cheetah run forward with the maximum velocity possible. 5.4.3 Rewards The cheetah run task is completed when the agent is able to make the cheetah achieve a velocity of value greater than or equal to 10 and receives a reward of 1. If otherwise, the agent receives a reward of value between 0 and 1. 𝑟𝑡 = { 0.1 × 𝑣𝑒𝑙𝑜𝑐𝑖𝑡𝑦, (0 ≤ 𝑣𝑒𝑙𝑜𝑐𝑖𝑡𝑦 < 10), 1, (𝑣𝑒𝑙𝑜𝑐𝑖𝑡𝑦 ≥ 10) 42 6 Evaluations and Analysis In this section, we conduct several ablation studies to analyze the behavior of MBPO and, the experimental results are discussed in detail. Each experiment carried out was conducted with 20 random seeds to be able to properly evaluate the results. The experimentation carried out in this section was performed using SAC [HZAL18] as baseline model-free algorithm and MBPO [JFZL21] as baseline model-based algorithm. The evaluation return plots presented in this section are computed by collecting the evaluation returns across several runs, calculating the mean score and finally smoothening with a moving average filter of window size 5. The confidence interval regions are produced by calculating the standard error of the mean for the smoothened returns and plotting the mean with ±1 standard error. By calculating the confidence interval value, we can estimate the range of values within which the true population mean is likely to fall with a certain level of confidence. All the algorithms were implemented using MBRL-Lib [PAZ+21]. In section 6.1, we conduct experiments to evaluate the performance of SAC and MBPO agents in both 𝑆𝑝𝑎𝑟𝑠𝑒𝑃𝑒𝑛𝑑𝑢𝑙𝑢𝑚 − 𝑣0 and 𝑀𝑜𝑢𝑛𝑡𝑎𝑖𝑛𝐶𝑎𝑟𝐶𝑜𝑛𝑡𝑖𝑛𝑢𝑜𝑢𝑠 − 𝑣0 sparse reward environments. This study is conducted to compare how a model-free SAC agent and a model-based MBPO agent behave in sparse reward environments. In section 6.2, experiments with hyperparameter ablations were conducted on rollout length (𝑘), model rollouts per step (𝑀), updates to retain buffer (𝑅), ensemble size (𝑁), frequency retrain (𝐹) and trainer patience. The goal is to examine why some hyperparameters work better than others through the lens of the dataset that is used for the training of the agent. The IPMs are utilized to better analyze the behavior of the agent during training. The hyperparameter ablation study was parallelly conducted on the MBPO-C agent, to evaluate the performance of this MBPO variant in sparse reward scenarios. In section 6.3, a hyperparameter ablation study was performed in a dense reward 𝐶ℎ𝑒𝑒𝑡𝑎ℎ𝑅𝑢𝑛− 𝑣0 environment to compare the behavior of the MBPO agent in higher dimensional dense reward settings. Finally, we conduct an experiment by specifically making the MBPO agent produce a performance close to the SAC agent in sparse reward scenarios. After reducing the performance, we conduct a specific hyperparameter study to understand which specific hyperparameter or a combination of hyperparameters is responsible for contributing to the performance boost in sparse reward scenarios. 43 6 Evaluations and Analysis 6.1 SAC vs MBPO Performance This section examines the behavior of model-free SAC and model-based MBPO agents in 𝑆𝑝𝑎𝑟𝑠𝑒𝑃𝑒𝑛𝑑𝑢𝑙𝑢𝑚 − 𝑣0 and 𝑀𝑜𝑢𝑛𝑡𝑎𝑖𝑛𝐶𝑎𝑟𝐶𝑜𝑛𝑡𝑖𝑛𝑢𝑜𝑢𝑠 − 𝑣0 sparse reward environments. In the aforementioned experiments, the MBPO agent uses the hyperparameter settings presented in Table 6.1. Model-free SAC agent uses the same configurations of policy network, Q-Network, and model network as the agent presented in Table 6.1, except for the model-based hyperparameters. We perform a comparative study to observe the effect of introducing the action cost and how both model-free and model-based agents manage to solve the sparse reward task in the environment. The goal of introducing action cost in a sparse pendulum environment is to make the sparse reward problem more difficult to solve. An action cost refers to a penalty or negative value associated with taking a specific action in a particular state of the environment. Action costs are often used in scenarios where taking certain actions might have drawbacks that the agent should consider when making decisions. These costs can influence the agent’s policy by encouraging it to choose actions that minimize the cumulative cost over time. A similar action cost was not introduced in a continuous mountain car environment since the environment has an action cost already implemented. The results of the experiment are seen in Figure 6.1b. In figure 6.1a, In the plot where no action cost is applied, i.e. there is no penalty for exploring diverse actions that are not helpful in finding the reward. Model-free SAC and model-based MBPO agents were able to find the sparse reward and converge to an effective policy across a majority of the seeds. Another key observation from the plot is that a model-based agent uses less number of environmental steps to converge to an optimal policy, on the other hand, the model-free agent is less sample efficient. In stark contrast, When action cost is applied to the sparse pendulum problem, i.e. awarding a penalty for performing explorations that are not helpful in finding the sparse reward. The SAC algorithm incorporates unstructured exploration to encourage the agent to explore new states and actions in the environment. Unstructured exploration in this context, refers to exploring state/action space in a more random and undirected manner, rather than relying solely on the policy optimization process. Through the combination of entropy regularization and exploration noise, SAC strikes a balance between structured exploration (encouraging diverse action selection) and unstructured exploration (encouraging random exploration in the action space. The results show us that the state-of-the-art model-free SAC algorithm struggles to explore and find the optimal or near-optimal policies in sparse reward scenarios because of the unstructured exploration. In contrast, the model-based MBPO agent manages to find sparse rewards in nine of the twenty seeds launched. By utilizing the learned model, MBPO can explore via model rollouts which helps in effective exploration and convergence to optimal policies. A similar kind of comparative study was performed between model-free and model-based agents, on the challenging continuous version of the mountain car problem. It is evident from this study that the state-of-the-art SAC algorithm has limitations since it performs unstructured exploration by injecting noise into action selection, and that is insufficient for tasks such as continuous mountain cars and sparse pendulum, which require sustained exploration to find the solution [RKS21] [EHPM22]. On the other hand, MBPO solves the problem roughly for half of the seeds. 44 6.2 Hyperparameters Ablation Study 0 10k 20k 30k Environment steps 0 100 200 300 R et ur n action cost = 0.0 0 10k 20k 30k Environment steps action cost = 0.1 (a) Sparse Pendulum 0 10k 20k 30k 40k 50k Environment steps −50 0 50 100 R et ur n Agents mbpo sac (b) Mountain Car Figure 6.1: (a) Learning curves of SAC and MBPO agents with and without action cost in sparse pendulum environment. (b) Learning curves of SAC and MBPO agents in a sparse continuous mountain car environment. Results presented show average returns over 20 random seeds which are smoothened by a moving average filter and we report the mean (solid lines) and standard error (shaded regions) and carried out using hyperparameter settings in Table 6.1. 6.2 Hyperparameters Ablation Study In this section, MBPO hyperparameter ablation study was conducted for both 𝑆𝑝𝑎𝑟𝑠𝑒𝑃𝑒𝑛𝑑𝑢𝑙𝑢𝑚−𝑣0 and 𝑀𝑜𝑢𝑛𝑡𝑎𝑖𝑛𝐶𝑎𝑟𝐶𝑜𝑛𝑡𝑖𝑛𝑢𝑜𝑢𝑠 − 𝑣0 sparse reward environments. All the experiments were carried out by applying action cost = 0.1, and the results presented show average returns over 20 random seeds which are smoothened by a moving average filter and report the mean (solid lines) and standard error (shaded regions). The results of the ablation study along with the metrics are discussed in further subsections. We utilize the hyperparameter settings from Table 6.1, which were used to conduct the initial experiment in the sparse pendulum as shown in Figure 6.1a and the mountain car as shown in Figure 6.1b. We use the average return in both the sparse pendulum environment with action cost as a baseline for the remainder of the hyperparameter ablation study, and similarly in the case of the mountain car environment. 45 6 Evaluations and Analysis Table 6.1: MBPO Hyperparameter settings for continuous control experiments Hyperparameter Sparse Pendulum Mountain Car Cheetah Run # environment steps 30e3 50e3 250e3 𝑇 - # episodes 75 50 250 𝐸 - # steps per episode 400 1000 𝐺 - policy updates per step 20 10 𝑀 - # model rollouts per step 10 𝐹 - frequency of model retrain (#steps) 400 250 𝑅 - # updates to retain buffer 1 10 𝑁 - ensemble size 5 𝑘 - rollout length 10 5 trainer patience 1 5 learning rate 3e-4 1e-3 Policy network 2 layers, 64 units, Tanh activation 2 layers, 128 units, Tanh activation Q network 2 layers, 256 units, Tanh activation Model network 4 layers, 200 units, SiLU activation The MMD distance metric plots presented in the hyperparameter ablation study indicate the distance between the distributions of environment and model state space. A higher MMD value indicates that the distributions are dissimilar, while an MMD value closer to zero indicates they are similar. Similarly, The Wasserstein distance metric plot shows the distance between the distributions of environment and model rewards. A higher distance value indicates the two distributions are different, and a value closer to zero indicates the distributions are similar. All the distance metric plots presented in the hyperparameter ablation study are clipped to a region where the distinction between the curves is clearly observable and as the training progresses all the values more or less remain constant. 6.2.1 Rollout Length Ablation In this subsection, we present the results of an ablation study conducted on rollout length hyperparameter. The study was conducted using MBPO and MBPO-C style methods in the sparse pendulum and mountain car environments. We performed the experiments with 𝑘 = {1, 3, 7 ,10} and kept the rest of the hyperparameters fixed to values mentioned in Table 6.1. Rollout length determines the length of the prediction horizon generated by the model to estimate the expected returns of various policies [BTZ22]. As a result, longer rollout length emphasizes the increased exploration of unfamiliar states and may lead to better policies but The accuracy 46 6.2 Hyperparameters Ablation Study of predictions typically decreases for longer rollouts due to distributional shift, i.e., we query the learned model out-of-distribution. On the other hand, a shorter rollout length emphasizes exploitation, which indicates that the agent focuses on immediate rewards and makes decisions based on its current state. The results presented in Figure 6.2, the top row shows the rollout length ablation study performed in the sparse pendulum environment and the bottom row corresponds to the ablation study performed in the mountain car environment. We observe from the learning curves that a higher value for rollout length in both the sparse reward environments resulted in higher returns when compared to the returns of smaller values of rollout length. A rollout length of value 1 indicates that the model takes a single action, observes the resulting state and reward, and then makes the next decision. This leads to a lack of exploration because, in continuous control environments, the model often has numerous potential actions available for any given state. By focusing on just one action, the agent could overlook opportunities to explore alternative actions that might yield superior results over time. Furthermore, the agent could become trapped in local optima, a situation where it makes a narrow-minded choice that enhances its immediate reward but hinders its progress towards a more favorable state in the future. We confirmed this hypothesis in the mountain car environment, where the learning curve with k=1 did not yield better returns compared to other rollout lengths with higher values. Interestingly, the sparse pendulum environment showed positive returns with a rollout length of 1. However, the hypothesis is still supported, as higher rollout lengths led to even greater returns. This suggests that other factors are at play in contributing to the positive returns observed in the sparse pendulum environment. In the W-distance metric plots (center column), In the sparse pendulum environment, we can observe that the difference between the reward distributions generated by the model and the environment is more pronounced at the start of training. As training advances, the Wasserstein distance for various rollout lengths gradually converges. The reported results are truncated because this convergence behavior persisted throughout the remaining environment steps. In the mountain car environment, we notice that a greater rollout length led to the most significant disparity between model-generated and environment reward distributions. This consequently led to a higher overall Wasserstein distance. This trend mirrors the observation made in the sparse pendulum environment. It can be attributed to the fact that at the beginning of the training the model visits unfamiliar states with longer rollouts and receives rewards based on the action taken from that unfamiliar states and hence the discrepancy between the model-generated and environment distribution at the beginning of the training and the eventual convergence at the end of the training. The results observed from the MMD distance metric plots (right column) indicate that higher rollout lengths produced maximum discrepancy between model-generated and environment observation state distributions for both sparse reward environments. As the training progressed the MMD distance value decreases, which indicates the distribution of states genera