Institute of Parallel and Distributed Systems University of Stuttgart Universitätsstraße 38 D–70569 Stuttgart Masterarbeit Hierarchical Inverse Reinforcement Learning from Motion Capture Data Rohit Prakash Course of Study: INFOTECH Examiner: Prof. Dr. rer. nat. Marc Toussaint Supervisor: Dr. Jim Mainprice, Philipp Kratzer, M.Sc. Commenced: April 29, 2019 Completed: October 29, 2019 Abstract A human motion generally consists of multiple low-level tasks which are performed in a defined order or in parallel to achieve a high level task. For example, making pizza dough consists of several low-level tasks such as measuring water, adding yeast, measuring flour, etc. And these activities must be performed in a definite order to make the dough. For a system to imitate these sequence of activities of a long horizon task with one global reward function is a lot of work. This process can be made easier if there is hierarchical state representation of the task and learning of local rewards for the hierarchies. In this thesis, we have learned to imitate a general day to day human activity of ’setting table for one person’. This work adopts a framework called Hierarchical Inverse Reinforcement Learning (HIRL), which is a model to learn sub-task structure from demonstrations. With this framework, the activity is decomposed into multiple lower level tasks which are performed in a sequence using learned policies. In this work, Maximum Entropy Inverse Reinforcement Learning (MaxEnt-IRL) is used to learn local rewards for the sub-tasks. Together with hierarchical state space representation and local reward functions, the model encodes the high level task objective based on human demonstrations of full body motion performing the high level task. The model achieves a success rate of 84% on average for middle levels and 83% for the top level in the cross validation tests. For visualization, the model is simulated in a 2D representation that takes current environment state as input and runs till the completion of the task. 3 Contents 1 Introduction 15 2 Background 17 2.1 Markov Decision Process and Reinforcement Learning . . . . . . . . . . . . . . 17 2.2 Hierarchical Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . 20 2.3 Inverse Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.4 Maximum Entropy Inverse Reinforcement Learning . . . . . . . . . . . . . . . 21 3 Related Work 23 3.1 Human Motion Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.2 Hierarchical Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.3 Hierarchical Inverse Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4 Methods 25 4.1 Environment and Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.2 Trajectory Formulation in 2D Map of Environment . . . . . . . . . . . . . . . . 27 4.3 Hierarchical Structure and State Representations . . . . . . . . . . . . . . . . . 30 4.4 Integration and Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 5 Evaluation 37 5.1 Cross Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 5.2 Comparison with Ground Truth . . . . . . . . . . . . . . . . . . . . . . . . . . 38 6 Discussion and Future Work 45 7 Conclusion 47 Bibliography 49 5 List of Figures 1.1 Montezuma’s Revenge and Taxi domain . . . . . . . . . . . . . . . . . . . . . . 15 2.1 Agent-environment interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 4.1 Demonstrator performing action . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.2 2D representation of environment . . . . . . . . . . . . . . . . . . . . . . . . . 26 4.3 Trajectory prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.4 Feature maps of the environment . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.5 Reward function of a scene . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.6 Different possibility of arrangement of table for one person . . . . . . . . . . . . 30 4.7 Zones of the environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.8 Hierarchical structure of the system . . . . . . . . . . . . . . . . . . . . . . . . 32 4.9 Simulation setup of the model . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 5.1 Cross validation result for different features . . . . . . . . . . . . . . . . . . . . 38 5.2 Example1 - Ground truth, Simulation path, Heat map . . . . . . . . . . . . . . . 40 5.3 Example2 - Ground truth, Simulation path, Heat map . . . . . . . . . . . . . . . 41 5.4 Example3 - Ground truth, Simulation path, Heat map . . . . . . . . . . . . . . . 43 7 List of Tables 4.1 Tasks performed in the dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 4.2 Macro-actions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.3 Example- sequence of states for Level H1 . . . . . . . . . . . . . . . . . . . . . 33 4.4 Example- sequence of states for Level H21 . . . . . . . . . . . . . . . . . . . . . 34 5.1 Example1 - Simulated sequence from the model . . . . . . . . . . . . . . . . . . 39 5.2 Example1 - Ground truth sequence from the demonstration . . . . . . . . . . . . 39 5.3 Example2 - Simulated sequence from the model . . . . . . . . . . . . . . . . . . 40 5.4 Example2 - Ground truth sequence from the demonstration . . . . . . . . . . . . 40 5.5 Example3 - Simulated sequence from the model . . . . . . . . . . . . . . . . . . 42 5.6 Example3 - Ground truth sequence from the demonstration . . . . . . . . . . . . 42 9 List of Algorithms 2.1 Abstract algorithm for IRL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 11 Acronyms DQN Deep Q Network. 24 HAM Hierarchical Abstract Machines. 20 HIRL Hierarchical Inverse Reinforcement Learning. 3 HRL Hierarchical Reinforcement Learning. 15 IOC Inverse Optimal Control. 16 IRL Inverse Reinforcement Learning. 16 MaxEnt-IRL Maximum Entropy Inverse Reinforcement Learning. 3 MDP Markov Decision Process. 17 SMDP Semi-Markov Decision Process. 20 13 1 Introduction Recently, there has been a number of advancements in the field of Reinforcement Learning. The coming of age Alpha Go AI by DeepMind proves how far we have come in development of an artificial intelligence. Alpha Go is arguably the best Go player in the world. It went on to defeat the Go world champion in global arenas[Dee19]. These achievements are the direct result of years of research and application in various scenarios using reinforcement learning. From learning and mastering of Atari games to performing complex real world tasks, reinforcement learning has showed its usefulness and capabilities on various situations. However, reinforcement learning has its limitations when it comes to complex environments. In a complex environment, state and action space is huge and widespread. Therefore, scaling of task from initial state to terminal state will take forever. The length of the path would be a very long combination of sequences of states and actions. As a result, the reward for an activity is delayed since it takes a long time for completion of an activity. Hierarchical Reinforcement Learning (HRL) tries to reform this problem in a much smaller state space with the use of hierarchies. A complex task can be broken down to hierarchies and provide state and temporal abstractions which could simplify the problem. The use of HRL to solve classic Atari games is an example of such application. The figure 1.1 shows two classic game domains solved using HRL. The first one is the Montezuma’s Revenge, where the agent learns the complex task of acquiring the key shown within the blue boundary and leaving out the door shown within the red boundary. For the agent to leave to next stage, he must first go down the ladder, jump to the right, down the ladder again, over the obstacle, up the ladder on the left and acquire the key. Then he must retrace his steps back to the top. And then finally, walk to the door. Figure 1.1: Montezuma’s Revenge and Taxi domain 15 1 Introduction The other example is the Taxi domain, where the agent picks up the passenger from one of the four locations(red, green, blue or yellow), goes to the destination location and drops the passenger. The hierarchies defined has to work in tandem between the levels for the agent to reach its goal. For instance, given the task of cutting a cucumber, human may perform lower level actions such as picking the cucumber, cleaning it and holding knife in stipulated fashion and then cutting it. This is quite analogous to HRL framework and what this framework is trying to achieve by simplifying a long horizon task by breaking down into hierarchies. In spite of this success, still the learning of the tasks takes a huge number of trials before an agent learns the optimal path or behaviour. Even though the framework could be a benchmark in learning complex scenarios, this is quite inefficient when compared to a human for the same activity. One drawback is that the reward of the states is not already known and agent learns this with various trials of exploration and exploitation like in case of Montezuma’s revenge. If the reward could be already known beforehand, our problem reduces to finding an optimal path in the given state space. Also, the reward which the agent receives by performing desired behaviour is defined by experts, such that agent learns from it. This is an additional overhead when defining an environment for a reinforcement learning task. In case of the Atari games or tasks with lower complexities, defining rewards in the environment is easy. But for a real world long horizon task, it is a lot of work. So how will an expert define rewards for a task of self driving in cars? Inverse Reinforcement Learning (IRL) is another sub field of reinforcement learning, where reward is learned from expert demonstrations. It is also known as Inverse Optimal Control (IOC). The Human expert is a benchmark we are trying to keep up with, and a demonstration of an activity by him can be considered to be viable path to a target state. So learning the reward from a demonstration from a human would make sense as human in most cases is expected to choose states and actions which is most or near efficient. The objective of this thesis is to learn reward from human demonstration for various activities in the environment defined and learn to plan the same activities in an efficient manner with hierarchical states using the learned rewards and features of the environment. This work attempts to combine HRL and IRL techniques to imitate a human demonstration. Chapter 2 gives insight into the background literature for the techniques used in this thesis. In chapter 3, existing works present in the domains of human motion prediction, hierarchical learning and hierarchical inverse learning are discussed. The following chapter 4 discusses the core imple- mentation of the model proposed. All the methods and steps involved in the development of the model is explained in detail. In chapter 5, the results of the thesis are discussed with respect to cross validation tests and comparison with ground truth. In the chapter 6, some of the pointers for future works and improvements are mentioned and finally chapter 7 presents the conclusion of the thesis. 16 2 Background 2.1 Markov Decision Process and Reinforcement Learning Markov Decision Process (MDP) [Bel57][SB98] is defined as a mathematical framework that models decision making in the environment of an agent. The MDP can be defined using four-tuple M = (S, A, Pa(s |s′),Ra(s, s′)). • S is the finite set of states in the environment. • A is the finite set of actions in the environment. • Pa(s |s′) = Pr(st+1 = s′ |st = s,at = a) is the transitional probability that an action a in state s at time t leads to a state s′ at time t+1, • Ra(s, s′) is the immediate reward (or expected immediate reward) received after transition from state s to state s′, due to action a. Policy π, can be defined as a mapping of states to actions which informs what action to perform in a given state s. In an MDP, there is a decision maker called the agent who interacts with the environment. Given a state s, from the set of states S of the environment, the agent selects an action a, from set of actions A. The environment then transitions into a new state and agent receives a reward r, as a result of the action. This process happens sequentially and agent jumps from one state to another creating a trajectory. If the states and actions involved are finite, the MDP is defined as a finite MDP. An important property of MDP is that agent only has to consider the current state it is in for selecting an action. This property is known as Markov property. Figure 2.1: Agent-environment interface 17 2 Background Based on the rewards in the environment, the objective of the agent is always to maximize cumulative rewards over time. We can define return G at time t as Gt = rt+1 + rt+2 + rt+3 + .... + rT (2.1) where T is the final step. If T is finite, then this definition works well. However if T is ∞, a discounting factor γ is introduced since return is no longer guaranteed to be finite. And we can rewrite the above equation as Gt = rt+1 + γrt+2 + γ2rt+3 + ... = ∞∑ i=0 γirt+i+1 (2.2) where discount rate γ is between 0 and 1. An important thing for an agent in the environment is to understand how good it is to be in a given state or to know how good it is to perform a given action in the given state. This notion is made clear with the value functions. There are two types of value functions used in reinforcement learning: the state value function, denoted as V(s), and the state action value function, denoted as Q(s,a). The state value function is the expected return when starting from state s acting according to our policy π: Vπ(s) = Eπ [rt+1 + γrt+2 + γ2rt+3 + ...|st = s] = Eπ [ ∞∑ i=0 γirt+i+1 |st = s] (2.3) This equation can be rewritten as Vπ(s) = Eπ [rt+1 + γ ∞∑ i=0 γirt+i+2 |st = s] = Eπ [rt+1 |st = s] + γEπ [ ∞∑ i=0 γirt+i+2 |st = s ′] = ∑ s′ P(s′ |s, π(s))[R(s′ |s, π(s)) + γVπ(s′)] (2.4) If the policy is a stochastic policy, Vπ(s) = ∑ a π(a|s) ∑ s′ P(s′ |s,a)[R(s′ |s,a) + γVπ(s′)] (2.5) Similarly, the state-action value function or Q function gives the expected return for performing action a in state s when following our policy π. It can be derived to the following equation, Qπ(s,a) = ∑ s′ P(s′ |s,a)[R(s′ |s,a) + γQπ(s′, π(s′)] (2.6) 18 2.1 Markov Decision Process and Reinforcement Learning If the policy is a stochastic policy, Qπ(s,a) = ∑ s′ P(s′ |s,a)[R(s′ |s,a) + γ ∑ a′ π(a′ |s′)Qπ(s′,a′)] (2.7) A policy π is optimal if the expected return for π is greater than or equal to the expected return of π′ for all states. In other words, π ≥ π′ if and only if Vπ(s) ≥ Vπ′(s) for all s ∈ S. The optimal state value function with their corresponding optimal policy can be written as below, V∗(s) = max a [ ∑ s′ P(s′ |s,a)(R(s′ |s,a) + γV∗(s′))] (2.8) π∗(s) = argmax a [ ∑ s′ P(s′ |s,a)(R(s′ |s,a) + γV∗(s′))] (2.9) The optimal state action value function and corresponding optimal policy can be written as below, Q∗(s,a) = ∑ s′ P(s′ |s,a)[R(s′ |s,a) + γmax a Q∗(s′,a′)] (2.10) π∗(s) = P(a|s) = argmax a (Q∗(s,a)) (2.11) Any basic reinforcement learning problem can be modeled as MDP. In classic reinforcement learning, the learning takes place based on a feedback system. Unlike supervised learning, where we have the solution in the dataset while training, reinforcement learning algorithm learns via interaction with the environment. There are two broad possibilities of the policy learning algorithms: Model-free algorithm and model-based algorithm. In model-free algorithm, the agent is not aware of transitional probabilities and learns optimal behaviour with trial-error method with application of different action. Q learning is an example of model-free method. Q learning is an off policy algorithm that evaluates which action to be taken in a state based on action-value function. Learning takes place with exploration of the environment and action-value for a state is updated as algorithm goes along. Q(st,at)← Q(st,at) + α(rt + γmax a Q(St+1,a) −Q(st,at) (2.12) At each step, Q table is updated with new approximated value for best action in a state. Function approximation can also be used for the same purpose. This is known as Deep Q network. This is widely used nowadays for a number of application. It can also be extended with hierarchies for HRL tasks. Value Iteration is a model based algorithm which uses transitional probabilities. The algorithm estimates the value function for all the states. Thus, policy for each state would be the action whichever gives the highest value for a state. Value iteration can be considered as an update rule which converges to optimal V∗. With reference to Bellman optimality equation, it can written as below: Vk+1 = max a ∑ s′,r p(s′,r |s,a)[r + γVk(s′)] (2.13) 19 2 Background 2.2 Hierarchical Reinforcement Learning The notion of Hierarchical Reinforcement Learning was developed to address some of the problems of the reinforcement learning in a complex environment. Solving a reinforcement learning problem in a high dimensional state space and action can be problematic due to very long sequences of actions to reach a defined goal. The idea of HRL is basically to subdivide a problem into hierarchy of tasks and solve them on different levels of temporal abstraction.There are broadly three approaches for HRL, namely Options Framework, Hierarchical Abstract Machines and MAXQ Value Function Decomposition. Options Framework [SPS99] is a well known HRL framework. In this framework, high level actions or macro-actions are options. An option consists of an initiation set, Io, which describes the low level state in which option can be executed; an option policy πo, which is the policy that gets executed and a termination condition, βo, which as the name suggests, provides the condition for termination of the particular option policy. Thus here, the classical definition of MDP needs to be altered by replacing the agents’ low level action with options. This is known as Semi-Markov Decision Process (SMDP) which can be described by the tuple: M = (S,O,P,R, γ) where S is the original state space, O is the set of options defined, P is the transitional probability for jumping to a state after execution of an option and R is the reward received after executing an option. MAXQ decomposition [Die00][Yan19] is a framework where hierarchy is created by decomposing the value function of an MDP into combinations of value functions of smaller MDPs. It can be represented as sum of two components Q(p, s,a) = V(a, s) + C(p, s,a) where V(a,s) is the total expected reward gained when executing an action a in state s and C(p,s,a) is the reward expected from the performance of the parent task after taking action a. Hierarchical Abstract Machines (HAM) [PR98][Yan19] is framework that constitutes of non- deterministic finite state machines whose transitions lead to lower level machines and back again. There are four machine states: action, call, choice and stop. In this approach, the states are complex to design and implement. Hence it is the least popular among the frameworks. 2.3 Inverse Reinforcement Learning The initial basic algorithm for IRL was introduced in Ng and Russell [NR00]. Many of the future works such as Ziebart et al. [ZMBD08] and Ramachandran and Amir [RA07] on IRL is based on this approach. The main issue with IRL is that a policy can be optimal for multiple reward functions,i.e, even if we have a actions from an expert there might exist multiple reward functions for the particular action. The IRL problem is modeled as an optimization problem. In the paper, it is stated that a policy π given by π(s) ≡ a1 is optimal if and only if all the actions and reward function satisfies the condition: (Pa1 − Pa)(1 − γPa1)−1R ≥ 0 (2.14) 20 2.4 Maximum Entropy Inverse Reinforcement Learning Algorithm 2.1 Abstract algorithm for IRL procedure Input : feature matrix, expert trajectories D = τi Ni=1 Initialize reward function, r repeat compute soft value function, V compute policy, π0 evaluate state visitation frequency µ for the current policy π0 evaluate objective function L and its derivative 5θL by comparing µ and state action distribution implied by D. update reward parameter θ until gradient 5θL converges return reward function,r for the optimum policy,π0 end procedure The problem is reduced to simple optimization using linear programming algorithms. However, there is a problem which is quite apparent. R = 0 is always a solution and there could be many options for R that could meet the condition. An additional constraint is introduced to possibly reduce the number of solutions to the above equation. The solution would be chosen that maximizes the following condition.∑ s∈S (Qπ(s,a1) − max a∈A\a1 Qπ(s,a)) (2.15) The intuition is that any deviation from optimal policy should eventually result in reduction of the total reward. This work also introduces the concept of having reward function as linear combination of feature vectors. This was mainly done for approximation in large state spaces. An abstract algorithm for IRL is described in algorithm 2.1[OPN18]. 2.4 Maximum Entropy Inverse Reinforcement Learning MaxEnt-IRL [ZMBD08] is a methodology to learn the reward function from observed demonstra- tions of an optimally behaving agent. In addition to Abbeel & Ng’s [AN04] approach, along with matching feature counts of the observed path and optimal path, principle of maximum entropy [Jay57] is used. With having only the matching of feature counts, the problem is ambiguous since there can be multiple policies for the observed feature counts. In this work, they adopt a probabilistic approach to resolve this ambiguity. Thus, in process handling uncertainty and noise as well. This yields soft maximum version of Markov decision process(MDP). The softened version of MDP equation is as follows[ZRG+09] Q≈(s,a) = R(s,a) + V≈(T(s,a)) (2.16) V≈(s) = softmaxaQ≈(s,a) (2.17) 21 2 Background The reward function is defined as a linear sum of weighted combinations of feature responses f (s) = [ f1(s)... fk(s)]T . θ is the set of reward weights for the corresponding features. The reward function is applicable to any configuration of environment. r(s; θ) = θT f (s) The policy for the MDP is probabilistic with distribution: π(a|s) = eQ ≈(s,a)−V≈(s). From all the observed paths from train set, the empirical feature count is calculated as average over total feature counts by number of demonstrations, f˜ = 1 M M∑ m f (s) where M is the total number of demonstrations. The gradient is calculated as difference of empirical feature counts f˜ and expected feature counts fˆ θ , which is average features accumulated by trajectories. The expected feature counts are calculated in two step process - backward pass and forward pass. And the parameters θ are learned with each iteration until the feature counts are matched. 22 3 Related Work 3.1 Human Motion Prediction There has been quite a lot of work on prediction of motion of an agent in an environment. Ziebart et al. [ZRG+09] is a follow-up of Maxent-IRL. This work presents an uncertainty based planning approach to assign cost function for different features of each state and action in the space. The method employs softened version of MDP and gradient based optimization to come up with cost function for the states of the environment. Activity forecasting [KZBH12] is an extension of Ziebart et al. [ZRG+09]. The objective of the paper is to forecast future action of people from a noisy input. The destination forecasting and understanding of user preferences with respect to physical environment is one of the major focus of the paper. The scene of the environment is semantically extracted to produce features such as cars, pavements, grass, sidewalk, curb, etc. The observed trajectory of the person in the scene is recorded, which along with the physical scene features gives the empirical feature count to be matched with, for IRL part of the problem. Using MaxEnt IRL, the reward weights for each semantic features are learned. And naturally those physical attributes in environment which human tend to prefer is calculated as higher reward weight than others. For instance, sidewalk has higher reward than a car in the scene. This understanding of reward for the semantics of environment helps to predict future probable motion of a person. Even if there are changes, like rearrangement of objects in physical environment, the learned reward weights are still valid since they are separately learned for each feature responses. Hirakawa et al. [HYY+17] also adopts activity forecasting model for predicting non goal directed trajectories of sea bird. There has also been work on the motion prediction of human in a group. Rudenko et al. [RPLA18] presents an approach by considering constraints from the environment and social forces. The work employs the group social force model [MPG+10] to come up with future paths taking in repulsive forces from other people in the surroundings and intra-group forces. There has also been work done to combine the social forces with IRL to get a path in a dynamic environment. Vasquez et al. [VOA14] adopts IRL techniques with features that change over time. Thus re-planning is done in an optimal fashion to factor in on the changing features. This approach uses density features, speed-orientation, velocity and social forces to plan a path for a robot in a dynamic environment. 3.2 Hierarchical Learning There has been quite a number of works in HRL for learning favourable behaviour in a complex environment. Many of these works are inspired from the basic HRL architectures explained in section 2.2. Kulkarni et al. [KNST16] adopts a hierarchical DQN framework to solve the Atari game called Montezuma’s Revenge. Here, there are only two levels of hierarchies controlled by meta-controller for top level and controller for bottom level. The goal of the agent is to acquire a key which is situated at a location in the environment. The agent is required to learn multiple sequences 23 3 Related Work of actions to acquire the key first and then go to an exit door. The meta-controller takes in the state of the environment and its Deep Q Network (DQN) learns to choose the best meta-action, which here is a destination in the environment. The bottom level controller is responsible for primitive actions of motion and grasp and has its own separate DQN which chooses the best primitive action for a combined input of state and subgoal. In this case, as one can perceive, both levels have to work in tandem to reach the terminal state. There is temporal abstraction for the subgoals and since the reward from the environment is delayed, the sub actions performed by lower level controller is provided with intrinsic motivation. In spite of the simplification of state space for the complex environment, the learning of ideal sequence of actions from exploration and exploitation is a tedious task. Konidaris et al. [KKL18] argues that it is difficult for an agent to come up with high level plan from a low level domain of the world. Abstract representations of an environment is created for planning in high dimensional space.The agent learns these representations on its own using its sensory inputs and models a high level plan for a simple task in a particular environment. 3.3 Hierarchical Inverse Learning Combining HRL with IRL is coming up as viable options in dealing with long horizon problems for practical applications. In [KGL+16], a long horizon task is decomposed into linear sub-goals and sequence of local rewards are learned with MaxEnt-IRL algorithm. The local rewards guide the agent to the given sub-goals which happen to be more efficient than the sparse true reward. The Q learning technique is used to learn the policies for the augmented state space for the problem. The paper proves that HIRL converges much better than other alternatives for the same experiment. This work is quite close to the work presented in this thesis as it deals with a long horizon task for setting up of a table. The task presented in this thesis is composed of multiple MDPs that transition among themselves and the switching is also modelled by another MDP at top level. Another alternative that can be modeled for the scenario of this thesis is the guided cost learning[FLA16]. Though the work explicitly does not mention the use of hierarchies, however it deals with high dimensional state space problem. The work presents a model that can learns a non-linear cost function using neural networks for a state space with unknown dynamics. This approach uses sample based approximation of MaxEnt-IRL and can be applied to a variety of real world challenges. As an experiment in the paper, the robot learns to carefully arrange plates in a dish rack from a human demonstration. 24 4 Methods In this chapter, the focus is on the methods involved in setup and development of the system. The first section discusses the environment and process of data collection for the tasks involved. The second section gives an overview of the implementation for obtaining trajectories. The third section gives a brief picture of the system developed with hierarchies for planning and the final section provides more information into the different features used for the development. 4.1 Environment and Data Collection The environment of the experiment includes 4m*4m area consisting of rigid objects such as table, a couple of identical chairs, a small shelf and a big shelf. An OptiTrack1 motion capture system with 12 cameras is used to capture the demonstrations and a motion capture suit to be worn by the demonstrator. The cameras and the skeleton of the demonstrator in the suit are calibrated before experiment. There were 16 different type of tasks performed in the dataset as defined in table 4.1. For these tasks, there were 4 coloured plates, 4 coloured cups, 1 bowl and 1 jug. The environment also consists of rigid objects as table, big shelf, small shelf and couple of chairs as depicted in figure 4.1. For instance, the task of setting table for three person, would mean the demonstrator should make sure he or she picks one object at a time from an initial spot (e.g. big Shelf) and place it to destination (e.g. table). In the end of this particular task, there must be 3 plates, 3 cups and either of jug or bowl must be on the table. Figure 4.1: Demonstrator performing action 1https://optitrack.com/ 25 4 Methods S.No Task description 1 setting table for one person 2 setting table for two persons 3 setting table for three persons 4 setting table for four persons 5 clear table 6 place jug and bowl on small shelf 7 place all cups on small shelf 8 place blue and pink objects on big shelf 9 place blue and red objects on big shelf 10 place blue and green objects on big shelf 11 place pink and red objects on big shelf 12 place pink and green objects on big shelf 13 place red and green objects on big shelf 14 place all cups on big shelf 15 place bowl and jug on big shelf 16 place all plates on big shelf Table 4.1: Tasks performed in the dataset For the data collection, the various tasks were performed in sequence without any particular order. Each of the objects in the scene including rigid objects have markers to track their position in space. Also the motion capture suit is accompanied with 50 markers to map the motion of the demonstrator in the space. For this work, the 2D positions and quaternion angles were mainly used. Figure 4.2: 2D representation of environment 26 4.2 Trajectory Formulation in 2D Map of Environment Using this information, precise 2D interface of the environment was developed in PyQT where the activity performed can be viewed in 2D in parallel with 3D visualization in RViz 2 as depicted in figure 4.2. 4.2 Trajectory Formulation in 2D Map of Environment A part of the thesis is to be able to navigate in the environment defined previously, by avoiding obstacles and making it to goal with efficiency. The Kitani et al. [KZBH12], uses Maxent IRL to learn the reward weights for the physical attributes in the environment. The linear sum of reward weights with feature responses of the physical attributes, provides the reward function for the states in the environment. With optimal control, a viable trajectory can be forecasted from an initial state to destination. Figure 4.3: Trajectory prediction The environment has obstacles in the form of chairs, tables and shelves. The way around these rigid objects would obviously be the user preferred path. In the Kitani et al. [KZBH12], top down visual images of a public location were used and features of physical attributes were generated after semantic labeling. A stride away from this approach, instead of the top down actual images, the images were extracted from the 2D representation of the environment, defined in section 4.1. This made semantic labeling easier as well and the trajectory was available from the motion capture setup. The figure 4.3 shows the prediction of trajectory between two points in the environment. The start being the lower left point to a point in the upper right, a path is predicted around the obstacles using corresponding reward map of the environment. 2https://github.com/ros-visualization/rviz 27 4 Methods Figure 4.4: Feature maps of the environment 28 4.2 Trajectory Formulation in 2D Map of Environment 4.2.1 Feature Map As defined in the Ziebart et al. [ZRG+09] and in Kitani et al. [KZBH12], four feature maps were generated for each physical attribute in the scene. The first feature is the probability map which indicates the presence of an object in the cell. The rest three are exponentiated distance function with variance, σ2 = 3,5,10. Each feature map is of the size 100 pixel*100 pixel. As total, there are 25 feature maps for the 6 different physical attributes and an additional constant function. Each of the feature map is normalized in the range between [-1, 0]. The figure 4.4 depicts the feature maps used to generate the reward function. The 2D interface of the environment also generates feature tensor for a scene of the environment. 4.2.2 Learning Reward for the 2D Map The trajectory logged from demonstrations, the image of the scenes and its corresponding feature maps, becomes the input for the model. The model uses the trajectory over the corresponding feature maps to calculate the empirical mean feature count of the demonstrations. The backward pass algorithm followed by forward pass algorithm learns state visitation frequency to calculate expected feature counts [ZMBD08][KZBH12]. This is repeated until the expected feature counts matches up to the empirical feature counts. The calculated reward weights is the information that would correspond to user preference in the environment. The figure 4.5 depicts the final reward function obtained for a scene of the environment. The yellow surface has higher reward and the dark blue has lower reward. The reward weights for each of the feature maps were calculated using MaxEnt-IRL. The reward weights were highest for the feature maps of path and its three blurs. The rest of the feature maps had negligible reward weights in comparison to the path. Thus, the final reward function looks more like a combination of feature maps of the path. Figure 4.5: Reward function of a scene 29 4 Methods 4.3 Hierarchical Structure and State Representations In this section, the hierarchies involved for the high level planning is explained for each level. The below section explains basic setup and prerequisites. The sections following correspond to each of the hierarchies in the system developed. Each of these subsection would describe deeper into the state and action spaces of the corresponding levels. For this thesis, the task of ’setting table for one person’ is considered. Therefore, all the levels defined here are not applicable for other tasks in the dataset. The middle levels and the navigation level can be re-purposed for other tasks. However, top level cannot be used for other tasks. 4.3.1 Basic Setup and Prerequisites The tasks for setting table for one person is quite complex compared to simple Atari games or taxi domain. There are three items- table, big shelf and small shelf as rigid objects which can accommodate other movable objects such as 4 plates, 4 cups and jug or bowl on them in different combinations at any point in time. There is also one additional possibility. The human demonstrator can also be handling one of the objects. Therefore, in any given instance, the movable objects can be in any of the four locations - table, big shelf, small shelf or human. In figure 4.6, couple of scenario from real demonstration is visualized in RViz. As it can be seen, table is set with red plate & cup with bowl in the first image and green plate & cup with jug in the second image. Figure 4.6: Different possibility of arrangement of table for one person Thus to define this complex environment into a simpler format, the zones are defined. The environ- ment is divided in to 4 zones namely: • near the table • near the big shelf • near the small shelf • away 30 4.3 Hierarchical Structure and State Representations Figure 4.7: Zones of the environment The movement of the demonstrator is tracked with respect to these zones. The zone values are enumerated starting from 0 to 3. The zones cover a circular patch of region around the rigid objects. The extent of circular region was decided based on the demonstrations and also making sure that zones do not overlap as depicted in figure 4.7. It is only possible to grasp or release within the first three zones. The fourth zone covers a larger region than others. The demonstrator continues to hold the object in this region. The location of movable objects are defined similarly, namely: • on table • on big shelf • on small shelf • with demonstrator The location values are enumerated starting from 0 to 3. Any object placed on these rigid objects will have one of the location values. Finally, putting the individual location of movable objects and location of demonstrator together, we can define the state of environment as vector of size 11(10 objects and demonstrator). state - [cup1, cup2, cup3, cup4, plate1, plate2, plate3, plate4, jug, bowl, demonstrator] As seen above, each value in the state correspond to the position of object in the scene based on locations defined and zone for the demonstrator. At any point in time, maximum of only one object is held by the human. For instance, state - [0,1,1,2,2,3,1,0,0,1,3], state[0] to state[3] gives location of all the cups; state[4] to state[7] gives location of all the plates ; state[8] to state[9] gives location of jug and bowl respectively; and finally state[11] gives location of human in terms of zone. As one can perceive, human is in zone 3 with a plate in his hand. 31 4 Methods Action for this setup is very basic like move, grasp and release. From the actual motion capture dataset, the state of the environment is extracted in the format defined with respect to time. This sequence of state forms the basis for upcoming sections for defining their corresponding states and actions. Figure 4.8: Hierarchical structure of the system 4.3.2 Level H1 This is the top level of the system. Here the state is defined with respect to number of cups, plates and jug or bowl on the table and shelves (both the shelves are considered as one entity). The state structure for this level is defined as below: state = [cupTable, cupShelves, plateTable, plateShelves, jugbowlTable, jugbowlShelves] Macro-actions for this level are presented in table 4.2 For instance, if the state = [1,3,1,3,0,2]. It means there is 1 cup on table, 1 plate on table. Rest of the cups, plates, jug and bowl are on either of the shelves. There is only one terminal state for this level - [1,3,1,3,1,1] which corresponds to 1 cup, 1 plate and either jug or bowl on table. There is a point to note that jug and bowl are considered in same category like cups or plates since they are interchangeable in setting up of table. Total number of possible states are 75. These states are extracted from the state and actions defined in section 4.3.1. An example of a sequence for this level is shown in table 4.3. 32 4.3 Hierarchical Structure and State Representations Macro-actions Sub- level Name Description H21 MoveCupTabletoShelves move the cup on the table to either of the shelves H22 MovePlateTabletoShelves move the plate on the table to either of the shelves H23 MoveJBTabletoShelves move the jug or bowl on the table to either of the shelves H24 MoveCupShelvestoTable move the cup to the table from either of the shelves H25 MovePlateShelvestoTable move the plate to the table from either of the shelves H26 MoveJBShelvestoTable move the jug or bowl to the table from either of the shelves Table 4.2: Macro-actions Level H1 S.No. States Meaning 1. [4, 0, 1, 3, 0, 2] initial state 2. [4, 0, 1, 3, 1, 1] jug moved to table from shelf 3. [3, 1, 1, 3, 1, 1] cup moved from table to shelf 4. [2, 2, 1, 3, 1, 1] cup moved from table to shelf 5. [1, 3, 1, 3, 1, 1] cup moved from table to shelf Table 4.3: Example- sequence of states for Level H1 These possible states encapsulates all possibilities of arrangement of 10 movable objects in the environment with focus to table alone. The shelves are considered as single entity which separates only in the lower levels. There were different types of tasks performed in sequence, for instance, setting table for three persons followed by setting table for one person. In such a scenario, the planning has to be done in a manner that objects have to be removed from the table to set for one person. Precisely in this case, two cups and two plates have to be removed. Such a structure of states also enables to learn to remove objects from the table to come to the defined terminal state rather than just add objects on to the table. 4.3.3 Level H2 This is the middle level of the system. In this level, the macro-actions from top level dictates the state structure and actions are more primitive. Since there are 6 macro-actions defined for top level, there are 6 branches in this level for each of them as depicted in the table 4.2 33 4 Methods The state structure and actions for each of these branches are mostly identical. So therefore, for purpose of explanation, only the branch H21 would be the point of interest. However, there are some exceptions present amongst branches, which will be explicitly addressed. Below is the state structure for the branches in this level. state = [cupOntable, cupOnBigShelf, cupOnSmallShelf, skeletonLocation, objectCapturedIndication] State defines the number of cups on each of the rigid objects, location of the person in the scene and an indication if cup is captured. Actions of this branch are: • MovetoBigShelf - move to a point near big shelf • MovetoSmallShelf - move to a point near small shelf • MovetoTable - move to a point near table • GraspCupTable - grasp cup from the table • ReleaseCupShelf - release cup on the shelf It can be perceived that the states have direct link with the macro-action from top level. Similarly, each of the branches in this level are specific to type of movable objects like cups, plates or jug bowl. It is to be noted that the location of person with respect to zones is accounted here for the first time in this level. There are a total of 100 states possible for this branch which are identical with branches H22, H24 and H25. Intuitively, it can be realised that, the branches H23 and H26 will have only 36 possible states (since only two objects - jug and bowl, comparing to four in other branches). Regarding terminal states, there are multiple possibilities for each of the branches. For instance, for branch H21, the cups can be taken from the table to either of the shelves and there could multiple combination of other cups present on any of the rigid objects. This creates a number of possible terminal states. This type of framework can be found analogous to Option Framework. The top level performs SMDP with options instead of actions and the middle levels are analogous to those options. Just like in case of options, the middle levels perform its policies until a termination condition. It has a sole purpose which it tries to fulfil and returns control to the top level. An example of a sequence for this branch H21 is shown in table 4.4. Level H21 S.No. States Meaning 1. [3, 0, 1, 2, 0] initial state of cups 2. [3, 0, 1, 0, 0] person moves near table from small shelf 3. [2, 0, 1, 0, 1] grasps cup 4. [2, 0, 1, 2, 1] person moves near small shelf from table 5. [2, 0, 2, 2, 0] release cup Table 4.4: Example- sequence of states for Level H21 34 4.4 Integration and Visualization Basically, a cup from the table is moved to the big shelf. Similarly, other branches also perform their specific policies. 4.3.4 Level H3 This is the lowest level of the system which is responsible for navigation from one point to another in the environment. As defined in the section 4.2, this level uses the reward weights learned and feature map to formulate trajectory around the obstacles in the environment. The actions from middle level dictates the motion in this level. There are 4 location points defined which are offset from each of the rigid objects. Thus any movement requested from above levels is planned between these points. For formulating paths, Dijkstra’s algorithm [Dij59] is used over the cost map. 4.3.5 Features There were two types of features used separately. The first one being euclidean distance to each of the rigid objects combined with indication of grasp or release. The second being the trajectory lengths to reach destination points near rigid objects combined with indication of grasp or release. For both the types of feature sets, the performance was compared for the policy generated for each of the branches in the middle level. From the position of demonstrator, euclidean distances or trajectory lengths would be calculated for the corresponding states in middle level. For grasp or release, the state in which it happens is set as 1. Trajectory lengths were calculated using Dijkstra’s algorithm based on the reward map generated for the navigation level. 4.4 Integration and Visualization In the previous section, the states and action for the levels were explained. However, the planning for a task is only successful when each of the levels work together in conjunction. The success of lower levels propagates to success of the planning by the top level. The policies learned for each of the level and branches are integrated such that there is transfer of control between the levels. For instance, top level policy decides to move a plate from big shelf to table. The middle level branch for the corresponding macro-action, H24 gets the control. The policy of this branch runs until a terminal state is reached. For each of the steps involving motion, there is a transfer of control to the navigation level H3, which holds control until its task is completed. 35 4 Methods Figure 4.9: Simulation setup of the model To realise these policies visually, the environment is simulated in 2D. As shown in figure 4.9,the planning can be viewed in parallel. The input for the simulation is simple initial state of the environment as defined in section 4.3.1. The simulation will plan as per the policies until the table is set for a person. The simulation was done for the actual cases from the dataset to see if planning done by the policies is similar to the demonstration. 36 5 Evaluation In this chapter, evaluation of the policies learned for each level is discussed. This is done by using cross validation and ground truth comparison. The first section gives a brief idea about the cross validation test performed for the policies. It must be noted that the cross validation tests were performed for top and middle levels. For the navigation level, the comparison of the simulated navigation routes and ground truth is done in the second section. The second section also covers the comparison with ground truth and the whole simulation system for some of the demonstration scenarios for the task of setting up table for one person. 5.1 Cross Validation To understand how good the policies are, leave-one-out cross validation was performed for each of the levels and branches defined in the system. Thus for each of the tests, one sample from the dataset was left out, IOC was performed to get reward values for the states and then using these new reward values, policy was generated. The left out sample was tested on this policy and its results were compared with actual trajectory. This was performed for all the trajectories from dataset for each of the policies used in the setup. Also for understanding how good the policies work for the states in any level which were not covered in demonstrations, another test was performed where all the states were tested against the final obtained policy. This was to make sure if any given state could converge to one of the terminal state. As there were cases, the policy gets blocked and jumps between states. For the validation of middle levels with respect to features used, a comparison is showed in the figure 5.1. In the figure, the blue bar is the result when no features were included, red bar for the features1 as euclidean distance and orange bar for the feature2 as trajectory lengths. As one can perceive, when features are used, the cross validation results are significantly better. Comparing euclidean distance and trajectory length as features, the latter performs in most cases, equal or better than the former. Also intuitively, we can understand that length of the path to a destination is a matter of concern for us humans as compared to direct euclidean distance, provided all other parameters are same for both cases. However, we cannot neglect the fact that there can be multiple parameters for making a decision between destinations. Euclidean distance and trajectory lengths are just two of many for the given scenario. For instance, given the task to pick a plate from table to one of the shelves, the human have to decide which shelf to go to. But the small shelf cannot accommodate a plate because of its small size. Therefore even if trajectory lengths to reach small shelf may be less, it is not a viable destination. Similarly there can be more parameters involved in decision making. 37 5 Evaluation Figure 5.1: Cross validation result for different features The top level of the system was not integrated with any features and it has 83% success rate with cross validation. Also the middle levels achieve a success rate of 84% on average. 5.2 Comparison with Ground Truth It is imperative to compare the motion dynamics of the model with the actual demonstration path. Additionally, the decision making sequence of the model has to be compared to the ground truth. There were a total of 21 demonstrations for the task of setting table for one person. All these demonstrations were quite varied in their sequences and lengths. Some of the cases just had couple of steps while some were much longer. All the scenarios were tested in the model and their motion dynamics and decision making sequences were compared. For purpose of distinction of trajectories, colors were used. The description for each of the step is also provided for better comparison. In the simulation model, trajectories tend to overlap since the paths are formulated between set points only. Under this section, only three examples are discussed. 5.2.1 Example 1 The initial state of the environment is [2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 2] as define in section 4.3.1. All the movable objects are on the shelves and there is no entity on the table. The person starts from zone 2 which includes the small shelf. The simulated sequence from the model is shown in the table 5.1 and ground truth in table 5.2. 38 5.2 Comparison with Ground Truth S.No. States Meaning 1. [2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 2] Initial state 2. [0, 2, 2, 2, 1, 1, 1, 1, 1, 1, 0] cup from small shelf to table 3. [0, 2, 2, 2, 0, 1, 1, 1, 1, 1, 0] plate from big shelf to table 4. [0, 2, 2, 2, 1, 1, 0, 1, 1, 0, 0] jug from big shelf to table Table 5.1: Example1 - Simulated sequence from the model S.No. States Meaning 1. [2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 2] Initial state 2. [2, 2, 2, 2, 1, 1, 0, 1, 1, 1, 0] plate from big shelf to table 3. [0, 2, 2, 2, 1, 1, 0, 1, 1, 1, 0] cup from small shelf to table 4. [0, 2, 2, 2, 1, 1, 0, 1, 1, 0, 0] jug from big shelf to table Table 5.2: Example1 - Ground truth sequence from the demonstration Comparing the sequences, it can be seen that steps 2 and 3 differ in order of execution. Otherwise all the steps are identical in terms of location from which objects were grasped. Also it can be inferred that, though the human was near the small shelf, he decided to pick first plate from big shelf rather than cup from small shelf. Perhaps, orientation of the person could also be a reason for this disparity. Orientation as a feature could also be an option. Figure 5.2 depicts the ground truth path, the simulated path and heat map. The ground truth portrays the actual trajectory of the demonstrator between the table and shelves. Each coloured line shows start and end of a sub-task. In the simulated path, the paths of sub-tasks are superimposed and hence it is not visible distinctly. However, one distinction which is visible is that of the first sub-task which is of the line with color magenta. In the ground truth, the plate is first grasped from big shelf which is denoted by the magenta line moving towards left from a point near small shelf and acquiring it. While in the simulation, a cup is first grasped from small shelf, the magenta line extends from small shelf towards the table. The heat map shows the cost function of the environment with yellow having higher cost than dark blue. The path simulated follows the lower cost trajectory using Dijkstra’s algorithm. 39 5 Evaluation Figure 5.2: Example1 - Ground truth, Simulation path, Heat map 5.2.2 Example 2 The initial state of the environment is [2, 1, 1, 2, 1, 1, 1, 1, 0, 1, 1] as defined in section 4.3.1. Most of the objects are on the shelves except for the jug on the table. This is easier scenario where only a cup and a plate has to be placed on the table. The simulated sequence from the model is shown in the table 5.3 and ground truth in table 5.4. S.No. States Meaning 1. [2, 1, 1, 2, 1, 1, 1, 1, 0, 1, 1] Initial state 2. [2, 1, 1, 2, 0, 1, 1, 1, 0, 1, 0] plate from big shelf to table 3. [2, 0, 1, 2, 0, 1, 1, 1, 0, 1, 0] cup from big shelf to table Table 5.3: Example2 - Simulated sequence from the model S.No. States Meaning 1. [2, 1, 1, 2, 1, 1, 1, 1, 0, 1, 1] Initial state 2. [2, 1, 0, 2, 1, 1, 1, 1, 0, 1, 0] cup from big shelf to table 3. [2, 1, 0, 2, 0, 1, 1, 1, 0, 1, 0] plate from big shelf to table Table 5.4: Example2 - Ground truth sequence from the demonstration 40 5.2 Comparison with Ground Truth In this scenario, again the order of picking cup and plate is different compared to the ground truth. Otherwise all the steps are identical in terms of location from which objects were grasped. Since the paths are planned between already defined points, they overlap going back and forth in the model. Nevertheless, the number of up and down motions are same between the two. The figure 5.3 shows the ground truth, simulated path and heat map. The ground truth depicts the motion of the demonstrator between the table and big shelf back and forth. In comparison to ground truth, the simulated path takes a different route since the model is working with only single fixed points near a rigid object. The heat map depicts the cost function and lower cost path taken simulated using Dijkstra’s algorithm. Figure 5.3: Example2 - Ground truth, Simulation path, Heat map 41 5 Evaluation 5.2.3 Example 3 The initial state of the environment is [0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1] as described in section 4.3.1. This scenario is quite different from previous cases since there are tasks to remove objects from the table. Initially, there are 4 cups on the table, plates on the big shelf, jug on the big shelf and bowl on the table. The simulated sequence from the model is shown in the table 5.5 and ground truth in table 5.6. S.No. States Meaning 1. [0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1] Initial state 2. [2, 0, 0, 0, 1, 1, 1, 1, 1, 0, 2] cup from table to small shelf 3. [2, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0] plate from big shelf to table 4. [2, 2, 0, 0, 0, 1, 1, 1, 1, 0, 2] cup from table to small shelf 5. [2, 2, 1, 0, 0, 1, 1, 1, 1, 0, 1] cup from table to big shelf Table 5.5: Example3 - Simulated sequence from the model S.No. States Meaning 1. [0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1] Initial state 2. [0, 2, 0, 0, 1, 1, 1, 1, 1, 0, 2] cup from table to small shelf 3. [0, 2, 2, 0, 1, 1, 1, 1, 1, 0, 2] cup from table to small shelf 4. [0, 2, 2, 2, 1, 1, 1, 1, 1, 0, 2] cup from table to small shelf 5. [0, 2, 2, 2, 1, 1, 1, 0, 1, 0, 0] plate from big shelf to table Table 5.6: Example3 - Ground truth sequence from the demonstration There are a few differences in this case. In step 3, the model decides to pick plate from big shelf to table which happens only in step 5 for the ground truth. In ground truth, all cups were placed in the small shelf while in simulation, all except one were placed in small shelf. The figure 5.4 shows the ground truth, simulated path and the heat map. The ground truth depicts the actual trajectory of the demonstration between the table and the shelves. The simulated path again formulates trajectory similar to the first example in section 5.2.1 as the cost map is almost identical. Also the heat map is identical with respect to the cost function. 42 5.2 Comparison with Ground Truth Figure 5.4: Example3 - Ground truth, Simulation path, Heat map As a conclusion, the model is successful in predicting the sequence of tasks similar if not identical to a human in the environment. The simulated path is also successful in avoiding obstacles of the environment and reaching to the destination requested. The model can be improved further with more offset points around the objects so as to travel more in the environment. Additional features as described in section 5.1 can be incorporated to improve the results. With incorporation of affordance in the model, the prediction of where to place an objects on the surface of a rigid object can be extended. 43 6 Discussion and Future Work This work opens up quite a few possibilities for expansion of the idea. The work can be extended for other similar tasks in the environment with few tweaks in the state structure and actions. The middle level can be shared among the tasks since most of the tasks performed can be encoded with such macro-actions. The navigation level too, can be re-purposed. The top level can be updated for other tasks, for example, the task for setting table for three person. The policy can be learned with a new terminal state that encodes the goal. It is possible to further increase the hierarchy levels for certain purposes. For instance, the different racks on the shelves can be encoded and a policy can be learned to decide which rack to choose to pick or place an object. Here, combining with affordances, the system can decide the viable regions in a plane to place an object. Currently, the system randomly places the movable objects inside the boundary confinements of the rigid objects. Affordances can be made use for grasp action as well. This would actually complete the task. Merely, it is not just to have set of objects on the location but also in an arrangement which human would do. As mentioned in earlier chapters, further features can be included in the model to enhance the results. Orientation of the person, physical attributes of the objects, occupancy metric, etc. can be used as features. Orientation of person can play a role in planning. Humans perceive things that are in front of them and in their field of view. Though a favourable action may be a possibility in a state, it may not be pursued if it is out of the human’s perception as explained in 5.2.1. The physical build of the objects can be a feature as well. For instance, with the height of the jug, it could not be placed on the second rack of the big shelf. Also, plates could not be placed on the small shelf due to its small surface. Such factors can be encoded as features since they act as a hindrance to an action. With the inclusion of these features, planning can be improved further. When a human makes a plan, a number of factors are considered before a decision is concluded. Quite analogous to this, any kind of planning in the environment will require features and its innate rewards. 45 7 Conclusion This thesis presents a model of HIRL which works to imitate a high level task from demonstrations. The long horizon task of setting table for one person is decomposed into sub-tasks. The latent hierarchical state representation for the task is structured in three levels of hierarchy and local rewards for states of each levels are learned. The challenge is in defining the hierarchical state structures for the activity and adopting the right features that encodes the high level task. The proposed framework addresses the problem of learning local sub-tasks and working in tandem with other levels for the common final goal. This work also adopts activity forecasting framework [KZBH12] to learn the rewards for semantics in the environment which is part of the navigation level of the model. The comparison with ground truth in section 5.2 proves that the model is successful in imitating human for the particular task. It is also to be noted that there is still a room for improvement for the model in terms of using of different features which could enhance the results. 47 Bibliography [AN04] P. Abbeel, A. Y. Ng. “Apprenticeship Learning via Inverse Reinforcement Learning”. In: Proceedings of the Twenty-first International Conference on Machine Learning. ICML ’04. Banff, Alberta, Canada: ACM, 2004, pp. 1–. isbn: 1-58113-838-5. doi: 10.1145/1015330.1015430. url: http://doi.acm.org/10.1145/1015330.1015430 (cit. on p. 21). [Bel57] R. Bellman. “A Markovian decision process”. In: Journal of Mathematics and Me- chanics 6.5 (1957), pp. 679–684. url: http://www.jstor.org/stable/24900506 (cit. on p. 17). [Dee19] DeepMind. AlphaGo. 2019. url: https://deepmind.com/research/case-studies/ alphago-the-story-so-far (cit. on p. 15). [Die00] T. G. Dietterich. “Hierarchical Reinforcement Learning with the MAXQ Value Func- tion Decomposition”. In: Journal of Artificial Intelligence Research 13 (Jan. 2000), pp. 227–303. doi: 10.1613/jair.639 (cit. on p. 20). [Dij59] E. W. Dijkstra. “A Note on Two Problems in Connexion with Graphs”. In: Numer. Math. 1.1 (Dec. 1959), pp. 269–271. issn: 0029-599X. doi: 10.1007/BF01386390. url: http://dx.doi.org/10.1007/BF01386390 (cit. on p. 35). [FLA16] C. Finn, S. Levine, P. Abbeel. “Guided Cost Learning: Deep Inverse Optimal Control via Policy Optimization”. In: Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48. ICML’16. New York, NY, USA: JMLR.org, 2016, pp. 49–58. url: http://dl.acm.org/citation.cfm?id= 3045390.3045397 (cit. on p. 24). [HYY+17] T. Hirakawa, T. Yamashita, K. Yoda, T. Tamaki, H. Fujiyoshi. “Travel Time-dependent Maximum Entropy Inverse Reinforcement Learning for Seabird Trajectory Prediction”. In: 2017 4th IAPR Asian Conference on Pattern Recognition (ACPR). IEEE. 2017, pp. 430–435 (cit. on p. 23). [Jay57] E. T. Jaynes. “Information Theory and Statistical Mechanics”. In: Phys. Rev. 106 (4 May 1957), pp. 620–630. doi: 10.1103/PhysRev.106.620. url: https://link.aps. org/doi/10.1103/PhysRev.106.620 (cit. on p. 21). [KGL+16] S. Krishnan, A. Garg, R. Liaw, L. Miller, F. T. Pokorny, K. Goldberg. “HIRL: Hi- erarchical Inverse Reinforcement Learning for Long-Horizon Tasks with Delayed Rewards”. In: CoRR abs/1604.06508 (2016). arXiv: 1604.06508. url: http://arxiv. org/abs/1604.06508 (cit. on p. 24). [KKL18] G. Konidaris, L. P. Kaelbling, T. Lozano-Perez. “From Skills to Symbols: Learning Symbolic Representations for Abstract High-Level Planning”. In: Journal of Artificial Intelligence Research 61 (2018), pp. 215–289. doi: 10.1613/jair.5575 (cit. on p. 24). 49 https://doi.org/10.1145/1015330.1015430 http://doi.acm.org/10.1145/1015330.1015430 http://www.jstor.org/stable/24900506 https://deepmind.com/research/case-studies/alphago-the-story-so-far https://deepmind.com/research/case-studies/alphago-the-story-so-far https://doi.org/10.1613/jair.639 https://doi.org/10.1007/BF01386390 http://dx.doi.org/10.1007/BF01386390 http://dl.acm.org/citation.cfm?id=3045390.3045397 http://dl.acm.org/citation.cfm?id=3045390.3045397 https://doi.org/10.1103/PhysRev.106.620 https://link.aps.org/doi/10.1103/PhysRev.106.620 https://link.aps.org/doi/10.1103/PhysRev.106.620 https://arxiv.org/abs/1604.06508 http://arxiv.org/abs/1604.06508 http://arxiv.org/abs/1604.06508 https://doi.org/10.1613/jair.5575 Bibliography [KNST16] T. D. Kulkarni, K. Narasimhan, A. Saeedi, J. B. Tenenbaum. “Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation”. In: CoRR abs/1604.06057 (2016). arXiv: 1604.06057. url: http://arxiv.org/abs/ 1604.06057 (cit. on p. 23). [KZBH12] K. M. Kitani, B. D. Ziebart, J. A. Bagnell, M. Hebert. “Activity Forecasting”. In: Proceedings of the 12th European Conference on Computer Vision - Volume Part IV. ECCV’12. Florence, Italy: Springer-Verlag, 2012, pp. 201–214. isbn: 978-3-642- 33764-2. doi: 10.1007/978-3-642-33765-9_15. url: http://dx.doi.org/10.1007/ 978-3-642-33765-9_15 (cit. on pp. 23, 27, 29, 47). [MPG+10] M. Moussaïd, N. Perozo, S. Garnier, D. Helbing, G. Theraulaz. “The Walking Be- haviour of Pedestrian Social Groups and Its Impact on Crowd Dynamics”. In: PLOS ONE 5.4 (Apr. 2010), pp. 1–7. doi: 10.1371/journal.pone.0010047. url: https: //doi.org/10.1371/journal.pone.0010047 (cit. on p. 23). [NR00] A. Y. Ng, S. J. Russell. “Algorithms for Inverse Reinforcement Learning”. In: Pro- ceedings of the Seventeenth International Conference on Machine Learning. ICML ’00. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2000, pp. 663–670. isbn: 1-55860-707-2. url: http://dl.acm.org/citation.cfm?id=645529.657801 (cit. on p. 20). [OPN18] T. Osa, J. Pajarinen, G. Neumann. An Algorithmic Perspective on Imitation Learning. Hanover, MA, USA: Now Publishers Inc., 2018. isbn: 168083410X, 9781680834109 (cit. on p. 21). [PR98] R. Parr, S. Russell. “Reinforcement Learning with Hierarchies of Machines”. In: Proceedings of the 1997 Conference on Advances in Neural Information Processing Systems 10. NIPS ’97. Denver, Colorado, USA: MIT Press, 1998, pp. 1043–1049. isbn: 0-262-10076-2. url: http://dl.acm.org/citation.cfm?id=302528.302894 (cit. on p. 20). [RA07] D. Ramachandran, E. Amir. “Bayesian Inverse Reinforcement Learning”. In: Proceed- ings of the 20th International Joint Conference on Artifical Intelligence. IJCAI’07. Hyderabad, India: Morgan Kaufmann Publishers Inc., 2007, pp. 2586–2591. url: http://dl.acm.org/citation.cfm?id=1625275.1625692 (cit. on p. 20). [RPLA18] A. Rudenko, L. Palmieri, A. J. Lilienthal, K. O. Arras. “Human Motion Prediction Under Social Grouping Constraints”. In: Proceedings of IEEE International Confer- ence on Intelligent Robots and Systems (IROS). 2018. url: https://doi.org/10. 1109/IROS.2018.8594258 (cit. on p. 23). [SB98] R. S. Sutton, A. G. Barto. “Reinforcement learning: an introduction MIT Press”. In: Cambridge, MA (1998) (cit. on p. 17). [SPS99] R. S. Sutton, D. Precup, S. Singh. “Between MDPs and semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning”. In: Artif. Intell. 112.1-2 (Aug. 1999), pp. 181–211. issn: 0004-3702. doi: 10.1016/S0004-3702(99)00052-1. url: http://dx.doi.org/10.1016/S0004-3702(99)00052-1 (cit. on p. 20). 50 https://arxiv.org/abs/1604.06057 http://arxiv.org/abs/1604.06057 http://arxiv.org/abs/1604.06057 https://doi.org/10.1007/978-3-642-33765-9_15 http://dx.doi.org/10.1007/978-3-642-33765-9_15 http://dx.doi.org/10.1007/978-3-642-33765-9_15 https://doi.org/10.1371/journal.pone.0010047 https://doi.org/10.1371/journal.pone.0010047 https://doi.org/10.1371/journal.pone.0010047 http://dl.acm.org/citation.cfm?id=645529.657801 http://dl.acm.org/citation.cfm?id=302528.302894 http://dl.acm.org/citation.cfm?id=1625275.1625692 https://doi.org/10.1109/IROS.2018.8594258 https://doi.org/10.1109/IROS.2018.8594258 https://doi.org/10.1016/S0004-3702(99)00052-1 http://dx.doi.org/10.1016/S0004-3702(99)00052-1 [VOA14] D. Vasquez, B. Okal, K. O. Arras. “Inverse Reinforcement Learning algorithms and features for robot navigation in crowds: An experimental comparison”. In: IEEE-RSJ Int. Conf. on Intelligent Robots and Systems. Chicago, United States, 2014, pp. 1341– 1346. doi: 10.1109/IROS.2014.6942731. url: https://hal.inria.fr/hal-01105265 (cit. on p. 23). [Yan19] Yannis Flet-Berliac. The Promise of Hierarchical Reinforcement Learning. 2019. url: https://thegradient.pub/the-promise-of-hierarchical-reinforcement-learning/ (cit. on p. 20). [ZMBD08] B. D. Ziebart, A. Maas, J. A. Bagnell, A. K. Dey. “Maximum Entropy Inverse Rein- forcement Learning”. In: Proceedings of the 23rd National Conference on Artificial Intelligence - Volume 3. AAAI’08. Chicago, Illinois: AAAI Press, 2008, pp. 1433– 1438. isbn: 978-1-57735-368-3. url: http://dl.acm.org/citation.cfm?id=1620270. 1620297 (cit. on pp. 20, 21, 29). [ZRG+09] B. D. Ziebart, N. Ratliff, G. Gallagher, C. Mertz, K. Peterson, J. A. Bagnell, M. Hebert, A. K. Dey, S. Srinivasa. “Planning-based Prediction for Pedestrians”. In: Proceedings of the 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems. IROS’09. St. Louis, MO, USA: IEEE Press, 2009, pp. 3931–3936. isbn: 978-1-4244- 3803-7. url: http://dl.acm.org/citation.cfm?id=1732643.1732694 (cit. on pp. 21, 23, 29). https://doi.org/10.1109/IROS.2014.6942731 https://hal.inria.fr/hal-01105265 https://thegradient.pub/the-promise-of-hierarchical-reinforcement-learning/ http://dl.acm.org/citation.cfm?id=1620270.1620297 http://dl.acm.org/citation.cfm?id=1620270.1620297 http://dl.acm.org/citation.cfm?id=1732643.1732694 Declaration I hereby declare that the work presented in this thesis is entirely my own and that I did not use any other sources and references than the listed ones. I have marked all direct or indirect statements from other sources contained therein as quotations. Neither this work nor significant parts of it were part of another examination procedure. I have not published this work in whole or in part before. The electronic copy is consistent with all submitted copies. place, date, signature 1 Introduction 2 Background 2.1 Markov Decision Process and Reinforcement Learning 2.2 Hierarchical Reinforcement Learning 2.3 Inverse Reinforcement Learning 2.4 Maximum Entropy Inverse Reinforcement Learning 3 Related Work 3.1 Human Motion Prediction 3.2 Hierarchical Learning 3.3 Hierarchical Inverse Learning 4 Methods 4.1 Environment and Data Collection 4.2 Trajectory Formulation in 2D Map of Environment 4.3 Hierarchical Structure and State Representations 4.4 Integration and Visualization 5 Evaluation 5.1 Cross Validation 5.2 Comparison with Ground Truth 6 Discussion and Future Work 7 Conclusion Bibliography