Institute of Software Engineering, Institute for Control Engineering of Machine Tools and Manufacturing Units University of Stuttgart Universitätsstraße 38 D–70569 Stuttgart Masterarbeit Generation of Reinforcement Learning Environments from Machine-Tool Descriptions Viktor Krimstein Course of Study: Softwaretechnik Examiner: Prof. Dr. rer. nat. Stefan Wagner Supervisors: Dr. rer. nat. Justus Bogner, Dr.-Ing. Akos Csiszar Commenced: December 1st, 2020 Completed: June 1st, 2021 Abstract Due to the ever-increasing amount of available data, the technological advances for its processing, and in the context of Industry 4.0, research and industry are focusing on creating increasingly detailed digital twins. These aspire to transfer all the capabilities and attributes of their physical counterparts into the digital world. Digital Twins enable simulations of real production and manufacturing processes to be carried out, new approaches to be tested and, in turn, innovative conclusions to be drawn without having to take the risks that costly machines entail. In parallel, approaches from the fields of machine learning, artificial intelligence and reinforcement learning are finding continuously more applications in the manufacturing and robotics domains. Especially in the latter, OpenAI researchers achieved a breakthrough, namely the construction of a neural network that was trained to solve a Rubik’s cube by a robotic hand using reinforcement learning. For the implementation, appropriate simulation environments were used, in which the agent responsible for controlling the robotic arm could train and learn for an enormous amount of times in the simulation. However, the highly heterogeneous environment in the production environment makes it difficult to integrate reinforcement learning methodologies and create the necessary simulations. Researchers must spend a severe amount of their time implementing interfaces for specific machine-tool related components rather than working on the actual problem. It is exactly this issue that this thesis addresses. The goal of this master thesis is the empirical development of a methodology for the automatic generation of reinforcement learning simulation environments for machine-tools. Within the scope of the thesis, different requirements shall be collected by interviewing domain experts as potential end users, generalized and transferred into a software concept. Furthermore, the possibility of deducing and abstracting state and action spaces for reinforcement learning environments and agents from a given machine-tool description is to be investigated within the scope of this work. In addition, the concept to be developed should be machine-tool- and platform-agnostic, as well as modular, so that subsequent research can be conducted upon the presented concept. 3 Kurzfassung Sowohl durch die ständig zunehmende Anzahl an verfügbaren Daten, die technologischen Fortschritte für ihre Verarbeitung, als auch im Kontext der Industrie 4.0 fokussieren sich Forschung und Industrie auf die Erstellung immer detailgetreuerer digitaler Zwillinge. Diese bemühen sich darum, alle Fähigkeiten und Attribute ihrer physikalischen Gegenstücke in die digitale Welt zu übertragen. Sie ermöglichen es, Simulationen von realen Produktions- und Fertigungsprozessen durchzuführen, neue Ansätze zu erproben um so wiederum innovative Schlüsse ziehen zu können ohne dabei die Risiken eingehen zu müssen, welche kostspielige Maschinen mit sich bringen. Parallel hierzu finden Ansätze aus den Bereichen des maschinellen Lernens, künstlicher Intelligenz sowie dem Reinforcement Learning vermehrt Anwendung im Fertigungs- und Robotiksektor. Besonders im letzten Bereich gelang den Forschern von OpenAI ein Durchbruch, nämlich die Konstruktion eines neuronalen Netzwerks, welches mittels Reinforcement Learning darauf trainiert wurde eine Roboterhand einen Zauberwürfel zu lösen. Für die Umsetzung fanden entsprechende Simulationsumgebungen ihren Ansatz, in denen der Agent, welcher für die Steuerung des Roboters verantwortlich war, unendliche Male in der Simulation trainieren und Lernen konnte. Jedoch erschwert die höchst heterogene Umgebung im Produktionsumfeld die Integration von Reinforcement und die Erstellung der notwendigen Simulationen. Forscher müssen einen Großteil ihrer Zeit damit zubringen, Schnittstellen für spezifische Komponenten zu implementieren, anstatt an dem tatsächlichen Problem zu arbeiten. Genau an dieser Problematik soll diese Thesis ansetzen. Ziel dieser Masterarbeit ist die empirische Erarbeitung einer Methodik für die automatische Generierung von Reinforcement Learning Simulationsumgebungen für Produktionsmaschinen. Im Rahmen der Arbeit sollen unterschiedliche Anforderungen durch die Befragung von Domänenexperten als potenzielle Endanwender erhoben, generalisiert und in ein Softwarekonzept überführt werden. Ferner soll im Rahme dieser Arbeit die Möglichkeit untersucht werden, inwiefern sich Zustands- und Aktionsräume für Reinforcement Learning Umgebungen und Agenten aus einer gegebenen Maschinenbeschreibung deduzieren und abstrahieren lassen. Zusätzlich soll das zu erstellende Konzept soll maschinen- und plattformagnostisch, sowie modular gestaltet werden, damit anschließende Arbeiten auf den vorgestellten Konzepten aufbauen können. 4 Contents 1 Introduction 15 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.2 Research Questions and Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2 Foundations 17 2.1 Digital Twin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.2 Machine-Tool Description and Communication . . . . . . . . . . . . . . . . . . . . . . . 18 2.3 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.4 Reinforcement Learning Environment Frameworks . . . . . . . . . . . . . . . . . . . . . 25 3 Related Work 27 3.1 Digital Twin Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.2 Reinforcement Learning Environment Generation . . . . . . . . . . . . . . . . . . . . . 27 3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4 Research Approach 29 4.1 Research Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.2 Interview Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 5 Evaluation and Results 33 5.1 Quantitative Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 5.2 Qualitative Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 6 Concept 43 6.1 Architectural Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 6.2 Machine Tool Description Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 6.3 Reward Function Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 6.4 Automatic Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 6.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 7 Conclusion and Outlook 49 Bibliography 51 5 List of Figures 2.1 The relationship between a DT and a CPS (adapted from [BL15]). . . . . . . . . . . . . . 17 2.2 UML component diagram of the two main components of a machine-tool. . . . . . . . . . 18 2.3 The OPC UA server architecture (adapted from [OPC17a]). . . . . . . . . . . . . . . . . . 20 2.4 The OPC UA client architecture (adapted from [OPC17a]). . . . . . . . . . . . . . . . . . 20 2.5 Schematic UML component diagram showing the EtherCAT Master implementation. . . . 21 2.6 Generic Reinforcement Learning Agent-Environment-loop (adapted from [SB18]). . . . . 22 4.1 Schematic representation of an exploratory research strategy. . . . . . . . . . . . . . . . . 30 4.2 The research design of flexible type (adapted from [RTVG21]). . . . . . . . . . . . . . . . 30 4.3 Schematic overview of a deductive research approach (adapted from [FP19]). . . . . . . . 31 5.1 Pie chart of the distribution of participants professional experience and the experience specifically in the field of RL. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 5.2 Pie chart of the distribution of participants employment types. . . . . . . . . . . . . . . . 34 5.3 The distribution of the frequency of creation or enhancement of simulation environments in relation to the employment field. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 5.4 Representation of the six most frequently used tools by the participants. . . . . . . . . . . 36 5.5 Satisfaction rating of the participants with the current development and research processes. 37 5.6 UML Use-Case diagram providing a brief overview of general functionalities. . . . . . . . 41 5.7 UML Use-Case diagram providing a brief overview of the user facing functionalities. . . . 41 6.1 The UML component diagram describing the access to the systems functionalities through a CLI application. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 6.2 The UML component diagram the services which realize the creation of a Python project for RL. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 6.3 The UML component diagram the services which realize the creation of a Python project for RL. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 6.4 UML class diagram of the Strategy pattern implementation for the reward function. . . . . 47 7 List of Tables 4.1 Distribution of research phases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.2 Summary of the research approach. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 5.1 Distribution of the participants professional experience and their respective experience in the field of RL. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 5.2 Mapping of the participants to their employment types in academia or the industry. . . . . 35 5.3 The distribution of the frequency of creation or enhancement of simulation environments in relation to the employment field. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 5.4 Representation of the six most frequently used tools by the participants. . . . . . . . . . . 37 5.5 Satisfaction rating of the participants with the current development and research processes. 37 9 List of Listings 6.1 Example Project Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 6.2 Example OPC UA Machine Tool Connectivity Configuration . . . . . . . . . . . . . . . . 46 11 Acronyms AI Artificial Intelligence. 15, 22, 25, 27 API Application Programming Interface. 19 CNC Computer Numerical Control. 16 CPS Cyber-Physical Systems. 7, 15, 17 DT Digital Twin. 7, 15, 17, 18, 25, 27, 35 ICT Information and Communication Technologies. 15 IoT Internet of Things. 15 MDP Markov Decision Process. 22 ML Machine Learning. 15 NC Numerical Control. 18 OPC Open Platform Communications. 18 OPC UA Open Platform Communications Unified Architecture. 18 PLC Programmable Logic Controller. 18 RL Reinforcement Learning. 15, 16, 17, 22, 23, 24, 25, 27, 29, 31, 33, 35, 38, 39, 47, 48 RQ Research Question. 16 UML Unified Modeling Language. 18 13 1 Introduction The industrial world is facing a digital transformation that started in Germany, in 2013, with the Industrie 4.0 initiative [NB18; PCB+19]. This transformation, known as the fourth industrial revolution, is based on the use of Cyber-Physical Systems (CPS) and Information and Communication Technologies (ICT) in manufacturing systems. Significant advances in Machine Learning (ML), Artificial Intelligence (AI) and the Internet of Things (IoT) are also contributing [MLC20]. Based on these circumstances, current research focuses on the creation and optimization of Digital Twin (DT) [WSJ17]. Due to the increasing interconnection of CPS, continuously growing data and information stocks enable the increasingly accurate modeling of physical production systems [Woh19]. DTs enable the simulation and calculation of results from production processes as well as experimentation with different optimization approaches without having to access cost-intensive real machines [Can16]. In order to use the DTs efficiently, they are equipped with the same or very similar interfaces as real machines. The goal is to be able to generate a simulation of real production and manufacturing processes that is as accurate as possible in order to analyze and verify new production methods in a time- and cost-optimized manner. In parallel, the use of ML and AI in the production environment is also increasing. The aspiration here is the possible optimization of existing processes and increase efficiency [CEV17]. Further, the generation of new insights and verification of innovative methods poses an additional research goal [JCKV18]. Reinforcement Learning (RL) algorithms can be used to train agents to simulate specific production processes and develop appropriate manufacturing strategies [LZYW20]. 1.1 Motivation Currently, the incompatibility of AI methods with those of the production and manufacturing domains poses significant challenges for research [NNXR08]. The creation of simulation environments and DTs requires a high degree of domain- and problem-specific knowledge. Furthermore, the large number of process and machine description models, some of which are incompatible, makes the generalization and modularization of simulation techniques difficult [ESLR19]. For example, if an agent trained by RL is to simulate the control of a sorting machine in order to evaluate the selected process strategy, it requires a specific description of the action and state spaces of the machine-tool used, as well as a set of input and output signals annotated with the corresponding domain context. The description and implementation of such simulation environments present researchers with an enormous effort. Depending on the type of the machine-tool used, the communication and description format as well as the target task, researchers have to describe and simulate these details in a highly heterogeneous domain. 1.2 Research Questions and Goals Targeting the above-mentioned issues, the goal of this thesis is the empirical analysis and development of a methodology for the automatic generation of RL simulation environments for machine-tools. The goal is to simplify the implementation of simulation environments in a manner which is generalizable through various frameworks and communication protocol types. Following an empirical research approach, this work provides a concept for a software solution for automatic generation of RL environments and the according 15 1 Introduction agents for a given machine-tool description. Use-Cases and functional details of the resulting software architecture are derived from semi-structured interviews lead with domain experts from both, research and industry. The additional goal is to homogenize the methodologies with which the simulation environments are created, while at the same time leaving enough space for researchers to resort to the tools that are most appropriate for their work. Based on the presented introduction and motivation, this thesis strives to answer the following Research Questions (RQ): RQ1: What general inconveniences in the process of generating a RL environment and agent with the corresponding simulation have to be overcome to enhance researcher productivity and the overall process quality? RQ2: Given a potential software solutions for automatic RL environment and agent generation. What Use-Cases and technical functionalities doe potential users expect from such a software? RQ3: Given a potential software solutions for automatic RL environment and agent generation. What would be the application of such a solution in the daily research work of the users? 1.3 Outline The remainder of this work is divided into the following Chapters: Chapter 2 - Foundations: Here, necessary foundational knowledge for RL simulation environments and machine-tool description and communication types is provided. Chapter 3 - Related Work: Related research papers for RL environment generation and applications to CNCs are highlighted and differences to the presented thesis are discussed. Chapter 4 - Research Approach: This chapter presents the empirical research approach and strategies used in this work. Chapter 5 - Evaluation and Results: Research results are presented and discussed in this chapter alongside the key aspects leading to the creation of the concept. Chapter 6 - Concept: Based on the results of the empirical research the concept, for a methodology of RL environment generator is presented. Chapter 7 - Conclusion and Outlook: Finally, this chapter provides a summary of the work and gives conclusive remarks, along with an outlook to future work. 16 2 Foundations This chapter establishes the foundations necessary to follow the remainder of this work. The basic concept of Digital Twins are introduced. Later a general understanding of the components of a machine-tool is provided alongside the communication- and capability-description methods. Eventually, an introduction to RL is given together with specific concepts required for the generation of simulation environments. 2.1 Digital Twin A Digital twin can be defined as a virtual representation of a physical asset enabled through data and simulators for real-time prediction, optimization, monitoring, controlling, and improved decision-making [RSK20]. DTs implement a broader concept that refers to a virtual representation of manufacturing elements such as personnel, products, assets and process definitions, a living model that continuously updates and changes as the physical counterpart changes to represent status, working conditions, product geometries and resource states in a synchronous manner [LLK+20]. Further, DTs provide a more sophisticated method for testing and experimenting with manufacturing processes in a virtual environment excluding risks of misconfiguration on a physical machine-tool. Figure 2.1 puts the DTs and their corresponding physical devices in context with the global term of CPS. Physical Machine-ToolDigital Twin CPS Figure 2.1: The relationship between a DT and a CPS (adapted from [BL15]). The DT represents a high-fidelity representation of the physical device. It allows not only to model and simulate processes possibly executed on the physical device, but mirrors nearly in real-time the signals of its physical counterpart. Specifically, this means that a CPS is characterized by both properties - a physical machine and its Digital Twin. Kritzinger et al. strive to differentiate recent applications of DTs based on the degree of data in digital and physical objects [KKT+18]. In their work, the authors separate two terms: the Digital Model and a Digital Twin [KKT+18]. A Digital Model implements bi-directional manual data flow, while the DT on the other 17 2 Foundations hand enables bi-directional data flow automatically. Digital Shadow only feeds one-way automatic data flow from the physical object into the digital object, while from digital object to physical object manually. Based on this categorical method, the majority of publications investigated by the authors were classified as Digital Shadow and Digital Model [KKT+18]. Studies where the Digital Twin classification could be applied correctly, are rare [XSK+21]. This shows the inconclusive specification of the term of DTs. Nevertheless, by speaking about DTs in this thesis, the term is defined by the definition of Kritzinger et al. 2.2 Machine-Tool Description and Communication For the context of this work the machine-tool representation is abstracted to the following parts: the physical machine-tool itself (e.g. a machine for milling metal), the Programmable Logic Controller (PLC) which controls the motions of the physical tool parts and electrical signals of the machine-tool and the Numerical Control (NC) which is responsible for numerical calculation and preparation of the paths to be followed by the machine-tool. Figure 2.2 provides a highly abstracted Unified Modeling Language (UML) [Obj17] component diagram representation of the machine-tool. Figure 2.2: UML component diagram of the two main components of a machine-tool. The machine-tool component provides two external interfaces to access the PLC and NC functionalities. The PLC of the machine-tool is responsible for the physical and electrical control of the manufacturing and machine tool processes. It can execute control commands like data gathering from the machine tool state, start- and stop of the production and transfer data to the NC. Processing of production parts data, planing of the milling routes for example and executing the part production is handled by the NC. Production data is loaded from and to the NC via the PLC and the current state is transferred to the PLC in real-time ensuring a correct monitoring of every production process. In the following all physical interactions are encapsulated within the machine-tool component, since one is not interested how exactly the control realizes a specific manufacturing process (e.g. setting the rotation speed of a drill) but rather, that it is possible by sending an explicit signal to the control of the machine-tool. 2.2.1 Open Platform Communications Unified Architecture The Open Platform Communications Unified Architecture (OPC UA) is the new version of the well-known Open Platform Communications (OPC) architecture [Had06] originally designed by the OPC Foundation to connect industrial devices to control and supervision applications [HS14]. The focus of OPC is on getting access to large amounts of real-time data while ensuring performance constraints without disrupting the normal operation of the devices [CJOC10]. The original OPC specifications, based on Microsoft’s 18 2.2 Machine-Tool Description and Communication Component Object Model/Distributed Component Object Model (COM/DCOM), are becoming obsolete and are gradually being replaced by new interoperability standards, including web-services what led the OPC Foundation to publish a new architecture, called OPC UA [Had06]. Note: The following two figures, shown in Figure 2.3 and Figure 2.4 were cited with the kind permission of the OPC Foundation to cite this images by referencing the origin OPC UA Specification Part referenced in [OPC17a]. OPC UA Server Model: OPC UA specifies the exchange of real-time information of production plant data between control devices or Information Technology (IT) systems from different manufacturers [VOX+05]. The communication is established by an inverted client-server system, where the client triggers actions on the server for automation control, and the server executes the commands on, or retrieves data from the underlying machine [IJ13]. Figure 2.3 shows the OPC UA Server Model according to its specification in [OPC17a]. OPC UA servers include an information model that allows users to organize data and their semantics in a structured manner. This semantic AddressSpace is constructed of standalone or interconnected Nodes mapped to real CPS object representatives as shown in Figure 2.3. Furthermore, nodes can be divided into functionality and view nodes, each sort implementing different functionalities and manners of user interaction. Further, each node can be monitored and subscribed by parties of interest using the OPC UA Server Application Programming Interface (API) as presented at the bottom of Figure 2.3. The information model constitutes the AddressSpaces of OPC UA servers. It is a fullmesh network of nodes with their properties and relations. In general, users create the information model for their OPC UA servers manually at implementation time or implement vendor-specific automatisms [HS14]. By providing specific monitorable elements and an asynchronous communication, OPC UA can be used to interact only with specific functionalities of the machine-tool while underlying concepts are hidden from the clients. It should be noted at this point, that the specification does not enforce a specific semantic mapping between nodes and their functionalities. Vendors are free to implement the AddressSpace in their own fashion, leaving it up to the domain experts and users to map the nodes to their semantic equivalents. OPC UA Client Model: The OPC UA Client architecture models the Client endpoint of client/server interactions. Figure 2.4 illustrates the major elements of a typical Client and how they relate to each other. As presented in Figure 2.4, the OPC UA Client is constructed by two layers – the Client Application and the OPC UA Client API. The Client Application encapsulates the producer–consumer service functionality by accessing the underlying OPC UA Client-API following the asynchronous system designs described in [TV07]. Further, the Client Application is the code that implements the function of the Client. It uses the Client API to send and receive OPC UA Service requests and responses to the Server. The Services defined for OPC UA are described in Clause 6.4 of [OPC17b]. Note that the Client API is an internal interface that isolates the Client application code from an OPC UA Communication Stack. The OPC UA Communication Stack converts Client API calls into messages and sends them through the underlying communications entity to the Server at the request of the Client application. The OPC UA Communication Stack also receives response and NotificationMessages from the underlying communications entity and delivers them to the Client application through the Client API. 19 2 Foundations Figure 2.3: The OPC UA server architecture (adapted from [OPC17a]). Figure 2.4: The OPC UA client architecture (adapted from [OPC17a]). 20 2.2 Machine-Tool Description and Communication 2.2.2 EtherCAT EtherCAT is an ISO standardized Industrial Ethernet technology specialized in real-time communication using field bus systems [RSD10]. The main difference between the standard Ethernet protocol and EtherCAT is, that EtherCAT allows the processing of Data Frames on the fly. This differentiation reduces the costs of latency in usual Ethernet communication. The main target of EtherCAT is to ensure a standard communication profile in the industrial context. Emphasize is put on the flexibility of the protocol and an efficient communication with different top-layer protocols (like OPC UA) rather than a semantic and modular description of the machine-tool capabilities [WZLL21]. Figure 2.5 shows a schematic overview of the EtherCAT Master implementation. Figure 2.5: Schematic UML component diagram showing the EtherCAT Master implementation. The scope of the available master implementation and the supported functions varies depending on the target application. Optional functionalities are supported or intentionally omitted to optimize hardware and software resources. One of the differences between EtherCAT and OPC UA is that EtherCAT is rather a pure communication protocol than OPC UA. EtherCAT does not decide about the modeling and functionality, how data sent is represented. 21 2 Foundations 2.3 Reinforcement Learning The term Reinforcement Learning describes a class of methods where one or multiple agents are placed in an environment for interactive task solving. The research field is influenced by AI, especially robotics, and classical control theory. In the following, this thesis follows the formalism from the textbooks by Sutton et al. [SB18], Russell et al. [RN02] and Goodfellow et al. [GBCB16]. First, we introduce the terminology of Reinforcement Learning. Figure 2.6 describes the general idea of RL. Formalism An reinforcement learning agents’ actions in a given environment, can be formalized by a Markov Decision Process (MDP) [RN02]. An MDP is a 5-tuple, 〈S,A,R,P, f0〉, where • S is the set of all valid states, • A is the set of all valid actions, • R : S × A × S → R is the reward function, with AC = R (BC , 0C , BC+1), • P : S × A → P (S) is the transition probability function, with P (B′ | B, 0) being the probability of transitioning into state B′ if the agent starts in state B and takes action 0, • f0 is the starting state distribution. The name MDP refers to the fact that the system obeys the Markov property, which can be described as the property that transitions only depend on the most recent state and action, and no prior historical data influences the decision. Agent Environment Action 0CNew state BC+1 Reward AC+1 Figure 2.6: Generic Reinforcement Learning Agent-Environment-loop (adapted from [SB18]). Environment and Agent The environment describes the space that a learning agent is placed in to solve tasks. While the environment can be either simulated or real, the agent is usually following a strategy that computationally defines how to explore and act in the environment. The goal of an agent in an environment is to learn to solve the specified task in the environment. During the learning phase, it follows a search strategy to explore and learn a strategy to exploit the experienced knowledge in order to solve the task. In the context of simulating machine-tools, the agent performs actions by sending in- and output signals to the control unit of the machine-tool. By sending the control signals, the control unit lets the machine perform a certain task (e.g. moving a robotic arm forward). Hence, the agent performed an action in the environment, which is the machine-tool itself. 22 2.3 Reinforcement Learning States and Observations A state B ∈ S describes an observable part of the environment that the agent is in. That description might not necessarily be complete, as the environment might only be partially observable. An observation > is a partial description of a state, which may omit information. For example, a robot tasked with navigating in a room might only have access to a camera image representing the room and the robots position. Terminal states denote a goal state, after which the environment usually experiences some kind of reset to an initial state. A terminal state also marks the end of an episode. The State space S describes the space of all possible states. Actions When in a state B ∈ S, an agent can perform an action 0 ∈ A from an action spaceA. This results in a new state B′ ∈ S and a reward signal. From then on, the agent is again tasked with performing an action, given the state B′. This loop is also illustrated in Figure 2.6. An agents interaction in an episode with the environment gives rise to a sequence g = ((B0, 00) , (B1, 01) , . . .) which is also referred to as a trajectory. For example, a robot tasked with navigating from point A to B might get a small negative reward of -1 for every part of the trajectory it takes towards point B, but a huge positive reward of 100 for actually stepping into B, and thus reaching a terminal state that finishes the trajectory. The rewards therefore encourage the robot to reach B with as few, deliberately actions as possible, taking the direct route, as to minimize the potential negative reward that accumulates from taking detours. Policies A policy c is a rule used by an agent to decide what actions to take. It can be either deterministic, in which case it is usually denoted by `: 0C = `(BC ), or stochastic, in which case it is usually denoted by c: 0C ∼ c(·|BC ), Especially in the robotics (and therefore the manufacturing) domain, RL deals with parametrized poli- cies. [AAC+19]. These policies are described by computable functions that depend on a set of parameters which are adjusted to change the behavior via some optimization algorithm. Such policies can be formalized as 0C = `\ (BC ) 0C ∼ c\ (·|BC ) where \ is a set which encapsulates all necessary parameters. 23 2 Foundations Reward and Return The reward function R is critically important in RL. It depends on the current state of the world, the action just taken, and the next state of the world: AC = R (BC , 0C , BC+1) Put simple, the reward is the gratification for the agent which he gets when transitioning from his current state to the next state by taking a specific action. The agent literally makes a step. The goal of the agent is to maximize some notion of cumulative reward over a trajectory. One kind of return is the finite-horizon undiscounted return, which is just the sum of rewards obtained in a fixed window of steps: R(g) = )∑ C=0 AC . Another kind of return is the infinite-horizon discounted return, which is the sum of all rewards ever obtained by the agent, but discounted by how far off in the future they’re obtained. This formulation of reward includes a discount factor W ∈ (0, 1) which is usually set to 1 2 in practice: R(g) = ∞∑ C=0 WCAC . The discount factor is added to enhance the chances, that an infinite-horizon sum of rewards may converge to a finite value. Convergence is the key factor in both reward function formulations. Reinforcement Learning Problem Definition The main goal of every RL algorithm is to find an optimal policy to maximize the expected return if the agent acts according to it in the environment. The expected return and therefore the utilityU of the chosen policy can be expressed as: U (c) = ∫ g P (g | c) R (g) = Eg∼c [R (g)] The central optimization problem in RL can then be expressed by c∗ = argmax c U (c) , with c∗ being the optimal policy. 24 2.4 Reinforcement Learning Environment Frameworks 2.4 Reinforcement Learning Environment Frameworks In the near past the application of RL algorithms has led to ground-breaking results. Whether Google Deepminds AlphaZero [SHS+18] which has, as the first system ever, beaten the world-champion in Go (the Chinese version of chess) or OpenAIs robotic hand which was able to solve single-handedly a Rubik’s Cube [AAC+19]. In the manufacturing domain, research was conducted on the creation of a RL agent, which can deduce the correct control policy of a manufacturing plant by acting in its DT (the Hardware-in-the-Loop simulation) [JCKV18]. All approaches have in common, that researchers have to design and implement the simulation and visualization, the abstraction layer for the RL environment and the RL agent. Depending on the posed optimization problem, this can be a cumbersome task [KCV18]. Although it is still required to write a certain amount of boilerplate source code, several Open Source projects provide standard approach for implementing RL agents and environments as well as training and gathering results of the algorithms and policies. OpenAI Gym The OpenAI Gym framework is toolkit for RL research including the common interfaces, environment and simulation implementations alongside with benchmarks and reference implementations of basic algorithms [BCP+16]. OpenAI Gym includes several implementations of RL environments including the domains of Atari Console Games, classic control (like the CartPole problem) and MuJoCo [TET12]. The framework is implemented in Python and provides interfaces for the most common AI frameworks, including TensorFlow and PyTorch. Google Dopamine Google’s Dopamine is an Open Source framework for researches to prototype and experiment with different RL algorithms [CMG+18]. Although Dopamine is not as popular in the developer community1, it is under continuous development and provides just as OpenAi’s Gym environments and integration for simulation and research of agents in environments, like Atari Console Games and MuJuCo simulations. 1On the GitHub Platform around 9.400 developers starred the project while on the other hand over 20.000 starred the OpenAI Gym project. 25 3 Related Work Chapter 2 was concerned about providing necessary background information and a general overview required for this work. Since the research and goals in the manufacturing and AI domains diverge it is challenging to find commonalities. Therefore, the following presents an overview of research topics in the intersection of both domains and how this work contributes to the research fields. 3.1 Digital Twin Generation Digital Twins have been proven as an excellent tool to improve systems design, to improve systems exploitation efficiency, as well as a maintenance support tool [CLQS19]. Research is highly driven to create even more precise and configurable simulations starting from the earliest stages of the system development [YJ18]. However, highly specialized domain expertise is required at every step of the creation of simulations the integration into a physical plant [Kre14]. Therefore, one research focus is the automatic generation of Digital Twin simulations from given data [MSKV18]. The main focus of the conducted research is the generation of physical twins and their corresponding capabilities in terms of control and functionality. It terms of the application of AI or RL, researchers have to implement and integrate the agents themselves and create simulation environments by their own. Tools like MatLab1 provide mechanisms for simulation and training of RL agents and the creation of machine-tool simulation, but the integration and methodology is neither standardized nor straight-forward. It takes a certain amount of boilerplate code to generate a digital twin by its physical properties and even more to integrate all components together including many manual tasks which slow down the actual research [MOW+21; XSK+21]. As highlighted by Xia et al. in their paper, the design of the RL environment was particularly focused on the integration to a previously created DT [XSK+21]. Results were shown for a Proof-of-Concept of a specific tasks rather than a general approach of integrating the RL algorithm design process in a reusable manner. 3.2 Reinforcement Learning Environment Generation Since the design of RL environments and agents is very task specific, it poses a very challenging task to find papers, which aim to generate such environments automatically. Although Ha et al. presented in their paper an approach to automate the reconfiguration of a robotic arm, their results and research questions are not comparable to those of this thesis. [HKY18]. 1See the official MathWorks Website: https://de.mathworks.com/products/reinforcement-learning.html?s_tid=hp_brand_rl 27 https://de.mathworks.com/products/reinforcement-learning.html?s_tid=hp_brand_rl 3 Related Work 3.3 Summary This thesis tackles the challenge to provide a concept for generation of reinforcement learning environments and corresponding agents from given machine-tool descriptions and domain expert input. It may seem similar to the research field of Digital Twin generation, but this paper focuses on the Reinforcement Learning and generalization aspects, rather than on specific physical and electrical properties of systems. Thus, this thesis makes a symbiotic contribution to the current research. 28 4 Research Approach After Chapter 2 and Chapter 3 have prepared the basics and a connection to related literature has been established, the following chapter deals with the research strategy with which the three initially mentioned research questions in Chapter 1 are to be answered. 4.1 Research Methodology The research methodology for this thesis is organized into the following phases: (i) Literature Review and Decision of the programming language and RL framework to be used for the creation of the concept, (ii) the experience-based evaluation where semi-structured interviews with domain experts were conducted to provide a qualitative analysis and (iii) the conclusion and deduction of Use-Cases and requirements. Table 4.1 describes the implied reasons and included tasks for each phase. Table 4.1: Distribution of research phases. Research Phase Task Reason for Selection Preparation Literature Review Analysis of existing open source RL frameworks gaining domain insights and foundational knowledge of existing approaches, minimal implementation to understand possible inconveniences of domain experts Experience-based Evaluation semi-structured interviews with domain experts Qualitative Analysis, gaining insights from experts, external input from the end-users to design a concept which will fit their needs Result Aggregation aggregation of interview results identification of possible Use-Cases and requirements Analysis of similarities between participants, combination of ideas to determine a clear understanding of the concept The research of this thesis is focused around the interviews with the domain experts. Through the analysis and definition of Use-Cases directly from potential users the goal is to design a software architecture concept which can fit exactly to the user’s requirements. Further, the results of the literature review in the first research phase have already been presented in Chapter 2 and Chapter 3. For the qualitative analysis of the frameworks and tools used for the concept creation, research was conducted about current implementations of RL algorithms and environments. Research Strategy Both the creation of RL environments and agents, as well as possible ways of integration with machine-tools, have been researched thoroughly. It should be noted that it showed as a challenging task to find literature on general approaches and common best practices for simulation creation. Instead, the literature presents specific solutions and their implementations leaving it up to the recipients to draw parallels and generalize the methodologies to their own needs. Several research strategies exist and were used for this analysis. An exploratory form of research strategy is relevant and hence exercised as shown in Figure 4.1. In the presented research process, several related topics were explored which provided valuable insights into the domain and the issues motivating further exploration. The motivation and a formal hypothesis for future research work was deduced. 29 4 Research Approach Figure 4.1: Schematic representation of an exploratory research strategy. Research Design This research is designed in a flexible since not all the parameters and the required domain knowledge were available in advance. During the course of the research parameters and dependencies could be more and more clarified and through the literature research specified. On the other hand, a fixed research design would require a predefined set of elements before data collection can be initiated. Figure 4.2 shows the continuous interdependent workflow. Figure 4.2: The research design of flexible type (adapted from [RTVG21]). Through continuous data and information aggregation new insights could be unveiled and lead to more specific research design. Using this iterative approach, gathered knowledge from literature and interviews could be integrated in the concept creation process. 30 4.2 Interview Design Research Category This case study is a deductive approach where we start with the collection of hypotheses from existing theories, and then we test these hypotheses against our observations. Figure 4.3: Schematic overview of a deductive research approach (adapted from [FP19]). 4.2 Interview Design The semi-structured interviews in a 45-minute time frame with the domain experts. Format of the interviews were video- and audio calls. First, the participants were asked to provide brief information about themselves like their professional and demographic background, total years of professional employment and the total number of years in the domain of manufacturing and RL research. RQ1: Targeting the first research question of this thesis mentioned in Chapter 1 the participants were asked about the general tools, programming languages and frameworks they use on a regular basis. The participants were asked open formulated question, so they have the according freedom to speak in general terms about the technologies they use and potential inconveniences during the implementation of software required for their research. The information provided included not only the specific tools they use but also information about the machine-tools used at the application stage of their research and the methodologies the use. Quantitative data was collected by formulating questions about the amount of time it usually takes to create a simulation, the required steps and their complexity. The participants were asked, how to perceive the efficiency of the current methodology and how satisfied they are on a scale of the interval [0, 5] in full integer steps. RQ2: As a follow-up, the participants were confronted with the idea, that given a software tool to automatically generate the simulation environments, what kind of requirements and Use-Cases such a software should implement and for which exact use. This developed into a productive dialog were specific ideas and potential requirements for a reference concept could be gathered, and the answers were directly put in relation with current research and related work. Further, questions on possible extension of such a software were asked to ensure, that required interfaces and dependencies were not left out during the concept creation. RQ3: Finally, the participants were asked about hypothetical uses of such a software in terms of applying it to real physical machine-tools. Again a monologue arose and led to concrete ideas of applications. Additionally, questions about the perception and trust to such a software were asked to include possible required interactions in the generation process. Quantitative data was gathered by asking the participants to rank the priorities of the requirements and Use-Cases drawn during the interview. 31 4 Research Approach 4.3 Summary During the course of the research for this thesis, quantitative results were obtained by literature review and specific questions in the style of a questionnaire during the conducted semi-structured interviews. These interviews provided the qualitative results for the research including insights to domain knowledge, common inconveniences and challenges in the regular work and research life of the participants. Following the flexible design and an explorative research strategy new data and insights from different source could be integrated into the concept resulting from the aggregated results. Table 4.2 summarizes the research approach of this thesis. Table 4.2: Summary of the research approach. Research Strategy Explorative Research Design Flexible Research Category Inductive Research Methods Metric measures Experts interviews Survey Type of Collected Data Quantitative Qualitative Question Types Open and Closed Type of Experts Interview Semi-Structured Arrangement Type of Experts Interview Remote audio and video calls 32 5 Evaluation and Results This chapter provides the aggregated quantitative and qualitative findings after the research was conducted as described in Chapter 4. Since it is a challenging task to find fitting participants and even more candidates, which conduct research in the field of RL and manufacturing, participants were selected from academia and industrial corporations equally. It should also be noted here that each participant’s research deals with a different, specific subfield, depending on the interests of the chair or the particular employer. 5.1 Quantitative Data In the following the quantitative data gathered by explicit formulated questions during the interviews is presented. Both the aggregated results as diagrams and the initial tabular data is provided along with partial discussion on the conclusions made from the data. Professional Experience Figure 5.1 and the corresponding Table 5.1 show, that the participants have in average 4 years of professional experience in general and around 2 years of experience in the field of RL in the context of robotics and manufacturing. This is due to the problem of finding suitable study participants, which was mentioned at the beginning of the study. Nevertheless, the selected participants can be seen as valuable sources of domain expertise and knowledge. Figure 5.1: Pie chart of the distribution of participants professional experience and the experience specifically in the field of RL. 33 5 Evaluation and Results Table 5.1: Distribution of the participants professional experience and their respective experience in the field of RL. Participant ID Years of professional excperience Years of Excperience in RL Domain P1 5 2 P2 5 1 P3 2 1 P4 2 2 P5 2 2 P6 3 1 P7 1 1 P8 9 4 P9 3 1 P10 6 2 Professional Background Figure 5.2 and the corresponding Table 5.2 show that the majority of the interviewed participants works in the industry with 60% while 40% work as researchers in academia. This circumstance also reflects the current maturity of research and its adaptation in the industrial environment. Figure 5.2: Pie chart of the distribution of participants employment types. 34 5.1 Quantitative Data Table 5.2: Mapping of the participants to their employment types in academia or the industry. Participant ID Employment Type P1 Academia P2 Academia P3 Industry P4 Academia P5 Industry P6 Industry P7 Academia P8 Industry P9 Industry P10 Industry Implementation Frequency During the interview, participants were asked to estimate how often they have to implement simulation environments from scratch, or add to existing environments, in their daily work. Figure 5.3 and the corresponding Table 5.3 present the results. Strongly Disagree here refers to the fact that the participant is almost never involved in the implementation of simulation environments. Strongly Agree refers to a regular involvement in the development process. Figure 5.3: The distribution of the frequency of creation or enhancement of simulation environments in relation to the employment field. From the provided results we can initially deduce that at least 40% of the participants have no regular or rare interaction with the implementation of simulation environments in general. The specification to concrete RL simulation environments was explicitly omitted since a participant may not be involved in the RL part of a research project but responsible for the machine-tool and DT related implementation and simulation tasks. 35 5 Evaluation and Results Table 5.3: The distribution of the frequency of creation or enhancement of simulation environments in relation to the employment field. Participant ID Employment Type Regularity of Implementation P1 Research Strongly Agree P2 Research Agree P3 Industry Normal P4 Research Disagree P5 Industry Disagree P6 Industry Agree P7 Research Normal P8 Industry Agree P9 Industry Strongly Disagree P10 Industry Normal Used Tools For gathering more information of the tools and concepts used by the participants in their regular work, they were asked to name 5 tools or programming languages they use. The following data is the aggregation distinct tools which are publicly available for industry and research equally. Figure 5.4: Representation of the six most frequently used tools by the participants. 36 5.1 Quantitative Data Table 5.4: Representation of the six most frequently used tools by the participants. Tools and Programming Languages Participants using Tools C/C++ 10 Python 10 MatLab Simulink 7 EtherCAT 4 OPC UA 5 OpenAI Gym 3 Current Satisfaction Rate As a final quantitative question, participants were asked to describe their satisfaction with the current development processes and the effort involved. It turned out that all but one participant are either satisfied with the current development methods or at least neutral towards them. Figure 5.5: Satisfaction rating of the participants with the current development and research processes. Table 5.5: Satisfaction rating of the participants with the current development and research processes. Satisfaction Count Strongly Disagree 0 Disagree 1 Neutral 5 Agree 4 Strongly Agree 0 37 5 Evaluation and Results 5.2 Qualitative Data Ten semi-structured interviews were conducted with experts to gain insight into your daily work and research routine and how your currently used work methodologies and tools can be improved, and recurring tasks can be automated. For each interview the time frame was set around 45 minutes. The following section presents the results of the interviews in aggregated form. After starting the dialogue by collecting quantitative data such as career field and the amount of years in professional employment, participants were first asked in general terms about the tools they used to create and use simulations of machine tools in their research. Participants were advised that they were free to list programming languages, communication technologies, simulation-related, and other software-related tools. Later, the participants were then asked to focus on recurring processes in their work and especially in the implementation and to list the work packages. After setting the framework for further dialog, participants were now asked to imagine having a software available that automatically handles recurring tasks such as setting up projects and generating boilerplate code. It was asked what functionalities such software should have in order to significantly improve the satisfaction and efficiency of the participants’ work. Potential use cases were further elaborated and subsequently prioritized. In particular, technical details were addressed in this dialog, for example, at which points the participant attaches importance to being able to intervene in and adapt the generation processes. The interview was concluded by an open dialogue about potential application scenarios of such software. In particular, questions of feasibility, integration with real machine tools, reusability and portability were addressed. 5.2.1 Results In the following, the results and similarities between the answers of the participants will be presented. All questions were posed open with the intent to stimulate a dialogue with the participant. Would you be so kind to describe the tools which you use for creation of simulations in the man- ufacturing domain? The majority of participants separated the answer into the following aspects: (i) machine-tool control and communication profiles, namely EtherCAT and OPC UA as well as MQTT, field bus systems like Profinet and Sercos, (ii) programming languages used for implementation of the simulations, namely C/C++, MatLab Simulink and Python, and (iii) the different types and machine-tool PLCs they use. Besides basic software development tools like integrated development environments and source code management systems, all participants strongly emphasized that the decision of the used tools depends on the type of the machine-tool and its control and communication capabilities. OPC UA and EtherCAT could be identified as common denominators in terms of communication protocols. All participants mentioned, that they rely on modularization and abstraction of interfaces to integrate and test different approaches based on existing projects. Further, all participants named the OpenAI Gym framework when asked for the framework they use for RL purposes and Python as the corresponding programming language. The majority of participants does not create simulation environments from scratch but rather enhances and complements existing simulations and models. One participant’s research work can’t even be simulated, and the participant must perform development and testing tasks directly on the control of the machine-tool. If you think about repetitive tasks in your day-to-day work, how do they look like and which other time-consuming tasks could you think of? For this question, all participants agreed that the most time- and effort consuming task is the implementation of wrapper and boilerplate code to integrate different concepts into their work. Depending on the complexity of the system, the participant’s estimates on the duration of the tasks reached from days to months of work. Further, the majority of participants from enterprises agreed upon the lack of standardization of interfaces and methodologies used. Two industrial participants mentioned, that the design, implementation and experimenting with new RL algorithms and 38 5.2 Qualitative Data neural networks takes only a third of the time they need, to integrate all stakeholder systems and implement the required boilerplate code. All participants mentioned the recurring task of documenting training results of the RL agents and the testing of different algorithmic approaches against each other. Given a potential software solutions for automatic RL environment and agent generation. What Use- Cases and technical functionalities would you expect of the software? The dialogue took the most time of the interviews. The deductions made from this list will be discussed in Section 5.2.2 Each participant provided interesting ideas but after being asked for the feasibility of their suggestions, the following list defined the common denominator: • automatic generation of base projects and the required boilerplate code, • it should be possible to decide between the communication protocol to use and to configure the connectivity information in advance • generate standardized projects without hacky fixes • allow the deployment of an agent directly to the machine-tool • possibility to test multiple agent implementations on the same environment • possibility to instantiate a pre-configured environment • logging of the agents actions in nearly real-time • the ability to test different reward functions without changing the whole implementation 5.2.2 Discussion on Findings The following discusses the key deductions and interpretations from the interviews. In this process, Experts opinions are analyzed thoroughly and Interpretations are made manually under software engineering aspects and the preceding literature research. The findings are discussed one-by one. Automatic Base Project Generation Following the expert’s opinions, the boilerplate code for the simulation environments and agents should be generated in a standardized manner. Users of a potential software should be able to select the programming language level and required libraries and dependencies, so the software can generate bot, environment and agent implementations. Either by setting CLI parameters or providing a YAML file, the experts suggested the usage of persistable configurations, so they can be shared between research partners. Abstraction of Environment and Client Implementation Especially the academic employees emphasized the importance of the separation between the actual simulation environment and the implementation of a learning agent. Following the expert’s opinion, this would improve the separation of work packages in a collaborative fashion as well as the maintainability and clearness of the implementations. Generic Interface for the Reward Function Implementation One important research topic is the design of suitable reward functions for the agents to learn in a given environment. Experts suggested the separation of the reward function from the underlying agents implementation and providing it through a standardized interface. At the phase of the interviews it was left unclear, how this can be achieved. 39 5 Evaluation and Results Configurable Machine-Tool Connectivity Since heterogeneity of communication and machine-tool description standards is a known issue in the manufacturing domain, the experts suggested designing a protocol-agnostic solution. Potential users of the software should be able to select from reference implementations (e.g. OPC UA) and only provide the necessary connectivity details (e.g. IPv4 addresses, usernames and passwords) and the software should set up the connectivity based on the provided information. Abstraction of the underlying Machine-Tool Capabilities Experts highlighted that the underlying semantic of a machine-tool description still requires a certain amount of domain expertise from an end-user. Therefore, the suggestion was to hide away the capabilities of the machine-tool and provide only possible parameters of the machine-tool to the user. Users could then select and adjust the required parameters themselves, but having at least a filtered view of available parameters would already be an enhancement. Modularity and Extensibility of the Software Research is fast, and innovation cycles are short. Experts suggested designing a potential software as modular and loosely coupled as possible so new ideas and technologies can be integrated and tested fast. Further, heterogeneous environment of tools and methodologies used in all phases of the development requires a way to extend the existing implementations. Manual Selection of In- and Output Parameters In the course of the interviews, the experts’ desire to be able to intervene in individual aspects of the generation process crystallized. In particular, this wish was clearly expressed in the selection of in- and output parameters for simulation. Instantiation of Environments and Agents For further applications of a potential software, the experts asked for a way to instantiate multiple agents and environments on different machines to research new approaches in parallel. Further, testing of different agent reward functions and different agents on the same environments could be a huge improvement for the time needed to aggregate training results and decide further development and research directions. 5.2.3 Derived Use-Cases Figure 5.6 and Figure 5.7 provide a brief schematic overview of the core Use-Cases derived from the interviews. The following chapter will describe the underlying concepts and functionalities in detail and provide a reference architecture. 40 5.2 Qualitative Data Figure 5.6: UML Use-Case diagram providing a brief overview of general functionalities. Figure 5.7: UML Use-Case diagram providing a brief overview of the user facing functionalities. 41 6 Concept After research was conducted and the results were described in Chapter 4 a concept for the realization of a Reinforcement Learning Environment Generator is presented. The concept presented addresses the identified needs and suggestions of the surveyed domain experts and makes implementation proposals for a reference implementation. By incorporating practices from software engineering and common best practices, a modular, extensible, and loosely coupled software architecture is outlined. Section 6.1 describes general software design decisions made based on the conducted interviews. Section 6.2 describes the methodology to parse a given machine-tool capability description and the process to generalize it so far that it can be passed to the generation system for further processing. Further, the section outlines the developed approach to distill states and actions from a given machine-tool description, so observations of the environment can be made and communicated bi-directionally to the agent. Section 6.3 deals with the decoupling of the Reward Function from the rest of the implementation. The topic of automatic training of the generated agent and environment is covered by Section 6.4. Finally, Section 6.5 concludes this chapter with a discussion of the concept and by naming some of its limitations. 6.1 Architectural Design Adhering to common best-practices, the software should be designed as loosely-coupled as possible, modular and extensible. By using the tools the users are already used, the concept tries to fit into the existing workflows without any friction. Therefore, Python1 was chosen as the programming language for the project. Although a possible reference implementation in the form of a command line interface tool (CLI)could be implemented in another language, the concept sticks to the tools commonly used by the experts. Further, Python is a platform independent language and provides several open source packages for the implementation of CLI tools. Additionally, all required dependencies have reference implementations or are solely implemented in Python. By making this decision, the concept strives to achieve modularity and extensibility. Since the majority of the experts had at least some experience with the OpenAI Gym framework2 the concept uses it for showing the integration of a RL framework. It should be noted, that the concept is designed in a framework-agnostic manner and can be extended by further RL frameworks like Google Dopamine. 6.1.1 Components First, the functionality provided to the user (in the following the provided functionalities are accessed through a CLI) have to be separated into responsible services, namely: (i) a service responsible for the generation of the folder and file structure, generation of standard project boilerplate code like setup.py files and package structures as well as the naming of packages and basic dependency management, (ii) a connectivity service which handles the communication with the machine-tool simulation, (iii) a machine-tool description service providing the information needed to determine states and actions for the environment and the agent and finally (iv) a service responsible for the interactions of the user concerning the agent parameters, states and actions. 1See: https://docs.python.org/3/reference/ 2See: https://github.com/openai/gym 43 https://docs.python.org/3/reference/ https://github.com/openai/gym 6 Concept By using a facade the concept hides implementation and interface details of the underlying technologies. This allows the integration of different communication protocols and machine-tools. Further, the separation between the communication and the description of its capabilities allows the integration of external description models. For the agent, the machine-tool itself is the environment with which it interacts. As longs as states and actions are provided the agent can be trained and evaluated against the provided environment. A researcher evaluating new algorithms shouldn’t be bothered by the implementation details of the underlying machine-tool description model or the way, how the communication between the environment and the machine-tool is established. Figure 6.1 provides a UML component overview about the separated services and interfaces. Figure 6.1: The UML component diagram describing the access to the systems functionalities through a CLI application. As shown in Figure 6.1 the sole entry point for the systems functionalities is a CLI. The CLI provides formalized input parameters and formats. If required the contract between the CLI and the internal system could be migrated to a Web-capable API like REST or GRPC. 6.1.2 Project Configuration The project configuration service handles the most time-consuming task - the setup of a Python project, installation of all required (and additional) dependencies, verifying that they don’t interfere with each other, creation of the required folder structures and setting the right permissions according to the client’s operating system. For illustration purposes, the constrained verification for a framework parameter has been added to the component diagram. Depending on the requested RL framework, the service creates the required files, manages the required class and file names as well as the package management in the folder structure (which is mandatory in frameworks like the OpenAI Gym) and stores the current configuration to a YAML file. The possible configuration YAML is shown below. 44 6.1 Architectural Design Figure 6.2: The UML component diagram the services which realize the creation of a Python project for RL. Listing 6.1 Example Project Configuration --- project_name: "foo" base_project_path: "/path/to/project" framework: "openai_gym" dependencies: - name: "py_dep_a" version: "1.2.3" - name: "py_dep_b" version: "4.5.6" --- 6.1.3 Machine Tool Connectivity and Description Both services shown in Figure 6.1 for connectivity are shown in an exemplary manner. The underlying key concept is the separation of the connectivity related tasks from the semantic description of the machine-tool and its capabilities. The concept separates the responsibilities of the services as follows: the connectivity service handles the connection to the machine-tool and provides permanent bi-directional access to it. The description service retrieves the information about the semantic properties of the machine-tool and provides it to interested services for future usage. Figure 6.3 shows the components of the connectivity service, and it’s relation to an exemplary OPC UA server attached to the machine-tool simulation. Analogous to the project configuration service shown in Section 6.1.2, a possible configuration in YAML is shown below. The shown configuration object is instantiated for OPC UA connections. Future implementations have to handle the configuration files in a contextual manner, meaning that the files have to be parsed in advance and checked for provided parameters. Depending on the provided parameters, the service generates the according source code for the specific connectivity protocol. 45 6 Concept Figure 6.3: The UML component diagram the services which realize the creation of a Python project for RL. Listing 6.2 Example OPC UA Machine Tool Connectivity Configuration --- type: "opc_ua" opc_ua_configuration: connection_type: "ua_tcp" connection_address: machine.tool port: 53530 server_name: "OPCUA/SimulationServer" security_modes: "None, Sign, Sign\&Encrypt" security_policies: "all" bind_address: true --- 6.2 Machine Tool Description Parsing For the generation of the environment and agent, it is necessary to provide the possible states and actions for the agent. If one takes OPC UA Server Model as machine-tool description format, the server provides access to the address space, which is basically an itterable tree. Clients can read and write data to OPC UA Nodes which would be actions in the environment. The observation and state of the environment on the other hand would be the current values in the subscribed objects. Unfortunately, it is a challenging task to map a semantic value to the OPC UA nodes on the server, meaning that without domain expert knowledge, it is not possible to predict the semantics of a given node and therefore not possible to automatically assign it to a specific action of an agent. But assuming, that provided with a mapping between NodeIDs and the corresponding action identifiers of an agent, a generator could create method bodies and the according boilerplate code in the agent implementation. In this way, the user would have to implement the functionality of the method bodies but would at least be sure, that all sets of required actions and states of the environment were generated correctly. If no configuration of states and actions is provided to the software, the concept suggests the following workflow. First, the description of the machine-tool is fetched and stored in an intermediate format. In this format, all writable nodes in a specific functionality space of the machine-tool are annotated as states. Accordingly, intermediate actions to write on these nodes are generated. The user is now provided with this intermediate mapping and has to manually select, which states and actions should be kept. After the user of the software finishes the labeling process, all non-marked states and actions are deleted from the intermediate file and the file itself is now transferred to a valid configuration and provided to the environment and agent implementations. 46 6.3 Reward Function Definition 6.3 Reward Function Definition Chapter 2 described the relevance of reward functions for RL agents. Depending on the algorithm and policy used, the implementation of the reward function requires different approaches. To separate the implementation of reward functions from the agent’s step function implementation, we use the strategy pattern as follows: The generated code contains the call to a parametrized client interface. Depending on the types and amount of variables (which can be functions as well) the interface calls another implementation of the reward function. The result of the function is then calculated and returned to the caller. A schematic representation of the used pattern is presented in Figure 6.4. Figure 6.4: UML class diagram of the Strategy pattern implementation for the reward function. By using callbacks at a global scope in the training environment, it would be even possible to inject another reward function at runtime. This can be realized by adding a global listener which permanently monitors the current training settings and if an asynchronous message for change arrives, replaces the implementation details of the reward function with the required ones. It should be noted that this is implementation and programming language dependent and represents a future research topic. 6.4 Automatic Training Using the generation system and the tools provided by open source frameworks, parallel training can be achieved using virtualization technologies like Docker3 and Kubeflow4. By relying on computation capacities available through Cloud Computing, multiple approaches can be trained by deploying different virtualization containers in which, agents are trained in parallel. If an approach reaches a specific threshold of certainty, using the proposed concept, researchers could deploy the agent on a real machine by simply replacing the connectivity of the simulated machine-tool with the real one. Although special attention and excessive testing in advance should be conducted before deploying onto a physical machine-tool and taking cost-sensitive risks. 3See: https://www.docker.com/ 4See: https://www.kubeflow.org/ 47 https://www.docker.com/ https://www.kubeflow.org/ 6 Concept 6.5 Discussion The presented concept provided a reference architecture for a framework- and machine-tool agnostic generation of RL environments and agents. Generation was realized by splitting the responsibilities of generation steps into several services with a unified interface. General projects can be generated with the required dependencies and structures depending on the selected framework. Potential users have the possibility to create configuration files for their specific needs and when necessary, enhance the implementation details to extend the software’s functionalities. Machine-tool interaction is separated from the actual environment generation since only a connection to the machine-tool and read/write access to the control units is required. By this separation existing machine-tool simulations can be integrated using for example OPC UA and reused for different projects if needed. Further, agents can be deployed directly to the machine-tools since the training was already performed on the digital twin of the physical machine. Limitations: The presented approach is not fully autonomous and still requires the user’s domain expertise to specify states and actions in the environment. Nevertheless, using intermediate description files for the state-action mapping, the users can pre-configure the desired machine-tool functionalities and enhance the workflow of generating a RL environment and the according agent. 48 7 Conclusion and Outlook Conclusion This thesis proposed a software architectural approach for generation of reinforcement learning environments and agents from machine-tool descriptions. The presented concept is based on the empirical research through semi-structured interviews with domain experts, which was conducted in advance. First, the foundations of Digital Twins, Machine-Tool Description and Communication methods and Reinforcement Learning were presented. Certain Reinforcement Learning frameworks have been highlighted. By relying on the observations made in the literature review for the foundations and related work, an empirical study could be designed and results aggregated to specific Use-Cases which build the base for the presented concept. Finally, the formulated an approach with which a generator can be implemented to facilitate the daily work of research in the domains of Reinforcement Learning and Manufacturing. Limitations of the concept were discussed to stimulate future research work. Outlook The presented concept serves as a starting point for further research and development. Future works could create a reference implementation including the testing with a physical machine. An arena like experiment similar to the OpenAI Hide and Seek paper could be a possible research target as well as the research on decision-making, when a trained agent is sufficiently secure to be deployed onto a real machine-tool. 49 Bibliography [AAC+19] I. Akkaya, M. Andrychowicz, M. Chociej, M. Litwin, B. McGrew, A. Petron, A. Paino, M. Plappert, G. Powell, R. Ribas, et al. “Solving rubik’s cube with a robot hand”. In: arXiv preprint arXiv:1910.07113 (2019) (cit. on pp. 23, 25). [BCP+16] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, W. Zaremba. “Openai gym”. In: arXiv preprint arXiv:1606.01540 (2016) (cit. on p. 25). [BL15] B. Bagheri, J. Lee. “Big future for cyber-physical manufacturing systems”. In: Design world 23 (2015) (cit. on p. 17). [Can16] A. Canedo. “Industrial IoT lifecycle via digital twins”. In: Proceedings of the Eleventh IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis. 2016, pp. 1–1 (cit. on p. 15). [CEV17] A. Csiszar, J. Eilers, A. Verl. “On solving the inverse kinematics problem using neural networks”. In: 2017 24th International Conference on Mechatronics and Machine Vision in Practice (M2VIP). IEEE. 2017, pp. 1–6 (cit. on p. 15). [CJOC10] G. Cândido, F. Jammes, J. B. de Oliveira, A. W. Colombo. “SOA at device level in the industrial domain: Assessment of OPC UA and DPWS specifications”. In: 2010 8th IEEE International Conference on Industrial Informatics. IEEE. 2010, pp. 598–603 (cit. on p. 18). [CLQS19] J. G. Campos, J. S. López, J. I. A. Quiroga, A. M. E. Seoane. “Automatic generation of digital twin industrial system from a high level specification”. In: Procedia Manufacturing 38 (2019), pp. 1095–1102 (cit. on p. 27). [CMG+18] P. S. Castro, S. Moitra, C. Gelada, S. Kumar, M. G. Bellemare. “Dopamine: A Research Framework for Deep Reinforcement Learning”. In: (2018). url: http://arxiv.org/abs/1812. 06110 (cit. on p. 25). [ESLR19] C. Ellwein, A. Schmidt, A. Lechler, O. Riedel. “Distributed Manufacturing: A Vision about Shareconomy in the Manufacturing Industry”. In: Proceedings of the 2019 3rd International Conference on Automation, Control and Robots. 2019, pp. 90–95 (cit. on p. 15). [FP19] D. M. Fernández, J.-H. Passoth. “Empirical software engineering: from discipline to interdisci- pline”. In: Journal of Systems and Software 148 (2019), pp. 170–179 (cit. on p. 31). [GBCB16] I. Goodfellow, Y. Bengio, A. Courville, Y. Bengio. Deep learning. Vol. 1. 2. MIT press Cambridge, 2016 (cit. on p. 22). [Had06] T. Hadlich. “Providing device integration with OPC UA”. In: 2006 4th IEEE International Conference on Industrial Informatics. IEEE. 2006, pp. 263–268 (cit. on pp. 18, 19). [HKY18] S. Ha, J. Kim, K. Yamane. “Automated deep reinforcement learning environment for hardware of a modular legged robot”. In: 2018 15th International Conference on Ubiquitous Robots (UR). IEEE. 2018, pp. 348–354 (cit. on p. 27). [HS14] R. Henßen, M. Schleipen. “Online-Kommunikation mittels OPC-UA vs. Engineering-Daten (offline) in AutomationML”. In: Tagungsband Automation 2014 (2014), pp. 59–74 (cit. on pp. 18, 19). [IJ13] J. Imtiaz, J. Jasperneite. “Scalability of OPC-UA down to the chip level enables “Internet of Things””. In: 2013 11th IEEE International Conference on Industrial Informatics (INDIN). IEEE. 2013, pp. 500–505 (cit. on p. 19). 51 http://arxiv.org/abs/1812.06110 http://arxiv.org/abs/1812.06110 Bibliography [JCKV18] F. Jaensch, A. Csiszar, A. Kienzlen, A. Verl. “Reinforcement learning of material flow control logic using hardware-in-the-loop simulation”. In: 2018 First International Conference on Artificial Intelligence for Industries (AI4I). IEEE. 2018, pp. 77–80 (cit. on pp. 15, 25). [KCV18] B. Kaiser, A. Csiszar, A. Verl. “Generative models for direct generation of cnc toolpaths”. In: 2018 25th International Conference on Mechatronics and Machine Vision in Practice (M2VIP). IEEE. 2018, pp. 1–6 (cit. on p. 25). [KKT+18] W. Kritzinger, M. Karner, G. Traar, J. Henjes, W. Sihn. “Digital Twin in manufacturing: A categorical literature review and classification”. In: IFAC-PapersOnLine 51.11 (2018), pp. 1016–1022 (cit. on pp. 17, 18). [Kre14] D. Krenczyk. “Automatic generation method of simulation model for production planning and simulation systems integration”. In: Advanced Materials Research. Vol. 1036. Trans Tech Publ. 2014, pp. 825–829 (cit. on p. 27). [LLK+20] Y. Lu, C. Liu, I. Kevin, K. Wang, H. Huang, X. Xu. “Digital Twin-driven smart manufacturing: Connotation, reference model, applications and research issues”. In: Robotics and Computer- Integrated Manufacturing 61 (2020), p. 101837 (cit. on p. 17). [LZYW20] B. Li, H. Zhang, P. Ye, J. Wang. “Trajectory smoothing method using reinforcement learn- ing for computer numerical control machine tools”. In: Robotics and Computer-Integrated Manufacturing 61 (2020), p. 101847 (cit. on p. 15). [MLC20] R. Minerva, G. M. Lee, N. Crespi. “Digital twin in the IoT context: a survey on technical features, scenarios, and architectural models”. In: Proceedings of the IEEE 108.10 (2020), pp. 1785–1824 (cit. on p. 15). [MOW+21] M. C. May, L. Overbeck, M. Wurster, A. Kuhnle, G. Lanza. “Foresighted digital twin for situational agent selection in production control”. In: Procedia CIRP 99 (2021), pp. 27–32 (cit. on p. 27). [MSKV18] G. S. Martıénez, S. Sierla, T. Karhela, V. Vyatkin. “Automatic generation of a simulation-based digital twin of an industrial process plant”. In: IECON 2018-44th Annual Conference of the IEEE Industrial Electronics Society. IEEE. 2018, pp. 3084–3089 (cit. on p. 27). [NB18] S. Niehoff, G. Beier. “Industrie 4.0 and a sustainable development: A short study on the perception and expectations of experts in Germany”. In: International Journal of Innovation and Sustainable Development 12.3 (2018), pp. 360–374 (cit. on p. 15). [NNXR08] A. Nassehi, S. T. Newman, X. W. Xu, R. Rosso Jr. “Toward interoperable CNC manufacturing”. In: International Journal of Computer Integrated Manufacturing 21.2 (2008), pp. 222–230 (cit. on p. 15). [Obj17] Object Management Group. The Unified Modeling Language Specification. Ed. by Object Management Group. Dec. 5, 2017. url: https://www.omg.org/spec/UML/2.5.1/PDF (cit. on p. 18). [OPC17a] OPC Foundation. OPC 10000-1 - Part 1: Overview and Concepts. Ed. by OPC Foundation. Nov. 22, 2017. url: https://opcfoundation.org/developer-tools/specifications-unified- architecture/part-1-overview-and-concepts/ (cit. on pp. 19, 20). [OPC17b] OPC Foundation. OPC 10000-4 - Part 4: Services. Ed. by OPC Foundation. Nov. 22, 2017. url: https://reference.opcfoundation.org/v104/Core/docs/Part4/ (cit. on p. 19). [PCB+19] F. Pires, A. Cachada, J. Barbosa, A. P. Moreira, P. Leitão. “Digital Twin in Industry 4.0: Technologies, Applications and Challenges”. In: 2019 IEEE 17th International Conference on Industrial Informatics (INDIN). Vol. 1. 2019, pp. 721–726. doi: 10.1109/INDIN41052.2019. 8972134 (cit. on p. 15). [RN02] S. Russell, P. Norvig. “Artificial intelligence: a modern approach”. In: (2002) (cit. on p. 22). [RSD10] M. Rostan, J. E. Stubbs, D. Dzilno. “EtherCAT enabled advanced control architecture”. In: 2010 IEEE/SEMI Advanced Semiconductor Manufacturing Conference (ASMC). IEEE. 2010, pp. 39–44 (cit. on p. 21). 52 https://www.omg.org/spec/UML/2.5.1/PDF https://opcfoundation.org/developer-tools/specifications-unified-architecture/part-1-overview-and-concepts/ https://opcfoundation.org/developer-tools/specifications-unified-architecture/part-1-overview-and-concepts/ https://reference.opcfoundation.org/v104/Core/docs/Part4/ https://doi.org/10.1109/INDIN41052.2019.8972134 https://doi.org/10.1109/INDIN41052.2019.8972134 [RSK20] A. Rasheed, O. San, T. Kvamsdal. “Digital twin: Values, challenges and enablers from a modeling perspective”. In: IEEE Access 8 (2020), pp. 21980–22012 (cit. on p. 17). [RTVG21] S. A. Rahman, L. Tuckerman, T. Vorley, C. Gherhes. “Resilient Research in the Field: Insights and Lessons From Adapting Qualitative Research Projects During the COVID-19 Pandemic”. In: International Journal of Qualitative Methods 20 (2021), p. 16094069211016106 (cit. on p. 30). [SB18] R. S. Sutton, A. G. Barto. Reinforcement learning: An introduction. MIT press, 2018 (cit. on p. 22). [SHS+18] D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, et al. “A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play”. In: Science 362.6419 (2018), pp. 1140–1144 (cit. on p. 25). [TET12] E. Todorov, T. Erez, Y. Tassa. “Mujoco: A physics engine for model-based control”. In: 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE. 2012, pp. 5026–5033 (cit. on p. 25). [TV07] A. S. Tanenbaum, M. Van Steen. Distributed systems: principles and paradigms. Prentice-hall, 2007 (cit. on p. 19). [VOX+05] S. Venkatesh, D. Odendahl, X. Xu, J. Michaloski, F. Proctor, T. Kramer. “Validating portability of STEP-NC tool center programming”. In: International Design Engineering Technical Conferences and Computers and Information in Engineering Conference. Vol. 47403. 2005, pp. 285–290 (cit. on p. 19). [Woh19] D. Wohlfeld. “Digitaler Zwilling für die Produktion von Übermorgen”. In: Zeitschrift für wirtschaftlichen Fabrikbetrieb 114.1-2 (2019), pp. 65–67 (cit. on p. 15). [WSJ17] M. Wollschlaeger, T. Sauter, J. Jasperneite. “The future of industrial communication: Au- tomation networks in the era of the internet of things and industry 4.0”. In: IEEE industrial electronics magazine 11.1 (2017), pp. 17–27 (cit. on p. 15). [WZLL21] C. Wang, L. Zheng, B. Li, Z. Li. “Design and implementation of EtherCAT Master based on Loongson”. In: Procedia Computer Science 183 (2021), pp. 462–470 (cit. on p. 21). [XSK+21] K. Xia, C. Sacco, M. Kirkpatrick, C. Saidy, L. Nguyen, A. Kircaliali, R. Harik. “A digital twin to train deep reinforcement learning agent for smart manufacturing plants: Environment, interfaces and intelligence”. In: Journal of Manufacturing Systems 58 (2021), pp. 210–230 (cit. on pp. 18, 27). [YJ18] A. Yadav, S. Jayswal. “Modelling of flexible manufacturing system: a review”. In: International Journal of Production Research 56.7 (2018), pp. 2464–2487 (cit. on p. 27). [YNGJ02] S. Yoo, G. Nicolescu, L. Gauthier, A. A. Jerraya. “Automatic generation of fast timed simulation models for operating systems in SoC design”. In: Proceedings 2002 Design, Automation and Test in Europe Conference and Exhibition. IEEE. 2002, pp. 620–627. All links were last followed on June 1, 2021. Declaration I hereby declare that the work presented in this thesis is entirely my own and that I did not use any other sources and references than the listed ones. I have marked all direct or indirect statements from other sources contained therein as quotations. Neither this work nor significant parts of it were part of another examination procedure. I have not published this work in whole or in part before. The electronic copy is consistent with all submitted copies. place, date, signature 1 Introduction 1.1 Motivation 1.2 Research Questions and Goals 1.3 Outline 2 Foundations 2.1 Digital Twin 2.2 Machine-Tool Description and Communication 2.3 Reinforcement Learning 2.4 Reinforcement Learning Environment Frameworks 3 Related Work 3.1 Digital Twin Generation 3.2 Reinforcement Learning Environment Generation 3.3 Summary 4 Research Approach 4.1 Research Methodology 4.2 Interview Design 4.3 Summary 5 Evaluation and Results 5.1 Quantitative Data 5.2 Qualitative Data 6 Concept 6.1 Architectural Design 6.2 Machine Tool Description Parsing 6.3 Reward Function Definition 6.4 Automatic Training 6.5 Discussion 7 Conclusion and Outlook Bibliography