Physics-Driven Machine Learning: from Biomolecules to Crystals Von der Fakultät Energie-, Verfahrens und Biotechnik der Universität Stuttgart und dem Stuttgart Center for Simulation Science (SC SimTech) zur Erlangung der Würde eines Doktors der Ingenieurwissenschaften (Dr.-Ing.) genehmigte Abhandlung Vorgelegt von Ángel Díaz Carral aus Ávila, Spanien Hauptberichter : Prof. Dr. rer. nat. Dr. h. c. Siegfried Schmauder Mitberichter : Prof. Dr. rer. nat. Maria Fyta Mitberichter : apl. Prof. Dr.-Ing. habil. Niels Hansen Tag der mündlichen Prüfung: 27.03.2024 Institut für Materialprüfung, Werkstoffkunde und Festigkeitslehre der Universität Stuttgart September 2024 I dedicate this thesis to my loving wife Kany, parents Ángel and Marilo, brother Nacho, and friends. I also extend this dedication to my grandparents, whose wisdom echoes through generations, and to my precious baby Lorán, whose arrival has filled our hearts with boundless joy and love. May the lessons from both the past and the future inspire the discoveries within these pages . . . Declaration I hereby declare that, except where specific reference is made to the work of others, the contents of this dissertation are original and have not been submitted in whole or in part for consideration for any other degree or qualification in this, or any other university. This dissertation is my own work and contains nothing which is the outcome of work done in collaboration with others, except as specified in the text and acknowledgements. I also declare that there is no conflict of interest, and any included publications are my own work, except where indicated throughout the thesis and summarised and clearly identified on the declarations page of the thesis. Ángel Díaz Carral Acknowledgements This thesis is a compilation of my research work done at SC SimTech and University of Stuttgart during the years 2019-2023. I am profoundly grateful to Prof. Fyta for her steadfast guidance and the life- changing opportunity to start as a HiWi and progress to a PhD student. Prof. Schmauder’s invaluable mentorship played a crucial role during my research journey. Special thanks to Christian Holm for exceptional leadership, unwavering support, and facilitating the contract extension for my continued involvement. The collaborative environment at the Institute for Computational Physics (ICP) and the Institute for Materials Testing, Materials Science, and Strength of Materials (IMWF) was pivotal in developing this thesis. The infrastructure at ICP and IMWF played an instrumental role in its completion. Expressing gratitude to the dedicated members of my research group at both ICP and IMWF, including Chandra, Takeshi, Martin, Magnus, Simon, Azade, Kira, Ayberk and Louis from ICP, and Xiang, Frank and Stephen from IMWF. Their insights and collaboration significantly shaped the outcomes of this thesis. I extend thanks to the University of Stuttgart for fostering an enriching academic environment and providing essential resources. My deep gratitude also goes to the University’s EXC 2075 SimTech Cluster, supported by the German Research Foundation (DFG), for their financial assistance, allowing me to concentrate on research and pursue academic goals throughout my doctoral studies. Finally, my deepest gratitude goes to my wife, family, and friends for their unwaver- ing support, encouragement, and understanding throughout this challenging journey. Their love and encouragement have been a constant source of motivation. This thesis is dedicated to all those who have played a significant role in my academic and personal growth. Thank you for being part of this incredible journey with me. Ángel Díaz Carral Publications The following publications are the result of the work dedicated to this thesis. • Á. Díaz Carral, M. Ostertag, and M. Fyta, "Deep learning for nanopore ionic current blockades," J. Chem. Phys., vol. 154, no. 4, p. 044111, 2021. • Á. Díaz Carral, X. Xu, S. Gravelle, A. YazdanYar, S. Schmauder, and M. Fyta, "Stability of binary precipitates in Cu-Ni-Si-Cr alloys investigated through active learning," Mater. Chem. Phys., vol. 306, p. 128053, 2023. • Á. Díaz Carral, M. Roitegui, and M. Fyta, "Interpretably learning the critical temperature of superconductors: electron concentration and feature dimensional- ity reduction," APL Mater., vol. 12, p. 041111, 2024. • Á. Díaz Carral, M. Roitegui, A. Koc, M. Ostertag, and M. Fyta, "Concurrent analysis of electronic and ionic nanopore signals: blockade mean and height," Nano Ex., vol. 5, no. 2, p. 025020, 2024. • Á. Díaz Carral, S. Gravelle and M. Fyta, "In silico evidence of metastable quaternary phases in Cu-Ni-Si-Cr alloys", submitted APL mach. learn, 2024. • Á. Díaz Carral, M. Roitegui, and M. Fyta, "Structural and electronic features for the prediction of superconducting materials," in preparation, 2024. Other publications: • L. Oberer, Á. Díaz Carral, and M. Fyta, "Simple Classification of RNA Sequences of Respiratory-Related Coronaviruses," ACS Omega, vol. 6, no. 31, pp. 20158-20165, 2021. Abstract Physical systems and their interactions are inherently equivariant [1]. The prediction of quantities in machine learning (ML) that are fundamentally generated from these equivariant interactions is accomplished through two main approaches: applying equivariant operations to generate invariant scalar features as input for an invariant model, or utilizing equivariant models themselves. In this thesis, the focus lies on the former framework, where we explore feature extraction and data representation techniques in physics domains through physics-driven machine learning (PDML). This particular field of ML benefits from prior knowledge of physics to create descriptors that encode the underlying symmetries of the dataset, thereby reducing dimensionality, increasing interpretability, and enhancing generalization capabilities. To highlight the significance of physics-driven descriptors in physically-inspired embedding spaces, the thesis focuses on several schemes relevant to its objectives: 1. Copper-based alloys 2. Nanopore detectors 3. High-Tc superconductivity Through molecular simulations and PDML approaches, the aim is to investigate and provide insights into nanopore sequencing and materials discovery. In this thesis, the following questions are investigated: • What are the limitations of physics-inspired descriptors in ML? • Can we decrease the dimensionality of the data while maintaining the same level of prediction accuracy? • Is it possible to achieve comparable performance using PDML with invariant descriptors compared to conventional ML methods? • How does PDML scale in atomistic systems? xii Abstract The investigation of copper-based alloys carried out within the framework of this project focuses on the combination of computer simulations and active learning (AL) to reveal stable precipitate phases of copper alloys and study their mechanical properties. Once the AL cycle that generates accurate ML interatomic potentials (MLIPs) has been successfully implemented, the focus has recently been placed on the stability analysis framework for binary copper-based systems. This part involves quantum mechanical (QM) simulations of various alloy configurations in copper alloys using the density functional theory (DFT) implemented in VASP. Static calculations are performed at zero temperature to generate data for the AL on-the-fly relaxation algorithm. The algorithm utilizes moment tensor potentials (MTPs), a type of descriptors based on invariant polynomials, to construct MLIPs for multi-component alloys. The goal is to conduct a comprehensive search for stable precipitate phases in copper alloys. To further elucidate the analysis, nanopore DNA translocations are studied through PDML. DNA molecules can be electrophoretically driven through a nanoscale opening in a material, giving rise to rich and measurable ionic current blockades. In this work, ML models are trained on experimental ionic blockade data from DNA nucleotide translocation through 2D pores of different diameters. The aim of the resulting classification is to enhance the read-out efficiency of nucleotide identity, providing pathways toward error-free sequencing. A novel method is proposed that simultaneously reduces the current traces to a few physical descriptors and trains low-complexity models, thus reducing the dimensionality of the data. Each translocation event is described by four features, including the height of the ionic current blockade. Exploring the field of high critical temperature (high-Tc) superconductivity, an exceptionally effective PDML model is proposed to predict critical temperatures of superconductors by carefully extracting characteristics from electronic and atomic properties. Despite a streamlined feature space, it upholds accuracy when compared to intricate methodologies. The model is fine-tuned to forecast distinct superconductor properties, finding an equilibrium between precision and simplicity and enabling projections for emerging structures. By bridging the gap between ML and physics, this research will contribute to the growing field of PDML embedding the physics into the descriptors, advancing the ability to model, predict, and control complex physical systems with unprecedented accuracy and efficiency. Through this work, the aim is to pave the way for transformative applications, insights, and discoveries that have the potential to reshape scientific and technological advancements across multiple disciplines. Zusammenfassung Physikalische Systeme und ihre Wechselwirkungen sind von Natur aus äquivariant [1]. Die Vorhersage von Größen in maschinellem Lernen (ML), die grundsätzlich aus diesen äquivarianten Wechselwirkungen generiert werden, erfolgt durch zwei Hauptansätze: Anwendung äquivarianter Operationen zur Erzeugung invarianter skalierter Merkmale als Eingabe für ein invariantes Modell oder Verwendung äquivarianter Modelle selbst. In dieser Arbeit liegt der Fokus auf dem ersten Ansatz, in dem wir die Extraktion von Merkmalen und die Darstellung von Daten in physikalischen Domänen durch physikgetriebenes maschinelles Lernen (PDML) untersuchen. Dieses spezielle Gebiet des ML profitiert von dem vorhandenen physikalischen Wissen, um Deskriptoren zu erstellen, die die zugrunde liegenden Symmetrien des Datensatzes kodieren. Dadurch wird die Dimensionalität reduziert, die Interpretierbarkeit erhöht und die Fähigkeit zur Verallgemeinerung verbessert. Um die Bedeutung von physikgesteuerten Deskriptoren in physisch inspirierten Einbettungsräumen zu unterstreichen, konzentriert sich die Arbeit auf mehrere Schemata, die für ihre Zielsetzungen relevant sind: 1. Kupferbasierte Legierungen 2. Nanopore-Detektoren 3. Hoch-Tc-Supraleitung Durch molekulare Simulationen und PDML-Ansätze soll das Ziel verfolgt werden, Einblicke in die Nanoporensequenzierung und die Materialentdeckung zu untersuchen und zu liefern. In dieser Arbeit werden die folgenden Fragen untersucht: • Was sind die Einschränkungen von physikbasierten Deskriptoren im maschinellen Lernen? • Können wir die Dimensionalität der Daten verringern, während wir gleichzeitig die gleiche Vorhersagegenauigkeit beibehalten? xiv Zusammenfassung • Ist es möglich, vergleichbare Leistung mit PDML und invarianten Deskriptoren im Vergleich zu konventionellen ML-Methoden zu erreichen? • Wie skalierbar ist PDML in atomistischen Systemen? Die Untersuchung von kupferbasierten Legierungen, die im Rahmen dieses Projekts durchgeführt wird, konzentriert sich auf die Kombination von Computersimulationen und aktivem Lernen (AL), um stabile Ausscheidungsphasen von Kupferlegierungen aufzudecken und ihre mechanischen Eigenschaften zu untersuchen. Sobald der AL- Zyklus erfolgreich implementiert und der genaue ML-Interatom-Potentiale (MLIPs) generiert war, wurde der Fokus auf den Stabilitätsanalyserahmen für binäre kupfer- basierte Systeme konzentriert. Dieser Teil umfasst quantenmechanische (QM) Simula- tionen verschiedener Legierungskonfigurationen in Kupferlegierungen unter Verwendung der Dichtefunktionaltheorie (DFT), die in VASP implementiert ist. Statische Berech- nungen werden bei Nulltemperatur durchgeführt, um Daten für den AL-On-the-Fly- Relaxationsalgorithmus zu generieren. Der Algorithmus verwendet Moment-Tensor- Potentiale (MTPs), eine Art von Deskriptoren auf der Grundlage von invarianten Polynomen, um MLIPs für Mehrkomponentenlegierungen zu konstruieren. Das Ziel ist es, eine umfassende Suche nach stabilen Ausscheidungsphasen in Kupferlegierungen durchgeführt zu werden. Um die Analyse weiter zu vertiefen, werden mittels PDML Nanopore DNA- Translokationen untersucht. DNA-Moleküle können elektrophoretisch durch eine nanoskalige Öffnung in einem Material bewegt werden, wodurch reichhaltige und mess- bare Blockaden des ionischen Stroms entstehen. In dieser Arbeit werden maschinelle Lernmodelle mit experimentellen Daten zur ionischen Blockade von DNA-Nukleotid- Translokationen durch 2D-Poren unterschiedlicher Durchmesser trainiert. Das Ziel dieser Klassifizierung besteht darin, die Effizienz der Nukleotid-Identifikation zu verbessern und so den Weg zu fehlerfreiem Sequenzieren zu ebnen. Es wird eine neuar- tige Methode vorgeschlagen, die gleichzeitig die Stromspuren auf wenige physikalische Deskriptoren reduziert und Modelle mit geringer Komplexität trainiert, um die Di- mensionalität der Daten zu verringern. Jedes Translokationsereignis wird durch vier Merkmale beschrieben, einschließlich der Höhe der ionischen Stromblockade. Die Erkundung des Feldes der Hochtemperatursupraleitung (high-Tc) schlägt ein außergewöhnlich effektives PDML-Modell vor, um die kritischen Temperaturen von Supraleitern vorherzusagen, indem sorgfältig Merkmale aus elektronischen und atom- aren Eigenschaften extrahiert werden. Trotz eines vereinfachten Merkmalsraums behält es seine Genauigkeit im Vergleich zu komplizierten Methoden bei. Das Modell wird xv feinabgestimmt, um verschiedene Eigenschaften von Supraleitern vorherzusagen, indem es ein Gleichgewicht zwischen Präzision und Einfachheit findet und Prognosen für aufkommende Strukturen ermöglicht. Durch die Überbrückung der Kluft zwischen ML und Physik wird diese Forschung zum wachsenden Bereich der PDML beitragen, indem sie die Physik in die Deskriptoren einbettet und die Fähigkeit verbessert, komplexe physikalische Systeme mit einem beispiellosen Maß an Genauigkeit und Effizienz zu modellieren, vorherzusagen und zu kontrollieren. Durch diese Arbeit soll der Weg für transformative Anwendungen, Erken- ntnisse und Entdeckungen geebnet werden, die das Potenzial haben, wissenschaftliche und technologische Fortschritte in verschiedenen Disziplinen neu zu gestalten. Table of contents Publications ix Abstract xi Zusammenfassung xiii List of figures xxi List of tables xxv 1 Introduction 1 1.1 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2 Theoretical Background 5 2.1 Copper-based Alloys . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Nanopore Detectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3 High-Tc Superconductivity . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.4 Respiratory-Related Coronaviruses . . . . . . . . . . . . . . . . . . . . 11 3 Machine Learning 13 3.1 Shallow Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.1.1 Clustering Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.1.2 Tree-based Methods . . . . . . . . . . . . . . . . . . . . . . . . 18 3.2 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.2.1 Multi-Layer Perceptron . . . . . . . . . . . . . . . . . . . . . . . 21 3.3 Feature Importance: SHAP Values . . . . . . . . . . . . . . . . . . . . 22 3.4 Physics-Driven Machine Learning . . . . . . . . . . . . . . . . . . . . . 23 3.4.1 Symmetry and Equivariance in ML . . . . . . . . . . . . . . . . 25 3.4.2 Invariant Descriptors: Learning Latent Representations . . . . . 27 xviii Table of contents 3.4.3 Geometric Deep Learning . . . . . . . . . . . . . . . . . . . . . 28 3.5 Machine Learning Combined with Atomistic Simu-lations . . . . . . . . 31 4 Simulation Methods 35 4.1 Density Functional Theory . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.1.1 Kohn-Sham Equations . . . . . . . . . . . . . . . . . . . . . . . 37 4.1.2 Exchange-correlation Functionals . . . . . . . . . . . . . . . . . 40 4.1.3 Crystals: Periodicity, Basis Sets and Pseudopotentials . . . . . . 42 4.2 Molecular Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.2.1 Integraton Schemes . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.3 Statistical Ensembles in MD . . . . . . . . . . . . . . . . . . . . . . . . 45 4.4 Thermostats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 5 Stability of n-ary Phases in Cu-Ni-Si-Cr Alloys 49 5.1 Computational Details . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 5.1.1 Crystal Prototype Sampling . . . . . . . . . . . . . . . . . . . . 49 5.1.2 On-the-fly Active Learning Relaxation: AL-MTP . . . . . . . . 50 5.1.3 Convex Hull Calculations . . . . . . . . . . . . . . . . . . . . . 52 5.1.4 Calculation of Phonon Dispersion . . . . . . . . . . . . . . . . . 53 5.1.5 Molecular Dynamics Simulations . . . . . . . . . . . . . . . . . 54 5.2 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 5.2.1 Prediction of Novel Binary Structures . . . . . . . . . . . . . . . 56 5.2.2 Analysis of Metastable Quaternary Structures . . . . . . . . . . 63 5.2.3 Properties Assessment of the Predicted Cu7Si . . . . . . . . . . 66 5.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.4 Acknowledgement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 6 Enhancing Nanopore Translocation Read-out via PDML 71 6.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 6.1.1 Experimental Details . . . . . . . . . . . . . . . . . . . . . . . . 71 6.1.2 Event Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 6.2 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 6.3 Machine Learning Models . . . . . . . . . . . . . . . . . . . . . . . . . 76 6.3.1 Clustering Methods . . . . . . . . . . . . . . . . . . . . . . . . . 76 6.3.2 Classification Methods . . . . . . . . . . . . . . . . . . . . . . . 77 6.3.3 SHAP Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 6.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 Table of contents xix 6.4.1 Feature Efficiency via Clustering Analysis . . . . . . . . . . . . 79 6.4.2 Classification of Single Nucleotides . . . . . . . . . . . . . . . . 81 6.4.3 Bimodal Feature Importance . . . . . . . . . . . . . . . . . . . . 86 6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 6.6 Acknowledgement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 7 Learning the Critical Temperature of Superconductors through PDML 89 7.1 Datasets on Superconductors . . . . . . . . . . . . . . . . . . . . . . . . 89 7.2 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 7.3 PDML Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 7.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 7.4.1 Feature Efficiency via Dimensionality Reduction . . . . . . . . . 91 7.4.2 Interpreting EC Features: SHAP Analysis . . . . . . . . . . . . 93 7.4.3 Prediction of Critical Temperature in HEA . . . . . . . . . . . 94 7.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 7.6 Acknowledgement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 8 Conclusions 97 Appendix A Simple Classification of RNA Sequences of Respiratory- Related Coronaviruses 99 A.1 Data Collection and Preprocessing . . . . . . . . . . . . . . . . . . . . 99 A.2 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 A.3 Implementation and Optimization . . . . . . . . . . . . . . . . . . . . . 102 A.4 Clustering Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 A.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 A.6 Acknowledgement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 References 107 List of figures 3.1 Instances of SL classification (left) and UL clustering (right). In SL, the model is trained on labeled data (blue and red), allowing it to learn patterns and relationships. On the other hand, UL involves the model adapting to unlabeled data (grey), autonomously identifying structures and patterns without predefined categorization [2]. . . . . . . . . . . . 16 3.2 k-means vs. DBSCAN example. DBSCAN proves adept with irregular and diverse datasets, while k-means efficiently partitions data into k clusters based on mean distances to centroids [3]. . . . . . . . . . . . . 17 3.3 This example employs two features(root and internal nodes) to clas- sify data into three sub-groups (leaves) and intermediate splits, not immediately forming a leaf, are internal nodes. The lines connecting nodes and leaves are branches. Input variables (features) are utilized to classify data into sub-groups based on binary conditions. The training sample computes the sample mean of the output yt for each sub-group, serving as a constant prediction for future observations classified into that sub-group [4]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.4 Computational graph of a single perceptron with the input and output layers, as well as the nodes and bias vectors of the layers [5]. . . . . . . 21 3.5 Data and physics scenarios [6]. . . . . . . . . . . . . . . . . . . . . . . . 24 3.6 An example illustrating the differences between symmetry group in- variance and equivariance is presented in the context of identifying a handwritten letter in an image [7]. . . . . . . . . . . . . . . . . . . . . . 26 3.7 LeNet-5: One of the earliest convolutional neural networks [5]. . . . . . 29 3.8 Time-layered architecture of an RNN [5]. . . . . . . . . . . . . . . . . . 30 3.9 Performance comparison of several descriptor-based ML potentials on a range of crystal structures [8]. . . . . . . . . . . . . . . . . . . . . . . . 32 xxii List of figures 3.10 Dimensionality reduction of the atomic neighborhood ni for the atom i, described by the moment tensors Mµ,ν [9]. . . . . . . . . . . . . . . . . 33 4.1 The ‘Jacob’s ladder’ of exchange-correlation functionals [10]. . . . . . . 41 5.1 Convex hulls for the binary systems in Tab. 5.1 as calculated in this work are labelled as ’AL-MTP’. The potentially new prototypes are denoted through the blue circles. The respective convex hulls from AFLOW are also provided for comparison [11]. . . . . . . . . . . . . . 57 5.2 The phonon dispersion for the predicted novel Cu7Si binary. The phonon spectra calculated within the DFT framework is compared to those obtained using the AL-MTP potential in LAMMPS-MLIP [11]. . . . . 58 5.3 Crystal structure of the novel predicted Cu7Si before (left) and after (middle) relaxation. The Cu and Si atoms are colored in orange and brown, respectively. The right panel depicts the Cu7Si supercell at a temperature T = 300K [11]. . . . . . . . . . . . . . . . . . . . . . . . 61 5.4 The convex hulls for the Cr-Ni (left) and the Cu-Ni (right) binaries. The black and green lines correspond to the spin-polarized and spin non- polarized calculations. The blue circles denote the predicted structures [11]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 5.5 The number of predicted quaternary structures with respect to their formation enthalpy ∆H (meV/Atom). . . . . . . . . . . . . . . . . . . 63 5.6 The number of predicted quaternary structures with respect to their hull distance ∆Hd (meV/Atom). . . . . . . . . . . . . . . . . . . . . . 64 5.7 Crystal structure of the novel predicted Cu4NiSi2Cr. The Cu, Ni, Si and Cr atoms are coloured in orange, brown, yellow and dark grey respectively. 65 5.8 The RDF for the Cu-Si pair calculated through MD simulations using the AL-MTP (left) and MEAM (right) potentials, respectively. The curves correspond to simulations at different temperatures T , as denoted by the legends [11]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 5.9 Density of the Cu7Si structure as a function of the temperature T , for both the AL-MTP and MEAM potentials as indicated by the legend. The horizontal dashed line highlights the value ρ = 8.15 g/cm3 as measured from DFT in the limit T → 0K [11]. . . . . . . . . . . . . . . . . . . . 67 List of figures xxiii 6.1 A set of concatenated events for the translocation of dAMP through the nanopore with a diameter of 2.8 nm (Exp. B) . Each red block on the left represents an event of a nucleotide translocating the nanopore in a certain configuration. On the right, the four features for a single nucleotide translocation event are highlighted [12]. . . . . . . . . . . . . 74 6.2 Detection of outlier events in the data from both experiments Exp.A and Exp.B. The left panel depicts all data with respect to the two features of the ionic current blockade and the dwell time, while the right panel reveals the same feature space after the outliers have been remove based on the cutoff for a dwell time below 10 ms [12]. . . . . . . . . . . . . . 75 6.3 The encoding process of mapping the ionic current raw signals, by means of the physical descriptors/features into grey-level scale images. The encoding is followed by the training procedure and the prediction of the nucleotide identity at the end of the pipeline. The images include 4 pixels corresponding to the four features for each single translocation experiment of a certain nucleotide [12]. . . . . . . . . . . . . . . . . . 78 6.4 Two dimensional graphs for the blockade mean and height features for the analyte ’ssDNA’ from Tab. 6.1. The top panels represent the clusters for these features and the ionic (left) and electronic (right) measurement channels, as denoted by the legends. The lower panels evaluate both channels together: The black filled circles denote the center of each cluster. 80 6.5 Confusion matrices for the LSTM, XGBoost, DNN, and CNN models as denoted by the labels. All datasets from both experiments are represented. the ’True label’ and ’Predicted label’ refer to the true and predicted identity of the nucleotides [12]. . . . . . . . . . . . . . . . . 85 6.6 The mean absolute SHAP values for the 80nt ssDNA dataset using the XGBoost classifier are depicted. On the left, a comparison of single- channel performance is shown, while on the right, the combination of both channels for molecule classification is presented. . . . . . . . . . . 87 7.1 t-SNE projection for the EC input space with 20 dimensions. The different groups iron-based, low-Tc and high-Tc cuprates, and ’others’ are highlighted by the colors orange, green, blue and grey, respectively. 91 7.2 Visual representation illustrating the comparison between measured and predicted Tc (K) values for the test set. The predictions were generated using MLP Regressor (iii). . . . . . . . . . . . . . . . . . . . . . . . . . 92 xxiv List of figures 7.3 Percentage of SHAP values contribution to the model for the different ECs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 7.4 Comparison between the convex hull obtained from AFLOW (represented by red dots and lines) and the one predicted by AL-MTP (depicted with black dots and lines). The algorithm identified two new stable structures (illustrated by blue dots). . . . . . . . . . . . . . . . . . . . . . . . . . . 94 A.1 A sketch depicting the ORF identification process within a sequence of nucleobases (see text for more explanation). The labels in green, red, blue denote the amino acids (’Met’ is methionine, ’Cys’ is cysteine, etc.) that are made up from the respective codons. . . . . . . . . . . . . . . 101 A.2 A sketch on the feature extraction scheme. Nucleobase triplets (the codons), shown on the top, are counted through the counter ’Σ’ and normalized over the total length of the sequence to lead to each feature. 102 A.3 The feature space formed by two feature vectors ACC (Threonine) and GGC (Glycine) for the SARS/MERS virus family. The green, red, and blue symbols correspond to the SARS-CoV-2, SARS, and MERS viruses, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 A.4 The feature space formed by two feature vectors (’0’ and ’1’) from PCA for the corona virus family. The colors correspond to the different viruses as denoted by the legend. . . . . . . . . . . . . . . . . . . . . . . . . . 104 A.5 The feature space formed by two feature vectors (’0’ and ’1’) from PCA for the corona and the herpes virus families. The colors correspond to the different viruses as denoted by the legend. . . . . . . . . . . . . . . 106 List of tables 5.1 Number of prototypes for the binary and quaternary systems studied in this work, generated by ENUMLIB for the parent lattices fcc and bcc. ’Other’ refers to the number of relevant structures taken from one or more of the ICSD, OQMD and Materials Project databases. . . . . . . 50 5.2 Number of unique stoichiometries for the quaternary systems studied in this work, generated by ENUMLIB for the parent lattice fcc. . . . . . 51 5.3 Ground state total energies (Etotal i in eV/atom) calculated during the DFT volume relaxation for the unit cells of the alloying elements i={Cu, Ni, Si, Cr} in the crystal symmetries (’symm’) and magnetic state (’magn’) stated. For the notation, see text and Eq. 5.1. . . . . . . . . 53 5.4 Fitting (MAE in meV/Atom) errors during the AL-MTP process and the generation of the MTPs for the fcc and bcc sets, respectively. The number of configurations selected for the training set are given. . . . . 56 5.5 Fitting (MAE and RMSE in meV/Atom) errors during the AL-MTP process and the generation of the MTPs for the fcc-derivative sets for dif- ferent levmax: 16, 18 and 20, respectively. The number of configurations selected for the training set are given. . . . . . . . . . . . . . . . . . . . 56 5.6 Formation enthalpy (meV/Atom) and source of the new prototypes found through the AL-MTP procedure and post-relaxed with DFT. . . 58 5.7 Formation enthalpies (in meV/Atom) of stable binary structures included in the AFLOW as compared to the DFT post-relaxed and AL-MTP values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.8 Lattice vectors (Å), angles (°), volume (Å3), lattice system (LS) and space-group (SG) for the new structures. . . . . . . . . . . . . . . . . . 61 5.9 Formation enthalpies (in meV/Atom) of metastable quaternary struc- tures. Comparison between the AL-MTP and DFT post-relaxed values. 65 xxvi List of tables 5.10 Lattice vectors (Å), angles (°), volume (Å3), lattice system (LS) and space-group (SG) for the novel Cu4NiSi2Cr. . . . . . . . . . . . . . . . 65 6.1 Overview of the most relevant experimental details and conditions related to the analyzed data. ’Analyte’, ’pore’, ’salt’, Vionic and Velrefer to the translocating molecule, the pore diameter, the salt solution, the voltage difference in the ionic and electronic channel, respectively, and the presence of a differential amplifier. . . . . . . . . . . . . . . . . . . . . . 72 6.2 Dataset sizes from the two experiments (Exp. A and Exp. B with nanopore diameter 3.3 nm and 2.8 nm, respectively). The left columns refer to the initial nucleotide data, while the right two columns (’Training set A’ and ’Training set B’) refer to the nucleotide data after the detection of outliers. The ionic current blockades are given in nA, the dwell times in ms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 6.3 Filter parameters for the CUSUM algorithm, applied for the event detection in both ionic and electronic channels. . . . . . . . . . . . . . 76 6.4 Classification performance of the different ML algorithms based sepa- rately on the data from nucleotide experiments A (top results) and B (intermediate results), as well as from the combination of data from both Exp. A+B (bottom results). The pore diameters are also indicated. It should be noted that error values up to the second digit after the comma have been presented. For Exp. B, and thus also for some of the Exp. A+B results, the error is not 0, but in the order of 0.001. In order to keep it consistent with the other values in the table, this was rounded to zero [12]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 7.1 Comparison of methods for Tc (K) prediction of novel binary phases with potential superconductivity. . . . . . . . . . . . . . . . . . . . . . 95 A.1 Types of viruses, approximate length of a virus genome sequence, date the data were accessed, number of complete virus genome sequences, and database for all RNA and DNA data used in the analysis. . . . . 100 A.2 Clustering scores obtained with DBSCAN (top) and k-means (bottom) for the set of the corona virus family feature vectors. The bold number in the first column (’clusters’) indicates the expected number of resulting clusters. The bold numbers in the other columns emphasize the best scoring result. The eps value in the last column (top results) denotes the value at which the DBSCAN clustering was performed. . . . . . . . 105 Chapter 1 Introduction ML techniques and data-driven artificial intelligence (AI) frameworks have seen a remarkable progress in recent decades regarding several application domains with complex, high-dimensional, unstructured data [13]. Despite this success, a significant limitation remains: most ML approaches struggle to extract interpretable information and knowledge from data [14]. Additionally, predictions solely derived from data-driven models may lack physical consistency or even seem implausible [15]. To address these challenges, there is currently extensive research focused on incorporating existing expert knowledge into ML design, so-called knowledge-driven ML (KDML) [16–18]. By carefully integrating into the learning procedure prior domain knowledge about a process, such as physical, biological and/or chemical insights, not only can the quality of the learned representation be improved, but the learning process can also be expedited with fewer data samples [19, 20]. This approach aims to combine the strengths of data-driven techniques with the valuable insights provided by existing knowledge, ultimately enhancing the interpretability and reliability of ML models. In this context, the main focus lies within the scientific domain of physics as the emerging field of physics-driven machine learning (PDML) is explored. PDML represents the convergence of physics and ML, and its applications are aimed to be explored, along with the seamless integration of physics principles into ML models being investigated. By incorporating physics-based biases, the goal is to enhance the learning process and improve the performance and interpretability of ML algorithms. These biases can be classified into three types based on where the physics knowledge is embedded: obser- vational, inductive, and learning bias [6]. Observational biases use data that embody the underlying physics or data augmentation techniques to train the machine learning system. Inductive biases incorporate prior assumptions into the model architecture 2 Introduction to ensure compliance with physical laws. Learning biases are introduced through the choice of loss functions, constraints, and inference algorithms during training to favor solutions adhering to physics. These biases can be combined to create hybrid approaches for building physics-driven learning machines. Here, the main focus lies on addressing observational bias by utilizing data transformation and incorporating physical symmetries into invariant physics-inspired descriptors [1]. This approach offers conceptual simplicity in implementation and interpretation compared to equivariant ML architectures or physical loss functions. While it may have limitations when dealing with large volumes of data, invariant descriptors have shown effectiveness in capturing the underlying physical principles of the system, even in high-dimensional feature spaces [6]. In this thesis, the applications of PDML to diverse schemes are divided into two main categories: solid-state physics and biomolecules. In the field of solid-state physics, the focus is on high-throughput screening in materials discovery, specifically high- entropy alloys (HEAs) such as copper-based alloys. The challenges associated with screening a large number of potential candidates are intended to be addressed by employing PDML techniques. Additionally, the superconducting critical temperature of HEAs, including iron-based, cuprate and perovskite superconductors, is investigated using electronic-derived features and low-dimensional surrogate ML models. In the field of biomolecules, novel physical descriptors are introduced for the classification of DNA translocations during experiments with nanopore detectors. This involves developing PDML approaches that utilize the unique characteristics of DNA molecules as they pass through the nanopore, enabling accurate classification and analysis. In the appendix, a highly efficient PDML classification method tailored for respiratory-related coronaviruses is introduced. This innovative approach leverages open reading frames (ORFs) and the genetic code to craft biologically inspired features from the RNA, enhancing virus classification precision. It exemplifies the pragmatic application of PDML in a distinct virology context, significantly enhancing the thesis. Through these diverse applications, this thesis demonstrates the versatility and effectiveness of PDML in various domains, showcasing its potential to revolutionize materials discovery and biomolecular analysis. 1.1 Goals In this thesis, the aim is to explore the limits of physics-inspired descriptors in ML applied to several atomistic systems. The goals are: 1.2 Thesis Outline 3 • Improve high-throughput searching in materials discovery: Develop computational methods and techniques to enhance the efficiency and effectiveness of screening processes. This includes analyzing large datasets to identify potential candidates with desired properties, accelerating the discovery and development of materials. • Enhance read-out protocols in nanopore detectors: Refine and optimize the protocols used for reading DNA molecules in nanopore experiments. Improve accuracy, reliability, and speed to make nanopore detection more precise and applicable in scientific and medical fields. • Expand the comprehension of high-Tc superconductivity: Integrate PDML tech- niques with ab initio descriptors to predict critical temperatures in novel su- perconductors. Rely exclusively on features derived from electronic and atomic structures, leading to reduced dimensionality, improved interpretability, and estab- lishing a direct connection between electronic orbitals and heightened prediction accuracy. • Establish a framework of PDML with invariant descriptors: Create a comprehen- sive framework that integrates physics principles into ML. Utilize invariant de- scriptors to capture fundamental symmetries and invariances in physical systems, effectively embedding physics-based knowledge and insights into ML algorithms. 1.2 Thesis Outline An outline of the thesis is presented as follows: • In Chapter 2, the state of the art in copper-based alloys, nanopore detectors, high-Tc superconductivity and respiratory-related coronaviruses is explored. • In Chapter 3, an outline of the ML algorithms applied in physics within this thesis is provided. • In Chapter 4, a detailed overview of the simulation methods used in this thesis is described. • In Chapters 5, 6, and 7, the application of PDML to molecules and crystals is presented. Chapter 5 explores the reliability and effectiveness of MLIPs in the search for novel phases within copper-based alloys. In Chapter 6, data acquired from nanopore experiments is utilized to classify DNA molecules and 4 Introduction single nucleotides during nanopore translocations. Various ML models, including clustering and deep learning (DL) methods, are employed for this purpose. In Chapter 7, a highly efficient predictive model designed for estimating the critical temperature of superconductors is introduced. • In Chapter 8, the contributions of the research and findings are summarized. • Appendix A presents a supplementary study on fast and efficient approaches for classifying respiratory-related coronaviruses using PDML. The thesis also includes a References section, where all cited works are listed. Chapter 2 Theoretical Background In this chapter, the main concepts and theoretical background of the three chosen fields to which PDML is applied are presented. These fields are related to copper-based alloys, nanopore detectors, high-Tc superconductors, and respiratory-related coronaviruses. 2.1 Copper-based Alloys Copper-based alloys are of great interest for electric and electronic applications such as connectors and lead frames due to their excellent electrical conductivity and strength [21]. The next generation of integrated circuits requires high-performance copper alloys with high alloy density, multi-functionality, miniaturization, and low cost [22]. These alloys should possess both high electrical conductivity and strength. However, the presence of impurities in the matrix reduces electrical conductivity [23–25]. Thermal aging can lead to precipitation processes, reducing the number of dissolved impurities [26]. Particularly promising alloys include copper (Cu) - nickel (Ni) - silicon (Si) - chromium (Cr) (Cu-Ni-Si-Cr) complexes, which act as effective barriers to dislocation motion, thereby enhancing the alloy’s strength [27–29]. Binary systems like Cu-Si, Ni-Si, Cr-Si, Cu-Ni, and Cu-Cr also hinder dislocation movement due to the formation of clusters and intermetallic phases within the Cu matrix [30–34]. Understanding and improving the mechanical and electrical properties of Cu-Ni-Si-Cr alloys during aging rely on the observation and discovery of intermetallic phases. While several stable phases have been experimentally observed, expanding the range of stable structures is important for simulations. Important phases that have been experimentally observed in the Cu-Ni-Si-Cr system and influence its strengthening include Cu3Si, Cu33Si7, Ni3Si, Ni2Si, Ni3Si2, Cr-rich clusters, Cr3Si, Cr5Si3 and fcc (Ni, Cr, Si)-rich phase [35–38]. 6 Theoretical Background Among these, copper-silicon based alloys are extensively researched for their wide range of applications in high-temperature conditions and microelectronics [39]. They are also used in the synthesis of ultrapure silicon, utilizing the binary precipitate Cu3Si and its phases µ, µ’, and µ” [40, 41]. Cu-rich silicide phases play a significant role in various technical applications, including Li-ion batteries [42]. Studies on the thermodynamics and kinetics of phase formation in the Cu-rich region have utilized dynamic scanning calorimetry and in situ high-resolution transmission electron microscopy. These studies have identified binary precipitates such as Cu3Si, Cu15Si4, Cu33Si7, Cu7Si, and Cu5Si in copper silicides [43–45, 40, 46]. The phase Cu15Si4 and other compounds like Cu19Si6, Cu56Si11, and Cu33Si7 are considered stoichiometric due to their narrow ranges of homogeneity [47]. Nickel silicides are commonly used as contacts in electronic devices [48]. First principles calculations on nickel-silicon systems have identified stable phases such as Ni3Si, Ni31Si12, Ni2Si, Ni3Si2, NiSi, NiSi2, and NiSi3, which are also experimentally observed [49–51]. Despite nickel’s ferromagnetic nature, no magnetic order has been found in these intermetallic stable phases [52]. Chromium silicides, such as Cr3Si, Cr5Si3, CrSi, and CrSi2, are important transition metal silicides within the studied system. These compounds are of particular interest for their potential use in high-temperature structures [53–57]. Cr-Ni alloys, which are nickel-based alloys, are being considered for potential applications in high-level nuclear waste containers [58–61]. The precipitation of CrNi2 structures requires an exceptionally slow cooling process. Cupronickel alloys, specifically Cu-Ni binary complexes in copper matrices, exhibit promising properties for marine applications and in corrosion environments [62]. They can take the form of nanoparticles [63–65] or clusters [66]. While experimental results have not shown evidence of intermetallic phases in the Cu-Ni system based on the enthalpy of formation [67], cluster expansion methods have identified two prototypes, Cu7Ni and Cu8Ni, at low nickel concentrations [68]. The Cu-Cr system exhibits a simple eutectic phase diagram, with Cr-rich clusters present in the microstructure [69, 70]. These clusters are found in the Cu matrix of Cu-Ni-Si-Cr alloys and contribute to their strengthening. No intermetallic phases have been reported in this system based on both first principles calculations and experiments [71–73]. More complex compounds found in the ternary Cu-Ni-Si [74, 75] or Ni-Si-Cr [76, 77] complexes play an important role in the design of copper-based alloys for simulations. Finally, the 4-nary Cu-Ni-Si-Cr system has been studied, where fcc-rich regions in a Cu matrix [38] have been found along with the well known Cr-Si and Ni-Si binary precipitates [78]. 2.1 Copper-based Alloys 7 The complexity of copper alloys, impurities, and their structures presents ample opportunities for further investigation and the discovery of new materials. In silico prediction of novel structures is now possible through advances in computational materials research. Material databases such as AFLOWLIB [79], Materials Project [80], ICSD [81] and OQMD [82] employ density functional theory (DFT) and high- throughput techniques [83] like cluster expansion [84] (CE) and chemical similarity [85] to calculate material properties of intermetallic systems. While QM algorithms show promise, their computational demands hinder their application in high-throughput material searches. CE has been successful in predicting ground-state energies of metallic alloys [86], but its on-lattice nature limits its generalization. Empirical interatomic potentials rely on system-specific parameterization and often lack transferability. The combination of ML approaches, invariant atomistic descriptors, and data from QM simulations has led to the development of MLIPs [87]. They overcome the limitations of classical methods by utilizing complex functional forms that map the potential energy surface (PES) of a system. MLIPs are crucial for accelerating high-throughput searches for novel materials, as they outperform computer simulations in terms of prediction accuracy and computational efficiency. This is particularly important due to the lack of suitable potentials with first-principles accuracy [88–90]. Towards the high-throughput search of materials, different descriptor-based ML methods along with MLIPs have been developed [91, 92]. These models replace ab initio calculations by mapping the crystal structure, atomic positions, forces, and stresses to the total energy of the system. Some of the novel descriptors, in combination with regression methods, are the many-body tensor representation [93] with kernel ridge regression, the smooth overlap of atomic positions with gaussian process regression [94] and the moment tensor potentials (MTPs) [95]. The latter are developed based on a polynomial regression. In the field of high-throughput screening, AL approaches [9] in combination with descriptors such as the MTPs have been implemented in order to accelerate the relaxation process of typically very large pool of structure candidates. In Chapter 5 of this thesis, the focus is on improving the understanding of bi- nary complexes in copper alloys by examining novel stable phases. Additionally, the metastability of quaternary phases is investigated to gain insights into the fcc (Ni, Cr, Si)-rich phase identified by experimentalists. To achieve these goals, a comprehensive framework that integrates QM simulations, AL, and materials data libraries is deployed. The objective is twofold: first, to predict and expand the repertoire of stable phases relevant to Cu-Ni-Si-Cr alloys, and second, to generate MLIPs suitable for large-scale atomistic simulations. By employing this framework, the properties of these novel 8 Theoretical Background prototypes, including mechanical and electronic characteristics, can be further explored. The methodology involves conducting QM simulations for a diverse range of binary and quaternary structures. To identify the most energetically favorable candidates, an on-the-fly relaxation AL scheme based on MTPs is employed. This AL scheme iteratively relaxes potential candidates towards their lowest enthalpy of formation. To assess the stability of the resulting structures, their convex hull is analyzed, and the phonon dispersion is examined. Based on these analyses, potentially stable candidates for copper alloys are proposed. This approach enables the uncovering of promising structures and the advancement of the understanding of copper alloy systems. 2.2 Nanopore Detectors Nanopores are nanometer-sized holes in materials, which can electrophoretically drive biomolecules, such as DNA, RNA, or proteins through [96–98]. The passage of molecules through a pore results in ionic current and/or electronic signals. These current drops, or ionic current blockades, can be used for the detection of the molecules based on reading- out their identity and sequence or on discriminating among different homopolymers [98, 99]. The duration of each current blockade through a nanopore links to the translocation or dwell time of a biomolecule passing through the nanopore [100]. Typically, the current signals from nanopore experiments need to be post-processed in order to be analyzed and provide information on either the molecule type, length or identity. Post-prosessing the nanopore data typically involves the use of ML algorithms. These attempt to translate the experimental data into base calls [101] through a proper feature extraction and classification [102–105]. To this end, different ML schemes have been used, ranging from Hidden-Markov-based algorithms [106–110] to neural networks (NNs) [111–114]. Such ML techniques can play a vital role in processing and recovering the information in the nanopore with robust statistics by creating automated models. These have shown the potential to improve the detection accuracy of nanopores and automatize the read-out of the DNA nucleobases towards ultra-fast DNA sequencing. To date, algorithms have been trained in order to process real- time long-read length sequencing data from the MinION nanopore device by Oxford Nanopores [115, 116, 113, 117]. In this way, a nanopore device can efficiently identify, for example, the position and structure of a bacterial antibiotic resistance island [118]. In order to improve error rates in the nanopore data, DL methods such as recurrent neural networks (RNNs) have been implemented [113]. A very important aspect in the 2.2 Nanopore Detectors 9 analysis of nanopore data is the classification based on knowledge-based descriptors [119]. Typically, the dwell time or the mean current blockade pointing to the most probable DNA translocation paths are chosen [120, 121]. The most probable DNA translocation paths are typically obtained by considering features such as the dwell time and blockade current [120, 121]. Nevertheless, it has recently been shown that the feature ’dwell time’ is quite inefficient in clustering the different types of molecular events through a nanopore [122]. On a time scale lower than the dwell time, important information on molecular aspects cannot be accessed. Instead, the appropriateness and efficiency of another ionic blockade feature, the height, have been demonstrated [122]. Based on an unsupervised clustering of the nanopore data, this feature has been shown to clearly identify specific types of molecular events through the nanopore. Apart from the unsupervised clustering approach utilized in the study [122] and de novo clustering [123], most ML algorithms are primarily based on supervised learning. The latter is focused on either improving the algorithmic scaling when processing the data or guiding the learning process to optimize the feature space and reduce the error rates. Chapter 6 introduces a read-out protocol based on PDML, connecting unsupervised and supervised learning techniques through physics-inspired features. This protocol is designed to address the challenges posed by novel nanopore sequencers, aiming to achieve error-free identification and detection of molecules passing through the nanopore. The feasibility of training ML algorithms using experimental nanopore data within a low-dimensional feature space is explored to enhance interpretability. Furthermore, concurrent ionic and transverse currents from nanopore experiments are jointly processed and analyses in order to understand the importance of these measurements and their inherent details mapped on the choice of features. The primary objective is to identify the key factors that contribute to an improved detection process by minimizing training errors. To accomplish this, an efficient encoding technique for the input data is proposed, emphasizing the significance of data transformation during the pre-processing stage. Implementing these strategies can significantly enhance the data pipeline, accelerate DNA classification, and reduce read-out errors. By employing the PDML read-out protocol, the aim is to enhance the interpretability of ML algorithms trained on experimental nanopore data. This approach holds great potential for advancing the field of nanopore sensing, enabling more accurate and reliable identification and sequencing of molecules translocating through the nanopore. 10 Theoretical Background 2.3 High-Tc Superconductivity The phenomenon of superconductivity in many materials still lacks comprehensive theoretical support. Researchers have historically relied on empirical guidelines, such as Matthias Rules [124, 125], derived from experimental observations due to the absence of theory-based predictive models. These guidelines have aided the synthesis of superconducting materials. However, a major challenge in the field is the discovery of new candidates that exhibit superconductivity at different temperatures, including those with high critical temperatures (referred to as high-Tc superconductors or HTS) [126]. The fundamental mechanisms underlying superconductivity and the relationships between chemical/structural properties and Tc in these materials are not yet well understood [127]. This knowledge gap offers an opportunity for the exploration of novel theories and methods, including computational and data analysis techniques. Recent advancements in ML have facilitated the investigation of the pairing mech- anisms responsible for high-Tc superconductivity and the prediction of critical tem- perature values [128]. Moreover, the availability of extensive materials databases, encompassing both experimental and calculated properties, has enabled the develop- ment of advanced data-driven ML approaches to discover potential high-Tc candidates [129]. However, due to the complex nature of superconductivity, interpreting these models remains challenging [130]. Descriptor-based ML approaches, leveraging chemical and structural features, have emerged as a promising avenue [131]. These methods not only predict the critical temperature Tc and identify potential novel superconducting structures but also enable the study of the importance of different descriptors [131]. This insight into the physical concepts underlying superconductivity can inform the de- velopment of novel theories. Unsupervised learning techniques, including conventional and neural network-based clustering methods like k-means [132], DBSCAN [133], and Self-Organizing Map (SOM) [134], have been employed to group superconductors into meaningful classes and discover new materials categories [135]. Additionally, Convo- lutional Neural Networks (CNNs) have been used for feature learning to effectively distinguish between cuprate and iron-based superconductors [129]. Material features, such as elemental property statistics generated through the Materials Agnostic Platform for Informatics and Exploration (Magpie) [136, 137], have shown promise when combined with tree-based ML algorithms [136]. Learned predictors using these features offer potential insights into the mechanisms governing superconducting effects. Other approaches based on features derived from existing databases [138, 128, 139] and/or chemical formulas [140, 141] have revealed that certain predictors are not directly influential in superconductivity and do not significantly 2.4 Respiratory-Related Coronaviruses 11 improve the models [130]. The integration of Magpie features and structural information through Smooth Overlap of Atomic Positions (SOAP) descriptors [142, 143] has shown improvements over other methods [144]. Despite notable progress in applying ML techniques to superconductivity, many models still lack sufficient prediction accuracy or rely on high-dimensional feature spaces, which increases learning complexity. Chapter 7 of the thesis introduces a PDML model that leverages a minimal set of electron-specific features to accurately predict the critical temperature of superconduc- tors. The goal of this model is to achieve high fidelity in predicting superconducting behavior. The first aim is to assess the feature importance of electron-specific de- scriptors by leveraging unsupervised learning techniques, particularly dimensionality reduction through projection. The objective is to gain insights into the data’s clustering potential within an embedding space. Additionally, SHAP (SHapley Additive exPlana- tions) analysis is employed to further elucidate the contribution of individual features to the clustering outcomes. To validate the effectiveness of the novel method, experiments are conducted using a list of potentially new HEA superconductors. The approach is compared against other relevant methods described in the literature, providing valuable insights into the performance and superiority of the PDML model in predicting the critical temperature of superconducting materials. The efficacy of the approach is demonstrated and compared to existing methods, contributing to the advancement of the field and showcasing the potential of PDML in predicting superconductivity in diverse materials. 2.4 Respiratory-Related Coronaviruses The coronavirus SARS-CoV-2 has been spreading globally, and efforts are being made to isolate and control its spread [145–148]. Over 22,000 genome sequences have been collected since its identification [149]. Research studies are focused on developing drugs or vaccines using these sequences [150, 151]. Accurate identification and categorization of the virus are crucial for reducing the disease’s spread [152]. Algorithmic approaches, such as the UMAP algorithm, have shown promising results in identifying SARS-CoV-2 viruses in genome datasets [153–155]. UMAP is widely used in bioinformatics and clustering visualization. Existing methods for virus identification, such as the one referenced in [156], are computationally complex and unsuitable for smaller computer architectures like microcontroller chips. To enable widespread and easier virus identification, as well as facilitate fast initial identification while reducing complexity, straightforward and efficient approaches are required. One potential 12 Theoretical Background solution is to employ a theory-based approach that leverages the biological information embedded within viruses. Viral proteins are encoded by virus sequences using codons, which are translated into amino acids. These amino acids form proteins. The protein- coding segment is called an ORF [157, 158]. They can uncover overlapping and hidden genes in viruses, including SARS-CoV-2 [159]. An ORF starts with a start codon, contains the protein sequence, and ends with a stop codon. Variations in ORF regions differentiate virus types within a family. The number of substrings, or k-mers, of length k in a sequence is similar among viruses within the same family [160]. Techniques focused on detecting RNA-genome substrings often employ larger k-mer sizes. In addition, some techniques utilize natural vectors to create a vast and detailed space, where each biological molecule is uniquely represented [161]. SARS-CoV-2 and other SARS-type viruses have a large ORF called ORF1ab, spanning about 13,000 nucleobases [162, 163]. ORF1ab contains essential structural proteins for virus replication [164]. In Appendix A, the application of PDML to the prediction of COVID-19 from its RNA is discussed. An efficient approach using genetic code rules and ORFs to encode the entire SARS virus sequence into biological features is proposed. The method offers greater interpretability of variations in RNA codon frequencies, also known as codon bias [165]. To achieve this, the genetic code rules (3-mers) are utilized to construct biology-based features, which is a natural choice. MERS-CoV, SARS-CoV, SARS-CoV-2, and other related viruses are analyzed to identify distinct clusters. By collecting coronavirus family data, extracting features from ORFs and codon counts, and visualizing low-dimensional latent spaces, the goal is to achieve accurate clustering. In order to demonstrate the effectiveness and validate the proposed approach, the complexity and diversity of the analyzed viral RNA data are enriched. Initially, the focus is on SARS-CoV-2, SARS-CoV, and MERS-CoV. Subsequently, the analysis is expanded to include additional members of the coronavirus family. Finally, members from other virus families, such as the herpes DNA virus family, are incorporated. This progressive inclusion of diverse viral data allows for the strengthening and evaluation of the robustness of the approach. Chapter 3 Machine Learning In this chapter, an introductory overview is provided for both shallow and PDML methodologies. The focus is on elucidating the key techniques pivotal to this thesis, which include clustering analysis, tree-based methods, DL, feature importance, PDML and MLIPs. 3.1 Shallow Learning ML involves computers using algorithms to optimize specific performance measures, like character recognition, based on example data or past experiences. It has evolved as a distinct field within computer science since the 1980s, with applications in engineering, speech and image analysis, pattern recognition, and communications [166]. Learning algorithms enhance the efficiency of target algorithms, serving as alternatives to conventional data extraction from simulations. ML’s ultimate goal is to enable computers to solve problems without explicit programming, achieved through learning rules that save computational time and improve accuracy beyond human capability. Technologies such as image and voice recognition, personalized marketing, and data analytics work in conjunction with ML algorithms to acquire knowledge and glean valuable insights [167]. Shallow learning algorithms follow a pipeline: acquire, preprocess, and transform data, select relevant features to create a feature space, form a training set, train the algorithm to identify distinct zones, optimize by finding similarities, and create a decision rule. ML algorithms can be categorized into five main classes based on whether prior knowledge is required or the goal is to discover new patterns [168]: 14 Machine Learning • Supervised learning (SL): In SL [169], samples are assigned classes (categorical or numerical) based on known labels. These algorithms use labeled data to automatically classify new samples. They learn from input datasets to generate outputs, with the classification task aiming to predict labels for new inputs [170]. This is especially valuable in computational biology for predicting mechanisms with uncertain definitions. The general mapping for supervised learning is: Y = f(X, θ) (3.1) where: Y is the output or target variable, X is the input features, f is the model function capturing the relationship between inputs and outputs, θ represents the parameters (weights and biases) of the model, The objective is to find the values of θ that minimize this cost function. The least squares error (LSE) cost function is the most common in supervised learning. The optimization process outlined in Eq. 3.1 requires finding the solution to the equation: θ = argmin θ 1 2m m∑ i=1 (hθ(x (i))− y(i))2 (3.2) where: m is the number of training examples, hθ is the predicted output, often computed using an activation function, y(i) is the actual output. For a general activation function denoted as σ, the predicted output hθ(X) is calculated as hθ(X) = σ(z), where z is the linear combination of the input features X = {x1, x2, ..., xn} and their corresponding weights θ, in addition to a bias term θ0: z = θ0 + θ1x1 + θ2x2 + . . .+ θnxn (3.3) 3.1 Shallow Learning 15 The choice of the activation function (σ) depends on the specific requirements of the model. For example, in the case of logistic regression, the sigmoid function is commonly used as the activation function: σ(z) = 1 1 + e−z (3.4) The interpretation of σ(z) depends on the application, often representing the probability of an input x belonging to a class in classification tasks. The decision boundary is determined by comparing this probability to a threshold. Differ- ent models use varied activation functions, impacting the expressiveness and characteristics of the model. • Semi-Supervised learning (SSL): The goal of these algorithms is to predict unknown labels from a dataset created for classification purposes. A trained supervised algorithm is utilized to classify unlabeled data [171]. The most confident unlabeled samples and their predicted labels are incorporated into the training set. • Unsupervised Learning (UL): Unlike SL algorithms, UL is employed when the labels are unknown. The training set consists of unlabelled samples, which means there are no predefined classes for dividing the feature space [167]. The purpose of UL is to observe the underlying mechanics of the system and uncover insights by identifying groups of samples with similar features. Some examples of UL algo- rithms include clustering methods, anomaly detection algorithms, or unsupervised versions of NNs [172]. As example of UL such as the k-means clustering algorithm, given a dataset X represented by the feature vector {x1, x2, ..., xn} in Rn and k clusters, the goal is to find cluster centers µ1, µ2, ..., µk. The optimization objective is to minimize the sum of squared distances: argmin C,µ n∑ i=1 ∥Xi − µci∥2 (3.5) Here, C = {c1, c2, ..., cn} represents cluster assignments, and µ = {µ1, µ2, ..., µk} are cluster centroids. In Fig 3.1, a comparison between the two main ML methodologies is depicted. • Reinforcement Learning (RL) involves an agent learning an optimal policy through trial and error in interaction with its environment. It’s used in various fields 16 Machine Learning Fig. 3.1 Instances of SL classification (left) and UL clustering (right). In SL, the model is trained on labeled data (blue and red), allowing it to learn patterns and relationships. On the other hand, UL involves the model adapting to unlabeled data (grey), autonomously identifying structures and patterns without predefined categorization [2]. and aims to maximize future rewards by selecting actions based on current environmental states [167]. RL sits between supervised and unsupervised learning, focusing on actions that contribute to a cumulative increase in reinforcement signal values for long-term performance [173]. • Active learning (AL): This specialized area within ML is employed when obtaining labels for a supervised task, such as regression or classification, is costly or resource-intensive. It aims to optimize the training set by actively selecting the most informative training samples that contribute to highly accurate predictions, thereby minimizing the loss function of our model [174]. 3.1.1 Clustering Analysis Clustering analysis categorizes dataset samples into subsets based on similarities [175]. Each subset comprises similar yet distinct samples, positioned according to distances to cluster centroids. Centroids represent central points, with larger distances indicating lower similarity. Clustering methods can be split into four main categories [176]: • Hierarchical methods: These approaches form a tree-like structure by recursively dividing the dataset. Two types exist: agglomerative, which starts with individual samples and merges clusters iteratively, and divisive, which does the opposite [177]. 3.1 Shallow Learning 17 Fig. 3.2 k-means vs. DBSCAN example. DBSCAN proves adept with irregular and diverse datasets, while k-means efficiently partitions data into k clusters based on mean distances to centroids [3]. • Density-based methods: Similar to distance-based clustering methods, these techniques assign samples to clusters based on the concept of density rather than distance. One of the most known density-based methods is DBSCAN [133]. By utilizing density, they can identify clusters with arbitrary shapes, avoiding the assumption of spherical clusters in the feature space. Density-based methods are also effective in detecting outliers or anomalies within the dataset [178]. • Grid-based methods: These algorithms are an improvement over density-based methods and are particularly suitable for datasets with higher dimensions. They quantize the feature space into a finite number of cells using a grid-like data structure. By operating within this grid structure, the clustering algorithm identifies clusters in an efficient manner [179]. • Partitioning methods: commonly used in clustering, divide a dataset into a specified number, k, of clusters using a distance metric. This results in spherical clusters, as seen in k-means clustering, a popular example. The k-means algorithm, an NP-hard method, partitions the dataset into k clusters, with the mean observation in each cluster as the centroid [180]. It requires inputting the desired k value and iteratively produces a cluster representation as output. Fig 3.2 depicts a comparison between the two clustering methods used in this thesis. After clustering analysis, it’s crucial to evaluate the results through clustering validation. This is necessary as clustering algorithms yield results even when the dataset doesn’t naturally form distinct clusters. Evaluation becomes essential to measure the 18 Machine Learning method’s effectiveness. Cluster validation can be internal, assessing clustering solution stability, or external, comparing results with other datasets and methods [181]. Key steps in evaluating clustering performance include: • Determining the clustering tendency of the dataset to identify non-random structure. • Identifying the appropriate number of clusters. • Assessing how well the cluster analysis results fit the data independently. • Comparing results with externally known information, like class labels. • Comparing two sets of clusters to determine better quality or agreement with known information [182]. 3.1.2 Tree-based Methods Tree-based methods, detailed by [183], are prominent nonparametric models that employ decision trees to iteratively divide a training dataset into smaller, more homogeneous subsets, effectively handling both classification and regression tasks [184]. Each node in the tree is associated with a decision rule, guiding the distribution of the data inherited from its parent among its children, and every leaf node, also referred to as a sub-group, is linked to at least one data point from the original training set. The most common criteria for node splitting are: • Gini Impurity: Minimize Gini impurity to achieve maximum homogeneity in subsets. IG(X) = 1− c∑ i=1 p2i (3.6) • Entropy: Maximize information gain to decrease entropy and enhance homogene- ity. IE(X) = − c∑ i=1 pi log2(pi) (3.7) 3.1 Shallow Learning 19 Fig. 3.3 This example employs two features(root and internal nodes) to classify data into three sub-groups (leaves) and intermediate splits, not immediately forming a leaf, are internal nodes. The lines connecting nodes and leaves are branches. Input variables (features) are utilized to classify data into sub-groups based on binary conditions. The training sample computes the sample mean of the output yt for each sub-group, serving as a constant prediction for future observations classified into that sub-group [4]. where: X is the dataset or a specific node, c is the number of classes, pi is the proportion of instances of class i in the dataset or node X In Fig. 3.3, a representation example of a decision tree is depicted. The construction of the tree entails the iterative segmentation of variables, where branches undergo evaluation for accuracy, efficiency, and effectiveness. The subset of variables chosen to split an internal node relies on predetermined criteria formulated as an optimization problem. To enhance the efficiency of the tree, the strategy involves reordering variable splits, giving priority to essential variables at the top, and eliminating irrelevant features for a successful model. Tree-based methods stand out as some of the most robust ML algorithms, adept at accommodating complex datasets and ranking among the most powerful algorithms in use today [185]. They demand minimal preparation time, dispensing with the need for feature scaling or centering. These methods not only yield excellent predictions but also allow you to scrutinize the calculations behind these predictions. However, articulating the reasons behind predictions in simple terms can 20 Machine Learning be challenging. Despite their tendency to overfit, they consistently outperform DL on tabular data, as highlighted by [186]. Random Forest Algorithms Random forest (RF) methods evolved from empirical successes rather than from a sound theory, with various parts of the algorithm remain heuristic rather than theoretically motivated [187]. These models combine tree predictors, with each tree drawing values from a random vector independently sampled with the same distribution across all trees [188]. Each decision tree is trained independently of the others and on distinct subsets of the training data. The ultimate decision is reached by considering the more frequently predicted outcome. As the number of trees in the forest grows, the generalization error converges to a limit. The generalization error of a forest of tree classifiers hinges on the strength of individual trees and the degree of correlation among them. Extreme Gradient Tree Boosting: XGBoost XGBoost (XBG), an ML technique detailed in [189], utilizes an optimized ensemble model of classification and regression trees. It employs gradient boosting to create a decision tree ensemble for making predictions. Gradient boosting, an ensemble learning method for ML classification and regression problems, combines multiple decision trees to construct a robust model for accurate predictions [190]. The algorithm builds trees sequentially, with each tree aiming to rectify the errors of its predecessor. Noteworthy for its outstanding performance on various standard classification benchmarks, XGB distinguishes itself by running significantly faster than many other popular approaches, as emphasized in [191]. 3.2 Deep Learning The DL paradigm is a sub-field of ML inspired by the structural and functional char- acteristics of neurons in the human brain [167, 192, 193]. Artificial Neural Networks (ANNs) are utilized to address non-linear regression and classification problems where linear activation functions are insufficient. As problems become more complex, the number of features exponentially increases, necessitating the consideration of linear combinations among them. Traditional linear ML algorithms struggle with the compu- tational cost of such problems. However, the development of more complex architectures 3.2 Deep Learning 21 Fig. 3.4 Computational graph of a single perceptron with the input and output layers, as well as the nodes and bias vectors of the layers [5]. has provided a solution. The Single Layer Perceptron, introduced by Rosenblatt, was the first non-linear model [194]. This architecture consist of an unidirectional network formed by one input layer and one output layer. Selecting as activation function the sign function (or the sigmoid function) this algorithm is the most simple NN created. 3.2.1 Multi-Layer Perceptron Feedforward NNs, commonly known as Multilayer Perceptrons (MLPs), are composed of multiple layers of single perceptrons. These networks consist of an input layer, one or more hidden layers, and an output layer. Each layer contains multiple computational units. Fig. 3.4 illustrates an example of the most simple deep neural network (DNN) or single perceptron, with only one hidden layer. The next terms can be identified: • Input Nodes (xi): Representing input features, each of the n features corre- sponds to an input node. • Weights (wi): Parameters learned during training, Wij denotes the weight connecting input node i to hidden node j. • Bias Neuron (b): An additional parameter for each hidden node, enabling the network to shift activation output. 22 Machine Learning • Activation Function (σ): Applied to the weighted sum of input nodes plus the bias term, common functions like sigmoid, tanh, and ReLU result in node activation aj: aj = σ ( n∑ i=1 Wij ·Xi + bj ) (3.8) • Output Node (y): Producing the final result, the output node(s) apply an activation function to the weighted sum from the last (hidden) layer. With an adequate number of units in a single hidden layer and the appropriate activation function, the network can approximate continuous functions arbitrarily closely. However, empirical evidence suggests that deep networks with multiple hidden layers exhibit improved performance and lower generalization error compared to shallow networks with a single hidden layer. With increased computational power, training larger networks in less time becomes feasible. This allows for efficient testing of different structures and hyperparameters. Additionally, larger datasets contribute to better generalization of the network’s learned patterns. 3.3 Feature Importance: SHAP Values SHAP (SHapley Additive exPlanations) is a potent method for establishing a hierarchy related to feature importance. Initially introduced by Scott M. Lundberg and Su-In Lee [195], it has become a widely employed tool among data scientists to explain individual predictions and provide interpretability to model descriptors. The core concept in SHAP analysis involves approximating a given model in an additive way to establish a hierarchy of feature importance. To achieve this, a model is approximated by a function of the type: g(z′) = ϕ0 + M∑ i=1 ϕiz ′ i, (3.9) where z′ ∈ {0, 1}M represents a simplified input, and M is the number of simplified input features. The SHAP values, denoted as ϕi ∈ R, are utilized to assess the model output. A mapping function hx(x ′) = x relates the simplified inputs approximating the model to the inputs of the actual model. The relationship between the simplified inputs for approximating the model and the inputs of the actual model is given by the mapping function hx(x′) = x. This mapping includes information on the actual input, enabling easy interpretability of the approximating model. Thus, the approximating 3.4 Physics-Driven Machine Learning 23 model results in a linear combination of ϕ values, which are determined through the following form: ϕi(f, x) = ∑ z′⊆x′ |z′|!(M − |z′| − 1)! M ! [fx(z ′)− fx(z ′/i)], (3.10) where f refers to the actual model, and fx(z ′) = f(hx(z ′)) denotes the output of the actual model for the simplified input z′. For each data point x′, new artificial data points z′ are created by excluding features, and the model is evaluated. The differences in the model output between the original data points and the artificial data points yield the SHAP value for each feature i. In the ML context, SHAP values account for varying magnitudes and signs, indicat- ing how features contribute to the model’s output or class prediction. A positive sign indicates a contribution to predicting a specific class, while a negative sign signifies a contribution to predicting the opposite class. The key advantages of SHAP values lie in their transparency, making them easily understandable. Additionally, this procedure is model-agnostic and can be applied to approximate any model. 3.4 Physics-Driven Machine Learning Physics-driven ML (PDML), a subset of scientific ML (SciML), represents the synergy between physics and machine learning. The core principles of SciML involve addressing the challenges presented by scientific domain knowledge and developing interpretable and robust ML models and algorithms [196]. The integration of physics knowledge into ML models is expected to improve accuracy, physical interpretability, model size, complexity, sample efficiency and generability [7]. Physics domain knowledge is available in various forms, including essential physical principles (e.g., ab initio or first- principles physics), physical constraints (e.g., symmetries, invariances, conservation laws, asymptotic limits), and valuable insights gained from theoretical or computational studies [14]. In this thesis, the mission is to explore dimensionality reduction of collected observational data through PDML, transitioning from data-driven to physics-driven approaches. To better understand this transition, the interplay between data and physics scenarios in the ML field is illustrated in Fig. 3.5. The incorporation of physically relevant prior knowledge into ML algorithms can be achieved through various high-level approaches: physics-inspired descriptors, ML architecture, loss function, and the utilization of hybrid methodologies. To incorporate this valuable knowledge into our models, three primary approaches are employed 24 Machine Learning Fig. 3.5 Data and physics scenarios [6]. according to the categorization of bias in the ML process [1, 6]: The first approach ingeniously incorporates observational bias by employing equivariant operations to generate invariant scalar features. These operations carefully preserve the consistency and integrity of the resulting features even under various data transformations, thereby safeguarding crucial physical properties throughout the learning process. By training a ML algorithm on an invariant, scalar, lower-dimensional feature space, it gains the capability to learn functions, vector fields, and operators that faithfully reflect the underlying physical structure of the data. The second approach relies on the concept of inductive bias, accomplished through the utilization of equivariant models. These models offer the distinct advantage of more faithfully representing physical interactions, ensuring that essential quantities maintain their predictable behavior under various coordinate transformations. The third approach tightly constrains the learning optimization step by skillfully incorporating learning bias through the selection of appropriate loss functions, constraints, and inference algorithms during the training phase. These strategic choices are deliberately tailored to steer convergence towards solutions that harmonize with the underlying physical principles. Through the fine-tuning of soft penalty constraints, the model can approximate adherence to the governing physical laws, providing a versatile avenue to introduce a diverse array of physics-based biases. As an illustrative example, we mention Physics-Informed Neural Networks (PINNs). While not directly applied in this thesis, they are pertinent to this section. PINNs represent a SciML technique used to tackle problems related to Partial Differential Equations (PDEs). These networks approximate PDE solutions by reframing the task of directly solving governing equations into an optimization problem centered around a loss function [197]. All these methods are designed to maintain the symmetries and equivariances present in the underlying physical systems, making them well-suited for effectively capturing and representing the relevant information. 3.4 Physics-Driven Machine Learning 25 By leveraging these three approaches, ML methods can effectively embed physical domain knowledge into their frameworks, enhancing their capabilities and enabling the application of ML techniques in a wide range of physical and scientific disciplines. 3.4.1 Symmetry and Equivariance in ML Symmetry, in the context of an object or system, refers to a transformation that preserves a specific property, rendering it unchanged or invariant [198]. These transfor- mations can manifest as either smooth, continuous processes or discrete operations. Symmetries play a fundamental role in various ML tasks. Discrete symmetries natu- rally emerge in scenarios like particle systems, where particles lack a definitive order and can be rearranged arbitrarily. Similarly, they arise in various dynamical systems through concepts such as time-reversal symmetry, as seen in systems adhering to detailed balance principles or Newton’s second law of motion. Furthermore, permuta- tion symmetries are of central importance in the analysis of data organized in graph structures. Mathematically, symmetries are typically described using groups [7]. The relationship between a function f and a symmetry group G can be characterized by examining its equivariance properties, indicating that f is equivariant with respect to G. Invariance is a special form of equivariance, dealing with quantities that remain unchanged irrespective of the choice of the coordinate system. Fig. 3.6 illustrates the distinction between invariance and equivariance. In the field of ML, a way to categorize models into PDML or conventional ML is based on whether symmetry is employed or equivariant operations are used [1]. By utilizing an equivariant model, transforming the input results in an output representation that undergoes the same transformation [199]. This often includes incorporating geometric coordinates and relevant quantities crucial for describing the system’s behavior, such as external fields or atom-wise properties like velocities. The strength of equivariant models lies in their capacity to uphold the system’s symmetries and invariances throughout the learning process, ensuring a robust and accurate representation of the underlying physics. The main concern associated with equivariant ML models revolves around the substantial technical complexity they involve. On the other hand, an invariant function produces the same output for both transformed and non-transformed inputs. Invariant scalar features are preferred over geometric tensors due to their ease of handling and computational efficiency [1]. The application of invariant models has demonstrated impressive performance across numerous existing benchmarks, making them a compelling choice for various scientific and engineering applications. By effectively capturing the essential invariances present in the data, 26 Machine Learning Fig. 3.6 An example illustrating the differences between symmetry group invariance and equivariance is presented in the context of identifying a handwritten letter in an image [7]. these models enable accurate and robust predictions, advancing our understanding and problem-solving capabilities in the domain of physical systems. However, when utilizing an invariant model, the challenge lies in devising a method to represent your naturally equivariant physical system using invariant scalar features. This requires careful consideration and creativity to encapsulate the crucial characteristics of the system in a way that remains consistent under transformations. It is crucial to note that, even when the ultimate goal is predicting a scalar quantity, not all physical interactions can be adequately represented using scalars alone. The richness and complexity of physical phenomena often necessitate a more comprehensive representation that considers higher-order interactions and geometrical aspects. This is precisely where equivariant models excel. By utilizing invariant and equivariant models, researchers and practitioners can access a powerful toolset to effectively capture the intricate dynamics of physical systems and make accurate predictions across a wide range of scientific and engineering applications. 3.4 Physics-Driven Machine Learning 27 3.4.2 Invariant Descriptors: Learning Latent Representations Observational data plays a fundamental role and serves as a critical foundation for the success and recent achievements of ML algorithms [6]. Nevertheless, it is essential to recognize that these data can also inadvertently introduce biases into the learning process. Despite this, ML methods have proven to be remarkably powerful, particularly when provided with sufficient data that cover the entire input domain of a learning task. This capability enables accurate interpolation even in high-dimensional scenarios. It is crucial to ensure that these observational data capture the underlying physical principles governing their generation. By doing so, we can leverage these data as a means of weakly embedding these principles into an ML model during its training phase. Nonetheless, it is worth noting that for over-parameterized ML models, a substantial volume of data is typically required to reinforce these biases adequately. This reinforcement is crucial to generate predictions that respect essential symmetries and conservation laws within the physical systems. Unfortunately, obtaining such a large volume of data can be challenging and costly, especially in the context of physical and engineering sciences. In many cases, observational data may be generated through expensive experiments or large-scale computational models, making the cost of data acquisition potentially prohibitive. As such, researchers and practitioners must carefully consider the trade-offs between the data volume required and the resources available for data acquisition in these applications, as mentioned in Fig. 3.5. In handling high-dimensional unstructured data, a first step is reducing dimen- sionality by extracting informative features [200]. These features form the foundation for solving downstream tasks like prediction or classification, boosting efficiency and accuracy in ML tasks. A highly efficient approach to tackle the data acquisition challenge is to create a lower-dimensional tuple of physics-inspired descriptors, gen- erating an embedding or latent spaces that capture all the underlying symmetries of the system. This resultant latent space serves as a robust basis for training an ML model, eliminating the necessity for data augmentation and greatly improving sample efficiency. By leveraging the intrinsic knowledge embedded within these physics-inspired representation, the ML model gains a deeper and more meaningful understanding of the data. This leads to more accurate predictions and maximizes the use of available samples, culminating in highly effective and precise machine learning applications. This integration of physics-driven features not only streamlines the learning process but also enhances interpretability and opens doors for innovative advancements in diverse scientific and engineering domains. 28 Machine Learning 3.4.3 Geometric Deep Learning Introducing an inductive bias involves the development of specialized architectures that intrinsically incorporate prior knowledge and inherent biases relevant to a specific predictive task [6]. This concept aligns with the paradigm known as geometric deep learning (GDL), which spans the entire spectrum of deep learning, encompassing both Euclidean and non-Euclidean domains. GDL seamlessly integrates insights about the structure and symmetry intrinsic to the system of interest. These domains encompass intricate structures, such as graphs, manifolds, meshes, and string representations [201]. Fundamentally, GDL employs techniques that establish a geometric bedrock, entailing the assimilation of knowledge concerning the inherent spatial relationships and symmetrical attributes present in input variables. By infusing this geometric foun- dation, the objective is to enhance the precision of information captured by the model [202]. Among these methods, CNNs stand out as a canonical example, fundamentally reshaping the landscape of computer vision by adeptly preserving invariances related to symmetrical groupings and the distributed patterns found in natural images. Further- more, convolutional networks can be extended to accommodate additional symmetry groups, encompassing rotations, reflections, and more intricate gauge symmetry trans- formations. Other notable instances include graph neural networks (GNNs), equivariant networks, kernel methods such as Gaussian processes, RNNs and Transformers [198]. GDL provides a constructive procedure for incorporating prior physical knowledge into neural architectures. Convolutional Neural Networks CNNs draw inspiration from cognitive neuroscience, particularly the pioneering work of Hubel and Wiesel on the cat’s visual cortex. Their research uncovered distinct neuron types: simple neurons responsive to small visual patterns and complex neurons tuned to larger motifs [203]. CNNs serve as the cornerstone of Image Classification, dominating the landscape of Computer Vision algorithms. Moreover, they have found promising applications in Natural Language Processing. In CNNs, the core operation is convolution: fi = σ (∑ j (X ⊛Wij) + bi ) (3.11) where filters systematically extract descriptive features by traversing the input data, generating feature maps. These filters act as functions applied to the data. CNNs accommodate multi-dimensional input arrays, such as two-dimensional images with 3.4 Physics-Driven Machine Learning 29 Fig. 3.7 LeNet-5: One of the earliest convolutional neural networks [5]. three color channels or one-dimensional genomic sequences with a channel for each nucleotide [204]. The high dimensionality of images increases the complexity of hyperparameter tuning. Convolutional layers, often referred to as pooling layers, empower the network to autonomously acquire abstract features. A typical CNN architecture is illustrated in Fig. 3.7. Key hyperparameters, including the number of convolutional layers, filter count, and filter size, need fine-tuning during the validation process. sCNNs excel at detecting local patterns by employing the convolution operation to glean insights from data. This process involves scanning through the data and generating feature maps that connect to subsequent CNN layers. The input to CNNs is an n-dimensional tensor, representing a variety of data types. For instance, it can encompass two-dimensional images with three color channels or one-dimensional genomic sequences with a channel assigned to each nucleotide. The integration of convolutional and pooling layers enables CNNs to autonomously discover abstract features within the data. Recurrent Neural Networks Another type of GDL tailored for processing specific data types, such as time-series, text, and biological data containing sequential dependencies among attributes, is RNNs. Renowned for their proficiency in learning from string representations, RNNs excel in pattern recognition across various time steps, facilitated by parameter sharing across different model segments. In an RNN, there is a direct correspondence between the 30 Machine Learning Fig. 3.8 Time-layered architecture of an RNN [5]. layers within the network and the specific timestamp or position in the sequence [5]. An RNN comprises a variable number of layers, with each layer having a single input corresponding to that particular timestamp. Specifically, considering a simple recurrent node in an RNN with the following equation: ht = σ(wxh · xt + whh · ht−1 + bh) (3.12) where: • ht: Hidden state at time t. • σ: Activation function applied element-wise. • wxh: Weight matrix connecting the input xt to the hidden state. • whh: Weight matrix connecting the previous hidden state ht−1 to the current hidden state. • bh: Bias term for the hidden state. RNNs can also be regarded as feed-forward networks with a specific structure rooted in the concept of time layering, enabling them to accept a sequence of inputs and generate a sequence of outputs. These models are proven to be particularly valuable for applications involving sequence-to-sequence learning, such as machine translation or predicting the next element in a sequence. In Fig. 3.8, the representation of an RNN is depicted. This architecture allows for the distinction of: • Input Sequence (xt): Represents the input at time t, varying from the se- quence’s start to end. 3.5 Machine Learning Combined with Atomistic Simu-lations 31 • Hidden State (ht): Denotes the hidden state at time t, capturing memory from previous steps. • Weights (w): Learned parameters, including connections from input to hidden states and recurrent connections. • Output (yt): Represents the output at time t, predicting or representing the input sequence. Long Short-Term Memory (LSTM) RNNs, in particular, excel at efficient parameter sharing through gated memory mechanisms [205]. Within each LSTM cell, recurrent units equipped with self-learned gating enable the preservation, modification, and selective forgetting of information within a short-term memory [206]. This effectively addresses challenges related to handling long learning dependencies, which can be problematic in other RNN variants. 3.5 Machine Learning Combined with Atomistic Simu- lations Atomistic simulations are a key tool for exploring material mechanics. The fidelity of simulation results relies on the interatomic potential describing atom interactions. Classical potentials have two main limitations: transferability and version-control of the originally developed potentials [207]. The first is due to fixed forms and few fitting parameters. The second major issue is the risk of discrepancies between the implemented potential and the original version provided by developers. Maintaining accurate parameters over time is difficult due to file format changes, transfer errors, and file corruption. In contrast, MLIPs [87] offer flexibility by learning from first principles calculations rather than relying on fixed forms [88–90]. Suc