Physics-Driven Machine Learning:
from Biomolecules to Crystals

Von der Fakultät Energie-, Verfahrens und Biotechnik der Universität
Stuttgart und dem Stuttgart Center for Simulation Science (SC SimTech)

zur Erlangung der Würde eines Doktors der Ingenieurwissenschaften
(Dr.-Ing.) genehmigte Abhandlung

Vorgelegt von
Ángel Díaz Carral

aus Ávila, Spanien

Hauptberichter : Prof. Dr. rer. nat. Dr. h. c. Siegfried Schmauder
Mitberichter : Prof. Dr. rer. nat. Maria Fyta

Mitberichter : apl. Prof. Dr.-Ing. habil. Niels Hansen

Tag der mündlichen Prüfung: 27.03.2024

Institut für Materialprüfung, Werkstoffkunde und Festigkeitslehre der
Universität Stuttgart

September 2024


I dedicate this thesis to my loving wife Kany, parents Ángel and Marilo, brother Nacho,
and friends. I also extend this dedication to my grandparents, whose wisdom echoes

through generations, and to my precious baby Lorán, whose arrival has filled our hearts
with boundless joy and love. May the lessons from both the past and the future inspire

the discoveries within these pages . . .


Declaration

I hereby declare that, except where specific reference is made to the work of
others, the contents of this dissertation are original and have not been submitted in
whole or in part for consideration for any other degree or qualification in this, or any
other university. This dissertation is my own work and contains nothing which is the
outcome of work done in collaboration with others, except as specified in the text
and acknowledgements. I also declare that there is no conflict of interest, and any
included publications are my own work, except where indicated throughout the thesis
and summarised and clearly identified on the declarations page of the thesis.

Ángel Díaz Carral


Acknowledgements

This thesis is a compilation of my research work done at SC SimTech and University
of Stuttgart during the years 2019-2023.

I am profoundly grateful to Prof. Fyta for her steadfast guidance and the life-
changing opportunity to start as a HiWi and progress to a PhD student. Prof.
Schmauder’s invaluable mentorship played a crucial role during my research journey.
Special thanks to Christian Holm for exceptional leadership, unwavering support, and
facilitating the contract extension for my continued involvement. The collaborative
environment at the Institute for Computational Physics (ICP) and the Institute for
Materials Testing, Materials Science, and Strength of Materials (IMWF) was pivotal in
developing this thesis. The infrastructure at ICP and IMWF played an instrumental
role in its completion.

Expressing gratitude to the dedicated members of my research group at both ICP
and IMWF, including Chandra, Takeshi, Martin, Magnus, Simon, Azade, Kira, Ayberk
and Louis from ICP, and Xiang, Frank and Stephen from IMWF. Their insights and
collaboration significantly shaped the outcomes of this thesis.

I extend thanks to the University of Stuttgart for fostering an enriching academic
environment and providing essential resources. My deep gratitude also goes to the
University’s EXC 2075 SimTech Cluster, supported by the German Research Foundation
(DFG), for their financial assistance, allowing me to concentrate on research and pursue
academic goals throughout my doctoral studies.

Finally, my deepest gratitude goes to my wife, family, and friends for their unwaver-
ing support, encouragement, and understanding throughout this challenging journey.
Their love and encouragement have been a constant source of motivation.

This thesis is dedicated to all those who have played a significant role in my
academic and personal growth. Thank you for being part of this incredible journey
with me.

Ángel Díaz Carral


Publications

The following publications are the result of the work dedicated to this thesis.

• Á. Díaz Carral, M. Ostertag, and M. Fyta, "Deep learning for nanopore ionic
current blockades," J. Chem. Phys., vol. 154, no. 4, p. 044111, 2021.

• Á. Díaz Carral, X. Xu, S. Gravelle, A. YazdanYar, S. Schmauder, and M. Fyta,
"Stability of binary precipitates in Cu-Ni-Si-Cr alloys investigated through active
learning," Mater. Chem. Phys., vol. 306, p. 128053, 2023.

• Á. Díaz Carral, M. Roitegui, and M. Fyta, "Interpretably learning the critical
temperature of superconductors: electron concentration and feature dimensional-
ity reduction," APL Mater., vol. 12, p. 041111, 2024.

• Á. Díaz Carral, M. Roitegui, A. Koc, M. Ostertag, and M. Fyta, "Concurrent
analysis of electronic and ionic nanopore signals: blockade mean and height,"
Nano Ex., vol. 5, no. 2, p. 025020, 2024.

• Á. Díaz Carral, S. Gravelle and M. Fyta, "In silico evidence of metastable
quaternary phases in Cu-Ni-Si-Cr alloys", submitted APL mach. learn, 2024.

• Á. Díaz Carral, M. Roitegui, and M. Fyta, "Structural and electronic features
for the prediction of superconducting materials," in preparation, 2024.

Other publications:

• L. Oberer, Á. Díaz Carral, and M. Fyta, "Simple Classification of RNA
Sequences of Respiratory-Related Coronaviruses," ACS Omega, vol. 6, no. 31,
pp. 20158-20165, 2021.


Abstract

Physical systems and their interactions are inherently equivariant [1]. The prediction
of quantities in machine learning (ML) that are fundamentally generated from these
equivariant interactions is accomplished through two main approaches: applying
equivariant operations to generate invariant scalar features as input for an invariant
model, or utilizing equivariant models themselves. In this thesis, the focus lies on
the former framework, where we explore feature extraction and data representation
techniques in physics domains through physics-driven machine learning (PDML). This
particular field of ML benefits from prior knowledge of physics to create descriptors
that encode the underlying symmetries of the dataset, thereby reducing dimensionality,
increasing interpretability, and enhancing generalization capabilities. To highlight the
significance of physics-driven descriptors in physically-inspired embedding spaces, the
thesis focuses on several schemes relevant to its objectives:

1. Copper-based alloys

2. Nanopore detectors

3. High-Tc superconductivity

Through molecular simulations and PDML approaches, the aim is to investigate
and provide insights into nanopore sequencing and materials discovery. In this thesis,
the following questions are investigated:

• What are the limitations of physics-inspired descriptors in ML?

• Can we decrease the dimensionality of the data while maintaining the same level
of prediction accuracy?

• Is it possible to achieve comparable performance using PDML with invariant
descriptors compared to conventional ML methods?

• How does PDML scale in atomistic systems?


xii Abstract

The investigation of copper-based alloys carried out within the framework of this
project focuses on the combination of computer simulations and active learning (AL) to
reveal stable precipitate phases of copper alloys and study their mechanical properties.
Once the AL cycle that generates accurate ML interatomic potentials (MLIPs) has been
successfully implemented, the focus has recently been placed on the stability analysis
framework for binary copper-based systems. This part involves quantum mechanical
(QM) simulations of various alloy configurations in copper alloys using the density
functional theory (DFT) implemented in VASP. Static calculations are performed at
zero temperature to generate data for the AL on-the-fly relaxation algorithm. The
algorithm utilizes moment tensor potentials (MTPs), a type of descriptors based on
invariant polynomials, to construct MLIPs for multi-component alloys. The goal is to
conduct a comprehensive search for stable precipitate phases in copper alloys.

To further elucidate the analysis, nanopore DNA translocations are studied through
PDML. DNA molecules can be electrophoretically driven through a nanoscale opening
in a material, giving rise to rich and measurable ionic current blockades. In this work,
ML models are trained on experimental ionic blockade data from DNA nucleotide
translocation through 2D pores of different diameters. The aim of the resulting
classification is to enhance the read-out efficiency of nucleotide identity, providing
pathways toward error-free sequencing. A novel method is proposed that simultaneously
reduces the current traces to a few physical descriptors and trains low-complexity
models, thus reducing the dimensionality of the data. Each translocation event is
described by four features, including the height of the ionic current blockade.

Exploring the field of high critical temperature (high-Tc) superconductivity, an
exceptionally effective PDML model is proposed to predict critical temperatures of
superconductors by carefully extracting characteristics from electronic and atomic
properties. Despite a streamlined feature space, it upholds accuracy when compared
to intricate methodologies. The model is fine-tuned to forecast distinct superconductor
properties, finding an equilibrium between precision and simplicity and enabling
projections for emerging structures.

By bridging the gap between ML and physics, this research will contribute to the
growing field of PDML embedding the physics into the descriptors, advancing the ability
to model, predict, and control complex physical systems with unprecedented accuracy
and efficiency. Through this work, the aim is to pave the way for transformative
applications, insights, and discoveries that have the potential to reshape scientific and
technological advancements across multiple disciplines.


Zusammenfassung

Physikalische Systeme und ihre Wechselwirkungen sind von Natur aus äquivariant [1].
Die Vorhersage von Größen in maschinellem Lernen (ML), die grundsätzlich aus diesen
äquivarianten Wechselwirkungen generiert werden, erfolgt durch zwei Hauptansätze:
Anwendung äquivarianter Operationen zur Erzeugung invarianter skalierter Merkmale
als Eingabe für ein invariantes Modell oder Verwendung äquivarianter Modelle selbst.
In dieser Arbeit liegt der Fokus auf dem ersten Ansatz, in dem wir die Extraktion
von Merkmalen und die Darstellung von Daten in physikalischen Domänen durch
physikgetriebenes maschinelles Lernen (PDML) untersuchen. Dieses spezielle Gebiet
des ML profitiert von dem vorhandenen physikalischen Wissen, um Deskriptoren zu
erstellen, die die zugrunde liegenden Symmetrien des Datensatzes kodieren. Dadurch
wird die Dimensionalität reduziert, die Interpretierbarkeit erhöht und die Fähigkeit zur
Verallgemeinerung verbessert. Um die Bedeutung von physikgesteuerten Deskriptoren
in physisch inspirierten Einbettungsräumen zu unterstreichen, konzentriert sich die
Arbeit auf mehrere Schemata, die für ihre Zielsetzungen relevant sind:

1. Kupferbasierte Legierungen

2. Nanopore-Detektoren

3. Hoch-Tc-Supraleitung

Durch molekulare Simulationen und PDML-Ansätze soll das Ziel verfolgt werden,
Einblicke in die Nanoporensequenzierung und die Materialentdeckung zu untersuchen
und zu liefern. In dieser Arbeit werden die folgenden Fragen untersucht:

• Was sind die Einschränkungen von physikbasierten Deskriptoren im maschinellen
Lernen?

• Können wir die Dimensionalität der Daten verringern, während wir gleichzeitig
die gleiche Vorhersagegenauigkeit beibehalten?


xiv Zusammenfassung

• Ist es möglich, vergleichbare Leistung mit PDML und invarianten Deskriptoren
im Vergleich zu konventionellen ML-Methoden zu erreichen?

• Wie skalierbar ist PDML in atomistischen Systemen?

Die Untersuchung von kupferbasierten Legierungen, die im Rahmen dieses Projekts
durchgeführt wird, konzentriert sich auf die Kombination von Computersimulationen
und aktivem Lernen (AL), um stabile Ausscheidungsphasen von Kupferlegierungen
aufzudecken und ihre mechanischen Eigenschaften zu untersuchen. Sobald der AL-
Zyklus erfolgreich implementiert und der genaue ML-Interatom-Potentiale (MLIPs)
generiert war, wurde der Fokus auf den Stabilitätsanalyserahmen für binäre kupfer-
basierte Systeme konzentriert. Dieser Teil umfasst quantenmechanische (QM) Simula-
tionen verschiedener Legierungskonfigurationen in Kupferlegierungen unter Verwendung
der Dichtefunktionaltheorie (DFT), die in VASP implementiert ist. Statische Berech-
nungen werden bei Nulltemperatur durchgeführt, um Daten für den AL-On-the-Fly-
Relaxationsalgorithmus zu generieren. Der Algorithmus verwendet Moment-Tensor-
Potentiale (MTPs), eine Art von Deskriptoren auf der Grundlage von invarianten
Polynomen, um MLIPs für Mehrkomponentenlegierungen zu konstruieren. Das Ziel ist
es, eine umfassende Suche nach stabilen Ausscheidungsphasen in Kupferlegierungen
durchgeführt zu werden.

Um die Analyse weiter zu vertiefen, werden mittels PDML Nanopore DNA-
Translokationen untersucht. DNA-Moleküle können elektrophoretisch durch eine
nanoskalige Öffnung in einem Material bewegt werden, wodurch reichhaltige und mess-
bare Blockaden des ionischen Stroms entstehen. In dieser Arbeit werden maschinelle
Lernmodelle mit experimentellen Daten zur ionischen Blockade von DNA-Nukleotid-
Translokationen durch 2D-Poren unterschiedlicher Durchmesser trainiert. Das Ziel
dieser Klassifizierung besteht darin, die Effizienz der Nukleotid-Identifikation zu
verbessern und so den Weg zu fehlerfreiem Sequenzieren zu ebnen. Es wird eine neuar-
tige Methode vorgeschlagen, die gleichzeitig die Stromspuren auf wenige physikalische
Deskriptoren reduziert und Modelle mit geringer Komplexität trainiert, um die Di-
mensionalität der Daten zu verringern. Jedes Translokationsereignis wird durch vier
Merkmale beschrieben, einschließlich der Höhe der ionischen Stromblockade.

Die Erkundung des Feldes der Hochtemperatursupraleitung (high-Tc) schlägt ein
außergewöhnlich effektives PDML-Modell vor, um die kritischen Temperaturen von
Supraleitern vorherzusagen, indem sorgfältig Merkmale aus elektronischen und atom-
aren Eigenschaften extrahiert werden. Trotz eines vereinfachten Merkmalsraums behält
es seine Genauigkeit im Vergleich zu komplizierten Methoden bei. Das Modell wird


xv

feinabgestimmt, um verschiedene Eigenschaften von Supraleitern vorherzusagen, indem
es ein Gleichgewicht zwischen Präzision und Einfachheit findet und Prognosen für
aufkommende Strukturen ermöglicht.

Durch die Überbrückung der Kluft zwischen ML und Physik wird diese Forschung
zum wachsenden Bereich der PDML beitragen, indem sie die Physik in die Deskriptoren
einbettet und die Fähigkeit verbessert, komplexe physikalische Systeme mit einem
beispiellosen Maß an Genauigkeit und Effizienz zu modellieren, vorherzusagen und zu
kontrollieren. Durch diese Arbeit soll der Weg für transformative Anwendungen, Erken-
ntnisse und Entdeckungen geebnet werden, die das Potenzial haben, wissenschaftliche
und technologische Fortschritte in verschiedenen Disziplinen neu zu gestalten.


Table of contents

Publications ix

Abstract xi

Zusammenfassung xiii

List of figures xxi

List of tables xxv

1 Introduction 1
1.1 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Theoretical Background 5
2.1 Copper-based Alloys . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Nanopore Detectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 High-Tc Superconductivity . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4 Respiratory-Related Coronaviruses . . . . . . . . . . . . . . . . . . . . 11

3 Machine Learning 13
3.1 Shallow Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.1.1 Clustering Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.1.2 Tree-based Methods . . . . . . . . . . . . . . . . . . . . . . . . 18

3.2 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2.1 Multi-Layer Perceptron . . . . . . . . . . . . . . . . . . . . . . . 21

3.3 Feature Importance: SHAP Values . . . . . . . . . . . . . . . . . . . . 22
3.4 Physics-Driven Machine Learning . . . . . . . . . . . . . . . . . . . . . 23

3.4.1 Symmetry and Equivariance in ML . . . . . . . . . . . . . . . . 25
3.4.2 Invariant Descriptors: Learning Latent Representations . . . . . 27


xviii Table of contents

3.4.3 Geometric Deep Learning . . . . . . . . . . . . . . . . . . . . . 28
3.5 Machine Learning Combined with Atomistic Simu-lations . . . . . . . . 31

4 Simulation Methods 35
4.1 Density Functional Theory . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.1.1 Kohn-Sham Equations . . . . . . . . . . . . . . . . . . . . . . . 37
4.1.2 Exchange-correlation Functionals . . . . . . . . . . . . . . . . . 40
4.1.3 Crystals: Periodicity, Basis Sets and Pseudopotentials . . . . . . 42

4.2 Molecular Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.2.1 Integraton Schemes . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.3 Statistical Ensembles in MD . . . . . . . . . . . . . . . . . . . . . . . . 45
4.4 Thermostats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

5 Stability of n-ary Phases in Cu-Ni-Si-Cr Alloys 49
5.1 Computational Details . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5.1.1 Crystal Prototype Sampling . . . . . . . . . . . . . . . . . . . . 49
5.1.2 On-the-fly Active Learning Relaxation: AL-MTP . . . . . . . . 50
5.1.3 Convex Hull Calculations . . . . . . . . . . . . . . . . . . . . . 52
5.1.4 Calculation of Phonon Dispersion . . . . . . . . . . . . . . . . . 53
5.1.5 Molecular Dynamics Simulations . . . . . . . . . . . . . . . . . 54

5.2 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.2.1 Prediction of Novel Binary Structures . . . . . . . . . . . . . . . 56
5.2.2 Analysis of Metastable Quaternary Structures . . . . . . . . . . 63
5.2.3 Properties Assessment of the Predicted Cu7Si . . . . . . . . . . 66

5.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.4 Acknowledgement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

6 Enhancing Nanopore Translocation Read-out via PDML 71
6.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

6.1.1 Experimental Details . . . . . . . . . . . . . . . . . . . . . . . . 71
6.1.2 Event Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

6.2 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.3 Machine Learning Models . . . . . . . . . . . . . . . . . . . . . . . . . 76

6.3.1 Clustering Methods . . . . . . . . . . . . . . . . . . . . . . . . . 76
6.3.2 Classification Methods . . . . . . . . . . . . . . . . . . . . . . . 77
6.3.3 SHAP Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

6.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79


Table of contents xix

6.4.1 Feature Efficiency via Clustering Analysis . . . . . . . . . . . . 79
6.4.2 Classification of Single Nucleotides . . . . . . . . . . . . . . . . 81
6.4.3 Bimodal Feature Importance . . . . . . . . . . . . . . . . . . . . 86

6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.6 Acknowledgement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

7 Learning the Critical Temperature of Superconductors through PDML 89
7.1 Datasets on Superconductors . . . . . . . . . . . . . . . . . . . . . . . . 89
7.2 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
7.3 PDML Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
7.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

7.4.1 Feature Efficiency via Dimensionality Reduction . . . . . . . . . 91
7.4.2 Interpreting EC Features: SHAP Analysis . . . . . . . . . . . . 93
7.4.3 Prediction of Critical Temperature in HEA . . . . . . . . . . . 94

7.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
7.6 Acknowledgement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

8 Conclusions 97

Appendix A Simple Classification of RNA Sequences of Respiratory-
Related Coronaviruses 99
A.1 Data Collection and Preprocessing . . . . . . . . . . . . . . . . . . . . 99
A.2 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
A.3 Implementation and Optimization . . . . . . . . . . . . . . . . . . . . . 102
A.4 Clustering Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
A.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
A.6 Acknowledgement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

References 107


List of figures

3.1 Instances of SL classification (left) and UL clustering (right). In SL,
the model is trained on labeled data (blue and red), allowing it to learn
patterns and relationships. On the other hand, UL involves the model
adapting to unlabeled data (grey), autonomously identifying structures
and patterns without predefined categorization [2]. . . . . . . . . . . . 16

3.2 k-means vs. DBSCAN example. DBSCAN proves adept with irregular
and diverse datasets, while k-means efficiently partitions data into k
clusters based on mean distances to centroids [3]. . . . . . . . . . . . . 17

3.3 This example employs two features(root and internal nodes) to clas-
sify data into three sub-groups (leaves) and intermediate splits, not
immediately forming a leaf, are internal nodes. The lines connecting
nodes and leaves are branches. Input variables (features) are utilized to
classify data into sub-groups based on binary conditions. The training
sample computes the sample mean of the output yt for each sub-group,
serving as a constant prediction for future observations classified into
that sub-group [4]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.4 Computational graph of a single perceptron with the input and output
layers, as well as the nodes and bias vectors of the layers [5]. . . . . . . 21

3.5 Data and physics scenarios [6]. . . . . . . . . . . . . . . . . . . . . . . . 24
3.6 An example illustrating the differences between symmetry group in-

variance and equivariance is presented in the context of identifying a
handwritten letter in an image [7]. . . . . . . . . . . . . . . . . . . . . . 26

3.7 LeNet-5: One of the earliest convolutional neural networks [5]. . . . . . 29
3.8 Time-layered architecture of an RNN [5]. . . . . . . . . . . . . . . . . . 30
3.9 Performance comparison of several descriptor-based ML potentials on a

range of crystal structures [8]. . . . . . . . . . . . . . . . . . . . . . . . 32


xxii List of figures

3.10 Dimensionality reduction of the atomic neighborhood ni for the atom i,
described by the moment tensors Mµ,ν [9]. . . . . . . . . . . . . . . . . 33

4.1 The ‘Jacob’s ladder’ of exchange-correlation functionals [10]. . . . . . . 41

5.1 Convex hulls for the binary systems in Tab. 5.1 as calculated in this
work are labelled as ’AL-MTP’. The potentially new prototypes are
denoted through the blue circles. The respective convex hulls from
AFLOW are also provided for comparison [11]. . . . . . . . . . . . . . 57

5.2 The phonon dispersion for the predicted novel Cu7Si binary. The phonon
spectra calculated within the DFT framework is compared to those
obtained using the AL-MTP potential in LAMMPS-MLIP [11]. . . . . 58

5.3 Crystal structure of the novel predicted Cu7Si before (left) and after
(middle) relaxation. The Cu and Si atoms are colored in orange and
brown, respectively. The right panel depicts the Cu7Si supercell at a
temperature T = 300K [11]. . . . . . . . . . . . . . . . . . . . . . . . 61

5.4 The convex hulls for the Cr-Ni (left) and the Cu-Ni (right) binaries. The
black and green lines correspond to the spin-polarized and spin non-
polarized calculations. The blue circles denote the predicted structures
[11]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

5.5 The number of predicted quaternary structures with respect to their
formation enthalpy ∆H (meV/Atom). . . . . . . . . . . . . . . . . . . 63

5.6 The number of predicted quaternary structures with respect to their
hull distance ∆Hd (meV/Atom). . . . . . . . . . . . . . . . . . . . . . 64

5.7 Crystal structure of the novel predicted Cu4NiSi2Cr. The Cu, Ni, Si and
Cr atoms are coloured in orange, brown, yellow and dark grey respectively. 65

5.8 The RDF for the Cu-Si pair calculated through MD simulations using
the AL-MTP (left) and MEAM (right) potentials, respectively. The
curves correspond to simulations at different temperatures T , as denoted
by the legends [11]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

5.9 Density of the Cu7Si structure as a function of the temperature T , for
both the AL-MTP and MEAM potentials as indicated by the legend. The
horizontal dashed line highlights the value ρ = 8.15 g/cm3 as measured
from DFT in the limit T → 0K [11]. . . . . . . . . . . . . . . . . . . . 67


List of figures xxiii

6.1 A set of concatenated events for the translocation of dAMP through
the nanopore with a diameter of 2.8 nm (Exp. B) . Each red block on
the left represents an event of a nucleotide translocating the nanopore
in a certain configuration. On the right, the four features for a single
nucleotide translocation event are highlighted [12]. . . . . . . . . . . . . 74

6.2 Detection of outlier events in the data from both experiments Exp.A and
Exp.B. The left panel depicts all data with respect to the two features
of the ionic current blockade and the dwell time, while the right panel
reveals the same feature space after the outliers have been remove based
on the cutoff for a dwell time below 10 ms [12]. . . . . . . . . . . . . . 75

6.3 The encoding process of mapping the ionic current raw signals, by means
of the physical descriptors/features into grey-level scale images. The
encoding is followed by the training procedure and the prediction of
the nucleotide identity at the end of the pipeline. The images include 4
pixels corresponding to the four features for each single translocation
experiment of a certain nucleotide [12]. . . . . . . . . . . . . . . . . . 78

6.4 Two dimensional graphs for the blockade mean and height features for
the analyte ’ssDNA’ from Tab. 6.1. The top panels represent the clusters
for these features and the ionic (left) and electronic (right) measurement
channels, as denoted by the legends. The lower panels evaluate both
channels together: The black filled circles denote the center of each cluster. 80

6.5 Confusion matrices for the LSTM, XGBoost, DNN, and CNN models
as denoted by the labels. All datasets from both experiments are
represented. the ’True label’ and ’Predicted label’ refer to the true and
predicted identity of the nucleotides [12]. . . . . . . . . . . . . . . . . 85

6.6 The mean absolute SHAP values for the 80nt ssDNA dataset using the
XGBoost classifier are depicted. On the left, a comparison of single-
channel performance is shown, while on the right, the combination of
both channels for molecule classification is presented. . . . . . . . . . . 87

7.1 t-SNE projection for the EC input space with 20 dimensions. The
different groups iron-based, low-Tc and high-Tc cuprates, and ’others’
are highlighted by the colors orange, green, blue and grey, respectively. 91

7.2 Visual representation illustrating the comparison between measured and
predicted Tc (K) values for the test set. The predictions were generated
using MLP Regressor (iii). . . . . . . . . . . . . . . . . . . . . . . . . . 92


xxiv List of figures

7.3 Percentage of SHAP values contribution to the model for the different
ECs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

7.4 Comparison between the convex hull obtained from AFLOW (represented
by red dots and lines) and the one predicted by AL-MTP (depicted with
black dots and lines). The algorithm identified two new stable structures
(illustrated by blue dots). . . . . . . . . . . . . . . . . . . . . . . . . . . 94

A.1 A sketch depicting the ORF identification process within a sequence of
nucleobases (see text for more explanation). The labels in green, red,
blue denote the amino acids (’Met’ is methionine, ’Cys’ is cysteine, etc.)
that are made up from the respective codons. . . . . . . . . . . . . . . 101

A.2 A sketch on the feature extraction scheme. Nucleobase triplets (the
codons), shown on the top, are counted through the counter ’Σ’ and
normalized over the total length of the sequence to lead to each feature. 102

A.3 The feature space formed by two feature vectors ACC (Threonine) and
GGC (Glycine) for the SARS/MERS virus family. The green, red, and
blue symbols correspond to the SARS-CoV-2, SARS, and MERS viruses,
respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

A.4 The feature space formed by two feature vectors (’0’ and ’1’) from PCA
for the corona virus family. The colors correspond to the different viruses
as denoted by the legend. . . . . . . . . . . . . . . . . . . . . . . . . . 104

A.5 The feature space formed by two feature vectors (’0’ and ’1’) from PCA
for the corona and the herpes virus families. The colors correspond to
the different viruses as denoted by the legend. . . . . . . . . . . . . . . 106


List of tables

5.1 Number of prototypes for the binary and quaternary systems studied in
this work, generated by ENUMLIB for the parent lattices fcc and bcc.
’Other’ refers to the number of relevant structures taken from one or
more of the ICSD, OQMD and Materials Project databases. . . . . . . 50

5.2 Number of unique stoichiometries for the quaternary systems studied in
this work, generated by ENUMLIB for the parent lattice fcc. . . . . . 51

5.3 Ground state total energies (Etotal
i in eV/atom) calculated during the

DFT volume relaxation for the unit cells of the alloying elements i={Cu,
Ni, Si, Cr} in the crystal symmetries (’symm’) and magnetic state
(’magn’) stated. For the notation, see text and Eq. 5.1. . . . . . . . . 53

5.4 Fitting (MAE in meV/Atom) errors during the AL-MTP process and
the generation of the MTPs for the fcc and bcc sets, respectively. The
number of configurations selected for the training set are given. . . . . 56

5.5 Fitting (MAE and RMSE in meV/Atom) errors during the AL-MTP
process and the generation of the MTPs for the fcc-derivative sets for dif-
ferent levmax: 16, 18 and 20, respectively. The number of configurations
selected for the training set are given. . . . . . . . . . . . . . . . . . . . 56

5.6 Formation enthalpy (meV/Atom) and source of the new prototypes
found through the AL-MTP procedure and post-relaxed with DFT. . . 58

5.7 Formation enthalpies (in meV/Atom) of stable binary structures included
in the AFLOW as compared to the DFT post-relaxed and AL-MTP
values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

5.8 Lattice vectors (Å), angles (°), volume (Å3), lattice system (LS) and
space-group (SG) for the new structures. . . . . . . . . . . . . . . . . . 61

5.9 Formation enthalpies (in meV/Atom) of metastable quaternary struc-
tures. Comparison between the AL-MTP and DFT post-relaxed values. 65


xxvi List of tables

5.10 Lattice vectors (Å), angles (°), volume (Å3), lattice system (LS) and
space-group (SG) for the novel Cu4NiSi2Cr. . . . . . . . . . . . . . . . 65

6.1 Overview of the most relevant experimental details and conditions related
to the analyzed data. ’Analyte’, ’pore’, ’salt’, Vionic and Velrefer to the
translocating molecule, the pore diameter, the salt solution, the voltage
difference in the ionic and electronic channel, respectively, and the
presence of a differential amplifier. . . . . . . . . . . . . . . . . . . . . . 72

6.2 Dataset sizes from the two experiments (Exp. A and Exp. B with
nanopore diameter 3.3 nm and 2.8 nm, respectively). The left columns
refer to the initial nucleotide data, while the right two columns (’Training
set A’ and ’Training set B’) refer to the nucleotide data after the detection
of outliers. The ionic current blockades are given in nA, the dwell times
in ms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

6.3 Filter parameters for the CUSUM algorithm, applied for the event
detection in both ionic and electronic channels. . . . . . . . . . . . . . 76

6.4 Classification performance of the different ML algorithms based sepa-
rately on the data from nucleotide experiments A (top results) and B
(intermediate results), as well as from the combination of data from both
Exp. A+B (bottom results). The pore diameters are also indicated. It
should be noted that error values up to the second digit after the comma
have been presented. For Exp. B, and thus also for some of the Exp.
A+B results, the error is not 0, but in the order of 0.001. In order to
keep it consistent with the other values in the table, this was rounded
to zero [12]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

7.1 Comparison of methods for Tc (K) prediction of novel binary phases
with potential superconductivity. . . . . . . . . . . . . . . . . . . . . . 95

A.1 Types of viruses, approximate length of a virus genome sequence, date
the data were accessed, number of complete virus genome sequences,
and database for all RNA and DNA data used in the analysis. . . . . 100

A.2 Clustering scores obtained with DBSCAN (top) and k-means (bottom)
for the set of the corona virus family feature vectors. The bold number
in the first column (’clusters’) indicates the expected number of resulting
clusters. The bold numbers in the other columns emphasize the best
scoring result. The eps value in the last column (top results) denotes
the value at which the DBSCAN clustering was performed. . . . . . . . 105


Chapter 1

Introduction

ML techniques and data-driven artificial intelligence (AI) frameworks have seen a
remarkable progress in recent decades regarding several application domains with
complex, high-dimensional, unstructured data [13]. Despite this success, a significant
limitation remains: most ML approaches struggle to extract interpretable information
and knowledge from data [14]. Additionally, predictions solely derived from data-driven
models may lack physical consistency or even seem implausible [15]. To address these
challenges, there is currently extensive research focused on incorporating existing
expert knowledge into ML design, so-called knowledge-driven ML (KDML) [16–18].
By carefully integrating into the learning procedure prior domain knowledge about
a process, such as physical, biological and/or chemical insights, not only can the
quality of the learned representation be improved, but the learning process can also
be expedited with fewer data samples [19, 20]. This approach aims to combine the
strengths of data-driven techniques with the valuable insights provided by existing
knowledge, ultimately enhancing the interpretability and reliability of ML models.

In this context, the main focus lies within the scientific domain of physics as
the emerging field of physics-driven machine learning (PDML) is explored. PDML
represents the convergence of physics and ML, and its applications are aimed to be
explored, along with the seamless integration of physics principles into ML models
being investigated.

By incorporating physics-based biases, the goal is to enhance the learning process
and improve the performance and interpretability of ML algorithms. These biases can
be classified into three types based on where the physics knowledge is embedded: obser-
vational, inductive, and learning bias [6]. Observational biases use data that embody
the underlying physics or data augmentation techniques to train the machine learning
system. Inductive biases incorporate prior assumptions into the model architecture


2 Introduction

to ensure compliance with physical laws. Learning biases are introduced through
the choice of loss functions, constraints, and inference algorithms during training to
favor solutions adhering to physics. These biases can be combined to create hybrid
approaches for building physics-driven learning machines. Here, the main focus lies
on addressing observational bias by utilizing data transformation and incorporating
physical symmetries into invariant physics-inspired descriptors [1]. This approach offers
conceptual simplicity in implementation and interpretation compared to equivariant
ML architectures or physical loss functions. While it may have limitations when dealing
with large volumes of data, invariant descriptors have shown effectiveness in capturing
the underlying physical principles of the system, even in high-dimensional feature
spaces [6].

In this thesis, the applications of PDML to diverse schemes are divided into two
main categories: solid-state physics and biomolecules. In the field of solid-state physics,
the focus is on high-throughput screening in materials discovery, specifically high-
entropy alloys (HEAs) such as copper-based alloys. The challenges associated with
screening a large number of potential candidates are intended to be addressed by
employing PDML techniques. Additionally, the superconducting critical temperature
of HEAs, including iron-based, cuprate and perovskite superconductors, is investigated
using electronic-derived features and low-dimensional surrogate ML models. In the
field of biomolecules, novel physical descriptors are introduced for the classification
of DNA translocations during experiments with nanopore detectors. This involves
developing PDML approaches that utilize the unique characteristics of DNA molecules
as they pass through the nanopore, enabling accurate classification and analysis. In the
appendix, a highly efficient PDML classification method tailored for respiratory-related
coronaviruses is introduced. This innovative approach leverages open reading frames
(ORFs) and the genetic code to craft biologically inspired features from the RNA,
enhancing virus classification precision. It exemplifies the pragmatic application of
PDML in a distinct virology context, significantly enhancing the thesis. Through these
diverse applications, this thesis demonstrates the versatility and effectiveness of PDML
in various domains, showcasing its potential to revolutionize materials discovery and
biomolecular analysis.

1.1 Goals

In this thesis, the aim is to explore the limits of physics-inspired descriptors in ML
applied to several atomistic systems. The goals are:


1.2 Thesis Outline 3

• Improve high-throughput searching in materials discovery: Develop computational
methods and techniques to enhance the efficiency and effectiveness of screening
processes. This includes analyzing large datasets to identify potential candidates
with desired properties, accelerating the discovery and development of materials.

• Enhance read-out protocols in nanopore detectors: Refine and optimize the
protocols used for reading DNA molecules in nanopore experiments. Improve
accuracy, reliability, and speed to make nanopore detection more precise and
applicable in scientific and medical fields.

• Expand the comprehension of high-Tc superconductivity: Integrate PDML tech-
niques with ab initio descriptors to predict critical temperatures in novel su-
perconductors. Rely exclusively on features derived from electronic and atomic
structures, leading to reduced dimensionality, improved interpretability, and estab-
lishing a direct connection between electronic orbitals and heightened prediction
accuracy.

• Establish a framework of PDML with invariant descriptors: Create a comprehen-
sive framework that integrates physics principles into ML. Utilize invariant de-
scriptors to capture fundamental symmetries and invariances in physical systems,
effectively embedding physics-based knowledge and insights into ML algorithms.

1.2 Thesis Outline

An outline of the thesis is presented as follows:

• In Chapter 2, the state of the art in copper-based alloys, nanopore detectors,
high-Tc superconductivity and respiratory-related coronaviruses is explored.

• In Chapter 3, an outline of the ML algorithms applied in physics within this
thesis is provided.

• In Chapter 4, a detailed overview of the simulation methods used in this thesis is
described.

• In Chapters 5, 6, and 7, the application of PDML to molecules and crystals
is presented. Chapter 5 explores the reliability and effectiveness of MLIPs in
the search for novel phases within copper-based alloys. In Chapter 6, data
acquired from nanopore experiments is utilized to classify DNA molecules and


4 Introduction

single nucleotides during nanopore translocations. Various ML models, including
clustering and deep learning (DL) methods, are employed for this purpose. In
Chapter 7, a highly efficient predictive model designed for estimating the critical
temperature of superconductors is introduced.

• In Chapter 8, the contributions of the research and findings are summarized.

• Appendix A presents a supplementary study on fast and efficient approaches for
classifying respiratory-related coronaviruses using PDML.

The thesis also includes a References section, where all cited works are listed.


Chapter 2

Theoretical Background

In this chapter, the main concepts and theoretical background of the three chosen fields
to which PDML is applied are presented. These fields are related to copper-based alloys,
nanopore detectors, high-Tc superconductors, and respiratory-related coronaviruses.

2.1 Copper-based Alloys

Copper-based alloys are of great interest for electric and electronic applications such as
connectors and lead frames due to their excellent electrical conductivity and strength
[21]. The next generation of integrated circuits requires high-performance copper alloys
with high alloy density, multi-functionality, miniaturization, and low cost [22]. These
alloys should possess both high electrical conductivity and strength. However, the
presence of impurities in the matrix reduces electrical conductivity [23–25]. Thermal
aging can lead to precipitation processes, reducing the number of dissolved impurities
[26]. Particularly promising alloys include copper (Cu) - nickel (Ni) - silicon (Si) -
chromium (Cr) (Cu-Ni-Si-Cr) complexes, which act as effective barriers to dislocation
motion, thereby enhancing the alloy’s strength [27–29]. Binary systems like Cu-Si,
Ni-Si, Cr-Si, Cu-Ni, and Cu-Cr also hinder dislocation movement due to the formation
of clusters and intermetallic phases within the Cu matrix [30–34]. Understanding and
improving the mechanical and electrical properties of Cu-Ni-Si-Cr alloys during aging
rely on the observation and discovery of intermetallic phases. While several stable
phases have been experimentally observed, expanding the range of stable structures is
important for simulations. Important phases that have been experimentally observed
in the Cu-Ni-Si-Cr system and influence its strengthening include Cu3Si, Cu33Si7, Ni3Si,
Ni2Si, Ni3Si2, Cr-rich clusters, Cr3Si, Cr5Si3 and fcc (Ni, Cr, Si)-rich phase [35–38].


6 Theoretical Background

Among these, copper-silicon based alloys are extensively researched for their wide
range of applications in high-temperature conditions and microelectronics [39]. They are
also used in the synthesis of ultrapure silicon, utilizing the binary precipitate Cu3Si and
its phases µ, µ’, and µ” [40, 41]. Cu-rich silicide phases play a significant role in various
technical applications, including Li-ion batteries [42]. Studies on the thermodynamics
and kinetics of phase formation in the Cu-rich region have utilized dynamic scanning
calorimetry and in situ high-resolution transmission electron microscopy. These studies
have identified binary precipitates such as Cu3Si, Cu15Si4, Cu33Si7, Cu7Si, and Cu5Si
in copper silicides [43–45, 40, 46]. The phase Cu15Si4 and other compounds like
Cu19Si6, Cu56Si11, and Cu33Si7 are considered stoichiometric due to their narrow ranges
of homogeneity [47]. Nickel silicides are commonly used as contacts in electronic
devices [48]. First principles calculations on nickel-silicon systems have identified stable
phases such as Ni3Si, Ni31Si12, Ni2Si, Ni3Si2, NiSi, NiSi2, and NiSi3, which are also
experimentally observed [49–51]. Despite nickel’s ferromagnetic nature, no magnetic
order has been found in these intermetallic stable phases [52]. Chromium silicides,
such as Cr3Si, Cr5Si3, CrSi, and CrSi2, are important transition metal silicides within
the studied system. These compounds are of particular interest for their potential use
in high-temperature structures [53–57].

Cr-Ni alloys, which are nickel-based alloys, are being considered for potential
applications in high-level nuclear waste containers [58–61]. The precipitation of CrNi2
structures requires an exceptionally slow cooling process. Cupronickel alloys, specifically
Cu-Ni binary complexes in copper matrices, exhibit promising properties for marine
applications and in corrosion environments [62]. They can take the form of nanoparticles
[63–65] or clusters [66]. While experimental results have not shown evidence of
intermetallic phases in the Cu-Ni system based on the enthalpy of formation [67],
cluster expansion methods have identified two prototypes, Cu7Ni and Cu8Ni, at low
nickel concentrations [68]. The Cu-Cr system exhibits a simple eutectic phase diagram,
with Cr-rich clusters present in the microstructure [69, 70]. These clusters are found
in the Cu matrix of Cu-Ni-Si-Cr alloys and contribute to their strengthening. No
intermetallic phases have been reported in this system based on both first principles
calculations and experiments [71–73]. More complex compounds found in the ternary
Cu-Ni-Si [74, 75] or Ni-Si-Cr [76, 77] complexes play an important role in the design of
copper-based alloys for simulations. Finally, the 4-nary Cu-Ni-Si-Cr system has been
studied, where fcc-rich regions in a Cu matrix [38] have been found along with the well
known Cr-Si and Ni-Si binary precipitates [78].


2.1 Copper-based Alloys 7

The complexity of copper alloys, impurities, and their structures presents ample
opportunities for further investigation and the discovery of new materials. In silico
prediction of novel structures is now possible through advances in computational
materials research. Material databases such as AFLOWLIB [79], Materials Project
[80], ICSD [81] and OQMD [82] employ density functional theory (DFT) and high-
throughput techniques [83] like cluster expansion [84] (CE) and chemical similarity [85]
to calculate material properties of intermetallic systems. While QM algorithms show
promise, their computational demands hinder their application in high-throughput
material searches. CE has been successful in predicting ground-state energies of metallic
alloys [86], but its on-lattice nature limits its generalization. Empirical interatomic
potentials rely on system-specific parameterization and often lack transferability. The
combination of ML approaches, invariant atomistic descriptors, and data from QM
simulations has led to the development of MLIPs [87]. They overcome the limitations of
classical methods by utilizing complex functional forms that map the potential energy
surface (PES) of a system. MLIPs are crucial for accelerating high-throughput searches
for novel materials, as they outperform computer simulations in terms of prediction
accuracy and computational efficiency. This is particularly important due to the lack
of suitable potentials with first-principles accuracy [88–90].

Towards the high-throughput search of materials, different descriptor-based ML
methods along with MLIPs have been developed [91, 92]. These models replace ab initio
calculations by mapping the crystal structure, atomic positions, forces, and stresses
to the total energy of the system. Some of the novel descriptors, in combination with
regression methods, are the many-body tensor representation [93] with kernel ridge
regression, the smooth overlap of atomic positions with gaussian process regression
[94] and the moment tensor potentials (MTPs) [95]. The latter are developed based on
a polynomial regression. In the field of high-throughput screening, AL approaches [9]
in combination with descriptors such as the MTPs have been implemented in order to
accelerate the relaxation process of typically very large pool of structure candidates.

In Chapter 5 of this thesis, the focus is on improving the understanding of bi-
nary complexes in copper alloys by examining novel stable phases. Additionally, the
metastability of quaternary phases is investigated to gain insights into the fcc (Ni, Cr,
Si)-rich phase identified by experimentalists. To achieve these goals, a comprehensive
framework that integrates QM simulations, AL, and materials data libraries is deployed.
The objective is twofold: first, to predict and expand the repertoire of stable phases
relevant to Cu-Ni-Si-Cr alloys, and second, to generate MLIPs suitable for large-scale
atomistic simulations. By employing this framework, the properties of these novel


8 Theoretical Background

prototypes, including mechanical and electronic characteristics, can be further explored.
The methodology involves conducting QM simulations for a diverse range of binary
and quaternary structures. To identify the most energetically favorable candidates,
an on-the-fly relaxation AL scheme based on MTPs is employed. This AL scheme
iteratively relaxes potential candidates towards their lowest enthalpy of formation. To
assess the stability of the resulting structures, their convex hull is analyzed, and the
phonon dispersion is examined. Based on these analyses, potentially stable candidates
for copper alloys are proposed. This approach enables the uncovering of promising
structures and the advancement of the understanding of copper alloy systems.

2.2 Nanopore Detectors

Nanopores are nanometer-sized holes in materials, which can electrophoretically drive
biomolecules, such as DNA, RNA, or proteins through [96–98]. The passage of molecules
through a pore results in ionic current and/or electronic signals. These current drops, or
ionic current blockades, can be used for the detection of the molecules based on reading-
out their identity and sequence or on discriminating among different homopolymers
[98, 99]. The duration of each current blockade through a nanopore links to the
translocation or dwell time of a biomolecule passing through the nanopore [100].
Typically, the current signals from nanopore experiments need to be post-processed in
order to be analyzed and provide information on either the molecule type, length or
identity.

Post-prosessing the nanopore data typically involves the use of ML algorithms.
These attempt to translate the experimental data into base calls [101] through a proper
feature extraction and classification [102–105]. To this end, different ML schemes
have been used, ranging from Hidden-Markov-based algorithms [106–110] to neural
networks (NNs) [111–114]. Such ML techniques can play a vital role in processing
and recovering the information in the nanopore with robust statistics by creating
automated models. These have shown the potential to improve the detection accuracy
of nanopores and automatize the read-out of the DNA nucleobases towards ultra-fast
DNA sequencing. To date, algorithms have been trained in order to process real-
time long-read length sequencing data from the MinION nanopore device by Oxford
Nanopores [115, 116, 113, 117]. In this way, a nanopore device can efficiently identify,
for example, the position and structure of a bacterial antibiotic resistance island [118].
In order to improve error rates in the nanopore data, DL methods such as recurrent
neural networks (RNNs) have been implemented [113]. A very important aspect in the


2.2 Nanopore Detectors 9

analysis of nanopore data is the classification based on knowledge-based descriptors
[119]. Typically, the dwell time or the mean current blockade pointing to the most
probable DNA translocation paths are chosen [120, 121]. The most probable DNA
translocation paths are typically obtained by considering features such as the dwell
time and blockade current [120, 121]. Nevertheless, it has recently been shown that
the feature ’dwell time’ is quite inefficient in clustering the different types of molecular
events through a nanopore [122]. On a time scale lower than the dwell time, important
information on molecular aspects cannot be accessed. Instead, the appropriateness and
efficiency of another ionic blockade feature, the height, have been demonstrated [122].
Based on an unsupervised clustering of the nanopore data, this feature has been shown
to clearly identify specific types of molecular events through the nanopore. Apart from
the unsupervised clustering approach utilized in the study [122] and de novo clustering
[123], most ML algorithms are primarily based on supervised learning. The latter
is focused on either improving the algorithmic scaling when processing the data or
guiding the learning process to optimize the feature space and reduce the error rates.

Chapter 6 introduces a read-out protocol based on PDML, connecting unsupervised
and supervised learning techniques through physics-inspired features. This protocol
is designed to address the challenges posed by novel nanopore sequencers, aiming
to achieve error-free identification and detection of molecules passing through the
nanopore. The feasibility of training ML algorithms using experimental nanopore
data within a low-dimensional feature space is explored to enhance interpretability.
Furthermore, concurrent ionic and transverse currents from nanopore experiments
are jointly processed and analyses in order to understand the importance of these
measurements and their inherent details mapped on the choice of features. The primary
objective is to identify the key factors that contribute to an improved detection process
by minimizing training errors. To accomplish this, an efficient encoding technique for
the input data is proposed, emphasizing the significance of data transformation during
the pre-processing stage. Implementing these strategies can significantly enhance
the data pipeline, accelerate DNA classification, and reduce read-out errors. By
employing the PDML read-out protocol, the aim is to enhance the interpretability of
ML algorithms trained on experimental nanopore data. This approach holds great
potential for advancing the field of nanopore sensing, enabling more accurate and
reliable identification and sequencing of molecules translocating through the nanopore.


10 Theoretical Background

2.3 High-Tc Superconductivity

The phenomenon of superconductivity in many materials still lacks comprehensive
theoretical support. Researchers have historically relied on empirical guidelines, such
as Matthias Rules [124, 125], derived from experimental observations due to the
absence of theory-based predictive models. These guidelines have aided the synthesis of
superconducting materials. However, a major challenge in the field is the discovery of
new candidates that exhibit superconductivity at different temperatures, including those
with high critical temperatures (referred to as high-Tc superconductors or HTS) [126].
The fundamental mechanisms underlying superconductivity and the relationships
between chemical/structural properties and Tc in these materials are not yet well
understood [127]. This knowledge gap offers an opportunity for the exploration of
novel theories and methods, including computational and data analysis techniques.

Recent advancements in ML have facilitated the investigation of the pairing mech-
anisms responsible for high-Tc superconductivity and the prediction of critical tem-
perature values [128]. Moreover, the availability of extensive materials databases,
encompassing both experimental and calculated properties, has enabled the develop-
ment of advanced data-driven ML approaches to discover potential high-Tc candidates
[129]. However, due to the complex nature of superconductivity, interpreting these
models remains challenging [130]. Descriptor-based ML approaches, leveraging chemical
and structural features, have emerged as a promising avenue [131]. These methods not
only predict the critical temperature Tc and identify potential novel superconducting
structures but also enable the study of the importance of different descriptors [131].
This insight into the physical concepts underlying superconductivity can inform the de-
velopment of novel theories. Unsupervised learning techniques, including conventional
and neural network-based clustering methods like k-means [132], DBSCAN [133], and
Self-Organizing Map (SOM) [134], have been employed to group superconductors into
meaningful classes and discover new materials categories [135]. Additionally, Convo-
lutional Neural Networks (CNNs) have been used for feature learning to effectively
distinguish between cuprate and iron-based superconductors [129].

Material features, such as elemental property statistics generated through the
Materials Agnostic Platform for Informatics and Exploration (Magpie) [136, 137],
have shown promise when combined with tree-based ML algorithms [136]. Learned
predictors using these features offer potential insights into the mechanisms governing
superconducting effects. Other approaches based on features derived from existing
databases [138, 128, 139] and/or chemical formulas [140, 141] have revealed that certain
predictors are not directly influential in superconductivity and do not significantly


2.4 Respiratory-Related Coronaviruses 11

improve the models [130]. The integration of Magpie features and structural information
through Smooth Overlap of Atomic Positions (SOAP) descriptors [142, 143] has shown
improvements over other methods [144]. Despite notable progress in applying ML
techniques to superconductivity, many models still lack sufficient prediction accuracy
or rely on high-dimensional feature spaces, which increases learning complexity.

Chapter 7 of the thesis introduces a PDML model that leverages a minimal set of
electron-specific features to accurately predict the critical temperature of superconduc-
tors. The goal of this model is to achieve high fidelity in predicting superconducting
behavior. The first aim is to assess the feature importance of electron-specific de-
scriptors by leveraging unsupervised learning techniques, particularly dimensionality
reduction through projection. The objective is to gain insights into the data’s clustering
potential within an embedding space. Additionally, SHAP (SHapley Additive exPlana-
tions) analysis is employed to further elucidate the contribution of individual features to
the clustering outcomes. To validate the effectiveness of the novel method, experiments
are conducted using a list of potentially new HEA superconductors. The approach is
compared against other relevant methods described in the literature, providing valuable
insights into the performance and superiority of the PDML model in predicting the
critical temperature of superconducting materials. The efficacy of the approach is
demonstrated and compared to existing methods, contributing to the advancement of
the field and showcasing the potential of PDML in predicting superconductivity in
diverse materials.

2.4 Respiratory-Related Coronaviruses

The coronavirus SARS-CoV-2 has been spreading globally, and efforts are being
made to isolate and control its spread [145–148]. Over 22,000 genome sequences
have been collected since its identification [149]. Research studies are focused on
developing drugs or vaccines using these sequences [150, 151]. Accurate identification
and categorization of the virus are crucial for reducing the disease’s spread [152].
Algorithmic approaches, such as the UMAP algorithm, have shown promising results in
identifying SARS-CoV-2 viruses in genome datasets [153–155]. UMAP is widely used
in bioinformatics and clustering visualization. Existing methods for virus identification,
such as the one referenced in [156], are computationally complex and unsuitable for
smaller computer architectures like microcontroller chips. To enable widespread and
easier virus identification, as well as facilitate fast initial identification while reducing
complexity, straightforward and efficient approaches are required. One potential


12 Theoretical Background

solution is to employ a theory-based approach that leverages the biological information
embedded within viruses. Viral proteins are encoded by virus sequences using codons,
which are translated into amino acids. These amino acids form proteins. The protein-
coding segment is called an ORF [157, 158]. They can uncover overlapping and hidden
genes in viruses, including SARS-CoV-2 [159]. An ORF starts with a start codon,
contains the protein sequence, and ends with a stop codon. Variations in ORF regions
differentiate virus types within a family. The number of substrings, or k-mers, of length
k in a sequence is similar among viruses within the same family [160]. Techniques
focused on detecting RNA-genome substrings often employ larger k-mer sizes. In
addition, some techniques utilize natural vectors to create a vast and detailed space,
where each biological molecule is uniquely represented [161]. SARS-CoV-2 and other
SARS-type viruses have a large ORF called ORF1ab, spanning about 13,000 nucleobases
[162, 163]. ORF1ab contains essential structural proteins for virus replication [164].

In Appendix A, the application of PDML to the prediction of COVID-19 from its
RNA is discussed. An efficient approach using genetic code rules and ORFs to encode
the entire SARS virus sequence into biological features is proposed. The method
offers greater interpretability of variations in RNA codon frequencies, also known
as codon bias [165]. To achieve this, the genetic code rules (3-mers) are utilized to
construct biology-based features, which is a natural choice. MERS-CoV, SARS-CoV,
SARS-CoV-2, and other related viruses are analyzed to identify distinct clusters. By
collecting coronavirus family data, extracting features from ORFs and codon counts,
and visualizing low-dimensional latent spaces, the goal is to achieve accurate clustering.
In order to demonstrate the effectiveness and validate the proposed approach, the
complexity and diversity of the analyzed viral RNA data are enriched. Initially, the
focus is on SARS-CoV-2, SARS-CoV, and MERS-CoV. Subsequently, the analysis is
expanded to include additional members of the coronavirus family. Finally, members
from other virus families, such as the herpes DNA virus family, are incorporated. This
progressive inclusion of diverse viral data allows for the strengthening and evaluation
of the robustness of the approach.


Chapter 3

Machine Learning

In this chapter, an introductory overview is provided for both shallow and PDML
methodologies. The focus is on elucidating the key techniques pivotal to this thesis,
which include clustering analysis, tree-based methods, DL, feature importance, PDML
and MLIPs.

3.1 Shallow Learning

ML involves computers using algorithms to optimize specific performance measures,
like character recognition, based on example data or past experiences. It has evolved
as a distinct field within computer science since the 1980s, with applications in
engineering, speech and image analysis, pattern recognition, and communications
[166]. Learning algorithms enhance the efficiency of target algorithms, serving as
alternatives to conventional data extraction from simulations. ML’s ultimate goal is to
enable computers to solve problems without explicit programming, achieved through
learning rules that save computational time and improve accuracy beyond human
capability. Technologies such as image and voice recognition, personalized marketing,
and data analytics work in conjunction with ML algorithms to acquire knowledge and
glean valuable insights [167]. Shallow learning algorithms follow a pipeline: acquire,
preprocess, and transform data, select relevant features to create a feature space,
form a training set, train the algorithm to identify distinct zones, optimize by finding
similarities, and create a decision rule. ML algorithms can be categorized into five
main classes based on whether prior knowledge is required or the goal is to discover
new patterns [168]:


14 Machine Learning

• Supervised learning (SL): In SL [169], samples are assigned classes (categorical
or numerical) based on known labels. These algorithms use labeled data to
automatically classify new samples. They learn from input datasets to generate
outputs, with the classification task aiming to predict labels for new inputs [170].
This is especially valuable in computational biology for predicting mechanisms
with uncertain definitions. The general mapping for supervised learning is:

Y = f(X, θ) (3.1)

where:

Y is the output or target variable,

X is the input features,

f is the model function capturing the relationship between inputs and outputs,

θ represents the parameters (weights and biases) of the model,

The objective is to find the values of θ that minimize this cost function. The least
squares error (LSE) cost function is the most common in supervised learning.
The optimization process outlined in Eq. 3.1 requires finding the solution to the
equation:

θ = argmin
θ

1

2m

m∑
i=1

(hθ(x
(i))− y(i))2 (3.2)

where:

m is the number of training examples,

hθ is the predicted output, often computed using an activation function,

y(i) is the actual output.

For a general activation function denoted as σ, the predicted output hθ(X) is
calculated as hθ(X) = σ(z), where z is the linear combination of the input
features X = {x1, x2, ..., xn} and their corresponding weights θ, in addition to a
bias term θ0:

z = θ0 + θ1x1 + θ2x2 + . . .+ θnxn (3.3)


3.1 Shallow Learning 15

The choice of the activation function (σ) depends on the specific requirements of
the model. For example, in the case of logistic regression, the sigmoid function is
commonly used as the activation function:

σ(z) =
1

1 + e−z
(3.4)

The interpretation of σ(z) depends on the application, often representing the
probability of an input x belonging to a class in classification tasks. The decision
boundary is determined by comparing this probability to a threshold. Differ-
ent models use varied activation functions, impacting the expressiveness and
characteristics of the model.

• Semi-Supervised learning (SSL): The goal of these algorithms is to predict
unknown labels from a dataset created for classification purposes. A trained
supervised algorithm is utilized to classify unlabeled data [171]. The most
confident unlabeled samples and their predicted labels are incorporated into the
training set.

• Unsupervised Learning (UL): Unlike SL algorithms, UL is employed when the
labels are unknown. The training set consists of unlabelled samples, which means
there are no predefined classes for dividing the feature space [167]. The purpose
of UL is to observe the underlying mechanics of the system and uncover insights
by identifying groups of samples with similar features. Some examples of UL algo-
rithms include clustering methods, anomaly detection algorithms, or unsupervised
versions of NNs [172]. As example of UL such as the k-means clustering algorithm,
given a dataset X represented by the feature vector {x1, x2, ..., xn} in Rn and
k clusters, the goal is to find cluster centers µ1, µ2, ..., µk. The optimization
objective is to minimize the sum of squared distances:

argmin
C,µ

n∑
i=1

∥Xi − µci∥2 (3.5)

Here, C = {c1, c2, ..., cn} represents cluster assignments, and µ = {µ1, µ2, ..., µk}
are cluster centroids. In Fig 3.1, a comparison between the two main ML
methodologies is depicted.

• Reinforcement Learning (RL) involves an agent learning an optimal policy through
trial and error in interaction with its environment. It’s used in various fields


16 Machine Learning

Fig. 3.1 Instances of SL classification (left) and UL clustering (right). In SL, the
model is trained on labeled data (blue and red), allowing it to learn patterns and
relationships. On the other hand, UL involves the model adapting to unlabeled
data (grey), autonomously identifying structures and patterns without predefined
categorization [2].

and aims to maximize future rewards by selecting actions based on current
environmental states [167]. RL sits between supervised and unsupervised learning,
focusing on actions that contribute to a cumulative increase in reinforcement
signal values for long-term performance [173].

• Active learning (AL): This specialized area within ML is employed when obtaining
labels for a supervised task, such as regression or classification, is costly or
resource-intensive. It aims to optimize the training set by actively selecting the
most informative training samples that contribute to highly accurate predictions,
thereby minimizing the loss function of our model [174].

3.1.1 Clustering Analysis

Clustering analysis categorizes dataset samples into subsets based on similarities [175].
Each subset comprises similar yet distinct samples, positioned according to distances
to cluster centroids. Centroids represent central points, with larger distances indicating
lower similarity. Clustering methods can be split into four main categories [176]:

• Hierarchical methods: These approaches form a tree-like structure by recursively
dividing the dataset. Two types exist: agglomerative, which starts with individual
samples and merges clusters iteratively, and divisive, which does the opposite
[177].


3.1 Shallow Learning 17

Fig. 3.2 k-means vs. DBSCAN example. DBSCAN proves adept with irregular and
diverse datasets, while k-means efficiently partitions data into k clusters based on mean
distances to centroids [3].

• Density-based methods: Similar to distance-based clustering methods, these
techniques assign samples to clusters based on the concept of density rather than
distance. One of the most known density-based methods is DBSCAN [133]. By
utilizing density, they can identify clusters with arbitrary shapes, avoiding the
assumption of spherical clusters in the feature space. Density-based methods are
also effective in detecting outliers or anomalies within the dataset [178].

• Grid-based methods: These algorithms are an improvement over density-based
methods and are particularly suitable for datasets with higher dimensions. They
quantize the feature space into a finite number of cells using a grid-like data
structure. By operating within this grid structure, the clustering algorithm
identifies clusters in an efficient manner [179].

• Partitioning methods: commonly used in clustering, divide a dataset into a
specified number, k, of clusters using a distance metric. This results in spherical
clusters, as seen in k-means clustering, a popular example. The k-means algorithm,
an NP-hard method, partitions the dataset into k clusters, with the mean
observation in each cluster as the centroid [180]. It requires inputting the desired
k value and iteratively produces a cluster representation as output.

Fig 3.2 depicts a comparison between the two clustering methods used in this
thesis. After clustering analysis, it’s crucial to evaluate the results through clustering
validation. This is necessary as clustering algorithms yield results even when the dataset
doesn’t naturally form distinct clusters. Evaluation becomes essential to measure the


18 Machine Learning

method’s effectiveness. Cluster validation can be internal, assessing clustering solution
stability, or external, comparing results with other datasets and methods [181]. Key
steps in evaluating clustering performance include:

• Determining the clustering tendency of the dataset to identify non-random
structure.

• Identifying the appropriate number of clusters.

• Assessing how well the cluster analysis results fit the data independently.

• Comparing results with externally known information, like class labels.

• Comparing two sets of clusters to determine better quality or agreement with
known information [182].

3.1.2 Tree-based Methods

Tree-based methods, detailed by [183], are prominent nonparametric models that employ
decision trees to iteratively divide a training dataset into smaller, more homogeneous
subsets, effectively handling both classification and regression tasks [184]. Each node in
the tree is associated with a decision rule, guiding the distribution of the data inherited
from its parent among its children, and every leaf node, also referred to as a sub-group,
is linked to at least one data point from the original training set. The most common
criteria for node splitting are:

• Gini Impurity: Minimize Gini impurity to achieve maximum homogeneity in
subsets.

IG(X) = 1−
c∑
i=1

p2i (3.6)

• Entropy: Maximize information gain to decrease entropy and enhance homogene-
ity.

IE(X) = −
c∑
i=1

pi log2(pi) (3.7)


3.1 Shallow Learning 19

Fig. 3.3 This example employs two features(root and internal nodes) to classify data
into three sub-groups (leaves) and intermediate splits, not immediately forming a leaf,
are internal nodes. The lines connecting nodes and leaves are branches. Input variables
(features) are utilized to classify data into sub-groups based on binary conditions. The
training sample computes the sample mean of the output yt for each sub-group, serving
as a constant prediction for future observations classified into that sub-group [4].

where:

X is the dataset or a specific node,

c is the number of classes,

pi is the proportion of instances of class i in the dataset or node X

In Fig. 3.3, a representation example of a decision tree is depicted. The construction
of the tree entails the iterative segmentation of variables, where branches undergo
evaluation for accuracy, efficiency, and effectiveness. The subset of variables chosen to
split an internal node relies on predetermined criteria formulated as an optimization
problem. To enhance the efficiency of the tree, the strategy involves reordering variable
splits, giving priority to essential variables at the top, and eliminating irrelevant features
for a successful model. Tree-based methods stand out as some of the most robust
ML algorithms, adept at accommodating complex datasets and ranking among the
most powerful algorithms in use today [185]. They demand minimal preparation time,
dispensing with the need for feature scaling or centering. These methods not only yield
excellent predictions but also allow you to scrutinize the calculations behind these
predictions. However, articulating the reasons behind predictions in simple terms can


20 Machine Learning

be challenging. Despite their tendency to overfit, they consistently outperform DL on
tabular data, as highlighted by [186].

Random Forest Algorithms

Random forest (RF) methods evolved from empirical successes rather than from a sound
theory, with various parts of the algorithm remain heuristic rather than theoretically
motivated [187]. These models combine tree predictors, with each tree drawing values
from a random vector independently sampled with the same distribution across all
trees [188]. Each decision tree is trained independently of the others and on distinct
subsets of the training data. The ultimate decision is reached by considering the
more frequently predicted outcome. As the number of trees in the forest grows, the
generalization error converges to a limit. The generalization error of a forest of tree
classifiers hinges on the strength of individual trees and the degree of correlation among
them.

Extreme Gradient Tree Boosting: XGBoost

XGBoost (XBG), an ML technique detailed in [189], utilizes an optimized ensemble
model of classification and regression trees. It employs gradient boosting to create a
decision tree ensemble for making predictions. Gradient boosting, an ensemble learning
method for ML classification and regression problems, combines multiple decision trees
to construct a robust model for accurate predictions [190]. The algorithm builds trees
sequentially, with each tree aiming to rectify the errors of its predecessor. Noteworthy
for its outstanding performance on various standard classification benchmarks, XGB
distinguishes itself by running significantly faster than many other popular approaches,
as emphasized in [191].

3.2 Deep Learning

The DL paradigm is a sub-field of ML inspired by the structural and functional char-
acteristics of neurons in the human brain [167, 192, 193]. Artificial Neural Networks
(ANNs) are utilized to address non-linear regression and classification problems where
linear activation functions are insufficient. As problems become more complex, the
number of features exponentially increases, necessitating the consideration of linear
combinations among them. Traditional linear ML algorithms struggle with the compu-
tational cost of such problems. However, the development of more complex architectures


3.2 Deep Learning 21

Fig. 3.4 Computational graph of a single perceptron with the input and output layers,
as well as the nodes and bias vectors of the layers [5].

has provided a solution. The Single Layer Perceptron, introduced by Rosenblatt, was
the first non-linear model [194]. This architecture consist of an unidirectional network
formed by one input layer and one output layer. Selecting as activation function the
sign function (or the sigmoid function) this algorithm is the most simple NN created.

3.2.1 Multi-Layer Perceptron

Feedforward NNs, commonly known as Multilayer Perceptrons (MLPs), are composed
of multiple layers of single perceptrons. These networks consist of an input layer, one or
more hidden layers, and an output layer. Each layer contains multiple computational
units. Fig. 3.4 illustrates an example of the most simple deep neural network (DNN)
or single perceptron, with only one hidden layer. The next terms can be identified:

• Input Nodes (xi): Representing input features, each of the n features corre-
sponds to an input node.

• Weights (wi): Parameters learned during training, Wij denotes the weight
connecting input node i to hidden node j.

• Bias Neuron (b): An additional parameter for each hidden node, enabling the
network to shift activation output.


22 Machine Learning

• Activation Function (σ): Applied to the weighted sum of input nodes plus
the bias term, common functions like sigmoid, tanh, and ReLU result in node
activation aj:

aj = σ

(
n∑
i=1

Wij ·Xi + bj

)
(3.8)

• Output Node (y): Producing the final result, the output node(s) apply an
activation function to the weighted sum from the last (hidden) layer.

With an adequate number of units in a single hidden layer and the appropriate
activation function, the network can approximate continuous functions arbitrarily
closely. However, empirical evidence suggests that deep networks with multiple hidden
layers exhibit improved performance and lower generalization error compared to shallow
networks with a single hidden layer. With increased computational power, training
larger networks in less time becomes feasible. This allows for efficient testing of different
structures and hyperparameters. Additionally, larger datasets contribute to better
generalization of the network’s learned patterns.

3.3 Feature Importance: SHAP Values

SHAP (SHapley Additive exPlanations) is a potent method for establishing a hierarchy
related to feature importance. Initially introduced by Scott M. Lundberg and Su-In Lee
[195], it has become a widely employed tool among data scientists to explain individual
predictions and provide interpretability to model descriptors. The core concept in
SHAP analysis involves approximating a given model in an additive way to establish
a hierarchy of feature importance. To achieve this, a model is approximated by a
function of the type:

g(z′) = ϕ0 +
M∑
i=1

ϕiz
′
i, (3.9)

where z′ ∈ {0, 1}M represents a simplified input, and M is the number of simplified
input features. The SHAP values, denoted as ϕi ∈ R, are utilized to assess the model
output. A mapping function hx(x

′) = x relates the simplified inputs approximating
the model to the inputs of the actual model. The relationship between the simplified
inputs for approximating the model and the inputs of the actual model is given by the
mapping function hx(x′) = x. This mapping includes information on the actual input,
enabling easy interpretability of the approximating model. Thus, the approximating


3.4 Physics-Driven Machine Learning 23

model results in a linear combination of ϕ values, which are determined through the
following form:

ϕi(f, x) =
∑
z′⊆x′

|z′|!(M − |z′| − 1)!

M !
[fx(z

′)− fx(z
′/i)], (3.10)

where f refers to the actual model, and fx(z
′) = f(hx(z

′)) denotes the output of the
actual model for the simplified input z′. For each data point x′, new artificial data
points z′ are created by excluding features, and the model is evaluated. The differences
in the model output between the original data points and the artificial data points
yield the SHAP value for each feature i.

In the ML context, SHAP values account for varying magnitudes and signs, indicat-
ing how features contribute to the model’s output or class prediction. A positive sign
indicates a contribution to predicting a specific class, while a negative sign signifies a
contribution to predicting the opposite class. The key advantages of SHAP values lie
in their transparency, making them easily understandable. Additionally, this procedure
is model-agnostic and can be applied to approximate any model.

3.4 Physics-Driven Machine Learning

Physics-driven ML (PDML), a subset of scientific ML (SciML), represents the synergy
between physics and machine learning. The core principles of SciML involve addressing
the challenges presented by scientific domain knowledge and developing interpretable
and robust ML models and algorithms [196]. The integration of physics knowledge
into ML models is expected to improve accuracy, physical interpretability, model
size, complexity, sample efficiency and generability [7]. Physics domain knowledge is
available in various forms, including essential physical principles (e.g., ab initio or first-
principles physics), physical constraints (e.g., symmetries, invariances, conservation
laws, asymptotic limits), and valuable insights gained from theoretical or computational
studies [14]. In this thesis, the mission is to explore dimensionality reduction of collected
observational data through PDML, transitioning from data-driven to physics-driven
approaches. To better understand this transition, the interplay between data and
physics scenarios in the ML field is illustrated in Fig. 3.5.

The incorporation of physically relevant prior knowledge into ML algorithms can
be achieved through various high-level approaches: physics-inspired descriptors, ML
architecture, loss function, and the utilization of hybrid methodologies. To incorporate
this valuable knowledge into our models, three primary approaches are employed


24 Machine Learning

Fig. 3.5 Data and physics scenarios [6].

according to the categorization of bias in the ML process [1, 6]: The first approach
ingeniously incorporates observational bias by employing equivariant operations to
generate invariant scalar features. These operations carefully preserve the consistency
and integrity of the resulting features even under various data transformations, thereby
safeguarding crucial physical properties throughout the learning process. By training
a ML algorithm on an invariant, scalar, lower-dimensional feature space, it gains
the capability to learn functions, vector fields, and operators that faithfully reflect
the underlying physical structure of the data. The second approach relies on the
concept of inductive bias, accomplished through the utilization of equivariant models.
These models offer the distinct advantage of more faithfully representing physical
interactions, ensuring that essential quantities maintain their predictable behavior
under various coordinate transformations. The third approach tightly constrains
the learning optimization step by skillfully incorporating learning bias through the
selection of appropriate loss functions, constraints, and inference algorithms during the
training phase. These strategic choices are deliberately tailored to steer convergence
towards solutions that harmonize with the underlying physical principles. Through the
fine-tuning of soft penalty constraints, the model can approximate adherence to the
governing physical laws, providing a versatile avenue to introduce a diverse array of
physics-based biases. As an illustrative example, we mention Physics-Informed Neural
Networks (PINNs). While not directly applied in this thesis, they are pertinent to
this section. PINNs represent a SciML technique used to tackle problems related to
Partial Differential Equations (PDEs). These networks approximate PDE solutions by
reframing the task of directly solving governing equations into an optimization problem
centered around a loss function [197]. All these methods are designed to maintain
the symmetries and equivariances present in the underlying physical systems, making
them well-suited for effectively capturing and representing the relevant information.


3.4 Physics-Driven Machine Learning 25

By leveraging these three approaches, ML methods can effectively embed physical
domain knowledge into their frameworks, enhancing their capabilities and enabling the
application of ML techniques in a wide range of physical and scientific disciplines.

3.4.1 Symmetry and Equivariance in ML

Symmetry, in the context of an object or system, refers to a transformation that
preserves a specific property, rendering it unchanged or invariant [198]. These transfor-
mations can manifest as either smooth, continuous processes or discrete operations.
Symmetries play a fundamental role in various ML tasks. Discrete symmetries natu-
rally emerge in scenarios like particle systems, where particles lack a definitive order
and can be rearranged arbitrarily. Similarly, they arise in various dynamical systems
through concepts such as time-reversal symmetry, as seen in systems adhering to
detailed balance principles or Newton’s second law of motion. Furthermore, permuta-
tion symmetries are of central importance in the analysis of data organized in graph
structures. Mathematically, symmetries are typically described using groups [7]. The
relationship between a function f and a symmetry group G can be characterized by
examining its equivariance properties, indicating that f is equivariant with respect to
G. Invariance is a special form of equivariance, dealing with quantities that remain
unchanged irrespective of the choice of the coordinate system. Fig. 3.6 illustrates the
distinction between invariance and equivariance. In the field of ML, a way to categorize
models into PDML or conventional ML is based on whether symmetry is employed or
equivariant operations are used [1].

By utilizing an equivariant model, transforming the input results in an output
representation that undergoes the same transformation [199]. This often includes
incorporating geometric coordinates and relevant quantities crucial for describing the
system’s behavior, such as external fields or atom-wise properties like velocities. The
strength of equivariant models lies in their capacity to uphold the system’s symmetries
and invariances throughout the learning process, ensuring a robust and accurate
representation of the underlying physics. The main concern associated with equivariant
ML models revolves around the substantial technical complexity they involve. On
the other hand, an invariant function produces the same output for both transformed
and non-transformed inputs. Invariant scalar features are preferred over geometric
tensors due to their ease of handling and computational efficiency [1]. The application
of invariant models has demonstrated impressive performance across numerous existing
benchmarks, making them a compelling choice for various scientific and engineering
applications. By effectively capturing the essential invariances present in the data,


26 Machine Learning

Fig. 3.6 An example illustrating the differences between symmetry group invariance
and equivariance is presented in the context of identifying a handwritten letter in an
image [7].

these models enable accurate and robust predictions, advancing our understanding
and problem-solving capabilities in the domain of physical systems. However, when
utilizing an invariant model, the challenge lies in devising a method to represent your
naturally equivariant physical system using invariant scalar features. This requires
careful consideration and creativity to encapsulate the crucial characteristics of the
system in a way that remains consistent under transformations.

It is crucial to note that, even when the ultimate goal is predicting a scalar quantity,
not all physical interactions can be adequately represented using scalars alone. The
richness and complexity of physical phenomena often necessitate a more comprehensive
representation that considers higher-order interactions and geometrical aspects. This is
precisely where equivariant models excel. By utilizing invariant and equivariant models,
researchers and practitioners can access a powerful toolset to effectively capture the
intricate dynamics of physical systems and make accurate predictions across a wide
range of scientific and engineering applications.


3.4 Physics-Driven Machine Learning 27

3.4.2 Invariant Descriptors: Learning Latent Representations

Observational data plays a fundamental role and serves as a critical foundation for
the success and recent achievements of ML algorithms [6]. Nevertheless, it is essential
to recognize that these data can also inadvertently introduce biases into the learning
process. Despite this, ML methods have proven to be remarkably powerful, particularly
when provided with sufficient data that cover the entire input domain of a learning
task. This capability enables accurate interpolation even in high-dimensional scenarios.
It is crucial to ensure that these observational data capture the underlying physical
principles governing their generation. By doing so, we can leverage these data as a
means of weakly embedding these principles into an ML model during its training
phase. Nonetheless, it is worth noting that for over-parameterized ML models, a
substantial volume of data is typically required to reinforce these biases adequately.
This reinforcement is crucial to generate predictions that respect essential symmetries
and conservation laws within the physical systems. Unfortunately, obtaining such
a large volume of data can be challenging and costly, especially in the context of
physical and engineering sciences. In many cases, observational data may be generated
through expensive experiments or large-scale computational models, making the cost
of data acquisition potentially prohibitive. As such, researchers and practitioners must
carefully consider the trade-offs between the data volume required and the resources
available for data acquisition in these applications, as mentioned in Fig. 3.5.

In handling high-dimensional unstructured data, a first step is reducing dimen-
sionality by extracting informative features [200]. These features form the foundation
for solving downstream tasks like prediction or classification, boosting efficiency and
accuracy in ML tasks. A highly efficient approach to tackle the data acquisition
challenge is to create a lower-dimensional tuple of physics-inspired descriptors, gen-
erating an embedding or latent spaces that capture all the underlying symmetries of
the system. This resultant latent space serves as a robust basis for training an ML
model, eliminating the necessity for data augmentation and greatly improving sample
efficiency. By leveraging the intrinsic knowledge embedded within these physics-inspired
representation, the ML model gains a deeper and more meaningful understanding of
the data. This leads to more accurate predictions and maximizes the use of available
samples, culminating in highly effective and precise machine learning applications. This
integration of physics-driven features not only streamlines the learning process but
also enhances interpretability and opens doors for innovative advancements in diverse
scientific and engineering domains.


28 Machine Learning

3.4.3 Geometric Deep Learning

Introducing an inductive bias involves the development of specialized architectures
that intrinsically incorporate prior knowledge and inherent biases relevant to a specific
predictive task [6]. This concept aligns with the paradigm known as geometric deep
learning (GDL), which spans the entire spectrum of deep learning, encompassing both
Euclidean and non-Euclidean domains. GDL seamlessly integrates insights about the
structure and symmetry intrinsic to the system of interest. These domains encompass
intricate structures, such as graphs, manifolds, meshes, and string representations
[201]. Fundamentally, GDL employs techniques that establish a geometric bedrock,
entailing the assimilation of knowledge concerning the inherent spatial relationships
and symmetrical attributes present in input variables. By infusing this geometric foun-
dation, the objective is to enhance the precision of information captured by the model
[202]. Among these methods, CNNs stand out as a canonical example, fundamentally
reshaping the landscape of computer vision by adeptly preserving invariances related to
symmetrical groupings and the distributed patterns found in natural images. Further-
more, convolutional networks can be extended to accommodate additional symmetry
groups, encompassing rotations, reflections, and more intricate gauge symmetry trans-
formations. Other notable instances include graph neural networks (GNNs), equivariant
networks, kernel methods such as Gaussian processes, RNNs and Transformers [198].
GDL provides a constructive procedure for incorporating prior physical knowledge into
neural architectures.

Convolutional Neural Networks

CNNs draw inspiration from cognitive neuroscience, particularly the pioneering work
of Hubel and Wiesel on the cat’s visual cortex. Their research uncovered distinct
neuron types: simple neurons responsive to small visual patterns and complex neurons
tuned to larger motifs [203]. CNNs serve as the cornerstone of Image Classification,
dominating the landscape of Computer Vision algorithms. Moreover, they have found
promising applications in Natural Language Processing. In CNNs, the core operation
is convolution:

fi = σ

(∑
j

(X ⊛Wij) + bi

)
(3.11)

where filters systematically extract descriptive features by traversing the input data,
generating feature maps. These filters act as functions applied to the data. CNNs
accommodate multi-dimensional input arrays, such as two-dimensional images with


3.4 Physics-Driven Machine Learning 29

Fig. 3.7 LeNet-5: One of the earliest convolutional neural networks [5].

three color channels or one-dimensional genomic sequences with a channel for each
nucleotide [204]. The high dimensionality of images increases the complexity of
hyperparameter tuning. Convolutional layers, often referred to as pooling layers,
empower the network to autonomously acquire abstract features. A typical CNN
architecture is illustrated in Fig. 3.7. Key hyperparameters, including the number of
convolutional layers, filter count, and filter size, need fine-tuning during the validation
process. sCNNs excel at detecting local patterns by employing the convolution operation
to glean insights from data. This process involves scanning through the data and
generating feature maps that connect to subsequent CNN layers. The input to CNNs
is an n-dimensional tensor, representing a variety of data types. For instance, it
can encompass two-dimensional images with three color channels or one-dimensional
genomic sequences with a channel assigned to each nucleotide. The integration of
convolutional and pooling layers enables CNNs to autonomously discover abstract
features within the data.

Recurrent Neural Networks

Another type of GDL tailored for processing specific data types, such as time-series,
text, and biological data containing sequential dependencies among attributes, is RNNs.
Renowned for their proficiency in learning from string representations, RNNs excel in
pattern recognition across various time steps, facilitated by parameter sharing across
different model segments. In an RNN, there is a direct correspondence between the


30 Machine Learning

Fig. 3.8 Time-layered architecture of an RNN [5].

layers within the network and the specific timestamp or position in the sequence [5].
An RNN comprises a variable number of layers, with each layer having a single input
corresponding to that particular timestamp. Specifically, considering a simple recurrent
node in an RNN with the following equation:

ht = σ(wxh · xt + whh · ht−1 + bh) (3.12)

where:

• ht: Hidden state at time t.

• σ: Activation function applied element-wise.

• wxh: Weight matrix connecting the input xt to the hidden state.

• whh: Weight matrix connecting the previous hidden state ht−1 to the current
hidden state.

• bh: Bias term for the hidden state.

RNNs can also be regarded as feed-forward networks with a specific structure rooted
in the concept of time layering, enabling them to accept a sequence of inputs and
generate a sequence of outputs. These models are proven to be particularly valuable
for applications involving sequence-to-sequence learning, such as machine translation
or predicting the next element in a sequence. In Fig. 3.8, the representation of an
RNN is depicted. This architecture allows for the distinction of:

• Input Sequence (xt): Represents the input at time t, varying from the se-
quence’s start to end.


3.5 Machine Learning Combined with Atomistic Simu-lations 31

• Hidden State (ht): Denotes the hidden state at time t, capturing memory
from previous steps.

• Weights (w): Learned parameters, including connections from input to hidden
states and recurrent connections.

• Output (yt): Represents the output at time t, predicting or representing the
input sequence.

Long Short-Term Memory (LSTM) RNNs, in particular, excel at efficient parameter
sharing through gated memory mechanisms [205]. Within each LSTM cell, recurrent
units equipped with self-learned gating enable the preservation, modification, and
selective forgetting of information within a short-term memory [206]. This effectively
addresses challenges related to handling long learning dependencies, which can be
problematic in other RNN variants.

3.5 Machine Learning Combined with Atomistic Simu-
lations

Atomistic simulations are a key tool for exploring material mechanics. The fidelity
of simulation results relies on the interatomic potential describing atom interactions.
Classical potentials have two main limitations: transferability and version-control
of the originally developed potentials [207]. The first is due to fixed forms and few
fitting parameters. The second major issue is the risk of discrepancies between the
implemented potential and the original version provided by developers. Maintaining
accurate parameters over time is difficult due to file format changes, transfer errors, and
file corruption. In contrast, MLIPs [87] offer flexibility by learning from first principles
calculations rather than relying on fixed forms [88–90]. Suc