Institute for Artificial Intelligence
Intelligent Sensing and Perception

University of Stuttgart
Universitätsstraße 32

D-70569 Stuttgart

Bachelorarbeit

Machine learning-based metabolic
rate estimation from wearable

sensors

Marie Olschewski

Course of Study: Informatik B.Sc.

Examiner: Jun.-Prof. Dr. Alina Roitberg

Supervisor: Jun.-Prof. Dr. Alina Roitberg

Commenced: August 5, 2024

Completed: February 19, 2025


Abstract

Adaptive devices such as exoskeletons and prostheses can enhance human physical capabilities
or replace the functionality of missing body parts. However, adjusting these devices for the
specific needs of an individual remains a time-consuming and costly procedure. A key objective in
optimizing these devices is minimizing the user’s energy expenditure (EE), a metric closely related
to metabolic cost. Traditional methods for estimating metabolic cost, such as indirect calorimetry,
are performed in controlled environments, limiting real-world applicability. This study aims to
bridge this gap by exploring the use of traditional machine learning (ML) methods to estimate
metabolic cost in real-time environments, utilizing wearable sensors integrated into adaptive devices.
Using the dataset from Ingraham et al. (2019), which includes data from ten healthy subjects
performing various exercises, the study investigates how different sensor combinations impact
prediction accuracy.

This thesis evaluated multiple ML models, including Random Forest (RF), Support Vector Machines
(SVM), Linear Regression (LR), Decision Trees (DT), and Multilayer Perceptrons (MLP), within two
cross-validation methods: Leave-One-Subject-Out (LOSO) and Leave-One-Time-Out (LOTO).

Key findings from this evaluation include: In the LOSO setting, RF outperformed other models,
achieving the lowest RMSE in several sensor regions, including Hexoskin, EMG Pants, and
Best Combination, with the ’Best Combination’ region showing the best results. In contrast,
MLP performed well in the LOTO setting, with its strongest performance observed in the ’Best
Combination’ region. SVM demonstrated robust performance when all sensor data was combined,
emphasizing the potential of multimodal sensor fusion. Hyperparameter tuning and sensor feature
selection were crucial factors in optimizing model performance, particularly for more complex
models like RF and MLP. The results suggest that while traditional ML methods can estimate EE
effectively, challenges remain in refining preprocessing techniques, tuning hyperparameters, and
optimizing sensor combinations. This thesis outlines the importance of model selection, sensor
fusion, and parameter optimization in developing more accurate and real-time energy expenditure
prediction systems for wearable technologies.

3


Kurzfassung

Adaptive Geräte wie Exoskelette und Prothesen können die körperlichen Fähigkeiten des Menschen
verbessern oder die Funktionalität fehlender Körperteile ersetzen. Die Anpassung dieser Geräte an
die Bedürfnisse einer Person, ist jedoch nach wie vor ein zeit- und kostenaufwändiges Verfahren.
Ein wichtiges Ziel bei der Optimierung dieser Geräte ist die Minimierung des Energieaufwands
des Benutzers, der eng mit den mit den Stoffwechselkosten zusammenhängt. Herkömmliche
Methoden zur Schätzung der Stoffwechselkosten, wie die indirekte Kalorimetrie, werden in
kontrollierten Umgebungen durchgeführt, was die Anwendbarkeit in der realen Welt einschränkt.
Diese Thesis zielt darauf ab diese Lücke zu schließen, indem sie die Anwendung traditioneller
Machine Learning Methoden zur Schätzung der Stoffwechselkosten in Echtzeitumgebungen unter
Verwendung von tragbaren Sensoren, die in adaptive Geräte integriert sind, untersucht. Unter
Verwendung des Datensatzes von Ingraham et al. (2019), der Daten von zehn gesunden Probanden
enthält, die verschiedene Übungen durchführten, untersucht die Thesis, wie sich verschiedene
Sensorkombinationen auf die Vorhersagegenauigkeit auswirken.

In dieser Arbeit wurden mehrere ML-Modelle bewertet, darunter Random Forests (RF), Support
Vector Machines (SVM), Linear Regression (LR), Decision Trees (DT), Multi-layer Perceptrons
(MLP), und AdaBoost (ADA) mit zwei Cross-Validation-Verfahren: Leave-One-Subject-Out (LOSO)
und Leave-One-Task-Out (LOTO). Die wichtigsten Ergebnisse dieser Thesis sind: Mit LOSO Cross
Validation schnitt Random Forests (RF) besser ab als andere Modelle, mit dem niedrigsten root
mean square error (RMSE) in mehreren Sensorregionen, einschließlich Hexoskin, EMG Pants und
Best Combination, wobei die Region Best Combination die besten Ergebnisse zeigte. Im Gegensatz
dazu schnitt Multi-layer Perceptrons (MLP) in der LOTO-Validierung gut ab, wobei die stärkste
Leistung in der Region Beste Kombination beobachtet wurde. SVM erreichte eine gute Leistung,
wenn alle Sensordaten kombiniert wurden, was das Potenzial der Kombination verschiedener
Sensortypen unterstreicht. Das Tuning der Hyperparameter und die Auswahl der Sensoren waren
entscheidende Faktoren für die Optimierung der Modellleistung, insbesondere bei komplexeren
Modelle wie RF und MLP. Die Ergebnisse deuten darauf hin, dass traditionelle ML-Methoden
zwar energy expenditure (EE) effektiv schätzen können, es bleiben aber Herausforderungen in der
Verfeinerung der Preprossesing-Techniken, des Tunings von Hyperparametern und der Optimierung
der Sensorauswahl. Diese Herausforderungen zu bewältigen ist ein wichtiger Teil der Entwicklung
genauerer Systeme zu Berechnung von Energieverbrauch in Echtzeit für tragbare Technologien.

4


Contents

1 Introduction 15
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.2 Thesis Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.3 Outline of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2 Related Work 19
2.1 State of the Art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2 Human-in-the-Loop Optimization . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3 Personalization of Wearable Robots . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4 Energy Expenditure Estimation and Optimization . . . . . . . . . . . . . . . . 20
2.5 Metabolic Efficiency in Prostheses and Exoskeletons . . . . . . . . . . . . . . . 20
2.6 Perturbation-Based Methods for Metabolic Cost Estimation . . . . . . . . . . . 20
2.7 Multi-Sensor Fusion for Energy Expenditure Prediction . . . . . . . . . . . . . 20
2.8 Accuracy of the SenseWear Armband during Short Bouts of Exercise . . . . . . 21
2.9 Advancements in Wearable Robotics for Rehabilitation . . . . . . . . . . . . . 21
2.10 Sparse Sensor Systems for Motion Analysis . . . . . . . . . . . . . . . . . . . 21
2.11 Personalized EMG Biofeedback for Exoskeleton Optimization . . . . . . . . . . 21
2.12 Benchmark Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3 Definition of Concepts 23
3.1 Signal Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2 Definition, Metrics and Evaluation Methods . . . . . . . . . . . . . . . . . . . 25
3.3 Supervised Machine Learning Algorithms for Regression . . . . . . . . . . . . 27
3.4 Model Training and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4 Methodology 33
4.1 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.2 Evaluation Criteria and Guidelines . . . . . . . . . . . . . . . . . . . . . . . . 33
4.3 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.4 Implementation Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.5 Problems/Challenges and solutions . . . . . . . . . . . . . . . . . . . . . . . . 48
4.6 Evaluation Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5 Results 55
5.1 Comparison using Pearson correlation coefficient (PCC) . . . . . . . . . . . . . 55
5.2 Additional Pipeline Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.3 Tuning Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.4 Evaluation of Various Sensor Regions . . . . . . . . . . . . . . . . . . . . . . 61

5


6 Discussion 65
6.1 Comparison using Pearson correlation coefficient (PCC) . . . . . . . . . . . . . 65
6.2 Additional Pipeline Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.3 Tuning Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
6.4 Evaluation of Various Sensor Regions . . . . . . . . . . . . . . . . . . . . . . 68
6.5 Limitations of this work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

7 Conclusion 73
7.1 Key Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
7.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
7.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
7.4 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
7.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

Bibliography 75

6


List of Figures

4.1 Hierarchical structure of the data set . . . . . . . . . . . . . . . . . . . . . . . . 36
4.2 This figure shows each signal preprocessing step regarding EMG data . . . . . . 44

5.1 Results of Pearson Correlation Coefficient . . . . . . . . . . . . . . . . . . . . . 56
5.2 Comparison of Preprocessors for LOSO and LOTO . . . . . . . . . . . . . . . . 58
5.3 Results of untuned models for LOSO and LOTO . . . . . . . . . . . . . . . . . 59
5.4 Progress over tuning phases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.5 Performance of ML models for sensor regions LOSO and LOTO . . . . . . . . . 64

7


List of Tables

4.1 Subject Demographics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.2 Activities for Each Subject (First Session) . . . . . . . . . . . . . . . . . . . . . 37
4.3 Activities for Each Subject (Second Session) . . . . . . . . . . . . . . . . . . . . 38
4.4 Activity Codes and Descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.5 Comparison of Label Structures Across Different Sensor Groups . . . . . . . . . 40
4.6 Dataframe MultiIndex Structure . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.7 Definition of Sensor Regions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

5.1 Results of Pearson Correlation Coefficient . . . . . . . . . . . . . . . . . . . . . 56
5.2 Tuning phases parameters and results . . . . . . . . . . . . . . . . . . . . . . . . 61
5.3 Performance of ML models for sensor regions LOSO . . . . . . . . . . . . . . . 63
5.4 Performance of ML models for sensor regions LOTO . . . . . . . . . . . . . . . 63

9


Listings

3.1 Butterworth Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2 Gaussian Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3 Interpolation and Resampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.4 Pearson Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.5 Linear Regression Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.6 Support Vector Regression Model . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.7 Decision Tree Regression Model . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.8 Random Forest Regression Model . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.9 AdaBoost Regression Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.10 Multi-layer Perceptron Regression Model . . . . . . . . . . . . . . . . . . . . . 29
3.11 Standard Scaler Preprocessor . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.12 Normalizer Preprocessor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.13 Preprocessing Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.14 GridSearchCV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.1 Custom LOO Generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

11


List of Abbreviations

Accel acceleration. 22

ADA AdaBoost. 4, 27

APDM_Accel APDM Accelerometer. 37

BF breath frequency. 52

CV cross-validation. 25, 27

DT Decision Trees. 4, 27

EDA electrodermal activity. 22

EE energy expenditure. 4

EMG Electromyography. 22

EMG (Group) Electromyography Sensors. 34

Empatica_Accel Empatica Accelerometer. 37

Empatica_Physio Empatica Physiological Sensors. 34

GT ground truth. 22

HITL human-in-the-loop. 19

HR heart rate. 22

IMUs inertial measurement units. 15

LLE lower-limb exoskeletons. 15

LOO Leave-One-Out. 25, 27

LOSO Leave-One-Subject-Out. 4, 33

LOTO Leave-One-Task-Out. 4, 33

LR Linear Regression. 4, 16

Metabolics_Systems Metabolics Measurement System. 34

ML machine learning. 23

MLP Multi-layer Perceptrons. 4

MSE mean square error. 26

MV Minute Ventilation. 52

13


List of Abbreviations

NAN Not A Number. 41

NRMSE negative root mean square error. 26

PCC Pearson correlation coefficient. 5, 6, 25, 26

RF Random Forests. 4

RMSE root mean square error. 4

SpO2 oxygen saturation. 22

SVM Support Vector Machines. 4, 27

Temp skin temperature. 22

VCO2 carbon dioxide production. 26

VO2 oxygen consumption. 26

14


1 Introduction

Wearable robots, such as exoskeletons, are emerging as powerful tools for enhancing mobility [11,
18, 37]. These devices provide movement support, enhance rehabilitation, and reduce physical
strain for individuals affected by illness [26], accidents [32], or aging [19]. Mobility challenges,
such as difficulty walking or climbing stairs, can significantly impact the independence and quality
of life of older adults [19]. Similarly, industrial workers performing physically demanding tasks,
such as heavy lifting and prolonged walking, often experience early-onset physical limitations [26].
With an aging population, and stroke remaining a leading cause of mobility impairment, lower-limb
rehabilitation robots have gained attention for their potential to improve gait and reduce energy
expenditure in stroke patients [37]. However, their effectiveness varies among users, emphasizing
the need for personalized assistance strategies [37].

To address these challenges, adaptive devices such as prostheses and exoskeletons have been
developed to enhance movement and provide essential support. Myoelectric-driven upper-limb
prosthetics effectively restore functionality [32], while lower-limb exoskeletons (LLE) assist users
in tasks ranging from basic walking to more complex movements [19, 26]. Integrating these
technologies helps users regain mobility and improve overall physical performance.

A critical aspect of wearable technology is accurately estimating metabolic EE during daily activities
and exercise. Many activity monitors struggle with short-duration activities, leading to imprecise EE
estimations [36]. To enhance accuracy, researchers are increasingly utilizing wearable sensors, such
as inertial measurement units (IMUs), for more precise EE assessment in healthcare applications
[20]. Energy efficiency is another key challenge in prosthetic and wearable robotics. Reducing
the user’s metabolic cost by supporting the user optimally during various activities is essential
for usability [22]. Traditional respiratory measurements provide only an average metabolic cost
per movement cycle, limiting insights into the contributions of specific walking phases [10].
Additionally, minimizing energy expenditure (EE) remains complex, particularly in tasks beyond
simple walking, such as navigating slopes or uneven terrain [11]. Personalization plays a crucial
role in optimizing exoskeleton performance, ensuring adaptive and responsive support tailored to
individual needs [18]. Research efforts focus on refining LLE to enable intelligent human-machine
interaction, ultimately improving rehabilitation outcomes and mobility support [23].

The continued advancement of exoskeletons and prostheses is driven by the need for improved
adaptability, energy efficiency, and personalization. These technologies aim to enhance rehabilitation,
support mobility, and improve the quality of life for individuals with physical limitations. By
integrating wearable robotics with intelligent control mechanisms, future innovations will further
refine and expand the capabilities of these assistive devices, making them more effective and
accessible to those in need.

15


1 Introduction

1.1 Motivation

Accurately estimating energy expenditure (EE) is a challenging task due to the variability across
individuals, influenced by factors such as demographics, physiological differences, and movement
patterns. These complexities make it difficult to capture reliable EE patterns using only a limited
set of sensor signals. Additionally, existing datasets for EE estimation are often tailored to specific
research objectives, leading to highly specialized and limited data that may not fully represent
real-world conditions. Many recent approaches have been optimized on these constrained datasets,
which restrict their applicability in dynamic, real-world environments.

Traditionally, EE is measured using indirect calorimetry, which is a breath-by-breath measurement
method typically conducted in a controlled laboratory setting. However, real-world applications
demand more practical and wearable solutions. Unfortunately, many recent studies do not fully
explore alternative sensor setups that could improve EE estimation while remaining lightweight and
wearable. This gap limits the scalability and practicality of current methods for everyday use, such
as in adaptive devices like exoskeletons and prostheses.

In this context, Ingraham et al. [15] introduced a comprehensive dataset incorporating various
sensors and diverse tasks, which has become a valuable resource for advancing EE estimation
research. This dataset presents an opportunity to explore optimal sensor configurations and
identify the most relevant features for accurate EE estimation across a wide range of individuals
and movement patterns. By leveraging this dataset, the aim is to move beyond traditional linear
approaches and enhance model performance, improving the generalizability of EE estimation across
different subject groups. This approach seeks to pave the way for more practical, real-time, and
user-friendly solutions for estimating energy expenditure, making it suitable for a variety of adaptive
technologies and everyday use.

1.2 Thesis Objectives

This work aims to extend the Linear Regression (LR) statistics presented by Ingraham et al. [15]
by investigating the impact of various traditional machine learning methods for estimating energy
expenditure (EE). An objective is to explore optimal sensor configurations and assess how different
machine learning models can improve EE estimation accuracy. To ensure consistency, the dataset
is preprocessed using the same approach as the original study, and the effectiveness of these
preprocessing steps is evaluated using Pearson correlation coefficients.

Several supervised machine learning methods are explored, with a focus on identifying the most
effective configurations and optimizations. Model performance is evaluated using the RMSE metric,
allowing for a comprehensive comparison of the models predictive accuracy. Additionally, sensors
are grouped into different regions to analyze the performance of various sensor sets across the
selected machine learning models. This approach aims to simulate realistic sensor configurations,
reflecting those commonly used in practical applications, and provides insights into how different
sensor combinations can enhance EE estimation performance. By leveraging this approach, This
thesis seeks to improve the generalizability of EE estimation across different subject groups and
real-world conditions.

16


1.3 Outline of the Thesis

1.3 Outline of the Thesis

This thesis begins with a review of recent related work, discussing specific circumstances and
environments relevant for estimating EE.

In Chapter 3, the definition of concepts establishes key foundations, providing detailed explanations
of the definitions and methods used. This ensures a clear understanding of the underlying approach
on which this work is based.

The Methodology is outlined in Chapter 4 the overall approach, requirements, and guiding principles
of this study. It includes an in-depth analysis of the dataset, followed by a step-by-step description
of the implementation process, challenges encountered, and evaluation procedures.

The discoveries are shown in the result Chapter 5 along with the analysis in Chapter 6. The important
outcomes are finally outlined in Chapter 7.

17


2 Related Work

This chapter will give an overview of related work and recent optimization approaches in the field.

2.1 State of the Art

2.1.1 Adaptive Control Using Heart Rate Feedback

The study by Manzoori et al. [24] introduces an adaptive control strategy for hip exoskeletons, where
heart rate (HR) feedback is used to adjust assistance during walking and other locomotor tasks
such as stairs and inclines. The approach successfully reduces the oxygen cost and effort intensity,
particularly during longer tasks. HR patterns were identified as a potential tool for improving the
responsiveness of the system in real time, although delays in HR response affected shorter tasks.
This study emphasizes HR’s potential to enhance adaptive control strategies in wearable devices,
allowing for more efficient locomotion by dynamically adjusting support based on the user’s effort
level.

2.1.2 human-in-the-loop (HITL) Optimization for Knee Exoskeletons

HITL Optimization for Knee Exoskeletons by Wang et al. [35] presents a HITL optimization
approach for knee exoskeletons, leveraging EMG signals from the semitendinosus and vastus
medialis muscles. The study focuses on using Bayesian optimization to rapidly identify the optimal
assistance strategies for knee exoskeletons. By analyzing muscle cooperation and dynamically
adjusting the assistance provided by the device, the study demonstrates a reduction in the convergence
time for optimization, which enhances the real-time adaptability of the system. This strategy aims
to improve mobility and metabolic efficiency by tailoring the exoskeleton’s support to the individual
user’s needs.

2.2 Human-in-the-Loop Optimization

Díaz et al. [11] focused on reducing energy expenditure in lower-limb exoskeletons using an
EMG-based objective function. Their motivation was to integrate EMG signals with Bayesian
optimization to enhance the real-time energy efficiency of wearable robots, thus paving the way for
more energy-efficient devices [11].

19


2 Related Work

2.3 Personalization of Wearable Robots

Koginov et al. [18] explored the customization of wearable robots for specific tasks like downhill
walking using sEMG data. Their motivation was to personalize the assistance provided by
exoskeletons to improve muscle activity reduction during specific tasks, making these devices more
suitable for real-world applications [18].

2.4 Energy Expenditure Estimation and Optimization

Lee and Lee [20] introduced a hybrid CNN-LSTM model for accurate energy expenditure (EE)
prediction during various walking conditions using IMU data. The motivation behind their work
was to develop real-time energy monitoring for rehabilitation and assistive devices, offering a model
that can estimate EE without needing heel-strike detection [20].

2.5 Metabolic Efficiency in Prostheses and Exoskeletons

Li et al. [22] developed a knee-guided evolutionary algorithm (KGEA) to optimize lower-limb
prostheses, focusing on improving metabolic efficiency during activities like stair climbing. The
motivation was to create real-time optimization techniques for prosthetics that enhance performance
and reduce the metabolic cost of walking [22].

2.6 Perturbation-Based Methods for Metabolic Cost Estimation

Dzewaltowski et al. [10] explored perturbation-based methods to estimate metabolic cost during
specific phases of the walking cycle. The motivation here was to better understand which phases of
walking contribute to higher energy expenditure, offering insights for designing assistive devices
that optimize walking efficiency [10].

2.7 Multi-Sensor Fusion for Energy Expenditure Prediction

Xu et al. [37] developed a spatial-temporal fusion network that integrates sEMG, inertial, metabolic,
and heart rate data to predict EE with high accuracy. Their motivation was to improve real-time EE
estimation by leveraging multi-sensor fusion and advanced hybrid attention mechanisms to enhance
prediction performance in wearable devices [37].

20


2.8 Accuracy of the SenseWear Armband during Short Bouts of Exercise

2.8 Accuracy of the SenseWear Armband during Short Bouts of
Exercise

Wedge et al. [36] investigated the SenseWear Armband (SWA)’s ability to estimate EE during
short-duration activities like walking and running. Their motivation was to explore mobile, real-time
EE monitoring solutions, demonstrating the potential of SWA for brief exercise sessions, but also
emphasizing the need for improvements in its accuracy for short-term activities [36].

2.9 Advancements in Wearable Robotics for Rehabilitation

Vinjamuri [34] discussed the integration of virtual reality (VR) and active-passive exoskeletons to
enhance mobility and rehabilitation outcomes. The motivation was to push forward the potential
of wearable robotics for rehabilitation practices, particularly in gait recovery and functional
improvement for individuals with neurological and musculoskeletal disorders [34].

2.10 Sparse Sensor Systems for Motion Analysis

Ebers et al. [12] introduced the use of Shallow Recurrent Decoders (SHRED) for reconstructing
missing sensor data in motion analysis. Their motivation was to develop a framework that works
with sparse sensor configurations, making motion tracking more efficient and accessible, particularly
for health and performance monitoring in contexts with limited sensor availability [12].

2.11 Personalized EMG Biofeedback for Exoskeleton Optimization

Li et al. [23] explored the use of EMG biofeedback to personalize exoskeleton assistance for users
with lower-limb dysfunction. The motivation was to tailor exoskeleton control based on individual
muscle activity, enhancing rehabilitation and mobility assistance, and improving the user experience
through personalized HITL optimization [23].

2.12 Benchmark Data

The work by Ingraham et al. (2019) [15] serves as the foundation for this bachelor thesis. Their
research aimed to evaluate the predictive capabilities of various physiological signals for estimating
metabolic energy cost during different exercises. The study focused on using simple signal
processing techniques and prediction algorithms to assess the baseline performance of each signal.
By providing insights and a dataset collected from 10 healthy individuals performing six different
activities — walking, incline walking, backward walking, running, cycling, and stair climbing — at
varying intensities, Ingraham et al. highlighted the potential to improve the speed, accuracy, and
portability of body-in-loop optimization algorithms. The physiological signals measured included

21


2 Related Work

breath frequency and volume, limb accelerations, lower limb Electromyography (EMG), heart rate
(HR), electrodermal activity (EDA), skin temperature (Temp), and oxygen saturation (SpO2), with
indirect calorimetry used as the ground truth (GT) for EE.

Their results showed that filtering the accelerations and EMG signals improved their predictive
power. Notably, global signals like HR and EDA were more sensitive to unknown subjects, while
local signals such as acceleration (Accel) were more sensitive to unknown tasks. The best predictive
performance was achieved by combining a small number of signals (4-5) from multiple sensor
types. Their study systematically identifies the most salient physiological signals for predicting
metabolic energy cost and provides valuable insights for improving HITL optimization algorithms
for wearable robots and prostheses. The dataset released with the study offers a promising foundation
for developing novel algorithms to predict energy cost across various applications, from wearable
technology to rehabilitation. Therefore, the insights from their study will support the design and
development of algorithms aimed at more accurate energy cost estimation, further advancing the
field.

22


3 Definition of Concepts

This section outlines the algorithms and concepts used in this work, covering signal preprocessing
techniques, definitions of metrics and evaluation methods, supervised machine learning (ML)
algorithms for regression, and the overall structure of the model training and evaluation approach.
The default values provided by the methods used are displayed in the listings. The SciPy library
[29] provides built-in tools for signal processing to optimize the signal and scikit-learn library [39]
delivers efficient ML algorithms for regression, as well as all the necessary tools for training and
evaluation.

3.1 Signal Preprocessing

Signal preprocessing improves signal quality and reliability by reducing noise from the environment,
sensors, and artifacts before analysis or ML. Filters like Butterworth or Gaussian refine the signal,
while vector magnitude or composite sums reduce the data size by making it more compact.

3.1.1 Butterworth Filter

Frequency filters like highpass (’high’), lowpass (’low’), and bandpass (’band’) filters are used to
isolate specific frequency ranges, enhancing signal quality for better analysis and processing. As
shown in Listing 3.1, the function enables the design of a Butterworth Filter for effective frequency
filtering.

from scipy import signal

b, a = signal.butter(N, Wn, btype='low', analog=False, output='ba', fs=None)

y = signal.filtfilt(b, a, x, axis=-1, padtype='odd', padlen=None, method='pad', irlen=None)

Listing 3.1: Butterworth Filter

While the function can create both digital and analog filters, the focus is on digital filters, which are
used for discrete and sampled data. It enables the creation of a Nth-order digital Butterworth filter
and returns the corresponding filter coefficients, which are subsequently used to apply the filter.
The critical frequency or frequencies (𝑤n) must be normalized between 0 and 1 using the formulas
in Equation (3.1) and Equation (3.2), which consider the cutoff frequency and sampling rates and
the Nyquist frequency ( 𝑓nyquist). The default output type returns the numerator (b) and denominator
(a) coefficients of the filter.

23


3 Definition of Concepts

𝑤n =
𝑓cutoff

𝑓nyquist
(3.1)

𝑓nyquist =
𝑓sampling_rate

2
(3.2)

Once the filter is designed, it can be applied to an input (x) using the second function, as shown in
Listing 3.1. This function processes the signal twice: first in the forward direction, then in reverse,
effectively canceling phase distortion. [5, 29]

3.1.2 Gaussian Kernel

Gaussian filtering is an effective technique for smoothing signals by reducing high-frequency noise
while preserving their shape. To achieve these characteristics, the function (Listing 3.2) applies a
one-dimensional filter along a specified axis.

from scipy.ndimage import gaussian_filter1d

smoothed_data = gaussian_filter1d(input, sigma=sigma, axis=-1, order=0, output=None, mode=

'reflect', cval=0.0, truncate=4.0, *, radius=None)

Listing 3.2: Gaussian Filter

The function processes an input by applying a Gaussian filter, where the specified sigma controls the
level of smoothing. The value of sigma is computed using the Equation (3.3). The mode parameter
defines how the filter handles signal boundaries. [29]

𝑠𝑖𝑔𝑚𝑎 =
𝑓sampling_rate

2 ∗ 𝜋 ∗ 𝑓cutoff
(3.3)

3.1.3 Interpolating and Resampling

Interpolation is the process of estimating values at intermediate points between known data points,
while resampling involves adjusting the sampling rate of the data. When working with data sampled
at different rates, aligning the data is useful to ensure it is of equal length and sampled at consistent
time points. For 1-D interpolation, the function in Listing 3.3 creates an interpolator that estimates
values for new time data points based on the existing x and y values.
from scipy.interpolate import interp1d

interpolator = interp1d(x, y, kind='linear', axis=-1, copy=True, bounds_error=None, fill_value

=nan, assume_sorted=False)

y_interpolated = interpolator(x_aligned)

Listing 3.3: Interpolation and Resampling

It takes arrays of x and y values, which approximate a function 𝑓 : 𝑦 = 𝑓 (𝑥), where x represents
time data and y represents value data. Interpolation kind can be specified (e.g., nearest, linear, or
cubic), and the fill_value parameter allows setting a constant value for out-of-bounds points or
enabling extrapolation. [29]

24


3.2 Definition, Metrics and Evaluation Methods

3.1.4 Full-wave Rectifying

Full-wave rectification converts an alternating signal into a non-negative signal by flipping negative
values to positive. This process retains only the magnitude of the signal, making it useful for signal
analysis where the strength, rather than the direction, is important. This can be achieved by the
function in Equation (3.4), which takes the absolute value of the input. [15]

𝑦(𝑡) = |𝑥(𝑡) |(3.4)

3.1.5 Vector Magnitude

The vector magnitude, given by the Euclidean norm, represents the length of a vector, simplifying
multidimensional data into a single value making it easier to interpret overall variations or effects.
Equation (3.5) computes the total magnitude by summing the squared components along each axis
and taking the square root, resulting in a single scalar value that represents overall intensity. [7,
15]

∥v∥ =
√︃
𝑥2

1 + 𝑥2
2 + · · · + 𝑥2

𝑛(3.5)

3.1.6 Composite Sum

Composite sums are used to combine multiple data points into a single value. Instead of summing
components along different axes, as in vector magnitude, data from various sensors are combined. To
calculate the composite sum, as shown in Equation (3.6), the sensor data are squared, summed, and
then the square root is taken, yielding a single scalar value that represents the overall measurement.
[7, 15]

𝑀 =

√√
𝑛∑︁
𝑖=1

𝑀2
𝑖

(3.6)

3.2 Definition, Metrics and Evaluation Methods

This section introduces key definitions, metrics, and evaluation methods used in the analysis. It
covers the Brockway energy expenditure (EE) equation, the Pearson correlation coefficient (PCC) for
measuring linear relationships, the root mean square error (RMSE) for assessing prediction accuracy,
and cross-validation (CV) with Leave-One-Out (LOO) for evaluating model performance.

25


3 Definition of Concepts

3.2.1 Brockway Energy Expenditure equation

EE refers to the total amount of energy the body uses over a specific period, influenced by factors
such as physical activity, metabolism, and respiratory gas exchange. Expressed by Equation (3.7), the
Brockway Equation estimates EE, measured in kcal per unit of time, based on oxygen consumption
(VO2) and carbon dioxide production (VCO2), both measured in liters per minute (L/min). [4,
25]

𝑒𝑒𝑣𝑎𝑙𝑢𝑒 = 16.58 · 𝑉𝑂2 + 4.15 · 𝑉𝐶𝑂2(3.7)

3.2.2 Pearson Correlation Coefficient

The Pearson correlation coefficient (PCC) measures the linear relationship between two variables as
shown in Equation (3.8).

𝑟 =

∑(𝑥 − 𝑚𝑥) (𝑦 − 𝑚𝑦)√︁∑(𝑥 − 𝑚𝑥)2 ∑(𝑦 − 𝑚𝑦)2
(3.8)

It calculates the sum of the product of the deviations of each data point from their respective means,
divided by the product of their standard deviations. The result, ranging from -1 to +1, indicates
the strength and direction of the relationship: +1 for a perfect positive correlation, -1 for a perfect
negative correlation, and 0 for no linear correlation. Correlations above 0.7 indicate a strong linear
relationship.

Listing 3.4 takes two datasets (x, y) as input and returns a PearsonRResult containing the Pearson
product-moment correlation coefficient. [15, 29]

from scipy.stats import pearsonr

corr, _ = pearsonr(x, y, *, alternative='two-sided', method=None, axis=0)

Listing 3.4: Pearson Correlation

3.2.3 Root Mean Squared Error (RMSE) Metric

RMSE is a commonly used metric that quantifies the difference between predicted and actual values
in regression models, with lower values indicating better performance. First, the mean square error
(MSE) is calculated by squaring the differences between predicted and actual values, summing
them, and averaging over all samples (Equation (3.9)). The RMSE is then obtained by taking the
square root of the MSE (Equation (3.10)). Unlike MSE, RMSE retains the same unit as the original
data, making it more intuitive for model comparison and evaluation. Additionally, RMSE penalizes
large errors more heavily, which is particularly useful when dealing with outliers. Since model
optimization typically involves maximizing a performance score, the negative root mean square
error (NRMSE) is often used as the negative counterpart of RMSE (Equation (3.11)). [2, 7, 15,
39]

26


3.3 Supervised Machine Learning Algorithms for Regression

𝑀𝑆𝐸 (𝑦, �̂�) = 1
𝑛

𝑛−1∑︁
𝑖=0

(𝑦𝑖 − �̂�𝑖)2(3.9)

𝑅𝑀𝑆𝐸 (𝑦, �̂�) =
√
𝑀𝑆𝐸(3.10)

𝑁𝑅𝑀𝑆𝐸 (𝑦, �̂�) = −𝑅𝑀𝑆𝐸(3.11)

3.2.4 Cross validation

Leave-One-Out (LOO) is a cross-validation (CV) technique where a model is trained on all data
points except one, which is used for testing. This process repeats for each data point, ensuring every
sample serves as a test instance once. Conceptually, LOO is equivalent to K-Fold CV with the
number of splits equal to the number of data points. While LOO provides an unbiased estimate of
model performance, it can be computationally expensive for large datasets due to the high number
of iterations, also referred to as folds.

3.3 Supervised Machine Learning Algorithms for Regression

Supervised regression algorithms are used to predict continuous target variables based on input
features. These methods, such as LR, Support Vector Machines (SVM), Decision Trees (DT), RF,
AdaBoost (ADA), and MLP, learn from labeled data to establish relationships between inputs and
outputs. Each method has specific parameters that help balance model performance and complexity,
with some methods focusing on reducing the risk of overfitting and others improving accuracy.

3.3.1 Linear Regression Model (LR)

Linear Regression (LR) (Listing 3.5) is a simple yet powerful method for modeling relationships
between variables by fitting a linear equation to the data. It assumes a linear relationship between
input features and the target variable.

Typically, an intercept (bias) is included to account for a nonzero baseline when the data is not
centered. By default, the model allows coefficients to take both positive and negative values, but
setting this constraint to True ensures all coefficients remain non-negative. [15, 39]

from sklearn.linear_model import LinearRegression

model = LinearRegression(*, fit_intercept=True, copy_X=True, n_jobs=None, positive=False)

Listing 3.5: Linear Regression Model

27


3 Definition of Concepts

3.3.2 Support Vector Machine Regression (SVM/SVR) Model

Support Vector Machines (SVM) (Listing 3.6), is an algorithm that identifies the optimal hyperplane
to effectively separate data points from different classes, maximizing the margin between them for
better generalization.

The C parameter controls the balance between model complexity and training error, epsilon defines
the margin for error. kernel determines the transformation of input data (e.g., linear, polynomial,
RBF). degree applies to the polynomial kernel, influencing the complexity of the decision boundary,
while gamma influences the impact of training points, where a small value smooths the boundary and
a large value risks overfitting. [6, 39]

from sklearn.svm import SVR

model = SVR(*, kernel='rbf', degree=3, gamma='scale', coef0=0.0, tol=0.001, C=1.0, epsilon

=0.1, shrinking=True, cache_size=200, verbose=False, max_iter=-1)

Listing 3.6: Support Vector Regression Model

3.3.3 Decision Tree (DT) Regression Model

Decision Trees (DT) (Listing 3.7), is an algorithm that splits data based on feature values to make
predictions. For regression, it predicts numerical values by recursively dividing the data into subsets,
starting from a root question and branching out.

max_depth limits the maximum depth the tree can grow, min_samples_split specifies the minimum
number of samples required to split an internal node, and min_samples_leaf determines the minimum
number of samples required in a leaf node. [9, 39]

from sklearn.tree import DecisionTreeRegressor

model = DecisionTreeRegressor(*, criterion='squared_error', splitter='best', max_depth=None,

min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=None,

random_state=None, max_leaf_nodes=None, min_impurity_decrease=0.0, ccp_alpha=0.0,

monotonic_cst=None)

Listing 3.7: Decision Tree Regression Model

3.3.4 Random Forest (RF) Regression Model

Random Forests (RF) (Listing 3.8), is an ensemble learning method that combines multiple DTs
(Section 3.3.3) to improve the performance and robustness of the model. It generates several DTs
by training each one on a random subset of the data, and the final prediction is made by averaging
the predictions from all the trees.

In addition to the parameters already explained in Section 3.3.3, the number of trees in the forest is
controlled by the n_estimators parameter, while max_features determines how many features are
considered when looking for the best split. Additionally, the bootstrap parameter defines whether
bootstrap sampling is used during tree construction. When set to True, it enables sampling with
replacement, ensuring that each tree is trained on a different subset of the data. [3, 39]

28


3.3 Supervised Machine Learning Algorithms for Regression

from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor(n_estimators=100, *, criterion='squared_error', max_depth=None,

min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=1.0,

max_leaf_nodes=None, min_impurity_decrease=0.0, bootstrap=True, oob_score=False, n_jobs=None,

random_state=None, verbose=0, warm_start=False, ccp_alpha=0.0, max_samples=None, monotonic_cst

=None)

Listing 3.8: Random Forest Regression Model

3.3.5 Adaptive Boosting (AdaBoost/ADA) Regression Model

AdaBoost (ADA) (Listing 3.9) is an ensemble learning method that combines the predictions of
multiple weak learners (such as DTs, Section 3.3.3) to create a strong classifier. It improves accuracy
by focusing on the mistakes made by previous learners. ADA assigns weights to each training
instance, adjusting them after each iteration to give more importance to misclassified data points.
Each subsequent weak learner places more emphasis on the instances that are harder to predict.

n_estimators controls the number of weak learners in the ensemble. The learning_rate parameter
determines the weight given to each individual learner. By default, the loss function is linear, where
errors are penalized linearly. [8, 39]

from sklearn.ensemble import AdaBoostRegressor

model = AdaBoostRegressor(estimator=None, *, n_estimators=50, learning_rate=1.0, loss='linear'

, random_state=None)

Listing 3.9: AdaBoost Regression Model

3.3.6 Multi-layer Perceptron (MLP) Regression Model

Multi-layer Perceptrons (MLP) Regression Model (Listing 3.10) is a neural network designed to
predict continuous values from input data. It consists of multiple layers of neurons that capture
complex patterns.

hidden_layer_sizes defines the number of neurons in each hidden layer. While more layers and
neurons can capture greater complexity, they also increase the risk of overfitting. activation

function determines how each neuron’s output is transformed (e.g., ReLU, tanh, logistic). solver
specifies the optimization algorithm used to adjust the model’s weights (e.g., adam, sgd, lbfgs).
learning_rate controls the step size in weight updates. Finally, early_stopping halts training when
the model’s performance on a validation set stops improving. [13, 14, 17, 39]

from sklearn.neural_network import MLPRegressor

model = MLPRegressor(hidden_layer_sizes=(100,), activation='relu', *, solver='adam', alpha

=0.0001, batch_size='auto', learning_rate='constant', learning_rate_init=0.001, power_t=0.5,

max_iter=200, shuffle=True, random_state=None, tol=0.0001, verbose=False, warm_start=False,

momentum=0.9, nesterovs_momentum=True, early_stopping=False, validation_fraction=0.1, beta_1

=0.9, beta_2=0.999, epsilon=1e-08, n_iter_no_change=10, max_fun=15000)

Listing 3.10: Multi-layer Perceptron Regression Model

29


3 Definition of Concepts

3.4 Model Training and Evaluation

The Model Training and Evaluation Framework provides a structured approach to training ML
models, evaluating their performance, and ensuring consistency throughout the process. It integrates
different models with preprocessors in pipelines, optimizing the process of model tuning and
evaluation using GridSearchCV.

3.4.1 Preprocessing for Machine Learning Models

Scaling data before applying ML models is crucial to ensure that all features contribute equally to
the model’s learning process. Some models rely on the distance between data points, and features
with larger scales may dominate the learning process, leading to biased results. Scaling also helps
models converge faster during training by preventing large numerical gradients.

StandardScaler (Listing 3.11) standardizes features by removing the mean and scaling to unit
variance. This ensures that each feature has a mean of 0 and a standard deviation of 1, making the
data centered and comparable across features. [39]

from sklearn.preprocessing import StandardScaler

transformer = StandardScaler(*, copy=True, with_mean=True, with_std=True)

Listing 3.11: Standard Scaler Preprocessor

Normalizer (Listing 3.12) normalizes samples individually to unit norm, which scales each sample
to have a norm of 1. This is useful when the focus is on the relationship or pattern between the data
points, rather than their specific values. [39]

from sklearn.preprocessing import Normalizer

transformer = Normalizer(norm='l2', *, copy=True)

Listing 3.12: Normalizer Preprocessor

3.4.2 Pipelines for Training

Machine learning pipelines simplify the workflow by applying consistent processing steps to both
training and testing data. Preprocessors, as described in Section 3.4.1, can be integrated to transform
features before training, ensuring automated preprocessing, preventing data leakage, and ensuring
reliable evaluation.

The code in Listing 3.13 demonstrates how to set up a pipeline with a LR model (Section 3.3.1) and
a Normalizer as the preprocessing step.

models = {'lin_reg': LinearRegression()}

pipelines = {}

for model_name, model in filtered_models.items():

pipeline = Pipeline(steps=[('preprocessor', Normalizer()), ('regressor', model)])

pipelines[model_name] = pipeline

Listing 3.13: Preprocessing Pipeline

30


3.4 Model Training and Evaluation

This approach is flexible and can be applied to multiple models. By adding conditional logic,
different preprocessing steps can be assigned to specific models as needed. It ensures appropriate
scaling across models and datasets while enhancing code efficiency and reproducibility. [39]

3.4.3 Model Training and Evaluation Framework

GridSearchCV is a technique for hyperparameter tuning and model evaluation through CV. It tests all
possible combinations of a parameter grid, training the model with each set and evaluating it using a
defined scoring metric, such as NRMSE. By automating the process of evaluating hyperparameters,
GridSearchCV helps identify the best model configuration and ensures optimal performance before
final evaluation.

The GridSearchCV object, as in Listing 3.14, is initialized with the model (estimator), a dictionary of
hyperparameters (param_grid), scoring, and the number of CV folds (e.g., LOO in Section 3.2.4).

After fitting the model the best-performing model can be accessed via best_estimator, while the
test scores, including the mean and standard deviation, provide insight into model performance
across the CV folds.

By extending this approach with a model and pipelines dictionaries, different models can be
evaluated in one process. [39]

from sklearn.model_selection import GridSearchCV

grid_search = GridSearchCV(estimator, param_grid, *, scoring=None, n_jobs=None, refit=True, cv

=None, verbose=0, pre_dispatch='2*n_jobs', error_score=nan, return_train_score=False)

grid_search.fit(X, y=None, **params)

best_estimator = grid_search.best_estimator_

mean_test_score = grid_search.cv_results_['mean_test_score']

std_test_score = grid_search.cv_results_['std_test_score']

split_test_score = grid_search.cv_results_['splitX_test_score']

Listing 3.14: GridSearchCV

31


4 Methodology

This section outlines the methodology used in this work, detailing the approach to EE estimation,
dataset characteristics and challenges, implementation pipeline, and evaluation strategies. The
methodology is designed to provide a structured and reproducible workflow.

4.1 Approach

To extend EE estimation beyond Linear Regression (LR), as previously explored by Ingraham et al.
[15], a structured approach is followed to ensure compatibility with benchmark data and meaningful
model evaluation. First, preprocessing is evaluated using the Pearson correlation coefficient (PCC),
comparing unfiltered and filtered data against benchmark results. Next, various supervised ML
models (LR, SVM, DT, RF, ADA, MLP) are trained on all 16 extracted features and assessed
using NRMSE across two CV strategies (Leave-One-Subject-Out (LOSO) and Leave-One-Task-Out
(LOTO)). To optimize model performance, hyperparameters are tuned through GridSearchCV using
a manually designed parameter grid, refined in three iterative phases, with LOSO serving as the
primary evaluation method for cross-subject generalization. Finally, recognizing the constraints
of real-world applications, key sensor regions are analyzed to balance estimation accuracy and
practical feasibility. This structured workflow ensures a comprehensive evaluation while maintaining
comparability with benchmark studies.

4.2 Evaluation Criteria and Guidelines

This section outlines the evaluation process using the benchmark dataset from Ingraham et al. [15].
The data preprocessing follows their methodology, including signal filtering, resampling, and GT
calculation. Evaluation techniques involve CV, PCC for validation, and RMSE for assessing model
performance, ensuring consistent and comparable results across various machine learning methods
and sensor configurations.

4.2.1 Benchmark Dataset

The dataset used in this work is provided by Ingraham et al. [15, 16]. It was collected from 10
healthy subjects equipped with a diverse range of sensors while performing six activities at varying
speeds and intensities. A detailed description of the dataset structure is provided in Section 4.3.

Features and Target Variable. Each model is trained using 16 signals, including six vector
magnitudes, composite sums from two EMG signals, two Empatica data points per wrist, and four
global signals. For further details, see Section 4.3.3.

33


4 Methodology

The target variable is the EE value, calculated using the Brockway EE Equation (3.7), which
incorporates VO2 and VCO2 data. Since VO2 is highly correlated with EE, it is also included as a
feature in some models to serve as a benchmark for performance evaluation.

4.2.2 Data preprocessing

Before training the models, proper data preprocessing is crucial to ensure accuracy and reliability.
The dataset, collected using wearable sensors, can exhibit issues such as sensor shifts, disconnections,
and noise due to high sensor sensitivity, which captures unintended movement patterns like gait
cycles. These discrepancies were identified by Ingraham et al. [15] and require appropriate handling
to ensure the integrity of the analysis.

The preprocessing steps followed in this work are aligned with the methods used by Ingraham et al.
[15] to maintain consistency and comparability. These steps include:

The vector magnitude of the Accel data was calculated across the three axes using Equation (3.5).

EMG preprocessing was done in several steps. EMG were filtered using a band-pass filter between
30 and 350 Hz, followed by full-wave rectification (Section 3.1.4) and low-pass filtering with a 5
Hz cutoff (Section 3.1.1). The filtered EMG signals were then normalized to the peak activation
level for each muscle within the same session. Composite sums for each side were calculated using
the eight EMG muscle sensors.

Although the raw data was time-synchronized during collection, each sensor group had a different
sampling rate, resulting in misaligned sampling points. To address this, the data was resampled to 1
Hz, ensuring uniform sampling across all sensor signals. (Section 3.1.3)

Additionally, data was acquired once unfiltered and filtered by applying Gaussian kernel smoothing
(Section 3.1.2) before resampling. The cutoff frequencies for resampling were set to 0.1 Hz for
Electromyography Sensors (EMG (Group)) and vector magnitude Accel signals and 0.01 Hz for
more robust Metabolics Measurement System (Metabolics_Systems) and Empatica Physiological
Sensors (Empatica_Physio) (such as VO2, HR, and SpO2)

The GT for EE was derived using the Brockway (Equation (3.7)), which calculates metabolic cost
from unfiltered VO2 and VCO2 data, normalized by the subject’s weight. Ingraham et al. [15]
determined that the energy cost remained constant across each 6-minute activity period. They
calculated the mean metabolic cost during the last 3 minutes of each activity session and extended
this value to the full 6 minutes. To account for the net EE, the metabolic cost of standing at the
beginning of each activity session was subtracted. This process was used to determine the net
energetic cost for each activity.

4.2.3 Evaluation Criteria

This section outlines the evaluation methods used to assess the performance of the models and their
predictions for EE for comparability with the benchmark.

34


4.2 Evaluation Criteria and Guidelines

PCC and LR. Preprocessing evaluation was conducted using Pearson correlation coefficient (PCC)
to ensure comparability with the original benchmarks. In the work of Ingraham et al. [15], both
filtered and unfiltered data were used to calculate the ground truth (GT) target variable. Additionally,
LR was performed on the preprocessed data, serving as a benchmark for evaluating the performance
of other machine learning methods.

Metrics. To assess the accuracy of the predictions, several metrics can be used. In this case [15],
RMSE Equation (3.10) is used as a metric to evaluate model performance. RMSE helps quantify
how closely the model’s predictions align with the GT EE values, providing a reliable measure for
comparison with benchmark data.

CV. The dataset consists of 10 subjects and six task modes, making it well-suited for LOO CV.
Given the limited size of the dataset, CV is crucial for improving model reliability and ensuring
generalizability across different data subsets. Following the methodology defined in detail by
Ingraham et al. [15], two types of CV were employed: LOSO and LOTO. LOSO involves 10 folds,
one for each subject, while LOTO splits the data into six folds based on the different tasks. These
approaches ensure that each sensor configuration is tested across diverse data points, minimizing
bias through averaging over the test sets.

Sensor regions. In addition to analyzing the performance of individual signals, Ingraham et al.
[15] evaluated four groups of sensors: Local Signals, Global Signals, a Combination of Both, and
Hexoskin, referencing its application. The Hexoskin region will be included in the evaluation, along
with other regions introduced in Section 4.6.

35


4 Methodology

Figure 4.1: Hierarchical structure of the data set

4.3 Dataset

This section outlines the structure of the provided dataset by Ingraham et al. [16]. It is organized
into several subsections, including details about the subjects, the activities, sensor and signal
descriptions, label structure, and missing data. Before proceeding with the data analysis, it is
important to address some missing details in the dataset provided by Ingraham et al. [15]. These
issues, such as the label structure and other challenges with the raw data, should be discussed and
clarified to ensure proper handling and analysis.

4.3.1 Subjects

According to Ingraham et al. [15], ten healthy subjects participated in the experimental data
collection, consisting of 8 males and 2 females, with an average age of 27.4±4.5 yr, height of 1.76
± 0.09 m, and weight of 69.1±9.9 kg. Further details on the demographics of each subject can be
found in Table 4.1.

4.3.2 Activities

As described in the paper of Ingraham et al. [15], each subject took part in two experimental sessions,
where they performed various physical activities at different speeds and intensities (Table 4.4). In
the first session the subjects engaged in the activities of sitting, standing, level walking (Walking),
incline walking (Incline), and backward walking on a treadmill (Backwards) (Table 4.2). In the
second session they participated in sitting, standing, running on a treadmill (Running), cycling on a
stationary bike (Cycling), and stair climbing on a stairmill (Stairs) (Table 4.3). For each activity, the
subjects typically stood for 6 minutes, performed various intensities of the activity, each lasting 6
minutes, and then rested while sitting for an additional 6 minutes. The standing (22) data serves

36


4.3 Dataset

Subject Sex Height (m) Weight (kg) Age (yr)
1 Male 1.85 63.49 30
2 Male 1.77 63.49 25
3 Female 1.73 71.20 24
4 Male 1.68 68.03 25
5 Male 1.83 68.03 24
6 Male 1.85 68.03 25
7 Male 1.85 95.25 24
8 Male 1.85 65.76 33
9 Male 1.73 68.93 37
10 Female 1.63 58.05 27

Absolute Mean ± SD
8 Male, 2 Female 1.76 ± 0.09 69.1±9.9 27.4±4.5

Table 4.1: Subject Demographics

as a benchmark and is used to calculate the net ground truth energy cost. Subsequently, both the
standing (22) and sitting (23) segments are removed from each activity, leaving only the varying
intensities of the activities.

Activities (First Session)
Subject Walking Incline Backwards Backwards2

1 22, 1, 3, 2, 23 22, 5, 4, 22, 6, 7, 23 22, 8, 9, 10, 23
2 22, 3, 2, 22, 1, 23 22, 5, 4, 22, 7, 6, 23 22, 8 22, 9, 10, 23
3 22, 1, 3, 2, 23 22, 4, 5, 22, 7, 6, 23 22, 9, 8, 23
4 22, 2, 1, 3 22, 4, 5, 22, 7, 6, 23 22, 8, 9, 23
5 22, 2, 3, 1, 23 22, 5, 4, 22, 6, 7, 23 22, 8, 9, 10, 23
6 22, 1, 2, 3, 23 22, 5, 4, 22, 6, 7, 23 22, 8, 9, 23
7 22, 3, 2, 1, 23 22, 5, 4, 22, 6, 7, 23 22, 9, 8, 23
8 22, 2, 1, 3, 23 22, 5, 4, 22, 6, 7, 23 22, 9, 8, 23
9 22, 3, 1, 2, 23 22, 4, 5, 22, 7, 6, 23 22, 9, 8, 23
10 22, 2, 1, 3, 23 22, 4, 5, 22, 7, 6, 23 22, 8, 9, 23

Table 4.2: Activities for Each Subject (First Session)

4.3.3 Sensors and signals

During the experiment, the subjects were equipped with a range of sensors attached to differ-
ent parts of their bodies. These sensors were grouped into the following categories: APDM
Accelerometer (APDM_Accel), Empatica Accelerometer (Empatica_Accel), Empatica_Physio,
Metabolics_Systems, and EMG (Group). A detailed description of each sensor group is provided

37


4 Methodology

Activities (Second Session)
Subject Running Cycling Cycling2 Stairs

1 22, 11, 13, 14, 12 23, 15, 16, 17, 18, 23
2 22, 11, 13, 14, 12, 23 23, 18, 17, 15, 16, 23 22, 19, 20, 21, 22
3 22, 11, 13, 14, 12, 23 23, 18, 17, 15, 16, 23 22, 19, 20, 21, 22
4 22, 11, 14, 12, 13, 23 23, 16, 17, 15, 18, 23 22, 21, 19, 20, 22
5 22, 11, 13, 14, 12, 23 23, 17, 18 23, 15, 16, 23 22, 21, 19, 20, 22
6 22, 11, 14, 12, 13, 23 23, 16, 18, 17, 15, 23 22, 20, 21, 19, 22
7 22, 11, 13, 14, 12, 23 23, 15, 16, 18, 17, 23 22, 19, 21, 20, 22
8 22, 11, 13, 12, 14, 23 23, 18, 16, 15, 17, 23 22, 21, 19, 20, 22
9 22, 11, 14, 12, 13, 23 23, 15, 17, 16, 18, 23 22, 20, 19, 21, 22
10 22, 11, 12 23, 15, 17, 18, 16, 23 22, 19, 21, 20, 22

Table 4.3: Activities for Each Subject (Second Session)

Activity Code Activity Mode Intensity Speed
1 Walking - 0.6 (m/s)
2 Walking - 0.9 (m/s)
3 Walking - 1.2 (m/s)
4 Incline 4° 0.6 (m/s)
5 Incline 4° 1.2 (m/s)
6 Incline 9° 0.6 (m/s)
7 Incline 9° 1.2 (m/s)
8 Backwards - 0.4 (m/s)
9 Backwards - 0.7 (m/s)
10 Backwards - 1 (m/s)
11 Walking - 1.2 (m/s)
12 Running - 1.8 (m/s)
13 Running - 2.2 (m/s)
14 Running - 2.7 (m/s)
15 Cycling Resistance 1 70 (rpm)
16 Cycling Resistance 3 70 (rpm)
17 Cycling Resistance 5 70 (rpm)
18 Cycling Resistance 1 100 (rpm)
19 Stairs 60 (W) -
20 Stairs 60 (W) -
21 Stairs 60 (W) -
22 Standing - -
23 Sitting - -

Table 4.4: Activity Codes and Descriptions

below. Ingraham et al. [15] reported that all sensor signals were time-synchronized during data
collection, and each sensor group included a time and activity code column to align the signals with
the corresponding activities.

38


4.3 Dataset

APDM Accelerometer (APDM_Accel)

Four commercial tri-axial accelerometers were attached to the designated positions on the subject
using elastic Velcro straps.

• Positions: Waist, Chest, Left Ankle, Right Ankle, Left Foot, Right Foot

• Physical Quantities and Units: Acceleration (𝑚/𝑠2), Angular Velocity (𝑟𝑎𝑑/𝑠), Magnetic
Field (𝜇𝑇)

• Axis: x, y, z

• Sampling Rate: 128 Hz

Empatica Accelerometer (Empatica_Accel)

Each subject wore two bilateral wristbands, each containing tri-axial accelerometers.

• Positions: Left Wrist, Right Wrist

• Physical Quantities and Units: Acceleration (𝑚/𝑠2)

• Axis: x, y, z

• Sampling Rate: 32 Hz

Empatica Physiological Sensors (Empatica_Physio)

The same wristbands used with the Empatica accelerometer also recorded electrodermal activity
and skin temperature.

• Positions: Left Wrist, Right Wrist

• Physical Quantities and Units: Electrodermal Activity (𝜇𝑆), Skin Temperature (◦𝐶)

• Sampling Rate: 4 Hz

Metabolics Measurement System (Metabolics_Systems)

Portable respirometer measures oxygen consumption (VO2), carbon dioxide production (VCO2),
breath frequency, and minute ventilation at a breath-by-breath sampling rate. Heart rate was
monitored using a wireless chest strap, while blood oxygen saturation (SpO2) was measured with a
pulse oximeter attached to the subject’s left earlobe.

• Physical Quantities and Units: VO2 (𝑚𝐿/𝑠), VCO2 (𝑚𝐿/𝑠), RER, Breath Frequency
(1/𝑚𝑖𝑛), Minute Ventilation (𝐿/𝑚𝑖𝑛), Oxygen Saturation (SpO2) (%), Heart Rate (1/𝑚𝑖𝑛)

• Sampling Rate: Breath-by-Breath

39


4 Methodology

Electromyography (EMG)

The muscle activity from eight bilateral muscles of the lower limbs was recorded using surface
EMG electrodes.

• Positions: Left Leg, Right Leg

• Physical Quantities and Units: Raw EMG (𝑉)

• Muscles: Gluteus Maximus, Rectus Femoris, Vastus Lateralis, Semitendinosus, Biceps
Femoris, Medial Gastrocnemius, Soleus, Tibialis Anterior

• Sampling Rate: 1 kHz

4.3.4 Labels

The structure of the labels is not described in detail by the authors. Upon inspection, some
inconsistencies in the structure are observed, as shown in Table 4.5. In the first row, the first entry is
always Time, followed by the Activity Code. Typically, the first row describes the Sensor Position,
except for the metabolic system, which does not have a position. For these, the first row is labeled
Unit, whereas for all other sensors, this appears in the second row. Accelerometers also have a
third row detailing the axis, and EMG data describes the muscle groups in the same row. These
discrepancies should be considered when converting the data, to ensure a consistent structure across
all recordings.

Sensor Group APDM_Accel Label Type
Time (s) Activity Code Waist Waist Waist ... Sensor Position

Acceleration (m/s2) Acceleration (m/s2) Acceleration (m/s2) ... Physical Quantity and Unit
x y z ... Axis

Sensor Group Empatica_Accel Label Type
Time (s) Activity Code Left Wrist Left Wrist Left Wrist ... Sensor Position

Acceleration (m/s2) Acceleration (m/s2) Acceleration (m/s2) ... Physical Quantity and Unit
x y z ... Axis

Sensor Group Empatica_Physio Label Type
Time (s) Activity Code Left Wrist Left Wrist Right Wrist ... Sensor Position

Electrodermal Activity (µS) Skin Temperature (°C) Electrodermal Activity (µS) ... Physical Quantity and Unit

Sensor Group Metabolics_System Label Type
Time (s) Activity Code VO2 (mL/s) VCO2 (mL/s) RER ... Physical Quantity and Unit

Sensor Group Empatica_Accel Label Type
Time (s) Activity Code Left Leg Left Leg Left Leg ... Sensor Position

Raw EMG (V) Raw EMG (V) Raw EMG (V) ... Physical Quantity and Unit
Gluteus maximus Rectus femoris Vastus lateralis ... Muscle Type (Axis)

Table 4.5: Comparison of Label Structures Across Different Sensor Groups

40


4.3 Dataset

4.3.5 Missing Data, occurring NAN Values and split activity sessions

Missing Data According to the Authors.

Ingraham et al. [15] note that equipment malfunction prevented the collection of complete data sets
from all subjects. Due to a failure in the right wristband worn by Subject 6 during session 2, their
right wrist accelerometry, electrodermal activity, and skin temperature data for the Running, Cycling,
and Stairs activities were excluded from analysis. This is reflected in the raw data, where Not A
Number (NAN) series appear for each missing sensor. The authors also mention that certain signals
were excluded due to electrode disconnection: Subject 5’s left gluteus maximus (Walking), Subject
3’s right gluteus maximus (Cycling), and Subject 3’s left Semitendinosis (Walking). However, the
sides of the data labels differ from those in their dataset. Subject 5’s right gluteus maximus (Walking)
and Subject 3’s right semitendinosus (Walking) are entirely NAN, while Subject 3’s left gluteus
maximus (Cycling) is only partially NAN, but was still completely excluded from the analysis.
Furthermore, Subject 1 was unable to complete the stair-climbing activity. As a result, there is no
entry for this activity in the data set, and no NAN values need to be handled for this task.

Trailing NAN Values. Additionally, the authors did not mention that some sensor recordings
within the same group had unequal lengths. It can be assumed that the recordings likely ended a
few seconds apart, leading to NAN values at the tail end of some signals.

Split Activity Sessions. Furthermore, two activity sessions were split into two separate recordings,
a detail not mentioned by the authors. For Subject 2 (Backwards) and Subject 5 (Cycling), there are
two recordings that should be combined into a single activity recording to recreate the structure
of the other activity recordings and simplify handling. These recordings also appear as separate
columns in Table 4.2 and Table 4.3.

41


4 Methodology

4.4 Implementation Workflow

The implementation is structured into three main stages: the preprocessing pipeline, evaluation
using Pearson correlation, and the final training pipeline.

The preprocessing pipeline consists of three scripts. The first script handles data conversion and
preprocessing steps related to Section 4.2. The second script restructures and simplifies data access
within the dataframe. In the third and final preprocessing step, the data is divided into two versions:
a filtered and an unfiltered dataframe, followed by interpolation and resampling. Next, the script for
calculating Pearson Correlation Coefficients between the filtered and unfiltered data is introduced.
Finally, the training pipeline, developed to meet the requirements in Chapter 4, and the evaluation
framework for gathering results, are explained. During the implementation of the preprocessing
steps, several challenges occurred, which are annotated in this section and discussed in detail in the
next Section 4.5.

4.4.1 Preprocessing Pipeline - First Script - Converting and Preprocessing

In the first preprocessing step, the MatLab data is converted into a Pandas DataFrame, processed
for each subject as outlined in Section 4.3, and then stored as Pickle files. The mat73 package is
used to load the data into a dictionary, enabling efficient access to its hierarchical structure during
conversion.

Preserving Data Structure. To capture the desired data structure in the new dataframe, a Multi-
Index for Pandas DataFrames is used to maintain the hierarchical organization and enable easier
grouping by layer. As shown in Figure 4.1, the data is structured hierarchically, with the Sensor

Data divided into Labels, Sampling Rate, and Data entries.

The multi-index is built from the categorical indices Activity Mode and Sensor Group from the
hierarchical layer, Position, Unit, and Axis from Labels, along with a new Preprocessed level to
mark derived data entries, such as vector magnitudes, composite sums or the ground truth. Since the
data is split by subject, the first layer contains only the current subject ID and is therefore excluded
from the multi-index at this stage.

The Data and Sampling Rate entries from Sensor Data represent the columns of the dataframe.
By iterating over the alphabetically ordered Activity Mode and Sensor Group, the corresponding
Labels, Sampling Rate, and Data for each combination can be extracted, enabling access to the
information.

Processing Data per Sensor Group. Since the Data was extracted from MatLab row by row, it
requires reorganization to group values by column. Transposing the data ensures each column
contains the complete session’s data for a given sensor, rather than storing values by second. Next,
Trailing NAN values are addressed as described in Section 4.5.3.

The Metabolic System and Empatica Physio data require no additional preprocessing and are directly
included in the new sensor dataset. For newly derived data, such as Accelerometer or EMG, the
original data is kept in the array, and the processed data is added, labeled as "Preprocessed."The
sensor data size is also updated to reflect the new sensors.

42


4.4 Implementation Workflow

Metabolic Data. GT EE is derived from VO2 and VCO2 values obtained from the Metabolic
System, following Equation (3.7), and normalized to the subject’s weight (Table 4.1). Given the
complexity of this process, additional details are provided in Section 4.5.5.

For the Metabolic System, the Breath-by-Breath sampling rate is replaced by the computed sampling
rate, as detailed in Section 4.5.4, addressing the issue of inconsistent sampling intervals for numerical
resampling.

Accelerometer Data. Vector magnitudes for accelerometer data (from APDM_Accel and Empat-
ica_Accel) are computed along the x, y, and z axes using Equation (3.5). NAN values for Subject
06 are handled by extending the NAN column from a single axis to all three axes by replicating its
data array. The NAN values persist, requiring further handling as described in Section 4.5.6.

EMG Data. As detailed in Section 4.2.2, the EMG preprocessing pipeline first applies a 4th-order
Bandpass Butterworth filter (Section 3.1.1), followed by full-wave rectification of the signal (using
Equation (3.4)), and then applies a second 4th-order Lowpass Butterworth filter. The Butterworth
filter order was determined based on the approach of Slade et al. [31]. Critical frequencies are
calculated based on the corresponding cutoff frequencies and the sampling rate, as described in
Equation (3.1). Each step of the signal processing is also illustrated in Figure 4.2.

The signal is then normalized using the peak activation level for each muscle on both sides. Since
obtaining these values was impractical at this point, they were retrieved through a separate script, as
outlined in 4.5.1. Finally, composite sums for each leg are calculated using Formula 3.6.

Empathica Data. No additional preprocessing is applied to the Empatica sensor group.

Since derived data is added during the iteration through the data array, the resulting array increases
in size. To ensure consistency, the new data structure must align with the original format, with all
data properly labeled.

Adding corresponding Labels and Sampling Rate. As outlined at the beginning of this section,
the MultiIndex is structured into six levels: Activity Mode, Sensor Group, Position, Unit, Axis,
and Preprocessed. Since iteration occurs over each Activity Mode and Sensor Group combination,
these values are directly accessible for labeling. Across all sensor groups, the first two entries in
each data array correspond to Time and Activity Code. For these labels, Position is set to Global,
while Unit is assigned Time (s) and Activity Mode, respectively. The Axis and Preprocessed levels
are set to None.

For most sensor groups, the first and second entries in Labels are mapped to Position and Unit.
However, for the Metabolic System, this order is reversed, as shown in Table 4.5. The third label
entry applies only to Accelerometer and EMG data, where it maps to Axis and includes labels for
muscle groups for simplification. For all other cases, it remains None.

Derived signals, such as Composite Sums or Vector Magnitudes, retain the same labels as the
associated sensor, except for the Axis label, which is set to None. The Preprocessed level is updated
accordingly, with labels such as Composite Sum Left, Composite Sum Right, or Vector Magnitude,
depending on the signal type and side.

For Ground Truth labels, Position is set to Global, and Unit is labeled as Ground Truth (kJ), while
Axis and Preprocessed are set to None. Rather than storing the Sampling Rate at the sensor group
level, it is now assigned to each sensor entry individually, ensuring flexible access to the sampling
rate across the dataset.

43


4 Methodology

Figure 4.2: This figure shows each signal preprocessing step regarding EMG data

MultiIndex and DataFrame A MultiIndex is generated using the label array as categorical labels.
The DataFrame is then constructed by assigning the data array to the Data column and the sampling
rate array to the Sampling Rate column, using the generated MultiIndex as its index. Finally, the
DataFrame is saved as a pickle file for all subjects.

4.4.2 Preprocessing Pipeline - Second Script - Restructuring

The second script simplifies the dataframe structure, as shown in Table 4.6, merges the split activities,
shown in Section 4.5.2, and removes the standing and sitting periods within each activity session.
Finally, it combines the separate pickle files into a single file for each subject, enabling easier access
within the dataframe.

44


4.4 Implementation Workflow

Adjusting MultiIndex. To achieve the desired dataframe structure, as shown in Table 4.6, several
key transformations were applied in sequence. First, the columns Name and Sensor Name were
inserted at the beginning of the dataframe. Next, the values in the Name column were generated by
combining the Position, Unit, and Axis values, while the values in the Sensor Name column were
derived by extracting the sensor name from the Unit column. Additionally, only the relevant unit
information was retained from the Unit level, and the Sensor Name column was set as the index for
better organization.

The Sensor Name was added to the MultiIndex, and the levels were rearranged to produce the
following new order: Activity Mode, Sensor Group, Position, Unit, Axis, Preprocessed, and Sensor

Name. Subsequently, the Unit, Axis, and Preprocessed levels were removed from the MultiIndex due
to redundant information, simplifying the structure.

Additionally, for compactness, the sensor name Oxygen Saturation, SpO2 was simplified to SpO2.
Lastly, the Subject ID was introduced as a new top-level entry in the MultiIndex, further enhancing
the dataset’s organization and ease of access.

Table 4.6: Dataframe MultiIndex Structure

Subject ID Activity Mode Sensor Group Position Sensor Name

1 Backwards APDM_Accel Global Time
Activity Code

Waist Acceleration - Vector magnitude
Chest Acceleration - Vector magnitude

... ... ... ...
EMG Global Time

Activity Code
Left Leg Raw EMG - Composite Sum Left

... ... ... ...
Cycling APDM_Accel Global Time

Activity Code
Waist Acceleration - Vector magnitude
Chest Acceleration - Vector magnitude

... ... ... ...
2 Backwards APDM_Accel Global Time

... ... ... ...

Removing Standing and Sitting sections. The standing and sitting periods in each activity session
are only used as benchmarks for calculating the ground truth in the first script and should not be
included in the evaluation. As a result, they are removed from all activity sessions.

First, the split activities must be handled as outlined in Section 4.5.2. Next, the standing and sitting
sections are removed from all dataframes. The data is grouped by activity modes and sensor groups,
and the corresponding time and activity code arrays are retrieved. The activity code is used to
identify the boundaries of the activity sections. Before removing the standing and sitting sections,
the time vector is adjusted to close any gaps between activity sections. Proper adjustment of the time
vector is critical, as it will be used in the next script for interpolation and resampling. Gaps may
arise, for instance, during the Incline activity or for subject 2 during Walking, as shown in Table 4.2.

45


4 Methodology

When removing data from the middle of the dataframe, the time vector is shifted by subtracting
the difference between the upper boundary value and the last value of the previous activity section.
Finally, the standing and sitting sections are removed, and the time vector is reset to begin at 0.

Filtering relevant sensors. Only the relevant sensor data required for the ML process is selected,
rather than using the full dataset. This includes sensors such as ’Ground Truth,’ ’Acceleration
- Vector Magnitude,’ ’Raw EMG - Composite Sum Left,’ ’Raw EMG - Composite Sum Right,’
’Electrodermal Activity,’ ’Skin Temperature,’ ’SpO2,’ ’Heart Rate,’ ’Breath Frequency,’ ’Minute
Ventilation,’ ’VO2,’ ’Time,’ and ’Activity Code.’ The selected positions are ’Global,’ ’Waist,’
’Chest,’ ’Left Ankle,’ ’Right Ankle,’ ’Left Leg,’ ’Right Leg,’ ’Left Wrist,’ and ’Right Wrist’. A new
dataframe, containing data from all ten subjects, is created and saved to simplify the process in the
subsequent steps.

4.4.3 Preprocessing Pipeline - Third Script - Smoothing and Resampling

In the third and final script of the preprocessing pipeline, a Gaussian kernel can be applied to the
filtered data, or the unfiltered data can be used. Resampling is performed in both cases.

Applying the Gaussian Kernel.
When smoothing is applied, a Gaussian kernel is used for each activity mode and sensor group.
The cutoff frequency is determined as described in Section 4.2.2. The kernel is then applied, as
outlined in Section 3.1.2, using the cutoff to calculate sigma, which is passed to the filter. ’Time,’
’Activity Code,’ and ’Ground Truth’ are excluded from smoothing, as these values should remain
unchanged.

Generating Uniform Time Vectors.
To ensure consistency during resampling, a uniform time vector is generated for each subject and
activity mode. This vector will serve as the new set of sampling points. The process begins by
grouping the data by activity mode and extracting the time arrays. The latest start time and earliest
end time across all vectors are then identified. Based on these, a uniform time vector with 1-second
intervals (1 Hz) is created.

Interpolation and Resampling.
For interpolation and resampling, the corresponding uniform time vector and original time array are
retrieved for each sensor group and activity mode. By default, linear interpolation is applied, while
nearest neighbor interpolation is used for Activity Code and Ground Truth to preserve their discrete
values. An interpolator, as described in Section 3.1.3, is instantiated using the original time vector,
signal data, interpolation type, and a extrapolation as fill value. Resampling is then carried out by
passing the uniform time vector to the interpolator for each activity mode.

4.4.4 Preprocessing Evaluation - Pearson Correlation coefficient

The impact of preprocessing is evaluated by computing the Pearson correlation coefficient for both
filtered and unfiltered data, as defined in (3.8). Correlation values are first calculated for each sensor
and then averaged across all subjects.

46


4.4 Implementation Workflow

To ensure proper data handling, sensor values are stored separately. Filtering is first applied at the
subject ID level, followed by the extraction of GT values, with the original GT array preserved for
potential restoration. The data is further refined by filtering based on position and sensor name,
while excluding non-sensor columns such as activity codes and timestamps, retaining only the
relevant signal data.

Before computing the correlation, NAN values are removed, as outlined in Section 4.3.5, to prevent
interference with the calculations.

The PCC is then computed by comparing signal data with ground truth values. This is applied to
both raw and preprocessed data to maintain consistency in the analysis.

This structured approach ensures a reliable evaluation of sensor data across multiple subjects and
activities.

4.4.5 Training Pipeline - Framework and Evaluation

This section describes the customizable options in the training pipeline, which allows for flexible
data collection modes, the inclusion of VO2 measurements, and selection of CV strategies (LOSO
or LOTO). The dataset is prepared based on the chosen mode, with features and target variables
selected accordingly. Several ML models can be trained, with preprocessing tailored for each model.
The GridSearchCV framework (Section 3.4.3) is used for model optimization, ensuring systematic
evaluation of the sensor data.

Customization Options. The training pipeline offers flexible configuration based on the specific
needs of the experiment.

The customization options include selecting the mode of data collection, which can be set to ’All’,
’Hexoskin’, ’Smart Watch’, ’EMG Pants’, ’Watch and Pants’, ’Best Combination’, as detailed in the
corresponding Table 4.7. Additionally, the pipeline allows for the inclusion or exclusion of VO2
measurements in the analysis and offers a choice between two CV strategies: LOSO and LOTO.

A set of models can be selected for evaluation, or alternatively, a single model can be chosen for
both evaluation and optimization. Additionally, different preprocessing pipelines, as described in
Section 3.4.2, can be tested by substituting the appropriate preprocessors.

Setup before Evaluation. Depending on the selected Mode, the corresponding features and target
variable GT are selected to be evaluated. The benchmark VO2 is optionally included in the features.
To prepare the dataset for training, the data is structured according to the selected mode of data
collection. The training set consists of sensor features (X) corresponding to the chosen positions,
while the target variable (y) is only the GT column. Before proceeding with model training, all
dataframes are converted into NumPy arrays to enhance computational efficiency.

The dataset is then reshaped based on the selected CV strategy. In the case of LOSO, the data is
pivoted with ’Activity Mode’ as the index and sensor readings as values. The resulting arrays are
expanded into row-wise structures, and any NAN values, particularly those from right-wrist sensors,
are appropriately handled. The index used for pivoting is subsequently dropped. A similar approach
is applied for LOTO, except that ’Subject ID’ serves as the index before being removed.

47


4 Methodology

Machine Learning Model. For model training, several ML algorithms are provided, including
LR, SVM, DT, RF, ADA, and MLP, as introduced in Section 3.3. Each model is integrated into a
dedicated pipeline, as in the Section 3.4.2, allowing for tailored preprocessing steps specific to the
requirements of each ML approach.

With a parameter grid, ML models can be evaluated across all combinations of the parameters,
helping to identify the best settings.

Preprocessing Pipeline. Certain models require additional preprocessing to transform the data
before training, as some models assumed the data to be. By iterating over the model and pipeline
is finalized, a new ML model is instantiated and trained for each pipeline. With a custom LOO
method, as detailed in Section 4.5.7, the desired behavior can be revoked, having 10 folds for the
LOSO and six for the LOTO CV.

The parameter grid for hyperparameter tuning is extracted but remains empty in this case, as the
models are trained using standard settings.

GridSearchCV framework. Finally, the GridSearchCV framework, as described in Section 3.4.3
is utilized to train and validate the models. The pipeline, parameter grid, and LOO strategy
are incorporated into GridSearchCV, with scoring based on the negative root mean square error
(NRMSE) (Equation (3.11)), as the objective is to minimize the error. The model is then fit using
the preprocessed training data (𝑋𝑡𝑟𝑎𝑖𝑛 and 𝑦𝑡𝑟𝑎𝑖𝑛), ensuring a structured and systematic evaluation
of the sensor data across multiple subjects and activities.

Training per Machine Learning Model. For each model pipeline, a new ML model is created and
trained. If the selected CV method is Leave-One-Subject-Out, the split count is set to 10; otherwise,
for Leave-One-Task-Out, it is set to 6.

Handling CV Using the split count, a KFold is generated to cross-validate the model. The parameter
grid for this model is extracted, but in our case, it is empty, as we intend to train the model with the
standard settings. Next, we create the GridSearchCV with the pipeline, parameter grid, KFold, and
use the negative root mean squared error for scoring, as GridSearchCV aims to maximize the score.
Finally, the model is trained (fitted) using the X_train and y_train data.

Training results. To obtain the results of the ML models, extract the mean_test_score and
std_test_score from grid_search.cv_results_, Additionally, save the split_test_scores for each
individual fold. Finally, store the results as a JSON file to allow for easy access to the dictionary
when needed.

4.5 Problems/Challenges and solutions

This section addresses implementation challenges encountered across all scripts, including handling
trailing NAN values, accurately calculating the ground truth, handling split datasets, determining the
metabolic system’s sampling rate, extracting peak activation levels from EMG data, and applying a
custom LOO CV approach.

48


4.5 Problems/Challenges and solutions

4.5.1 Get the Peak Activation Level for EMG Data

A key step in preprocessing EMG data is normalizing it to the peak activation level for each
muscle group per session. The challenge is that during the first script, only the current activity and
sensor group are accessible before saving the pickle file, whereas calculating peak activation levels
depends on having simultaneous access to all EMG data within the same session. Additionally,
normalization is not the final processing step in this script, preventing the calculation of EMG
composite sums. Skipping normalization would lead to incomplete and unstructured data, requiring
an extra preprocessing step later. A cleaner approach is to first extract all EMG data, compute peak
values for each muscle group, and store them separately for use in the first script.

To achieve this, an additional script is run before the first script to collect peak activation levels
for each muscle signal per session for each subject. The muscle signals were collected separately
for the left and right leg and include: ’Gluteus maximus’, ’Rectus femoris’, ’Vastus lateralis’,
’Semitendinosis’, ’Biceps femoris’, ’Medial gastrocnemius’, ’Soleus’, ’Tibialis anterior’. The first
Table 4.2 includes the activities Walking, Incline and Backwards, while the second Table 4.3
includes Running, Cycling and Stairs. This script processes raw MatLab data, considering only
EMG signals. It converts and preprocesses the data in the same way (Section 4.4.1) as the main
script up to the normalization step. Peak activation levels are stored in a dictionary with keys
for session ID, position, and sensor. Finally, peak activation levels are stored in a CSV file with
columns for Subject ID, Activity, Position, Sensor Name, and Value to ensure proper data handling.
While all column combinations are listed, some may not appear in the data, for example Stairs for
subject 1.

4.5.2 Split Data: Backwards2 and Cycling2

Due to recording issues (Section 4.3.5), two activity sessions were split into separate recordings
(Table 4.2, Table 4.3). The standing ground truth value for split activity data is taken from the first
entry and applied to both entries to ensure they share the same benchmark for net metabolic cost
before merging them.

To ensure proper smoothing at the borders, this step is performed after conversion and before
resampling. Both recordings were retrieved, and the Activity Mode Name was adjusted by removing
the ’2’ suffix for proper data alignment. Their data arrays were then concatenated, and the second
recording was subsequently dropped from the dataframe. Since the standing and sitting sessions are
removed for all activities, the overall structure remains consistent with other recordings. Another
important step in handling split data in the first script is transferring the standing metabolic cost
from the first recording to the second, allowing us to use the same baseline metabolic cost for
calculations.

49


4 Methodology

4.5.3 Handling Trailing NAN Values

As described in Section 4.3.5, some recordings contain missing data filled with NAN values at the
end. To ensure clean data processing, these values must be removed at the beginning of the first
script. Early handling of this problem is possible since all sensors within a group are sampled at the
same rate. Trailing NANs are addressed by identifying them in the data array, determining where
the NAN sequence begins, and removing that section from each sensor’s data within the group.

4.5.4 Determining the Sampling Rate of the Metabolic System for
Breath-by-Breath Data

To apply a Gaussian kernel (Section 3.1.2) to the signals, the sampling rate is required. However,
the sampling rate for the Metabolic System sensor group is defined as Breath-by-Breath rather than
a numerical value. Two approaches could be used to determine the sampling rate. One method is to
analyze the time values and calculate the number of samples per second. The chosen approach,
however, was to compute the mean breath frequency. This was done using the ’breath frequency’
sensor within the Metabolic System group, providing an estimated mean sampling rate for each
sensor within the group.

4.5.5 Ground truth

To calculate the ground truth, the following inputs are required: time values, activity code, VO2
values, VCO2 values, subject’s weight, and previous value. The previous value is used only for split
data. First, the index borders of the activity code are determined to identify where each activity
section begins and ends. The GT value is then calculated separately for each activity section. The
time values are extracted based on each activity section, using the lower and upper borders. Three
minutes are subtracted from the upper border to determine the middle index. If the activtiy duration
is shorter than 3 minutes, the 3-minute mean cannot be computed, and will be handled later, which
applies only to split data (Section 4.5.2). Next, the metabolic cost is calculated using the EE
(Section 3.2.1), with the VO2 and VCO2 values between the mid and upper indices. The mean of
the 3-minute period is then calculated and extended over the entire 6-minute period. The values are
normalized by the subject’s weight, which is listed in Table 4.1. Finally, the metabolic net cost is
calculated by subtracting the baseline metabolic cost. The baseline metabolic cost is the standing,
or sitting for Cycling, metabolic cost value taken from the first value of the array, as each training
session begins with it (Table 4.2, Table 4.3). However, if the session is too short to calculate a valid
value, the value from the previous session is used, as explained for split data (Section 4.5.2).

4.5.6 Handling missing values for subject 6

Missing values are handled prior to calculating the Pearson correlation coefficient and are excluded
from the training data to ensure that NAN values do not affect the calculation.

PCC. To handle missing values (Section 4.3.5) for Subject 6 at the Right Wrist position, and for
specific sensors such as ’Acceleration - Vector Magnitude’, ’Electrodermal Activity’, or ’Skin
Temperature’, the following steps are performed: If the subject ID is 6, the position is "Right Wrist",

50


4.6 Evaluation Process

and the sensor is one of the specified types, the signal array and GT array are zipped together. Then,
all tuples containing NAN values are filtered out, and the arrays are unzipped back into their original
form.

Training Problems The approach for gathering clean training data depends on whether NAN values
are present in the selected sensors for the chosen model training, as not all sensors are included at
all times. If necessary, any rows with NAN values should be removed.

4.5.7 Custom LOO

KFold and LOO from scikit-learn do not produce the desired behavior for CV, as the data sizes
vary across subjects. Therefore, the data must be split based on the boundaries of subject IDs or
activity modes, depending on the chosen CV method. To address this, an iterator (Listing 4.1) was
designed to split the data and return the test and train indices, processing them one at a time.

def loo_with_indicies(indicies_array):

size = np.sum(indicies_array)

start = 0

for end in indicies_array:

test_idx = list(range(start, start+end))

train_idx = [i for i in range(size) if i not in test_idx]

yield train_idx, test_idx

start += end

Listing 4.1: Custom LOO Generator

4.6 Evaluation Process

The evaluation of traditional models follows the key steps outlined in the proposed approach. First,
the preprocessed data is evaluated using the PCC in comparison to the benchmark data. Next, the
optimal preprocessors are identified for each model. The optimization process is carried out in three
phases to fine-tune the hyperparameters. Finally, different sensor regions are compared to assess
their performance.

Evaluating Preprocessing. With the PCC, the quality of the preprocessing the unfiltered and
filtered data should be evaluated. As benchmark serve the results of Ingraham et al. [15], which are
compared with.

Machine Learning Methods. The ML models (LR, SVM, DT, RF, ADA, MLP), as detailed
in Section 3.3, were evaluated including all 16 features. GridSearchCV was employed for
hyperparameter tuning, with NRMSE as the performance metric. Two types of CV were applied:
LOSO and LOTO, as introduced in Section 3.2.4. The evaluation was conducted using both CV
methods, each with and without VO2 as a benchmark feature, and the results were compared to the
LR benchmark from Ingraham et al. [15].

51


4 Methodology

Determine best Preprocessor. The effect of preprocessing was evaluated by testing all combinations
of each model with StandardScaler, Normalizer, and no preprocessing. These configurations were
analyzed using both CV types (LOSO and LOTO), with the VO2 benchmark included as a feature.
This enabled a comparison of how different preprocessing methods impacted model performance
across both CV approaches.

Tune hyperparameters with Parameters Grid. GridSearchCV allows the use of a parameter grid,
which is cross-validated across all data, enabling the evaluation of parameter tuning. Starting with
the default values, a custom parameter grid is manually designed, slightly modifying the defaults to
optimize model performance. Due to limited computational resources, the number of combinations
is adjusted based on the model type and tuning parameters. The tuning process is carried out
in three phases, with the parameter grid being refined after each iteration. The optimization is
evaluated exclusively using the LOSO CV, including the benchmark, as the primary goal of recent
work is to adapt the application across multiple subjects. While this approach may not yield the
fully optimized model, it demonstrates that parameter tuning is crucial for improving prediction
accuracy in regression tasks, such as estimating EE.

Evaluate various sensor regions. In real-world applications, measuring all variables across
different body regions is often impractical and depends on specific application goals. To address
this, key sensor regions were selected for real-time adaptation, ensuring alignment with practical
constraints and application requirements.

The evaluation explores six sensor configurations based on insights from related studies, also
summarized in Table 4.7:

• All Sensors: All 16 sensors.

• Hexoskin: Waist Accel, HR, breath frequency (BF), Minute Ventilation (MV).

• Smart Watch: both wrist Accel, EDA and Temp, HR, SpO2.

• EMG Pants: both EMG, Waist Accel and Ankle Accel, possibly HR and SpO2.

• Watch and Pants: Combination of the Smart Watch and EMG pants sensor regions.

• Best Combination: Top six sensors based on correlation, including Waist Accel, EMG (left
and right), EDA, HR, BF, MV.

The All Sensors configuration includes all 16 sensors, as used in previous evaluation steps. The
Hexoskin setup, which focuses on sensors in the chest region, has been referenced in prior studies
[15] and builds on earlier work of Beltrame et al. [1].

Smart watches provide a lightweight and widely adopted sensing method, making them a practical
choice for real-world applications [28, 30, 36, 38]. Similarly, Sensor Pants provide a wearable
solution for capturing EMG signals [30, 33]. While they do not directly measure HR or accelerometer
data, it is assumed that these could be easily obtained from the same region using a HR monitor or
additional accelerometers. The Watch and Pants configuration simply combines the smart watch
and EMG pants sensor regions.

Finally, the Best Combination region was derived from previous studies [15], where the top-
performing linear regression models were analyzed with an increasing number of predictors. From
this evaluation, the six most relevant sensors were selected to optimize performance. While these

52


4.6 Evaluation Process

signals could potentially be measured across various body regions, this approach assumes that
reducing the sensor set for training and calculations, by removing redundant data and prioritizing
the most impactful signals, would be beneficial. Furthermore, it is assumed that this could help
lower computational complexity, thereby improving the model’s efficiency.

All regions are evaluated across all models and both CV types, with and without the inclusion of
the benchmark feature, ensuring a comprehensive assessment of their performance.

Table 4.7: Definition of Sensor Regions

Accel EMG EDA Temp BF MV HR SpO2

All * * * * * * * *
Smart Watch Wrist * * * *
Hexoskin Waist * * *
EMG Pants Waist, Ankle * *
Best Combination Waist * * * * *
Watch and Pants Waist, Ankle, Wrist * * * * *