Institute for Artificial Intelligence Intelligent Sensing and Perception University of Stuttgart Universitätsstraße 32 D-70569 Stuttgart Bachelorarbeit Machine learning-based metabolic rate estimation from wearable sensors Marie Olschewski Course of Study: Informatik B.Sc. Examiner: Jun.-Prof. Dr. Alina Roitberg Supervisor: Jun.-Prof. Dr. Alina Roitberg Commenced: August 5, 2024 Completed: February 19, 2025 Abstract Adaptive devices such as exoskeletons and prostheses can enhance human physical capabilities or replace the functionality of missing body parts. However, adjusting these devices for the specific needs of an individual remains a time-consuming and costly procedure. A key objective in optimizing these devices is minimizing the user’s energy expenditure (EE), a metric closely related to metabolic cost. Traditional methods for estimating metabolic cost, such as indirect calorimetry, are performed in controlled environments, limiting real-world applicability. This study aims to bridge this gap by exploring the use of traditional machine learning (ML) methods to estimate metabolic cost in real-time environments, utilizing wearable sensors integrated into adaptive devices. Using the dataset from Ingraham et al. (2019), which includes data from ten healthy subjects performing various exercises, the study investigates how different sensor combinations impact prediction accuracy. This thesis evaluated multiple ML models, including Random Forest (RF), Support Vector Machines (SVM), Linear Regression (LR), Decision Trees (DT), and Multilayer Perceptrons (MLP), within two cross-validation methods: Leave-One-Subject-Out (LOSO) and Leave-One-Time-Out (LOTO). Key findings from this evaluation include: In the LOSO setting, RF outperformed other models, achieving the lowest RMSE in several sensor regions, including Hexoskin, EMG Pants, and Best Combination, with the ’Best Combination’ region showing the best results. In contrast, MLP performed well in the LOTO setting, with its strongest performance observed in the ’Best Combination’ region. SVM demonstrated robust performance when all sensor data was combined, emphasizing the potential of multimodal sensor fusion. Hyperparameter tuning and sensor feature selection were crucial factors in optimizing model performance, particularly for more complex models like RF and MLP. The results suggest that while traditional ML methods can estimate EE effectively, challenges remain in refining preprocessing techniques, tuning hyperparameters, and optimizing sensor combinations. This thesis outlines the importance of model selection, sensor fusion, and parameter optimization in developing more accurate and real-time energy expenditure prediction systems for wearable technologies. 3 Kurzfassung Adaptive Geräte wie Exoskelette und Prothesen können die körperlichen Fähigkeiten des Menschen verbessern oder die Funktionalität fehlender Körperteile ersetzen. Die Anpassung dieser Geräte an die Bedürfnisse einer Person, ist jedoch nach wie vor ein zeit- und kostenaufwändiges Verfahren. Ein wichtiges Ziel bei der Optimierung dieser Geräte ist die Minimierung des Energieaufwands des Benutzers, der eng mit den mit den Stoffwechselkosten zusammenhängt. Herkömmliche Methoden zur Schätzung der Stoffwechselkosten, wie die indirekte Kalorimetrie, werden in kontrollierten Umgebungen durchgeführt, was die Anwendbarkeit in der realen Welt einschränkt. Diese Thesis zielt darauf ab diese Lücke zu schließen, indem sie die Anwendung traditioneller Machine Learning Methoden zur Schätzung der Stoffwechselkosten in Echtzeitumgebungen unter Verwendung von tragbaren Sensoren, die in adaptive Geräte integriert sind, untersucht. Unter Verwendung des Datensatzes von Ingraham et al. (2019), der Daten von zehn gesunden Probanden enthält, die verschiedene Übungen durchführten, untersucht die Thesis, wie sich verschiedene Sensorkombinationen auf die Vorhersagegenauigkeit auswirken. In dieser Arbeit wurden mehrere ML-Modelle bewertet, darunter Random Forests (RF), Support Vector Machines (SVM), Linear Regression (LR), Decision Trees (DT), Multi-layer Perceptrons (MLP), und AdaBoost (ADA) mit zwei Cross-Validation-Verfahren: Leave-One-Subject-Out (LOSO) und Leave-One-Task-Out (LOTO). Die wichtigsten Ergebnisse dieser Thesis sind: Mit LOSO Cross Validation schnitt Random Forests (RF) besser ab als andere Modelle, mit dem niedrigsten root mean square error (RMSE) in mehreren Sensorregionen, einschließlich Hexoskin, EMG Pants und Best Combination, wobei die Region Best Combination die besten Ergebnisse zeigte. Im Gegensatz dazu schnitt Multi-layer Perceptrons (MLP) in der LOTO-Validierung gut ab, wobei die stärkste Leistung in der Region Beste Kombination beobachtet wurde. SVM erreichte eine gute Leistung, wenn alle Sensordaten kombiniert wurden, was das Potenzial der Kombination verschiedener Sensortypen unterstreicht. Das Tuning der Hyperparameter und die Auswahl der Sensoren waren entscheidende Faktoren für die Optimierung der Modellleistung, insbesondere bei komplexeren Modelle wie RF und MLP. Die Ergebnisse deuten darauf hin, dass traditionelle ML-Methoden zwar energy expenditure (EE) effektiv schätzen können, es bleiben aber Herausforderungen in der Verfeinerung der Preprossesing-Techniken, des Tunings von Hyperparametern und der Optimierung der Sensorauswahl. Diese Herausforderungen zu bewältigen ist ein wichtiger Teil der Entwicklung genauerer Systeme zu Berechnung von Energieverbrauch in Echtzeit für tragbare Technologien. 4 Contents 1 Introduction 15 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 1.2 Thesis Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 1.3 Outline of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2 Related Work 19 2.1 State of the Art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.2 Human-in-the-Loop Optimization . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.3 Personalization of Wearable Robots . . . . . . . . . . . . . . . . . . . . . . . . 20 2.4 Energy Expenditure Estimation and Optimization . . . . . . . . . . . . . . . . 20 2.5 Metabolic Efficiency in Prostheses and Exoskeletons . . . . . . . . . . . . . . . 20 2.6 Perturbation-Based Methods for Metabolic Cost Estimation . . . . . . . . . . . 20 2.7 Multi-Sensor Fusion for Energy Expenditure Prediction . . . . . . . . . . . . . 20 2.8 Accuracy of the SenseWear Armband during Short Bouts of Exercise . . . . . . 21 2.9 Advancements in Wearable Robotics for Rehabilitation . . . . . . . . . . . . . 21 2.10 Sparse Sensor Systems for Motion Analysis . . . . . . . . . . . . . . . . . . . 21 2.11 Personalized EMG Biofeedback for Exoskeleton Optimization . . . . . . . . . . 21 2.12 Benchmark Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3 Definition of Concepts 23 3.1 Signal Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.2 Definition, Metrics and Evaluation Methods . . . . . . . . . . . . . . . . . . . 25 3.3 Supervised Machine Learning Algorithms for Regression . . . . . . . . . . . . 27 3.4 Model Training and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4 Methodology 33 4.1 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.2 Evaluation Criteria and Guidelines . . . . . . . . . . . . . . . . . . . . . . . . 33 4.3 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.4 Implementation Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.5 Problems/Challenges and solutions . . . . . . . . . . . . . . . . . . . . . . . . 48 4.6 Evaluation Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 5 Results 55 5.1 Comparison using Pearson correlation coefficient (PCC) . . . . . . . . . . . . . 55 5.2 Additional Pipeline Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . 57 5.3 Tuning Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 5.4 Evaluation of Various Sensor Regions . . . . . . . . . . . . . . . . . . . . . . 61 5 6 Discussion 65 6.1 Comparison using Pearson correlation coefficient (PCC) . . . . . . . . . . . . . 65 6.2 Additional Pipeline Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . 67 6.3 Tuning Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 6.4 Evaluation of Various Sensor Regions . . . . . . . . . . . . . . . . . . . . . . 68 6.5 Limitations of this work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 7 Conclusion 73 7.1 Key Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 7.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 7.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 7.4 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 7.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 Bibliography 75 6 List of Figures 4.1 Hierarchical structure of the data set . . . . . . . . . . . . . . . . . . . . . . . . 36 4.2 This figure shows each signal preprocessing step regarding EMG data . . . . . . 44 5.1 Results of Pearson Correlation Coefficient . . . . . . . . . . . . . . . . . . . . . 56 5.2 Comparison of Preprocessors for LOSO and LOTO . . . . . . . . . . . . . . . . 58 5.3 Results of untuned models for LOSO and LOTO . . . . . . . . . . . . . . . . . 59 5.4 Progress over tuning phases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.5 Performance of ML models for sensor regions LOSO and LOTO . . . . . . . . . 64 7 List of Tables 4.1 Subject Demographics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.2 Activities for Each Subject (First Session) . . . . . . . . . . . . . . . . . . . . . 37 4.3 Activities for Each Subject (Second Session) . . . . . . . . . . . . . . . . . . . . 38 4.4 Activity Codes and Descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.5 Comparison of Label Structures Across Different Sensor Groups . . . . . . . . . 40 4.6 Dataframe MultiIndex Structure . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.7 Definition of Sensor Regions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5.1 Results of Pearson Correlation Coefficient . . . . . . . . . . . . . . . . . . . . . 56 5.2 Tuning phases parameters and results . . . . . . . . . . . . . . . . . . . . . . . . 61 5.3 Performance of ML models for sensor regions LOSO . . . . . . . . . . . . . . . 63 5.4 Performance of ML models for sensor regions LOTO . . . . . . . . . . . . . . . 63 9 Listings 3.1 Butterworth Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.2 Gaussian Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.3 Interpolation and Resampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.4 Pearson Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.5 Linear Regression Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.6 Support Vector Regression Model . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.7 Decision Tree Regression Model . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.8 Random Forest Regression Model . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.9 AdaBoost Regression Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.10 Multi-layer Perceptron Regression Model . . . . . . . . . . . . . . . . . . . . . 29 3.11 Standard Scaler Preprocessor . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.12 Normalizer Preprocessor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.13 Preprocessing Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.14 GridSearchCV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.1 Custom LOO Generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 11 List of Abbreviations Accel acceleration. 22 ADA AdaBoost. 4, 27 APDM_Accel APDM Accelerometer. 37 BF breath frequency. 52 CV cross-validation. 25, 27 DT Decision Trees. 4, 27 EDA electrodermal activity. 22 EE energy expenditure. 4 EMG Electromyography. 22 EMG (Group) Electromyography Sensors. 34 Empatica_Accel Empatica Accelerometer. 37 Empatica_Physio Empatica Physiological Sensors. 34 GT ground truth. 22 HITL human-in-the-loop. 19 HR heart rate. 22 IMUs inertial measurement units. 15 LLE lower-limb exoskeletons. 15 LOO Leave-One-Out. 25, 27 LOSO Leave-One-Subject-Out. 4, 33 LOTO Leave-One-Task-Out. 4, 33 LR Linear Regression. 4, 16 Metabolics_Systems Metabolics Measurement System. 34 ML machine learning. 23 MLP Multi-layer Perceptrons. 4 MSE mean square error. 26 MV Minute Ventilation. 52 13 List of Abbreviations NAN Not A Number. 41 NRMSE negative root mean square error. 26 PCC Pearson correlation coefficient. 5, 6, 25, 26 RF Random Forests. 4 RMSE root mean square error. 4 SpO2 oxygen saturation. 22 SVM Support Vector Machines. 4, 27 Temp skin temperature. 22 VCO2 carbon dioxide production. 26 VO2 oxygen consumption. 26 14 1 Introduction Wearable robots, such as exoskeletons, are emerging as powerful tools for enhancing mobility [11, 18, 37]. These devices provide movement support, enhance rehabilitation, and reduce physical strain for individuals affected by illness [26], accidents [32], or aging [19]. Mobility challenges, such as difficulty walking or climbing stairs, can significantly impact the independence and quality of life of older adults [19]. Similarly, industrial workers performing physically demanding tasks, such as heavy lifting and prolonged walking, often experience early-onset physical limitations [26]. With an aging population, and stroke remaining a leading cause of mobility impairment, lower-limb rehabilitation robots have gained attention for their potential to improve gait and reduce energy expenditure in stroke patients [37]. However, their effectiveness varies among users, emphasizing the need for personalized assistance strategies [37]. To address these challenges, adaptive devices such as prostheses and exoskeletons have been developed to enhance movement and provide essential support. Myoelectric-driven upper-limb prosthetics effectively restore functionality [32], while lower-limb exoskeletons (LLE) assist users in tasks ranging from basic walking to more complex movements [19, 26]. Integrating these technologies helps users regain mobility and improve overall physical performance. A critical aspect of wearable technology is accurately estimating metabolic EE during daily activities and exercise. Many activity monitors struggle with short-duration activities, leading to imprecise EE estimations [36]. To enhance accuracy, researchers are increasingly utilizing wearable sensors, such as inertial measurement units (IMUs), for more precise EE assessment in healthcare applications [20]. Energy efficiency is another key challenge in prosthetic and wearable robotics. Reducing the user’s metabolic cost by supporting the user optimally during various activities is essential for usability [22]. Traditional respiratory measurements provide only an average metabolic cost per movement cycle, limiting insights into the contributions of specific walking phases [10]. Additionally, minimizing energy expenditure (EE) remains complex, particularly in tasks beyond simple walking, such as navigating slopes or uneven terrain [11]. Personalization plays a crucial role in optimizing exoskeleton performance, ensuring adaptive and responsive support tailored to individual needs [18]. Research efforts focus on refining LLE to enable intelligent human-machine interaction, ultimately improving rehabilitation outcomes and mobility support [23]. The continued advancement of exoskeletons and prostheses is driven by the need for improved adaptability, energy efficiency, and personalization. These technologies aim to enhance rehabilitation, support mobility, and improve the quality of life for individuals with physical limitations. By integrating wearable robotics with intelligent control mechanisms, future innovations will further refine and expand the capabilities of these assistive devices, making them more effective and accessible to those in need. 15 1 Introduction 1.1 Motivation Accurately estimating energy expenditure (EE) is a challenging task due to the variability across individuals, influenced by factors such as demographics, physiological differences, and movement patterns. These complexities make it difficult to capture reliable EE patterns using only a limited set of sensor signals. Additionally, existing datasets for EE estimation are often tailored to specific research objectives, leading to highly specialized and limited data that may not fully represent real-world conditions. Many recent approaches have been optimized on these constrained datasets, which restrict their applicability in dynamic, real-world environments. Traditionally, EE is measured using indirect calorimetry, which is a breath-by-breath measurement method typically conducted in a controlled laboratory setting. However, real-world applications demand more practical and wearable solutions. Unfortunately, many recent studies do not fully explore alternative sensor setups that could improve EE estimation while remaining lightweight and wearable. This gap limits the scalability and practicality of current methods for everyday use, such as in adaptive devices like exoskeletons and prostheses. In this context, Ingraham et al. [15] introduced a comprehensive dataset incorporating various sensors and diverse tasks, which has become a valuable resource for advancing EE estimation research. This dataset presents an opportunity to explore optimal sensor configurations and identify the most relevant features for accurate EE estimation across a wide range of individuals and movement patterns. By leveraging this dataset, the aim is to move beyond traditional linear approaches and enhance model performance, improving the generalizability of EE estimation across different subject groups. This approach seeks to pave the way for more practical, real-time, and user-friendly solutions for estimating energy expenditure, making it suitable for a variety of adaptive technologies and everyday use. 1.2 Thesis Objectives This work aims to extend the Linear Regression (LR) statistics presented by Ingraham et al. [15] by investigating the impact of various traditional machine learning methods for estimating energy expenditure (EE). An objective is to explore optimal sensor configurations and assess how different machine learning models can improve EE estimation accuracy. To ensure consistency, the dataset is preprocessed using the same approach as the original study, and the effectiveness of these preprocessing steps is evaluated using Pearson correlation coefficients. Several supervised machine learning methods are explored, with a focus on identifying the most effective configurations and optimizations. Model performance is evaluated using the RMSE metric, allowing for a comprehensive comparison of the models predictive accuracy. Additionally, sensors are grouped into different regions to analyze the performance of various sensor sets across the selected machine learning models. This approach aims to simulate realistic sensor configurations, reflecting those commonly used in practical applications, and provides insights into how different sensor combinations can enhance EE estimation performance. By leveraging this approach, This thesis seeks to improve the generalizability of EE estimation across different subject groups and real-world conditions. 16 1.3 Outline of the Thesis 1.3 Outline of the Thesis This thesis begins with a review of recent related work, discussing specific circumstances and environments relevant for estimating EE. In Chapter 3, the definition of concepts establishes key foundations, providing detailed explanations of the definitions and methods used. This ensures a clear understanding of the underlying approach on which this work is based. The Methodology is outlined in Chapter 4 the overall approach, requirements, and guiding principles of this study. It includes an in-depth analysis of the dataset, followed by a step-by-step description of the implementation process, challenges encountered, and evaluation procedures. The discoveries are shown in the result Chapter 5 along with the analysis in Chapter 6. The important outcomes are finally outlined in Chapter 7. 17 2 Related Work This chapter will give an overview of related work and recent optimization approaches in the field. 2.1 State of the Art 2.1.1 Adaptive Control Using Heart Rate Feedback The study by Manzoori et al. [24] introduces an adaptive control strategy for hip exoskeletons, where heart rate (HR) feedback is used to adjust assistance during walking and other locomotor tasks such as stairs and inclines. The approach successfully reduces the oxygen cost and effort intensity, particularly during longer tasks. HR patterns were identified as a potential tool for improving the responsiveness of the system in real time, although delays in HR response affected shorter tasks. This study emphasizes HR’s potential to enhance adaptive control strategies in wearable devices, allowing for more efficient locomotion by dynamically adjusting support based on the user’s effort level. 2.1.2 human-in-the-loop (HITL) Optimization for Knee Exoskeletons HITL Optimization for Knee Exoskeletons by Wang et al. [35] presents a HITL optimization approach for knee exoskeletons, leveraging EMG signals from the semitendinosus and vastus medialis muscles. The study focuses on using Bayesian optimization to rapidly identify the optimal assistance strategies for knee exoskeletons. By analyzing muscle cooperation and dynamically adjusting the assistance provided by the device, the study demonstrates a reduction in the convergence time for optimization, which enhances the real-time adaptability of the system. This strategy aims to improve mobility and metabolic efficiency by tailoring the exoskeleton’s support to the individual user’s needs. 2.2 Human-in-the-Loop Optimization Díaz et al. [11] focused on reducing energy expenditure in lower-limb exoskeletons using an EMG-based objective function. Their motivation was to integrate EMG signals with Bayesian optimization to enhance the real-time energy efficiency of wearable robots, thus paving the way for more energy-efficient devices [11]. 19 2 Related Work 2.3 Personalization of Wearable Robots Koginov et al. [18] explored the customization of wearable robots for specific tasks like downhill walking using sEMG data. Their motivation was to personalize the assistance provided by exoskeletons to improve muscle activity reduction during specific tasks, making these devices more suitable for real-world applications [18]. 2.4 Energy Expenditure Estimation and Optimization Lee and Lee [20] introduced a hybrid CNN-LSTM model for accurate energy expenditure (EE) prediction during various walking conditions using IMU data. The motivation behind their work was to develop real-time energy monitoring for rehabilitation and assistive devices, offering a model that can estimate EE without needing heel-strike detection [20]. 2.5 Metabolic Efficiency in Prostheses and Exoskeletons Li et al. [22] developed a knee-guided evolutionary algorithm (KGEA) to optimize lower-limb prostheses, focusing on improving metabolic efficiency during activities like stair climbing. The motivation was to create real-time optimization techniques for prosthetics that enhance performance and reduce the metabolic cost of walking [22]. 2.6 Perturbation-Based Methods for Metabolic Cost Estimation Dzewaltowski et al. [10] explored perturbation-based methods to estimate metabolic cost during specific phases of the walking cycle. The motivation here was to better understand which phases of walking contribute to higher energy expenditure, offering insights for designing assistive devices that optimize walking efficiency [10]. 2.7 Multi-Sensor Fusion for Energy Expenditure Prediction Xu et al. [37] developed a spatial-temporal fusion network that integrates sEMG, inertial, metabolic, and heart rate data to predict EE with high accuracy. Their motivation was to improve real-time EE estimation by leveraging multi-sensor fusion and advanced hybrid attention mechanisms to enhance prediction performance in wearable devices [37]. 20 2.8 Accuracy of the SenseWear Armband during Short Bouts of Exercise 2.8 Accuracy of the SenseWear Armband during Short Bouts of Exercise Wedge et al. [36] investigated the SenseWear Armband (SWA)’s ability to estimate EE during short-duration activities like walking and running. Their motivation was to explore mobile, real-time EE monitoring solutions, demonstrating the potential of SWA for brief exercise sessions, but also emphasizing the need for improvements in its accuracy for short-term activities [36]. 2.9 Advancements in Wearable Robotics for Rehabilitation Vinjamuri [34] discussed the integration of virtual reality (VR) and active-passive exoskeletons to enhance mobility and rehabilitation outcomes. The motivation was to push forward the potential of wearable robotics for rehabilitation practices, particularly in gait recovery and functional improvement for individuals with neurological and musculoskeletal disorders [34]. 2.10 Sparse Sensor Systems for Motion Analysis Ebers et al. [12] introduced the use of Shallow Recurrent Decoders (SHRED) for reconstructing missing sensor data in motion analysis. Their motivation was to develop a framework that works with sparse sensor configurations, making motion tracking more efficient and accessible, particularly for health and performance monitoring in contexts with limited sensor availability [12]. 2.11 Personalized EMG Biofeedback for Exoskeleton Optimization Li et al. [23] explored the use of EMG biofeedback to personalize exoskeleton assistance for users with lower-limb dysfunction. The motivation was to tailor exoskeleton control based on individual muscle activity, enhancing rehabilitation and mobility assistance, and improving the user experience through personalized HITL optimization [23]. 2.12 Benchmark Data The work by Ingraham et al. (2019) [15] serves as the foundation for this bachelor thesis. Their research aimed to evaluate the predictive capabilities of various physiological signals for estimating metabolic energy cost during different exercises. The study focused on using simple signal processing techniques and prediction algorithms to assess the baseline performance of each signal. By providing insights and a dataset collected from 10 healthy individuals performing six different activities — walking, incline walking, backward walking, running, cycling, and stair climbing — at varying intensities, Ingraham et al. highlighted the potential to improve the speed, accuracy, and portability of body-in-loop optimization algorithms. The physiological signals measured included 21 2 Related Work breath frequency and volume, limb accelerations, lower limb Electromyography (EMG), heart rate (HR), electrodermal activity (EDA), skin temperature (Temp), and oxygen saturation (SpO2), with indirect calorimetry used as the ground truth (GT) for EE. Their results showed that filtering the accelerations and EMG signals improved their predictive power. Notably, global signals like HR and EDA were more sensitive to unknown subjects, while local signals such as acceleration (Accel) were more sensitive to unknown tasks. The best predictive performance was achieved by combining a small number of signals (4-5) from multiple sensor types. Their study systematically identifies the most salient physiological signals for predicting metabolic energy cost and provides valuable insights for improving HITL optimization algorithms for wearable robots and prostheses. The dataset released with the study offers a promising foundation for developing novel algorithms to predict energy cost across various applications, from wearable technology to rehabilitation. Therefore, the insights from their study will support the design and development of algorithms aimed at more accurate energy cost estimation, further advancing the field. 22 3 Definition of Concepts This section outlines the algorithms and concepts used in this work, covering signal preprocessing techniques, definitions of metrics and evaluation methods, supervised machine learning (ML) algorithms for regression, and the overall structure of the model training and evaluation approach. The default values provided by the methods used are displayed in the listings. The SciPy library [29] provides built-in tools for signal processing to optimize the signal and scikit-learn library [39] delivers efficient ML algorithms for regression, as well as all the necessary tools for training and evaluation. 3.1 Signal Preprocessing Signal preprocessing improves signal quality and reliability by reducing noise from the environment, sensors, and artifacts before analysis or ML. Filters like Butterworth or Gaussian refine the signal, while vector magnitude or composite sums reduce the data size by making it more compact. 3.1.1 Butterworth Filter Frequency filters like highpass (’high’), lowpass (’low’), and bandpass (’band’) filters are used to isolate specific frequency ranges, enhancing signal quality for better analysis and processing. As shown in Listing 3.1, the function enables the design of a Butterworth Filter for effective frequency filtering. from scipy import signal b, a = signal.butter(N, Wn, btype='low', analog=False, output='ba', fs=None) y = signal.filtfilt(b, a, x, axis=-1, padtype='odd', padlen=None, method='pad', irlen=None) Listing 3.1: Butterworth Filter While the function can create both digital and analog filters, the focus is on digital filters, which are used for discrete and sampled data. It enables the creation of a Nth-order digital Butterworth filter and returns the corresponding filter coefficients, which are subsequently used to apply the filter. The critical frequency or frequencies (𝑤n) must be normalized between 0 and 1 using the formulas in Equation (3.1) and Equation (3.2), which consider the cutoff frequency and sampling rates and the Nyquist frequency ( 𝑓nyquist). The default output type returns the numerator (b) and denominator (a) coefficients of the filter. 23 3 Definition of Concepts 𝑤n = 𝑓cutoff 𝑓nyquist (3.1) 𝑓nyquist = 𝑓sampling_rate 2 (3.2) Once the filter is designed, it can be applied to an input (x) using the second function, as shown in Listing 3.1. This function processes the signal twice: first in the forward direction, then in reverse, effectively canceling phase distortion. [5, 29] 3.1.2 Gaussian Kernel Gaussian filtering is an effective technique for smoothing signals by reducing high-frequency noise while preserving their shape. To achieve these characteristics, the function (Listing 3.2) applies a one-dimensional filter along a specified axis. from scipy.ndimage import gaussian_filter1d smoothed_data = gaussian_filter1d(input, sigma=sigma, axis=-1, order=0, output=None, mode= 'reflect', cval=0.0, truncate=4.0, *, radius=None) Listing 3.2: Gaussian Filter The function processes an input by applying a Gaussian filter, where the specified sigma controls the level of smoothing. The value of sigma is computed using the Equation (3.3). The mode parameter defines how the filter handles signal boundaries. [29] 𝑠𝑖𝑔𝑚𝑎 = 𝑓sampling_rate 2 ∗ 𝜋 ∗ 𝑓cutoff (3.3) 3.1.3 Interpolating and Resampling Interpolation is the process of estimating values at intermediate points between known data points, while resampling involves adjusting the sampling rate of the data. When working with data sampled at different rates, aligning the data is useful to ensure it is of equal length and sampled at consistent time points. For 1-D interpolation, the function in Listing 3.3 creates an interpolator that estimates values for new time data points based on the existing x and y values. from scipy.interpolate import interp1d interpolator = interp1d(x, y, kind='linear', axis=-1, copy=True, bounds_error=None, fill_value =nan, assume_sorted=False) y_interpolated = interpolator(x_aligned) Listing 3.3: Interpolation and Resampling It takes arrays of x and y values, which approximate a function 𝑓 : 𝑦 = 𝑓 (𝑥), where x represents time data and y represents value data. Interpolation kind can be specified (e.g., nearest, linear, or cubic), and the fill_value parameter allows setting a constant value for out-of-bounds points or enabling extrapolation. [29] 24 3.2 Definition, Metrics and Evaluation Methods 3.1.4 Full-wave Rectifying Full-wave rectification converts an alternating signal into a non-negative signal by flipping negative values to positive. This process retains only the magnitude of the signal, making it useful for signal analysis where the strength, rather than the direction, is important. This can be achieved by the function in Equation (3.4), which takes the absolute value of the input. [15] 𝑦(𝑡) = |𝑥(𝑡) |(3.4) 3.1.5 Vector Magnitude The vector magnitude, given by the Euclidean norm, represents the length of a vector, simplifying multidimensional data into a single value making it easier to interpret overall variations or effects. Equation (3.5) computes the total magnitude by summing the squared components along each axis and taking the square root, resulting in a single scalar value that represents overall intensity. [7, 15] ∥v∥ = √︃ 𝑥2 1 + 𝑥2 2 + · · · + 𝑥2 𝑛(3.5) 3.1.6 Composite Sum Composite sums are used to combine multiple data points into a single value. Instead of summing components along different axes, as in vector magnitude, data from various sensors are combined. To calculate the composite sum, as shown in Equation (3.6), the sensor data are squared, summed, and then the square root is taken, yielding a single scalar value that represents the overall measurement. [7, 15] 𝑀 = √√ 𝑛∑︁ 𝑖=1 𝑀2 𝑖 (3.6) 3.2 Definition, Metrics and Evaluation Methods This section introduces key definitions, metrics, and evaluation methods used in the analysis. It covers the Brockway energy expenditure (EE) equation, the Pearson correlation coefficient (PCC) for measuring linear relationships, the root mean square error (RMSE) for assessing prediction accuracy, and cross-validation (CV) with Leave-One-Out (LOO) for evaluating model performance. 25 3 Definition of Concepts 3.2.1 Brockway Energy Expenditure equation EE refers to the total amount of energy the body uses over a specific period, influenced by factors such as physical activity, metabolism, and respiratory gas exchange. Expressed by Equation (3.7), the Brockway Equation estimates EE, measured in kcal per unit of time, based on oxygen consumption (VO2) and carbon dioxide production (VCO2), both measured in liters per minute (L/min). [4, 25] 𝑒𝑒𝑣𝑎𝑙𝑢𝑒 = 16.58 · 𝑉𝑂2 + 4.15 · 𝑉𝐶𝑂2(3.7) 3.2.2 Pearson Correlation Coefficient The Pearson correlation coefficient (PCC) measures the linear relationship between two variables as shown in Equation (3.8). 𝑟 = ∑(𝑥 − 𝑚𝑥) (𝑦 − 𝑚𝑦)√︁∑(𝑥 − 𝑚𝑥)2 ∑(𝑦 − 𝑚𝑦)2 (3.8) It calculates the sum of the product of the deviations of each data point from their respective means, divided by the product of their standard deviations. The result, ranging from -1 to +1, indicates the strength and direction of the relationship: +1 for a perfect positive correlation, -1 for a perfect negative correlation, and 0 for no linear correlation. Correlations above 0.7 indicate a strong linear relationship. Listing 3.4 takes two datasets (x, y) as input and returns a PearsonRResult containing the Pearson product-moment correlation coefficient. [15, 29] from scipy.stats import pearsonr corr, _ = pearsonr(x, y, *, alternative='two-sided', method=None, axis=0) Listing 3.4: Pearson Correlation 3.2.3 Root Mean Squared Error (RMSE) Metric RMSE is a commonly used metric that quantifies the difference between predicted and actual values in regression models, with lower values indicating better performance. First, the mean square error (MSE) is calculated by squaring the differences between predicted and actual values, summing them, and averaging over all samples (Equation (3.9)). The RMSE is then obtained by taking the square root of the MSE (Equation (3.10)). Unlike MSE, RMSE retains the same unit as the original data, making it more intuitive for model comparison and evaluation. Additionally, RMSE penalizes large errors more heavily, which is particularly useful when dealing with outliers. Since model optimization typically involves maximizing a performance score, the negative root mean square error (NRMSE) is often used as the negative counterpart of RMSE (Equation (3.11)). [2, 7, 15, 39] 26 3.3 Supervised Machine Learning Algorithms for Regression 𝑀𝑆𝐸 (𝑦, �̂�) = 1 𝑛 𝑛−1∑︁ 𝑖=0 (𝑦𝑖 − �̂�𝑖)2(3.9) 𝑅𝑀𝑆𝐸 (𝑦, �̂�) = √ 𝑀𝑆𝐸(3.10) 𝑁𝑅𝑀𝑆𝐸 (𝑦, �̂�) = −𝑅𝑀𝑆𝐸(3.11) 3.2.4 Cross validation Leave-One-Out (LOO) is a cross-validation (CV) technique where a model is trained on all data points except one, which is used for testing. This process repeats for each data point, ensuring every sample serves as a test instance once. Conceptually, LOO is equivalent to K-Fold CV with the number of splits equal to the number of data points. While LOO provides an unbiased estimate of model performance, it can be computationally expensive for large datasets due to the high number of iterations, also referred to as folds. 3.3 Supervised Machine Learning Algorithms for Regression Supervised regression algorithms are used to predict continuous target variables based on input features. These methods, such as LR, Support Vector Machines (SVM), Decision Trees (DT), RF, AdaBoost (ADA), and MLP, learn from labeled data to establish relationships between inputs and outputs. Each method has specific parameters that help balance model performance and complexity, with some methods focusing on reducing the risk of overfitting and others improving accuracy. 3.3.1 Linear Regression Model (LR) Linear Regression (LR) (Listing 3.5) is a simple yet powerful method for modeling relationships between variables by fitting a linear equation to the data. It assumes a linear relationship between input features and the target variable. Typically, an intercept (bias) is included to account for a nonzero baseline when the data is not centered. By default, the model allows coefficients to take both positive and negative values, but setting this constraint to True ensures all coefficients remain non-negative. [15, 39] from sklearn.linear_model import LinearRegression model = LinearRegression(*, fit_intercept=True, copy_X=True, n_jobs=None, positive=False) Listing 3.5: Linear Regression Model 27 3 Definition of Concepts 3.3.2 Support Vector Machine Regression (SVM/SVR) Model Support Vector Machines (SVM) (Listing 3.6), is an algorithm that identifies the optimal hyperplane to effectively separate data points from different classes, maximizing the margin between them for better generalization. The C parameter controls the balance between model complexity and training error, epsilon defines the margin for error. kernel determines the transformation of input data (e.g., linear, polynomial, RBF). degree applies to the polynomial kernel, influencing the complexity of the decision boundary, while gamma influences the impact of training points, where a small value smooths the boundary and a large value risks overfitting. [6, 39] from sklearn.svm import SVR model = SVR(*, kernel='rbf', degree=3, gamma='scale', coef0=0.0, tol=0.001, C=1.0, epsilon =0.1, shrinking=True, cache_size=200, verbose=False, max_iter=-1) Listing 3.6: Support Vector Regression Model 3.3.3 Decision Tree (DT) Regression Model Decision Trees (DT) (Listing 3.7), is an algorithm that splits data based on feature values to make predictions. For regression, it predicts numerical values by recursively dividing the data into subsets, starting from a root question and branching out. max_depth limits the maximum depth the tree can grow, min_samples_split specifies the minimum number of samples required to split an internal node, and min_samples_leaf determines the minimum number of samples required in a leaf node. [9, 39] from sklearn.tree import DecisionTreeRegressor model = DecisionTreeRegressor(*, criterion='squared_error', splitter='best', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=None, random_state=None, max_leaf_nodes=None, min_impurity_decrease=0.0, ccp_alpha=0.0, monotonic_cst=None) Listing 3.7: Decision Tree Regression Model 3.3.4 Random Forest (RF) Regression Model Random Forests (RF) (Listing 3.8), is an ensemble learning method that combines multiple DTs (Section 3.3.3) to improve the performance and robustness of the model. It generates several DTs by training each one on a random subset of the data, and the final prediction is made by averaging the predictions from all the trees. In addition to the parameters already explained in Section 3.3.3, the number of trees in the forest is controlled by the n_estimators parameter, while max_features determines how many features are considered when looking for the best split. Additionally, the bootstrap parameter defines whether bootstrap sampling is used during tree construction. When set to True, it enables sampling with replacement, ensuring that each tree is trained on a different subset of the data. [3, 39] 28 3.3 Supervised Machine Learning Algorithms for Regression from sklearn.ensemble import RandomForestRegressor model = RandomForestRegressor(n_estimators=100, *, criterion='squared_error', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=1.0, max_leaf_nodes=None, min_impurity_decrease=0.0, bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False, ccp_alpha=0.0, max_samples=None, monotonic_cst =None) Listing 3.8: Random Forest Regression Model 3.3.5 Adaptive Boosting (AdaBoost/ADA) Regression Model AdaBoost (ADA) (Listing 3.9) is an ensemble learning method that combines the predictions of multiple weak learners (such as DTs, Section 3.3.3) to create a strong classifier. It improves accuracy by focusing on the mistakes made by previous learners. ADA assigns weights to each training instance, adjusting them after each iteration to give more importance to misclassified data points. Each subsequent weak learner places more emphasis on the instances that are harder to predict. n_estimators controls the number of weak learners in the ensemble. The learning_rate parameter determines the weight given to each individual learner. By default, the loss function is linear, where errors are penalized linearly. [8, 39] from sklearn.ensemble import AdaBoostRegressor model = AdaBoostRegressor(estimator=None, *, n_estimators=50, learning_rate=1.0, loss='linear' , random_state=None) Listing 3.9: AdaBoost Regression Model 3.3.6 Multi-layer Perceptron (MLP) Regression Model Multi-layer Perceptrons (MLP) Regression Model (Listing 3.10) is a neural network designed to predict continuous values from input data. It consists of multiple layers of neurons that capture complex patterns. hidden_layer_sizes defines the number of neurons in each hidden layer. While more layers and neurons can capture greater complexity, they also increase the risk of overfitting. activation function determines how each neuron’s output is transformed (e.g., ReLU, tanh, logistic). solver specifies the optimization algorithm used to adjust the model’s weights (e.g., adam, sgd, lbfgs). learning_rate controls the step size in weight updates. Finally, early_stopping halts training when the model’s performance on a validation set stops improving. [13, 14, 17, 39] from sklearn.neural_network import MLPRegressor model = MLPRegressor(hidden_layer_sizes=(100,), activation='relu', *, solver='adam', alpha =0.0001, batch_size='auto', learning_rate='constant', learning_rate_init=0.001, power_t=0.5, max_iter=200, shuffle=True, random_state=None, tol=0.0001, verbose=False, warm_start=False, momentum=0.9, nesterovs_momentum=True, early_stopping=False, validation_fraction=0.1, beta_1 =0.9, beta_2=0.999, epsilon=1e-08, n_iter_no_change=10, max_fun=15000) Listing 3.10: Multi-layer Perceptron Regression Model 29 3 Definition of Concepts 3.4 Model Training and Evaluation The Model Training and Evaluation Framework provides a structured approach to training ML models, evaluating their performance, and ensuring consistency throughout the process. It integrates different models with preprocessors in pipelines, optimizing the process of model tuning and evaluation using GridSearchCV. 3.4.1 Preprocessing for Machine Learning Models Scaling data before applying ML models is crucial to ensure that all features contribute equally to the model’s learning process. Some models rely on the distance between data points, and features with larger scales may dominate the learning process, leading to biased results. Scaling also helps models converge faster during training by preventing large numerical gradients. StandardScaler (Listing 3.11) standardizes features by removing the mean and scaling to unit variance. This ensures that each feature has a mean of 0 and a standard deviation of 1, making the data centered and comparable across features. [39] from sklearn.preprocessing import StandardScaler transformer = StandardScaler(*, copy=True, with_mean=True, with_std=True) Listing 3.11: Standard Scaler Preprocessor Normalizer (Listing 3.12) normalizes samples individually to unit norm, which scales each sample to have a norm of 1. This is useful when the focus is on the relationship or pattern between the data points, rather than their specific values. [39] from sklearn.preprocessing import Normalizer transformer = Normalizer(norm='l2', *, copy=True) Listing 3.12: Normalizer Preprocessor 3.4.2 Pipelines for Training Machine learning pipelines simplify the workflow by applying consistent processing steps to both training and testing data. Preprocessors, as described in Section 3.4.1, can be integrated to transform features before training, ensuring automated preprocessing, preventing data leakage, and ensuring reliable evaluation. The code in Listing 3.13 demonstrates how to set up a pipeline with a LR model (Section 3.3.1) and a Normalizer as the preprocessing step. models = {'lin_reg': LinearRegression()} pipelines = {} for model_name, model in filtered_models.items(): pipeline = Pipeline(steps=[('preprocessor', Normalizer()), ('regressor', model)]) pipelines[model_name] = pipeline Listing 3.13: Preprocessing Pipeline 30 3.4 Model Training and Evaluation This approach is flexible and can be applied to multiple models. By adding conditional logic, different preprocessing steps can be assigned to specific models as needed. It ensures appropriate scaling across models and datasets while enhancing code efficiency and reproducibility. [39] 3.4.3 Model Training and Evaluation Framework GridSearchCV is a technique for hyperparameter tuning and model evaluation through CV. It tests all possible combinations of a parameter grid, training the model with each set and evaluating it using a defined scoring metric, such as NRMSE. By automating the process of evaluating hyperparameters, GridSearchCV helps identify the best model configuration and ensures optimal performance before final evaluation. The GridSearchCV object, as in Listing 3.14, is initialized with the model (estimator), a dictionary of hyperparameters (param_grid), scoring, and the number of CV folds (e.g., LOO in Section 3.2.4). After fitting the model the best-performing model can be accessed via best_estimator, while the test scores, including the mean and standard deviation, provide insight into model performance across the CV folds. By extending this approach with a model and pipelines dictionaries, different models can be evaluated in one process. [39] from sklearn.model_selection import GridSearchCV grid_search = GridSearchCV(estimator, param_grid, *, scoring=None, n_jobs=None, refit=True, cv =None, verbose=0, pre_dispatch='2*n_jobs', error_score=nan, return_train_score=False) grid_search.fit(X, y=None, **params) best_estimator = grid_search.best_estimator_ mean_test_score = grid_search.cv_results_['mean_test_score'] std_test_score = grid_search.cv_results_['std_test_score'] split_test_score = grid_search.cv_results_['splitX_test_score'] Listing 3.14: GridSearchCV 31 4 Methodology This section outlines the methodology used in this work, detailing the approach to EE estimation, dataset characteristics and challenges, implementation pipeline, and evaluation strategies. The methodology is designed to provide a structured and reproducible workflow. 4.1 Approach To extend EE estimation beyond Linear Regression (LR), as previously explored by Ingraham et al. [15], a structured approach is followed to ensure compatibility with benchmark data and meaningful model evaluation. First, preprocessing is evaluated using the Pearson correlation coefficient (PCC), comparing unfiltered and filtered data against benchmark results. Next, various supervised ML models (LR, SVM, DT, RF, ADA, MLP) are trained on all 16 extracted features and assessed using NRMSE across two CV strategies (Leave-One-Subject-Out (LOSO) and Leave-One-Task-Out (LOTO)). To optimize model performance, hyperparameters are tuned through GridSearchCV using a manually designed parameter grid, refined in three iterative phases, with LOSO serving as the primary evaluation method for cross-subject generalization. Finally, recognizing the constraints of real-world applications, key sensor regions are analyzed to balance estimation accuracy and practical feasibility. This structured workflow ensures a comprehensive evaluation while maintaining comparability with benchmark studies. 4.2 Evaluation Criteria and Guidelines This section outlines the evaluation process using the benchmark dataset from Ingraham et al. [15]. The data preprocessing follows their methodology, including signal filtering, resampling, and GT calculation. Evaluation techniques involve CV, PCC for validation, and RMSE for assessing model performance, ensuring consistent and comparable results across various machine learning methods and sensor configurations. 4.2.1 Benchmark Dataset The dataset used in this work is provided by Ingraham et al. [15, 16]. It was collected from 10 healthy subjects equipped with a diverse range of sensors while performing six activities at varying speeds and intensities. A detailed description of the dataset structure is provided in Section 4.3. Features and Target Variable. Each model is trained using 16 signals, including six vector magnitudes, composite sums from two EMG signals, two Empatica data points per wrist, and four global signals. For further details, see Section 4.3.3. 33 4 Methodology The target variable is the EE value, calculated using the Brockway EE Equation (3.7), which incorporates VO2 and VCO2 data. Since VO2 is highly correlated with EE, it is also included as a feature in some models to serve as a benchmark for performance evaluation. 4.2.2 Data preprocessing Before training the models, proper data preprocessing is crucial to ensure accuracy and reliability. The dataset, collected using wearable sensors, can exhibit issues such as sensor shifts, disconnections, and noise due to high sensor sensitivity, which captures unintended movement patterns like gait cycles. These discrepancies were identified by Ingraham et al. [15] and require appropriate handling to ensure the integrity of the analysis. The preprocessing steps followed in this work are aligned with the methods used by Ingraham et al. [15] to maintain consistency and comparability. These steps include: The vector magnitude of the Accel data was calculated across the three axes using Equation (3.5). EMG preprocessing was done in several steps. EMG were filtered using a band-pass filter between 30 and 350 Hz, followed by full-wave rectification (Section 3.1.4) and low-pass filtering with a 5 Hz cutoff (Section 3.1.1). The filtered EMG signals were then normalized to the peak activation level for each muscle within the same session. Composite sums for each side were calculated using the eight EMG muscle sensors. Although the raw data was time-synchronized during collection, each sensor group had a different sampling rate, resulting in misaligned sampling points. To address this, the data was resampled to 1 Hz, ensuring uniform sampling across all sensor signals. (Section 3.1.3) Additionally, data was acquired once unfiltered and filtered by applying Gaussian kernel smoothing (Section 3.1.2) before resampling. The cutoff frequencies for resampling were set to 0.1 Hz for Electromyography Sensors (EMG (Group)) and vector magnitude Accel signals and 0.01 Hz for more robust Metabolics Measurement System (Metabolics_Systems) and Empatica Physiological Sensors (Empatica_Physio) (such as VO2, HR, and SpO2) The GT for EE was derived using the Brockway (Equation (3.7)), which calculates metabolic cost from unfiltered VO2 and VCO2 data, normalized by the subject’s weight. Ingraham et al. [15] determined that the energy cost remained constant across each 6-minute activity period. They calculated the mean metabolic cost during the last 3 minutes of each activity session and extended this value to the full 6 minutes. To account for the net EE, the metabolic cost of standing at the beginning of each activity session was subtracted. This process was used to determine the net energetic cost for each activity. 4.2.3 Evaluation Criteria This section outlines the evaluation methods used to assess the performance of the models and their predictions for EE for comparability with the benchmark. 34 4.2 Evaluation Criteria and Guidelines PCC and LR. Preprocessing evaluation was conducted using Pearson correlation coefficient (PCC) to ensure comparability with the original benchmarks. In the work of Ingraham et al. [15], both filtered and unfiltered data were used to calculate the ground truth (GT) target variable. Additionally, LR was performed on the preprocessed data, serving as a benchmark for evaluating the performance of other machine learning methods. Metrics. To assess the accuracy of the predictions, several metrics can be used. In this case [15], RMSE Equation (3.10) is used as a metric to evaluate model performance. RMSE helps quantify how closely the model’s predictions align with the GT EE values, providing a reliable measure for comparison with benchmark data. CV. The dataset consists of 10 subjects and six task modes, making it well-suited for LOO CV. Given the limited size of the dataset, CV is crucial for improving model reliability and ensuring generalizability across different data subsets. Following the methodology defined in detail by Ingraham et al. [15], two types of CV were employed: LOSO and LOTO. LOSO involves 10 folds, one for each subject, while LOTO splits the data into six folds based on the different tasks. These approaches ensure that each sensor configuration is tested across diverse data points, minimizing bias through averaging over the test sets. Sensor regions. In addition to analyzing the performance of individual signals, Ingraham et al. [15] evaluated four groups of sensors: Local Signals, Global Signals, a Combination of Both, and Hexoskin, referencing its application. The Hexoskin region will be included in the evaluation, along with other regions introduced in Section 4.6. 35 4 Methodology Figure 4.1: Hierarchical structure of the data set 4.3 Dataset This section outlines the structure of the provided dataset by Ingraham et al. [16]. It is organized into several subsections, including details about the subjects, the activities, sensor and signal descriptions, label structure, and missing data. Before proceeding with the data analysis, it is important to address some missing details in the dataset provided by Ingraham et al. [15]. These issues, such as the label structure and other challenges with the raw data, should be discussed and clarified to ensure proper handling and analysis. 4.3.1 Subjects According to Ingraham et al. [15], ten healthy subjects participated in the experimental data collection, consisting of 8 males and 2 females, with an average age of 27.4±4.5 yr, height of 1.76 ± 0.09 m, and weight of 69.1±9.9 kg. Further details on the demographics of each subject can be found in Table 4.1. 4.3.2 Activities As described in the paper of Ingraham et al. [15], each subject took part in two experimental sessions, where they performed various physical activities at different speeds and intensities (Table 4.4). In the first session the subjects engaged in the activities of sitting, standing, level walking (Walking), incline walking (Incline), and backward walking on a treadmill (Backwards) (Table 4.2). In the second session they participated in sitting, standing, running on a treadmill (Running), cycling on a stationary bike (Cycling), and stair climbing on a stairmill (Stairs) (Table 4.3). For each activity, the subjects typically stood for 6 minutes, performed various intensities of the activity, each lasting 6 minutes, and then rested while sitting for an additional 6 minutes. The standing (22) data serves 36 4.3 Dataset Subject Sex Height (m) Weight (kg) Age (yr) 1 Male 1.85 63.49 30 2 Male 1.77 63.49 25 3 Female 1.73 71.20 24 4 Male 1.68 68.03 25 5 Male 1.83 68.03 24 6 Male 1.85 68.03 25 7 Male 1.85 95.25 24 8 Male 1.85 65.76 33 9 Male 1.73 68.93 37 10 Female 1.63 58.05 27 Absolute Mean ± SD 8 Male, 2 Female 1.76 ± 0.09 69.1±9.9 27.4±4.5 Table 4.1: Subject Demographics as a benchmark and is used to calculate the net ground truth energy cost. Subsequently, both the standing (22) and sitting (23) segments are removed from each activity, leaving only the varying intensities of the activities. Activities (First Session) Subject Walking Incline Backwards Backwards2 1 22, 1, 3, 2, 23 22, 5, 4, 22, 6, 7, 23 22, 8, 9, 10, 23 2 22, 3, 2, 22, 1, 23 22, 5, 4, 22, 7, 6, 23 22, 8 22, 9, 10, 23 3 22, 1, 3, 2, 23 22, 4, 5, 22, 7, 6, 23 22, 9, 8, 23 4 22, 2, 1, 3 22, 4, 5, 22, 7, 6, 23 22, 8, 9, 23 5 22, 2, 3, 1, 23 22, 5, 4, 22, 6, 7, 23 22, 8, 9, 10, 23 6 22, 1, 2, 3, 23 22, 5, 4, 22, 6, 7, 23 22, 8, 9, 23 7 22, 3, 2, 1, 23 22, 5, 4, 22, 6, 7, 23 22, 9, 8, 23 8 22, 2, 1, 3, 23 22, 5, 4, 22, 6, 7, 23 22, 9, 8, 23 9 22, 3, 1, 2, 23 22, 4, 5, 22, 7, 6, 23 22, 9, 8, 23 10 22, 2, 1, 3, 23 22, 4, 5, 22, 7, 6, 23 22, 8, 9, 23 Table 4.2: Activities for Each Subject (First Session) 4.3.3 Sensors and signals During the experiment, the subjects were equipped with a range of sensors attached to differ- ent parts of their bodies. These sensors were grouped into the following categories: APDM Accelerometer (APDM_Accel), Empatica Accelerometer (Empatica_Accel), Empatica_Physio, Metabolics_Systems, and EMG (Group). A detailed description of each sensor group is provided 37 4 Methodology Activities (Second Session) Subject Running Cycling Cycling2 Stairs 1 22, 11, 13, 14, 12 23, 15, 16, 17, 18, 23 2 22, 11, 13, 14, 12, 23 23, 18, 17, 15, 16, 23 22, 19, 20, 21, 22 3 22, 11, 13, 14, 12, 23 23, 18, 17, 15, 16, 23 22, 19, 20, 21, 22 4 22, 11, 14, 12, 13, 23 23, 16, 17, 15, 18, 23 22, 21, 19, 20, 22 5 22, 11, 13, 14, 12, 23 23, 17, 18 23, 15, 16, 23 22, 21, 19, 20, 22 6 22, 11, 14, 12, 13, 23 23, 16, 18, 17, 15, 23 22, 20, 21, 19, 22 7 22, 11, 13, 14, 12, 23 23, 15, 16, 18, 17, 23 22, 19, 21, 20, 22 8 22, 11, 13, 12, 14, 23 23, 18, 16, 15, 17, 23 22, 21, 19, 20, 22 9 22, 11, 14, 12, 13, 23 23, 15, 17, 16, 18, 23 22, 20, 19, 21, 22 10 22, 11, 12 23, 15, 17, 18, 16, 23 22, 19, 21, 20, 22 Table 4.3: Activities for Each Subject (Second Session) Activity Code Activity Mode Intensity Speed 1 Walking - 0.6 (m/s) 2 Walking - 0.9 (m/s) 3 Walking - 1.2 (m/s) 4 Incline 4° 0.6 (m/s) 5 Incline 4° 1.2 (m/s) 6 Incline 9° 0.6 (m/s) 7 Incline 9° 1.2 (m/s) 8 Backwards - 0.4 (m/s) 9 Backwards - 0.7 (m/s) 10 Backwards - 1 (m/s) 11 Walking - 1.2 (m/s) 12 Running - 1.8 (m/s) 13 Running - 2.2 (m/s) 14 Running - 2.7 (m/s) 15 Cycling Resistance 1 70 (rpm) 16 Cycling Resistance 3 70 (rpm) 17 Cycling Resistance 5 70 (rpm) 18 Cycling Resistance 1 100 (rpm) 19 Stairs 60 (W) - 20 Stairs 60 (W) - 21 Stairs 60 (W) - 22 Standing - - 23 Sitting - - Table 4.4: Activity Codes and Descriptions below. Ingraham et al. [15] reported that all sensor signals were time-synchronized during data collection, and each sensor group included a time and activity code column to align the signals with the corresponding activities. 38 4.3 Dataset APDM Accelerometer (APDM_Accel) Four commercial tri-axial accelerometers were attached to the designated positions on the subject using elastic Velcro straps. • Positions: Waist, Chest, Left Ankle, Right Ankle, Left Foot, Right Foot • Physical Quantities and Units: Acceleration (𝑚/𝑠2), Angular Velocity (𝑟𝑎𝑑/𝑠), Magnetic Field (𝜇𝑇) • Axis: x, y, z • Sampling Rate: 128 Hz Empatica Accelerometer (Empatica_Accel) Each subject wore two bilateral wristbands, each containing tri-axial accelerometers. • Positions: Left Wrist, Right Wrist • Physical Quantities and Units: Acceleration (𝑚/𝑠2) • Axis: x, y, z • Sampling Rate: 32 Hz Empatica Physiological Sensors (Empatica_Physio) The same wristbands used with the Empatica accelerometer also recorded electrodermal activity and skin temperature. • Positions: Left Wrist, Right Wrist • Physical Quantities and Units: Electrodermal Activity (𝜇𝑆), Skin Temperature (◦𝐶) • Sampling Rate: 4 Hz Metabolics Measurement System (Metabolics_Systems) Portable respirometer measures oxygen consumption (VO2), carbon dioxide production (VCO2), breath frequency, and minute ventilation at a breath-by-breath sampling rate. Heart rate was monitored using a wireless chest strap, while blood oxygen saturation (SpO2) was measured with a pulse oximeter attached to the subject’s left earlobe. • Physical Quantities and Units: VO2 (𝑚𝐿/𝑠), VCO2 (𝑚𝐿/𝑠), RER, Breath Frequency (1/𝑚𝑖𝑛), Minute Ventilation (𝐿/𝑚𝑖𝑛), Oxygen Saturation (SpO2) (%), Heart Rate (1/𝑚𝑖𝑛) • Sampling Rate: Breath-by-Breath 39 4 Methodology Electromyography (EMG) The muscle activity from eight bilateral muscles of the lower limbs was recorded using surface EMG electrodes. • Positions: Left Leg, Right Leg • Physical Quantities and Units: Raw EMG (𝑉) • Muscles: Gluteus Maximus, Rectus Femoris, Vastus Lateralis, Semitendinosus, Biceps Femoris, Medial Gastrocnemius, Soleus, Tibialis Anterior • Sampling Rate: 1 kHz 4.3.4 Labels The structure of the labels is not described in detail by the authors. Upon inspection, some inconsistencies in the structure are observed, as shown in Table 4.5. In the first row, the first entry is always Time, followed by the Activity Code. Typically, the first row describes the Sensor Position, except for the metabolic system, which does not have a position. For these, the first row is labeled Unit, whereas for all other sensors, this appears in the second row. Accelerometers also have a third row detailing the axis, and EMG data describes the muscle groups in the same row. These discrepancies should be considered when converting the data, to ensure a consistent structure across all recordings. Sensor Group APDM_Accel Label Type Time (s) Activity Code Waist Waist Waist ... Sensor Position Acceleration (m/s2) Acceleration (m/s2) Acceleration (m/s2) ... Physical Quantity and Unit x y z ... Axis Sensor Group Empatica_Accel Label Type Time (s) Activity Code Left Wrist Left Wrist Left Wrist ... Sensor Position Acceleration (m/s2) Acceleration (m/s2) Acceleration (m/s2) ... Physical Quantity and Unit x y z ... Axis Sensor Group Empatica_Physio Label Type Time (s) Activity Code Left Wrist Left Wrist Right Wrist ... Sensor Position Electrodermal Activity (µS) Skin Temperature (°C) Electrodermal Activity (µS) ... Physical Quantity and Unit Sensor Group Metabolics_System Label Type Time (s) Activity Code VO2 (mL/s) VCO2 (mL/s) RER ... Physical Quantity and Unit Sensor Group Empatica_Accel Label Type Time (s) Activity Code Left Leg Left Leg Left Leg ... Sensor Position Raw EMG (V) Raw EMG (V) Raw EMG (V) ... Physical Quantity and Unit Gluteus maximus Rectus femoris Vastus lateralis ... Muscle Type (Axis) Table 4.5: Comparison of Label Structures Across Different Sensor Groups 40 4.3 Dataset 4.3.5 Missing Data, occurring NAN Values and split activity sessions Missing Data According to the Authors. Ingraham et al. [15] note that equipment malfunction prevented the collection of complete data sets from all subjects. Due to a failure in the right wristband worn by Subject 6 during session 2, their right wrist accelerometry, electrodermal activity, and skin temperature data for the Running, Cycling, and Stairs activities were excluded from analysis. This is reflected in the raw data, where Not A Number (NAN) series appear for each missing sensor. The authors also mention that certain signals were excluded due to electrode disconnection: Subject 5’s left gluteus maximus (Walking), Subject 3’s right gluteus maximus (Cycling), and Subject 3’s left Semitendinosis (Walking). However, the sides of the data labels differ from those in their dataset. Subject 5’s right gluteus maximus (Walking) and Subject 3’s right semitendinosus (Walking) are entirely NAN, while Subject 3’s left gluteus maximus (Cycling) is only partially NAN, but was still completely excluded from the analysis. Furthermore, Subject 1 was unable to complete the stair-climbing activity. As a result, there is no entry for this activity in the data set, and no NAN values need to be handled for this task. Trailing NAN Values. Additionally, the authors did not mention that some sensor recordings within the same group had unequal lengths. It can be assumed that the recordings likely ended a few seconds apart, leading to NAN values at the tail end of some signals. Split Activity Sessions. Furthermore, two activity sessions were split into two separate recordings, a detail not mentioned by the authors. For Subject 2 (Backwards) and Subject 5 (Cycling), there are two recordings that should be combined into a single activity recording to recreate the structure of the other activity recordings and simplify handling. These recordings also appear as separate columns in Table 4.2 and Table 4.3. 41 4 Methodology 4.4 Implementation Workflow The implementation is structured into three main stages: the preprocessing pipeline, evaluation using Pearson correlation, and the final training pipeline. The preprocessing pipeline consists of three scripts. The first script handles data conversion and preprocessing steps related to Section 4.2. The second script restructures and simplifies data access within the dataframe. In the third and final preprocessing step, the data is divided into two versions: a filtered and an unfiltered dataframe, followed by interpolation and resampling. Next, the script for calculating Pearson Correlation Coefficients between the filtered and unfiltered data is introduced. Finally, the training pipeline, developed to meet the requirements in Chapter 4, and the evaluation framework for gathering results, are explained. During the implementation of the preprocessing steps, several challenges occurred, which are annotated in this section and discussed in detail in the next Section 4.5. 4.4.1 Preprocessing Pipeline - First Script - Converting and Preprocessing In the first preprocessing step, the MatLab data is converted into a Pandas DataFrame, processed for each subject as outlined in Section 4.3, and then stored as Pickle files. The mat73 package is used to load the data into a dictionary, enabling efficient access to its hierarchical structure during conversion. Preserving Data Structure. To capture the desired data structure in the new dataframe, a Multi- Index for Pandas DataFrames is used to maintain the hierarchical organization and enable easier grouping by layer. As shown in Figure 4.1, the data is structured hierarchically, with the Sensor Data divided into Labels, Sampling Rate, and Data entries. The multi-index is built from the categorical indices Activity Mode and Sensor Group from the hierarchical layer, Position, Unit, and Axis from Labels, along with a new Preprocessed level to mark derived data entries, such as vector magnitudes, composite sums or the ground truth. Since the data is split by subject, the first layer contains only the current subject ID and is therefore excluded from the multi-index at this stage. The Data and Sampling Rate entries from Sensor Data represent the columns of the dataframe. By iterating over the alphabetically ordered Activity Mode and Sensor Group, the corresponding Labels, Sampling Rate, and Data for each combination can be extracted, enabling access to the information. Processing Data per Sensor Group. Since the Data was extracted from MatLab row by row, it requires reorganization to group values by column. Transposing the data ensures each column contains the complete session’s data for a given sensor, rather than storing values by second. Next, Trailing NAN values are addressed as described in Section 4.5.3. The Metabolic System and Empatica Physio data require no additional preprocessing and are directly included in the new sensor dataset. For newly derived data, such as Accelerometer or EMG, the original data is kept in the array, and the processed data is added, labeled as "Preprocessed."The sensor data size is also updated to reflect the new sensors. 42 4.4 Implementation Workflow Metabolic Data. GT EE is derived from VO2 and VCO2 values obtained from the Metabolic System, following Equation (3.7), and normalized to the subject’s weight (Table 4.1). Given the complexity of this process, additional details are provided in Section 4.5.5. For the Metabolic System, the Breath-by-Breath sampling rate is replaced by the computed sampling rate, as detailed in Section 4.5.4, addressing the issue of inconsistent sampling intervals for numerical resampling. Accelerometer Data. Vector magnitudes for accelerometer data (from APDM_Accel and Empat- ica_Accel) are computed along the x, y, and z axes using Equation (3.5). NAN values for Subject 06 are handled by extending the NAN column from a single axis to all three axes by replicating its data array. The NAN values persist, requiring further handling as described in Section 4.5.6. EMG Data. As detailed in Section 4.2.2, the EMG preprocessing pipeline first applies a 4th-order Bandpass Butterworth filter (Section 3.1.1), followed by full-wave rectification of the signal (using Equation (3.4)), and then applies a second 4th-order Lowpass Butterworth filter. The Butterworth filter order was determined based on the approach of Slade et al. [31]. Critical frequencies are calculated based on the corresponding cutoff frequencies and the sampling rate, as described in Equation (3.1). Each step of the signal processing is also illustrated in Figure 4.2. The signal is then normalized using the peak activation level for each muscle on both sides. Since obtaining these values was impractical at this point, they were retrieved through a separate script, as outlined in 4.5.1. Finally, composite sums for each leg are calculated using Formula 3.6. Empathica Data. No additional preprocessing is applied to the Empatica sensor group. Since derived data is added during the iteration through the data array, the resulting array increases in size. To ensure consistency, the new data structure must align with the original format, with all data properly labeled. Adding corresponding Labels and Sampling Rate. As outlined at the beginning of this section, the MultiIndex is structured into six levels: Activity Mode, Sensor Group, Position, Unit, Axis, and Preprocessed. Since iteration occurs over each Activity Mode and Sensor Group combination, these values are directly accessible for labeling. Across all sensor groups, the first two entries in each data array correspond to Time and Activity Code. For these labels, Position is set to Global, while Unit is assigned Time (s) and Activity Mode, respectively. The Axis and Preprocessed levels are set to None. For most sensor groups, the first and second entries in Labels are mapped to Position and Unit. However, for the Metabolic System, this order is reversed, as shown in Table 4.5. The third label entry applies only to Accelerometer and EMG data, where it maps to Axis and includes labels for muscle groups for simplification. For all other cases, it remains None. Derived signals, such as Composite Sums or Vector Magnitudes, retain the same labels as the associated sensor, except for the Axis label, which is set to None. The Preprocessed level is updated accordingly, with labels such as Composite Sum Left, Composite Sum Right, or Vector Magnitude, depending on the signal type and side. For Ground Truth labels, Position is set to Global, and Unit is labeled as Ground Truth (kJ), while Axis and Preprocessed are set to None. Rather than storing the Sampling Rate at the sensor group level, it is now assigned to each sensor entry individually, ensuring flexible access to the sampling rate across the dataset. 43 4 Methodology Figure 4.2: This figure shows each signal preprocessing step regarding EMG data MultiIndex and DataFrame A MultiIndex is generated using the label array as categorical labels. The DataFrame is then constructed by assigning the data array to the Data column and the sampling rate array to the Sampling Rate column, using the generated MultiIndex as its index. Finally, the DataFrame is saved as a pickle file for all subjects. 4.4.2 Preprocessing Pipeline - Second Script - Restructuring The second script simplifies the dataframe structure, as shown in Table 4.6, merges the split activities, shown in Section 4.5.2, and removes the standing and sitting periods within each activity session. Finally, it combines the separate pickle files into a single file for each subject, enabling easier access within the dataframe. 44 4.4 Implementation Workflow Adjusting MultiIndex. To achieve the desired dataframe structure, as shown in Table 4.6, several key transformations were applied in sequence. First, the columns Name and Sensor Name were inserted at the beginning of the dataframe. Next, the values in the Name column were generated by combining the Position, Unit, and Axis values, while the values in the Sensor Name column were derived by extracting the sensor name from the Unit column. Additionally, only the relevant unit information was retained from the Unit level, and the Sensor Name column was set as the index for better organization. The Sensor Name was added to the MultiIndex, and the levels were rearranged to produce the following new order: Activity Mode, Sensor Group, Position, Unit, Axis, Preprocessed, and Sensor Name. Subsequently, the Unit, Axis, and Preprocessed levels were removed from the MultiIndex due to redundant information, simplifying the structure. Additionally, for compactness, the sensor name Oxygen Saturation, SpO2 was simplified to SpO2. Lastly, the Subject ID was introduced as a new top-level entry in the MultiIndex, further enhancing the dataset’s organization and ease of access. Table 4.6: Dataframe MultiIndex Structure Subject ID Activity Mode Sensor Group Position Sensor Name 1 Backwards APDM_Accel Global Time Activity Code Waist Acceleration - Vector magnitude Chest Acceleration - Vector magnitude ... ... ... ... EMG Global Time Activity Code Left Leg Raw EMG - Composite Sum Left ... ... ... ... Cycling APDM_Accel Global Time Activity Code Waist Acceleration - Vector magnitude Chest Acceleration - Vector magnitude ... ... ... ... 2 Backwards APDM_Accel Global Time ... ... ... ... Removing Standing and Sitting sections. The standing and sitting periods in each activity session are only used as benchmarks for calculating the ground truth in the first script and should not be included in the evaluation. As a result, they are removed from all activity sessions. First, the split activities must be handled as outlined in Section 4.5.2. Next, the standing and sitting sections are removed from all dataframes. The data is grouped by activity modes and sensor groups, and the corresponding time and activity code arrays are retrieved. The activity code is used to identify the boundaries of the activity sections. Before removing the standing and sitting sections, the time vector is adjusted to close any gaps between activity sections. Proper adjustment of the time vector is critical, as it will be used in the next script for interpolation and resampling. Gaps may arise, for instance, during the Incline activity or for subject 2 during Walking, as shown in Table 4.2. 45 4 Methodology When removing data from the middle of the dataframe, the time vector is shifted by subtracting the difference between the upper boundary value and the last value of the previous activity section. Finally, the standing and sitting sections are removed, and the time vector is reset to begin at 0. Filtering relevant sensors. Only the relevant sensor data required for the ML process is selected, rather than using the full dataset. This includes sensors such as ’Ground Truth,’ ’Acceleration - Vector Magnitude,’ ’Raw EMG - Composite Sum Left,’ ’Raw EMG - Composite Sum Right,’ ’Electrodermal Activity,’ ’Skin Temperature,’ ’SpO2,’ ’Heart Rate,’ ’Breath Frequency,’ ’Minute Ventilation,’ ’VO2,’ ’Time,’ and ’Activity Code.’ The selected positions are ’Global,’ ’Waist,’ ’Chest,’ ’Left Ankle,’ ’Right Ankle,’ ’Left Leg,’ ’Right Leg,’ ’Left Wrist,’ and ’Right Wrist’. A new dataframe, containing data from all ten subjects, is created and saved to simplify the process in the subsequent steps. 4.4.3 Preprocessing Pipeline - Third Script - Smoothing and Resampling In the third and final script of the preprocessing pipeline, a Gaussian kernel can be applied to the filtered data, or the unfiltered data can be used. Resampling is performed in both cases. Applying the Gaussian Kernel. When smoothing is applied, a Gaussian kernel is used for each activity mode and sensor group. The cutoff frequency is determined as described in Section 4.2.2. The kernel is then applied, as outlined in Section 3.1.2, using the cutoff to calculate sigma, which is passed to the filter. ’Time,’ ’Activity Code,’ and ’Ground Truth’ are excluded from smoothing, as these values should remain unchanged. Generating Uniform Time Vectors. To ensure consistency during resampling, a uniform time vector is generated for each subject and activity mode. This vector will serve as the new set of sampling points. The process begins by grouping the data by activity mode and extracting the time arrays. The latest start time and earliest end time across all vectors are then identified. Based on these, a uniform time vector with 1-second intervals (1 Hz) is created. Interpolation and Resampling. For interpolation and resampling, the corresponding uniform time vector and original time array are retrieved for each sensor group and activity mode. By default, linear interpolation is applied, while nearest neighbor interpolation is used for Activity Code and Ground Truth to preserve their discrete values. An interpolator, as described in Section 3.1.3, is instantiated using the original time vector, signal data, interpolation type, and a extrapolation as fill value. Resampling is then carried out by passing the uniform time vector to the interpolator for each activity mode. 4.4.4 Preprocessing Evaluation - Pearson Correlation coefficient The impact of preprocessing is evaluated by computing the Pearson correlation coefficient for both filtered and unfiltered data, as defined in (3.8). Correlation values are first calculated for each sensor and then averaged across all subjects. 46 4.4 Implementation Workflow To ensure proper data handling, sensor values are stored separately. Filtering is first applied at the subject ID level, followed by the extraction of GT values, with the original GT array preserved for potential restoration. The data is further refined by filtering based on position and sensor name, while excluding non-sensor columns such as activity codes and timestamps, retaining only the relevant signal data. Before computing the correlation, NAN values are removed, as outlined in Section 4.3.5, to prevent interference with the calculations. The PCC is then computed by comparing signal data with ground truth values. This is applied to both raw and preprocessed data to maintain consistency in the analysis. This structured approach ensures a reliable evaluation of sensor data across multiple subjects and activities. 4.4.5 Training Pipeline - Framework and Evaluation This section describes the customizable options in the training pipeline, which allows for flexible data collection modes, the inclusion of VO2 measurements, and selection of CV strategies (LOSO or LOTO). The dataset is prepared based on the chosen mode, with features and target variables selected accordingly. Several ML models can be trained, with preprocessing tailored for each model. The GridSearchCV framework (Section 3.4.3) is used for model optimization, ensuring systematic evaluation of the sensor data. Customization Options. The training pipeline offers flexible configuration based on the specific needs of the experiment. The customization options include selecting the mode of data collection, which can be set to ’All’, ’Hexoskin’, ’Smart Watch’, ’EMG Pants’, ’Watch and Pants’, ’Best Combination’, as detailed in the corresponding Table 4.7. Additionally, the pipeline allows for the inclusion or exclusion of VO2 measurements in the analysis and offers a choice between two CV strategies: LOSO and LOTO. A set of models can be selected for evaluation, or alternatively, a single model can be chosen for both evaluation and optimization. Additionally, different preprocessing pipelines, as described in Section 3.4.2, can be tested by substituting the appropriate preprocessors. Setup before Evaluation. Depending on the selected Mode, the corresponding features and target variable GT are selected to be evaluated. The benchmark VO2 is optionally included in the features. To prepare the dataset for training, the data is structured according to the selected mode of data collection. The training set consists of sensor features (X) corresponding to the chosen positions, while the target variable (y) is only the GT column. Before proceeding with model training, all dataframes are converted into NumPy arrays to enhance computational efficiency. The dataset is then reshaped based on the selected CV strategy. In the case of LOSO, the data is pivoted with ’Activity Mode’ as the index and sensor readings as values. The resulting arrays are expanded into row-wise structures, and any NAN values, particularly those from right-wrist sensors, are appropriately handled. The index used for pivoting is subsequently dropped. A similar approach is applied for LOTO, except that ’Subject ID’ serves as the index before being removed. 47 4 Methodology Machine Learning Model. For model training, several ML algorithms are provided, including LR, SVM, DT, RF, ADA, and MLP, as introduced in Section 3.3. Each model is integrated into a dedicated pipeline, as in the Section 3.4.2, allowing for tailored preprocessing steps specific to the requirements of each ML approach. With a parameter grid, ML models can be evaluated across all combinations of the parameters, helping to identify the best settings. Preprocessing Pipeline. Certain models require additional preprocessing to transform the data before training, as some models assumed the data to be. By iterating over the model and pipeline is finalized, a new ML model is instantiated and trained for each pipeline. With a custom LOO method, as detailed in Section 4.5.7, the desired behavior can be revoked, having 10 folds for the LOSO and six for the LOTO CV. The parameter grid for hyperparameter tuning is extracted but remains empty in this case, as the models are trained using standard settings. GridSearchCV framework. Finally, the GridSearchCV framework, as described in Section 3.4.3 is utilized to train and validate the models. The pipeline, parameter grid, and LOO strategy are incorporated into GridSearchCV, with scoring based on the negative root mean square error (NRMSE) (Equation (3.11)), as the objective is to minimize the error. The model is then fit using the preprocessed training data (𝑋𝑡𝑟𝑎𝑖𝑛 and 𝑦𝑡𝑟𝑎𝑖𝑛), ensuring a structured and systematic evaluation of the sensor data across multiple subjects and activities. Training per Machine Learning Model. For each model pipeline, a new ML model is created and trained. If the selected CV method is Leave-One-Subject-Out, the split count is set to 10; otherwise, for Leave-One-Task-Out, it is set to 6. Handling CV Using the split count, a KFold is generated to cross-validate the model. The parameter grid for this model is extracted, but in our case, it is empty, as we intend to train the model with the standard settings. Next, we create the GridSearchCV with the pipeline, parameter grid, KFold, and use the negative root mean squared error for scoring, as GridSearchCV aims to maximize the score. Finally, the model is trained (fitted) using the X_train and y_train data. Training results. To obtain the results of the ML models, extract the mean_test_score and std_test_score from grid_search.cv_results_, Additionally, save the split_test_scores for each individual fold. Finally, store the results as a JSON file to allow for easy access to the dictionary when needed. 4.5 Problems/Challenges and solutions This section addresses implementation challenges encountered across all scripts, including handling trailing NAN values, accurately calculating the ground truth, handling split datasets, determining the metabolic system’s sampling rate, extracting peak activation levels from EMG data, and applying a custom LOO CV approach. 48 4.5 Problems/Challenges and solutions 4.5.1 Get the Peak Activation Level for EMG Data A key step in preprocessing EMG data is normalizing it to the peak activation level for each muscle group per session. The challenge is that during the first script, only the current activity and sensor group are accessible before saving the pickle file, whereas calculating peak activation levels depends on having simultaneous access to all EMG data within the same session. Additionally, normalization is not the final processing step in this script, preventing the calculation of EMG composite sums. Skipping normalization would lead to incomplete and unstructured data, requiring an extra preprocessing step later. A cleaner approach is to first extract all EMG data, compute peak values for each muscle group, and store them separately for use in the first script. To achieve this, an additional script is run before the first script to collect peak activation levels for each muscle signal per session for each subject. The muscle signals were collected separately for the left and right leg and include: ’Gluteus maximus’, ’Rectus femoris’, ’Vastus lateralis’, ’Semitendinosis’, ’Biceps femoris’, ’Medial gastrocnemius’, ’Soleus’, ’Tibialis anterior’. The first Table 4.2 includes the activities Walking, Incline and Backwards, while the second Table 4.3 includes Running, Cycling and Stairs. This script processes raw MatLab data, considering only EMG signals. It converts and preprocesses the data in the same way (Section 4.4.1) as the main script up to the normalization step. Peak activation levels are stored in a dictionary with keys for session ID, position, and sensor. Finally, peak activation levels are stored in a CSV file with columns for Subject ID, Activity, Position, Sensor Name, and Value to ensure proper data handling. While all column combinations are listed, some may not appear in the data, for example Stairs for subject 1. 4.5.2 Split Data: Backwards2 and Cycling2 Due to recording issues (Section 4.3.5), two activity sessions were split into separate recordings (Table 4.2, Table 4.3). The standing ground truth value for split activity data is taken from the first entry and applied to both entries to ensure they share the same benchmark for net metabolic cost before merging them. To ensure proper smoothing at the borders, this step is performed after conversion and before resampling. Both recordings were retrieved, and the Activity Mode Name was adjusted by removing the ’2’ suffix for proper data alignment. Their data arrays were then concatenated, and the second recording was subsequently dropped from the dataframe. Since the standing and sitting sessions are removed for all activities, the overall structure remains consistent with other recordings. Another important step in handling split data in the first script is transferring the standing metabolic cost from the first recording to the second, allowing us to use the same baseline metabolic cost for calculations. 49 4 Methodology 4.5.3 Handling Trailing NAN Values As described in Section 4.3.5, some recordings contain missing data filled with NAN values at the end. To ensure clean data processing, these values must be removed at the beginning of the first script. Early handling of this problem is possible since all sensors within a group are sampled at the same rate. Trailing NANs are addressed by identifying them in the data array, determining where the NAN sequence begins, and removing that section from each sensor’s data within the group. 4.5.4 Determining the Sampling Rate of the Metabolic System for Breath-by-Breath Data To apply a Gaussian kernel (Section 3.1.2) to the signals, the sampling rate is required. However, the sampling rate for the Metabolic System sensor group is defined as Breath-by-Breath rather than a numerical value. Two approaches could be used to determine the sampling rate. One method is to analyze the time values and calculate the number of samples per second. The chosen approach, however, was to compute the mean breath frequency. This was done using the ’breath frequency’ sensor within the Metabolic System group, providing an estimated mean sampling rate for each sensor within the group. 4.5.5 Ground truth To calculate the ground truth, the following inputs are required: time values, activity code, VO2 values, VCO2 values, subject’s weight, and previous value. The previous value is used only for split data. First, the index borders of the activity code are determined to identify where each activity section begins and ends. The GT value is then calculated separately for each activity section. The time values are extracted based on each activity section, using the lower and upper borders. Three minutes are subtracted from the upper border to determine the middle index. If the activtiy duration is shorter than 3 minutes, the 3-minute mean cannot be computed, and will be handled later, which applies only to split data (Section 4.5.2). Next, the metabolic cost is calculated using the EE (Section 3.2.1), with the VO2 and VCO2 values between the mid and upper indices. The mean of the 3-minute period is then calculated and extended over the entire 6-minute period. The values are normalized by the subject’s weight, which is listed in Table 4.1. Finally, the metabolic net cost is calculated by subtracting the baseline metabolic cost. The baseline metabolic cost is the standing, or sitting for Cycling, metabolic cost value taken from the first value of the array, as each training session begins with it (Table 4.2, Table 4.3). However, if the session is too short to calculate a valid value, the value from the previous session is used, as explained for split data (Section 4.5.2). 4.5.6 Handling missing values for subject 6 Missing values are handled prior to calculating the Pearson correlation coefficient and are excluded from the training data to ensure that NAN values do not affect the calculation. PCC. To handle missing values (Section 4.3.5) for Subject 6 at the Right Wrist position, and for specific sensors such as ’Acceleration - Vector Magnitude’, ’Electrodermal Activity’, or ’Skin Temperature’, the following steps are performed: If the subject ID is 6, the position is "Right Wrist", 50 4.6 Evaluation Process and the sensor is one of the specified types, the signal array and GT array are zipped together. Then, all tuples containing NAN values are filtered out, and the arrays are unzipped back into their original form. Training Problems The approach for gathering clean training data depends on whether NAN values are present in the selected sensors for the chosen model training, as not all sensors are included at all times. If necessary, any rows with NAN values should be removed. 4.5.7 Custom LOO KFold and LOO from scikit-learn do not produce the desired behavior for CV, as the data sizes vary across subjects. Therefore, the data must be split based on the boundaries of subject IDs or activity modes, depending on the chosen CV method. To address this, an iterator (Listing 4.1) was designed to split the data and return the test and train indices, processing them one at a time. def loo_with_indicies(indicies_array): size = np.sum(indicies_array) start = 0 for end in indicies_array: test_idx = list(range(start, start+end)) train_idx = [i for i in range(size) if i not in test_idx] yield train_idx, test_idx start += end Listing 4.1: Custom LOO Generator 4.6 Evaluation Process The evaluation of traditional models follows the key steps outlined in the proposed approach. First, the preprocessed data is evaluated using the PCC in comparison to the benchmark data. Next, the optimal preprocessors are identified for each model. The optimization process is carried out in three phases to fine-tune the hyperparameters. Finally, different sensor regions are compared to assess their performance. Evaluating Preprocessing. With the PCC, the quality of the preprocessing the unfiltered and filtered data should be evaluated. As benchmark serve the results of Ingraham et al. [15], which are compared with. Machine Learning Methods. The ML models (LR, SVM, DT, RF, ADA, MLP), as detailed in Section 3.3, were evaluated including all 16 features. GridSearchCV was employed for hyperparameter tuning, with NRMSE as the performance metric. Two types of CV were applied: LOSO and LOTO, as introduced in Section 3.2.4. The evaluation was conducted using both CV methods, each with and without VO2 as a benchmark feature, and the results were compared to the LR benchmark from Ingraham et al. [15]. 51 4 Methodology Determine best Preprocessor. The effect of preprocessing was evaluated by testing all combinations of each model with StandardScaler, Normalizer, and no preprocessing. These configurations were analyzed using both CV types (LOSO and LOTO), with the VO2 benchmark included as a feature. This enabled a comparison of how different preprocessing methods impacted model performance across both CV approaches. Tune hyperparameters with Parameters Grid. GridSearchCV allows the use of a parameter grid, which is cross-validated across all data, enabling the evaluation of parameter tuning. Starting with the default values, a custom parameter grid is manually designed, slightly modifying the defaults to optimize model performance. Due to limited computational resources, the number of combinations is adjusted based on the model type and tuning parameters. The tuning process is carried out in three phases, with the parameter grid being refined after each iteration. The optimization is evaluated exclusively using the LOSO CV, including the benchmark, as the primary goal of recent work is to adapt the application across multiple subjects. While this approach may not yield the fully optimized model, it demonstrates that parameter tuning is crucial for improving prediction accuracy in regression tasks, such as estimating EE. Evaluate various sensor regions. In real-world applications, measuring all variables across different body regions is often impractical and depends on specific application goals. To address this, key sensor regions were selected for real-time adaptation, ensuring alignment with practical constraints and application requirements. The evaluation explores six sensor configurations based on insights from related studies, also summarized in Table 4.7: • All Sensors: All 16 sensors. • Hexoskin: Waist Accel, HR, breath frequency (BF), Minute Ventilation (MV). • Smart Watch: both wrist Accel, EDA and Temp, HR, SpO2. • EMG Pants: both EMG, Waist Accel and Ankle Accel, possibly HR and SpO2. • Watch and Pants: Combination of the Smart Watch and EMG pants sensor regions. • Best Combination: Top six sensors based on correlation, including Waist Accel, EMG (left and right), EDA, HR, BF, MV. The All Sensors configuration includes all 16 sensors, as used in previous evaluation steps. The Hexoskin setup, which focuses on sensors in the chest region, has been referenced in prior studies [15] and builds on earlier work of Beltrame et al. [1]. Smart watches provide a lightweight and widely adopted sensing method, making them a practical choice for real-world applications [28, 30, 36, 38]. Similarly, Sensor Pants provide a wearable solution for capturing EMG signals [30, 33]. While they do not directly measure HR or accelerometer data, it is assumed that these could be easily obtained from the same region using a HR monitor or additional accelerometers. The Watch and Pants configuration simply combines the smart watch and EMG pants sensor regions. Finally, the Best Combination region was derived from previous studies [15], where the top- performing linear regression models were analyzed with an increasing number of predictors. From this evaluation, the six most relevant sensors were selected to optimize performance. While these 52 4.6 Evaluation Process signals could potentially be measured across various body regions, this approach assumes that reducing the sensor set for training and calculations, by removing redundant data and prioritizing the most impactful signals, would be beneficial. Furthermore, it is assumed that this could help lower computational complexity, thereby improving the model’s efficiency. All regions are evaluated across all models and both CV types, with and without the inclusion of the benchmark feature, ensuring a comprehensive assessment of their performance. Table 4.7: Definition of Sensor Regions Accel EMG EDA Temp BF MV HR SpO2 All * * * * * * * * Smart Watch Wrist * * * * Hexoskin Waist * * * EMG Pants Waist, Ankle * * Best Combination Waist * * * * * Watch and Pants Waist, Ankle, Wrist * * * * *