Institute of Architecture of Application Systems University of Stuttgart Universitätsstraße 38 D–70569 Stuttgart Bachelorarbeit Examining a Hybrid Pitting Detection Approach on Real Data Tobias Schmid Course of Study: Data Science Examiner: Prof. Dr. Marco Aiello Supervisor: Robin Pesl M.Sc., Lisa Binanzer, M.Sc. Commenced: August 8, 2022 Completed: February 7, 2023 Kurzfassung Zur Lösung von automatischer Grübchenerkennung am laufenden Getriebe wird ein hybrides Modell vorgestellt und untersucht. Die meisten Arbeiten zur automatischen Grübchenerkennung fallen unter den Bereich des überwachten maschinellen Lernens, mit zuvor gekennzeichneten Daten. Zusätzlich werden Grübchenschäden oft durch gebohrte Löcher simuliert. Für die Erkennung von Grübchen am laufenden Getriebe muss ein Modell jedoch auf unbeschrifteten Daten operieren können. Die vorgestellte hybride Lösung ermöglicht es einen Klassifikationsalgorithmus auf unbeschrifteten Daten zu trainieren. Zum Test der Methode werden Daten verwendet von Getrieben verwendet, über die lediglich bekannt ist, dass sie am Ende ihrer Laufzeit über Grübchen verfügten. Diese Daten bieten eine gute Annäherung an ein tatsächliches Betriebsszenario. Der Ansatz kann grob in zwei Teile unterteilt werden. Zunächst wird die zu Grunde liegende Struktur der Daten durch einen Autoencoder gelernt. Sobald sich eine signifikante Veränderung der Struktur anhand eines Clustering feststellen lässt, wird ein Klassifikator in Form eines Long Short Term Memory Modell auf den Daten trainiert. Die Beschriftung der Daten folgt der Beschriftung der Cluster, die als zeitlich trennbar identifiziert wurden. Ziel ist er den Klassifikator als Teil einer adaptiven Betriebsstrategie anzuwenden, und so durch variierung des Eingangsdrehmoments die Lebensdauer des Zahnrads zu steigern. Abstract We propose a hybrid detection approach for pitting detection in gears. Most research on pitting detection with machine learning is done on supervised data and often on simulated pitting. However, for pitting detection in practice an unsupervised approach is required. The main idea behind the proposed solution is the ability to leverage elements of supervised machine learning models in pitting detection, while operating on unlabeled data. The training data used for this algorithm is taken from gear boxes with actual pitting failure and without prior knowledge about pitting size during different stages of operation, providing a realistic case for an operational scenario. The approach can be seen as two parts. At first we try to detect changes in the underlying structure of the vibration data using fast fourier transform to obtain frequency spectra that are fed to a sparse autoencoder. The encoded reduced feature space is the clustered to look for separability by time. In case there are significant changes in the underlying structure, an Long Short Term Memory model is trained to see if the changes can be validated as actual pitting damage. The LSTM can then further be used to adapt fast to the pitting damage with the goal of prolonging the lifespan of the gear. 3 Contents 1 Introduction 11 2 Background Information 13 2.1 Gear Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2 Pitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3 Vibration data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.4 Deep Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.5 Sparse Autoencoders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.6 Long Short Term Memory Models . . . . . . . . . . . . . . . . . . . . . . . . 20 2.7 Gear damage theory and detection . . . . . . . . . . . . . . . . . . . . . . . . 21 2.8 Disentangled Tone Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.9 Deep Learning Cepstral Classification . . . . . . . . . . . . . . . . . . . . . . 25 3 A hybrid model for pitting detection in operation 27 3.1 Constraints and Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.2 Structure of the Hybrid approach . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.3 Implementation of the Sparse Autoencoder . . . . . . . . . . . . . . . . . . . . 30 3.4 MFCC Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.5 LSTM Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.6 Comparison between Autoencoder and LSTM . . . . . . . . . . . . . . . . . . 36 4 Realisation of Solution 37 4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.2 Data selection and model control flow . . . . . . . . . . . . . . . . . . . . . . 38 4.3 Data Import . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.4 Vibration Data Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.5 Sparse Autoencoder Implementation . . . . . . . . . . . . . . . . . . . . . . . 41 4.6 Implementation of the supervised phase . . . . . . . . . . . . . . . . . . . . . 45 5 Evaluation of Solution 47 5.1 Global view of the underlying structure of the data . . . . . . . . . . . . . . . . 48 5.2 Accuracy of SAE for different Parameters . . . . . . . . . . . . . . . . . . . . . 49 5.3 Results of the Autoencoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 5.4 Accuracy of the LSTM for different Parameters . . . . . . . . . . . . . . . . . . 55 5.5 Evaluation of the identified points . . . . . . . . . . . . . . . . . . . . . . . . . 57 6 Conclusions 59 6.1 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 6.2 Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5 Bibliography 63 6 List of Figures 2.1 Line of action and pitch point[Gea] . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2 Pitting damage on a gear[Ber; HMO11] . . . . . . . . . . . . . . . . . . . . . . 14 2.3 Sine wave summation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.4 Sparse auto encoder [Mas] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.5 Multilayer Autoencoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.6 Vibration signal composition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.7 Simple damage signal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.8 Damage in spectrum and cepstrum . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.9 Mel filter bank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.1 The two phases of the hybrid detection approach . . . . . . . . . . . . . . . . . . 28 3.2 Encoded data with significant shift . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.3 Steps of MFCC computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.1 The Hybrid model step by step . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.2 Application of a Hanning window to a segment of vibration data . . . . . . . . . 40 4.3 Vibration data of gear 6 before (left) and after (right) fourier transform . . . . . . 40 4.4 Calculation of the mel filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.5 Code of the single layer autoencoder model . . . . . . . . . . . . . . . . . . . . 43 4.6 Code of the sparsity penalty as part of the loss function . . . . . . . . . . . . . . 44 4.7 TensorFlow Code of the LSTM model . . . . . . . . . . . . . . . . . . . . . . . 46 5.1 Underlying structure across files . . . . . . . . . . . . . . . . . . . . . . . . . . 48 5.2 Frequency spectrum comparison between data with and without pitting damage . 48 5.3 Activation of nodes on well separable data . . . . . . . . . . . . . . . . . . . . . 50 5.4 Activation of nodes on poorly separable data . . . . . . . . . . . . . . . . . . . . 51 5.5 Percentage of points correctly identified as part of their file for different 𝛽 . . . . 52 5.6 Percentage of points correctly identified as part of their file for different 𝜌 . . . . 53 5.7 Average frequencies across 3000 spectra, before and after the recording break . . 55 5.8 Classification Accuracy of LSTM for different separation points over gear lifetime 58 7 List of Tables 5.1 Parameter Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 5.2 Accuracy of Clustering for different 𝛽 on all reduced dimensions . . . . . . . . . 52 5.3 Accuracy of Clustering for different 𝜌 on all reduced dimensions . . . . . . . . . 53 5.4 Accuracy of Clustering for different 𝜆 on all reduced dimensions . . . . . . . . . 54 5.5 Accuracy of Clustering for different layer structures on all reduced dimensions . . 54 5.6 Accuracy of LSTM for different number of Training Samples . . . . . . . . . . . 56 5.7 Accuracy of LSTM for different number of LSTM Nodes . . . . . . . . . . . . . 56 5.8 Accuracy of LSTM for different number of LSTM Nodes . . . . . . . . . . . . . 57 5.9 Confusion matrix for classes separated by the identified points. Actual classes = rows, Predicted classes = columns . . . . . . . . . . . . . . . . . . . . . . . . . 57 9 1 Introduction Wind energy is an important component in reducing carbon emissions to fight climate change. One of the major cost factors of a wind turbine are it’s maintenance costs which can make up to 75% of the investment cost of a wind turbine during it’s lifetime [Vac02]. One of the potential damage sources, pitting damage in the wind turbine’s gears is responsible for long maintenance times of an average of 4 weeks [NB21]. During this time, the wind turbine is not able to produce energy, leading to an average loss of an estimated 50.000=C per maintenance period. In order to combat pitting damage, the condition of the turbine is constantly monitored using a condition monitoring systems. These systems are, among other things, used to reduce the maintenance time by scheduling maintenance in advance [NB21]. Gretzinger proposed another use case for condition monitoring in using the data of the pitting damage signal to increase the remaining lifespan of a gear affected by pitting [Gre22]. The basic idea is to decrease the load on the affected teeth and compensate the loss of rotational force by increasing that force on the teeth that are not affected by pitting [Gre22]. In order to test and automate this proposed solution, automatic methods of pitting detection are needed. While there has been a lot of research in the field of automatic pitting detection [EA14; ERM+14; MCC+19; ÖSY08; QZH+19; ZD13] a lot of these approaches rely on simulated, drilled pitting data. Furthermore, most machine learning approaches are supervised models relying on labeled data for model training. Due to the nature of the gear vibration data, it is typically not possible to use a pre-trained model on a different gear setup, making it hard to use supervised machine learning. In order to still use the power of supervised machine learning models, the LSTM proposed by Medina et al. in particular [MCC+19], a hybrid pitting detection approach is proposed. The algorithm first trains a sparse autoencoder like the one proposed by Qu et. al. [QZH+19] to detect changes in the underlying structure of the data. Once such a change is spotted, the model trains an LSTM classifier on the data, with labels provided by a clustering of the underlying structure. The classifier is then validated on a validation data set. The data used to test the hybrid approach is not simulated pitting data, but rather data used from a gearbox that at some point experienced pitting failure, allowing for more realistic conditions. The goal of the hybrid model is to identify actual pitting damage on a gear without using supervised methods or simulated pitting. To better understand the problem of pitting in gears, Chapter 2 will provide information on gears, pitting damage and how the vibration signal of a gears is affected by pitting damage. An overview of the current state of the art of pitting detection approaches is also provided. The hybrid pitting detection approach itself will be introduced in theoretical terms in Chapter 3, while the implementation details can be found in Chapter 4. A discussion on results for different model parameters is located in Chapter 5. The paper concludes in Chapter 6 with a short discussion of the limitations of this methods, as well as further improvements that might be necessary to use this technique in a live environment. 11 2 Background Information Below is a short primer on what pitting in gears is and how pitting can be identified using vibration data. This section also includes a short introduction into the machine learning algorithms used in this paper. 2.1 Gear Terminology Gears can be thought of as rotating wheels with teeth, which transmit torque and speed from one wheel to another, when the teeth of the gears mesh with each other. The path on which the force is transmitted when two gears mesh is known as the line of action. The point where the path of action intersects with the imaginary line that connects both axes (central points) of the gears is known as the pitch point. The circle around the axis which goes through all pitch points is known as the pitch circle. An example of this can be seen in Figure 2.1. The pitch point is both on the vertical line in the middle, which connects the axes of the gears, and on the line of action. Note that the pitch point coincides with the point of contact, and that the path of action is a straight line. Both of this is not always the case, however it is both true in involute gears. These gears have teeth with a special shape, like the ones in Figure 2.1. A more detailed explanation on gears can be found in engineering handbooks like [Dub13] Figure 2.1: Line of action and pitch point[Gea] 13 2 Background Information Figure 2.2: Pitting damage on a gear[Ber; HMO11] 2.2 Pitting There are two main theories on the formation of pitting [Kna88]. The first considers pitting as a result of wear over time. The tooth surface will slightly degrade over time by continued cyclical load. This degradation shows up in the form of small cracks, both above and below the pitch circle. Oil will naturally fill these small cracks. When the tooth meshes with the other gear the oil above the pitch circle will be pressed out of the crack, while the oil below the pitch circle will be pressed further into the cracks. As the oil cannot compress, small particles of the material are broken out of the gear, in the form of smaller secondary cracks. This would also explain why pitting occurs beneath the pitch circle. The second theory assumes that pitting starts beneath the surface. Material defects close to the surface lead to local spikes in pressure. These spikes then result in small cracks, which combine to bigger cracks when faced with load. The cracks are slightly angled toward the surface (20°) and go towards the base of the tooth [BLGM17]. Once the crack reaches the surface, material is broken out of the gear. The load that the surface is exposed to increases the damage, turning the cracks into deeper pits. Pitting will be visible as small to big, shallow pits on the surface of the gear tooth. A large case is shown in Figure 2.2 Pitting grows over time and the rate of growth also increases over time. At some point the tooth will break away from the gear as a result, because it is too weak [Gre22]. This point is reached once either one pit grows to a size of 4% of the surface of the tooth it is on or once all pits grow to a size of 1% of the combined tooth surface. These conditions are referred to as the 1% criterion and the 4% criterion, although these percentages might differ for other types of gears, like unhardened gears [NW03]. 14 2.3 Vibration data Figure 2.3: Sine wave summation In the presence of pitting, the vibrations in the gear box increase significantly. Pitting reduces the contact area and affects the contact stiffness [Ngu02]. This results in a change in amplitude in the gear box vibrations. This change in amplitude is the basis for automated pitting detection in the works used in this paper. 2.3 Vibration data Vibration data is typically recorded by an accelerometer, which measures acceleration. The basic structure of an accelerometer is a piece of test mass connected to a spring. The sensor measures the acceleration by the spring’s compression, once it experiences acceleration towards it. The data is often recorded as multiple measurements in small fixed time intervals. Most standard accelerometers measure up to 30.000 data points per second [Len]. As the data measures the vibration of rotating gears, the data is cyclical in nature. A basic form of cyclical data would be a sine function: 𝑓 (𝑥) = 𝑎 · 𝑠𝑖𝑛(𝑏 · (𝑥 + 𝑐)) + 𝑑(2.1) where a is the amplitude, b is the frequency, c is the displacement on the x axis (responsible for the phase of 𝑓 (𝑥)), and d is the displacement on the y axis. More complex cyclical data will often be a combination of simple signals. In Figure 2.3 we can see the thin lines in the upper diagram, graphs of the functions 𝑓 (𝑥) = 4 · 𝑠𝑖𝑛(𝑥), 𝑔(𝑥) = 2 · 𝑠𝑖𝑛(2𝑥) + 2 and ℎ(𝑥) = 𝑠𝑖𝑛(4𝑥) + 4. When added up they combine to a more complex function 𝐹 (𝑥) = 𝑓 (𝑥) +𝑔(𝑥) +ℎ(𝑥) = 4 · 𝑠𝑖𝑛(𝑥) +2 · 𝑠𝑖𝑛(2𝑥) +𝑠𝑖𝑛(4𝑥) +6. While the simpler functions have clearly defined frequencies, determining the actual frequency of F(x) is harder. 15 2 Background Information 2.3.1 Fourier Transform In order to better assess the amplitudes of the vibrations, Fast Fourier Transform (FFT) is performed. Fourier Transform takes a function of amplitude over time and outputs the amplitudes broken down by frequency [FD+22]. Fourier Transform is applied to break down cyclical data into it’s frequency components. For the purpose of vibration analysis for pitting detection, we would like to look at changes in the amplitudes of the frequency components. The result of the Fourier Transform is a function over frequencies 𝜉 obtained by the following equation: 𝑓 (𝜉) = ∫ ∞ −∞ 𝑓 (𝑥)𝑒−𝑖2𝜋 𝜉 𝑥 𝑑𝑥(2.2) Where 𝑓 (𝑥) in our case, is the acceleration at time 𝑥. 𝑓 (𝜉) is a complex function, which means that values of 𝑓 are of the form 𝑓 (𝜉) = 𝑎 + 𝑏𝑖. The amplitude of frequency 𝜉 corresponds to the absolute value of 𝑓 at frequency 𝜉: | 𝑓 (𝜉) | = √︁ 𝑎2 + 𝑏2(2.3) In the example in Figure 2.3, if �̂�) is the fourier transform of 𝐹 (𝑥) we can obtain the amplitudes |�̂� (1) | = 4, |�̂� (2) | = 2 and |�̂� (4) | = 1, of 𝑓 (𝑥), 𝑔(𝑥) and ℎ(𝑥). The x-Axis in the bottom diagram of Figure 2.3 displays higher numbers, because the FFT sums over the frequencies. The relative values of the amplitudes are accurate however. �̂� (1) = 2 · �̂� (2) = 4 · �̂� (4) The recorded vibration data is in essence a more complex combination of many waves resulting from a complex system with many moving parts in the gear box. As there is no practical way to calculate the function that describes the vibration data best, we instead perform Discrete Fourier Transform (DFT) as the data is given in small discrete time intervals. FFT is the name of the actual algorithm used to perform DFT [HJB85]. FFT computes a frequency spectrum for a given time interval when given equidistant measurements in that time interval. FFT returns a list of complex numbers, one for each frequency of the spectrum [HJB85]. 2.3.2 Mel Frequency Cepstral Coefficients A cepstrum is defined as the spectrum of a logarithmic spectrum [Ran76]. Let 𝑓 { 𝑓 (𝑥)} be the fourier transform of 𝑓 (𝑥). The cepstrum 𝐶𝑝 [Bog63], now known as the power cepstrum is calculated as follows: 𝐶𝑝 = | 𝑓 {𝑙𝑜𝑔( | 𝑓 { 𝑓 (𝑥)}|2)}|2(2.4) Breaking down Equation (2.4), cepstrum calculation consists of the following steps: 1. Take the fourier transform of the input signal x 2. Take the absolute values of the fourier transform results to obtain the amplitudes of the frequency components 3. Take the logarithm of the amplitudes 4. Perform another fourier transform on the logarithms 5. The absolute values of the second fourier transform are the cepstral coefficients 16 2.4 Deep Neural Networks In essence, we perform the fourier transform to obtain the spectrum of the original data. We then perform another fourier transform, to obtain the spectrum of the spectrum, hence the name cepstrum, which is an anagram of spectrum. Some use the inverse fourier transform 𝑓 −1 in in place of the fourier transform at the last step. The resulting cepstrum is basically identical, only differing by a scaling factor. The Mel Frequency Cepstrum (MFC) differs from the power cepstrum in two ways. Instead of using fourier transform on the logarithmic amplitudes of the spectrum, discrete cosine transform (DCT) [ANR74] is used. DCT is related to discrete fourier transform but it returns real numbers instead of complex numbers. For the purpose of cepstrum computation, what is import is that DCT returns a spectrum. The second important difference is that the frequency spectrum resulting from the first fourier transform is scaled by the Mel-scale. 1000 Mel are exactly 1000 Hz but the scale is logarithmic for frequencies above 1kHz: 𝑓𝑚𝑒𝑙 (𝑥) = 2595 · 𝑙𝑜𝑔10(1 + 𝑥/700)(2.5) Written as one equation, with 𝐷 as the DCT: 𝐶𝑚𝑒𝑙 = |𝐷{𝑙𝑜𝑔( 𝑓𝑚𝑒𝑙 ( | 𝑓 { 𝑓 (𝑥)}|2))}|2(2.6) Mel Frequency Cepstral Coefficients (MFCC) are commonly used in audio tasks such as speech recognition[ISY12] as the Mel-scale is used to emulate human hearing more closely than a linear frequency mapping. Points that are of equal distance on the Mel-scale are judged by humans to be of equal distance to one another [SVN37]. Humans are better at telling apart lower frequencies. For example in music, a difference of 55Hz between low frequencies is a whole octave (𝐴1 to 𝐴2) while for higher frequencies it is just a semitone (from 𝐴#5 to 𝐵5) in twelve-tone equal temperament. This makes MFCC suitable for speech recognition tasks because models can more closely emulate the human perception of frequencies. MFCCs have been found useful in the area of bridge damage detection [BGND21] which is typically identifiable in lower frequency ranges. Cepstral analysis has been found useful in gear box fault diagnosis [Ran76] because it allows for better analysis of sidebands. Sidebands occur as the result of modulation[Ran76] including the kind of amplitude modulation that occurs as a result of pitting. One of the algorithms examined in this paper [MCC+19] uses MFCC as inputs. Why MFCC are the preferred input over other cepstral coefficients is not further explained in the paper. A possible explanation might be the use of acoustic data as other inputs, as acoustic data is often processed using Mel Cespstra. Another explanation might be that the Long Short Term Memory model performs especially well on MFCC because both are frequently used in Natural Language Processing. 2.4 Deep Neural Networks Artificial neural networks are loosely based on animal brains. They try to emulate the behaviour of neurons that pass electrical signals between them. Neural networks typically consist of an input layer, an output layer and one or more hidden layers that are made up of nodes or neurons. If a neural network contains more than one hidden layer, it is called a deep neural network [Sch14]. 17 2 Background Information 2.5 Sparse Autoencoders Sparse autoencoders are a type of unsupervised machine learning method that try to create a close approximation of their input. They perform this task while only using a small percentage of their nodes. 2.5.1 Autoencoders Autoencoders are a type of neural network. The goal of an autoencoder is to reconstruct an input x as well as possible. Rather than just copying the input however, autoencoders typically operate under the constraint, that they learn a set of features with less dimensions than the input dimension. The autoencoder will then reconstruct the original image from the set of features. Conceptually an autoencoder is made up of two functions: 𝐸𝑛𝑐𝑜𝑑𝑒𝑟 : ℎ = 𝑓 (𝑥)(2.7) 𝐷𝑒𝑐𝑜𝑑𝑒𝑟 : 𝑥 = 𝑔(ℎ)(2.8) Where x is the input, h is the hidden layer with reduced dimensionality and 𝑥 is the approximation of x. Written as one equation: 𝐴𝑢𝑡𝑜𝑒𝑛𝑐𝑜𝑑𝑒𝑟 : 𝑥 = 𝑔( 𝑓 (𝑥))(2.9) 2.5.2 Sparse Autoencoders To help reduce dimensionality of the hidden dimensions even further, we would like the learned representation to be sparse. A sparse representation will have as many entries as possible set to zero. Sparsity is often relevant in compression, but in this case it is used to further incentivize using an even smaller number of features to represent input x. In the case of autoencoders specifically, sparsity refers to the activation of the units, rather than the exact values. In this context, this means that in a sparse autoencoder only a small number of the hidden units are activated for each input. We do impose sparsity in order to further capture the underlying structure of the data [Ng]. A visual example of a sparse autoencoder can be seen in Figure 2.4. The blue nodes on the left side of the encoder represent the features of the inputs. This could for example be pixels of an image, prices of a stock at certain times or, in our case, frequency components of vibration data. The blue nodes on the right side of the encoder represent the features of the recreated output. The number of input nodes is the same as the number of output nodes. The orange nodes in the middle represent the hidden nodes, which contain the learned feature. The dark orange nodes are the nodes that are active in recreating this particular input. In a sparse autoencoder, we would like to have a small amount of nodes active at the same time [Ng]. 18 2.5 Sparse Autoencoders Figure 2.4: Sparse auto encoder [Mas] 2.5.3 Multilayer Autoencoders The equation in Equation (2.9) describes an autoencoder that contains a single hidden layer. We can chain multiple singlelayer autoencoders to reduce the number of learned dimensions even further by using the output 𝑥 of a previous autoencoder as the input for the next autoencoder. Given two autoencoders: 𝐴𝑢𝑡𝑜𝑒𝑛𝑐𝑜𝑑𝑒𝑟 1 : 𝑥1 = 𝑔1( 𝑓1(𝑥))𝐴𝑢𝑡𝑜𝑒𝑛𝑐𝑜𝑑𝑒𝑟 2 : 𝑥 = 𝑔2( 𝑓2(𝑥1)) We can combine these autoencoders into one multilayer autoencoder: 𝑀𝑢𝑙𝑡𝑖𝑙𝑎𝑦𝑒𝑟 𝑎𝑢𝑡𝑒𝑛𝑐𝑜𝑑𝑒𝑟 : 𝑥 = 𝑔2( 𝑓2(𝑔1( 𝑓1(𝑥)))) (2.10) As both the hidden layers obtained by 𝑓1 and 𝑓2 as well as the output of the first autoencoder obtained by 𝑔1 are considered hidden layers, the autoencoder in Equation (2.10) has three hidden layers and is therefore considered a deep neural network. The single layer autoencoders making up the multilayer autoencoder can have different hyper-parameters, allowing the multilayer autoencoder to learn representations for complex data. The number of input and output dimensions of each singlelayer autoencoder is often the same however, matching the dimensions of the input. An example of this can be seen in Figure 2.5. The middle blue layer is the output layer of the first autoencoder, consisting of the three leftmost layers. It is also the input layer of the second autoencoder, consisting of the three rightmost layers. The number of hidden (orange) nodes can differ from layer to layer, while the number of input / output nodes (blue) stays the same. 19 2 Background Information Figure 2.5: Multilayer Autoencoder 2.6 Long Short Term Memory Models Long Short Term Memory Models are designed to have the ability to use information that has been seen a long time ago. They fall under the supervised learning techniques. 2.6.1 Recurrent Neural Networks Unlike autoencoders, which only pass data forward from one layer to another, recurrent neural networks (RNN) can pass data backwards. This allows neurons of RNNs to be part of loops, enabling neurons to influence themselves [WWFN16]. RNNs can keep information about previously encountered data points. As a result, RNNs are able to take context, in our case temporal context into account [Mil12]. 2.6.2 Long Short Term Memory Models Early versions of RNNs often suffered from the "Vanishing Gradient Problem"[Hoc91]. The longer a path between two nodes 𝑛1 and 𝑛2, the smaller the influence of 𝑛1 on 𝑛2. Looking at nodes as part of a cycle that sends a signal through the cycle once per time step, one can unfold the cycle into an acyclical network by chaining the networks behind each other [MP88]. This means, that the influence of a node with past information will drastically decrease with the number of time steps between two nodes [Hoc91]. A related problem occurs if the network uses activation functions that allow for numbers over 1. In this case the influence of nodes with past information will drastically increase with the number of time steps between two nodes. This is known as the Ëxploding Gradient Problem". 20 2.7 Gear damage theory and detection Figure 2.6: Vibration signal composition Long Short Term Memory (LSTM) Models were first proposed by Hochreiter and Schmidhuber to deal with the Vanishing Gradient Problem [HS97]. LSTM models contain cells that can store values over an arbitrarily long period of time. Each cell has an input gate, an output gate and a forget gate that regulate the information going into and out of the cell [HS96]. This enables LSTM models to store past data for an arbitrary amount of time and allows them to use past data only when it is relevant for the current data point [Hoc91]. LSTM models are therefore really useful when dealing with time series data, like the vibration data for pitting detection. They are able to look at current frequency inputs while also taking historical frequency spectra into account. This could allow them to better identify deviations from previous frequency data [MCC+19]. 2.7 Gear damage theory and detection Early attempts at damage detection in gears, mostly for tooth cracks focused on around the concept of the mesh frequency [McF87]. The Mesh frequency 𝑓𝑚𝑒𝑠ℎ is the rate at which two gears mesh over a fixed time interval, typically a second (measured in Hertz). 𝑓𝑚𝑒𝑠ℎ is calculated as follows: 𝑓𝑚𝑒𝑠ℎ = 𝑓1 · 𝑁1 = 𝑓2 · 𝑁2 (2.11) where 𝑓𝑛 is the number rotations of wheel n per second, and 𝑁𝑛 is the number of teeth of wheel n. As an example we consider two gears 𝑔𝑒𝑎𝑟1 with 15 teeth and 𝑔𝑒𝑎𝑟2 with 30 teeth. The input shaft is attached to 𝑔𝑒𝑎𝑟1, making it turn with a frequency 𝑓1 of 25𝐻𝑧. As 𝑔𝑒𝑎𝑟2 has twice as many teeth, it’s will be half of that of 𝑔𝑒𝑎𝑟1, 𝑓2 = 𝑓1 · 15/30 = 12.5𝐻𝑧. The mesh frequency of this system according to Equation (2.11) is 𝑓𝑚𝑒𝑠ℎ = 𝑓1 · 15 = 375𝐻𝑧. The mesh waveform, which is the vibration signal of the meshing gears has the mesh frequency. As the mesh of teeth is responsible for load transfer, the mesh waveform will have the highest vibration amplitude [Inc]. To reflect this in our example, 𝑤𝑚𝑒𝑠ℎ (𝑥) = 𝑠𝑖𝑛(2 · 𝜋 · 𝑓𝑚𝑒𝑠ℎ · 𝑥) will have an amplitude of 1, while 𝑤1(𝑥) = 0.4 · 𝑠𝑖𝑛(2 · 𝜋 · 𝑓1 · 𝑥) and 𝑤2(𝑥) = 0.5 · 𝑠𝑖𝑛(2 · 𝜋 · 𝑓2 · 𝑥) have amplitudes of 0.4 and 0.5 respectively. The resulting waves are shown in Figure 2.6. 21 2 Background Information Figure 2.7: Simple damage signal McFadden defines the vibration data as a combination of two signals. The "regular signal", which is the combination of the fundamental mesh frequency and it’s harmonics (waves with frequencies 2 · 𝑓𝑚𝑒𝑠ℎ, 3 · 𝑓𝑚𝑒𝑠ℎ, ...) is responsible for the vibration of a single tooth [McF87]. The regular signal basically represents the vibrations that should result from the mesh of two teeth. Subtracting the regular signal from the original yields the "residual signal". This signal represents the deviations of the actual signal from the regular signal. As random noise is reduced through time domain averaging before calculating the regular signal and residual signal, the residual signal should not be random but rather have some cause. The idea is to analyse the residual signal for patterns that indicate gear damage, as gear damage should show up in form of a deviation from the norm [McF87]. Damage signals should in theory show up in the data in the form of frequent pulses at roughly equal time intervals, because they would effect the overall vibration signal when the damaged tooth comes is struck [Whi84]. These signals may contain slight variations in length and frequency due to noise, and they will change over time in amplitude and rate of repetition as the damage increases [LM92]. It is also theorized that damage first registers in the low frequency ranges and then emerges in higher frequency ranges with as the condition of the gear deteriorates [LM92]. In our example, a simplified damage signal is added to the vibration signal to simulate a broken tooth at the third tooth in 𝑔𝑒𝑎𝑟1. The signal and it’s effect on the overall signal can be seen in Figure 2.7 It has a rectangular shape, while in real systems, the damage signal would look more like a diminishing sinusoid as the damage would be absorbed by the material [Whi84]. Spotting the differences in the vibration signal of a damaged gear that result from the added damaged signal, is the main idea behind most early fault detection research. This can be done by looking at the vibration signal over time [McF87; TWZ+14], or analyzing the frequency spectrum [ERM+14] or other forms of wavelet transforms of the original signal [ÖSY08; TDZ+16]. Others have applied cepstral analysis [EA14; ZD13]. How damage patterns in our example shows up in the frequency spectrum and the power cepstrum of the vibration data can be seen in Figure 2.8. In the upper diagram, the mesh frequency is clearly very prevalent in both the healthy signal, and the damaged signal. However the sidebands (adjacent higher and lower) of the mesh frequency, tend to fluctuate more heavily in the damaged signal. This type of fluctuation is covered well in the cepstra, where we have clear spikes at certain quefrencies. While a lot of early research has focused on highlighting and visualizing the differences in the signal between a healthy gear and a damaged gear, machine learning approaches have gained popularity in damage classification. Models used include support vector machines [KSH11; Sam04; 22 2.8 Disentangled Tone Mining Figure 2.8: Damage in spectrum and cepstrum SMR07], random forests [LSZ+16], and various types of neural networks [ÜOK16; XJYD20]. In this work, two of these deep learning approaches are examined. Qu et al. studied the use of sparse auto encoders on unlabeled vibration data to create "meaningful"disentangled features in order to perform fault diagnosis in operation [QZH+19]. Medina et al. implemented an LSTM model for pitting detection using MFCC [MCC+19]. While both of these approaches showed promising results, the data was collected from gears with artificial pitting damage, rather than real pitting damage. Pitting damage was simulated by drilling into the gear. In this paper, we will look at how both of these approaches behave on real pitting damage in operation. This might be useful, if we want to incorporate one or both of these methods into a system that can recognize pitting damage early on and take measures to slow down pitting deterioration in operation [Gre22]. Before we examine the performance of the models on real data, we will need a better understanding of how they work. Below is a more detailed overview of the core ideas and design of both models. 2.8 Disentangled Tone Mining The goal of the Disentangled Tone Mining (DTM) approach is to extract the hidden structures behind the vibration data and to disentangle them into separate tones, hence the name of the approach [QZH+19]. It can be thought of as a way of dimensionality reduction, compressing the feature space from hundreds of single frequency components to a combination of a small number of ideally uncorrelated features that represent the different original signals of the vibration signal. The process of obtaining these features consists of three steps. 23 2 Background Information 2.8.1 Fast Fourier Transform The first step is obtaining the frequency spectrum of the vibration data by means of FFT. The vibration data is split into equal time segments, with each segment roughly capturing one revolution of the driven gear. The authors used a driving gear with 40 teeth, and a driven gear with 72 teeth, so each time segment should contain at least one revolution of each gear. In case of pitting, the signal should therefore be captured in all segments. The frequency spectrum is calculated for each time segment and used as inputs for the sparse autoencoder. Each input consists of the spectrum for one time interval and each feature of the input is a frequency component of the spectrum. The frequency spectrum is used over the raw vibration data, because it should capture the different frequencies belonging to different signals like the damage signal, in a better way. Further preprocessing is avoided for two reasons. First, a desired goal of the approach is to learn features without requiring prior human knowledge and second, it allows for faster model performance in operation. 2.8.2 Sparse Autoencoder In order to reduce the frequency features to a smaller number of components that can distinguish between different signals, a Sparse Autoencoder (SAE) is used. The basic architecture of an autoencoder is given in Equation (2.9) and Equation (2.10). The SAE will try to recreate the frequency spectrum with only a small number of features. These features should capture the underlying structure of the data, including the damage signal. The number of learned features n should be significantly smaller than the number of input features N (𝑛 << 𝑁). A sparsity constraint is used to further reduce the amount of relevant features. This sparsity constraint takes the form of a hyper-parameter in the loss function to incentivize a low number of hidden unit activations. The features are obtained by training the SAE on the frequency spectra and then using the encoder of the trained SAE on each frequency spectrum, to obtain it’s low dimensional representation in the form of an m by n dimensional feature matrix M, where m is the number of encoded inputs and n is the number of learned features. 2.8.3 Further Feature Reduction In order to further reduce the number of features, the authors perform the following feature pruning technique[QZH+19]: • Perform QR decomposition on learned feature matrix M • Calculate the eigenvalues of upper triangle matrix R, which are also the eigenvalues of M • Determine the top k eigenvalues by an automatic threshold, like 𝜆𝑘 > 10 · 𝜆𝑘+1 • Keep only the dimensions associated with 𝜆1, ..., 𝜆𝑘 This process returns a list of the k most relevant features, where k is chosen automatically based on the feature matrix M rather than a fixed number. The excluded features are strongly correlated to the kept features, leaving the final feature space more uncorrelated or "disentangled". The proposed advantage of this method is that it keeps the learned features intact, which keeps some of the properties of the learned features. A different dimensionality reduction technique, like principal 24 2.9 Deep Learning Cepstral Classification Figure 2.9: Mel filter bank component analysis (PCA) will not return the original features, but rather the eigenvectors of the highest eigenvalues. The final set of features can then be used as inputs for further analysis, like clustering. 2.9 Deep Learning Cepstral Classification While DTM is an unsupervised deep learning approach, the LSTM model proposed by Medina et al. is falls in the category of supervised learning [MCC+19]. The main goal is classification of different stages of pitting damage. 2.9.1 Data Preprocessing The first step in this approach is also to obtain multiple frequency spectra for short time steps of 5.12ms, using FFT. Then, each frequency spectrum is then processed using a Mel filter bank. This filter bank consists of M filters with triangular shape on the Mel scale. An example of such a filterbank can be seen in Figure 2.9. By applying the filters to the frequency spectrum, low frequencies will be more distinguished than high frequencies. Afterwards, DFT is applied to the logarithms of the filtered frequencies to obtain the Mel Frequency Cepstral Coefficients (MFCC). As previously mentioned, cepstral analysis is a common practice in damage detection in gears, because it can show changes in frequency side bands, as shown in Figure 2.8. 25 2 Background Information 2.9.2 LSTM Model The MFCC and their first and second derivatives with respect to time will be used as inputs for the Long Short Term Memory (LSTM) model. The LSTM model takes the cepstra as well as their context (information about 𝑐𝑒𝑝𝑠𝑡𝑟𝑢𝑚1, ..., 𝑐𝑒𝑝𝑠𝑡𝑟𝑢𝑚𝑖−1 for 𝑐𝑒𝑝𝑠𝑡𝑟𝑢𝑚𝑖) to classify the data as "pittingör "no pitting". More granular classification is possible, if data with different degrees of pitting is available. The original paper distinguished between 9 classes (healthy, and 8 levels of pitting). The model is trained on labeled data first and can predict the condition of new data. 26 3 A hybrid model for pitting detection in operation Both the sparse autoencoder and disentangled tone mining method and the LSTM classifier show high accuracies for identifying pitting in several stages of deterioration [MCC+19; QZH+19]. The models were both evaluated on data, of which it was known beforehand what stage of pitting each data point was subjected to. Pitting damage in these cases was simulated by artificially adding pits to the data, using a drill bit [QZH+19] or an electrical discharging machine [MCC+19]. The different stages of pitting were clearly separated and the state of the gear therefore stayed within each class of pitting damage. The models were trained to separate the data into 5 [QZH+19] and 9 [MCC+19] classes respectively. However, pitting damage detection in operation enforces additional challenge due to it’s uncertain nature. 3.1 Constraints and Requirements Pitting damage from data in operation will not jump from one stage of deterioration to another from data point to data point. Rather, pitting will happen gradually over time. Data from very slight pitting damage might not show significant enough change to put it into it’s own cluster. On the other hand we want to detect the pitting damage as early as possible, to slow it’s deterioration. The model has to be able to recognize pitting damage as early as possible by being able to identify slight changes in the data, while also being robust to random noise. Each gearbox produces unique vibration data due to the unique material properties of each gear (no two gears are exactly the same) and slight differences in installation, causing models that were pre-trained with a different setup will probably perform poorly on new data. Therefore, a model needs to be trained on the vibration data of the gear during operation. As a result, there are no labels for the data on which to verify the performance of the model during operation. The model therefore needs to be trained on unlabeled vibration data. The model needs to be updated with new data coming in. The gearbox will produce large amounts of vibration data that needs to be preprocessed to be used as the model input. Keeping all the data for model training will lead to ever increasing model training times. It is therefore important to consider what data to feed the model when looking for pitting, and how to update the model without missing a change in pitting deterioration. The model needs to be provided with the newest data and with data to compare the new data to, in order to see if the new data deviates from the state of normal operations. 27 3 A hybrid model for pitting detection in operation Figure 3.1: The two phases of the hybrid detection approach 3.2 Structure of the Hybrid approach In order to provide an automatic detection approach on unlabeled data, which enables fast training and classification, we propose a combination of the unsupervised Sparse Autoencoder approach [QZH+19] which will allow us to create labeled data for the powerful LSTM model [MCC+19]. The model operates on two phases. It starts in an unsupervised phase until it suspects potential pitting damage. It will the transition into the supervised phase. The two phases can be seen inFigure 3.1. The first phase is the unsupervised phase. Ideally the model will spend the majority of time in this state, until it detects pitting damage. The model takes the past four minutes of vibration data and transforms the data into it’s frequency components. Four minutes of data are chosen, as this time frame is big enough to provide enough data for the model to learn useful information about the data, while also being small enough to perform both preprocessing and model training in a short period of time. Fast Fourier Transform is used to obtain the frequency components. The frequency spectrum is compressed by a factor of ten before it is fed to the autoencoder. The compression is done to reduce the required training time of the autoencoder. The input for the autoencoder is a matrix with each row being a list of amplitudes at different frequencies. Using the frequency data x, we train an autoencoder g(f(x)) = 𝑥 that tries to replicate the original data x by using a smaller number of parameters h = f(x). The autoencoder is also trained to be sparse. This forces the autoencoder to have most values of h be 0 for any given input x. Sparsity is introduced to further reduce the number of variables, encoding more information on each active node of h. After training, we use the encoder function f of the autoencoder to encode the frequency 28 3.2 Structure of the Hybrid approach data vector x into the much smaller vector h. This vector h represents the underlying structure of the data learned by the autoencoder. It contains more information about the overall structure of the frequency spectrum per parameter, as any of the single frequency parameters. We then take the underlying structure vector h and see if we can split the data points into two clusters using a K-Means algorithm. The idea is to obtain two clusters that can be separated by a point in time. That point in time, called separation point in this paper, could indicate a shift in the underlying structure at this moment. In theory, as pitting damage gets worse over time, the damage should lead to a change in the frequency spectrum from the point that pitting is formed as can be seen in Figure 2.8. If the separation point is in the recorded window, the window will contain two types of frequency data. The data before the separation point should contain the healthy frequency spectra, while the data after the separation point will contain the unhealthy spectra. The difference between these two types of spectra should be captured in the underlying structure of the data that gets learned by the autoencoder. By clustering the underlying structure, we look if such a separation point exists, that data can be split into two clusters based on time. If such a separation point is found, the cluster labels will be used as pseudo labels for training the LSTM classifier. The input data for the LSTM will be from the same window as the data for the autoencoder. We take the uncompressed frequency data of the given window and transform it into Mel frequency cepstral coefficients (MFCC). Each frequency spectrum will be filtered via predefined Mel filters. The filtered data will then be logarithmically scaled. Finally the discrete cosine transform is used to obtain the MFCC. The MFCC can be thought of as components of a spectrum of a spectrum. The cepstral components, as well as their first two derivatives are used to train the LSTM classifier. The labels for the training will be the cluster labels of the window. The LSTM model works with batches of MFCCs, to incorporate context of previous examples into the classification. The LSTM model works on a different type of input data, allowing us to see if the changes can be spotted both in the frequency spectrum as well as the cepstrum. While the cepstrum is computed from the spectrum, what each input data point represents differs significantly. We can validate the LSTM by looking at the next window and at data sampled from past windows. We assume the model is able to detect pitting if it classifies the data in the next window as pitting and the sampled data in past windows as no pitting. This kind of classification would indicate a change in the vibration data, that happened at some point (the separation point) during operation. This result of the model is the final LSTM classifier. The trained LSTM allows us to classify data significantly faster than the SAE, which could help decrease the response time of the system. This is crucial when dealing with an adaptive system, that tries to slow down pitting growth as proposed by Gretzinger [Gre22]. The idea is to adapt the input rotational force by decreasing the load on the teeth with pitting [Gre22]. In order to find the teeth that are affected by pitting, we decrease the load on each tooth in turn and feed the resulting vibration data (or rather its MFCC representation) to the classifier. If the ease on the pitted teeth returns the vibration data closed to it’s original state compared to the other teeth, the classifier could identify that change and reclassify the new data as healthy. We then continue to run the gear with the adaption in rotational force, in order to increase it’s lifespan[Gre22]. A fast classifier here is essential as we would like to minimize the time spent on checking which teeth contain pitting. This is because we want to avoid spending a long time putting additional force on the damaged teeth while checking the other teeth. 29 3 A hybrid model for pitting detection in operation 3.3 Implementation of the Sparse Autoencoder The autoencoder takes the compressed frequency spectra and tries to represent them by using a lower amount of features than the given number of frequencies. This is done by taking the vector of frequencies x and using an encoder function f(x) to map the frequencies on a smaller vector h. A decoder function g(h) then tries to recreate x given the smaller vector h. The result of the decoder, called 𝑥, is compared to x to adjust the autoencoder. Expanding on the basic architecture of an autoencoder in Equation (2.9), we can formulate 𝑓 (𝑥) and 𝑔(ℎ) as ℎ = 𝜎1(𝑤1 · 𝑥 + 𝑏1)(3.1) 𝑥 = 𝜎2(𝑤2 · 𝑥 + 𝑏2)(3.2) 𝑤1, 𝑤2 are weight matrices and 𝑏1, 𝑏2 are bias vectors. The weight matrices are adjusted a lot in the learning process in order to learn from the data that is provided. The encoding of the data h can be seen as a linear transformation of x with 𝑤1 and 𝑏1. As 𝑤1 is a matrix an x is a vector, when looking for a single data point, each frequency component of x can influence all nodes of the encoding h. The strength of that influence is mainly dependent on the weight matrix 𝑤1. 𝜎1, 𝜎2 are activation functions. These functions determine what the output of the node is, based on it’s input and weights. The sigmoid function was used for both encoders and decoders. 𝜎(𝑥) = 1 1 + 𝑒−𝑥(3.3) The sigmoid function returns a value between 0 and 1. If the result of the function is 0, the node is considered öff". The sparsity constraint will try to enforce the average of the activation functions over all nodes of h to equal a sparsity parameter 𝜌, forcing many nodes to be off. 3.3.1 Loss Function of the Sparse Autoencoder Let n be the number of training data points. The loss function for the SAE is defined in the original paper as follows [QZH+19]: 𝐿 (𝑤1, 𝑤2, 𝑏1, 𝑏2 |𝑥) = 1 2𝑛 ∑︁ 𝑛 | |𝑥 − 𝑥 | |22 + 𝜆 · | |𝑤 | | 2 + 𝛽 · 𝑠2∑︁ 𝑗=1 𝐾𝐿 (𝑝 | |𝑝)(3.4) The loss function is a sum up of three terms. The first term 1 2𝑛 ∑︁ 𝑛 | |𝑥 − 𝑥 | |22(3.5) calculates (one half of) the mean squared distance between the original input x and the output of the model 𝑥. The differences between each frequency component of x and 𝑥 are summed up and squared. We would like to minimize this value to obtain an output 𝑥 which is as close to the input x as possible. The second term 30 3.3 Implementation of the Sparse Autoencoder 𝜆 · | |𝑤 | |2(3.6) is used for weight regularization. This term gets larger, the larger the values of weight matrix 𝑤, which incentivises keeping weights as small as possible. This is used to avoid overfitting. Overfitting refers to a phenomenon in which the autoencoder simply memorizes the training data as well as possible by using large weights. The autoencoder will the perform worse on new data. The third term 𝛽 · 𝑠2∑︁ 𝑗=1 𝐾𝐿 (𝜌 | | �̂�)(3.7) is responsible for the sparsity of the network, where KL is the Kullback-Leibler Divergence and 𝜌 is the sparsity parameter, according to Qu et al [QZH+19]. It is not mentioned, what �̂� represents in this context, however it is reasonable to assume based on the naming of parameters, that the loss function is taken from Andrew Ng [Ng]. Ng defines �̂� 𝑗 as the average activation of hidden unit j, given by �̂� 𝑗 = 1 𝑛 𝑛∑︁ 𝑖=1 [𝑎 (2) 𝑗 · 𝑥 (𝑖) ](3.8) where 𝑎 (2) 𝑗 · 𝑥 (𝑖) is the activation of hidden unit j 𝑎 (2) 𝑗 given input 𝑥 (𝑖) . The third term can be written in more detail as 𝛽 · 𝑠2∑︁ 𝑗=1 𝐾𝐿 (𝜌 | | �̂� 𝑗) = 𝛽 · 𝑠2∑︁ 𝑗=1 𝜌 · 𝑙𝑜𝑔( 𝜌 �̂� 𝑗 ) + (1 − 𝜌) · 𝑙𝑜𝑔( 1 − 𝜌 1 − �̂� 𝑗 )(3.9) where 𝑠2 is the number of nodes in the hidden layer. Typically, the Kullback-Leibler Divergence calculates the relative entropy of two probability distributions. In this case, we can think of 𝜌 as a Bernoulli random variable with mean 𝜌 and �̂� 𝑗 as a Bernoulli random variable with mean �̂� 𝑗 . The KL Divergence punishes models with sparsity (average rate of activation across hidden units) that deviates from the desired level of sparsity 𝜌. A high degree of sparsity refers to a low amount of node activation. For example 𝜌 = 0.01 would mean only one of every 100 nodes is active at the same time. For this use case, 𝜌 = 0.01 would be considered very high sparsity. The three terms are added up to form the loss function, which the model minimizes to obtain a low dimensional sparse representation of x. Each term is assigned a weight, within the loss function. The mean squared error loss is assigned a weight of 0.5, while the degree in which weight regularization and sparsity figure into the model are regulated by parameters 𝜆 and 𝛽 respectively. 𝜆 was set to 0.001 and 𝛽 was set to 3, and 𝜌 was set to 0.15 in line with the original paper [QZH+19]. The high 𝛽 leads to a high priority on sparsity. 3.3.2 Layer structure of the Sparse Autoencoder The Autoencoder used by Qu et al. consists of the following layers[QZH+19]: 31 3 A hybrid model for pitting detection in operation • An input consisting of 1000 frequency points • A hidden layer of 500 hidden units, followed by a 1000 node output layer • A hidden layer of 200 hidden units, followed by a 1000 node output layer • A hidden layer of 50 hidden units, followed by a 1000 node output layer for the final output This type of structure of a deep autoencoder can be thought of as a funnel. The model first learns a representation of the data that represents 1000 frequency components in 500 hidden components and then rebuilds the original input from the smaller number of hidden features. Then, the model tries to reduce the number of features in the representation even further, by using only 200 features. Finally, it uses only 50 nodes to represent 1000 features. The autoencoders are trained layer by layer. First, a good 500 node representation is found. Then, a good 200 node representation is found using the resulting 𝑥 of the 500 node representation. Lastly, a good 50 node representation is found using the output of the 200 node representation. This process could be continued for an arbitrary amount of layers. The first encoder will remove a lot of the information that cannot be encoded using 500 nodes and therefore being more likely to contain random noise. The same goes for each subsequent autoencoder. In essence this structure filters the structure down to it’s most important components, containing the most information possible. The sparsity constraint has an additional benefit here. The number of active nodes gets reduced at the same rate as the number of total nodes, given the same level of sparsity for each encoder. This ensures that each subsequent layer is not simply a copy of only the relevant components of the previous layer, without the least useful components. Each layer also has to reduce the total activation of the nodes. The structure can be written in short as 1000-500-200-50. Other tested structures included in the results were 1000-500-200-100 and 1000-500-200-20 for varying output sizes and 1000-50 for a different number of hidden layers. For the hybrid approach we use a similar structure of 1281-560-200-50. The number of nodes in the first two layers is slightly higher due to the higher number of frequencies taken into account for the hybrid model. 3.3.3 Clustering the reduced feature space Before clustering the obtained feature space in h, we could consider removing less relevant parameters, that exist due to the sparsity constraint. The idea is to only obtain those features, that contain the most information about h and therefore about the frequency data x. In the original paper, the authors propose a new type of reducing the encoding of the trained SAE even further, by using QR decomposition as described in Section 2.8.3. Qu et al. do not describe further how to get the features which correspond to the highest eigenvalues of the upper triangle matrix R [QZH+19]. For the purpose of the hybrid pitting detection approach, three clustering approaches were employed and compared: • Cluster the entire encoded feature matrix M (all 50 dimensions) • Apply PCA for a fixed number of dimensions, then cluster along the principle components • – Apply QR decomposition 32 3.3 Implementation of the Sparse Autoencoder Figure 3.2: Encoded data with significant shift – Apply eigenvector decomposition on upper triangle matrix R – Keep highest k eigenvectors (𝜆1, ..., 𝜆𝑘), where 𝜆𝑘 > 10 · 𝜆𝑘+1 – Keep columns of R at the same positions as the eigenvalues, set the rest to 0 – Multiply Q and R to obtain the original reduced feature space, with columns that have eigenvalue 0 set to zero. – Cluster the non-zero features PCA stands for principal component analysis and is a form of dimensionality reduction [Pea01]. After using PCA on the 50 component matrix h, we will be left with n principal components, that retain the most information about the data possible, while also accounting for differences between data points. PCA will perform linear transformations on the matrix h, which will alter the components [Pea01]. It can be hard to tell, how a principal component is constructed. The other methods preserve the original features that were learned by the autoencoder. This can be seen as an advantage as it keeps the properties of the latent space according to [QZH+19]. In short clustering among all components and using QR decomposition to obtain the most relevant features leaves (some of) the original components of h as is, while PCA transforms them. Using either PCA or the QR decomposition approach will reduce the number of features further. The clustering method employed was K-Means with k=2. Ideally, we are looking for a point in time that separates two clusters, where one cluster only contains data points from the start of the window up to that point and the other cluster contains data from after the point to the end of the window. This would indicate a significant shift in the underlying structure of the data, where the reason for the shift happens at the separation point. An example of data that can be separated well can be seen in Figure 3.2 Once such a clear separation in the data is found, the model will switch to the supervised phase. The clustering resulting from the K-Means algorithm will be a vector of cluster labels labeled 0 or 1. We will use these labels as labels for training the LSTM classifier. 33 3 A hybrid model for pitting detection in operation 3.4 MFCC Computation The LSTM model is trained on Mel Frequency Cepstral Coefficients. The steps to obtain these coefficients can be seen in Figure 3.3. We use uncompressed frequency data of a given time frame. The time frame in question is the same time frame on which the unsupervised model spotted a potential separation point is obtained, as we have obtained the labels for this window through the unsupervised model. A Mel Filterbank for the input frequencies is obtained. This is done by calculating the Mel values of the maximum and minimum frequencies (Equation (2.5)), placing a number of equidistant points on the Mel scale, then computing these points back to the frequency scale. These points will be spread further apart at higher frequencies. A picture of these filterbanks can be seen on the right side of Figure 3.3, as well as in Figure 2.9. We then sum up the energy between each two points [Lyoa]. This will result in the signal seen in the second frame from the top in Figure 3.3. We then apply the logarithm to the obtained energies. This is typically done to resemble human hearing [Lyoa]. In this case it could be seen as a sort of normalization. The final step is to take the discrete cosine transform of the logarithm of the filterbank energies. As DCT is similar to Fourier Transform, we obtain a spectrum of a spectrum, hence the anagram cepstrum. The absolute values of this resulting cepstrum are called Mel Frequency Cepstral coefficients and form the basis of the LSTM inputs. These coefficients represent the rate of change (frequency) in the filterbank energies, and by extension the frequencies of the frequencies (known as quefrencies) [Lyoa]. Afterwards, we calculate the first and second derivative in the time domain for each MFCC, leaving us with 120 data points per time interval. These derivatives improve model performance in speech processing and represent the trajectory of the MFCC over time. In order to leverage temporal context with the LSTM, we need to split our input data into batches. Each batch contains multiple data points. The more data points there are per batch, the more context is provided as the model considers all the data points per batch as consecutive time steps. A batch that contains 10 data points allows the model to learn from information from up to 10 time steps ago. For the next section, we assume a batch size of 10 data points. 3.5 LSTM Architecture The LSTM classifier is divided into three layers: • An input layer of shape 10 x 120 • A hidden layer of 180 LSTM units • A fully connected classification layer consisting of 2 units to signal presence of absence of pitting 34 3.5 LSTM Architecture Figure 3.3: Steps of MFCC computation The first layer receives the data in the correct input shape. In this case we feed the model 10 data points at a time, with 40 Mel frequency cepstral coefficients, 40 delta coefficients and 40 delta delta coefficients. The input layer passes the data to the LSTM layer. The middle layer contains LSTM units. These units are controlled by gates (see Section 2.6.2) that allow the LSTM to store information about previous inputs without being subjected to exploding or diminishing weight sizes [HS97]. The last layer of the LSTM is a layer displays the output of the LSTM as a binary classification choice. The model returns either 0 or 1. In order to add more classes to the classification, we could add more nodes. [MCC+19] uses an output layer of 9 nodes. Other types of output layers could also be considered. For example it might be desirable to return a single value as a damage indicator, similar to [KDK19; KDK20]. The activation function for the LSTM model is the sigmoid function Equation (3.3). The loss function of the classifier is the cross entropy function for two classes: 𝐿 = −(𝑦 · 𝑙𝑜𝑔(𝑝) + (1 − 𝑦) · 𝑙𝑜𝑔(1 − 𝑝)) (3.10) where y is the true label (either 1 or 0) and p is the probability calculated by the model that the label of the observation is either 0 or 1[che]. Cross entropy is a common loss function used for classification. 35 3 A hybrid model for pitting detection in operation 3.6 Comparison between Autoencoder and LSTM In order to understand why a hybrid approach is used, compared to either one of these models, we need to understand how they compare to each other. Both models fall under the category of neural networks, as both models try to mimic the structure of brains by using a node structure. The sparse autoencoder typically has more layers than the LSTM, as it requires the depth to learn the latent features of the data. The biggest difference of course is the fact, that the SAE is an unsupervised model, while the LSTM is a supervised model. The LSTM therefore requires labeled training data, while the SAE only requires the input data. The reason for this stems from the purpose of these models. The SAE is trained in order to approximate the input data with a lower number of features, while the LSTM model is supposed to classify between pitting and no pitting. In order to know if it is performing this classification well, the LSTM needs feedback on the returned classifications to learn from the data. The SAE simply needs the original input to verify if it’s result (which should ideally be the input data). Both of these models could be adapted to perform the task of the other model. The Autoencoder could be connected to a classification layer instead of a final layer that has the same dimension as the input. The layer would basically be identical to the classification layer in the LSTM. The loss function of the final autoencoder would have to be changed to contain for example cross entropy instead of the mean squared error. It could still keep the other components of the loss function, including the sparsity component. LSTM models can be turned into autoencoders, by chaining an LSTM model that takes a large input and reduces it to a small number of features with an LSTM model that takes those features and tries to return the original input. These models are known as sequence to sequence models [SVL14] and are often used in natural language processing. The autoencoder is trained on frequency data, while the LSTM is trained on MFCC. This is mostly a result of using the inputs used in the original papers [MCC+19; QZH+19]. The SAE could be trained on the MFCC and vice versa. LSTM models are often trained on MFCC in natural language processing, and the choice of input data makes sense under that aspect. We could also feed the learned space to the LSTM as extra parameters, however this risks the LSTM simply prioritizing these features, copying the results of the SAE. Having two different transformations of vibration data as inputs allows us to see that changes are visible in both the spectrum and the cepstrum. Another main difference are the hidden nodes of the models. While the sparse autoencoder uses a fully connected layer, meaning every input node can influence every hidden node, the LSTM model uses special LSTM nodes. These nodes are controlled by gates and allow the model to factor in past data in it’s classification process, with the past data dating back an arbitrary number of data points. This is responsible for the last main difference, which is that the LSTM is fed with batches of multiple data points at once, while the sparse autoencoder learns data points one data point at a time. 36 4 Realisation of Solution The two phase structure in Section 3.2 can be broken up further into steps that need to be performed in each phase. These steps can be seen in Figure 4.1. The model will stay in the unsupervised loop until data is found, that is able to be separated into two distinct clusters based on time. 4.1 Experimental Setup The data for the analysis was produced by two gearboxes, with the input gear operating at 2150 rotations per minute (35.83 Hz) with a torque of 650 Nm. These gears were two of several that were tested on the same setup. The gears were akin to those used in car gearboxes and the operation was a simulation of a car in second gear non-stop until failure. The gears were labeled as gear 6 and gear 11. The data was collected every 24 hours for 10 minutes, from start to end of the gear life cycle. Gear 6 produced 14 files of data, where data was collected from 12:00 to 12:10 every day over a span of 13 days. The exception to this is the latest data file (file 13) of gear 6 which contains data from 22:50 to 23:30 from the same day as file 13. The operation was stopped soon after the recording of file 14 due to the gear being significantly damaged. Figure 4.1: The Hybrid model step by step 37 4 Realisation of Solution For Gear 11, there were 51 files of data produced. Two of the data files (files 10 and 24) contain only three minutes of data and are not recorded between at 12:00 am, compared to the other files. There is a significant lapse in recording between files 9 and 10. File 10 was recorded 24 days after file 9, after which the daily recording schedule continued. All vibration data was recorded at 51200 Hz, resulting in 30 720 000 Data points per measurement over each 10 minute time span. All three spacial dimensions were recorded with accelerometers. 4.2 Data selection and model control flow In order to simulate an environment in operation, only two data files will be loaded at a time. We start with the first two data files (files 0 and 1). Of these 20 minutes of data, only a four minute time frame is considered at the same time. We chose four minutes as this is enough data to train the models, while also being not too much data to slow down training too much. This would make looping through the unsupervised phase in real time impossible.This four minute window corresponds to a fixed number of data points determined by the sampling rate, and the number of data points that are used per frequency spectrum. The frequency data of this time window is copied and compressed to be used as input for the autoencoder. If there is no pitting detected, the time frame will be updated by two minutes. Let the window be between minute i and minute i+4. After the update the window will be updated to i+2 : i+6. The window is shifted by only half of it’s own length, as to not miss a potential shift happening at the edge of the frame. Once i is bigger than the last minute of the earlier data file, the late data file becomes the early data file and the next file is loaded in. As an example, if file 2 and file 3 are loaded, file 3 will be saved in the variable of file 2, to free up memory and file 4 will be loaded in place of file 3. This allows the model to move the window between data files. The index of the files grows as we load in new data files to allow for easy separation point identification. 4.3 Data Import The vibration data used for the training and testing of the hybrid model is stored in .tdms files. Tdms files are hierarchical data structures, that allow for storage of file information like the author names in the top layer, while storing the actual data in the lowers of the three hierarchical layers, known as channels. In order to work with the data more easily, it is first transformed into a pandas DataFrame. The data is arranged in channels with the heading for vibration data containing the prefix 𝑹𝒂𝒘_𝑽𝒊𝒃𝑿 with 𝑿 being 1, 2 or 3, depending on the spacial dimension of the recorder. The resulting DataFrame takes the dimensions 512000 x 180 with each column containing the vibration data for 10 seconds. The entire DataFrame contains 10 minutes of vibration data for each vibration dimension. 38 4.4 Vibration Data Transformation 4.4 Vibration Data Transformation The DataFrame is reshaped to contain 3 columns with 10 minutes of vibration data resulting in a 30.720.000 x 3 data frame, where each column contains one dimension of vibration data. Each data point is a float number that represents the acceleration of the sensor at that particular point in time. To speed up model training we only use data from the first of the three dimensions. Pitting patterns can be observed in all three dimension, with some dimension providing better cluster distinction than others. In early tests, we found dimension 1 to produce the best results. 4.4.1 Fast Fourier Transform Before the transform itself, the data was segmented into smaller segments of data, given by a parameter 𝒏_𝒗𝒊𝒃. For the hybrid model, the number of data points used for each spectrum calculation was set to 10.000, capturing 7.7 rotations of the input gear per frequency spectrum. The model has been found to be relatively robust to the number of data points per frequency spectrum. However, it is recommended to capture at least one rotation of each gear in each frequency spectrum. For a smaller number of data points some teeth will be left out of each spectrum. Because pitting damage is localized to a small number of adjacent teeth [Gre22] we want to capture the mesh of all gear teeth in the spectrum. Cutting vibration data into smaller segments can be seen as applying a transformation function to the data, which keeps the data in the segment the same, while setting all other data points to 0. This transformation will affect the frequency spectrum, by adding frequencies to the spectrum that are not present in the original signal. This process of spectral noise generation is known as spectral leakage [Nut81]. In order to combat spectral leakage, we can apply a window function to the data. The results of windowing can be seen in Figure 4.2. Applying a windowing function is a different type of transformation, that often reduces the data on the sides of the window while keeping the middle of the data the same. Depending on the window function, we can influence what frequencies should be better distinguished. When no windowing is applied in this step (meaning technically a rectangular window is applied as described above), components with similar frequencies and amplitudes will be distinguished more easily (high resolution), while components with more dissimilar frequencies and amplitudes will be harder to distinguish (low dynamic range). Applying a window function here can adjust resolution and dynamic range, but there is always a trade-off, meaning there is no function that allows for high resolution and dynamic range at the same time [Nut81]. For the purpose of the hybrid model a rectangular window (no specific window) was applied, as the noise resulting from spectral leakage should be discarded by the sparse autoencoder. In order to transform the vibration data from the time domain to the frequency domain, the SciPy implementation of fast fourier transform was used. More specifically the implementation of the real fast fourier transform was (scipy.fft.rfft). The method takes a vector of vibration data v, with length 𝑛_𝑣𝑖𝑏 and the number of frequencies to be considered in the spectrum. If the number of frequencies is larger than the number of data points in the v, the vector is padded with zeroes. The method returns all positive frequency terms. Given an even frequency n, the number of returned 39 4 Realisation of Solution Figure 4.2: Application of a Hanning window to a segment of vibration data Figure 4.3: Vibration data of gear 6 before (left) and after (right) fourier transform frequencies is ⌊ 𝑛2 ⌋ + 1 [com]. Rfft returns only half the frequency components, as the other half of the components is symmetrical to the first. This second half of the frequency components would therefore not add any new information to the spectrum. In our case we chose to compute all frequency components that could be captured in the data. Due to anti-aliasing we can capture frequencies up to half of the sampling rate of 51.200 Hz, leaving us with 25.600 Hz. As mentioned above, we segment the data into pieces of 10.000 data points, turning the original 30.720.000 data points of vibration data into a list of 3.072 segments of 10.000 data points each. Rfft gets applied to each of the segments with a maximum frequency of 25.600. This returns 12.801 frequency for each segment, leaving us with a 3.072 x 12.801 matrix. Where each column represents a frequency component and each row represents the frequency spectrum of 10.000 data points of vibration data. The result of the rfft can be seen in Figure 4.3. Matrix obtained by the fast fourier transform contains complex values, which contain information about the phase and the amplitude of each frequency component. As we are interested in the amplitudes, we take the absolute values of the frequency data. 4.4.2 MFCC Computation As there doesn’t seem to be a standard MFCC implementation, the code for calculating the coefficients is taken from this kaggle notebook [ILM]. For the MFCCs we want to calculate the energy in different frequency regions [Lyoa]. These regions will be defined by filters. The calculation of these filters can be seen in Figure 4.4. We calculate the MEL frequency for the highest and lowest frequency we have obtained through fourier transform using Equation (2.5). In our case, out lowest frequency component is at frequency 0, while our highest component is at frequency 25.600. The respective Mel values of these extremes are 0 and 4086.75 respectively. We then 40 4.5 Sparse Autoencoder Implementation Figure 4.4: Calculation of the mel filters obtain evenly spaced points on the Mel scale between the two extreme values, and transform the values back to the standard frequency scale, using the inverse of Equation (2.5). The values in these frequency ranges are then summed up to represent the energy in each region [Lyoa]. The number of choice of filters influences the number of frequencies summed up for each filter. We used 45 filters, turning our 12801 frequency components into 45 energy regions, which are slightly overlapping. The base 10 logarithm of each region is taken, resulting in lower values for higher frequency regions. In speech processing this is done to mimic how humans have non-linear hearing, requiring roughly 8 times as much energy to double the volume of a sound [Lyoa]. Finally, discrete cosine transform is applied to obtain the cepstrum. The number of cepstral coefficients is typically lower than the number of Mel filters, as the highest coefficients which represent the highest changes in energy in the Mel filterbank actually decrease model performance (in speech processing). We set the number of discrete cosine transform components to 40 in order to get rid of the higher quefrency regions. This algorithm turns our 3072 x 12801 frequency matrix into a 3072 x 40 Mel frequency cepstral coefficient matrix C. 4.5 Sparse Autoencoder Implementation The single layer autoencoders were implemented using the TensorFlow python library. The single layer encoders were then chained together, to create a deep sparse autoencoder. The autoencoder takes training and test data in form of frequency data. It will return the encoding of the frequency components of the last encoder layer on the original data. The autoencoder can be adjusted by selecting a model structure and tuning the sparsity parameters 𝛽 and 𝜌 as well as the weight regularisation parameter 𝜆. 41 4 Realisation of Solution 4.5.1 Preprocessing To train the autoencoder, we use the frequency data of the current 4 minute time frame. To speed up the training process we use only every tenth frequency component, meaning components 0, 10, 20, ... This leaves us with 1228 data points with 1281 frequency components each. The data is split into training and test data at 80% training data. The autoencoder will be trained on the training data while the test data is withheld in order to ensure that the autoencoder does not overfit the data. If the SAE were to overfit the data by "learning the data by heartït would perform bad on the test data. When splitting the data into training and test data, the input is randomly shuffled. While we want to evaluate the clustering later by separating the clusters by time, this randomization of the order of data points is desirable, as we do not want the model to cheat by learning structure only on the order of the input data. Doing this could cause "data leakage"by factoring in information that is not given to the model in operation, artificially inflating it’s performance in training. In order to remember the order of the nodes for the clustering later, we save the indices of the shuffled DataFrames after the train-test split in additional variables. These indices will be re-attached to the encoded data later to restore the original order of the data for the clustering. Next, the data is converted to a tensor to enable it to be used as input for a TensorFlow model. This conversion removes the indices of the data, which is why we saved them in the previous step. After preprocessing we should be left with a training tensor of shape 982 x 1281 and a test tensor of 246 x 1281. 4.5.2 Single Layer Autoencoder Structure Each single layer autoencoder consists of an encoder and a decoder with fully connected layers (in TensorFlow, these layers are called Dense layers). The last layer of the decoder needs to have the size of the original input to replicate the input. It is possible to pass multiple fully connected layers to the encoder an the decoder respectively. In our case, we only pass one layer to the encoder and decoder layer. The depth of the model will be achieved by using the output layer of a single layer encoder as the input layer for the next encoder. The activation function for both the encoder and the decoder is the sigmoid function (Equation (3.3)). The model takes parameters to adjust the number of nodes in the encoder layer, as well as the decoder layer structure and the weight regularization parameter 𝜆. The number of nodes in the encoder layer will shrink progressively from encoder to encoder when combining the single layer encoders into a deep autoencoder to create a funnel effect. Each single layer autoencoder, when used as a whole, takes an input x and returns an input 𝑥 of the same dimensions. In our case, the autoencoder takes our tensor of size 982 x 1281 and learns to construct a similar tensor of size 982 x 1281. However, we can also only call the encoder of the trained autoencoder. In this case, the encoder will return a tensor of dimension 982 x h, where h is the number of nodes in the hidden encoder layer and typically h « 1281. This tensor is the mapping of the data points to the learned feature space of the autoencoder. It depicts the underlying structure of the frequency spectrum. 42 4.5 Sparse Autoencoder Implementation Figure 4.5: Code of the single layer autoencoder model 4.5.3 Loss Function As seen in Equation (3.4) the loss function for the sparse autoencoder consists of 3 parts that form a sum. TensorFlow does not support the loss function directly. It needs to be reconstructed using the individual parts. The first term of the loss function is implemented in the built in TensorFlow MeanSquearedError loss function [Gooa]. This is the entire loss function according to TensorFlow as it is primarily concerned with optimizing the result of the autoencoder. In theory this result is to replicate the input x as good as possible by reconstructing it from a lower dimensional representation. For the purpose of pitting detection however, we are more interested in the properties of the encoding than the replication of the frequency data. The second term concerns the size of the weights of the encoder and is commonly known as l2 regularization. This term is added to avoid overfitting and allow for good results on new data. In order to add this term, we set the kernel_regularizer, which influences the size of the weights, with the built in TensorFlow L2 Regularizer [Gooc]. Regularizers in TensorFlow are a way to influence parts of a layer in a TensorFlow model. The kernel_regularizer applies a penalty to the kernel of the layer. The bias_regularizer applies a penalty to the bias of the layer, and the activity_regularizer applies a penalty to the layer output [Good]. Looking at the hidden encoder layer as the result of the function in Equation (3.1), the kernel_regularizer in TensorFlow penalizes the weight matrix 𝑤1, the bias_regularizer influences the bias vector 𝑏1 and the activity_regularizer influences the result of the sigmoid function h. As the second term of the loss function Equation (3.6) deals with the weight matrix w, we need to use a kernel_regularizer. For the Kullback Leibler divergence, we will have to create a custom regularizer based on Equation (3.9). The code for the regularizer is adapted from [Lin] and can be seen in Figure 4.6. As we need to influence the average activation of the nodes in the hidden layer, we need to use an activity_regularizer. The activity_regularizer will be given all the values of the nodes of hidden 43 4 Realisation of Solution Figure 4.6: Code of the sparsity penalty as part of the loss function layer in the form of vector h. As we care about average activation, we will take the mean of h and compare it to the desired level of sparsity 𝜌. The activity_regularizer class only allows one input, which is the vector h. In order to adjust beta and rho for testing purposes, we use global variables. We use beta to increase the impact of the sparsity penalty, while rho dictates the desired sparsity. 4.5.4 Combining Single Layer Encoders In order to turn the single layer autoencoders into a sparse multilayer autoencoder, we train one autoencoder after another, then use the result of the previous single layer encoder to train the next one. We allow for multiple encoder and decoder layers per autoencoder. It would theoretically be possible to pass different values for 𝛽, 𝜆 and 𝜌 for each single layer encoder or even two different encoder layers as part of the same encoder-decoder combination. In order to not get lost in too much testing and to avoid overfitting, we kept the hyperparameters the same across all layers for each sparse autoencoder. This method seems to be sufficient so far, but could be interesting to investigate in the future. Training of a three layer sparse autoencoder is done in the following fashion. We copy our training data tensor X with shape 982 x 1281. We then train a single layer autoencoder (𝜆 = 0.001, 𝛽 = 3, 𝜌 = 0.15) with a hidden layer with 560 hidden nodes on the training data. We then let the trained autoencoder replicate X, obtaining X’. We use X’ to train a single layer autoencoder with the same hyperparameters, except we only give it 200 hidden nodes. The trained autoencoder obtains X” from X’. Lastly we train an autoencoder with 50 hidden nodes on X”. This time however, we use only the encoder of the trained autoencoder on the original data X, obtaining a reduced feature space in the form of tensor M_train with dimensions 982 x 50. We also use the encoder on the training data to obtain tensor M_test with dimensions 246 x 50. 4.5.5 Clustering SAE results After obtaining M_train and M_test, we turn them into DataFrames and re-attatched the saved indeces. We then concatenate and then sort the data points by their index to obtain the underlying structural data M as a 1228 x 50 DataFrame in chronological order. Before clustering we can try to reduce the feature space further. We could use principal component analysis (PCA) or by using the QR decomposition method proposed by Qu et. al. [QZH+19]. If we use PCA, we will take the first three principle components of the data, giving us a clustering on a 1228 x 3 matrix. The number of dimensions kept by the QR method will vary based on the data and on the cutoff threshold to decide when to discard eigenvalues and their corresponding features. However, we will cluster on a 1228 x n feature space, with n < 50. Alternatively we can cluster on the entire 1228x50 feature space. All three methods as mentioned in Section 3.3.3 were tested. 44 4.6 Implementation of the supervised phase For the clustering of the reduced feature space, K-Means clustering was employed, with k=2. The result of the K-Means clustering is a 1228 x 1 vector y which contains a cluster label in 0,1 for each frequency spectrum. We investigate if y contains two clear clusters separable by time. We call clusters separable by time if for the current data 𝑦 = (𝑦1, 𝑦2, ..., 𝑦𝑛) there exists a point at time i, for which 𝒔𝒆 𝒑𝒂𝒓𝒂𝒃𝒊𝒍 𝒊𝒕𝒚 ·100 % of data in (𝑦1, ..., 𝑦𝑖) can be put into one cluster and 𝒔𝒆 𝒑𝒂𝒓𝒂𝒃𝒊𝒍 𝒊𝒕𝒚 · 100 % of data in (𝑦𝑖+1, ..., 𝑦𝑛) can be put into the other cluster. 𝒔𝒆 𝒑𝒂𝒓𝒃𝒊𝒍 𝒊𝒕𝒚 is a threshold value between 0 and 1, that decides if the cluster purity is high enough to consider the data to be separable by time and therefore assume that a shift in the underlying signal, that could indicate pitting, has happened. 𝒔𝒆 𝒑𝒂𝒓𝒂𝒃𝒊𝒍 𝒊𝒕𝒚 is usually set to 0.99. If cluster purity is higher than 𝒔𝒆 𝒑𝒂𝒓𝒂𝒃𝒊𝒍 𝒊𝒕𝒚 the model transitions to the supervised phase. 4.6 Implementation of the supervised phase The goal of the supervised phase is to obtain a classifier that can distinguish between vibration data with pitting and vibration data without pitting. The problem with training the classifier on the vibration data has been the lack of labeled training data. Now, we can use the cluster labels y as the labels for the training of the Long Short Term Memory classifier. The LSTM takes Mel Frequency Cepstral Coefficients as input instead of frequency spectra. The calculation of these coefficients is shown in Section 4.4.2. Additionally the first and second derivative of the MFCC are calculated. These coefficients are calculated as follows[Lyoa]: 𝑑𝑡 = ∑𝑁 𝑛=1 𝑛(𝑐𝑡+𝑛 − 𝑐𝑡−𝑛) 2 ∑𝑁 𝑛=1 𝑛 2 (4.1) The first values obtained by the first derivative as known as delta coefficients. Delta coefficient 𝑑𝑡 can be seen as the slope between the neighbouring coefficients of MFCC 𝑛𝑡 (𝑛𝑡−1 and 𝑛𝑡+1). The parameter N allows us to also include the neighbours of the neighbours. For the hybrid model, N is set to 2. The derivative of the delta coefficients are called delta-delta coefficients. By padding the edges of the MFCC matrix C with zeroes we obtain an additional 40 delta and 40 delta-delta coefficients per data point. We add these to the MFCC to obtain our input matrix X of shape 1228 x 120. The code for calculating these coefficients was taken from [Lyob]. The input X is also split into a training and a test set X_train (∈ R982𝑥120) and X_test (∈ R246𝑥120). Because the classifier is a supervised model, we also split y into y_train and y_test. In order to leverage the power of the LSTM model to take previous data points into account, we need to reshape the input data into batches given by the parameter 𝒃𝒂𝒕𝒄𝒉_𝒔𝒊𝒛𝒆. Finding a balance between 𝒃𝒂𝒕𝒄𝒉_𝒔𝒊𝒛𝒆 and the number of inputs is important. Making the batch size to small will limit the number of historical data points the LSTM can leverage in it’s prediction. Having a very small input size will likely limit the ability of the model to generalize and classify new data. We experiment with different levels of 𝒃𝒂𝒕𝒄𝒉_𝒔𝒊𝒛𝒆, defaulting to 𝒃𝒂𝒕𝒄𝒉_𝒔𝒊𝒛𝒆 = 10. By using the TensorFlow reshape method, we turn our input matrices X_train and X_test into a tensor of shapes 98 x 10 x 120 and 24 x 10 x 120. The label vectors are reshaped to 98 x 10 x 1 and 24 x 10 x 1 respectively. 45 4 Realisation of Solution Figure 4.7: TensorFlow Code of the LSTM model 4.6.1 Model Architecture The LSTM model itself consists of an input layer, an LSTM layer with 𝒏𝒍 𝒔𝒕𝒎𝒏𝒐𝒅𝒆𝒔 hidden units and a fully connected layer with two units representing the classification outputs (pitting and no pitting in theory, cluster 1 and cluster 2 in practice). The activation function used is also the sigmoid function, while the loss function is sparse categorical crossentropy [Goob] which is used for classification and produces a number of a cluster. The code for the LSTM model itself can be seen in Figure 4.7. In addition to varying the 𝒃𝒂𝒕𝒄𝒉_𝒔𝒊𝒛𝒆, the model can be tuned by adjusting the number of hidden units in the LSTM layer. It is important to turn the parameter "return_sequences"to true to make the model return the classification of all data points and not just the last data point in each batch. Once the model is trained on X_train and y_train it produces a result vector with two features, when given input data. For example, if we feed X_train to the classifier, it will return �̂�_𝑡𝑟𝑎𝑖𝑛 as a 98 x 10 x 2 tensor. We reshape the tensor into a 980 x 2 DataFrame. The two columns of �̂�_𝑡𝑟𝑎𝑖𝑛 represent a one-hot encoding of the cluster labels. For given data point X_train[i] if �̂�_𝑡𝑟𝑎𝑖𝑛[𝑖] [0] > �̂�_𝑡𝑟𝑎𝑖𝑛[𝑖] [1] the data is considered part of cluster 0 and vice versa. 4.6.2 LSTM Validation The trained classifier is considered the result of the supervised phase and of the model as a whole. In order to validate if the classifier can actually classify pitting data we try to validate the classifier by importing data sets of data that was recorded before and after the window the LSTM was trained on. For example if the model was trained on data in files 4 and 5, use all 20 minutes of files 3 and 6 as validation data. Data in file 3 is considered to have no pitting, and data in file 6 is considered to have pitting. If the classifier can classify over 95% of the data correctly, it is considered to have found pitting damage. 46 5 Evaluation of Solution The model is first and foremost evaluated on its ability to detect pitting damage. This was done using parameters as close to the original papers [MCC+19; QZH+19] as reasonable. We then evaluate potential changes to parameters, as well as model components. The parameters were as follows: Both machine learning models were evaluated with regards to their accuracy. Accuracy was investigated for different hyperparameters in both models. It turns out that accuracy is not a good evaluation metric in operation for this specific task, as will be discussed in Section 5.5.2. Parameter Description Value frequency maximum frequency calculated for FFT 25600 n_vib number of data points per Fourier Transform 10000 vibration_dimension the spacial dimension of the vibration 1 compression_factor factor by which sae input is smaller than FFT result 10 n_coefficients number of MFCC inputs for lstm 40 n_mel_filters number of Mel filterbanks 45 𝛽 sparsity penalty weighting in sae loss function 3 𝜆 weight regularization weighting in sae loss function 0.001 𝜌 desired level of sparsity in sae loss function 0.15 sae_encoder_layers numbers of hidden nodes per sae encoder layer 560-200-50 sae_decoder_layers numbers of hidden nodes per sae decoder layer 560-200-50 sae_epochs number of epochs of sae training 100 lstm_epochs number of epochs of lstm training 100 batch_size number of data points fed to LSTM at the same time 10 n_lstm_nodes number of hidden units in the LSTM layer 180 accuracy_threshold level of accuracy to consider trained classifier useful 0.95 separability_threshold percentage of data points in their correct cluster 1.0 window_time number of minutes for which data points are used for training 4 window_shift_factor percentage of the window by which it gets moved per iteration 2 Table 5.1: List of parameters used in the algorithm 47 5 Evaluation of Solution Figure 5.1: Underlying structure across files Figure 5.2: Frequency spectrum comparison between data with and without pitting damage 5.1 Global view of the underlying structure of the data If the autoencoder is given all of the vibration data for gear 6 (14 Files), it’s first principle component shows the structure displayed in Figure 5.1. We can see that the last two data files contain a significantly different underlying structure. Looking at the absolute values of the mean frequency spectrum of the first and last file shows significant differences. Especially in the high frequency areas the later file shows big spikes. In a realistic scenario we would like to be able to discover pitting damage at an earlier stage than it is found in the last two files of the data. This is why most papers tend to look at pitting damage at various stages of deterioration [MCC+19; QZH+19]. The problem with testing the hybrid detection approach on real data, is that it is hard to figure out when the pitting damage first occurred. In order to see if earlier stages of pitting damage can be spotted in the data, we looked if additional data points could be found, that can separate the data encoded by the sparse autoencoder in the time domain. When considering a separability threshold of 0.99, five such points were identified. These points were used to separate the data into five areas. The LSTM classifier was trained with different parameters to check if any of these points could indicate early stages of pitting. 48 5.2 Accuracy of SAE for different Parameters 5.2 Accuracy of SAE for different Parameters The models were trained on the data files with clear pitting and no pitting, as well as data in the five clusters found by the hybrid model, The models were also evaluated on four combinations of consecutive files with no severe pitting damage, namely files (4,5), (5,6), (7,8) and (9,10). 5.2.1 Accuracy of the clustering methods When evaluating the different methods of further dimensionality reduction, namely no reduction, taking the highest 2 principle components and taking the features associated with the highest eigenvalues of upper triangle matrix R in QR decomposition, accuracy refers to the percentage of data points correctly identified as part of a cluster that contains the data points of one of the two files. An accuracy of 50% suggests the lowest possible accuracy, which often occurs when all data points are put into the same cluster. This might not suggest bad performance of the model necessarily, as two files could be generally inseparable because the data in these files contains the same underlying structure. When clustering the reduced feature space of the data in the consecutive measurements mentioned above, clustering along all dimensions (performing no further feature reduction) produces an average accuracy of 62.5867%, using the first two principle components produces an average accuracy of 62.5768% and using the QR decomposition method to select the most relevant features has an average accuracy of 58.2905%. The QR decomposition method has the lowest performance among the three methods in 79.84% of cases. The method only has a higher accuracy than the other two methods in 4.4% of cases, with marginal improvements. The reason for this might be a flawed implementation of the method, as it was not described in detail in [QZH+19]. In this case, the metho