Institute of Architecture of Application Systems

University of Stuttgart
Universitätsstraße 38

D–70569 Stuttgart

Bachelorarbeit

Examining a Hybrid Pitting
Detection Approach on Real Data

Tobias Schmid

Course of Study: Data Science

Examiner: Prof. Dr. Marco Aiello

Supervisor: Robin Pesl M.Sc.,
Lisa Binanzer, M.Sc.

Commenced: August 8, 2022

Completed: February 7, 2023


Kurzfassung

Zur Lösung von automatischer Grübchenerkennung am laufenden Getriebe wird ein hybrides
Modell vorgestellt und untersucht. Die meisten Arbeiten zur automatischen Grübchenerkennung
fallen unter den Bereich des überwachten maschinellen Lernens, mit zuvor gekennzeichneten Daten.
Zusätzlich werden Grübchenschäden oft durch gebohrte Löcher simuliert. Für die Erkennung von
Grübchen am laufenden Getriebe muss ein Modell jedoch auf unbeschrifteten Daten operieren
können. Die vorgestellte hybride Lösung ermöglicht es einen Klassifikationsalgorithmus auf
unbeschrifteten Daten zu trainieren. Zum Test der Methode werden Daten verwendet von Getrieben
verwendet, über die lediglich bekannt ist, dass sie am Ende ihrer Laufzeit über Grübchen verfügten.
Diese Daten bieten eine gute Annäherung an ein tatsächliches Betriebsszenario. Der Ansatz kann
grob in zwei Teile unterteilt werden. Zunächst wird die zu Grunde liegende Struktur der Daten
durch einen Autoencoder gelernt. Sobald sich eine signifikante Veränderung der Struktur anhand
eines Clustering feststellen lässt, wird ein Klassifikator in Form eines Long Short Term Memory
Modell auf den Daten trainiert. Die Beschriftung der Daten folgt der Beschriftung der Cluster,
die als zeitlich trennbar identifiziert wurden. Ziel ist er den Klassifikator als Teil einer adaptiven
Betriebsstrategie anzuwenden, und so durch variierung des Eingangsdrehmoments die Lebensdauer
des Zahnrads zu steigern.

Abstract

We propose a hybrid detection approach for pitting detection in gears. Most research on pitting
detection with machine learning is done on supervised data and often on simulated pitting. However,
for pitting detection in practice an unsupervised approach is required. The main idea behind the
proposed solution is the ability to leverage elements of supervised machine learning models in
pitting detection, while operating on unlabeled data. The training data used for this algorithm is
taken from gear boxes with actual pitting failure and without prior knowledge about pitting size
during different stages of operation, providing a realistic case for an operational scenario. The
approach can be seen as two parts. At first we try to detect changes in the underlying structure of
the vibration data using fast fourier transform to obtain frequency spectra that are fed to a sparse
autoencoder. The encoded reduced feature space is the clustered to look for separability by time. In
case there are significant changes in the underlying structure, an Long Short Term Memory model is
trained to see if the changes can be validated as actual pitting damage. The LSTM can then further
be used to adapt fast to the pitting damage with the goal of prolonging the lifespan of the gear.

3


Contents

1 Introduction 11

2 Background Information 13
2.1 Gear Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Pitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 Vibration data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4 Deep Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.5 Sparse Autoencoders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.6 Long Short Term Memory Models . . . . . . . . . . . . . . . . . . . . . . . . 20
2.7 Gear damage theory and detection . . . . . . . . . . . . . . . . . . . . . . . . 21
2.8 Disentangled Tone Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.9 Deep Learning Cepstral Classification . . . . . . . . . . . . . . . . . . . . . . 25

3 A hybrid model for pitting detection in operation 27
3.1 Constraints and Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2 Structure of the Hybrid approach . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3 Implementation of the Sparse Autoencoder . . . . . . . . . . . . . . . . . . . . 30
3.4 MFCC Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.5 LSTM Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.6 Comparison between Autoencoder and LSTM . . . . . . . . . . . . . . . . . . 36

4 Realisation of Solution 37
4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.2 Data selection and model control flow . . . . . . . . . . . . . . . . . . . . . . 38
4.3 Data Import . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.4 Vibration Data Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.5 Sparse Autoencoder Implementation . . . . . . . . . . . . . . . . . . . . . . . 41
4.6 Implementation of the supervised phase . . . . . . . . . . . . . . . . . . . . . 45

5 Evaluation of Solution 47
5.1 Global view of the underlying structure of the data . . . . . . . . . . . . . . . . 48
5.2 Accuracy of SAE for different Parameters . . . . . . . . . . . . . . . . . . . . . 49
5.3 Results of the Autoencoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.4 Accuracy of the LSTM for different Parameters . . . . . . . . . . . . . . . . . . 55
5.5 Evaluation of the identified points . . . . . . . . . . . . . . . . . . . . . . . . . 57

6 Conclusions 59
6.1 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.2 Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

5


Bibliography 63

6


List of Figures

2.1 Line of action and pitch point[Gea] . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Pitting damage on a gear[Ber; HMO11] . . . . . . . . . . . . . . . . . . . . . . 14
2.3 Sine wave summation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4 Sparse auto encoder [Mas] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.5 Multilayer Autoencoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.6 Vibration signal composition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.7 Simple damage signal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.8 Damage in spectrum and cepstrum . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.9 Mel filter bank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.1 The two phases of the hybrid detection approach . . . . . . . . . . . . . . . . . . 28
3.2 Encoded data with significant shift . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.3 Steps of MFCC computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.1 The Hybrid model step by step . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.2 Application of a Hanning window to a segment of vibration data . . . . . . . . . 40
4.3 Vibration data of gear 6 before (left) and after (right) fourier transform . . . . . . 40
4.4 Calculation of the mel filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.5 Code of the single layer autoencoder model . . . . . . . . . . . . . . . . . . . . 43
4.6 Code of the sparsity penalty as part of the loss function . . . . . . . . . . . . . . 44
4.7 TensorFlow Code of the LSTM model . . . . . . . . . . . . . . . . . . . . . . . 46

5.1 Underlying structure across files . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.2 Frequency spectrum comparison between data with and without pitting damage . 48
5.3 Activation of nodes on well separable data . . . . . . . . . . . . . . . . . . . . . 50
5.4 Activation of nodes on poorly separable data . . . . . . . . . . . . . . . . . . . . 51
5.5 Percentage of points correctly identified as part of their file for different 𝛽 . . . . 52
5.6 Percentage of points correctly identified as part of their file for different 𝜌 . . . . 53
5.7 Average frequencies across 3000 spectra, before and after the recording break . . 55
5.8 Classification Accuracy of LSTM for different separation points over gear lifetime 58

7


List of Tables

5.1 Parameter Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.2 Accuracy of Clustering for different 𝛽 on all reduced dimensions . . . . . . . . . 52
5.3 Accuracy of Clustering for different 𝜌 on all reduced dimensions . . . . . . . . . 53
5.4 Accuracy of Clustering for different 𝜆 on all reduced dimensions . . . . . . . . . 54
5.5 Accuracy of Clustering for different layer structures on all reduced dimensions . . 54
5.6 Accuracy of LSTM for different number of Training Samples . . . . . . . . . . . 56
5.7 Accuracy of LSTM for different number of LSTM Nodes . . . . . . . . . . . . . 56
5.8 Accuracy of LSTM for different number of LSTM Nodes . . . . . . . . . . . . . 57
5.9 Confusion matrix for classes separated by the identified points. Actual classes =

rows, Predicted classes = columns . . . . . . . . . . . . . . . . . . . . . . . . . 57

9


1 Introduction

Wind energy is an important component in reducing carbon emissions to fight climate change. One
of the major cost factors of a wind turbine are it’s maintenance costs which can make up to 75% of
the investment cost of a wind turbine during it’s lifetime [Vac02]. One of the potential damage
sources, pitting damage in the wind turbine’s gears is responsible for long maintenance times of
an average of 4 weeks [NB21]. During this time, the wind turbine is not able to produce energy,
leading to an average loss of an estimated 50.000=C per maintenance period. In order to combat
pitting damage, the condition of the turbine is constantly monitored using a condition monitoring
systems. These systems are, among other things, used to reduce the maintenance time by scheduling
maintenance in advance [NB21]. Gretzinger proposed another use case for condition monitoring in
using the data of the pitting damage signal to increase the remaining lifespan of a gear affected by
pitting [Gre22]. The basic idea is to decrease the load on the affected teeth and compensate the loss
of rotational force by increasing that force on the teeth that are not affected by pitting [Gre22]. In
order to test and automate this proposed solution, automatic methods of pitting detection are needed.
While there has been a lot of research in the field of automatic pitting detection [EA14; ERM+14;
MCC+19; ÖSY08; QZH+19; ZD13] a lot of these approaches rely on simulated, drilled pitting data.
Furthermore, most machine learning approaches are supervised models relying on labeled data for
model training. Due to the nature of the gear vibration data, it is typically not possible to use a
pre-trained model on a different gear setup, making it hard to use supervised machine learning.

In order to still use the power of supervised machine learning models, the LSTM proposed by
Medina et al. in particular [MCC+19], a hybrid pitting detection approach is proposed. The
algorithm first trains a sparse autoencoder like the one proposed by Qu et. al. [QZH+19] to detect
changes in the underlying structure of the data. Once such a change is spotted, the model trains an
LSTM classifier on the data, with labels provided by a clustering of the underlying structure. The
classifier is then validated on a validation data set. The data used to test the hybrid approach is not
simulated pitting data, but rather data used from a gearbox that at some point experienced pitting
failure, allowing for more realistic conditions. The goal of the hybrid model is to identify actual
pitting damage on a gear without using supervised methods or simulated pitting.

To better understand the problem of pitting in gears, Chapter 2 will provide information on
gears, pitting damage and how the vibration signal of a gears is affected by pitting damage. An
overview of the current state of the art of pitting detection approaches is also provided. The hybrid
pitting detection approach itself will be introduced in theoretical terms in Chapter 3, while the
implementation details can be found in Chapter 4. A discussion on results for different model
parameters is located in Chapter 5. The paper concludes in Chapter 6 with a short discussion of the
limitations of this methods, as well as further improvements that might be necessary to use this
technique in a live environment.

11


2 Background Information

Below is a short primer on what pitting in gears is and how pitting can be identified using vibration
data. This section also includes a short introduction into the machine learning algorithms used in
this paper.

2.1 Gear Terminology

Gears can be thought of as rotating wheels with teeth, which transmit torque and speed from one
wheel to another, when the teeth of the gears mesh with each other. The path on which the force is
transmitted when two gears mesh is known as the line of action. The point where the path of action
intersects with the imaginary line that connects both axes (central points) of the gears is known
as the pitch point. The circle around the axis which goes through all pitch points is known as the
pitch circle. An example of this can be seen in Figure 2.1. The pitch point is both on the vertical
line in the middle, which connects the axes of the gears, and on the line of action. Note that the
pitch point coincides with the point of contact, and that the path of action is a straight line. Both of
this is not always the case, however it is both true in involute gears. These gears have teeth with a
special shape, like the ones in Figure 2.1. A more detailed explanation on gears can be found in
engineering handbooks like [Dub13]

Figure 2.1: Line of action and pitch point[Gea]

13


2 Background Information

Figure 2.2: Pitting damage on a gear[Ber; HMO11]

2.2 Pitting

There are two main theories on the formation of pitting [Kna88]. The first considers pitting as a
result of wear over time. The tooth surface will slightly degrade over time by continued cyclical
load. This degradation shows up in the form of small cracks, both above and below the pitch circle.
Oil will naturally fill these small cracks. When the tooth meshes with the other gear the oil above
the pitch circle will be pressed out of the crack, while the oil below the pitch circle will be pressed
further into the cracks. As the oil cannot compress, small particles of the material are broken out
of the gear, in the form of smaller secondary cracks. This would also explain why pitting occurs
beneath the pitch circle.

The second theory assumes that pitting starts beneath the surface. Material defects close to the
surface lead to local spikes in pressure. These spikes then result in small cracks, which combine to
bigger cracks when faced with load. The cracks are slightly angled toward the surface (20°) and go
towards the base of the tooth [BLGM17]. Once the crack reaches the surface, material is broken out
of the gear. The load that the surface is exposed to increases the damage, turning the cracks into
deeper pits.

Pitting will be visible as small to big, shallow pits on the surface of the gear tooth. A large case is
shown in Figure 2.2 Pitting grows over time and the rate of growth also increases over time. At
some point the tooth will break away from the gear as a result, because it is too weak [Gre22]. This
point is reached once either one pit grows to a size of 4% of the surface of the tooth it is on or once
all pits grow to a size of 1% of the combined tooth surface. These conditions are referred to as the
1% criterion and the 4% criterion, although these percentages might differ for other types of gears,
like unhardened gears [NW03].

14


2.3 Vibration data

Figure 2.3: Sine wave summation

In the presence of pitting, the vibrations in the gear box increase significantly. Pitting reduces the
contact area and affects the contact stiffness [Ngu02]. This results in a change in amplitude in the
gear box vibrations. This change in amplitude is the basis for automated pitting detection in the
works used in this paper.

2.3 Vibration data

Vibration data is typically recorded by an accelerometer, which measures acceleration. The basic
structure of an accelerometer is a piece of test mass connected to a spring. The sensor measures the
acceleration by the spring’s compression, once it experiences acceleration towards it. The data is
often recorded as multiple measurements in small fixed time intervals. Most standard accelerometers
measure up to 30.000 data points per second [Len]. As the data measures the vibration of rotating
gears, the data is cyclical in nature. A basic form of cyclical data would be a sine function:

𝑓 (𝑥) = 𝑎 · 𝑠𝑖𝑛(𝑏 · (𝑥 + 𝑐)) + 𝑑(2.1)

where a is the amplitude, b is the frequency, c is the displacement on the x axis (responsible for the
phase of 𝑓 (𝑥)), and d is the displacement on the y axis. More complex cyclical data will often be a
combination of simple signals. In Figure 2.3 we can see the thin lines in the upper diagram, graphs of
the functions 𝑓 (𝑥) = 4 · 𝑠𝑖𝑛(𝑥), 𝑔(𝑥) = 2 · 𝑠𝑖𝑛(2𝑥) + 2 and ℎ(𝑥) = 𝑠𝑖𝑛(4𝑥) + 4. When added up they
combine to a more complex function 𝐹 (𝑥) = 𝑓 (𝑥) +𝑔(𝑥) +ℎ(𝑥) = 4 · 𝑠𝑖𝑛(𝑥) +2 · 𝑠𝑖𝑛(2𝑥) +𝑠𝑖𝑛(4𝑥) +6.
While the simpler functions have clearly defined frequencies, determining the actual frequency of
F(x) is harder.

15


2 Background Information

2.3.1 Fourier Transform

In order to better assess the amplitudes of the vibrations, Fast Fourier Transform (FFT) is performed.
Fourier Transform takes a function of amplitude over time and outputs the amplitudes broken down
by frequency [FD+22]. Fourier Transform is applied to break down cyclical data into it’s frequency
components. For the purpose of vibration analysis for pitting detection, we would like to look at
changes in the amplitudes of the frequency components. The result of the Fourier Transform is a
function over frequencies 𝜉 obtained by the following equation:

𝑓 (𝜉) =
∫ ∞

−∞
𝑓 (𝑥)𝑒−𝑖2𝜋 𝜉 𝑥 𝑑𝑥(2.2)

Where 𝑓 (𝑥) in our case, is the acceleration at time 𝑥. 𝑓 (𝜉) is a complex function, which means
that values of 𝑓 are of the form 𝑓 (𝜉) = 𝑎 + 𝑏𝑖. The amplitude of frequency 𝜉 corresponds to the
absolute value of 𝑓 at frequency 𝜉:

| 𝑓 (𝜉) | =
√︁
𝑎2 + 𝑏2(2.3)

In the example in Figure 2.3, if �̂�) is the fourier transform of 𝐹 (𝑥) we can obtain the amplitudes
|�̂� (1) | = 4, |�̂� (2) | = 2 and |�̂� (4) | = 1, of 𝑓 (𝑥), 𝑔(𝑥) and ℎ(𝑥). The x-Axis in the bottom diagram
of Figure 2.3 displays higher numbers, because the FFT sums over the frequencies. The relative
values of the amplitudes are accurate however. �̂� (1) = 2 · �̂� (2) = 4 · �̂� (4) The recorded vibration
data is in essence a more complex combination of many waves resulting from a complex system
with many moving parts in the gear box. As there is no practical way to calculate the function
that describes the vibration data best, we instead perform Discrete Fourier Transform (DFT) as
the data is given in small discrete time intervals. FFT is the name of the actual algorithm used to
perform DFT [HJB85]. FFT computes a frequency spectrum for a given time interval when given
equidistant measurements in that time interval. FFT returns a list of complex numbers, one for each
frequency of the spectrum [HJB85].

2.3.2 Mel Frequency Cepstral Coefficients

A cepstrum is defined as the spectrum of a logarithmic spectrum [Ran76]. Let 𝑓 { 𝑓 (𝑥)} be the
fourier transform of 𝑓 (𝑥). The cepstrum 𝐶𝑝 [Bog63], now known as the power cepstrum is
calculated as follows:

𝐶𝑝 = | 𝑓 {𝑙𝑜𝑔( | 𝑓 { 𝑓 (𝑥)}|2)}|2(2.4)

Breaking down Equation (2.4), cepstrum calculation consists of the following steps:

1. Take the fourier transform of the input signal x

2. Take the absolute values of the fourier transform results to obtain the amplitudes of the
frequency components

3. Take the logarithm of the amplitudes

4. Perform another fourier transform on the logarithms

5. The absolute values of the second fourier transform are the cepstral coefficients

16


2.4 Deep Neural Networks

In essence, we perform the fourier transform to obtain the spectrum of the original data. We then
perform another fourier transform, to obtain the spectrum of the spectrum, hence the name cepstrum,
which is an anagram of spectrum.

Some use the inverse fourier transform 𝑓 −1 in in place of the fourier transform at the last step. The
resulting cepstrum is basically identical, only differing by a scaling factor. The Mel Frequency
Cepstrum (MFC) differs from the power cepstrum in two ways. Instead of using fourier transform
on the logarithmic amplitudes of the spectrum, discrete cosine transform (DCT) [ANR74] is used.
DCT is related to discrete fourier transform but it returns real numbers instead of complex numbers.
For the purpose of cepstrum computation, what is import is that DCT returns a spectrum. The
second important difference is that the frequency spectrum resulting from the first fourier transform
is scaled by the Mel-scale. 1000 Mel are exactly 1000 Hz but the scale is logarithmic for frequencies
above 1kHz:

𝑓𝑚𝑒𝑙 (𝑥) = 2595 · 𝑙𝑜𝑔10(1 + 𝑥/700)(2.5)

Written as one equation, with 𝐷 as the DCT:

𝐶𝑚𝑒𝑙 = |𝐷{𝑙𝑜𝑔( 𝑓𝑚𝑒𝑙 ( | 𝑓 { 𝑓 (𝑥)}|2))}|2(2.6)

Mel Frequency Cepstral Coefficients (MFCC) are commonly used in audio tasks such as speech
recognition[ISY12] as the Mel-scale is used to emulate human hearing more closely than a linear
frequency mapping. Points that are of equal distance on the Mel-scale are judged by humans to be
of equal distance to one another [SVN37]. Humans are better at telling apart lower frequencies.
For example in music, a difference of 55Hz between low frequencies is a whole octave (𝐴1 to
𝐴2) while for higher frequencies it is just a semitone (from 𝐴#5 to 𝐵5) in twelve-tone equal
temperament. This makes MFCC suitable for speech recognition tasks because models can more
closely emulate the human perception of frequencies. MFCCs have been found useful in the area
of bridge damage detection [BGND21] which is typically identifiable in lower frequency ranges.
Cepstral analysis has been found useful in gear box fault diagnosis [Ran76] because it allows for
better analysis of sidebands. Sidebands occur as the result of modulation[Ran76] including the
kind of amplitude modulation that occurs as a result of pitting. One of the algorithms examined
in this paper [MCC+19] uses MFCC as inputs. Why MFCC are the preferred input over other
cepstral coefficients is not further explained in the paper. A possible explanation might be the use
of acoustic data as other inputs, as acoustic data is often processed using Mel Cespstra. Another
explanation might be that the Long Short Term Memory model performs especially well on MFCC
because both are frequently used in Natural Language Processing.

2.4 Deep Neural Networks

Artificial neural networks are loosely based on animal brains. They try to emulate the behaviour of
neurons that pass electrical signals between them. Neural networks typically consist of an input
layer, an output layer and one or more hidden layers that are made up of nodes or neurons. If a
neural network contains more than one hidden layer, it is called a deep neural network [Sch14].

17


2 Background Information

2.5 Sparse Autoencoders

Sparse autoencoders are a type of unsupervised machine learning method that try to create a close
approximation of their input. They perform this task while only using a small percentage of their
nodes.

2.5.1 Autoencoders

Autoencoders are a type of neural network. The goal of an autoencoder is to reconstruct an input
x as well as possible. Rather than just copying the input however, autoencoders typically operate
under the constraint, that they learn a set of features with less dimensions than the input dimension.
The autoencoder will then reconstruct the original image from the set of features.

Conceptually an autoencoder is made up of two functions:

𝐸𝑛𝑐𝑜𝑑𝑒𝑟 : ℎ = 𝑓 (𝑥)(2.7)

𝐷𝑒𝑐𝑜𝑑𝑒𝑟 : 𝑥 = 𝑔(ℎ)(2.8)

Where x is the input, h is the hidden layer with reduced dimensionality and 𝑥 is the approximation
of x. Written as one equation:

𝐴𝑢𝑡𝑜𝑒𝑛𝑐𝑜𝑑𝑒𝑟 : 𝑥 = 𝑔( 𝑓 (𝑥))(2.9)

2.5.2 Sparse Autoencoders

To help reduce dimensionality of the hidden dimensions even further, we would like the learned
representation to be sparse. A sparse representation will have as many entries as possible set to
zero. Sparsity is often relevant in compression, but in this case it is used to further incentivize using
an even smaller number of features to represent input x. In the case of autoencoders specifically,
sparsity refers to the activation of the units, rather than the exact values. In this context, this means
that in a sparse autoencoder only a small number of the hidden units are activated for each input.
We do impose sparsity in order to further capture the underlying structure of the data [Ng].

A visual example of a sparse autoencoder can be seen in Figure 2.4. The blue nodes on the left side
of the encoder represent the features of the inputs. This could for example be pixels of an image,
prices of a stock at certain times or, in our case, frequency components of vibration data. The blue
nodes on the right side of the encoder represent the features of the recreated output. The number of
input nodes is the same as the number of output nodes. The orange nodes in the middle represent
the hidden nodes, which contain the learned feature. The dark orange nodes are the nodes that are
active in recreating this particular input. In a sparse autoencoder, we would like to have a small
amount of nodes active at the same time [Ng].

18


2.5 Sparse Autoencoders

Figure 2.4: Sparse auto encoder [Mas]

2.5.3 Multilayer Autoencoders

The equation in Equation (2.9) describes an autoencoder that contains a single hidden layer. We can
chain multiple singlelayer autoencoders to reduce the number of learned dimensions even further by
using the output 𝑥 of a previous autoencoder as the input for the next autoencoder.

Given two autoencoders:

𝐴𝑢𝑡𝑜𝑒𝑛𝑐𝑜𝑑𝑒𝑟 1 : 𝑥1 = 𝑔1( 𝑓1(𝑥))𝐴𝑢𝑡𝑜𝑒𝑛𝑐𝑜𝑑𝑒𝑟 2 : 𝑥 = 𝑔2( 𝑓2(𝑥1))

We can combine these autoencoders into one multilayer autoencoder:

𝑀𝑢𝑙𝑡𝑖𝑙𝑎𝑦𝑒𝑟 𝑎𝑢𝑡𝑒𝑛𝑐𝑜𝑑𝑒𝑟 : 𝑥 = 𝑔2( 𝑓2(𝑔1( 𝑓1(𝑥))))
(2.10)

As both the hidden layers obtained by 𝑓1 and 𝑓2 as well as the output of the first autoencoder
obtained by 𝑔1 are considered hidden layers, the autoencoder in Equation (2.10) has three hidden
layers and is therefore considered a deep neural network. The single layer autoencoders making up
the multilayer autoencoder can have different hyper-parameters, allowing the multilayer autoencoder
to learn representations for complex data. The number of input and output dimensions of each
singlelayer autoencoder is often the same however, matching the dimensions of the input.

An example of this can be seen in Figure 2.5. The middle blue layer is the output layer of the
first autoencoder, consisting of the three leftmost layers. It is also the input layer of the second
autoencoder, consisting of the three rightmost layers. The number of hidden (orange) nodes can
differ from layer to layer, while the number of input / output nodes (blue) stays the same.

19


2 Background Information

Figure 2.5: Multilayer Autoencoder

2.6 Long Short Term Memory Models

Long Short Term Memory Models are designed to have the ability to use information that has been
seen a long time ago. They fall under the supervised learning techniques.

2.6.1 Recurrent Neural Networks

Unlike autoencoders, which only pass data forward from one layer to another, recurrent neural
networks (RNN) can pass data backwards. This allows neurons of RNNs to be part of loops,
enabling neurons to influence themselves [WWFN16]. RNNs can keep information about previously
encountered data points. As a result, RNNs are able to take context, in our case temporal context
into account [Mil12].

2.6.2 Long Short Term Memory Models

Early versions of RNNs often suffered from the "Vanishing Gradient Problem"[Hoc91]. The longer
a path between two nodes 𝑛1 and 𝑛2, the smaller the influence of 𝑛1 on 𝑛2. Looking at nodes as part
of a cycle that sends a signal through the cycle once per time step, one can unfold the cycle into
an acyclical network by chaining the networks behind each other [MP88]. This means, that the
influence of a node with past information will drastically decrease with the number of time steps
between two nodes [Hoc91]. A related problem occurs if the network uses activation functions that
allow for numbers over 1. In this case the influence of nodes with past information will drastically
increase with the number of time steps between two nodes. This is known as the Ëxploding Gradient
Problem".

20


2.7 Gear damage theory and detection

Figure 2.6: Vibration signal composition

Long Short Term Memory (LSTM) Models were first proposed by Hochreiter and Schmidhuber
to deal with the Vanishing Gradient Problem [HS97]. LSTM models contain cells that can store
values over an arbitrarily long period of time. Each cell has an input gate, an output gate and a
forget gate that regulate the information going into and out of the cell [HS96]. This enables LSTM
models to store past data for an arbitrary amount of time and allows them to use past data only when
it is relevant for the current data point [Hoc91].

LSTM models are therefore really useful when dealing with time series data, like the vibration data
for pitting detection. They are able to look at current frequency inputs while also taking historical
frequency spectra into account. This could allow them to better identify deviations from previous
frequency data [MCC+19].

2.7 Gear damage theory and detection

Early attempts at damage detection in gears, mostly for tooth cracks focused on around the concept
of the mesh frequency [McF87]. The Mesh frequency 𝑓𝑚𝑒𝑠ℎ is the rate at which two gears mesh
over a fixed time interval, typically a second (measured in Hertz). 𝑓𝑚𝑒𝑠ℎ is calculated as follows:

𝑓𝑚𝑒𝑠ℎ = 𝑓1 · 𝑁1 = 𝑓2 · 𝑁2

(2.11)

where 𝑓𝑛 is the number rotations of wheel n per second, and 𝑁𝑛 is the number of teeth of wheel
n. As an example we consider two gears 𝑔𝑒𝑎𝑟1 with 15 teeth and 𝑔𝑒𝑎𝑟2 with 30 teeth. The input
shaft is attached to 𝑔𝑒𝑎𝑟1, making it turn with a frequency 𝑓1 of 25𝐻𝑧. As 𝑔𝑒𝑎𝑟2 has twice as
many teeth, it’s will be half of that of 𝑔𝑒𝑎𝑟1, 𝑓2 = 𝑓1 · 15/30 = 12.5𝐻𝑧. The mesh frequency
of this system according to Equation (2.11) is 𝑓𝑚𝑒𝑠ℎ = 𝑓1 · 15 = 375𝐻𝑧. The mesh waveform,
which is the vibration signal of the meshing gears has the mesh frequency. As the mesh of teeth is
responsible for load transfer, the mesh waveform will have the highest vibration amplitude [Inc]. To
reflect this in our example, 𝑤𝑚𝑒𝑠ℎ (𝑥) = 𝑠𝑖𝑛(2 · 𝜋 · 𝑓𝑚𝑒𝑠ℎ · 𝑥) will have an amplitude of 1, while
𝑤1(𝑥) = 0.4 · 𝑠𝑖𝑛(2 · 𝜋 · 𝑓1 · 𝑥) and 𝑤2(𝑥) = 0.5 · 𝑠𝑖𝑛(2 · 𝜋 · 𝑓2 · 𝑥) have amplitudes of 0.4 and 0.5
respectively. The resulting waves are shown in Figure 2.6.

21


2 Background Information

Figure 2.7: Simple damage signal

McFadden defines the vibration data as a combination of two signals. The "regular signal", which
is the combination of the fundamental mesh frequency and it’s harmonics (waves with frequencies
2 · 𝑓𝑚𝑒𝑠ℎ, 3 · 𝑓𝑚𝑒𝑠ℎ, ...) is responsible for the vibration of a single tooth [McF87]. The regular signal
basically represents the vibrations that should result from the mesh of two teeth. Subtracting the
regular signal from the original yields the "residual signal". This signal represents the deviations
of the actual signal from the regular signal. As random noise is reduced through time domain
averaging before calculating the regular signal and residual signal, the residual signal should not
be random but rather have some cause. The idea is to analyse the residual signal for patterns that
indicate gear damage, as gear damage should show up in form of a deviation from the norm [McF87].
Damage signals should in theory show up in the data in the form of frequent pulses at roughly equal
time intervals, because they would effect the overall vibration signal when the damaged tooth comes
is struck [Whi84]. These signals may contain slight variations in length and frequency due to noise,
and they will change over time in amplitude and rate of repetition as the damage increases [LM92].
It is also theorized that damage first registers in the low frequency ranges and then emerges in
higher frequency ranges with as the condition of the gear deteriorates [LM92]. In our example, a
simplified damage signal is added to the vibration signal to simulate a broken tooth at the third
tooth in 𝑔𝑒𝑎𝑟1. The signal and it’s effect on the overall signal can be seen in Figure 2.7 It has a
rectangular shape, while in real systems, the damage signal would look more like a diminishing
sinusoid as the damage would be absorbed by the material [Whi84].

Spotting the differences in the vibration signal of a damaged gear that result from the added damaged
signal, is the main idea behind most early fault detection research. This can be done by looking at
the vibration signal over time [McF87; TWZ+14], or analyzing the frequency spectrum [ERM+14]
or other forms of wavelet transforms of the original signal [ÖSY08; TDZ+16]. Others have applied
cepstral analysis [EA14; ZD13]. How damage patterns in our example shows up in the frequency
spectrum and the power cepstrum of the vibration data can be seen in Figure 2.8. In the upper
diagram, the mesh frequency is clearly very prevalent in both the healthy signal, and the damaged
signal. However the sidebands (adjacent higher and lower) of the mesh frequency, tend to fluctuate
more heavily in the damaged signal. This type of fluctuation is covered well in the cepstra, where
we have clear spikes at certain quefrencies.

While a lot of early research has focused on highlighting and visualizing the differences in the
signal between a healthy gear and a damaged gear, machine learning approaches have gained
popularity in damage classification. Models used include support vector machines [KSH11; Sam04;

22


2.8 Disentangled Tone Mining

Figure 2.8: Damage in spectrum and cepstrum

SMR07], random forests [LSZ+16], and various types of neural networks [ÜOK16; XJYD20]. In
this work, two of these deep learning approaches are examined. Qu et al. studied the use of sparse
auto encoders on unlabeled vibration data to create "meaningful"disentangled features in order
to perform fault diagnosis in operation [QZH+19]. Medina et al. implemented an LSTM model
for pitting detection using MFCC [MCC+19]. While both of these approaches showed promising
results, the data was collected from gears with artificial pitting damage, rather than real pitting
damage. Pitting damage was simulated by drilling into the gear. In this paper, we will look at how
both of these approaches behave on real pitting damage in operation. This might be useful, if we
want to incorporate one or both of these methods into a system that can recognize pitting damage
early on and take measures to slow down pitting deterioration in operation [Gre22].

Before we examine the performance of the models on real data, we will need a better understanding
of how they work. Below is a more detailed overview of the core ideas and design of both models.

2.8 Disentangled Tone Mining

The goal of the Disentangled Tone Mining (DTM) approach is to extract the hidden structures behind
the vibration data and to disentangle them into separate tones, hence the name of the approach
[QZH+19]. It can be thought of as a way of dimensionality reduction, compressing the feature
space from hundreds of single frequency components to a combination of a small number of ideally
uncorrelated features that represent the different original signals of the vibration signal. The process
of obtaining these features consists of three steps.

23


2 Background Information

2.8.1 Fast Fourier Transform

The first step is obtaining the frequency spectrum of the vibration data by means of FFT. The
vibration data is split into equal time segments, with each segment roughly capturing one revolution
of the driven gear. The authors used a driving gear with 40 teeth, and a driven gear with 72 teeth, so
each time segment should contain at least one revolution of each gear. In case of pitting, the signal
should therefore be captured in all segments. The frequency spectrum is calculated for each time
segment and used as inputs for the sparse autoencoder. Each input consists of the spectrum for one
time interval and each feature of the input is a frequency component of the spectrum. The frequency
spectrum is used over the raw vibration data, because it should capture the different frequencies
belonging to different signals like the damage signal, in a better way. Further preprocessing is
avoided for two reasons. First, a desired goal of the approach is to learn features without requiring
prior human knowledge and second, it allows for faster model performance in operation.

2.8.2 Sparse Autoencoder

In order to reduce the frequency features to a smaller number of components that can distinguish
between different signals, a Sparse Autoencoder (SAE) is used. The basic architecture of an
autoencoder is given in Equation (2.9) and Equation (2.10). The SAE will try to recreate the
frequency spectrum with only a small number of features. These features should capture the
underlying structure of the data, including the damage signal. The number of learned features n
should be significantly smaller than the number of input features N (𝑛 << 𝑁). A sparsity constraint
is used to further reduce the amount of relevant features. This sparsity constraint takes the form of a
hyper-parameter in the loss function to incentivize a low number of hidden unit activations. The
features are obtained by training the SAE on the frequency spectra and then using the encoder of
the trained SAE on each frequency spectrum, to obtain it’s low dimensional representation in the
form of an m by n dimensional feature matrix M, where m is the number of encoded inputs and n is
the number of learned features.

2.8.3 Further Feature Reduction

In order to further reduce the number of features, the authors perform the following feature pruning
technique[QZH+19]:

• Perform QR decomposition on learned feature matrix M

• Calculate the eigenvalues of upper triangle matrix R, which are also the eigenvalues of M

• Determine the top k eigenvalues by an automatic threshold, like 𝜆𝑘 > 10 · 𝜆𝑘+1

• Keep only the dimensions associated with 𝜆1, ..., 𝜆𝑘

This process returns a list of the k most relevant features, where k is chosen automatically based on
the feature matrix M rather than a fixed number. The excluded features are strongly correlated to
the kept features, leaving the final feature space more uncorrelated or "disentangled". The proposed
advantage of this method is that it keeps the learned features intact, which keeps some of the
properties of the learned features. A different dimensionality reduction technique, like principal

24


2.9 Deep Learning Cepstral Classification

Figure 2.9: Mel filter bank

component analysis (PCA) will not return the original features, but rather the eigenvectors of the
highest eigenvalues. The final set of features can then be used as inputs for further analysis, like
clustering.

2.9 Deep Learning Cepstral Classification

While DTM is an unsupervised deep learning approach, the LSTM model proposed by Medina et
al. is falls in the category of supervised learning [MCC+19]. The main goal is classification of
different stages of pitting damage.

2.9.1 Data Preprocessing

The first step in this approach is also to obtain multiple frequency spectra for short time steps
of 5.12ms, using FFT. Then, each frequency spectrum is then processed using a Mel filter bank.
This filter bank consists of M filters with triangular shape on the Mel scale. An example of such
a filterbank can be seen in Figure 2.9. By applying the filters to the frequency spectrum, low
frequencies will be more distinguished than high frequencies. Afterwards, DFT is applied to the
logarithms of the filtered frequencies to obtain the Mel Frequency Cepstral Coefficients (MFCC).
As previously mentioned, cepstral analysis is a common practice in damage detection in gears,
because it can show changes in frequency side bands, as shown in Figure 2.8.

25


2 Background Information

2.9.2 LSTM Model

The MFCC and their first and second derivatives with respect to time will be used as inputs for
the Long Short Term Memory (LSTM) model. The LSTM model takes the cepstra as well as
their context (information about 𝑐𝑒𝑝𝑠𝑡𝑟𝑢𝑚1, ..., 𝑐𝑒𝑝𝑠𝑡𝑟𝑢𝑚𝑖−1 for 𝑐𝑒𝑝𝑠𝑡𝑟𝑢𝑚𝑖) to classify the data
as "pittingör "no pitting". More granular classification is possible, if data with different degrees of
pitting is available. The original paper distinguished between 9 classes (healthy, and 8 levels of
pitting). The model is trained on labeled data first and can predict the condition of new data.

26


3 A hybrid model for pitting detection in
operation

Both the sparse autoencoder and disentangled tone mining method and the LSTM classifier show
high accuracies for identifying pitting in several stages of deterioration [MCC+19; QZH+19]. The
models were both evaluated on data, of which it was known beforehand what stage of pitting each
data point was subjected to. Pitting damage in these cases was simulated by artificially adding
pits to the data, using a drill bit [QZH+19] or an electrical discharging machine [MCC+19]. The
different stages of pitting were clearly separated and the state of the gear therefore stayed within
each class of pitting damage. The models were trained to separate the data into 5 [QZH+19] and 9
[MCC+19] classes respectively. However, pitting damage detection in operation enforces additional
challenge due to it’s uncertain nature.

3.1 Constraints and Requirements

Pitting damage from data in operation will not jump from one stage of deterioration to another from
data point to data point. Rather, pitting will happen gradually over time. Data from very slight
pitting damage might not show significant enough change to put it into it’s own cluster. On the
other hand we want to detect the pitting damage as early as possible, to slow it’s deterioration. The
model has to be able to recognize pitting damage as early as possible by being able to identify slight
changes in the data, while also being robust to random noise.

Each gearbox produces unique vibration data due to the unique material properties of each gear
(no two gears are exactly the same) and slight differences in installation, causing models that were
pre-trained with a different setup will probably perform poorly on new data. Therefore, a model
needs to be trained on the vibration data of the gear during operation. As a result, there are no labels
for the data on which to verify the performance of the model during operation. The model therefore
needs to be trained on unlabeled vibration data.

The model needs to be updated with new data coming in. The gearbox will produce large amounts
of vibration data that needs to be preprocessed to be used as the model input. Keeping all the data
for model training will lead to ever increasing model training times. It is therefore important to
consider what data to feed the model when looking for pitting, and how to update the model without
missing a change in pitting deterioration. The model needs to be provided with the newest data
and with data to compare the new data to, in order to see if the new data deviates from the state of
normal operations.

27


3 A hybrid model for pitting detection in operation

Figure 3.1: The two phases of the hybrid detection approach

3.2 Structure of the Hybrid approach

In order to provide an automatic detection approach on unlabeled data, which enables fast training
and classification, we propose a combination of the unsupervised Sparse Autoencoder approach
[QZH+19] which will allow us to create labeled data for the powerful LSTM model [MCC+19]. The
model operates on two phases. It starts in an unsupervised phase until it suspects potential pitting
damage. It will the transition into the supervised phase. The two phases can be seen inFigure 3.1.

The first phase is the unsupervised phase. Ideally the model will spend the majority of time in this
state, until it detects pitting damage. The model takes the past four minutes of vibration data and
transforms the data into it’s frequency components. Four minutes of data are chosen, as this time
frame is big enough to provide enough data for the model to learn useful information about the
data, while also being small enough to perform both preprocessing and model training in a short
period of time. Fast Fourier Transform is used to obtain the frequency components. The frequency
spectrum is compressed by a factor of ten before it is fed to the autoencoder. The compression is
done to reduce the required training time of the autoencoder. The input for the autoencoder is a
matrix with each row being a list of amplitudes at different frequencies.

Using the frequency data x, we train an autoencoder g(f(x)) = 𝑥 that tries to replicate the original
data x by using a smaller number of parameters h = f(x). The autoencoder is also trained to be
sparse. This forces the autoencoder to have most values of h be 0 for any given input x. Sparsity is
introduced to further reduce the number of variables, encoding more information on each active
node of h. After training, we use the encoder function f of the autoencoder to encode the frequency

28


3.2 Structure of the Hybrid approach

data vector x into the much smaller vector h. This vector h represents the underlying structure of
the data learned by the autoencoder. It contains more information about the overall structure of the
frequency spectrum per parameter, as any of the single frequency parameters.

We then take the underlying structure vector h and see if we can split the data points into two
clusters using a K-Means algorithm. The idea is to obtain two clusters that can be separated by a
point in time. That point in time, called separation point in this paper, could indicate a shift in the
underlying structure at this moment. In theory, as pitting damage gets worse over time, the damage
should lead to a change in the frequency spectrum from the point that pitting is formed as can be
seen in Figure 2.8. If the separation point is in the recorded window, the window will contain two
types of frequency data. The data before the separation point should contain the healthy frequency
spectra, while the data after the separation point will contain the unhealthy spectra. The difference
between these two types of spectra should be captured in the underlying structure of the data that
gets learned by the autoencoder. By clustering the underlying structure, we look if such a separation
point exists, that data can be split into two clusters based on time.

If such a separation point is found, the cluster labels will be used as pseudo labels for training
the LSTM classifier. The input data for the LSTM will be from the same window as the data for
the autoencoder. We take the uncompressed frequency data of the given window and transform
it into Mel frequency cepstral coefficients (MFCC). Each frequency spectrum will be filtered via
predefined Mel filters. The filtered data will then be logarithmically scaled. Finally the discrete
cosine transform is used to obtain the MFCC. The MFCC can be thought of as components of a
spectrum of a spectrum.

The cepstral components, as well as their first two derivatives are used to train the LSTM classifier.
The labels for the training will be the cluster labels of the window. The LSTM model works with
batches of MFCCs, to incorporate context of previous examples into the classification. The LSTM
model works on a different type of input data, allowing us to see if the changes can be spotted
both in the frequency spectrum as well as the cepstrum. While the cepstrum is computed from the
spectrum, what each input data point represents differs significantly.

We can validate the LSTM by looking at the next window and at data sampled from past windows.
We assume the model is able to detect pitting if it classifies the data in the next window as pitting
and the sampled data in past windows as no pitting. This kind of classification would indicate a
change in the vibration data, that happened at some point (the separation point) during operation.

This result of the model is the final LSTM classifier. The trained LSTM allows us to classify data
significantly faster than the SAE, which could help decrease the response time of the system. This
is crucial when dealing with an adaptive system, that tries to slow down pitting growth as proposed
by Gretzinger [Gre22]. The idea is to adapt the input rotational force by decreasing the load on the
teeth with pitting [Gre22]. In order to find the teeth that are affected by pitting, we decrease the
load on each tooth in turn and feed the resulting vibration data (or rather its MFCC representation)
to the classifier. If the ease on the pitted teeth returns the vibration data closed to it’s original state
compared to the other teeth, the classifier could identify that change and reclassify the new data as
healthy. We then continue to run the gear with the adaption in rotational force, in order to increase
it’s lifespan[Gre22]. A fast classifier here is essential as we would like to minimize the time spent
on checking which teeth contain pitting. This is because we want to avoid spending a long time
putting additional force on the damaged teeth while checking the other teeth.

29


3 A hybrid model for pitting detection in operation

3.3 Implementation of the Sparse Autoencoder

The autoencoder takes the compressed frequency spectra and tries to represent them by using a
lower amount of features than the given number of frequencies. This is done by taking the vector of
frequencies x and using an encoder function f(x) to map the frequencies on a smaller vector h. A
decoder function g(h) then tries to recreate x given the smaller vector h. The result of the decoder,
called 𝑥, is compared to x to adjust the autoencoder. Expanding on the basic architecture of an
autoencoder in Equation (2.9), we can formulate 𝑓 (𝑥) and 𝑔(ℎ) as

ℎ = 𝜎1(𝑤1 · 𝑥 + 𝑏1)(3.1)

𝑥 = 𝜎2(𝑤2 · 𝑥 + 𝑏2)(3.2)

𝑤1, 𝑤2 are weight matrices and 𝑏1, 𝑏2 are bias vectors. The weight matrices are adjusted a lot in
the learning process in order to learn from the data that is provided. The encoding of the data h
can be seen as a linear transformation of x with 𝑤1 and 𝑏1. As 𝑤1 is a matrix an x is a vector,
when looking for a single data point, each frequency component of x can influence all nodes of the
encoding h. The strength of that influence is mainly dependent on the weight matrix 𝑤1. 𝜎1, 𝜎2 are
activation functions. These functions determine what the output of the node is, based on it’s input
and weights. The sigmoid function was used for both encoders and decoders.

𝜎(𝑥) = 1
1 + 𝑒−𝑥(3.3)

The sigmoid function returns a value between 0 and 1. If the result of the function is 0, the node is
considered öff". The sparsity constraint will try to enforce the average of the activation functions
over all nodes of h to equal a sparsity parameter 𝜌, forcing many nodes to be off.

3.3.1 Loss Function of the Sparse Autoencoder

Let n be the number of training data points. The loss function for the SAE is defined in the original
paper as follows [QZH+19]:

𝐿 (𝑤1, 𝑤2, 𝑏1, 𝑏2 |𝑥) =
1
2𝑛

∑︁
𝑛

| |𝑥 − 𝑥 | |22 + 𝜆 · | |𝑤 | |
2 + 𝛽 ·

𝑠2∑︁
𝑗=1
𝐾𝐿 (𝑝 | |𝑝)(3.4)

The loss function is a sum up of three terms. The first term

1
2𝑛

∑︁
𝑛

| |𝑥 − 𝑥 | |22(3.5)

calculates (one half of) the mean squared distance between the original input x and the output of
the model 𝑥. The differences between each frequency component of x and 𝑥 are summed up and
squared. We would like to minimize this value to obtain an output 𝑥 which is as close to the input x
as possible. The second term

30


3.3 Implementation of the Sparse Autoencoder

𝜆 · | |𝑤 | |2(3.6)

is used for weight regularization. This term gets larger, the larger the values of weight matrix 𝑤,
which incentivises keeping weights as small as possible. This is used to avoid overfitting. Overfitting
refers to a phenomenon in which the autoencoder simply memorizes the training data as well as
possible by using large weights. The autoencoder will the perform worse on new data. The third
term

𝛽 ·
𝑠2∑︁
𝑗=1
𝐾𝐿 (𝜌 | | �̂�)(3.7)

is responsible for the sparsity of the network, where KL is the Kullback-Leibler Divergence and 𝜌 is
the sparsity parameter, according to Qu et al [QZH+19]. It is not mentioned, what �̂� represents in
this context, however it is reasonable to assume based on the naming of parameters, that the loss
function is taken from Andrew Ng [Ng]. Ng defines �̂� 𝑗 as the average activation of hidden unit j,
given by

�̂� 𝑗 =
1
𝑛

𝑛∑︁
𝑖=1

[𝑎 (2)
𝑗

· 𝑥 (𝑖) ](3.8)

where 𝑎 (2)
𝑗

· 𝑥 (𝑖) is the activation of hidden unit j 𝑎 (2)
𝑗

given input 𝑥 (𝑖) . The third term can be written
in more detail as

𝛽 ·
𝑠2∑︁
𝑗=1
𝐾𝐿 (𝜌 | | �̂� 𝑗) = 𝛽 ·

𝑠2∑︁
𝑗=1

𝜌 · 𝑙𝑜𝑔( 𝜌
�̂� 𝑗

) + (1 − 𝜌) · 𝑙𝑜𝑔( 1 − 𝜌
1 − �̂� 𝑗

)(3.9)

where 𝑠2 is the number of nodes in the hidden layer. Typically, the Kullback-Leibler Divergence
calculates the relative entropy of two probability distributions. In this case, we can think of 𝜌 as a
Bernoulli random variable with mean 𝜌 and �̂� 𝑗 as a Bernoulli random variable with mean �̂� 𝑗 . The
KL Divergence punishes models with sparsity (average rate of activation across hidden units) that
deviates from the desired level of sparsity 𝜌. A high degree of sparsity refers to a low amount of
node activation. For example 𝜌 = 0.01 would mean only one of every 100 nodes is active at the
same time. For this use case, 𝜌 = 0.01 would be considered very high sparsity.

The three terms are added up to form the loss function, which the model minimizes to obtain
a low dimensional sparse representation of x. Each term is assigned a weight, within the loss
function. The mean squared error loss is assigned a weight of 0.5, while the degree in which weight
regularization and sparsity figure into the model are regulated by parameters 𝜆 and 𝛽 respectively. 𝜆
was set to 0.001 and 𝛽 was set to 3, and 𝜌 was set to 0.15 in line with the original paper [QZH+19].
The high 𝛽 leads to a high priority on sparsity.

3.3.2 Layer structure of the Sparse Autoencoder

The Autoencoder used by Qu et al. consists of the following layers[QZH+19]:

31


3 A hybrid model for pitting detection in operation

• An input consisting of 1000 frequency points

• A hidden layer of 500 hidden units, followed by a 1000 node output layer

• A hidden layer of 200 hidden units, followed by a 1000 node output layer

• A hidden layer of 50 hidden units, followed by a 1000 node output layer for the final output

This type of structure of a deep autoencoder can be thought of as a funnel. The model first learns a
representation of the data that represents 1000 frequency components in 500 hidden components
and then rebuilds the original input from the smaller number of hidden features. Then, the model
tries to reduce the number of features in the representation even further, by using only 200 features.
Finally, it uses only 50 nodes to represent 1000 features. The autoencoders are trained layer by
layer.

First, a good 500 node representation is found. Then, a good 200 node representation is found using
the resulting 𝑥 of the 500 node representation. Lastly, a good 50 node representation is found using
the output of the 200 node representation. This process could be continued for an arbitrary amount
of layers. The first encoder will remove a lot of the information that cannot be encoded using 500
nodes and therefore being more likely to contain random noise. The same goes for each subsequent
autoencoder. In essence this structure filters the structure down to it’s most important components,
containing the most information possible. The sparsity constraint has an additional benefit here.
The number of active nodes gets reduced at the same rate as the number of total nodes, given the
same level of sparsity for each encoder. This ensures that each subsequent layer is not simply a
copy of only the relevant components of the previous layer, without the least useful components.
Each layer also has to reduce the total activation of the nodes.

The structure can be written in short as 1000-500-200-50. Other tested structures included in
the results were 1000-500-200-100 and 1000-500-200-20 for varying output sizes and 1000-50
for a different number of hidden layers. For the hybrid approach we use a similar structure of
1281-560-200-50. The number of nodes in the first two layers is slightly higher due to the higher
number of frequencies taken into account for the hybrid model.

3.3.3 Clustering the reduced feature space

Before clustering the obtained feature space in h, we could consider removing less relevant
parameters, that exist due to the sparsity constraint. The idea is to only obtain those features, that
contain the most information about h and therefore about the frequency data x.

In the original paper, the authors propose a new type of reducing the encoding of the trained SAE
even further, by using QR decomposition as described in Section 2.8.3. Qu et al. do not describe
further how to get the features which correspond to the highest eigenvalues of the upper triangle
matrix R [QZH+19]. For the purpose of the hybrid pitting detection approach, three clustering
approaches were employed and compared:

• Cluster the entire encoded feature matrix M (all 50 dimensions)

• Apply PCA for a fixed number of dimensions, then cluster along the principle components

• – Apply QR decomposition

32


3.3 Implementation of the Sparse Autoencoder

Figure 3.2: Encoded data with significant shift

– Apply eigenvector decomposition on upper triangle matrix R

– Keep highest k eigenvectors (𝜆1, ..., 𝜆𝑘), where 𝜆𝑘 > 10 · 𝜆𝑘+1

– Keep columns of R at the same positions as the eigenvalues, set the rest to 0

– Multiply Q and R to obtain the original reduced feature space, with columns that have
eigenvalue 0 set to zero.

– Cluster the non-zero features

PCA stands for principal component analysis and is a form of dimensionality reduction [Pea01].
After using PCA on the 50 component matrix h, we will be left with n principal components,
that retain the most information about the data possible, while also accounting for differences
between data points. PCA will perform linear transformations on the matrix h, which will alter the
components [Pea01]. It can be hard to tell, how a principal component is constructed. The other
methods preserve the original features that were learned by the autoencoder. This can be seen as an
advantage as it keeps the properties of the latent space according to [QZH+19].

In short clustering among all components and using QR decomposition to obtain the most relevant
features leaves (some of) the original components of h as is, while PCA transforms them. Using
either PCA or the QR decomposition approach will reduce the number of features further.

The clustering method employed was K-Means with k=2. Ideally, we are looking for a point in time
that separates two clusters, where one cluster only contains data points from the start of the window
up to that point and the other cluster contains data from after the point to the end of the window.
This would indicate a significant shift in the underlying structure of the data, where the reason for
the shift happens at the separation point. An example of data that can be separated well can be seen
in Figure 3.2

Once such a clear separation in the data is found, the model will switch to the supervised phase.
The clustering resulting from the K-Means algorithm will be a vector of cluster labels labeled 0 or 1.
We will use these labels as labels for training the LSTM classifier.

33


3 A hybrid model for pitting detection in operation

3.4 MFCC Computation

The LSTM model is trained on Mel Frequency Cepstral Coefficients. The steps to obtain these
coefficients can be seen in Figure 3.3.

We use uncompressed frequency data of a given time frame. The time frame in question is the same
time frame on which the unsupervised model spotted a potential separation point is obtained, as we
have obtained the labels for this window through the unsupervised model.

A Mel Filterbank for the input frequencies is obtained. This is done by calculating the Mel values
of the maximum and minimum frequencies (Equation (2.5)), placing a number of equidistant points
on the Mel scale, then computing these points back to the frequency scale. These points will be
spread further apart at higher frequencies. A picture of these filterbanks can be seen on the right
side of Figure 3.3, as well as in Figure 2.9.

We then sum up the energy between each two points [Lyoa]. This will result in the signal seen in
the second frame from the top in Figure 3.3. We then apply the logarithm to the obtained energies.
This is typically done to resemble human hearing [Lyoa]. In this case it could be seen as a sort of
normalization.

The final step is to take the discrete cosine transform of the logarithm of the filterbank energies.
As DCT is similar to Fourier Transform, we obtain a spectrum of a spectrum, hence the anagram
cepstrum. The absolute values of this resulting cepstrum are called Mel Frequency Cepstral
coefficients and form the basis of the LSTM inputs. These coefficients represent the rate of change
(frequency) in the filterbank energies, and by extension the frequencies of the frequencies (known
as quefrencies) [Lyoa].

Afterwards, we calculate the first and second derivative in the time domain for each MFCC, leaving
us with 120 data points per time interval. These derivatives improve model performance in speech
processing and represent the trajectory of the MFCC over time.

In order to leverage temporal context with the LSTM, we need to split our input data into batches.
Each batch contains multiple data points. The more data points there are per batch, the more context
is provided as the model considers all the data points per batch as consecutive time steps. A batch
that contains 10 data points allows the model to learn from information from up to 10 time steps
ago. For the next section, we assume a batch size of 10 data points.

3.5 LSTM Architecture

The LSTM classifier is divided into three layers:

• An input layer of shape 10 x 120

• A hidden layer of 180 LSTM units

• A fully connected classification layer consisting of 2 units to signal presence of absence of
pitting

34


3.5 LSTM Architecture

Figure 3.3: Steps of MFCC computation

The first layer receives the data in the correct input shape. In this case we feed the model 10 data
points at a time, with 40 Mel frequency cepstral coefficients, 40 delta coefficients and 40 delta delta
coefficients. The input layer passes the data to the LSTM layer.

The middle layer contains LSTM units. These units are controlled by gates (see Section 2.6.2) that
allow the LSTM to store information about previous inputs without being subjected to exploding or
diminishing weight sizes [HS97].

The last layer of the LSTM is a layer displays the output of the LSTM as a binary classification
choice. The model returns either 0 or 1. In order to add more classes to the classification, we could
add more nodes. [MCC+19] uses an output layer of 9 nodes. Other types of output layers could
also be considered. For example it might be desirable to return a single value as a damage indicator,
similar to [KDK19; KDK20].

The activation function for the LSTM model is the sigmoid function Equation (3.3). The loss
function of the classifier is the cross entropy function for two classes:

𝐿 = −(𝑦 · 𝑙𝑜𝑔(𝑝) + (1 − 𝑦) · 𝑙𝑜𝑔(1 − 𝑝))
(3.10)

where y is the true label (either 1 or 0) and p is the probability calculated by the model that the
label of the observation is either 0 or 1[che]. Cross entropy is a common loss function used for
classification.

35


3 A hybrid model for pitting detection in operation

3.6 Comparison between Autoencoder and LSTM

In order to understand why a hybrid approach is used, compared to either one of these models, we
need to understand how they compare to each other. Both models fall under the category of neural
networks, as both models try to mimic the structure of brains by using a node structure. The sparse
autoencoder typically has more layers than the LSTM, as it requires the depth to learn the latent
features of the data.

The biggest difference of course is the fact, that the SAE is an unsupervised model, while the LSTM
is a supervised model. The LSTM therefore requires labeled training data, while the SAE only
requires the input data. The reason for this stems from the purpose of these models. The SAE is
trained in order to approximate the input data with a lower number of features, while the LSTM
model is supposed to classify between pitting and no pitting. In order to know if it is performing
this classification well, the LSTM needs feedback on the returned classifications to learn from the
data. The SAE simply needs the original input to verify if it’s result (which should ideally be the
input data).

Both of these models could be adapted to perform the task of the other model. The Autoencoder
could be connected to a classification layer instead of a final layer that has the same dimension as
the input. The layer would basically be identical to the classification layer in the LSTM. The loss
function of the final autoencoder would have to be changed to contain for example cross entropy
instead of the mean squared error. It could still keep the other components of the loss function,
including the sparsity component. LSTM models can be turned into autoencoders, by chaining an
LSTM model that takes a large input and reduces it to a small number of features with an LSTM
model that takes those features and tries to return the original input. These models are known as
sequence to sequence models [SVL14] and are often used in natural language processing.

The autoencoder is trained on frequency data, while the LSTM is trained on MFCC. This is mostly
a result of using the inputs used in the original papers [MCC+19; QZH+19]. The SAE could be
trained on the MFCC and vice versa. LSTM models are often trained on MFCC in natural language
processing, and the choice of input data makes sense under that aspect. We could also feed the
learned space to the LSTM as extra parameters, however this risks the LSTM simply prioritizing
these features, copying the results of the SAE. Having two different transformations of vibration
data as inputs allows us to see that changes are visible in both the spectrum and the cepstrum.

Another main difference are the hidden nodes of the models. While the sparse autoencoder uses a
fully connected layer, meaning every input node can influence every hidden node, the LSTM model
uses special LSTM nodes. These nodes are controlled by gates and allow the model to factor in
past data in it’s classification process, with the past data dating back an arbitrary number of data
points. This is responsible for the last main difference, which is that the LSTM is fed with batches
of multiple data points at once, while the sparse autoencoder learns data points one data point at a
time.

36


4 Realisation of Solution

The two phase structure in Section 3.2 can be broken up further into steps that need to be performed
in each phase. These steps can be seen in Figure 4.1. The model will stay in the unsupervised loop
until data is found, that is able to be separated into two distinct clusters based on time.

4.1 Experimental Setup

The data for the analysis was produced by two gearboxes, with the input gear operating at 2150
rotations per minute (35.83 Hz) with a torque of 650 Nm. These gears were two of several that were
tested on the same setup. The gears were akin to those used in car gearboxes and the operation was
a simulation of a car in second gear non-stop until failure. The gears were labeled as gear 6 and gear
11. The data was collected every 24 hours for 10 minutes, from start to end of the gear life cycle.

Gear 6 produced 14 files of data, where data was collected from 12:00 to 12:10 every day over
a span of 13 days. The exception to this is the latest data file (file 13) of gear 6 which contains
data from 22:50 to 23:30 from the same day as file 13. The operation was stopped soon after the
recording of file 14 due to the gear being significantly damaged.

Figure 4.1: The Hybrid model step by step

37


4 Realisation of Solution

For Gear 11, there were 51 files of data produced. Two of the data files (files 10 and 24) contain
only three minutes of data and are not recorded between at 12:00 am, compared to the other files.
There is a significant lapse in recording between files 9 and 10. File 10 was recorded 24 days after
file 9, after which the daily recording schedule continued.

All vibration data was recorded at 51200 Hz, resulting in 30 720 000 Data points per measurement
over each 10 minute time span. All three spacial dimensions were recorded with accelerometers.

4.2 Data selection and model control flow

In order to simulate an environment in operation, only two data files will be loaded at a time. We
start with the first two data files (files 0 and 1). Of these 20 minutes of data, only a four minute
time frame is considered at the same time. We chose four minutes as this is enough data to train
the models, while also being not too much data to slow down training too much. This would
make looping through the unsupervised phase in real time impossible.This four minute window
corresponds to a fixed number of data points determined by the sampling rate, and the number of
data points that are used per frequency spectrum. The frequency data of this time window is copied
and compressed to be used as input for the autoencoder.

If there is no pitting detected, the time frame will be updated by two minutes. Let the window be
between minute i and minute i+4. After the update the window will be updated to i+2 : i+6. The
window is shifted by only half of it’s own length, as to not miss a potential shift happening at the
edge of the frame. Once i is bigger than the last minute of the earlier data file, the late data file
becomes the early data file and the next file is loaded in. As an example, if file 2 and file 3 are
loaded, file 3 will be saved in the variable of file 2, to free up memory and file 4 will be loaded in
place of file 3. This allows the model to move the window between data files. The index of the files
grows as we load in new data files to allow for easy separation point identification.

4.3 Data Import

The vibration data used for the training and testing of the hybrid model is stored in .tdms files.
Tdms files are hierarchical data structures, that allow for storage of file information like the author
names in the top layer, while storing the actual data in the lowers of the three hierarchical layers,
known as channels. In order to work with the data more easily, it is first transformed into a pandas
DataFrame. The data is arranged in channels with the heading for vibration data containing the
prefix 𝑹𝒂𝒘_𝑽𝒊𝒃𝑿 with 𝑿 being 1, 2 or 3, depending on the spacial dimension of the recorder. The
resulting DataFrame takes the dimensions 512000 x 180 with each column containing the vibration
data for 10 seconds. The entire DataFrame contains 10 minutes of vibration data for each vibration
dimension.

38


4.4 Vibration Data Transformation

4.4 Vibration Data Transformation

The DataFrame is reshaped to contain 3 columns with 10 minutes of vibration data resulting in a
30.720.000 x 3 data frame, where each column contains one dimension of vibration data. Each data
point is a float number that represents the acceleration of the sensor at that particular point in time.
To speed up model training we only use data from the first of the three dimensions. Pitting patterns
can be observed in all three dimension, with some dimension providing better cluster distinction
than others. In early tests, we found dimension 1 to produce the best results.

4.4.1 Fast Fourier Transform

Before the transform itself, the data was segmented into smaller segments of data, given by
a parameter 𝒏_𝒗𝒊𝒃. For the hybrid model, the number of data points used for each spectrum
calculation was set to 10.000, capturing 7.7 rotations of the input gear per frequency spectrum. The
model has been found to be relatively robust to the number of data points per frequency spectrum.
However, it is recommended to capture at least one rotation of each gear in each frequency spectrum.
For a smaller number of data points some teeth will be left out of each spectrum. Because pitting
damage is localized to a small number of adjacent teeth [Gre22] we want to capture the mesh of all
gear teeth in the spectrum.

Cutting vibration data into smaller segments can be seen as applying a transformation function to
the data, which keeps the data in the segment the same, while setting all other data points to 0. This
transformation will affect the frequency spectrum, by adding frequencies to the spectrum that are
not present in the original signal. This process of spectral noise generation is known as spectral
leakage [Nut81].

In order to combat spectral leakage, we can apply a window function to the data. The results
of windowing can be seen in Figure 4.2. Applying a windowing function is a different type of
transformation, that often reduces the data on the sides of the window while keeping the middle of
the data the same. Depending on the window function, we can influence what frequencies should
be better distinguished.

When no windowing is applied in this step (meaning technically a rectangular window is applied as
described above), components with similar frequencies and amplitudes will be distinguished more
easily (high resolution), while components with more dissimilar frequencies and amplitudes will be
harder to distinguish (low dynamic range). Applying a window function here can adjust resolution
and dynamic range, but there is always a trade-off, meaning there is no function that allows for high
resolution and dynamic range at the same time [Nut81]. For the purpose of the hybrid model a
rectangular window (no specific window) was applied, as the noise resulting from spectral leakage
should be discarded by the sparse autoencoder.

In order to transform the vibration data from the time domain to the frequency domain, the SciPy
implementation of fast fourier transform was used. More specifically the implementation of the
real fast fourier transform was (scipy.fft.rfft). The method takes a vector of vibration data v, with
length 𝑛_𝑣𝑖𝑏 and the number of frequencies to be considered in the spectrum. If the number of
frequencies is larger than the number of data points in the v, the vector is padded with zeroes. The
method returns all positive frequency terms. Given an even frequency n, the number of returned

39


4 Realisation of Solution

Figure 4.2: Application of a Hanning window to a segment of vibration data

Figure 4.3: Vibration data of gear 6 before (left) and after (right) fourier transform

frequencies is ⌊ 𝑛2 ⌋ + 1 [com]. Rfft returns only half the frequency components, as the other half of
the components is symmetrical to the first. This second half of the frequency components would
therefore not add any new information to the spectrum.

In our case we chose to compute all frequency components that could be captured in the data. Due
to anti-aliasing we can capture frequencies up to half of the sampling rate of 51.200 Hz, leaving
us with 25.600 Hz. As mentioned above, we segment the data into pieces of 10.000 data points,
turning the original 30.720.000 data points of vibration data into a list of 3.072 segments of 10.000
data points each. Rfft gets applied to each of the segments with a maximum frequency of 25.600.
This returns 12.801 frequency for each segment, leaving us with a 3.072 x 12.801 matrix. Where
each column represents a frequency component and each row represents the frequency spectrum of
10.000 data points of vibration data. The result of the rfft can be seen in Figure 4.3. Matrix obtained
by the fast fourier transform contains complex values, which contain information about the phase
and the amplitude of each frequency component. As we are interested in the amplitudes, we take
the absolute values of the frequency data.

4.4.2 MFCC Computation

As there doesn’t seem to be a standard MFCC implementation, the code for calculating the
coefficients is taken from this kaggle notebook [ILM]. For the MFCCs we want to calculate
the energy in different frequency regions [Lyoa]. These regions will be defined by filters. The
calculation of these filters can be seen in Figure 4.4. We calculate the MEL frequency for the highest
and lowest frequency we have obtained through fourier transform using Equation (2.5). In our case,
out lowest frequency component is at frequency 0, while our highest component is at frequency
25.600. The respective Mel values of these extremes are 0 and 4086.75 respectively. We then

40


4.5 Sparse Autoencoder Implementation

Figure 4.4: Calculation of the mel filters

obtain evenly spaced points on the Mel scale between the two extreme values, and transform the
values back to the standard frequency scale, using the inverse of Equation (2.5). The values in these
frequency ranges are then summed up to represent the energy in each region [Lyoa]. The number of
choice of filters influences the number of frequencies summed up for each filter. We used 45 filters,
turning our 12801 frequency components into 45 energy regions, which are slightly overlapping.

The base 10 logarithm of each region is taken, resulting in lower values for higher frequency regions.
In speech processing this is done to mimic how humans have non-linear hearing, requiring roughly
8 times as much energy to double the volume of a sound [Lyoa]. Finally, discrete cosine transform
is applied to obtain the cepstrum. The number of cepstral coefficients is typically lower than the
number of Mel filters, as the highest coefficients which represent the highest changes in energy in
the Mel filterbank actually decrease model performance (in speech processing).

We set the number of discrete cosine transform components to 40 in order to get rid of the higher
quefrency regions. This algorithm turns our 3072 x 12801 frequency matrix into a 3072 x 40 Mel
frequency cepstral coefficient matrix C.

4.5 Sparse Autoencoder Implementation

The single layer autoencoders were implemented using the TensorFlow python library. The single
layer encoders were then chained together, to create a deep sparse autoencoder. The autoencoder
takes training and test data in form of frequency data. It will return the encoding of the frequency
components of the last encoder layer on the original data. The autoencoder can be adjusted by
selecting a model structure and tuning the sparsity parameters 𝛽 and 𝜌 as well as the weight
regularisation parameter 𝜆.

41


4 Realisation of Solution

4.5.1 Preprocessing

To train the autoencoder, we use the frequency data of the current 4 minute time frame. To speed up
the training process we use only every tenth frequency component, meaning components 0, 10, 20,
... This leaves us with 1228 data points with 1281 frequency components each.

The data is split into training and test data at 80% training data. The autoencoder will be trained on
the training data while the test data is withheld in order to ensure that the autoencoder does not
overfit the data. If the SAE were to overfit the data by "learning the data by heartït would perform
bad on the test data. When splitting the data into training and test data, the input is randomly shuffled.
While we want to evaluate the clustering later by separating the clusters by time, this randomization
of the order of data points is desirable, as we do not want the model to cheat by learning structure
only on the order of the input data. Doing this could cause "data leakage"by factoring in information
that is not given to the model in operation, artificially inflating it’s performance in training. In
order to remember the order of the nodes for the clustering later, we save the indices of the shuffled
DataFrames after the train-test split in additional variables. These indices will be re-attached to the
encoded data later to restore the original order of the data for the clustering.

Next, the data is converted to a tensor to enable it to be used as input for a TensorFlow model. This
conversion removes the indices of the data, which is why we saved them in the previous step. After
preprocessing we should be left with a training tensor of shape 982 x 1281 and a test tensor of 246
x 1281.

4.5.2 Single Layer Autoencoder Structure

Each single layer autoencoder consists of an encoder and a decoder with fully connected layers (in
TensorFlow, these layers are called Dense layers). The last layer of the decoder needs to have the
size of the original input to replicate the input. It is possible to pass multiple fully connected layers
to the encoder an the decoder respectively. In our case, we only pass one layer to the encoder and
decoder layer. The depth of the model will be achieved by using the output layer of a single layer
encoder as the input layer for the next encoder. The activation function for both the encoder and the
decoder is the sigmoid function (Equation (3.3)).

The model takes parameters to adjust the number of nodes in the encoder layer, as well as the
decoder layer structure and the weight regularization parameter 𝜆. The number of nodes in the
encoder layer will shrink progressively from encoder to encoder when combining the single layer
encoders into a deep autoencoder to create a funnel effect. Each single layer autoencoder, when
used as a whole, takes an input x and returns an input 𝑥 of the same dimensions. In our case, the
autoencoder takes our tensor of size 982 x 1281 and learns to construct a similar tensor of size 982
x 1281. However, we can also only call the encoder of the trained autoencoder. In this case, the
encoder will return a tensor of dimension 982 x h, where h is the number of nodes in the hidden
encoder layer and typically h « 1281. This tensor is the mapping of the data points to the learned
feature space of the autoencoder. It depicts the underlying structure of the frequency spectrum.

42


4.5 Sparse Autoencoder Implementation

Figure 4.5: Code of the single layer autoencoder model

4.5.3 Loss Function

As seen in Equation (3.4) the loss function for the sparse autoencoder consists of 3 parts that form a
sum. TensorFlow does not support the loss function directly. It needs to be reconstructed using
the individual parts. The first term of the loss function is implemented in the built in TensorFlow
MeanSquearedError loss function [Gooa]. This is the entire loss function according to TensorFlow
as it is primarily concerned with optimizing the result of the autoencoder. In theory this result is to
replicate the input x as good as possible by reconstructing it from a lower dimensional representation.
For the purpose of pitting detection however, we are more interested in the properties of the encoding
than the replication of the frequency data.

The second term concerns the size of the weights of the encoder and is commonly known as l2
regularization. This term is added to avoid overfitting and allow for good results on new data. In
order to add this term, we set the kernel_regularizer, which influences the size of the weights, with
the built in TensorFlow L2 Regularizer [Gooc].

Regularizers in TensorFlow are a way to influence parts of a layer in a TensorFlow model. The
kernel_regularizer applies a penalty to the kernel of the layer. The bias_regularizer applies a
penalty to the bias of the layer, and the activity_regularizer applies a penalty to the layer output
[Good]. Looking at the hidden encoder layer as the result of the function in Equation (3.1), the
kernel_regularizer in TensorFlow penalizes the weight matrix 𝑤1, the bias_regularizer influences
the bias vector 𝑏1 and the activity_regularizer influences the result of the sigmoid function h. As
the second term of the loss function Equation (3.6) deals with the weight matrix w, we need to use a
kernel_regularizer.

For the Kullback Leibler divergence, we will have to create a custom regularizer based on
Equation (3.9). The code for the regularizer is adapted from [Lin] and can be seen in Figure 4.6.
As we need to influence the average activation of the nodes in the hidden layer, we need to use an
activity_regularizer. The activity_regularizer will be given all the values of the nodes of hidden

43


4 Realisation of Solution

Figure 4.6: Code of the sparsity penalty as part of the loss function

layer in the form of vector h. As we care about average activation, we will take the mean of h and
compare it to the desired level of sparsity 𝜌. The activity_regularizer class only allows one input,
which is the vector h. In order to adjust beta and rho for testing purposes, we use global variables.
We use beta to increase the impact of the sparsity penalty, while rho dictates the desired sparsity.

4.5.4 Combining Single Layer Encoders

In order to turn the single layer autoencoders into a sparse multilayer autoencoder, we train one
autoencoder after another, then use the result of the previous single layer encoder to train the next
one. We allow for multiple encoder and decoder layers per autoencoder. It would theoretically be
possible to pass different values for 𝛽, 𝜆 and 𝜌 for each single layer encoder or even two different
encoder layers as part of the same encoder-decoder combination. In order to not get lost in too
much testing and to avoid overfitting, we kept the hyperparameters the same across all layers for
each sparse autoencoder. This method seems to be sufficient so far, but could be interesting to
investigate in the future.

Training of a three layer sparse autoencoder is done in the following fashion. We copy our training
data tensor X with shape 982 x 1281. We then train a single layer autoencoder (𝜆 = 0.001, 𝛽 =

3, 𝜌 = 0.15) with a hidden layer with 560 hidden nodes on the training data. We then let the trained
autoencoder replicate X, obtaining X’. We use X’ to train a single layer autoencoder with the same
hyperparameters, except we only give it 200 hidden nodes. The trained autoencoder obtains X”
from X’. Lastly we train an autoencoder with 50 hidden nodes on X”. This time however, we use
only the encoder of the trained autoencoder on the original data X, obtaining a reduced feature
space in the form of tensor M_train with dimensions 982 x 50. We also use the encoder on the
training data to obtain tensor M_test with dimensions 246 x 50.

4.5.5 Clustering SAE results

After obtaining M_train and M_test, we turn them into DataFrames and re-attatched the saved
indeces. We then concatenate and then sort the data points by their index to obtain the underlying
structural data M as a 1228 x 50 DataFrame in chronological order.

Before clustering we can try to reduce the feature space further. We could use principal component
analysis (PCA) or by using the QR decomposition method proposed by Qu et. al. [QZH+19]. If we
use PCA, we will take the first three principle components of the data, giving us a clustering on a
1228 x 3 matrix. The number of dimensions kept by the QR method will vary based on the data
and on the cutoff threshold to decide when to discard eigenvalues and their corresponding features.
However, we will cluster on a 1228 x n feature space, with n < 50. Alternatively we can cluster on
the entire 1228x50 feature space. All three methods as mentioned in Section 3.3.3 were tested.

44


4.6 Implementation of the supervised phase

For the clustering of the reduced feature space, K-Means clustering was employed, with k=2. The
result of the K-Means clustering is a 1228 x 1 vector y which contains a cluster label in 0,1 for each
frequency spectrum.

We investigate if y contains two clear clusters separable by time. We call clusters separable by time if
for the current data 𝑦 = (𝑦1, 𝑦2, ..., 𝑦𝑛) there exists a point at time i, for which 𝒔𝒆 𝒑𝒂𝒓𝒂𝒃𝒊𝒍 𝒊𝒕𝒚 ·100 %
of data in (𝑦1, ..., 𝑦𝑖) can be put into one cluster and 𝒔𝒆 𝒑𝒂𝒓𝒂𝒃𝒊𝒍 𝒊𝒕𝒚 · 100 % of data in (𝑦𝑖+1, ..., 𝑦𝑛)
can be put into the other cluster. 𝒔𝒆 𝒑𝒂𝒓𝒃𝒊𝒍 𝒊𝒕𝒚 is a threshold value between 0 and 1, that decides if
the cluster purity is high enough to consider the data to be separable by time and therefore assume
that a shift in the underlying signal, that could indicate pitting, has happened. 𝒔𝒆 𝒑𝒂𝒓𝒂𝒃𝒊𝒍 𝒊𝒕𝒚 is
usually set to 0.99. If cluster purity is higher than 𝒔𝒆 𝒑𝒂𝒓𝒂𝒃𝒊𝒍 𝒊𝒕𝒚 the model transitions to the
supervised phase.

4.6 Implementation of the supervised phase

The goal of the supervised phase is to obtain a classifier that can distinguish between vibration
data with pitting and vibration data without pitting. The problem with training the classifier on the
vibration data has been the lack of labeled training data. Now, we can use the cluster labels y as the
labels for the training of the Long Short Term Memory classifier.

The LSTM takes Mel Frequency Cepstral Coefficients as input instead of frequency spectra. The
calculation of these coefficients is shown in Section 4.4.2. Additionally the first and second
derivative of the MFCC are calculated. These coefficients are calculated as follows[Lyoa]:

𝑑𝑡 =

∑𝑁
𝑛=1 𝑛(𝑐𝑡+𝑛 − 𝑐𝑡−𝑛)

2
∑𝑁

𝑛=1 𝑛
2

(4.1)

The first values obtained by the first derivative as known as delta coefficients. Delta coefficient 𝑑𝑡
can be seen as the slope between the neighbouring coefficients of MFCC 𝑛𝑡 (𝑛𝑡−1 and 𝑛𝑡+1). The
parameter N allows us to also include the neighbours of the neighbours. For the hybrid model, N
is set to 2. The derivative of the delta coefficients are called delta-delta coefficients. By padding
the edges of the MFCC matrix C with zeroes we obtain an additional 40 delta and 40 delta-delta
coefficients per data point. We add these to the MFCC to obtain our input matrix X of shape 1228 x
120. The code for calculating these coefficients was taken from [Lyob].

The input X is also split into a training and a test set X_train (∈ R982𝑥120) and X_test (∈ R246𝑥120).
Because the classifier is a supervised model, we also split y into y_train and y_test.

In order to leverage the power of the LSTM model to take previous data points into account, we
need to reshape the input data into batches given by the parameter 𝒃𝒂𝒕𝒄𝒉_𝒔𝒊𝒛𝒆. Finding a balance
between 𝒃𝒂𝒕𝒄𝒉_𝒔𝒊𝒛𝒆 and the number of inputs is important. Making the batch size to small will
limit the number of historical data points the LSTM can leverage in it’s prediction. Having a very
small input size will likely limit the ability of the model to generalize and classify new data. We
experiment with different levels of 𝒃𝒂𝒕𝒄𝒉_𝒔𝒊𝒛𝒆, defaulting to 𝒃𝒂𝒕𝒄𝒉_𝒔𝒊𝒛𝒆 = 10. By using the
TensorFlow reshape method, we turn our input matrices X_train and X_test into a tensor of shapes
98 x 10 x 120 and 24 x 10 x 120. The label vectors are reshaped to 98 x 10 x 1 and 24 x 10 x 1
respectively.

45


4 Realisation of Solution

Figure 4.7: TensorFlow Code of the LSTM model

4.6.1 Model Architecture

The LSTM model itself consists of an input layer, an LSTM layer with 𝒏𝒍 𝒔𝒕𝒎𝒏𝒐𝒅𝒆𝒔 hidden units
and a fully connected layer with two units representing the classification outputs (pitting and no
pitting in theory, cluster 1 and cluster 2 in practice). The activation function used is also the
sigmoid function, while the loss function is sparse categorical crossentropy [Goob] which is used
for classification and produces a number of a cluster.

The code for the LSTM model itself can be seen in Figure 4.7. In addition to varying the 𝒃𝒂𝒕𝒄𝒉_𝒔𝒊𝒛𝒆,
the model can be tuned by adjusting the number of hidden units in the LSTM layer. It is important
to turn the parameter "return_sequences"to true to make the model return the classification of all
data points and not just the last data point in each batch.

Once the model is trained on X_train and y_train it produces a result vector with two features, when
given input data. For example, if we feed X_train to the classifier, it will return �̂�_𝑡𝑟𝑎𝑖𝑛 as a 98
x 10 x 2 tensor. We reshape the tensor into a 980 x 2 DataFrame. The two columns of �̂�_𝑡𝑟𝑎𝑖𝑛
represent a one-hot encoding of the cluster labels. For given data point X_train[i] if �̂�_𝑡𝑟𝑎𝑖𝑛[𝑖] [0]
> �̂�_𝑡𝑟𝑎𝑖𝑛[𝑖] [1] the data is considered part of cluster 0 and vice versa.

4.6.2 LSTM Validation

The trained classifier is considered the result of the supervised phase and of the model as a whole.
In order to validate if the classifier can actually classify pitting data we try to validate the classifier
by importing data sets of data that was recorded before and after the window the LSTM was trained
on. For example if the model was trained on data in files 4 and 5, use all 20 minutes of files 3 and 6
as validation data. Data in file 3 is considered to have no pitting, and data in file 6 is considered to
have pitting. If the classifier can classify over 95% of the data correctly, it is considered to have
found pitting damage.

46


5 Evaluation of Solution

The model is first and foremost evaluated on its ability to detect pitting damage. This was done using
parameters as close to the original papers [MCC+19; QZH+19] as reasonable. We then evaluate
potential changes to parameters, as well as model components. The parameters were as follows:

Both machine learning models were evaluated with regards to their accuracy. Accuracy was
investigated for different hyperparameters in both models. It turns out that accuracy is not a good
evaluation metric in operation for this specific task, as will be discussed in Section 5.5.2.

Parameter Description Value

frequency maximum frequency calculated for FFT 25600
n_vib number of data points per Fourier Transform 10000

vibration_dimension the spacial dimension of the vibration 1
compression_factor factor by which sae input is smaller than FFT result 10

n_coefficients number of MFCC inputs for lstm 40
n_mel_filters number of Mel filterbanks 45

𝛽 sparsity penalty weighting in sae loss function 3
𝜆 weight regularization weighting in sae loss function 0.001
𝜌 desired level of sparsity in sae loss function 0.15

sae_encoder_layers numbers of hidden nodes per sae encoder layer 560-200-50
sae_decoder_layers numbers of hidden nodes per sae decoder layer 560-200-50

sae_epochs number of epochs of sae training 100
lstm_epochs number of epochs of lstm training 100
batch_size number of data points fed to LSTM at the same time 10

n_lstm_nodes number of hidden units in the LSTM layer 180
accuracy_threshold level of accuracy to consider trained classifier useful 0.95

separability_threshold percentage of data points in their correct cluster 1.0
window_time number of minutes for which data points are used for training 4

window_shift_factor percentage of the window by which it gets moved per iteration 2

Table 5.1: List of parameters used in the algorithm

47


5 Evaluation of Solution

Figure 5.1: Underlying structure across files

Figure 5.2: Frequency spectrum comparison between data with and without pitting damage

5.1 Global view of the underlying structure of the data

If the autoencoder is given all of the vibration data for gear 6 (14 Files), it’s first principle component
shows the structure displayed in Figure 5.1. We can see that the last two data files contain a
significantly different underlying structure.

Looking at the absolute values of the mean frequency spectrum of the first and last file shows
significant differences. Especially in the high frequency areas the later file shows big spikes.

In a realistic scenario we would like to be able to discover pitting damage at an earlier stage than it
is found in the last two files of the data. This is why most papers tend to look at pitting damage at
various stages of deterioration [MCC+19; QZH+19]. The problem with testing the hybrid detection
approach on real data, is that it is hard to figure out when the pitting damage first occurred. In order
to see if earlier stages of pitting damage can be spotted in the data, we looked if additional data
points could be found, that can separate the data encoded by the sparse autoencoder in the time
domain. When considering a separability threshold of 0.99, five such points were identified. These
points were used to separate the data into five areas. The LSTM classifier was trained with different
parameters to check if any of these points could indicate early stages of pitting.

48


5.2 Accuracy of SAE for different Parameters

5.2 Accuracy of SAE for different Parameters

The models were trained on the data files with clear pitting and no pitting, as well as data in the
five clusters found by the hybrid model, The models were also evaluated on four combinations of
consecutive files with no severe pitting damage, namely files (4,5), (5,6), (7,8) and (9,10).

5.2.1 Accuracy of the clustering methods

When evaluating the different methods of further dimensionality reduction, namely no reduction,
taking the highest 2 principle components and taking the features associated with the highest
eigenvalues of upper triangle matrix R in QR decomposition, accuracy refers to the percentage
of data points correctly identified as part of a cluster that contains the data points of one of the
two files. An accuracy of 50% suggests the lowest possible accuracy, which often occurs when all
data points are put into the same cluster. This might not suggest bad performance of the model
necessarily, as two files could be generally inseparable because the data in these files contains the
same underlying structure.

When clustering the reduced feature space of the data in the consecutive measurements mentioned
above, clustering along all dimensions (performing no further feature reduction) produces an average
accuracy of 62.5867%, using the first two principle components produces an average accuracy of
62.5768% and using the QR decomposition method to select the most relevant features has an
average accuracy of 58.2905%.

The QR decomposition method has the lowest performance among the three methods in 79.84%
of cases. The method only has a higher accuracy than the other two methods in 4.4% of cases,
with marginal improvements. The reason for this might be a flawed implementation of the method,
as it was not described in detail in [QZH+19]. In this case, the metho