Institut für Maschinelle Sprachverarbeitung

Universität Stuttgart

Pfaffenwaldring 5B

D-70569 Stuttgart

Machine Translation
with Transformers

Truong Thinh Nguyen

Master Thesis

Prüfer: Prof. Dr. Ngoc Thang Vu

Betreuer: Pavel Denisov

Beginn der Arbeit: 01.02.2019

Ende der Arbeit: 01.09.2019


Abstract

The Transformer translation model (Vaswani et al., 2017), which relies on self-

attention mechanisms, has achieved state-of-the-art performance in recent neural

machine translation (NMT) tasks. Although the Recurrent Neural Network (RNN)

is one of the most powerful and useful architectures for transforming one sequence

into another one, the Transformer model does not employ any RNN. This work

aims to investigate the performance of the Transformer model compared to differ-

ent kinds of RNN model in a variety of difficulty levels of NMT problems.


Acknowledgements

I would like to express my deep gratitude to Professor Ngoc Thang Vu and Pavel

Denisov - my supervisors - for their patient guidance, encouragement and valuable

advice they have provided throughout my enrollment as a master student. I am

particularly grateful for the assistance given by Pavel Denisov, who helped me a

lot with installing the speech processing toolkit.

Special thanks to Hung Ngo, Vien Ngo, Khiem Nguyen, Kim Anh Nguyen for

sharing your orientations and experiences in the lunchtimes we enjoyed together

at Mensa.

I would also like to extend my thanks to Maximilian Schmidt, Dirk Väth, Tuan

Pham and Ha Nguyen - my dear friends - for making my study complete in joy

and warmth.

Last but not least, I wish to thank my family members for their unconditional

love and support.


Contents

List of Figures 5

List of Tables 6

1 Introduction 7

2 Background 9

2.1 Neural Machine Translation . . . . . . . . . . . . . . . . . . . . . . 9

2.2 RNN, LSTM and BLSTM . . . . . . . . . . . . . . . . . . . . . . . 10

2.3 Transformer Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.3.1 Self-Attention Mechanism . . . . . . . . . . . . . . . . . . . 13

2.3.2 Multi-Head Attention . . . . . . . . . . . . . . . . . . . . . . 15

2.3.3 Positional Encoding . . . . . . . . . . . . . . . . . . . . . . . 16

2.3.4 The Overall Model Architecture . . . . . . . . . . . . . . . . 18

2.4 Automatic Speech Recognition . . . . . . . . . . . . . . . . . . . . . 20

3 Multilingual Named Entity Transliteration 22

3.1 Experiment Setups . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.1.1 Training Data . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.1.2 LSTM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.1.3 Transformer . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.2 Evaluation Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3


4 Neural Machine Translation with Seq2Seq 32

4.1 Experiment Setups . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.1.1 Training Data . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.1.2 LSTM and BLSTM . . . . . . . . . . . . . . . . . . . . . . . 32

4.1.3 Transformer . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.2 Evaluation Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 36

5 Speech Translation 37

5.1 Experiment Setups . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

5.1.1 Training Data . . . . . . . . . . . . . . . . . . . . . . . . . . 38

5.1.2 ASR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

5.1.3 BLSTM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

5.1.4 Transformer . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5.2 Evaluation Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 43

6 Analysis 46

6.1 Multilingual Named Entity Transliteration . . . . . . . . . . . . . . 46

6.2 Neural Machine Translation with Seq2Seq . . . . . . . . . . . . . . 47

6.3 Speech Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

7 Conclusion and Future works 49

Bibliography 50

4


List of Figures

1 Encoder - Decoder structure translating the English sentence “I love

you” to the German sentence “ich liebe dich” . . . . . . . . . . . . 7

2 Recurrent Neural Networks with loop (Left) and its unfolded rep-

resentation (Right) . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3 The control flow of a standard RNN. . . . . . . . . . . . . . . . . . 11

4 The control flow of an LSTM. . . . . . . . . . . . . . . . . . . . . . 12

5 Idea of Self-attention. . . . . . . . . . . . . . . . . . . . . . . . . . 13

6 Self-attention layer. . . . . . . . . . . . . . . . . . . . . . . . . . . 14

7 A self-attention layer predicts at time step t . . . . . . . . . . . . . 15

8 Multi-Head Attention consists of h attention layers running in par-

allel. (Vaswani et al., 2017) . . . . . . . . . . . . . . . . . . . . . . 16

9 Waveform representation of a sinusoid . . . . . . . . . . . . . . . . 17

10 Positional encoding with an embedding size of 4 . . . . . . . . . . 18

11 Transformer model architecture with a single layer of Encoder (left)

and Decoder (right) (Vaswani et al., 2017) . . . . . . . . . . . . . . 19

12 Flow of standard recipes in ESPnet (Watanabe et al., 2018) . . . . 21

13 Transliteration examples in four language pairs. (Karimi et al., 2011a)

22

14 Normalized Edit Distance reported by Irvine et al. (2010) in Translit-

erating from all languages paper. . . . . . . . . . . . . . . . . . . . 30

15 Validation accuracy of LSTM, BLSTM, Transformer models during

training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

16 Traditional pipeline of Speech Translation . . . . . . . . . . . . . . 37

5


List of Tables

1 Languages of interest and the number of person names paired with

English. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2 Batch size for each source language in Named Entity Transliteration

task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3 Average Normalized Edit Distance of LSTM models in Named En-

tity Transliteration task . . . . . . . . . . . . . . . . . . . . . . . . 28

4 Average Normalized Edit Distance of Transformer models in Named

Entity Transliteration task . . . . . . . . . . . . . . . . . . . . . . . 29

5 Comparison between the results from TFAL (Irvine et al., 2010),

the LSTM models and Transformer models in Named Entity Translit-

eration task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

6 BLEU scores of LSTM, BLSTM, Transformer models on test set

newstest2014 of MT task . . . . . . . . . . . . . . . . . . . . . . . 36

7 Example of our ASR system transcriptions and IMS-Speech tran-

scriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

8 Word error rates (WER) of our ASR system and IMS-Speech on

the TED-LIUM 2 dataset . . . . . . . . . . . . . . . . . . . . . . . 44

9 BLEU scores of the NMT systems including the Transformer model

and the BLSTM model . . . . . . . . . . . . . . . . . . . . . . . . . 44

10 Example of our NMT models on an ASR-like version of newtest2014 44

11 BLEU scores of Speech Translation systems . . . . . . . . . . . . . 45

12 Example of transliterating from a Russian name entity to English . 46

13 Example of our NMT models on newtest2014 test set . . . . . . . . 47

14 References of an example English speech source and German text

target sentence and the output of the different steps of our cascaded

speech translation system . . . . . . . . . . . . . . . . . . . . . . . 48

6


1 Introduction

With the power of deep learning, Neural Machine Translation (NMT) has been

established as the most powerful algorithm to perform machine translation (MT).

There has been a significant change in the state-of-the-art techniques for NMT

in recent years. Most former NMT models are Sequence-to-Sequence (Seq2Seq)

and highly rely on the architecture composed of two recurrent neural networks

(RNNs) which are Encoder and Decoder. However, RNNs handle sequences word-

by-word sequentially, which prohibits parallelization and creates a problem with

learning long-term dependencies within the input and output sequences from mem-

ory (Kolen and Kremer, 2001). Hence, the Transformer model replaces RNNs with

self-attention layers to get rid of them.

Figure 1: Encoder - Decoder structure translating the English sentence “I love

you” to the German sentence “ich liebe dich”

Vaswani et al. (2017) shows that the Transformer model outperforms both

Convolutional Neural Network (CNN) and RNN on WMT14 1 English-to-German

and English-to-French translation tasks. To determine whether the Transformer

can handle other machine translation problems, this work performed several MT

experiments on an increasingly complex scale.

The following three Machine Translation tasks are carried out in this work:

1. Named entity transliteration from 13 different languages to English.

1https://www.statmt.org/wmt14/translation-task.html

7


2. Text translation from English to German.

3. Speech translation from English voice to the German text.

In each of these above tasks, there are comparisons between the effectiveness

of the Transformer (attention-based) and other RNN architectures such as long

short-term memory (LSTM) (Hochreiter and Schmidhuber, 1997) and bidirectional

LSTM (BLSTM - Graves and Schmidhuber (2005)). We observe that the Trans-

former usually outperforms Recurrent LSTM-based variants on performing MT

tasks.

8


2 Background

2.1 Neural Machine Translation

The Machine Translation of text or speech from one language to another is one

of the most popular and challenging goals for computers. Conventional machine

translation systems often use rules, which are usually created by linguists at the

semantic, syntactic or morphological level. However, the key weakness of rule-

based translation systems is that it requires large sets of rules, which is difficult to

deal with rule interactions in ambiguity or idiomatic expressions. With the rais-

ing concern of deep learning, Neural Machine Translation, which was proposed by

(Kalchbrenner and Blunsom, 2013), (Cho et al., 2014) and (Sutskever et al., 2014),

has the potential to overcome many limitations of the classical rule-based machine

translation approach. Neural machine translation is the art of using artificial neu-

ral networks (ANN) models to learn a statistical model for machine translation.

The phrase-based translation systems (e.g. Marcu and Wong (2002), Koehn et al.

(2003), Setiawan et al. (2005)) require the pipeline of specialized components such

as language model, translation model, and reordering model. The structure of the

NMT models is simpler than phrase-based models. NMT aims at building and

training a single and large ANN that can be tuned to perform language transla-

tion effectively (Bahdanau et al., 2014). NMT learns directly the mapping from an

input source language to its associated output target language in an end-to-end

fashion (Wu et al., 2016).

The architecture of NMT models often consists of an encoder and a decoder

(Figure 1). Firstly, each word in the input sentence is fed separately into the

encoder to encode the source sentence into an internal fixed-length representation

called the context vector. This context vector contains the meaning of the sentence.

Secondly, the decoder decodes the fixed-length context vector and then predicts

the output sequence. While some types of Encoder-Decoder model used LSTM-

based approach (e.g. Sutskever et al. (2014), Luong et al. (2015b)), the others (e.g.

Luong et al. (2015a), Vaswani et al. (2017), Galassi et al. (2019)) explored the use

9


of attention-based architectures for neural machine translation.

2.2 RNN, LSTM and BLSTM

In MT tasks, if we want to predict the next word in a sentence, it is a good idea to

know which words come before it. Recurrent Neural Network (RNN) is designed to

make use of sequential information. The output yt at time step t depends not only

on its present input xt but also the entire history of inputs x0, x1,..., xt−1 from the

previous moments. In order to remember the past, RNN introduces hidden states

to act as the memory of the network. The hidden state ht at time step t captures

information from all of previous time steps. It is calculated based on the input at

the current time step xt and the previously hidden state ht−1, through a single

tanh layer as the activation function.

Figure 2: Recurrent Neural Networks with loop (Left) and its unfolded represen-

tation (Right)

However, RNN is difficult to solve long-term dependence in practice, which

means the information slowly disappears as the number of layers in the neural

network increases. One of the most popular solutions for this drawback is to use

Long Short-Term Memory (LSTM) network. LSTM is a special kind of RNN using

a gating mechanism that controls the work of adding or deleting the memory of

the network.

10


Figure 3: The control flow of a standard RNN.

Figure 3 2 and Figure 4 3 show that the control flow of an LSTM network is

a chain-like structure which is almost identical with a standard RNN. In addition

to the hidden state, each LSTM unit has a cell state to store memory. The gate in

LSTM is a component that selectively passes information to an LSTM cell state.

A gate consists of a neural network layer with Sigmoid as the activation function

and an element-wise multiplication operation. A standard LSTM has three gates

learning what information should be added or deleted during training: forget gate,

input gate, and output gate. In short, the forget gate controls the amount of

information which should be kept from prior steps. The input gate determines

how much new relevant information is added from the current step. Lastly, the

output gate decides which part of the current cell makes the next hidden state and

cell state.

Although LSTM is capable of learning long-term dependencies, it still has limi-

tations since its output is mostly based on previous states. The bi-directional Long

Short-Term Memory (BLSTM) (Graves and Schmidhuber, 2005) can process the

input sequence in both forward and backward directions. Therefore, input infor-

mation from the past and future of a point in a given sequence can be combined

2Image Credit: Christopher Olah

http://colah.github.io/posts/2015-08-Understanding-LSTMs/
3Image Credit: Purnasai Gudikandula

https://mc.ai/recurrent-neural-networks-and-lstm-explained/

11


Figure 4: The control flow of an LSTM.

to compute the output sequence.

2.3 Transformer Model

Before Vaswani et al. (2017) introduced the Transformer, RNNs used to be the

most popular and powerful architecture for the Encoder-Decoder structure to solve

NMT problems. Nowadays, modern and fast computing devices such as Graphics

Processing Units (GPUs) and Tensor Processing Units (TPUs) rely on parallel

computing. However, the sequential nature of RNNs forces the computation to

process sentences word by word, which makes RNNs suitable for parallelization.

In order to overcome this shortcoming of RNNs, the Transformer employs a self-

attention mechanism that allows the encoder and decoder to account every word

of the entire input sequence. Transformer proposes to encode each position, apply

self-attention in both decoder and encoder, enhance the idea of self-attention by

calculating multi-head attention.

12


2.3.1 Self-Attention Mechanism

Self-attention in NMT (intra-attention) is an attention mechanism which has the

ability to represent relationships between words in a sentence.

Figure 5: Idea of Self-attention.

Figure 5 4 shows an example of self-attention mechanism. The model is able

to look at the other words in the input sequence - “I”, “kicked”, “ball” - to get

a better understanding of the certain word “kicked” by answering three questions

- “Who”, “Did what”, “To Whom” - respectively. The attention mechanism used

by Vaswani et al. (2017) is dot-product attention, which can be described by the

following equation:

4Image Credit: Ashish Vaswani and Anna Huang

http://web.stanford.edu/class/cs224n/slides/cs224n-2019-lecture14-transformers.

pdf

13


(1) Attention(Q,K, V ) = softmax(
QKT

√
dk

)V

where Q is a matrix that contains the set of queries packed together, K and

V are keys and values matrices. In terms of encoder-decoder, the query and key

are usually the hidden state of the decoder and encoder, respectively. Inputs are

used as keys and values. The value is a normalized weight representing how much

attention that a corresponding key gets.

Figure 6: Self-attention layer.

On the encoder side, a better representation of input xt at time step t is

generated by self-attention using all other inputs x1, x2,..., xn, where n is the

sequence length. Because this work can be done for all input steps in parallel, the

Transformer is more suitable than RNNs for parallelization. Besides, self-attention

layer connects all positions with the complexity of O(1) number of sequential

operations, cheaper than O(n) of RNNs .

On the decoder side, at time step t, xt is the current input as the query. All

past queries x1, x2,...,xt−1 are combined to be keys and values of the self-attention

14


Figure 7: A self-attention layer predicts at time step t

layer.

2.3.2 Multi-Head Attention

Instead of a single attention weighted sum of the values, the Multi-Head Atten-

tion computes multiple attention weighted sums to capture various aspects of the

input. Through simple splicing of multiple independent attentions, we can obtain

information on different sub-spaces.

To learn diverse representations, each head is a unique linear transformation of

the input representation as query, key, and value. Then the Scaled-Dot Attention

is calculated h times in parallel, making it so-called Multi-Headed. Outputs are

then concatenated. Finally, one single linear transformation is applied, as showed

in Figure 8. Multi-Head Self Attention can learn related information from different

representation sub-spaces because each of these query, key, value sets is randomly

initialized.

15


Figure 8: Multi-Head Attention consists of h attention layers running in parallel.

(Vaswani et al., 2017)

2.3.3 Positional Encoding

In the case of RNNs, we feed the words sequentially to the model, each token is

aware of how it was ordered. However, multi-head attention computes the output

of each item in the sequence independently with no notion of word order. It is inef-

ficient to model the sequence information without any special order or position. To

account for the order of the words in the input sequence, the Transformer model

adds a vector to each input embedding called Positional Encoding. Positional En-

coding from the Transformer model is computed by sine and cosine functions of

different frequencies as:

16


(2) PE(pos,2i) = sin(
pos

10000
2i

dmodel

)

(3) PE(pos,2i+1) = cos(
pos

10000
2i

dmodel

)

where i represents the vector index we are looking at, pos represents the po-

sition, dmodel represents the dimension of the input embeddings. Each dimension

of the positional encoding forms a sinusoid, as illustrated in Figure 9, allows the

model to generalize to longer sequence lengths. PEpos+k can be computed by a

linear function of PEpos with an offset k, so the relative position between different

embeddings can be inferred at a cheap cost.

Figure 9: Waveform representation of a sinusoid

Figure 10 describes an example of positional encoding for the input sentence

“I love you”. Each input word is first turned into a vector using an embedding

algorithm of size 4. Positional encoding is then applied to get the corresponding

output vector of the same size.

17


Figure 10: Positional encoding with an embedding size of 4

2.3.4 The Overall Model Architecture

The Transformer model with its encoder and decoder components is illustrated in

Figure 11. Both Encoder and Decoder are composed of multiple identical encoders

and decoders that can be stacked on top of each other Nx times. The encoder

stack and the decoder stack share the same number of Nx.

Encoder: The encoder block is a stack of Nx identical layers. Each layer has

a multi-head self-attention mechanism sub-layer followed by a position-wise fully

connected feed-forward network sub-layer.

Decoder: The decoder block is also a stack of Nx identical layers. In addition

to the two sub-layers in each encoder layer, the decoder has an extra Masked

Multi-Head Attention sub-layer to avoid this attention sub-layer looking into the

future.

18


Figure 11: Transformer model architecture with a single layer of Encoder (left)

and Decoder (right) (Vaswani et al., 2017)

19


2.4 Automatic Speech Recognition

Automatic speech recognition (ASR) is a standalone, machine-based process de-

coding of phonetic transcription (Lai and Yankelovich, 2003). Automatic speech

recognition has been a field of research since the 1950s. Recently, we have witnessed

a progressive improvement of ASR technologies (Yu and Deng, 2015), (Ravanelli,

2017), especially with the participation of deep learning technology (Hinton et al.,

2012). To build a recognition system, several potential choices of open-source tool-

kits are available: HTK written in C by Young et al. (2002), Sphinx-4 written

in Java by Walker et al. (2004), Julius written in C by Lee et al. (2001), RASR

written in C++ by Rybach et al. (2011). The most popular ASR tool-kits are built

on end-to-end deep learning such as Deep Speech 2 Amodei et al. (2015), ESPnet

Watanabe et al. (2018), Kaldi Povey et al. (2011).

This thesis uses ESPnet for the speech recognition task. ESPnet is developed

based on Chainer (Tokui and Oono, 2015) and PyTorch (Paszke et al., 2017). For

data processing, feature extraction/format, ESPnet also follows the style of Kaldi

ASR toolkit, making it convenient to use existing Kaldi recipes.

The standard recipe of ESPnet does not consist of complicated tasks such

as lexicon preparation, finite state transducer compilation, alignment, Gaussian

mixture modeling, and lattice generation. In total, there are 6 stages in an ESPnet

standard recipe, as illustrated in Figure 12:

• Data preparation: Using the Kaldi data preparation script.

• Feature extraction: Using the Kaldi feature extraction. This stage extracts

log Mel feature with 80 dimensions combined with the pitch feature.

• Data preparation for ESPnet: Converting all the information about tran-

scriptions, speaker and language IDs, and input and output lengths into one

JSON file.

• Language model training (optional): Training a character-based RNN

Language Model.

20


• End-to-end ASR training: Training a hybrid CTC (Graves et al., 2006)

and attention-based encoder-decoder model (Watanabe et al., 2017).

• Recognition and scoring: Accomplishing speech recognition using previ-

ously trained RNN Language Model and end-to-end ASR model.

Figure 12: Flow of standard recipes in ESPnet (Watanabe et al., 2018)

21


3 Multilingual Named Entity Transliteration

Named entity transliteration (NET) is an important sub-task in machine trans-

lation (MT). It is the process of taking as input a named entity from one source

language then generates a named entity in another language while maintaining

their pronunciation (Knight and Graehl (1997), Karimi et al. (2011b)). For exam-

ple, transliteration of the Greek name Mελβoυρνη to English is Melbourne.

Figure 13: Transliteration examples in four language pairs. (Karimi et al., 2011a)

Work in transliteration can be classified into two categories: generative translit-

eration and transliteration discovery. Transliteration discovery aims at selecting

the best candidate for a query name, by discovering already transliterated pairs

of words in different languages (Al-Onaizan and Knight (2002), Klementiev and

Roth (2006), Kuo et al. (2009)). This thesis focuses on the generative translitera-

tion where the task is to directly map symbols of a source word to a target word (

Li et al. (2004), Jiampojamarn et al. (2009), Finch et al. (2015), Upadhyay et al.

(2018)).

22


Machine transliteration has emerged for around two decades and most gen-

erative transliteration systems are data-driven (Irvine et al. (2010),Ekbal et al.

(2006), Sajjad et al. (2011), Shao and Nivre (2016)). Recently, neural learning sys-

tems have become good alternatives to traditional data-driven approaches. Since

the LSTM models and the Transformer model have achieved remarkable success in

a wide range of natural language processing tasks, this thesis aims to compare the

neural encoder-decoder method with LSTM (Sutskever et al., 2014) to the novel

Transformer model on the Named Entity Transliteration task. In this work, we use

OpenNMT-py (Klein et al., 2017), which is an open-source NMT system, to train

an LSTM model and several Transformer models.

3.1 Experiment Setups

3.1.1 Training Data

Name pairs based on Wikipedia have been widely used in transliteration works

(e.g., Rosca and Breuel (2016), Pasternack and Roth (2009)). Most Wikipedia

pages contain multilingual links among them, which is easy to collect a set of la-

bels that describe the same name entity in different languages. This thesis uses

multilingual dataset mined from Wikipedia by Irvine et al. (2010). This dataset

consists of 100 languages with overlapping name pages with English. For all exper-

iments, 13 languages of interest are chosen in the same way as Irvine et al. (2010)

to transliterate into English. For each language, its dataset is separated into 80%

for training, 10% for validation and 10% for testing.

23


Language Number of name-pairs Training Validation Test

Russian (ru) 229680 183744 22968 22968

Farsi (fa) 92295 73837 9229 9229

Ukranian (uk) 75054 60044 7505 7505

Arabic (ar) 59356 47486 5935 5935

Korean (ko) 56865 45493 5686 5686

Hebrew (he) 51887 41511 5188 5188

Bulgarian (bg) 41767 33415 4176 4176

Serbian (sr) 28198 22560 2819 2819

Greek (el) 24755 19805 2475 2475

Belarusian (be) 19807 15847 1980 1980

Georgian (ka) 16466 13174 1646 1646

Macedonian (mk) 10227 8183 1022 1022

Old-Belarusian

(be-x-old)
9752 7802 975 975

Table 1: Languages of interest and the number of person names paired with En-

glish.

3.1.2 LSTM

Every language is trained with the default LSTM model from OpenNMT-py for

100,000 steps. To be specific, some important hyper-parameters in this default

model are:

• Number of layers in the encoder = 2

• Number of layers in the decoder = 2

• Number of hidden units in the encoder = 500

• Number of hidden units in the decoder = 500

• Dropout = 0.3

24


• Learning rate = 1.0

• Decay learning rate = 0.5

• Learning rate warm-up steps = 4000

• Maximum prediction length = 100

• Word embedding size = 500

• Batch size = 64

3.1.3 Transformer

OpenNMT-py provided a set of hyper-parameters for their default Transformer

model. Those hyper-parameters are also used in our baseline setting:

• Number of layers in the encoder/decoder Nx = 6

• Word embedding size dmodel = 512

• Size of hidden Transformer feed-forward dff = 2048

• Number of heads for Transformer self-attention h = 8

• Number of training steps = 100,000

• Maximum batches of words in a sequence to run the generator in parallel =

2

• Dropout = 0.1

• Batch size = 4096

• Adam optimizer:

– β1 = 0.9

– β2 = 0.998

25


• Learning rate = 2.0

– Decay method = noam

– Learning rate warm-up steps = 8000

• Maximum prediction length = 100

• Label smoothing value ε = 0.1

Due to the difference in the number of name pairs between English and other

source languages, different numbers of batch size are assigned while training:

Language Batch Size

Russian (ru) 4096

Farsi (fa) 1024

Ukranian (uk) 1024

Arabic (ar) 1024

Korean (ko) 1024

Hebrew (he) 1024

Bulgarian (bg) 512

Serbian (sr) 512

Greek (el) 512

Belarusian (be) 256

Georgian (ka) 256

Macedonian (mk) 128

Old-Belarusian (be-x-old) 128

Table 2: Batch size for each source language in Named Entity Transliteration task

The Transformer model is highly sensitive to hyper-parameters as described by

Popel and Bojar (2018). Therefore, several Transformer models are experimented,

including the default OpenNMT-py Transformer model (baseline) and the others

called A, B, C, D as described in Table 5. Hyper-parameters such as number of

layers in decoder/encoder, word embedding size, feed-forward hidden size, number

of attention heads... are tuned in the setting of each model.

26


3.2 Evaluation Metric

For evaluation, the Normalized Edit Distance metric, which is a normalized

version of Levenshtein Edit Distance, is used to compute the similarity between

a pair of names. Levenshtein Edit Distance is the minimum number of single-

character edit operations required to change one word into another. The editing

operations include insertions, deletions, and substitutions.

Given two strings a,b of length |a| and |b| respectively. The Levenshtein Edit

Distance leva,b between a and b is defined by:

(4)

leva,b(i, j) =



max(i, j) if min(i, j) = 0,

min


leva,b(i− 1, j) + 1

leva,b(i, j − 1) + 1

leva,b(i− 1, j − 1) + 1(ai 6=bj)

otherwise.

where leva,b(i, j) indicates is the distance between the first i, j characters of a

and b respectively.

Example: The Levenshtein Edit Distance between “Mannhaton” and “Man-

hattan” is 3, as we need three editing operations to transform the first one into

the second one, and there is no way to do it with fewer than three edits

• “Mannhaton” → “Manhaton”

1 operation of deletion of first “n”.

• “Manhaton” → “Manhatton”

1 operation of insertion of “t” at the end of sub-string “Manha”.

• “Manhatton” → “Manhattan”

1 operation of substitution of “a” for “o”.

27


Normalized edit distance: To compute Normalized edit distance, we first

normalize Levenshtein Edit Distance by the length of the reference string, then

multiply the result by 100.

3.3 Experimental Results

Language
Average

Edit Distance

Average

Normalized Edit Distance

Russian (ru) 1.83 9.93

Farsi (fa) 0.78 4.89

Ukranian (uk) 1.08 6.43

Arabic (ar) 1.37 8.62

Korean (ko) 1.54 10.24

Hebrew (he) 0.75 4.79

Bulgarian (bg) 1.1 7.43

Serbian (sr) 0.92 5.33

Greek (el) 1.6 8.86

Belarusian (be) 1.61 9.95

Georgian (ka) 1.09 7.5

Macedonian (mk) 2.14 12.92

Old-Belarusian (be-x-old) 1.9 11.92

Table 3: Average Normalized Edit Distance of LSTM models in Named Entity

Transliteration task

28


M
o
d
e
ls

T
ra

n
sf
o
rm

e
r

h
y
p
e
r-
p
a
ra

m
e
te
rs

S
o
u
rc
e
la
n
g
u
a
g
e
s

N
x

d
m
o
d
el

d
f
f

h
ru

fa
u
k

a
r

k
o

h
e

bg
sr

e
l

be
k
a

m
k

be
-x
-o

ld

b
a
se
li
n
e

6
51

2
20

48
8

4
.9
2

7.
33

6
.4
2

8.
76

12
.3

2
4.

65
10

.3
6

8.
15

13
.0

1
50

.3
4

16
.0

2
87

.7
1

16
3.

24

(A
)

6
25

6
20

48
8

11
.4

7
4
.4
8

8.
31

8.
6

9
.4
8

4
.3
8

7.
22

8.
2

8.
01

8
.8
6

7
.8
5

20
.3

20
.5

3

(B
)

6
25

6
10

24
8

10
.4

4
4.

53
8.

05
8
.2
6

11
.6

7
4.

49
6
.8
1

7.
9

7
.1
3

8.
92

16
.3

6
8.

62
9
.9
1

(C
)

6
25

6
10

24
4

6.
27

24
.9

8
8.

24
12

.1
9

11
.4

7
4.

48
7.

14
7
.6
3

11
.5

1
19

.1
4

15
.1

1
8
.4
7

10
.2

1

(D
)

6
32

10
24

8
28

.0
2

42
.1

6
31

.8
9

21
.7

5
21

.9
8

45
.6

7
30

.4
5

25
.1

2
16

.3
7

39
.2

7
41

.4
8

31
.4

5
32

.4
7

(E
)

4
12

8
51

2
4

22
.2

8
46

.5
8

24
.5

4
42

.7
1

21
.4

7
41

.4
4

20
.7

7
15

.5
2

22
.3

9
26

.1
20

.5
1

18
.1

8
19

.8
5

T
ab

le
4:

A
ve

ra
ge

N
or

m
al

iz
ed

E
d
it

D
is

ta
n
ce

of
T

ra
n
sf

or
m

er
m

o
d
el

s
in

N
am

ed
E

n
ti

ty
T

ra
n
sl

it
er

at
io

n
ta

sk

29


Figure 14: Normalized Edit Distance reported by Irvine et al. (2010) in Translit-

erating from all languages paper.

Table 3 reports the average Normalized Edit Distances of LSTM models for

13 source languages transliterating into English. The average Normalized Edit

Distances of Transformer models are also showed in Table 4. This work uses the

same dataset with Irvine et al. (2010). In order to evaluate the quality of the LSTM

models and Transformer models in Named Entity Transliteration task, the results

from Transliterating from all languages (TFAL) (Irvine et al., 2010) paper is also

presented in Figure 14 for later comparison. In TFAL, they trained a statistical

transliteration model based on the log-linear formulation of statistical machine

translation. Finally, Table 5 shows a comparison between the results from TFAL ,

the LSTM models and Transformer models.

30


Language TFAL LSTM Transformer

Russian (ru) 13.8 9.93 4.92

Farsi (fa) 24 4.89 4.48

Ukranian (uk) 14.8 6.43 6.42

Arabic (ar) 22 8.62 8.26

Korean (ko) 21.5 10.24 9.48

Hebrew (he) 24.7 4.79 4.38

Bulgarian (bg) 14.9 7.43 6.81

Serbian(sr) 16.2 5.33 7.63

Greek (el) 15.8 8.86 7.13

Belarusian (be) 18.5 9.95 8.86

Georgian (ka) 14.2 7.5 7.85

Macedonian (mk) 14.9 12.92 8.47

Old-Belarusian (be-x-old) 19 11.92 9.91

Table 5: Comparison between the results from TFAL (Irvine et al., 2010), the

LSTM models and Transformer models in Named Entity Transliteration task

31


4 Neural Machine Translation with Seq2Seq

4.1 Experiment Setups

4.1.1 Training Data

To evaluate the effectiveness of the LSTM model and Transformer on the trans-

lation tasks, we conduct experiments on the widely adopted benchmark dataset

WMT14 English → German translation. We use 4.5M parallel sentence pairs for

training set. newstest2014 (2737 sentences) is used as the test set. Validation is

done on newstest2013 (3000 sentences) for each of the experiment setups. Tokeniza-

tion tool 5 from OpenNMT Torch version (Klein et al., 2017) is used to convert

raw sentences into sequences of tokens. Being different from Vaswani et al. (2017),

all sentences are not encoded using byte pair encoding (BPE) (Britz et al., 2017).

4.1.2 LSTM and BLSTM

Both of the LSTM and BLSTM models share the same hyper-parameters for En-

glish → German translation task:

• Number of layers in the encoder = 2

• Number of layers in the decoder = 2

• Number of hidden units in the encoder = 500

• Number of hidden units in the decoder = 500

• Dropout = 0.3

• Learning rate = 1.0

• Decay learning rate = 0.5

5http://opennmt.net/OpenNMT/tools/tokenization/

32


• Learning rate warm-up steps = 4000

• Maximum prediction length = 100

• Word embedding size = 500

• Batch size = 64

• Number of training steps = 200,000

4.1.3 Transformer

This thesis experiments the Transformer model using the hyper-parameters sug-

gested by OpenNMT. Those hyper-parameters have been confirmed by Klein et al.

(2017) that they have the ability to replicate WMT14 results:

• Number of layers in the encoder/decoder = 6

• Word embedding size = 512

• Size of hidden transformer feed-forward = 2048

• Number of heads for transformer self-attention = 8

• Number of training steps = 200,000

• Maximum batches of words in a sequence to run the generator in parallel =

2

• Dropout = 0.1

• Batch size = 4096

• Adam optimizer:

– β1 = 0.9

– β2 = 0.998

33


• Learning rate = 2.0

– Decay method = noam

– Learning rate warm-up steps = 8000

• Maximum prediction length = 100

• Label smoothing value ε = 0.1

This model is very similar to base Transformer by Vaswani et al. (2017), such

that all projection and multi-head attention layers consist of 512 units followed by

a feed-forward layer provided with 2048 units.

4.2 Evaluation Metric

We evaluate all Seq2Seq models using tokenized and BLEU scores (Papineni et al.,

2002). Experimental results have proved that BLEU is highly correlated with hu-

man judgments (Bojar et al., 2010). Basically, BLEU is the averaged percentage

of n-gram matches (typically up to n-grams of length 4). The BLEU score between

a reference (ref ) sentence and a hypothesis (hyp) candidate is calculated by:

• Modified n-gram precision on corpus:

(5) pn =

∑
C∈{hyp}

∑
n−gram∈C Countclip(n− gram)∑

C′∈{hyp}
∑

n−gram′∈C′ Count(n− gram′)

where Countclip = min(Count,MaxRefCount).

• brevity penalty :

(6) BP =

{
1 if |ref | > |hyp|

exp(1− |ref ||hyp|) otherwise

where |ref | and |hyp| are length of reference and hypothesis sequences re-

spectively.

34


• BLEU score:

(7) BLEU = BP. exp(
1

N

N∑
n=1

log pn)

All reported BLEU scores are computed by the score.lua script from OpenNMT

Scorer tool 6 with N=4.

6http://opennmt.net/OpenNMT/tools/scorer/

35


4.3 Experimental Results

Figure 15: Validation accuracy of LSTM, BLSTM, Transformer models during

training

Model Test BLEU

LSTM 22.21

BLSTM 23.37

Transformer 25.17

Table 6: BLEU scores of LSTM, BLSTM, Transformer models on test set new-

stest2014 of MT task

On the WMT 2014 English-to-German translation task, the Transformer model

outperforms both the LSTM model and the BLSTM model. Table 6 summarizes

the translation quality of three architectures: Transformer, LSTM and BLSTM on

newstest2014 test set.

36


5 Speech Translation

Speech Translation or Speech-to-Text Translation is a process that takes the con-

versational speech phrase in one language as an input, then outputs translated

text phrases in another language.

Traditionally, the cascaded approach has a pipeline consisting of two compo-

nents connected in a sequential order: an automatic speech recognition (ASR)

system and a machine translation (MT) system (Post et al., 2013),(Kumar et al.,

2014),(Kumar et al., 2015), (Sulubacak et al., 2018). The ASR is responsible for

converting the spoken voices of the source language to the text transcriptions in

a similar language. After that, the MT is followed to translates the text in source

language the text in the target language.

Figure 16: Traditional pipeline of Speech Translation

37


Recently, end-to-end neural models, seq2seq for example, have showed that

they are powerful enough to solve speech translation task (e.g. Berard et al. (2016),

Weiss et al. (2017), Duong et al. (2016), Liu et al. (2019)). End-to-end models offer

advantages at low resource settings when the voices are in one language while their

transcript is in another.

This part of the thesis addresses the translation of English audio into German

text by following the traditional pipeline. This approach not only gives benefit from

large quantities of text-only corpus WMT14 training data but also makes it easy

to evaluate the performance of the Transformer on speech translation task. For the

ASR module, ESPnet (Watanabe et al., 2018) - an end-to-end speech processing

toolkit is used to perform the speech recognition task on the provided TED-LIUM

speech recognition corpus version 2 (Rousseau et al., 2012). Next, several seq2seq

text machine translation models are trained for accomplishing the pipeline.

5.1 Experiment Setups

5.1.1 Training Data

For the speech recognition part, TEDLIUM corpus version 2 has been used for

training. This data consists of more than 200 hours from TED talks 7 (Cettolo

et al., 2013), which allows me to build an ASR system with high performance.

For the machine translation task, WMT14 is used once again together with

text-only corpus provides from IWSLT 2018 campaign 8. These datasets are in reg-

ular language with a large amount of punctuation and special characters. Hence,

there is a mismatch between these data and the ASR output in the speech transla-

tion pipeline. To solve this problem, standard text-based MT sentences are normal-

ized using NLTK toolkit (Loper and Bird, 2002) and transformed to all lowercase

letters to reflect ASR output.

7https://www.ted.com/talks
8https://workshop2018.iwslt.org/

38


For the speech translation module, the test sets from the tasks between 2013

and 2015 (“tst2013 ” and “tst2015”) provided by the IWSLT organizers are used

for testing the performance of the cascade model.

5.1.2 ASR

An automatic speech recognition system is trained by ESPnet with VGG-BLSTMP

encoder-decoder, combined with joint connectionist temporal classification (CTC)

decoding and RNN language model(LM). The following parameters are used during

training:

1. Network architecture

• etype = vggblstmp # Encoder architecture type

• elayers=6 # Number of encoder layers

• eunits=320 # Number of encoder hidden units

• eprojs=320 # Number of encoder projection units

• subsample= 1 2 2 1 1 # Encoder subsampling

• dlayers=1 # Number of decoder layers

• dunits=300 # Number of decoder hidden units

• atype=dot # Type of attention

• adim=320 # Number of attention dimensions

• aconv chans=10 # Number of attention convolution channels

• aconv filts=100 # Number of attention convolution filters

• mtlalpha=0.5 # Multitask learning coefficient

• batchsize=30 # Batch size

• maxlen in=800 # Maximum input length for reducing batch size

• maxlen out=150 # Maximum output length for reducing batch size

39


• sortagrad=0 # Feed samples type

• opt=adadelta # Optimizer

• epochs=15 # Epochs Number

• patience=3 # Patience for optimization

2. RNN language model

• lm layers=2 # Number of LM layers

• lm units=650 # Number of LM hidden units

• lm opt=sgd # LM optimizer

• lm sortagrad=0 # LM Feed samples type

• lm batchsize=1024 # LM batch size

• lm epochs=20 # LM Epochs Number

• lm patience=3 # LM Patience for optimization

• lm maxlen=150 # LM Maximum length for reducing lm batchsize

3. Decoding parameter

• lm weight=1.0 # Language model weight

• beam size=20 # Beam size

• penalty=0.0 # Penalty

• maxlenratio=0.0 # Maximum length ratio

• minlenratio=0.0 # Minimum length ratio

• ctc weight=0.3 # CTC weight

• recog model=model.acc.best # set a model for decoding

5.1.3 BLSTM

Hyper-parameters for BLSTM model in English → German text translation task:

40


• Number of layers in the encoder = 2

• Number of layers in the decoder = 2

• Number of hidden units in the encoder = 500

• Number of hidden units in the decoder = 500

• Dropout = 0.3

• Learning rate = 1.0

• Decay learning rate = 0.5

• Learning rate warm-up steps = 4000

• Maximum prediction length = 100

• Word embedding size = 500

• Batch size = 64

• Number of training steps = 200,000

5.1.4 Transformer

Hyper-parameters for Transformer model in English → German text translation

task:

• Number of layers in the encoder/decoder = 6

• Word embedding size = 512

• Size of hidden transformer feed-forward = 2048

• Number of heads for transformer self-attention = 8

• Number of training steps = 200,000

41


• Maximum batches of words in a sequence to run the generator in parallel =

2

• Dropout = 0.1

• Batch size = 4096

• Adam optimizer:

– β1 = 0.9

– β2 = 0.998

• Learning rate = 2.0

– Decay method = noam

– Learning rate warm-up steps = 8000

• Maximum prediction length = 100

• Label smoothing value ε = 0.1

5.2 Evaluation Metric

To evaluate the performance of an automatic speech recognition system, word error

rate (WER) (Popović and Ney, 2007) is the most widely used metric. Basically,

the WER (working at the word level) is extended from the Levenshtein distance

(working at the phoneme level). The WER between the hypothesis (hyp) and the

reference (ref ) sentences is calculated as:

(8) WERhyp,ref =
S + I + D

N

Where I stands for the number of editing operation insertions (inserting a

word), D is the number of deletions, S means the number of substitutions and N

is the number of words in the reference.

42


This thesis uses WER and BLEU score for evaluating the effectiveness of the

ASR system and speech translation system, respectively.

5.3 Experimental Results

Utterance System Transcription

AimeeMullins

2009P-

0002881-

0004026

Our ASR system

i had already finished editing the piece

and i realized that i have never ones

in my life looked up the words disabled

to see what i’d find when we read the entry

IMS-Speech

i had already finished editing the piece

and i realized that i had never once

in my life looked up the word disabled

to see what i’d find let me read you the entry

Ground truth

i’d already finished editing the piece

and i realized that i had never once

in my life looked up the word disabled

to see what i’d find let me read you the entry

Table 7: Example of our ASR system transcriptions and IMS-Speech transcriptions

Firstly, the performance of the ASR systems is evaluated before conducting

any machine translation. In Table 8 we report results on the TED-LIUM 2 dataset

for our ASR system, which is built with ESPnet on the TED-LIUM development

and test sets, and a web-based speech transcription tool for English and German

languages called IMS-Speech (Denisov and Vu, 2019). The speech recognition com-

ponent of IMS-Speech is implemented with ESPnet toolkit with PyTorch backend

and trained on multiple speech databases summed up to 2277 hours of English

transcribed speech training data. Due to a large amount of training data, IMS-

Speech can compete confidently with other ASR systems on various tasks and

conditions. As expected, IMS-Speech achieves better results than our ASR system

43


on TED-LIUM 2 test set.

set Our ASR system IMS-Speech

dev 19.4 -

test 18.2 7.3

Table 8: Word error rates (WER) of our ASR system and IMS-Speech on the

TED-LIUM 2 dataset

Secondly, the efficiency of several NMT models on translating ASR-like English

to German is shown in Table 9. Table 10 also gives an example of a translation from

ASR-like English to German. Once again, the Transformer model outperforms the

BLSTM model on this task.

set BLSTM Transformer

dev (newstest2013) 17.00 20.24

test (newstest2014) 16.66 20.40

Table 9: BLEU scores of the NMT systems including the Transformer model and

the BLSTM model

Input

sentence

(English)

orlando bloom and miranda kerr still love each other

Output

sentence

(German)

BLSTM orlando bloom und miranda kerr lieben immer noch

Transformer orlando bloom und miranda kerr lieben sich noch immer

Reference orlando bloom und miranda kerr lieben sich noch immer

Table 10: Example of our NMT models on an ASR-like version of newtest2014

The third and last evaluation is about the performance of speech translation

systems using different ASR and NMT components. The IWSLT organizers provide

a baseline model 9 (Zenkel et al., 2018) of the spoken language translation system.

This baseline model includes two different speech recognition systems, a CTC-

based system and an attention-based system. For machine translation, it uses an

9https://github.com/isl-mt/SLT.KIT

44


Testset
baseline Our ASR system IMS-Speech

LSTM BLSTM Transformer BLSTM Transformer

tst2013 13.73 12.52 15.17 16.02 19.19

tst2015 - 12.80 15.16 15.91 19.42

Table 11: BLEU scores of Speech Translation systems

LSTM model trained with OpenNMT-py. To compare the performance of speech

translation systems, we use the test sets used for the IWSLT conference: tst2013

and tst2015. Clearly, Table 11 states that the cascaded speech translation systems

which use the Transformer model achieved higher performance than the others

which use the BLSTM model as their MT component.

45


6 Analysis

6.1 Multilingual Named Entity Transliteration

Since transliteration is a simple sub-task of Machine Translation, both the LSTM

model and the Transformer model demonstrated the power of deep neural networks

when they outperform the word-based n-gram model from Irvine et al. (2010)

easily.

Table 12: Example of transliterating from a Russian name entity to English

In general, we see that the Transformer model can achieve higher (and some-

times competitive) scores than the LSTM approach at the work of transliterating

named entities from a source language into English. Serbian (sr) and Georgian

(ka) are the only cases in which the LSTM model (slightly) beats the Transform-

ers approach. Due to the fact that both Serbian and Georgian are in the group

of Slavic languages and contain small amounts of data, we hypothesize that the

Transformer models achieve a high quality of transliterating Slavic languages into

English when huge datasets are given. For example, in cases of large datasets such

as Russian and Ukrainian, which are also Slavic languages, the Transformer models

significantly surpass the performance of the LSTM model (an example is provided

in Table 12) and the baseline model. Another explanation is that the Transformer

model is extremely responsive to hyper-parameters, there is a possibility that cho-

sen hyper-parameters for Serbian and Georgian are not really fitting. The Trans-

former models seem to outperform the LSTM models if the hyper-parameters are

46


carefully tuned in the transliteration task.

6.2 Neural Machine Translation with Seq2Seq

Looking at the BLEU scores given in Table 6, it is evident that the Transformer

performs better in all the two RNN-based model variants. The main factor that

makes the Transformer model more efficient than both the LSTM model and the

BLSTM model is its multi-head self-attention mechanism. Although LSTMs were

designed to handle the long-term dependency problem, it still performs not well

when the length of sentences is too long. When the distance between words at the

current step and previously step increases, the probability of keeping the context

decreases exponentially with that distance. Since the self-attention mechanism

applies attention weights to all tokens in the input sequences, the Transformer

model is able to easily capture long-term dependencies. Besides, the multi-head

attention allows the model to capture diverse representations of the input.

Input

sentence

(English)

Orlando Bloom and Miranda Kerr still love each other

Output

sentence

(German)

LSTM Orlando Bloom und Miranda Kerr immer noch gegenseitig.

BLSTM Orlando Bloom und Miranda Kerr lieben immer noch jede andere.

Transformer Orlando Bloom und Miranda Kerr lieben einander noch immer.

Reference Orlando Bloom und Miranda Kerr lieben sich noch immer

Table 13: Example of our NMT models on newtest2014 test set

Despite the fact that our Transformer model uses identical settings with the

Transformer base model from Vaswani et al. (2017), the BLEU score for it is

slightly lower than the one reported in Vaswani et al. (2017). The reason for this is

because this thesis intends to determine whether the Transformer model performs

MT problems better than other models. Hence, we did not carry out checkpoint

averaging, Unicode normalization or tokenization using Moses (Koehn et al., 2007)

and byte pair encoding.

47


6.3 Speech Translation

On the grounds that the cascaded speech translation systems in this thesis are com-

posed of two separate components: an ASR system and a seq2seq NMT model, the

performances of the Transformer model and the BLSTM model can be compared

obviously. The results in Table 11 demonstrate that the speech translation sys-

tems using the Transformer model as their MT component outperform the speech

translation systems using the BLSTM model. The reason for these results is the

cascaded speech translation system in which the output of a speech recognizer is

given and translated using the seq2seq Transformer model as its MT component

naturally inherits the advantages of the seq2seq Transformer model as discussed

in Chapter 5.

Step Text

Reference EN a mobile phone can change your life

Our ASR system EN mobile phone can change your life

Machine Translation DE
BLSTM handy kann ihr leben ändern

Transformer mobiltelefon kann ihr leben verändern

Reference DE ein handy kann ein leben verändern

Table 14: References of an example English speech source and German text target

sentence and the output of the different steps of our cascaded speech translation

system

Despite the fact that our speech translation system using the Transformer

model performs better than the others, its performance is still low. Since the

speech translation architecture consists of an ASR component followed by an NMT

component, its performance consequently may not be high when the ASR system

gets poor quality. Hence, the performance of the speech translation system can be

significantly improved by increasing the efficiency of the ASR component. Table 11

also provides systematic evidence in which the cascaded speech translation system

using a better ASR system such as IMS-Speech outperforms our speech translation

system.

48


7 Conclusion and Future works

In this work, we demonstrate how the recently introduced Transformer architec-

ture performs in several Machine Translation problems range from easiness to

difficultness: transliteration, text translation and speech translation. Our analysis

compared the Transformer model with other popular RNN-based models such as

the LSTM and the bidirectional-LSTM. To conclude, the Transformer architecture

can perform different tasks in Neural Machine Translation with great efficiency,

compared to RNN-based models.

A more in-depth analysis of these NMT tasks is going to be carried out in

future work. In addition, the performance of these problems using the Transformer

architecture need to be improved by applying novel powerful techniques.

49


Bibliography

Yaser Al-Onaizan and Kevin Knight. Translating named entities using monolingual

and bilingual resources. pages 400–408, 01 2002.

Dario Amodei, Rishita Anubhai, Eric Battenberg, Carl Case, Jared Casper, Bryan

Catanzaro, Jingdong Chen, Mike Chrzanowski, Adam Coates, Greg Diamos,

Erich Elsen, Jesse H. Engel, Linxi Fan, Christopher Fougner, Tony Han, Awni Y.

Hannun, Billy Jun, Patrick LeGresley, Libby Lin, Sharan Narang, Andrew Y.

Ng, Sherjil Ozair, Ryan Prenger, Jonathan Raiman, Sanjeev Satheesh, David

Seetapun, Shubho Sengupta, Yi Wang, Zhiqian Wang, Chong Wang, Bo Xiao,

Dani Yogatama, Jun Zhan, and Zhenyao Zhu. Deep speech 2: End-to-end speech

recognition in english and mandarin. CoRR, abs/1512.02595, 2015. URL http:

//arxiv.org/abs/1512.02595.

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine trans-

lation by jointly learning to align and translate. CoRR, abs/1409.0473, 2014.

Alexandre Berard, Olivier Pietquin, Christophe Servan, and Laurent Besacier. Lis-

ten and translate: A proof of concept for end-to-end speech-to-text translation.

CoRR, abs/1612.01744, 2016. URL http://arxiv.org/abs/1612.01744.

Ondřej Bojar, Kamil Kos, and David Mareček. Tackling sparse data issue in ma-

chine translation evaluation. In Proceedings of the ACL 2010 Conference Short

Papers, pages 86–91, Uppsala, Sweden, July 2010. Association for Computa-

tional Linguistics. URL https://www.aclweb.org/anthology/P10-2016.

Denny Britz, Anna Goldie, Minh-Thang Luong, and Quoc V. Le. Massive ex-

ploration of neural machine translation architectures. CoRR, abs/1703.03906,

2017. URL http://arxiv.org/abs/1703.03906.

Mauro Cettolo, Jan Niehues, Sebastian Stker, Luisa Bentivogli, and Marcello Fed-

erico. Report on the 10th iwslt evaluation campaign. pages 29–38, 01 2013.

50


KyungHyun Cho, Bart van Merrienboer, Dzmitry Bahdanau, and Yoshua Bengio.

On the properties of neural machine translation: Encoder-decoder approaches.

CoRR, abs/1409.1259, 2014. URL http://arxiv.org/abs/1409.1259.

Pavel Denisov and Ngoc Thang Vu. Ims-speech: A speech to text tool. In Pe-

ter Birkholz and Simon Stone, editors, Studientexte zur Sprachkommunikation:

Elektronische Sprachsignalverarbeitung 2019, pages 170–177. TUDpress, Dres-

den, 2019. ISBN 978-3-959081-57-3.

Long Duong, Antonios Anastasopoulos, David Chiang, Steven Bird, and Trevor

Cohn. An attentional model for speech translation without transcription. In Pro-

ceedings of the 2016 Conference of the North American Chapter of the Associa-

tion for Computational Linguistics: Human Language Technologies, pages 949–

959, San Diego, California, June 2016. Association for Computational Linguis-

tics. doi: 10.18653/v1/N16-1109. URL https://www.aclweb.org/anthology/

N16-1109.

Asif Ekbal, Sudip Kumar Naskar, and Sivaji Bandyopadhyay. A modified joint

source-channel model for transliteration. In Proceedings of the COLING/ACL

2006 Main Conference Poster Sessions, pages 191–198, Sydney, Australia, July

2006. Association for Computational Linguistics. URL https://www.aclweb.

org/anthology/P06-2025.

Andrew Finch, Lemao Liu, Xiaolin Wang, and Eiichiro Sumita. Neural network

transduction models in transliteration generation. In Proceedings of the Fifth

Named Entity Workshop, pages 61–66, Beijing, China, July 2015. Association

for Computational Linguistics. doi: 10.18653/v1/W15-3909. URL https://

www.aclweb.org/anthology/W15-3909.

Andrea Galassi, Marco Lippi, and Paolo Torroni. Attention, please! a criti-

cal review of neural attention models in natural language processing. ArXiv,

abs/1902.02181, 2019.

A. Graves and J. Schmidhuber. Framewise phoneme classification with bidirec-

tional lstm networks. In Proceedings. 2005 IEEE International Joint Conference

51


on Neural Networks, 2005., volume 4, pages 2047–2052 vol. 4, July 2005. doi:

10.1109/IJCNN.2005.1556215.

Alex Graves and Jrgen Schmidhuber. Framewise phoneme classification with bidi-

rectional lstm and other neural network architectures. Neural networks : the

official journal of the International Neural Network Society, 18:602–10, 07 2005.

doi: 10.1016/j.neunet.2005.06.042.

Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber. Con-

nectionist temporal classification: Labelling unsegmented sequence data with

recurrent neural networks. In Proceedings of the 23rd International Confer-

ence on Machine Learning, ICML ’06, pages 369–376, New York, NY, USA,

2006. ACM. ISBN 1-59593-383-2. doi: 10.1145/1143844.1143891. URL http:

//doi.acm.org/10.1145/1143844.1143891.

G. Hinton, L. Deng, D. Yu, G. E. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Van-

houcke, P. Nguyen, T. N. Sainath, and B. Kingsbury. Deep neural networks

for acoustic modeling in speech recognition: The shared views of four research

groups. IEEE Signal Processing Magazine, 29(6):82–97, Nov 2012. ISSN 1053-

5888. doi: 10.1109/MSP.2012.2205597.

Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural com-

putation, 9(8):1735–1780, 1997.

Ann Irvine, Chris Callison-burch, and Alexandre Klementiev. Transliterating from

all languages. In In AMTA, 2010.

Sittichai Jiampojamarn, Aditya Bhargava, Qing Dou, Kenneth Dwyer, and Grze-

gorz Kondrak. DirecTL: a language independent approach to transliteration. In

Proceedings of the 2009 Named Entities Workshop: Shared Task on Translitera-

tion (NEWS 2009), pages 28–31, Suntec, Singapore, August 2009. Association

for Computational Linguistics. URL https://www.aclweb.org/anthology/

W09-3504.

52


Nal Kalchbrenner and Phil Blunsom. Recurrent continuous translation models.

Seattle, October 2013. Association for Computational Linguistics.

Sarvnaz Karimi, Falk Scholer, and Andrew Turpin. Machine transliteration survey.

ACM Comput. Surv., 43:17, 04 2011a. doi: 10.1145/1922649.1922654.

Sarvnaz Karimi, Falk Scholer, and Andrew Turpin. Machine transliteration sur-

vey. ACM Comput. Surv., 43(3):17:1–17:46, April 2011b. ISSN 0360-0300.

doi: 10.1145/1922649.1922654. URL http://doi.acm.org/10.1145/1922649.

1922654.

Guillaume Klein, Yoon Kim, Yuntian Deng, Jean Senellart, and Alexander M.

Rush. OpenNMT: Open-source toolkit for neural machine translation. In Proc.

ACL, 2017. doi: 10.18653/v1/P17-4012. URL https://doi.org/10.18653/v1/

P17-4012.

Alexandre Klementiev and Dan Roth. Named entity transliteration and discovery

from multilingual comparable corpora. In Proceedings of the Human Language

Technology Conference of the NAACL, Main Conference, pages 82–88, New York

City, USA, June 2006. Association for Computational Linguistics. URL https:

//www.aclweb.org/anthology/N06-1011.

Kevin Knight and Jonathan Graehl. Machine transliteration. In Proceedings of the

Eighth Conference on European Chapter of the Association for Computational

Linguistics, EACL ’97, pages 128–135, Stroudsburg, PA, USA, 1997. Association

for Computational Linguistics. doi: 10.3115/979617.979634. URL https://doi.

org/10.3115/979617.979634.

Philipp Koehn, Franz Josef Och, and Daniel Marcu. Statistical phrase-based trans-

lation. In Proceedings of the 2003 Conference of the North American Chapter

of the Association for Computational Linguistics on Human Language Technol-

ogy - Volume 1, NAACL ’03, pages 48–54, Stroudsburg, PA, USA, 2003. As-

sociation for Computational Linguistics. doi: 10.3115/1073445.1073462. URL

https://doi.org/10.3115/1073445.1073462.

53


Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Fed-

erico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard

Zens, Chris Dyer, Ondřej Bojar, Alexandra Constantin, and Evan Herbst. Moses:

Open source toolkit for statistical machine translation. In Proceedings of the 45th

Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions,

ACL ’07, pages 177–180, Stroudsburg, PA, USA, 2007. Association for Com-

putational Linguistics. URL http://dl.acm.org/citation.cfm?id=1557769.

1557821.

J. F. Kolen and S. C. Kremer. Gradient Flow in Recurrent Nets: The Difficulty of

Learning LongTerm Dependencies. IEEE, 2001. ISBN 9780470544037. doi: 10.

1109/9780470544037.ch14. URL https://ieeexplore.ieee.org/document/

5264952.

G. Kumar, M. Post, D. Povey, and S. Khudanpur. Some insights from translating

conversational telephone speech. In 2014 IEEE International Conference on

Acoustics, Speech and Signal Processing (ICASSP), pages 3231–3235, May 2014.

doi: 10.1109/ICASSP.2014.6854197.

Gaurav Kumar, Graeme Blackwood, Jan Trmal, Daniel Povey, and Sanjeev Khu-

danpur. A coarse-grained model for optimal coupling of ASR and SMT systems

for speech translation. In Proceedings of the 2015 Conference on Empirical

Methods in Natural Language Processing, pages 1902–1907, Lisbon, Portugal,

September 2015. Association for Computational Linguistics. doi: 10.18653/v1/

D15-1218. URL https://www.aclweb.org/anthology/D15-1218.

Jin-Shea Kuo, Haizhou Li, and Chih-Lung Lin. Harvesting regional transliter-

ation variants with guided search. In Wenjie Li and Diego Mollá-Aliod, edi-

tors, Computer Processing of Oriental Languages. Language Technology for the

Knowledge-based Economy, pages 133–144, Berlin, Heidelberg, 2009. Springer

Berlin Heidelberg. ISBN 978-3-642-00831-3.

Jennifer Lai and Nicole Yankelovich. The human-computer interaction handbook.

chapter Conversational Speech Interfaces, pages 698–713. L. Erlbaum Associates

54


Inc., Hillsdale, NJ, USA, 2003. ISBN 0-8058-3838-4. URL http://dl.acm.org/

citation.cfm?id=772072.772116.

Akinobu Lee, Tatsuya Kawahara, and Kiyohiro Shikano. Juliusan open source

real-time large vocabulary recognition engine. volume 3, pages 1691–1694, 01

2001.

Haizhou Li, Min Zhang, and Jian Su. A joint source-channel model for ma-

chine transliteration. In Proceedings of the 42nd Annual Meeting of the As-

sociation for Computational Linguistics (ACL-04), pages 159–166, Barcelona,

Spain, July 2004. doi: 10.3115/1218955.1218976. URL https://www.aclweb.

org/anthology/P04-1021.

Yuchen Liu, Hao Xiong, Zhongjun He, Jiajun Zhang, Hua Wu, Haifeng Wang, and

Chengqing Zong. End-to-end speech translation with knowledge distillation.

CoRR, abs/1904.08075, 2019. URL http://arxiv.org/abs/1904.08075.

Edward Loper and Steven Bird. Nltk: The natural language toolkit. In Pro-

ceedings of the ACL-02 Workshop on Effective Tools and Methodologies for

Teaching Natural Language Processing and Computational Linguistics - Vol-

ume 1, ETMTNLP ’02, pages 63–70, Stroudsburg, PA, USA, 2002. Associ-

ation for Computational Linguistics. doi: 10.3115/1118108.1118117. URL

https://doi.org/10.3115/1118108.1118117.

Minh-Thang Luong, Hieu Pham, and Christopher D. Manning. Effective ap-

proaches to attention-based neural machine translation. CoRR, abs/1508.04025,

2015a. URL http://arxiv.org/abs/1508.04025.

Thang Luong, Ilya Sutskever, Quoc Le, Oriol Vinyals, and Wojciech Zaremba.

Addressing the rare word problem in neural machine translation. In Proceed-

ings of the 53rd Annual Meeting of the Association for Computational Lin-

guistics and the 7th International Joint Conference on Natural Language Pro-

cessing (Volume 1: Long Papers), pages 11–19, Beijing, China, July 2015b.

Association for Computational Linguistics. doi: 10.3115/v1/P15-1002. URL

https://www.aclweb.org/anthology/P15-1002.

55


Daniel Marcu and William Wong. A phrase-based, joint probability model for

statistical machine translation. In Proceedings of the ACL-02 Conference on

Empirical Methods in Natural Language Processing - Volume 10, EMNLP ’02,

pages 133–139, Stroudsburg, PA, USA, 2002. Association for Computational

Linguistics. doi: 10.3115/1118693.1118711. URL https://doi.org/10.3115/

1118693.1118711.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: A method

for automatic evaluation of machine translation. In Proceedings of the 40th

Annual Meeting on Association for Computational Linguistics, ACL ’02, pages

311–318, Stroudsburg, PA, USA, 2002. Association for Computational Lin-

guistics. doi: 10.3115/1073083.1073135. URL https://doi.org/10.3115/

1073083.1073135.

Jeff Pasternack and Dan Roth. Learning better transliterations. pages 177–186,

01 2009. doi: 10.1145/1645953.1645978.

Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang,

Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer.

Automatic differentiation in pytorch. 2017.

Martin Popel and Ondrej Bojar. Training tips for the transformer model. CoRR,

abs/1804.00247, 2018. URL http://arxiv.org/abs/1804.00247.

Maja Popović and Hermann Ney. Word error rates: Decomposition over pos classes

and applications for error analysis. In Proceedings of the Second Workshop on

Statistical Machine Translation, StatMT ’07, pages 48–55, Stroudsburg, PA,

USA, 2007. Association for Computational Linguistics. URL http://dl.acm.

org/citation.cfm?id=1626355.1626362.

Matt Post, Gaurav Kumar, Adam Lopez, Damianos Karakos, Chris Callison-

Burch, and Sanjeev Khudanpur. Improved speech-to-text translation with the

fisher and callhome spanishenglish speech translation corpus. In International

Workshop on Spoken Language Translation (IWSLT 2013), 2013.

56


Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek,

Nagendra Goel, Mirko Hannemann, Petr Motlicek, Yanmin Qian, Petr Schwarz,

Jan Silovsky, Georg Stemmer, and Karel Vesely. The kaldi speech recognition

toolkit. In IEEE 2011 Workshop on Automatic Speech Recognition and Under-

standing. IEEE Signal Processing Society, December 2011. ISBN 978-1-4673-

0366-8. IEEE Catalog No.: CFP11SRW-USB.

Mirco Ravanelli. Deep learning for distant speech recognition. CoRR,

abs/1712.06086, 2017. URL http://arxiv.org/abs/1712.06086.

Mihaela Rosca and Thomas Breuel. Sequence-to-sequence neural network models

for transliteration. CoRR, abs/1610.09565, 2016. URL http://arxiv.org/

abs/1610.09565.

Anthony Rousseau, Paul Delglise, and Yannick Estve. Ted-lium: an automatic

speech recognition dedicated corpus. pages 125–129, 05 2012.

David Rybach, Stefan Hahn, Patrick Lehnen, David Nolden, Martin Sundermeyer,

Zoltn Tske, Simon Wiesler, Ralf Schlter, and Hermann Ney. Rasr - the rwth

aachen university open source speech recognition toolkit. 12 2011.

Hassan Sajjad, Alexander Fraser, and Helmut Schmid. An algorithm for un-

supervised transliteration mining with an application to word alignment. In

Proceedings of the 49th Annual Meeting of the Association for Computational

Linguistics: Human Language Technologies, pages 430–439, Portland, Oregon,

USA, June 2011. Association for Computational Linguistics. URL https:

//www.aclweb.org/anthology/P11-1044.

Hendra Setiawan, Haizhou Li, Min Zhang, and Beng Chin Ooi. Phrase-based sta-

tistical machine translation: A level of detail approach. In Robert Dale, Kam-Fai

Wong, Jian Su, and Oi Yee Kwong, editors, Natural Language Processing – IJC-

NLP 2005, pages 576–587, Berlin, Heidelberg, 2005. Springer Berlin Heidelberg.

ISBN 978-3-540-31724-1.

57


Yan Shao and Joakim Nivre. Applying neural networks to English-Chinese named

entity transliteration. In Proceedings of the Sixth Named Entity Workshop, pages

73–77, Berlin, Germany, August 2016. Association for Computational Linguis-

tics. doi: 10.18653/v1/W16-2710. URL https://www.aclweb.org/anthology/

W16-2710.

Umut Sulubacak, Jörg Tiedemann, Aku Rouhe, Stig-Arne Grönroos, and Mikko

Kurimo. The memad submission to the IWSLT 2018 speech translation task.

CoRR, abs/1810.10320, 2018. URL http://arxiv.org/abs/1810.10320.

Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. Sequence to sequence learning with

neural networks. CoRR, abs/1409.3215, 2014. URL http://arxiv.org/abs/

1409.3215.

Seiya Tokui and Kenta Oono. Chainer : a next-generation open source framework

for deep learning. 2015.

Shyam Upadhyay, Jordan Kodner, and Dan Roth. Bootstrapping transliteration

with constrained discovery for low-resource languages. In Proceedings of the

2018 Conference on Empirical Methods in Natural Language Processing, pages

501–511, Brussels, Belgium, October-November 2018. Association for Compu-

tational Linguistics. doi: 10.18653/v1/D18-1046. URL https://www.aclweb.

org/anthology/D18-1046.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,

Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you

need. 2017. URL https://arxiv.org/pdf/1706.03762.pdf.

Willie Walker, Paul Lamere, Philip Kwok, Bhiksha Raj, Rita Singh, Evandro Gou-

vea, Peter Wolf, and Joe Wlfel. Sphinx-4: A flexible open source framework for

speech recognition. Sun Microsystems, 12 2004.

Shinji Watanabe, Takaaki Hori, Suyoun Kim, John R. Hershey, and Tomoki

Hayashi. Hybrid ctc/attention architecture for end-to-end speech recognition.

58


IEEE Journal on Selected Topics in Signal Processing, 11(8):1240–1253, 12 2017.

ISSN 1932-4553. doi: 10.1109/JSTSP.2017.2763455.

Shinji Watanabe, Takaaki Hori, Shigeki Karita, Tomoki Hayashi, Jiro Nishi-

toba, Yuya Unno, Nelson Enrique Yalta Soplin, Jahn Heymann, Matthew

Wiesner, Nanxin Chen, Adithya Renduchintala, and Tsubasa Ochiai. Esp-

net: End-to-end speech processing toolkit. In Interspeech, pages 2207–2211,

2018. doi: 10.21437/Interspeech.2018-1456. URL http://dx.doi.org/10.

21437/Interspeech.2018-1456.

Ron J. Weiss, Jan Chorowski, Navdeep Jaitly, Yonghui Wu, and Zhifeng Chen.

Sequence-to-sequence models can directly transcribe foreign speech. CoRR,

abs/1703.08581, 2017. URL http://arxiv.org/abs/1703.08581.

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi,

Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey,

Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Lukasz Kaiser,

Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens,

George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason

Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jef-

frey Dean. Google’s neural machine translation system: Bridging the gap be-

tween human and machine translation. CoRR, abs/1609.08144, 2016. URL

http://arxiv.org/abs/1609.08144.

Steve Young, G Evermann, M.J.F. Gales, Thomas Hain, D Kershaw, Xunying

Liu, G Moore, James Odell, D Ollason, Daniel Povey, V Valtchev, and Philip

Woodland. The HTK book. 01 2002.

Dong Yu and Li Deng. Automatic Speech Recognition: A Deep Learning Approach.

Signals and Communication Technology. Springer, London, 2015. ISBN 978-1-

4471-5778-6. doi: 10.1007/978-1-4471-5779-3.

Thomas Zenkel, Matthias Sperber, Jan Niehues, Markus Mller, Quan Pham, Sebas-

tian Stker, and Alex Waibel. Open source toolkit for speech to text translation.

59


The Prague Bulletin of Mathematical Linguistics, 111:125–135, 10 2018. doi:

10.2478/pralin-2018-0011.

60