Computational Models of Word Order

Von der Fakultät Informatik, Elektrotechnik und Informationstechnik der Universität
Stuttgart zur Erlangung der Würde eines Doktors der

Naturwissenschaften (Dr. rer. nat.) genehmigte Abhandlung.

Vorgelegt von

Xiang Yu
aus Hunan, China

Hauptberichter: Prof. Dr. Jonas Kuhn
1. Mitberichter: Prof. Dr. Ngoc Thang Vu
2. Mitberichter: Prof. Dr. Leo Wanner

Tag der mündlichen Prüfung: 07.06.2021

Institut für Maschinelle Sprachverarbeitung der Universität Stuttgart

2022


iii

I hereby declare that I have created this work completely on my own and used no
other sources or tools than the ones listed, and that I have marked any citations accord-
ingly.

Hiermit versichere ich, dass ich die vorliegende Arbeit selbständig verfasst und keine
anderen als die angegebenen Quellen und Hilfsmittel benutzt sowie Zitate kenntlich
gemacht habe.

Berlin, May 20, 2022

Place, Date Xiang Yu


v

Contents

Abstract xvii

Überblick xix

1 Introduction 1

1.1 Research Gaps and Contribution . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Background 9

2.1 Dependency Grammar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.1.1 Dependency Grammar vs. Constituency Grammar . . . . . . . . . 9

2.1.2 Projectivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.1.3 Parsing and Linearization . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2 Dependency Parsing Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.2.1 Graph-based Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.2.2 Transition-based Parsing . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.2.3 Non-Projective Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.3 Linearization Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.3.1 Syntax-free Linearization . . . . . . . . . . . . . . . . . . . . . . . . 16

2.3.2 Syntax-based Linearization . . . . . . . . . . . . . . . . . . . . . . . 16

2.4 Machine Learning Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.4.1 Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.4.2 Multi-Layer Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.4.3 Distributed Representation . . . . . . . . . . . . . . . . . . . . . . . 19

2.4.4 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . . . 20

2.4.5 Sequence-to-Sequence Model and Attention Mechanism . . . . . . 21


vi Contents

2.4.6 Tree-LSTM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.4.7 Graph Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.5 Dependency Length Minimization . . . . . . . . . . . . . . . . . . . . . . . 25

3 Dependency Tree Linearization 27
3.1 Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.1.1 Token . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.1.2 Sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.1.3 Tree-LSTM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.1.4 Graph Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.2 Divide-and-Conquer Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.3 Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.3.1 Incremental Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . 35

Left-to-Right Generation with Beam Search . . . . . . . . . . . . . . 35

Head-First Generation Order . . . . . . . . . . . . . . . . . . . . . . 36

Training with Beam Search . . . . . . . . . . . . . . . . . . . . . . . 37

Ensemble . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.3.2 Graph-based Decoding . . . . . . . . . . . . . . . . . . . . . . . . . 39

Scoring Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

TSP Solver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

Training Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

Decoding with Constraints . . . . . . . . . . . . . . . . . . . . . . . 43

3.3.3 Transition-based Decoding . . . . . . . . . . . . . . . . . . . . . . . 44

Transition System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

Scoring Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

Producing Non-Projective Sentences . . . . . . . . . . . . . . . . . . 45

3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4 Linearization Experiments 51
4.1 Data and Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.2 Encoders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.3 Decoders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.4 Model Errors vs. Search Errors . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.5 Non-Projectiviy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.6 Training Data Sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.7 Error Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62


Contents vii

4.8 Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

4.8.1 English . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

4.8.2 German . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

4.8.3 Chinese . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

4.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

5 Dependency Length Minimization 75

5.1 Experimental Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

5.1.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

5.1.2 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

5.1.3 Baseline Dependency Lengths . . . . . . . . . . . . . . . . . . . . . 77

5.2 Analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

5.2.1 Word Order Constraints . . . . . . . . . . . . . . . . . . . . . . . . . 78

5.2.2 Dependent Distribution: FREE vs. VAL . . . . . . . . . . . . . . . . . 80

5.2.3 Head Direction: VAL vs. SIDE . . . . . . . . . . . . . . . . . . . . . . 81

5.2.4 Sibling Ordering: SIDE vs. OBS . . . . . . . . . . . . . . . . . . . . . 83

5.3 Experiments with Linearizer . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

5.4 DLM and Non-Projectivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

6 Surface Realization Shared Tasks 89

6.1 Task Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

6.2 Full Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

6.2.1 Linearization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

6.2.2 Function Word Completion . . . . . . . . . . . . . . . . . . . . . . . 93

6.2.3 Morphological Inflection . . . . . . . . . . . . . . . . . . . . . . . . . 95

6.2.4 Contraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

6.2.5 Detokenization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

6.3 Data Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

6.4 Ensemble . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

6.5 Shared Task Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

6.5.1 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

6.5.2 SR’19 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

6.5.3 SR’20 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

6.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104


viii Contents

7 Conclusion 105
7.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
7.2 Future Perspectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

A Detailed Experimental Results 109

Bibliography 119


ix

List of Figures

2.1 Examples of constituency tree and dependency tree. . . . . . . . . . . . . . 10

2.2 Dependency trees of the same meaning in different languages. . . . . . . . 11

2.3 An example non-projective sentence. . . . . . . . . . . . . . . . . . . . . . . 11

3.1 A full pipeline for the surface realization shared task, including five mod-
ules: (1) linearization, (2) completion, (3) inflection, (4) contraction, and (5)
detokenization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.2 Different ways to serialize a dependency tree as a sequence. . . . . . . . . 31

3.3 A schematic illustration of the bidirectional Tree-LSTM. . . . . . . . . . . . 32

3.4 An illustration of the information flow in the encoder, where the red dotted
arrows represent the bottom-up pass and the blue dashed arrows represent
the top-down pass. The solid arrows illustrate the information flow from
node 8 to node 4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.5 Multiple possible full tree orders from the same subtree orders, where (a)
and (b) are ordered subtrees, (c), (d), and (e) are possible full ordered trees,
among which only (c) is projective. . . . . . . . . . . . . . . . . . . . . . . . 34

3.6 The matrix and graph views of the linearization task as a TSP. . . . . . . . 40

3.7 An illustration of the swap model. (1) shows a configuration [1 2 | 3 4 5
6], where the stack contains 1 and 2, the buffer contains 3, 4, 5, and 6, and
the model predicts shift to move 3 from the buffer to the stack; (2) is the
resulted configuration, and the model predicts swap, which moves 3 back
to the buffer behind 4, as shown in (3). The solid black arrows represent
the computation already done in the previous steps, and the dashed red
arrows represent the new computation needed for the current step. . . . . 45

3.8 An pair of projective and non-projective sentences. . . . . . . . . . . . . . . 46

3.9 An overview of the optimal linearization system. . . . . . . . . . . . . . . . 49


x List of Figures

4.1 An example to demonstrate the different evaluation metrics, where (a) is
the reference sentence and (b) is the predicted sentence. . . . . . . . . . . . 54

4.2 The improvement of full tree match after reordering with respect to the
percentage of non-projective arcs in each treebank. . . . . . . . . . . . . . . 60

4.3 BLEU score with respect to the sentence length and the non-projective arc
rate. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.4 BLEU scores of models trained on different data sizes (on a logarithmic
scale), where (a) shows the average and (b) shows individual treebanks. . 62

4.5 An example of two types of error cases, where the correct order should be
[1, 2, 3, 4, 5, 6], head errors are marked with red (d and e), sibling errors
are marked with blue (b and c). . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.6 Frequency of each dependency relation in each language, where the tree-
banks are grouped by the language families and subfamilies, and the rela-
tions are grouped by the structural categories of the dependent. . . . . . . 64

4.7 Head and sibling error rates of each dependency relation in each language. 65

4.8 Difference of errors between the lexicalized model and the delexicalized
model, which illustrates the benefit of lexicalization. . . . . . . . . . . . . . 67

5.1 Example dependency trees and their dependency lengths, adapted from
Futrell et al. (2015a). The second sentence is preferred since it has shorter
DL and lower cognitive burden. . . . . . . . . . . . . . . . . . . . . . . . . . 76

5.2 An example tree and results of four baseline linearization methods, where
the label represents the size of the subtree. . . . . . . . . . . . . . . . . . . . 77

5.3 Average DL vs. sentence length across 53 treebanks. . . . . . . . . . . . . . 78

5.4 All treebanks characterized by the two types of word order freedom. . . . 79

5.5 Percentage of trees where the real order has shorter DL than flipping one
dependent. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

5.6 Percentage of trees where the real order has shorter DL than swapping two
siblings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

5.7 Average DL vs. sentence length across 53 treebanks. The additional results
of DELEX and STRUCT are from the trained linearization models, while the
others are directly from manipulation of the original treebank. . . . . . . . 84

5.8 The percentage of non-projective trees and the DL of the fully random lin-
earization, compared to the projective random linearization and the real
data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86


List of Figures xi

5.9 A pair of projective and non-projective sentence in German, and their DLs.
The non-projective sentence with the extraposed relative clause has shorter
DL. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

5.10 The comparisons between the non-projective sentences and their projec-
tivized versions in terms of the DL. . . . . . . . . . . . . . . . . . . . . . . . 87

6.1 A full pipeline for the surface realization shared task, including five mod-
ules: (1) linearization, (2) completion, (3) inflection, (4) contraction, and (5)
detokenization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

6.2 An example of the completion process to the right side, where the right
arrows illustrate the forward LSTM, and the left arrows the backward LSTM. 94

6.3 An illustration of the contraction model. . . . . . . . . . . . . . . . . . . . . 97
6.4 Sentence length distribution in en lines before and after filtering. . . . . . 98


xiii

List of Tables

2.1 Transitions of the ARCSTANDARD system. . . . . . . . . . . . . . . . . . . . 14

2.2 Transitions of the ARCSTANDARDSWAP system. . . . . . . . . . . . . . . . 15

3.1 An example sentence in the CoNLL-U format. . . . . . . . . . . . . . . . . 29

3.2 The SWAP transition system. . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.3 The transitions to reorder the projective tree in Figure 3.8. . . . . . . . . . . 47

4.1 Information about the treebanks used in the experiments, including the
language (sub-)family, data size, average sentence length, average non-leaf
subtree size, percentage of non-projective sentences and non-projective arcs. 52

4.2 Different encoders combined the subtree and full tree TSP decoders, eval-
uated on the BLEU score and the exact matching rates of subtrees and full
trees. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.3 Different versions of the sequence encoders combined the subtree and full
tree TSP decoders, evaluated on the BLEU score and the exact matching
rates of subtrees and full trees. . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.4 Performances of different decoders on the subtree and full tree level. . . . 57

4.5 Model and search error rates by the TSP-sub and TSP-full decoders. . 58

4.6 TSP decoders on the full tree and subtree level and additionally with re-
ordering, evaluated on the BLEU score and the exact matching rates of
subtrees and full trees. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.7 Percentage of sentences that are changed better or worse by the reordering. 60

4.8 Distribution of deviation categories in the prediction. . . . . . . . . . . . . 69


xiv List of Tables

6.1 Numbers of sentences of the datasets in SR’19 and SR’20, where the in-
domain, out-of-domain, and automatically parsed treebanks are used in
both shared tasks, and the automatically parsed Wikipedia treebanks are
newly introduced in SR’20. . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

6.2 An example sentence for the shallow track (T1). . . . . . . . . . . . . . . . 91
6.3 An example sentence for the deep track (T2). . . . . . . . . . . . . . . . . . 91
6.4 Automatic Evaluation Results of the shallow track (T1) and the BLEU dif-

ference with the best system among other participants for each treebank. . 100
6.5 Automatic Evaluation Results of the deep track (T2) and the BLEU differ-

ence with the best system among other participants for each treebank. . . 101
6.6 Comparing the average BLEU scores of our SR’19 results and our SR’20

results in the closed (2020a) and open (2020b) subtrack. . . . . . . . . . . . 101
6.7 Automatic evaluation results in the shallow track (T1) of the our system in

SR’19 and the closed subtrack (2020a) and open subtrack (2020b) in SR’20. 102
6.8 Automatic evaluation results in the deep track (T2) of the our system in

SR’19 and the closed subtrack (2020a) and open subtrack (2020b) in SR’20. 103

A.1 Detailed BLEU scores for the experiment on encoders in §4.2. . . . . . . . 110
A.2 Detailed EMF scores for the experiment on encoders in §4.2. . . . . . . . . 111
A.3 Detailed EMS scores for the experiment on encoders in §4.2. . . . . . . . . 112
A.4 Detailed BLEU scores for the experiment on decoders in §4.3. . . . . . . . 113
A.5 Detailed EMF scores for the experiment on decoders in §4.3. . . . . . . . . 114
A.6 Detailed EMS scores for the experiment on decoders in §4.3. . . . . . . . . 115
A.7 Detailed BLEU scores for the experiment on training data sizes in §4.6. . . 116
A.8 Detailed EMF scores for the experiment on training data sizes in §4.6. . . . 117
A.9 Detailed EMS scores for the experiment on training data sizes in §4.6. . . . 118


xv

List of Abbreviations

AMR Abstract Meaning Representation
DL Dependency Length
DLM Dependency Length Minimization
GAT Graph Attention Network
GCN Graph Convolutional Network
GNN Graph Neural Network
LSTM Long Short-Term Memory
MLP Multi-layer Perceptron
MST Maximum Spanning Tree
NLP Natural Language Processing
RNN Recurrent Neural Network
Seq2Seq Sequence-to-Sequence
TSP Traveling Salesman Problem
UD Universal Dependencies
UPOS Universal Part-of-Speech
VRP Vehicle Routing Problem
XPOS Language-specific Part-of-Speech


xvii

Abstract

A sentence in our mind is not a simple sequence of words but a hierarchical structure. We
put the sentence in the linear order when we utter it for communication. Linearization
is the task of mapping the hierarchical structure of a sentence into its linear order.

Our work is based on the dependency grammar, which models the dependency rela-
tion between the words, and the resulting syntactic representation is a directed tree struc-
ture. The popularity of dependency grammar in Natural Language Processing (NLP)
benefits from its separation of structure order and linear order and its emphasis on syn-
tactic functions. These properties facilitate a universal annotation scheme covering a
wide range of languages used in our experiments.

We focus on developing a robust and efficient computational model that finds the
linear order of a dependency tree. We take advantage of deep learning models’ expres-
sive power to encode the syntactic structures of typologically diverse languages robustly.
We take a graph-based approach that combines a simple bigram scoring model and a
greedy decoding algorithm to search for the optimal word order efficiently. We use the
divide-and-conquer strategy to reduce the search space, which restricts the output to be
projective. We then resolve the restriction with a transition-based post-processing model.

Apart from the computational models, we also study the word order from a quanti-
tative linguistic perspective. We examine the Dependency Length Minimization (DLM)
hypothesis, which is believed to be a universal factor that affects the word order of every
language. It states that human languages tend to order the words to minimize the overall
length of dependency arcs, which reduces the cognitive burden of speaking and under-
standing. We demonstrate that DLM can explain every aspect of word order in a depen-
dency tree, such as the direction of the head, the arrangement of sibling dependents, and
the existence of crossing arcs (non-projectivity). Furthermore, we find that DLM not only
shapes the general word order preferences but also motivates the occasional deviation
from the preferences.

Finally, we apply our model in the task of surface realization, which aims to generate


xviii Abstract

a sentence from a deep syntactic representation. We implement a pipeline with five steps,
(1) linearization, (2) function word generation, (3) morphological inflection, (4) contrac-
tion, and (5) detokenization, which achieved state-of-the-art performance.


xix

Überblick

Ein Satz ist in unserem Kopf keine einfache Folge von Wörtern, sondern eine hierarchis-
che Struktur. Wir bringen den Satz in die lineare Reihenfolge, wenn wir ihn zur Kommu-
nikation aussprechen. Linearisierung ist die Aufgabe, die hierarchische Struktur eines
Satzes in seine lineare Ordnung abzubilden.

Unsere Arbeit basiert auf der Dependenzgrammatik, die die Dependenz
(Abhängigkeitsbeziehung) zwischen den Wörtern modelliert, und die resultierende syn-
taktische Darstellung ist eine gerichtete Baumstruktur. Die Popularität der Dependen-
zgrammatik in der Machinellen Sprachverarbeitung profitiert von ihrer Trennung von
struktureller und linearer Ordnung und ihrer Betonung auf der syntaktischen Funk-
tionen. Diese Eigenschaften ermöglichen ein universelles Annotationsschema, das eine
Vielzahl von Sprachen abdeckt, die in unseren Experimenten verwendet werden.

In dieser Arbeit konzentrieren wir uns auf die Entwicklung eines robusten und ef-
fizienten Berechnungsmodells, das die lineare Ordnung eines Dependenzbaums findet.
Wir nutzen die Ausdruckskraft von Deep-Learning-Modellen, um die syntaktischen
Strukturen von typologisch unterschiedlichen Sprachen robust zu kodieren. Wir ver-
wenden einen graphenbasierten Ansatz, der ein einfaches Bigram-Scoring-Modell und
einen Greedy-Decodierungsalgorithmus kombiniert, um effizient nach der optimalen
Wortreihenfolge zu suchen. Wir verwenden die Divide-and-Conquer-Strategie, um den
Suchraum zu reduzieren, was die Ausgabe auf projektive Dependenzbäume beschränkt.
Anschließend lösen wir die Einschränkung mit einem transitionsbasierten Nachbear-
beitungsmodell auf.

Zusätzlich zu den Berechnungsmodellen untersuchen wir die Wortreihenfolge auch
aus einer quantitativen linguistischen Perspektive. Wir untersuchen die Dependency-
Length-Minimization (DLM)-Hypothese, von der angenommen wird, dass sie ein uni-
verseller Faktor ist, der die Wortreihenfolge jeder Sprache beeinflusst. Sie besagt, dass
menschliche Sprachen dazu neigen, die Wörter so anzuordnen, dass die Gesamtlänge
der Dependenzkanten minimiert wird, was die kognitive Belastung beim Sprechen


xx Überblick

und Verstehen reduziert. Wir zeigen, dass DLM jeden Aspekt der Wortreihenfolge in
einem Dependenzbaum erklären kann, wie z. B. die Richtung des Kopfes, die Anord-
nung von Geschwister-Dependenten und die Existenz von kreuzenden Dependenzkan-
ten (Nicht-Projektivität). Außerdem finden wir, dass DLM nicht nur die allgemeinen
Wortordnungspräferenzen formt, sondern auch die gelegentliche Abweichung von den
Präferenzen motiviert.

Schließlich wenden wir unser Modell bei der Aufgabe der Oberflächenrealisierung
an, die darauf abzielt, einen Satz aus einer tiefen syntaktischen Repräsentation zu
erzeugen. Wir implementieren eine Pipeline mit fünf Schritten, (1) Linearisierung, (2)
Funktionsworterzeugung, (3) morphologische Flexion, (4) Kontraktion und (5) Deto-
kenisierung, die eine State-of-the-Art-Leistung erzielt.


xxi

Acknowledgements

My journey in computational linguistics started with randomly picking up a novel Story
of your Life by Ted Chiang in a library, which described an alien creature who has no
linear conception of time and no linear word order in their language. At that point, I had
no idea that I would have spent so many years studying language and ending up with a
dissertation on word order. In retrospect, it was just like the story in the novel.

Throughout my journey at the IMS, my deepest gratitude goes to my advisor Jonas
Kuhn. He offered me a research assistant job since I was an undergraduate student.
Later I wrote my Bachelor’s, Master’s, and Doctoral Theses under his supervision. He
has given me great freedom to explore the topics that I found interesting, and helped
me finding the connection of ideas on a higher perspective and steering in uncertainty.
Eventually, some seemingly random pieces of research work intertwined nicely together
and resulted in this dissertation.

My other advisor Ngoc Thang Vu has also given me a great amount of guidance, and
in a complementary way to Jonas. He focused on the technical side of machine learning,
and provided hands-on support. We had many discussions on the details of my models,
and spent time together debugging my code. He has always encouraged me with his
great optimism when I was in doubt about my ideas.

Apart from my two advisors, I have also benefited tremendously from Wolfgang
Seeker and Anders Björkelund, both co-supervisors of my student work and thesis, as
well as my office mates. From them, I learned not only a great amount of knowledge
about parsing and machine learning but also critical thinking in research. I am also very
grateful to Agnieszka Faleńska for the cooperation in teaching and research, for plotting
all the beautiful figures in our papers, and for being a cheerful and candid friend.

Also many thanks to all the friends and colleagues at IMS, with whom I have had
countless discussions about research and life, Özlem Çetinoğlu, Kyle Richardson, Simon
Tannert, Sean Papay, Daniel Ortega et. al. With all your company, I have enjoyed every
day in my journey (except for those few sleepless days before conference deadlines).


xxii Überblick

A great portion of my dissertation is motivated by the work of two former colleagues
from Stuttgart, Bernd Bohnet and Leo Wanner. Special thanks to Leo for reviewing my
thesis as a committee member.

Finally, I would like to thank my parents for their constant support of my decisions,
my wife Yang for putting up with me, and my son Luke, whose due date was a week
after finishing the draft, for forcing me to eventually overcome my procrastination. And
now begins the story of your life.


1

Chapter 1

Introduction

Dependency grammar is a grammar formalism based on the dependency relations be-
tween words, pioneered by Tesnière (1959, 2015). The dependency relations such as sub-
ject and object link the words in a sentence and build up a tree structure representing the
meaning of the sentence. A special trait of dependency grammar is that it is agnostic to
the surface order of the sentence. The linear order of the words in the sentence is not
modeled by the dependency structure and is treated as a secondary role in the represen-
tation of the sentence, as stated in Tesnière (2015, p. 13), “[o]ne must not lose sight of the
fact that syntactically, the true sentence is the structural sentence, for which the linear
sentence is only an image projected onto the spoken chain, with all the disadvantages
associated with flatness. ”

In Tesnière’s theory, when we utter a sentence, we first have a tree-structured repre-
sentation in mind and transform that into the linear sequence of words by traversing the
structure. This process is called linearization. The opposite process is called parsing;
namely, we receive a sentence in its linear form and analyze its syntactic structure by
identifying the dependency relations among the words.

We have two main focuses in this thesis. The first is to develop a computational model
that determines the word order of a dependency tree; the second is to explore on a higher
level the factor that universally influences the word order in all languages.

1.1 Research Gaps and Contribution

Developing a stable and efficient linearization model. Several linearization systems
have been developed over the years, and some already achieved very strong perfor-
mance, e.g., Bohnet et al. (2012). However, most of them are based on perceptron with


2 1 Introduction

feature templates, which work well on English but would struggle with morphologi-
cally rich or low-resource languages. Some linearizers are very slow due to the beam
search decoding, so that they can not satisfy the speed requirements of real-time gener-
ation. In this thesis, we aim to develop a linearization system with stable multilingual
performance thanks to the powerful representation ability of deep learning models while
remaining highly efficient thanks to the simple decoding algorithm.

To achieve this goal, we will explore several linearization algorithms from different
paradigms. The first algorithm generates the linear order incrementally by selecting the
words from the input tree one after another. We use a Recurrent Neural Network (RNN),
in particular, the Long Short-Term Memory (LSTM), to calculate the partial sequence
score and use beam search to explore multiple hypotheses simultaneously. The second
algorithm treats linearization as a Traveling Salesman Problem (TSP), namely to find a
traversal path that visits each word exactly once. We use a biaffine model to calculate
all possible bigrams’ scores and use a greedy TSP solving algorithm to find the approx-
imately optimal traversal path. The third algorithm is a transition system with a stack
and a buffer similar to the transition-based parsing system. It uses two operations, shift
to move the words from the buffer to the stack and swap to swap the order of two words.
Unlike the previous two, where the input is a set of words, this algorithm takes a se-
quence as input and outputs another sequence.

Impact of dependency length minimization on word order. Apart from developing
linearization models for each individual language, we are also interested in understand-
ing what universally shapes the word order in all languages. It has been explored since
Greenberg’s Universal that the word order is more frequent to have a shorter distance
between the head and the dependent (Greenberg, 1963). More recent work (Temperley,
2007; Liu, 2008; Futrell et al., 2015a) has focused on quantifying the dependency length
and exploring how it affects word order, which is commonly referred to as the Depen-
dency Length Minimization (DLM) hypothesis.

We follow this line of research to provide further evidence to the hypothesis. Fur-
thermore, we break down the word order into more fine-grained factors, including the
direction of the dependency and the pairwise arrangement of sibling dependents. With
incremental baselines that start with the random word order, we gradually conform with
the word order preferences toward the actual sentences in the treebank collections. In
this way, we can explore the effect of DLM in each factor of word order. Moreover, we
show that DLM not only shapes the general word order preferences in these factors, it
also motivates the deviation from such preferences when necessary.


1.2 Outline 3

Treatment of non-projective sentences. Projectivity is a important property that
emerges from the combination of linear order and structural order in a sentence. When
the words in a sentence are linearly ordered and the dependency arcs do not cross each
other, it is called a projective sentence or a projective tree; otherwise, it is non-projective.
Most languages have very few non-projective sentences, such as English and Chinese,
although theoretically, among all possible arrangements of words in a sentence, the vast
majority are non-projective. Some languages with more flexible word order, such as Latin
or Russian, tend to have a much higher percentage of non-projective sentences.

Many linearization algorithms and parsing algorithms assume that all the sentences
they generate or parse are restricted to be projective. The main reason is that the pro-
jective assumption vastly reduces the search space; therefore, the projective algorithms
typically have lower time complexity than the non-projective algorithms. Also, since
most sentences are projective, it is more likely to find the correct parse tree or word or-
der in the restricted search space. This is an engineering trade-off with coverage on the
one hand and accuracy and complexity on the other hand. However, the projective al-
gorithms do not work well on a few languages with a large percentage of non-projective
sentences. It is thus desirable to find a method to deal with non-projectivity without
sacrifice performance on projective sentences.

The projectivity assumption is also commonly imposed on the analysis of DLM in
many previous studies, i.e., non-projective sentences are discarded in the experiments.
In fact, non-projectivity is an equally if not more important factor of word order, and it
should not be overlooked in the analysis. Therefore, we conduct experiments specifically
on the non-projective sentences and show that the non-projectivity as a factor of word
order is also strongly influenced by DLM.

1.2 Outline

The remainder of the dissertation is organized as follows:

Chapter 2 introduces the main concepts relevant to the thesis. We start with the
framework of dependency grammar and the dichotomy of structural order and
surface order, and the tasks of transforming from one to the other, namely pars-
ing and linearization. We briefly introduce parsing methods, although the focus of
this thesis is on linearization, since many concepts and methods that are originally


4 1 Introduction

developed for parsing and adapted for linearization. We then describe several suc-
cessful models for linearization. As our model is based on neural networks, we also
introduce relevant models such as LSTM, Tree-LSTM, and Graph Neural Networks.
Finally, we describe the Dependency Length Minimization hypothesis, which is a
probable explanation of how the word order preferences in natural languages are
shaped.

Chapter 3 describes our framework of the computational model for linearization,
which consists of two main components, encoder and decoder. We present different
methods to represent the dependency tree and different decoding algorithms to
produce the surface order of the encoded dependency tree. We discuss the pros and
cons of our decoder’s inductive bias that restricts the surface order to be projective
and present a transition-based model to reorder the sentence.

Chapter 4 describes the main experiments in this thesis. We start with the compar-
ison of encoders and decoders to find the optimal combination for the linearization
model. With the optimal linearization model, we then analyze its prediction er-
rors from different angles. From the machine learning perspective, we quantify the
extent of model errors and search errors to determine the performance bottleneck.
From the linguistic perspective, we look for error patterns with respect to language
families and dependency relations. We also conduct manual inspection into the
error cases in a few languages and distinguish the word order variation from the
actual linearization error.

Chapter 5 extends the linguistic analysis of word order in the previous chapter and
explores a hypothesis of a universal factor that influences the word order, Depen-
dency Length Minimization (DLM). We demonstrate the effect of DLM by design-
ing a series of baselines and show that the dependency length would gradually
decrease from complete random ordering to the actual ordering in the treebank
data through several intermediate baselines. We explain each drop of dependency
length with respect to a factor of word order and show the universality of DLM in
almost all languages and all factors. Finally, we also explore the DLM hypothesis
in projectivity, an often overlooked factor in word order.

Chapter 6 turns the focus back to the computational model. We apply the lineariza-
tion model as a part of the pipeline for the shared tasks of surface realization. We
describe other components in the pipeline as well as the data augmentation and
ensembling methods for pushing the performance.


1.3 Publications 5

Chapter 7 concludes this thesis. We summarize the contribution in our work, which
are mainly (1) developing an efficient and stable computational model for word or-
der, (2) exploring the relationship between word order and dependency length, and
(3) a special focus on non-projective trees. We also discuss our work in a broader
perspective and envisage an interplay of linearization and parsing as syntax-based
machine translation.

1.3 Publications

This main body of dissertation is based on the following peer-reviewed publications:

• Yu, X., Falenska, A., and Kuhn, J. (2019b). Dependency length minimization vs.
word order constraints: An empirical study on 55 treebanks. In Proceedings of the
First Workshop on Quantitative Syntax (Quasy, SyntaxFest 2019), pages 89–97, Paris,
France. Association for Computational Linguistics

Chapter 5 is based on this publication. My co-author Agnieszka Falenska was re-
sponsible for visualization, Jonas Kuhn was responsible for writing the overview
and providing advice.

• Yu, X., Falenska, A., Vu, N. T., and Kuhn, J. (2019c). Head-First Linearization with
Tree-Structured Representation. In Proceedings of the 12th International Conference on
Natural Language Generation, pages 279–289, Tokyo, Japan. Association for Compu-
tational Linguistics

Chapter 3 is partly based on this publication. My co-author Agnieszka Falenska
was responsible for visualization and calibrating baseline models, Ngoc Thang Vu
and Jonas Kuhn had advisory roles.

• Yu, X., Falenska, A., Haid, M., Vu, N. T., and Kuhn, J. (2019a). IMSurReal: IMS at the
Surface Realization Shared Task 2019. In Proceedings of the 2nd Workshop on Multi-
lingual Surface Realisation (MSR 2019), pages 50–58, Hong Kong, China. Association
for Computational Linguistics

Chapter 6 is partly based on this publication. My co-author Agnieszka Falenska
was responsible for visualization and calibrating baseline models, Marina Haid
conducted preliminary experiments, Ngoc Thang Vu and Jonas Kuhn had advisory
roles.


6 1 Introduction

• Yu, X., Tannert, S., Vu, N. T., and Kuhn, J. (2020a). Fast and accurate non-projective
dependency tree linearization. In Proceedings of the 58th Annual Meeting of the Asso-
ciation for Computational Linguistics, pages 1451–1462, Online. Association for Com-
putational Linguistics

Chapter 3 is partly based on this publication. My co-author Simon Tannert was
responsible for visualization and conducting preliminary experiments, Ngoc Thang
Vu and Jonas Kuhn had advisory roles.

• Yu, X., Tannert, S., Vu, N. T., and Kuhn, J. (2020b). IMSurReal too: IMS in the sur-
face realization shared task 2020. In Proceedings of the Third Workshop on Multilingual
Surface Realisation, pages 35–41, Barcelona, Spain (Online). Association for Compu-
tational Linguistics

Chapter 6 is partly based on this publication. My co-author Simon Tannert was
responsible for visualization and data augmentation, Ngoc Thang Vu and Jonas
Kuhn had advisory roles.

The work in this dissertation has also been shaped by the experiences from the fol-
lowing publications during my doctoral studies. Although they are not a core part of the
dissertation, they have been part of my project work and more or less influenced the final
product.

• Yu, X. and Vu, N. T. (2017). Character composition model with convolutional neural
networks for dependency parsing on morphologically rich languages. In Proceed-
ings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume
2: Short Papers), pages 672–678, Vancouver, Canada. Association for Computational
Linguistics

• Björkelund, A., Falenska, A., Yu, X., and Kuhn, J. (2017). IMS at the CoNLL 2017
UD shared task: CRFs and perceptrons meet neural networks. In Proceedings of the
CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies,
pages 40–51, Vancouver, Canada. Association for Computational Linguistics

• Yu, X., Falenska, A., and Vu, N. T. (2017). A general-purpose tagger with convolu-
tional neural networks. In Proceedings of the First Workshop on Subword and Character
Level Models in NLP, pages 124–129, Copenhagen, Denmark. Association for Com-
putational Linguistics


1.3 Publications 7

• Yu, X., Vu, N. T., and Kuhn, J. (2018). Approximate dynamic oracle for dependency
parsing with reinforcement learning. In Proceedings of the Second Workshop on Uni-
versal Dependencies (UDW 2018), pages 183–191, Brussels, Belgium. Association for
Computational Linguistics

• Blohm, M., Jagfeld, G., Sood, E., Yu, X., and Vu, N. T. (2018). Comparing attention-
based convolutional and recurrent neural networks: Success and limitations in ma-
chine reading comprehension. In Proceedings of the 22nd Conference on Computational
Natural Language Learning, pages 108–118, Brussels, Belgium. Association for Com-
putational Linguistics

• Yu, X., Vu, N. T., and Kuhn, J. (2019d). Learning the Dyck language with attention-
based Seq2Seq models. In Proceedings of the 2019 ACL Workshop BlackboxNLP: Ana-
lyzing and Interpreting Neural Networks for NLP, pages 138–146, Florence, Italy. Asso-
ciation for Computational Linguistics

• Yu, X., Vu, N. T., and Kuhn, J. (2020c). Ensemble self-training for low-resource lan-
guages: Grapheme-to-phoneme conversion and morphological inflection. In Pro-
ceedings of the 17th SIGMORPHON Workshop on Computational Research in Phonetics,
Phonology, and Morphology, pages 70–78, Online. Association for Computational Lin-
guistics

• Dönicke, T., Yu, X., and Kuhn, J. (2020b). Real-valued logics for typological univer-
sals: Framework and application. In Proceedings of the 28th International Conference on
Computational Linguistics, pages 3990–4003, Barcelona, Spain (Online). International
Committee on Computational Linguistics

• Dönicke, T., Yu, X., and Kuhn, J. (2020a). Identifying and handling cross-treebank
inconsistencies in UD: A pilot study. In Proceedings of the Fourth Workshop on Univer-
sal Dependencies (UDW 2020), pages 67–75, Barcelona, Spain (Online). Association
for Computational Linguistics


9

Chapter 2

Background

2.1 Dependency Grammar

Dependency grammar is a grammar formalism representing the structure of a sentence
by modeling the dependency relations among the words. The foundation of the mod-
ern dependency grammar is laid out by Tesnière (1959, 2015), followed by many theories
with similar principles, such as the Word Grammar (Hudson, 1984), the Functional Gen-
erative Description (Sgall et al., 1986), the Meaning Text Theory (Mel’cuk et al., 1988), and
Universal Dependencies (Nivre et al., 2016a).

The dependency grammar assumes the units of the structure are words, and they are
connected by binary, asymmetric dependency relations, noted as ti

l−→ tj , where ti is the
head, tj is the dependent, and l is the dependency label between them, such as subject
or object. If there is a directed path ti → ... → tj , then ti is called the ancestor of tj and
tj is the descendent of ti, and we say ti dominates tj . In most theories, the dependency
structure is a tree, i.e., there is a word as the root, which is the ancestor of all other words
in the sentence. It satisfies the property that every word except for the root has exactly
one head.

2.1.1 Dependency Grammar vs. Constituency Grammar

Another influential school of grammar formalism, the constituency grammar (also called
phrase structure grammar) proposed by (Chomsky, 1957) represents the syntactical struc-
ture of a sentence with multiple levels of phrases. Figure 2.1 is an example that demon-
strates the syntactical representation of a sentence in both formalisms.

There are several fundamental differences between these two grammar formalisms.


10 2 Background

S

VP

NP

N

cat

Det

the

V

chases

NP

N

dog

Det

the

(a) constituency tree

the dog chases the cat

det nsubj det
obj

(b) dependency tree

Figure 2.1: Examples of constituency tree and dependency tree.

The most obvious contrast is that the constituency grammar introduces many intermedi-
ate nodes that are not present in the sentence’s surface form. In contrast, the dependency
grammar does not assume extra nodes apart from the words, and all nodes are directly
linked to each other.

Another subtler difference is that the word order is coupled into the phrases but de-
coupled from the dependencies. In the example of constituency grammar, the order of the
verb phrase has already determined that noun phrase the mouse follows the verb chases.
However, in the dependency grammar, whether the verb chases precedes or follows its
object mouse, is not encoded in its syntactic representation.

At first glance, the lack of surface order might be an inadequacy of the dependency
grammar compared to the constituency grammar. However, in the context of multilin-
gualism, it turns out to be a crucial advantage. The Universal Dependencies (UD) (Nivre
et al., 2016b) is an initiative that maintains an extensive collection of treebanks aiming
at a universal annotation guideline of dependency grammar across languages. It now
contains more than 100 typologically diverse languages and is still rapidly expanding.
Within the UD framework, the translation of a sentence in different languages can be
represented as very similar dependency structures, although the word order might vary
considerably.

Figure 2.2 illustrates the dependency trees of the sentence the dog chases the cat in four
different languages. To amplify the contrast, the German example uses a less frequent
but valid word order. There are three distinct orders of the verb and its arguments in
these examples: SVO (subject-verb-object) for English and Chinese, OVS for German,
and SOV for Turkish. Also, there are no explicit determiners for Chinese and Turkish.


2.1 Dependency Grammar 11

However, all examples have a common core structure where the root of the sentence is
chases, and it has two arguments, dog as the subject and cat as the object. The cross-
lingual consistency of the (universal) dependency grammar is not only a plausible model
of human language faculty but also practical for NLP applications. For example, it allows
one to easily translate from one language to another by making the lexical substitution
with proper morphological inflection, then arranging the word order of the target tree.

The dog chases the cat

det nsubj det
obj

(a) English

狗 追 猫
dog chases cat

nsubj obj

(b) Chinese

Die Katze jagt der Hund
the cat chases the dog

det obj det
nsubj

(c) German

Köpek kediyi kovalar
dog the cat chases

nsubj
obj

(d) Turkish

Figure 2.2: Dependency trees of the same meaning in different languages.

2.1.2 Projectivity

Projectivity is an important property to classify dependency trees. Intuitively, a tree is
projective if its tokens are drawn in their linear order, and there are no dependency arcs
crossing each other. For example, the trees in Figure 2.2 are all projective. If a tree has
at least one non-projective arc, it is a non-projective tree. Figure 2.3 is an example of a
non-projective tree, where the arc from you to Mark (marked in red) is non-projective.

Will you handle this or Mark ?

aux
nsubj obj cc

conj

punct

Figure 2.3: An example non-projective sentence.


12 2 Background

Formally, an arc is called projective if the head of the arc dominates all intervening
tokens between the head and the dependent, otherwise it is non-projective:

ti → tj is projective ⇐⇒

{
∀ti≺tk≺tj ti

∗−→ tk if ti ≺ tj
∀tj≺tk≺ti ti

∗−→ tk otherwise
(2.1)

where ≺ denotes linear precedence.

2.1.3 Parsing and Linearization

Tesnière distinguishes the dependency structure and the surface word order as two dif-
ferent orders of the sentence, the structural order and the linear order. He argues that
all the syntactic information of a sentence is contained in the structural order, while the
linear order is a compromise to the physical world since time is one-dimensional and one
has to utter the words linearly.

Since the linear order is independent of the structural order in dependency grammar,
there is a need for a (computational) model that can transform one into the other. The
task of transforming from linear order to structural order is parsing, and the task of
transforming from structural order to linear order is linearization.

In natural languages (unlike programming languages), there is no perfect one-to-one
mapping between the structural order and the linear order, making the task of parsing
and linearization both difficult and exciting.

For parsing, the syntactical ambiguity causes different interpretations of the same
sentence. For example, I shot an elephant in my pajamas could mean I was in the pajamas, or
the elephant was in the pajamas. To resolve the ambiguity, one must rely on the semantics
and world knowledge to understand that elephants are unlikely to wear pajamas.

For linearization, the word order variation results in the same syntactic structure
being uttered in different ways. The previous example could be rearranged as: In my
pajamas, I shot an elephant, where the emphasis is on the outfit instead of the event.

The word order variation is influenced by various factors at different linguistic levels,
e.g., prosody (Zubizarreta, 1998), information structure (Hawkins, 1992; Lambrecht, 1996;
Neeleman and Van de Koot, 2016), and as we will discuss in Chapter 5, the cognitive
burden. Note that in the linearization model in this work, these factors are not explicitly
encoded since we focus on the technical framework of data-driven linearization model.
However, it is possible that the machine learning model could implicitly pick up the
relevant cues to determine the word order.


2.2 Dependency Parsing Methods 13

2.2 Dependency Parsing Methods

Although this thesis does not directly work on dependency parsing, the parsing meth-
ods have profoundly influenced the development of linearization due to the similarity
between these two tasks. Therefore we first briefly introduce some parsing methods,
focusing on those directly adopted for linearization. For a complete introduction to de-
pendency parsing algorithms, we refer to Kübler et al. (2009). Generally, there are two
major paradigms of dependency parsing algorithms, graph-based and transition-based.
Graph-based algorithms directly search for a tree in the fully-connected graphs, while
transition-based algorithms derive the tree step by step through transitions.

2.2.1 Graph-based Parsing

Graph-based parsing algorithms (Eisner, 1996; McDonald et al., 2005) first construct a
fully connected directed graph where every word is linked to every other word, a scoring
function computes the scores of all the arcs in the graph, and a decoding algorithm finds
a tree that maximizes the total score of the arcs in the tree, which is called the Maximum
Spanning Tree (MST).

To score each potential arc, features are extracted from the head and the dependent
of the arc, including their word forms, POS tags, their relative direction and distance, as
well as other words around and between them as the context. These features are then
used by the machine learning model, e.g., the perceptron (§2.4.1), to calculate a score.

Having the scores for all arcs in the graph, we can then use an MST algorithm to
decode the tree. There are two commonly used decoding algorithms. Eisner’s Algo-
rithm (Eisner, 1996) is based on dynamic programming, which recursively breaks down
the problem of finding the tree into finding the left and right subtrees and uses a chart
to store the scores of the reoccuring subtrees to avoid repeated computation. Chu-Liu-
Edmonds’ Algorithm (Chu and Liu, 1965; Edmonds, 1967) used by McDonald et al. (2005)
is a greedy algorithm with recursive refinement. It first selects the highest scoring incom-
ing arc for each node, which might result in cycles (thus not a tree). It then resolves the
cycles recursively until there is no more.

Both algorithms can find exact solutions efficiently, with the time complexity ofO(n3).
A significant difference between the two is that Chu-Liu-Edmonds’ Algorithms can na-
tively produce non-projective trees, while Eisner’s algorithm is restricted to projective
trees. However, Eisner’s algorithm can integrate more flexible higher-order features.
Therefore, it is more popular for parsing mostly projective languages such as English.


14 2 Background

Transition Before After Precondition

left-arc 〈σ | ti, tj | β,A〉 ⇒ 〈σ, tj | β,A ∪ {〈tj , ti〉}〉 ti 6= ROOT
right-arc 〈σ | ti, tj | β,A〉 ⇒ 〈σ, ti | β,A ∪ {〈i, j〉}〉
shift 〈σ, ti | β,A〉 ⇒ 〈σ | ti, β, A〉

Table 2.1: Transitions of the ARCSTANDARD system.

2.2.2 Transition-based Parsing

Transition-based algorithms define a transition system that builds up the tree step by
step through transitions that changes a parsing configuration to another configuration
until reaching the terminal configuration, which contains the parse tree. Among many
transition systems, we introduce the ARCSTANDARD system (Nivre, 2008) as an example,
which has the same concepts such as configuration and transition that are also used in
other transition systems.

A configuration is a triple 〈σ, β,A〉, where the stack σ contains partially processed to-
kens, the buffer β contains the unprocessed tokens in the sentence, and the setA contains
the dependency arcs created so far. The initial configuration is 〈[ROOT], [t1, t2, ..., tn], ∅〉,
where the stack contains only the root token, the buffer contains all the tokens in the sen-
tence in its order, and the arc set is empty. The terminal configuration is 〈[ROOT], [], A〉,
where the stack again only contains the root token, the buffer is empty, and theA contains
all the arcs that constitute the parse tree.

There are three transitions in the ARCSTANDARD system, left-arc, right-arc, and shift
the effect of these transitions are described in Table 2.1. The left-arc transition creates an
arc from the top of the stack i to the front of the buffer j and removes the token i; right-arc
creates an arc from the front of the buffer j to the top of the stack i, removes the token i
and moves j back to the buffer; shift moves the front of the buffer to the top of the stack.

2.2.3 Non-Projective Parsing

Among the previously described parsing algorithms, only the graph-based parser with
Chu-Liu-Edmonds’ algorithm can natively produce non-projective trees. For the pro-
jective parsers, several additional methods are developed as pre- or post-processing to
remedy this restriction, and we introduce two methods here.

Nivre and Nilsson (2005) proposed the pseudo-projective parsing. The basic idea is
to modify the target non-projective tree into a (pseudo-)projective one, by “lifting” the
non-projective arcs until it becomes projective. To lift an arc is to change the head of the
arc as the head of the head, and leave a trace in the dependency label. They can then use a


2.3 Linearization Methods 15

Transition Before After Precondition

left-arc 〈σ | ti, tj | β,A〉 ⇒ 〈σ, tj | β,A ∪ {〈tj , ti〉}〉 ti 6= ROOT
right-arc 〈σ | ti, tj | β,A〉 ⇒ 〈σ, ti | β,A ∪ {〈i, j〉}〉
shift 〈σ, ti | β,A〉 ⇒ 〈σ | ti, β, A〉
swap 〈σ | ti, tj | β,A〉 ⇒ 〈σ, tj | ti | β,A〉 ROOT ≺ ti ≺ tj

Table 2.2: Transitions of the ARCSTANDARDSWAP system.

projective parsing algorithm to predict the pseudo-projective tree, and use the predicted
label to recover the lifting thus producing a non-projective tree.

Nivre (2009) proposed another algorithm which modifies the projective ARCSTAN-
DARD system algorithm by adding a new transition swap, which swaps the order between
the top of the stack and the front of the buffer. The resulting system is noted as ARCSTAN-
DARDSWAP. This transition essentially reorders the sentence such that the target parse
tree could be projective in the new order. The transitions in the ARCSTANDARDSWAP are
listed in Table 2.2, where the first three transitions are identical as in ARCSTANDARD.

The two methods above modify the non-projective trees in different ways: pseudo-
projective parsing changes the tree structure by lifting the arcs; the ARCSTANDARDSWAP

system changes the linear order of the sentence by swapping the words.

2.3 Linearization Methods

The research of computational models for linearization has its root in machine transla-
tion, which serves to find the order of words in the translated sentence (Brown et al.,
1990; Brew, 1992). Earlier work mostly focuses on grammar-based approaches using dif-
ferent syntactic formalisms (Elhadad and Robin, 1992; Lavoie and Rainbow, 1997; Carroll
et al., 1999). With the increasing availability of annotated treebanks, statistical methods
gain popularity and become the dominant approach (Langkilde and Knight, 1998; Ban-
galore and Rambow, 2000; Filippova and Strube, 2009).

We now introduce a few statistical linearization models that are relevant to our work.
We broadly divide the models into two categories based on the input format, syntax-free
and syntax-based. Syntax-free linearization operates on a bag of words and does not
assume any structure in them, while syntax-based linearization takes an unordered de-
pendency tree (among other grammar formalisms) as input. Although the two categories
differ a lot in the available information, many methods designed for one category can be
adapted for the other category.


16 2 Background

2.3.1 Syntax-free Linearization

Knight (1999) cast the task of linearization as a Traveling Salesman Problem (TSP). A
TSP is a problem that takes a list of cities and the distance between each pair of cities
and searches for a route that visits each city exactly once and returns to the origin. To
formulate the word ordering problem as a TSP, each word is treated as a city, the log
probability of a word pair in a bigram language model is treated as the distance, and the
goal is to find a sequence of words with the maximum probability.

Much work has followed this formulation. Among others, Zaslavskiy et al. (2009)
formulated the word ordering problem in phrase-based machine translation as a TSP and
show that it achieves better performance and speed than beam search decoding with the
same bigram language model. Horvat and Byrne (2014) explored higher-order n-gram
language models for TSP-based word ordering, which transforms into a much larger TSP
graph. These studies operate on a bag of words without syntax, which is a TSP graph of
non-trivial size with little information about the internal structure. Much effort has been
put into incorporating more powerful decoding algorithms such as Integer Programming
(Germann et al., 2001) and Dynamic Programming (Tillmann and Ney, 2003).

Under the TSP formulation, the computational model behind linearization is a simple
n-gram language model, which does not model long-distance relations. To address this
problem, Schmaltz et al. (2016) proposed a linearization model based on a Long Short-
Term Memory (LSTM) language model. They use a pre-trained language model, which
can give the probability estimation for any suffix of a sentence, and use beam search to
approximately search for the order that maximizes the overall probability.

2.3.2 Syntax-based Linearization

The methods mentioned above aim at linearizing a bag of words without any syntactical
information. Although they could be used for dependency trees, the performance would
be suboptimal. Syntax-based linearization typically uses discriminative models (in con-
trast to the generative models such as the language model) to decide the word order.

Bohnet et al. (2010) designed an algorithm to order the words with beam search. They
extract features including the word form, part-of-speech, and dependency relations from
the previous words and use a perceptron to calculate the scores for the next word. This is
different from the approach based on the language model, which calculates the probabil-
ity of the next word given the previous words only based on the statistics of co-occurrence
of the words in a large corpus, The discriminative model can incorporate rich structural
information from the input and make a more informed decision.


2.3 Linearization Methods 17

Another major property of their algorithm that benefits from the tree structure is that
they do not order the full tree altogether. Instead, they divide the tree into subtrees (or
word order domains in their terminology), where each subtree contains a head and its de-
pendents. They then use beam search to order each subtree separately and finally com-
bine the ordered subtrees into a full sentence. This is commonly referred to as the divide-
and-conquer strategy since they divide a large problem (ordering the full tree) into sev-
eral smaller subproblems (ordering each subtree), and by doing so, the search space is
drastically reduced, thus increasing the likelihood of an approximate search algorithm to
find the optimal solution.

A potential problem of this linearization algorithm is that by combining the ordered
subtrees, it can only produce a projective sentence. Since the was initially designed to
work on English data, which is predominantly projective, the inability to produce non-
projective sentences did not pose a severe limitation to the model’s performance. How-
ever, in order to apply the model to multilingual data, such limitation must be addressed.
Bohnet et al. (2012) introduced a preprocessing model that first lifts the non-projective
dependency arcs so that the correct order of the resulting dependency tree could be pro-
jective, although the tree is not the original one anymore. This method is inspired by the
pseudo-projective parsing algorithm (Nivre and Nilsson, 2005).

Another line of work (Liu et al., 2015; Puduppully et al., 2016) is based on a transition
system very similar to the ARCSTANDARD system for dependency parsing. They use a
stack and a set to represents a configuration (in contrast to a stack and a buffer in a pars-
ing configuration). There are three transitions, shift moves one word from the set onto
the stack, left-arc and right-arc creates a dependency arc between the top two items on
the stack and removes the dependent from the stack. If the dependency tree structure is
not given (i.e., ordering a bag of words), this model performs parsing and linearization
simultaneously, where shift decides the order of the words and left-arc and right-arc de-
termine the dependency relations. If the dependency tree (or a partial tree) is given, then
it is used as syntactic constraints to limit the possible transitions at every configuration.
Since the transition system is based on the arc-standard system, it only produces projec-
tive sentences. Similar to Bohnet et al. (2010), a perceptron is used to score the transitions
based on features extracted from the configuration. They also use beam search to explore
multiple transition sequences.


18 2 Background

2.4 Machine Learning Models

2.4.1 Perceptron

Traditional models for classification tasks in NLP are mostly based on linear models such
as the perceptron (Rosenblatt, 1958; Collins, 2002), which makes the prediction based on
features extracted from the instances. The features are extracted according to manually
designed templates, such as “what is the lemma of the head”, “what is the POS tag of the
dependent”. The extracted features are then “the lemma of the head is chases”, “the POS
tag of the dependent is NOUN”. All the features found in the training set are assigned an
index and stored in a feature mapping. In this way, each instance can be mapped to an
input vector x composed of 0s and 1s, where the values at the positions that correspond
to the activated features are set to 1, and the rest are set to 0.

The computation in a perceptron model is:

y = W · x+ b (2.2)

where the weight matrix W and the bias vector b are the model parameters, x is the input
feature vector, and y is the output vector that holds the score for each class. And the
prediction of the model is the class with the highest score in y.

To increase the linear perceptron model’s expressive power, feature combination is of-
ten needed, where several elemental features are merged as one. It is not trivial to design
a proper set of feature templates. The model could suffer from underfitting if the fea-
tures are not expressive enough, or suffer from overfitting if the features are too specific
that only matches certain training instances and cannot generalize. For this reason, much
effort has been focused on feature engineering, i.e., exploring informative elemental fea-
tures and feature combinations that strike a balance between underfitting and overfitting.

2.4.2 Multi-Layer Perceptron

The emergence of neural networks has brought profound influence into the NLP field.
One of the most basic neural models is the Multi-layer Perceptron (MLP), which is a series
of perceptrons with non-linear functions (e.g., the sigmoid function) applied between the
layers. This introduces non-linearity into the computation, thus frees the need for manual
feature combinations.


2.4 Machine Learning Models 19

Following is an example of an MLP with two hidden-layers:

h1 = σ(W1 · x+ b1) (2.3)

h2 = σ(W2 · h1 + b2) (2.4)

y = W3 + h2 + b3 (2.5)

where x and y are the input and output vectors (same as in perceptron), W1, W2, W3

are weight matrices, and b1, b2, and b3 are bias vectors, h1 and h2 are the intermediate
results from each layer of perceptron, and σ(·) is the non-linear function, for example the
sigmoid function:

σ(x) =
1

1 + e−x
(2.6)

For simplicity, we use the following abstract notation to refer to the MLP:

y = MLP (x) (2.7)

2.4.3 Distributed Representation

Arguably the most important concept of deep learning models in NLP is the distributed
representation, also known as word embeddings when applied to represent words (Ben-
gio et al., 2003; Mikolov et al., 2013). In the traditional perceptron model, the input is a
one-hot vector, i.e., a vector with either 0 or 1, where the value on the n-th position being
1 means the n-th feature is activated. An expressive model typically has millions of fea-
tures, which means the input vector and the model parameters would have millions of
dimensions.

The distributed representation, in contrast, uses a much smaller, dense vector to repre-
sent the features (words), where the values of the vector are part of the model parameters
that are trained together with the rest of the model. The distributed representation not
only has the advantage of compactness, more importantly, it also captures the syntactic
and semantic similarity between the words, which is obtained from the context during
the training.

The distributed representation is initially used to represent words since it originates
in language models, where words are the only observable features. It is soon adopted
to represent any categorical features, such as POS tags, dependency relations (Chen and
Manning, 2014).


20 2 Background

2.4.4 Recurrent Neural Networks

Another advantage of deep learning models is their ability to model long-distance con-
texts. The Recurrent Neural Network (RNN) is designed by Elman (1990) to model se-
quences using a memory cell that recurrently updates itself with a sequence of inputs.

Suppose we have a sequence of vectors as input [x0, x1, ..., xn], where each xi repre-
sents one element in the sequence. We can apply the RNN to obtain a sequence of output
vectors [y0, y1, ..., yn], where each yi also represents each element in the sequence, and it
now takes its previous context into account.

The model maintains memory cell that keeps the information of the sequence that it
reads so far, and at each step, it takes the current input, updates the memory, and outputs
a vector as the output, as follows:

ht = σh(Wh · xt + Uh · ht−1 + bh) (2.8)

yt = σy(Wy · ht + by) (2.9)

where xt is the input vector at time step t, [Wh, Uh, bh] are parameters to calculate the
hidden state ht, [Wy, by] are parameters to calculate the output yt, σh(·) and σy(·) are
non-linear activation functions.

The vanilla RNN has the problem of vanishing or exploding gradient due to the depth
of the layers that prevent stable and effective training. Hochreiter and Schmidhuber
(1997) proposed a more stable variant, the Long Short-Term Memory (LSTM) architec-
ture, that addresses this problem.

At each time step with the input xt, the computation in of the LSTM is as follows:

ft = σ(Wf · xt + Uf · ht−1 + bf ) (2.10)

it = σ(Wi · xt + Ui · ht−1 + bi) (2.11)

ot = σ(Wo · xt + Uo · ht−1 + bo) (2.12)

ut = tanh(Wu · xt + Uu · ht−1 + bu) (2.13)

ct = ft � ct−1 + it � ut (2.14)

ht = ot � tanh(ct) (2.15)

where ft is the forget gate, it is the the input gate, and ot is the output gate, all three
vectors are calculated from the current input xt, the previous output ht−1 with the corre-
sponding parameters W , U , and b. These gating vectors all have values between 0 and
1, as a result of the sigmoid activation function, they are used to manipulate the trans-


2.4 Machine Learning Models 21

formed information of the current step ut through element-wise multiplication� to form
the current memory ct and the current output ht.

The LSTM only models the sequence in one direction, which means the output at
a time step is not aware of information from the future time steps. To better process
the sequential information, Graves and Schmidhuber (2005) proposed the bidirectional
LSTM, which uses two LSTMs that run from opposite directions, and the representation
of each element in the sequence is simply the concatenation of the outputs of both LSTMs.

We use the following abstract notations for the forward and backward LSTMs:

y1:n =
−−−−→
LSTM(x1:n) (2.16)

y1:n =
←−−−−
LSTM(x1:n) (2.17)

where x1:n is a sequence of input vectors and y1:n is a sequence of output vectors.

2.4.5 Sequence-to-Sequence Model and Attention Mechanism

The Sequence-to-Sequence (Seq2Seq) architecture (Cho et al., 2014; Sutskever et al., 2014)
is a widely successful model that takes a sequence as input and produces another se-
quence as output. It is typically based on the combination of two RNNs, one RNN as
the encoder that reads the input sequence, the other as the decoder that incrementally
generates the elements of the output sequence.

The encoder reads all elements in the input sequence. Its final output summarizes
all the information in the input and is used by the decoder to generate the output. This
introduces a problem, namely, it has to compress all the information in a sequence into
one vector, which would cause information loss, especially for longer sequences, thus
hurting the quality of the generated output.

The attention mechanism (Graves et al., 2014; Bahdanau et al., 2014; Luong et al., 2015)
is designed to address this problem. It is motivated by the fact that humans pay attention
to different elements in the input sequence at different times. For example, in machine
translation, when generating each word in the output, different input words become the
focus. In a Seq2Seq model with attention, the current decoder state calculates its attention
over all encoder outputs at every time step. The attention is a vector with values ranging
between 0 and 1 and summing up to 1. It is used to obtain a weighted average of all the
encoder states, which is then used by the decoder to generate the output for the current
time step.


22 2 Background

Generally, the attention score on one encoder state is calculated as follows:

st,i = f(yt, xi) (2.18)

where f(yt, xi) is a function Rd × Rd → R that takes the decoder state yt and the encoder
state xi. There are several different attention models, which only differ in the function
f(yt, xi), and we refer to Hu (2019) for a comprehensive survey. The function that we use
is based on Vaswani et al. (2017):

f(yt, xi) =
x>i · yt√

d
(2.19)

where d is the dimension of the encoder state.

The attention score is then normalized by the softmax function:

αt,i =
exp(st,i)∑n
j=1 exp(st,j)

(2.20)

Having the attention score αt,i, we can then calculate the weighted average of the
encoder state, which focuses on different position of the input at each decoding step:

ct =
n∑
i=1

αt,i · xi (2.21)

We use the following abstract notation for the attention mechanism:

αt,1:n = attention(yt, x1:n) (2.22)

where x1:n are the encoder states for the elements in the input, yt is the decoder state at
the t-th step.

In the attention-based Seq2Seq models, the attention mechanism is used to calculate
the weights to average the representations of the input vectors, and the output is gener-
ated from the vocabulary. Vinyals et al. (2015b) proposed the pointer network, which can
be viewed as a simplified version of the attention-based Seq2Seq model. It also calculates
the attention between the current decoder state and all encoder states, but instead of gen-
erating an output, it simply picks the input item with the highest attention as the output.
The pointer network model is suitable for the tasks where all elements in the output are
contained in the input, for example, the linearization task.


2.4 Machine Learning Models 23

2.4.6 Tree-LSTM

Tai et al. (2015) proposed the Tree-LSTM to model tree structures, which is a generaliza-
tion of the sequential LSTM. We introduce the Child-sum variant here, as it is suitable to
model an unordered dependency tree. Unlike LSTM that propagates information from
the preceding node to the following node in a sequence, Tree-LSTM propagates the in-
formation from the dependents to their head in a dependency tree. Suppose we calculate
the node j, whose dependents are k ∈ C(j), the computation of the Tree-LSTM is:

h̃j =
∑

k∈C(j)

hk (2.23)

fjk = σ(Wf · xj + Uf · hk + bf ) (2.24)

ij = σ(Wi · xj + Ui · h̃j + bi) (2.25)

oj = σ(Wo · xj + Uo · h̃j + bo) (2.26)

uj = tanh(Wu · xj + Uu · h̃j + bu) (2.27)

cj = ij � uj +
∑

k∈C(j)

fjk � ck (2.28)

hj = oj � tanh (cj) (2.29)

We first sum up the output of all the dependents of node j to obtain h̃j that summa-
rizes the information of the dependents, then calculate the forget gate of each dependent
fjk, the input gate ij and the output gate oj for the current node j. The memory cell of
each dependent is controlled by the corresponding forget gate, the transformed informa-
tion uj is controlled by the input gate, and their sum forms the memory cell of the current
node, and finally outputs hj through the output gate.

The Tree-LSTM is improved by adding the attention mechanism to the hidden states
(Zhou et al., 2016), so that each dependent influences the head representation to different
degrees. Miwa and Bansal (2016) proposed a bidirectional extension that traverses the
tree in both bottom-up and top-down directions to access all the tokens in the tree.

We use the following abstract notations for the bottom-up and top-down Tree-LSTMs:

y↑1:n = Tree-LSTM↑(x1:n) (2.30)

y↓1:n = Tree-LSTM↓(x1:n) (2.31)

where x1:n are the input vectors of the nodes in the tree, and y↑1:n and y↓1:n are the output
vectors of the bottom-up and top-down Tree-LSTMs, respectively.


24 2 Background

2.4.7 Graph Neural Networks

A tree is a particular case of a graph, and the general graph structure has much broader
application. Therefore, there has been a lot of work focusing on modeling graph struc-
tures with Graph Neural Networks (GNNs); we refer the reader to Wu et al. (2020) for
a comprehensive survey. Here, we introduce a variant we use in our experiments, the
Graph Attention Network (GAT) by Veličković et al. (2017), which is improved upon the
Graph Convolutional Network (GCN) by Kipf and Welling (2016).

Unlike the sequence or tree structure, there is no beginning or ending in a graph;
therefore, we cannot propagate the information in one go as in the sequential LSTM and
Tree-LSTM. Instead, every node in a connected graph has neighbors, and the GNN prop-
agates the information of every node to its neighbors simultaneously and repeats it n
times to propagate further. Each propagation uses a separate layer of GNN, and it can
thus only send messages to a limited number of hops way, which is a compromise to
process more complex structure compared to the Sequential or Tree-LSTM.

In the basic GCN model, the representation of a node i at the (l + 1)-th layer is based
on the representations of its neighbors in the l-th layer as follows:

z
(l)
j = W (l) · h(l)j (2.32)

αi,j =
1√

|N (i)| · |N (j)|
(2.33)

h
(l+1)
i = σ

 ∑
j∈N (i)

αi,j · z(l)
 (2.34)

whereN (i) are the direct neighbors of i, z(l)j is the transformed representation of a neigh-
bor node j at the previous layer, αi,j is the normalization term to calculate the average.

When computing the representation of a node, GCN averages the information from its
neighbors, while GAT takes into account the importance of each neighbor by calculating
the attention from the node to its neighbors, by replacing Equation (2.33) with:

αi,j = attention(h
(l)
i , h

(l)
1:n)j (2.35)

We use the following abstract notation for the GAT:

y1:n = GAT(x1:n) (2.36)

where x1:n and y1:n are the input and output vectors for all the nodes in the graph.


2.5 Dependency Length Minimization 25

2.5 Dependency Length Minimization

The recent development of comparable treebanks for a wide range of languages across
the typological spectrum, particularly the UD (Nivre et al., 2016b), has made it possible to
address some long-standing hypotheses regarding a functional explanation of linguistic
universals on word orders. Several recent studies (Liu, 2008; Futrell et al., 2015a) have
used evidence from treebanks across languages to address what is arguably the most
prominent hypothesis of a functionally motivated universal constraint, the Dependency
Length Minimization (DLM) hypothesis, which can be traced back to Behaghel (1932).
The DLM hypothesis states that we tend to arrange the words to minimize the overall
lengths of the dependency arcs in the linear order to reduce the cognitive load of un-
derstanding (parsing) the sentence. This effect of DLM presents in the utterances where
multiple alternative word orders are valid; it also influences the preferences of word or-
der in the long-term development of a language.

Liu (2008) performed the first cross-treebank study to compare the actual length of
dependencies, for 20 languages, against the length that results when the dependency
structures are linearized in random ways. The results indicate that languages indeed
tend to minimize the dependency distance. Futrell et al. (2015a) presented a recent ex-
pansion of this type of treebank study to a set of 37 languages (which they argue to be
the first comprehensive analysis that covers a broad range of typologically diverse lan-
guages), presenting comparisons of the real dependency trees from the treebank with
random re-orderings of the same dependency structures. The analysis shows that the
real DL is indeed significantly shorter than chance across all 37 analyzed languages. This
result corroborates findings from a broad range of empirical studies that are typologically
less comprehensive (Gildea and Temperley, 2010; Gulordava and Merlo, 2015; Gulordava
et al., 2015). This type of cross-treebank study has prompted a fair number of expansions
and discussion regarding the typological implications (e.g., Jiang and Liu (2015); Chen
and Gerdes (2018); Temperley and Gildea (2018)).

The studies mentioned above (Liu, 2008; Futrell et al., 2015a; Gulordava, 2018) have
only focused on the projective sentences and ignored the non-projective sentences, which
constitute a substantial proportion in many languages with relatively free word order.
A series of studies (Ferrer-i-Cancho, 2006; Ferrer-i-Cancho and Gómez-Rodrı́guez, 2016;
Gómez-Rodrı́guez and Ferrer-i-Cancho, 2017) focused on the relation between crossing
arcs and DL. They concluded that the scarcity of non-projective sentences in natural lan-
guages is a direct outcome of DLM since the crossing arcs generally would increase the
DL of an otherwise projective sentence.


27

Chapter 3

Dependency Tree Linearization

The task of linearization is to arrange the nodes in a graph structure into a sequence. In
our context, the input is an unordered dependency tree, and the output is a sentence.
Linearization on its own may seem somewhat artificial since it is uncommon to have
the dependency tree structure of a sentence while the order is unknown. However, this
could be an essential step for many real-world tasks. For example, in the deep track of
several editions of the surface realization shared task (Belz et al., 2011; Mille et al., 2018,
2019, 2020), the input is an deep-syntactic structure which contains only content words
connected by predicate-argument relations (Meyers et al., 2004; Palmer et al., 2005).

Figure 3.1 illustrates our complete pipeline for the surface realization shared task for
both shallow and deep tracks. It consists of five modules: (1) linearization aims to find
the surface order of a tree; (2) completion is the process of transforming the deep tree
into a shallow tree by generating the absent function words; (3) inflection is the task of
generating the word form based on the lemma and the syntactic and morphological infor-
mation; (4) contraction applies for a few languages where several tokens are contracted
as one; (5) detokenization turns a list of words into text with proper spacing between the
words and punctuation marks. The code for pipeline with all the modules is available at
https://github.com/EggplantElf/IMSurReal.

This chapter focuses on the first module in the pipeline, dependency tree lineariza-
tion. This module aims to find the optimal order of an unordered dependency tree, be
it shallow or deep; it takes a tree as input and produces a sequence of tokens from the
tree. A typical linearization model includes two major components: encoding and de-
coding, where the tokens are encoded with the tree information and then decoded into a
sequence that forms a natural language sentence.

https://github.com/EggplantElf/IMSurReal


28 3 Dependency Tree Linearization

de 

originario

norte ser  

áfrica

.   

originario

norte

originario norte

áfricaoriginario   ser .   

es originaria del norte de áfrica.

áfrica

de 

áfrica

norteel   de de 

áfricaoriginaria   es .   norteel   de de 

áfricaoriginaria   es .   nortedel de 

Shallow Track: Deep Track:

(1) Linearization

(1) Linearization

(3) Inflection

(4) Contraction

(5) Detokenization

(2) Completion

el   

Gender=Fem
Number=Sing

Number=Sing
Person=3
Tense=Pres

Figure 3.1: A full pipeline for the surface realization shared task, including five modules: (1)
linearization, (2) completion, (3) inflection, (4) contraction, and (5) detokenization.


3.1 Encoding 29

3.1 Encoding

We explore three models to encode the input unordered dependency tree: sequential
LSTM, Tree-LSTM, and GAT. These three models are designed for structures with in-
creasing generality, i.e., sequence, tree, and graph.

3.1.1 Token

Before modeling the structure, we first obtain the representation for each token in the
tree, which serves as the common ground for all the structural encoders. Table 3.1 shows
an example sentence from the treebank in the CoNLL-U format. Each row contains the
information about a token in the sentence, including the token index, word form, lemma,
Universal Part-of-Speech (UPOS), Language-specific Part-of-Speech (XPOS), a list of mor-
phological features, the index of the head, and the dependency relation.

ID Word Lemma UPOS XPOS Morphological Features Head Label

1 From from ADP IN 3 case
2 the the DET DT Definite=Def|PronType=Art 3 det
3 AP AP PROPN NNP Number=Sing 4 obl
4 comes come VERB VBZ Mood=Ind|Number=Sing|Person=3|... 0 root
5 this this DET DT Number=Sing|PronType=Dem 6 det
6 story story NOUN NN Number=Sing 4 nsubj
7 : : PUNCT : 4 punct

Table 3.1: An example sentence in the CoNLL-U format.

Among these entries, the token index and head index indicate the correct order of the
words in the sentence, they are available for training but scrambled for testing. The word
form is the target for morphological inflection, which is also not available for lineariza-
tion. Therefore, we use the lemma, UPOS, morphological features, and dependency label
as features to encode the token; XPOS is left out since they contain different information
in different treebanks thus making a cross-lingual comparison more difficult.

We run a bidirectional LSTM on the embeddings of the characters in the lemma (de-
noted as c1:p) and a bidirectional LSTM for the morphological tags (denoted as m1:q).

v
(char)
i =

−−−−→
LSTMchar(c1:p)p +

←−−−−
LSTMchar(c1:p)1 (3.1)

v
(morph)
i =

−−−−→
LSTMmorph(m1:q)q +

←−−−−
LSTMmorph(m1:q)1 (3.2)

The LSTM output vectors for characters and morphological tags are then summed
together with the embeddings of other features, i.e., lemma, UPOS, and dependency re-


30 3 Dependency Tree Linearization

lation, to form the token-level representation.

v◦i = e
(lemma)
i + e

(upos)
i + e

(dep)
i + v

(char)
i + v

(morph)
i (3.3)

Since the training data are generally relatively small (thousands of sentences), we
apply word dropout during training to improve the robustness against unknown words.
Concretely, During training, we set the lemma representation as a zero vector with a
probability of 1%, so the model is forced to rely on other features to make the prediction.

The token-level encoding can also be viewed as the simplest way to encode the tree,
ignoring the complete syntactic information and modeling it as a bag of words. Such
encoding generally leads to poor performance, but it could serve as a baseline in the
ablation study to illustrate the importance of syntactic information.

3.1.2 Sequence

Since the Sequence-to-Sequence (Seq2Seq) models gained popularity with their success
in the machine translation task, they have also been widely adopted in many other tasks,
even when the input is not a sequence in nature, including the linearization task (Konstas
et al., 2017; Elder and Hokamp, 2018; Elder et al., 2020).

To use the Seq2Seq model to represent the tree structure, we first need to transform
the tree structure to a sequence, which still include all the information about the tree.
We denote this process as serialization. Note that the term appears quite similar to
linearization, but they have very different meanings in our context. Serialization as a
pre-processing step flattens the input unordered tree into a sequence with a depth-first,
pre-order traversal, where the head is visited first, and the dependents are visited in ar-
bitrary order. Linearization, in contrast, is the task the model is trying to solve, where the
order of the output sequence is the target.

We follow Konstas et al. (2017) to use scoping brackets to establish an unambiguous
mapping between the tree and the sequence, so that the tree information is not lost in the
serialization. To improve the model’s robustness to random serialization, we use a new
random sequence every time to serialize the same tree during training.

Figure 3.2 illustrates different ways to represent an input dependency tree as a se-
quence. (1) and (2) are two different random serialization of the tree with the scoping
brackets; (3) is a random order traversal of the tree without scoping brackets, which does
not fully encode the tree information since we cannot unambiguously reconstruct the


3.1 Encoding 31

tree from the sequence; (4) is a completely random sequence, i.e., it is not necessarily a
traversal of the tree in any order; thus it completely loses the tree information.

let

get us

this

with 

over   

(1):        let ( us get ( over this ( with ) ) )

(2):        let ( get ( this ( with ) over ) us )

(3):        let get this with over us

(4):        let get with us over this

Figure 3.2: Different ways to serialize a dependency tree as a sequence.

Having transformed the tree into a sequence, we can then use a bidirectional LSTM
to encode the sequence from left to right and from right to left. This way, the tokens in
the sequence are encoded with a proxy of the tree information.

It is worth noting that, although the tree can be unambiguously transformed into a
sequence, the LSTM model does not necessarily have the ability to understand the tree
structure behind it. We have shown in Yu et al. (2019d) that the LSTM can recognize the
tree structure to some extent, but it cannot truly generalize when the model is tested on
a larger tree than what it is trained on. However, in this task, the training and testing
sentences are generally of similar sizes; therefore, the limitation of generalization does
not impose a severe problem.

3.1.3 Tree-LSTM

To represent the tree nodes with better generalization ability, we adopt the Tree-LSTM
model (Tai et al., 2015) to explicitly encode the dependency tree. The original Tree-LSTM
propagates information only in the bottom-up direction, i.e., from the dependents to the
heads, which does not encode all hierarchical information in each node since there is no
information flow from the ancestor to the descendants. We therefore apply a top-down
pass to propagate information from the head to the dependents as in Miwa and Bansal
(2016). Since each dependent has only one head, unlike the bottom-up pass, we could
use a standard sequential LSTM to encode the paths from the root to each leaf node. For
each node, we feed its bottom-up vector v↑ into the hidden state of its head to obtain
the hidden state for the current node, and the output is the top-down vector v↓. Miwa
and Bansal (2016) perform the two passes independently, i.e., both LSTMs take v◦ as
input and produce v↑ and v↓ as outputs, similar to the standard sequential bidirectional


32 3 Dependency Tree Linearization

LSTM (Graves and Schmidhuber, 2005). However, two independent passes can not pass
the information of all tokens to all other tokens in the tree. Since each token only gets
information from its ancestors and descendants, it is thus unaware of its siblings, which
is a crucial information for linearization.

Therefore, our model performs the bottom-up pass first and uses its output v↑ as the
input for the top-down pass to obtain v↓, as illustrated in Figure 3.3. The final represen-
tation is the sum of the token-level vector and the two Tree-LSTM outputs.

v↑i = Tree-LSTM↑(v◦0:n)i (3.4)

v↓i = Tree-LSTM↓(v↑0:n)i (3.5)

xi = v◦i + v↑i + v↓i (3.6)

deps:

head:

v◦ v↑ 

memory

memory

input output

(a) Bottom-up LSTM.

deps:

head:

v↑ v↓ 

 
memory

memory

input output

(b) Top-down LSTM.

Figure 3.3: A schematic illustration of the bidirectional Tree-LSTM.

In this way, all tokens in the tree can be accessed by other tokens since any two tokens
have a common ancestor, and the information of one token can be first passed up to the
common ancestor then down to the other token. Figure 3.4 illustrates the information
flow of our bidirectional model in a dependency tree, where the red dotted arrows in-
dicate the bottom-up pass, and the blue dashed arrows indicate the top-down pass. We
highlight how node 8 influences node 4. Its representation v◦8 is first propagated up to
the lowest common ancestor v↑2, then goes down to v↓4.


3.1 Encoding 33

1

2

3 4

6 7

5

8

Figure 3.4: An illustration of the information flow in the encoder, where the red dotted arrows
represent the bottom-up pass and the blue dashed arrows represent the top-down pass. The solid
arrows illustrate the information flow from node 8 to node 4.

3.1.4 Graph Neural Networks

Graph neural networks are typically used to encode graph structures beyond trees. Nev-
ertheless, we apply it to encode the input dependency tree. We implement the Graph
Attention Network (GAT) model described in §2.4.7. Through preliminary experiments,
we set the number of GAT layers to 3, i.e., each node is aware of its neighbors three hops
away. This is a compromise between the model’s expressiveness and parsimony: too few
layers would cause the node to have insufficient context information, while too many
layers would bloat the model parameters and cause overfitting.

Let us take the example in Figure 3.4 to compare Tree-LSTM and GAT. The Tree-LSTM
has two sets of parameters, i.e., the bottom-up and the top-down submodule. When
passing the information from node 8 to node 4, the bottom-up submodule is used twice
for the information to reach node 2, and the top-down submodule is used once to pass
it down to node 4. In contrast, the GAT in our setting has three sets of parameters, i.e.,
the three layers, and they can pass the information three steps away, just enough from
node 8 to node 4, and each step involves a GAT layer. However, it is not possible to pass
the information further to node 7. Therefore, the GAT has a limited scope of structural
context, and its node representation may not be aware of the full graph.


34 3 Dependency Tree Linearization

3.2 Divide-and-Conquer Strategy

After encoding the tokens, we could directly generate the sequence of the tree with a
decoder. However, it is not trivial to order decode long sentences, since the total search
space is the factorial of the number of tokens. The divide-and-conquer strategy is com-
monly used to break down a large problem into several smaller problems and solve them
separately in order to reduce the problem’s complexity. Following Bohnet et al. (2010),
we employ this strategy by recursively traversing the tree and solve the linearization task
for each local subtree, where each subtree consists of a head and its direct dependents.

After obtaining the subtrees’ local orders (by the decoder in §3.3), there are multiple
possible ways to combine them to obtain the total order. Figure 3.5 gives an example,
where (a) and (b) are the linearized subtrees, and (c), (d), and (e) are three different full
linearized trees that respect the subtrees’ local order.

Out of all possibilities, there is only one projective tree, i.e., without crossing arc in
the tree, which is (c) in the example. Therefore, we make a projective assumption on the
output to complete the divide-and-conquer process. This assumption turns out to be a
practical inductive bias since the majority of sentences in most languages are projective,
or contain very few non-projective arcs, see the statistics in Table 4.1.

1 2 4

root

(a)

3 4 5

2

(b)

1 2 3 4 5

root

(c)

1 3 2 4 5

root

(d)

3 1 2 4 5

root

(e)

Figure 3.5: Multiple possible full tree orders from the same subtree orders, where (a) and (b)
are ordered subtrees, (c), (d), and (e) are possible full ordered trees, among which only (c) is
projective.

Whether to employ the divide-and-conquer strategy is a trade-off between the decod-
ing efficiency and the coverage of exactly correct solutions. On the one hand, the strategy
massively reduces the search space; on the other hand, it is impossible to produce a small
fraction of correct trees since they lie in the reduced search space.


3.3 Decoding 35

3.3 Decoding

We experiment with three types of decoding algorithm: incremental decoding, graph-
based decoding, and transition-based decoding. In the first method, we add tokens incre-
mentally to build partial sequences and use beam search to explore several hypotheses.
In the second method, we treat the task as a Traveling Salesman Problem (TSP) and de-
code the sequence with a TSP solver. In the third method, we design a transition system
to swap the tokens in the initial sequence step by step until reaching the final word order.

3.3.1 Incremental Decoding

Left-to-Right Generation with Beam Search

We start with the incremental decoding method, largely following Bohnet et al. (2010).
The decoder takes a set of tokens as input, and the sequence is initially empty. At each
decoding step, we select one token from the set of candidate tokens and add it to the end
of the partial sequence until the candidate set is empty. To address the error propagation
problem in greedy search, the decoder applies beam search to extend the sequence. At
each step, we keep at most k partial sequences as the current hypotheses.

We use an LSTM to represent the partial sequences and adopt the pointer network
(Vinyals et al., 2015b) to score the new expansions. At each step, we calculate the atten-
tion score between the LSTM output of the current partial sequence and the remaining
tokens that are “pointed” to. The unnormalized attention score (without softmax) is the
incremental score for adding such token.

Algorithm 1 describes the left-to-right decoding together with the LSTM scoring
model. We first divide the input tree T into subtrees; each consists of a head h and its
dependents. For each subtree, we initialize the agenda with an empty sequence (line 3).
A sequence is represented by an LSTM state (line 5), and we also keep track of the total
score of the sequence (line 4).

At each step, for each sequence in the agenda, we use a pointer network to calculate
the unnormalized attention score between the LSTM state and all the remaining tokens
as the scores of attaching each token to the end of the sequence (ATTENDl in line 14,1

where xt is the vector representation of the token t).
We then create a new sequence for each possible attachment (line 11, where⊕ denotes

concatenation), and the score of each new sequence is incremented by the unnormalized
attention score (line 12). We also update the corresponding LSTM state of that sequence

1All scores are calculated in one go in the actual model, we distribute them in the loop only for readability.


36 3 Dependency Tree Linearization

by adding the representation of the attached dependent as input (line 13). The new se-
quences are then added to the agenda for the next step (line 14).

If the number of new sequences in the new agenda is larger than the beam size, we
keep only the highest-scoring ones for further expansion (line 17). When all the sequences
are finished, we take the highest-scoring full sequence as the linearization of the subtree
(line 19). Finally, after linearizing all subtrees, we combine them into a full sentence as
the output (line 21).

Algorithm 1 Left-to-Right linearization.
1: for all h ∈ T do . divide the tree into subtrees
2: Th = {h} ∪ dependents(h) . all tokens in the subtree
3: seq = [] . initial sequence
4: seq.score = 0
5: seq.state = InitialLSTMState()
6: agenda = [seq] . initial agenda
7: while |seq| < |Th| do
8: for all seq ∈ agenda do
9: beam = []

10: for all t ∈ Th \ seq do . a new sequence for each remaining token
11: new seq = seq ⊕ t . attach to the right
12: new seq.score += ATTEND(seq.state, xt)
13: new seq.state = seq.state.addInput(xt)
14: beam.append(seq)
15: end for
16: end for
17: agenda = N-BEST(beam, beam-size) . keep N-best sequences for next step
18: end while
19: Sh = N-BEST(agenda, 1) . best sequence for Th

20: end for
21: S = COMBINESUBTREES(S1, S2, ..., Sn) . combine subtrees into full tree
22: return S

Note that the left-to-right d