Computational Models of Word Order Von der Fakultät Informatik, Elektrotechnik und Informationstechnik der Universität Stuttgart zur Erlangung der Würde eines Doktors der Naturwissenschaften (Dr. rer. nat.) genehmigte Abhandlung. Vorgelegt von Xiang Yu aus Hunan, China Hauptberichter: Prof. Dr. Jonas Kuhn 1. Mitberichter: Prof. Dr. Ngoc Thang Vu 2. Mitberichter: Prof. Dr. Leo Wanner Tag der mündlichen Prüfung: 07.06.2021 Institut für Maschinelle Sprachverarbeitung der Universität Stuttgart 2022 iii I hereby declare that I have created this work completely on my own and used no other sources or tools than the ones listed, and that I have marked any citations accord- ingly. Hiermit versichere ich, dass ich die vorliegende Arbeit selbständig verfasst und keine anderen als die angegebenen Quellen und Hilfsmittel benutzt sowie Zitate kenntlich gemacht habe. Berlin, May 20, 2022 Place, Date Xiang Yu v Contents Abstract xvii Überblick xix 1 Introduction 1 1.1 Research Gaps and Contribution . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2 Background 9 2.1 Dependency Grammar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.1.1 Dependency Grammar vs. Constituency Grammar . . . . . . . . . 9 2.1.2 Projectivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.1.3 Parsing and Linearization . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2 Dependency Parsing Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2.1 Graph-based Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2.2 Transition-based Parsing . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2.3 Non-Projective Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3 Linearization Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.3.1 Syntax-free Linearization . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3.2 Syntax-based Linearization . . . . . . . . . . . . . . . . . . . . . . . 16 2.4 Machine Learning Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.4.1 Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.4.2 Multi-Layer Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.4.3 Distributed Representation . . . . . . . . . . . . . . . . . . . . . . . 19 2.4.4 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . . . 20 2.4.5 Sequence-to-Sequence Model and Attention Mechanism . . . . . . 21 vi Contents 2.4.6 Tree-LSTM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.4.7 Graph Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.5 Dependency Length Minimization . . . . . . . . . . . . . . . . . . . . . . . 25 3 Dependency Tree Linearization 27 3.1 Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.1.1 Token . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.1.2 Sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.1.3 Tree-LSTM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.1.4 Graph Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.2 Divide-and-Conquer Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.3 Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.3.1 Incremental Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . 35 Left-to-Right Generation with Beam Search . . . . . . . . . . . . . . 35 Head-First Generation Order . . . . . . . . . . . . . . . . . . . . . . 36 Training with Beam Search . . . . . . . . . . . . . . . . . . . . . . . 37 Ensemble . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.3.2 Graph-based Decoding . . . . . . . . . . . . . . . . . . . . . . . . . 39 Scoring Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 TSP Solver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Training Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 Decoding with Constraints . . . . . . . . . . . . . . . . . . . . . . . 43 3.3.3 Transition-based Decoding . . . . . . . . . . . . . . . . . . . . . . . 44 Transition System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 Scoring Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 Producing Non-Projective Sentences . . . . . . . . . . . . . . . . . . 45 3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4 Linearization Experiments 51 4.1 Data and Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.2 Encoders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.3 Decoders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.4 Model Errors vs. Search Errors . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.5 Non-Projectiviy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.6 Training Data Sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.7 Error Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 Contents vii 4.8 Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.8.1 English . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.8.2 German . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.8.3 Chinese . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 5 Dependency Length Minimization 75 5.1 Experimental Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.1.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.1.2 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.1.3 Baseline Dependency Lengths . . . . . . . . . . . . . . . . . . . . . 77 5.2 Analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.2.1 Word Order Constraints . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.2.2 Dependent Distribution: FREE vs. VAL . . . . . . . . . . . . . . . . . 80 5.2.3 Head Direction: VAL vs. SIDE . . . . . . . . . . . . . . . . . . . . . . 81 5.2.4 Sibling Ordering: SIDE vs. OBS . . . . . . . . . . . . . . . . . . . . . 83 5.3 Experiments with Linearizer . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 5.4 DLM and Non-Projectivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 6 Surface Realization Shared Tasks 89 6.1 Task Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 6.2 Full Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 6.2.1 Linearization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 6.2.2 Function Word Completion . . . . . . . . . . . . . . . . . . . . . . . 93 6.2.3 Morphological Inflection . . . . . . . . . . . . . . . . . . . . . . . . . 95 6.2.4 Contraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 6.2.5 Detokenization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 6.3 Data Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 6.4 Ensemble . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 6.5 Shared Task Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 6.5.1 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 6.5.2 SR’19 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 6.5.3 SR’20 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 6.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 viii Contents 7 Conclusion 105 7.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 7.2 Future Perspectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 A Detailed Experimental Results 109 Bibliography 119 ix List of Figures 2.1 Examples of constituency tree and dependency tree. . . . . . . . . . . . . . 10 2.2 Dependency trees of the same meaning in different languages. . . . . . . . 11 2.3 An example non-projective sentence. . . . . . . . . . . . . . . . . . . . . . . 11 3.1 A full pipeline for the surface realization shared task, including five mod- ules: (1) linearization, (2) completion, (3) inflection, (4) contraction, and (5) detokenization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.2 Different ways to serialize a dependency tree as a sequence. . . . . . . . . 31 3.3 A schematic illustration of the bidirectional Tree-LSTM. . . . . . . . . . . . 32 3.4 An illustration of the information flow in the encoder, where the red dotted arrows represent the bottom-up pass and the blue dashed arrows represent the top-down pass. The solid arrows illustrate the information flow from node 8 to node 4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.5 Multiple possible full tree orders from the same subtree orders, where (a) and (b) are ordered subtrees, (c), (d), and (e) are possible full ordered trees, among which only (c) is projective. . . . . . . . . . . . . . . . . . . . . . . . 34 3.6 The matrix and graph views of the linearization task as a TSP. . . . . . . . 40 3.7 An illustration of the swap model. (1) shows a configuration [1 2 | 3 4 5 6], where the stack contains 1 and 2, the buffer contains 3, 4, 5, and 6, and the model predicts shift to move 3 from the buffer to the stack; (2) is the resulted configuration, and the model predicts swap, which moves 3 back to the buffer behind 4, as shown in (3). The solid black arrows represent the computation already done in the previous steps, and the dashed red arrows represent the new computation needed for the current step. . . . . 45 3.8 An pair of projective and non-projective sentences. . . . . . . . . . . . . . . 46 3.9 An overview of the optimal linearization system. . . . . . . . . . . . . . . . 49 x List of Figures 4.1 An example to demonstrate the different evaluation metrics, where (a) is the reference sentence and (b) is the predicted sentence. . . . . . . . . . . . 54 4.2 The improvement of full tree match after reordering with respect to the percentage of non-projective arcs in each treebank. . . . . . . . . . . . . . . 60 4.3 BLEU score with respect to the sentence length and the non-projective arc rate. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.4 BLEU scores of models trained on different data sizes (on a logarithmic scale), where (a) shows the average and (b) shows individual treebanks. . 62 4.5 An example of two types of error cases, where the correct order should be [1, 2, 3, 4, 5, 6], head errors are marked with red (d and e), sibling errors are marked with blue (b and c). . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.6 Frequency of each dependency relation in each language, where the tree- banks are grouped by the language families and subfamilies, and the rela- tions are grouped by the structural categories of the dependent. . . . . . . 64 4.7 Head and sibling error rates of each dependency relation in each language. 65 4.8 Difference of errors between the lexicalized model and the delexicalized model, which illustrates the benefit of lexicalization. . . . . . . . . . . . . . 67 5.1 Example dependency trees and their dependency lengths, adapted from Futrell et al. (2015a). The second sentence is preferred since it has shorter DL and lower cognitive burden. . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.2 An example tree and results of four baseline linearization methods, where the label represents the size of the subtree. . . . . . . . . . . . . . . . . . . . 77 5.3 Average DL vs. sentence length across 53 treebanks. . . . . . . . . . . . . . 78 5.4 All treebanks characterized by the two types of word order freedom. . . . 79 5.5 Percentage of trees where the real order has shorter DL than flipping one dependent. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 5.6 Percentage of trees where the real order has shorter DL than swapping two siblings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 5.7 Average DL vs. sentence length across 53 treebanks. The additional results of DELEX and STRUCT are from the trained linearization models, while the others are directly from manipulation of the original treebank. . . . . . . . 84 5.8 The percentage of non-projective trees and the DL of the fully random lin- earization, compared to the projective random linearization and the real data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 List of Figures xi 5.9 A pair of projective and non-projective sentence in German, and their DLs. The non-projective sentence with the extraposed relative clause has shorter DL. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 5.10 The comparisons between the non-projective sentences and their projec- tivized versions in terms of the DL. . . . . . . . . . . . . . . . . . . . . . . . 87 6.1 A full pipeline for the surface realization shared task, including five mod- ules: (1) linearization, (2) completion, (3) inflection, (4) contraction, and (5) detokenization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 6.2 An example of the completion process to the right side, where the right arrows illustrate the forward LSTM, and the left arrows the backward LSTM. 94 6.3 An illustration of the contraction model. . . . . . . . . . . . . . . . . . . . . 97 6.4 Sentence length distribution in en lines before and after filtering. . . . . . 98 xiii List of Tables 2.1 Transitions of the ARCSTANDARD system. . . . . . . . . . . . . . . . . . . . 14 2.2 Transitions of the ARCSTANDARDSWAP system. . . . . . . . . . . . . . . . 15 3.1 An example sentence in the CoNLL-U format. . . . . . . . . . . . . . . . . 29 3.2 The SWAP transition system. . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.3 The transitions to reorder the projective tree in Figure 3.8. . . . . . . . . . . 47 4.1 Information about the treebanks used in the experiments, including the language (sub-)family, data size, average sentence length, average non-leaf subtree size, percentage of non-projective sentences and non-projective arcs. 52 4.2 Different encoders combined the subtree and full tree TSP decoders, eval- uated on the BLEU score and the exact matching rates of subtrees and full trees. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.3 Different versions of the sequence encoders combined the subtree and full tree TSP decoders, evaluated on the BLEU score and the exact matching rates of subtrees and full trees. . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.4 Performances of different decoders on the subtree and full tree level. . . . 57 4.5 Model and search error rates by the TSP-sub and TSP-full decoders. . 58 4.6 TSP decoders on the full tree and subtree level and additionally with re- ordering, evaluated on the BLEU score and the exact matching rates of subtrees and full trees. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.7 Percentage of sentences that are changed better or worse by the reordering. 60 4.8 Distribution of deviation categories in the prediction. . . . . . . . . . . . . 69 xiv List of Tables 6.1 Numbers of sentences of the datasets in SR’19 and SR’20, where the in- domain, out-of-domain, and automatically parsed treebanks are used in both shared tasks, and the automatically parsed Wikipedia treebanks are newly introduced in SR’20. . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 6.2 An example sentence for the shallow track (T1). . . . . . . . . . . . . . . . 91 6.3 An example sentence for the deep track (T2). . . . . . . . . . . . . . . . . . 91 6.4 Automatic Evaluation Results of the shallow track (T1) and the BLEU dif- ference with the best system among other participants for each treebank. . 100 6.5 Automatic Evaluation Results of the deep track (T2) and the BLEU differ- ence with the best system among other participants for each treebank. . . 101 6.6 Comparing the average BLEU scores of our SR’19 results and our SR’20 results in the closed (2020a) and open (2020b) subtrack. . . . . . . . . . . . 101 6.7 Automatic evaluation results in the shallow track (T1) of the our system in SR’19 and the closed subtrack (2020a) and open subtrack (2020b) in SR’20. 102 6.8 Automatic evaluation results in the deep track (T2) of the our system in SR’19 and the closed subtrack (2020a) and open subtrack (2020b) in SR’20. 103 A.1 Detailed BLEU scores for the experiment on encoders in §4.2. . . . . . . . 110 A.2 Detailed EMF scores for the experiment on encoders in §4.2. . . . . . . . . 111 A.3 Detailed EMS scores for the experiment on encoders in §4.2. . . . . . . . . 112 A.4 Detailed BLEU scores for the experiment on decoders in §4.3. . . . . . . . 113 A.5 Detailed EMF scores for the experiment on decoders in §4.3. . . . . . . . . 114 A.6 Detailed EMS scores for the experiment on decoders in §4.3. . . . . . . . . 115 A.7 Detailed BLEU scores for the experiment on training data sizes in §4.6. . . 116 A.8 Detailed EMF scores for the experiment on training data sizes in §4.6. . . . 117 A.9 Detailed EMS scores for the experiment on training data sizes in §4.6. . . . 118 xv List of Abbreviations AMR Abstract Meaning Representation DL Dependency Length DLM Dependency Length Minimization GAT Graph Attention Network GCN Graph Convolutional Network GNN Graph Neural Network LSTM Long Short-Term Memory MLP Multi-layer Perceptron MST Maximum Spanning Tree NLP Natural Language Processing RNN Recurrent Neural Network Seq2Seq Sequence-to-Sequence TSP Traveling Salesman Problem UD Universal Dependencies UPOS Universal Part-of-Speech VRP Vehicle Routing Problem XPOS Language-specific Part-of-Speech xvii Abstract A sentence in our mind is not a simple sequence of words but a hierarchical structure. We put the sentence in the linear order when we utter it for communication. Linearization is the task of mapping the hierarchical structure of a sentence into its linear order. Our work is based on the dependency grammar, which models the dependency rela- tion between the words, and the resulting syntactic representation is a directed tree struc- ture. The popularity of dependency grammar in Natural Language Processing (NLP) benefits from its separation of structure order and linear order and its emphasis on syn- tactic functions. These properties facilitate a universal annotation scheme covering a wide range of languages used in our experiments. We focus on developing a robust and efficient computational model that finds the linear order of a dependency tree. We take advantage of deep learning models’ expres- sive power to encode the syntactic structures of typologically diverse languages robustly. We take a graph-based approach that combines a simple bigram scoring model and a greedy decoding algorithm to search for the optimal word order efficiently. We use the divide-and-conquer strategy to reduce the search space, which restricts the output to be projective. We then resolve the restriction with a transition-based post-processing model. Apart from the computational models, we also study the word order from a quanti- tative linguistic perspective. We examine the Dependency Length Minimization (DLM) hypothesis, which is believed to be a universal factor that affects the word order of every language. It states that human languages tend to order the words to minimize the overall length of dependency arcs, which reduces the cognitive burden of speaking and under- standing. We demonstrate that DLM can explain every aspect of word order in a depen- dency tree, such as the direction of the head, the arrangement of sibling dependents, and the existence of crossing arcs (non-projectivity). Furthermore, we find that DLM not only shapes the general word order preferences but also motivates the occasional deviation from the preferences. Finally, we apply our model in the task of surface realization, which aims to generate xviii Abstract a sentence from a deep syntactic representation. We implement a pipeline with five steps, (1) linearization, (2) function word generation, (3) morphological inflection, (4) contrac- tion, and (5) detokenization, which achieved state-of-the-art performance. xix Überblick Ein Satz ist in unserem Kopf keine einfache Folge von Wörtern, sondern eine hierarchis- che Struktur. Wir bringen den Satz in die lineare Reihenfolge, wenn wir ihn zur Kommu- nikation aussprechen. Linearisierung ist die Aufgabe, die hierarchische Struktur eines Satzes in seine lineare Ordnung abzubilden. Unsere Arbeit basiert auf der Dependenzgrammatik, die die Dependenz (Abhängigkeitsbeziehung) zwischen den Wörtern modelliert, und die resultierende syn- taktische Darstellung ist eine gerichtete Baumstruktur. Die Popularität der Dependen- zgrammatik in der Machinellen Sprachverarbeitung profitiert von ihrer Trennung von struktureller und linearer Ordnung und ihrer Betonung auf der syntaktischen Funk- tionen. Diese Eigenschaften ermöglichen ein universelles Annotationsschema, das eine Vielzahl von Sprachen abdeckt, die in unseren Experimenten verwendet werden. In dieser Arbeit konzentrieren wir uns auf die Entwicklung eines robusten und ef- fizienten Berechnungsmodells, das die lineare Ordnung eines Dependenzbaums findet. Wir nutzen die Ausdruckskraft von Deep-Learning-Modellen, um die syntaktischen Strukturen von typologisch unterschiedlichen Sprachen robust zu kodieren. Wir ver- wenden einen graphenbasierten Ansatz, der ein einfaches Bigram-Scoring-Modell und einen Greedy-Decodierungsalgorithmus kombiniert, um effizient nach der optimalen Wortreihenfolge zu suchen. Wir verwenden die Divide-and-Conquer-Strategie, um den Suchraum zu reduzieren, was die Ausgabe auf projektive Dependenzbäume beschränkt. Anschließend lösen wir die Einschränkung mit einem transitionsbasierten Nachbear- beitungsmodell auf. Zusätzlich zu den Berechnungsmodellen untersuchen wir die Wortreihenfolge auch aus einer quantitativen linguistischen Perspektive. Wir untersuchen die Dependency- Length-Minimization (DLM)-Hypothese, von der angenommen wird, dass sie ein uni- verseller Faktor ist, der die Wortreihenfolge jeder Sprache beeinflusst. Sie besagt, dass menschliche Sprachen dazu neigen, die Wörter so anzuordnen, dass die Gesamtlänge der Dependenzkanten minimiert wird, was die kognitive Belastung beim Sprechen xx Überblick und Verstehen reduziert. Wir zeigen, dass DLM jeden Aspekt der Wortreihenfolge in einem Dependenzbaum erklären kann, wie z. B. die Richtung des Kopfes, die Anord- nung von Geschwister-Dependenten und die Existenz von kreuzenden Dependenzkan- ten (Nicht-Projektivität). Außerdem finden wir, dass DLM nicht nur die allgemeinen Wortordnungspräferenzen formt, sondern auch die gelegentliche Abweichung von den Präferenzen motiviert. Schließlich wenden wir unser Modell bei der Aufgabe der Oberflächenrealisierung an, die darauf abzielt, einen Satz aus einer tiefen syntaktischen Repräsentation zu erzeugen. Wir implementieren eine Pipeline mit fünf Schritten, (1) Linearisierung, (2) Funktionsworterzeugung, (3) morphologische Flexion, (4) Kontraktion und (5) Deto- kenisierung, die eine State-of-the-Art-Leistung erzielt. xxi Acknowledgements My journey in computational linguistics started with randomly picking up a novel Story of your Life by Ted Chiang in a library, which described an alien creature who has no linear conception of time and no linear word order in their language. At that point, I had no idea that I would have spent so many years studying language and ending up with a dissertation on word order. In retrospect, it was just like the story in the novel. Throughout my journey at the IMS, my deepest gratitude goes to my advisor Jonas Kuhn. He offered me a research assistant job since I was an undergraduate student. Later I wrote my Bachelor’s, Master’s, and Doctoral Theses under his supervision. He has given me great freedom to explore the topics that I found interesting, and helped me finding the connection of ideas on a higher perspective and steering in uncertainty. Eventually, some seemingly random pieces of research work intertwined nicely together and resulted in this dissertation. My other advisor Ngoc Thang Vu has also given me a great amount of guidance, and in a complementary way to Jonas. He focused on the technical side of machine learning, and provided hands-on support. We had many discussions on the details of my models, and spent time together debugging my code. He has always encouraged me with his great optimism when I was in doubt about my ideas. Apart from my two advisors, I have also benefited tremendously from Wolfgang Seeker and Anders Björkelund, both co-supervisors of my student work and thesis, as well as my office mates. From them, I learned not only a great amount of knowledge about parsing and machine learning but also critical thinking in research. I am also very grateful to Agnieszka Faleńska for the cooperation in teaching and research, for plotting all the beautiful figures in our papers, and for being a cheerful and candid friend. Also many thanks to all the friends and colleagues at IMS, with whom I have had countless discussions about research and life, Özlem Çetinoğlu, Kyle Richardson, Simon Tannert, Sean Papay, Daniel Ortega et. al. With all your company, I have enjoyed every day in my journey (except for those few sleepless days before conference deadlines). xxii Überblick A great portion of my dissertation is motivated by the work of two former colleagues from Stuttgart, Bernd Bohnet and Leo Wanner. Special thanks to Leo for reviewing my thesis as a committee member. Finally, I would like to thank my parents for their constant support of my decisions, my wife Yang for putting up with me, and my son Luke, whose due date was a week after finishing the draft, for forcing me to eventually overcome my procrastination. And now begins the story of your life. 1 Chapter 1 Introduction Dependency grammar is a grammar formalism based on the dependency relations be- tween words, pioneered by Tesnière (1959, 2015). The dependency relations such as sub- ject and object link the words in a sentence and build up a tree structure representing the meaning of the sentence. A special trait of dependency grammar is that it is agnostic to the surface order of the sentence. The linear order of the words in the sentence is not modeled by the dependency structure and is treated as a secondary role in the represen- tation of the sentence, as stated in Tesnière (2015, p. 13), “[o]ne must not lose sight of the fact that syntactically, the true sentence is the structural sentence, for which the linear sentence is only an image projected onto the spoken chain, with all the disadvantages associated with flatness. ” In Tesnière’s theory, when we utter a sentence, we first have a tree-structured repre- sentation in mind and transform that into the linear sequence of words by traversing the structure. This process is called linearization. The opposite process is called parsing; namely, we receive a sentence in its linear form and analyze its syntactic structure by identifying the dependency relations among the words. We have two main focuses in this thesis. The first is to develop a computational model that determines the word order of a dependency tree; the second is to explore on a higher level the factor that universally influences the word order in all languages. 1.1 Research Gaps and Contribution Developing a stable and efficient linearization model. Several linearization systems have been developed over the years, and some already achieved very strong perfor- mance, e.g., Bohnet et al. (2012). However, most of them are based on perceptron with 2 1 Introduction feature templates, which work well on English but would struggle with morphologi- cally rich or low-resource languages. Some linearizers are very slow due to the beam search decoding, so that they can not satisfy the speed requirements of real-time gener- ation. In this thesis, we aim to develop a linearization system with stable multilingual performance thanks to the powerful representation ability of deep learning models while remaining highly efficient thanks to the simple decoding algorithm. To achieve this goal, we will explore several linearization algorithms from different paradigms. The first algorithm generates the linear order incrementally by selecting the words from the input tree one after another. We use a Recurrent Neural Network (RNN), in particular, the Long Short-Term Memory (LSTM), to calculate the partial sequence score and use beam search to explore multiple hypotheses simultaneously. The second algorithm treats linearization as a Traveling Salesman Problem (TSP), namely to find a traversal path that visits each word exactly once. We use a biaffine model to calculate all possible bigrams’ scores and use a greedy TSP solving algorithm to find the approx- imately optimal traversal path. The third algorithm is a transition system with a stack and a buffer similar to the transition-based parsing system. It uses two operations, shift to move the words from the buffer to the stack and swap to swap the order of two words. Unlike the previous two, where the input is a set of words, this algorithm takes a se- quence as input and outputs another sequence. Impact of dependency length minimization on word order. Apart from developing linearization models for each individual language, we are also interested in understand- ing what universally shapes the word order in all languages. It has been explored since Greenberg’s Universal that the word order is more frequent to have a shorter distance between the head and the dependent (Greenberg, 1963). More recent work (Temperley, 2007; Liu, 2008; Futrell et al., 2015a) has focused on quantifying the dependency length and exploring how it affects word order, which is commonly referred to as the Depen- dency Length Minimization (DLM) hypothesis. We follow this line of research to provide further evidence to the hypothesis. Fur- thermore, we break down the word order into more fine-grained factors, including the direction of the dependency and the pairwise arrangement of sibling dependents. With incremental baselines that start with the random word order, we gradually conform with the word order preferences toward the actual sentences in the treebank collections. In this way, we can explore the effect of DLM in each factor of word order. Moreover, we show that DLM not only shapes the general word order preferences in these factors, it also motivates the deviation from such preferences when necessary. 1.2 Outline 3 Treatment of non-projective sentences. Projectivity is a important property that emerges from the combination of linear order and structural order in a sentence. When the words in a sentence are linearly ordered and the dependency arcs do not cross each other, it is called a projective sentence or a projective tree; otherwise, it is non-projective. Most languages have very few non-projective sentences, such as English and Chinese, although theoretically, among all possible arrangements of words in a sentence, the vast majority are non-projective. Some languages with more flexible word order, such as Latin or Russian, tend to have a much higher percentage of non-projective sentences. Many linearization algorithms and parsing algorithms assume that all the sentences they generate or parse are restricted to be projective. The main reason is that the pro- jective assumption vastly reduces the search space; therefore, the projective algorithms typically have lower time complexity than the non-projective algorithms. Also, since most sentences are projective, it is more likely to find the correct parse tree or word or- der in the restricted search space. This is an engineering trade-off with coverage on the one hand and accuracy and complexity on the other hand. However, the projective al- gorithms do not work well on a few languages with a large percentage of non-projective sentences. It is thus desirable to find a method to deal with non-projectivity without sacrifice performance on projective sentences. The projectivity assumption is also commonly imposed on the analysis of DLM in many previous studies, i.e., non-projective sentences are discarded in the experiments. In fact, non-projectivity is an equally if not more important factor of word order, and it should not be overlooked in the analysis. Therefore, we conduct experiments specifically on the non-projective sentences and show that the non-projectivity as a factor of word order is also strongly influenced by DLM. 1.2 Outline The remainder of the dissertation is organized as follows: Chapter 2 introduces the main concepts relevant to the thesis. We start with the framework of dependency grammar and the dichotomy of structural order and surface order, and the tasks of transforming from one to the other, namely pars- ing and linearization. We briefly introduce parsing methods, although the focus of this thesis is on linearization, since many concepts and methods that are originally 4 1 Introduction developed for parsing and adapted for linearization. We then describe several suc- cessful models for linearization. As our model is based on neural networks, we also introduce relevant models such as LSTM, Tree-LSTM, and Graph Neural Networks. Finally, we describe the Dependency Length Minimization hypothesis, which is a probable explanation of how the word order preferences in natural languages are shaped. Chapter 3 describes our framework of the computational model for linearization, which consists of two main components, encoder and decoder. We present different methods to represent the dependency tree and different decoding algorithms to produce the surface order of the encoded dependency tree. We discuss the pros and cons of our decoder’s inductive bias that restricts the surface order to be projective and present a transition-based model to reorder the sentence. Chapter 4 describes the main experiments in this thesis. We start with the compar- ison of encoders and decoders to find the optimal combination for the linearization model. With the optimal linearization model, we then analyze its prediction er- rors from different angles. From the machine learning perspective, we quantify the extent of model errors and search errors to determine the performance bottleneck. From the linguistic perspective, we look for error patterns with respect to language families and dependency relations. We also conduct manual inspection into the error cases in a few languages and distinguish the word order variation from the actual linearization error. Chapter 5 extends the linguistic analysis of word order in the previous chapter and explores a hypothesis of a universal factor that influences the word order, Depen- dency Length Minimization (DLM). We demonstrate the effect of DLM by design- ing a series of baselines and show that the dependency length would gradually decrease from complete random ordering to the actual ordering in the treebank data through several intermediate baselines. We explain each drop of dependency length with respect to a factor of word order and show the universality of DLM in almost all languages and all factors. Finally, we also explore the DLM hypothesis in projectivity, an often overlooked factor in word order. Chapter 6 turns the focus back to the computational model. We apply the lineariza- tion model as a part of the pipeline for the shared tasks of surface realization. We describe other components in the pipeline as well as the data augmentation and ensembling methods for pushing the performance. 1.3 Publications 5 Chapter 7 concludes this thesis. We summarize the contribution in our work, which are mainly (1) developing an efficient and stable computational model for word or- der, (2) exploring the relationship between word order and dependency length, and (3) a special focus on non-projective trees. We also discuss our work in a broader perspective and envisage an interplay of linearization and parsing as syntax-based machine translation. 1.3 Publications This main body of dissertation is based on the following peer-reviewed publications: • Yu, X., Falenska, A., and Kuhn, J. (2019b). Dependency length minimization vs. word order constraints: An empirical study on 55 treebanks. In Proceedings of the First Workshop on Quantitative Syntax (Quasy, SyntaxFest 2019), pages 89–97, Paris, France. Association for Computational Linguistics Chapter 5 is based on this publication. My co-author Agnieszka Falenska was re- sponsible for visualization, Jonas Kuhn was responsible for writing the overview and providing advice. • Yu, X., Falenska, A., Vu, N. T., and Kuhn, J. (2019c). Head-First Linearization with Tree-Structured Representation. In Proceedings of the 12th International Conference on Natural Language Generation, pages 279–289, Tokyo, Japan. Association for Compu- tational Linguistics Chapter 3 is partly based on this publication. My co-author Agnieszka Falenska was responsible for visualization and calibrating baseline models, Ngoc Thang Vu and Jonas Kuhn had advisory roles. • Yu, X., Falenska, A., Haid, M., Vu, N. T., and Kuhn, J. (2019a). IMSurReal: IMS at the Surface Realization Shared Task 2019. In Proceedings of the 2nd Workshop on Multi- lingual Surface Realisation (MSR 2019), pages 50–58, Hong Kong, China. Association for Computational Linguistics Chapter 6 is partly based on this publication. My co-author Agnieszka Falenska was responsible for visualization and calibrating baseline models, Marina Haid conducted preliminary experiments, Ngoc Thang Vu and Jonas Kuhn had advisory roles. 6 1 Introduction • Yu, X., Tannert, S., Vu, N. T., and Kuhn, J. (2020a). Fast and accurate non-projective dependency tree linearization. In Proceedings of the 58th Annual Meeting of the Asso- ciation for Computational Linguistics, pages 1451–1462, Online. Association for Com- putational Linguistics Chapter 3 is partly based on this publication. My co-author Simon Tannert was responsible for visualization and conducting preliminary experiments, Ngoc Thang Vu and Jonas Kuhn had advisory roles. • Yu, X., Tannert, S., Vu, N. T., and Kuhn, J. (2020b). IMSurReal too: IMS in the sur- face realization shared task 2020. In Proceedings of the Third Workshop on Multilingual Surface Realisation, pages 35–41, Barcelona, Spain (Online). Association for Compu- tational Linguistics Chapter 6 is partly based on this publication. My co-author Simon Tannert was responsible for visualization and data augmentation, Ngoc Thang Vu and Jonas Kuhn had advisory roles. The work in this dissertation has also been shaped by the experiences from the fol- lowing publications during my doctoral studies. Although they are not a core part of the dissertation, they have been part of my project work and more or less influenced the final product. • Yu, X. and Vu, N. T. (2017). Character composition model with convolutional neural networks for dependency parsing on morphologically rich languages. In Proceed- ings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 672–678, Vancouver, Canada. Association for Computational Linguistics • Björkelund, A., Falenska, A., Yu, X., and Kuhn, J. (2017). IMS at the CoNLL 2017 UD shared task: CRFs and perceptrons meet neural networks. In Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pages 40–51, Vancouver, Canada. Association for Computational Linguistics • Yu, X., Falenska, A., and Vu, N. T. (2017). A general-purpose tagger with convolu- tional neural networks. In Proceedings of the First Workshop on Subword and Character Level Models in NLP, pages 124–129, Copenhagen, Denmark. Association for Com- putational Linguistics 1.3 Publications 7 • Yu, X., Vu, N. T., and Kuhn, J. (2018). Approximate dynamic oracle for dependency parsing with reinforcement learning. In Proceedings of the Second Workshop on Uni- versal Dependencies (UDW 2018), pages 183–191, Brussels, Belgium. Association for Computational Linguistics • Blohm, M., Jagfeld, G., Sood, E., Yu, X., and Vu, N. T. (2018). Comparing attention- based convolutional and recurrent neural networks: Success and limitations in ma- chine reading comprehension. In Proceedings of the 22nd Conference on Computational Natural Language Learning, pages 108–118, Brussels, Belgium. Association for Com- putational Linguistics • Yu, X., Vu, N. T., and Kuhn, J. (2019d). Learning the Dyck language with attention- based Seq2Seq models. In Proceedings of the 2019 ACL Workshop BlackboxNLP: Ana- lyzing and Interpreting Neural Networks for NLP, pages 138–146, Florence, Italy. Asso- ciation for Computational Linguistics • Yu, X., Vu, N. T., and Kuhn, J. (2020c). Ensemble self-training for low-resource lan- guages: Grapheme-to-phoneme conversion and morphological inflection. In Pro- ceedings of the 17th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology, pages 70–78, Online. Association for Computational Lin- guistics • Dönicke, T., Yu, X., and Kuhn, J. (2020b). Real-valued logics for typological univer- sals: Framework and application. In Proceedings of the 28th International Conference on Computational Linguistics, pages 3990–4003, Barcelona, Spain (Online). International Committee on Computational Linguistics • Dönicke, T., Yu, X., and Kuhn, J. (2020a). Identifying and handling cross-treebank inconsistencies in UD: A pilot study. In Proceedings of the Fourth Workshop on Univer- sal Dependencies (UDW 2020), pages 67–75, Barcelona, Spain (Online). Association for Computational Linguistics 9 Chapter 2 Background 2.1 Dependency Grammar Dependency grammar is a grammar formalism representing the structure of a sentence by modeling the dependency relations among the words. The foundation of the mod- ern dependency grammar is laid out by Tesnière (1959, 2015), followed by many theories with similar principles, such as the Word Grammar (Hudson, 1984), the Functional Gen- erative Description (Sgall et al., 1986), the Meaning Text Theory (Mel’cuk et al., 1988), and Universal Dependencies (Nivre et al., 2016a). The dependency grammar assumes the units of the structure are words, and they are connected by binary, asymmetric dependency relations, noted as ti l−→ tj , where ti is the head, tj is the dependent, and l is the dependency label between them, such as subject or object. If there is a directed path ti → ... → tj , then ti is called the ancestor of tj and tj is the descendent of ti, and we say ti dominates tj . In most theories, the dependency structure is a tree, i.e., there is a word as the root, which is the ancestor of all other words in the sentence. It satisfies the property that every word except for the root has exactly one head. 2.1.1 Dependency Grammar vs. Constituency Grammar Another influential school of grammar formalism, the constituency grammar (also called phrase structure grammar) proposed by (Chomsky, 1957) represents the syntactical struc- ture of a sentence with multiple levels of phrases. Figure 2.1 is an example that demon- strates the syntactical representation of a sentence in both formalisms. There are several fundamental differences between these two grammar formalisms. 10 2 Background S VP NP N cat Det the V chases NP N dog Det the (a) constituency tree the dog chases the cat det nsubj det obj (b) dependency tree Figure 2.1: Examples of constituency tree and dependency tree. The most obvious contrast is that the constituency grammar introduces many intermedi- ate nodes that are not present in the sentence’s surface form. In contrast, the dependency grammar does not assume extra nodes apart from the words, and all nodes are directly linked to each other. Another subtler difference is that the word order is coupled into the phrases but de- coupled from the dependencies. In the example of constituency grammar, the order of the verb phrase has already determined that noun phrase the mouse follows the verb chases. However, in the dependency grammar, whether the verb chases precedes or follows its object mouse, is not encoded in its syntactic representation. At first glance, the lack of surface order might be an inadequacy of the dependency grammar compared to the constituency grammar. However, in the context of multilin- gualism, it turns out to be a crucial advantage. The Universal Dependencies (UD) (Nivre et al., 2016b) is an initiative that maintains an extensive collection of treebanks aiming at a universal annotation guideline of dependency grammar across languages. It now contains more than 100 typologically diverse languages and is still rapidly expanding. Within the UD framework, the translation of a sentence in different languages can be represented as very similar dependency structures, although the word order might vary considerably. Figure 2.2 illustrates the dependency trees of the sentence the dog chases the cat in four different languages. To amplify the contrast, the German example uses a less frequent but valid word order. There are three distinct orders of the verb and its arguments in these examples: SVO (subject-verb-object) for English and Chinese, OVS for German, and SOV for Turkish. Also, there are no explicit determiners for Chinese and Turkish. 2.1 Dependency Grammar 11 However, all examples have a common core structure where the root of the sentence is chases, and it has two arguments, dog as the subject and cat as the object. The cross- lingual consistency of the (universal) dependency grammar is not only a plausible model of human language faculty but also practical for NLP applications. For example, it allows one to easily translate from one language to another by making the lexical substitution with proper morphological inflection, then arranging the word order of the target tree. The dog chases the cat det nsubj det obj (a) English 狗 追 猫 dog chases cat nsubj obj (b) Chinese Die Katze jagt der Hund the cat chases the dog det obj det nsubj (c) German Köpek kediyi kovalar dog the cat chases nsubj obj (d) Turkish Figure 2.2: Dependency trees of the same meaning in different languages. 2.1.2 Projectivity Projectivity is an important property to classify dependency trees. Intuitively, a tree is projective if its tokens are drawn in their linear order, and there are no dependency arcs crossing each other. For example, the trees in Figure 2.2 are all projective. If a tree has at least one non-projective arc, it is a non-projective tree. Figure 2.3 is an example of a non-projective tree, where the arc from you to Mark (marked in red) is non-projective. Will you handle this or Mark ? aux nsubj obj cc conj punct Figure 2.3: An example non-projective sentence. 12 2 Background Formally, an arc is called projective if the head of the arc dominates all intervening tokens between the head and the dependent, otherwise it is non-projective: ti → tj is projective ⇐⇒ { ∀ti≺tk≺tj ti ∗−→ tk if ti ≺ tj ∀tj≺tk≺ti ti ∗−→ tk otherwise (2.1) where ≺ denotes linear precedence. 2.1.3 Parsing and Linearization Tesnière distinguishes the dependency structure and the surface word order as two dif- ferent orders of the sentence, the structural order and the linear order. He argues that all the syntactic information of a sentence is contained in the structural order, while the linear order is a compromise to the physical world since time is one-dimensional and one has to utter the words linearly. Since the linear order is independent of the structural order in dependency grammar, there is a need for a (computational) model that can transform one into the other. The task of transforming from linear order to structural order is parsing, and the task of transforming from structural order to linear order is linearization. In natural languages (unlike programming languages), there is no perfect one-to-one mapping between the structural order and the linear order, making the task of parsing and linearization both difficult and exciting. For parsing, the syntactical ambiguity causes different interpretations of the same sentence. For example, I shot an elephant in my pajamas could mean I was in the pajamas, or the elephant was in the pajamas. To resolve the ambiguity, one must rely on the semantics and world knowledge to understand that elephants are unlikely to wear pajamas. For linearization, the word order variation results in the same syntactic structure being uttered in different ways. The previous example could be rearranged as: In my pajamas, I shot an elephant, where the emphasis is on the outfit instead of the event. The word order variation is influenced by various factors at different linguistic levels, e.g., prosody (Zubizarreta, 1998), information structure (Hawkins, 1992; Lambrecht, 1996; Neeleman and Van de Koot, 2016), and as we will discuss in Chapter 5, the cognitive burden. Note that in the linearization model in this work, these factors are not explicitly encoded since we focus on the technical framework of data-driven linearization model. However, it is possible that the machine learning model could implicitly pick up the relevant cues to determine the word order. 2.2 Dependency Parsing Methods 13 2.2 Dependency Parsing Methods Although this thesis does not directly work on dependency parsing, the parsing meth- ods have profoundly influenced the development of linearization due to the similarity between these two tasks. Therefore we first briefly introduce some parsing methods, focusing on those directly adopted for linearization. For a complete introduction to de- pendency parsing algorithms, we refer to Kübler et al. (2009). Generally, there are two major paradigms of dependency parsing algorithms, graph-based and transition-based. Graph-based algorithms directly search for a tree in the fully-connected graphs, while transition-based algorithms derive the tree step by step through transitions. 2.2.1 Graph-based Parsing Graph-based parsing algorithms (Eisner, 1996; McDonald et al., 2005) first construct a fully connected directed graph where every word is linked to every other word, a scoring function computes the scores of all the arcs in the graph, and a decoding algorithm finds a tree that maximizes the total score of the arcs in the tree, which is called the Maximum Spanning Tree (MST). To score each potential arc, features are extracted from the head and the dependent of the arc, including their word forms, POS tags, their relative direction and distance, as well as other words around and between them as the context. These features are then used by the machine learning model, e.g., the perceptron (§2.4.1), to calculate a score. Having the scores for all arcs in the graph, we can then use an MST algorithm to decode the tree. There are two commonly used decoding algorithms. Eisner’s Algo- rithm (Eisner, 1996) is based on dynamic programming, which recursively breaks down the problem of finding the tree into finding the left and right subtrees and uses a chart to store the scores of the reoccuring subtrees to avoid repeated computation. Chu-Liu- Edmonds’ Algorithm (Chu and Liu, 1965; Edmonds, 1967) used by McDonald et al. (2005) is a greedy algorithm with recursive refinement. It first selects the highest scoring incom- ing arc for each node, which might result in cycles (thus not a tree). It then resolves the cycles recursively until there is no more. Both algorithms can find exact solutions efficiently, with the time complexity ofO(n3). A significant difference between the two is that Chu-Liu-Edmonds’ Algorithms can na- tively produce non-projective trees, while Eisner’s algorithm is restricted to projective trees. However, Eisner’s algorithm can integrate more flexible higher-order features. Therefore, it is more popular for parsing mostly projective languages such as English. 14 2 Background Transition Before After Precondition left-arc 〈σ | ti, tj | β,A〉 ⇒ 〈σ, tj | β,A ∪ {〈tj , ti〉}〉 ti 6= ROOT right-arc 〈σ | ti, tj | β,A〉 ⇒ 〈σ, ti | β,A ∪ {〈i, j〉}〉 shift 〈σ, ti | β,A〉 ⇒ 〈σ | ti, β, A〉 Table 2.1: Transitions of the ARCSTANDARD system. 2.2.2 Transition-based Parsing Transition-based algorithms define a transition system that builds up the tree step by step through transitions that changes a parsing configuration to another configuration until reaching the terminal configuration, which contains the parse tree. Among many transition systems, we introduce the ARCSTANDARD system (Nivre, 2008) as an example, which has the same concepts such as configuration and transition that are also used in other transition systems. A configuration is a triple 〈σ, β,A〉, where the stack σ contains partially processed to- kens, the buffer β contains the unprocessed tokens in the sentence, and the setA contains the dependency arcs created so far. The initial configuration is 〈[ROOT], [t1, t2, ..., tn], ∅〉, where the stack contains only the root token, the buffer contains all the tokens in the sen- tence in its order, and the arc set is empty. The terminal configuration is 〈[ROOT], [], A〉, where the stack again only contains the root token, the buffer is empty, and theA contains all the arcs that constitute the parse tree. There are three transitions in the ARCSTANDARD system, left-arc, right-arc, and shift the effect of these transitions are described in Table 2.1. The left-arc transition creates an arc from the top of the stack i to the front of the buffer j and removes the token i; right-arc creates an arc from the front of the buffer j to the top of the stack i, removes the token i and moves j back to the buffer; shift moves the front of the buffer to the top of the stack. 2.2.3 Non-Projective Parsing Among the previously described parsing algorithms, only the graph-based parser with Chu-Liu-Edmonds’ algorithm can natively produce non-projective trees. For the pro- jective parsers, several additional methods are developed as pre- or post-processing to remedy this restriction, and we introduce two methods here. Nivre and Nilsson (2005) proposed the pseudo-projective parsing. The basic idea is to modify the target non-projective tree into a (pseudo-)projective one, by “lifting” the non-projective arcs until it becomes projective. To lift an arc is to change the head of the arc as the head of the head, and leave a trace in the dependency label. They can then use a 2.3 Linearization Methods 15 Transition Before After Precondition left-arc 〈σ | ti, tj | β,A〉 ⇒ 〈σ, tj | β,A ∪ {〈tj , ti〉}〉 ti 6= ROOT right-arc 〈σ | ti, tj | β,A〉 ⇒ 〈σ, ti | β,A ∪ {〈i, j〉}〉 shift 〈σ, ti | β,A〉 ⇒ 〈σ | ti, β, A〉 swap 〈σ | ti, tj | β,A〉 ⇒ 〈σ, tj | ti | β,A〉 ROOT ≺ ti ≺ tj Table 2.2: Transitions of the ARCSTANDARDSWAP system. projective parsing algorithm to predict the pseudo-projective tree, and use the predicted label to recover the lifting thus producing a non-projective tree. Nivre (2009) proposed another algorithm which modifies the projective ARCSTAN- DARD system algorithm by adding a new transition swap, which swaps the order between the top of the stack and the front of the buffer. The resulting system is noted as ARCSTAN- DARDSWAP. This transition essentially reorders the sentence such that the target parse tree could be projective in the new order. The transitions in the ARCSTANDARDSWAP are listed in Table 2.2, where the first three transitions are identical as in ARCSTANDARD. The two methods above modify the non-projective trees in different ways: pseudo- projective parsing changes the tree structure by lifting the arcs; the ARCSTANDARDSWAP system changes the linear order of the sentence by swapping the words. 2.3 Linearization Methods The research of computational models for linearization has its root in machine transla- tion, which serves to find the order of words in the translated sentence (Brown et al., 1990; Brew, 1992). Earlier work mostly focuses on grammar-based approaches using dif- ferent syntactic formalisms (Elhadad and Robin, 1992; Lavoie and Rainbow, 1997; Carroll et al., 1999). With the increasing availability of annotated treebanks, statistical methods gain popularity and become the dominant approach (Langkilde and Knight, 1998; Ban- galore and Rambow, 2000; Filippova and Strube, 2009). We now introduce a few statistical linearization models that are relevant to our work. We broadly divide the models into two categories based on the input format, syntax-free and syntax-based. Syntax-free linearization operates on a bag of words and does not assume any structure in them, while syntax-based linearization takes an unordered de- pendency tree (among other grammar formalisms) as input. Although the two categories differ a lot in the available information, many methods designed for one category can be adapted for the other category. 16 2 Background 2.3.1 Syntax-free Linearization Knight (1999) cast the task of linearization as a Traveling Salesman Problem (TSP). A TSP is a problem that takes a list of cities and the distance between each pair of cities and searches for a route that visits each city exactly once and returns to the origin. To formulate the word ordering problem as a TSP, each word is treated as a city, the log probability of a word pair in a bigram language model is treated as the distance, and the goal is to find a sequence of words with the maximum probability. Much work has followed this formulation. Among others, Zaslavskiy et al. (2009) formulated the word ordering problem in phrase-based machine translation as a TSP and show that it achieves better performance and speed than beam search decoding with the same bigram language model. Horvat and Byrne (2014) explored higher-order n-gram language models for TSP-based word ordering, which transforms into a much larger TSP graph. These studies operate on a bag of words without syntax, which is a TSP graph of non-trivial size with little information about the internal structure. Much effort has been put into incorporating more powerful decoding algorithms such as Integer Programming (Germann et al., 2001) and Dynamic Programming (Tillmann and Ney, 2003). Under the TSP formulation, the computational model behind linearization is a simple n-gram language model, which does not model long-distance relations. To address this problem, Schmaltz et al. (2016) proposed a linearization model based on a Long Short- Term Memory (LSTM) language model. They use a pre-trained language model, which can give the probability estimation for any suffix of a sentence, and use beam search to approximately search for the order that maximizes the overall probability. 2.3.2 Syntax-based Linearization The methods mentioned above aim at linearizing a bag of words without any syntactical information. Although they could be used for dependency trees, the performance would be suboptimal. Syntax-based linearization typically uses discriminative models (in con- trast to the generative models such as the language model) to decide the word order. Bohnet et al. (2010) designed an algorithm to order the words with beam search. They extract features including the word form, part-of-speech, and dependency relations from the previous words and use a perceptron to calculate the scores for the next word. This is different from the approach based on the language model, which calculates the probabil- ity of the next word given the previous words only based on the statistics of co-occurrence of the words in a large corpus, The discriminative model can incorporate rich structural information from the input and make a more informed decision. 2.3 Linearization Methods 17 Another major property of their algorithm that benefits from the tree structure is that they do not order the full tree altogether. Instead, they divide the tree into subtrees (or word order domains in their terminology), where each subtree contains a head and its de- pendents. They then use beam search to order each subtree separately and finally com- bine the ordered subtrees into a full sentence. This is commonly referred to as the divide- and-conquer strategy since they divide a large problem (ordering the full tree) into sev- eral smaller subproblems (ordering each subtree), and by doing so, the search space is drastically reduced, thus increasing the likelihood of an approximate search algorithm to find the optimal solution. A potential problem of this linearization algorithm is that by combining the ordered subtrees, it can only produce a projective sentence. Since the was initially designed to work on English data, which is predominantly projective, the inability to produce non- projective sentences did not pose a severe limitation to the model’s performance. How- ever, in order to apply the model to multilingual data, such limitation must be addressed. Bohnet et al. (2012) introduced a preprocessing model that first lifts the non-projective dependency arcs so that the correct order of the resulting dependency tree could be pro- jective, although the tree is not the original one anymore. This method is inspired by the pseudo-projective parsing algorithm (Nivre and Nilsson, 2005). Another line of work (Liu et al., 2015; Puduppully et al., 2016) is based on a transition system very similar to the ARCSTANDARD system for dependency parsing. They use a stack and a set to represents a configuration (in contrast to a stack and a buffer in a pars- ing configuration). There are three transitions, shift moves one word from the set onto the stack, left-arc and right-arc creates a dependency arc between the top two items on the stack and removes the dependent from the stack. If the dependency tree structure is not given (i.e., ordering a bag of words), this model performs parsing and linearization simultaneously, where shift decides the order of the words and left-arc and right-arc de- termine the dependency relations. If the dependency tree (or a partial tree) is given, then it is used as syntactic constraints to limit the possible transitions at every configuration. Since the transition system is based on the arc-standard system, it only produces projec- tive sentences. Similar to Bohnet et al. (2010), a perceptron is used to score the transitions based on features extracted from the configuration. They also use beam search to explore multiple transition sequences. 18 2 Background 2.4 Machine Learning Models 2.4.1 Perceptron Traditional models for classification tasks in NLP are mostly based on linear models such as the perceptron (Rosenblatt, 1958; Collins, 2002), which makes the prediction based on features extracted from the instances. The features are extracted according to manually designed templates, such as “what is the lemma of the head”, “what is the POS tag of the dependent”. The extracted features are then “the lemma of the head is chases”, “the POS tag of the dependent is NOUN”. All the features found in the training set are assigned an index and stored in a feature mapping. In this way, each instance can be mapped to an input vector x composed of 0s and 1s, where the values at the positions that correspond to the activated features are set to 1, and the rest are set to 0. The computation in a perceptron model is: y = W · x+ b (2.2) where the weight matrix W and the bias vector b are the model parameters, x is the input feature vector, and y is the output vector that holds the score for each class. And the prediction of the model is the class with the highest score in y. To increase the linear perceptron model’s expressive power, feature combination is of- ten needed, where several elemental features are merged as one. It is not trivial to design a proper set of feature templates. The model could suffer from underfitting if the fea- tures are not expressive enough, or suffer from overfitting if the features are too specific that only matches certain training instances and cannot generalize. For this reason, much effort has been focused on feature engineering, i.e., exploring informative elemental fea- tures and feature combinations that strike a balance between underfitting and overfitting. 2.4.2 Multi-Layer Perceptron The emergence of neural networks has brought profound influence into the NLP field. One of the most basic neural models is the Multi-layer Perceptron (MLP), which is a series of perceptrons with non-linear functions (e.g., the sigmoid function) applied between the layers. This introduces non-linearity into the computation, thus frees the need for manual feature combinations. 2.4 Machine Learning Models 19 Following is an example of an MLP with two hidden-layers: h1 = σ(W1 · x+ b1) (2.3) h2 = σ(W2 · h1 + b2) (2.4) y = W3 + h2 + b3 (2.5) where x and y are the input and output vectors (same as in perceptron), W1, W2, W3 are weight matrices, and b1, b2, and b3 are bias vectors, h1 and h2 are the intermediate results from each layer of perceptron, and σ(·) is the non-linear function, for example the sigmoid function: σ(x) = 1 1 + e−x (2.6) For simplicity, we use the following abstract notation to refer to the MLP: y = MLP (x) (2.7) 2.4.3 Distributed Representation Arguably the most important concept of deep learning models in NLP is the distributed representation, also known as word embeddings when applied to represent words (Ben- gio et al., 2003; Mikolov et al., 2013). In the traditional perceptron model, the input is a one-hot vector, i.e., a vector with either 0 or 1, where the value on the n-th position being 1 means the n-th feature is activated. An expressive model typically has millions of fea- tures, which means the input vector and the model parameters would have millions of dimensions. The distributed representation, in contrast, uses a much smaller, dense vector to repre- sent the features (words), where the values of the vector are part of the model parameters that are trained together with the rest of the model. The distributed representation not only has the advantage of compactness, more importantly, it also captures the syntactic and semantic similarity between the words, which is obtained from the context during the training. The distributed representation is initially used to represent words since it originates in language models, where words are the only observable features. It is soon adopted to represent any categorical features, such as POS tags, dependency relations (Chen and Manning, 2014). 20 2 Background 2.4.4 Recurrent Neural Networks Another advantage of deep learning models is their ability to model long-distance con- texts. The Recurrent Neural Network (RNN) is designed by Elman (1990) to model se- quences using a memory cell that recurrently updates itself with a sequence of inputs. Suppose we have a sequence of vectors as input [x0, x1, ..., xn], where each xi repre- sents one element in the sequence. We can apply the RNN to obtain a sequence of output vectors [y0, y1, ..., yn], where each yi also represents each element in the sequence, and it now takes its previous context into account. The model maintains memory cell that keeps the information of the sequence that it reads so far, and at each step, it takes the current input, updates the memory, and outputs a vector as the output, as follows: ht = σh(Wh · xt + Uh · ht−1 + bh) (2.8) yt = σy(Wy · ht + by) (2.9) where xt is the input vector at time step t, [Wh, Uh, bh] are parameters to calculate the hidden state ht, [Wy, by] are parameters to calculate the output yt, σh(·) and σy(·) are non-linear activation functions. The vanilla RNN has the problem of vanishing or exploding gradient due to the depth of the layers that prevent stable and effective training. Hochreiter and Schmidhuber (1997) proposed a more stable variant, the Long Short-Term Memory (LSTM) architec- ture, that addresses this problem. At each time step with the input xt, the computation in of the LSTM is as follows: ft = σ(Wf · xt + Uf · ht−1 + bf ) (2.10) it = σ(Wi · xt + Ui · ht−1 + bi) (2.11) ot = σ(Wo · xt + Uo · ht−1 + bo) (2.12) ut = tanh(Wu · xt + Uu · ht−1 + bu) (2.13) ct = ft � ct−1 + it � ut (2.14) ht = ot � tanh(ct) (2.15) where ft is the forget gate, it is the the input gate, and ot is the output gate, all three vectors are calculated from the current input xt, the previous output ht−1 with the corre- sponding parameters W , U , and b. These gating vectors all have values between 0 and 1, as a result of the sigmoid activation function, they are used to manipulate the trans- 2.4 Machine Learning Models 21 formed information of the current step ut through element-wise multiplication� to form the current memory ct and the current output ht. The LSTM only models the sequence in one direction, which means the output at a time step is not aware of information from the future time steps. To better process the sequential information, Graves and Schmidhuber (2005) proposed the bidirectional LSTM, which uses two LSTMs that run from opposite directions, and the representation of each element in the sequence is simply the concatenation of the outputs of both LSTMs. We use the following abstract notations for the forward and backward LSTMs: y1:n = −−−−→ LSTM(x1:n) (2.16) y1:n = ←−−−− LSTM(x1:n) (2.17) where x1:n is a sequence of input vectors and y1:n is a sequence of output vectors. 2.4.5 Sequence-to-Sequence Model and Attention Mechanism The Sequence-to-Sequence (Seq2Seq) architecture (Cho et al., 2014; Sutskever et al., 2014) is a widely successful model that takes a sequence as input and produces another se- quence as output. It is typically based on the combination of two RNNs, one RNN as the encoder that reads the input sequence, the other as the decoder that incrementally generates the elements of the output sequence. The encoder reads all elements in the input sequence. Its final output summarizes all the information in the input and is used by the decoder to generate the output. This introduces a problem, namely, it has to compress all the information in a sequence into one vector, which would cause information loss, especially for longer sequences, thus hurting the quality of the generated output. The attention mechanism (Graves et al., 2014; Bahdanau et al., 2014; Luong et al., 2015) is designed to address this problem. It is motivated by the fact that humans pay attention to different elements in the input sequence at different times. For example, in machine translation, when generating each word in the output, different input words become the focus. In a Seq2Seq model with attention, the current decoder state calculates its attention over all encoder outputs at every time step. The attention is a vector with values ranging between 0 and 1 and summing up to 1. It is used to obtain a weighted average of all the encoder states, which is then used by the decoder to generate the output for the current time step. 22 2 Background Generally, the attention score on one encoder state is calculated as follows: st,i = f(yt, xi) (2.18) where f(yt, xi) is a function Rd × Rd → R that takes the decoder state yt and the encoder state xi. There are several different attention models, which only differ in the function f(yt, xi), and we refer to Hu (2019) for a comprehensive survey. The function that we use is based on Vaswani et al. (2017): f(yt, xi) = x>i · yt√ d (2.19) where d is the dimension of the encoder state. The attention score is then normalized by the softmax function: αt,i = exp(st,i)∑n j=1 exp(st,j) (2.20) Having the attention score αt,i, we can then calculate the weighted average of the encoder state, which focuses on different position of the input at each decoding step: ct = n∑ i=1 αt,i · xi (2.21) We use the following abstract notation for the attention mechanism: αt,1:n = attention(yt, x1:n) (2.22) where x1:n are the encoder states for the elements in the input, yt is the decoder state at the t-th step. In the attention-based Seq2Seq models, the attention mechanism is used to calculate the weights to average the representations of the input vectors, and the output is gener- ated from the vocabulary. Vinyals et al. (2015b) proposed the pointer network, which can be viewed as a simplified version of the attention-based Seq2Seq model. It also calculates the attention between the current decoder state and all encoder states, but instead of gen- erating an output, it simply picks the input item with the highest attention as the output. The pointer network model is suitable for the tasks where all elements in the output are contained in the input, for example, the linearization task. 2.4 Machine Learning Models 23 2.4.6 Tree-LSTM Tai et al. (2015) proposed the Tree-LSTM to model tree structures, which is a generaliza- tion of the sequential LSTM. We introduce the Child-sum variant here, as it is suitable to model an unordered dependency tree. Unlike LSTM that propagates information from the preceding node to the following node in a sequence, Tree-LSTM propagates the in- formation from the dependents to their head in a dependency tree. Suppose we calculate the node j, whose dependents are k ∈ C(j), the computation of the Tree-LSTM is: h̃j = ∑ k∈C(j) hk (2.23) fjk = σ(Wf · xj + Uf · hk + bf ) (2.24) ij = σ(Wi · xj + Ui · h̃j + bi) (2.25) oj = σ(Wo · xj + Uo · h̃j + bo) (2.26) uj = tanh(Wu · xj + Uu · h̃j + bu) (2.27) cj = ij � uj + ∑ k∈C(j) fjk � ck (2.28) hj = oj � tanh (cj) (2.29) We first sum up the output of all the dependents of node j to obtain h̃j that summa- rizes the information of the dependents, then calculate the forget gate of each dependent fjk, the input gate ij and the output gate oj for the current node j. The memory cell of each dependent is controlled by the corresponding forget gate, the transformed informa- tion uj is controlled by the input gate, and their sum forms the memory cell of the current node, and finally outputs hj through the output gate. The Tree-LSTM is improved by adding the attention mechanism to the hidden states (Zhou et al., 2016), so that each dependent influences the head representation to different degrees. Miwa and Bansal (2016) proposed a bidirectional extension that traverses the tree in both bottom-up and top-down directions to access all the tokens in the tree. We use the following abstract notations for the bottom-up and top-down Tree-LSTMs: y↑1:n = Tree-LSTM↑(x1:n) (2.30) y↓1:n = Tree-LSTM↓(x1:n) (2.31) where x1:n are the input vectors of the nodes in the tree, and y↑1:n and y↓1:n are the output vectors of the bottom-up and top-down Tree-LSTMs, respectively. 24 2 Background 2.4.7 Graph Neural Networks A tree is a particular case of a graph, and the general graph structure has much broader application. Therefore, there has been a lot of work focusing on modeling graph struc- tures with Graph Neural Networks (GNNs); we refer the reader to Wu et al. (2020) for a comprehensive survey. Here, we introduce a variant we use in our experiments, the Graph Attention Network (GAT) by Veličković et al. (2017), which is improved upon the Graph Convolutional Network (GCN) by Kipf and Welling (2016). Unlike the sequence or tree structure, there is no beginning or ending in a graph; therefore, we cannot propagate the information in one go as in the sequential LSTM and Tree-LSTM. Instead, every node in a connected graph has neighbors, and the GNN prop- agates the information of every node to its neighbors simultaneously and repeats it n times to propagate further. Each propagation uses a separate layer of GNN, and it can thus only send messages to a limited number of hops way, which is a compromise to process more complex structure compared to the Sequential or Tree-LSTM. In the basic GCN model, the representation of a node i at the (l + 1)-th layer is based on the representations of its neighbors in the l-th layer as follows: z (l) j = W (l) · h(l)j (2.32) αi,j = 1√ |N (i)| · |N (j)| (2.33) h (l+1) i = σ  ∑ j∈N (i) αi,j · z(l)  (2.34) whereN (i) are the direct neighbors of i, z(l)j is the transformed representation of a neigh- bor node j at the previous layer, αi,j is the normalization term to calculate the average. When computing the representation of a node, GCN averages the information from its neighbors, while GAT takes into account the importance of each neighbor by calculating the attention from the node to its neighbors, by replacing Equation (2.33) with: αi,j = attention(h (l) i , h (l) 1:n)j (2.35) We use the following abstract notation for the GAT: y1:n = GAT(x1:n) (2.36) where x1:n and y1:n are the input and output vectors for all the nodes in the graph. 2.5 Dependency Length Minimization 25 2.5 Dependency Length Minimization The recent development of comparable treebanks for a wide range of languages across the typological spectrum, particularly the UD (Nivre et al., 2016b), has made it possible to address some long-standing hypotheses regarding a functional explanation of linguistic universals on word orders. Several recent studies (Liu, 2008; Futrell et al., 2015a) have used evidence from treebanks across languages to address what is arguably the most prominent hypothesis of a functionally motivated universal constraint, the Dependency Length Minimization (DLM) hypothesis, which can be traced back to Behaghel (1932). The DLM hypothesis states that we tend to arrange the words to minimize the overall lengths of the dependency arcs in the linear order to reduce the cognitive load of un- derstanding (parsing) the sentence. This effect of DLM presents in the utterances where multiple alternative word orders are valid; it also influences the preferences of word or- der in the long-term development of a language. Liu (2008) performed the first cross-treebank study to compare the actual length of dependencies, for 20 languages, against the length that results when the dependency structures are linearized in random ways. The results indicate that languages indeed tend to minimize the dependency distance. Futrell et al. (2015a) presented a recent ex- pansion of this type of treebank study to a set of 37 languages (which they argue to be the first comprehensive analysis that covers a broad range of typologically diverse lan- guages), presenting comparisons of the real dependency trees from the treebank with random re-orderings of the same dependency structures. The analysis shows that the real DL is indeed significantly shorter than chance across all 37 analyzed languages. This result corroborates findings from a broad range of empirical studies that are typologically less comprehensive (Gildea and Temperley, 2010; Gulordava and Merlo, 2015; Gulordava et al., 2015). This type of cross-treebank study has prompted a fair number of expansions and discussion regarding the typological implications (e.g., Jiang and Liu (2015); Chen and Gerdes (2018); Temperley and Gildea (2018)). The studies mentioned above (Liu, 2008; Futrell et al., 2015a; Gulordava, 2018) have only focused on the projective sentences and ignored the non-projective sentences, which constitute a substantial proportion in many languages with relatively free word order. A series of studies (Ferrer-i-Cancho, 2006; Ferrer-i-Cancho and Gómez-Rodrı́guez, 2016; Gómez-Rodrı́guez and Ferrer-i-Cancho, 2017) focused on the relation between crossing arcs and DL. They concluded that the scarcity of non-projective sentences in natural lan- guages is a direct outcome of DLM since the crossing arcs generally would increase the DL of an otherwise projective sentence. 27 Chapter 3 Dependency Tree Linearization The task of linearization is to arrange the nodes in a graph structure into a sequence. In our context, the input is an unordered dependency tree, and the output is a sentence. Linearization on its own may seem somewhat artificial since it is uncommon to have the dependency tree structure of a sentence while the order is unknown. However, this could be an essential step for many real-world tasks. For example, in the deep track of several editions of the surface realization shared task (Belz et al., 2011; Mille et al., 2018, 2019, 2020), the input is an deep-syntactic structure which contains only content words connected by predicate-argument relations (Meyers et al., 2004; Palmer et al., 2005). Figure 3.1 illustrates our complete pipeline for the surface realization shared task for both shallow and deep tracks. It consists of five modules: (1) linearization aims to find the surface order of a tree; (2) completion is the process of transforming the deep tree into a shallow tree by generating the absent function words; (3) inflection is the task of generating the word form based on the lemma and the syntactic and morphological infor- mation; (4) contraction applies for a few languages where several tokens are contracted as one; (5) detokenization turns a list of words into text with proper spacing between the words and punctuation marks. The code for pipeline with all the modules is available at https://github.com/EggplantElf/IMSurReal. This chapter focuses on the first module in the pipeline, dependency tree lineariza- tion. This module aims to find the optimal order of an unordered dependency tree, be it shallow or deep; it takes a tree as input and produces a sequence of tokens from the tree. A typical linearization model includes two major components: encoding and de- coding, where the tokens are encoded with the tree information and then decoded into a sequence that forms a natural language sentence. https://github.com/EggplantElf/IMSurReal 28 3 Dependency Tree Linearization de originario norte ser áfrica . originario norte originario norte áfricaoriginario ser . es originaria del norte de áfrica. áfrica de áfrica norteel de de áfricaoriginaria es . norteel de de áfricaoriginaria es . nortedel de Shallow Track: Deep Track: (1) Linearization (1) Linearization (3) Inflection (4) Contraction (5) Detokenization (2) Completion el Gender=Fem Number=Sing Number=Sing Person=3 Tense=Pres Figure 3.1: A full pipeline for the surface realization shared task, including five modules: (1) linearization, (2) completion, (3) inflection, (4) contraction, and (5) detokenization. 3.1 Encoding 29 3.1 Encoding We explore three models to encode the input unordered dependency tree: sequential LSTM, Tree-LSTM, and GAT. These three models are designed for structures with in- creasing generality, i.e., sequence, tree, and graph. 3.1.1 Token Before modeling the structure, we first obtain the representation for each token in the tree, which serves as the common ground for all the structural encoders. Table 3.1 shows an example sentence from the treebank in the CoNLL-U format. Each row contains the information about a token in the sentence, including the token index, word form, lemma, Universal Part-of-Speech (UPOS), Language-specific Part-of-Speech (XPOS), a list of mor- phological features, the index of the head, and the dependency relation. ID Word Lemma UPOS XPOS Morphological Features Head Label 1 From from ADP IN 3 case 2 the the DET DT Definite=Def|PronType=Art 3 det 3 AP AP PROPN NNP Number=Sing 4 obl 4 comes come VERB VBZ Mood=Ind|Number=Sing|Person=3|... 0 root 5 this this DET DT Number=Sing|PronType=Dem 6 det 6 story story NOUN NN Number=Sing 4 nsubj 7 : : PUNCT : 4 punct Table 3.1: An example sentence in the CoNLL-U format. Among these entries, the token index and head index indicate the correct order of the words in the sentence, they are available for training but scrambled for testing. The word form is the target for morphological inflection, which is also not available for lineariza- tion. Therefore, we use the lemma, UPOS, morphological features, and dependency label as features to encode the token; XPOS is left out since they contain different information in different treebanks thus making a cross-lingual comparison more difficult. We run a bidirectional LSTM on the embeddings of the characters in the lemma (de- noted as c1:p) and a bidirectional LSTM for the morphological tags (denoted as m1:q). v (char) i = −−−−→ LSTMchar(c1:p)p + ←−−−− LSTMchar(c1:p)1 (3.1) v (morph) i = −−−−→ LSTMmorph(m1:q)q + ←−−−− LSTMmorph(m1:q)1 (3.2) The LSTM output vectors for characters and morphological tags are then summed together with the embeddings of other features, i.e., lemma, UPOS, and dependency re- 30 3 Dependency Tree Linearization lation, to form the token-level representation. v◦i = e (lemma) i + e (upos) i + e (dep) i + v (char) i + v (morph) i (3.3) Since the training data are generally relatively small (thousands of sentences), we apply word dropout during training to improve the robustness against unknown words. Concretely, During training, we set the lemma representation as a zero vector with a probability of 1%, so the model is forced to rely on other features to make the prediction. The token-level encoding can also be viewed as the simplest way to encode the tree, ignoring the complete syntactic information and modeling it as a bag of words. Such encoding generally leads to poor performance, but it could serve as a baseline in the ablation study to illustrate the importance of syntactic information. 3.1.2 Sequence Since the Sequence-to-Sequence (Seq2Seq) models gained popularity with their success in the machine translation task, they have also been widely adopted in many other tasks, even when the input is not a sequence in nature, including the linearization task (Konstas et al., 2017; Elder and Hokamp, 2018; Elder et al., 2020). To use the Seq2Seq model to represent the tree structure, we first need to transform the tree structure to a sequence, which still include all the information about the tree. We denote this process as serialization. Note that the term appears quite similar to linearization, but they have very different meanings in our context. Serialization as a pre-processing step flattens the input unordered tree into a sequence with a depth-first, pre-order traversal, where the head is visited first, and the dependents are visited in ar- bitrary order. Linearization, in contrast, is the task the model is trying to solve, where the order of the output sequence is the target. We follow Konstas et al. (2017) to use scoping brackets to establish an unambiguous mapping between the tree and the sequence, so that the tree information is not lost in the serialization. To improve the model’s robustness to random serialization, we use a new random sequence every time to serialize the same tree during training. Figure 3.2 illustrates different ways to represent an input dependency tree as a se- quence. (1) and (2) are two different random serialization of the tree with the scoping brackets; (3) is a random order traversal of the tree without scoping brackets, which does not fully encode the tree information since we cannot unambiguously reconstruct the 3.1 Encoding 31 tree from the sequence; (4) is a completely random sequence, i.e., it is not necessarily a traversal of the tree in any order; thus it completely loses the tree information. let get us this with over (1): let ( us get ( over this ( with ) ) ) (2): let ( get ( this ( with ) over ) us ) (3): let get this with over us (4): let get with us over this Figure 3.2: Different ways to serialize a dependency tree as a sequence. Having transformed the tree into a sequence, we can then use a bidirectional LSTM to encode the sequence from left to right and from right to left. This way, the tokens in the sequence are encoded with a proxy of the tree information. It is worth noting that, although the tree can be unambiguously transformed into a sequence, the LSTM model does not necessarily have the ability to understand the tree structure behind it. We have shown in Yu et al. (2019d) that the LSTM can recognize the tree structure to some extent, but it cannot truly generalize when the model is tested on a larger tree than what it is trained on. However, in this task, the training and testing sentences are generally of similar sizes; therefore, the limitation of generalization does not impose a severe problem. 3.1.3 Tree-LSTM To represent the tree nodes with better generalization ability, we adopt the Tree-LSTM model (Tai et al., 2015) to explicitly encode the dependency tree. The original Tree-LSTM propagates information only in the bottom-up direction, i.e., from the dependents to the heads, which does not encode all hierarchical information in each node since there is no information flow from the ancestor to the descendants. We therefore apply a top-down pass to propagate information from the head to the dependents as in Miwa and Bansal (2016). Since each dependent has only one head, unlike the bottom-up pass, we could use a standard sequential LSTM to encode the paths from the root to each leaf node. For each node, we feed its bottom-up vector v↑ into the hidden state of its head to obtain the hidden state for the current node, and the output is the top-down vector v↓. Miwa and Bansal (2016) perform the two passes independently, i.e., both LSTMs take v◦ as input and produce v↑ and v↓ as outputs, similar to the standard sequential bidirectional 32 3 Dependency Tree Linearization LSTM (Graves and Schmidhuber, 2005). However, two independent passes can not pass the information of all tokens to all other tokens in the tree. Since each token only gets information from its ancestors and descendants, it is thus unaware of its siblings, which is a crucial information for linearization. Therefore, our model performs the bottom-up pass first and uses its output v↑ as the input for the top-down pass to obtain v↓, as illustrated in Figure 3.3. The final represen- tation is the sum of the token-level vector and the two Tree-LSTM outputs. v↑i = Tree-LSTM↑(v◦0:n)i (3.4) v↓i = Tree-LSTM↓(v↑0:n)i (3.5) xi = v◦i + v↑i + v↓i (3.6) deps: head: v◦ v↑ memory memory input output (a) Bottom-up LSTM. deps: head: v↑ v↓ memory memory input output (b) Top-down LSTM. Figure 3.3: A schematic illustration of the bidirectional Tree-LSTM. In this way, all tokens in the tree can be accessed by other tokens since any two tokens have a common ancestor, and the information of one token can be first passed up to the common ancestor then down to the other token. Figure 3.4 illustrates the information flow of our bidirectional model in a dependency tree, where the red dotted arrows in- dicate the bottom-up pass, and the blue dashed arrows indicate the top-down pass. We highlight how node 8 influences node 4. Its representation v◦8 is first propagated up to the lowest common ancestor v↑2, then goes down to v↓4. 3.1 Encoding 33 1 2 3 4 6 7 5 8 Figure 3.4: An illustration of the information flow in the encoder, where the red dotted arrows represent the bottom-up pass and the blue dashed arrows represent the top-down pass. The solid arrows illustrate the information flow from node 8 to node 4. 3.1.4 Graph Neural Networks Graph neural networks are typically used to encode graph structures beyond trees. Nev- ertheless, we apply it to encode the input dependency tree. We implement the Graph Attention Network (GAT) model described in §2.4.7. Through preliminary experiments, we set the number of GAT layers to 3, i.e., each node is aware of its neighbors three hops away. This is a compromise between the model’s expressiveness and parsimony: too few layers would cause the node to have insufficient context information, while too many layers would bloat the model parameters and cause overfitting. Let us take the example in Figure 3.4 to compare Tree-LSTM and GAT. The Tree-LSTM has two sets of parameters, i.e., the bottom-up and the top-down submodule. When passing the information from node 8 to node 4, the bottom-up submodule is used twice for the information to reach node 2, and the top-down submodule is used once to pass it down to node 4. In contrast, the GAT in our setting has three sets of parameters, i.e., the three layers, and they can pass the information three steps away, just enough from node 8 to node 4, and each step involves a GAT layer. However, it is not possible to pass the information further to node 7. Therefore, the GAT has a limited scope of structural context, and its node representation may not be aware of the full graph. 34 3 Dependency Tree Linearization 3.2 Divide-and-Conquer Strategy After encoding the tokens, we could directly generate the sequence of the tree with a decoder. However, it is not trivial to order decode long sentences, since the total search space is the factorial of the number of tokens. The divide-and-conquer strategy is com- monly used to break down a large problem into several smaller problems and solve them separately in order to reduce the problem’s complexity. Following Bohnet et al. (2010), we employ this strategy by recursively traversing the tree and solve the linearization task for each local subtree, where each subtree consists of a head and its direct dependents. After obtaining the subtrees’ local orders (by the decoder in §3.3), there are multiple possible ways to combine them to obtain the total order. Figure 3.5 gives an example, where (a) and (b) are the linearized subtrees, and (c), (d), and (e) are three different full linearized trees that respect the subtrees’ local order. Out of all possibilities, there is only one projective tree, i.e., without crossing arc in the tree, which is (c) in the example. Therefore, we make a projective assumption on the output to complete the divide-and-conquer process. This assumption turns out to be a practical inductive bias since the majority of sentences in most languages are projective, or contain very few non-projective arcs, see the statistics in Table 4.1. 1 2 4 root (a) 3 4 5 2 (b) 1 2 3 4 5 root (c) 1 3 2 4 5 root (d) 3 1 2 4 5 root (e) Figure 3.5: Multiple possible full tree orders from the same subtree orders, where (a) and (b) are ordered subtrees, (c), (d), and (e) are possible full ordered trees, among which only (c) is projective. Whether to employ the divide-and-conquer strategy is a trade-off between the decod- ing efficiency and the coverage of exactly correct solutions. On the one hand, the strategy massively reduces the search space; on the other hand, it is impossible to produce a small fraction of correct trees since they lie in the reduced search space. 3.3 Decoding 35 3.3 Decoding We experiment with three types of decoding algorithm: incremental decoding, graph- based decoding, and transition-based decoding. In the first method, we add tokens incre- mentally to build partial sequences and use beam search to explore several hypotheses. In the second method, we treat the task as a Traveling Salesman Problem (TSP) and de- code the sequence with a TSP solver. In the third method, we design a transition system to swap the tokens in the initial sequence step by step until reaching the final word order. 3.3.1 Incremental Decoding Left-to-Right Generation with Beam Search We start with the incremental decoding method, largely following Bohnet et al. (2010). The decoder takes a set of tokens as input, and the sequence is initially empty. At each decoding step, we select one token from the set of candidate tokens and add it to the end of the partial sequence until the candidate set is empty. To address the error propagation problem in greedy search, the decoder applies beam search to extend the sequence. At each step, we keep at most k partial sequences as the current hypotheses. We use an LSTM to represent the partial sequences and adopt the pointer network (Vinyals et al., 2015b) to score the new expansions. At each step, we calculate the atten- tion score between the LSTM output of the current partial sequence and the remaining tokens that are “pointed” to. The unnormalized attention score (without softmax) is the incremental score for adding such token. Algorithm 1 describes the left-to-right decoding together with the LSTM scoring model. We first divide the input tree T into subtrees; each consists of a head h and its dependents. For each subtree, we initialize the agenda with an empty sequence (line 3). A sequence is represented by an LSTM state (line 5), and we also keep track of the total score of the sequence (line 4). At each step, for each sequence in the agenda, we use a pointer network to calculate the unnormalized attention score between the LSTM state and all the remaining tokens as the scores of attaching each token to the end of the sequence (ATTENDl in line 14,1 where xt is the vector representation of the token t). We then create a new sequence for each possible attachment (line 11, where⊕ denotes concatenation), and the score of each new sequence is incremented by the unnormalized attention score (line 12). We also update the corresponding LSTM state of that sequence 1All scores are calculated in one go in the actual model, we distribute them in the loop only for readability. 36 3 Dependency Tree Linearization by adding the representation of the attached dependent as input (line 13). The new se- quences are then added to the agenda for the next step (line 14). If the number of new sequences in the new agenda is larger than the beam size, we keep only the highest-scoring ones for further expansion (line 17). When all the sequences are finished, we take the highest-scoring full sequence as the linearization of the subtree (line 19). Finally, after linearizing all subtrees, we combine them into a full sentence as the output (line 21). Algorithm 1 Left-to-Right linearization. 1: for all h ∈ T do . divide the tree into subtrees 2: Th = {h} ∪ dependents(h) . all tokens in the subtree 3: seq = [] . initial sequence 4: seq.score = 0 5: seq.state = InitialLSTMState() 6: agenda = [seq] . initial agenda 7: while |seq| < |Th| do 8: for all seq ∈ agenda do 9: beam = [] 10: for all t ∈ Th \ seq do . a new sequence for each remaining token 11: new seq = seq ⊕ t . attach to the right 12: new seq.score += ATTEND(seq.state, xt) 13: new seq.state = seq.state.addInput(xt) 14: beam.append(seq) 15: end for 16: end for 17: agenda = N-BEST(beam, beam-size) . keep N-best sequences for next step 18: end while 19: Sh = N-BEST(agenda, 1) . best sequence for Th 20: end for 21: S = COMBINESUBTREES(S1, S2, ..., Sn) . combine subtrees into full tree 22: return S Note that the left-to-right d