Browsing by Author "Michelbacher, Lukas"

Now showing 1 - 1 of 1

Open Access
Multi-word tokenization for natural language processing
(2013) Michelbacher, Lukas; Schütze, Hinrich (Prof. Dr.)
Sophisticated natural language processing (NLP) applications are entering everyday life in the form of translation services, electronic personal assistants or open-domain question answering systems. The more voice-operated applications like these become commonplace, the more expectations of users are raised to communicate with these services in unrestricted natural language, just as in a normal conversation. One obstacle that hinders computers to understand unrestricted natural language is that of collocations, combinations of multiple words that have idiosyncratic properties, for example, red tape, kick the bucket or there's no use crying over spilled milk. Automatic processing of collocations is nontrivial because these properties cannot be predicted from the properties of the individual words. This thesis addresses multi-word units (MWUs), collocations that appear in the form of complex noun phrases. Complex noun phrases are important for NLP because they denote real-world entities and concepts and are often used for specialized vocabulary such as scientific or legal terms. Virtually every NLP system uses tokenization, the partitioning of textual input into meaningful units, or tokens, as part of preprocessing. Traditionally, tokenization does not deal with MWUs which leads to early errors and error propagation in subsequent NLP tasks, resulting in poorer quality of NLP applications. The central idea presented in this thesis is the proposition of multi-word tokenization (MWT), MWU-aware tokenization as a preprocessing step for NLP systems. The goal of this thesis is to drive research towards NLP applications that understand unrestricted natural language. Our main contributions cover two aspects of MWT. First, we conducted fundamental research into asymmetric association, the phenomenon that lexical association from one component of an MWU to another can be stronger in one direction than in the other. This property has not been investigated deeply in the literature. We position asymmetric association in the broader context of different types of word association and collected human syntagmatic associations using a novel experiment setup. We measured asymmetric association in human syntagmatic production and showed that it is a phenomenon that is indicative of MWUs. Furthermore, we created corpus-based asymmetric association measures and showed that asymmetry in word combinations can be predicted automatically with high accuracy using these measures. Second, we present an implementation of MWT where we cast MWU recognition as a classification problem. We built an MWU classifier whose features address properties of MWUs. In particular, we targeted semantic non-compositionality, a phenomenon of unpredictable meaning shifts that occurs in many MWUs. In order to detect meaning shifts, we used features of contextual similarity based on distributional semantics. We found that context features significantly improve MWU classification accuracy but that there are unreliable aspects in the workings of such features. Additionally, we integrated MWT into an information retrieval system and showed that incorporating MWU information improves retrieval performance.