Browsing by Author "Schütze, Hinrich (Prof. Dr.)"

Now showing 1 - 8 of 8

Open Access
Effective active learning for complex natural language processing tasks
(2013) Laws, Florian; Schütze, Hinrich (Prof. Dr.)
Supervised machine learning is a widely used approach to natural language processing tasks. However, supervised learning needs large amounts of labeled training data, which needs to be annotated in a time-consuming and expensive process. Active learning is a strategy to reduce this annotation effort by setting up an interactive process in which the machine learning system iteratively selects data for annotation. By selecting only data that the system considers informative, this strategy promises a significant reduction of data that is needed for training. In this thesis, we investigate the application of active learning to key natural language processing tasks. We investigate selection strategies for “informative” training examples for two key NLP tasks: named entity recognition and coreference resolution. We show that active learning can deliver a large reduction in annotation effort for these NLP tasks. However, in cases of unfortunate initialization, active learning can suffer from slow learning progress on infrequent classes: the missed cluster effect. We show that active learning can be made resilient against this phenomenon by co-selecting examples that occur together in a natural context (e.g. a sentence). We also apply this strategy to selection of examples for coreference annotation and could demonstrate for the first time a successful active learning approach to coreference resolution. We also monitor training progress during data annotation. We investigate a method to estimate performance without additional labeled test data. While this method is not reliable for stopping at a performance threshold, we can use it to define effective criteria to stop when performance for a given system and given dataset is close to optimal. Finally, we investigate crowdsourcing as a complementary cost reduction approach that aims to reduce the per-example cost by outsourcing annotation over the web. We propose strategies to mitigate the higher mistake rates of crowdsourcing annotators and present a successful combination of active learning with crowdsourcing.
Open Access
Information extraction for the geospatial domain
(2014) Blessing, André; Schütze, Hinrich (Prof. Dr.)
Geospatial knowledge is increasingly becoming an essential part of software applications. This is primarily due to the importance of mobile devices and of location-based queries on the World Wide Web. Context models are one way to disseminate geospatial data in a digital and machine-readable representation. One key challenge involves acquiring and updating such data, since physical sensors cannot be used to collect such data on a large scale. Doing the required manual work is very time-consuming and expensive. Alternatively, a lot of geospatial data already exists in a textual representation, and this can instead be used. The question is how to extract such information from texts in order to integrate it into context models. In this thesis we tackle this issue and provide new approaches which were implemented as prototypes and evaluated. The first challenge in analyzing geospatial data in texts is identifying geospatial entities, which are also called toponyms. Such an approach can be divided into several steps. The first step marks possible candidates in the text, which is called spotting. Gazetteers are the key component for that task but they have to be augmented by linguistically motivated methods to enable the spotting of inflected names. A second step is needed, since the spotting process cannot resolve ambiguous entities. For instance, London can be a city or a surname; we call this a geo/non-geo ambiguity. There are also geo/geo ambiguities, e.g. Fulda (city) vs. Fulda (river). For our experiments, we prepared a new dataset that contains mentions of street names. Each mention was manually annotated and one part of the data was used to develop methods for toponym recognition and the remaining part was used to evaluate performance. The results showed that machine learning based classifiers perform well for resolving the geo/non-geo ambiguity. To tackle the geo/geo ambiguity we have to ground toponyms by finding the corresponding real world objects. In this work we present such approaches in a formal description and in a (partial) prototypical implementation, e.g., the recognition of vernacular named regions (like old town or financial district). The lack of annotated data in the geospatial domain is a major obstacle for the development of supervised extraction approaches. The second part of this thesis thus focuses on approaches that enable the automatic annotation of textual data, which we call unstructured data, by using machine-readable data from a knowledge base, which we call structured data. This approach is an instance of distant supervision (DS). It is well established for the English language. We apply this approach to German data which is more challenging than English, since German provides a richer morphology and its word order is more variable than that of English. Our approach takes these requirements into account. We evaluated our approach in several scenarios, which involve of the extraction of relations between geospatial entities (e.g., between cities and their suburbs or between towns and their corresponding rivers). For our evaluation, we developed two different relation extraction systems. One is a DS-based system, which uses the automatically annotated training set and the other one is a standard system, which uses the manually annotated training set. The comparison of the systems showed that both reach the same quality, which is evidence that DS can replace manual annotations. One drawback of current DS approaches is that both structured data and unstructured data must be represented in the same language. However, most knowledge bases are represented in the English language, which prevents the development of DS for other languages. We developed an approach called Crosslingual Distant Supervision (CDS) that eliminates this restriction. Our experiments showed that structured data from a German knowledge base can successfully be transferred by CDS into other languages (English, French, and Chinese).
Open Access
A joint translation model with integrated reordering
(2012) Durrani, Nadir; Schütze, Hinrich (Prof. Dr.)
This dissertation aims at combining the benefits and to remedy the flaws of the two popular frameworks in statistical machine translation, namely Phrase-based MT and N-gram-based MT. Phrase-based MT advanced the state-of-the art towards translating phrases than words. By memorizing phrases, phrasal MT, is able to learn local reorderings, and handling of other local dependencies such as insertions, deletions etc. Inter-phrasal reorderings are handled through the lexicalized reordering model, which remains the state-of-the-art model for reordering in phrase-based SMT till date. However, phrase-based MT has some drawbacks: • Dependencies across phrases are not directly represented in the translation model • Discontinuous phrases cannot be represented and used • The reordering model is not designed to handle long range reorderings • Search and modeling problems require the use of a hard reordering limit • The presence of many different equivalent segmentations increases the search space • Source word deletion and target word insertion outside phrases is not allowed during decoding N-gram-based MT exists as an alternate to the more commonly used Phrase-based MT. Unlike Phrasal MT, N-gram-based MT uses minimal translation units called as tuples. Using minimal translation units, enables N-gram systems to avoid the spurious phrasal segmentation problem in the phrase-based MT. However, it also gives up the ability to memorize dependencies such as short reorderings that are local to the phrases. Reordering in N-gram MT is carried out by source linearization and POS-based rewrite rules. The search graph for decoding is constructed as a preprocessing step using these rules. N-gram-based MT has the following drawbacks: • Only the pre-calculated orderings are hypothesized during decoding • The N-gram model can not use lexical triggers • Long distance reorderings can not be performed • Unaligned target words can not be handled • Using tuples presents a more difficult search problem than that in phrase-based SMT In this dissertation, we present a novel machine translation model based on a joint probability model, which represents translation process as a linear sequence of operations. Our model like the N-gram model uses minimal translation units, but has the ability to memorize like the phrase-based model. Unlike the “N-gram” model, our operation sequence includes not only translation but also reordering operations. The strong coupling of reordering and translation into a single generative story provides a mechanism to better restrict the position to which a word or phrase can be moved, and is able to handle short and long distance reorderings effectively. This thesis remedies the problems in phrasal MT and N-gram-based MT by making the following contributions: • We proposed a model that handles both local and long distance dependencies uniformly and effectively • Our model is able to handle discontinuous source-side units • During the decoding, we are able to remove the hard reordering constraint which is necessary in the phrase-based systems • Like the phrase-based and unlike the N-gram model, our model exhibits the ability to memorize phrases • In comparison to the N-gram-based model, our model performs search on all possible reorderings and has the ability to learn lexical triggers and apply them to unseen contexts A secondary goal of this thesis is to challenge the belief that conditional probability models work better than the joint probability models in SMT and that the source-side context is less helpful in the translation process.
Open Access
A modular framework for coreference resolution
(2012) Kobdani, Hamidreza; Schütze, Hinrich (Prof. Dr.)
Coreference resolution is playing an increasingly important role in a wide range of disciplines such as theoretical, corpus and computational linguistics. It has been shown that it is beneficial in a number of natural language processing applications, including machine translation, automatic abstracting, information extraction and question answering. As a result, it has enjoyed increased interest in recent years. First, this thesis introduces a modular supervised system for coreference resolution. It is composed of separate, interchangeable components, between which there are clear well-defined logical boundaries that improve maintainability. This system has been used successfully in two international shared tasks on coreference resolution achieving good performance. The good performance of our system demonstrates the general validity of our design. In addition, a new framework for feature engineering of natural language processing will be presented that is based on a relational data model of text. It includes fast and flexible methods for implementing and extracting new features, thereby reducing the effort of creating an NLP system for a particular task. This thesis presents an instantiation and evaluation of the framework for the problem of coreference resolution in multiple languages. Competitive results were able to be obtained in a short implementation period. This demonstrates the potential power of this framework for feature engineering. An unsupervised framework will also be presented that bootstraps a complete coreference resolution system from word associations mined from a large unlabeled corpus. I will show that word associations are useful for coreference resolution - e.g., the strong association between Obama and President is an indicator of likely coreference. Association information has so far not been used in coreference resolution because it is sparse and difficult to learn from small labeled corpora. Since unlabeled text is readily available, the unsupervised approach proposed here addresses the sparseness problem. In a self-training framework, I train a decision tree on a corpus that is automatically labeled using word associations. I will show that this unsupervised system has better coreference resolution performance than other learning approaches that do not use manually labeled data.
Open Access
Multi-word tokenization for natural language processing
(2013) Michelbacher, Lukas; Schütze, Hinrich (Prof. Dr.)
Sophisticated natural language processing (NLP) applications are entering everyday life in the form of translation services, electronic personal assistants or open-domain question answering systems. The more voice-operated applications like these become commonplace, the more expectations of users are raised to communicate with these services in unrestricted natural language, just as in a normal conversation. One obstacle that hinders computers to understand unrestricted natural language is that of collocations, combinations of multiple words that have idiosyncratic properties, for example, red tape, kick the bucket or there's no use crying over spilled milk. Automatic processing of collocations is nontrivial because these properties cannot be predicted from the properties of the individual words. This thesis addresses multi-word units (MWUs), collocations that appear in the form of complex noun phrases. Complex noun phrases are important for NLP because they denote real-world entities and concepts and are often used for specialized vocabulary such as scientific or legal terms. Virtually every NLP system uses tokenization, the partitioning of textual input into meaningful units, or tokens, as part of preprocessing. Traditionally, tokenization does not deal with MWUs which leads to early errors and error propagation in subsequent NLP tasks, resulting in poorer quality of NLP applications. The central idea presented in this thesis is the proposition of multi-word tokenization (MWT), MWU-aware tokenization as a preprocessing step for NLP systems. The goal of this thesis is to drive research towards NLP applications that understand unrestricted natural language. Our main contributions cover two aspects of MWT. First, we conducted fundamental research into asymmetric association, the phenomenon that lexical association from one component of an MWU to another can be stronger in one direction than in the other. This property has not been investigated deeply in the literature. We position asymmetric association in the broader context of different types of word association and collected human syntagmatic associations using a novel experiment setup. We measured asymmetric association in human syntagmatic production and showed that it is a phenomenon that is indicative of MWUs. Furthermore, we created corpus-based asymmetric association measures and showed that asymmetry in word combinations can be predicted automatically with high accuracy using these measures. Second, we present an implementation of MWT where we cast MWU recognition as a classification problem. We built an MWU classifier whose features address properties of MWUs. In particular, we targeted semantic non-compositionality, a phenomenon of unpredictable meaning shifts that occurs in many MWUs. In order to detect meaning shifts, we used features of contextual similarity based on distributional semantics. We found that context features significantly improve MWU classification accuracy but that there are unreliable aspects in the workings of such features. Additionally, we integrated MWT into an information retrieval system and showed that incorporating MWU information improves retrieval performance.
Open Access
Natural language processing and information retrieval methods for intellectual property analysis
(2014) Jochim, Charles; Schütze, Hinrich (Prof. Dr.)
More intellectual property information is generated now than ever before. The accumulation of intellectual property data, further complicated by this continued increase in production, makes it imperative to develop better methods for archiving and more importantly for accessing this information. Information retrieval (IR) is a standard technique used for efficiently accessing information in such large collections. The most prominent example comprising a vast amount of data is the World Wide Web, where current search engines already satisfy user queries by immediately providing an accurate list of relevant documents. However, IR for intellectual property is neither as fast nor as accurate as what we expect from an Internet search engine. In this thesis, we explore how to improve information access in intellectual property collections by combining previously mentioned IR techniques with advanced natural language processing (NLP) techniques. The information in intellectual property is encoded in text (i.e., language), and we expect that by adding better language processing to IR we can better understand and access the data. NLP is a quite varied field encompassing a number of solutions for improving the understanding of language input. We concentrate more specifically on the NLP tasks of statistical machine translation, information extraction, named entity recognition (NER), sentiment analysis, relation extraction, and text classification. Searching for intellectual property, specifically patents, is a difficult retrieval task where standard IR techniques have had only moderate success. The difficulty of this task only increases when presented with multilingual collections as is the case with patents. We present an approach for improving retrieval performance on a multilingual patent collection by using machine translation (an active research area in NLP) to translate patent queries before concatenating these parallel translations into a multilingual query. Even after retrieving an intellectual property document however, we still face the problem of extracting the relevant information needed. We would like to improve our understanding of the complex intellectual property data by uncovering latent information in the text. We do this by identifying citations in a collection of scientific literature and classifying them by their citation function. This classification is successfully carried out by exploiting some characteristics of the citation text, including features extracted via sentiment analysis, NER, and relation extraction. By assigning labels to citations we can better understand the relationships between intellectual property documents, which can be valuable information for IR or other applications.
Open Access
Statistical models for unsupervised, semi-supervised and supervised transliteration mining
(2012) Sajjad, Hassan; Schütze, Hinrich (Prof. Dr.)
Transliteration is a process of converting a word written in one script to another script in such a way that pronunciation remains almost the same. It is useful in major applications of natural language processing such as machine translation and cross language information retrieval. A transliteration system is generally built using two types of manually created resources - hand-crafted transliteration rules and a list of transliteration pairs. It either uses the transliteration rules with an edit distance metric to produce transliterations or automatically learns character alignments from transliteration pairs to build a model. The system requires language pair dependent resources for training which are not available for all language pairs. Using transliteration mining, one can automatically extract a list of transliteration pairs from a parallel corpus. However, all the state-of-the-art transliteration mining techniques are supervised or semi-supervised and require language dependent information for training. Until the work described here was carried out, there was no fully unsupervised method in the literature. In this thesis, I solve this issue by showing that transliteration mining can be done in an unsupervised fashion. The proposed method does not require any language pair dependent resources. I also incorporate transliteration into machine translation and word alignment and show that it improves the performance of the systems. Following is the summary of the steps which I have gone through to accomplish this task: In the first part of my work, I have shown the applicability of transliteration to machine translation. I have presented a novel machine translation model that incorporates transliteration. During disambiguation, transliteration and translation options compete with each other and the decoder has to decide on the fly which translation or transliteration to choose. For closely related language pairs with significant vocabulary overlap, I showed that transliteration is effective for more than just translating out-of-vocabulary words. I have proposed a heuristic-based transliteration mining system and showed that transliteration mining can be done in an unsupervised fashion. It shows competitive results when compared with the previous semi-supervised and supervised systems. This system has a few limitations. I then presented a novel model for unsupervised transliteration mining that consists of a transliteration sub-model and a non-transliteration sub-model. The unsupervised system performed better than most of the previous semi-supervised and supervised systems. I extended the unsupervised model to use the available resources and presented a semi-supervised and supervised version of it. I showed that if some labeled data is available, it is better to build a semi-supervised system than a supervised or unsupervised system. I have incorporated unsupervised transliteration mining model to an unsupervised word aligner. The new alignment system is also fully unsupervised and showed a big improvement in both precision and recall when compared with the baseline alignment. This showed that the proposed unsupervised method of mining can be effectively used to improve the performance of natural language processing applications.
Open Access
Supervised and semi-supervised statistical models for word-based sentiment analysis
(2014) Scheible, Christian; Schütze, Hinrich (Prof. Dr.)
Ever since its inception, sentiment analysis has relied heavily on methods that use words as their basic unit. Even today, such methods deliver top performance. This way of representing data for sentiment analysis is known as the clue model. It offers practical advantages over more sophisticated approaches: It is easy to implement and statistical models can be trained efficiently even on large datasets. However, the clue model also has notable shortcomings. First, clues are highly redundant across examples, and thus training based on annotated data is potentially inefficient. Second, clues are treated context-insensitively, i.e., the sentiment expressed by a clue is assumed to be the same regardless of context. In this thesis, we address these shortcomings. We propose two approaches to reduce redundancy: First, we use active learning, a method for automatic data selection guided by the statistical model to be trained. We show that active learning can speed up the training process for document classification significantly, reducing clue redundancy. Second, we present a graph-based approach that uses annotated clue types rather than annotated documents which contain clue instances. We show that using a random-walk model, we can train a highly accurate document classifier. We next investigate the context-dependency of clues. We first introduce sentiment relevance, a novel concept that aims at identifying content that contributes to the overall sentiment of the review. We show that even when we have no annotated sentiment relevance data available, a high-accuracy sentiment relevance classifier can be trained using transfer learning and distant supervision. Second, we perform linguistically motivated analysis and simplification of a compositional sentiment analysis. We find that the model captures linguistic structures poorly. Further, it can be simplified without any loss of accuracy.