Universität Stuttgart

Permanent URI for this communityhttps://elib.uni-stuttgart.de/handle/11682/1

Browse

Search Results

Now showing 1 - 10 of 40
  • Thumbnail Image
    ItemOpen Access
    Computational modelling of coreference and bridging resolution
    (2019) Rösiger, Ina; Kuhn, Jonas (Prof. Dr.)
  • Thumbnail Image
    ItemOpen Access
    Modeling the interface between morphology and syntax in data-driven dependency parsing
    (2016) Seeker, Wolfgang; Kuhn, Jonas (Prof. Dr.)
    When people formulate sentences in a language, they follow a set of rules specific to that language that defines how words must be put together in order to express the intended meaning. These rules are called the grammar of the language. Languages have essentially two ways of encoding grammatical information: word order or word form. English uses primarily word order to encode different meanings, but many other languages change the form of the words themselves to express their grammatical function in the sentence. These languages are commonly subsumed under the term morphologically rich languages. Parsing is the automatic process for predicting the grammatical structure of a sentence. Since grammatical structure guides the way we understand sentences, parsing is a key component in computer programs that try to automatically understand what people say and write. This dissertation is about parsing and specifically about parsing languages with a rich morphology, which encode grammatical information in the form of words. Today’s parsing models for automatic parsing were developed for English and achieve good results on this language. However, when applied to other languages, a significant drop in performance is usually observed. The standard model for parsing is a pipeline model that separates the parsing process into different steps, in particular it separates the morphological analysis, i.e. the analysis of word forms, from the actual parsing step. This dissertation argues that this separation is one of the reasons for the performance drop of standard parsers when applied to other languages than English. An analysis is presented that exposes the connection between the morphological system of a language and the errors of a standard parsing model. In a second series of experiments, we show that knowledge about the syntactic structure of sentence can support the prediction of morphological information. We then argue for an alternative approach that models morphological analysis and syntactic analysis jointly instead of separating them. We support this argumentation with empirical evidence by implementing two parsers that model the relationship between morphology and syntax in two different but complementary ways.
  • Thumbnail Image
    ItemOpen Access
    Natural language processing and information retrieval methods for intellectual property analysis
    (2014) Jochim, Charles; Schütze, Hinrich (Prof. Dr.)
    More intellectual property information is generated now than ever before. The accumulation of intellectual property data, further complicated by this continued increase in production, makes it imperative to develop better methods for archiving and more importantly for accessing this information. Information retrieval (IR) is a standard technique used for efficiently accessing information in such large collections. The most prominent example comprising a vast amount of data is the World Wide Web, where current search engines already satisfy user queries by immediately providing an accurate list of relevant documents. However, IR for intellectual property is neither as fast nor as accurate as what we expect from an Internet search engine. In this thesis, we explore how to improve information access in intellectual property collections by combining previously mentioned IR techniques with advanced natural language processing (NLP) techniques. The information in intellectual property is encoded in text (i.e., language), and we expect that by adding better language processing to IR we can better understand and access the data. NLP is a quite varied field encompassing a number of solutions for improving the understanding of language input. We concentrate more specifically on the NLP tasks of statistical machine translation, information extraction, named entity recognition (NER), sentiment analysis, relation extraction, and text classification. Searching for intellectual property, specifically patents, is a difficult retrieval task where standard IR techniques have had only moderate success. The difficulty of this task only increases when presented with multilingual collections as is the case with patents. We present an approach for improving retrieval performance on a multilingual patent collection by using machine translation (an active research area in NLP) to translate patent queries before concatenating these parallel translations into a multilingual query. Even after retrieving an intellectual property document however, we still face the problem of extracting the relevant information needed. We would like to improve our understanding of the complex intellectual property data by uncovering latent information in the text. We do this by identifying citations in a collection of scientific literature and classifying them by their citation function. This classification is successfully carried out by exploiting some characteristics of the citation text, including features extracted via sentiment analysis, NER, and relation extraction. By assigning labels to citations we can better understand the relationships between intellectual property documents, which can be valuable information for IR or other applications.
  • Thumbnail Image
    ItemOpen Access
    Structurally informed methods for improved sentiment analysis
    (2017) Kessler, Stefanie Wiltrud; Kuhn, Jonas (Prof. Dr.)
    Sentiment analysis deals with methods to automatically analyze opinions in natural language texts, e.g., product reviews. Such reviews contain a large number of fine-grained opinions, but to automatically extract detailed information it is necessary to handle a wide variety of verbalizations of opinions. The goal of this thesis is to develop robust structurally informed models for sentiment analysis which address challenges that arise from structurally complex verbalizations of opinions. In this thesis, we look at two examples for such verbalizations that benefit from including structural information into the analysis: negation and comparisons. Negation directly influences the polarity of sentiment expressions, e.g., while "good" is positive, "not good" expresses a negative opinion. We propose a machine learning approach that uses information from dependency parse trees to determine whether a sentiment word is in the scope of a negation expression. Comparisons like "X is better than Y" are the main topic of this thesis. We present a machine learning system for the task of detecting the individual components of comparisons: the anchor or predicate of the comparison, the entities that are compared, which aspect they are compared in, and which entity is preferred. Again, we use structural context from a dependency parse tree to improve the performance of our system. We discuss two ways of addressing the issue of limited availability of training data for our system. First, we create a manually annotated corpus of comparisons in product reviews, the largest such resource available to date. Second, we use the semi-supervised method of structural alignment to expand a small seed set of labeled sentences with similar sentences from a large set of unlabeled sentences. Finally, we work on the task of producing a ranked list of products that complements the isolated prediction of ratings and supports the user in a process of decision making. We demonstrate how we can use the information from comparisons to rank products and evaluate the result against two conceptually different external gold standard rankings.
  • Thumbnail Image
    ItemOpen Access
    Challenges of computational social science analysis with NLP methods
    (2022) Dayanik, Erenay; Padó, Sebastian (Prof. Dr.)
    Computational Social Science (CSS) is an emerging research area at the intersection of social science and computer science, where problems of societal relevance can be addressed by novel computational methods. With the recent advances in machine learning and natural language processing as well as the availability of textual data, CSS has opened up to new possibilities, but also methodological challenges. In this thesis, we present a line of work on developing methods and addressing challenges in terms of data annotation and modeling for computational political science and social media analysis, two highly popular and active research areas within CSS. In the first part of the thesis, we focus on a use case from computational political science, namely Discourse Network Analysis (DNA), a framework that aims at analyzing the structures behind complex societal discussions. We investigate how this style of analysis, which is traditionally performed manually, can be automated. We start by providing a requirement analysis outlining a roadmap to decompose the complex DNA task into several conceptually simpler sub-tasks. Then, we introduce NLP models with various configurations to automate two of the sub-tasks given by the requirement analysis, namely claim detection and classification, based on different neural network architectures ranging from unidirectional LSTMs to Transformer based architectures. In the second part of the thesis, we shift our focus to fairness, a central concern in CSS. Our goal in this part of the thesis is to analyze and improve the performances of NLP models used in CSS in terms of fairness and robustness while maintaining their overall performance. With that in mind, we first analyze the above-mentioned claim detection and classification models and propose techniques to improve model fairness and overall performance. After that, we broaden our focus to social media analysis, another highly active subdomain of CSS. Here, we study text classification of the correlated attributes, which pose an important but often overlooked challenge to model fairness. Our last contribution is to discuss the limitations of the current statistical methods applied for bias identification; to propose a multivariate regression based approach; and to show that, through experiments conducted on social media data, it can be used as a complementary method for bias identification and analysis tasks. Overall, our work takes a step towards increasing the understanding of challenges of computational social science. We hope that both political scientists and NLP scholars can make use of the insights from this thesis in their research.
  • Thumbnail Image
    ItemOpen Access
    Decoding strategies for syntax-based statistical machine translation
    (2015) Braune, Fabienne; Maletti, Andreas (Dr.)
    Provided with a sentence in an input language, a human translator produces a sentence in the desired target language. The advances in artificial intelligence in the 1950s led to the idea of using machines instead of humans to generate translations. Based on this idea, the field of Machine Translation (MT) was created. The first MT systems aimed to map input text into the target translation through the application of hand-crafted rules. While this approach worked well for specific language-pairs on restricted fields, it was hardly extendable to new languages and domains because of the huge amount of human effort necessary to create new translation rules. The increase of computational power enabled Statistical Machine Translation (SMT) in the late 1980s, which addressed this problem by learning translation units automatically from large text collections. Statistical machine translation can be divided into several paradigms. Early systems modeled translation between words while later work extended these to sequences of words called phrases. A common point between word and phrase-based SMT is that the translation process takes place sequentially, which is not well suited to translate between languages where words need to be reordered over (potentially) long distances. Such reorderings led to the implementation of SMT systems based on formalisms that allow to translate recursively instead of sequentially. In these systems, called syntax-based systems, the translation units are modeled with formal grammar productions and translation is performed by assembling the productions of these grammars. This thesis contributes to the field of syntax-based SMT in two ways : (i) the applicability of a new grammar formalism is tested by building the first SMT system based on the local local Multi Bottom-Up Tree Transducer (l-MBOT) (ii) new ways to integrate linguistic annotations in the translation model (instead of the grammar rules) of syntax-based systems are developed.
  • Thumbnail Image
    ItemOpen Access
    Effective active learning for complex natural language processing tasks
    (2013) Laws, Florian; Schütze, Hinrich (Prof. Dr.)
    Supervised machine learning is a widely used approach to natural language processing tasks. However, supervised learning needs large amounts of labeled training data, which needs to be annotated in a time-consuming and expensive process. Active learning is a strategy to reduce this annotation effort by setting up an interactive process in which the machine learning system iteratively selects data for annotation. By selecting only data that the system considers informative, this strategy promises a significant reduction of data that is needed for training. In this thesis, we investigate the application of active learning to key natural language processing tasks. We investigate selection strategies for “informative” training examples for two key NLP tasks: named entity recognition and coreference resolution. We show that active learning can deliver a large reduction in annotation effort for these NLP tasks. However, in cases of unfortunate initialization, active learning can suffer from slow learning progress on infrequent classes: the missed cluster effect. We show that active learning can be made resilient against this phenomenon by co-selecting examples that occur together in a natural context (e.g. a sentence). We also apply this strategy to selection of examples for coreference annotation and could demonstrate for the first time a successful active learning approach to coreference resolution. We also monitor training progress during data annotation. We investigate a method to estimate performance without additional labeled test data. While this method is not reliable for stopping at a performance threshold, we can use it to define effective criteria to stop when performance for a given system and given dataset is close to optimal. Finally, we investigate crowdsourcing as a complementary cost reduction approach that aims to reduce the per-example cost by outsourcing annotation over the web. We propose strategies to mitigate the higher mistake rates of crowdsourcing annotators and present a successful combination of active learning with crowdsourcing.
  • Thumbnail Image
    ItemOpen Access
    Automatic term extraction for conventional and extended term definitions across domains
    (2020) Hätty, Anna; Schulte im Walde, Sabine (apl. Prof. Dr.)
    A terminology is the entirety of concepts which constitute the vocabulary of a domain or subject field. Automatically identifying various linguistic forms of terms in domain-specific corpora is an important basis for further natural language processing tasks, such as ontology creation or, in general, domain knowledge acquisition. As a short overview for terms and domains, expressions like 'hammer', 'jigsaw', 'cordless screwdriver' or 'to drill' can be considered as terms in the domain of DIY (’do-it-yourself’); 'beaten egg whites' or 'electric blender' as terms in the domain of cooking. These examples cover different linguistic forms: simple terms like 'hammer' and complex terms like 'beaten egg whites', which consist of several simple words. However, although these words might seem to be obvious examples of terms, in many cases the decision to distinguish a term from a ‘non-term’ is not straightforward. There is no common, established way to define terms, but there are multiple terminology theories and diverse approaches to conduct human annotation studies. In addition, terms can be perceived to be more or less terminological, and the hard distinction between term and ‘non-term’ can be unsatisfying. Beyond term definition, when it comes to the automatic extraction of terms, there are further challenges, considering that complex terms as well as simple terms need to be automatically identified by an extraction system. The extraction of complex terms can profit from exploiting information about their constituents because complex terms might be infrequent as a whole. Simple terms might be more frequent, but they are especially prone to ambiguity. If a system considers an assumed term occurrence in text, which actually carries a different meaning, this can lead to wrong term extraction results. Thus, term complexity and ambiguity are major challenges for automatic term extraction. The present work describes novel theoretical and computational models for the considered aspects. It can be grouped into three broad categories: term definition studies, conventional automatic term extraction models, and extended automatic term extraction models that are based on fine-grained term frameworks. Term complexity and ambiguity are special foci here. In this thesis, we report on insights and improvements on these theoretical and computational models for terminology: We find that terms are concepts that can intuitively be derstood by lay people. We test more fine-grained term characterization frameworks that go beyond the conventional term/‘non-term’-distinction. We are the first to describe and model term ambiguity as gradual meaning variation between general and domain-specific language, and use the resulting representations to prevent errors typically made by term extraction systems resulting from ambiguity. We develop computational models that exploit the influence of term constituents on the prediction of complex terms. We especially tackle German closed compound terms, which are a frequent complex term type in German. Finally, we find that we can use similar strategies for modeling term complexity and ambiguity computationally for conventional and extended term extraction.
  • Thumbnail Image
    ItemOpen Access
    A computational stylistics of poetry : distant reading and modeling of German and English verse
    (2023) Haider, Thomas; Kuhn, Jonas (Prof. Dr.)
    This doctoral thesis is about the computational modeling of stylistic variation in poetry. As ‘a computational stylistics’ it examines the forms, social embedding, and the aesthetic potential of literary texts by means of computational and statistical methods, ranging from simple counting over information theoretic measures to neural network models, including experiments with representation learning, transfer learning, and multi-task learning. We built small corpora to manually annotate a number of phenomena that are relevant for poetry, such as meter, rhythm, rhyme, and also emotions and aesthetic judgements that are elicited in the reader. A strict annotation workflow allows us to better understand these phenomena, from how to conceptualize them and which problems arise when trying to annotate them on a larger scale. Furthermore, we built large corpora to discover patterns in a wide historical, aesthetic and linguistic range, with a focus on German and English writing, encompassing public domain texts from the late 16th century up into the early 20th century. These corpora are published with metadata and reliable automatic annotation of part-of-speech tags, syllable boundaries, meter and verse measures. This thesis contains chapters on diachronic variation, aesthetic emotions, and modeling prosody, including experiments that also investigate the interaction between them. We look at how the diction of poets in different languages changed over time, which topics and metaphors were and became popular, both as a reaction to aesthetic considerations and also the political climate of the time. We investigate which emotions are elicited in readers when they read poetry, how that relates to aesthetic judgements, how we can annotate such emotions, and then train models to learn them. Also, we present experiments on how to annotate prosodic devices on a large scale, how well we can train computational models to predict the prosody from text, and how informative those devices are for each other.
  • Thumbnail Image
    ItemOpen Access
    A joint translation model with integrated reordering
    (2012) Durrani, Nadir; Schütze, Hinrich (Prof. Dr.)
    This dissertation aims at combining the benefits and to remedy the flaws of the two popular frameworks in statistical machine translation, namely Phrase-based MT and N-gram-based MT. Phrase-based MT advanced the state-of-the art towards translating phrases than words. By memorizing phrases, phrasal MT, is able to learn local reorderings, and handling of other local dependencies such as insertions, deletions etc. Inter-phrasal reorderings are handled through the lexicalized reordering model, which remains the state-of-the-art model for reordering in phrase-based SMT till date. However, phrase-based MT has some drawbacks: • Dependencies across phrases are not directly represented in the translation model • Discontinuous phrases cannot be represented and used • The reordering model is not designed to handle long range reorderings • Search and modeling problems require the use of a hard reordering limit • The presence of many different equivalent segmentations increases the search space • Source word deletion and target word insertion outside phrases is not allowed during decoding N-gram-based MT exists as an alternate to the more commonly used Phrase-based MT. Unlike Phrasal MT, N-gram-based MT uses minimal translation units called as tuples. Using minimal translation units, enables N-gram systems to avoid the spurious phrasal segmentation problem in the phrase-based MT. However, it also gives up the ability to memorize dependencies such as short reorderings that are local to the phrases. Reordering in N-gram MT is carried out by source linearization and POS-based rewrite rules. The search graph for decoding is constructed as a preprocessing step using these rules. N-gram-based MT has the following drawbacks: • Only the pre-calculated orderings are hypothesized during decoding • The N-gram model can not use lexical triggers • Long distance reorderings can not be performed • Unaligned target words can not be handled • Using tuples presents a more difficult search problem than that in phrase-based SMT In this dissertation, we present a novel machine translation model based on a joint probability model, which represents translation process as a linear sequence of operations. Our model like the N-gram model uses minimal translation units, but has the ability to memorize like the phrase-based model. Unlike the “N-gram” model, our operation sequence includes not only translation but also reordering operations. The strong coupling of reordering and translation into a single generative story provides a mechanism to better restrict the position to which a word or phrase can be moved, and is able to handle short and long distance reorderings effectively. This thesis remedies the problems in phrasal MT and N-gram-based MT by making the following contributions: • We proposed a model that handles both local and long distance dependencies uniformly and effectively • Our model is able to handle discontinuous source-side units • During the decoding, we are able to remove the hard reordering constraint which is necessary in the phrase-based systems • Like the phrase-based and unlike the N-gram model, our model exhibits the ability to memorize phrases • In comparison to the N-gram-based model, our model performs search on all possible reorderings and has the ability to learn lexical triggers and apply them to unseen contexts A secondary goal of this thesis is to challenge the belief that conditional probability models work better than the joint probability models in SMT and that the source-side context is less helpful in the translation process.