05 Fakultät Informatik, Elektrotechnik und Informationstechnik

Permanent URI for this collectionhttps://elib.uni-stuttgart.de/handle/11682/6

Browse

Search Results

Now showing 1 - 10 of 11
  • Thumbnail Image
    ItemOpen Access
    Cross-lingual frame comparability : computational and linguistic perspectives
    (2023) Sikos, Jennifer; Padó, Sebastian (Prof. Dr.)
    Frames are descriptions of commonplace scenarios or events. Because they describe everyday scenes, such as buying or eating, it seems reasonable to assume that many frames in one language would carry over directly to other languages. However, the specifics of how that scene is realized can be highly specific to a culture; it is still an open research question as to how well (and how many) frames actually apply across languages. This thesis concerns cross-lingual frame comparability - the degree to which a frame can be transferred from one language to another. It addresses several aspects of frame comparability: what is frame comparability; how a computational system can measure cross-lingual frame comparability; and how frame comparability affects cross-lingual models of frames.
  • Thumbnail Image
    ItemOpen Access
    A computational stylistics of poetry : distant reading and modeling of German and English verse
    (2023) Haider, Thomas; Kuhn, Jonas (Prof. Dr.)
    This doctoral thesis is about the computational modeling of stylistic variation in poetry. As ‘a computational stylistics’ it examines the forms, social embedding, and the aesthetic potential of literary texts by means of computational and statistical methods, ranging from simple counting over information theoretic measures to neural network models, including experiments with representation learning, transfer learning, and multi-task learning. We built small corpora to manually annotate a number of phenomena that are relevant for poetry, such as meter, rhythm, rhyme, and also emotions and aesthetic judgements that are elicited in the reader. A strict annotation workflow allows us to better understand these phenomena, from how to conceptualize them and which problems arise when trying to annotate them on a larger scale. Furthermore, we built large corpora to discover patterns in a wide historical, aesthetic and linguistic range, with a focus on German and English writing, encompassing public domain texts from the late 16th century up into the early 20th century. These corpora are published with metadata and reliable automatic annotation of part-of-speech tags, syllable boundaries, meter and verse measures. This thesis contains chapters on diachronic variation, aesthetic emotions, and modeling prosody, including experiments that also investigate the interaction between them. We look at how the diction of poets in different languages changed over time, which topics and metaphors were and became popular, both as a reaction to aesthetic considerations and also the political climate of the time. We investigate which emotions are elicited in readers when they read poetry, how that relates to aesthetic judgements, how we can annotate such emotions, and then train models to learn them. Also, we present experiments on how to annotate prosodic devices on a large scale, how well we can train computational models to predict the prosody from text, and how informative those devices are for each other.
  • Thumbnail Image
    ItemOpen Access
    Automatic term extraction for conventional and extended term definitions across domains
    (2020) Hätty, Anna; Schulte im Walde, Sabine (apl. Prof. Dr.)
    A terminology is the entirety of concepts which constitute the vocabulary of a domain or subject field. Automatically identifying various linguistic forms of terms in domain-specific corpora is an important basis for further natural language processing tasks, such as ontology creation or, in general, domain knowledge acquisition. As a short overview for terms and domains, expressions like 'hammer', 'jigsaw', 'cordless screwdriver' or 'to drill' can be considered as terms in the domain of DIY (’do-it-yourself’); 'beaten egg whites' or 'electric blender' as terms in the domain of cooking. These examples cover different linguistic forms: simple terms like 'hammer' and complex terms like 'beaten egg whites', which consist of several simple words. However, although these words might seem to be obvious examples of terms, in many cases the decision to distinguish a term from a ‘non-term’ is not straightforward. There is no common, established way to define terms, but there are multiple terminology theories and diverse approaches to conduct human annotation studies. In addition, terms can be perceived to be more or less terminological, and the hard distinction between term and ‘non-term’ can be unsatisfying. Beyond term definition, when it comes to the automatic extraction of terms, there are further challenges, considering that complex terms as well as simple terms need to be automatically identified by an extraction system. The extraction of complex terms can profit from exploiting information about their constituents because complex terms might be infrequent as a whole. Simple terms might be more frequent, but they are especially prone to ambiguity. If a system considers an assumed term occurrence in text, which actually carries a different meaning, this can lead to wrong term extraction results. Thus, term complexity and ambiguity are major challenges for automatic term extraction. The present work describes novel theoretical and computational models for the considered aspects. It can be grouped into three broad categories: term definition studies, conventional automatic term extraction models, and extended automatic term extraction models that are based on fine-grained term frameworks. Term complexity and ambiguity are special foci here. In this thesis, we report on insights and improvements on these theoretical and computational models for terminology: We find that terms are concepts that can intuitively be derstood by lay people. We test more fine-grained term characterization frameworks that go beyond the conventional term/‘non-term’-distinction. We are the first to describe and model term ambiguity as gradual meaning variation between general and domain-specific language, and use the resulting representations to prevent errors typically made by term extraction systems resulting from ambiguity. We develop computational models that exploit the influence of term constituents on the prediction of complex terms. We especially tackle German closed compound terms, which are a frequent complex term type in German. Finally, we find that we can use similar strategies for modeling term complexity and ambiguity computationally for conventional and extended term extraction.
  • Thumbnail Image
    ItemOpen Access
    Prosodic event detection for speech understanding using neural networks
    (2020) Stehwien, Sabrina; Vu, Ngoc Thang (Prof. Dr.)
  • Thumbnail Image
    ItemOpen Access
    Linguistically-informed modeling of potentials for misunderstanding
    (2024) Anthonio, Talita; Roth, Michael (Dr.)
    Misunderstandings are prevalent in communication. While there is a large amount of work on misunderstandings in conversations, only little attention has been given to misunderstandings that arise from text. This is because readers and writers typically do not interact with one another. However, texts that potentially evoke different interpretations can be identified by certain linguistic phenomena, especially those related to implicitness or underspecificity. In Computational Linguistics, there is a considerable amount of work conducted on such linguistic phenomena and the computational modeling thereof. However, most of these studies do not examine when these phenomena cause misunderstandings. This is a crucial aspect, because ambiguous language does not always cause misunderstanding. In this thesis, we provide the first steps to develop a computational model that can automatically identify whether an instructional text is likely to cause misunderstandings ("potentials for misunderstanding"). To achieve this goal, we build large corpora with potentials for misunderstanding in instructional texts. We follow previous work and define misunderstandings as the existence of multiple, plausible interpretations. As these interpretations may be similar in meaning to one another, we specifically define misunderstandings as the existence of multiple plausible, but conflicting interpretations. Therefore, we find texts that potentially cause misunderstanding ("potentials for misunderstanding") by looking for passages that have several plausible interpretations that are conflicting to one another. We automatically identify such passages from revision histories of instructional texts, based on the finding that we can find potentials for misunderstanding by looking into older versions of a text, and their clarifications thereof in newer versions. We specifically look for unclarified sentences that contain implicit and underspecified language, and study their clarifications. Through several analyses and crowdsourcing studies, we demonstrate that our corpora provide valuable resources on potentials for misunderstanding, as we find that revised sentences are better than their previous ones. Furthermore, we show that the provided corpora can be used for several computational modeling purposes. The three resulting models can each be combined to identify whether a text potentially causes misunderstanding or not. More specifically, we first develop a model that can detect improvements in a text, even when they are subtle and closely dependent on the context. In an analysis, we verify that the judgements from the model on what makes a better or equally good sentence overlap with the judgements by humans. Secondly, we build a transformer-based language model that automatically resolves potentials for misunderstanding caused by implicit references. We find that modeling discourse context improves the performance of this model. In an analysis, we find that the best model is not only capable of generating the golden resolution, but also capable of generating several plausible resolutions for implicit references in instructional text. We use this finding to build a large dataset with plausible and implausible resolutions of implicit and underspecified elements. We use the resulting dataset for a third computational task, in which we train a model to automatically distinguish between plausible and implausible resolutions for implicit and underspecified elements. We show that this model and the provided dataset can be used to find passages with several, plausible clarifications. Since our definition of misunderstanding focuses on conflicting clarifications, we conduct a final study to conclude the thesis. In particular, we provide and validate a crowdsourcing set-up that allows to find the cases with conflicting, plausible, resolutions. The set-up and findings could be used in future research to directly train a model to identify passages with implicit elements that have conflicting resolutions.
  • Thumbnail Image
    ItemOpen Access
    Distributional analysis of entities
    (2022) Gupta, Abhijeet; Padó, Sebastian (Prof. Dr.)
    Arguably, one of the most important aspects of natural language processing is natural language understanding which relies heavily on lexical knowledge. In computational linguistics, modelling lexical knowledge through distributional semantics has gained considerable popularity. However, the modelling is largely restricted to generic lexical categories (typically common nouns, adjectives, etc.) which are associated with coarse-grained information i.e., the category country has a boundary, rivers and gold deposits. Comparatively, less attention has been paid towards modelling entities which, on the other hand, are associated with fine-grained real-world information, for instance: the entity Germany has precise properties such as, (GDP - 3.6 trillion Euros), (GDP per capita - 44.5 thousand Euros) and (Continent - Europe). The lack of focus on entities and the inherent latency of information in distributional representations warrants greater efforts towards modelling entity related phenomena and, increasing the understanding about the information encoded within distributional representations. This work makes two contributions in that direction: (a) We introduce a semantic relation – Instantiation, a relation between entities and their categories, and distributionally model it to investigate the hypothesis that distributional distinctions do exist in modelling entities versus modelling categories within a semantic space. Our results show that in a semantic space: 1) entities and categories are quite distinct with respect to their distributional behaviour, geometry and linguistic properties; 2) Instantiation relation is recoverable by distributional models; and, 3) for lexical relational modelling purposes, categories are better represented by the centroids of their entities instead of their distributional representations constructed directly from corpora. (b) We also investigate the potential and limitations of distributional semantics for the purpose of Knowledge Base Completion, starting with the hypothesis that fine-grained knowledge is encoded in distributional representations of entities during their meaning construction. We show that: 1) fine-grained information of entities is encoded in distributional representations and can be extracted by simple data-driven supervised models as attribute-value pairs; 2) the models can predict the entire range of fine-grained attributes, as seen in a knowledge base, in one go; and, 3) a crucial factor in determining success in extracting this type of information is contextual support i.e., the extent of contextual information captured by a distributional model during meaning construction. Overall, this thesis takes a step towards increasing the understanding about entity meaning representations in a distributional setup, with respect to their modelling and the extent of knowledge inclusion during their meaning construction.
  • Thumbnail Image
    ItemOpen Access
    Computational models of word order
    (2022) Yu, Xiang; Kuhn, Jonas (Prof. Dr.)
    A sentence in our mind is not a simple sequence of words but a hierarchical structure. We put the sentence in the linear order when we utter it for communication. Linearization is the task of mapping the hierarchical structure of a sentence into its linear order. Our work is based on the dependency grammar, which models the dependency relation between the words, and the resulting syntactic representation is a directed tree structure. The popularity of dependency grammar in Natural Language Processing (NLP) benefits from its separation of structure order and linear order and its emphasis on syntactic functions. These properties facilitate a universal annotation scheme covering a wide range of languages used in our experiments. We focus on developing a robust and efficient computational model that finds the linear order of a dependency tree. We take advantage of deep learning models’ expressive power to encode the syntactic structures of typologically diverse languages robustly. We take a graph-based approach that combines a simple bigram scoring model and a greedy decoding algorithm to search for the optimal word order efficiently. We use the divide-and-conquer strategy to reduce the search space, which restricts the output to be projective. We then resolve the restriction with a transition-based post-processing model. Apart from the computational models, we also study the word order from a quantitative linguistic perspective. We examine the Dependency Length Minimization (DLM) hypothesis, which is believed to be a universal factor that affects the word order of every language. It states that human languages tend to order the words to minimize the overall length of dependency arcs, which reduces the cognitive burden of speaking and understanding. We demonstrate that DLM can explain every aspect of word order in a dependency tree, such as the direction of the head, the arrangement of sibling dependents, and the existence of crossing arcs (non-projectivity). Furthermore, we find that DLM not only shapes the general word order preferences but also motivates the occasional deviation from the preferences. Finally, we apply our model in the task of surface realization, which aims to generate a sentence from a deep syntactic representation. We implement a pipeline with five steps, (1) linearization, (2) function word generation, (3) morphological inflection, (4) contraction, and (5) detokenization, which achieved state-of-the-art performance.
  • Thumbnail Image
    ItemOpen Access
    When few-shot fails : low-resource, domain-specific text classification with transformers
    (2024) Wertz, Lukas; Klinger, Roman (Prof. Dr.)
    Text classification (TC) is a foundational technique in natural language processing (NLP). The ability to automatically classify texts into predetermined categories serves a critical role in numerous applications. With recent advances in NLP owed to powerful pre-trained language models, there is also a rapidly growing interest in using TC for difficult, real-world problems in both business and industrial sectors as well as the scientific community. Using pre-trained language models, modern TC systems achieve state-of-the-art accuracy on benchmark datasets using only a handful of training examples for effective fine-tuning. However, when faced with challenging, low-resource datasets from specialised language domains, relying on small labeled datasets to train classification models is often not effective. When few-shot classification fails, we need to employ more traditional NLP techniques which increase the amount of training data in order to train accurate models. We find that existing approaches for expanding the training data are often unsuitable for deep, transformer-based classification networks. In addition, these approaches are usually tested on standard benchmark datasets, which do not properly reflect the complexity of real-world classification tasks. Consequently, there is a need for effective data augmentation or selection techniques that allow TC systems to handle complex tasks, relevant for modern industrial or business applications. Our primary goal in this work is to design effective data collection and augmentation systems for TC on low-resource datasets from technical or otherwise non-standard language domains that more closely resemble real-world applications. As a consequence, our experiments also demonstrate the limits of existing approaches and strongly motivate the need for more complex, domain-specific benchmark datasets. First, we investigate the use of generational language models for data augmentation. We find that simple language edits are smoothed out by the language model and fine-tuning on the small training data proves unstable. As such we propose a simple generation scheme, which uses specific model prompts built from the data. Second, we employ a variety of existing selection strategies for active learning. Since we find that no strategy consistently outperforms a random selection across datasets, we design an approach that combines the strategies via reinforcement learning. This allows learning which information source for data selection is most valuable and greatly improves the classification performance in early stages of the active learning process. Overall, we find that TC ist still a challenge in NLP, in particular when systems have to be designed with modern application contexts in mind. Our experiments with various baselines show, that existing augmenting techniques and AL strategies can not easily be transferred to current architectures, increasingly complex tasks or domain specific language. Consequently, the approaches presented in this work are an important step towards TC for sparse, complex datasets and real-world challenges.
  • Thumbnail Image
    ItemOpen Access
    Human and computational measurement of lexical semantic change
    (2023) Schlechtweg, Dominik; Schulte im Walde, Sabine (apl. Prof. Dr.)
    Human language changes over time. This change occurs on several linguistic levels such as grammar, sound or meaning. The study of meaning changes on the word level is often called 'Lexical Semantic Change' (LSC) and is traditionally either approached from an onomasiological perspective asking by which words a meaning can be expressed, or a semasiological perspective asking which meanings a word can express over time. In recent years, the task of automatic detection of semasiological LSC from textual data has been established as a proper field of computational linguistics under the name of 'Lexical Semantic Change Detection' (LSCD). Two main factors have contributed to this development: (i) The 'digital turn' in the humanities has made large amounts of historical texts available in digital form. (ii) New computational models have been introduced efficiently learning semantic aspects of words solely from text. One of the main motivations behind the work on LSCD are their applications in historical semantics and historical lexicography, where researchers are concerned with the classification of words into categories of semantic change. Automatic methods have the advantage to produce semantic change predictions for large amounts of data in small amounts of time and could thus considerably decrease human efforts in the mentioned fields while being able to scan more data and thus to uncover more semantic changes, which are at the same time less biased towards ad hoc sampling criteria used by researchers. On the other hand, automatic methods may also be hurtful when their predictions are biased, i.e., they may miss numerous semantic changes or label words as changing which are not. Results produced in this way may then lead researchers to make empirically inadequate generalizations on semantic change. Hence, automatic change detection methods should not be trusted until they have been evaluated thoroughly and their predictions have been shown to reach an acceptable level of correctness. Despite the rapid growth of LSCD as a field, a solid evaluation of the wealth of proposed models was still missing at the onset of this thesis. The reasons were multiple, but most importantly there was no annotated benchmark test set available. This thesis is thus concerned with the process of providing such an evaluation for LSCD, including • the definition of the basic concepts and tasks, • the development and validation of data annotation schemes with humans, • the annotation of a multilingual benchmark test set, • the evaluation of computational models on the benchmark, their analysis and improvement, as well as • an application of the developed methods to showcase their usefulness in the targeted fields (historical semantics and lexicography).
  • Thumbnail Image
    ItemOpen Access
    Task generality in relation extraction
    (2024) Papay, Sean; Padó, Sebastian (Prof. Dr.)
    Relation extraction involves the identification of relations between entities in text. Many distinct tasks in natural language processing, including semantic role labeling, quotation analysis, and event extraction, can be categorized as instances of relation extraction, and share similar structures. However, despite the similarities between these tasks, modeling approaches tend to show little overlap, and model architectures designed for one type of relation extraction task can rarely be applied to others. This situation stands in contrast to other task paradigms common in natural language processing, such as text classification and text generation, wherein existing architectures tend to be highly generalizable to many distinct tasks within their paradigms. This dissertation investigates task generality for relation extraction, that is, the ability or inability of relation extraction model architectures to be successfully applied to diverse relation extraction tasks. To this end, we make a number of concrete contributions: First, we present a formal description language for specifying the properties of different relation extraction tasks, and introduce a software framework for developing model architectures which can automatically account for these properties. By delineating task-specific frontends from task-general backends, this framework enables task-general architectures to be easily adapted to the specifics of particular tasks. Next, we investigate task generality for span extraction, an important subtask of relation extraction. We identify architecture design choices which facilitate task-% generality, and go on to statistically analyze how different types of architectures generalize to different types of tasks, gleaning insights into which task properties, model properties, and interactions therebetween are important for generalization. Finally, we present a method for enforcing regular-language constraints on the outputs of a class of sequence labeling models. We show how constraints can be constructed which capture the specific structures of relation extraction tasks, such that label sequences can be interpreted as relations. Overall, this dissertation works towards making relation extraction more task-general, and we hope our contributions can spur further work in this direction.