Browsing by Author "Kuhn, Jonas (Prof. Dr.)"

Now showing 1 - 15 of 15

Open Access
Computational modelling of coreference and bridging resolution
(2019) Rösiger, Ina; Kuhn, Jonas (Prof. Dr.)
Open Access
Computational models of word order
(2022) Yu, Xiang; Kuhn, Jonas (Prof. Dr.)
A sentence in our mind is not a simple sequence of words but a hierarchical structure. We put the sentence in the linear order when we utter it for communication. Linearization is the task of mapping the hierarchical structure of a sentence into its linear order. Our work is based on the dependency grammar, which models the dependency relation between the words, and the resulting syntactic representation is a directed tree structure. The popularity of dependency grammar in Natural Language Processing (NLP) benefits from its separation of structure order and linear order and its emphasis on syntactic functions. These properties facilitate a universal annotation scheme covering a wide range of languages used in our experiments. We focus on developing a robust and efficient computational model that finds the linear order of a dependency tree. We take advantage of deep learning models’ expressive power to encode the syntactic structures of typologically diverse languages robustly. We take a graph-based approach that combines a simple bigram scoring model and a greedy decoding algorithm to search for the optimal word order efficiently. We use the divide-and-conquer strategy to reduce the search space, which restricts the output to be projective. We then resolve the restriction with a transition-based post-processing model. Apart from the computational models, we also study the word order from a quantitative linguistic perspective. We examine the Dependency Length Minimization (DLM) hypothesis, which is believed to be a universal factor that affects the word order of every language. It states that human languages tend to order the words to minimize the overall length of dependency arcs, which reduces the cognitive burden of speaking and understanding. We demonstrate that DLM can explain every aspect of word order in a dependency tree, such as the direction of the head, the arrangement of sibling dependents, and the existence of crossing arcs (non-projectivity). Furthermore, we find that DLM not only shapes the general word order preferences but also motivates the occasional deviation from the preferences. Finally, we apply our model in the task of surface realization, which aims to generate a sentence from a deep syntactic representation. We implement a pipeline with five steps, (1) linearization, (2) function word generation, (3) morphological inflection, (4) contraction, and (5) detokenization, which achieved state-of-the-art performance.
Open Access
A computational stylistics of poetry : distant reading and modeling of German and English verse
(2023) Haider, Thomas; Kuhn, Jonas (Prof. Dr.)
This doctoral thesis is about the computational modeling of stylistic variation in poetry. As ‘a computational stylistics’ it examines the forms, social embedding, and the aesthetic potential of literary texts by means of computational and statistical methods, ranging from simple counting over information theoretic measures to neural network models, including experiments with representation learning, transfer learning, and multi-task learning. We built small corpora to manually annotate a number of phenomena that are relevant for poetry, such as meter, rhythm, rhyme, and also emotions and aesthetic judgements that are elicited in the reader. A strict annotation workflow allows us to better understand these phenomena, from how to conceptualize them and which problems arise when trying to annotate them on a larger scale. Furthermore, we built large corpora to discover patterns in a wide historical, aesthetic and linguistic range, with a focus on German and English writing, encompassing public domain texts from the late 16th century up into the early 20th century. These corpora are published with metadata and reliable automatic annotation of part-of-speech tags, syllable boundaries, meter and verse measures. This thesis contains chapters on diachronic variation, aesthetic emotions, and modeling prosody, including experiments that also investigate the interaction between them. We look at how the diction of poets in different languages changed over time, which topics and metaphors were and became popular, both as a reaction to aesthetic considerations and also the political climate of the time. We investigate which emotions are elicited in readers when they read poetry, how that relates to aesthetic judgements, how we can annotate such emotions, and then train models to learn them. Also, we present experiments on how to annotate prosodic devices on a large scale, how well we can train computational models to predict the prosody from text, and how informative those devices are for each other.
Open Access
Enhancing character type detection using coreference information : experiments on dramatic texts
(2024) Pagel, Janis; Kuhn, Jonas (Prof. Dr.)
This thesis describes experiments on enhancing machine-learning based detection of literary character types in German-language dramatic texts by using coreference information. The thesis makes four major contributions to the research discourse of character type detection and coreference resolution for German dramatic texts: (i) a corpus of annotations of coreference on dramatic texts, called GerDraCor-Coref, (ii) a rule-based system to automatically resolve coreferences on dramatic texts, called DramaCoref, as well as experiments and analyses of results by using DramaCoref on GerDraCor-Coref, (iii) experiments on the automatic detection of three selected character types (title characters, protagonists and schemers) using machine-learning approaches, and (iv) experiments on utilizing the coreference information of (i) and (ii) for improving the performance of character type detection of (iii).
Open Access
Ensemble dependency parsing across languages : methodological perspectives
(2021) Faleńska, Agnieszka; Kuhn, Jonas (Prof. Dr.)
Human language is ambiguous. Such ambiguity occurs at the lexical as well as syntactic level. At the lexical level, the same word can represent different concepts and objects. At the syntactic level, one phrase or a sentence can have more than one interpretation. Language ambiguity is one of the biggest challenges of Natural Language Processing (NLP), i.e., the research field that sits at the intersection of machine learning and linguistics, and that deals with automatic processing of language data. This challenge arises when automatic NLP tools need to resolve ambiguities and select one possible interpretation of a text to approach understanding its meaning. This dissertation focuses on one of the essential Natural Language Processing tasks - dependency parsing. The task involves assigning a syntactic structure called a dependency tree to a given sentence. Parsing is usually one of the processing steps that helps downstream NLP tasks by resolving some of the syntactic ambiguities occurring in sentences. Since human language is highly ambiguous, deciding on the best syntactic structure for a given sentence is challenging. As a result, even state-of-the-art dependency parsers are far from being perfect. Ensemble methods allow for postponing the decision about the best interpretation until several single parsing models express their opinions. Such complementary views on the same problem show which parts of the sentence are the most ambiguous and require more attention. Ensemble parsers find a consensus among such single predictions, and as a result, provide robust and more trustworthy results. Ensemble parsing architectures are commonly regarded as solutions only for experts and overlooked in practical applications. Therefore, this dissertation aims to provide a deeper understanding of ensemble dependency parsers and answer practical questions that arise when designing such approaches. We investigate ensemble models from three core methodological perspectives: parsing time, availability of training resources, and the final accuracy of the system. We demonstrate that in applications where the complexity of the architecture is not a bottleneck, an integration of strong and diverse parsers is the most reliable approach. Such integration provides robust results regardless of the language and the domain of application. However, when the final accuracy of the system can be sacrificed, more efficient ensemble architectures become available. The decision on how to design them has to take into consideration the desired parsing time, the available training data, and the involved single predictors. The main goal of this thesis is to investigate ensemble parsers. However, to design an ensemble architecture for a particular application, it is crucial to understand the similarities and differences in the behavior of its components. Therefore, this dissertation makes contributions of two sorts: (1) we provide guidelines on practical applications of ensemble dependency parsers, but also (2) through the ensembles, we develop a deeper understanding of single parsing models. We primarily focus on differences between the traditional parsers and their recent successors, which use deep learning techniques.
Open Access
Modeling the interface between morphology and syntax in data-driven dependency parsing
(2016) Seeker, Wolfgang; Kuhn, Jonas (Prof. Dr.)
When people formulate sentences in a language, they follow a set of rules specific to that language that defines how words must be put together in order to express the intended meaning. These rules are called the grammar of the language. Languages have essentially two ways of encoding grammatical information: word order or word form. English uses primarily word order to encode different meanings, but many other languages change the form of the words themselves to express their grammatical function in the sentence. These languages are commonly subsumed under the term morphologically rich languages. Parsing is the automatic process for predicting the grammatical structure of a sentence. Since grammatical structure guides the way we understand sentences, parsing is a key component in computer programs that try to automatically understand what people say and write. This dissertation is about parsing and specifically about parsing languages with a rich morphology, which encode grammatical information in the form of words. Today’s parsing models for automatic parsing were developed for English and achieve good results on this language. However, when applied to other languages, a significant drop in performance is usually observed. The standard model for parsing is a pipeline model that separates the parsing process into different steps, in particular it separates the morphological analysis, i.e. the analysis of word forms, from the actual parsing step. This dissertation argues that this separation is one of the reasons for the performance drop of standard parsers when applied to other languages than English. An analysis is presented that exposes the connection between the morphological system of a language and the errors of a standard parsing model. In a second series of experiments, we show that knowledge about the syntactic structure of sentence can support the prediction of morphological information. We then argue for an alternative approach that models morphological analysis and syntactic analysis jointly instead of separating them. We support this argumentation with empirical evidence by implementing two parsers that model the relationship between morphology and syntax in two different but complementary ways.
Open Access
Morphological processing of compounds for statistical machine translation
(2014) Cap, Fabienne; Kuhn, Jonas (Prof. Dr.)
Machine Translation denotes the translation of a text written in one language into another language performed by a computer program. In times of internet and globalisation, there has been a constantly growing need for machine translation. For example, think of the European Union, with its 24 official languages into which each official document must be translated. The translation of official documents would be less manageable and much less affordable without computer-aided translation systems. Most state-of-the-art machine translation systems are based on statistical models. These are trained on a bilingual text collection to “learn” translational correspondences of words (and phrases) of the two languages. The underlying text collection must be parallel, i.e. the content of one line must exactly correspond to the translation of this line in the other language. After training the statistical models, they can be used to translate new texts. However, one of the drawbacks of Statistical Machine Translation (SMT) is that it can only translate words which have occurred in the training texts. This applies in particular to SMT systems which have been designed for translating from and to German. It is widely known that German allows for productive word formation processes. Speakers of German can put together existing words to form new words, called compounds. An example is the German “Apfel + Baum = Apfelbaum” (=“apple + tree = apple tree”). Theoretically there is no limit to the length of a German compound. Whereas “Apfelbaum” (= apple tree”) is a rather common German compound, “Apfelbaumholzpalettenabtransport” (= “apple|tree|wood|pallet|removal”) is a spontaneous new creation, which (probably) has not occurred in any text collection yet. The productivity of German compounds leads to a large number of distinct compound types, many of which occur only with low frequency in a text collection, if they occur at all. This fact makes German compounds a challenge for SMT systems, as only words which have occurred in the parallel training data can later be translated by the systems. Splitting compounds into their component words can solve this problem. For example, splitting “Apfelbaumholzpalettenabtransport” into its component words, it becomes intuitively clear that “Apfel” (= “apple”), “Baum” (= “tree”), “Palette” (= “palette”) and “Abtransport” (= “removal”) are all common German words, which should have occurred much more often in any text collection than the compound as a whole. Splitting compounds thus potentially makes them translatable part-by-part. This thesis deals with the question as to whether using morphologically aware compound splitting improves translation performance, when compared to previous approaches to compound splitting for SMT. To do so, we investigate both translation directions of the language pair German and English. In the past, there have been several approaches to compound splitting for SMT systems for translating from German to English. However, the problem has mostly been ignored for the opposite translation direction, from English to German. Note that this translation direction is the more challenging one: prior to training and translation, compounds must be split and after translation, they must be accurately reassembled. Moreover, German has a rich inflectional morphology. For example, it requires the agreement of all noun phrase components which are morphologically marked. In this thesis, we introduce a compound processing procedure for SMT which is able to put together new compounds that have not occurred in the parallel training data and inflects these compounds correctly – in accordance to their context. Our work is the first which takes syntactic information, derived from the source language sentence (here: English) into consideration for our decision which simple words to merge into compounds. We evaluate the quality of our morphological compound splitting approach using manual evaluations. We measure the impact of our compound processing approach on the translation performance of a state-of-the-art, freely available SMT system. We investigate both translation directions of the language pair German and English. Whenever possible, we compare our results to previous approaches to compound processing, most of which work without morphological knowledge.
Open Access
New resources and ideas for semantic parser induction
(2018) Richardson, Kyle; Kuhn, Jonas (Prof. Dr.)
In this thesis, we investigate the general topic of computational natural language understanding (NLU), which has as its goal the development of algorithms and other computational methods that support reasoning about natural language by the computer. Under the classical approach, NLU models work similar to computer compilers (Aho et al., 1986), and include as a central component a semantic parser that translates natural language input (i.e., the compiler’s high-level language) to lower-level formal languages that facilitate program execution and exact reasoning. Given the difficulty of building natural language compilers by hand, recent work has centered around semantic parser induction, or on using machine learning to learn semantic parsers and semantic representations from parallel data consisting of example text-meaning pairs (Mooney, 2007a). One inherent difficulty in this data-driven approach is finding the parallel data needed to train the target semantic parsing models, given that such data does not occur naturally “in the wild” (Halevy et al., 2009). Even when data is available, the amount of domain- and language-specific data and the nature of the available annotations might be insufficient for robust machine learning and capturing the full range of NLU phenomena. Given these underlying resource issues, the semantic parsing field is in constant need of new resources and datasets, as well as novel learning techniques and task evaluations that make models more robust and adaptable to the many applications that require reliable semantic parsing. To address the main resource problem involving finding parallel data, we investigate the idea of using source code libraries, or collections of code and text documentation, as a parallel corpus for semantic parser development and introduce 45 new datasets in this domain and a new and challenging text-to-code translation task. As a way of addressing the lack of domain- and language-specific parallel data, we then use these and other benchmark datasets to investigate training se- mantic parsers on multiple datasets, which helps semantic parsers to generalize across different domains and languages and solve new tasks such as polyglot decoding and zero-shot translation (i.e., translating over and between multiple natural and formal languages and unobserved language pairs). Finally, to address the issue of insufficient annotations, we introduce a new learning framework called learning from entailment that uses entailment information (i.e., high-level inferences about whether the meaning of one sentence follows from another) as a weak learning signal to train semantic parsers to reason about the holes in their analysis and learn improved semantic representations. Taken together, this thesis contributes a wide range of new techniques and technical solutions to help build semantic parsing models with minimal amounts of training supervision and manual engineering effort, hence avoiding the resource issues described at the onset. We also introduce a diverse set of new NLU tasks for evaluating semantic parsing models, which we believe help to extend the scope and real world applicability of semantic parsing and computational NLU.
Open Access
Online learning of latent linguistic structure with approximate search
(2019) Björkelund, Anders; Kuhn, Jonas (Prof. Dr.)
Automatic analysis of natural language data is a frequently occurring application of machine learning systems. These analyses often revolve around some linguistic structure, for instance a syntactic analysis of a sentence by means of a tree. Machine learning models that carry out structured prediction, as opposed to simpler machine learning tasks such as classification or regression, have therefore received considerable attention in the language processing literature. As an additional twist, the sought linguistic structures are sometimes not directly modeled themselves. Rather, prediction takes place in a different space where the same linguistic structure can be represented in more than one way. However, in a standard supervised learning setting, these prediction structures are not available in the training data, but only the linguistic structure. Since multiple prediction structures may correspond to the same linguistic structure, it is thus unclear which prediction structure to use for learning. One option is to treat the prediction structure as latent and let the machine learning algorithm guide this selection. In this dissertation we present an abstract framework for structured prediction. This framework supports latent structures and is agnostic of the particular language processing task. It defines a set of hyperparameters and task-specific functions which a user must implement in order to apply it to a new task. The advantage of this modularization is that it permits comparisons and reuse across tasks in a common framework. The framework we devise is based on the structured perceptron for learning. The perceptron is an online learning algorithm which considers one training instance at a time, makes a prediction, and carries out an update if the prediction was wrong. We couple the structured perceptron with beam search, which is a general purpose search algorithm. Beam search is, however, only approximate, meaning that there is no guarantee that it will find the optimal structure in a large search space. Therefore special attention is required to handle search errors during training. This has led to the development of special update methods such as early and max-violation updates. The contributions of this dissertation sit at the intersection of machine learning and natural language processing. With regard to language processing, we consider three tasks: Coreference resolution, dependency parsing, and joint sentence segmentation and dependency parsing. For coreference resolution, we start from an existing latent tree model and extend it to accommodate non-local features drawn from a greater structural context. This requires us to sacrifice exact for approximate search, but we show that, assuming sufficiently advanced update methods are used for the structured perceptron, then the richer scope of features yields a stronger coreference model. We take a transition-based approach to dependency parsing, where dependency trees are constructed incrementally by transition system. Latent structures for transition-based parsing have previously not received enough attention, partly because the characterization of the prediction space is non-trivial. We provide a thorough analysis of this space with regard to the ArcStandard with Swap transition system. This characterization enables us to evaluate the role of latent structures in transition-based dependency parsing. Empirically we find that the utility of latent structures depend on the choice of approximate search -- for greedy search they improve performance, whereas with beam search they are on par, or sometimes slightly ahead of, previous approaches. We then go on to extend this transition system to do joint sentence segmentation and dependency parsing. We develop a transition system capable of handling this task and evaluate it on noisy, non-edited texts. With a set of carefully selected baselines and data sets we employ this system to measure the effectiveness of syntactic information for sentence segmentation. We show that, in the absence of obvious orthographic clues such as punctuation and capitalization, syntactic information can be used to improve sentence segmentation. With regard to machine learning, our contributions of course include the framework itself. The task-specific evaluations, however, allow us to probe the learning machinery along certain boundary points and draw more general conclusions. A recurring observation is that some of the standard update methods for the structured perceptron with approximate search -- e.g., early and max-violation updates -- are inadequate when the predicted structure reaches a certain size. We show that the primary problem with these updates is that they may discard training data and that this effect increases as the structure size increases. This problem can be handled by using more advanced update methods that commit to using all the available training data. Here, we propose a new update method, DLaSO, which consistently outperforms all other update methods we compare to. Moreover, while this problem potentially could be handled by an increased beam size, we also show that this cannot fully compensate for the structure size and that the more advanced methods indeed are required.
Open Access
Structurally informed methods for improved sentiment analysis
(2017) Kessler, Stefanie Wiltrud; Kuhn, Jonas (Prof. Dr.)
Sentiment analysis deals with methods to automatically analyze opinions in natural language texts, e.g., product reviews. Such reviews contain a large number of fine-grained opinions, but to automatically extract detailed information it is necessary to handle a wide variety of verbalizations of opinions. The goal of this thesis is to develop robust structurally informed models for sentiment analysis which address challenges that arise from structurally complex verbalizations of opinions. In this thesis, we look at two examples for such verbalizations that benefit from including structural information into the analysis: negation and comparisons. Negation directly influences the polarity of sentiment expressions, e.g., while "good" is positive, "not good" expresses a negative opinion. We propose a machine learning approach that uses information from dependency parse trees to determine whether a sentiment word is in the scope of a negation expression. Comparisons like "X is better than Y" are the main topic of this thesis. We present a machine learning system for the task of detecting the individual components of comparisons: the anchor or predicate of the comparison, the entities that are compared, which aspect they are compared in, and which entity is preferred. Again, we use structural context from a dependency parse tree to improve the performance of our system. We discuss two ways of addressing the issue of limited availability of training data for our system. First, we create a manually annotated corpus of comparisons in product reviews, the largest such resource available to date. Second, we use the semi-supervised method of structural alignment to expand a small seed set of labeled sentences with similar sentences from a large set of unlabeled sentences. Finally, we work on the task of producing a ranked list of products that complements the isolated prediction of ratings and supports the user in a process of decision making. We demonstrate how we can use the information from comparisons to rank products and evaluate the result against two conceptually different external gold standard rankings.
Open Access
Syntactic and referential choice in corpus-based generation : modeling source, context and interactions
(2016) Zarrieß, Sina; Kuhn, Jonas (Prof. Dr.)
Natürlich-sprachliche Sätze aus einer abstrakten Repräsentation einer kommunikativen Absicht zu generieren, ist ein Prozess, der einer gewissen Variabilität unterliegt, was bedeutet, dass typischerweise mehrere sprachliche Ausdrucksmöglichkeiten für einen nicht-sprachlichen Fakt verfügbar sind. Diese Variabilität liegt auf allen Ebenen der sprachlichen Realisierung vor, zum Beispiel in der Satzstruktur, in lexikalischen Entscheidungen oder der Wortstellung, und viele dieser Realisierungsmöglichkeiten interagieren. Aus der Perspektive des Sprachgebrauchs erfüllen Phänomene wie Wortstellungsvarianten eine Funktion: sie dienen dazu, eine sprachliche Äußerung an ihren Kontext anzupassen. Diese Doktorarbeit untersucht statistische Modelle, die ein Ranking zwischen verschiedenen Realisierungsmöglichkeiten einer Generierungseingabe im Hinblick auf ihre Adäquatheit im Diskurskontext vorhersagen. Wir übernehmen dazu bestimmte Annahmen und Methoden aus dem Paradigma der korpusbasierten Generierung: die Modelle benutzen tatsächlich vorkommende Korpussätze als Instanzen sprachlicher Realisierungsvarianten und die vorhergehenden Sätze als ihren Kontext. Wir setzen Analysewerkzeuge wie Grammatiken und Parser ein, um eine abstrakte Repräsentation eines Satzes zu bestimmen. Diese Repräsentation stellt den Ausgangspunkt für den Generierungsprozess dar. Das Generierungssystem bildet die Ausgangsrepräsentation auf eine Kandidatenmenge von Realisierungen ab und gewichtet diese mit Hilfe von Merkmalen, die aus dem Kontext berechnet werden. Die Ausgabe des Generierungssytems ist der am besten bewertete Satz, der gegen den originalen Korpussatz evaluiert werden kann.
Open Access
Syntactic dependencies and beyond : robust neural architectures and quality-enhanced corpora for structured prediction in NLP
(2025) Grünewald, Stefan; Kuhn, Jonas (Prof. Dr.)
This thesis investigates explicit structure in Natural Language Processing (NLP). Such structure, represented by abstract linguistic objects like part-of-speech tags, syntax trees, or graph-based meaning representations, has traditionally played a central role in NLP. Historically linked to the idea of rule-based processing of human language (using tools such as formal grammars), it has also successfully been combined with statistical machine learning techniques. For practical applications, this has often taken the form of a pipeline in which the prediction of linguistic features serves as a first step towards addressing “higher-level” tasks such as text classification or information extraction. In addition, algorithmic extraction of linguistic structures is also being pursued for its own sake, i.e., as a means to deepen our understanding of human language. Most recently, the field of NLP has been dominated by techniques leveraging artificial neural networks. In this paradigm, language data is not processed using a pipeline approach as outlined above, which is ultimately grounded in simple and interpretable features. Rather, neural networks learn internal, vector-based language representations by means of large-scale mathematical optimization based on (usually very large amounts of) raw input data, allowing for “end-to-end” language processing that does not involve any kind of explicit structure as an intermediate representation. While the successes of this paradigm are undeniable in practical terms - i.e., achieving new state-of-the-art results on a wide range of applications ranging from information extraction to machine translation -, it has also spurred controversial questions around the present and future role of explicit structure in NLP. At an overarching level, this means a general uncertainty about the role of explicit structure: Can NLP still benefit from modeling structure explicitly, or have such approaches become obsolete? Apart from this fundamental question, however, the interaction between neural networks and explicit structure in NLP also raises a number of practical challenges; and it is these challenges that form the core of this thesis and the basis for its contributions. The first challenge relates to the role of data in training structure-prediction systems. As a general rule, neural networks require large amounts of (labeled) training data for learning specific tasks, and thus the curation and annotation of suitable datasets is a common bottleneck in their development. In our contributions, we focus on the Universal Dependencies (UD) formalism for the annotation of syntactic dependencies. Evaluating the quality of existing treebanks and examining ways of improving and extending them, one of our core findings it that both rule-based and machine learning-based methods can be leveraged to reduce the need for manual annotation. The second challenge relates to the design and architecture of neural structure-predicting systems, of which there exists a wide variety; often, it is not fully clear which factors are truly important in achieving the best possible performance. We study dependency parser architectures for UD parsing, finding that when using modern neural network backbones, simpler is often better, with more sophisticated setups offering little in the way of performance improvements. The third challenge relates to structure prediction for downstream NLP tasks. Here, we investigate the tasks of Negation Resolution and Relation Extraction by means of framing them as graph parsing problems and utilizing neural architectures similar to those studied for dependency parsing. We find that such an approach generally yields robust results, but is not clearly superior to “shallow” sequence labeling. In sum, we hope that our contributions serve to inform and inspire future research on the role of explicit structure in NLP, and more generally within the emerging paradigm of artificial intelligence (AI) that combines neural networks with rule-based algorithms and symbolic representations (“neuro-symbolic AI”).
Open Access
The Taming of the Shrew - non-standard text processing in the Digital Humanities
(2018) Schulz, Sarah; Kuhn, Jonas (Prof. Dr.)
Natural language processing (NLP) has focused on the automatic processing of newspaper texts for many years. With the growing importance of text analysis in various areas such as spoken language understanding, social media processing and the interpretation of text material from the humanities, techniques and methodologies have to be reviewed and redefined since so called non-standard texts pose challenges on the lexical and syntactic level especially for machine-learning-based approaches. Automatic processing tools developed on the basis of newspaper texts show a decreased performance for texts with divergent characteristics. Digital Humanities (DH) as a field that has risen to prominence in the last decades, holds a variety of examples for this kind of texts. Thus, the computational analysis of the relationships of Shakespeare’s dramatic characters requires the adjustment of processing tools to English texts from the 16th-century in dramatic form. Likewise, the investigation of narrative perspective in Goethe’s ballads calls for methods that can handle German verse from the 18th century. In this dissertation, we put forward a methodology for NLP in a DH environment. We investigate how an interdisciplinary context in combination with specific goals within projects influences the general NLP approach. We suggest thoughtful collaboration and increased attention to the easy applicability of resulting tools as a solution for differences in the store of knowledge between project partners. Projects in DH are not only constituted by the automatic processing of texts but are usually framed by the investigation of a research question from the humanities. As a consequence, time limitations complicate the successful implementation of analysis techniques especially since the diversity of texts impairs the transferability and reusability of tools beyond a specific project. We answer to this with modular and thus easily adjustable project workflows and system architectures. Several instances serve as examples for our methodology on different levels. We discuss modular architectures that balance time-saving solutions and problem-specific implementations on the example of automatic postcorrection of the output text from an optical character recognition system. We address the problem of data diversity and low resource situations by investigating different approaches towards non-standard text processing. We examine two main techniques: text normalization and tool adjustment. Text normalization aims at the transformation of non-standard text in order to assimilate it to the standard whereas tool adjustment concentrates on the contrary direction of enabling tools to successfully handle a specific kind of text. We focus on the task of part-of-speech tagging to illustrate various approaches toward the processing of historical texts as an instance for non-standard texts. We discuss how the level of deviation from a standard form influences the performance of different methods. Our approaches shed light on the importance of data quality and quantity and emphasize the indispensability of annotations for effective machine learning. In addition, we highlight the advantages of problem-driven approaches where the purpose of a tool is clearly formulated through the research question. Another significant finding to emerge from this work is a summary of the experiences and increased knowledge through collaborative projects between computer scientists and humanists. We reflect on various aspects of the elaboration and formalization of research questions in the DH and assess the limitations and possibilities of the computational modeling of humanistic research questions. An emphasis is placed on the interplay of expert knowledge with respect to a subject of investigation and the implementation of tools for that purpose and the thereof resulting advantages such as the targeted improvement of digital methods through purposeful manual correction and error analysis. We show obstacles and chances and give prospects and directions for future development in this realm of interdisciplinary research.
Open Access
Task-based parser output combination : workflow and infrastructure
(2018) Eckart, Kerstin; Kuhn, Jonas (Prof. Dr.)
This dissertation introduces the method of task-based parser output combination as a device to enhance the reliability of automatically generated syntactic information for further processing tasks. Parsers, i.e. tools generating syntactic analyses, are usually based on reference data. Typically these are modern news texts. However, the data relevant for applications or tasks beyond parsing often differs from this standard domain, or only specific phenomena from the syntactic analysis are actually relevant for further processing. In these cases, the reliability of the parsing output might deviate essentially from the expected outcome on standard news text. Studies for several levels of analysis in natural language processing have shown that combining systems from the same analysis level outperforms the best involved single system. This is due to different error distributions of the involved systems which can be exploited, e.g. in a majority voting approach. In other words: for an effective combination, the involved systems have to be sufficiently different. In these combination studies, usually the complete analyses are combined and evaluated. However, to be able to combine the analyses completely, a full mapping of their structures and tagsets has to be found. The need for a full mapping either restricts the degree to which the participating systems are allowed to differ or it results in information loss. Moreover, the evaluation of the combined complete analyses does not reflect the reliability achieved in the analysis of the specific aspects needed to resolve a given task. This work presents an abstract workflow which can be instantiated based on the respective task and the available parsers. The approach focusses on the task-relevant aspects and aims at increasing the reliability of their analysis. Moreover, this focus allows a combination of more diverging systems, since no full mapping of the structures and tagsets from the single systems is needed. The usability of this method is also increased by focussing on the output of the parsers: It is not necessary for the users to reengineer the tools. Instead, off-the-shelf parsers and parsers for which no configuration options or sources are available to the users can be included. Based on this, the method is applicable to a broad range of applications. For instance, it can be applied to tasks from the growing field of Digital Humanities, where the focus is often on tasks different from syntactic analysis.
Open Access
Task-oriented specialization techniques for entity retrieval
(2020) Glaser, Andrea; Kuhn, Jonas (Prof. Dr.)
Finding information on the internet has become very important nowadays, and online encyclopedias or websites specialized in certain topics offer users a great amount of information. Search engines support users when trying to find information. However, the vast amount of information makes it difficult to separate relevant from irrelevant facts for a specific information need. In this thesis we explore two areas of natural language processing in the context of retrieving information about entities: named entity disambiguation and sentiment analysis. The goal of this thesis is to use methods from these areas to develop task-oriented specialization techniques for entity retrieval. Named entity disambiguation is concerned with linking referring expressions (e.g., proper names) in text to their corresponding real world or fictional entity. Identifying the correct entity is an important factor in finding information on the internet as many proper names are ambiguous and need to be disambiguated to find relevant information. To that end, we introduce the notion of r-context, a new type of structurally informed context. This r-context consists of sentences that are relevant to the entity only to capture all important context clues and to avoid noise. We then show the usefulness of this r-context by performing a systematic study on a pseudo-ambiguity dataset. Identifying less known named entities is a challenge in named entity disambiguation because usually there is not much data available from which a machine learning algorithm can learn. We propose an approach that uses an aggregate of textual data about other entities which share certain properties with the target entity, and learn information from it by using topic modelling, which is then used to disambiguate the less known target entity. We use a dataset that is created automatically by exploiting the link structure in Wikipedia, and show that our approach is helpful for disambiguating entities without training material and with little surrounding context. Retrieving the relevant entities and information can produce many search results. Thus, it is important to effectively present the information to a user. We regard this step beyond the entity retrieval and employ sentiment analysis, which is used to analyze opinions expressed in text, in the context of effectively displaying information about product reviews to a user. We present a system that extracts a supporting sentence, a single sentence that captures both the sentiment of the author as well as a supportingfact. This supporting sentence can be used to provide users with an easy way to assess information in order to make informed choices quickly. We evaluate our approach by using the crowdsourcing service Amazon Mechanical Turk.