05 Fakultät Informatik, Elektrotechnik und Informationstechnik

Permanent URI for this collectionhttps://elib.uni-stuttgart.de/handle/11682/6

Browse

Search Results

Now showing 1 - 3 of 3
  • Thumbnail Image
    ItemOpen Access
    Challenges of computational social science analysis with NLP methods
    (2022) Dayanik, Erenay; Padó, Sebastian (Prof. Dr.)
    Computational Social Science (CSS) is an emerging research area at the intersection of social science and computer science, where problems of societal relevance can be addressed by novel computational methods. With the recent advances in machine learning and natural language processing as well as the availability of textual data, CSS has opened up to new possibilities, but also methodological challenges. In this thesis, we present a line of work on developing methods and addressing challenges in terms of data annotation and modeling for computational political science and social media analysis, two highly popular and active research areas within CSS. In the first part of the thesis, we focus on a use case from computational political science, namely Discourse Network Analysis (DNA), a framework that aims at analyzing the structures behind complex societal discussions. We investigate how this style of analysis, which is traditionally performed manually, can be automated. We start by providing a requirement analysis outlining a roadmap to decompose the complex DNA task into several conceptually simpler sub-tasks. Then, we introduce NLP models with various configurations to automate two of the sub-tasks given by the requirement analysis, namely claim detection and classification, based on different neural network architectures ranging from unidirectional LSTMs to Transformer based architectures. In the second part of the thesis, we shift our focus to fairness, a central concern in CSS. Our goal in this part of the thesis is to analyze and improve the performances of NLP models used in CSS in terms of fairness and robustness while maintaining their overall performance. With that in mind, we first analyze the above-mentioned claim detection and classification models and propose techniques to improve model fairness and overall performance. After that, we broaden our focus to social media analysis, another highly active subdomain of CSS. Here, we study text classification of the correlated attributes, which pose an important but often overlooked challenge to model fairness. Our last contribution is to discuss the limitations of the current statistical methods applied for bias identification; to propose a multivariate regression based approach; and to show that, through experiments conducted on social media data, it can be used as a complementary method for bias identification and analysis tasks. Overall, our work takes a step towards increasing the understanding of challenges of computational social science. We hope that both political scientists and NLP scholars can make use of the insights from this thesis in their research.
  • Thumbnail Image
    ItemOpen Access
    Distributional analysis of entities
    (2022) Gupta, Abhijeet; Padó, Sebastian (Prof. Dr.)
    Arguably, one of the most important aspects of natural language processing is natural language understanding which relies heavily on lexical knowledge. In computational linguistics, modelling lexical knowledge through distributional semantics has gained considerable popularity. However, the modelling is largely restricted to generic lexical categories (typically common nouns, adjectives, etc.) which are associated with coarse-grained information i.e., the category country has a boundary, rivers and gold deposits. Comparatively, less attention has been paid towards modelling entities which, on the other hand, are associated with fine-grained real-world information, for instance: the entity Germany has precise properties such as, (GDP - 3.6 trillion Euros), (GDP per capita - 44.5 thousand Euros) and (Continent - Europe). The lack of focus on entities and the inherent latency of information in distributional representations warrants greater efforts towards modelling entity related phenomena and, increasing the understanding about the information encoded within distributional representations. This work makes two contributions in that direction: (a) We introduce a semantic relation – Instantiation, a relation between entities and their categories, and distributionally model it to investigate the hypothesis that distributional distinctions do exist in modelling entities versus modelling categories within a semantic space. Our results show that in a semantic space: 1) entities and categories are quite distinct with respect to their distributional behaviour, geometry and linguistic properties; 2) Instantiation relation is recoverable by distributional models; and, 3) for lexical relational modelling purposes, categories are better represented by the centroids of their entities instead of their distributional representations constructed directly from corpora. (b) We also investigate the potential and limitations of distributional semantics for the purpose of Knowledge Base Completion, starting with the hypothesis that fine-grained knowledge is encoded in distributional representations of entities during their meaning construction. We show that: 1) fine-grained information of entities is encoded in distributional representations and can be extracted by simple data-driven supervised models as attribute-value pairs; 2) the models can predict the entire range of fine-grained attributes, as seen in a knowledge base, in one go; and, 3) a crucial factor in determining success in extracting this type of information is contextual support i.e., the extent of contextual information captured by a distributional model during meaning construction. Overall, this thesis takes a step towards increasing the understanding about entity meaning representations in a distributional setup, with respect to their modelling and the extent of knowledge inclusion during their meaning construction.
  • Thumbnail Image
    ItemOpen Access
    Computational models of word order
    (2022) Yu, Xiang; Kuhn, Jonas (Prof. Dr.)
    A sentence in our mind is not a simple sequence of words but a hierarchical structure. We put the sentence in the linear order when we utter it for communication. Linearization is the task of mapping the hierarchical structure of a sentence into its linear order. Our work is based on the dependency grammar, which models the dependency relation between the words, and the resulting syntactic representation is a directed tree structure. The popularity of dependency grammar in Natural Language Processing (NLP) benefits from its separation of structure order and linear order and its emphasis on syntactic functions. These properties facilitate a universal annotation scheme covering a wide range of languages used in our experiments. We focus on developing a robust and efficient computational model that finds the linear order of a dependency tree. We take advantage of deep learning models’ expressive power to encode the syntactic structures of typologically diverse languages robustly. We take a graph-based approach that combines a simple bigram scoring model and a greedy decoding algorithm to search for the optimal word order efficiently. We use the divide-and-conquer strategy to reduce the search space, which restricts the output to be projective. We then resolve the restriction with a transition-based post-processing model. Apart from the computational models, we also study the word order from a quantitative linguistic perspective. We examine the Dependency Length Minimization (DLM) hypothesis, which is believed to be a universal factor that affects the word order of every language. It states that human languages tend to order the words to minimize the overall length of dependency arcs, which reduces the cognitive burden of speaking and understanding. We demonstrate that DLM can explain every aspect of word order in a dependency tree, such as the direction of the head, the arrangement of sibling dependents, and the existence of crossing arcs (non-projectivity). Furthermore, we find that DLM not only shapes the general word order preferences but also motivates the occasional deviation from the preferences. Finally, we apply our model in the task of surface realization, which aims to generate a sentence from a deep syntactic representation. We implement a pipeline with five steps, (1) linearization, (2) function word generation, (3) morphological inflection, (4) contraction, and (5) detokenization, which achieved state-of-the-art performance.