05 Fakultät Informatik, Elektrotechnik und Informationstechnik

Permanent URI for this collectionhttps://elib.uni-stuttgart.de/handle/11682/6

Browse

Search Results

Now showing 1 - 10 of 26

Open Access
Analysis of political positioning from politician’s tweets
(2023) Maurer, Maximilian Martin
Social media platforms such as Twitter have become important communication channels for politicians to interact with the electorate and communicate their stances on policy issues. In contrast to party manifestos, which lay out curated, compromised positions, the full range of positions within the ideological bounds of a party can be found on social media. This begs the question of how aligned the ideological positions of parties on social media are with their respective manifesto. To assess the alignment of social media and manifesto positions, we correlate the positions automatically retrieved from the tweets with manifesto-based positions for the German federal elections of 2017 and 2021. Additionally, we assess whether the change in positions over time is aligned between social media and manifestos. We retrieve ideological positions by aggregating distances between parties from sentence representations of their members' tweets from a corpus containing >2M individual tweets of 421 German politicians. We leverage domain-specific information by training a sentence embedding model such that representations of tweets with co-occurring hashtags are closer to each other than ones without co-occurring hashtags, following the assumption that hashtags approximate policy-related topics. Our experiments compare this political social media domain-specific model with other political domain and general domain sentence embedding models. We find high, significant correlations between the Twitter-retrieved positions and manifesto positions, especially for our domain-specific fine-tuned model. Moreover, for this model, we find overlaps in terms of how the positions change over time. These results indicate that the ideological positions of parties on Twitter correspond to the ideological positions as laid out in the manifestos to a large extent.
Open Access
Evaluating methods of improving the distribution of data across users in a corpus of tweets
(2023) Milovanovic, Milan
Corpora created from social network data often serve as the data source for tasks in natural language processing. Compared to other, more standardized corpora, social media corpora have idiosyncratic properties due to the fact that they consist of user-generated comments. These are, for example, the unbalanced distribution of the respective comments, a generally lower linguistic quality, and an inherently unstructured and noisy nature. Using a Twitter-generated corpus, I will investigate to what extent the unbalanced distribution of the data has an influence on two downstream tasks, relying on word embeddings. Word embeddings are a ubiquitous and frequently used concept in the field of natural language processing. The most common models are often the means to obtain semantic information about words and their usage by representing the words in an abstract word vector space. The basic idea is that semantically similar words in the mapped vector space have similar vectors. In doing so, these vectors serve as input for standard downstream tasks such as word similarity and semantic change detection. One of the most common models in current research is the use of word2vec, and more specifically, the Skip-gram architecture of this model. The Skip-gram architecture attempts to predict the surrounding words based on the current word. The data on which this architecture is trained greatly influences the resulting word vectors. In the context of this work, however, no significant improvement in the results to a fully preprocessed corpus could be found when filtering methods, widely used in the literature, without specific motivation, are used to select a subset of data according to defined criteria, neither for word similarity nor for semantic change detection. However, comparable results could be achieved with some filters, although the resulting models were trained using significantly fewer tokens as input.
Open Access
Plug-and-play domain adaptation for neural machine translation
(2023) Kadiķis, Emīls
Neural machine translation has emerged as a powerful tool, yet its performance heavily relies on training data. In a fast-changing world, dealing with out-of-domain data remains a challenge, prompting the need for adaptable translation systems. While fine-tuning is a proven effective adaptation method, it is not always feasible due to data availability, memory, and computational constraints. This thesis introduces a dynamic plug-and-play method inspired by controllable text generation to enhance machine translation across various domains without fine-tuning. This method, called Plug-and-Play Neural Machine Translation (PPNMT), uses a mono-lingual domain-specific bag-of-words to push the hidden state of the decoder through backrpopogation, making the output more in-domain. The method is tested on two types of domains: formality, gender (where the source language does not make a distinction between these aspects, but the target language does), and fine-grained technical domains (which are more based on topic inherent in the text on both the source and target sides). The method performs reasonably well for adapting the translation to different formality levels and, to a lesser extent, grammatical genders, even with an incredibly simple bag-of-words. However, it struggles with adapting the model to technical domains, and a fine-tuning baseline outperforms the proposed method in anything but very low few-shot settings in all tried domains. Despite that, the method shows some interesting behaviour, adapting to the formality on a level that goes beyond just using formal pronouns.
Open Access
Gender bias in dependency parsing
(2023) Go, Paul Stanley
Recent high-profile advances in natural language processing (NLP) have spurred interest into identifying and rectifying socially harmful problems common in NLP systems such as gender bias. Unfortunately, many works which attempt to tackle the issue of gender bias suffer from methodological deficiencies such as the assumption of a binary and immutable concept of gender. We scrutinize one such work which found gender bias in dependency parsing and evaluate if the claims have merit. Our results were inconsistent with the gender bias findings of that paper, and further investigations through error analysis and treebank analysis revealed methodological flaws which artificially introduced differences between their female and male data sets. Mistakes made during preprocessing compromised the outcome; therefore, their results do not prove the existence of gender bias in dependency parsing. Through our findings, we suggest a different methodology for identifying and alleviating syntactic bias that is more inclusive for everyone-no matter their gender.
Open Access
Evaluation of complex typological universals with language vectors and real-valued logics
(2020) Dönicke, Tillmann
Language representations and typological universals have received increasing attention in computational linguistics over the past few years. Most approaches make use of binary language vectors from typological databases and/or focus on the correlation of only two typological variables at the same time. This thesis shows that real-valued logics can be used to evaluate even more complex formulae, as they are formulated by typologists, on continuous vectors from existing corpora. Syntactic language vectors are extracted from the Universal Dependencies treebanks (Nivre et al., 2016) and serve as the basis for the evaluation of word-order universals.
Open Access
Exploring the effects of enriched English language input on language model efficiency
(2024) Zeller, Tom
Recent years have seen the advent of large-scale language modeling as exemplified by transformer-based models like GPT or variants of the BERT architecture. These models, which are trained on massive datasets and using compute unattainable by actors that are not of the scale of the biggest tech companies, have shown impressive feats of syntactic and semantic understanding. Naturally, interest has risen in making these models more efficient, in terms of compute as well as data requirements. Research in this area can be seen as primarily motivated by two factors: reducing the barrier for smaller actors like research institutes or end consumers to train and execute state-of-the-art models, as well as reducing the carbon footprint of these models. To achieve this goal, model compression techniques like quantization, pruning or distillation are utilized. This work aims to explore a different, less model-centric and more data-centric approach: Modifying the training and inference data, by enriching it with syntactic and semantic information. To this end, a lexical resource is created which maps English words to a form where individual characters represent values of a range of semantic and syntactic features, providing lexical information that is accessible to all model types that operate on tokens at the sub-word or character-level. Different features and methods of representation are discussed, and their effect on model performance is evaluated by pretraining a small GPT-family model and fine-tuning on downstream tasks of the SuperGLUE benchmark. Given a fixed amount of data and compute, the experiments show a performance advantage for a character-level model trained using the enriched data.
Open Access
Enhancing a German dialect corpus with neural methods
(2023) Tessadri, Wolfgang
With the advent of modern chat applications, an increasing number of German dialect speakers use their dialects for written communication. The DiDi Facebook corpus (Frey et al. 2016) captures this phenomenon for South Tyrolean dialects. While the authors included a dialect/standard variety tag on the posting level, a third of these tags was undefined. By training DeBERTa and XLM-RoBERTa for dialect/standard classification we reduce these undefined instances by over 75%. We also use XLM-RoBERTa to add explicit variety labels to individual tokens. By performing a linear regression analysis of socio-linguistic variables and a label-derived dialectality metric we show that the generated labels are highly meaningful. Finally, we describe how the implemented Transformer models can be applied to gather geo-referenced dialect samples on Twitter and we discuss how this data can enrich future dialectometric research.
Open Access
Multimodal OCR post-correction on German historical documents
(2023) Wu, Nianheng
Optical Character Recognition (OCR) post-correction is essential to digitalizing historical documents, increasing transcription accuracy, and reducing manual effort. Previous works often handle this as a text-to-text translation problem. However, the orthography of many languages, including German, has evolved across centuries, leading to many "irregular" spellings. Thus, a text-only system would face many uncertainties. Therefore, combining image features with text should be meaningful. The rise of large-scale pretrained models has brought new opportunities in this field. In this work, I will: 1) Introduce a dataset that includes historical German documents from 1783 to 1903 based on Deutsches Textarchiv with aligned golden transcription, OCR-ed textline, and their corresponding textline image; 2) Present a multimodal OCR post-correction system that combines CLIP image encoder, a pretrained image feature model, with ByT5, a byte-based language model. According to my experiments, this model outperforms the state-of-the-art text-only model.
Open Access
Gender-neutral language detection in instructional texts
(2023) Suhr, Katharina
Gender-neutral language plays a large part in the linguistic inclusion of women and binary and non-binary trans individuals. Its implementation can depend on different factors like language and guidelines from specific organizations. In this work, the gender-neutral language of instructional texts is analyzed. Five different classifiers are compared using a gold dataset, that consists of annotated revisions from wikiHow.com. The best performing classifier used a static list of gendered words to find gendered terms in the revision. Once a gendered term is found, the revision is classified as gendered. Using this classifier, the remaining 11 million revisions are classified and analyzed. The analysis suggests, that even though 3/4 articles gender-neutral, there was no concerted effort of the editors to change the ones that are gendered to be gender-neutral.
Open Access
RAGAR, your falsehood RADAR : RAG-augmented reasoning for political fact-checking using multimodal large language models
(2024) Abdul Khaliq, Mohammed
The escalating challenge of misinformation, particularly in the context of political discourse, necessitates advanced solutions for fact-checking. This thesis introduces innovative approaches to enhance the reliability and efficiency of multimodal fact-checking through the integration of large language models (LLMs) with Retrieval-augmented Generation (RAG) based advanced reasoning techniques. In the digital era, where misinformation spreads rapidly across various media, including text and images, there's a critical need for robust mechanisms capable of evaluating the veracity of political claims. This work proposes two novel methodologies, Chain of RAG (CoRAG) and Tree of RAG (ToRAG), and their hybrid implementations incorporating Chain of Thought and Chain of Verification. These approaches leverage RAG techniques utilizing multimodal LLMs with reasoning techniques. The approaches are designed to process and assess political claims by considering textual and visual information, providing a comprehensive approach to fact-checking. This thesis explores the implementation of these approaches within a multimodal fact-checking pipeline, highlighting their effectiveness in improving the accuracy of veracity predictions and the generation of explanations. By employing multimodal LLMs adept at analyzing text and images, this research advances the capability of automated systems in identifying and countering misinformation. The experimental evaluation demonstrates that the proposed RAG-augmented Reasoning (RAGAR) techniques outperform existing methods that rely on sub-question generation, offering a promising solution to the challenges of political fact-checking. This thesis contributes to the fields of computational linguistics and political science by providing an effective approach to combat fake news, thereby enhancing the integrity of political discourse in the digital age.

05 Fakultät Informatik, Elektrotechnik und Informationstechnik

Browse

Filters

Settings

Sort By

Results per page

Search Results