05 Fakultät Informatik, Elektrotechnik und Informationstechnik
Permanent URI for this collectionhttps://elib.uni-stuttgart.de/handle/11682/6
Browse
37 results
Search Results
Item Open Access Effects of paraphrasing and demographic metadata on NLI classification performance(2023) Marx Larre, MiguelNative language identification (NLI) refers to the task of automatically deducing the native language (L1) of a document's author, when the document is written in a second language (L2). Documents stem from different sources, but recently more documents are altered before publication through paraphrasing methods. This alteration changes the content, grammar, and style of the document, which inherently obfuscates the L1 of the author. In addition, the demographic metadata of the author, such as age and gender, may influence the performance with which an author's L1 may be detected. In this thesis, two corpora which provide necessary demographic metadata, the International Corpus of Learner English (ICLE) and the \textsc{Trustpilot} corpus, are used to analyze the impact of paraphrasing and demographic factors in the context of NLI tasks. To analyze the effect of paraphrasing on a document, new versions of both corpora are created, which contain paraphrased versions of the documents contained. The effect is inspected using two state-of-the-art NLI systems to perform the task, while the results were analyzed using a regression analysis in combination with dominance analysis (DA). Paraphrasing was found to have a substantial influence in performance of NLI tasks, regardless of corpus, classifier, or paraphrasing method. The usual influence of demographic factors on NLI tasks could not be confirmed in this thesis. Regression analysis and DA allowed for a more profound analysis of the results, which allowed for findings regarding the influence of specific L1s on performance of NLI tasks.Item Open Access Emotion classification based on the emotion component model(2020) Heindl, AmelieThe term emotion is, despite its frequent use, still mysterious to researchers. This poses difficulties on the task of automatic emotion detection in text. At the same time, applications for emotion classifiers increase steadily in today's digital society where humans are constantly interacting with machines. Hence, the need for improvement of current state-of-the-art emotion classifiers arises. The Swiss psychologist Klaus Scherer published an emotion model according to which an emotion is composed of changes in the five components cognitive appraisal, physiological symptoms, action tendencies, motor expressions, and subjective feelings. This model, which he calls CPM gained reputation in psychology and philosophy, but has so far not been used for NLP tasks. With this work, we investigate, whether it is possible to automatically detect the CPM components in social media posts and, whether information on those components can aid the detection of emotions. We create a text corpus consisting of 2100 Twitter posts, that has every instance labeled with exactly one emotion and a binary label for each CPM component. With a Maximum Entropy classifier we manage to detect CPM components with an average F1-score of 0.56 and average accuracy of 0.82 on this corpus. Furthermore, we compare baseline versions of one Maximum Entropy and one CNN emotion classifier to extensions of those classifiers with the CPM annotations and predictions as additional features. We find slight performance increases of up to 0.03 for the F1-score for emotion detection upon incorporation of CPM information.Item Open Access Modeling paths in knowledge graphs for context-aware prediction and explanation of facts(2019) Stadelmaier, JosuaKnowledge bases are an important resource for question answering systems and search engines but often suffer from incompleteness. This work considers the problem of knowledge base completion (KBC). In the context of natural language processing, knowledge bases comprise facts that can be formalized as triples of the form (entity 1, relation, entity 2). A common approach for the KBC problem is to learn representations for entities and relations that allow for generalizing existing connections in the knowledge base to predict the correctness of a triple that is not in the knowledge base. In this work, I propose the context path model, which is based on this approach. In contrast to existing KBC models, it also provides explanations for predictions. For this purpose, it uses paths that capture the context of a given triple. The context path model can be applied on top of several existing KBC models. In a manual evaluation, I observe that most of the paths the model uses as explanation are meaningful and provide evidence for assessing the correctness of triples. I also show in an experiment that the performance of the context path model on a standard KBC task is close to a state of the art model.Item Open Access Modeling the evaluative nature of German personal name compounds(2023) Deeg, TanaGerman personal name compounds such as Villen-Spahn (’villa-Spahn’), Gold-Rosi (’gold-Rosi’) and Folter-Bush (’torture-Bush’) are a rather infrequent phenomenon in the German language. They have the structure of determinative compounds and serve as a nickname for a usually well-known person. According to Belosevic (2022), personal name compounds are mostly evaluative, i.e. they evaluate the person behind the name in a positive or negative way. Further research on an evaluation across different groups of compounds (politics, showbusiness, sports) is proposed. This work will investigate the evaluative nature of 413 German personal name compounds that mostly have the structure of noun as modifier and last name as head. The 131 corresponding full names will be considered as well, e.g. Jens Spahn would correspond to Villen-Spahn. The context data of compounds and names was collected from Twitter and the Leipzig Corpora Collection. The valence value of these context words, based on a valence database of Köper and Schulte im Walde (2016), will be used to investigate the evaluative nature of compounds in comparison to their names. Furthermore, the relation to and function of the modifier will be examined. The valence values will then be used to verify whether there are noticeable differences between the groups of compounds. Afterwards, a linear regression will be implemented to predict a ’delta’ value: the difference between name valence and compound valence. Several predictor variables such as name valence, compound valence, modifier valence, age, gender, political party and nationality will be used. The results reveal that compounds are both positively and negatively evaluative in comparison to their full name while highlighting the reason why they were created. Compound valence and modifier valence are only partially correlated due to modifiers being involved rather accidentally or interpreted ironically. Lastly, noticable differences between the groups can be observed with politicians being the most negative group regarding their valence values. Conducting the linear regression with different combinations of predictor variables shows that compound valence is a highly significant predictor. Also, other variables such as modifier valence, age or political party are able to compose models that predict the delta value very well.Item Open Access Automatische Kategorisierung von Autoren in Bezug auf Arzneimittel in Twitter(2016) Xu, MInMit der rasch wachsenden Popularität von Twitter werden auch immer mehr unterschiedliche Themen diskutiert. Dies lässt sich auch im Bezug auf die Wirkung von Arzneimitteln beobachten. Es ist daher sehr interessant herauszufinden, welche sozialen Gruppen dazu neigen, bestimmte Arzneimittel in Twitter zu diskutieren und welche Arzneimittel am meisten in Twitter diskutiert werden. Deshalb bietet es sich an, mit Verwendung der Technologie der Textklassifikation, die große Anzahl von Tweets zu kategorisieren. In dieser Arbeit wird das hauptsächlich mit dem Maximum Entropy Klassifikator realisiert, mit den sich die Autoren der Tweets erkennen lassen. Da das Maximum Entropy Modell eine Vielzahl der relevanten oder irrelevanten Kenntnis der Wahrscheinlichkeiten umfassend beobachten kann, erzielt der Maximum Entropy Klassifikator im Vergleich zum naiven Bayes-Klassifikator in dieser Arbeit ein besseres Ergebnis bei der Multi-Klassen-Klassifikation. Die Beeinflussung auf die Leistungen des Maximum Entropy Klassifikator unter der Verwendungen von verschiedenen Methoden, wie Information Gain & Mutual Information und LDA-Topic Model, zur Auswahl der Merkmale und unterschiedlicher Anzahl an Merkmalen wird verglichen und analysiert. Die Ergebnissen zeigen, dass die Methoden Information Gain & Mutual Information und LDA-Topic-Model gute praktische Ansätze sind, mit denen die Merkmale kurzer Texte erkannt werden können. Mit dem Maximum Entropy Klassifikator wird eine durchschnittliche Testgenauigkeit von 79.8% erreicht.Item Open Access Automatic classification of abstractness in English rigid nouns(2023) Saponaro, AlbertoThe main difference between (i) Mass-Count Languages (such as English) and (ii)Classifiers Languages (such as Chinese) is that (i) encode the information about nouns’ countability in their grammar and (ii) employ a classification system of classifiers to distinguish between individuals or substance. If the mass-count distinction is a characteristic of mass-count language, the substance-individuals denotation seems to be a concept universally available for all humans. Another concept that appears to be universally accessible and linked to the countability status of English nouns is the notion of abstractness. Then, mass nouns usually refer to an abstract object, and this is confirmed from the distribution of abstractness in the dataset. This thesis’ objective is to provide a model for the classification of rigid nouns (count or mass only) that is capable to generalize on the degree of abstractness. Additionally, it tests if a model trained with the same set of features is capable of rating the abstractness of those nouns. To accomplish these tasks, several sets of features are being identified based on syntactic and semantic properties of nouns that describe the mass-count distinction. The results indicate that the first model M1, a mass-count classifier that predicts the countability class of a rigid noun, provides reliable predictions and can generalize on the degree of abstractness of the targets. The second model M2, an abstractness rate predictor that assigns an abstractness rate from 1 to 5 to a rigid noun, is incapable of providing reliable ratings and cannot generalize on the countability status of the targets. A third model M3, an abstract-concrete (binary) classifier that predicts the abstractness class of a rigid noun, provides reliable predictions and can generalize on the countability status of the targets. Given that those results concerns rigid nouns only, further research can be conducted by examining the abstractness of elastic nouns. However, there is the need of an annotation that rates abstractness of nouns senses.Item Open Access The Impact of intensifiers, diminishers and negations on emotion expressions(2017) Strohm, FlorianThere are several areas of application for emotion detection systems, for example social media analysis, for which it is important to reliably recognize expressed emotions. This thesis takes negations, intensifiers and diminishers on emotion expressions in Tweets into account, in order to study whether this can improve an emotion detection system. It uses different emotion classifiers together with various modifier detection approaches to evaluate the impact of modifiers on emotion expressions. The results show that an emotion detection system can be slightly improved if negations are taken into account. The thesis also studies the correlation between modified emotion words and basic emotions to obtain a better understanding about modified emotions. The analysis of the results shows correlations between modified and basic emotions, which enables us to determine the expressed basic emotion of modified emotion words.Item Open Access Generierung von synthetischen Trainingsdaten für die Erkennung von Absenderdaten aus Brief-Korrespondenz(2020) Burkhardt, JannikEin Problem, das sich oft bei Machine-Learning Projekten auftut, ist der Mangel an passenden Trainingsdaten. In dieser Arbeit wird untersucht, wie hoch der Nutzen aus der Verwendung synthetischer Daten in Situationen ist, wo nur sehr wenige echte Trainingsdaten zur Verfügung stehen. Am Beispiel der Absenderdatenerkennung in Briefkorrespondenz wird beschrieben, auf welche Eigenschaften synthetischer Dokumente zu achten ist, damit eine künstliche Intelligenz mit ihrer Hilfe auch echte Dokumente bearbeiten kann. Es wird gezeigt, dass die Ergebnisse einer künstlichen Intelligenz, welche sowohl mit wenigen echten, als auch einem großen Korpus synthetischer Daten trainiert wurde, um ein vielfaches akkurater sind als wenn auf synthetische Daten verzichtet wird. Daraus lässt sich schließen, dass in Situationen, wo echte Trainingsdaten nicht verfügbar sind, synthetische Daten eine brauchbare Alternative darstellen.Item Open Access Optimierung von Clustering von Wortverwendungsgraphen(2021) Tunc, BenjaminAlgorithms for clustering of Word Usage Graphs are not optimal in terms of efficiency and often do not find the optimal clustering loss on larger graphs. Our aim in this paper is to find efficient ways to approximate the global minimum of a clustering loss function on three Word Usage Graphs data sets using correlation clustering and simulated annealing. Therefore we define 321 models with different initialization modifications, parameter combinations and stopping criterion and evaluate them in terms of loss, similarity to word sense description annotation, robustness and runtime. We evaluate different approaches and define efficient models with dynamic stopping criterion to find the lowest loss, which yield robust cluster solutions. We find that lowering the loss lead to better and clustering solutions.Item Open Access Kategorisierung der Zustandsveränderungen bei CoS-Verben auf Basis von Bild- und/oder Textdaten(2023) Godbersen, JuleSowohl textliche als auch visuelle Informationen können für das Verständnis einer Aktion relevant sein. In dieser Bachelorarbeit werden Aktionen betrachtet, die zu einer Veränderung im Zustand des beteiligten Objekts führen. Ziel dabei ist die Beantwortung der Forschungsfrage, welchen Beitrag die Modalitäten bei der Vorhersage von solchen Zustandsveränderungen haben. Die Vorhersage erfolgt mithilfe von Kategorien wie beispielsweise Farbe, Größe und Quantität. Ein wesentlicher Bestandteil dieser Bachelorarbeit ist die Erstellung eines Datensatzes, der Beispielvorkommen von Aktionen mit Zustandsveränderungen enthält. Eine weitere Aufgabe besteht darin, einiger dieser Datenpunkte mit Kategorien von Zustandsveränderungen annotieren zu lassen. Darüber hinaus wird ausgehend von einem visiolinguistischen Modell eine Ablationsstudie durchgeführt. Diese erlaubt mithilfe verschiedener Klassifikatoren, den Einfluss der verschiedenen Modalitäten auf die Leistungsfähigkeit eines Modells im Hinblick auf die Vorhersage von Zustandsveränderungen zu testen. Diese Bachelorarbeit veranschaulicht unter anderem Schwierigkeiten im Rahmen der Annotationen. Die Leistungsfähigkeit bezüglich der Vorhersage von Kategorien, gemessen mit der Akkuratheit, ist bei den Klassifikatoren ähnlich hoch wie bei einem Baseline Modell. Die verschiedenen Klassifikatoren treffen Vorhersagen mit ähnlicher Akkuratheit, sodass die Forschungsfrage mit den Ergebnissen dieser Bachelorarbeit nicht hinreichend beantwortet werden kann. Die Hypothese, dass die Kombination aus textlicher und visueller Modalität komplementäre Informationen liefert und dementsprechend die Kombination beider Modalitäten relevant ist, wird durch die Ergebnisse nicht bestätigt. Ergänzend wird durch diese Bachelorarbeit gezeigt, dass die trainierten Klassifikatoren es ermöglichen, in gewissem Maße auf ungesehene Datenpunkte, ungesehene Verben und ungesehene Domänen zu generalisieren.