Evaluating methods of improving the distribution of data across users in a corpus of tweets

Milovanovic, Milan

Evaluating methods of improving the distribution of data across users in a corpus of tweets

Files

Master_Thesis_Milovanovic.pdf (926.76 KB)

Date

2023

Authors

Milovanovic, Milan

Abstract

Corpora created from social network data often serve as the data source for tasks in natural language processing. Compared to other, more standardized corpora, social media corpora have idiosyncratic properties due to the fact that they consist of user-generated comments. These are, for example, the unbalanced distribution of the respective comments, a generally lower linguistic quality, and an inherently unstructured and noisy nature. Using a Twitter-generated corpus, I will investigate to what extent the unbalanced distribution of the data has an influence on two downstream tasks, relying on word embeddings. Word embeddings are a ubiquitous and frequently used concept in the field of natural language processing. The most common models are often the means to obtain semantic information about words and their usage by representing the words in an abstract word vector space. The basic idea is that semantically similar words in the mapped vector space have similar vectors. In doing so, these vectors serve as input for standard downstream tasks such as word similarity and semantic change detection. One of the most common models in current research is the use of word2vec, and more specifically, the Skip-gram architecture of this model. The Skip-gram architecture attempts to predict the surrounding words based on the current word. The data on which this architecture is trained greatly influences the resulting word vectors. In the context of this work, however, no significant improvement in the results to a fully preprocessed corpus could be found when filtering methods, widely used in the literature, without specific motivation, are used to select a subset of data according to defined criteria, neither for word similarity nor for semantic change detection. However, comparable results could be achieved with some filters, although the resulting models were trained using significantly fewer tokens as input.

URI

http://nbn-resolving.de/urn:nbn:de:bsz:93-opus-ds-141529
http://elib.uni-stuttgart.de/handle/11682/14152
http://dx.doi.org/10.18419/opus-14133

Collections

05 Fakultät Informatik, Elektrotechnik und Informationstechnik

Full item page

Evaluating methods of improving the distribution of data across users in a corpus of tweets

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

Collections

Endorsement

Review

Supplemented By

Referenced By