Browsing by Author "Milovanovic, Milan"

Now showing 1 - 2 of 2

Open Access
Evaluating methods of improving the distribution of data across users in a corpus of tweets
(2023) Milovanovic, Milan
Corpora created from social network data often serve as the data source for tasks in natural language processing. Compared to other, more standardized corpora, social media corpora have idiosyncratic properties due to the fact that they consist of user-generated comments. These are, for example, the unbalanced distribution of the respective comments, a generally lower linguistic quality, and an inherently unstructured and noisy nature. Using a Twitter-generated corpus, I will investigate to what extent the unbalanced distribution of the data has an influence on two downstream tasks, relying on word embeddings. Word embeddings are a ubiquitous and frequently used concept in the field of natural language processing. The most common models are often the means to obtain semantic information about words and their usage by representing the words in an abstract word vector space. The basic idea is that semantically similar words in the mapped vector space have similar vectors. In doing so, these vectors serve as input for standard downstream tasks such as word similarity and semantic change detection. One of the most common models in current research is the use of word2vec, and more specifically, the Skip-gram architecture of this model. The Skip-gram architecture attempts to predict the surrounding words based on the current word. The data on which this architecture is trained greatly influences the resulting word vectors. In the context of this work, however, no significant improvement in the results to a fully preprocessed corpus could be found when filtering methods, widely used in the literature, without specific motivation, are used to select a subset of data according to defined criteria, neither for word similarity nor for semantic change detection. However, comparable results could be achieved with some filters, although the resulting models were trained using significantly fewer tokens as input.
Open Access
Investigating different levels of joining entity and relation classification
(2018) Milovanovic, Milan
Named entities, such as persons or locations, are crucial bearers of information within an unstructured text. Recognition and classification of these (named) entities is an essential part of information extraction. Relation classification, the process of categorizing semantic relations between two entities within a text, is another task closely linked to named entities. Those two tasks -- entity and relation classification -- have been commonly treated as a pipeline of two separate models. While this separation simplifies the problem, it also disregards underlying dependencies and connections between the two subtasks. As a consequence, merging both subtasks into one joint model for entity and relation classification is the next logical step. A thorough investigation and comparison of different levels of joining the two tasks is the goal of this thesis. This thesis will accomplish the objective by defining different levels of joint entity and relation classification and developing (implementing and evaluating) and analyzing machine learning models for each level. The levels which will be investigated are: (L1) a pipeline of independent models for entity classification and relation classification (L2) using the entity class predictions as features for relation classification (L3) global features for both entity and relation classification (L4) explicit utilization of a single joint model for entity and relation classification The best results are achieved using the model for level 3 with an F1 score of 0.830 for entity classification and an F_1 score of 0.52 for relation classification.