05 Fakultät Informatik, Elektrotechnik und Informationstechnik

Permanent URI for this collectionhttps://elib.uni-stuttgart.de/handle/11682/6

Browse

Search Results

Now showing 1 - 10 of 16
  • Thumbnail Image
    ItemOpen Access
    Cross-lingual citations in English papers : a large-scale analysis of prevalence, usage, and impact
    (2021) Saier, Tarek; Färber, Michael; Tsereteli, Tornike
    Citation information in scholarly data is an important source of insight into the reception of publications and the scholarly discourse. Outcomes of citation analyses and the applicability of citation-based machine learning approaches heavily depend on the completeness of such data. One particular shortcoming of scholarly data nowadays is that non-English publications are often not included in data sets, or that language metadata is not available. Because of this, citations between publications of differing languages (cross-lingual citations) have only been studied to a very limited degree. In this paper, we present an analysis of cross-lingual citations based on over one million English papers, spanning three scientific disciplines and a time span of three decades. Our investigation covers differences between cited languages and disciplines, trends over time, and the usage characteristics as well as impact of cross-lingual citations. Among our findings are an increasing rate of citations to publications written in Chinese, citations being primarily to local non-English languages, and consistency in citation intent between cross- and monolingual citations. To facilitate further research, we make our collected data and source code publicly available.
  • Thumbnail Image
    ItemOpen Access
    Distributional measures of semantic abstraction
    (2022) Schulte im Walde, Sabine; Frassinelli, Diego
    This article provides an in-depth study of distributional measures for distinguishing between degrees of semantic abstraction. Abstraction is considered a “central construct in cognitive science” (Barsalou, 2003) and a “process of information reduction that allows for efficient storage and retrieval of central knowledge” (Burgoon et al., 2013). Relying on the distributional hypothesis, computational studies have successfully exploited measures of contextual co-occurrence and neighbourhood density to distinguish between conceptual semantic categorisations. So far, these studies have modeled semantic abstraction across lexical-semantic tasks such as ambiguity; diachronic meaning changes; abstractness vs. concreteness; and hypernymy. Yet, the distributional approaches target different conceptual types of semantic relatedness, and as to our knowledge not much attention has been paid to apply, compare or analyse the computational abstraction measures across conceptual tasks. The current article suggests a novel perspective that exploits variants of distributional measures to investigate semantic abstraction in English in terms of the abstract-concrete dichotomy (e.g., glory-banana) and in terms of the generality-specificity distinction (e.g., animal-fish), in order to compare the strengths and weaknesses of the measures regarding categorisations of abstraction, and to determine and investigate conceptual differences. In a series of experiments we identify reliable distributional measures for both instantiations of lexical-semantic abstraction and reach a precision higher than 0.7, but the measures clearly differ for the abstract-concrete vs. abstract-specific distinctions and for nouns vs. verbs. Overall, we identify two groups of measures, (i) frequency and word entropy when distinguishing between more and less abstract words in terms of the generality-specificity distinction, and (ii) neighbourhood density variants (especially target-context diversity) when distinguishing between more and less abstract words in terms of the abstract-concrete dichotomy. We conclude that more general words are used more often and are less surprising than more specific words, and that abstract words establish themselves empirically in semantically more diverse contexts than concrete words. Finally, our experiments once more point out that distributional models of conceptual categorisations need to take word classes and ambiguity into account: results for nouns vs. verbs differ in many respects, and ambiguity hinders fine-tuning empirical observations.
  • Thumbnail Image
    ItemOpen Access
    Advances in clinical voice quality analysis with VOXplot
    (2023) Barsties von Latoszek, Ben; Mayer, Jörg; Watts, Christopher R.; Lehnert, Bernhard
    Background: The assessment of voice quality can be evaluated perceptually with standard clinical practice, also including acoustic evaluation of digital voice recordings to validate and further interpret perceptual judgments. The goal of the present study was to determine the strongest acoustic voice quality parameters for perceived hoarseness and breathiness when analyzing the sustained vowel [a:] using a new clinical acoustic tool, the VOXplot software. Methods: A total of 218 voice samples of individuals with and without voice disorders were applied to perceptual and acoustic analyses. Overall, 13 single acoustic parameters were included to determine validity aspects in relation to perceptions of hoarseness and breathiness. Results: Four single acoustic measures could be clearly associated with perceptions of hoarseness or breathiness. For hoarseness, the harmonics-to-noise ratio (HNR) and pitch perturbation quotient with a smoothing factor of five periods (PPQ5), and, for breathiness, the smoothed cepstral peak prominence (CPPS) and the glottal-to-noise excitation ratio (GNE) were shown to be highly valid, with a significant difference being demonstrated for each of the other perceptual voice quality aspects. Conclusions: Two acoustic measures, the HNR and the PPQ5, were both strongly associated with perceptions of hoarseness and were able to discriminate hoarseness from breathiness with good confidence. Two other acoustic measures, the CPPS and the GNE, were both strongly associated with perceptions of breathiness and were able to discriminate breathiness from hoarseness with good confidence.
  • Thumbnail Image
    ItemOpen Access
    Resources for Turkish natural language processing : a critical survey
    (2022) Çöltekin, Çağrı; Doğruöz, A. Seza; Çetinoğlu, Özlem
    This paper presents a comprehensive survey of corpora and lexical resources available for Turkish. We review a broad range of resources, focusing on the ones that are publicly available. In addition to providing information about the available linguistic resources, we present a set of recommendations, and identify gaps in the data available for conducting research and building applications in Turkish Linguistics and Natural Language Processing.
  • Thumbnail Image
    ItemOpen Access
    Using the relative entropy of linguistic complexity to assess L2 language proficiency development
    (2021) Sun, Kun; Wang, Rong
    This study applies relative entropy in naturalistic large-scale corpus to calculate the difference among L2 (second language) learners at different levels. We chose lemma, token, POStrigram, conjunction to represent lexicon and grammar to detect the patterns of language proficiency development among different L2 groups using relative entropy. The results show that information distribution discrimination regarding lexical and grammatical differences continues to increase from L2 learners at a lower level to those at a higher level. This result is consistent with the assumption that in the course of second language acquisition, L2 learners develop towards a more complex and diverse use of language. Meanwhile, this study uses the statistics method of time series to process the data on L2 differences yielded by traditional frequency-based methods processing the same L2 corpus to compare with the results of relative entropy. However, the results from the traditional methods rarely show regularity. As compared to the algorithms in traditional approaches, relative entropy performs much better in detecting L2 proficiency development. In this sense, we have developed an effective and practical algorithm for stably detecting and predicting the developments in L2 learners’ language proficiency.
  • Thumbnail Image
    ItemOpen Access
    Between welcome culture and border fence : a dataset on the European refugee crisis in German newspaper reports
    (2023) Blokker, Nico; Blessing, André; Dayanik, Erenay; Kuhn, Jonas; Padó, Sebastian; Lapesa, Gabriella
    Newspaper reports provide a rich source of information on the unfolding of public debates, which can serve as basis for inquiry in political science. Such debates are often triggered by critical events, which attract public attention and incite the reactions of political actors: crisis sparks the debate. However, due to the challenges of reliable annotation and modeling, few large-scale datasets with high-quality annotation are available. This paper introduces DebateNet2.0 , which traces the political discourse on the 2015 European refugee crisis in the German quality newspaper taz . The core units of our annotation are political claims (requests for specific actions to be taken) and the actors who advance them (politicians, parties, etc.). Our contribution is twofold. First, we document and release DebateNet2.0 along with its companion R package, mardyR . Second, we outline and apply a Discourse Network Analysis (DNA) to DebateNet2.0 , comparing two crucial moments of the policy debate on the “refugee crisis”: the migration flux through the Mediterranean in April/May and the one along the Balkan route in September/October. We guide the reader through the methods involved in constructing a discourse network from a newspaper, demonstrating that there is not one single discourse network for the German migration debate, but multiple ones, depending on the research question through the associated choices regarding political actors, policy fields and time spans.
  • Thumbnail Image
    ItemOpen Access
    Bootstrap co-occurrence networks of consonants and the Basic Consonant Inventory
    (2023) Nikolaev, Dmitry
    It has been recently shown by Nikolaev and Grossman that it is possible to provide a fine-grained typological analysis of consonant inventories of the world’s languages by investigating co-occurrence classes of segments, i.e. groups of segments that tend to be found together in inventories. Nikolaev and Grossman argued that the structure of many of such co-occurrence classes is in contradiction with the Feature-Economy Principle. As a side product of this analysis, a new definition of the Basic Consonant Inventory (BCI) - a cluster of segments forming the bedrock of consonantal inventories of the world’s languages - was provided. This paper replicates the co-occurrence study in an arguably more robust way. In addition to making a methodological contribution, it shows that some of the co-occurrence classes defined by Nikolaev and Grossman, including the BCI, are not statistically stable and may be an artefact of the imbalance in the language sample used for the analysis. The findings of the authors regarding the Feature-Economy Principle, however, were corroborated.
  • Thumbnail Image
    ItemOpen Access
    Determinants of grader agreement : an analysis of multiple short answer corpora
    (2021) Padó, Ulrike; Padó, Sebastian
    The ’short answer’ question format is a widely used tool in educational assessment, in which students write one to three sentences in response to an open question. The answers are subsequently rated by expert graders. The agreement between these graders is crucial for reliable analysis, both in terms of educational strategies and in terms of developing automatic models for short answer grading (SAG), an active research topic in NLP. This makes it important to understand the properties that influence grader agreement (such as question difficulty, answer length, and answer correctness). However, the twin challenges towards such an understanding are the wide range of SAG corpora in use (which differ along a number of dimensions) and the hierarchical structure of potentially relevant properties (which can be located at the corpus, answer, or question levels). This article uses generalized mixed effects models to analyze the effect of various such properties on grader agreement in six major SAG corpora for two main assessment tasks (language and content assessment). Overall, we find broad agreement among corpora, with a number of properties behaving similarly across corpora (e.g., shorter answers and correct answers are easier to grade). Some properties show more corpus-specific behavior (e.g., the question difficulty level), and some corpora are more in line with general tendencies than others. In sum, we obtain a nuanced picture of how the major short answer grading corpora are similar and dissimilar from which we derive suggestions for corpus development and analysis.
  • Thumbnail Image
    ItemOpen Access
    The links of causal chains
    (2022) Kamp, Hans
    This paper is about the Causal Theory of Names, as outlined by Kripke in Naming and Necessity. The paper argues that causal chains which connect users in command of a name N with those present at the baptismal event in which N was introduced are branches of networks of ‘N‐labelled’ entity representations in the minds of past and present users of N. These networks of N‐labelled entity representations are special cases of networks that result in general from the use of referring expressions. Such networks are an important part of the fabric that holds a speech community together and point towards a view of language as a social practice. The theory of networks and chains is developed within MSDRT (‘Mental State Discourse Representation Theory’), an extension of DRT designed for the description of utterance contents, propositional attitudes, mental states and the ways in which mental states change in the course of verbal communication. The last section of the paper explores the view of languages as social practices somewhat further in the light of the network theory developed in the sections leading up to it.