05 Fakultät Informatik, Elektrotechnik und Informationstechnik
Permanent URI for this collectionhttps://elib.uni-stuttgart.de/handle/11682/6
Browse
9 results
Search Results
Item Open Access Cross-lingual citations in English papers : a large-scale analysis of prevalence, usage, and impact(2021) Saier, Tarek; Färber, Michael; Tsereteli, TornikeCitation information in scholarly data is an important source of insight into the reception of publications and the scholarly discourse. Outcomes of citation analyses and the applicability of citation-based machine learning approaches heavily depend on the completeness of such data. One particular shortcoming of scholarly data nowadays is that non-English publications are often not included in data sets, or that language metadata is not available. Because of this, citations between publications of differing languages (cross-lingual citations) have only been studied to a very limited degree. In this paper, we present an analysis of cross-lingual citations based on over one million English papers, spanning three scientific disciplines and a time span of three decades. Our investigation covers differences between cited languages and disciplines, trends over time, and the usage characteristics as well as impact of cross-lingual citations. Among our findings are an increasing rate of citations to publications written in Chinese, citations being primarily to local non-English languages, and consistency in citation intent between cross- and monolingual citations. To facilitate further research, we make our collected data and source code publicly available.Item Open Access Distributional measures of semantic abstraction(2022) Schulte im Walde, Sabine; Frassinelli, DiegoThis article provides an in-depth study of distributional measures for distinguishing between degrees of semantic abstraction. Abstraction is considered a “central construct in cognitive science” (Barsalou, 2003) and a “process of information reduction that allows for efficient storage and retrieval of central knowledge” (Burgoon et al., 2013). Relying on the distributional hypothesis, computational studies have successfully exploited measures of contextual co-occurrence and neighbourhood density to distinguish between conceptual semantic categorisations. So far, these studies have modeled semantic abstraction across lexical-semantic tasks such as ambiguity; diachronic meaning changes; abstractness vs. concreteness; and hypernymy. Yet, the distributional approaches target different conceptual types of semantic relatedness, and as to our knowledge not much attention has been paid to apply, compare or analyse the computational abstraction measures across conceptual tasks. The current article suggests a novel perspective that exploits variants of distributional measures to investigate semantic abstraction in English in terms of the abstract-concrete dichotomy (e.g., glory-banana) and in terms of the generality-specificity distinction (e.g., animal-fish), in order to compare the strengths and weaknesses of the measures regarding categorisations of abstraction, and to determine and investigate conceptual differences. In a series of experiments we identify reliable distributional measures for both instantiations of lexical-semantic abstraction and reach a precision higher than 0.7, but the measures clearly differ for the abstract-concrete vs. abstract-specific distinctions and for nouns vs. verbs. Overall, we identify two groups of measures, (i) frequency and word entropy when distinguishing between more and less abstract words in terms of the generality-specificity distinction, and (ii) neighbourhood density variants (especially target-context diversity) when distinguishing between more and less abstract words in terms of the abstract-concrete dichotomy. We conclude that more general words are used more often and are less surprising than more specific words, and that abstract words establish themselves empirically in semantically more diverse contexts than concrete words. Finally, our experiments once more point out that distributional models of conceptual categorisations need to take word classes and ambiguity into account: results for nouns vs. verbs differ in many respects, and ambiguity hinders fine-tuning empirical observations.Item Open Access Editorial - perspectives for natural language processing between AI, linguistics and cognitive science(2022) Lenci, Alessandro; Padó, SebastianItem Open Access Using the relative entropy of linguistic complexity to assess L2 language proficiency development(2021) Sun, Kun; Wang, RongThis study applies relative entropy in naturalistic large-scale corpus to calculate the difference among L2 (second language) learners at different levels. We chose lemma, token, POStrigram, conjunction to represent lexicon and grammar to detect the patterns of language proficiency development among different L2 groups using relative entropy. The results show that information distribution discrimination regarding lexical and grammatical differences continues to increase from L2 learners at a lower level to those at a higher level. This result is consistent with the assumption that in the course of second language acquisition, L2 learners develop towards a more complex and diverse use of language. Meanwhile, this study uses the statistics method of time series to process the data on L2 differences yielded by traditional frequency-based methods processing the same L2 corpus to compare with the results of relative entropy. However, the results from the traditional methods rarely show regularity. As compared to the algorithms in traditional approaches, relative entropy performs much better in detecting L2 proficiency development. In this sense, we have developed an effective and practical algorithm for stably detecting and predicting the developments in L2 learners’ language proficiency.Item Open Access Bootstrap co-occurrence networks of consonants and the Basic Consonant Inventory(2023) Nikolaev, DmitryIt has been recently shown by Nikolaev and Grossman that it is possible to provide a fine-grained typological analysis of consonant inventories of the world’s languages by investigating co-occurrence classes of segments, i.e. groups of segments that tend to be found together in inventories. Nikolaev and Grossman argued that the structure of many of such co-occurrence classes is in contradiction with the Feature-Economy Principle. As a side product of this analysis, a new definition of the Basic Consonant Inventory (BCI) - a cluster of segments forming the bedrock of consonantal inventories of the world’s languages - was provided. This paper replicates the co-occurrence study in an arguably more robust way. In addition to making a methodological contribution, it shows that some of the co-occurrence classes defined by Nikolaev and Grossman, including the BCI, are not statistically stable and may be an artefact of the imbalance in the language sample used for the analysis. The findings of the authors regarding the Feature-Economy Principle, however, were corroborated.Item Open Access Determinants of grader agreement : an analysis of multiple short answer corpora(2021) Padó, Ulrike; Padó, SebastianThe ’short answer’ question format is a widely used tool in educational assessment, in which students write one to three sentences in response to an open question. The answers are subsequently rated by expert graders. The agreement between these graders is crucial for reliable analysis, both in terms of educational strategies and in terms of developing automatic models for short answer grading (SAG), an active research topic in NLP. This makes it important to understand the properties that influence grader agreement (such as question difficulty, answer length, and answer correctness). However, the twin challenges towards such an understanding are the wide range of SAG corpora in use (which differ along a number of dimensions) and the hierarchical structure of potentially relevant properties (which can be located at the corpus, answer, or question levels). This article uses generalized mixed effects models to analyze the effect of various such properties on grader agreement in six major SAG corpora for two main assessment tasks (language and content assessment). Overall, we find broad agreement among corpora, with a number of properties behaving similarly across corpora (e.g., shorter answers and correct answers are easier to grade). Some properties show more corpus-specific behavior (e.g., the question difficulty level), and some corpora are more in line with general tendencies than others. In sum, we obtain a nuanced picture of how the major short answer grading corpora are similar and dissimilar from which we derive suggestions for corpus development and analysis.Item Open Access Sprachwissenschaft: Eine Wissenschaft?(2004) Kamp, HansDie Linguistik gehört zu den "Sciences de l'Homme" – den Wissenschaften, die unsere eigene Spezies, die des Homo Sapiens, betreffen – und insbesondere diejenigen Aspekte, die den Menschen innerhalb der Menge aller Lebewesen als das Einzigartige auszeichnen, für das wir uns halten. Diese Aspekte haben immer etwas mit dem menschlichen "Geist" zu tun oder mit der kognitiven Veranlagung des Menschen, wenn man eine solche Formulierung vorzieht, und mit dem Gebrauch, den er von dieser Veranlagung macht. Dies gilt insbesondere auch für den Gegenstandsbereich der Linguistik: die menschlichen Sprachen und unsere Fähigkeit, sie zu erwerben und zu verwenden. Die Linguistik ist nicht die einzige Wissenschaft, die die Sprache zum Gegenstand hat. In vielen anderen "Sciences de l'Homme" spielt sie ebenfalls eine wichtige oder sogar zentrale Rolle. In manchen, wie zum Beispiel in der Literaturwissenschaft, geht es sogar - ebenso wie in der Linguistik - ausschließlich um Sprache.Item Open Access Text-und Data-Mining : urheberrechtliche Grenzen der Nachnutzung wissenschaftlicher Korpora und ihre Bedeutung für die Digital Humanities(2021) Kleinkopf, Felicitas; Jacke, Janina; Gärtner, MarkusItem Open Access Two languages, one treebank : building a Turkish-German code-switching treebank and its challenges(2022) Çetinoğlu, Özlem; Çöltekin, ÇağrıThis paper presents the SAGT Turkish-German code-switching treebank, and observations and annotation challenges we encountered during its development. The treebank consists of transcriptions of bilingual conversations annotated with several layers: language IDs, lemmas, POS tags, morphological features, and dependency relations. The annotations follow the Universal Dependencies annotation scheme and the conventions used in monolingual treebanks as much as possible. We present and discuss a number of issues that arise because of the need for consistent multilingual annotation within a single treebank, as well as the informal language, which is where code-switching is observed most. Besides proposing solutions to these issues, we present some observations about code-switching phenomena that are only possible to observe in a data set with rich linguistic annotation. The treebank was annotated with a focus on quality of annotations through an iterative process of detecting and correcting annotation errors. We also present quantitative measures for indication of annotation quality. The code-switching treebank created in this study is released to the public through Universal Dependencies repositories.