05 Fakultät Informatik, Elektrotechnik und Informationstechnik
Permanent URI for this collectionhttps://elib.uni-stuttgart.de/handle/11682/6
Browse
2 results
Search Results
Item Open Access Computational approaches to the comparison of regional variety corpora : prototyping a semi-automatic system for German(2013) Anstein, Stefanie; Heid, Ulrich (Prof. Dr. phil. habil.)Regional varieties of pluri-centric languages such as German are generally very similar with respect to their structure and the linguistic phenomena that occur. The extraction of differences is thus crucial e.g. for variety documentation, lexicography, or didactics. In this thesis, computational approaches to the comparison of regional variety corpora are explored, in order to support manual analyses by variety linguists. A feasibility study on semi-automatic corpus comparison has been conducted by developing a prototype system, in order to determine on which levels of linguistic description such automation is possible and to what extent. Further research aims at showing which features of the input corpora produce the best results as well as on the ‘relevance ranking’ of the output. In addition, the potential of integrating available standard tools as well as the transferability of the system to other languages have been explored. Written corpora, which have been made increasingly available through initiatives such as Korpus Südtirol, are used as an empirical basis to extract differences semi-automatically, which is more efficient and more objective than a purely manual approach. The results yielded by the prototype system Vis-À-Vis assist variety linguists in their detailed qualitative analyses by reducing corpus comparison output to presumably relevant phenomena. In regional variety linguistics, numerous manual approaches have been applied and various single studies have been carried out, followed more recently by an increasing number of automated studies on the basis of corpora being developed for pluri-centric languages. In computational linguistics, the analysis and comparison of corpora through automated systems, in order to find differences on various levels of linguistic description, has been conducted for a considerable time (e.g. for register studies), yielding promising results. Vis-À-Vis applies linguistic pattern extraction as well as statistical output comparison, combining existing standard tools with adapted or newly-developed tools. It is a modular system that offers a user-friendly graphical interface, available online and for downloading. The processing of the corpus input consists of data annotation and the extraction of phenomena on different levels of linguistic description (i.e. at the uni-gram level, the bi-gram level, and selected aspects of the syntactic level). Vis-À-Vis produces ranked ‘candidate’ lists of variety peculiarities – by filtering through corpus-external linguistic knowledge and by applying statistical association measures to identify significantly different phenomena in the two input corpora, in order to reduce the output to probably relevant phenomena. The quantitative evaluation showed that the system performs clearly better than a baseline approach and that it outperforms well-known commercial systems. Furthermore, first qualitative results produced by Vis-À-Vis led to suggestions for refining and enhancing variant dictionary entries. The overall conclusion of this work is that a semi-automatic approach to variety comparison is clearly promising for lower levels of linguistic description, and – with further refinements – for more complex levels as well. The comparability of the input corpora turned out to be crucial for usable results, and the association measures used for relevance ranking proved to be valuable. Standard corpus processing tools have been integrated, and the transferability to other pluricentric languages is ensured by the system’s modular architecture. Complying with the research desiderata identified, comprehensive methods for systematic regional variety studies have been assessed and made available. This work has contributed to applied variety linguistic research, resulting in benefits for general variety description. Regional variety linguists as well as lexicographers, teachers and learners, and the interested public all benefit from the results of such a specifically tailored tool; they can use these results as a compact empirical basis — extracted from large amounts of authentic data — for their detailed qualitative analyses. Through an easily accessible user-friendly application of a comprehensive computational system, they are supported in efficiently extracting differences between varieties of pluri-centric languages. Bootstrapping processes will further enhance the input data and the methods to provide increasingly better results of variety corpus comparison. Such comprehensive tools can also serve in fields outside of regional variety linguistics, wherever corpora are being compared, contributing to further general linguistic research.Item Open Access Analysing names of organic chemical compounds : from morpho-semantics to SMILES strings and classes(2005) Anstein, Stefanie; Kremer, GerhardThe linguistic analysis of chemical terminology is a key to biochemical text processing and semi-automatic database curation. The system described analyses systematic and semi-systematic names of chemical compounds, class terms, and also otherwise underspecified names by means of a morpho-semantic grammar developed according to IUPAC nomenclature. It yields an intermediate semantic representation which describes the information encoded in a name. Our tool provides SMILES strings for the mapping of names to their molecule structure and also classifies the analysed terms. It was implemented in Prolog as a prototype and a basis for further development to support research in the life sciences.