05 Fakultät Informatik, Elektrotechnik und Informationstechnik

Permanent URI for this collectionhttps://elib.uni-stuttgart.de/handle/11682/6

Browse

Search Results

Now showing 1 - 4 of 4
  • Thumbnail Image
    ItemOpen Access
    Analysing names of organic chemical compounds : from morpho-semantics to SMILES strings and classes
    (2005) Anstein, Stefanie; Kremer, Gerhard
    The linguistic analysis of chemical terminology is a key to biochemical text processing and semi-automatic database curation. The system described analyses systematic and semi-systematic names of chemical compounds, class terms, and also otherwise underspecified names by means of a morpho-semantic grammar developed according to IUPAC nomenclature. It yields an intermediate semantic representation which describes the information encoded in a name. Our tool provides SMILES strings for the mapping of names to their molecule structure and also classifies the analysed terms. It was implemented in Prolog as a prototype and a basis for further development to support research in the life sciences.
  • Thumbnail Image
    ItemOpen Access
    Sprachwissenschaft: Eine Wissenschaft?
    (2004) Kamp, Hans
    Die Linguistik gehört zu den "Sciences de l'Homme" – den Wissenschaften, die unsere eigene Spezies, die des Homo Sapiens, betreffen – und insbesondere diejenigen Aspekte, die den Menschen innerhalb der Menge aller Lebewesen als das Einzigartige auszeichnen, für das wir uns halten. Diese Aspekte haben immer etwas mit dem menschlichen "Geist" zu tun oder mit der kognitiven Veranlagung des Menschen, wenn man eine solche Formulierung vorzieht, und mit dem Gebrauch, den er von dieser Veranlagung macht. Dies gilt insbesondere auch für den Gegenstandsbereich der Linguistik: die menschlichen Sprachen und unsere Fähigkeit, sie zu erwerben und zu verwenden. Die Linguistik ist nicht die einzige Wissenschaft, die die Sprache zum Gegenstand hat. In vielen anderen "Sciences de l'Homme" spielt sie ebenfalls eine wichtige oder sogar zentrale Rolle. In manchen, wie zum Beispiel in der Literaturwissenschaft, geht es sogar - ebenso wie in der Linguistik - ausschließlich um Sprache.
  • Thumbnail Image
    ItemOpen Access
    The statistics of word cooccurrences : word pairs and collocations
    (2005) Evert, Stefan; Rohrer, Christian (Prof. Dr.)
    "You shall know a word by the company it keeps!" With this slogan, J. R. Firth drew attention to a fact that language scholars had intuitively known for a long time: In natural language, words are not combined randomly into phrases and sentences, constrained only by the rules of syntax. They have a tendency to appear in certain recurrent combinations. As there are many possible reasons for words to go together, a broad range of linguistic and extra-linguistic phenomena can be found among the recurrent combinations, making them a goldmine of information for linguistics, natural language processing and related fields. There are compound nouns ("black box"), fixed and opaque idioms ("kick the bucket"), lexical selection ("a pride of lions", "heavy smoker") and formulaic expressions ("have a nice day"). They can often tell us something about the meaning of a word or even the concept behind the word (think of combinations like "dark night" and "bright day"), an idea that has inspired latent semantic analysis and similar vector space models of word meaning. With modern computers it is easy to extract evidence for recurrent word pairs from huge text corpora, often aided by linguistic pre-processing and annotation (so that specific combinations, e.g. noun+verb can be targeted). However, the raw data - in the form of frequency counts for word pairs – are not always meaningful as a measure for the amount of "glue" between two words. Provided that both words are sufficiently frequent, their cooccurrences might be pure coincidence. Therefore, a statistical interpretation of the frequency data is necessary, which determines the degree of statistical association between the words and whether there is enough evidence to rule out chance as a factor. For this purpose, association measures are applied, which assign a score to each word pair based on the observed frequency data. The higher this score is, the stronger and more certain the association between the two words. Even forty years ago, at the Symposium on Statistical Association Methods for Mechanized Documentation, there was a bewildering multitude of measures to choose from, but hardly any guidelines to help with the decision. This situation hasn't changed very much over the last forty years. We are still far away from a thorough understanding of association measures and there is not even a standard reference where one could look up precise definitions and related information. My thesis aims to fill this gap. The first, encyclopedic part of the thesis begins with a description of the formal and statistical prerequisites. Intended primarily as a reference for students and researchers, it also addresses the limits of the statistical models. The following chapter presents a comprehensive repository of association measures, which are organised into thematic groups. An explicit equation is given for each measure, using a consistent notation in terms of observed and expected frequencies. The second, methodological part suggests new approaches to the study of association measures, with an emphasis on empirical results and intuitive understanding. A cornerstone of this approach is a geometric interpretation of cooccurrence data and association measure. Measures are visualised as surfaces in a three-dimensional "coordinate space". The properties of each measure are determined by the geometric shapes of the respective surfaces. Empirical results are obtained from evaluation studies, which test the performance of association measures in a collocation extraction task. In addition to its relevance for real-life applications, a carefully designed evaluation can reveal important properties of the association measures. Unfortunately, it is becoming clear the evaluation results cannot easily be generalised. For this reason it is desirable to carry out more evaluation experiments under different conditions. In order to reduce the necessary amount of manual work, evaluation can be performed on random samples from a set of candidates. Appropriate significance tests correct for the higher degree of uncertainty. Finally, there is a third, computational aspect to the thesis. It is accompanied by an open-source software toolkit, which was used to perform experiments and produce graphs for the thesis. The unique feature of this software toolkit is that the current release includes all the data, scripts and explanations needed to replicate (almost) all the results found in the book.
  • Thumbnail Image
    ItemOpen Access
    Untersuchung der Sprecherindividualität höherer Formanten
    (2004) Kremer, Gerhard
    Die Behauptung, höhere Formanten seien sprecherspezifisch, scheint angemessen zu sein. Dennoch konnte ich keine Studien dazu finden, die dies belegen. In der vorliegenden Arbeit untersuchte ich die Individualität der Formanten F4-F7 der 16 Sprecher aus dem deutschen "Kiel-Korpus der spontanen gesprochenen Sprache". Auf der Grundlage einer LPC-Analyse ließ ich daraus mit Hilfe von Programmwerkzeugen von Entropic die Formantfrequenzdaten aller Sprachsignaldateien erzeugen. Mit einem Perl-Skript extrahierte ich die Daten der Vokale [a], [i], [u] und der Nasale [m], [n] und [\ng]. Ich wertete die Daten mit den statistischen Verfahren T-Test und ANOVA aus, wofür ich das Statistik-Programm R benutzte. Die Verteilungen der Formantfrequenzdaten unterschieden sich hochsignifikant bei allen sieben Formanten in Abhängigkeit der Faktoren Sprecher, Sprachlaut und der Sprecher-Sprachlaut-Kombination.