Please use this identifier to cite or link to this item: http://dx.doi.org/10.18419/opus-11000
|Title:||Quality assessment in text analysis pipelines|
|Abstract:||High quality data and data analysis results are a precondition for future concepts such as the data-driven factory of the future. The quality of business decisions is directly influenced by the quality of data and analysis results. Current data quality concepts and tools only consider the raw input data of data analysis pipelines. They fail to regard specifics of analysis tools as well as data for each step of analysis pipelines. To fill this research gap, the QUALM concept for continuous and holistic data quality measurement and improvement within data analysis pipelines is presented in this thesis. In QUALM, data characteristics as well as specifics of analysis tools such as training data, features and semantic resources are regarded in each step of analysis pipelines. Existing data quality metrics measure the data quality of structured data, e.g., by counting null values, duplicates or invalid values. Equivalent approaches for textual data are missing. Additionally, most domain-specific text data sets are unlabeled. Thus, in addition to missing data quality metrics, also evaluation metrics are not calculable for these data sets and the thereupon derived analysis results. This leads to a high uncertainty of the analysts with respect to the quality of data and analysis results. QUALM conquers this challenge with a set of concrete text data quality methods. QUALM data quality indicators quantify text characteristics and give hints with respect to the expected quality of analysis results. Just as existing metrics for structured data determine, e.g., the number of null values and invalid fields, the QUALM indicators characterize texts with respect to, e.g., the number of abbreviations, spelling mistakes and ungrammatical sentences. Moreover, as demanded by the QUALM concept, these methods do not only consider the raw data, but also respect the specifics of the analysis tools. For example, QUALM has indicators which measure the confidence of standard analysis tools or the fit of semantic resources employed by analysis tools. Each indicator comes with a corresponding modifier. For example, the amount of abbreviations or spelling mistakes may be measured by a QUALM indicator. A corresponding QUALM modifier, e.g., modifies the data by means of resolving abbreviations or by a correction of spelling mistakes. Moreover, the selection of appropriate training data is especially difficult for analysts such as domain experts with little IT and/or data science knowledge. Yet, the appropriate selection of training data has a high impact on the quality of analysis results. Therefore, QUALM addresses this issue through a concrete method. The corresponding QUALM indicator measures data quality by means of the similarity between input and training data. In the case of textual data, text similarity metrics such as Latent Semantic Analysis and Cosine Similarity are employed. The counterpart QUALM modifier automatically selects the best-fitting training data and thus impedes low-quality results of domain-specific analysis. Finally, QUALM has another method which addresses data quality issues that arise from information extraction approaches that only consider either structured or unstructured text data in isolation. These isolated approaches may lead to a loss in terms of the amount of new information that may be presented to the analyst. In this thesis, this issue is addressed by a hybrid approach, which exploits structured and unstructured information sources in information extraction. To this end, especially structured data is considered which is enriched by unstructured free text fields. In the suggested approach, structured data is used to guide and improve the text analysis process. To this end, structured data is employed as a basis for a first grouping of free text fields and for removing information from the free texts which is already present in the structured fields. Thus, the hybrid approach yields more new and relevant information. The QUALM concept and methods are evaluated with respect to several industry-near application scenarios and corresponding concrete data sets. For example, the analysis of downtimes on a production line is considered. To this end a confidential industry data set comprising structured data enriched with free-text fields is employed. In further application scenarios, sample citizen data scientists are considered, i.e., domain experts with little IT and data science knowledge, who want to build analysis pipelines from scratch. E.g., they want to know customer opinions on a product. The evaluation results are very promising. The QUALM indicators and analysis result quality, measured as accuracy, correlate. Thus QUALM indicators are valid means to indicate the expected analysis result quality to the analyst. Moreover, the investigated QUALM modifiers lead to an increase in accuracy, e.g., of part-of-speech tagger and language identifier tools. In a qualitative discussion in this thesis, the positive effect of QUALM on a whole chain of analysis tools, i.e., an analysis pipeline, is shown.|
|Appears in Collections:||05 Fakultät Informatik, Elektrotechnik und Informationstechnik|
Files in This Item:
|Dissertation_Cornelia_Kiefer_QualityAssessmentText.pdf||40,06 MB||Adobe PDF||View/Open|
Items in OPUS are protected by copyright, with all rights reserved, unless otherwise indicated.