Browsing by Author "Schuff, Hendrik"

Now showing 1 - 4 of 4

Open Access
Explainable question answering beyond F1: metrics, models and human evaluation
(2020) Schuff, Hendrik
Explainable question answering systems not only predict an answer, but also provide an explanation why it has been selected. Current work predominantly focuses on the evaluation and development of new models around established metrics such as F1. We argue that this constitutes a distorted incentive and limits the exploration of explainability methods as the ultimate measure of performance should not be F1 but the value that the system adds for a human. In this thesis, we analyze two baseline models trained on the HotpotQA data set, which provides explanations in the form of a selection of supporting facts from Wikipedia articles. We identify two weaknesses: (i) the models predict facts to be irrelevant but still include them into their answer and (ii) the models do not use facts for answering the question although they report them to be relevant. Based on these shortcomings, we propose two methods to quantify how strongly a system's answer is coupled to its explanation based on (i) how robust the system's answer prediction is against the removal of facts it predicts to be (ir)relevant and (ii) the location of the answer span. In order to address the identified weaknesses, we present (i) a novel neural network architecture that guarantees that no facts which are predicted to be irrelevant are used in the answer prediction, (ii) a post-hoc heuristic that reduces the number of unused facts and (iii) a regularization term that explicitly couples the prediction of answer and explanation. We show that our methods improve performance on our proposed metrics and assess them within an online study. Even though our methods only reach slight improvements on standard metrics, they all improve various human measures such as decision correctness and certainty, supporting our claim that F1 alone is not suited to evaluate explainability. The regularized model even surpasses the ground truth condition regarding helpfulness and certainty. We analyze how strongly different metrics are linked to human measures and find that our metrics outperform all evaluated standard metrics, suggesting they provide a valuable addition to automatized explainability evaluation.
Open Access
How to do human evaluation : a brief introduction to user studies in NLP
(2023) Schuff, Hendrik; Vanderlyn, Lindsey; Adel, Heike; Vu, Ngoc Thang
Open Access
Human-centered explainable artificial intelligence for natural language processing
(2024) Schuff, Hendrik; Vu, Ngoc Thang (Prof. Dr.)
With the ongoing advances in artificial intelligence (AI) systems, their influence on our private, professional, and public life is expanding. While these systems' prediction performance increases, they often rely on opaque system architectures that hide the reasons for the systems' decisions. The field of explainable AI thus seeks to answer why a system returns its prediction. In this thesis, we explore explanatory methods for natural language processing (NLP) systems. Instead of focusing on the technical aspects of explainability in isolation, we take a human-centered approach and additionally explore users' perception of and their interaction with explainable NLP systems. Our contributions thus range on a spectrum from technology-centered machine learning contributions to human-centered studies of cognitive biases. On the technical end of the spectrum, we first contribute novel approaches to integrate external knowledge into explainable natural language inference (NLI) systems and study the effect of different sources of external knowledge on fine-grained model reasoning capabilities. We compare automatic evaluation with user-perceived system quality and find an equally surprising and alarming disconnect between the two. Second, we present a novel self-correction paradigm inspired by Hegel's dialectics. We apply our resulting thought flow network method to question answering (QA) systems and demonstrate our method's ability to self-correct model predictions that increase prediction performance and additionally find that the corresponding decision sequence explanations enable significant improvements in the users' interaction with the system and enhance user-perceived system quality. Our architectural and algorithmic contributions are followed by an in-depth investigation of explanation quality quantification. We first focus on explainable QA systems and find that the currently used proxy scores fail to capture to which extent an explanation is relevant to the system's answer. We thus propose the two novel model-agnostic scores FaRM and LocA, which quantify a system's internal explanation-answer coupling following two complementary approaches. Second, we consider general explanation quality and discuss its characteristics and how they are violated by current evaluation practices at the example of a popular explainable QA leaderboard. We provide guidelines for explanation quality evaluation and propose our novel "Pareto Front leaderboard" method to construct system rankings to overcome challenges in explanation quality evaluation. In the last part of the thesis, we focus on human perception of explanations. We first investigate how users interpret the frequently used heatmap explanations over text. We find that the information communicated by the explanations differs from the information understood by the users. In a series of studies, we discover distorting effects of various types of biases and demonstrate that cognitive biases, learning effects, and linguistic properties can distort users' interpretation of explanations. We question the use of heatmap visualizations and propose alternative visualization methods. Second, we develop, validate, and apply a novel questionnaire to measure perceived system predictability. Concretely, we contribute the novel perceived system predictability (PSP) scale, demonstrate its desirable psychometric properties, and use it to uncover a dissociation of perceived and objective predictability in the context of explainable NLP systems. Overall, this thesis highlights that progress in explainable NLP cannot rely on technical advances in isolation, but needs to simultaneously involve the recipients of explanations including their requirements, perception, and cognition.
Open Access
Leveraging electromyography to enhance musician-instrument interaction using domain-specific motions
(2017) Schuff, Hendrik
Manual interaction tasks, such as playing a musical instrument, require certain amounts of training until users are proficient. Electromyography (EMG) can bridge this gap and is able to provide proficiency feedback without the need for supervision. EMG measures the electrical potential that is related to muscular activity and has been used in Human-Computer-Interaction (HCI) in a variety of applications. This thesis explores the usage of EMG together with domain-specific movements, such as playing guitar chords, in the context of musician-instrument interaction. This includes a review of related work, an evaluation of suitable features, and machine learning methods as well as the realization of an EMG guitar tutor system. The results of this thesis show that it is possible to classify guitar chords with an average F1-measure of 87\%. We identified a trade-off between classifier-accuracy and window size, which is an important finding regarding real-time interaction. Further, we evaluated a guitar tutor system within a study. The results suggest, that electrodes and wires did not limit the participants in playing the guitar. An analysis of inter-person generalizability shows that dimensionality reduction methods can slightly increase the classifier performance. We propose further solutions to enhance the guitar tutor system from a machine learning perspective as well as from an usability perspective. Ultimately, we discuss how our findings can be transferred to related domains.