Democratizing clustering analyses : AutoML, meta-learning, and ensemble clustering to support novice analysts

Thumbnail Image

Date

2024

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Data Analysis is a crucial discipline to gain insights from data. A fundamental primitive is clustering analysis, which groups instances such that each group has similar instances (homogeneity) and such that the groups are as dissimilar as possible (heterogeneity). Clustering is used in many different application domains and throughout these manifold application domains, analysts struggle to select an appropriate clustering configuration, i.e., a setting of clustering algorithm and its hyperparameters. To find a clustering configuration, analysts have to perform multiple exploration cycles, i.e., select, execute, and evaluate various configurations. This process is time-consuming and its success depends on the domain expertise of an analyst. Novice analysts, in particular, lack the required domain knowledge and experience. They do not know how to explore configurations efficiently (Challenge C1), cannot benefit from experience, e.g., to select a cluster validity index to evaluate the results (Challenge C2), and face high runtimes when executing complex clustering algorithms with quadratic or higher runtimes, e.g., density-based clustering (Challenge C3). Thus, they have to execute many different clustering configurations with potentially high runtimes and can still end up with poor results. In contrast, experienced analysts have profound knowledge about the data characteristics in their domain and therefore have to execute only a few configurations to achieve accurate results. This thesis proposes the following approaches to support novice analysts to achieve similar results to those of more experienced analysts: (1) AutoML4Clust: Recent works in the area of AutoML are able to explore large configuration spaces efficiently using various optimization techniques. However, these approaches are only applied to supervised learning tasks, and applying such concepts to clustering is challenging due to its unsupervised nature. Thus, it is not clear how to use such optimizers for clustering and which cluster validity index is suitable as optimization objective. Therefore, to address Challenge C1, we propose AutoML4Clust, an efficient AutoML for clustering system. We apply optimizers from existing AutoML systems to the unsupervised task of clustering. AutoML4Clust is generic as it supports four different optimizers and three different cluster validity indices. Our evaluation reveals promising combinations of optimizers and cluster validity indices. Furthermore, the choice of the cluster validity index is more crucial than the choice of the optimizer and it depends on the data characteristics. (2) ML2DAC: While AutoML4Clust addresses the efficient exploration, the analyst has to specify the cluster validity index and the search space of clustering algorithms. In contrast to more experienced analysts, novice analysts have no experience in specifying the cluster validity index or narrowing down the search space. We therefore propose the meta-learning approach ML2DAC to address Challenge C2, i.e., enabling novice analysts to benefit from external experience. ML2DAC extends AutoML4Clust with a novel meta-learning methodology to learn from previously evaluated clustering analyses and automatically apply them to new datasets. To this end, ML2DAC requires a previous learning phase to build a meta-knowledge repository. This learning phase is executed once and can then be applied to any new dataset. When applying our approach to a new dataset, we use the meta-knowledge repository to (i) narrow down the search space to a few algorithms, (ii) recommend well-performing configurations, and (iii) select a suitable cluster validity index depending on data characteristics. Based on these inputs, we apply an optimizer from AutoML4Clust to explore the configuration space efficiently (Challenge C3). Our comprehensive evaluation shows that we can achieve more accurate results than state-of-the-art baselines. Further, we can build the meta-knowledge repository with synthetic data and still achieve accurate results on real-world data, which makes it easier to acquire training data for our learning phase. However, ML2DAC can still use complex clustering algorithms such as density-based algorithms with high runtime complexities to detect complex data characteristics (e.g., Half-Moons). (3) EffEns: To address the high runtime of complex clustering algorithms (Challenge C3), we propose an efficient ensemble clustering approach based on meta-learning. We rely on simple and efficient clustering algorithms such as k-Means to generate the ensemble and combine them using efficient consensus functions. We use meta-learning to learn which base clusterings should be used in the ensemble and which consensus function is suitable based on the data characteristics. Thus, novice analysts can also benefit from experience similar to ML2DAC to address Challenge C2. Further, we use an optimizer to tune the hyperparameters of the selected consensus function (Challenge C1). Our evaluation unveils that EffEns is much faster and scales better for large datasets than existing approaches that use complex clustering algorithms, while it can still detect complex data characteristics such as Half-Moons. We show that each approach (AutoML4Clust, ML2DAC, and EffEns) can achieve more accurate and faster results than state-of-the-art baselines. In our comprehensive overall evaluation, we compare our novel approaches theoretically and on real-world benchmark data. Our results unveil that EffEns is the only approach to address all three challenges and it achieves the most accurate and fastest results compared to AutoML4Clust and ML2DAC. Clustering is also often used in many real-world use cases for classification tasks, i.e., the data is grouped into clusters, and then separate and more specialized classification models are built on each cluster. However, it is not clear which clustering configuration should be used to achieve valuable results as different algorithms and hyperparameters are used for different use cases. Therefore, we investigate whether our novel approaches are able to find valuable clustering results that can improve classification accuracy for data with complex classification-specific characteristics. Such characteristics are multi-class imbalance and heterogeneous groups. As data with such characteristics are typically not available, we propose a synthetic data generator that can generate data with different manifestations of these characteristics and that builds the foundation for our evaluation. Our evaluation unveils that our novel approaches, in particular EffEns, can improve the classification accuracy significantly, even for less frequent classes. In conclusion, the novel approaches developed in this thesis can support novice and even more experienced analysts in achieving accurate clustering results in a short time frame.

Description

Keywords

Citation

Endorsement

Review

Supplemented By

Referenced By