Incremental and effective data summarization for dynamic hierarchical clustering
Mining informative patterns from very large, dynamically changing databases poses numerous interesting challenges. Data summarizations (e.g., data bubbles) have been proposed to compress very large static databases into representative points suitable for subsequent effective hierarchical cluster analysis. In many real world applications, however, the databases dynamically change due to frequent insertions and deletions, possibly changing the data distribution and clustering structure over time. Completely reapplying both the data summarization and the clustering algorithm to detect the changes in the clustering structure and update the uncovered data patterns following such deletions and insertions is prohibitively expensive for large fast changing databases. In this paper, we propose a new scheme to maintain data bubbles incrementally. By using incremental data bubbles, a high-quality hierarchical clustering is quickly available at any point in time. In our scheme, a quality measure for incremental data bubbles is used to identify data bubbles that do not compress well their underlying data points after certain insertions and deletions. Only these data bubbles are re-built using efficient split and merge operations. An extensive experimental evaluation shows that the incremental data bubbles provide significantly faster data summarization than completely re-building the data bubbles after a certain number of insertions and deletions, and are effective in preserving (and in some cases even improving) the quality of the data summarization.