CiteULike is a free online bibliography manager. Register and you can start organising your references online.

LIMBO: Scalable Clustering of Categorical Data Export

Advances in Database Technology - EDBT 2004 (2004), pp. 531-532.

Citation Format

[Posts]

View FullText article


lillejul's tags for this article

clustering discovery entityguides schema

X Reviews [Write a review of this article]

X Notes for this article

lillejul has 0 private notes and 1 public note for this article.
  • Input: 1 rel table
  • Output: Column matchings
  • Technique:
  1. Uses information theoretic measures to create a distance between tuples and between categorical values (categorical = no inherent distance measure exists)
  2. A hierarchical clustering algorithm (Agglomerative Information Bottleneck) uses then the measure to cluster tuples and/or values.
  3. The whole LIMBO clustering algo can be bounded in space and/or in time. Making it thus suitable for clustering streaming data.
lillejul (public note) - 2009-02-10 10:07:04

X Find related articles from these CiteULike users

X Find related articles with these CiteULike tags

X Posting History

X Abstract

Clustering is a problem of great practical importance in numerous applications. The problem of clustering becomes more challenging when the data is categorical, that is, when there is no inherent distance measure between data values. We introduce LIMBO, a scalable hierarchical categorical clustering algorithm that builds on the Information Bottleneck (IB) framework for quantifying the relevant information preserved when clustering. As a hierarchical algorithm, LIMBO has the advantage that it can produce clusterings of different sizes in a single execution. We use the IB framework to define a distance measure for categorical tuples and we also present a novel distance measure for categorical attribute values. We show how the LIMBO algorithm can be used to cluster both tuples and values. LIMBO handles large data sets by producing a memory bounded summary model for the data. We present an experimental evaluation of LIMBO, and we study how clustering quality compares to other categorical clustering algorithms. LIMBO supports a trade-off between efficiency (in terms of space and time) and quality. We quantify this trade-off and demonstrate that LIMBO allows for substantial improvements in efficiency with negligible decrease in quality.


X BibTeX record

X RIS record


Privacy Statement | Terms & Conditions
CiteULike organises scholarly (or academic) papers or literature and provides bibliographic (which means it makes bibliographies) for universities and higher education establishments. It helps undergraduates and postgraduates. People studying for PhDs or in postdoctoral (postdoc) positions. The service is similar in scope to EndNote or RefWorks or any other reference manager like BibTeX, but it is a social bookmarking service for scientists and humanities researchers.