CiteULike is a free online bibliography manager. Register and you can start organising your references online.

Document clustering using word clusters via the information bottleneck method Export

In SIGIR '00: Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval (2000), pp. 208-215.

Citation Format

[Posts]

View FullText article


cmalek's tags for this article

clustering is366c wiki

X Reviews [Write a review of this article]

X Find related articles from these CiteULike users

X Find related articles with these CiteULike tags

X Posting History

X Abstract

We present a novel implementation of the recently introduced information bottleneck method for unsupervised document clustering. Given a joint empirical distribution of words and documents, p ( x , y ), we first cluster the words, Y , so that the obtained word clusters, Ytilde;, maximally preserve the information on the documents. The resulting joint distribution. p ( X , Ytilde; ), contains most of the original information about the documents, I ( X ; Ytilde; ) ≈ I ( X ; Y ), but it is much less sparse and noisy. Using the same procedure we then cluster the documents, X , so that the information about the word-clusters is preserved. Thus, we first find word-clusters that capture most of the mutual information about to set of documents, and then find document clusters , that preserve the information about the word clusters. We tested this procedure over several document collections based on subsets taken from the standard 20 Newsgroups corpus. The results were assessed by calculating the correlation between the document clusters and the correct labels for these documents. Finding from our experiments show that this double clustering procedure, which uses the information bottleneck method, yields significantly superior performance compared to other common document distributional clustering algorithms. Moreover, the double clustering procedure improves all the distributional clustering methods examined here.


X BibTeX record

X RIS record


Privacy Statement | Terms & Conditions
CiteULike organises scholarly (or academic) papers or literature and provides bibliographic (which means it makes bibliographies) for universities and higher education establishments. It helps undergraduates and postgraduates. People studying for PhDs or in postdoctoral (postdoc) positions. The service is similar in scope to EndNote or RefWorks or any other reference manager like BibTeX, but it is a social bookmarking service for scientists and humanities researchers.