CiteULike is a free online bibliography manager. Register and you can start organising your references online.

Brute Force and Indexed Approaches to Pairwise Document Similarity Comparisons with MapReduce Export

In ACM SIGIR (2009)

Citation Format

[Posts]

View FullText article


eddymier's tags for this article

document index map reduce similarity

X Reviews [Write a review of this article]

X Find related articles from these CiteULike users

X Find related articles with these CiteULike tags

X Posting History

X Abstract

This paper explores the problem of computing pairwise similarity on document collections, focusing on the application of \more like this" queries in the life sciences domain. Three MapReduce algorithms are introduced: one based on brute force, a second where the problem is treated as large-scale ad hoc retrieval, and a third based on the Cartesian product of postings lists. Each algorithm supports one or more approximations that trade eectiveness for eciency, the characteristics of which are studied experimentally. Results show that the brute force algorithm is the most ecient of the three when exact similarity is desired. However, the other two algorithms support approximations that yield large ef- ciency gains without signicant loss of eectiveness.


X BibTeX record

X RIS record


Privacy Statement | Terms & Conditions
CiteULike organises scholarly (or academic) papers or literature and provides bibliographic (which means it makes bibliographies) for universities and higher education establishments. It helps undergraduates and postgraduates. People studying for PhDs or in postdoctoral (postdoc) positions. The service is similar in scope to EndNote or RefWorks or any other reference manager like BibTeX, but it is a social bookmarking service for scientists and humanities researchers.