CiteULike is a free online bibliography manager. Register and you can start organising your references online.

The Google Similarity Distance

(30 May 2007)

X Abstract

Words and phrases acquire meaning from the way they are used in society, from their relative semantics to other words and phrases. For computers the equivalent of `society' is `database,' and the equivalent of `use' is `way to search the database.' We present a new theory of similarity between words and phrases based on information distance and Kolmogorov complexity. To fix thoughts we use the world-wide-web as database, and Google as search engine. The method is also applicable to other search engines and databases. This theory is then applied to construct a method to automatically extract similarity, the Google similarity distance, of words and phrases from the world-wide-web using Google page counts. The world-wide-web is the largest database on earth, and the context information entered by millions of independent users averages out to provide automatic semantics of useful quality. We give applications in hierarchical clustering, classification, and language translation. We give examples to distinguish between colors and numbers, cluster names of paintings by 17th century Dutch masters and names of books by English novelists, the ability to understand emergencies, and primes, and we demonstrate the ability to do a simple automatic English-Spanish translation. Finally, we use the WordNet database as an objective baseline against which to judge the performance of our method. We conduct a massive randomized trial in binary classification using support vector machines to learn categories based on our Google distance, resulting in an a mean agreement of 87% with the expert crafted WordNet categories.

View the full article here:

arXiv (abstract), arXiv (PDF)

This article has been bookmarked 31 times, initially on 2004-12-27.

2008-07-09 User kolenchery
2008-03-20 User ctl
2008-03-08 User sugarexpletive
2008-01-22 User anton-tayanovskyy
2007-12-13 User scorreia_pro
2007-12-11 User votis
2007-05-29 User AlisonBabeu
2007-02-02 User ngdelamo
Group NETS-UAM
2007-01-18 User simons
User stavros
2006-02-22 User RobotAdam
2006-02-06 User joelh
2006-01-11 User umbra
2006-01-03 User mcphee
2005-03-27 User ingo
User macartisan
2005-02-21 User samth
Group NU-PRL
2005-02-14 User mercutio
2005-02-12 User korakot
Group Philosophy_of_Information
Group Blog_and_Wiki_Research
User ranford
2005-01-28 User parmentierf , 1 note

This is close to how ECTOR learns...

2005-01-28 15:28:05
2004-12-28 User mortimer
Group UoY-CS-AIG
Group AI
User A_Olympia
User sourada
2004-12-27 User mafwood
Privacy Statement | Terms & Conditions
CiteULike organises scholarly (or academic) papers or literature and provides bibliographic (which means it makes bibliographies) for universities and higher education establishments. It helps undergraduates and postgraduates. People studying for PhDs or in postdoctoral (postdoc) positions. The service is similar in scope to EndNote or RefWorks or any other reference manager like BibTeX, but it is a social bookmarking service for scientists and humanities researchers.