CiteULike is a free online bibliography manager. Register and you can start organising your references online.

Automatic Meaning Discovery Using Google

(21 December 2004)

X Abstract

We propose a new method to extract semantic knowledge from the world-wide-web for both supervised and unsupervised learning using the Google search engine in an unconventional manner. The approach is novel in its unrestricted problem domain, simplicity of implementation, and manifestly ontological underpinnings. We give evidence of elementary learning of the semantics of concepts, in contrast to most prior approaches. The method works as follows: The world-wide-web is the largest database on earth, and it induces a probability mass function, the Google distribution, via page counts for combinations of search queries. This distribution allows us to tap the latent semantic knowledge on the web. Shannon's coding theorem is used to establish a code-length associated with each search query. Viewing this mapping as a data compressor, we connect to earlier work on Normalized Compression Distance. We give applications in (i) unsupervised hierarchical clustering, demonstrating the ability to distinguish between colors and numbers, and to distinguish between 17th century Dutch painters; (ii) supervised concept-learning by example, using Support Vector Machines, demonstrating the ability to understand electrical terms, religious terms, emergency incidents, and by conducting a massive experiment in understanding WordNet categories; and (iii) matching of meaning, in an example of automatic English-Spanish translation.

View the full article here:

arXiv (abstract), arXiv (PDF)

This article has been bookmarked 31 times, initially on 2004-12-27.

2008-07-09 User kolenchery
2008-03-20 User ctl
2008-03-08 User sugarexpletive
2008-01-22 User anton-tayanovskyy
2007-12-13 User scorreia_pro
2007-12-11 User votis
2007-05-29 User AlisonBabeu
2007-02-02 User ngdelamo
Group NETS-UAM
2007-01-18 User simons
User stavros
2006-02-22 User RobotAdam
2006-02-06 User joelh
2006-01-11 User umbra
2006-01-03 User mcphee
2005-03-27 User ingo
User macartisan
2005-02-21 User samth
Group NU-PRL
2005-02-14 User mercutio
2005-02-12 User korakot
Group Philosophy_of_Information
Group Blog_and_Wiki_Research
User ranford
2005-01-28 User parmentierf , 1 note

This is close to how ECTOR learns...

2005-01-28 15:28:05
2004-12-28 User mortimer
Group UoY-CS-AIG
Group AI
User A_Olympia
User sourada
2004-12-27 User mafwood
Privacy Statement | Terms & Conditions
CiteULike organises scholarly (or academic) papers or literature and provides bibliographic (which means it makes bibliographies) for universities and higher education establishments. It helps undergraduates and postgraduates. People studying for PhDs or in postdoctoral (postdoc) positions. The service is similar in scope to EndNote or RefWorks or any other reference manager like BibTeX, but it is a social bookmarking service for scientists and humanities researchers.