Muy buen ejemplo de correlación entre queries y documentos, query terms y document terms, usando ckickthrough data (logs), con formalización probabilística
First, we will test the assumption that the
terms used in queries and in documents are truly very different.
This assumption has often been made, but never tested by a
quantitative measurement. Our test will show that there is indeed
a large difference between the query terms and document terms
But for short
queries, phrases are of crucial importance because they are more
accurate representations of information and requirements. Without
phrases, separate words in the query may lead to poor results. For
example, given the query “search engine”, if it is represented as
“search” and “engine”, few of the retrieved documents will be
related to search engine, and most of them pertain to mechanical
engines
The
central idea of our method is that if a set of documents is often
selected for the same queries, then the terms in these documents
are strongly related to the terms of the queries. Thus some
probabilistic correlations between query terms and document
terms can be established based on the query logs.