![]() |
CiteULike | ![]() |
fly51fly's CiteULike | ![]() |
![]() |
|
![]() |
Register | ![]() |
Log in | ![]() |
TF-IDF uncovered: a study of theories and probabilitiesby: Thomas Roelleke, Jun Wang
In SIGIR '08: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval (2008), pp. 435-442.
|
Reviews
[Write a review of this article]
Find related articles from these CiteULike users
Find related articles with these CiteULike tags
Posting History
AbstractInterpretations of TF-IDF are based on binary independence retrieval, Poisson, information theory, and language modelling. This paper contributes a review of existing interpretations, and then, TF-IDF is systematically related to the probabilities P(q|d) and P(d|q). Two approaches are explored: a space of independent, and a space of disjoint terms. For independent terms, an "extreme" query/non-query term assumption uncovers TF-IDF, and an analogy of P(d|q) and the probabilistic odds O(r|d, q) mirrors relevance feedback. For disjoint terms, a relationship between probability theory and TF-IDF is established through the integral + 1/x dx = log x. This study uncovers components such as divergence from randomness and pivoted document length to be inherent parts of a document-query independence (DQI) measure, and interestingly, an integral of the DQI over the term occurrence probability leads to TF-IDF.
BibTeX record
RIS record