![]() |
CiteULike | ![]() |
ldietz's CiteULike | ![]() |
![]() |
|
![]() |
Register | ![]() |
Log in | ![]() |
Probabilistic Models of Text and Link Structure for Hypertext Classification |
Reviews
[Write a review of this article]
Notes for this articletask: classify web pages according to student, course, faculty project using Hyperlink, Anchor text and the hub property.
idea of the task: train with manually classified pages from one university and apply this to other universities.
Method: PRM with existence uncertainty + Belief Propagation (partial knowledge influences its unknown environment)
"Because instances are not independent, information about some instances can be used to reach conclusions about others."
"Note that during classification, existence of links and anchor words in the links are used as evidence to infer categories of the web pages."
determined classes: Page [.category, .hub, .word1, ... .wordn]
undetermined: Links [.fromPage, .toPage, .anchor, .exists]; Anchor [.word]
Find related articles from these CiteULike users
Find related articles with these CiteULike tags
Posting History
AbstractMost text classification methods treat each document as an independent instance. However, in many text domains, documents are linked and the topics of linked documents are correlated. For example, web pages of related topics are often connected by hyperlinks and scientific papers from related fields are commonly linked by citations. We propose a unified probabilistic model for both the textual content and the link structure of a document collection. Our model is based on the recently introduced framework of Probabilistic Relational Models (PRMs), which allows us to capture correlations between linked documents. We show how to learn these models from data and use them efficiently for classification. Since exact methods for classification in these large models are intractable, we utilize belief propagation, an approximate inference algorithm. Belief propagation automatically induces a very natural behavior, where our knowledge about one document helps us classify related ones, which in turn help us classify others. We present preliminary empirical results on a dataset of university web pages.
BibTeX record
RIS record