CiteULike is a free online bibliography manager. Register and you can start organising your references online.

Probabilistic Models of Text and Link Structure for Hypertext Classification Export

In IJCAI Workshop on "Text Learning: Beyond Supervision" (August 2001)

Citation Format

[Posts]

View FullText article


ldietz's tags for this article

mustread relationalmodels socialnets

X Reviews [Write a review of this article]

X Notes for this article

ldietz has 1 private note and 4 public notes for this article. If you are ldietz then you can log in to see the private note.

task: classify web pages according to student, course, faculty project using Hyperlink, Anchor text and the hub property.

idea of the task: train with manually classified pages from one university and apply this to other universities.

Method: PRM with existence uncertainty + Belief Propagation (partial knowledge influences its unknown environment)

ldietz (public note) - 2005-11-27 10:55:41

"Because instances are not independent, information about some instances can be used to reach conclusions about others."

ldietz (public note) - 2005-11-27 10:56:24

"Note that during classification, existence of links and anchor words in the links are used as evidence to infer categories of the web pages."

ldietz (public note) - 2005-11-27 10:57:24

determined classes: Page [.category, .hub, .word1, ... .wordn]

undetermined: Links [.fromPage, .toPage, .anchor, .exists]; Anchor [.word]

ldietz (public note) - 2005-11-27 11:17:39

X Find related articles from these CiteULike users

X Find related articles with these CiteULike tags

X Posting History

X Abstract

Most text classification methods treat each document as an independent instance. However, in many text domains, documents are linked and the topics of linked documents are correlated. For example, web pages of related topics are often connected by hyperlinks and scientific papers from related fields are commonly linked by citations. We propose a unified probabilistic model for both the textual content and the link structure of a document collection. Our model is based on the recently introduced framework of Probabilistic Relational Models (PRMs), which allows us to capture correlations between linked documents. We show how to learn these models from data and use them efficiently for classification. Since exact methods for classification in these large models are intractable, we utilize belief propagation, an approximate inference algorithm. Belief propagation automatically induces a very natural behavior, where our knowledge about one document helps us classify related ones, which in turn help us classify others. We present preliminary empirical results on a dataset of university web pages.


X BibTeX record

X RIS record


Privacy Statement | Terms & Conditions
CiteULike organises scholarly (or academic) papers or literature and provides bibliographic (which means it makes bibliographies) for universities and higher education establishments. It helps undergraduates and postgraduates. People studying for PhDs or in postdoctoral (postdoc) positions. The service is similar in scope to EndNote or RefWorks or any other reference manager like BibTeX, but it is a social bookmarking service for scientists and humanities researchers.