CiteULike is a free online bibliography manager. Register and you can start organising your references online.

Incorporating site-level knowledge to extract structured data from web forums Export

In WWW '09: Proceedings of the 18th international conference on World wide web (2009), pp. 181-190.

Citation Format

[Posts]

View FullText article


X Reviews [Write a review of this article]

X Notes for this article

ChaTo has 0 private notes and 1 public note for this article.

[Talk]

Exploit sitemap; cluster pages by layout.

Find out where the author/title/date/etc. are in the DOM of a cluster of pages.

ChaTo (public note) - 2009-04-23 15:50:24

X Find related articles from these CiteULike users

X Find related articles with these CiteULike tags

X Posting History

X Abstract

Web forums have become an important data resource for many web applications, but extracting structured data from unstructured web forum pages is still a challenging task due to both complex page layout designs and unrestricted user created posts. In this paper, we study the problem of structured data extraction from various web forum sites. Our target is to find a solution as general as possible to extract structured data, such as post title, post author, post time, and post content from any forum site. In contrast to most existing information extraction methods, which only leverage the knowledge inside an individual page, we incorporate both page-level and site-level knowledge and employ Markov logic networks (MLNs) to effectively integrate all useful evidence by learning their importance automatically. Site-level knowledge includes (1) the linkages among different object pages, such as list pages and post pages, and (2) the interrelationships of pages belonging to the same object. The experimental results on 20 forums show a very encouraging information extraction performance, and demonstrate the ability of the proposed approach on various forums. We also show that the performance is limited if only page-level knowledge is used, while when incorporating the site-level knowledge both precision and recall can be significantly improved.


X BibTeX record

X RIS record


Privacy Statement | Terms & Conditions
CiteULike organises scholarly (or academic) papers or literature and provides bibliographic (which means it makes bibliographies) for universities and higher education establishments. It helps undergraduates and postgraduates. People studying for PhDs or in postdoctoral (postdoc) positions. The service is similar in scope to EndNote or RefWorks or any other reference manager like BibTeX, but it is a social bookmarking service for scientists and humanities researchers.