CiteULike is a free online bibliography manager. Register and you can start organising your references online.

Mining data records in Web pages Export

In KDD '03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining (2003), pp. 601-606.

Citation Format

[Posts]

View FullText article


tulaydemir's tags for this article

2003 record_extraction

X Reviews [Write a review of this article]

X Notes for this article

tulaydemir has 1 private note and 0 public notes for this article. If you are tulaydemir then you can log in to see the private note.

X Find related articles from these CiteULike users

X Find related articles with these CiteULike tags

X Posting History

X Abstract

A large amount of information on the Web is contained in regularly structured objects, which we call data records. Such data records are important because they often present the essential information of their host pages, e.g., lists of products or services. It is useful to mine such data records in order to extract information from them to provide value-added services. Existing automatic techniques are not satisfactory because of their poor accuracies. In this paper, we propose a more effective technique to perform the task. The technique is based on two observations about data records on the Web and a string matching algorithm. The proposed technique is able to mine both contiguous and non-contiguous data records. Our experimental results show that the proposed technique outperforms existing techniques substantially.


X BibTeX record

X RIS record


Privacy Statement | Terms & Conditions
CiteULike organises scholarly (or academic) papers or literature and provides bibliographic (which means it makes bibliographies) for universities and higher education establishments. It helps undergraduates and postgraduates. People studying for PhDs or in postdoctoral (postdoc) positions. The service is similar in scope to EndNote or RefWorks or any other reference manager like BibTeX, but it is a social bookmarking service for scientists and humanities researchers.