BlockWeb: An IR Model for Block Structured Web Pages
BlockWeb is a model that we have developed for indexing and querying web pages according to their content as well as to their visual rendering. These pages are split up into blocks what has several advantages in terms of page indexing and querying: (i) blocks of a page most similar to a query may be returned instead of the page as a whole (ii) the importance of a block can be taken into account, as well as (iii) the permeability of the blocks to the content of neighbor blocks. In this paper, we present the BlockWeb model and show its interest for indexing images of Web pages, through an experiment performed on electronic versions of French daily newspapers. We also present the engine we have implemented for block extraction, indexing and querying according to the BlockWeb model.