What is the Web? Probably, most of people would answer: "Web is information". Indeed, despite many alternative uses, such as communicating with friends, playing online games, and selling goods, the primary use of the Web is to satisfy user's information needs. In order to make use of the information of the Web effectively and create tools that help users satisfy their information needs, it is necessary to understand how the information on the Web is organized.
In this project we consider the Web as a set of logical information units, so called compound documents (cDocs), which represent an individual's perception of information entities on the Web. Though a single Web page may be a cDoc itself, often a cDoc consists of multiple Web pages. A typical example of such a cDoc is a web news article, consisting of several html pages. Another example is a set of Web pages describing a subject such as the biology of monarch butterflies, with different pages dedicated to anatomy, sensory systems, life cycle, etc.
We believe that the ability to identify cDocs would be useful in a number of important applications. It has a potential to improve both recall and precision of the information retrieval on the Web; it can help link analysis algorithms, such as PageRank, to identify links serving solely navigational purpose; it would allow better automatic metadata generation and association for Digital Libraries, and, finally, it can be used as a basis for improved user interfaces.
In this project, we develop methods for automatically identifying the boundaries of cDocs on the Web, utilizing both content features of constituent Web pages, and structural features of the underlying Web graph.