To search, Click below search items.


All Published Papers Search Service


Extracting Content for News Web Pages based on DOM


Hua Geng, Qiang Gao, Jingui Pan


Vol. 7  No. 2  pp. 124-129


Nowadays, RSS is becoming a hot topic for Web applications. A lot of famous Web sites have provided RSS for users. However, making RSS files manually is boring, and so far, most sites haven’t provided such a service. In this paper, we mainly describe the design, implementation and evaluation of HTML2RSS, a system to extract content from HTML Web pages based on DOM structure, and generate RSS files automatically with the extracted content. We introduce two algorithms to extract information from semi-structured Web data. The goal of HTML2RSS is to provide users with RSS files as a substitute of the HTML pages.


Web information extracting, DOM, XML, time pattern, RSS