Heterogeneous Web Data Crawling

On 2011年五月24日, in 個人作品, by markpeng

Figure 1. The proposed framework of heterogeneous web crawling/searching

To build up a fined-grained search engine which transforms raw data into valuable information, I argue that we will need three key steps:

Crawl —> Extract —> Search

As depicted in Figure 1, firstly, fault-tolerant crawlers are designed to gather heterogeneous contents through different ways (HTTP requests, Social API, Telnet, etc.). Then, by utilizing the InfoViz tool we can extract detailed information (for instances, the title, post time, author name and content of a particular news post, or the price, name, descriptions of products appeared inside a 3C product page) from each raw Web page . Finally, distributed indexing/searching framework can help provide near-real-time API service for the users. Through these steps, we can have the ability to search fine-grained data using detailed queries like “How many distinct authors had mentioned a particular event within a fixed date range?", or “Please tell me which 3C website has the cheapest price for the iPad2?". That gives more satisfying search results for the users!

In the following paragraphs I briefly introduce key concepts of the above steps.

Crawl: From Web to Page

In practice, the crawling schedule for each data source varies (based on the blocking policy of each data source). In the cases of highly secured data sources, to prevent being blocked by the data source, the crawling behavior and schedule of crawlers should be designed as similar as humans’ access behavior. For instance, a target website may constraint the requests from the clients (users) with a limited time slot of 2 seconds; therefore it would be smarter to set the crawling schedule with more than 3 seconds delay per request to prevent blocking. Since it is impossible for setting the optimal crawling schedules for all data sources manually (e.g., the time interval of crawling request), the system should be able to evolve these optimal parameters over time.

Furthermore, as the crawlers tend to send minimum number of requests to the data sources for blocking prevention, how to make sure the freshness of data collection is also very important. While some data sources may update their site with new data every 15 minutes, some other ones may only do updating job once a day. Therefore, the system should also be able to evolve an optimal visiting schedule for each data source. During the process of evolution, a knowledge base with pre-defined scheduling options and initial parameters may aid the system to find optimal scheduling plans in short-term cycles. Gathered feedbacks from the crawlers (e.g., lessons learned by Try & Error) may also enrich the knowledge base for solving scheduling problems in long-term cycles.

Extract: From Page to Table

Once we have collected large amounts of raw Web pages, it is necessary to extract detailed information to give those pages more semantic meanings. By utilizing InfoViz, labor cost of the task of learning templates from those pages can be significantly decreased. However, learned templates are not always working.  We need automatic or semi-automatic repair mechanism to detect if the template are still valid for the extraction task, and if not, how to rebuild them quickly through the InfoViz tool.

The extraction algorithms of InfoViz reply on some visual or HTML-coded features, such as CSS/tag attributes, visual alignments and layouts, for learning the template of web pages and further extracting desired information from identified informative blocks. Once the template of website changes, adequate features and optimal parameters have to be investigated and revised again. Figure out solutions for adapting the changes of templates with experienced feature/parameter setting automatically is important.


Search: From Table to Index

Lucene is a very reputed Apache project for fast full-text indexing/searching, which is originally part of Nutch, a distributed Search Engine Crawler project. In contrast to traditional RDBMS (MySQL, MSSQL), Lucene is more powerful for full-text indexing and searching. For some query cases, it is 500-1000 times faster than using SQL commands. Solr and Zoie are other projects utilizing Lucene for single/distributed indexing/searching purposes. Of course, there are some key/value NO SQL solutions that can do similar jobs for indexing & searching (such as mongoDB and HBase).


  • Java, C#, MySQL, Google Alerts Crawler, RSS, Bing API, Google News Crawler, Plurk API, Facebook Graph API, Twitter API, InfoViz, Lucene, Nutch

Comments are closed.

total of 454136 visits