Template Learning System: InfoViz

On 2011年五月5日, in 個人作品, by markpeng

infoviz-1
Figure 1. InfoViz system screenshot

此計畫主要目標在於建立一套可自動學習各類網頁結構中特定資訊區塊的位置,以做到針對大量異質結構網站進行精確的內容萃取(Information Extraction),以協助發展新一代搜尋引擎及語意分析技術。傳統搜尋引擎只著眼於將網際網路的網頁透過抓取器完整迅速的收集,並將所收集的"整個網頁原始碼"進行索引以用於使用者搜尋,這樣的索引方式會造成以下問題:(1) 程式無法自動理解網頁中儲存了哪些重要資訊(購物網站的網頁會有產品名稱、敘述及價格等重要資訊);(2) 網頁本身可能帶有許多與主要內容無關的文字 (如網頁橫幅的廣告及重複性的敘述文字等),又因索引方式是將所有網頁文字納入考量,因而造成搜尋結果中會出現與使用者輸入關鍵字完全無關的網頁(只因在網頁的某個廣告區塊裡出現了相關字眼)。

透過一套網頁資訊區塊標記及學習的技術,可以協助我們定義並掌握每個網頁中帶有哪些價值資訊(做到類似semantic tagging的效果),而此一技術運用在巨量網路資料收集技術上有極大的幫助。當我們有辦法針對不同資料類型的網頁(如新聞、部落格及討論區等)進行網頁樣板的建置,自動萃取出其中的標題、時間及內文等重要資訊,不但可以做到更精確的網頁內容索引,更可以協助我們判斷每筆網頁資料所代表的時間為何,避免過度造訪特定網站而造成被拒絕服務的問題(即透過只造訪近期新增網頁的方式做到Incremental Crawling,減少不必要的造訪次數)。

此技術利用網頁屬性及結構性的特徵,能夠有效辨識使用者手動標記出的資訊區塊在網頁中對應的位置,並可以做到"舉一反三"的效果(使用者只要標記一個範例,該技術即可自動將同網頁中具相同特徵的地方一併辨識出來)。透過特殊script,所習得的網站樣板可以儲存成檔案型式保存,並可做為其他網路應用程式解析網頁資訊內容的input,達到重複再利用性(reusability)的效果。

Web data extraction aims to identify and extract desired data items from given diverse Web pages. For this task, it is intuitive to assume that most pages of the websites follow some fixed templates for rendering Web data consistently in well formats. Based on this assumption, previous studies have proposed wrapper induction methods for learning extraction rules or repeated patterns from an initial set of labeled pages [1, 2]. However, a major problem of this approach is that the selected set of labeled pages may not be fully representative to all the templates of a website. Some researchers have presented unsupervised techniques to learn patterns of templates automatically without using any labeled pages [1, 2]. However, manual post-processing is still needed for identifying desired data items.

In this work, I propose a novel data extraction method which combines attributes of HTML element and DOM tree based features for extracting desired data items from both list and detail pages. The method only needs to label a single page instance for learning the template. Only when the items of a new instance can not be extracted correctly does it needs labeling. The proposed method can also find data items from each new instance efficiently by using a top-down tree path traversal mechanism.

A system called InfoViz is implemented to evaluate the performance of the proposed method. InfoViz is a visual-aided template learning system that lets user define and label desired data regions and items of a page directly through an embedded browser. Experiment results show that our proposed method performs well and is as competitive as state-of-the-art method. Notably, the system is capable of extracting data accurately from list and detail pages with diverse structures, including blogs, forums, news and e-commerce sites.

InfoViz4
Figure 2. Visualized validation for matching results in web pages.

More details can be found in the following paper:
Ting-Chun Peng, “InfoViz: An Instance-based Template Learning System for Web Data Extraction in List and Detail Pages,” Institute for Information Industry, 2010. [pdf]

Reference:
[1]     Chang, C., Kayed, M., Girgis, M. R., and Shaalan, K. F., “A survey of web information extraction systems," IEEE Transactions on Knowledge and Data Engineering, vol. 18, p. 1411, 2006.
[2]     Liu, B., Web Data Mining, Springer-Verlag, 2007.

Techniques/Skills:

  • C# Win Form, DOM Parser, CSS, IE Webbrowser control, Pattern Recognition
Share
 

Comments are closed.

total of 454140 visits