I gave a talk at Spark Taiwan User Group today to share my experience and some general tips for participating kaggle competitions.
Here is the slideshare link for all your reference, hope it can be helpful to anyone who wants to get good results on kaggle!
Got 7th place of 1334 teams in Kaggle Crowdflower Search Results Relevance Contest! This contest asks kagglers to predict the relevance score of search results made by the users using machine learning models.
In this post I would like to share some of our findings and experience in NLP, feature extraction and feature selection tasks, and also my ensemble architecture. Feature engineering helped us a lot in improving the performance of our different models.
Due to the limitation of Web editor, I cannot show my post in well format here. So please check the following link to download and see full post:
Hope it can help other kagglers to some extent.
To build up a fined-grained search engine which transforms raw data into valuable information, I argue that we will need three key steps:
Crawl —> Extract —> Search
As depicted in Figure 1, firstly, fault-tolerant crawlers are designed to gather heterogeneous contents through different ways (HTTP requests, Social API, Telnet, etc.). Then, by utilizing the InfoViz tool we can extract detailed information (for instances, the title, post time, author name and content of a particular news post, or the price, name, descriptions of products appeared inside a 3C product page) from each raw Web page . Finally, distributed indexing/searching framework can help provide near-real-time API service for the users. Through these steps, we can have the ability to search fine-grained data using detailed queries like “How many distinct authors had mentioned a particular event within a fixed date range?", or “Please tell me which 3C website has the cheapest price for the iPad2?". That gives more satisfying search results for the users!
Why opinion mining is important for all?
With the rapid growth of Internet users using UGC services (such as Facebook, Twitter, Plurk and LinkedIn) in their daily lives, it is very common for a user to share his/her opinions about the events happened everyday. Those events may include: buying a new product, having dinner with people in a restaurant, a political rumor, sharing a funny YouTube video, etc. Once we could gather all these opinionated reviews and analyze them using opinion mining techniques, it would be very helpful for the companies/governments to understand the sentiment orientation (positive or negative) of word-of-mouth opinions among people (with detailed summarization and statistical charts/graphs). Numerous studies have been conducted for opinion mining in English documents [1, 2]. However, only few works have been done for Chinese. Continue reading »
Continue reading »
Named Entity Extraction/Recognition (NER) has become a major task of Natural Language Processing (NLP). Named Entities (NE) represent important parts of the meaning of human-written sentences, such as persons, affairs, time, places and objects . Most NER research studies in English and Chinese focused on recognizing names of persons, organizations and locations, and numeric entities including time, date, and so on . As to the task of Chinese Named Entity Extraction, we put our first emphasis on understanding the linguistic characteristics of Chinese language and applied the ideas to a special entity: cuisine name.
Why we choose this?
A market research was conducted surveying 1,200 online consumers in 2008, and demonstrated that over 80% of online consumers decide between two or three products based on consumer reviews . Another market research surveying 2,000 U.S. Internet users in 2007, revealed that restaurant reviews are the most influential type of reviews, which attracts 41% of review viewers to subsequently make purchases . Many popular review aggregating websites, for example, Yelp, Google Local Search, and Yahoo! Local, also put emphasis on collecting restaurant reviews. In the task of mining expressed opinions inside those reviews, cuisine names (like other entities) are very important clues for pointing out the main topics and targets of a user review. Continue reading »
Continue reading »
Figure 1. InfoViz system screenshot
此計畫主要目標在於建立一套可自動學習各類網頁結構中特定資訊區塊的位置，以做到針對大量異質結構網站進行精確的內容萃取(Information Extraction)，以協助發展新一代搜尋引擎及語意分析技術。傳統搜尋引擎只著眼於將網際網路的網頁透過抓取器完整迅速的收集，並將所收集的"整個網頁原始碼"進行索引以用於使用者搜尋，這樣的索引方式會造成以下問題：(1) 程式無法自動理解網頁中儲存了哪些重要資訊(購物網站的網頁會有產品名稱、敘述及價格等重要資訊)；(2) 網頁本身可能帶有許多與主要內容無關的文字 (如網頁橫幅的廣告及重複性的敘述文字等)，又因索引方式是將所有網頁文字納入考量，因而造成搜尋結果中會出現與使用者輸入關鍵字完全無關的網頁(只因在網頁的某個廣告區塊裡出現了相關字眼)。
透過一套網頁資訊區塊標記及學習的技術，可以協助我們定義並掌握每個網頁中帶有哪些價值資訊(做到類似semantic tagging的效果)，而此一技術運用在巨量網路資料收集技術上有極大的幫助。當我們有辦法針對不同資料類型的網頁(如新聞、部落格及討論區等)進行網頁樣板的建置，自動萃取出其中的標題、時間及內文等重要資訊，不但可以做到更精確的網頁內容索引，更可以協助我們判斷每筆網頁資料所代表的時間為何，避免過度造訪特定網站而造成被拒絕服務的問題(即透過只造訪近期新增網頁的方式做到Incremental Crawling，減少不必要的造訪次數)。
此技術利用網頁屬性及結構性的特徵，能夠有效辨識使用者手動標記出的資訊區塊在網頁中對應的位置，並可以做到"舉一反三"的效果(使用者只要標記一個範例，該技術即可自動將同網頁中具相同特徵的地方一併辨識出來)。透過特殊script，所習得的網站樣板可以儲存成檔案型式保存，並可做為其他網路應用程式解析網頁資訊內容的input，達到重複再利用性(reusability)的效果。 Continue reading »
Continue reading »
Motivation and System Requirements:
The original initiative of Pathterpreter Project came from the Institute of Chemistry, Academia Sinica, Taiwan. Since the past decades of biological pathway research, a huge amount of pathway data has been generated and stored in different format of pathway databases built by different biological research institutions, for example, KEGG, Panther, BioCarta and GenMapp. Although the Website of those databases provide the access to search for pathway data, the pathway researchers still suffer from having to map the experiment data they generated with pathway data stored in heterogeneous databases manually in order to find out any possible proteins that causes the diseases. It leads to a lot of time and labor costs.
Therefore, motivated by those problems, researchers of the Institute of Chemistry brought up the requirement of having a system that integrates all the pathway databases in common use (KEGG, Panther and BioCarta), and produce the detail information of proteins matched with the variant experiment proteins in specific pathways automatically. The system could also show some statistic information and rankings based some sort of criteria. Continue reading »
Continue reading »