Chinese Named Entity Extraction

On 2011年五月8日, in 個人作品, by markpeng

Figure 1. illustrated idea for extracting cuisine names based on linguistic features.

Named Entity Extraction/Recognition (NER) has become a major task of Natural Language Processing (NLP). Named Entities (NE) represent important parts of the meaning of human-written sentences, such as persons, affairs, time, places and objects [1]. Most NER research studies in English and Chinese focused on recognizing names of persons, organizations and locations, and numeric entities including time, date, and so on [3]. As to the task of Chinese Named Entity Extraction, we put our first emphasis on understanding the linguistic characteristics of Chinese language and applied the ideas to a special entity: cuisine name.


Why we choose this?

A market research was conducted surveying 1,200 online consumers in 2008, and demonstrated that over 80% of online consumers decide between two or three products based on consumer reviews [4]. Another market research surveying 2,000 U.S. Internet users in 2007, revealed that restaurant reviews are the most influential type of reviews, which attracts 41% of review viewers to subsequently make purchases [2]. Many popular review aggregating websites, for example, Yelp, Google Local Search, and Yahoo! Local, also put emphasis on collecting restaurant reviews. In the task of mining expressed opinions inside those reviews, cuisine names (like other entities) are very important clues for pointing out the main topics and targets of a user review.

In fact, we found that  Chinese named entities often have some explicit/domain-specific characteristics that can help us extract them correctly. For instance, cuisine (dish) name entities usually comprise meaningful elements that give us featured cues:

  • Ingredients – the main ingredients of a dish.
  • Culinary manner – how a dish is cooked or seasoned.
  • Cooking equipment – the type of equipment used for cooking a dish.
  • Origin – where a dish originated from.
  • Appearance – description of the appearance of a dish.
  • Taste – description of the taste of a dish.
  • Transliteration – transliterated dish elements.

For example: 蕃茄牛肉炒飯 (Tomato and Beef Fried Rice) is a Chinese dish name that consists of ingredients (蕃茄 tomato, 牛肉 beef and 飯 rice) and culinary manner (炒 fried); 法式起司火鍋 (French Cheese Fondue) is a French dish name comprising ingredients (起司 cheese), cooking equipment (火鍋 fondue) and origin (法式 French). That is, dish names are basically compound names that comprise cuisine-related elements. The basic idea is to extract representative “dish features” reflecting dish elements with the aid of statistical techniques and dish corpus, and further to combine these features into a compound dish name.

How it works?

The process of our proposed extraction method is depicted as below.


Figure 2. the process of cuisine name extraction.

The algorithm contains two parts for aggregating the feature scores inside a candidate name:
1. the sum of all appeared domain feature terms with a precomputed weight.
2. the featured prefix/suffix terms matched with a weighting mechanism similar to IDF.


Figure 3. the proposed algorithm.

To show its usefulness, a demonstration website– CuisineGuide was implemented. This website shows a “cuisine map” generated by mining more than 12,000 Chinese restaurant reviews, which provide detail information for categorized restaurants (e.g., address, telephone number and popular cuisines). The users can easily find hot restaurants fit with their tastes. The potential value of cuisine name extraction is that it could help in recognizing popular cuisine and dishes in most restaurants from abundant User Generated Content (UGC) resources and summarizing updated opinions and reviews about restaurants using the techniques of sentiment mining. To the best of our knowledge, this is the first study for cuisine name extraction.

Can it do more than just extracting cuisine names?

Actually, we have applied similar algorithm to the task of extracting person names from Web content (e.g., News, social media). The results were very satisfying. We can now draw people networks for hot topics based on the extracted person names. It is very useful for the users to understand the main participants of a particular topic/event quickly and can provide a new way of searching that gives users the ability to filter out and keep only the results that mention the people they care about.

Figure 3. People network constructed based on extracted person names
from News data.


More details can be found in the following papers:

Ting-Chun Peng, Chia-Chun Shih, “Mining Chinese Restaurant Reviews for Cuisine Name Extraction: An Application to Cuisine Guide Service,” International Conference on Information Engineering and Computer Science (ICIECS ‘09), 2009. [link]
Chia-Chun Shih, Ting-Chun Peng and Wei-Shen Lai, “Mining the Blogosphere to Generate Local Cuisine Hotspots for Mobile Map Service,” Fourth International Conference on Digital Information Management (ICDIM ‘09), pp.152-159, 2009. [link]

(U.S. and Taiwan patent applications filed)

References:
[1]     Chen, C. and Lee, H. J., “A Three-Phase System for Chinese Named Entity Recognition," in Proceedings of ROCLING, 2004, pp. 39-48.
[2]     comScore, “Online Consumer-Generated Reviews Have Significant Impact on Offline Purchase Behavior," 2007.
[3]     Nadeau, D. and Sekine, S., “A survey of named entity recognition and classification," Linguisticae Investigationes, vol. 30, pp. 3-26, 2007.
[4]     PowerReviews, “New Social Shopping Study from the e-tailing group and PowerReviews; Discovers New Breed of Shopper – The Social Researcher," 2007.

Techniques/Skills:

  • Java, MySQL,  Chinese Word Segmentation, Association Rule, Mutual Information, Feature-based Scoring
Share
 

Comments are closed.

total of 454139 visits