Chinese Opinion Mining for the Web

On 2011年五月14日, in 個人作品, by markpeng

Figure 1. Bing results for the query term “很輕巧” (very lightweight).

Why opinion mining is important for all?

With the rapid growth of Internet users using UGC services (such as Facebook, Twitter, Plurk and LinkedIn) in their daily lives, it is very common for a user to share his/her opinions about the events happened everyday. Those events may include: buying a new product, having dinner with people in a restaurant, a political rumor, sharing a funny YouTube video, etc. Once we could gather all these opinionated reviews and analyze them using opinion mining techniques, it would be very helpful for the companies/governments to understand the sentiment orientation (positive or negative) of word-of-mouth opinions among people (with detailed summarization and statistical charts/graphs). Numerous studies have been conducted for opinion mining in English documents [1, 2]. However, only few works have been done for Chinese.

Classical researches for Sentiment Classification

Turney [1, 2] proposed an unsupervised Semantic Orientation (SO) method, which predicts the sentiment orientation of an extracted phrase by measuring the word association between the phrase and a set of positive/negative reference word pairs (RWPs). A PMI-IR algorithm is proposed to measure word associations using NEAR operator supported by AltaVista search engine. In an SO approach, a phrase is positive when it is strongly associated with a small set of positive reference words (e.g., excellent, good), and vice versa. The strength of associations stems from the conditional probability of nearby co-occurrence in a large corpus. RWPs are important in SO approaches. However, choosing reference word is quite tricky, and the appropriateness of reference words influences the performance of an SO approach significantly. In other words, in practice, it is very hard to find a set of RWPs useful for all domains. That is even harder for Chinese!

Two challenging issues for Opinion Mining

I argue that current opinion mining researches are facing two issues:
1. For SO methods, Reference word Pairs (RWPs) are very hard to find and their performance is not generally satisfying.
2. For all methods, gathering sentiment lexicons is very important for the success of opinion mining task; however, most sentiment lexicons are generated by human manually and the amount of sentiment phrases in lexicons is limited. Some sentiment phrases are even out-of-date and many new-generation phrases can not be found in current lexicons.

How can we resolve them?

Figure 2. Illustrations of the proposed concept.

To resolve above issues, I proposed an unsupervised snippet-based sentiment classification method for Chinese unknown sentiment phrases, which is also applicable to other languages theoretically. In contrast to [1] and [2], the proposed method does not require any RWPs. Instead, a gathered sentiment lexicon is utilized. By assuming that sentiment phrases tend to co-occur in the relevant documents (e.g., snippets) with obvious orientation of sentiments, we analyze top-N relevant snippets returned by a search engine to predict the sentiment of an unknown phrase using the summarized sentiments of opinions expressed by other known sentiment words appearing in each snippet in a fixed context window. Nowadays, it is relatively easy to build a sentiment lexicon containing common and basic sentiment words from public Internet resources such as NTU Sentiment Dictionary (NTUSD) and HOWNet for Chinese, and General Inquirer and SentiWordNet for English.

Figure 3. Comparison of accuracy in diff. lexicon size.
(with top-100, top-500 and top-800 snippets)

In this research, it has been proven that it is possible for expanding sentiment lexicon automatically with the aid of search engine and initial lexicons.  Also, the proposed method can actually being applied to all languages theoretically. The main contributions of this work can be summarized as follows: (1) a language-independent method that analyzes the sentiments of known words inside snippets to predict the sentiment orientation of an unknown phrase with relatively fewer queries is proposed; (2) the influences of window size, top-N size and lexicon size to the proposed method are investigated; (3) the proposed method can be combined with other existing state-of-the-art methods to improve the performance of opinion extraction in UGC reviews.

More details can be found in the following papers:

Ting-Chun Peng, Chia-Chun Shih, “An Unsupervised Snippet-Based Sentiment Classification Method for Chinese Unknown Phrases without Using Reference Word Pairs,” IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT ’10), vol. 3, pp.243-248, 2010. [link] [pdf]
Ting-Chun Peng, Chia-Chun Shih, “Using Chinese Part-of-Speech Patterns for Sentiment Phrase Identification and Opinion Extraction in User Generated Reviews,” Fifth International Conference on Digital Information Management (ICDIM ’10), 2010. [link]

[1]       Turney, P. D., “Thumbs up or thumbs down? Semantic orientation applied to unsupervised classification of reviews,” Proceedings of ACL annual meeting, 2002, pp. 417–424.
[2]      Turney, P. D. and Littman, M. L., “Measuring praise and criticism: Inference of semantic orientation from association,” ACM TOIS, vol. 21, pp. 315-346, 2003.


  • Java, MySQL, Part-of-Speech Tagging, Bing API, Mutual Information, Sentiment Lexicon Expansion

Comments are closed.

total of 508056 visits