School of Science and Technology 科技學院
Computing Programmes 電腦學系

Automatic Digest Generation for Mobile Phone Online Reviews

YEUNG Wing Hong

  
ProgrammeBachelor of Science with Honours in Computing
SupervisorDr. Andrew Lui
AreasText Mining for Intelligent Applications
Year of Completion2011

 

Objectives

The aim of the project is to design a system that using opinion mining technique to help buyers reorganize the chaotic information into neat information. Our system is focused on phone reviews and the main objective of the project is to generate a phone summary from available online phone review site. As our framework mentioned, the feature and its corresponding opinion are the most important information people interested in. Hence, the summary will categorize the positive and negative opinions into different feature. Reorganizing the chaotic information, the system allows reader interested exclusively in positive or negative opinions of specific feature to save opinion searching time and clear reader mind on conflicting opinion.

Background and Methodology

With the rapid growth of web, people feel more comfortable with the internet and increasing writing online review. Online reviews become one of the most valuable and easily reached resources for buyer making their buying decision. When a consumer desires to purchase a phone, the consumers will probably go to one of the online review sites, like cNet and amazon, and read the reviews to help their buying decision. To avoid bias, people will prefer read as many review as possible. However, when digesting lots of review, consumer may encounter the problem of information overloading because the information is distributed among the reviews, and the different perspective of the reviewer will induce information misunderstanding problem. The consumer may confuse with the above mentioned chaotic information.

Our system is focus on the opinion mining technique to generate the summary from the reviews. Opinion mining technique is a main research direction in text sentiment analysis (Zhang 2008). This technique refers of deriving opinion information, such as sentiment information, from the sentence. Beside of the opinion mining, our work is related to synonyms grouping, and sentiment classification.

The system first downloads the reviews from cNet. The reviews will undergo pre-processing and split into a sentence level. Then, the feature classifier will help us to extract and classify the feature in the sentence. After that, the opinion classifier will extract the corresponding opinion and the semantic orientation for the sentence is identified. Finally, the summary is produced. The system overview is shown below:

The main task for the feature classifier is feature identification and classification. Concerned with the feature identification, the system will extract the feature as well as the sentence. For example, the sentence “There are two more keys placed under the display” and labeled with feature “display” will be extracted. Regarding the feature classification, the system will group the sentence into different categories. For our application, the feature will be classified into 6 categories. They are general, display and control, camera, sound, connectivity, and battery life. For the general category, it implies the attributes of the phone, such as size, speed, appearance, and etc. For example, the feature “clock” and feature “processor” have the similar meaning in specific domain, so we will classify it together.

The main task for the opinion classifier is to identify and classify the opinion. Concerned with the opinion identification, the system extracts the opinion words of each sentence according to the feature extracted. For example, “large” is the opinion word of the sentence “The handset has a comfortable keypad and a large display”. Regarding the opinion classification, the system groups the sentence into two categories, positive and negative. After the opinion words are extracted, the system will determine its orientation. The semantic orientations of the sentence are highly depending on the opinion words. If most of the opinion words in the sentence are positive, the semantic orientation of that sentence will be positive too.

The system methodology is shown in details below:

In the LSR pattern matching, which is responsible for automatic rule pattern generation, an example is shown below:

When the LSR mining is completed, we will set the threshold of support and confidence for feature words mining by experiment. For our system, if the rule support is larger than 0.02% and the confidence is larger than 60%, it will be a feasible rule. When mining a new review, the keyword list and the rule pattern can be the source to match the sentence and extract the sentence as a candidate segment for further opinion grouping.

After finishing finding the features within the sentence, we need to group the discrete features into category to have a clear summary. It is more meaningful to the buyers if the highly related features are grouped together. WordNet is used to group the synonyms together is one of the methods commonly used. The WordNet ontology which shows the isA relationships among the noun to produce ontology and having 8 path from image to size is detailed below:

In the opinion identification, we assume that when the author is writing down the feature, there must be an opinion words near it (Popescu and Etzioni, 2005). Because of this assumption, we can make use of dependency grammar in the Stanford NLP parser to extract the opinion words. An example is shown below:

In the opinion classification an opinion lexicon is built, we will iteratively searching for their synonyms and antonyms of words in WordNet and group them into same set or opposite set until 3 iteration. Because the polysemy of some words in wordNet, we find that 3 iteration is suitable to produce the lexicon. An example is shown below:

After compiling the opinion lexicon, the system can identify the opinion orientation of each sentence. We will determine the sentence in sentence level. For each sentence, our system will classify it into positive and negative. A positive word is assigned a score of +1 and a negative word is assigned a score of –1. All the score are then summed up. If the sentence with a positive score, the sentence is a positive sentence, otherwise, it is a negative sentence. The summary is shown below:

Evaluation

After the 150 sentence are collected, we will manually read and tag the entire sentence with 6 categories in feature and 2 categories in opinion. To increase reliability, we will evaluate the agreement of annotations using the Cohen's Kappa coefficient. It means that the sentences will be tagged by two people instead of one. The Cohen's Kappa coefficient, the proportion of agreement corrected for chance between two judges assigning cases to a set of k categories, offers as a measure of reliability. It intends to give the reader a quantitative measure of the magnitude of agreement between observers. The calculation is based on the difference between how much agreement is actually present compared to how much agreement would be expected to be present by chance alone. The following two tables illustrate the results for the inter-annotator agreement for the feature classifier and opinion classifier respectively:
A/B General D&C Camera Sound Network Battery Null
General 22 6 0 0 0 0 4
D&C 8 37 0 0 0 0 2
Camera 1 0 15 1 1 0 1
Sound 1 0 0 16 1 0 5
Network 0 1 0 0 9 0 0
Battery 1 0 0 0 0 8 0
Null 3 6 0 1 2 0 10
A/B Pos Neg
Pos 69 8
Neg 9 23
The Cohen's Kappa coefficient for the feature classifier is 0.6587 and for opinion classifier is 0.6205. Both of them are above 0.6. This means that they are substantial agreement and can use it for evaluation. The following table gives the matching result as well as precision and recall for the feature classifier:
Test/standard General D&C Camera Sound Network Battery Null Precision
General 10 4 0 0 1 0 2 0.588
D&C 0 17 0 0 0 0 1 0.944
Camera 0 0 13 0 0 0 1 0.928
Sound 0 0 1 18 0 0 0 0.947
Network 0 1 0 0 5 0 0 0.833
Battery 0 0 0 0 0 5 1 0.833
Null 16 11 1 4 1 0 13
Recall 0.385 0.515 0.867 0.818 0.714 1
The average precision is 0.845 and the recall is 0.716. Both of them are acceptable. However, we observed that some specific category, such as category 1 (general) have a relative low recall. The reason of the low recall may be caused by the implicit feature. Implicit feature is some feature may not appear in sentences and it is more difficult to identify than the explicit feature. Implicit feature usually uses the adjective as feature indicator. For example, “The phone is small”. The word “small” indicates the size of the phone. For the category 1, the features are mostly the attribute of phone. It has a higher chance to encounter the implicit feature and make the low recall. The following table gives the matching result as well as the precision and recall result for the opinion classifier:
Test/Standard Pos Neg
Pos 64 14 0.821
Neg 9 10 0.526
0.877 0.417
The average precision is 0.6735 and the average recall is 0.647. Although the result is not excellent, the result is still valid for our application. However, our application hasn't any algorithm to deal with the neutral opinion and objective fact. Thus, if our application classifies the opinion orientation into 3 categories, the result will drop significantly. In summary, the classifiers we built are valid and acceptable for our application. Let us use LG optimus 2x as the example for the qualitative evaluation. We use the LG optimus 2X because it is the first phone which embeds with dual core processor. There are three type of summary and they are radar chart, bar chart sentence summary respectively. These three types of summary is two side of the same coin including radar chart, bar chart, and sentence summary. The following show some of the examples for each one of those:

Conclusion and Future Development

This paper presents an application to solve the two common problems when people digests the reviews, that is information overloading and information misunderstanding. Our project aims to reorganizing the chaotic information to neat information by using data mining and natural language processing methods. The main objective of our project is to produce a phone summary to help buying decision and two sub-objectives which are building a feature classifier and opinion classifier to categorize the chaotic information. Our project combines the existing methods to produce the feature classifier and opinion classifier, and tested that it is valid to produce the summary base on the review from cNet.

In our future work, we will mainly focus on dealing with implicit feature and pronoun resolution. Because the sentence is not always contains feature words, they may present in implicit feature and pronoun. By finding the implicit feature and pronoun resolution, the recall for feature classifier can highly increase. Furthermore, we will further improve our algorithm by using machine learning for text, such as SVM, kNN and naïve Bayesian, and build a neutral opinion lexicon to extract the neutral and objective opinon.

Copyright Yeung Wing Hong and Andrew Lui 2011

Jonathan Chiu
Marketing Director
3DP Technology Limited

Jonathan handles all external affairs include business development, patents write up and public relations. He is frequently interviewed by media and is considered a pioneer in 3D printing products.

Krutz Cheuk
Biomedical Engineer
Hong Kong Sanatorium & Hospital

After graduating from OUHK, Krutz obtained an M.Sc. in Engineering Management from CityU. He is now completing his second master degree, M.Sc. in Biomedical Engineering, at CUHK. Krutz has a wide range of working experience. He has been with Siemens, VTech, and PCCW.

Hugo Leung
Software and Hardware Engineer
Innovation Team Company Limited

Hugo Leung Wai-yin, who graduated from his four-year programme in 2015, won the Best Paper Award for his ‘intelligent pill-dispenser’ design at the Institute of Electrical and Electronics Engineering’s International Conference on Consumer Electronics – China 2015.

The pill-dispenser alerts patients via sound and LED flashes to pre-set dosage and time intervals. Unlike units currently on the market, Hugo’s design connects to any mobile phone globally. In explaining how it works, he said: ‘There are three layers in the portable pillbox. The lowest level is a controller with various devices which can be connected to mobile phones in remote locations. Patients are alerted by a sound alarm and flashes. Should they fail to follow their prescribed regime, data can be sent via SMS to relatives and friends for follow up.’ The pill-dispenser has four medicine slots, plus a back-up with a LED alert, topped by a 500ml water bottle. It took Hugo three months of research and coding to complete his design, but he feels it was worth all his time and effort.

Hugo’s public examination results were disappointing and he was at a loss about his future before enrolling at the OUHK, which he now realizes was a major turning point in his life. He is grateful for the OUHK’s learning environment, its industry links and the positive guidance and encouragement from his teachers. The University is now exploring the commercial potential of his design with a pharmaceutical company. He hopes that this will benefit the elderly and chronically ill, as well as the society at large.

Soon after completing his studies, Hugo joined an automation technology company as an assistant engineer. He is responsible for the design and development of automation devices. The target is to minimize human labor and increase the quality of products. He is developing products which are used in various sections, including healthcare, manufacturing and consumer electronics.

Course Code Title Credits
  COMP S321F Advanced Database and Data Warehousing 5
  COMP S333F Advanced Programming and AI Algorithms 5
  COMP S351F Software Project Management 5
  COMP S362F Concurrent and Network Programming 5
  COMP S363F Distributed Systems and Parallel Computing 5
  COMP S382F Data Mining and Analytics 5
  COMP S390F Creative Programming for Games 5
  COMP S492F Machine Learning 5
  ELEC S305F Computer Networking 5
  ELEC S348F IOT Security 5
  ELEC S371F Digital Forensics 5
  ELEC S431F Blockchain Technologies 5
  ELEC S425F Computer and Network Security 5
 Course CodeTitleCredits
 ELEC S201FBasic Electronics5
 IT S290FHuman Computer Interaction & User Experience Design5
 STAT S251FStatistical Data Analysis5
 Course CodeTitleCredits
 COMPS333FAdvanced Programming and AI Algorithms5
 COMPS362FConcurrent and Network Programming5
 COMPS363FDistributed Systems and Parallel Computing5
 COMPS380FWeb Applications: Design and Development5
 COMPS381FServer-side Technologies and Cloud Computing5
 COMPS382FData Mining and Analytics5
 COMPS390FCreative Programming for Games5
 COMPS413FApplication Design and Development for Mobile Devices5
 COMPS492FMachine Learning5
 ELECS305FComputer Networking5
 ELECS363FAdvanced Computer Design5
 ELECS425FComputer and Network Security5