Study of Attribute Word Extraction in Chinese Product Reviews

Size: px

Start display at page:

Download "Study of Attribute Word Extraction in Chinese Product Reviews"

Kelly Carr
5 years ago
Views:

1 The 5th Annual IEEE International Conference on Cyber Technology in Automation, Control and Intelligent Systems June 8-12, 2015, Shenyang, China Study of Extraction in Chinese Product Reviews Ma Rui University of Chinese Academy of Sciences Wuxi CAS Ubiquitous Technology R&D Center CO., LTD Beijing, China Pan Fucheng, Zhou Xiaofeng Shenyang Institute of Automation Chinese Academy of Sciences Shenyang, China Abstract As e-commerce develops quickly and online product reviews dramatically increases, extraction of attribute words in the product review is of significance. In this paper, authors presented an efficient extraction method, with categorization by synonyms, based on ontology library, of product reviews attribute words under the big data environments. The experiment validated the approach and it improves the extraction efficiency of attribute words. Keywords ontology; text mining; product reviews; attribute word extraction I. INTRODUCTION With the rise of the Internet industry, businesses involved in the e-commerce have increased every year, while consumers increasingly incline to online shopping. From 2011 to 2014, China's B2C e-commerce sales grew from $5.537 billion to $11.74 billion [1]. Only on November 11th, 2014, Alibaba China e-commerce platform achieved more than 6.8 million orders [2]. The spring up of shopping online generates massive product reviews that help consumers get quality and service information of some online goods, and help businesses and producers benefit from the feedbacks. But the amount of produce reviews is of such a mass that consumers and businesses is difficult to read one by one. There is a need to extract useful information from the large number of comments. Currently a great quantity of researches are on methods for English text information extraction. With Pointwise Mutual Information method Kamps J et al. identified evaluative words by calculating correlations between all the adjectives in Net and two categories of representative words including seed commendatory term good and seed derogatory term bad [3]. Pancholi G used semi-supervised learning techniques to extract attribute word-evaluative word pairs [4]. Many scholars studied text information extraction in Chinese. Yu Xinsheng, Li Jiyun et al. who used the association rules to extract high-frequency attribute words, used the Pruning Rules to improve accuracy and coverage rate, and then found low-frequency attribute words, which were finally added into attribute word list, by using the adjectives neighboring them [5-6]. Yang Qin made a visual prototype system of evaluation information, in which attribute words, which were separated into two group made of general ones and special ones, were obtained with TF-IDF mothed [7]. Fan Yuehua et al. extracted attribute words through unsupervised methods such as preprocessing, clustering, similarity calculation, noun phrases gather, trim, etc. [8]. However, relevant researches are still undergoing exploratory stage which have not yet progressed to standardized methods of research and experiment. On the account of comparative dispersion on project design, technology utilization, absence of integrated platform of experiment and resources especially in the area of ontology construction regarding product reviews [9], together with a deficiency of corresponding researches under the circumstance of very large data, under which many researches have high computational complexity, therefore, it is necessary to develop methods of extracting information in Chinese product reviews from the large amount of data. This paper proposes a method of classification and extraction of attribute words in product reviews from the Internet, based on a sort of ontology, under the background of the big data. words are the core of commodities commented, recognition and classification of attribute words are important parts of sentiment label extraction and sentiment analysis. Firstly, the method of constructing ontology is introduced; then elaborate how to extract attribute words base on the ontology and how to classify them by synonyms; next, based on this method experiments are carried out and finally conclusions are given. From another perspective, this paper first discusses the method in relation to a single type of commodity, which can be seen as a uniform method for the massive comments of different types of commodities. Reviews in different sorts of commodities are processed using a parallel processing way which are shown in Fig. 1. The dotted line in the Fig. 1 shows the whole procedure depends on the ontology. II. CONSTRUCT THE ATTRIBUTE WORD ONTOLOGY Construction of Chinese ontology research lacks of uniform methods and standards. Different research purposes and different applications have different construction methods. The ontology structure concerned in this paper is shown in Fig. 2. The features of comments between different kinds of online goods vary greatly, so the attribute word libraries are divided according to product categories. This work is supported by the National High Technology Research and Development Program of China (No. 2013AA ) /15/$ IEEE 1730

2 Reviews of Product Type 1 Reviews of Product Type n Ontology Library Parallel Process Fig. 1. Parallel process of attribute word extraction Set 1 Set n To facilitate subsequent work, the attribute word thesauruses are constructed following different product categories. Extract attribute words in the mass product comments from currently popular e-commerce websites with the help of Chinese word segmentation, POS tagging and association rules [6]. To build an attribute word thesaurus, in which new words are added with manual work, categorize all the reviews by product types, such as cellphone women's clothing. Due to the limited number of product attributes, attribute words can easily got through artificial way. Each word library has some characteristics. According to the concept of objects in reality world, each thesaurus should have a hierarchical structure. Node on the upper level contains concept of its underlying child nodes, such as the lower subnode of cellphone can be shell, screen, color and so on; lower child node of screen can be color resolution, etc., as shown in Fig. 3. Besides, attribute words also have synonyms. s in a certain synonym class, which have a center word each, are in the same level. A center word can represent a synonym class in that it has a high frequency in the commodity comments. An example of synonyms is shown in Fig. 4, in which appearance is the center word, and texture, shell, design, model are synonymous. Categorizing words based on similarity can rely on the same center word. Part of the word records in Fig. 3 are shown in TABLE I. Using different coding to distinguish the same word in different classes or levels. Material and material* are two different attribute words in the different levels. Material is a child node of cellphone shown in Fig. 3 whereas material* is a child node of screen. III. CATEGORIZE AND EXTRACT ATTRIBUTE WORDS A. Process on a Single Kind of Product After Chinese word segmentation, stop words removing and POS tagging, attribute word candidate sets are composed by the nouns in the sentences. words, which are more generally nouns, are recognized by matching nouns in the sentences with the words that are of the same to the nouns in the ontology. Ontology Library Library 1 Library 2 Fig. 2. Structure of ontology library Library n Cellphone Shell Screen Color Color Resolution Fig. 3. Hierarchical structure of attribute word library Texture Shell Appearance Fig. 4. Synonymous and their center word Design Model On the basis of the ontology, categorize attribute words in the product reviews regarding a single kind of commodity. Categorization process is divided into two operations: classification and merging. Classification operation put the word into a categories of its center word by traversing all the attribute word candidate sets of a single kind of commodity; merge operation: if two classes both contain fewer words than a certain threshold, then merge them. The purpose of the merging operation is to make sure each class reaches a certain size in order to identify high-frequency attribute words. The overall process of classification and merging is shown in Fig. 5. If a sentence contains multiple attribute words, e.g., there are three attribute words cellphone, screen and color" in The screen of the cellphone is colorful, then use the word color, which is on the lowest level of concept hierarchy in the ontology, instead of the other attribute words in the sentence, to provide a basis for the follow-up merging operation. Merging process is as follows: If the center words of two classes A and B are at the same level and have the same father node in the ontology, and the numbers of words contained in A and B are less than a certain threshold k, then merge them, and the new center word w of the new class is the word in the upper level to the center words of A and B; if the center word of A is the ancestor of the center word of B, and the size class B is less than a certain threshold k, then A and B are merged, the new center word w is the center word of A. Merge until each size of the classes meet the minimum threshold or conditions are not met for the next merging. To formalize the descriptions above, some operations are defined in TABLE II. According to the sub-operations, classification operation can be expressed as: U(C(w), w) Merging operation can be expressed in Fig. 6. Wherein, in line 4 w satisfies the following equation: GT(w, E(A)) AND GT(w, E(B)) = TRUE And in line 7 w satisfies the following formula: w = E(A) 1731

3 TABLE I. TABLE II. EXAMPLES OF ATTRIBUTE WORD RECORD Upper Center Center Cellphone NULL Cellphone Appearance Cellphone Appearance Material Cellphone Appearance Material* Screen Material* Operation C(w) U(A, w) M(A,B) Q(A) L(w1, w2) GT(w1, w2) E(A) S(w1, w2) SUB-OPERATIONS OF CLASSIFICATION AND MERGING Definition Get the center word of the word w Put word w into class A Merge class A and class B Get the size of class A If the words w1 and w2 are on the same level If word w1 is ancestor of word w2 Get the center word of class A If word w1 and w2 have the same father node B. Parallel Process The method described above is how to extract and categorize attribute words in comments in relation to a single type of product. In practice, to process different types of largescale product comments we can make use of the way of parallel computing. In 2004, Google first proposed MapReduce [10] technology, as a parallel computing model for large data analysis and processing, which has aroused widespread concern in industry and academia. MapReduce parallel programming model for the calculation process is divided into two main phases, namely Phase Map and Reduce phases. Map function handles the Key/Value pairs, generating a series of intermediate Key/Value pair; Reduce function is used to merge all intermediate values with the same Key of key-value pairs to calculate the final result. Following the aforementioned method, perform on comments associated with different sorts of commodity simultaneously taking advantage of MapReduce platform to make them parallelized. IV. EXPERIMENTS We show how we get the dataset for experiments and the experiment evaluation indicators we choose. Then the general procedure of experiment, in which we make comparisons based on the indicators, is proposed in the part of experimental result and finally the result is shown. Classify Merge Conditions A. Get the Source Data Use Breadth-first method to obtain a large number of web pages containing product comments from the main Chinese e- commerce sites. Breadth-first approach: Fig. 5. Classification and merging procedure PROCEDURE MERGING IF L( E(A), E(B) ) = TRUE IF S( E(A), E(B) ) = TRUE AND Q(A)<k AND Q(B)<k THEN w = M(A, B) END IF ELSE IF GT( E(A), E(B) ) = TURE AND Q(B)<k THEN w = M(A, B) END IF END PROCEDURE Fig. 6. Merging operation In Fig. 7 (a), screen, color, resolution are three word nodes shown in Fig. 3. Color is the center word of the synonym class of color. Merge color and resolution finally get a class screen ; Merge color and screen eventually get screen as well. In both cases the final result of the merging is shown in Fig. 7 (b). Starting from a vertex V 0, access this vertex; Starting from V 0, access all V 0 s never visited neighbors W 1,,, individually; Starting from W 1,,, successively, access their neighbors that have not been accessed; Repeat the step above until all vertices have been visited, or until the termination condition is satisfied. View a web page as a vertex, then get pages and comments from the initial resource URL, and then by breadth-first method acquire web page URLs to have access to other pages and gain comment resources. Before the experiments the data of reviews linked to different brands of goods, such as digital cameras, and skin care products, are obtained. B. Experimental Evaluation Indicators In the field of big data mining, due to a variety of intended purposes, data sets and evaluation methods used are not exactly the same, it is difficult to seek a totally precise evaluation technique. In this paper, Recall Rate, Precision Rate, their Macro averages and parallel Speedup are used to measure results. 1) Recall and Precision Rates Provided that a comment text with an attribute word that can be extracted actually belongs to a certain class in which the word is included. 1732

4 Hue Color (a) Screen Fig. 7. An example of merging operation Resolution Screen Recall Rate is the ratio of the number of the comment texts that are correctly extracted attribute words and the quantity of texts included in actual classes. It reflects the degree of agreement with the actual categories, regarding avoiding undetected related text. Precision refers to the ratio of the count of the comment texts that are properly disposed and the number of texts that are processed. It reflects the degree of accurate of the method, mainly in connection with avoiding error detection and noise. There is a text set D, in which there are the amount of text processed correctly recorded as TT, the number of text that is not processed properly but belongs to a certain class recorded as FT, the count of text that does not belong to any class but is classified to a class recorded as TF, and the number of text that does not belong to any class and not classified to any class recorded as FF. The formulas of the two evaluation measurements are as follows: TT Recall = TT + FT TT Precision = TT + TF The numerators of the two formulas above are the same, so in order to improve Recall and Precision, we must make FT, TF as small as possible. The ideal situation is to make Recall = 1 (FT = 0), Precision = 1 (TF = 0). Although the formulas have been proposed, it does not mean that those are the most accurate evaluation ways. These two indicators we have presented are only in relate to the results of comments processing regarding a single type of commodity. Because each product attribute word thesaurus in the ontology library is obtained from product reviews of one kind of commodity category rather than just one kind of commodity, we need to consider the Macro average. Macro average is considered when the results of processing about each type of commodity affect the overall results related to the whole category of commodity. Assume that a certain category containing n types of goods, then for each type of goods calculate parameter values, and make a sum. As to Recall, first calculate Recall i for each sort of product reviews, then compute the Macro average for the whole commodity category. The formula is as follows: i Recall Macro-Recall = i n For accuracy, calculate Precision i for each type of product reviews, then by commodity category we get the Macro average of Precision. The formula is as follows: (b) i Precision Macro-Precision = i n 2) Speedup Rate Speedup is the ratio of time consumed, running the same task, in a single processing system and a parallel processing system. It is used to measure performance of parallel systems and program parallelization effects. Formula as follows: S p = T 1 / T p S p is Speedup; T 1 is the running time under a single processing system; T p is the time consumed in a parallel system with p processing units. C. Experimental Results The results of the experiments are divided into two parts: First, based on small-scale review data, compare automatic processing results to manual tagging results then calculate the Macro average for the results regarding each category of goods; In the other part, based on large-scale review data, observe the results by statistical sampling, and then calculate the Speedup of parallel method. 1) Small-scale Datasets Experimental Results The overall amount of reviews collected manually is shown in TABLE III. After these three categories of product reviews are conducted classification and extraction of attribute words, results are shown in TABLE IV. As can be seen from the results, the Macro average Recall and Precision rates for small-scale datasets are all more than 80%, which has been able to basically meet the demands and purpose of the paper. 2) The Results of Large-scale Datasets Perform data acquisition and attribute words classification and extraction of product reviews on Hadoop MapReduce parallel processing platform. In order to carry out the largescale data analysis experiment, we get one billion product reviews, which is equivalent to the amount about 115GB. Liu Hongyu et al. studied extraction technique of objects evaluated in product reviews based on syntactic analysis [11]. In the large-scale experiment, on the basis of the same datasets, we make comparisons on both Macro averages and Speedup. TABLE III. THE AMOUNT OF REVIEWS COLLECTED MANUALLY Type of Product Review Count Skin care 1000 Women s clothing 1900 Digital camera 1500 TABLE IV. RESULT OF SMALL-SCALE DATASETS EXPERIMENT Category of Product Macro-Recall Macro-Precision Skin care Women s clothing Digital camera

After sampling part of the processing results (each specimen, which comes from a certain category of product, has a volume of 1000) of the product reviews, calculate the Macro average Precision and

From the sampling results, we can see the values of Macro-Recall and Macro-Precision are all above 0.7.

The reason may be under the condition of smallscale datasets, the selected goods are relatively hot commodity, of which the number of comments is more, with higher quality; what s more in the

In the large-scale data processing, the Speedup rates with the same amount of data running are shown in Fig. 10. Based on our method, the Speedup rate is close to 5.

The experiment shows that the method in this paper can be effectively fast in big data environments, and be capable of analyzing large data processing tasks.

5 After sampling part of the processing results (each specimen, which comes from a certain category of product, has a volume of 1000) of the product reviews, calculate the Macro average Precision and Recall rates with comparison of machine processing method to manual annotations. The result of the method this paper propose is shown in Fig. 8. From the sampling results, we can see the values of Macro-Recall and Macro-Precision are all above 0.7. However, the average value is slightly lower than the small-scale datasets, and the variance is greater. The reason may be under the condition of smallscale datasets, the selected goods are relatively hot commodity, of which the number of comments is more, with higher quality; what s more in the large-scale data condition, may be due to the presence of some of the not so popular categories of goods with lower average quality product reviews. In the large-scale data processing, the Speedup rates with the same amount of data running are shown in Fig. 10. Based on our method, the Speedup rate is close to 5.5 when the number of task nodes reaches eight. The experiment shows that the method in this paper can be effectively fast in big data environments, and be capable of analyzing large data processing tasks. To make a comparison, we make a test through the method, presented in [11], of which results are shown in Fig. 9 and Fig. 10. The method we proposed has a better performance of accuracy, mainly as a result of the ontology library, which is precision enough, specialized for extraction of attribute words in Chinese product reviews. It also has a greater Speedup rate under large-scale background, because the use of ontology as well, which makes algorithms more simple. V. CONCLUSION This paper presents a method on attribute words classification and extraction of Internet commodity comments under large data environment, in which the core approach is to build the ontology library. Take account of the simple and subject-related features of product reviews, it is a good way to construct an ontology to go a step further to decrease the computational complex. After verification of the experiments, in terms of reliability and efficiency the results are good enough. It is to be tested in practical applications. Fig. 8. Macro averages of the method used by the authors Fig. 9. Results of method compared Fig. 10. Speedup rates comparison under large-scale experiments REFERENCES [1] China e-commerce research center, [2] Official microblog of Alibaba, [3] Kamps J., Marx M., Mokken R.J., Using Net to measure semantic orientation of adjectives, In: Calzolari N, et al., eds. Proc. of the LREC, 2013, pp [4] Pancholi G., Jacob J., Malavalli K, Distributed computing environment using DCOM, Arlington, USA: University of Texas at Arlington, [5] Yu Xinsheng, Rong Zhiwei, Application of binary-tree under multicommunication network, computer engineering, vol. 32, pp , May [6] Li Jiyun, Dong Xiaoshe, Tong Duan, Research on distributed objects middleware, Computer Engineering and design, vol. 25, pp , February [7] Yang Qin, The three-layer client/server model based on COM, Computer Application and Research, 2011, pp [8] Fan Yuehua, Liu Bailin, Architecture of distribute software system with object oriented component, Journal of Xi an Institute of Technology, vol. 22, pp , March [9] Yu Shuhao, Su Shoubao, Guo Xuejun, Methodology of domainspecific ontology construction, Journal of West Anhui University, vol. 21, pp , April [10] J. Dean, S. Ghemawat, MapReduce: simplified data processing on large clusters, Communications of the ACM, 2008 vol. 51, pp [11] Liu Hongyu, Zhao Yanyan, Qin Bing, Liu Ting, Comment target extraction and sentiment classification, Journal of Chinese Information Processing, vol.24, pp , January

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2018 IJSRCSEIT Volume 3 Issue 3 ISSN : 2456-3307 Some Issues in Application of NLP to Intelligent