The Analysis and Optimization of KNN Algorithm Space-Time Efficiency for Chinese Text Categorization

Size: px

Start display at page:

Download "The Analysis and Optimization of KNN Algorithm Space-Time Efficiency for Chinese Text Categorization"

Hugo Stephens
5 years ago
Views:

1 The Analysis and Optimization of KNN Algorithm Space-Time Efficiency for Chinese Text Categorization Ying Cai and Xiaofei Wang Dept. of Computer Science and Technology, Beijing Information Science & Technology University Beijing, , P. R. China Abstract. The performance of any algorithm for text classification are reflected in the of reliability classification results and classification algorithm is high efficient. We analyze the space-time efficiency of different stages based on the traditional KNN algorithm process for Chinese text classification and ensure the reliability of classification. And we optimize efficiency of the algorithm and the feasibility in the practical application from these aspects including feature extraction, feature weighting, similarity computing etc. Keywords: KNN Algorithm, Space-Time Efficiency, Text Categorization, Feature, Feature Vector, Similarity. 1 Introduction With the Web site of resources and the popularization of electronic text, the human began to pursue an efficient and reliable method of information processing in response to the rapid development of information technology industry brought about by the explosion of nowledge and other issues. Now many scholars concern the text classification technology. Text classification will be determined by one or more pre-defined class method based on the text content [1]. The current text classification methods can be divided into general rule-based calssifying and statistical-based classifying which including Decision Tree, K-nearest neighbor (KNN), Support Vector Machine, Bayes etc.[2]. However, the performance of any classification is reflected in two aspects, namely, the reliability and high efficiency of the classification algorithm. Chinese text classification requires more reliability and efficiency because of its inherent complicated, confusing meaning of the word, language forms and other characteristics. The text classification algorithm is often optimized for the reliability of the algorithm itself and is very difficult to implement it. At the same time it is easy to ignore time and space efficiency of classification algorithms. Or efficiency for the algorithm optimization of a certain stage, the whole traditional classification algorithms advantage is dispersed weaened. Therefore we analyze the efficiency of time and space in difference stage in order to get a reasonable proportion of S. Lin and X. Huang (Eds.): CSEE 2011, Part I, CCIS 214, pp , Springer-Verlag Berlin Heidelberg 2011

2 The Analysis and Optimization of KNN Algorithm Space-Time Efficiency 543 space-time according to the process of traditional KNN algorithm. It can get high efficiency and feasible for optimization algorithm of the practical application in ensuring the reliability of classification. 2 KNN Algorithm 2.1 KNN Algorithm Overview Traditional KNN algorithm is a simple and effective non-parametric algorithm. It is outstanding in the precision and recall rate. But its main problem is a high feature dimension space[3]. KNN is a lazy learning method, the calculation of large sample similarity, classification time is nonlinear, training fast but classification slow. And KNN classifier is strongly affected by the distribution of training data in efficiency. Its computational is unbearable in the general computer environment[4]. Traditional KNN algorithm is one of the most useful classifier as a Chinese text. It is deserved to study and exploration for the performance analysis and optimization. 2.2 Traditional KNN Algorithm Process The basic idea of the traditional KNN algorithm can be expressed as: According to the traditional vector space model, text features are formalized as the weighted feature vector[5]. For a given text to be classified, calculate similarity (distance) for each text in the training set. Then select the K texts with the nearest distance between the training set of documents and text sets to be classified. Determine which categories of the new text [6] according to the above K texts category. The algorithm flow is as the Figure1. Fig. 1. Traditional KNN Algorithm Flow

3 544 Y. Cai and X. Wang 2.3 The Principles of Traditional KNN Algorithm KNN algorithm design and implementation of the optimization process should follow a few principles to improve the space-time efficiency of algorithms. (1) Store intermediate results with a dis file. (2) Minimize the number of dis file access. (3) Hash table used as the basic storage structure. 3 KNN Algorithm Space-Time Efficiency Analysis We analyze the space-time efficiency of the traditional KNN algorithm. It will divide three stages including feature extraction, feature vector computing and similarity computing. 3.1 Feature Items Analysis The solution of feature item on the traditional KNN algorithm exists time and space both defects. First, the current widespread use of evaluation function to extraction feature items. But the evaluation function only increase in the extraction accuracy within a limited, and the time cost and the calculation cost of flat text similarity is same, timeconsuming is too high and increase the training corpus part of the burden. Secondly, feature extraction is not strictly the requirements of space resources, however, the characteristics of large-scale text feature entry will greatly increase the computational complexity of subsequent algorithms. In the calculation feature vector stage, each feature item is calculated as a dimension of vector. 3.2 Feature Vectors Analysis It is the basis in the text classification that the document was changed into the format computer can do it by using simple and accurate method [7]. Formalization of the classic text is as the feature vector with feature as following: (W 1,W 2,W 3,,W n ), where W i is the i th weight of feature item. The text formalization of the feature is calculated by assigning weight and form feature vectors. Feature vectors are composition numerical by weight. Feature items weighted based on the following two main experiences [1]. (1)The more lexical item appearing in a text the more it related to the subject of it. (2)The more times appear set of lexical items in the text the worse the term discrimination between items. Traditional KNN algorithm use tfidf(term frequency inverse document frequency), weighting formula[7], that is, feature items w in the text t weight is: tfidf ( w, t ) = # ( w, t ) * lg( N /# w ) (1)

4 The Analysis and Optimization of KNN Algorithm Space-Time Efficiency 545 #( w, t ) is the number of the feature item w appearing in the text t. N is the total number of text, # w is the number of text when appearing the feature w. High dimension of feature vectors, is generally more than 20 World Wide Web, commonly used feature vector stored in two ways. (1) Using fixed-length dimensions to store text feature vector. It taes up a lot of storage space, easy to calculate similarity and see fast. (2) Using variable length dimensions to store text feature vector. Saving only the characteristics of each item in the real text, need small space, not easy to calculate the similarity, a large see time. The tfidf algorithm with weight is as follows. for(text_i=first_text to N) for(i_tag_j=i_first_tag to text_i.length) begin #t_j++; for(text_=first_text to N) if(search i_tag_j in text_) #w_j++; end return #t*log(n/#w); The time complexity above algorithm is O (n 3 ) and space complexity is O(1). It costs so more time by repeatedly opening the dis file in the inner loop that it becomes the bottlenec for the weighting algorithm. Consider the separation of the inner loop, or cut down to a constant level for the inner cycle. 3.3 Similarity Analysis We calculate the similarity between the test and the training corpus to reflect how similarity and provide data support for the classification. We use cosine formula as a formula for calculating the similarity in Chinese text categorization [8]. Between the text t i and t j, the similarity is : M M = M 2 2 sim ( t i, t j ) ( w i * w j ) / w i w j (2) = 1 = 1 = 1 w i is the th feature item weight in the text i t. M is the total number of feature items. Because the traditional KNN algorithm needs to calculate the similarity with each training text, and therefore simplifying the training process will cost a lot of time[9].the traditional KNN algorithm for classification is low efficiency. Testing corpus choose single text. The basic design of the similarity is replacing time with space. Traditional similarity algorithm is as follows. for(weight_i=first_weight to M) put into hashtable ha; for(text_j=first_text to N) for(j_weight_=j_first_weight to M) if(search j_weight_ in ha) sim_j();

5 546 Y. Cai and X. Wang The time complexity of algorithm is O(n 2 ), space complexity is O(n). But if we use a conventional algorithm to calculate the similarity of large texts, time complexity will increase to O (n 3 ), then control the time consumption is particularly important. 4 Optimization Scheme of KNN Algorithm and Test Then with the results of KNN algorithm space-time efficiency analysis, we design and test space-time efficiency optimization schemes according to every stages. Mainly including extraction of feature items optimization scheme, feature items weighted optimization scheme, similarity calculation optimization scheme. 4.1 Features Extraction Optimization and Test The feature item data are the most is resources in the KNN classification algorithm. If the evaluation function is ignored, the establishment of good quality stop words can also reduce the dimension of feature vectors, and to ensure time efficiency of the algorithm. At present, stop words table is no uniform standard [10]. According to different extraction method stop words can be constructed of different tables, the space-time classification algorithm will affect the performance. Feature extraction program supports different items filter design is as follows. if(word.length()>lower_limit && word.length<upper_limit) if(word.trait==n or word.trait==v) tag_hash.put(word); Choose different word frequency, word combinations form different feature item extraction scheme. The test result is as the Table1. It includes 500 training Corpus. No. Word Frequency Table 1. Testing Result of Features Speech Part Feature Extraction Time(ms) Feature Storage space(kb) Feature Vector Dimension 1 >99 Noun >99 noun,verb >499 Noun >499 noun,verb >999 Noun >999 noun,verb <1000 Noun <1000 noun,verb <500 Noun <500 noun,verb The different test results are as the Figure 2.

6 The Analysis and Optimization of KNN Algorithm Space-Time Efficiency Optimization of Feature Items with Weight and Test We design tfidf optimization algorithm as the following. for(text_i=first_text to N) for(i_tag_j=i_first_tag to text_i.length) #t_j++; for(tag_group_i=first_tag_group to tag.group) for(text_j=first_text to N) if(search tag_group_i in text_j) #w_group_i++; The test results are as the Table2. It includes 500 training Corpus. Fig. 2. Feature Analysis Table 2. Testing Result of Feature Weighting Size of feature set Computing time (ms) Auxiliary storage space Fig. 3. Analysis of Feature Weighting

7 548 Y. Cai and X. Wang As the group size increases, the time complexity of the algorithm tends to O(n 2 ), space complexity tends to O(n). When the bloc size is 1, optimization algorithm is the same to the traditional one. When the bloc size is greater than 100, the time cost is reduced, space is increased slightly, and the system is easy to bear. Therefore, the optimization algorithm is feasible and effective. 4.3 Optimization Scheme of Similarity Calculation We consider to using the space instead of time if test corpus is composed of the text corpus. We send all the training feature vectors into memory, also can get the similarity. Optimized algorithm time complexity is O(n 2 ) and space complexity increase to O(n 2 ) but it still can withstand within the system. for(train_text_i=train_first_text to N) for(i_weight_=i_first_weight to M) put into hashtable hash_train[]; for(test_text_j=test_first_test to N) for(j_weight_=j_first_weight to M) if(search j_weight_ in hash_train[]) sim_j(); 5 Classifier Design and Test After the optimization of traditional KNN algorithm, the space-time efficiency is improved and more reasonable. But this high efficiency must based on the ensuring of reliability. 5.1 Classifier Design Select the K Neighbors. Select the K training text with big similar as the K neighbors for the current test version. In practical problems, it is difficult to determine the K value of the selected, often only rough estimates based on experience. This method of valuation may cause a decline in the accuracy of KNN algorithm. Scoring by Category. Test the text similarity of K neighbors by accumulation, accumulated value will be scored [11]. Classification Proposed. According to the principle of the KNN algorithm, we should include the class which is in the highest scores. 5.2 Performance Indexes Classification proposed is on the direct basis of evaluating the classification performance by Classifier. The following performance test index is generally used. 1) accuracy rate = number of correctly assigned to the particular type of the text / actual number assigned to certain types of the text. 2) the recall rate = number of correct assigned to certain types of text / text of the actual number to be assigned to certain types.

8 The Analysis and Optimization of KNN Algorithm Space-Time Efficiency 549 3) Standard measure is as following formula (3), where p is the accuracy, r is the recall rate,:β the weight of precision and recall rate. F1 measure is taen when β is Classifier Testing 2 2 F = (1 + β ) pr /( β p + r) β (3) Sougou Corpus was selected which involve the nine categories including financial business, information technology, food hygiene, sports, tourism, education exams, employment worplace, culture arts, military weapons. There are 1990 paper in each category. Dictionary contains 275,613 words, excluding stop words words. Choose 900 test text which is 5% of total corpus. Classifier test results is as shown the Table 3. Table 3. Testing Result of Classifier K Performance index FB IT FH SE TV EE EW CA MW Precision Recall F Precision Recall F Precision Recall F Precision Recall F Precision Recall F Performance test results above are automatically statistics by the classifier without manual checing. The results show that K value has little influence on the classifier. The above implementation and testing environment for the computer is 2GHz CPU frequency, 3GB main memory, Windows XP operating system, Java Language compiler. Thus, the traditional KNN algorithm efficiency is recognized by common PC environment. The weighted average time of feature vectors are 63ms / articles, similarities average time 4043ms / articles, the traditional KNN classifier level are classified time 125ms / articles.

9 550 Y. Cai and X. Wang 6 Conclusion In this paper, we analysis time and space efficiency for the conventional KNN algorithm of Chinese text classification process. We present a set of detailed efficiency optimization scheme in ensuring the reliability of the classification including extraction of feature items optimization scheme, feature items weighted optimization scheme, similarity calculation optimization scheme. Tests results satisfied the expected results. Acnowledgment. The research is supported by the General program of science and technology development project of Beijing Municipal Education Commission under Grant No.KM , Funding Project for Academic Human Resources Development in Institutions of Higher Learning under the Jurisdiction of Beijing Municipality Grant No.PHR and Beijing Municipal Organization Department Project talent under Grant No.2010D References 1. A Survey on Automated Text Categorization, 2. Guo, G.D., Wang, H., Bell, D., Bi, Y.X., Greer, K.: An KNN Model-based Approach and Its Application in Text Categorization. J. Computer Science, (2003) 3. Yang, Y.M., Pedersen, J.O.: A Comparative Study on Feature Selection in Text Categorization. In: 14th Int l Conf. on Machine Learning (ICML 1997), pp Morgan Kaufmann Publishers, San Francisco (1997) 4. Vries, A.D., Mamoulis, N., Nes, N.: Efficient KNN search on vertically decomposed data. In: 2002 ACM SIGMOD International Conference on Management of Data, pp ACM Press, Madison (2002) 5. Sun, R.Z.: An Improved KNN Algorithm for Text Classification. J. Computer Knowledge and Technology. 6(1), (2010) 6. Ma, J.B., Li, J., Teng, G.F., Wang, F., Zhao, Y.: The Comparison Studies on the Algorithm of KNN and SVM for Chinese Text Classification. Journal of Agricultural University of HeBei 31(3), (2008) 7. Wang, X.Q.: Research of KNN Classif ication Method based on Parallel Genetic Algorithm. Journal of Southwest China Normal University 35(2), (2010) 8. Zhu, G.H., Cheng, C.P.: An Improved -Nearest Neighbor Classification Method. Journal of HeNan Institute of Engineer Ing., (2008) 9. Liu, B., Yang, L., Yuan, F.: Improved KNN Method and Its Application in Chinese Text Classification. Journal of Xihua University 27(2), (2008) 10. Zhou, Q.Q., Sun, B.D., Wang, Y.: Study on New Pretreatment Method for Chinese Text Classification System. J. Application Research of Computers (2), (2005) 11. He, F., Lin, Y.L.: Summary of Improving KNN text classification algorithm. J. FuJian Computer (3), (2005)

Keyword Extraction by KNN considering Similarity among Features

64 Int'l Conf. on Advances in Big Data Analytics ABDA'15 Keyword Extraction by KNN considering Similarity among Features Taeho Jo Department of Computer and Information Engineering, Inha University, Incheon,