Code Syntax-Comparison Algorithm based on Type-Redefinition-Preprocessing and Rehash Classification

Size: px

Start display at page:

Download "Code Syntax-Comparison Algorithm based on Type-Redefinition-Preprocessing and Rehash Classification"

Dorcas Pearson
6 years ago
Views:

1 320 JOURNAL OF MULTIMEDIA, VOL. 6, NO. 4, AUGUST 2011 Code Syntax-Comparison Algorithm based on Type-Redefinition-Preprocessing and Rehash Classification Baojiang Cui 1,2 1 Beijing University of Posts and Telecommunications, Beijing, China 2 China Information Technology Security Evaluation Center, Beijing, China cuibj@bupt.edu.cn Jun Guan 1, Tao Guo 2, Lifang Han 1, Jianxin Wang 3, Yupeng Ji 1 1 Beijing University of Posts and Telecommunications, Beijing, China 2 China Information Technology Security Evaluation Center, Beijing, China 3 Beijing Forestry University, Beijing, China guanjun@yahoo.cn Abstract The code comparison technology plays an important role in the fields of software security protection and plagiarism detection. Nowadays, there are mainly FIVE approaches of plagiarism detection, file-attribute-based, text-based, token-based, syntax-based and semantic-based. The prior three approaches have their own limitations, while the technique based on syntax has its shortage of detection ability and low efficiency that all of these approaches cannot meet the requirements on large-scale software plagiarism detection. Based on our prior research, we propose an algorithm on type redefinition plagiarism detection, which could detect the level of simple type redefinition, repeating pattern redefinition, and the redefinition of type with pointer. Besides, this paper also proposes a code syntax-comparison algorithm based on rehash classification, which enhances the node storage structure of the syntax tree, and greatly improves the efficiency. Index Terms code clone, code plagiarism, syntax tree, rehash classification, type-redefinition I. INTRODUCTION With the rapid development of information technology, code capacity increases and code plagiarism is much harder to detect than ever. The existing detection technology can either meet the requirements about detection ability or detection efficiency. In order to deal with massive abuse of open source code and private source code plagiarism, and to gain software intellectual property detection and software safety protection, it is necessary to bring the research of code comparison technology with powerful function and high efficiency. In the code comparison area, the most popular five approaches include [1] : file-attribute-based, text-based, token-based, syntax- based and semantic-based. File-attribute-based approach employs hash values of the whole text, text blocks and the key structures in the text, or watermark information of the text to find out the similar code between target code and sample code. The core of this technique is to compare hash values, so it has advantage of a high detection speed, while disadvantage in detection ability. However, it can only detect cases like plagiarism of the whole code or parts of code, but incapable with the cases like changes of characters. Among the most popular software based on file-attribute technique are Protex by BlackDuck and Proteccode system by Proteccode. The above software systems have played important roles in intellectual property detection and software development and management. Text-based approach converts source code into strings and detects clone code by searching and matching strings. Johnson does relative research and proposes an idea---divide the code into series of strings and match similar code by fingerprint information of strings [2]. Ducasse further raises an idea that it filters the code to remove noises such as remarks, blanks and line break firstly, and calculates hash value of each line, by comparing hash value of each line, it has the ability to detect code clone [3]. A metric-based approach combines with the textual comparison of the source code proposed by Kodhai.E [10] detects code clone by standardizing the source code. Despite of all of these attempts, text-based technique can only detect plagiarisms of the whole code copy or partly copy. Token-based approach converts code into token sequences, and matches the token sequences by token sequence hash. CCFinder [4] by Kamiya marks the code by self-defined token, and compares the token sequences by LCS algorithm. Maeda proposes a method to convert source to media XML sequence including lexical and syntactic information by the tool PALEX, then do clone detection with the lexical/syntactic information [11] or Parsing Action [12]. CodeCompare [5] also uses token-based technique. At first, it pretreats the source code by removing useless information in the code and redefines others by semantic analysis type. And then, it does lexical analysis to generate token sequences. Finally, it matches hash values of the token sequences by using an improved LCS algorithm [5]. Comparison algorithm in doi: /jmm

2 JOURNAL OF MULTIMEDIA, VOL. 6, NO. 4, AUGUST CodeCompare system can detect plagiarisms of type redefinition, substitution of variable names and function names, but incapable with plagiarisms like lines disorganization and IF statement alternation. Syntax-based approach implements code clone detection on syntax structure level. DECKARD [1] by Jiang uses Euclid space to record syntax structure information, and uses vector mixture to detect code clone. CloneDr [6] by Baxter traverses and compares hash value in each node based on the abstract syntax tree to calculate the similarities of syntax structure. Corazza proposes an idea to detect code clones based syntactic information enriched by lexical element and brings up the usage of tree kernels to compute similarity (sub)trees [13]. Code plagiarism detection system [7] based on abstract syntax tree analyzes code similarities based on syntax structure. It processes code by deleting useless characteristics firstly, and obtains code similarities by traversing and comparing the analyzed syntax node. Syntax-based technique can detect syntax structure alternations without changing semantics, such as switch case sequences, parameter sequences and variable definitions. Semantic-based approach implements code plagiarism detection on semantic level. Komondoor and Horwitz propose the idea to put program dependence graphs, PDG and program slicing into code clone detection [8] [1]. Krinke puts forward a special PDG with similarities with AST and traditional PDG and utilizes it into code detection [9]. Higo takes lots of work to enhance the quality of PDG-based technique. For example, execution dependency and program slices are used to improve the ability to detect contiguous code clone [14] ; PDG specializations and detection heuristics are proposed to reduce computational cost of detection in the same way to enhance the capacity of every kinds of code detection [15]. The first three detection technologies are all on text level, with high detection efficiency, but limited detection ability. Type redefinition plagiarism detection based on token comparison [5] extends the previous token comparison, but cannot analyze code on syntax level, which resulted in deficiency in detection. Almost all code plagiarism detection tools based on syntax tree implement various plagiarism detection methods, but lack of detection efficiency. Code plagiarism detection technologies based on semantics is still under research and could not fit for massive code plagiarism detection. This paper is an attempt on the research of code comparison algorithm [7] based on abstract syntax tree. Firstly, by the algorithm improvement on the preprocess step, we propose an idea to detect type redefinition problem. Besides, this paper proposes a novel comparison algorithm by rehash classification to syntax nodes, which, by only comparing syntax node with the same type, we increase the efficiency greatly. What s more, we propose a storage structure of syntax tree nodes, by only traversing and comparing syntax nodes with the same type, so as to reduce comparison complexity greatly. II. ORIGINAL CODE COMPARISON ALGORITHM The original algorithm based on abstract syntax tree is that it transforms the source code into a syntax tree according to the grammar structures. A node is the smallest unit in syntax tree, which represents a grammar structure in the source code, and includes all the information of the related syntax structure. By traversing syntax node between target code and sample code, it identifies the similarity in syntax level between each other. The steps of this algorithm are as follows. 1) Preprocessing on the source code. The preprocess step is to handle the annotation, macro definition, conditional compilation statement, head file including in the source file, etc. To the annotation and head file including, it just deletes them immediately, and process other parts as their semantics. 2) Lexical analysis to the source code. This step transforms the source code into the token series by Lex (Lexical Compiler). By the lexical rule, Lex generates lexical analyzer to receive the processed source code, and return the token value by the matching rule from regular expression. 3) Abstract syntax tree generation. The syntax analysis is based on Lexical analysis. According to the syntax rule, this process firstly utilizes YACC (Yet Another Compiler Compiler) to generate grammar parser. By comparing the received token series with the relevant rule, the grammar parser generates the abstract syntax tree from the original source code. Every node in the syntax tree is a part structure to the source code, and the node stores all the information in the syntax structure. 4) Hash calculation to syntax tree node. The purpose to generate the syntax tree is to do code comparison. While most syntax trees are structure-complicated and storage-consuming, it is very difficult to do the code comparison with the original syntax tree. Therefore, we reduce the complexity by doing the code comparison with its hash value. We calculate the hash value to every leaf node and every sub-tree node of the syntax tree, and every hash value represents its own sub-tree uniquely. The hash value algorithm calculation is as follows: we calculate every leaf node directly, while every sub-tree s hash value is calculated by the addition from its own sub-trees. Besides hash calculation, we also record the child node numbers in the current sub-tree and line number from the start line to the end line in the source code. 5) Storage of the syntax tree. After calculating the hash value of the syntax tree, we design an algorithm to store the information in every sub-tree. In memory, the information is stored as list array. Firstly, we create a dynamic array Array, the length to the dynamic array is the maximum number of the child node. We save the node information with the same child node numbers, into the correspondent position of the Array. For instance, if there is a node a, and there are n child nodes under a, we save a into the position of Array[n]. The storage structure of the syntax tree nodes is as follows:

3 322 JOURNAL OF MULTIMEDIA, VOL. 6, NO. 4, AUGUST 2011 Type-Redefinition is that an original data type is redefined by the key word typedef. There is an example about type-redefinition below: Code a 1 void main() 2 { 3 int a = 0; 4 char b = 1; 5 float c = 2; 6 printf("%d%d%f",a,b,c); 7 } Figure 2-1. The storage form of the hash array list 6) Code Comparison. To do code comparison, the source file and target file have to be handled by the same steps as above. Both target and sample file will get a hash array list (TargetHashListArray and SampleHashListArray), we then do comparison between TargetHashListArray[n] and SampleHashListArray[n] with n beginning at the threshold value. The threshold value is the smallest number of sub-nodes of a tree node which is necessary to do comparison. Traversing every chained list in both TargetHashListArray and SampleHashListArray, compare every node which has the same number of sub-nodes and the record line number from the start line to the end. The Algorithm above has the ability to detect plagiarism methods such as entirety code copy, part code copy, replace name of variable or function, disorder of statement. However it still could not detect the plagiarism based on type redefinition, what s more, it makes a low efficiency of the whole process because of the complexity of the Algorithm. In order to make up for the two deficiencies, we do lots of work as follows: III. REHASH CLASSIFICATION BASED COMPARISON ALGORITHM It is very similar in the syntax comparison algorithm from our method to the algorithm above. The main improvements of our work are: (1) in the preprocess step, we add the semantic parser to the type redefinition. In this way, we can detect the code plagiarism by type redefinition. (2) In the step of syntax tree storage, we classify the syntax tree nodes by both the number of child node and the modulus value to the hash value of the tree node, so that all the tree nodes would be classified into many different types. Because the nodes with the same syntax structure will be classified into the same category, we only need to do the code comparison in the same category, which certainly improve the comparison efficiency to a certain extent. The two aspects of the Algorithm improvements in detail are as follows: A. Type-Redefinition-Preprocessing Algorithm Table 3-1 code example about type redefinition plagiarism It is judged that code a and code b in table 3-1 are homologous according to semantics. In view of the situation, we mainly lay emphasis on dealing with three types of type redefinition: simple type redefinition, repeated type redefinition and type redefinition with pointer. The algorithm is described as following: Code b 1 typedef int INTEGER; 2 typedef char BYTE8; 3 typedef float FLOAT; 4 void main() 5 { 6 INTEGER a = 0; 7 BYTE8 b = 1; 8 FLOAT c = 2; 9 printf("%d%d%f",a,b,c); 10 } Table 3-2 code example about type redefinition plagiarism Assume that all type redefinition statements in original code constitute set {TS}, and T 1, T 2,, T n { TS} ; {TSO} is composed of the types S S S { TSO} before redefinition, and elements 1, 2,, n ; {TSN} is composed of the type after redefinition, and R,, { TSN} elements 1 R2, R n. Then, traverse all elements of {TS} and judge types of elements. Algorithm of Preprocessing Typedef 1: function ST(SC: SourceCode):Replaced SourceCode 2: while SourceCode do 3: if SimpleTypedef 4: for pointer p from Typedef to the end do 5:StoreVar(NewType) 6: Replace(SourceCode,NewType,RType) 7: end for 8: end if 9: else if PointerTypedef 10: for pointer p from Typedef to the end do 11:StoreVar(NewType) 12: Replace(SourceCode,NewType,RType) 13: end for 14: end else if 15: return SourceCode 16: end function Table 3-3 the pseudo-code of type redefinition algorithm For any T 1 i n, if T is simple type redefinition statement like typedef int INTEGER, S is int and R is statement like typedef char *LPCH, S

4 JOURNAL OF MULTIMEDIA, VOL. 6, NO. 4, AUGUST is char * and R is LPCH. Traverse all elements in TSN and search the location of R R TSN0,1 i n and replace it with S. Then record the number that R appears in source code as m 1 i n. Suppose that M m, the complexity of the algorithm is o M. If TSO TSN, the search algorithm will be ended; if TSO TSN,supposeS TSO,R TSN,S R,1 x,y n, we find out R in the code and replace it with S. Then traverse the source code till all the elements in set TSN are completely replaced. Here, the algorithm is completed. The pseudo-code of type redefinition algorithm is shown like Table 3-3: After the above preprocess step, the algorithm could detect the three kinds of plagiarisms of type redefinition. Suppose the total line number of the code is S, the modified line number by the algorithm is L, the probability of improved similarity is shown like this: L η = 100% S B. Code Syntax-Comparison Algorithm Based on Rehash Classification The core to our code comparison algorithm is that we classify the syntax tree into some more precise categories; different types of nodes must be irrelevant. After the preprocess step, lexical analysis, syntax analysis and the generation of syntax tree, we classify the syntax tree nodes by both the number of child node of the tree node and the modulus value to the hash value of the tree node, and we save them into different chained lists according to the sub-tree types, then wait for the invocation process of next step. The specifics of the rehash classification algorithm are as follows: 1) After the generation of the syntax tree, we do the traversal. 2) In the progress of traversal, we calculate the hash value and the number of child nodes of each node in the syntax tree, and record the start and end line number of each node in the source file. Then we store all the nodes into a chained list. 3) By searching into the syntax tree, we find the node with the maximum number N of child nodes, to create an array Array according to the number N. After the traversal of the chained list we set up before, we do the first classification. We save nodes into the nth position of the array based on its child node number n, so at every position of the array it creates a chained list where stores the same kind of nodes. For instance, if we have a nodea, and a is the root of n nodes, then we save the information of a into the chained list at position of Array[n]. 4) Finally, we set up an array LongArray, which has the length of t*n according to the empiric value t. By traversing the linked list Array we got from previous step, we do the modulus calculation to the hash value of the nodes in Array[n]. We classify the nodes in the chained list into t categories by the modulus value, and the according to the modulus result h (h from 0 to k-1), we save the node into the (nt+h)th position of LongArray LongArray[nt+h]. In this way, we finish the rehash classification. In the rehash classification algorithm, the first thing we should do is to classify all tree nodes by the number of sub-node. After that, we do the general statistics of the number for each type of syntax structure, to give the numerical basis to the empirical value t. How to choose the number t is very important to improve the efficiency of our algorithm. Based on the idea that different types of syntax tree nodes are irrelevant, our algorithm do a maximum classification to all the syntax tree nodes. Therefore, by only comparing to the same categories of tree nodes, we reduce the complexity of code comparison. The specifics of code comparison algorithm based on syntax tree are as follows: 1) Read tree node hash value of target file and sample file from LongArray at the position of threshold value to do comparison. 2) According to the position of LongArray, do the traversal to the hash chained lists of target file and sample file. 3) Do the comparison to the nodes with the same classification. If they share the same hash value, then store their positions in the source files into database, and if do not, do the next comparison. 4) Traverse all types of nodes until it ends. Define target file and sample file syntax tree node array as TargetLongArray and SampleLongArray. During the process of comparing, TargetLongArray and SampleLongArray have the same structure, so we just need to compare nodes at the same position. For example, the chained list in TargetLongArray[k] only compare with the chained list in SampleLongArray[k]. C. Evaluation of Code Syntax-Comparison Algorithm Based on Rehash Classification The purpose of the rehash classification algorithm is to improve the efficiency in large-scale code comparison. In this section, we calculate the complexity of both the algorithm based on abstract syntax tree and our rehash classification algorithm, and compare the efficiency difference between each other. To evaluate the amount of calculation in the code comparison, we firstly should establish the mathematical model, and make reasonable assumptions based on the model. The rules to the assumptions are that they should not have much impact to the real calculation. The rules of the mathematical model to the syntax tree are as follows: Assume that the target file generates the syntax tree TargetTree, and the sample file generates SampleTree, they have m and n syntax tree nodes correspondently. In the TargetTree, according to the number of syntax tree nodes, the nodes could be divided intoxdifferent types, and we assume that there is the same number in each type of nodes, that is m x. At the same time, in the SampleTree, the nodes could be divided into y different types, and each category has n y number of nodes. To do the code

5 324 JOURNAL OF MULTIMEDIA, VOL. 6, NO. 4, AUGUST 2011 comparison, we do the hash calculation to the two syntax f ( m, tree, and try to find the same nodes. We use to represent the amount of the whole calculation process. Complexity of Original algorithm Before comparison, the algorithm based on abstract syntax tree needs to do the classification to the syntax tree nodes. The process of classification is to generate chained list according to the number of numberk of syntax tree child nodes, and save the pointer of the linked list into the kth position of the array HashListArray, HashListArray[k]. Although similar nodes in the chained list are in random order, each syntax tree nodes only need to be operated once to the classification. Based on the assumption before, the amount of calculation of this process for TargetTree is f 1 (m)=m, while for SampleTree is f 1 (=n. After the classification is the process of comparison. It only needs to compare the hash valueof the syntax tree nodes with the same number of child nodes. Assume that there are k 1 nodes at TargetHashListArray[k], and k 2 nodes at SampleHashListArray[k] correspondently. Due to the random order in the linked list, the amount of calculation of comparison nodes in this position is k 1 k 2. According to the assumption to the TargetTree and SampleTree, there are maximum min(x, y) times of comparison (x different types of nodes in TargetTree and y in SampleTree). While the amount of calculation of each time is m x * n, the total amount of calculation to y the comparison of TargetTree and SampleTree is m n f 2( m, = ( ) min( x, y) x y (1) As the discussion above, the total amount of calculation of the algorithm based on abstract syntax tree is: g 1( m, = f 1( m) 1( 2( m, m n = ( m + + ( ) min( x, y) (2) x y Complexity of Rehash Classification based comparison algorithm Before the comparison, the rehash classification algorithm needs another two steps, which are the classification to the syntax tree nodes and the classification to hash value of the syntax tree. The classification to the syntax tree nodes is the same with the algorithm based on abstract syntax tree, therefore, the amount of calculation of this process for TargetTree is f 1 (m)=m, while for SampleTree is f 1 (=n. After that, we do the rehash classification to the nodes in the array Array. This process is to read the hash value of each node and do the modulus calculation (mod t, t is an empiric value), and according to the result h the tree node is saved to (kt+h)th of the position of LongArray, LongArray[kt+h].( k is the number of child nodes, t is a integer type of empiric value, and h is the result of modulus.) It only needs to be operated once to finish the classification, therefore, the amount of calculation of the rehash classification for TargetTree is f 1 (m)=m, while for SampleTree is f 1 (=n. The rehash classification reduces the amount of operations in the comparing process, and we only need to compare the nodes in the same type. Assume that there are k 1 nodes in TargetLongArray[k], and k 2 nodes in SampleLongArray[k] correspondently. Due to the random order in the linked list, the amount of calculation of this position of the node is k 1 k 2. Here we need to have another assumption that the nodes with the same number of child nodes would be divided into t types by the empiric value t, and the number of each type of node is m tx and n ty, and correspondently. Based on the prior mathematical model, there are maximum min(x, y) times of comparison, while the amount of calculation of each time is m tx * n ty. Therefore, the total amount of calculation to the comparison of TargetLongArray and SampleLongArray m n f 4( m, = ( ) min( tx, ty) is tx ty (3) As the discussion, the total amount of calculation of the rehash classification based comparison algorithm is: g 2 ( m, = f 1( m) 1( 3( m) 3( 4( m, m n = ( m + + ( m + + ( ) min( tx, ty) tx ty mn = 2( m + + ( ) min( x, y) (4) txy In the case that m and n are enough large, the amount of calculation of g 2 (m, is 1 t of g 1 (m,. In engineering, there should be a suitable value t in order to reduce the complexity of the whole comparison process. Source code example 1 staticintoversizelength(asn1_type *asn1, void *user) 2 { 3 unsigned int *max_size; 4 if(!asn1!user) 5 return 0; 6 max_size = (unsigned int *)user; 7 if(*max_size&& *max_size<= asn1->len.size) 8 return 1; 9 return 0; 10 } revised code example 1 typedefint INTEGER; 2 typedef unsigned int UINT; 3 static INTEGER OversizeLength(ASN1_TYPE *asn1, void *user) 4 { 5 UINT *max_size; 6 if(!asn1!user) 7 return 0; 8 max_size = (UINT *)user; 9 if(*max_size&& *max_size<= asn1->len.size) 10 return 1; 11 return 0; 12 } Code 1 code example used to detect repeated type redefinition plagiarism

6 JOURNAL OF MULTIMEDIA, VOL. 6, NO. 4, AUGUST IV. EXPERIMENTS AND ANALISYS The advantages of this new algorithm are following: 1) we design an algorithm to analysis type-redefinition in order to detect plagiarism methods of simple type redefinition, repeated type redefinition and type redefinition with pointer. 2) We classify the syntax tree nodes by the modulus value to the hash value of the tree node, so that all the tree nodes would be classified more accurately. In this way, the complexity of the process of comparison is reduced. CodeCompareV2.0 is developed with this efficient algorithm. Of course CodeCompareV2.0 not only inherits all the functions of CodeCompareV1.0, but also gets the ability to detect type-redefinition plagiarism and has a higher detecting rate. A. Experiment to Analyze Type-Redefinition Plagiarism Detection Algorithm Experiment Environment In order to prove that our algorithm improves the accuracy of code comparison, we had a lot of experiments using the source code getting from the open source software. We mainly focus on the accuracy of detecting the plagiarism of type redefinition. By comparing CCFinder and CodeCompareV2.0, it shows that CodeCompareV2.0 could detect this kind of plagiarism more effectively. Here we only show two sample source codes for each instance in Code1 Code 3.The environment of our experiment is: Microsoft Windows 7 Ultimate on Intel(R) Dual-Core CPU 2.00 GB of memory. Code 1 is from the sp_asn1_detect.c of snort [20], which redefines the variable type simply. We change the code as following: add the redefinition statements of int and unsigned int at the beginning. Change all int into INTEGER in the first line and unsigned int into UINT in the third line and sixth line. Other statements and functions are not modified. Actually, plagiarism brings trouble to homologous code detection by using type redefinition to the same type repeatedly. In order to detect the homology of the source codemore accurately, the algorithm finds and replaces the redefined type repeatedly in the preprocessing until the redefined type couldn t be replaced. Code 2 is gotten from the capture.c of prelude Due to space limitations, the experiment only gives an example of redefining the same type twice. We change the code as following: redefine the int into INTEGER in the beginning and then redefine INTEGER into int16. Change the int in first and second lines into int16. Other statements and functions are not modified The algorithm mainly takes measurements to type redefinition with pointer on account of all cases of type redefinition. For instance, typedef char *LPCH, after preprocessing, LPCH will be replaced by char * in the source code. In other words, if the statement like LPCH a; appears in the source code, after preprocessing, the statement will be changed to char * a;. Source code example 1 static intdo_poll(structpollfd *pfd, unsigned intnfds, int timeout) 2 { 3 int ret; 4 ret = poll(pfd, nfds, timeout); 5 if ( ret < 0 ) { 6 if ( errno == EINTR ) 7 return 0; 8 log(log_err, "poll returned an error.\n"); 9 return -1; 10 } 11 if ( ret == 0 ) 12 return 0; 13 return 1; 14 } revised code example 1 typedef int INTEGER; 2 typedef INTEGER int16; 3 static int16 do_poll(structpollfd *pfd, unsigned intnfds, int16 timeout) 4 { 5 int16 ret; 6 ret = poll(pfd, nfds, timeout); 7 if ( ret < 0 ) { 8 if ( errno == EINTR ) 9 return 0; 10 log(log_err, "poll returned an error.\n"); 11 return -1; 12 } 13 if ( ret == 0 ) 14 return 0; 15 return 1; 16 } Code 2 code example used to detect repeated type redefinition plagiarism Source code example 1 void main() 2 { 3 char *a; 4 char *b; 5 f(a,b); 6 } revised code example 1 typedef char *LPCH; 2 void main() 3 { 4 LPCH a; 5 LPCH b; 6 f(a,b); 7 } Code 3 code example used to detect repeated type redefinition plagiarism Code 3 is a simple function of generating function. We change the code as following: Add the redefinition statement typedef char *LPCH; at the beginning of the code. And then all the strings of char * are replaced by LPCH without any function modification. Experiment Result The detection results of Code 1 about simple type redefinition plagiarism using CCFinder and CodeCompare are separately shown in Figure 2-1and Figure 2-2 in the following.

326 JOURNAL OF MULTIMEDIA, VOL. 6, NO. 4, AUGUST 2011 Figure 4-1.

following instructions won t repeat) Figure 4-2.

following instructions won t repeat) The detection results of Code 1 about repeated type redefinition plagiarism using

The detection results of repeat type redefinition plagiarism by CCFinder Figure 4-4.

7 326 JOURNAL OF MULTIMEDIA, VOL. 6, NO. 4, AUGUST 2011 Figure 4-1. The detection results of simple type redefinition plagiarism by CCFinder(The dark background shows the similar code,the following instructions won t repeat) Figure 4-2. The detection results of simple type redefinition plagiarism by CodeCompare (Highlighted in red indicate the similar code,the following instructions won t repeat) The detection results of Code 1 about repeated type redefinition plagiarism using CCFinder and CodeCompare are separately shown in Figure 2-3and Figure 2-4 in the following. Figure 4-3. The detection results of repeat type redefinition plagiarism by CCFinder Figure 4-4. The detection result of repeat type redefinition plagiarism by CodeCompare The detection results of Code 3 about type redefinition with pointer plagiarism using CCFinder and CodeCompare are separately shown in Figure 2-5and Figure 2-6 in the following. Figure 4-5. The detection results of type redefinition with pointer plagiarism by CCFinder

8 JOURNAL OF MULTIMEDIA, VOL. 6, NO. 4, AUGUST Figure 4-6. The detection results of type redefinition with pointer plagiarism by CodeCompare From the three sets of comparative results, we can see that CCFinder can only detected the same statement block of the code but type redefinition plagiarism. However, CodeCompareV2.0 can detect the above three kinds of type redefinition plagiarism and improve the accuracy of code plagiarism detection B. Experiment to Analyze Rehash Classification Based Syntax-Comparison Algorithm Experiment Environment In order to prove the efficiency of our algorithm, lots of experiments are launched with the source code from real-world open source software. By comparing CodeCompareV1.0, established by the original code comparison algorithm, and CodeCompareV2.0 which is established by the rehash classification based comparison algorithm, we found that CodeCompareV2.0 has obvious advantages in its processing rate. As to show the difference between the two versions, we get lots of source code in large scale software. The environment of our experiment is: Microsoft Windows 7 Ultimate on Intel(R) Dual-Core CPU 2.00 GB of memory. Code 4 is cmpile.cpp of Yazoo [17]. Yazoo is a command-line scripting language and a wrapper for co-compiled C routines. cmpile.cpp size 86KB, 2024 lines, 26 functions, 0 struct, 0 union, 0 class. Code 5 is extrnl.cpp of Yazoo; size 86KB, 1907 lines, 75 functions, 0 struct, 0 union, 0 class. Code 6 is hobbs.cpp of Yazoo; size 133KB, 3022 lines, 144 functions, 0 struct, 0 union, 0 class. Code 7 is intrpt.cpp of Yazoo; size 96KB, 2082 lines, 74 functions, 0 struct, 0 union, 0 class. Code 8 is gtwvw.c from project harbor [18]. Code 9 is gtkconv.c of funpindgin [19]. Code 10 is gtk-main-interface.c from project gwaei [20]. Code 11 is hvm.c of harbor-project. Code 12 is hbmain.c of harbor-project. Code 13 is harbor.c of harbor-project. The table 4-1 shows the syntax node information of all the source code: Experiment Result Code4~13 all are in so large scale that we would find the difference in comparison efficiency easily when doing comparison with the two versions of CodeCompare. There are 9 groups of experiment. Each group contains one source code file and one sample code file. The table below shows the result of the 9 groups of experiments. According to the data statistics in the table 4-2, the algorithm based on rehash classification shows an advantage in reducing processing time. In order to describe the relationship between the modulus t in rehash classification based algorithm and consumed time proportion, the value of t is set to 100 in all groups of experiment. Under ideal conditions, the theoretical proportion of time consumed should be 100:1. However there may not be an average quantity of child-nodes in each class and there is also other operation in the whole comparison process, so the proportion in real, comparison num 15:1 consistent to time consumed 12:1, is reasonable which in dead prove that the rehash classification based algorithm has a higher efficiency. In tab 4-2, Comparison Num means the times to compare every syntax node; Time consumed stands for the period of comparison. (Time in milliseconds) Code Name Efficacious Node Number 4 Cmplie.cpp Extenl.cpp Hobbsh.cpp Intrpt.cpp Gtwvw.c Gtkconv.c G-M_I.c Hvm.c Hbmain.c Harbor.c Table 4-1 Source Code information Max Child Node Num V. CONCLUSION On the basis of the source code comparison on the syntax tree structure, this paper introduces an algorithm to type redefinition detection. Besides, this paper also proposes syntax tree nodes which we called the algorithm of rehash syntax structure comparison. The algorithm we propose gives a more precise way to classify the syntax tree nodes, which narrow the scale of code comparison into several smaller ranges. Therefore, we reduce the complexity of the code comparison and improve its efficiency.

9 328 JOURNAL OF MULTIMEDIA, VOL. 6, NO. 4, AUGUST 2011 NO Source Sample Comparison Num of Old algorithm Comparison Num of New algorithm Time consumed by Old algorithm (ms) Time consumed by New algorithm (ms) 1 Code4 Code Code5 Code Code6 Code Code8 Code Code10 Code Code8 Code Code9 Code Code6 Code Code11 Code Total Average Proportion Table 4-2 Time consumed in comparison ACKNOWLEDGEMENTS This work was supported by National Natural Science Foundation of China (No ). REFERENCES [1] Lingxiao Jiang, Ghassan Misherghi, Zhendong Su,St ephane Glondu. DECKARD: Scalable and Accurate Tree-based Detection of Code Clones.29th International Conference on Software Engineering (ICSE'07), July [2] Johnson J. Substring matching for clone detection and change tracking. In Proceedings International Conference onsoftware Maintenance (ICSM 1994), IEEE Computer Society: Los Alamitos CA, 1994; pp [3] St ephane Ducasse, Oscar Nierstrasz and Matthias Rieger. On the effectiveness of clone detection by string matching [4] Toshihiro Kamiya, Shinji Kusumoto, and Katsuro Inoue. CCFinder: A multilinguistic token-based code clone detection system for large scale source code. IEEE Transactions on Software Engineering, 28(7): pp , July [5] Han Lifang, Cui Baojiang, Type Redefenition Plagiarism Detection of Token-Based Comparison, Proceeding of IEEE The International Conference on Multimedia Information Networking and Security (MINES 2010). [6] I.D. Baxter, A. Yahin, L. Moura, M. Sant' Anna, and L. Bier. Clone Detection Using Abstract Syntax Trees Proc. IEEE Int'l Conf. Software Maintenance (ICSM '98), pp , Nov [7] Cui Baojiang, Li Jiansong, Guo Tao, Code Comparison System Based on Abstract Syntax Tree, Proceeding of 2010 International Conference on Broadband Network and Multimedia Technology [8] R. Komondoor and S. Horwitz. Using slicing to identify duplication in source code. In SAS, pp40 56, [9] Jens Krinke. Identifying similar code with program dependence graphs /01 IEEE, pp , 2001 [10] Kodhai. E. Detection of Type-1 and Type-2 Code Clones Using Textual Analysis and Metrics. International Conference on Recent Trends in Information, Telecommunication and Computing, [11] Kazuaki Maeda. An extended line-based approach to detect code clones using syntactic and lexical information. The seventh International Conference on Information Technology [12] Kazuaki Maeda. Code Clone Detection Using Parsing Action. Communications and Information Technology, ISCIT th International Symposium: pp [13] Anna Corazza. A Tree Kernel Based Approach for Clone Detection. Software Maintenance (ICSM), 2010 IEEE International Conference: pp [14] Yoshiki Higo, Shinji Kusumoto. Enhancing Quality of Code Clone Detection with Program Dependency Graph. Reverse Engineering, WCRE '09. 16th Working Conference: pp [15] Yoshiki Higo, Shinji Kusumoto. Code Clone Detection on Specialized PDGs with Heuristics. Software Maintenance and Reengineering (CSMR), th European Conference:pp [16] sourceforge.net/projects/ Yazoo /. [17] sourceforge.net/projects/ Harbor /. [18] / [19] sourceforge.net/projects/ gwaei /. [20] Cui Baojiang is an Associate Professor in the School of Computer Science and Technology at Beijing University of Posts and Telecommunications, China. His main research areas include software and information security. Guan Jun is a graduate student in the School of Computer Science and Technology at Beijing University of Posts and Telecommunications, China. His areas of research interest include software security and software code comparison. Guo Tao is a researcher in China Information Technology Security Evaluation Center. His areas of research interest are software security. Han Lifang is a graduate student in the School of Computer Science and Technology at Beijing University of Posts and Telecommunications, China. Her areas of research interest are information security. Wang Jianxin is an Associate Professor, Computer Science Department of Beijing Forestry University, Beijing, China. His research interests include data mining, artificial intelligence, etc. Ji Yupeng is a graduate student in the School of Computer Science and Technology at Beijing University of Posts and Telecommunications, China. His areas of research interests include software security and software code comparison.

On Refactoring for Open Source Java Program

On Refactoring for Open Source Java Program Yoshiki Higo 1,Toshihiro Kamiya 2, Shinji Kusumoto 1, Katsuro Inoue 1 and Yoshio Kataoka 3 1 Graduate School of Information Science and Technology, Osaka University