Code Syntax-Comparison Algorithm based on Type-Redefinition-Preprocessing and Rehash Classification

Size: px
Start display at page:

Download "Code Syntax-Comparison Algorithm based on Type-Redefinition-Preprocessing and Rehash Classification"

Transcription

1 320 JOURNAL OF MULTIMEDIA, VOL. 6, NO. 4, AUGUST 2011 Code Syntax-Comparison Algorithm based on Type-Redefinition-Preprocessing and Rehash Classification Baojiang Cui 1,2 1 Beijing University of Posts and Telecommunications, Beijing, China 2 China Information Technology Security Evaluation Center, Beijing, China cuibj@bupt.edu.cn Jun Guan 1, Tao Guo 2, Lifang Han 1, Jianxin Wang 3, Yupeng Ji 1 1 Beijing University of Posts and Telecommunications, Beijing, China 2 China Information Technology Security Evaluation Center, Beijing, China 3 Beijing Forestry University, Beijing, China guanjun@yahoo.cn Abstract The code comparison technology plays an important role in the fields of software security protection and plagiarism detection. Nowadays, there are mainly FIVE approaches of plagiarism detection, file-attribute-based, text-based, token-based, syntax-based and semantic-based. The prior three approaches have their own limitations, while the technique based on syntax has its shortage of detection ability and low efficiency that all of these approaches cannot meet the requirements on large-scale software plagiarism detection. Based on our prior research, we propose an algorithm on type redefinition plagiarism detection, which could detect the level of simple type redefinition, repeating pattern redefinition, and the redefinition of type with pointer. Besides, this paper also proposes a code syntax-comparison algorithm based on rehash classification, which enhances the node storage structure of the syntax tree, and greatly improves the efficiency. Index Terms code clone, code plagiarism, syntax tree, rehash classification, type-redefinition I. INTRODUCTION With the rapid development of information technology, code capacity increases and code plagiarism is much harder to detect than ever. The existing detection technology can either meet the requirements about detection ability or detection efficiency. In order to deal with massive abuse of open source code and private source code plagiarism, and to gain software intellectual property detection and software safety protection, it is necessary to bring the research of code comparison technology with powerful function and high efficiency. In the code comparison area, the most popular five approaches include [1] : file-attribute-based, text-based, token-based, syntax- based and semantic-based. File-attribute-based approach employs hash values of the whole text, text blocks and the key structures in the text, or watermark information of the text to find out the similar code between target code and sample code. The core of this technique is to compare hash values, so it has advantage of a high detection speed, while disadvantage in detection ability. However, it can only detect cases like plagiarism of the whole code or parts of code, but incapable with the cases like changes of characters. Among the most popular software based on file-attribute technique are Protex by BlackDuck and Proteccode system by Proteccode. The above software systems have played important roles in intellectual property detection and software development and management. Text-based approach converts source code into strings and detects clone code by searching and matching strings. Johnson does relative research and proposes an idea---divide the code into series of strings and match similar code by fingerprint information of strings [2]. Ducasse further raises an idea that it filters the code to remove noises such as remarks, blanks and line break firstly, and calculates hash value of each line, by comparing hash value of each line, it has the ability to detect code clone [3]. A metric-based approach combines with the textual comparison of the source code proposed by Kodhai.E [10] detects code clone by standardizing the source code. Despite of all of these attempts, text-based technique can only detect plagiarisms of the whole code copy or partly copy. Token-based approach converts code into token sequences, and matches the token sequences by token sequence hash. CCFinder [4] by Kamiya marks the code by self-defined token, and compares the token sequences by LCS algorithm. Maeda proposes a method to convert source to media XML sequence including lexical and syntactic information by the tool PALEX, then do clone detection with the lexical/syntactic information [11] or Parsing Action [12]. CodeCompare [5] also uses token-based technique. At first, it pretreats the source code by removing useless information in the code and redefines others by semantic analysis type. And then, it does lexical analysis to generate token sequences. Finally, it matches hash values of the token sequences by using an improved LCS algorithm [5]. Comparison algorithm in doi: /jmm

2 JOURNAL OF MULTIMEDIA, VOL. 6, NO. 4, AUGUST CodeCompare system can detect plagiarisms of type redefinition, substitution of variable names and function names, but incapable with plagiarisms like lines disorganization and IF statement alternation. Syntax-based approach implements code clone detection on syntax structure level. DECKARD [1] by Jiang uses Euclid space to record syntax structure information, and uses vector mixture to detect code clone. CloneDr [6] by Baxter traverses and compares hash value in each node based on the abstract syntax tree to calculate the similarities of syntax structure. Corazza proposes an idea to detect code clones based syntactic information enriched by lexical element and brings up the usage of tree kernels to compute similarity (sub)trees [13]. Code plagiarism detection system [7] based on abstract syntax tree analyzes code similarities based on syntax structure. It processes code by deleting useless characteristics firstly, and obtains code similarities by traversing and comparing the analyzed syntax node. Syntax-based technique can detect syntax structure alternations without changing semantics, such as switch case sequences, parameter sequences and variable definitions. Semantic-based approach implements code plagiarism detection on semantic level. Komondoor and Horwitz propose the idea to put program dependence graphs, PDG and program slicing into code clone detection [8] [1]. Krinke puts forward a special PDG with similarities with AST and traditional PDG and utilizes it into code detection [9]. Higo takes lots of work to enhance the quality of PDG-based technique. For example, execution dependency and program slices are used to improve the ability to detect contiguous code clone [14] ; PDG specializations and detection heuristics are proposed to reduce computational cost of detection in the same way to enhance the capacity of every kinds of code detection [15]. The first three detection technologies are all on text level, with high detection efficiency, but limited detection ability. Type redefinition plagiarism detection based on token comparison [5] extends the previous token comparison, but cannot analyze code on syntax level, which resulted in deficiency in detection. Almost all code plagiarism detection tools based on syntax tree implement various plagiarism detection methods, but lack of detection efficiency. Code plagiarism detection technologies based on semantics is still under research and could not fit for massive code plagiarism detection. This paper is an attempt on the research of code comparison algorithm [7] based on abstract syntax tree. Firstly, by the algorithm improvement on the preprocess step, we propose an idea to detect type redefinition problem. Besides, this paper proposes a novel comparison algorithm by rehash classification to syntax nodes, which, by only comparing syntax node with the same type, we increase the efficiency greatly. What s more, we propose a storage structure of syntax tree nodes, by only traversing and comparing syntax nodes with the same type, so as to reduce comparison complexity greatly. II. ORIGINAL CODE COMPARISON ALGORITHM The original algorithm based on abstract syntax tree is that it transforms the source code into a syntax tree according to the grammar structures. A node is the smallest unit in syntax tree, which represents a grammar structure in the source code, and includes all the information of the related syntax structure. By traversing syntax node between target code and sample code, it identifies the similarity in syntax level between each other. The steps of this algorithm are as follows. 1) Preprocessing on the source code. The preprocess step is to handle the annotation, macro definition, conditional compilation statement, head file including in the source file, etc. To the annotation and head file including, it just deletes them immediately, and process other parts as their semantics. 2) Lexical analysis to the source code. This step transforms the source code into the token series by Lex (Lexical Compiler). By the lexical rule, Lex generates lexical analyzer to receive the processed source code, and return the token value by the matching rule from regular expression. 3) Abstract syntax tree generation. The syntax analysis is based on Lexical analysis. According to the syntax rule, this process firstly utilizes YACC (Yet Another Compiler Compiler) to generate grammar parser. By comparing the received token series with the relevant rule, the grammar parser generates the abstract syntax tree from the original source code. Every node in the syntax tree is a part structure to the source code, and the node stores all the information in the syntax structure. 4) Hash calculation to syntax tree node. The purpose to generate the syntax tree is to do code comparison. While most syntax trees are structure-complicated and storage-consuming, it is very difficult to do the code comparison with the original syntax tree. Therefore, we reduce the complexity by doing the code comparison with its hash value. We calculate the hash value to every leaf node and every sub-tree node of the syntax tree, and every hash value represents its own sub-tree uniquely. The hash value algorithm calculation is as follows: we calculate every leaf node directly, while every sub-tree s hash value is calculated by the addition from its own sub-trees. Besides hash calculation, we also record the child node numbers in the current sub-tree and line number from the start line to the end line in the source code. 5) Storage of the syntax tree. After calculating the hash value of the syntax tree, we design an algorithm to store the information in every sub-tree. In memory, the information is stored as list array. Firstly, we create a dynamic array Array, the length to the dynamic array is the maximum number of the child node. We save the node information with the same child node numbers, into the correspondent position of the Array. For instance, if there is a node a, and there are n child nodes under a, we save a into the position of Array[n]. The storage structure of the syntax tree nodes is as follows:

3 322 JOURNAL OF MULTIMEDIA, VOL. 6, NO. 4, AUGUST 2011 Type-Redefinition is that an original data type is redefined by the key word typedef. There is an example about type-redefinition below: Code a 1 void main() 2 { 3 int a = 0; 4 char b = 1; 5 float c = 2; 6 printf("%d%d%f",a,b,c); 7 } Figure 2-1. The storage form of the hash array list 6) Code Comparison. To do code comparison, the source file and target file have to be handled by the same steps as above. Both target and sample file will get a hash array list (TargetHashListArray and SampleHashListArray), we then do comparison between TargetHashListArray[n] and SampleHashListArray[n] with n beginning at the threshold value. The threshold value is the smallest number of sub-nodes of a tree node which is necessary to do comparison. Traversing every chained list in both TargetHashListArray and SampleHashListArray, compare every node which has the same number of sub-nodes and the record line number from the start line to the end. The Algorithm above has the ability to detect plagiarism methods such as entirety code copy, part code copy, replace name of variable or function, disorder of statement. However it still could not detect the plagiarism based on type redefinition, what s more, it makes a low efficiency of the whole process because of the complexity of the Algorithm. In order to make up for the two deficiencies, we do lots of work as follows: III. REHASH CLASSIFICATION BASED COMPARISON ALGORITHM It is very similar in the syntax comparison algorithm from our method to the algorithm above. The main improvements of our work are: (1) in the preprocess step, we add the semantic parser to the type redefinition. In this way, we can detect the code plagiarism by type redefinition. (2) In the step of syntax tree storage, we classify the syntax tree nodes by both the number of child node and the modulus value to the hash value of the tree node, so that all the tree nodes would be classified into many different types. Because the nodes with the same syntax structure will be classified into the same category, we only need to do the code comparison in the same category, which certainly improve the comparison efficiency to a certain extent. The two aspects of the Algorithm improvements in detail are as follows: A. Type-Redefinition-Preprocessing Algorithm Table 3-1 code example about type redefinition plagiarism It is judged that code a and code b in table 3-1 are homologous according to semantics. In view of the situation, we mainly lay emphasis on dealing with three types of type redefinition: simple type redefinition, repeated type redefinition and type redefinition with pointer. The algorithm is described as following: Code b 1 typedef int INTEGER; 2 typedef char BYTE8; 3 typedef float FLOAT; 4 void main() 5 { 6 INTEGER a = 0; 7 BYTE8 b = 1; 8 FLOAT c = 2; 9 printf("%d%d%f",a,b,c); 10 } Table 3-2 code example about type redefinition plagiarism Assume that all type redefinition statements in original code constitute set {TS}, and T 1, T 2,, T n { TS} ; {TSO} is composed of the types S S S { TSO} before redefinition, and elements 1, 2,, n ; {TSN} is composed of the type after redefinition, and R,, { TSN} elements 1 R2, R n. Then, traverse all elements of {TS} and judge types of elements. Algorithm of Preprocessing Typedef 1: function ST(SC: SourceCode):Replaced SourceCode 2: while SourceCode do 3: if SimpleTypedef 4: for pointer p from Typedef to the end do 5:StoreVar(NewType) 6: Replace(SourceCode,NewType,RType) 7: end for 8: end if 9: else if PointerTypedef 10: for pointer p from Typedef to the end do 11:StoreVar(NewType) 12: Replace(SourceCode,NewType,RType) 13: end for 14: end else if 15: return SourceCode 16: end function Table 3-3 the pseudo-code of type redefinition algorithm For any T 1 i n, if T is simple type redefinition statement like typedef int INTEGER, S is int and R is statement like typedef char *LPCH, S

4 JOURNAL OF MULTIMEDIA, VOL. 6, NO. 4, AUGUST is char * and R is LPCH. Traverse all elements in TSN and search the location of R R TSN0,1 i n and replace it with S. Then record the number that R appears in source code as m 1 i n. Suppose that M m, the complexity of the algorithm is o M. If TSO TSN, the search algorithm will be ended; if TSO TSN,supposeS TSO,R TSN,S R,1 x,y n, we find out R in the code and replace it with S. Then traverse the source code till all the elements in set TSN are completely replaced. Here, the algorithm is completed. The pseudo-code of type redefinition algorithm is shown like Table 3-3: After the above preprocess step, the algorithm could detect the three kinds of plagiarisms of type redefinition. Suppose the total line number of the code is S, the modified line number by the algorithm is L, the probability of improved similarity is shown like this: L η = 100% S B. Code Syntax-Comparison Algorithm Based on Rehash Classification The core to our code comparison algorithm is that we classify the syntax tree into some more precise categories; different types of nodes must be irrelevant. After the preprocess step, lexical analysis, syntax analysis and the generation of syntax tree, we classify the syntax tree nodes by both the number of child node of the tree node and the modulus value to the hash value of the tree node, and we save them into different chained lists according to the sub-tree types, then wait for the invocation process of next step. The specifics of the rehash classification algorithm are as follows: 1) After the generation of the syntax tree, we do the traversal. 2) In the progress of traversal, we calculate the hash value and the number of child nodes of each node in the syntax tree, and record the start and end line number of each node in the source file. Then we store all the nodes into a chained list. 3) By searching into the syntax tree, we find the node with the maximum number N of child nodes, to create an array Array according to the number N. After the traversal of the chained list we set up before, we do the first classification. We save nodes into the nth position of the array based on its child node number n, so at every position of the array it creates a chained list where stores the same kind of nodes. For instance, if we have a nodea, and a is the root of n nodes, then we save the information of a into the chained list at position of Array[n]. 4) Finally, we set up an array LongArray, which has the length of t*n according to the empiric value t. By traversing the linked list Array we got from previous step, we do the modulus calculation to the hash value of the nodes in Array[n]. We classify the nodes in the chained list into t categories by the modulus value, and the according to the modulus result h (h from 0 to k-1), we save the node into the (nt+h)th position of LongArray LongArray[nt+h]. In this way, we finish the rehash classification. In the rehash classification algorithm, the first thing we should do is to classify all tree nodes by the number of sub-node. After that, we do the general statistics of the number for each type of syntax structure, to give the numerical basis to the empirical value t. How to choose the number t is very important to improve the efficiency of our algorithm. Based on the idea that different types of syntax tree nodes are irrelevant, our algorithm do a maximum classification to all the syntax tree nodes. Therefore, by only comparing to the same categories of tree nodes, we reduce the complexity of code comparison. The specifics of code comparison algorithm based on syntax tree are as follows: 1) Read tree node hash value of target file and sample file from LongArray at the position of threshold value to do comparison. 2) According to the position of LongArray, do the traversal to the hash chained lists of target file and sample file. 3) Do the comparison to the nodes with the same classification. If they share the same hash value, then store their positions in the source files into database, and if do not, do the next comparison. 4) Traverse all types of nodes until it ends. Define target file and sample file syntax tree node array as TargetLongArray and SampleLongArray. During the process of comparing, TargetLongArray and SampleLongArray have the same structure, so we just need to compare nodes at the same position. For example, the chained list in TargetLongArray[k] only compare with the chained list in SampleLongArray[k]. C. Evaluation of Code Syntax-Comparison Algorithm Based on Rehash Classification The purpose of the rehash classification algorithm is to improve the efficiency in large-scale code comparison. In this section, we calculate the complexity of both the algorithm based on abstract syntax tree and our rehash classification algorithm, and compare the efficiency difference between each other. To evaluate the amount of calculation in the code comparison, we firstly should establish the mathematical model, and make reasonable assumptions based on the model. The rules to the assumptions are that they should not have much impact to the real calculation. The rules of the mathematical model to the syntax tree are as follows: Assume that the target file generates the syntax tree TargetTree, and the sample file generates SampleTree, they have m and n syntax tree nodes correspondently. In the TargetTree, according to the number of syntax tree nodes, the nodes could be divided intoxdifferent types, and we assume that there is the same number in each type of nodes, that is m x. At the same time, in the SampleTree, the nodes could be divided into y different types, and each category has n y number of nodes. To do the code

5 324 JOURNAL OF MULTIMEDIA, VOL. 6, NO. 4, AUGUST 2011 comparison, we do the hash calculation to the two syntax f ( m, tree, and try to find the same nodes. We use to represent the amount of the whole calculation process. Complexity of Original algorithm Before comparison, the algorithm based on abstract syntax tree needs to do the classification to the syntax tree nodes. The process of classification is to generate chained list according to the number of numberk of syntax tree child nodes, and save the pointer of the linked list into the kth position of the array HashListArray, HashListArray[k]. Although similar nodes in the chained list are in random order, each syntax tree nodes only need to be operated once to the classification. Based on the assumption before, the amount of calculation of this process for TargetTree is f 1 (m)=m, while for SampleTree is f 1 (=n. After the classification is the process of comparison. It only needs to compare the hash valueof the syntax tree nodes with the same number of child nodes. Assume that there are k 1 nodes at TargetHashListArray[k], and k 2 nodes at SampleHashListArray[k] correspondently. Due to the random order in the linked list, the amount of calculation of comparison nodes in this position is k 1 k 2. According to the assumption to the TargetTree and SampleTree, there are maximum min(x, y) times of comparison (x different types of nodes in TargetTree and y in SampleTree). While the amount of calculation of each time is m x * n, the total amount of calculation to y the comparison of TargetTree and SampleTree is m n f 2( m, = ( ) min( x, y) x y (1) As the discussion above, the total amount of calculation of the algorithm based on abstract syntax tree is: g 1( m, = f 1( m) 1( 2( m, m n = ( m + + ( ) min( x, y) (2) x y Complexity of Rehash Classification based comparison algorithm Before the comparison, the rehash classification algorithm needs another two steps, which are the classification to the syntax tree nodes and the classification to hash value of the syntax tree. The classification to the syntax tree nodes is the same with the algorithm based on abstract syntax tree, therefore, the amount of calculation of this process for TargetTree is f 1 (m)=m, while for SampleTree is f 1 (=n. After that, we do the rehash classification to the nodes in the array Array. This process is to read the hash value of each node and do the modulus calculation (mod t, t is an empiric value), and according to the result h the tree node is saved to (kt+h)th of the position of LongArray, LongArray[kt+h].( k is the number of child nodes, t is a integer type of empiric value, and h is the result of modulus.) It only needs to be operated once to finish the classification, therefore, the amount of calculation of the rehash classification for TargetTree is f 1 (m)=m, while for SampleTree is f 1 (=n. The rehash classification reduces the amount of operations in the comparing process, and we only need to compare the nodes in the same type. Assume that there are k 1 nodes in TargetLongArray[k], and k 2 nodes in SampleLongArray[k] correspondently. Due to the random order in the linked list, the amount of calculation of this position of the node is k 1 k 2. Here we need to have another assumption that the nodes with the same number of child nodes would be divided into t types by the empiric value t, and the number of each type of node is m tx and n ty, and correspondently. Based on the prior mathematical model, there are maximum min(x, y) times of comparison, while the amount of calculation of each time is m tx * n ty. Therefore, the total amount of calculation to the comparison of TargetLongArray and SampleLongArray m n f 4( m, = ( ) min( tx, ty) is tx ty (3) As the discussion, the total amount of calculation of the rehash classification based comparison algorithm is: g 2 ( m, = f 1( m) 1( 3( m) 3( 4( m, m n = ( m + + ( m + + ( ) min( tx, ty) tx ty mn = 2( m + + ( ) min( x, y) (4) txy In the case that m and n are enough large, the amount of calculation of g 2 (m, is 1 t of g 1 (m,. In engineering, there should be a suitable value t in order to reduce the complexity of the whole comparison process. Source code example 1 staticintoversizelength(asn1_type *asn1, void *user) 2 { 3 unsigned int *max_size; 4 if(!asn1!user) 5 return 0; 6 max_size = (unsigned int *)user; 7 if(*max_size&& *max_size<= asn1->len.size) 8 return 1; 9 return 0; 10 } revised code example 1 typedefint INTEGER; 2 typedef unsigned int UINT; 3 static INTEGER OversizeLength(ASN1_TYPE *asn1, void *user) 4 { 5 UINT *max_size; 6 if(!asn1!user) 7 return 0; 8 max_size = (UINT *)user; 9 if(*max_size&& *max_size<= asn1->len.size) 10 return 1; 11 return 0; 12 } Code 1 code example used to detect repeated type redefinition plagiarism

6 JOURNAL OF MULTIMEDIA, VOL. 6, NO. 4, AUGUST IV. EXPERIMENTS AND ANALISYS The advantages of this new algorithm are following: 1) we design an algorithm to analysis type-redefinition in order to detect plagiarism methods of simple type redefinition, repeated type redefinition and type redefinition with pointer. 2) We classify the syntax tree nodes by the modulus value to the hash value of the tree node, so that all the tree nodes would be classified more accurately. In this way, the complexity of the process of comparison is reduced. CodeCompareV2.0 is developed with this efficient algorithm. Of course CodeCompareV2.0 not only inherits all the functions of CodeCompareV1.0, but also gets the ability to detect type-redefinition plagiarism and has a higher detecting rate. A. Experiment to Analyze Type-Redefinition Plagiarism Detection Algorithm Experiment Environment In order to prove that our algorithm improves the accuracy of code comparison, we had a lot of experiments using the source code getting from the open source software. We mainly focus on the accuracy of detecting the plagiarism of type redefinition. By comparing CCFinder and CodeCompareV2.0, it shows that CodeCompareV2.0 could detect this kind of plagiarism more effectively. Here we only show two sample source codes for each instance in Code1 Code 3.The environment of our experiment is: Microsoft Windows 7 Ultimate on Intel(R) Dual-Core CPU 2.00 GB of memory. Code 1 is from the sp_asn1_detect.c of snort [20], which redefines the variable type simply. We change the code as following: add the redefinition statements of int and unsigned int at the beginning. Change all int into INTEGER in the first line and unsigned int into UINT in the third line and sixth line. Other statements and functions are not modified. Actually, plagiarism brings trouble to homologous code detection by using type redefinition to the same type repeatedly. In order to detect the homology of the source codemore accurately, the algorithm finds and replaces the redefined type repeatedly in the preprocessing until the redefined type couldn t be replaced. Code 2 is gotten from the capture.c of prelude Due to space limitations, the experiment only gives an example of redefining the same type twice. We change the code as following: redefine the int into INTEGER in the beginning and then redefine INTEGER into int16. Change the int in first and second lines into int16. Other statements and functions are not modified The algorithm mainly takes measurements to type redefinition with pointer on account of all cases of type redefinition. For instance, typedef char *LPCH, after preprocessing, LPCH will be replaced by char * in the source code. In other words, if the statement like LPCH a; appears in the source code, after preprocessing, the statement will be changed to char * a;. Source code example 1 static intdo_poll(structpollfd *pfd, unsigned intnfds, int timeout) 2 { 3 int ret; 4 ret = poll(pfd, nfds, timeout); 5 if ( ret < 0 ) { 6 if ( errno == EINTR ) 7 return 0; 8 log(log_err, "poll returned an error.\n"); 9 return -1; 10 } 11 if ( ret == 0 ) 12 return 0; 13 return 1; 14 } revised code example 1 typedef int INTEGER; 2 typedef INTEGER int16; 3 static int16 do_poll(structpollfd *pfd, unsigned intnfds, int16 timeout) 4 { 5 int16 ret; 6 ret = poll(pfd, nfds, timeout); 7 if ( ret < 0 ) { 8 if ( errno == EINTR ) 9 return 0; 10 log(log_err, "poll returned an error.\n"); 11 return -1; 12 } 13 if ( ret == 0 ) 14 return 0; 15 return 1; 16 } Code 2 code example used to detect repeated type redefinition plagiarism Source code example 1 void main() 2 { 3 char *a; 4 char *b; 5 f(a,b); 6 } revised code example 1 typedef char *LPCH; 2 void main() 3 { 4 LPCH a; 5 LPCH b; 6 f(a,b); 7 } Code 3 code example used to detect repeated type redefinition plagiarism Code 3 is a simple function of generating function. We change the code as following: Add the redefinition statement typedef char *LPCH; at the beginning of the code. And then all the strings of char * are replaced by LPCH without any function modification. Experiment Result The detection results of Code 1 about simple type redefinition plagiarism using CCFinder and CodeCompare are separately shown in Figure 2-1and Figure 2-2 in the following.

7 326 JOURNAL OF MULTIMEDIA, VOL. 6, NO. 4, AUGUST 2011 Figure 4-1. The detection results of simple type redefinition plagiarism by CCFinder(The dark background shows the similar code,the following instructions won t repeat) Figure 4-2. The detection results of simple type redefinition plagiarism by CodeCompare (Highlighted in red indicate the similar code,the following instructions won t repeat) The detection results of Code 1 about repeated type redefinition plagiarism using CCFinder and CodeCompare are separately shown in Figure 2-3and Figure 2-4 in the following. Figure 4-3. The detection results of repeat type redefinition plagiarism by CCFinder Figure 4-4. The detection result of repeat type redefinition plagiarism by CodeCompare The detection results of Code 3 about type redefinition with pointer plagiarism using CCFinder and CodeCompare are separately shown in Figure 2-5and Figure 2-6 in the following. Figure 4-5. The detection results of type redefinition with pointer plagiarism by CCFinder

8 JOURNAL OF MULTIMEDIA, VOL. 6, NO. 4, AUGUST Figure 4-6. The detection results of type redefinition with pointer plagiarism by CodeCompare From the three sets of comparative results, we can see that CCFinder can only detected the same statement block of the code but type redefinition plagiarism. However, CodeCompareV2.0 can detect the above three kinds of type redefinition plagiarism and improve the accuracy of code plagiarism detection B. Experiment to Analyze Rehash Classification Based Syntax-Comparison Algorithm Experiment Environment In order to prove the efficiency of our algorithm, lots of experiments are launched with the source code from real-world open source software. By comparing CodeCompareV1.0, established by the original code comparison algorithm, and CodeCompareV2.0 which is established by the rehash classification based comparison algorithm, we found that CodeCompareV2.0 has obvious advantages in its processing rate. As to show the difference between the two versions, we get lots of source code in large scale software. The environment of our experiment is: Microsoft Windows 7 Ultimate on Intel(R) Dual-Core CPU 2.00 GB of memory. Code 4 is cmpile.cpp of Yazoo [17]. Yazoo is a command-line scripting language and a wrapper for co-compiled C routines. cmpile.cpp size 86KB, 2024 lines, 26 functions, 0 struct, 0 union, 0 class. Code 5 is extrnl.cpp of Yazoo; size 86KB, 1907 lines, 75 functions, 0 struct, 0 union, 0 class. Code 6 is hobbs.cpp of Yazoo; size 133KB, 3022 lines, 144 functions, 0 struct, 0 union, 0 class. Code 7 is intrpt.cpp of Yazoo; size 96KB, 2082 lines, 74 functions, 0 struct, 0 union, 0 class. Code 8 is gtwvw.c from project harbor [18]. Code 9 is gtkconv.c of funpindgin [19]. Code 10 is gtk-main-interface.c from project gwaei [20]. Code 11 is hvm.c of harbor-project. Code 12 is hbmain.c of harbor-project. Code 13 is harbor.c of harbor-project. The table 4-1 shows the syntax node information of all the source code: Experiment Result Code4~13 all are in so large scale that we would find the difference in comparison efficiency easily when doing comparison with the two versions of CodeCompare. There are 9 groups of experiment. Each group contains one source code file and one sample code file. The table below shows the result of the 9 groups of experiments. According to the data statistics in the table 4-2, the algorithm based on rehash classification shows an advantage in reducing processing time. In order to describe the relationship between the modulus t in rehash classification based algorithm and consumed time proportion, the value of t is set to 100 in all groups of experiment. Under ideal conditions, the theoretical proportion of time consumed should be 100:1. However there may not be an average quantity of child-nodes in each class and there is also other operation in the whole comparison process, so the proportion in real, comparison num 15:1 consistent to time consumed 12:1, is reasonable which in dead prove that the rehash classification based algorithm has a higher efficiency. In tab 4-2, Comparison Num means the times to compare every syntax node; Time consumed stands for the period of comparison. (Time in milliseconds) Code Name Efficacious Node Number 4 Cmplie.cpp Extenl.cpp Hobbsh.cpp Intrpt.cpp Gtwvw.c Gtkconv.c G-M_I.c Hvm.c Hbmain.c Harbor.c Table 4-1 Source Code information Max Child Node Num V. CONCLUSION On the basis of the source code comparison on the syntax tree structure, this paper introduces an algorithm to type redefinition detection. Besides, this paper also proposes syntax tree nodes which we called the algorithm of rehash syntax structure comparison. The algorithm we propose gives a more precise way to classify the syntax tree nodes, which narrow the scale of code comparison into several smaller ranges. Therefore, we reduce the complexity of the code comparison and improve its efficiency.

9 328 JOURNAL OF MULTIMEDIA, VOL. 6, NO. 4, AUGUST 2011 NO Source Sample Comparison Num of Old algorithm Comparison Num of New algorithm Time consumed by Old algorithm (ms) Time consumed by New algorithm (ms) 1 Code4 Code Code5 Code Code6 Code Code8 Code Code10 Code Code8 Code Code9 Code Code6 Code Code11 Code Total Average Proportion Table 4-2 Time consumed in comparison ACKNOWLEDGEMENTS This work was supported by National Natural Science Foundation of China (No ). REFERENCES [1] Lingxiao Jiang, Ghassan Misherghi, Zhendong Su,St ephane Glondu. DECKARD: Scalable and Accurate Tree-based Detection of Code Clones.29th International Conference on Software Engineering (ICSE'07), July [2] Johnson J. Substring matching for clone detection and change tracking. In Proceedings International Conference onsoftware Maintenance (ICSM 1994), IEEE Computer Society: Los Alamitos CA, 1994; pp [3] St ephane Ducasse, Oscar Nierstrasz and Matthias Rieger. On the effectiveness of clone detection by string matching [4] Toshihiro Kamiya, Shinji Kusumoto, and Katsuro Inoue. CCFinder: A multilinguistic token-based code clone detection system for large scale source code. IEEE Transactions on Software Engineering, 28(7): pp , July [5] Han Lifang, Cui Baojiang, Type Redefenition Plagiarism Detection of Token-Based Comparison, Proceeding of IEEE The International Conference on Multimedia Information Networking and Security (MINES 2010). [6] I.D. Baxter, A. Yahin, L. Moura, M. Sant' Anna, and L. Bier. Clone Detection Using Abstract Syntax Trees Proc. IEEE Int'l Conf. Software Maintenance (ICSM '98), pp , Nov [7] Cui Baojiang, Li Jiansong, Guo Tao, Code Comparison System Based on Abstract Syntax Tree, Proceeding of 2010 International Conference on Broadband Network and Multimedia Technology [8] R. Komondoor and S. Horwitz. Using slicing to identify duplication in source code. In SAS, pp40 56, [9] Jens Krinke. Identifying similar code with program dependence graphs /01 IEEE, pp , 2001 [10] Kodhai. E. Detection of Type-1 and Type-2 Code Clones Using Textual Analysis and Metrics. International Conference on Recent Trends in Information, Telecommunication and Computing, [11] Kazuaki Maeda. An extended line-based approach to detect code clones using syntactic and lexical information. The seventh International Conference on Information Technology [12] Kazuaki Maeda. Code Clone Detection Using Parsing Action. Communications and Information Technology, ISCIT th International Symposium: pp [13] Anna Corazza. A Tree Kernel Based Approach for Clone Detection. Software Maintenance (ICSM), 2010 IEEE International Conference: pp [14] Yoshiki Higo, Shinji Kusumoto. Enhancing Quality of Code Clone Detection with Program Dependency Graph. Reverse Engineering, WCRE '09. 16th Working Conference: pp [15] Yoshiki Higo, Shinji Kusumoto. Code Clone Detection on Specialized PDGs with Heuristics. Software Maintenance and Reengineering (CSMR), th European Conference:pp [16] sourceforge.net/projects/ Yazoo /. [17] sourceforge.net/projects/ Harbor /. [18] / [19] sourceforge.net/projects/ gwaei /. [20] Cui Baojiang is an Associate Professor in the School of Computer Science and Technology at Beijing University of Posts and Telecommunications, China. His main research areas include software and information security. Guan Jun is a graduate student in the School of Computer Science and Technology at Beijing University of Posts and Telecommunications, China. His areas of research interest include software security and software code comparison. Guo Tao is a researcher in China Information Technology Security Evaluation Center. His areas of research interest are software security. Han Lifang is a graduate student in the School of Computer Science and Technology at Beijing University of Posts and Telecommunications, China. Her areas of research interest are information security. Wang Jianxin is an Associate Professor, Computer Science Department of Beijing Forestry University, Beijing, China. His research interests include data mining, artificial intelligence, etc. Ji Yupeng is a graduate student in the School of Computer Science and Technology at Beijing University of Posts and Telecommunications, China. His areas of research interests include software security and software code comparison.

On Refactoring for Open Source Java Program

On Refactoring for Open Source Java Program On Refactoring for Open Source Java Program Yoshiki Higo 1,Toshihiro Kamiya 2, Shinji Kusumoto 1, Katsuro Inoue 1 and Yoshio Kataoka 3 1 Graduate School of Information Science and Technology, Osaka University

More information

Keywords Code cloning, Clone detection, Software metrics, Potential clones, Clone pairs, Clone classes. Fig. 1 Code with clones

Keywords Code cloning, Clone detection, Software metrics, Potential clones, Clone pairs, Clone classes. Fig. 1 Code with clones Volume 4, Issue 4, April 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Detection of Potential

More information

Refactoring Support Based on Code Clone Analysis

Refactoring Support Based on Code Clone Analysis Refactoring Support Based on Code Clone Analysis Yoshiki Higo 1,Toshihiro Kamiya 2, Shinji Kusumoto 1 and Katsuro Inoue 1 1 Graduate School of Information Science and Technology, Osaka University, Toyonaka,

More information

Token based clone detection using program slicing

Token based clone detection using program slicing Token based clone detection using program slicing Rajnish Kumar PEC University of Technology Rajnish_pawar90@yahoo.com Prof. Shilpa PEC University of Technology Shilpaverma.pec@gmail.com Abstract Software

More information

Rearranging the Order of Program Statements for Code Clone Detection

Rearranging the Order of Program Statements for Code Clone Detection Rearranging the Order of Program Statements for Code Clone Detection Yusuke Sabi, Yoshiki Higo, Shinji Kusumoto Graduate School of Information Science and Technology, Osaka University, Japan Email: {y-sabi,higo,kusumoto@ist.osaka-u.ac.jp

More information

On Refactoring Support Based on Code Clone Dependency Relation

On Refactoring Support Based on Code Clone Dependency Relation On Refactoring Support Based on Code Dependency Relation Norihiro Yoshida 1, Yoshiki Higo 1, Toshihiro Kamiya 2, Shinji Kusumoto 1, Katsuro Inoue 1 1 Graduate School of Information Science and Technology,

More information

DETECTING SIMPLE AND FILE CLONES IN SOFTWARE

DETECTING SIMPLE AND FILE CLONES IN SOFTWARE DETECTING SIMPLE AND FILE CLONES IN SOFTWARE *S.Ajithkumar, P.Gnanagurupandian, M.Senthilvadivelan, Final year Information Technology **Mr.K.Palraj ME, Assistant Professor, ABSTRACT: The objective of this

More information

Keywords Clone detection, metrics computation, hybrid approach, complexity, byte code

Keywords Clone detection, metrics computation, hybrid approach, complexity, byte code Volume 3, Issue 5, May 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com An Emerging Approach

More information

Detection of Non Continguous Clones in Software using Program Slicing

Detection of Non Continguous Clones in Software using Program Slicing Detection of Non Continguous Clones in Software using Program Slicing Er. Richa Grover 1 Er. Narender Rana 2 M.Tech in CSE 1 Astt. Proff. In C.S.E 2 GITM, Kurukshetra University, INDIA Abstract Code duplication

More information

Clone Detection using Textual and Metric Analysis to figure out all Types of Clones

Clone Detection using Textual and Metric Analysis to figure out all Types of Clones Detection using Textual and Metric Analysis to figure out all Types of s Kodhai.E 1, Perumal.A 2, and Kanmani.S 3 1 SMVEC, Dept. of Information Technology, Puducherry, India Email: kodhaiej@yahoo.co.in

More information

A Novel Technique for Retrieving Source Code Duplication

A Novel Technique for Retrieving Source Code Duplication A Novel Technique for Retrieving Source Code Duplication Yoshihisa Udagawa Computer Science Department, Faculty of Engineering Tokyo Polytechnic University Atsugi-city, Kanagawa, Japan udagawa@cs.t-kougei.ac.jp

More information

Lecture 25 Clone Detection CCFinder. EE 382V Spring 2009 Software Evolution - Instructor Miryung Kim

Lecture 25 Clone Detection CCFinder. EE 382V Spring 2009 Software Evolution - Instructor Miryung Kim Lecture 25 Clone Detection CCFinder Today s Agenda (1) Recap of Polymetric Views Class Presentation Suchitra (advocate) Reza (skeptic) Today s Agenda (2) CCFinder, Kamiya et al. TSE 2002 Recap of Polymetric

More information

Code Similarity Detection by Program Dependence Graph

Code Similarity Detection by Program Dependence Graph 2016 International Conference on Computer Engineering and Information Systems (CEIS-16) Code Similarity Detection by Program Dependence Graph Zhen Zhang, Hai-Hua Yan, Xiao-Wei Zhang Dept. of Computer Science,

More information

Automatic Mining of Functionally Equivalent Code Fragments via Random Testing. Lingxiao Jiang and Zhendong Su

Automatic Mining of Functionally Equivalent Code Fragments via Random Testing. Lingxiao Jiang and Zhendong Su Automatic Mining of Functionally Equivalent Code Fragments via Random Testing Lingxiao Jiang and Zhendong Su Cloning in Software Development How New Software Product Cloning in Software Development Search

More information

Classification of Java Programs in SPARS-J. Kazuo Kobori, Tetsuo Yamamoto, Makoto Matsusita and Katsuro Inoue Osaka University

Classification of Java Programs in SPARS-J. Kazuo Kobori, Tetsuo Yamamoto, Makoto Matsusita and Katsuro Inoue Osaka University Classification of Java Programs in SPARS-J Kazuo Kobori, Tetsuo Yamamoto, Makoto Matsusita and Katsuro Inoue Osaka University Background SPARS-J Reuse Contents Similarity measurement techniques Characteristic

More information

Visualization of Clone Detection Results

Visualization of Clone Detection Results Visualization of Clone Detection Results Robert Tairas and Jeff Gray Department of Computer and Information Sciences University of Alabama at Birmingham Birmingham, AL 5294-1170 1-205-94-221 {tairasr,

More information

Code Clone Detection Technique Using Program Execution Traces

Code Clone Detection Technique Using Program Execution Traces 1,a) 2,b) 1,c) Code Clone Detection Technique Using Program Execution Traces Masakazu Ioka 1,a) Norihiro Yoshida 2,b) Katsuro Inoue 1,c) Abstract: Code clone is a code fragment that has identical or similar

More information

Incremental Clone Detection and Elimination for Erlang Programs

Incremental Clone Detection and Elimination for Erlang Programs Incremental Clone Detection and Elimination for Erlang Programs Huiqing Li and Simon Thompson School of Computing, University of Kent, UK {H.Li, S.J.Thompson}@kent.ac.uk Abstract. A well-known bad code

More information

Code Clone Analysis and Application

Code Clone Analysis and Application Code Clone Analysis and Application Katsuro Inoue Osaka University Talk Structure Clone Detection CCFinder and Associate Tools Applications Summary of Code Clone Analysis and Application Clone Detection

More information

EVALUATION OF TOKEN BASED TOOLS ON THE BASIS OF CLONE METRICS

EVALUATION OF TOKEN BASED TOOLS ON THE BASIS OF CLONE METRICS EVALUATION OF TOKEN BASED TOOLS ON THE BASIS OF CLONE METRICS Rupinder Kaur, Harpreet Kaur, Prabhjot Kaur Abstract The area of clone detection has considerably evolved over the last decade, leading to

More information

Zjednodušení zdrojového kódu pomocí grafové struktury

Zjednodušení zdrojového kódu pomocí grafové struktury Zjednodušení zdrojového kódu pomocí grafové struktury Ing. Tomáš Bublík 1. Introduction Nowadays, there is lot of programming languages. These languages differ in syntax, usage, and processing. Keep in

More information

COMPARISON AND EVALUATION ON METRICS

COMPARISON AND EVALUATION ON METRICS COMPARISON AND EVALUATION ON METRICS BASED APPROACH FOR DETECTING CODE CLONE D. Gayathri Devi 1 1 Department of Computer Science, Karpagam University, Coimbatore, Tamilnadu dgayadevi@gmail.com Abstract

More information

Proceedings of the Eighth International Workshop on Software Clones (IWSC 2014)

Proceedings of the Eighth International Workshop on Software Clones (IWSC 2014) Electronic Communications of the EASST Volume 63 (2014) Proceedings of the Eighth International Workshop on Software Clones (IWSC 2014) Toward a Code-Clone Search through the Entire Lifecycle Position

More information

Detection and Behavior Identification of Higher-Level Clones in Software

Detection and Behavior Identification of Higher-Level Clones in Software Detection and Behavior Identification of Higher-Level Clones in Software Swarupa S. Bongale, Prof. K. B. Manwade D. Y. Patil College of Engg. & Tech., Shivaji University Kolhapur, India Ashokrao Mane Group

More information

CCFinderSW: Clone Detection Tool with Flexible Multilingual Tokenization

CCFinderSW: Clone Detection Tool with Flexible Multilingual Tokenization 2017 24th Asia-Pacific Software Engineering Conference CCFinderSW: Clone Detection Tool with Flexible Multilingual Tokenization Yuichi Semura, Norihiro Yoshida, Eunjong Choi and Katsuro Inoue Osaka University,

More information

Similarities in Source Codes

Similarities in Source Codes Similarities in Source Codes Marek ROŠTÁR* Slovak University of Technology in Bratislava Faculty of Informatics and Information Technologies Ilkovičova 2, 842 16 Bratislava, Slovakia rostarmarek@gmail.com

More information

Software Clone Detection. Kevin Tang Mar. 29, 2012

Software Clone Detection. Kevin Tang Mar. 29, 2012 Software Clone Detection Kevin Tang Mar. 29, 2012 Software Clone Detection Introduction Reasons for Code Duplication Drawbacks of Code Duplication Clone Definitions in the Literature Detection Techniques

More information

Code Clone Detection on Specialized PDGs with Heuristics

Code Clone Detection on Specialized PDGs with Heuristics 2011 15th European Conference on Software Maintenance and Reengineering Code Clone Detection on Specialized PDGs with Heuristics Yoshiki Higo Graduate School of Information Science and Technology Osaka

More information

A Technique to Detect Multi-grained Code Clones

A Technique to Detect Multi-grained Code Clones Detection Time The Number of Detectable Clones A Technique to Detect Multi-grained Code Clones Yusuke Yuki, Yoshiki Higo, and Shinji Kusumoto Graduate School of Information Science and Technology, Osaka

More information

Master Thesis. Type-3 Code Clone Detection Using The Smith-Waterman Algorithm

Master Thesis. Type-3 Code Clone Detection Using The Smith-Waterman Algorithm Master Thesis Title Type-3 Code Clone Detection Using The Smith-Waterman Algorithm Supervisor Prof. Shinji KUSUMOTO by Hiroaki MURAKAMI February 5, 2013 Department of Computer Science Graduate School of

More information

Deckard: Scalable and Accurate Tree-based Detection of Code Clones. Lingxiao Jiang, Ghassan Misherghi, Zhendong Su, Stephane Glondu

Deckard: Scalable and Accurate Tree-based Detection of Code Clones. Lingxiao Jiang, Ghassan Misherghi, Zhendong Su, Stephane Glondu Deckard: Scalable and Accurate Tree-based Detection of Code Clones Lingxiao Jiang, Ghassan Misherghi, Zhendong Su, Stephane Glondu The Problem Find similar code in large code bases, often referred to as

More information

Quick Parser Development Using Modified Compilers and Generated Syntax Rules

Quick Parser Development Using Modified Compilers and Generated Syntax Rules Quick Parser Development Using Modified Compilers and Generated Syntax Rules KAZUAKI MAEDA Department of Business Administration and Information Science, Chubu University 1200 Matsumoto, Kasugai, Aichi,

More information

PAPER Proposing and Evaluating Clone Detection Approaches with Preprocessing Input Source Files

PAPER Proposing and Evaluating Clone Detection Approaches with Preprocessing Input Source Files IEICE TRANS. INF. & SYST., VOL.E98 D, NO.2 FEBRUARY 2015 325 PAPER Proposing and Evaluating Clone Detection Approaches with Preprocessing Input Source Files Eunjong CHOI a), Nonmember, Norihiro YOSHIDA,

More information

Software Quality Analysis by Code Clones in Industrial Legacy Software

Software Quality Analysis by Code Clones in Industrial Legacy Software Software Quality Analysis by Code Clones in Industrial Legacy Software Akito Monden 1 Daikai Nakae 1 Toshihiro Kamiya 2 Shin-ichi Sato 1,3 Ken-ichi Matsumoto 1 1 Nara Institute of Science and Technology,

More information

CS 6353 Compiler Construction Project Assignments

CS 6353 Compiler Construction Project Assignments CS 6353 Compiler Construction Project Assignments In this project, you need to implement a compiler for a language defined in this handout. The programming language you need to use is C or C++ (and the

More information

To Enhance Type 4 Clone Detection in Clone Testing Swati Sharma #1, Priyanka Mehta #2 1 M.Tech Scholar,

To Enhance Type 4 Clone Detection in Clone Testing Swati Sharma #1, Priyanka Mehta #2 1 M.Tech Scholar, To Enhance Type 4 Clone Detection in Clone Testing Swati Sharma #1, Priyanka Mehta #2 1 M.Tech Scholar, 2 Head of Department, Department of Computer Science & Engineering, Universal Institute of Engineering

More information

CODE CLONE DETECTION A NEW APPROACH. - Sanjeev Chakraborty

CODE CLONE DETECTION A NEW APPROACH. - Sanjeev Chakraborty CODE CLONE DETECTION A NEW APPROACH - Sanjeev Chakraborty ` CONTENTS Need of Research...3 1. Abstract...4 2. Introduction...4 3. Related Works...5 4. Methodology...5 5. Experimental Set-Up...6 6. Implementation...6

More information

Sub-clones: Considering the Part Rather than the Whole

Sub-clones: Considering the Part Rather than the Whole Sub-clones: Considering the Part Rather than the Whole Robert Tairas 1 and Jeff Gray 2 1 Department of Computer and Information Sciences, University of Alabama at Birmingham, Birmingham, AL 2 Department

More information

Folding Repeated Instructions for Improving Token-based Code Clone Detection

Folding Repeated Instructions for Improving Token-based Code Clone Detection 2012 IEEE 12th International Working Conference on Source Code Analysis and Manipulation Folding Repeated Instructions for Improving Token-based Code Clone Detection Hiroaki Murakami, Keisuke Hotta, Yoshiki

More information

A Simple Syntax-Directed Translator

A Simple Syntax-Directed Translator Chapter 2 A Simple Syntax-Directed Translator 1-1 Introduction The analysis phase of a compiler breaks up a source program into constituent pieces and produces an internal representation for it, called

More information

Similar Code Detection and Elimination for Erlang Programs

Similar Code Detection and Elimination for Erlang Programs Similar Code Detection and Elimination for Erlang Programs Huiqing Li and Simon Thompson School of Computing, University of Kent, UK {H.Li, S.J.Thompson}@kent.ac.uk Abstract. A well-known bad code smell

More information

Study and Analysis of Object-Oriented Languages using Hybrid Clone Detection Technique

Study and Analysis of Object-Oriented Languages using Hybrid Clone Detection Technique Advances in Computational Sciences and Technology ISSN 0973-6107 Volume 10, Number 6 (2017) pp. 1635-1649 Research India Publications http://www.ripublication.com Study and Analysis of Object-Oriented

More information

Design Code Clone Detection System uses Optimal and Intelligence Technique based on Software Engineering

Design Code Clone Detection System uses Optimal and Intelligence Technique based on Software Engineering Volume 8, No. 5, May-June 2017 International Journal of Advanced Research in Computer Science RESEARCH PAPER Available Online at www.ijarcs.info ISSN No. 0976-5697 Design Code Clone Detection System uses

More information

Problematic Code Clones Identification using Multiple Detection Results

Problematic Code Clones Identification using Multiple Detection Results Problematic Code Clones Identification using Multiple Detection Results Yoshiki Higo, Ken-ichi Sawa, and Shinji Kusumoto Graduate School of Information Science and Technology, Osaka University, 1-5, Yamadaoka,

More information

CS 6353 Compiler Construction Project Assignments

CS 6353 Compiler Construction Project Assignments CS 6353 Compiler Construction Project Assignments In this project, you need to implement a compiler for a language defined in this handout. The programming language you need to use is C or C++ (and the

More information

Clone Detection Using Abstract Syntax Suffix Trees

Clone Detection Using Abstract Syntax Suffix Trees Clone Detection Using Abstract Syntax Suffix Trees Rainer Koschke, Raimar Falke, Pierre Frenzel University of Bremen, Germany http://www.informatik.uni-bremen.de/st/ {koschke,rfalke,saint}@informatik.uni-bremen.de

More information

Comparing Multiple Source Code Trees, version 3.1

Comparing Multiple Source Code Trees, version 3.1 Comparing Multiple Source Code Trees, version 3.1 Warren Toomey School of IT Bond University April 2010 This is my 3 rd version of a tool to compare source code trees to find similarities. The latest algorithm

More information

A Tree Kernel Based Approach for Clone Detection

A Tree Kernel Based Approach for Clone Detection A Tree Kernel Based Approach for Clone Detection Anna Corazza 1, Sergio Di Martino 1, Valerio Maggio 1, Giuseppe Scanniello 2 1) University of Naples Federico II 2) University of Basilicata Outline Background

More information

Algorithm to Detect Non-Contiguous Clones with High Precision

Algorithm to Detect Non-Contiguous Clones with High Precision Algorithm to Detect Non-Contiguous Clones with High Precision Sonam Gupta Research Scholar, Suresh Gyan Vihar University, Jaipur, Rajasthan, India Dr. P.C. Gupta Department of Computer Science and Engineering

More information

COMPILER CONSTRUCTION LAB 2 THE SYMBOL TABLE. Tutorial 2 LABS. PHASES OF A COMPILER Source Program. Lab 2 Symbol table

COMPILER CONSTRUCTION LAB 2 THE SYMBOL TABLE. Tutorial 2 LABS. PHASES OF A COMPILER Source Program. Lab 2 Symbol table COMPILER CONSTRUCTION Lab 2 Symbol table LABS Lab 3 LR parsing and abstract syntax tree construction using ''bison' Lab 4 Semantic analysis (type checking) PHASES OF A COMPILER Source Program Lab 2 Symtab

More information

Code duplication in Software Systems: A Survey

Code duplication in Software Systems: A Survey Code duplication in Software Systems: A Survey G. Anil kumar 1 Dr. C.R.K.Reddy 2 Dr. A. Govardhan 3 A. Ratna Raju 4 1,4 MGIT, Dept. of Computer science, Hyderabad, India Email: anilgkumar@mgit.ac.in, ratnaraju@mgit.ac.in

More information

CSC 467 Lecture 13-14: Semantic Analysis

CSC 467 Lecture 13-14: Semantic Analysis CSC 467 Lecture 13-14: Semantic Analysis Recall Parsing is to translate token stream to parse tree Today How to build trees: syntax direction translation How to add information to trees: semantic analysis

More information

1/30/18. Overview. Code Clones. Code Clone Categorization. Code Clones. Code Clone Categorization. Key Points of Code Clones

1/30/18. Overview. Code Clones. Code Clone Categorization. Code Clones. Code Clone Categorization. Key Points of Code Clones Overview Code Clones Definition and categories Clone detection Clone removal refactoring Spiros Mancoridis[1] Modified by Na Meng 2 Code Clones Code clone is a code fragment in source files that is identical

More information

Gapped Code Clone Detection with Lightweight Source Code Analysis

Gapped Code Clone Detection with Lightweight Source Code Analysis Gapped Code Clone Detection with Lightweight Source Code Analysis Hiroaki Murakami, Keisuke Hotta, Yoshiki Higo, Hiroshi Igaki, Shinji Kusumoto Graduate School of Information Science and Technology, Osaka

More information

On the Robustness of Clone Detection to Code Obfuscation

On the Robustness of Clone Detection to Code Obfuscation On the Robustness of Clone Detection to Code Obfuscation Sandro Schulze TU Braunschweig Braunschweig, Germany sandro.schulze@tu-braunschweig.de Daniel Meyer University of Magdeburg Magdeburg, Germany Daniel3.Meyer@st.ovgu.de

More information

Incremental Code Clone Detection: A PDG-based Approach

Incremental Code Clone Detection: A PDG-based Approach Incremental Code Clone Detection: A PDG-based Approach Yoshiki Higo, Yasushi Ueda, Minoru Nishino, Shinji Kusumoto Graduate School of Information Science and Technology, Osaka University, 1-5, Yamadaoka,

More information

Accuracy Enhancement in Code Clone Detection Using Advance Normalization

Accuracy Enhancement in Code Clone Detection Using Advance Normalization Accuracy Enhancement in Code Clone Detection Using Advance Normalization 1 Ritesh V. Patil, 2 S. D. Joshi, 3 Digvijay A. Ajagekar, 4 Priyanka A. Shirke, 5 Vivek P. Talekar, 6 Shubham D. Bankar 1 Research

More information

Code Clone Detector: A Hybrid Approach on Java Byte Code

Code Clone Detector: A Hybrid Approach on Java Byte Code Code Clone Detector: A Hybrid Approach on Java Byte Code Thesis submitted in partial fulfillment of the requirements for the award of degree of Master of Engineering in Software Engineering Submitted By

More information

Enhancing Source-Based Clone Detection Using Intermediate Representation

Enhancing Source-Based Clone Detection Using Intermediate Representation Enhancing Source-Based Detection Using Intermediate Representation Gehan M. K. Selim School of Computing, Queens University Kingston, Ontario, Canada, K7L3N6 gehan@cs.queensu.ca Abstract Detecting software

More information

More On Syntax Directed Translation

More On Syntax Directed Translation More On Syntax Directed Translation 1 Types of Attributes We have productions of the form: A X 1 X 2 X 3... X n with semantic rules of the form: b:= f(c 1, c 2, c 3,..., c n ) where b and the c s are attributes

More information

The Comparative Study of Machine Learning Algorithms in Text Data Classification*

The Comparative Study of Machine Learning Algorithms in Text Data Classification* The Comparative Study of Machine Learning Algorithms in Text Data Classification* Wang Xin School of Science, Beijing Information Science and Technology University Beijing, China Abstract Classification

More information

Falsification: An Advanced Tool for Detection of Duplex Code

Falsification: An Advanced Tool for Detection of Duplex Code Indian Journal of Science and Technology, Vol 9(39), DOI: 10.17485/ijst/2016/v9i39/96195, October 2016 ISSN (Print) : 0974-6846 ISSN (Online) : 0974-5645 Falsification: An Advanced Tool for Detection of

More information

Compiler Design (40-414)

Compiler Design (40-414) Compiler Design (40-414) Main Text Book: Compilers: Principles, Techniques & Tools, 2 nd ed., Aho, Lam, Sethi, and Ullman, 2007 Evaluation: Midterm Exam 35% Final Exam 35% Assignments and Quizzes 10% Project

More information

Clone Detection Using Dependence. Analysis and Lexical Analysis. Final Report

Clone Detection Using Dependence. Analysis and Lexical Analysis. Final Report Clone Detection Using Dependence Analysis and Lexical Analysis Final Report Yue JIA 0636332 Supervised by Professor Mark Harman Department of Computer Science King s College London September 2007 Acknowledgments

More information

LECTURE 3. Compiler Phases

LECTURE 3. Compiler Phases LECTURE 3 Compiler Phases COMPILER PHASES Compilation of a program proceeds through a fixed series of phases. Each phase uses an (intermediate) form of the program produced by an earlier phase. Subsequent

More information

A Study on A Tool to Suggest Similar Program Element Modifications

A Study on A Tool to Suggest Similar Program Element Modifications WASEDA UNIVERSITY Graduate School of Fundamental Science and Engineering A Study on A Tool to Suggest Similar Program Element Modifications A Thesis Submitted in Partial Fulfillment of the Requirements

More information

Identification of File and Directory Level Near-Miss Clones For Higher Level Cloning Sonam Gupta, Vishwachi

Identification of File and Directory Level Near-Miss Clones For Higher Level Cloning Sonam Gupta, Vishwachi International Journal of Engineering and Advanced Technology (IJEAT) ISSN: 2249 8958, Volume-3, Issue-8 Identification of File and Directory Level Near-Miss Clones For Higher Level Cloning Sonam Gupta,

More information

A Novel Ontology Metric Approach for Code Clone Detection Using FusionTechnique

A Novel Ontology Metric Approach for Code Clone Detection Using FusionTechnique A Novel Ontology Metric Approach for Code Clone Detection Using FusionTechnique 1 Syed MohdFazalulHaque, 2 Dr. V Srikanth, 3 Dr. E. Sreenivasa Reddy 1 Maulana Azad National Urdu University, 2 Professor,

More information

Dr. Sushil Garg Professor, Dept. of Computer Science & Applications, College City, India

Dr. Sushil Garg Professor, Dept. of Computer Science & Applications, College City, India Volume 3, Issue 11, November 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Study of Different

More information

5. Semantic Analysis. Mircea Lungu Oscar Nierstrasz

5. Semantic Analysis. Mircea Lungu Oscar Nierstrasz 5. Semantic Analysis Mircea Lungu Oscar Nierstrasz Thanks to Jens Palsberg and Tony Hosking for their kind permission to reuse and adapt the CS132 and CS502 lecture notes. http://www.cs.ucla.edu/~palsberg/

More information

KClone: A Proposed Approach to Fast Precise Code Clone Detection

KClone: A Proposed Approach to Fast Precise Code Clone Detection KClone: A Proposed Approach to Fast Precise Code Clone Detection Yue Jia 1, David Binkley 2, Mark Harman 1, Jens Krinke 1 and Makoto Matsushita 3 1 King s College London 2 Loyola College in Maryland 3

More information

Semantic Clone Detection Using Machine Learning

Semantic Clone Detection Using Machine Learning Semantic Clone Detection Using Machine Learning Abdullah Sheneamer University of Colorado Colorado Springs, CO USA 80918 Email: asheneam@uccs.edu Jugal Kalita University of Colorado Colorado Springs, CO

More information

Log File Modification Detection and Location Using Fragile Watermark

Log File Modification Detection and Location Using Fragile Watermark Log File Modification Detection and Location Using Fragile Watermark Liang Xu and Huiping Guo Department of Computer Science California State University at Los Angeles Los Angeles, CA, USA Abstract- In

More information

Parallel and Distributed Code Clone Detection using Sequential Pattern Mining

Parallel and Distributed Code Clone Detection using Sequential Pattern Mining Parallel and Distributed Code Clone Detection using Sequential Pattern Mining Ali El-Matarawy Faculty of Computers and Information, Cairo University Mohammad El-Ramly Faculty of Computers and Information,

More information

Semantic Analysis. Compiler Architecture

Semantic Analysis. Compiler Architecture Processing Systems Prof. Mohamed Hamada Software Engineering Lab. The University of Aizu Japan Source Compiler Architecture Front End Scanner (lexical tokens Parser (syntax Parse tree Semantic Analysis

More information

Single-pass Static Semantic Check for Efficient Translation in YAPL

Single-pass Static Semantic Check for Efficient Translation in YAPL Single-pass Static Semantic Check for Efficient Translation in YAPL Zafiris Karaiskos, Panajotis Katsaros and Constantine Lazos Department of Informatics, Aristotle University Thessaloniki, 54124, Greece

More information

Abstract. We define an origin relationship as follows, based on [12].

Abstract. We define an origin relationship as follows, based on [12]. When Functions Change Their Names: Automatic Detection of Origin Relationships Sunghun Kim, Kai Pan, E. James Whitehead, Jr. Dept. of Computer Science University of California, Santa Cruz Santa Cruz, CA

More information

SOURCE CODE RETRIEVAL USING SEQUENCE BASED SIMILARITY

SOURCE CODE RETRIEVAL USING SEQUENCE BASED SIMILARITY SOURCE CODE RETRIEVAL USING SEQUENCE BASED SIMILARITY Yoshihisa Udagawa Faculty of Engineering, Tokyo Polytechnic University, Atsugi City, Kanagawa, Japan udagawa@cs.t-kougei.ac.jp ABSTRACT Duplicate code

More information

Compiling clones: What happens?

Compiling clones: What happens? Compiling clones: What happens? Oleksii Kononenko, Cheng Zhang, and Michael W. Godfrey David R. Cheriton School of Computer Science University of Waterloo, Canada {okononen, c16zhang, migod}@uwaterloo.ca

More information

Toward a Taxonomy of Clones in Source Code: A Case Study

Toward a Taxonomy of Clones in Source Code: A Case Study Toward a Taxonomy of Clones in Source Code: A Case Study Cory Kapser and Michael W. Godfrey Software Architecture Group (SWAG) School of Computer Science, University of Waterloo fcjkapser, migodg@uwaterloo.ca

More information

Enhancing Program Dependency Graph Based Clone Detection Using Approximate Subgraph Matching

Enhancing Program Dependency Graph Based Clone Detection Using Approximate Subgraph Matching Enhancing Program Dependency Graph Based Clone Detection Using Approximate Subgraph Matching A DISSERTATION SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENT FOR THE AWARD OF THE DEGREE OF MASTER OF

More information

The Reverse Engineering in Oriented Aspect Detection of semantics clones

The Reverse Engineering in Oriented Aspect Detection of semantics clones International Journal of Scientific & Engineering Research Volume 3, Issue 5, May-2012 1 The Reverse Engineering in Oriented Aspect Detection of semantics clones Amel Belmabrouk, Belhadri Messabih Abstract-Attention

More information

Programming in C++ 4. The lexical basis of C++

Programming in C++ 4. The lexical basis of C++ Programming in C++ 4. The lexical basis of C++! Characters and tokens! Permissible characters! Comments & white spaces! Identifiers! Keywords! Constants! Operators! Summary 1 Characters and tokens A C++

More information

Extension of GCC with a fully manageable reverse engineering front end

Extension of GCC with a fully manageable reverse engineering front end Proceedings of the 7 th International Conference on Applied Informatics Eger, Hungary, January 28 31, 2007. Vol. 1. pp. 147 154. Extension of GCC with a fully manageable reverse engineering front end Csaba

More information

Thomas LaToza 5/5/2005 A Literature Review of Clone Detection Analysis

Thomas LaToza 5/5/2005 A Literature Review of Clone Detection Analysis Thomas LaToza 5/5/2005 A Literature Review of Clone Detection Analysis Introduction Code clones, pieces of code similar enough to be considered duplicates or clones of the same functionality, are a problem.

More information

Binghamton University. CS-211 Fall Syntax. What the Compiler needs to understand your program

Binghamton University. CS-211 Fall Syntax. What the Compiler needs to understand your program Syntax What the Compiler needs to understand your program 1 Pre-Processing Any line that starts with # is a pre-processor directive Pre-processor consumes that entire line Possibly replacing it with other

More information

Performance Evaluation and Comparative Analysis of Code- Clone-Detection Techniques and Tools

Performance Evaluation and Comparative Analysis of Code- Clone-Detection Techniques and Tools , pp. 31-50 http://dx.doi.org/10.14257/ijseia.2017.11.3.04 Performance Evaluation and Comparative Analysis of Code- Clone-Detection Techniques and Tools Harpreet Kaur 1 * (Assistant Professor) and Raman

More information

CS 415 Midterm Exam Spring 2002

CS 415 Midterm Exam Spring 2002 CS 415 Midterm Exam Spring 2002 Name KEY Email Address Student ID # Pledge: This exam is closed note, closed book. Good Luck! Score Fortran Algol 60 Compilation Names, Bindings, Scope Functional Programming

More information

DEPARTMENT OF MATHS, MJ COLLEGE

DEPARTMENT OF MATHS, MJ COLLEGE T. Y. B.Sc. Mathematics MTH- 356 (A) : Programming in C Unit 1 : Basic Concepts Syllabus : Introduction, Character set, C token, Keywords, Constants, Variables, Data types, Symbolic constants, Over flow,

More information

Introduction to Compiler Construction

Introduction to Compiler Construction Introduction to Compiler Construction ALSU Textbook Chapter 1.1 1.5 Tsan-sheng Hsu tshsu@iis.sinica.edu.tw http://www.iis.sinica.edu.tw/~tshsu 1 What is a compiler? Definitions: a recognizer ; a translator.

More information

Research of Automatic Scoring of Student Programs Based on Static Analysis

Research of Automatic Scoring of Student Programs Based on Static Analysis Journal of Electrical and Electronic Engineering 208; 6(2): 53-58 http://www.sciencepublishinggroup.com/j/jeee doi: 0.648/j.jeee.2080602.3 ISSN: 2329-63 (Print); ISSN: 2329-605 (Online) Research of Automatic

More information

Rochester Institute of Technology. Making personalized education scalable using Sequence Alignment Algorithm

Rochester Institute of Technology. Making personalized education scalable using Sequence Alignment Algorithm Rochester Institute of Technology Making personalized education scalable using Sequence Alignment Algorithm Submitted by: Lakhan Bhojwani Advisor: Dr. Carlos Rivero 1 1. Abstract There are many ways proposed

More information

SEMANTIC ANALYSIS TYPES AND DECLARATIONS

SEMANTIC ANALYSIS TYPES AND DECLARATIONS SEMANTIC ANALYSIS CS 403: Type Checking Stefan D. Bruda Winter 2015 Parsing only verifies that the program consists of tokens arranged in a syntactically valid combination now we move to check whether

More information

Detecting code re-use potential

Detecting code re-use potential Detecting code re-use potential Mario Konecki, Tihomir Orehovački, Alen Lovrenčić Faculty of Organization and Informatics University of Zagreb Pavlinska 2, 42000 Varaždin, Croatia {mario.konecki, tihomir.orehovacki,

More information

Tool Support for Refactoring Duplicated OO Code

Tool Support for Refactoring Duplicated OO Code Tool Support for Refactoring Duplicated OO Code Stéphane Ducasse and Matthias Rieger and Georges Golomingi Software Composition Group, Institut für Informatik (IAM) Universität Bern, Neubrückstrasse 10,

More information

Symbol Tables. ASU Textbook Chapter 7.6, 6.5 and 6.3. Tsan-sheng Hsu.

Symbol Tables. ASU Textbook Chapter 7.6, 6.5 and 6.3. Tsan-sheng Hsu. Symbol Tables ASU Textbook Chapter 7.6, 6.5 and 6.3 Tsan-sheng Hsu tshsu@iis.sinica.edu.tw http://www.iis.sinica.edu.tw/~tshsu 1 Definitions Symbol table: A data structure used by a compiler to keep track

More information

CA Compiler Construction

CA Compiler Construction CA4003 - Compiler Construction Semantic Analysis David Sinclair Semantic Actions A compiler has to do more than just recognise if a sequence of characters forms a valid sentence in the language. It must

More information

Towards the Code Clone Analysis in Heterogeneous Software Products

Towards the Code Clone Analysis in Heterogeneous Software Products Towards the Code Clone Analysis in Heterogeneous Software Products 11 TIJANA VISLAVSKI, ZORAN BUDIMAC AND GORDANA RAKIĆ, University of Novi Sad Code clones are parts of source code that were usually created

More information

Prof. Mohamed Hamada Software Engineering Lab. The University of Aizu Japan

Prof. Mohamed Hamada Software Engineering Lab. The University of Aizu Japan Language Processing Systems Prof. Mohamed Hamada Software Engineering Lab. The University of Aizu Japan Semantic Analysis Compiler Architecture Front End Back End Source language Scanner (lexical analysis)

More information

Compiler construction in4020 lecture 5

Compiler construction in4020 lecture 5 Compiler construction in4020 lecture 5 Semantic analysis Assignment #1 Chapter 6.1 Overview semantic analysis identification symbol tables type checking CS assignment yacc LLgen language grammar parser

More information