Fast Plagiarism Detection System

Size: px

Start display at page:

Download "Fast Plagiarism Detection System"

Scott Cole
5 years ago
Views:

1 Fast Plagiarism Detection System No Author Given No Institute Given Abstract. Plagiarism on programming courses has always been a difficult problem to counter; the large numbers of students following such courses and the limited vocabulary of a programming language combine to make it hard to detect plagiarism during marking. Various methods of plagiarism detection have been proposed but all suffer from the complexity inherent in a many-to-many comparison of files. We have designed an efficient new algorithm to be implemented as a plagiarism detector, which offers the possibility of fast comparisons of a corpus of work. Our algorithm makes use of sparse suffix structures and allows rapid calculation of the similarity ratios between any file and a pre-existing collection. 1 Introduction Programming is a vital part of a Computer Science qualification and also forms a key constituent of many other technical disciplines. For this reason, it is very important that the integrity of such qualifications should not be called into doubt through accusations of mass plagiarism among students. In addition to this, the issue of plagiarism and its apparently increasing prevalence at undergraduate level has been receiving a lot of coverage in the mainstream media [2, 11]. Therefore, universities should be seen to be taking measures to lessen the impact of this problem. These facts highlight the need for a strong, competent strategy that will both prevent and detect plagiarism among students [1]. The large class sizes typical of an undergraduate programming course mean that it is near impossible for a human marker to accurately detect plagiarism, particularly if some attempt has been made to obfuscate the copying. There are many standard techniques that students will use to try and hide their plagiarism. While it would not be possible to list all possible techniques here, they generally fall into one of the following two groups: 1. Lexical changes can be thought of as pre-processing techniques and do not generally require any in-depth knowledge of the programming language being used. A typical attempt to hide plagiarism using lexical changes may be to add/remove comments, adjust the coding style or do a find and replace on identifiers. 2. Structural changes require significantly more language-specific knowledge and their extra sophistication makes them harder to detect. Typical examples of structural changes may be changing the type of loop construct from a for loop to a while loop; rewriting logical conditions to an equivalent or

2 replacing a switch statement with a series of if statements. More advanced forms of the structural change might involve refactoring code into different methods. While it would be desirable to be able to detect all possible code transformations there is a minimum level of acceptable performance for the application of detecting student plagiarism. Students that plagiarise another s work will normally feel they have to do so for one of two reasons: Either they have not allocated sufficient time to complete the work or they do not understand the subject well enough to complete the work. It would be useful if the detector operated at a level that meant for a piece of work to fool the algorithm would require that the student spent a large amount of time on the assignment and had a good enough understanding to do the work without plagiarising. 2 Previous Work The problem of detecting source-code plagiarism has concerned academics since the late 1970 s, when Ottenstein used Halstead s basic metrics for measuring algorithms to try and find plagiarism in punch-card programming assignments [8, 4]. This metric counting approach used by the initial plagiarism detectors relied on the assumption that two similar pieces of code would have similar software metrics. A simple software metric would be the number of commented lines in the original file, although more complicated metrics that are invariant to superficial changes were also used [14]. This approach had its advantages, in that it was simple and computationally inexpensive but was found to be quite ineffective in practice [13]. A second generation of plagiarism detectors has emerged that use a tokenization technique to improve detection by looking at the structure of files. However, this comes at the cost of computational time. These detectors work by pre-processing code to remove white-space and comments before converting the file into a tokenized string. For example, a typical tokenization scheme might involve replacing all identifiers with the <IDT> token, all numbers by <VALUE> and any loops by generic <BEGIN LOOP>...<END LOOP> tokens. More sophisticated techniques allow to keep track of different identifiers usage, so both expressions a = a + b and x = x + y are mapped to tokenized string <IDT1> = <IDT1> + <IDT2>. This enables the files to be compared at a less superficial level but at the cost of a greater perceived similarity between the files, as previously unrelated segments of code can produce the same token strings. The main advantage of such an approach is that it negates all lexical changes and a good token set can also reduce the efficacy of many structural changes. Tokenization has the added benefit that it is language independent and all that is needed to add detection support for a new language is to write a new parser to convert source code into token strings. Several tokenizing plagiarism detectors are now implemented at various academic institutions, these include Sherlock at the University of Warwick [5], JPlag

3 at the Universitat Karlsruhe [9] and MOSS at the University of California, Berkeley [10]. JPlag and MOSS are available as internet services for academics to use while Sherlock is a stand-alone program. Our algorithm also makes use of tokenised versions of the input files and we use suffix arrays as our indexing structure to enable efficient comparisons [7]. While all the above-mentioned systems use different algorithms to each other, the core idea is the same: a many-to-many comparison of all files submitted for an assignment should produce a list sorted by some similarity score that can then be used to determine which pairs are most likely to contain plagiarism. A naive implementation of this comparison, such as that used by Sherlock or JPlag, results in O(f(n)N 2 ) complexity where N is the size (number of files) of the collection, and f(n) is the time to make the comparison between one pair of files of length n. E.g. for JPlag f(n) = O(n 2 ) on average or O(n 3 ) in the worst case, i.e. O(n 2 N 2 ) total average time (for Sherlock the complexity is even worse). Without loss of detection quality, our method achieves O(N(n + N)) average time by using indexing techniques based on suffix trees. 3 Algorithms and Complexity Our proposed system is based on an index structure built over the entire file collection. Before the index is built, all the files in the collection are tokenized, as explained in Sec. 2. This is a simple parsing problem, and can be solved in linear time. We used the well-known JavaCC system [3] to build our own tokenizer. For each of the N files in the collection, The output of the tokenizer for a file F i is a string of n i tokens, separated with #: token 1 #token 2 #...#token ni #. The total number of tokens is denoted by n = n i. The tokens are character strings, but for simplicity and without loss of generality we can assume that each token is an atomic symbol of the resulting token string. 3.1 Index Structure A suffix array is a lexicographically sorted array of all suffixes of a given string [7]. A string w is a suffix of string s, if s can be written as vw, where vw denotes the concatenation of the possibly empty strings v and w. The suffix array SA for the string T [1..n i ] of tokens contains all the suffixes of T in lexicographically sorted order, i.e. SA[i] = j, if T [j] is the lexicographically ith suffix of T. Note that the suffixes themselves are not explicitly stored, an index to the original string suffices. Our index structure is the suffix array of all n tokens in the source code collection. Each entry in the array also contains the file identifier of the corresponding suffix. Therefore, the suffix array for the whole document collection is of size O(n). Therefore, we consider the total memory requirements to be acceptable for modern hardware. The tokenization process also allows us to save space in our index structure, since only the tokens are indexed, and will also speed up the plagiarism detection phase. Fig. 1 shows an example suffix array.

4 Algorithm 1 Compare a File Against an Existing Collection 1 p = 1 // the first token of Q 2 WHILE p q γ find Q[p...p + γ 1] from the suffix array 4 IF Q[p...p + γ 1] was found 5 UpdateRepository 6 p = p + γ 7 ELSE 8 p = p END IF 10 END WHILE 11 FOR EVERY file F i in the collection 12 Similarity(Q, F i) = MatchedT okens(f i)/q The array can be constructed in time O(n log n), assuming atomic comparison of two tokens. This can be improved to O(n) with a somewhat more complex algorithm [6]. A suffix array allows us to rapidly find a file (or files), containing any given substring. This is achieved with a binary search, and requires O(m+log 2 n) time on average, where m is the length of the substring (it is also possible to make this the worst case complexity, see [7]). {#<MODIFIER># <IDT>#{#<MODIFIER># <IDT>#<MODIFIER>#<IDT>#{\#<MODIFIER># <MODIFIER># <MODIFIER>#<IDT>#{#<MODIFIER># Fig. 1. Suffix array contents for the tokenized string <IDT>#<MODIFIER>#<IDT>#{#<MODIFIER>#. 3.2 Comparing a File Against an Existing Collection We now describe the algorithm for finding all files within the collection s index that are similar to a given query file. Pseudo-code for the algorithm is given in Algorithm 1. Algorithm 1 tries to find the substrings of the tokenised query file, Q, in the suffix array. Matching substrings are recorded and each match contributes to the similarity score, described later. But a closer look reveals a problem: which query file substrings should be analyzed? The simplest solution is to try to match every token of an input file but thia leads to lots of false matches, since two files will be considered to be very similar when they are both composed of a small set of identical tokens even if these tokens form totally different sequences. A greedy search approach would be to try and match the longest possible substring of an input file before continuing with the first mismatched token. This idea may

5 Algorithm 2 Update the Repository 1 Let S be the set of matches of Q[p...p + γ 1] 2 IF some of the strings in S are found in the same file, leave only the longest one 3 FOR every string M from the remaining list S 4 IF M doesn t instersect with any repository element 5 insert M to the repository 6 ELSE IF M is longer than any conflicting rep. element 7 remove all conflicting repository elements 8 insert M to the repository 9 END IF 10 END FOR result in a weak detection performance, since many interesting substrings would likely be skipped. Our idea is a trade-off between these extremes. We introduce a new parameter, γ, which stands for the length of the substring to be matched (in tokens). If γ = 1, the algorithm tries to match every token of the input file. By increasing γ, one can fine-tune the detection performance and also the time required for the detection process itself. The algorithm takes contiguous non-overlapping token substrings of length γ from the query file and searches all the matching substrings from the index. If several matches that are found correspond to the same indexed file, these matches are extended to Γ tokens, Γ γ, such that only one of the original matches survives for each indexed file. Therefore, for each file in the index, the algorithm finds all matching substrings that are longer than other matching substrings and whose lengths are at least γ tokens. These matches are recorded into a repository. This phase also includes a sanity check as overlapping matches are not allowed. The similarity between the file being tested and any file in the collection is just a number of tokens matched in the collection file divided by the total number of tokens in the test file (so it is a value between 0 and 1). More precisely, let Q be the string of tokens for the query file, with total q tokens. Algorithm 1 then proceeds to find the matches of the substrings Q[p...p+ γ 1] from the index. If no matches are found, the next substring is searched, i.e. p is increased by one. If at least one match is found, we update our match repository (Algorithm 2), and increase p by γ. Collision Detection In Algorithm 3, we encounter two types of collisions. The first one appears when more than one match is found in the same file. Obviously, we should not treat an occurrence of some string as five occurrences, just because this string can be found five times in the file. Recall that the similarity between two files is a value between zero and one (or between 0% and 100%), computed as Similarity(Q, F i ) = MatchedT okens(f i )/q, where q is the number of tokens in the query file. Therefore, the number of tokens stored in the repository for the file F i should never be greater than the size of

6 Algorithm 3 Updating the Repository (detailed) 1 Let S be the set of matches of Q[p...p + γ 1] 2 matchlist = { } 3 FOR EACH found occurrence M S 4 IF matchlist contains a match M found in the same file as M 5 extend M such that it matches Q[p...p + Γ 1 1], for Γ 1 γ 6 extend M such that it matches Q[p...p + Γ 2 1], for Γ 2 γ 7 IF Γ 1 Γ 2 THEN M = M ELSE M = M 8 remove M from matchlist 9 insert M into matchlist 10 ELSE 11 insert M into matchlist 12 END IF 13 END FOR 14 FOR EACH match M matchlist 15 f = fileidentifier(m) 16 IF repository[f] = { } 17 repository[f] = {M} 18 ELSE 19 result = { } 20 FOR EACH match M repository[f] 21 IF M intersects with M THEN put M into result 22 END FOR 23 len = the total length of all matches in result 24 IF len < length of the match M 25 remove all matches in result from repository[f] 26 add M into repository[f] 27 END IF 28 END IF 29 END FOR the input file in tokens. Now, consider a situation when some match is found several times in the same collection file: Query file: Indexed collection file: Here, every token of the input file can be found twice in the collection file. The above formula will give us a 200% similarity, which is counter-intuitive. Also, collisions can significally distort the real situation. Consider the same input query file compared to a different collection file: <type><type><type><type><type>. Here we obtain 100% similarity, which obviously does not reflect the real situation. Therefore, Algorithm 2 first takes the list of matches for Q[p...p+γ 1]. If the list has any two matches that are from the same file, these matches are extended with the tokens Q[p + γ],..., Q[p + Γ 1], until no match can be extended (i.e. the token Q[p + Γ ] would mismatch). If after this process the matches still cannot be differentiated (i.e. the lenghts are still same), we just take the first one. At this point only one match is left for that file, and it matches the substring Q[p...p + Γ 1]. Note that this process may end up with different Γ value for each file in the index.

7 The resulting pruned match list is then checked for collisions of type 2, and the non-conflicting matches are put into the match repository (see Algorithms 2 and 3). The second collision type is the reverse of the previous problem: we should not allow the situation when two different places in the input file correspond to the same place in some collection file. Collisions of type two bring the same problems. If we allow different places in the input file to be matched against the same place of the collection file then the following files give 100% similarity; another counter-intuitive result. Query file: <type><type><type><type> Indexed collection file: To resolve the difficulty we use longest wins heuristics. We sum the lengths of all the previous matches that intersect with the current one, and if the current match is longer, we use it to replace the intersecting previous matches, see Alg. 3. Examples Here, we provide some examples to illustrate our collision-resolving techniques. Suppose the file to be tested contains three variable definitions: int i = 10; int j = 15; int k = 20; These are tokenized to the strings: Some file in the collection contains the block: float d = 11.5; int k = 100; int h = k; class T { This block is then tokenized into the strings: <type><idt>=<idt>; <class><idt>{ For the very beginning of the input file, with γ = 3, our search routine will find three occurrences of the string <type><idt>= in the collection file. After that the system has to decide which occurrence to keep. Longest-wins heuristics

8 will show that the chunk float d = is the best match, since its consequent symbols form the longest occurrence in the collection file (match underlined): Occurrence 1: <type><idt>=<idt> Occurrence 2: <type><idt>=<idt>; Occurrence 3: <type><idt>=<idt>; After this initial pass, the search routine will deal with the second variable definition (int j = 15;). Its tokenized γ-prefix <type><idt>= will again be found in the same three different places inside the collection file. However, now the situation is different: we should not only select the best match (collision type 1), but also take into account that the chunk float d = is already matched somewhere earlier (collision type 2). The algorithm for resolving collisions of type 1 says that float d = is the best match again, leaving us two choices: either to reject this match and not record it, or to remove the previous match from the repository and insert this one. The decision is again made according to longest-wins heuristics and the original match remains in the repository. 3.3 Complexity The complexity of Algorithm 1 is highly dependent on the value of the γ parameter. Line 3 of Algorithm 1 takes O(γ + log n) average time, where n is the total number of tokens in the collection (assuming atomic token comparisons). In the worst case this becomes O(γ log n) if plain binary search is used, but the worst case time can be improved to match the average time by saving some extra information into the suffix array [7]. Therefore, the total average time is at most O(q(γ + log n)), where q is the number of tokens in the query file, and assuming that the substring of Q was never found. On the other hand, if Q was found, we call Algorithm 3. This can happen at most O(q/γ) times, so line 5 takes at most time O(q/γ) times the complexity of Algorithm 3. This complexity depends mainly on how many matches we have, on average, when searching γ-length strings in the suffix structure. If we make the simplifying assumption that two randomly picked tokens match each other (independently) with fixed probability p, then on average we obtain np γ matches for substrings of length γ. This is exponentially decreasing, and becomes O(1) for γ = Θ(log 1/p n). The total complexity of Algorithm 3 is, on average, at most O((q/γ np γ ) 2 ). To keep the total average complexity of Algorithm 1 to at most O(q(γ + log n)), it is enough that γ = Ω(log 1/p n). This results in O(q log n) total average time. Since we require that γ = Ω(log n), and may adjust γ to tune the quality of the detection results, we state the time bound as O(qγ).

9 Figure 2 shows the results of timed measurements of our suffix array based implementation on a set derived from student s work. These are CPU times from a 2.4GHz Celeron with 256 Mb RAM running under Windows XP using Sun s Java 1.4. Fig. 2. Time required to score some single file against a collection. Finally, the scores for each file can be computed in O(N) time. To summarize, the total average complexity of Algorithm 1 can be made O(q(γ + log n) + N) = O(qγ + N). The O(γ + log n) factors can be easily reduced to O(1) (worst case) using suffix trees [12] with suffix links, instead of suffix arrays. This would result in O(q + N) total time. Note that we have excluded the tokenization of Q and that we have considered the number of tokens rather than the number of characters. However, the tokenization is a simple linear time process, and the number of tokens depends linearly on the file length. 3.4 All Against All Comparison To compare every file against each other, we can just run Algorithm 1 for every file in our collection. After that, every file pair gets two scores: one when file a is compared to file b and one when the reverse comparison happens, as the comparison is not symmetric. We can use the average of these scores as a final score for this pair.

10 Summing up the cost of this procedure for all the N files in the collection, we obtain a total complexity of O(nγ + N 2 ), including the time to build the suffix array index structure. See also Figure 3. With suffix trees this can be made O(n + N 2 ) Fig. 3. Time required to test all files in the collection against each other Note that any plagiarism detection routine based on pairwise file comparisons will have a complexity of O(f(n/N)N 2 ) at least, where f(n) is the complexity of comparing two files of lenght n. In our terms each file is approximately n/n tokens, hence e.g. JPlag would take O(n 2 ) total average time. The N 2 factor is negligible compared to n 2 since n N, unless the files are only one token each. 4 Evaluation of the System It is not feasible in the nearest future to compare our system s results with a human expert s opinion on real-world datasets as a human would not have the time to conduct a thorough comparison of every possible file pair. This would also be a very error-prone process. However, we can examine the reports that are produced by different plagiarism detection software when used on the same dataset. The systems used for the analysis include MOSS [10], JPlag [9] and Sherlock [5]. Every system printed a report about the same real collection, consisting of 220 undergraduate student s Java programs. These reports were

Fig. 4. Different systems reports summarized on the single diagram summarized and displayed as a diagram, a reduced version of which is shown in Figure 4.

11 Fig. 4. Different systems reports summarized on the single diagram summarized and displayed as a diagram, a reduced version of which is shown in Figure 4. The figure only shows results for 50 of the 220 files. This diagram shows the score for every suspicious file in the collection (file pairs are not displayed here). Since the systems can be fine-tuned to show more or less files, we tried to obtain equal-size outputs. The raw report of these systems is a list of pairs with calculated similarity ratio between their members. Although the opinions of all the tested systems are different for many of the files, most files are either detected or rejected by the majority of systems. This simple approach (to consider only detection or rejection) allows us to organize a voting experiment. Let S i be the number of jury systems (MOSS, JPlag and Sherlock), which marked file i as suspicious. If S i 2, we should expect our system to mark this file as wel. If S i < 2, the file should, in general, remain unmarked. For the test set consisting of 155 files marked by at least one program, our system agreed with the jury in 115 cases (and, correspondingly, disagreed in 40 cases). This result is more conformist than the results obtained when the same experiment was run on the other 3 tested systems. Each system was tested while the other three acted as jury. All results are shown in Table 1. Table 1. Agreement between plagiarism detectors MOSS JPlag Our System Sherlock Agreed Disagreed

12 So, we can claim that our system is at least no worse than the other tested systems and an argument could be made that it is better since its results correlate best with all three systems. However, the subtleties of the different algorithms used by each system could mean the data we used was particularly suited to our system and that other software may produce results of a better quality on other collections. 5 Conclusions and Future Work We have developed a new fast algorithm for plagiarism detection. Our method is based on indexing the code database with a suffix array, which allows rapid retrieval of blocks of code that are similar to the query file. This idea made rapid pairwise file comparison possible. Evaluation shows that this algorithm s quality is not worse than the quality of existing widely used methods, while its speed performance is much higher. For the all-against-all problem our method achieves O(γn) (with suffix arrays) or O(n) (with suffix trees) average time for the comparison phase. Traditional methods, such as JPlag, need at least O((n/N) 2 N 2 ) = O(n 2 ) average time for the same task. In addition, computing the similarity matrix takes O(N 2 ) additional time, and this cannot be improved, as it is also the size of the output. However, usually one is interested only in similarity scores that are above a certain threshold, or only the h highest similarity scores (where h N 2 is a parameter). This would allow reducing the O(N 2 ) factor, if it becomes an issue. The main motivation for this work was plagiarism detection; however, there are other applications for the method. For example, the algorithm can detect similar blocks of code in some large software system, revealing good places for the refactoring. In the future we would like to see a full implementation of the algorithm as part of a plagiarism detection system, which would then allow for a full visualisation of the matches found and enable the results to be used in a realworld context. We would also like to investigate the effects of varying the γ parameter on the quality of the results and undertake further comparisons with other source-code plagiarism detectors. References 1. J. Carrol and J. Appleton. Plagiarism: A good practice guide. JISC, May plag practise. 2. P. Curtis. Quarter of students plagiarise essays. Guardian Unlimited, June O. Enseling. Build your own languages with javacc. JavaWorld, M. H. Halstead. Elements of Software Science. Operating and Programming Systems Series. Elsevier North-Holland, New York, M. S. Joy and M. Luck. Plagiarism in programming assignments. IEEE Transactions on Education, 42(2): , May 1999.

13 6. Kärkkäinen and Sanders. Simple linear work suffix array construction. In ICALP: Annual International Colloquium on Automata, Languages and Programming, U. Manber and G. Myers. Suffix arrays: a new method for on-line string searches. In SODA 90: Proceedings of the first annual ACM-SIAM symposium on Discrete algorithms, pages Society for Industrial and Applied Mathematics, K. J. Ottenstein. An algorithmic approach to the detection and prevention of plagiarism. SIGCSE Bull., 8(4):30 41, L. Prechelt, G. Malpohl, and M. Phlippsen. Jplag: Finding plagiarisms among a set of programs. Technical report, Fakultat for Informatik, Universitat Karlsruhe, prechelt/biblio/jplagtr.pdf. 10. S. Schleimer, D. S. Wilkerson, and A. Aiken. Winnowing: local algorithms for document fingerprinting. In SIGMOD 03: Proceedings of the 2003 ACM SIGMOD international conference on Management of data, pages ACM Press, L. Thompson. Educators blame internet for rise in student cheating. The Seattle Times, January plagiarism16m.html. 12. E. Ukkonen. On-line construction of suffix trees. Algorithmica, 14: , K. L. Verco and M. J. Wise. Plagiarism a la mode: A comparison of automated systems for detecting suspected plagiarism. The Computer Journal, 39(9): , G. Whale. Software metrics and plagiarism detection. Journal of Systems and Software, 13(2): , 1990.

An Approach to Source Code Plagiarism Detection Based on Abstract Implementation Structure Diagram

An Approach to Source Code Plagiarism Detection Based on Abstract Implementation Structure Diagram Shuang Guo 1, 2, b 1, 2, a, JianBin Liu 1 School of Computer, Science Beijing Information Science & Technology