Fast Plagiarism Detection System

Size: px
Start display at page:

Download "Fast Plagiarism Detection System"

Transcription

1 Fast Plagiarism Detection System No Author Given No Institute Given Abstract. Plagiarism on programming courses has always been a difficult problem to counter; the large numbers of students following such courses and the limited vocabulary of a programming language combine to make it hard to detect plagiarism during marking. Various methods of plagiarism detection have been proposed but all suffer from the complexity inherent in a many-to-many comparison of files. We have designed an efficient new algorithm to be implemented as a plagiarism detector, which offers the possibility of fast comparisons of a corpus of work. Our algorithm makes use of sparse suffix structures and allows rapid calculation of the similarity ratios between any file and a pre-existing collection. 1 Introduction Programming is a vital part of a Computer Science qualification and also forms a key constituent of many other technical disciplines. For this reason, it is very important that the integrity of such qualifications should not be called into doubt through accusations of mass plagiarism among students. In addition to this, the issue of plagiarism and its apparently increasing prevalence at undergraduate level has been receiving a lot of coverage in the mainstream media [2, 11]. Therefore, universities should be seen to be taking measures to lessen the impact of this problem. These facts highlight the need for a strong, competent strategy that will both prevent and detect plagiarism among students [1]. The large class sizes typical of an undergraduate programming course mean that it is near impossible for a human marker to accurately detect plagiarism, particularly if some attempt has been made to obfuscate the copying. There are many standard techniques that students will use to try and hide their plagiarism. While it would not be possible to list all possible techniques here, they generally fall into one of the following two groups: 1. Lexical changes can be thought of as pre-processing techniques and do not generally require any in-depth knowledge of the programming language being used. A typical attempt to hide plagiarism using lexical changes may be to add/remove comments, adjust the coding style or do a find and replace on identifiers. 2. Structural changes require significantly more language-specific knowledge and their extra sophistication makes them harder to detect. Typical examples of structural changes may be changing the type of loop construct from a for loop to a while loop; rewriting logical conditions to an equivalent or

2 replacing a switch statement with a series of if statements. More advanced forms of the structural change might involve refactoring code into different methods. While it would be desirable to be able to detect all possible code transformations there is a minimum level of acceptable performance for the application of detecting student plagiarism. Students that plagiarise another s work will normally feel they have to do so for one of two reasons: Either they have not allocated sufficient time to complete the work or they do not understand the subject well enough to complete the work. It would be useful if the detector operated at a level that meant for a piece of work to fool the algorithm would require that the student spent a large amount of time on the assignment and had a good enough understanding to do the work without plagiarising. 2 Previous Work The problem of detecting source-code plagiarism has concerned academics since the late 1970 s, when Ottenstein used Halstead s basic metrics for measuring algorithms to try and find plagiarism in punch-card programming assignments [8, 4]. This metric counting approach used by the initial plagiarism detectors relied on the assumption that two similar pieces of code would have similar software metrics. A simple software metric would be the number of commented lines in the original file, although more complicated metrics that are invariant to superficial changes were also used [14]. This approach had its advantages, in that it was simple and computationally inexpensive but was found to be quite ineffective in practice [13]. A second generation of plagiarism detectors has emerged that use a tokenization technique to improve detection by looking at the structure of files. However, this comes at the cost of computational time. These detectors work by pre-processing code to remove white-space and comments before converting the file into a tokenized string. For example, a typical tokenization scheme might involve replacing all identifiers with the <IDT> token, all numbers by <VALUE> and any loops by generic <BEGIN LOOP>...<END LOOP> tokens. More sophisticated techniques allow to keep track of different identifiers usage, so both expressions a = a + b and x = x + y are mapped to tokenized string <IDT1> = <IDT1> + <IDT2>. This enables the files to be compared at a less superficial level but at the cost of a greater perceived similarity between the files, as previously unrelated segments of code can produce the same token strings. The main advantage of such an approach is that it negates all lexical changes and a good token set can also reduce the efficacy of many structural changes. Tokenization has the added benefit that it is language independent and all that is needed to add detection support for a new language is to write a new parser to convert source code into token strings. Several tokenizing plagiarism detectors are now implemented at various academic institutions, these include Sherlock at the University of Warwick [5], JPlag

3 at the Universitat Karlsruhe [9] and MOSS at the University of California, Berkeley [10]. JPlag and MOSS are available as internet services for academics to use while Sherlock is a stand-alone program. Our algorithm also makes use of tokenised versions of the input files and we use suffix arrays as our indexing structure to enable efficient comparisons [7]. While all the above-mentioned systems use different algorithms to each other, the core idea is the same: a many-to-many comparison of all files submitted for an assignment should produce a list sorted by some similarity score that can then be used to determine which pairs are most likely to contain plagiarism. A naive implementation of this comparison, such as that used by Sherlock or JPlag, results in O(f(n)N 2 ) complexity where N is the size (number of files) of the collection, and f(n) is the time to make the comparison between one pair of files of length n. E.g. for JPlag f(n) = O(n 2 ) on average or O(n 3 ) in the worst case, i.e. O(n 2 N 2 ) total average time (for Sherlock the complexity is even worse). Without loss of detection quality, our method achieves O(N(n + N)) average time by using indexing techniques based on suffix trees. 3 Algorithms and Complexity Our proposed system is based on an index structure built over the entire file collection. Before the index is built, all the files in the collection are tokenized, as explained in Sec. 2. This is a simple parsing problem, and can be solved in linear time. We used the well-known JavaCC system [3] to build our own tokenizer. For each of the N files in the collection, The output of the tokenizer for a file F i is a string of n i tokens, separated with #: token 1 #token 2 #...#token ni #. The total number of tokens is denoted by n = n i. The tokens are character strings, but for simplicity and without loss of generality we can assume that each token is an atomic symbol of the resulting token string. 3.1 Index Structure A suffix array is a lexicographically sorted array of all suffixes of a given string [7]. A string w is a suffix of string s, if s can be written as vw, where vw denotes the concatenation of the possibly empty strings v and w. The suffix array SA for the string T [1..n i ] of tokens contains all the suffixes of T in lexicographically sorted order, i.e. SA[i] = j, if T [j] is the lexicographically ith suffix of T. Note that the suffixes themselves are not explicitly stored, an index to the original string suffices. Our index structure is the suffix array of all n tokens in the source code collection. Each entry in the array also contains the file identifier of the corresponding suffix. Therefore, the suffix array for the whole document collection is of size O(n). Therefore, we consider the total memory requirements to be acceptable for modern hardware. The tokenization process also allows us to save space in our index structure, since only the tokens are indexed, and will also speed up the plagiarism detection phase. Fig. 1 shows an example suffix array.

4 Algorithm 1 Compare a File Against an Existing Collection 1 p = 1 // the first token of Q 2 WHILE p q γ find Q[p...p + γ 1] from the suffix array 4 IF Q[p...p + γ 1] was found 5 UpdateRepository 6 p = p + γ 7 ELSE 8 p = p END IF 10 END WHILE 11 FOR EVERY file F i in the collection 12 Similarity(Q, F i) = MatchedT okens(f i)/q The array can be constructed in time O(n log n), assuming atomic comparison of two tokens. This can be improved to O(n) with a somewhat more complex algorithm [6]. A suffix array allows us to rapidly find a file (or files), containing any given substring. This is achieved with a binary search, and requires O(m+log 2 n) time on average, where m is the length of the substring (it is also possible to make this the worst case complexity, see [7]). {#<MODIFIER># <IDT>#{#<MODIFIER># <IDT>#<MODIFIER>#<IDT>#{\#<MODIFIER># <MODIFIER># <MODIFIER>#<IDT>#{#<MODIFIER># Fig. 1. Suffix array contents for the tokenized string <IDT>#<MODIFIER>#<IDT>#{#<MODIFIER>#. 3.2 Comparing a File Against an Existing Collection We now describe the algorithm for finding all files within the collection s index that are similar to a given query file. Pseudo-code for the algorithm is given in Algorithm 1. Algorithm 1 tries to find the substrings of the tokenised query file, Q, in the suffix array. Matching substrings are recorded and each match contributes to the similarity score, described later. But a closer look reveals a problem: which query file substrings should be analyzed? The simplest solution is to try to match every token of an input file but thia leads to lots of false matches, since two files will be considered to be very similar when they are both composed of a small set of identical tokens even if these tokens form totally different sequences. A greedy search approach would be to try and match the longest possible substring of an input file before continuing with the first mismatched token. This idea may

5 Algorithm 2 Update the Repository 1 Let S be the set of matches of Q[p...p + γ 1] 2 IF some of the strings in S are found in the same file, leave only the longest one 3 FOR every string M from the remaining list S 4 IF M doesn t instersect with any repository element 5 insert M to the repository 6 ELSE IF M is longer than any conflicting rep. element 7 remove all conflicting repository elements 8 insert M to the repository 9 END IF 10 END FOR result in a weak detection performance, since many interesting substrings would likely be skipped. Our idea is a trade-off between these extremes. We introduce a new parameter, γ, which stands for the length of the substring to be matched (in tokens). If γ = 1, the algorithm tries to match every token of the input file. By increasing γ, one can fine-tune the detection performance and also the time required for the detection process itself. The algorithm takes contiguous non-overlapping token substrings of length γ from the query file and searches all the matching substrings from the index. If several matches that are found correspond to the same indexed file, these matches are extended to Γ tokens, Γ γ, such that only one of the original matches survives for each indexed file. Therefore, for each file in the index, the algorithm finds all matching substrings that are longer than other matching substrings and whose lengths are at least γ tokens. These matches are recorded into a repository. This phase also includes a sanity check as overlapping matches are not allowed. The similarity between the file being tested and any file in the collection is just a number of tokens matched in the collection file divided by the total number of tokens in the test file (so it is a value between 0 and 1). More precisely, let Q be the string of tokens for the query file, with total q tokens. Algorithm 1 then proceeds to find the matches of the substrings Q[p...p+ γ 1] from the index. If no matches are found, the next substring is searched, i.e. p is increased by one. If at least one match is found, we update our match repository (Algorithm 2), and increase p by γ. Collision Detection In Algorithm 3, we encounter two types of collisions. The first one appears when more than one match is found in the same file. Obviously, we should not treat an occurrence of some string as five occurrences, just because this string can be found five times in the file. Recall that the similarity between two files is a value between zero and one (or between 0% and 100%), computed as Similarity(Q, F i ) = MatchedT okens(f i )/q, where q is the number of tokens in the query file. Therefore, the number of tokens stored in the repository for the file F i should never be greater than the size of

6 Algorithm 3 Updating the Repository (detailed) 1 Let S be the set of matches of Q[p...p + γ 1] 2 matchlist = { } 3 FOR EACH found occurrence M S 4 IF matchlist contains a match M found in the same file as M 5 extend M such that it matches Q[p...p + Γ 1 1], for Γ 1 γ 6 extend M such that it matches Q[p...p + Γ 2 1], for Γ 2 γ 7 IF Γ 1 Γ 2 THEN M = M ELSE M = M 8 remove M from matchlist 9 insert M into matchlist 10 ELSE 11 insert M into matchlist 12 END IF 13 END FOR 14 FOR EACH match M matchlist 15 f = fileidentifier(m) 16 IF repository[f] = { } 17 repository[f] = {M} 18 ELSE 19 result = { } 20 FOR EACH match M repository[f] 21 IF M intersects with M THEN put M into result 22 END FOR 23 len = the total length of all matches in result 24 IF len < length of the match M 25 remove all matches in result from repository[f] 26 add M into repository[f] 27 END IF 28 END IF 29 END FOR the input file in tokens. Now, consider a situation when some match is found several times in the same collection file: Query file: Indexed collection file: Here, every token of the input file can be found twice in the collection file. The above formula will give us a 200% similarity, which is counter-intuitive. Also, collisions can significally distort the real situation. Consider the same input query file compared to a different collection file: <type><type><type><type><type>. Here we obtain 100% similarity, which obviously does not reflect the real situation. Therefore, Algorithm 2 first takes the list of matches for Q[p...p+γ 1]. If the list has any two matches that are from the same file, these matches are extended with the tokens Q[p + γ],..., Q[p + Γ 1], until no match can be extended (i.e. the token Q[p + Γ ] would mismatch). If after this process the matches still cannot be differentiated (i.e. the lenghts are still same), we just take the first one. At this point only one match is left for that file, and it matches the substring Q[p...p + Γ 1]. Note that this process may end up with different Γ value for each file in the index.

7 The resulting pruned match list is then checked for collisions of type 2, and the non-conflicting matches are put into the match repository (see Algorithms 2 and 3). The second collision type is the reverse of the previous problem: we should not allow the situation when two different places in the input file correspond to the same place in some collection file. Collisions of type two bring the same problems. If we allow different places in the input file to be matched against the same place of the collection file then the following files give 100% similarity; another counter-intuitive result. Query file: <type><type><type><type> Indexed collection file: To resolve the difficulty we use longest wins heuristics. We sum the lengths of all the previous matches that intersect with the current one, and if the current match is longer, we use it to replace the intersecting previous matches, see Alg. 3. Examples Here, we provide some examples to illustrate our collision-resolving techniques. Suppose the file to be tested contains three variable definitions: int i = 10; int j = 15; int k = 20; These are tokenized to the strings: Some file in the collection contains the block: float d = 11.5; int k = 100; int h = k; class T { This block is then tokenized into the strings: <type><idt>=<idt>; <class><idt>{ For the very beginning of the input file, with γ = 3, our search routine will find three occurrences of the string <type><idt>= in the collection file. After that the system has to decide which occurrence to keep. Longest-wins heuristics

8 will show that the chunk float d = is the best match, since its consequent symbols form the longest occurrence in the collection file (match underlined): Occurrence 1: <type><idt>=<idt> Occurrence 2: <type><idt>=<idt>; Occurrence 3: <type><idt>=<idt>; After this initial pass, the search routine will deal with the second variable definition (int j = 15;). Its tokenized γ-prefix <type><idt>= will again be found in the same three different places inside the collection file. However, now the situation is different: we should not only select the best match (collision type 1), but also take into account that the chunk float d = is already matched somewhere earlier (collision type 2). The algorithm for resolving collisions of type 1 says that float d = is the best match again, leaving us two choices: either to reject this match and not record it, or to remove the previous match from the repository and insert this one. The decision is again made according to longest-wins heuristics and the original match remains in the repository. 3.3 Complexity The complexity of Algorithm 1 is highly dependent on the value of the γ parameter. Line 3 of Algorithm 1 takes O(γ + log n) average time, where n is the total number of tokens in the collection (assuming atomic token comparisons). In the worst case this becomes O(γ log n) if plain binary search is used, but the worst case time can be improved to match the average time by saving some extra information into the suffix array [7]. Therefore, the total average time is at most O(q(γ + log n)), where q is the number of tokens in the query file, and assuming that the substring of Q was never found. On the other hand, if Q was found, we call Algorithm 3. This can happen at most O(q/γ) times, so line 5 takes at most time O(q/γ) times the complexity of Algorithm 3. This complexity depends mainly on how many matches we have, on average, when searching γ-length strings in the suffix structure. If we make the simplifying assumption that two randomly picked tokens match each other (independently) with fixed probability p, then on average we obtain np γ matches for substrings of length γ. This is exponentially decreasing, and becomes O(1) for γ = Θ(log 1/p n). The total complexity of Algorithm 3 is, on average, at most O((q/γ np γ ) 2 ). To keep the total average complexity of Algorithm 1 to at most O(q(γ + log n)), it is enough that γ = Ω(log 1/p n). This results in O(q log n) total average time. Since we require that γ = Ω(log n), and may adjust γ to tune the quality of the detection results, we state the time bound as O(qγ).

9 Figure 2 shows the results of timed measurements of our suffix array based implementation on a set derived from student s work. These are CPU times from a 2.4GHz Celeron with 256 Mb RAM running under Windows XP using Sun s Java 1.4. Fig. 2. Time required to score some single file against a collection. Finally, the scores for each file can be computed in O(N) time. To summarize, the total average complexity of Algorithm 1 can be made O(q(γ + log n) + N) = O(qγ + N). The O(γ + log n) factors can be easily reduced to O(1) (worst case) using suffix trees [12] with suffix links, instead of suffix arrays. This would result in O(q + N) total time. Note that we have excluded the tokenization of Q and that we have considered the number of tokens rather than the number of characters. However, the tokenization is a simple linear time process, and the number of tokens depends linearly on the file length. 3.4 All Against All Comparison To compare every file against each other, we can just run Algorithm 1 for every file in our collection. After that, every file pair gets two scores: one when file a is compared to file b and one when the reverse comparison happens, as the comparison is not symmetric. We can use the average of these scores as a final score for this pair.

10 Summing up the cost of this procedure for all the N files in the collection, we obtain a total complexity of O(nγ + N 2 ), including the time to build the suffix array index structure. See also Figure 3. With suffix trees this can be made O(n + N 2 ) Fig. 3. Time required to test all files in the collection against each other Note that any plagiarism detection routine based on pairwise file comparisons will have a complexity of O(f(n/N)N 2 ) at least, where f(n) is the complexity of comparing two files of lenght n. In our terms each file is approximately n/n tokens, hence e.g. JPlag would take O(n 2 ) total average time. The N 2 factor is negligible compared to n 2 since n N, unless the files are only one token each. 4 Evaluation of the System It is not feasible in the nearest future to compare our system s results with a human expert s opinion on real-world datasets as a human would not have the time to conduct a thorough comparison of every possible file pair. This would also be a very error-prone process. However, we can examine the reports that are produced by different plagiarism detection software when used on the same dataset. The systems used for the analysis include MOSS [10], JPlag [9] and Sherlock [5]. Every system printed a report about the same real collection, consisting of 220 undergraduate student s Java programs. These reports were

11 Fig. 4. Different systems reports summarized on the single diagram summarized and displayed as a diagram, a reduced version of which is shown in Figure 4. The figure only shows results for 50 of the 220 files. This diagram shows the score for every suspicious file in the collection (file pairs are not displayed here). Since the systems can be fine-tuned to show more or less files, we tried to obtain equal-size outputs. The raw report of these systems is a list of pairs with calculated similarity ratio between their members. Although the opinions of all the tested systems are different for many of the files, most files are either detected or rejected by the majority of systems. This simple approach (to consider only detection or rejection) allows us to organize a voting experiment. Let S i be the number of jury systems (MOSS, JPlag and Sherlock), which marked file i as suspicious. If S i 2, we should expect our system to mark this file as wel. If S i < 2, the file should, in general, remain unmarked. For the test set consisting of 155 files marked by at least one program, our system agreed with the jury in 115 cases (and, correspondingly, disagreed in 40 cases). This result is more conformist than the results obtained when the same experiment was run on the other 3 tested systems. Each system was tested while the other three acted as jury. All results are shown in Table 1. Table 1. Agreement between plagiarism detectors MOSS JPlag Our System Sherlock Agreed Disagreed

12 So, we can claim that our system is at least no worse than the other tested systems and an argument could be made that it is better since its results correlate best with all three systems. However, the subtleties of the different algorithms used by each system could mean the data we used was particularly suited to our system and that other software may produce results of a better quality on other collections. 5 Conclusions and Future Work We have developed a new fast algorithm for plagiarism detection. Our method is based on indexing the code database with a suffix array, which allows rapid retrieval of blocks of code that are similar to the query file. This idea made rapid pairwise file comparison possible. Evaluation shows that this algorithm s quality is not worse than the quality of existing widely used methods, while its speed performance is much higher. For the all-against-all problem our method achieves O(γn) (with suffix arrays) or O(n) (with suffix trees) average time for the comparison phase. Traditional methods, such as JPlag, need at least O((n/N) 2 N 2 ) = O(n 2 ) average time for the same task. In addition, computing the similarity matrix takes O(N 2 ) additional time, and this cannot be improved, as it is also the size of the output. However, usually one is interested only in similarity scores that are above a certain threshold, or only the h highest similarity scores (where h N 2 is a parameter). This would allow reducing the O(N 2 ) factor, if it becomes an issue. The main motivation for this work was plagiarism detection; however, there are other applications for the method. For example, the algorithm can detect similar blocks of code in some large software system, revealing good places for the refactoring. In the future we would like to see a full implementation of the algorithm as part of a plagiarism detection system, which would then allow for a full visualisation of the matches found and enable the results to be used in a realworld context. We would also like to investigate the effects of varying the γ parameter on the quality of the results and undertake further comparisons with other source-code plagiarism detectors. References 1. J. Carrol and J. Appleton. Plagiarism: A good practice guide. JISC, May plag practise. 2. P. Curtis. Quarter of students plagiarise essays. Guardian Unlimited, June O. Enseling. Build your own languages with javacc. JavaWorld, M. H. Halstead. Elements of Software Science. Operating and Programming Systems Series. Elsevier North-Holland, New York, M. S. Joy and M. Luck. Plagiarism in programming assignments. IEEE Transactions on Education, 42(2): , May 1999.

13 6. Kärkkäinen and Sanders. Simple linear work suffix array construction. In ICALP: Annual International Colloquium on Automata, Languages and Programming, U. Manber and G. Myers. Suffix arrays: a new method for on-line string searches. In SODA 90: Proceedings of the first annual ACM-SIAM symposium on Discrete algorithms, pages Society for Industrial and Applied Mathematics, K. J. Ottenstein. An algorithmic approach to the detection and prevention of plagiarism. SIGCSE Bull., 8(4):30 41, L. Prechelt, G. Malpohl, and M. Phlippsen. Jplag: Finding plagiarisms among a set of programs. Technical report, Fakultat for Informatik, Universitat Karlsruhe, prechelt/biblio/jplagtr.pdf. 10. S. Schleimer, D. S. Wilkerson, and A. Aiken. Winnowing: local algorithms for document fingerprinting. In SIGMOD 03: Proceedings of the 2003 ACM SIGMOD international conference on Management of data, pages ACM Press, L. Thompson. Educators blame internet for rise in student cheating. The Seattle Times, January plagiarism16m.html. 12. E. Ukkonen. On-line construction of suffix trees. Algorithmica, 14: , K. L. Verco and M. J. Wise. Plagiarism a la mode: A comparison of automated systems for detecting suspected plagiarism. The Computer Journal, 39(9): , G. Whale. Software metrics and plagiarism detection. Journal of Systems and Software, 13(2): , 1990.

An Approach to Source Code Plagiarism Detection Based on Abstract Implementation Structure Diagram

An Approach to Source Code Plagiarism Detection Based on Abstract Implementation Structure Diagram An Approach to Source Code Plagiarism Detection Based on Abstract Implementation Structure Diagram Shuang Guo 1, 2, b 1, 2, a, JianBin Liu 1 School of Computer, Science Beijing Information Science & Technology

More information

Lecture 7 February 26, 2010

Lecture 7 February 26, 2010 6.85: Advanced Data Structures Spring Prof. Andre Schulz Lecture 7 February 6, Scribe: Mark Chen Overview In this lecture, we consider the string matching problem - finding all places in a text where some

More information

Suffix Vector: A Space-Efficient Suffix Tree Representation

Suffix Vector: A Space-Efficient Suffix Tree Representation Lecture Notes in Computer Science 1 Suffix Vector: A Space-Efficient Suffix Tree Representation Krisztián Monostori 1, Arkady Zaslavsky 1, and István Vajk 2 1 School of Computer Science and Software Engineering,

More information

EPlag: A Two Layer Source Code Plagiarism Detection System

EPlag: A Two Layer Source Code Plagiarism Detection System EPlag: A Two Layer Source Code Plagiarism Detection System Omer Ajmal, M. M. Saad Missen, Tazeen Hashmat, M. Moosa, Tenvir Ali Dept. of Computer Science & IT The Islamia University of Bahawalpur Pakistan

More information

Classifications of plagiarism detection engines

Classifications of plagiarism detection engines Innovation in Teaching and Learning in Information and Computer Sciences ISSN: (Print) 1473-7507 (Online) Journal homepage: http://www.tandfonline.com/loi/rhep14 Classifications of plagiarism detection

More information

Detecting code re-use potential

Detecting code re-use potential Detecting code re-use potential Mario Konecki, Tihomir Orehovački, Alen Lovrenčić Faculty of Organization and Informatics University of Zagreb Pavlinska 2, 42000 Varaždin, Croatia {mario.konecki, tihomir.orehovacki,

More information

Plagiarism detection for Java: a tool comparison

Plagiarism detection for Java: a tool comparison Plagiarism detection for Java: a tool comparison Jurriaan Hage e-mail: jur@cs.uu.nl homepage: http://www.cs.uu.nl/people/jur/ Joint work with Peter Rademaker and Nikè van Vugt. Department of Information

More information

Lecture 8 13 March, 2012

Lecture 8 13 March, 2012 6.851: Advanced Data Structures Spring 2012 Prof. Erik Demaine Lecture 8 13 March, 2012 1 From Last Lectures... In the previous lecture, we discussed the External Memory and Cache Oblivious memory models.

More information

Determining gapped palindrome density in RNA using suffix arrays

Determining gapped palindrome density in RNA using suffix arrays Determining gapped palindrome density in RNA using suffix arrays Sjoerd J. Henstra Leiden Institute of Advanced Computer Science Leiden University, The Netherlands Abstract DNA and RNA strings contain

More information

Instructor-Centric Source Code Plagiarism Detection and Plagiarism Corpus

Instructor-Centric Source Code Plagiarism Detection and Plagiarism Corpus Instructor-Centric Source Code Plagiarism Detection and Plagiarism Corpus Jonathan Y. H. Poon, Kazunari Sugiyama, Yee Fan Tan, Min-Yen Kan National University of Singapore Introduction Plagiarism in undergraduate

More information

Table : IEEE Single Format ± a a 2 a 3 :::a 8 b b 2 b 3 :::b 23 If exponent bitstring a :::a 8 is Then numerical value represented is ( ) 2 = (

Table : IEEE Single Format ± a a 2 a 3 :::a 8 b b 2 b 3 :::b 23 If exponent bitstring a :::a 8 is Then numerical value represented is ( ) 2 = ( Floating Point Numbers in Java by Michael L. Overton Virtually all modern computers follow the IEEE 2 floating point standard in their representation of floating point numbers. The Java programming language

More information

UNIVERSITY OF EDINBURGH COLLEGE OF SCIENCE AND ENGINEERING SCHOOL OF INFORMATICS INFR08008 INFORMATICS 2A: PROCESSING FORMAL AND NATURAL LANGUAGES

UNIVERSITY OF EDINBURGH COLLEGE OF SCIENCE AND ENGINEERING SCHOOL OF INFORMATICS INFR08008 INFORMATICS 2A: PROCESSING FORMAL AND NATURAL LANGUAGES UNIVERSITY OF EDINBURGH COLLEGE OF SCIENCE AND ENGINEERING SCHOOL OF INFORMATICS INFR08008 INFORMATICS 2A: PROCESSING FORMAL AND NATURAL LANGUAGES Saturday 10 th December 2016 09:30 to 11:30 INSTRUCTIONS

More information

6 Distributed data management I Hashing

6 Distributed data management I Hashing 6 Distributed data management I Hashing There are two major approaches for the management of data in distributed systems: hashing and caching. The hashing approach tries to minimize the use of communication

More information

Joint Entity Resolution

Joint Entity Resolution Joint Entity Resolution Steven Euijong Whang, Hector Garcia-Molina Computer Science Department, Stanford University 353 Serra Mall, Stanford, CA 94305, USA {swhang, hector}@cs.stanford.edu No Institute

More information

Automatic Generation of Plagiarism Detection Among Student Programs

Automatic Generation of Plagiarism Detection Among Student Programs Paper Number 165 1 Automatic Generation of Plagiarism Detection Among Student Programs Rachel Edita Roxas, Nathalie Rose Lim and Natasja Bautista Abstract A system for the automatic generation of plagiarism

More information

A Visualization Program for Subset Sum Instances

A Visualization Program for Subset Sum Instances A Visualization Program for Subset Sum Instances Thomas E. O Neil and Abhilasha Bhatia Computer Science Department University of North Dakota Grand Forks, ND 58202 oneil@cs.und.edu abhilasha.bhatia@my.und.edu

More information

e-ccc-biclustering: Related work on biclustering algorithms for time series gene expression data

e-ccc-biclustering: Related work on biclustering algorithms for time series gene expression data : Related work on biclustering algorithms for time series gene expression data Sara C. Madeira 1,2,3, Arlindo L. Oliveira 1,2 1 Knowledge Discovery and Bioinformatics (KDBIO) group, INESC-ID, Lisbon, Portugal

More information

Sorting is a problem for which we can prove a non-trivial lower bound.

Sorting is a problem for which we can prove a non-trivial lower bound. Sorting The sorting problem is defined as follows: Sorting: Given a list a with n elements possessing a total order, return a list with the same elements in non-decreasing order. Remember that total order

More information

Plagiarism Detection: An Architectural and Semantic Approach. Matthew Salisbury. Computing. Session 2009

Plagiarism Detection: An Architectural and Semantic Approach. Matthew Salisbury. Computing. Session 2009 Plagiarism Detection: An Architectural and Semantic Approach Matthew Salisbury Computing Session 2009 The candidate confirms that the work submitted is their own and the appropriate credit has been given

More information

Application of the BWT Method to Solve the Exact String Matching Problem

Application of the BWT Method to Solve the Exact String Matching Problem Application of the BWT Method to Solve the Exact String Matching Problem T. W. Chen and R. C. T. Lee Department of Computer Science National Tsing Hua University, Hsinchu, Taiwan chen81052084@gmail.com

More information

Solution to Problem 1 of HW 2. Finding the L1 and L2 edges of the graph used in the UD problem, using a suffix array instead of a suffix tree.

Solution to Problem 1 of HW 2. Finding the L1 and L2 edges of the graph used in the UD problem, using a suffix array instead of a suffix tree. Solution to Problem 1 of HW 2. Finding the L1 and L2 edges of the graph used in the UD problem, using a suffix array instead of a suffix tree. The basic approach is the same as when using a suffix tree,

More information

Optimal Parallel Randomized Renaming

Optimal Parallel Randomized Renaming Optimal Parallel Randomized Renaming Martin Farach S. Muthukrishnan September 11, 1995 Abstract We consider the Renaming Problem, a basic processing step in string algorithms, for which we give a simultaneously

More information

Indexing: Part IV. Announcements (February 17) Keyword search. CPS 216 Advanced Database Systems

Indexing: Part IV. Announcements (February 17) Keyword search. CPS 216 Advanced Database Systems Indexing: Part IV CPS 216 Advanced Database Systems Announcements (February 17) 2 Homework #2 due in two weeks Reading assignments for this and next week The query processing survey by Graefe Due next

More information

Full-Text Search on Data with Access Control

Full-Text Search on Data with Access Control Full-Text Search on Data with Access Control Ahmad Zaky School of Electrical Engineering and Informatics Institut Teknologi Bandung Bandung, Indonesia 13512076@std.stei.itb.ac.id Rinaldi Munir, S.T., M.T.

More information

SOURCE CODE PLAGIARISM DETECTION FOR PHP LANGUAGE

SOURCE CODE PLAGIARISM DETECTION FOR PHP LANGUAGE SOURCE CODE PLAGIARISM DETECTION FOR PHP LANGUAGE Richard Všianský 1, Dita Dlabolová 1, Tomáš Foltýnek 1 1 Mendel University in Brno, Czech Republic Volume 3 Issue 2 ISSN 2336-6494 www.ejobsat.com ABSTRACT

More information

Frequent Itemsets Melange

Frequent Itemsets Melange Frequent Itemsets Melange Sebastien Siva Data Mining Motivation and objectives Finding all frequent itemsets in a dataset using the traditional Apriori approach is too computationally expensive for datasets

More information

PAPER Constructing the Suffix Tree of a Tree with a Large Alphabet

PAPER Constructing the Suffix Tree of a Tree with a Large Alphabet IEICE TRANS. FUNDAMENTALS, VOL.E8??, NO. JANUARY 999 PAPER Constructing the Suffix Tree of a Tree with a Large Alphabet Tetsuo SHIBUYA, SUMMARY The problem of constructing the suffix tree of a tree is

More information

Character Recognition

Character Recognition Character Recognition 5.1 INTRODUCTION Recognition is one of the important steps in image processing. There are different methods such as Histogram method, Hough transformation, Neural computing approaches

More information

6.001 Notes: Section 4.1

6.001 Notes: Section 4.1 6.001 Notes: Section 4.1 Slide 4.1.1 In this lecture, we are going to take a careful look at the kinds of procedures we can build. We will first go back to look very carefully at the substitution model,

More information

MergeSort, Recurrences, Asymptotic Analysis Scribe: Michael P. Kim Date: September 28, 2016 Edited by Ofir Geri

MergeSort, Recurrences, Asymptotic Analysis Scribe: Michael P. Kim Date: September 28, 2016 Edited by Ofir Geri CS161, Lecture 2 MergeSort, Recurrences, Asymptotic Analysis Scribe: Michael P. Kim Date: September 28, 2016 Edited by Ofir Geri 1 Introduction Today, we will introduce a fundamental algorithm design paradigm,

More information

An Information Retrieval Approach for Source Code Plagiarism Detection

An Information Retrieval Approach for Source Code Plagiarism Detection -2014: An Information Retrieval Approach for Source Code Plagiarism Detection Debasis Ganguly, Gareth J. F. Jones CNGL: Centre for Global Intelligent Content School of Computing, Dublin City University

More information

Implementing a Statically Adaptive Software RAID System

Implementing a Statically Adaptive Software RAID System Implementing a Statically Adaptive Software RAID System Matt McCormick mattmcc@cs.wisc.edu Master s Project Report Computer Sciences Department University of Wisconsin Madison Abstract Current RAID systems

More information

Analyzing Dshield Logs Using Fully Automatic Cross-Associations

Analyzing Dshield Logs Using Fully Automatic Cross-Associations Analyzing Dshield Logs Using Fully Automatic Cross-Associations Anh Le 1 1 Donald Bren School of Information and Computer Sciences University of California, Irvine Irvine, CA, 92697, USA anh.le@uci.edu

More information

Code generation for modern processors

Code generation for modern processors Code generation for modern processors Definitions (1 of 2) What are the dominant performance issues for a superscalar RISC processor? Refs: AS&U, Chapter 9 + Notes. Optional: Muchnick, 16.3 & 17.1 Instruction

More information

This is a repository copy of A Rule Chaining Architecture Using a Correlation Matrix Memory.

This is a repository copy of A Rule Chaining Architecture Using a Correlation Matrix Memory. This is a repository copy of A Rule Chaining Architecture Using a Correlation Matrix Memory. White Rose Research Online URL for this paper: http://eprints.whiterose.ac.uk/88231/ Version: Submitted Version

More information

Plagiarism and its Detection in Programming Languages

Plagiarism and its Detection in Programming Languages Plagiarism and its Detection in Programming Languages Sanjay Goel, Deepak Rao et. al. Abstract Program similarity checking is an important of programming education fields. The increase of material now

More information

Code generation for modern processors

Code generation for modern processors Code generation for modern processors What are the dominant performance issues for a superscalar RISC processor? Refs: AS&U, Chapter 9 + Notes. Optional: Muchnick, 16.3 & 17.1 Strategy il il il il asm

More information

Implementation of Customized FindBugs Detectors

Implementation of Customized FindBugs Detectors Implementation of Customized FindBugs Detectors Jerry Zhang Department of Computer Science University of British Columbia jezhang@cs.ubc.ca ABSTRACT There are a lot of static code analysis tools to automatically

More information

Space Efficient Linear Time Construction of

Space Efficient Linear Time Construction of Space Efficient Linear Time Construction of Suffix Arrays Pang Ko and Srinivas Aluru Dept. of Electrical and Computer Engineering 1 Laurence H. Baker Center for Bioinformatics and Biological Statistics

More information

A Rule Chaining Architecture Using a Correlation Matrix Memory. James Austin, Stephen Hobson, Nathan Burles, and Simon O Keefe

A Rule Chaining Architecture Using a Correlation Matrix Memory. James Austin, Stephen Hobson, Nathan Burles, and Simon O Keefe A Rule Chaining Architecture Using a Correlation Matrix Memory James Austin, Stephen Hobson, Nathan Burles, and Simon O Keefe Advanced Computer Architectures Group, Department of Computer Science, University

More information

Computing the Longest Common Substring with One Mismatch 1

Computing the Longest Common Substring with One Mismatch 1 ISSN 0032-9460, Problems of Information Transmission, 2011, Vol. 47, No. 1, pp. 1??. c Pleiades Publishing, Inc., 2011. Original Russian Text c M.A. Babenko, T.A. Starikovskaya, 2011, published in Problemy

More information

3 No-Wait Job Shops with Variable Processing Times

3 No-Wait Job Shops with Variable Processing Times 3 No-Wait Job Shops with Variable Processing Times In this chapter we assume that, on top of the classical no-wait job shop setting, we are given a set of processing times for each operation. We may select

More information

Speed and Accuracy using Four Boolean Query Systems

Speed and Accuracy using Four Boolean Query Systems From:MAICS-99 Proceedings. Copyright 1999, AAAI (www.aaai.org). All rights reserved. Speed and Accuracy using Four Boolean Query Systems Michael Chui Computer Science Department and Cognitive Science Program

More information

7. Decision or classification trees

7. Decision or classification trees 7. Decision or classification trees Next we are going to consider a rather different approach from those presented so far to machine learning that use one of the most common and important data structure,

More information

1 Definition of Reduction

1 Definition of Reduction 1 Definition of Reduction Problem A is reducible, or more technically Turing reducible, to problem B, denoted A B if there a main program M to solve problem A that lacks only a procedure to solve problem

More information

REDUCING GRAPH COLORING TO CLIQUE SEARCH

REDUCING GRAPH COLORING TO CLIQUE SEARCH Asia Pacific Journal of Mathematics, Vol. 3, No. 1 (2016), 64-85 ISSN 2357-2205 REDUCING GRAPH COLORING TO CLIQUE SEARCH SÁNDOR SZABÓ AND BOGDÁN ZAVÁLNIJ Institute of Mathematics and Informatics, University

More information

3 SOLVING PROBLEMS BY SEARCHING

3 SOLVING PROBLEMS BY SEARCHING 48 3 SOLVING PROBLEMS BY SEARCHING A goal-based agent aims at solving problems by performing actions that lead to desirable states Let us first consider the uninformed situation in which the agent is not

More information

MergeSort, Recurrences, Asymptotic Analysis Scribe: Michael P. Kim Date: April 1, 2015

MergeSort, Recurrences, Asymptotic Analysis Scribe: Michael P. Kim Date: April 1, 2015 CS161, Lecture 2 MergeSort, Recurrences, Asymptotic Analysis Scribe: Michael P. Kim Date: April 1, 2015 1 Introduction Today, we will introduce a fundamental algorithm design paradigm, Divide-And-Conquer,

More information

Indexing Variable Length Substrings for Exact and Approximate Matching

Indexing Variable Length Substrings for Exact and Approximate Matching Indexing Variable Length Substrings for Exact and Approximate Matching Gonzalo Navarro 1, and Leena Salmela 2 1 Department of Computer Science, University of Chile gnavarro@dcc.uchile.cl 2 Department of

More information

DESIGN AND ANALYSIS OF ALGORITHMS. Unit 1 Chapter 4 ITERATIVE ALGORITHM DESIGN ISSUES

DESIGN AND ANALYSIS OF ALGORITHMS. Unit 1 Chapter 4 ITERATIVE ALGORITHM DESIGN ISSUES DESIGN AND ANALYSIS OF ALGORITHMS Unit 1 Chapter 4 ITERATIVE ALGORITHM DESIGN ISSUES http://milanvachhani.blogspot.in USE OF LOOPS As we break down algorithm into sub-algorithms, sooner or later we shall

More information

Singular Value Decomposition, and Application to Recommender Systems

Singular Value Decomposition, and Application to Recommender Systems Singular Value Decomposition, and Application to Recommender Systems CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Recommendation

More information

COMPILER DESIGN. For COMPUTER SCIENCE

COMPILER DESIGN. For COMPUTER SCIENCE COMPILER DESIGN For COMPUTER SCIENCE . COMPILER DESIGN SYLLABUS Lexical analysis, parsing, syntax-directed translation. Runtime environments. Intermediate code generation. ANALYSIS OF GATE PAPERS Exam

More information

Welfare Navigation Using Genetic Algorithm

Welfare Navigation Using Genetic Algorithm Welfare Navigation Using Genetic Algorithm David Erukhimovich and Yoel Zeldes Hebrew University of Jerusalem AI course final project Abstract Using standard navigation algorithms and applications (such

More information

Leveraging Transitive Relations for Crowdsourced Joins*

Leveraging Transitive Relations for Crowdsourced Joins* Leveraging Transitive Relations for Crowdsourced Joins* Jiannan Wang #, Guoliang Li #, Tim Kraska, Michael J. Franklin, Jianhua Feng # # Department of Computer Science, Tsinghua University, Brown University,

More information

Theorem 2.9: nearest addition algorithm

Theorem 2.9: nearest addition algorithm There are severe limits on our ability to compute near-optimal tours It is NP-complete to decide whether a given undirected =(,)has a Hamiltonian cycle An approximation algorithm for the TSP can be used

More information

Adding Source Code Searching Capability to Yioop

Adding Source Code Searching Capability to Yioop Adding Source Code Searching Capability to Yioop Advisor - Dr Chris Pollett Committee Members Dr Sami Khuri and Dr Teng Moh Presented by Snigdha Rao Parvatneni AGENDA Introduction Preliminary work Git

More information

Unit 1 Chapter 4 ITERATIVE ALGORITHM DESIGN ISSUES

Unit 1 Chapter 4 ITERATIVE ALGORITHM DESIGN ISSUES DESIGN AND ANALYSIS OF ALGORITHMS Unit 1 Chapter 4 ITERATIVE ALGORITHM DESIGN ISSUES http://milanvachhani.blogspot.in USE OF LOOPS As we break down algorithm into sub-algorithms, sooner or later we shall

More information

Chapter 6 Evaluation Metrics and Evaluation

Chapter 6 Evaluation Metrics and Evaluation Chapter 6 Evaluation Metrics and Evaluation The area of evaluation of information retrieval and natural language processing systems is complex. It will only be touched on in this chapter. First the scientific

More information

Scalable Trigram Backoff Language Models

Scalable Trigram Backoff Language Models Scalable Trigram Backoff Language Models Kristie Seymore Ronald Rosenfeld May 1996 CMU-CS-96-139 School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 This material is based upon work

More information

Repeating Segment Detection in Songs using Audio Fingerprint Matching

Repeating Segment Detection in Songs using Audio Fingerprint Matching Repeating Segment Detection in Songs using Audio Fingerprint Matching Regunathan Radhakrishnan and Wenyu Jiang Dolby Laboratories Inc, San Francisco, USA E-mail: regu.r@dolby.com Institute for Infocomm

More information

Lectures 6+7: Zero-Leakage Solutions

Lectures 6+7: Zero-Leakage Solutions Lectures 6+7: Zero-Leakage Solutions Contents 1 Overview 1 2 Oblivious RAM 1 3 Oblivious RAM via FHE 2 4 Oblivious RAM via Symmetric Encryption 4 4.1 Setup........................................ 5 4.2

More information

Searching Algorithms/Time Analysis

Searching Algorithms/Time Analysis Searching Algorithms/Time Analysis CSE21 Fall 2017, Day 8 Oct 16, 2017 https://sites.google.com/a/eng.ucsd.edu/cse21-fall-2017-miles-jones/ (MinSort) loop invariant induction Loop invariant: After the

More information

Introduction to Programming in C Department of Computer Science and Engineering. Lecture No. #17. Loops: Break Statement

Introduction to Programming in C Department of Computer Science and Engineering. Lecture No. #17. Loops: Break Statement Introduction to Programming in C Department of Computer Science and Engineering Lecture No. #17 Loops: Break Statement (Refer Slide Time: 00:07) In this session we will see one more feature that is present

More information

Flexible Coloring. Xiaozhou Li a, Atri Rudra b, Ram Swaminathan a. Abstract

Flexible Coloring. Xiaozhou Li a, Atri Rudra b, Ram Swaminathan a. Abstract Flexible Coloring Xiaozhou Li a, Atri Rudra b, Ram Swaminathan a a firstname.lastname@hp.com, HP Labs, 1501 Page Mill Road, Palo Alto, CA 94304 b atri@buffalo.edu, Computer Sc. & Engg. dept., SUNY Buffalo,

More information

Fast and Simple Algorithms for Weighted Perfect Matching

Fast and Simple Algorithms for Weighted Perfect Matching Fast and Simple Algorithms for Weighted Perfect Matching Mirjam Wattenhofer, Roger Wattenhofer {mirjam.wattenhofer,wattenhofer}@inf.ethz.ch, Department of Computer Science, ETH Zurich, Switzerland Abstract

More information

Process Model Improvement for Source Code Plagiarism Detection in Student Programming Assignments

Process Model Improvement for Source Code Plagiarism Detection in Student Programming Assignments Informatics in Education, 2016, Vol. 15, No. 1, 103 126 2016 Vilnius University DOI: 10.15388/infedu.2016.06 103 Process Model Improvement for Source Code Plagiarism Detection in Student Programming Assignments

More information

ASCERTAINING THE RELEVANCE MODEL OF A WEB SEARCH-ENGINE BIPIN SURESH

ASCERTAINING THE RELEVANCE MODEL OF A WEB SEARCH-ENGINE BIPIN SURESH ASCERTAINING THE RELEVANCE MODEL OF A WEB SEARCH-ENGINE BIPIN SURESH Abstract We analyze the factors contributing to the relevance of a web-page as computed by popular industry web search-engines. We also

More information

Lecture 5: Suffix Trees

Lecture 5: Suffix Trees Longest Common Substring Problem Lecture 5: Suffix Trees Given a text T = GGAGCTTAGAACT and a string P = ATTCGCTTAGCCTA, how do we find the longest common substring between them? Here the longest common

More information

Lecture L16 April 19, 2012

Lecture L16 April 19, 2012 6.851: Advanced Data Structures Spring 2012 Prof. Erik Demaine Lecture L16 April 19, 2012 1 Overview In this lecture, we consider the string matching problem - finding some or all places in a text where

More information

CSCI B522 Lecture 11 Naming and Scope 8 Oct, 2009

CSCI B522 Lecture 11 Naming and Scope 8 Oct, 2009 CSCI B522 Lecture 11 Naming and Scope 8 Oct, 2009 Lecture notes for CS 6110 (Spring 09) taught by Andrew Myers at Cornell; edited by Amal Ahmed, Fall 09. 1 Static vs. dynamic scoping The scope of a variable

More information

Algorithms. Lecture Notes 5

Algorithms. Lecture Notes 5 Algorithms. Lecture Notes 5 Dynamic Programming for Sequence Comparison The linear structure of the Sequence Comparison problem immediately suggests a dynamic programming approach. Naturally, our sub-instances

More information

Week - 03 Lecture - 18 Recursion. For the last lecture of this week, we will look at recursive functions. (Refer Slide Time: 00:05)

Week - 03 Lecture - 18 Recursion. For the last lecture of this week, we will look at recursive functions. (Refer Slide Time: 00:05) Programming, Data Structures and Algorithms in Python Prof. Madhavan Mukund Department of Computer Science and Engineering Indian Institute of Technology, Madras Week - 03 Lecture - 18 Recursion For the

More information

A Two-Expert Approach to File Access Prediction

A Two-Expert Approach to File Access Prediction A Two-Expert Approach to File Access Prediction Wenjing Chen Christoph F. Eick Jehan-François Pâris 1 Department of Computer Science University of Houston Houston, TX 77204-3010 tigerchenwj@yahoo.com,

More information

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015 University of Virginia Department of Computer Science CS 4501: Information Retrieval Fall 2015 2:00pm-3:30pm, Tuesday, December 15th Name: ComputingID: This is a closed book and closed notes exam. No electronic

More information

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search Informal goal Clustering Given set of objects and measure of similarity between them, group similar objects together What mean by similar? What is good grouping? Computation time / quality tradeoff 1 2

More information

Today: Amortized Analysis (examples) Multithreaded Algs.

Today: Amortized Analysis (examples) Multithreaded Algs. Today: Amortized Analysis (examples) Multithreaded Algs. COSC 581, Algorithms March 11, 2014 Many of these slides are adapted from several online sources Reading Assignments Today s class: Chapter 17 (Amortized

More information

The Potential of Prototype Styles of Generalization. D. Randall Wilson Tony R. Martinez

The Potential of Prototype Styles of Generalization. D. Randall Wilson Tony R. Martinez Proceedings of the 6th Australian Joint Conference on Artificial Intelligence (AI 93), pp. 356-361, Nov. 1993. The Potential of Prototype Styles of Generalization D. Randall Wilson Tony R. Martinez Computer

More information

Clustering Algorithms for general similarity measures

Clustering Algorithms for general similarity measures Types of general clustering methods Clustering Algorithms for general similarity measures general similarity measure: specified by object X object similarity matrix 1 constructive algorithms agglomerative

More information

In examining performance Interested in several things Exact times if computable Bounded times if exact not computable Can be measured

In examining performance Interested in several things Exact times if computable Bounded times if exact not computable Can be measured System Performance Analysis Introduction Performance Means many things to many people Important in any design Critical in real time systems 1 ns can mean the difference between system Doing job expected

More information

Register Allocation in Just-in-Time Compilers: 15 Years of Linear Scan

Register Allocation in Just-in-Time Compilers: 15 Years of Linear Scan Register Allocation in Just-in-Time Compilers: 15 Years of Linear Scan Kevin Millikin Google 13 December 2013 Register Allocation Overview Register allocation Intermediate representation (IR): arbitrarily

More information

Speeding up Queries in a Leaf Image Database

Speeding up Queries in a Leaf Image Database 1 Speeding up Queries in a Leaf Image Database Daozheng Chen May 10, 2007 Abstract We have an Electronic Field Guide which contains an image database with thousands of leaf images. We have a system which

More information

New Trials on Test Data Generation: Analysis of Test Data Space and Design of Improved Algorithm

New Trials on Test Data Generation: Analysis of Test Data Space and Design of Improved Algorithm New Trials on Test Data Generation: Analysis of Test Data Space and Design of Improved Algorithm So-Yeong Jeon 1 and Yong-Hyuk Kim 2,* 1 Department of Computer Science, Korea Advanced Institute of Science

More information

Efficiently decodable insertion/deletion codes for high-noise and high-rate regimes

Efficiently decodable insertion/deletion codes for high-noise and high-rate regimes Efficiently decodable insertion/deletion codes for high-noise and high-rate regimes Venkatesan Guruswami Carnegie Mellon University Pittsburgh, PA 53 Email: guruswami@cmu.edu Ray Li Carnegie Mellon University

More information

6. Advanced Topics in Computability

6. Advanced Topics in Computability 227 6. Advanced Topics in Computability The Church-Turing thesis gives a universally acceptable definition of algorithm Another fundamental concept in computer science is information No equally comprehensive

More information

A Synchronization Algorithm for Distributed Systems

A Synchronization Algorithm for Distributed Systems A Synchronization Algorithm for Distributed Systems Tai-Kuo Woo Department of Computer Science Jacksonville University Jacksonville, FL 32211 Kenneth Block Department of Computer and Information Science

More information

I/O Efficieny of Highway Hierarchies

I/O Efficieny of Highway Hierarchies I/O Efficieny of Highway Hierarchies Riko Jacob Sushant Sachdeva Departement of Computer Science ETH Zurich, Technical Report 531, September 26 Abstract Recently, Sanders and Schultes presented a shortest

More information

Similarity search in multimedia databases

Similarity search in multimedia databases Similarity search in multimedia databases Performance evaluation for similarity calculations in multimedia databases JO TRYTI AND JOHAN CARLSSON Bachelor s Thesis at CSC Supervisor: Michael Minock Examiner:

More information

Data Structure and Algorithm Homework #6 Due: 5pm, Friday, June 14, 2013 TA === Homework submission instructions ===

Data Structure and Algorithm Homework #6 Due: 5pm, Friday, June 14, 2013 TA   === Homework submission instructions === Data Structure and Algorithm Homework #6 Due: 5pm, Friday, June 14, 2013 TA email: dsa1@csie.ntu.edu.tw === Homework submission instructions === For Problem 1, submit your source codes, a Makefile to compile

More information

IMPROVING THE RELEVANCY OF DOCUMENT SEARCH USING THE MULTI-TERM ADJACENCY KEYWORD-ORDER MODEL

IMPROVING THE RELEVANCY OF DOCUMENT SEARCH USING THE MULTI-TERM ADJACENCY KEYWORD-ORDER MODEL IMPROVING THE RELEVANCY OF DOCUMENT SEARCH USING THE MULTI-TERM ADJACENCY KEYWORD-ORDER MODEL Lim Bee Huang 1, Vimala Balakrishnan 2, Ram Gopal Raj 3 1,2 Department of Information System, 3 Department

More information

Practice Problems for the Final

Practice Problems for the Final ECE-250 Algorithms and Data Structures (Winter 2012) Practice Problems for the Final Disclaimer: Please do keep in mind that this problem set does not reflect the exact topics or the fractions of each

More information

for (i=1; i<=100000; i++) { x = sqrt (y); // square root function cout << x+i << endl; }

for (i=1; i<=100000; i++) { x = sqrt (y); // square root function cout << x+i << endl; } Ex: The difference between Compiler and Interpreter The interpreter actually carries out the computations specified in the source program. In other words, the output of a compiler is a program, whereas

More information

Applied Algorithm Design Lecture 3

Applied Algorithm Design Lecture 3 Applied Algorithm Design Lecture 3 Pietro Michiardi Eurecom Pietro Michiardi (Eurecom) Applied Algorithm Design Lecture 3 1 / 75 PART I : GREEDY ALGORITHMS Pietro Michiardi (Eurecom) Applied Algorithm

More information

Project Report: Needles in Gigastack

Project Report: Needles in Gigastack Project Report: Needles in Gigastack 1. Index 1.Index...1 2.Introduction...3 2.1 Abstract...3 2.2 Corpus...3 2.3 TF*IDF measure...4 3. Related Work...4 4. Architecture...6 4.1 Stages...6 4.1.1 Alpha...7

More information

Uncertain Data Models

Uncertain Data Models Uncertain Data Models Christoph Koch EPFL Dan Olteanu University of Oxford SYNOMYMS data models for incomplete information, probabilistic data models, representation systems DEFINITION An uncertain data

More information

Distributed and Paged Suffix Trees for Large Genetic Databases p.1/18

Distributed and Paged Suffix Trees for Large Genetic Databases p.1/18 istributed and Paged Suffix Trees for Large Genetic atabases Raphaël Clifford and Marek Sergot raphael@clifford.net, m.sergot@ic.ac.uk Imperial College London, UK istributed and Paged Suffix Trees for

More information

Indexing and Searching

Indexing and Searching Indexing and Searching Introduction How to retrieval information? A simple alternative is to search the whole text sequentially Another option is to build data structures over the text (called indices)

More information

A Lightweight Blockchain Consensus Protocol

A Lightweight Blockchain Consensus Protocol A Lightweight Blockchain Consensus Protocol Keir Finlow-Bates keir@chainfrog.com Abstract A lightweight yet deterministic and objective consensus protocol would allow blockchain systems to be maintained

More information

String Allocation in Icon

String Allocation in Icon String Allocation in Icon Ralph E. Griswold Department of Computer Science The University of Arizona Tucson, Arizona IPD277 May 12, 1996 http://www.cs.arizona.edu/icon/docs/ipd275.html Note: This report

More information

Improving Suffix Tree Clustering Algorithm for Web Documents

Improving Suffix Tree Clustering Algorithm for Web Documents International Conference on Logistics Engineering, Management and Computer Science (LEMCS 2015) Improving Suffix Tree Clustering Algorithm for Web Documents Yan Zhuang Computer Center East China Normal

More information

Unit 6 Chapter 15 EXAMPLES OF COMPLEXITY CALCULATION

Unit 6 Chapter 15 EXAMPLES OF COMPLEXITY CALCULATION DESIGN AND ANALYSIS OF ALGORITHMS Unit 6 Chapter 15 EXAMPLES OF COMPLEXITY CALCULATION http://milanvachhani.blogspot.in EXAMPLES FROM THE SORTING WORLD Sorting provides a good set of examples for analyzing

More information