ICECCS 2017 Exploring Similar Code - From Code Clone Detection to Provenance Identification - Katsuro Inoue Osaka University 1 Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
My Background Formal Empirical Ph.D. for Implementation & Optimization of Functional Programs Software Process Modeling & Formalization Program Slicing Mining Soft. Repositories, Code Clone, Code Search Code Provenance, License Analysis 84 90 95 00 05 10 15 17 Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 2
Welcome to ICECCS 2017 The 22nd International Conference on Engineering of Complex Computer Systems (ICECCS 2017) will be held on November 6-8, 2017, Fukuoka, Japan. Complex computer systems are common in many sectors, such as manufacturing, communications, defense, transportation, aerospace, hazardous environments, energy, and health care. These systems are frequently distributed over heterogeneous networks, and are driven by many diverse requirements on performance, real-time behavior, fault tolerance, security, adaptability, development time and cost, long life concerns, and other areas. Such requirements frequently conflict, and their satisfaction therefore requires managing the trade-off among them during system development and throughout the entire system life. The goal of this conference is to bring together industrial, academic, and government experts, from a variety of user domains and software disciplines, to determine how the disciplines' problems and solution... 3
Welcome to ICECCS 2017 The 22nd International Conference on Engineering of Complex Computer Systems (ICECCS 2017) will be held on November 6-8, 2017, Fukuoka, Japan. Complex computer systems are common in many sectors, such as manufacturing, communications, defense, transportation, aerospace, hazardous environments, energy, and health care. These systems are frequently distributed over heterogeneous networks, and are driven by many diverse requirements on performance, real-time behavior, fault tolerance, security, adaptability, development time and cost, long life concerns, and other areas. Such requirements frequently conflict, and their satisfaction therefore requires managing the trade-off among them during system development and throughout the entire system life. The goal Average: of this 2.3% conference is to bring together industrial, academic, ICECCS: and 6.8% government experts, from a variety of user domains and software disciplines, to determine how the disciplines' problems and solution... 4
5
#include <stdlib.h> #include <stdio.h> #include <ctype.h> int main() { int tot_chars = 0; /* total characters */ int tot_lines = 0; /* total lines */ int tot_words = 0; /* total words */ int boolean; /* EOF == end of file */ int n; while ((n = getchar())!= EOF) { tot_chars++; if (isspace(n) &&!isspace(getchar())) { tot_words++; } if (n == ' n') { tot_lines++; } if (n == '-') { tot_words--; } } printf("lines, Words, Characters n"); printf(" %3d %3d %3d n", tot_lines, tot_words, tot_chars); return 0; } 6
Similarity of Code Snippets 7
Library C Program A Program B System X System Y System Z 8
Matters? 9
Library C Program A Program B System X System Y System Z 10
Program A Program A Program A 11
Yes, code similarity matters! 12
Our Code Clone Research 13
Code Clone A code fragment with an identical or similar code fragment Created Accidentally: small popular idioms e.g., if(fp=fopen( file, r )==NULL){ Intentionally: copy & paste practice copy & paste copy & paste Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 14
15
Issue at Company X Maintained code clones by hand M-LOC system Check Hand-made clone list 20 years? Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 16
No tools available in 99 17
18
Clone Detection Policies Determine detection granularity char / token / line / block /function /... Cut off small clones (probably by nonintentional idioms) Find all possible clone sets at the same time Use efficient and scalable algorithm Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 19
Classification of Clones Type 1: Exact copy, only differences in white space and comments Type 2: Same as type 1, but also variable renaming Type 3: Same as type 2, but also deleting, adding, or changing few statements Type 4: Semantically identical, but not necessarily same syntax. Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 20
Example of Clones (type 2) Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 21
A Clone Detection Process Source files 1. 1. static void $ void $ foo ( foo() ( ) ) throws throws $ { RESyntaxException $ $ { String a a { 2. 2. [ String ] = new a[] $ $ String = new [ ][ String { ] $ { "123,400" $ [] [] { "123,400",, "abc", "orange 100" }; }; 3. "abc", "orange 100". apache. regexp 3. org.apache.regexp.re } ; pat = new org.apache.regexp.re("[0-9,]+");. 4. RE pat org. apache. regexp 4. int $ sum $ = = new 0; 0;. RE $ ( "[0-9,]+" $ ) ; int $ sum $ = $ 0 5. 5. for (inti i = 0; 0; i i < a.length; ++i) ; for ( int $ $ i = $ 0 ; $ i < 6. 6. if if (pat.match(a[i])) a $. length $ ; ++ ; ++ $ i ) ) if if( ( $ pat 7. 7. match sum ( += a [ Sample.parseNumber(pat.getParen(0));. $ ( $ [ $ i ] ] ) ) ) ) $ sum 8. 8.. += System.out.println("sum Sample $. $. parsenumber ( $. $ ( ( pat $. = getparen " + sum); ( ( 0 9. 9. } ) ) ; System $. $. out. $. println ( $ ( "sum = " 10. static + sum $ ) void ) ; ; } goo(string static void $ $ goo [] []( a) a) ( $ String throws RESyntaxException { 11. a $ RE [ exp ] ) = throws new RESyntaxException $ RE("[0-9,]+"); { $ $ = { RE exp = RE "[0-9,]+" 0; ) ; int sum = 0 12. new int $ sum ( $ = ) 0; ; $ $ = $ ; ( $ $ i = $ 0 ; $ i < 13. for (inti i = 0; 0; i i < a.length; ++i) a $. length $ ; ++ ; ++ $ i ) ) if if( ( $ exp 14. if if (exp.match(a[i])). match $ ( ( $ a [ [ $ i ] ] ) ) ) ) $ sum 15. += parsenumber $ sum. parsenumber += ( ( parsenumber(exp.getparen(0)); $ $ exp. (. $ exp getparen (. $ getparen () 0) ) ( ) 0 ) ) 16. ; System.out.println("sum $. $.. $. ( $ ( + $ = " " + + sum sum); 17. } ) ; } ) ; } Lexical analysis Token sequence Transformation Transformed token sequence Match detection Clones on on transformed sequence Formatting Clone pairs Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 22
Match Detection Algorithms 1 2 3 4 5 6 7 x x y x y z % 1 2 3 4 5 6 7 x x y x y z % 1 x * * * 2 x * * * 3 y * * 4 x * * * 5 y * * 6 z * 7 % * Matching Table x y z% % 6 7 xyxyz% y xyz% z% 1 3 5 Suffix Tree xyz% z% 2 4 1 2 3 4 5 6 7 x x y x y z % Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 23
Initial Result Initial prototype had been implemented in a few weeks Limited scalability Refined & tuned Bought powerful workstation and expensive memory Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 24
Clones between Unix Kernels FreeBSD Linux NetBSD FreeBSD Linux NetBSD Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 25
Let s Submit to IEEE TSE! CCFinder: a multilinguistic token-based code clone detection system for large scale source code, T Kamiya, S Kusumoto, K Inoue, IEEE Transactions on Software Engineering, Vol.28, No.7, pp.654-670, 2002. Initial submission in July, 2000 Major revision request in Jan., 2001 Minor revision request in Aug., 2001 Accepted in Sept., 2001 Published in July, 2002 1,413 citations (2017, Nov. 3) 22 nd highly-cited paper in SE Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 26
Extending CCFinder Prototype was extended, tuned, and refined as a practical tool Target languages Initially C Adapted to C++, COBOL, Java, Lisp, Text Added GUI and result visualizer Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 27
Promoting Code Clone Technology Collaboration with many companies NTT, Fujitsu, Hitachi, Samsung, Microsoft Research,... International Workshop on Software Clones IWSC ( 02~) Code clone seminars (9 times, 02~ 07) Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 28
> 5,000 downloads
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 30
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 31
32
Papers Related to Code Clone 1800 1600 1400 1200 1000 800 600 400 200 0 ~00 01~05 06~10 11~15 16~ 33
Applying to New Fields 34
Compiler Construction Class S1 S2 S3 S4 S5 S1 S2 S3 S4 S5 Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 35
Compiler Construction Class S1 S2 S3 S4 S5 S1 S2 S3 S4 S5 Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 36
Law Suit of Lace-making Machines A B Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 37
Clones in near systems 38
System Evolution Analysis BSD Unix Clustering Similarity Measure ß Cover Ratio Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 39
Chasing Scalability CCFinder input limitation: A few M LOC due to the suffix tree algorithm --- on core Code clone detection is embarrassingly parallel problem --- divide and concur Distributed CCFinder with 80 lab machines Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 40
FreeBSD Ports Collection / 136 Versions of Linux Kernel FreeBSD Ports Collection (10.8GB, 400M LOC in C) 136 Ver. Linux Kernels (7.4GB, 260M LOC in C) Livieri, S., Higo, Y., Matsushita, M., Inoue, K., Very-Large Scale Code Clone Analysis and Visualization of Open Source Programs Using Distributed CCFinder: D-CCFinder, ICSE2007. Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 41
Searching for clone pairs 42
34th International Conference on Software Engineering, pp.331-341, Zurich, Switzerland, 2012-05.
Developer s Concerns Existing Project New Project Code Fragment Reusable? Developer Origin - Who? - When? - License? - Copyright? Evolution - Maintenance? - Popularity? - Newer version? To ease concerns, a support system is needed Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 44
Code History Tracking System Code Fragment? in Question Copy Source Proj. B Modify Proj. C License: X Copyright: Y Descendants Proj. F Proj. E Proj. A Ancestor Proj. D Modify Copy Proj. G License: X /Copyright: Y Proj. H Time License: X Copyright Y OSS Repositories Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 45
System Overview Input Query Q Code q c Fragment Code Attributes (Optional) attribute 1:... attribute 2:...... Integrated Code History Tracker Ichi Tracker Output Results R Code Clones attribute 1:... attribute 2:...... Code Fragment Code Attributes Search Query SQ Search Results SR attribute 1:... attribute 2:...... Code Search Engines SPARS/R Google Code Search Koders Internet attribute 1:... attribute 2:...... Open Source Repositories Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 46
Evolution Pattern of Texture.java Cover Ratio 1 jmonkeyengine R3800 0.9 0.8 0.7 0.6 0.5 A 1 R3448 2 3 7 4 56 8 9 25 10 R4099 0.4 2006/10/10 2007/04/28 2007/11/14 2008/06/01 2008/12/18 2009/07/06 2010/01/22 2010/08/10 2011/02/26 C 11 12 13 Last Modify Time 14 15 16 17 19 20 21 22 23 24 26 18 B : File in New BSD License 1 Jmonkeyengine r3448 (K) 2 Simplexe (G) 3 Simplexe (K) 4 Jmonkeyengine r3800 (G) 5 The-project08 (G) 6 Tank Combat Game (K) 7 wrathofthetaboos (G) 8 wrathofthetaboos (K) 9 Lasthaven (G) 10 Jme-cotk (G) 11 Ardor3D (G) 12 Jmonkeyengine r4099 (G) 13 Fairytale-soulfire (G) 14 Jmonkeyengine r4490 (G) 15 Jmonkeyengine r4490 (S) 16 Xenogeddon (G) 17 Deathsquadrendezvous (G) 18 Partiendolapana (G) 19 Tholos (G) 20 Ardor3D (G) 21 Cosmic-engine (G) 22 Fregatclient3d (G) 23 Footballmanagerdesia (G) 24 Multiplicity (G) 25 Jmerefactoring (G) 26 Java3dfh (G) : Evolution of jmonkeyengine Project : File in zib-libpng License : Cluster of Same or Similar Files
Evolution Pattern of kern_malloc 1 1 2 3 4 13 15 19 24 36 58 62 65 1 Lites 1.0 (G) Kernel Source 2 Archive - CMU Mach 3.0 (K) 3 Lites 1.1.u3 (G) 28-33 Kame 34-36 SimOS 27-46 Kame (G) (K) (G) Cover Ratio 0.9 0.8 0.7 5 6 7 11 10 9 8 12 26 14 25 16 18 17 27 20-23 37-46 34 28-33 35 52 53 47 48 51 49 50 59 61 60 56 55 63 66 4 Lites 1.1-950808 5 6 The Rio (RAM I/O) Project ftp in The University of Edinburgh 7 Mip-summer98 8 freebsd/sparc 9-12 ftp in Stockholm University 13 freebsd-cam2.1.5r 14-15 SonicOSX 16 Labyrinth BSD(labyrinthos) (G) (K) (G) (G) (G) (G) (G) (K) (K) 47 Netnice 48 Kame 49-50 Psumip 51 Netnice 52 Reflexprotocol 53 Netnice 54 NetBSD v1.105 55 OpenBSD PV Xen 56 OpenBSD v1.73 (G) (G) (G) (G) (G) (G) (K) (G) (K) 0.6 54 57 64 67 17 Oskit 18 Psumip (G) (G) 57 Pmon (G) Proyecto A.T.L.D. 58-62 GNU/hurd(extremeli nux) (K) 19 Mach (G) 63 OpenBSD v1.74 (K) 0.5 20-22 Savannah Unofficial OSKit 23 source Unofficial OSKit 24-26 source(oskit) ftp in Stockholm 27 University (G) (K) (K) (G) 64 Pmon 65 774 66 Chord-ns3 67 Openbsd-loongsonvc (G) (G) (G) (G) 0.4 Results by G(Google Code Search) 1993/01/31 1995/10/28 1998/07/24 2001/04/19 2004/01/14 2006/10/10 2009/07/06 and 2012/04/01 K(Koders) Last modified time : File in New BSD License : File in Original BSD License
Current and Future
OSS Dependency Clone A Paste Import System Y Library P Copy Export Clone B OSS System X System Z Library Q 50
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University yeoman-ass... hubot istanbul mocha underscore jshint-sty... grunt-cont... time-grunt load-grunt... grunt-cont... inquirer commander grunt colors tap request assert extend node-uuid should async coffee-scr... nan fs-extra graceful-fs yosay yeoman-gen... jsdom meow ava xo mime standard through2 minimist debug browserify grunt-cont... grunt-cont... grunt-cont... grunt-cont... grunt-cont... grunt-cont... grunt-cont... ejs mocha-lcov... babel q gulp-uglify vinyl-sour... gulp-watch gulp-rename clone code sinon lab winston promise qs prompt nodemailer ws connect README.md uglify-js grunt-cont... stylus jade nock marked optimist jasmine-node watchify gulp-sourc... gulp-coffee del gulp-mocha through redis coffeelint resolve vows sinon-chai lodash grunt-jscs eslint cz-convent... codecov.io phantomjs karma-chai karma babel-pres... karma-mocha karma-phan... karma-brow... cheerio jquery watch moment babel-cli gulp-minif... require-dir gulp-sass gulp-concat gulp-if gulp-clean gulp-replace vinyl grunt-cli babel-regi... babel-eslint eslint-plu... inherits minimatch less chai-as-pr... benchmark tape event-stream semver concat-str... pg uuid eslint-plu... eslint-con... xtend readable-s... zuul socket.io-... body-parser socket.io bower tmp tap-spec yargs gulp-plumber babel-pres... gulp-babel babel-loader babel-pres... gulp-load-... babel-poly... run-sequence mongodb grunt-brow... grunt-moch... grunt-karma nodemon path mustache npm supertest power-assert faucet grunt-bump object-ass... expect.js jscs pre-commit nodeunit hoek hapi mongoose blanket grunt-shell matchdep grunt-rele... sqlite3 handlebars nyc rsvp karma-moch... karma-fire... webpack-de... karma-sour... karma-webp... grunt-eslint gulp-jshint gulp-bump gulp-eslint gulp-filter gulp-istan... esprima js-yaml shelljs brfs proxyquire rewire aws-sdk mockery temp karma-chro... co source-map... cli-color escodegen bunyan xml2js when http-proxy babelify babel-plug... browser-sync mysql null react-router react-dom gulp-nsp example ramda cookie-par... karma-jasm... requirejs jasmine-core underscore... karma-cove... compression immutable node-sass css-loader superagent update-not... chokidar merge autoprefixer postcss angular-mo... bootstrap classnames url-loader babel-pres... expect jest-cli react-native react-redux redux ember-cli-... ember-disa... ember-cli-... ember-expo... ember-disa... broccoli-a... ember-cli-... ember-cli-... ember-cli-... mber-data ember-cli-... ember-cli-... ember-cli-... ape-report... ape-tasking coz ape-tmpl <%= appnam... ember-reso... <%= name %> everything 51
Raula Gaikovina Kula, Daniel M. German, Ali Ouni, Takashi Ishio, Katsuro Inoue, Do developers update their library dependencies? An empirical study on the impact of security advisories on library migration, EMSE accepted & FSE2017 J. 1 st. 52
53
54
OSS Ecosystem OSS Repository Project Depend OSS Repository Project Depend Contribute Use Developer User Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 55
Summary Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 56
Finding similar code in softw1are matters 57
Code clone analysis became very popular SE technology 58
Analyzing code similarity is key to know OSS provenance and evolution
Exploring OSS universe with code similarity analysis
Thank you 62