MapReduce as a General Framework to Support Research in Mining Software Repositories (MSR)

Size: px

Start display at page:

Download "MapReduce as a General Framework to Support Research in Mining Software Repositories (MSR)"

Damian Pierce
5 years ago
Views:

1 MapReduce as a General Framework to Support Research in Mining Software Repositories (MSR) Weiyi Shang, Zhen Ming Jiang, Bram Adams, Ahmed E. Hassan Software Analysis Intelligence Lab (SAIL) Queen s University Kingston, Canada {swy, zmjiang, bram, ahmed}@cs.queensu.ca Abstract Researchers continue to demonstrate benefits of Mining Software Repositories (MSR) for supporting software development research activities. However, as mining process is time resource intensive, y often create ir own distributed platforms use various optimizations to speed up scale up ir analysis. se platforms are project-specific, hard to reuse, offer minimal debugging deployment support. In this paper, we propose use of MapReduce, a distributed computing platform, to support research in MSR. As a proof-of-concept, we migrate J-REX, an optimized evolutionary code extractor, to run on Hadoop, an open source implementation of MapReduce. Through a case study on source control repositories of Eclipse, BIRT Datatools projects, we demonstrate that migration effort to MapReduce is minimal that benefits are significant, as running time of migrated J-REX is only 30% to 50% of original J-REX s. This paper documents our experience with migration, highlights benefits challenges of MapReduce framework in MSR community. 1 Introduction Mining Software Repositories (MSR) field analyzes cross-links rich data available in software repositories to uncover interesting actionable information about software systems [15]. Examples of software repositories include source control repositories, bug repositories, archived communications, deployment logs, code repositories. Research in MSR field has received an increasing amount of interest. Most MSR techniques have been demonstrated on largescale software systems. However, size of data available for mining continues to grow at a very high rate. For example, size of Linux kernel source code exhibits super-linear growth [13]. MSR researchers continue to explore deeper more sophisticated analysis across a large number of long-lived systems. Robles et al. reported that Debian, a well-known Linux distribution, doubles in size approximately every two years [22]. Combined with huge system size, this may pose problems for analyzing evolution of Debian system in future. large continuously growing software repositories need for deeper analysis impose challenges on scalability of MSR techniques. Powerful computers sophisticated software mining algorithms are needed to successfully analyze cross-link data in a timely fashion. Prior research focuses especially on building home-grown solutions for this. authors of D-CCFinder [18], a distributed version of popular CCFinder [17] clone detection tool, have improved ir processing time from 40 days on a single PC-based workstation (Intel Xeon 2.8GHz, 2 GB RAM) to 2 days on a distributed system consisting of 80 PCs. ir home-grown solution is reported to contain about 20 kloc of Java code, which must be maintained enhanced by se MSR researchers does not directly translate to or analyzers. Tackling problem of processing large software repositories in a timely fashion is of paramount importance to future of MSR field in general, as we aim to improve adoption rate of MSR techniques by practitioners. We envision a future where sophisticated MSR techniques are integrated into IDEs that run on commodity workstations that provide fast accurate results to developers managers. In short, one cannot require every MSR researcher to have large expensive servers. Furrmore, homegrown solutions to optimize mining performance require huge development maintenance efforts. Last but not least, task of performance tuning turns our attention away from real problem, which is to uncover /09/$ IEEE 21 MSR 2009

2 interesting repository information. In many cases, MSR researchers do not have expertise required nor interest to improve performance of ir data mining algorithms. Techniques are needed that hide complexity of scaling yet provide researchers with benefits of scale. Research shows that scaling out (distributed systems) is always better than scaling up (bigger more powerful machines) [20]. Off--shelf distributed frameworks are promising technologies that can help our field. In this paper, we explore one of se technologies, called MapReduce [7]. As a proof-of-concept, we migrate J-REX, an optimized evolutionary code extractor, to run on Hadoop platform. Hadoop is a popular open-source implementation of MapReduce which is increasingly gaining popularity has proved to be scalable of production quality. Companies like Yahoo have Hadoop installations with over 5,000 machines, Hadoop is also used by Amazon computing clouds. With many companies involved into its development maintenance, Hadoop is rapidly maturing. Through a case study on source control repositories of Eclipse, BIRT Datatools projects, we show that migration effort to MapReduce is minimal that benefits are significant. migrated J-REX solutions are 4 times faster than original J-REX. This paper documents our experience with migration highlights benefits challenges of adopting off--shelf distributed frameworks in MSR community. paper is organized as follows. Section 2 imposes requirements for a general distributed framework to support MSR research. MapReduce is explained in Section 3, as well as one of its open source implementations: Hadoop. Section 4 discusses our case study in which we migrate J- REX, an evolutionary code extractor running on a single machine, to Hadoop. repercussions of this case study limitations of our approach are discussed in Section 5. Section 6 presents some related works. Finally, Section 7 concludes paper presents future work. 2 Requirements for a General Framework to Support MSR Research We seek four common requirements for large distributed platforms to support MSR research. We detail m as follows: 1. Adaptability: platform should take MSR researchers minimal effort to migrate from ir prototype solutions, which are developed on a single machine. 2. Efficiency: adoption of platform should drastically speed up mining process. 3. Scalability: platform should scale with size of input data as well as with available computing power. 4. Flexibility: platform should be able to run on various types of machines, from expensive servers to commodity PCs or even virtual machines. This paper presents evaluates MapReduce as a possible distributed platform which satisfies se four requirements. 3 MapReduce MapReduce [7] is a distributed framework for processing vast data sets. It was originally proposed used by Google engineers to process large amount of data y must analyze on a daily basis. input data for MapReduce consists of a list of key/ pairs. Mappers accept incoming pairs, map m into intermediate key/ pairs. Each group of intermediate data with same key is n passed to a specific set of s, each of which performs computations on data reduce it to one single key/s pair. sorted output of s is final result of MapReduce process. In this paper, we simplify discussion of MapReduce by assuming that s accept s instead of key/ pairs. To illustrate MapReduce, we consider an example MapReduce process which counts frequency of word lengths in a book. example process is shown in Figure 1. Mappers accept every single word from book, make keys for m. Because we want to count frequency of all words with different length, a typical approach would be to use length of word as key. So, for word hello, a will generate a key/ pair of 5/hello. Afterwards, key/ pairs with same key are grouped sent to s. A, which receives a list of s with same key, can simply count size of this list, keep key in its output. If a receives a list with key 5, for example, it will count size of list of all words that have as length 5. If size is n, it outputs an output pair 5/n which means re are n words with length 5 in book. power challenge of MapReduce model resides in its ability to support different mapping reducing strategies. For example, an alternative implementation could map each input (i.e., word) based on its first letter its length. n, s would process those words starting with one or a small number of different letters (keys), perform counting. This MapReduce strategy permits an increasing number of Reducers that can work in parallel on problem. However final output needs additional post-processing in comparison to first strategy. In short, both strategies can solve problem, but each strategy has different performance implementation benefits challenges. 22

3 Input data Intermediate data Output data key dog 3 dog cat fish 3 4 cat fish key 3 2 hello good 5 4 hello good night 5 night 6 1 happy 5 happy school 6 school Figure 1: Example of MapReduce process for counting frequency of word lengths in a book. Figure 1. Example MapReduce process for counting frequency of word lengths in a book. Techniques are needed that hide complexity of scaling Hadoopyet is an provide open-source researchers implementation with benefits of MapReduce scale. [3] which Research is supported shows that by Yahoo scaling out is used (distributed by Ama- of zon, systems) AOL, Baidu is always abetter number than of or scaling companies up (bigger for ir distributed more powerful solutions. machines) Hadoop [5]. can Off--shelf run on various distributed operatingframeworks systems such are aspromising Linux, Windows, technologies FreeBSD, that can Machelp OSX our OpenSolaris. field. It not only implements MapReduce model, In but this also paper, provides we explore a distributed one of se file system, technologies called called Hadoop MapReduce Distributed[10]. File System As a proof-of-concept, (HDFS). Hadoop supplies migrate Java interfaces J-REX, toan simplify optimized MapReduce evolutionary model code we toextractor control code, HDFS to run programmatically. on Hadoop platform. Anor advantage Hadoop foris users a ispopular that Hadoop open-source by default comes implementation with some basic of MapReduce. widely used Hadoop mapping is increasingly reducing gaining methods, popularity for example to split has been files proven into lines, to be or to scalable split a directory of production into files. quality. Companies like Yahoo have Hadoop With se methods, users occasionally do not have to write installation with over 5,000 boxes. Hadoop is also used new code to use MapReduce. by Amazon computing clouds. Hadoop is rapidly maturing We used Hadoop with asmany our MapReduce companies implementation involved into its for development following four reasons: maintenance. Through a case study on 1. Hadoop source is control easy repositories use. Researchers of Eclipse, do BIRT not want todatatools spend considerable projects, we time show on modifying that migration ir mining effort program minimal to make it distributed. benefits are great. simple MapReduce migrated Java J- is interface REX solutions simplifies are 4 process times faster of implementing than original s REX. This s. paper documents our experience with J- migration 2. Hadoop runs highlights on different benefits operating challenges systems. Academic adopting research off--shelf labs tend to distributed have a heterogeneous frameworks network in of ofmsr machines community. with different hardware configurations varying operating systems. Hadoop can run on most current Organization operating systems of Paper: hence to exploit as much of available computing rest of this power paper as possible. is organized as follows. Section 2 imposes requirements for general framework to 3. Hadoop support runs MSR on commodity research. Section machines. 3 explains largest computation resources in research labs software development companies are desktop computers laptops. This MapReduce one of its open source implementations: characteristic of Hadoop. Section permits 4 se discusses computers our case to join study leave about migrating computingj-rex, cluster an in aevolutionary dynamic code transparent fashion running without on user a intervention. single machine, to Hadoop. extractor Section 4. Hadoop 5 discusses mature our approaches an open outlines source system. limitations. Hadoop has Section been successfully 6 concludes usedpaper in many presents commercial future projects. work. It is actively developed with new features enhancements continuously being added. Since Hadoop is 2. free Requirements to download redistribute, for General it canframework be installed on multiple Support machines MSR without Research worrying about costs per seat to licensing. We Based seek four on se requirements points, wefor consider developing Hadoop a general as most MSR suitable mining MapReduce framework. implementation We detail m foras our follows: research. next section evaluates ability of Hadoop, hence 1. MapReduce, Adaptability: to satisfy technique four requirements should take of Section MSR 2. researchers minimal effort to migrate from ir 4 prototype Case study solutions, which are developed on a single machine. 2. Efficiency: adoption of technique should To validate promise of MapReduce for MSR research, drastically speed up mining process. 3. Scalability: we discuss This our technique experience should migrating scale with an evolutionary size code of extractor data as well called as with J-REX to computing Hadoop. power. J-REX is a 4. highly Flexibility: optimized This evolutionary technique should code extractor be able to for run Java on systems, various similar types to C-REX of machines [14]. from expensive servers As to commodity shown in Figure PCs or 2, even whole virtual process machines. of J-REX spans three phases. first phase is extraction phase, where 3. J-REX MapReduce extracts source code snapshots for each file from a CVS repository. In second phase, i.e. parsing phase, J-REX calls for each file snapshot Eclipse JDT parser to MapReduce [10] is a distributed framework for parse Java code into its abstract syntax tree [1], which processing large data sets which are often encountered by is stored MSR as researchers. an XML document. MapReduce In was third originally phase, i.e. analysis phase, J-REX compares XML documents of consecutive file revisions to determine changed code units, 23

4 Table 1. Disk performance of desktop server computers. Cached read speed Cached write speed Server 8, 531MB/sec 211MB/sec Desktop 3, 302MB/sec 107MB/sec Rom read speed Rom write speed Server 2, 986MB/sec 1, 075MB/sec Desktop 1, 488MB/sec 658MB/sec Table 2. Characteristics of Eclipse, BIRT Datatools. Repository #Source Length #Revisions Size Code of History Files Datatools 394MB 10, years 2,398 BIRT 810MB 13, years 19,583 Eclipse 4.2GB 56, years 82,682 Figure 2. Architecture of J-REX. generates evolutionary change data in an XML format [16]. evolutionary change data reports evolution of a software system at level of code entities such as methods classes (for example, class A was changed to add a new method B ). architecture of J-REX is comparable to architecture of or MSR tools. J-REX runtime process requires a huge amount of I/O operations which are performance bottlenecks, a large amount of computing power when comparing XML trees. I/O computational characteristics of J-REX make it an ideal case study to study performance benefits of MapReduce computation model. Through this case study, we seek to verify wher Hadoop solution satisfies four requirements listed in Section Adaptability: We explain process to migrate basic J-REX, a non-distributed MSR tool, to three different distributed Hadoop solutions (DJ-REX1, DJ-REX2, DJ-REX3). 2. Efficiency: For all three Hadoop solutions, we compare performance of mining process among desktop server machines. 3. Scalability: We examine scalability of Hadoop solutions on three data repositories with varying sizes. We also examine scalability of Hadoop solutions running on a varying number of machines. 4. Flexibility: Finally, we study flexibility of Hadoop platform by deploying Hadoop on virtual machines in a multicore environment. In rest of this section, we first explain our experimental environment details of our experiments. n, we discuss wher or not using Hadoop for software mining satisfies 4 requirements of Section Experimental environment Our Hadoop installation is deployed on 4 computers in a local gigabit network. 4 computers consist of 2 desktop computers, each having an Intel Quad Core 2.40 GHz CPU with 2 GB RAM memory, of 2 server computers, one having an Intel Core i GHz CPU with 8 Cores (Hyperthreading) 6 GB RAM memory, or one having an Intel Quad Core 2.40 GHz CPU with 8 GB RAM memory a RAID5 disk. 8 core server machine has Solid State Disks (SSD) instead of regular RAID disks. difference in disk performance between regular disk machines SSD disk server computer as measured by hdparm iozone (64 kb block size) is shown in Table 1. server s I/O speed with SSD drive is twice as fast as machines with regular disk for rom I/O turns out to be two a half times for cached operations. source control repositories used in our experiments consist of whole Eclipse repository 2 sub-projects from Eclipse called BIRT Datatools. Eclipse has a large repository with a long history, BIRT has a medium repository with a medium length history, Datatools has a small repository with a short history. Using se 3 repositories with different size length of history, we can better evaluate performance of our approach across subject systems. repository information of 3 projects is shown in Table 2. 24

5 Table 3. Experimental results for DJ-REX in Hadoop. Repsitory Desktop Server Strategy 2 nodes 3 nodes 4 nodes Datatools 0:35:50 0:34:14 DJ-REX3 0:19:52 0:14:32 0:16:40 DJ-REX1 2:03:51 2:05:02 2:16:03 BIRT 2:44:09 2:05:55 DJ-REX2 1:40:22 1:40:32 1:47:26 DJ-REX3 1:08:36 0:50:33 0:45:16 DJ-REX3* 3:02:47 Eclipse 12:35:34 DJ-REX3 3:49: Experiments We conduct following experiments: 1. Run J-REX without Hadoop on BIRT, Datatools Eclipse repositories. 2. Run DJ-REX1, DJ-REX2 DJ-REX3 on BIRT with 2, 3 4 machines. 3. Run DJ-REX3 on Datatools with 2, 3 4 machines. 4. Run DJ-REX3 on Eclipse with 4 machines. 5. Run DJ-REX3 on BIRT with 3 virtual machines. Only DJ-REX3 is applied in last three experiments, because experimental results for smallest system, i.e. BIRT, already showed a significant speed improvement compared to or two distributed strategies original, undistributed J-REX. results of all five experiments are summarized in Table 3 are discussed in next section. row with DJ-REX3* corresponds to experiment that has DJ-REX3 running on 3 virtual machines. 4.3 Case study discussion This section uses experiment data results of Table 3 to discuss wher or not various DJ-REX solutions meet 4 requirements outlined in Section 2. Adaptability Table 4 shows implementation deployment effort required for DJ-REX. We first discuss effort devoted to porting J-REX to Hadoop. n we present experience about configuring Hadoop to add in more computing power. implementation effort of three DJ-REX solutions decreases as we got more acquainted with technology. Easy to experiment with various distributed solutions As is often case, MSR researchers do not have expertise required for nor do y have interest in improving performance of ir mining algorithms. need Table 4. Effort to program deploy DJ- REX. J-REX Logic MapReduce strategy for DJ-REX1 MapReduce strategy for DJ-REX2 MapReduce strategy for DJ-REX3 Deployment Configuration Reconfiguration No Change 400 LOC, 2 hours 400 LOC, 2 hours 300 LOC, 1 hours 1 hour 1 minute Table 5. Overview of distributed steps in DJ- REX1 to DJ-REX3. Extraction Parsing Analysis DJ-REX1 No No Yes DJ-REX2 No Yes Yes DJ-REX3 Yes Yes Yes to rewrite an MSR tool from scratch to make it run on Hadoop is not an acceptable option. If programming time for Hadoop migration is long (maybe as long as re-implementing it), n chances of adopting Hadoop become very low. In addition, if one has to modify a tool in such an invasive way, considerably more time will have to be spent to test it again once it runs distributed. We found that applications are very easy to port to Hadoop. First of all, Hadoop provides a number of default mechanisms to split input data across s. For example, MultiFileSplit class splits files in a directory, whereas DBInputSplit class splits rows in a database table. Often, one can reuse se existing mapping strategies. Second, Hadoop has well-defined simple APIs to implement a MapReduce process. One just needs to implement corresponding interfaces to make a custom MapReduce process. Third, several code examples are available to show users how to write MapReduce code with Hadoop [3]. After looking at available code examples, we found that we could reuse code for splitting input data by files. n, we spent a few hours to write around 400 lines of Java code for each of three DJ-REX MapReduce strategies. programming logic of J-REX itself barely 25

6 Input data Intermediate data Output data key a_0.java a_1.java a.java a.java a_0.java a_1.java key a.java a.output b_0.java b.java b_0.java b.java b.output a_2.java a.java a_2.java b_1.java b.java b_1.java Figure 3: MapReduce strategy for DJ-REX Figure 3. MapReduce strategy for DJ-REX. Hadoop [14]. One of code examples is to search a contains 3 phases: extraction phase, parsing phase string in files, which is just like grep comm in analysis phase. Based on se 3 phases, we come up changed. Linux. Anor example is to count word frequency with 3 different implementations as shown in Table 5. from remainder files. of this section explains our three DJ-REX first one is called DJ-REX1. We use one In our experience, after reading above two computer to extract source code offline to parse MapReduce strategies. Intuitively, we need to compare examples, because input data of examples m to AST. After finishing first two phases, difference between adjacent revisions of a Java source code J-REX are all stored in multiple files, we use ir output files are put in HDFS Hadoop is used to file. We could define key/ pair output of function as (D1, a 0.java a 1.java), re- ways to split whole input data by files. n we analyze change information. In this case, only 1 phase spent a few hours wrote around 400 lines of Java of J-REX becomes distributed, files we copy ducer code function to implement output as three (revision MapReduce number, strategies. evolutionary information). into HDFS are XML files which represent AST programming logic key D1 of J-REX represents barely changed. difference between Figure 4. Running time comparison of J-REX information of each version of all source code files in two versions, Here we aexplain 0.javaour MapReduce a 1.java represent strategies in names on a server machine desktop machine DJ- repository. ofrex. two files. Intuitively, Becausewe of want way to that compare we partition difference data, For compared DJ-REX2, to one more fastest phase, DJ-REX3 parsing for phase, Datatools. each between revision adjacent needs versions. to be copied We define transferred (key, to ) more becomes distributed. refore only extraction than pair onein node, which function generates as (D1, extra a_0.java overhead for phase still runs as non-distributed. extraction a_1.java). mining process, function turned out is defined to makeas (revision process phase extracts all main versions of source code much number, longer. change failure data). of thiskey naive D1 strategy represents shows files from repository, se files are copied into importance difference ofbetween designing two a good versions. strategy Because of MapReduce. of way HDFS are used as input to following Hadooped phases. parsing phase, becomes distributed. Only extraction that refore, we partition we tried anor data, each basicrevision MapReduce needs strategy, to be ascopied shown in Figure transferred 3. This to more strategy than performs one node, much which better phase Finally is still non-distributed, DJ-REX3 whereas a fully-distributed parsing analysis phases are with done 3 inside phases running s. in Finally, a distributed DJ-REX3 gives than more our naive overhead strategy. to whole key/ mining pair process. output of implementation runtime function of this is strategy defined as is much (file name, longer. revision failure snapshot), of fashion. is a fully-distributed Input for DJ-REX-3 implementation is a raw with CVS 3 phases data running this strategy shows importance of designing a good Hadoop used throughout 3 phases. whereas key/ pair output of function is in a distributed fashion inside each. input for strategy of MapReduce. For DJ-REX1 to DJ-REX3, using files to partition (file name, evolutionary information for this file). For example, file a.java has 3 revisions. mapping phase gets DJ-REX3 is raw CVS data Hadoop is used throughout all 3 phases. refore, we tried anor MapReduce strategy input data is an intuitively often most suitable shown in Figure 3. This strategy performs much better option for many mining techniques which want to file than names our previous revision strategy. numbers as (key, input, ) sorts pair revision in our explore Using files use of tohadoop. partition input data is an intuitive numbers per function file: (a.java, is defined a 0.java), as (file (a.java, name, a revision 1.java) often most suitable option for many mining techniques snapshot), (a.java, a 2.java). (key, ) Pairs pair with for same keyfunction are n Table which5: want Overview to explore of distributed use of Hadoop. steps in DJ-REX1 to sent to to be (file same name,. evolutionary final information output for for a.java this file). is DJ-REX3 generated For example, evolutionary a.java information. has 3 revisions. In mapping Easy to deploy Extract add more computing Parse power Analyze phase, On topit of generates this basic MapReduce 3 (key, ) strategy, pairs: we have (a.java, implemented a_0.java), 3 flavors (a.java, ofa_1.java) DJ-REX (Table (a.java, 5). Eacha_2.java). flavor dis- DJ-REX2 It took us only 1 hour No to learn howyes to deploy Hadoop Yes in DJ-REX1 No No Yes tributes Since ay different have combination same key, of se 3three phases pairs ofare sent original to J-REX same implementation. final (Figure output 2). for file first a.java flavoris is add more machines), we only needed to add machines DJ-REX3 local network. To exp Yes experiment Yes cluster Yes (i.e., to called evolutionary DJ-REX1. One information. machine extracts source code offline We have parses implemented it into AST3 form. versions Afterwards, of DJ-REX. output Each It machines. took us only Based around on our 1 hour experience, to learn we feel configure that porting Easy names to deploy in a configuration add in more file computing install Hadoop powers on those XML version files explores are stored distributing in HDFS different Hadoop phases uses in it to analyze J- how J-REX to deploy to Hadoop Hadoop is easy in local straightforward, network. If we want for sure REX implementation. change information. As shown In this case, in Figure only 1 2, phase J-REX of J- to easier exp less experiment error-prone than cluster implementing (i.e., adding ourmore own distributed REX becomes distributed. For DJ-REX2, one more phase, platform. 26

7 Figure 5. Running time of J-REX on a server machine desktop machine compared to fastest deployment of DJ-REX1, DJ-REX2 DJ-REX3 for BIRT. Figure 6. Running time of J-REX on a server machine compared to DJ-REX3 with 4 work nodes for Eclipse. Efficiency We now use our experimental data to test how much time could be saved by using Hadoop for mining process. Figure 4 (Datatools),Figure 5 (BIRT) Figure 6 (Eclipse) present results of Table 3 in a graphical way. From Figure 4 (Datatools) Figure 5 (BIRT), we can draw following two conclusions. On one h, faster powerful machinery can speed up mining process. For example, running J-REX on a very fast server machine with SSD drives for BIRT repository saves around 40 minutes compared with running it on desktop machine. On or h, all DJ-REX solutions perform no worse or even better than J-REX solutions regardless of difference in hardware machinery. As shown in Figure 5, running time on SSD server machine is almost same to that using DJ-REX1, which only has analysis phase distributed, since analysis phrase is shortest of all three J-REX phases. refore, performance gain of DJ-REX is not significant. DJ-REX2 DJ-REX3, however, outperform server. running time of DJ-REX3 on BIRT is almost one quarter of running it on a desktop machine one third time of running it on a server machine. running time of DJ-REX3 for Datatools has been reduced to around half time taken by desktop server solutions, for Eclipse to around a quarter of time of server solution. It is clear that more we distribute our process, less time is needed. Figure 7. Comparison of running time of 3 flavors of DJ-REX for BIRT. Figure 7 shows detailed performance statistics of three flavors of DJ-REX for BIRT repository. total running time can be broken down into three parts: preprocess time (black) is time needed for nondistributed phases, copy data time (light blue) is time taken for copying input data into distributed file system, process data time (white) is time needed by distributed phases. In Figure 7, running time of DJ-REX3 is always shortest, whereas DJ-REX1 always takes longest time. reason for this is that undistributed black parts dominate process time for DJ-REX1 DJ-REX2, whereas in DJ-REX3 everything is distributed. Hence, fully distributed DJ-REX3 is most efficient one. In Figure 7, process data time (white) is decreasing constantly. MapReduce strategy of DJ-REX is basically dividing job by files which are processed independently from each or in different s. Hence, one could approximate job s running time by dividing total processing time by number of Hadoop nodes. more Hadoop nodes re are, smaller incremental benefit of extra nodes. In addition, a new node introduces more overhead, like network overhead or distributed file system data synchronization. Figure 7 clearly shows that copy data time (light blue) is increasing when adding nodes hence that performance with 4 nodes is not always best one (e.g. for Datatools). Our experiments show that using Hadoop to support MSR research is an efficient viable approach that can drastically reduce required processing time. Scalability Eclipse has a large repository, BIRT has a medium-sized repository Datatools has a small repository. From Figure 4 (Datatools), Figure 5 (BIRT), Figure 6 (Eclipse) Figure 7 (BIRT), it is clear that Hadoop reduces running time for each of three repositories. When mining small Datatools repository, running time is reduced to 27

8 Figure 8. Running time comparison for BIRT Datatools with DJ-REX3. Figure 9. Running time of basic J-REX on a desktop server machine, of DJ- REX-3 on 3 virtual machines on same server machine. 50%. bigger repository, more time can be saved by Hadoop. running time can be reduced to 36% 30% of non-hadoop version for BIRT Eclipse repositories, respectively. Figure 8 shows that Hadoop scales well for different numbers of nodes (2 to 4) for BIRT Datatools. We did not include running time for Eclipse because of its large variance fact that we could not run Eclipse on desktop machine (we could not fit entire data into memory). However, from Figure 6 we know that running time for Eclipse on server machine is more than 12 hours that it only takes a quarter of this time (around 3.5 hours) using DJ-REX3. Unfortunately, we found that performance of DJ- REX3 is not proportional to amount of computing resources introduced. From Figure 8, we observe that adding a fourth node introduces additional overhead to our process, since copying input data to anor node out-weighs parallelizing tasks to more machines. optimal number of nodes depends on mining problem MapReduce strategies that are being used, as outlined in Section 3. Flexibility Hadoop runs on many different platforms (i.e., Windows, Mac Unix). In our experiments, we used server machines with without SSD drives, relatively slow desktop machines. Because of load balance control in Hadoop, each machine is given a fair amount of work. Because network latency could be one of major causes of data copying overhead, we did an experiment with 3 Hadoop nodes running in 3 virtual machines on Intel Quad Core server machine. Running only 3 virtual machines increases probability that each Hadoop process has its own processor core, whereas running Hadoop inside virtual machines should eliminate majority of network latency. Figure 9 shows running time of DJ-REX3 when deployed on 3 virtual machines on same server machine. performance of DJ-REX3 in virtual machines turns out to be worse than that of undistributed J-REX. We suspect that this happens because virtual machine setup results in slower disk accesses than deployment on a physical machine. However, this could be improved by using a redundant storage array (RAID), or a networked storage array, but this is future work. ability to run Hadoop in a virtual machine can be used to deploy a large Hadoop cluster in a very short time by rapidly replicating starting up virtual machines. A well configured virtual machine could be deployed to run mining process without any configuration, which is extremely suitable for non-experts. 5 Discussion Limitations MapReduce on or software repositories Multiple types of repositories are used in MSR field, but in principle MapReduce could be used as a stard platform to speed up scale up different analyses. main challenge is deriving optimal mapping strategies. For example, a MapReduce strategy could split mailing list data by time or by sender name, when mining a mailing list repository. Similarly, when mapping a bug reports repository, creator creation time of bug report could be used as splitting criteria. Incremental processing Incremental processing is one possible way to deal with large repositories extensive analysis. Instead of processing data from a long history in one shot, one could incrementally process data on a weekly or monthly basis. However, incremental processing requires more sophisticated designs of mining algorithms, sometimes is just not possible to achieve. Since researchers are mostly prototyping ir ideas, a brute force approach might be more desirable with optimizations (such as incremental processing) to follow later. low cost of migrating an analysis technique to MapReduce is negligible compared to complexity of migrating a technique to support incremental processing. 28

9 One-time processing One-time processing involves processing a repository once, n storing it in a compact format for subsequent querying analysis. Clearly, cost of one-time processing is not a major concern. However, we believe that MapReduce can help in two ways: 1) scaling number of possible systems that can be analyzed, 2) speeding up prototyping phase. Using a MapReduce implementation, analyzing querying a large system is simply faster than when doing one-time processing without MapReduce. Moreover, although one-time processing might require a single pass through data, it is often case that developers of technique explore a lot of ideas as y are prototyping ir algorithm ideas, have to debug technique. repositories must be analyzed time time again in se cases. We believe MapReduce can help speed up prototyping phase offer researchers more timely feedback on ir ideas. Robustness MapReduce its Hadoop implementation offer a robust computation model which can deal with different kinds of failures at run-time. If certain nodes fail, tasks belonging to failed nodes are automatically re-assigned to or nodes. All or nodes are notified to avoid trying to read data from failed nodes. Dean et al. [7] reported that MapReduce clusters with over 80 nodes can become unreachable, yet processing continues finishes successfully. This type of robustness permits execution of Hadoop on laptops non-dedicated machines, such that lab computers can join leave a Hadoop cluster rapidly easily based on needs of owners. For example, students can join a Hadoop cluster while y are away from ir desk leave it on until y are back. Current Limitations Most of current limitations are imposed by implementation of Hadoop. Locality is one of most important issues for a distributed platform, as network bwidth is a scarce resource when processing a large amount of data. To solve this problem, Hadoop attempts to replicate data across nodes to always locate nearest replica of data. In Hadoop, a typical configuration with hundreds of computers by default would have only 3 copies of data. In this case, chance of finding required data stored on local machine is very small. However, increasing number of data copies requires more space more time to put large amount of data into distributed file system. This in turn leads to more processing overhead. Deploying data into HDFS file system is anor limitation of Hadoop. In current Hadoop version (0.19.0), all input data needs to be copied into HDFS, which gives much overhead. As Figure 7 Figure 8 show, running time with 4 nodes may not be shortest one. Finding out optimal Hadoop configuration is future work. 6 Related Work Automated evolutionary extractors, optimized mining solutions distributed computing platforms are three areas of research most related to our work. Automated evolutionary extractors Hassan developed an evolutionary code extractor for C language called C-REX [14]. Kenyon framework [6] combines various source code repositories into one to facilitate software evolution research. Draheim et al. created Bloof [8], by which users can define custom evolution metrics from CVS logs. Alonso et al. developed Minero [5] which uses database techniques to integrate manage data from software repositories. Godfrey et al. [13] developed evolutionary extractors that use metrics at system subsystem level to monitor evolution for each release of Linux. In addition, Qu et al. [23] developed evolutionary extractors that track structural dependency changes at file level for each release of GCC compiler. Gall et al. [10, 11] have developed evolutionary extractors that track co-change of files for each changelist in CVS. Gall et al. [9] developed extractors which track source code changes. Zimmermann et al. [24] present an extractor which determines changed functions for each changelist. All se tools can be easily ported to MapReduce framework. Optimizing Mining Solutions To authors knowledge, re is only one related work which tries to optimize software mining solutions on large scale data, i.e. D-CCFinder [18, 19]. D-CCFinder is a distributed implementation of CCFinder to analyze source code with a large size long history in a relatively short time. Unfortunately, this implementation is homegrown specialized to CCFinder, not open to or MSR techniques. More recently, researchers behind D-CCFinder proposed to run CCFinder on a grid-based system [19]. Distributed Platforms re are several distributed platforms that implement MapReduce. prototypical one is from Google [7]. Google platform makes use of Google s file system, called GFS [12]. Phoenix is anor implementation of MapReduce model [21]. Phoenix s main focus is on exploiting multi-core multi-processor systems such as Playstation Cell architecture. GridGain [2] is an open source implementation of MapReduce, but its main disadvantage is that it can only process data which can be stored in a JVM heap space. For large size of data that we usu- 29

10 ally process in MSR, this is not a good choice. We chose Hadoop due to its simple design its wide user base. 7 Conclusions Future Work A scalable software mining solution should be adaptable, efficient, scalable flexible. In this paper, we propose to use MapReduce as a general framework to support research in MSR. To validate our approach, we presented our experience of porting J-REX, an evolutionary code extractor for Java, to Hadoop, an open source implementation of MapReduce. Our experiments demonstrate that our new solution (DJ-REX) satisfies four requirements of scalable software mining solutions. Our experiments show that running our optimized solution (DJ-REX3) on a small local area network with 4 nodes requires 75% less time than time needed when running it on a desktop machine 66% less time than on a server machine. One of our future goals is to dynamically control computation resources in our lab. If someone wants to pull a machine out of platform, we don t want to reconfigure whole platform. If machine becomes idle, one should be able to plug it back into platform. In addition, we also plan to experiment with or technologies like HBase [4] to improve current Hadoop deployment. References [1] Eclipse jdt. [2] Gridgain. [3] Hadoop. [4] Hbase. [5] O. Alonso, P. T. Devanbu, M. Gertz. Database techniques for analysis exploration of software repositories. In Proc. of 1st Workshop on Mining Software Repositories (MSR), [6] J. Bevan, E. J. Whitehead, Jr., S. Kim, M. Godfrey. Facilitating software evolution research with kenyon. In Proc. of 10th European Software Engineering Conference (ESEC/FSE), [7] J. Dean S. Ghemawat. Mapreduce: simplified data processing on large clusters. Commun. ACM, 51(1), [8] D. Draheim L. Pekacki. Process-centric analytical processing of version control data. In Proc. of 6th Int. Workshop on Principles of Software Evolution (IWPSE), [9] B. Fluri, H. C. Gall, M. Pinzger. Fine-grained analysis of change couplings. In Proc. of 4th Int. Workshop on Source Code Analysis Manipulation (SCAM), [10] H. Gall, K. Hajek, M. Jazayeri. Detection of logical coupling based on product release history. In Proc. of 14th Int. Conference on Software Maintenance (ICSM), [11] H. Gall, M. Jazayeri, J. Krajewski. Cvs release history data for detecting logical couplings. In Proc. of 6th Int. Workshop on Principles of Software Evolution (IWPSE), [12] S. Ghemawat, H. Gobioff, S.-T. Leung. google file system. In Proc. of 19th ACM symposium on Operating Systems Principles (SOSP), [13] M. W. Godfrey Q. Tu. Evolution in open source software: A case study. In Proc. of 16th Int. Conference on Software Maintenance (ICSM), [14] A. E. Hassan. Mining software repositories to assist developers support managers. PhD sis, [15] A. E. Hassan. road ahead for mining software repositories. In Frontiers of Software Maintenance (FoSM), pages 48 57, October [16] A. E. Hassan R. C. Holt. Using development history sticky notes to underst software architecture. In Proc. of 12th IEEE Int. Workshop on Program Comprehension (IWPC), [17] T. Kamiya, S. Kusumoto, K. Inoue. Ccfinder: a multilinguistic token-based code clone detection system for large scale source code. IEEE Trans. Softw. Eng., 28(7), [18] S. Livieri, Y. Higo, M. Matushita, K. Inoue. Very-large scale code clone analysis visualization of open source programs using distributed ccfinder: D-ccfinder. In Proc. of 29th Int. conference on Software Engineering (ICSE), [19] Y. Manabe, Y. Higo, K. Inoue. Toward ecient code clone detection on grid environment. In Proc. of 1st Workshop on Accountability Traceability in Global Software Engineering (ATGSE), December [20] A. Michael, F. Armo, G. Rean, D. J. Anthony, K. Ry, K. Andy, L. Gunho, P. David, R. Ariel, S. Ion, Z. Matei. Above clouds: A berkeley view of cloud computing. Technical report, University of California, Berkeley, [21] C. Ranger, R. Raghuraman, A. Penmetsa, G. Bradski, C. Kozyrakis. Evaluating mapreduce for multi-core multiprocessor systems. In Proc. of 13th Int. Symposium on High Performance Computer Architecture (HPCA), [22] G. Robles, J. M. Gonzalez-Barahona, M. Michlmayr, J. J. Amor. Mining large software compilations over time: anor perspective of software evolution. In Proc. of 3rd Int. workshop on Mining software repositories (MSR), [23] Q. Tu M. W. Godfrey. An integrated approach for studying architectural evolution. In Proc. of 10th Int. Workshop on Program Comprehension (IWPC), [24] T. Zimmermann, S. Diehl, A. Zeller. How history justifies system architecture (or not). In Proc. of 6th Int. Workshop on Principles of Software Evolution (IWPSE),

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team Introduction to Hadoop Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since