MapReduce as a General Framework to Support Research in Mining Software Repositories (MSR)

Size: px
Start display at page:

Download "MapReduce as a General Framework to Support Research in Mining Software Repositories (MSR)"

Transcription

1 MapReduce as a General Framework to Support Research in Mining Software Repositories (MSR) Weiyi Shang, Zhen Ming Jiang, Bram Adams, Ahmed E. Hassan Software Analysis Intelligence Lab (SAIL) Queen s University Kingston, Canada {swy, zmjiang, bram, ahmed}@cs.queensu.ca Abstract Researchers continue to demonstrate benefits of Mining Software Repositories (MSR) for supporting software development research activities. However, as mining process is time resource intensive, y often create ir own distributed platforms use various optimizations to speed up scale up ir analysis. se platforms are project-specific, hard to reuse, offer minimal debugging deployment support. In this paper, we propose use of MapReduce, a distributed computing platform, to support research in MSR. As a proof-of-concept, we migrate J-REX, an optimized evolutionary code extractor, to run on Hadoop, an open source implementation of MapReduce. Through a case study on source control repositories of Eclipse, BIRT Datatools projects, we demonstrate that migration effort to MapReduce is minimal that benefits are significant, as running time of migrated J-REX is only 30% to 50% of original J-REX s. This paper documents our experience with migration, highlights benefits challenges of MapReduce framework in MSR community. 1 Introduction Mining Software Repositories (MSR) field analyzes cross-links rich data available in software repositories to uncover interesting actionable information about software systems [15]. Examples of software repositories include source control repositories, bug repositories, archived communications, deployment logs, code repositories. Research in MSR field has received an increasing amount of interest. Most MSR techniques have been demonstrated on largescale software systems. However, size of data available for mining continues to grow at a very high rate. For example, size of Linux kernel source code exhibits super-linear growth [13]. MSR researchers continue to explore deeper more sophisticated analysis across a large number of long-lived systems. Robles et al. reported that Debian, a well-known Linux distribution, doubles in size approximately every two years [22]. Combined with huge system size, this may pose problems for analyzing evolution of Debian system in future. large continuously growing software repositories need for deeper analysis impose challenges on scalability of MSR techniques. Powerful computers sophisticated software mining algorithms are needed to successfully analyze cross-link data in a timely fashion. Prior research focuses especially on building home-grown solutions for this. authors of D-CCFinder [18], a distributed version of popular CCFinder [17] clone detection tool, have improved ir processing time from 40 days on a single PC-based workstation (Intel Xeon 2.8GHz, 2 GB RAM) to 2 days on a distributed system consisting of 80 PCs. ir home-grown solution is reported to contain about 20 kloc of Java code, which must be maintained enhanced by se MSR researchers does not directly translate to or analyzers. Tackling problem of processing large software repositories in a timely fashion is of paramount importance to future of MSR field in general, as we aim to improve adoption rate of MSR techniques by practitioners. We envision a future where sophisticated MSR techniques are integrated into IDEs that run on commodity workstations that provide fast accurate results to developers managers. In short, one cannot require every MSR researcher to have large expensive servers. Furrmore, homegrown solutions to optimize mining performance require huge development maintenance efforts. Last but not least, task of performance tuning turns our attention away from real problem, which is to uncover /09/$ IEEE 21 MSR 2009

2 interesting repository information. In many cases, MSR researchers do not have expertise required nor interest to improve performance of ir data mining algorithms. Techniques are needed that hide complexity of scaling yet provide researchers with benefits of scale. Research shows that scaling out (distributed systems) is always better than scaling up (bigger more powerful machines) [20]. Off--shelf distributed frameworks are promising technologies that can help our field. In this paper, we explore one of se technologies, called MapReduce [7]. As a proof-of-concept, we migrate J-REX, an optimized evolutionary code extractor, to run on Hadoop platform. Hadoop is a popular open-source implementation of MapReduce which is increasingly gaining popularity has proved to be scalable of production quality. Companies like Yahoo have Hadoop installations with over 5,000 machines, Hadoop is also used by Amazon computing clouds. With many companies involved into its development maintenance, Hadoop is rapidly maturing. Through a case study on source control repositories of Eclipse, BIRT Datatools projects, we show that migration effort to MapReduce is minimal that benefits are significant. migrated J-REX solutions are 4 times faster than original J-REX. This paper documents our experience with migration highlights benefits challenges of adopting off--shelf distributed frameworks in MSR community. paper is organized as follows. Section 2 imposes requirements for a general distributed framework to support MSR research. MapReduce is explained in Section 3, as well as one of its open source implementations: Hadoop. Section 4 discusses our case study in which we migrate J- REX, an evolutionary code extractor running on a single machine, to Hadoop. repercussions of this case study limitations of our approach are discussed in Section 5. Section 6 presents some related works. Finally, Section 7 concludes paper presents future work. 2 Requirements for a General Framework to Support MSR Research We seek four common requirements for large distributed platforms to support MSR research. We detail m as follows: 1. Adaptability: platform should take MSR researchers minimal effort to migrate from ir prototype solutions, which are developed on a single machine. 2. Efficiency: adoption of platform should drastically speed up mining process. 3. Scalability: platform should scale with size of input data as well as with available computing power. 4. Flexibility: platform should be able to run on various types of machines, from expensive servers to commodity PCs or even virtual machines. This paper presents evaluates MapReduce as a possible distributed platform which satisfies se four requirements. 3 MapReduce MapReduce [7] is a distributed framework for processing vast data sets. It was originally proposed used by Google engineers to process large amount of data y must analyze on a daily basis. input data for MapReduce consists of a list of key/ pairs. Mappers accept incoming pairs, map m into intermediate key/ pairs. Each group of intermediate data with same key is n passed to a specific set of s, each of which performs computations on data reduce it to one single key/s pair. sorted output of s is final result of MapReduce process. In this paper, we simplify discussion of MapReduce by assuming that s accept s instead of key/ pairs. To illustrate MapReduce, we consider an example MapReduce process which counts frequency of word lengths in a book. example process is shown in Figure 1. Mappers accept every single word from book, make keys for m. Because we want to count frequency of all words with different length, a typical approach would be to use length of word as key. So, for word hello, a will generate a key/ pair of 5/hello. Afterwards, key/ pairs with same key are grouped sent to s. A, which receives a list of s with same key, can simply count size of this list, keep key in its output. If a receives a list with key 5, for example, it will count size of list of all words that have as length 5. If size is n, it outputs an output pair 5/n which means re are n words with length 5 in book. power challenge of MapReduce model resides in its ability to support different mapping reducing strategies. For example, an alternative implementation could map each input (i.e., word) based on its first letter its length. n, s would process those words starting with one or a small number of different letters (keys), perform counting. This MapReduce strategy permits an increasing number of Reducers that can work in parallel on problem. However final output needs additional post-processing in comparison to first strategy. In short, both strategies can solve problem, but each strategy has different performance implementation benefits challenges. 22

3 Input data Intermediate data Output data key dog 3 dog cat fish 3 4 cat fish key 3 2 hello good 5 4 hello good night 5 night 6 1 happy 5 happy school 6 school Figure 1: Example of MapReduce process for counting frequency of word lengths in a book. Figure 1. Example MapReduce process for counting frequency of word lengths in a book. Techniques are needed that hide complexity of scaling Hadoopyet is an provide open-source researchers implementation with benefits of MapReduce scale. [3] which Research is supported shows that by Yahoo scaling out is used (distributed by Ama- of zon, systems) AOL, Baidu is always abetter number than of or scaling companies up (bigger for ir distributed more powerful solutions. machines) Hadoop [5]. can Off--shelf run on various distributed operatingframeworks systems such are aspromising Linux, Windows, technologies FreeBSD, that can Machelp OSX our OpenSolaris. field. It not only implements MapReduce model, In but this also paper, provides we explore a distributed one of se file system, technologies called called Hadoop MapReduce Distributed[10]. File System As a proof-of-concept, (HDFS). Hadoop supplies migrate Java interfaces J-REX, toan simplify optimized MapReduce evolutionary model code we toextractor control code, HDFS to run programmatically. on Hadoop platform. Anor advantage Hadoop foris users a ispopular that Hadoop open-source by default comes implementation with some basic of MapReduce. widely used Hadoop mapping is increasingly reducing gaining methods, popularity for example to split has been files proven into lines, to be or to scalable split a directory of production into files. quality. Companies like Yahoo have Hadoop With se methods, users occasionally do not have to write installation with over 5,000 boxes. Hadoop is also used new code to use MapReduce. by Amazon computing clouds. Hadoop is rapidly maturing We used Hadoop with asmany our MapReduce companies implementation involved into its for development following four reasons: maintenance. Through a case study on 1. Hadoop source is control easy repositories use. Researchers of Eclipse, do BIRT not want todatatools spend considerable projects, we time show on modifying that migration ir mining effort program minimal to make it distributed. benefits are great. simple MapReduce migrated Java J- is interface REX solutions simplifies are 4 process times faster of implementing than original s REX. This s. paper documents our experience with J- migration 2. Hadoop runs highlights on different benefits operating challenges systems. Academic adopting research off--shelf labs tend to distributed have a heterogeneous frameworks network in of ofmsr machines community. with different hardware configurations varying operating systems. Hadoop can run on most current Organization operating systems of Paper: hence to exploit as much of available computing rest of this power paper as possible. is organized as follows. Section 2 imposes requirements for general framework to 3. Hadoop support runs MSR on commodity research. Section machines. 3 explains largest computation resources in research labs software development companies are desktop computers laptops. This MapReduce one of its open source implementations: characteristic of Hadoop. Section permits 4 se discusses computers our case to join study leave about migrating computingj-rex, cluster an in aevolutionary dynamic code transparent fashion running without on user a intervention. single machine, to Hadoop. extractor Section 4. Hadoop 5 discusses mature our approaches an open outlines source system. limitations. Hadoop has Section been successfully 6 concludes usedpaper in many presents commercial future projects. work. It is actively developed with new features enhancements continuously being added. Since Hadoop is 2. free Requirements to download redistribute, for General it canframework be installed on multiple Support machines MSR without Research worrying about costs per seat to licensing. We Based seek four on se requirements points, wefor consider developing Hadoop a general as most MSR suitable mining MapReduce framework. implementation We detail m foras our follows: research. next section evaluates ability of Hadoop, hence 1. MapReduce, Adaptability: to satisfy technique four requirements should take of Section MSR 2. researchers minimal effort to migrate from ir 4 prototype Case study solutions, which are developed on a single machine. 2. Efficiency: adoption of technique should To validate promise of MapReduce for MSR research, drastically speed up mining process. 3. Scalability: we discuss This our technique experience should migrating scale with an evolutionary size code of extractor data as well called as with J-REX to computing Hadoop. power. J-REX is a 4. highly Flexibility: optimized This evolutionary technique should code extractor be able to for run Java on systems, various similar types to C-REX of machines [14]. from expensive servers As to commodity shown in Figure PCs or 2, even whole virtual process machines. of J-REX spans three phases. first phase is extraction phase, where 3. J-REX MapReduce extracts source code snapshots for each file from a CVS repository. In second phase, i.e. parsing phase, J-REX calls for each file snapshot Eclipse JDT parser to MapReduce [10] is a distributed framework for parse Java code into its abstract syntax tree [1], which processing large data sets which are often encountered by is stored MSR as researchers. an XML document. MapReduce In was third originally phase, i.e. analysis phase, J-REX compares XML documents of consecutive file revisions to determine changed code units, 23

4 Table 1. Disk performance of desktop server computers. Cached read speed Cached write speed Server 8, 531MB/sec 211MB/sec Desktop 3, 302MB/sec 107MB/sec Rom read speed Rom write speed Server 2, 986MB/sec 1, 075MB/sec Desktop 1, 488MB/sec 658MB/sec Table 2. Characteristics of Eclipse, BIRT Datatools. Repository #Source Length #Revisions Size Code of History Files Datatools 394MB 10, years 2,398 BIRT 810MB 13, years 19,583 Eclipse 4.2GB 56, years 82,682 Figure 2. Architecture of J-REX. generates evolutionary change data in an XML format [16]. evolutionary change data reports evolution of a software system at level of code entities such as methods classes (for example, class A was changed to add a new method B ). architecture of J-REX is comparable to architecture of or MSR tools. J-REX runtime process requires a huge amount of I/O operations which are performance bottlenecks, a large amount of computing power when comparing XML trees. I/O computational characteristics of J-REX make it an ideal case study to study performance benefits of MapReduce computation model. Through this case study, we seek to verify wher Hadoop solution satisfies four requirements listed in Section Adaptability: We explain process to migrate basic J-REX, a non-distributed MSR tool, to three different distributed Hadoop solutions (DJ-REX1, DJ-REX2, DJ-REX3). 2. Efficiency: For all three Hadoop solutions, we compare performance of mining process among desktop server machines. 3. Scalability: We examine scalability of Hadoop solutions on three data repositories with varying sizes. We also examine scalability of Hadoop solutions running on a varying number of machines. 4. Flexibility: Finally, we study flexibility of Hadoop platform by deploying Hadoop on virtual machines in a multicore environment. In rest of this section, we first explain our experimental environment details of our experiments. n, we discuss wher or not using Hadoop for software mining satisfies 4 requirements of Section Experimental environment Our Hadoop installation is deployed on 4 computers in a local gigabit network. 4 computers consist of 2 desktop computers, each having an Intel Quad Core 2.40 GHz CPU with 2 GB RAM memory, of 2 server computers, one having an Intel Core i GHz CPU with 8 Cores (Hyperthreading) 6 GB RAM memory, or one having an Intel Quad Core 2.40 GHz CPU with 8 GB RAM memory a RAID5 disk. 8 core server machine has Solid State Disks (SSD) instead of regular RAID disks. difference in disk performance between regular disk machines SSD disk server computer as measured by hdparm iozone (64 kb block size) is shown in Table 1. server s I/O speed with SSD drive is twice as fast as machines with regular disk for rom I/O turns out to be two a half times for cached operations. source control repositories used in our experiments consist of whole Eclipse repository 2 sub-projects from Eclipse called BIRT Datatools. Eclipse has a large repository with a long history, BIRT has a medium repository with a medium length history, Datatools has a small repository with a short history. Using se 3 repositories with different size length of history, we can better evaluate performance of our approach across subject systems. repository information of 3 projects is shown in Table 2. 24

5 Table 3. Experimental results for DJ-REX in Hadoop. Repsitory Desktop Server Strategy 2 nodes 3 nodes 4 nodes Datatools 0:35:50 0:34:14 DJ-REX3 0:19:52 0:14:32 0:16:40 DJ-REX1 2:03:51 2:05:02 2:16:03 BIRT 2:44:09 2:05:55 DJ-REX2 1:40:22 1:40:32 1:47:26 DJ-REX3 1:08:36 0:50:33 0:45:16 DJ-REX3* 3:02:47 Eclipse 12:35:34 DJ-REX3 3:49: Experiments We conduct following experiments: 1. Run J-REX without Hadoop on BIRT, Datatools Eclipse repositories. 2. Run DJ-REX1, DJ-REX2 DJ-REX3 on BIRT with 2, 3 4 machines. 3. Run DJ-REX3 on Datatools with 2, 3 4 machines. 4. Run DJ-REX3 on Eclipse with 4 machines. 5. Run DJ-REX3 on BIRT with 3 virtual machines. Only DJ-REX3 is applied in last three experiments, because experimental results for smallest system, i.e. BIRT, already showed a significant speed improvement compared to or two distributed strategies original, undistributed J-REX. results of all five experiments are summarized in Table 3 are discussed in next section. row with DJ-REX3* corresponds to experiment that has DJ-REX3 running on 3 virtual machines. 4.3 Case study discussion This section uses experiment data results of Table 3 to discuss wher or not various DJ-REX solutions meet 4 requirements outlined in Section 2. Adaptability Table 4 shows implementation deployment effort required for DJ-REX. We first discuss effort devoted to porting J-REX to Hadoop. n we present experience about configuring Hadoop to add in more computing power. implementation effort of three DJ-REX solutions decreases as we got more acquainted with technology. Easy to experiment with various distributed solutions As is often case, MSR researchers do not have expertise required for nor do y have interest in improving performance of ir mining algorithms. need Table 4. Effort to program deploy DJ- REX. J-REX Logic MapReduce strategy for DJ-REX1 MapReduce strategy for DJ-REX2 MapReduce strategy for DJ-REX3 Deployment Configuration Reconfiguration No Change 400 LOC, 2 hours 400 LOC, 2 hours 300 LOC, 1 hours 1 hour 1 minute Table 5. Overview of distributed steps in DJ- REX1 to DJ-REX3. Extraction Parsing Analysis DJ-REX1 No No Yes DJ-REX2 No Yes Yes DJ-REX3 Yes Yes Yes to rewrite an MSR tool from scratch to make it run on Hadoop is not an acceptable option. If programming time for Hadoop migration is long (maybe as long as re-implementing it), n chances of adopting Hadoop become very low. In addition, if one has to modify a tool in such an invasive way, considerably more time will have to be spent to test it again once it runs distributed. We found that applications are very easy to port to Hadoop. First of all, Hadoop provides a number of default mechanisms to split input data across s. For example, MultiFileSplit class splits files in a directory, whereas DBInputSplit class splits rows in a database table. Often, one can reuse se existing mapping strategies. Second, Hadoop has well-defined simple APIs to implement a MapReduce process. One just needs to implement corresponding interfaces to make a custom MapReduce process. Third, several code examples are available to show users how to write MapReduce code with Hadoop [3]. After looking at available code examples, we found that we could reuse code for splitting input data by files. n, we spent a few hours to write around 400 lines of Java code for each of three DJ-REX MapReduce strategies. programming logic of J-REX itself barely 25

6 Input data Intermediate data Output data key a_0.java a_1.java a.java a.java a_0.java a_1.java key a.java a.output b_0.java b.java b_0.java b.java b.output a_2.java a.java a_2.java b_1.java b.java b_1.java Figure 3: MapReduce strategy for DJ-REX Figure 3. MapReduce strategy for DJ-REX. Hadoop [14]. One of code examples is to search a contains 3 phases: extraction phase, parsing phase string in files, which is just like grep comm in analysis phase. Based on se 3 phases, we come up changed. Linux. Anor example is to count word frequency with 3 different implementations as shown in Table 5. from remainder files. of this section explains our three DJ-REX first one is called DJ-REX1. We use one In our experience, after reading above two computer to extract source code offline to parse MapReduce strategies. Intuitively, we need to compare examples, because input data of examples m to AST. After finishing first two phases, difference between adjacent revisions of a Java source code J-REX are all stored in multiple files, we use ir output files are put in HDFS Hadoop is used to file. We could define key/ pair output of function as (D1, a 0.java a 1.java), re- ways to split whole input data by files. n we analyze change information. In this case, only 1 phase spent a few hours wrote around 400 lines of Java of J-REX becomes distributed, files we copy ducer code function to implement output as three (revision MapReduce number, strategies. evolutionary information). into HDFS are XML files which represent AST programming logic key D1 of J-REX represents barely changed. difference between Figure 4. Running time comparison of J-REX information of each version of all source code files in two versions, Here we aexplain 0.javaour MapReduce a 1.java represent strategies in names on a server machine desktop machine DJ- repository. ofrex. two files. Intuitively, Becausewe of want way to that compare we partition difference data, For compared DJ-REX2, to one more fastest phase, DJ-REX3 parsing for phase, Datatools. each between revision adjacent needs versions. to be copied We define transferred (key, to ) more becomes distributed. refore only extraction than pair onein node, which function generates as (D1, extra a_0.java overhead for phase still runs as non-distributed. extraction a_1.java). mining process, function turned out is defined to makeas (revision process phase extracts all main versions of source code much number, longer. change failure data). of thiskey naive D1 strategy represents shows files from repository, se files are copied into importance difference ofbetween designing two a good versions. strategy Because of MapReduce. of way HDFS are used as input to following Hadooped phases. parsing phase, becomes distributed. Only extraction that refore, we partition we tried anor data, each basicrevision MapReduce needs strategy, to be ascopied shown in Figure transferred 3. This to more strategy than performs one node, much which better phase Finally is still non-distributed, DJ-REX3 whereas a fully-distributed parsing analysis phases are with done 3 inside phases running s. in Finally, a distributed DJ-REX3 gives than more our naive overhead strategy. to whole key/ mining pair process. output of implementation runtime function of this is strategy defined as is much (file name, longer. revision failure snapshot), of fashion. is a fully-distributed Input for DJ-REX-3 implementation is a raw with CVS 3 phases data running this strategy shows importance of designing a good Hadoop used throughout 3 phases. whereas key/ pair output of function is in a distributed fashion inside each. input for strategy of MapReduce. For DJ-REX1 to DJ-REX3, using files to partition (file name, evolutionary information for this file). For example, file a.java has 3 revisions. mapping phase gets DJ-REX3 is raw CVS data Hadoop is used throughout all 3 phases. refore, we tried anor MapReduce strategy input data is an intuitively often most suitable shown in Figure 3. This strategy performs much better option for many mining techniques which want to file than names our previous revision strategy. numbers as (key, input, ) sorts pair revision in our explore Using files use of tohadoop. partition input data is an intuitive numbers per function file: (a.java, is defined a 0.java), as (file (a.java, name, a revision 1.java) often most suitable option for many mining techniques snapshot), (a.java, a 2.java). (key, ) Pairs pair with for same keyfunction are n Table which5: want Overview to explore of distributed use of Hadoop. steps in DJ-REX1 to sent to to be (file same name,. evolutionary final information output for for a.java this file). is DJ-REX3 generated For example, evolutionary a.java information. has 3 revisions. In mapping Easy to deploy Extract add more computing Parse power Analyze phase, On topit of generates this basic MapReduce 3 (key, ) strategy, pairs: we have (a.java, implemented a_0.java), 3 flavors (a.java, ofa_1.java) DJ-REX (Table (a.java, 5). Eacha_2.java). flavor dis- DJ-REX2 It took us only 1 hour No to learn howyes to deploy Hadoop Yes in DJ-REX1 No No Yes tributes Since ay different have combination same key, of se 3three phases pairs ofare sent original to J-REX same implementation. final (Figure output 2). for file first a.java flavoris is add more machines), we only needed to add machines DJ-REX3 local network. To exp Yes experiment Yes cluster Yes (i.e., to called evolutionary DJ-REX1. One information. machine extracts source code offline We have parses implemented it into AST3 form. versions Afterwards, of DJ-REX. output Each It machines. took us only Based around on our 1 hour experience, to learn we feel configure that porting Easy names to deploy in a configuration add in more file computing install Hadoop powers on those XML version files explores are stored distributing in HDFS different Hadoop phases uses in it to analyze J- how J-REX to deploy to Hadoop Hadoop is easy in local straightforward, network. If we want for sure REX implementation. change information. As shown In this case, in Figure only 1 2, phase J-REX of J- to easier exp less experiment error-prone than cluster implementing (i.e., adding ourmore own distributed REX becomes distributed. For DJ-REX2, one more phase, platform. 26

7 Figure 5. Running time of J-REX on a server machine desktop machine compared to fastest deployment of DJ-REX1, DJ-REX2 DJ-REX3 for BIRT. Figure 6. Running time of J-REX on a server machine compared to DJ-REX3 with 4 work nodes for Eclipse. Efficiency We now use our experimental data to test how much time could be saved by using Hadoop for mining process. Figure 4 (Datatools),Figure 5 (BIRT) Figure 6 (Eclipse) present results of Table 3 in a graphical way. From Figure 4 (Datatools) Figure 5 (BIRT), we can draw following two conclusions. On one h, faster powerful machinery can speed up mining process. For example, running J-REX on a very fast server machine with SSD drives for BIRT repository saves around 40 minutes compared with running it on desktop machine. On or h, all DJ-REX solutions perform no worse or even better than J-REX solutions regardless of difference in hardware machinery. As shown in Figure 5, running time on SSD server machine is almost same to that using DJ-REX1, which only has analysis phase distributed, since analysis phrase is shortest of all three J-REX phases. refore, performance gain of DJ-REX is not significant. DJ-REX2 DJ-REX3, however, outperform server. running time of DJ-REX3 on BIRT is almost one quarter of running it on a desktop machine one third time of running it on a server machine. running time of DJ-REX3 for Datatools has been reduced to around half time taken by desktop server solutions, for Eclipse to around a quarter of time of server solution. It is clear that more we distribute our process, less time is needed. Figure 7. Comparison of running time of 3 flavors of DJ-REX for BIRT. Figure 7 shows detailed performance statistics of three flavors of DJ-REX for BIRT repository. total running time can be broken down into three parts: preprocess time (black) is time needed for nondistributed phases, copy data time (light blue) is time taken for copying input data into distributed file system, process data time (white) is time needed by distributed phases. In Figure 7, running time of DJ-REX3 is always shortest, whereas DJ-REX1 always takes longest time. reason for this is that undistributed black parts dominate process time for DJ-REX1 DJ-REX2, whereas in DJ-REX3 everything is distributed. Hence, fully distributed DJ-REX3 is most efficient one. In Figure 7, process data time (white) is decreasing constantly. MapReduce strategy of DJ-REX is basically dividing job by files which are processed independently from each or in different s. Hence, one could approximate job s running time by dividing total processing time by number of Hadoop nodes. more Hadoop nodes re are, smaller incremental benefit of extra nodes. In addition, a new node introduces more overhead, like network overhead or distributed file system data synchronization. Figure 7 clearly shows that copy data time (light blue) is increasing when adding nodes hence that performance with 4 nodes is not always best one (e.g. for Datatools). Our experiments show that using Hadoop to support MSR research is an efficient viable approach that can drastically reduce required processing time. Scalability Eclipse has a large repository, BIRT has a medium-sized repository Datatools has a small repository. From Figure 4 (Datatools), Figure 5 (BIRT), Figure 6 (Eclipse) Figure 7 (BIRT), it is clear that Hadoop reduces running time for each of three repositories. When mining small Datatools repository, running time is reduced to 27

8 Figure 8. Running time comparison for BIRT Datatools with DJ-REX3. Figure 9. Running time of basic J-REX on a desktop server machine, of DJ- REX-3 on 3 virtual machines on same server machine. 50%. bigger repository, more time can be saved by Hadoop. running time can be reduced to 36% 30% of non-hadoop version for BIRT Eclipse repositories, respectively. Figure 8 shows that Hadoop scales well for different numbers of nodes (2 to 4) for BIRT Datatools. We did not include running time for Eclipse because of its large variance fact that we could not run Eclipse on desktop machine (we could not fit entire data into memory). However, from Figure 6 we know that running time for Eclipse on server machine is more than 12 hours that it only takes a quarter of this time (around 3.5 hours) using DJ-REX3. Unfortunately, we found that performance of DJ- REX3 is not proportional to amount of computing resources introduced. From Figure 8, we observe that adding a fourth node introduces additional overhead to our process, since copying input data to anor node out-weighs parallelizing tasks to more machines. optimal number of nodes depends on mining problem MapReduce strategies that are being used, as outlined in Section 3. Flexibility Hadoop runs on many different platforms (i.e., Windows, Mac Unix). In our experiments, we used server machines with without SSD drives, relatively slow desktop machines. Because of load balance control in Hadoop, each machine is given a fair amount of work. Because network latency could be one of major causes of data copying overhead, we did an experiment with 3 Hadoop nodes running in 3 virtual machines on Intel Quad Core server machine. Running only 3 virtual machines increases probability that each Hadoop process has its own processor core, whereas running Hadoop inside virtual machines should eliminate majority of network latency. Figure 9 shows running time of DJ-REX3 when deployed on 3 virtual machines on same server machine. performance of DJ-REX3 in virtual machines turns out to be worse than that of undistributed J-REX. We suspect that this happens because virtual machine setup results in slower disk accesses than deployment on a physical machine. However, this could be improved by using a redundant storage array (RAID), or a networked storage array, but this is future work. ability to run Hadoop in a virtual machine can be used to deploy a large Hadoop cluster in a very short time by rapidly replicating starting up virtual machines. A well configured virtual machine could be deployed to run mining process without any configuration, which is extremely suitable for non-experts. 5 Discussion Limitations MapReduce on or software repositories Multiple types of repositories are used in MSR field, but in principle MapReduce could be used as a stard platform to speed up scale up different analyses. main challenge is deriving optimal mapping strategies. For example, a MapReduce strategy could split mailing list data by time or by sender name, when mining a mailing list repository. Similarly, when mapping a bug reports repository, creator creation time of bug report could be used as splitting criteria. Incremental processing Incremental processing is one possible way to deal with large repositories extensive analysis. Instead of processing data from a long history in one shot, one could incrementally process data on a weekly or monthly basis. However, incremental processing requires more sophisticated designs of mining algorithms, sometimes is just not possible to achieve. Since researchers are mostly prototyping ir ideas, a brute force approach might be more desirable with optimizations (such as incremental processing) to follow later. low cost of migrating an analysis technique to MapReduce is negligible compared to complexity of migrating a technique to support incremental processing. 28

9 One-time processing One-time processing involves processing a repository once, n storing it in a compact format for subsequent querying analysis. Clearly, cost of one-time processing is not a major concern. However, we believe that MapReduce can help in two ways: 1) scaling number of possible systems that can be analyzed, 2) speeding up prototyping phase. Using a MapReduce implementation, analyzing querying a large system is simply faster than when doing one-time processing without MapReduce. Moreover, although one-time processing might require a single pass through data, it is often case that developers of technique explore a lot of ideas as y are prototyping ir algorithm ideas, have to debug technique. repositories must be analyzed time time again in se cases. We believe MapReduce can help speed up prototyping phase offer researchers more timely feedback on ir ideas. Robustness MapReduce its Hadoop implementation offer a robust computation model which can deal with different kinds of failures at run-time. If certain nodes fail, tasks belonging to failed nodes are automatically re-assigned to or nodes. All or nodes are notified to avoid trying to read data from failed nodes. Dean et al. [7] reported that MapReduce clusters with over 80 nodes can become unreachable, yet processing continues finishes successfully. This type of robustness permits execution of Hadoop on laptops non-dedicated machines, such that lab computers can join leave a Hadoop cluster rapidly easily based on needs of owners. For example, students can join a Hadoop cluster while y are away from ir desk leave it on until y are back. Current Limitations Most of current limitations are imposed by implementation of Hadoop. Locality is one of most important issues for a distributed platform, as network bwidth is a scarce resource when processing a large amount of data. To solve this problem, Hadoop attempts to replicate data across nodes to always locate nearest replica of data. In Hadoop, a typical configuration with hundreds of computers by default would have only 3 copies of data. In this case, chance of finding required data stored on local machine is very small. However, increasing number of data copies requires more space more time to put large amount of data into distributed file system. This in turn leads to more processing overhead. Deploying data into HDFS file system is anor limitation of Hadoop. In current Hadoop version (0.19.0), all input data needs to be copied into HDFS, which gives much overhead. As Figure 7 Figure 8 show, running time with 4 nodes may not be shortest one. Finding out optimal Hadoop configuration is future work. 6 Related Work Automated evolutionary extractors, optimized mining solutions distributed computing platforms are three areas of research most related to our work. Automated evolutionary extractors Hassan developed an evolutionary code extractor for C language called C-REX [14]. Kenyon framework [6] combines various source code repositories into one to facilitate software evolution research. Draheim et al. created Bloof [8], by which users can define custom evolution metrics from CVS logs. Alonso et al. developed Minero [5] which uses database techniques to integrate manage data from software repositories. Godfrey et al. [13] developed evolutionary extractors that use metrics at system subsystem level to monitor evolution for each release of Linux. In addition, Qu et al. [23] developed evolutionary extractors that track structural dependency changes at file level for each release of GCC compiler. Gall et al. [10, 11] have developed evolutionary extractors that track co-change of files for each changelist in CVS. Gall et al. [9] developed extractors which track source code changes. Zimmermann et al. [24] present an extractor which determines changed functions for each changelist. All se tools can be easily ported to MapReduce framework. Optimizing Mining Solutions To authors knowledge, re is only one related work which tries to optimize software mining solutions on large scale data, i.e. D-CCFinder [18, 19]. D-CCFinder is a distributed implementation of CCFinder to analyze source code with a large size long history in a relatively short time. Unfortunately, this implementation is homegrown specialized to CCFinder, not open to or MSR techniques. More recently, researchers behind D-CCFinder proposed to run CCFinder on a grid-based system [19]. Distributed Platforms re are several distributed platforms that implement MapReduce. prototypical one is from Google [7]. Google platform makes use of Google s file system, called GFS [12]. Phoenix is anor implementation of MapReduce model [21]. Phoenix s main focus is on exploiting multi-core multi-processor systems such as Playstation Cell architecture. GridGain [2] is an open source implementation of MapReduce, but its main disadvantage is that it can only process data which can be stored in a JVM heap space. For large size of data that we usu- 29

10 ally process in MSR, this is not a good choice. We chose Hadoop due to its simple design its wide user base. 7 Conclusions Future Work A scalable software mining solution should be adaptable, efficient, scalable flexible. In this paper, we propose to use MapReduce as a general framework to support research in MSR. To validate our approach, we presented our experience of porting J-REX, an evolutionary code extractor for Java, to Hadoop, an open source implementation of MapReduce. Our experiments demonstrate that our new solution (DJ-REX) satisfies four requirements of scalable software mining solutions. Our experiments show that running our optimized solution (DJ-REX3) on a small local area network with 4 nodes requires 75% less time than time needed when running it on a desktop machine 66% less time than on a server machine. One of our future goals is to dynamically control computation resources in our lab. If someone wants to pull a machine out of platform, we don t want to reconfigure whole platform. If machine becomes idle, one should be able to plug it back into platform. In addition, we also plan to experiment with or technologies like HBase [4] to improve current Hadoop deployment. References [1] Eclipse jdt. [2] Gridgain. [3] Hadoop. [4] Hbase. [5] O. Alonso, P. T. Devanbu, M. Gertz. Database techniques for analysis exploration of software repositories. In Proc. of 1st Workshop on Mining Software Repositories (MSR), [6] J. Bevan, E. J. Whitehead, Jr., S. Kim, M. Godfrey. Facilitating software evolution research with kenyon. In Proc. of 10th European Software Engineering Conference (ESEC/FSE), [7] J. Dean S. Ghemawat. Mapreduce: simplified data processing on large clusters. Commun. ACM, 51(1), [8] D. Draheim L. Pekacki. Process-centric analytical processing of version control data. In Proc. of 6th Int. Workshop on Principles of Software Evolution (IWPSE), [9] B. Fluri, H. C. Gall, M. Pinzger. Fine-grained analysis of change couplings. In Proc. of 4th Int. Workshop on Source Code Analysis Manipulation (SCAM), [10] H. Gall, K. Hajek, M. Jazayeri. Detection of logical coupling based on product release history. In Proc. of 14th Int. Conference on Software Maintenance (ICSM), [11] H. Gall, M. Jazayeri, J. Krajewski. Cvs release history data for detecting logical couplings. In Proc. of 6th Int. Workshop on Principles of Software Evolution (IWPSE), [12] S. Ghemawat, H. Gobioff, S.-T. Leung. google file system. In Proc. of 19th ACM symposium on Operating Systems Principles (SOSP), [13] M. W. Godfrey Q. Tu. Evolution in open source software: A case study. In Proc. of 16th Int. Conference on Software Maintenance (ICSM), [14] A. E. Hassan. Mining software repositories to assist developers support managers. PhD sis, [15] A. E. Hassan. road ahead for mining software repositories. In Frontiers of Software Maintenance (FoSM), pages 48 57, October [16] A. E. Hassan R. C. Holt. Using development history sticky notes to underst software architecture. In Proc. of 12th IEEE Int. Workshop on Program Comprehension (IWPC), [17] T. Kamiya, S. Kusumoto, K. Inoue. Ccfinder: a multilinguistic token-based code clone detection system for large scale source code. IEEE Trans. Softw. Eng., 28(7), [18] S. Livieri, Y. Higo, M. Matushita, K. Inoue. Very-large scale code clone analysis visualization of open source programs using distributed ccfinder: D-ccfinder. In Proc. of 29th Int. conference on Software Engineering (ICSE), [19] Y. Manabe, Y. Higo, K. Inoue. Toward ecient code clone detection on grid environment. In Proc. of 1st Workshop on Accountability Traceability in Global Software Engineering (ATGSE), December [20] A. Michael, F. Armo, G. Rean, D. J. Anthony, K. Ry, K. Andy, L. Gunho, P. David, R. Ariel, S. Ion, Z. Matei. Above clouds: A berkeley view of cloud computing. Technical report, University of California, Berkeley, [21] C. Ranger, R. Raghuraman, A. Penmetsa, G. Bradski, C. Kozyrakis. Evaluating mapreduce for multi-core multiprocessor systems. In Proc. of 13th Int. Symposium on High Performance Computer Architecture (HPCA), [22] G. Robles, J. M. Gonzalez-Barahona, M. Michlmayr, J. J. Amor. Mining large software compilations over time: anor perspective of software evolution. In Proc. of 3rd Int. workshop on Mining software repositories (MSR), [23] Q. Tu M. W. Godfrey. An integrated approach for studying architectural evolution. In Proc. of 10th Int. Workshop on Program Comprehension (IWPC), [24] T. Zimmermann, S. Diehl, A. Zeller. How history justifies system architecture (or not). In Proc. of 6th Int. Workshop on Principles of Software Evolution (IWPSE),

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team Introduction to Hadoop Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since

More information

CIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin. Presented by: Suhua Wei Yong Yu

CIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin. Presented by: Suhua Wei Yong Yu CIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin Presented by: Suhua Wei Yong Yu Papers: MapReduce: Simplified Data Processing on Large Clusters 1 --Jeffrey Dean

More information

Assignment 5. Georgia Koloniari

Assignment 5. Georgia Koloniari Assignment 5 Georgia Koloniari 2. "Peer-to-Peer Computing" 1. What is the definition of a p2p system given by the authors in sec 1? Compare it with at least one of the definitions surveyed in the last

More information

MapReduce: A Programming Model for Large-Scale Distributed Computation

MapReduce: A Programming Model for Large-Scale Distributed Computation CSC 258/458 MapReduce: A Programming Model for Large-Scale Distributed Computation University of Rochester Department of Computer Science Shantonu Hossain April 18, 2011 Outline Motivation MapReduce Overview

More information

Google File System (GFS) and Hadoop Distributed File System (HDFS)

Google File System (GFS) and Hadoop Distributed File System (HDFS) Google File System (GFS) and Hadoop Distributed File System (HDFS) 1 Hadoop: Architectural Design Principles Linear scalability More nodes can do more work within the same time Linear on data size, linear

More information

Map Reduce. Yerevan.

Map Reduce. Yerevan. Map Reduce Erasmus+ @ Yerevan dacosta@irit.fr Divide and conquer at PaaS 100 % // Typical problem Iterate over a large number of records Extract something of interest from each Shuffle and sort intermediate

More information

Data Analysis Using MapReduce in Hadoop Environment

Data Analysis Using MapReduce in Hadoop Environment Data Analysis Using MapReduce in Hadoop Environment Muhammad Khairul Rijal Muhammad*, Saiful Adli Ismail, Mohd Nazri Kama, Othman Mohd Yusop, Azri Azmi Advanced Informatics School (UTM AIS), Universiti

More information

Introduction to MapReduce

Introduction to MapReduce Basics of Cloud Computing Lecture 4 Introduction to MapReduce Satish Srirama Some material adapted from slides by Jimmy Lin, Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed

More information

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Shark Hive on Spark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Agenda Intro to Spark Apache Hive Shark Shark s Improvements over Hive Demo Alpha

More information

CLOUD-SCALE FILE SYSTEMS

CLOUD-SCALE FILE SYSTEMS Data Management in the Cloud CLOUD-SCALE FILE SYSTEMS 92 Google File System (GFS) Designing a file system for the Cloud design assumptions design choices Architecture GFS Master GFS Chunkservers GFS Clients

More information

Effective Use of CSAIL Storage

Effective Use of CSAIL Storage Effective Use of CSAIL Storage How to get the most out of your computing infrastructure Garrett Wollman, Jonathan Proulx, and Jay Sekora The Infrastructure Group Introduction Outline of this talk 1. Introductions

More information

IBM Spectrum NAS. Easy-to-manage software-defined file storage for the enterprise. Overview. Highlights

IBM Spectrum NAS. Easy-to-manage software-defined file storage for the enterprise. Overview. Highlights IBM Spectrum NAS Easy-to-manage software-defined file storage for the enterprise Highlights Reduce capital expenditures with storage software on commodity servers Improve efficiency by consolidating all

More information

Dell Reference Configuration for Large Oracle Database Deployments on Dell EqualLogic Storage

Dell Reference Configuration for Large Oracle Database Deployments on Dell EqualLogic Storage Dell Reference Configuration for Large Oracle Database Deployments on Dell EqualLogic Storage Database Solutions Engineering By Raghunatha M, Ravi Ramappa Dell Product Group October 2009 Executive Summary

More information

Clustering Documents. Case Study 2: Document Retrieval

Clustering Documents. Case Study 2: Document Retrieval Case Study 2: Document Retrieval Clustering Documents Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 21 th, 2015 Sham Kakade 2016 1 Document Retrieval Goal: Retrieve

More information

CS 345A Data Mining. MapReduce

CS 345A Data Mining. MapReduce CS 345A Data Mining MapReduce Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very large Tens to hundreds of terabytes

More information

Map-Reduce. Marco Mura 2010 March, 31th

Map-Reduce. Marco Mura 2010 March, 31th Map-Reduce Marco Mura (mura@di.unipi.it) 2010 March, 31th This paper is a note from the 2009-2010 course Strumenti di programmazione per sistemi paralleli e distribuiti and it s based by the lessons of

More information

Dynamic processing slots scheduling for I/O intensive jobs of Hadoop MapReduce

Dynamic processing slots scheduling for I/O intensive jobs of Hadoop MapReduce Dynamic processing slots scheduling for I/O intensive jobs of Hadoop MapReduce Shiori KURAZUMI, Tomoaki TSUMURA, Shoichi SAITO and Hiroshi MATSUO Nagoya Institute of Technology Gokiso, Showa, Nagoya, Aichi,

More information

CHAPTER 4 ROUND ROBIN PARTITIONING

CHAPTER 4 ROUND ROBIN PARTITIONING 79 CHAPTER 4 ROUND ROBIN PARTITIONING 4.1 INTRODUCTION The Hadoop Distributed File System (HDFS) is constructed to store immensely colossal data sets accurately and to send those data sets at huge bandwidth

More information

IBM Spectrum NAS, IBM Spectrum Scale and IBM Cloud Object Storage

IBM Spectrum NAS, IBM Spectrum Scale and IBM Cloud Object Storage IBM Spectrum NAS, IBM Spectrum Scale and IBM Cloud Object Storage Silverton Consulting, Inc. StorInt Briefing 2017 SILVERTON CONSULTING, INC. ALL RIGHTS RESERVED Page 2 Introduction Unstructured data has

More information

Optimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures*

Optimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures* Optimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures* Tharso Ferreira 1, Antonio Espinosa 1, Juan Carlos Moure 2 and Porfidio Hernández 2 Computer Architecture and Operating

More information

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University CS 555: DISTRIBUTED SYSTEMS [MAPREDUCE] Shrideep Pallickara Computer Science Colorado State University Frequently asked questions from the previous class survey Bit Torrent What is the right chunk/piece

More information

The MOSIX Scalable Cluster Computing for Linux. mosix.org

The MOSIX Scalable Cluster Computing for Linux.  mosix.org The MOSIX Scalable Cluster Computing for Linux Prof. Amnon Barak Computer Science Hebrew University http://www. mosix.org 1 Presentation overview Part I : Why computing clusters (slide 3-7) Part II : What

More information

Distributed Filesystem

Distributed Filesystem Distributed Filesystem 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributing Code! Don t move data to workers move workers to the data! - Store data on the local disks of nodes in the

More information

Clustering Documents. Document Retrieval. Case Study 2: Document Retrieval

Clustering Documents. Document Retrieval. Case Study 2: Document Retrieval Case Study 2: Document Retrieval Clustering Documents Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April, 2017 Sham Kakade 2017 1 Document Retrieval n Goal: Retrieve

More information

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe Cloud Programming Programming Environment Oct 29, 2015 Osamu Tatebe Cloud Computing Only required amount of CPU and storage can be used anytime from anywhere via network Availability, throughput, reliability

More information

Oracle Big Data Connectors

Oracle Big Data Connectors Oracle Big Data Connectors Oracle Big Data Connectors is a software suite that integrates processing in Apache Hadoop distributions with operations in Oracle Database. It enables the use of Hadoop to process

More information

Lecture 11 Hadoop & Spark

Lecture 11 Hadoop & Spark Lecture 11 Hadoop & Spark Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Outline Distributed File Systems Hadoop Ecosystem

More information

Bringing OpenStack to the Enterprise. An enterprise-class solution ensures you get the required performance, reliability, and security

Bringing OpenStack to the Enterprise. An enterprise-class solution ensures you get the required performance, reliability, and security Bringing OpenStack to the Enterprise An enterprise-class solution ensures you get the required performance, reliability, and security INTRODUCTION Organizations today frequently need to quickly get systems

More information

Parallel Nested Loops

Parallel Nested Loops Parallel Nested Loops For each tuple s i in S For each tuple t j in T If s i =t j, then add (s i,t j ) to output Create partitions S 1, S 2, T 1, and T 2 Have processors work on (S 1,T 1 ), (S 1,T 2 ),

More information

Parallel Partition-Based. Parallel Nested Loops. Median. More Join Thoughts. Parallel Office Tools 9/15/2011

Parallel Partition-Based. Parallel Nested Loops. Median. More Join Thoughts. Parallel Office Tools 9/15/2011 Parallel Nested Loops Parallel Partition-Based For each tuple s i in S For each tuple t j in T If s i =t j, then add (s i,t j ) to output Create partitions S 1, S 2, T 1, and T 2 Have processors work on

More information

Comprehensive Guide to Evaluating Event Stream Processing Engines

Comprehensive Guide to Evaluating Event Stream Processing Engines Comprehensive Guide to Evaluating Event Stream Processing Engines i Copyright 2006 Coral8, Inc. All rights reserved worldwide. Worldwide Headquarters: Coral8, Inc. 82 Pioneer Way, Suite 106 Mountain View,

More information

A New HadoopBased Network Management System with Policy Approach

A New HadoopBased Network Management System with Policy Approach Computer Engineering and Applications Vol. 3, No. 3, September 2014 A New HadoopBased Network Management System with Policy Approach Department of Computer Engineering and IT, Shiraz University of Technology,

More information

Voldemort. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

Voldemort. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation Voldemort Smruti R. Sarangi Department of Computer Science Indian Institute of Technology New Delhi, India Smruti R. Sarangi Leader Election 1/29 Outline 1 2 3 Smruti R. Sarangi Leader Election 2/29 Data

More information

Improved MapReduce k-means Clustering Algorithm with Combiner

Improved MapReduce k-means Clustering Algorithm with Combiner 2014 UKSim-AMSS 16th International Conference on Computer Modelling and Simulation Improved MapReduce k-means Clustering Algorithm with Combiner Prajesh P Anchalia Department Of Computer Science and Engineering

More information

index construct Overview Overview Recap How to construct index? Introduction Index construction Introduction to Recap

index construct Overview Overview Recap How to construct index? Introduction Index construction Introduction to Recap to to Information Retrieval Index Construct Ruixuan Li Huazhong University of Science and Technology http://idc.hust.edu.cn/~rxli/ October, 2012 1 2 How to construct index? Computerese term document docid

More information

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce Parallel Programming Principle and Practice Lecture 10 Big Data Processing with MapReduce Outline MapReduce Programming Model MapReduce Examples Hadoop 2 Incredible Things That Happen Every Minute On The

More information

Advanced Database Systems

Advanced Database Systems Lecture II Storage Layer Kyumars Sheykh Esmaili Course s Syllabus Core Topics Storage Layer Query Processing and Optimization Transaction Management and Recovery Advanced Topics Cloud Computing and Web

More information

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP TITLE: Implement sort algorithm and run it using HADOOP PRE-REQUISITE Preliminary knowledge of clusters and overview of Hadoop and its basic functionality. THEORY 1. Introduction to Hadoop The Apache Hadoop

More information

Towards a Taxonomy of Approaches for Mining of Source Code Repositories

Towards a Taxonomy of Approaches for Mining of Source Code Repositories Towards a Taxonomy of Approaches for Mining of Source Code Repositories Huzefa Kagdi, Michael L. Collard, Jonathan I. Maletic Department of Computer Science Kent State University Kent Ohio 44242 {hkagdi,

More information

Chapter 5. The MapReduce Programming Model and Implementation

Chapter 5. The MapReduce Programming Model and Implementation Chapter 5. The MapReduce Programming Model and Implementation - Traditional computing: data-to-computing (send data to computing) * Data stored in separate repository * Data brought into system for computing

More information

Mitigating Data Skew Using Map Reduce Application

Mitigating Data Skew Using Map Reduce Application Ms. Archana P.M Mitigating Data Skew Using Map Reduce Application Mr. Malathesh S.H 4 th sem, M.Tech (C.S.E) Associate Professor C.S.E Dept. M.S.E.C, V.T.U Bangalore, India archanaanil062@gmail.com M.S.E.C,

More information

Introduction to MapReduce Algorithms and Analysis

Introduction to MapReduce Algorithms and Analysis Introduction to MapReduce Algorithms and Analysis Jeff M. Phillips October 25, 2013 Trade-Offs Massive parallelism that is very easy to program. Cheaper than HPC style (uses top of the line everything)

More information

EXTRACT DATA IN LARGE DATABASE WITH HADOOP

EXTRACT DATA IN LARGE DATABASE WITH HADOOP International Journal of Advances in Engineering & Scientific Research (IJAESR) ISSN: 2349 3607 (Online), ISSN: 2349 4824 (Print) Download Full paper from : http://www.arseam.com/content/volume-1-issue-7-nov-2014-0

More information

Performance Analysis of Virtual Environments

Performance Analysis of Virtual Environments Performance Analysis of Virtual Environments Nikhil V Mishrikoti (nikmys@cs.utah.edu) Advisor: Dr. Rob Ricci Co-Advisor: Anton Burtsev 1 Introduction Motivation Virtual Machines (VMs) becoming pervasive

More information

Part 1: Indexes for Big Data

Part 1: Indexes for Big Data JethroData Making Interactive BI for Big Data a Reality Technical White Paper This white paper explains how JethroData can help you achieve a truly interactive interactive response time for BI on big data,

More information

Efficient, Scalable, and Provenance-Aware Management of Linked Data

Efficient, Scalable, and Provenance-Aware Management of Linked Data Efficient, Scalable, and Provenance-Aware Management of Linked Data Marcin Wylot 1 Motivation and objectives of the research The proliferation of heterogeneous Linked Data on the Web requires data management

More information

The Google File System

The Google File System October 13, 2010 Based on: S. Ghemawat, H. Gobioff, and S.-T. Leung: The Google file system, in Proceedings ACM SOSP 2003, Lake George, NY, USA, October 2003. 1 Assumptions Interface Architecture Single

More information

Big Data Analysis Using Hadoop and MapReduce

Big Data Analysis Using Hadoop and MapReduce Big Data Analysis Using Hadoop and MapReduce Harrison Carranza, MSIS Marist College, Harrison.Carranza2@marist.edu Mentor: Aparicio Carranza, PhD New York City College of Technology - CUNY, USA, acarranza@citytech.cuny.edu

More information

EsgynDB Enterprise 2.0 Platform Reference Architecture

EsgynDB Enterprise 2.0 Platform Reference Architecture EsgynDB Enterprise 2.0 Platform Reference Architecture This document outlines a Platform Reference Architecture for EsgynDB Enterprise, built on Apache Trafodion (Incubating) implementation with licensed

More information

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context 1 Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes

More information

Index Construction. Dictionary, postings, scalable indexing, dynamic indexing. Web Search

Index Construction. Dictionary, postings, scalable indexing, dynamic indexing. Web Search Index Construction Dictionary, postings, scalable indexing, dynamic indexing Web Search 1 Overview Indexes Query Indexing Ranking Results Application Documents User Information analysis Query processing

More information

Introducing Oracle R Enterprise 1.4 -

Introducing Oracle R Enterprise 1.4 - Hello, and welcome to this online, self-paced lesson entitled Introducing Oracle R Enterprise. This session is part of an eight-lesson tutorial series on Oracle R Enterprise. My name is Brian Pottle. I

More information

A BigData Tour HDFS, Ceph and MapReduce

A BigData Tour HDFS, Ceph and MapReduce A BigData Tour HDFS, Ceph and MapReduce These slides are possible thanks to these sources Jonathan Drusi - SCInet Toronto Hadoop Tutorial, Amir Payberah - Course in Data Intensive Computing SICS; Yahoo!

More information

Key aspects of cloud computing. Towards fuller utilization. Two main sources of resource demand. Cluster Scheduling

Key aspects of cloud computing. Towards fuller utilization. Two main sources of resource demand. Cluster Scheduling Key aspects of cloud computing Cluster Scheduling 1. Illusion of infinite computing resources available on demand, eliminating need for up-front provisioning. The elimination of an up-front commitment

More information

Parallel data processing with MapReduce

Parallel data processing with MapReduce Parallel data processing with MapReduce Tomi Aarnio Helsinki University of Technology tomi.aarnio@hut.fi Abstract MapReduce is a parallel programming model and an associated implementation introduced by

More information

IN-MEMORY DATA FABRIC: Real-Time Streaming

IN-MEMORY DATA FABRIC: Real-Time Streaming WHITE PAPER IN-MEMORY DATA FABRIC: Real-Time Streaming COPYRIGHT AND TRADEMARK INFORMATION 2014 GridGain Systems. All rights reserved. This document is provided as is. Information and views expressed in

More information

Lecture 9: MIMD Architectures

Lecture 9: MIMD Architectures Lecture 9: MIMD Architectures Introduction and classification Symmetric multiprocessors NUMA architecture Clusters Zebo Peng, IDA, LiTH 1 Introduction MIMD: a set of general purpose processors is connected

More information

The Use of Cloud Computing Resources in an HPC Environment

The Use of Cloud Computing Resources in an HPC Environment The Use of Cloud Computing Resources in an HPC Environment Bill, Labate, UCLA Office of Information Technology Prakashan Korambath, UCLA Institute for Digital Research & Education Cloud computing becomes

More information

David R. Mackay, Ph.D. Libraries play an important role in threading software to run faster on Intel multi-core platforms.

David R. Mackay, Ph.D. Libraries play an important role in threading software to run faster on Intel multi-core platforms. Whitepaper Introduction A Library Based Approach to Threading for Performance David R. Mackay, Ph.D. Libraries play an important role in threading software to run faster on Intel multi-core platforms.

More information

Bigtable. A Distributed Storage System for Structured Data. Presenter: Yunming Zhang Conglong Li. Saturday, September 21, 13

Bigtable. A Distributed Storage System for Structured Data. Presenter: Yunming Zhang Conglong Li. Saturday, September 21, 13 Bigtable A Distributed Storage System for Structured Data Presenter: Yunming Zhang Conglong Li References SOCC 2010 Key Note Slides Jeff Dean Google Introduction to Distributed Computing, Winter 2008 University

More information

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES 1 THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES Vincent Garonne, Mario Lassnig, Martin Barisits, Thomas Beermann, Ralph Vigne, Cedric Serfon Vincent.Garonne@cern.ch ph-adp-ddm-lab@cern.ch XLDB

More information

CISC 7610 Lecture 2b The beginnings of NoSQL

CISC 7610 Lecture 2b The beginnings of NoSQL CISC 7610 Lecture 2b The beginnings of NoSQL Topics: Big Data Google s infrastructure Hadoop: open google infrastructure Scaling through sharding CAP theorem Amazon s Dynamo 5 V s of big data Everyone

More information

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros Data Clustering on the Parallel Hadoop MapReduce Model Dimitrios Verraros Overview The purpose of this thesis is to implement and benchmark the performance of a parallel K- means clustering algorithm on

More information

DELL EMC DATA DOMAIN SISL SCALING ARCHITECTURE

DELL EMC DATA DOMAIN SISL SCALING ARCHITECTURE WHITEPAPER DELL EMC DATA DOMAIN SISL SCALING ARCHITECTURE A Detailed Review ABSTRACT While tape has been the dominant storage medium for data protection for decades because of its low cost, it is steadily

More information

Data Informatics. Seon Ho Kim, Ph.D.

Data Informatics. Seon Ho Kim, Ph.D. Data Informatics Seon Ho Kim, Ph.D. seonkim@usc.edu HBase HBase is.. A distributed data store that can scale horizontally to 1,000s of commodity servers and petabytes of indexed storage. Designed to operate

More information

HDF Virtualization Review

HDF Virtualization Review Scott Wegner Beginning in July 2008, The HDF Group embarked on a new project to transition Windows support to a virtualized environment using VMWare Workstation. We utilized virtual machines in order to

More information

Huge Data Analysis and Processing Platform based on Hadoop Yuanbin LI1, a, Rong CHEN2

Huge Data Analysis and Processing Platform based on Hadoop Yuanbin LI1, a, Rong CHEN2 2nd International Conference on Materials Science, Machinery and Energy Engineering (MSMEE 2017) Huge Data Analysis and Processing Platform based on Hadoop Yuanbin LI1, a, Rong CHEN2 1 Information Engineering

More information

WHITE PAPER AGILOFT SCALABILITY AND REDUNDANCY

WHITE PAPER AGILOFT SCALABILITY AND REDUNDANCY WHITE PAPER AGILOFT SCALABILITY AND REDUNDANCY Table of Contents Introduction 3 Performance on Hosted Server 3 Figure 1: Real World Performance 3 Benchmarks 3 System configuration used for benchmarks 3

More information

Lab Validation Report

Lab Validation Report Lab Validation Report NetApp SnapManager for Oracle Simple, Automated, Oracle Protection By Ginny Roth and Tony Palmer September 2010 Lab Validation: NetApp SnapManager for Oracle 2 Contents Introduction...

More information

Big Data Management and NoSQL Databases

Big Data Management and NoSQL Databases NDBI040 Big Data Management and NoSQL Databases Lecture 2. MapReduce Doc. RNDr. Irena Holubova, Ph.D. holubova@ksi.mff.cuni.cz http://www.ksi.mff.cuni.cz/~holubova/ndbi040/ Framework A programming model

More information

Parallel Computing: MapReduce Jin, Hai

Parallel Computing: MapReduce Jin, Hai Parallel Computing: MapReduce Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology ! MapReduce is a distributed/parallel computing framework introduced by Google

More information

Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel Abadi, David DeWitt, Samuel Madden, and Michael Stonebraker SIGMOD'09. Presented by: Daniel Isaacs

Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel Abadi, David DeWitt, Samuel Madden, and Michael Stonebraker SIGMOD'09. Presented by: Daniel Isaacs Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel Abadi, David DeWitt, Samuel Madden, and Michael Stonebraker SIGMOD'09 Presented by: Daniel Isaacs It all starts with cluster computing. MapReduce Why

More information

Comparative Analysis of K means Clustering Sequentially And Parallely

Comparative Analysis of K means Clustering Sequentially And Parallely Comparative Analysis of K means Clustering Sequentially And Parallely Kavya D S 1, Chaitra D Desai 2 1 M.tech, Computer Science and Engineering, REVA ITM, Bangalore, India 2 REVA ITM, Bangalore, India

More information

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad

More information

High Performance Computing on MapReduce Programming Framework

High Performance Computing on MapReduce Programming Framework International Journal of Private Cloud Computing Environment and Management Vol. 2, No. 1, (2015), pp. 27-32 http://dx.doi.org/10.21742/ijpccem.2015.2.1.04 High Performance Computing on MapReduce Programming

More information

WHITEPAPER. Improve Hadoop Performance with Memblaze PBlaze SSD

WHITEPAPER. Improve Hadoop Performance with Memblaze PBlaze SSD Improve Hadoop Performance with Memblaze PBlaze SSD Improve Hadoop Performance with Memblaze PBlaze SSD Exclusive Summary We live in the data age. It s not easy to measure the total volume of data stored

More information

FLAT DATACENTER STORAGE. Paper-3 Presenter-Pratik Bhatt fx6568

FLAT DATACENTER STORAGE. Paper-3 Presenter-Pratik Bhatt fx6568 FLAT DATACENTER STORAGE Paper-3 Presenter-Pratik Bhatt fx6568 FDS Main discussion points A cluster storage system Stores giant "blobs" - 128-bit ID, multi-megabyte content Clients and servers connected

More information

MapReduce. Cloud Computing COMP / ECPE 293A

MapReduce. Cloud Computing COMP / ECPE 293A Cloud Computing COMP / ECPE 293A MapReduce Jeffrey Dean and Sanjay Ghemawat, MapReduce: simplified data processing on large clusters, In Proceedings of the 6th conference on Symposium on Opera7ng Systems

More information

Lecture 1: January 23

Lecture 1: January 23 CMPSCI 677 Distributed and Operating Systems Spring 2019 Lecture 1: January 23 Lecturer: Prashant Shenoy Scribe: Jonathan Westin (2019), Bin Wang (2018) 1.1 Introduction to the course The lecture started

More information

CLIENT DATA NODE NAME NODE

CLIENT DATA NODE NAME NODE Volume 6, Issue 12, December 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Efficiency

More information

Lecture 1: January 22

Lecture 1: January 22 CMPSCI 677 Distributed and Operating Systems Spring 2018 Lecture 1: January 22 Lecturer: Prashant Shenoy Scribe: Bin Wang 1.1 Introduction to the course The lecture started by outlining the administrative

More information

Churrasco: Supporting Collaborative Software Evolution Analysis

Churrasco: Supporting Collaborative Software Evolution Analysis Churrasco: Supporting Collaborative Software Evolution Analysis Marco D Ambros a, Michele Lanza a a REVEAL @ Faculty of Informatics - University of Lugano, Switzerland Abstract Analyzing the evolution

More information

The amount of data increases every day Some numbers ( 2012):

The amount of data increases every day Some numbers ( 2012): 1 The amount of data increases every day Some numbers ( 2012): Data processed by Google every day: 100+ PB Data processed by Facebook every day: 10+ PB To analyze them, systems that scale with respect

More information

10 Steps to Virtualization

10 Steps to Virtualization AN INTEL COMPANY 10 Steps to Virtualization WHEN IT MATTERS, IT RUNS ON WIND RIVER EXECUTIVE SUMMARY Virtualization the creation of multiple virtual machines (VMs) on a single piece of hardware, where

More information

2/26/2017. The amount of data increases every day Some numbers ( 2012):

2/26/2017. The amount of data increases every day Some numbers ( 2012): The amount of data increases every day Some numbers ( 2012): Data processed by Google every day: 100+ PB Data processed by Facebook every day: 10+ PB To analyze them, systems that scale with respect to

More information

Clustering Lecture 8: MapReduce

Clustering Lecture 8: MapReduce Clustering Lecture 8: MapReduce Jing Gao SUNY Buffalo 1 Divide and Conquer Work Partition w 1 w 2 w 3 worker worker worker r 1 r 2 r 3 Result Combine 4 Distributed Grep Very big data Split data Split data

More information

Dept. Of Computer Science, Colorado State University

Dept. Of Computer Science, Colorado State University CS 455: INTRODUCTION TO DISTRIBUTED SYSTEMS [HADOOP/HDFS] Trying to have your cake and eat it too Each phase pines for tasks with locality and their numbers on a tether Alas within a phase, you get one,

More information

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets 2011 Fourth International Symposium on Parallel Architectures, Algorithms and Programming PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets Tao Xiao Chunfeng Yuan Yihua Huang Department

More information

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics IBM Data Science Experience White paper R Transforming R into a tool for big data analytics 2 R Executive summary This white paper introduces R, a package for the R statistical programming language that

More information

Cloud Computing & Visualization

Cloud Computing & Visualization Cloud Computing & Visualization Workflows Distributed Computation with Spark Data Warehousing with Redshift Visualization with Tableau #FIUSCIS School of Computing & Information Sciences, Florida International

More information

Google File System. By Dinesh Amatya

Google File System. By Dinesh Amatya Google File System By Dinesh Amatya Google File System (GFS) Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung designed and implemented to meet rapidly growing demand of Google's data processing need a scalable

More information

Using Pig as a Data Preparation Language for Large-Scale Mining Software Repositories Studies: An Experience Report

Using Pig as a Data Preparation Language for Large-Scale Mining Software Repositories Studies: An Experience Report Using Pig as a Data Preparation Language for Large-Scale Mining Software Repositories Studies: An Experience Report Weiyi Shang, Bram Adams, Ahmed E. Hassan Software Analysis and Intelligence Lab (SAIL),

More information

Embedded Technosolutions

Embedded Technosolutions Hadoop Big Data An Important technology in IT Sector Hadoop - Big Data Oerie 90% of the worlds data was generated in the last few years. Due to the advent of new technologies, devices, and communication

More information

Information Retrieval

Information Retrieval Information Retrieval Suan Lee - Information Retrieval - 04 Index Construction 1 04 Index Construction - Information Retrieval - 04 Index Construction 2 Plan Last lecture: Dictionary data structures Tolerant

More information

Virtual Machines. 2 Disco: Running Commodity Operating Systems on Scalable Multiprocessors([1])

Virtual Machines. 2 Disco: Running Commodity Operating Systems on Scalable Multiprocessors([1]) EE392C: Advanced Topics in Computer Architecture Lecture #10 Polymorphic Processors Stanford University Thursday, 8 May 2003 Virtual Machines Lecture #10: Thursday, 1 May 2003 Lecturer: Jayanth Gummaraju,

More information

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung HDFS: Hadoop Distributed File System CIS 612 Sunnie Chung What is Big Data?? Bulk Amount Unstructured Introduction Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per

More information

Operating Systems: Internals and Design Principles. Chapter 2 Operating System Overview Seventh Edition By William Stallings

Operating Systems: Internals and Design Principles. Chapter 2 Operating System Overview Seventh Edition By William Stallings Operating Systems: Internals and Design Principles Chapter 2 Operating System Overview Seventh Edition By William Stallings Operating Systems: Internals and Design Principles Operating systems are those

More information

How are Developers Treating License Inconsistency Issues? A Case Study on License Inconsistency Evolution in FOSS Projects

How are Developers Treating License Inconsistency Issues? A Case Study on License Inconsistency Evolution in FOSS Projects How are Developers Treating License Inconsistency Issues? A Case Study on License Inconsistency Evolution in FOSS Projects Yuhao Wu 1(B), Yuki Manabe 2, Daniel M. German 3, and Katsuro Inoue 1 1 Graduate

More information

High Throughput WAN Data Transfer with Hadoop-based Storage

High Throughput WAN Data Transfer with Hadoop-based Storage High Throughput WAN Data Transfer with Hadoop-based Storage A Amin 2, B Bockelman 4, J Letts 1, T Levshina 3, T Martin 1, H Pi 1, I Sfiligoi 1, M Thomas 2, F Wuerthwein 1 1 University of California, San

More information

Oracle Developer Studio 12.6

Oracle Developer Studio 12.6 Oracle Developer Studio 12.6 Oracle Developer Studio is the #1 development environment for building C, C++, Fortran and Java applications for Oracle Solaris and Linux operating systems running on premises

More information