MapReduce as a General Framework to Support Research in Mining Software Repositories (MSR)
|
|
- Damian Pierce
- 5 years ago
- Views:
Transcription
1 MapReduce as a General Framework to Support Research in Mining Software Repositories (MSR) Weiyi Shang, Zhen Ming Jiang, Bram Adams, Ahmed E. Hassan Software Analysis Intelligence Lab (SAIL) Queen s University Kingston, Canada {swy, zmjiang, bram, ahmed}@cs.queensu.ca Abstract Researchers continue to demonstrate benefits of Mining Software Repositories (MSR) for supporting software development research activities. However, as mining process is time resource intensive, y often create ir own distributed platforms use various optimizations to speed up scale up ir analysis. se platforms are project-specific, hard to reuse, offer minimal debugging deployment support. In this paper, we propose use of MapReduce, a distributed computing platform, to support research in MSR. As a proof-of-concept, we migrate J-REX, an optimized evolutionary code extractor, to run on Hadoop, an open source implementation of MapReduce. Through a case study on source control repositories of Eclipse, BIRT Datatools projects, we demonstrate that migration effort to MapReduce is minimal that benefits are significant, as running time of migrated J-REX is only 30% to 50% of original J-REX s. This paper documents our experience with migration, highlights benefits challenges of MapReduce framework in MSR community. 1 Introduction Mining Software Repositories (MSR) field analyzes cross-links rich data available in software repositories to uncover interesting actionable information about software systems [15]. Examples of software repositories include source control repositories, bug repositories, archived communications, deployment logs, code repositories. Research in MSR field has received an increasing amount of interest. Most MSR techniques have been demonstrated on largescale software systems. However, size of data available for mining continues to grow at a very high rate. For example, size of Linux kernel source code exhibits super-linear growth [13]. MSR researchers continue to explore deeper more sophisticated analysis across a large number of long-lived systems. Robles et al. reported that Debian, a well-known Linux distribution, doubles in size approximately every two years [22]. Combined with huge system size, this may pose problems for analyzing evolution of Debian system in future. large continuously growing software repositories need for deeper analysis impose challenges on scalability of MSR techniques. Powerful computers sophisticated software mining algorithms are needed to successfully analyze cross-link data in a timely fashion. Prior research focuses especially on building home-grown solutions for this. authors of D-CCFinder [18], a distributed version of popular CCFinder [17] clone detection tool, have improved ir processing time from 40 days on a single PC-based workstation (Intel Xeon 2.8GHz, 2 GB RAM) to 2 days on a distributed system consisting of 80 PCs. ir home-grown solution is reported to contain about 20 kloc of Java code, which must be maintained enhanced by se MSR researchers does not directly translate to or analyzers. Tackling problem of processing large software repositories in a timely fashion is of paramount importance to future of MSR field in general, as we aim to improve adoption rate of MSR techniques by practitioners. We envision a future where sophisticated MSR techniques are integrated into IDEs that run on commodity workstations that provide fast accurate results to developers managers. In short, one cannot require every MSR researcher to have large expensive servers. Furrmore, homegrown solutions to optimize mining performance require huge development maintenance efforts. Last but not least, task of performance tuning turns our attention away from real problem, which is to uncover /09/$ IEEE 21 MSR 2009
2 interesting repository information. In many cases, MSR researchers do not have expertise required nor interest to improve performance of ir data mining algorithms. Techniques are needed that hide complexity of scaling yet provide researchers with benefits of scale. Research shows that scaling out (distributed systems) is always better than scaling up (bigger more powerful machines) [20]. Off--shelf distributed frameworks are promising technologies that can help our field. In this paper, we explore one of se technologies, called MapReduce [7]. As a proof-of-concept, we migrate J-REX, an optimized evolutionary code extractor, to run on Hadoop platform. Hadoop is a popular open-source implementation of MapReduce which is increasingly gaining popularity has proved to be scalable of production quality. Companies like Yahoo have Hadoop installations with over 5,000 machines, Hadoop is also used by Amazon computing clouds. With many companies involved into its development maintenance, Hadoop is rapidly maturing. Through a case study on source control repositories of Eclipse, BIRT Datatools projects, we show that migration effort to MapReduce is minimal that benefits are significant. migrated J-REX solutions are 4 times faster than original J-REX. This paper documents our experience with migration highlights benefits challenges of adopting off--shelf distributed frameworks in MSR community. paper is organized as follows. Section 2 imposes requirements for a general distributed framework to support MSR research. MapReduce is explained in Section 3, as well as one of its open source implementations: Hadoop. Section 4 discusses our case study in which we migrate J- REX, an evolutionary code extractor running on a single machine, to Hadoop. repercussions of this case study limitations of our approach are discussed in Section 5. Section 6 presents some related works. Finally, Section 7 concludes paper presents future work. 2 Requirements for a General Framework to Support MSR Research We seek four common requirements for large distributed platforms to support MSR research. We detail m as follows: 1. Adaptability: platform should take MSR researchers minimal effort to migrate from ir prototype solutions, which are developed on a single machine. 2. Efficiency: adoption of platform should drastically speed up mining process. 3. Scalability: platform should scale with size of input data as well as with available computing power. 4. Flexibility: platform should be able to run on various types of machines, from expensive servers to commodity PCs or even virtual machines. This paper presents evaluates MapReduce as a possible distributed platform which satisfies se four requirements. 3 MapReduce MapReduce [7] is a distributed framework for processing vast data sets. It was originally proposed used by Google engineers to process large amount of data y must analyze on a daily basis. input data for MapReduce consists of a list of key/ pairs. Mappers accept incoming pairs, map m into intermediate key/ pairs. Each group of intermediate data with same key is n passed to a specific set of s, each of which performs computations on data reduce it to one single key/s pair. sorted output of s is final result of MapReduce process. In this paper, we simplify discussion of MapReduce by assuming that s accept s instead of key/ pairs. To illustrate MapReduce, we consider an example MapReduce process which counts frequency of word lengths in a book. example process is shown in Figure 1. Mappers accept every single word from book, make keys for m. Because we want to count frequency of all words with different length, a typical approach would be to use length of word as key. So, for word hello, a will generate a key/ pair of 5/hello. Afterwards, key/ pairs with same key are grouped sent to s. A, which receives a list of s with same key, can simply count size of this list, keep key in its output. If a receives a list with key 5, for example, it will count size of list of all words that have as length 5. If size is n, it outputs an output pair 5/n which means re are n words with length 5 in book. power challenge of MapReduce model resides in its ability to support different mapping reducing strategies. For example, an alternative implementation could map each input (i.e., word) based on its first letter its length. n, s would process those words starting with one or a small number of different letters (keys), perform counting. This MapReduce strategy permits an increasing number of Reducers that can work in parallel on problem. However final output needs additional post-processing in comparison to first strategy. In short, both strategies can solve problem, but each strategy has different performance implementation benefits challenges. 22
3 Input data Intermediate data Output data key dog 3 dog cat fish 3 4 cat fish key 3 2 hello good 5 4 hello good night 5 night 6 1 happy 5 happy school 6 school Figure 1: Example of MapReduce process for counting frequency of word lengths in a book. Figure 1. Example MapReduce process for counting frequency of word lengths in a book. Techniques are needed that hide complexity of scaling Hadoopyet is an provide open-source researchers implementation with benefits of MapReduce scale. [3] which Research is supported shows that by Yahoo scaling out is used (distributed by Ama- of zon, systems) AOL, Baidu is always abetter number than of or scaling companies up (bigger for ir distributed more powerful solutions. machines) Hadoop [5]. can Off--shelf run on various distributed operatingframeworks systems such are aspromising Linux, Windows, technologies FreeBSD, that can Machelp OSX our OpenSolaris. field. It not only implements MapReduce model, In but this also paper, provides we explore a distributed one of se file system, technologies called called Hadoop MapReduce Distributed[10]. File System As a proof-of-concept, (HDFS). Hadoop supplies migrate Java interfaces J-REX, toan simplify optimized MapReduce evolutionary model code we toextractor control code, HDFS to run programmatically. on Hadoop platform. Anor advantage Hadoop foris users a ispopular that Hadoop open-source by default comes implementation with some basic of MapReduce. widely used Hadoop mapping is increasingly reducing gaining methods, popularity for example to split has been files proven into lines, to be or to scalable split a directory of production into files. quality. Companies like Yahoo have Hadoop With se methods, users occasionally do not have to write installation with over 5,000 boxes. Hadoop is also used new code to use MapReduce. by Amazon computing clouds. Hadoop is rapidly maturing We used Hadoop with asmany our MapReduce companies implementation involved into its for development following four reasons: maintenance. Through a case study on 1. Hadoop source is control easy repositories use. Researchers of Eclipse, do BIRT not want todatatools spend considerable projects, we time show on modifying that migration ir mining effort program minimal to make it distributed. benefits are great. simple MapReduce migrated Java J- is interface REX solutions simplifies are 4 process times faster of implementing than original s REX. This s. paper documents our experience with J- migration 2. Hadoop runs highlights on different benefits operating challenges systems. Academic adopting research off--shelf labs tend to distributed have a heterogeneous frameworks network in of ofmsr machines community. with different hardware configurations varying operating systems. Hadoop can run on most current Organization operating systems of Paper: hence to exploit as much of available computing rest of this power paper as possible. is organized as follows. Section 2 imposes requirements for general framework to 3. Hadoop support runs MSR on commodity research. Section machines. 3 explains largest computation resources in research labs software development companies are desktop computers laptops. This MapReduce one of its open source implementations: characteristic of Hadoop. Section permits 4 se discusses computers our case to join study leave about migrating computingj-rex, cluster an in aevolutionary dynamic code transparent fashion running without on user a intervention. single machine, to Hadoop. extractor Section 4. Hadoop 5 discusses mature our approaches an open outlines source system. limitations. Hadoop has Section been successfully 6 concludes usedpaper in many presents commercial future projects. work. It is actively developed with new features enhancements continuously being added. Since Hadoop is 2. free Requirements to download redistribute, for General it canframework be installed on multiple Support machines MSR without Research worrying about costs per seat to licensing. We Based seek four on se requirements points, wefor consider developing Hadoop a general as most MSR suitable mining MapReduce framework. implementation We detail m foras our follows: research. next section evaluates ability of Hadoop, hence 1. MapReduce, Adaptability: to satisfy technique four requirements should take of Section MSR 2. researchers minimal effort to migrate from ir 4 prototype Case study solutions, which are developed on a single machine. 2. Efficiency: adoption of technique should To validate promise of MapReduce for MSR research, drastically speed up mining process. 3. Scalability: we discuss This our technique experience should migrating scale with an evolutionary size code of extractor data as well called as with J-REX to computing Hadoop. power. J-REX is a 4. highly Flexibility: optimized This evolutionary technique should code extractor be able to for run Java on systems, various similar types to C-REX of machines [14]. from expensive servers As to commodity shown in Figure PCs or 2, even whole virtual process machines. of J-REX spans three phases. first phase is extraction phase, where 3. J-REX MapReduce extracts source code snapshots for each file from a CVS repository. In second phase, i.e. parsing phase, J-REX calls for each file snapshot Eclipse JDT parser to MapReduce [10] is a distributed framework for parse Java code into its abstract syntax tree [1], which processing large data sets which are often encountered by is stored MSR as researchers. an XML document. MapReduce In was third originally phase, i.e. analysis phase, J-REX compares XML documents of consecutive file revisions to determine changed code units, 23
4 Table 1. Disk performance of desktop server computers. Cached read speed Cached write speed Server 8, 531MB/sec 211MB/sec Desktop 3, 302MB/sec 107MB/sec Rom read speed Rom write speed Server 2, 986MB/sec 1, 075MB/sec Desktop 1, 488MB/sec 658MB/sec Table 2. Characteristics of Eclipse, BIRT Datatools. Repository #Source Length #Revisions Size Code of History Files Datatools 394MB 10, years 2,398 BIRT 810MB 13, years 19,583 Eclipse 4.2GB 56, years 82,682 Figure 2. Architecture of J-REX. generates evolutionary change data in an XML format [16]. evolutionary change data reports evolution of a software system at level of code entities such as methods classes (for example, class A was changed to add a new method B ). architecture of J-REX is comparable to architecture of or MSR tools. J-REX runtime process requires a huge amount of I/O operations which are performance bottlenecks, a large amount of computing power when comparing XML trees. I/O computational characteristics of J-REX make it an ideal case study to study performance benefits of MapReduce computation model. Through this case study, we seek to verify wher Hadoop solution satisfies four requirements listed in Section Adaptability: We explain process to migrate basic J-REX, a non-distributed MSR tool, to three different distributed Hadoop solutions (DJ-REX1, DJ-REX2, DJ-REX3). 2. Efficiency: For all three Hadoop solutions, we compare performance of mining process among desktop server machines. 3. Scalability: We examine scalability of Hadoop solutions on three data repositories with varying sizes. We also examine scalability of Hadoop solutions running on a varying number of machines. 4. Flexibility: Finally, we study flexibility of Hadoop platform by deploying Hadoop on virtual machines in a multicore environment. In rest of this section, we first explain our experimental environment details of our experiments. n, we discuss wher or not using Hadoop for software mining satisfies 4 requirements of Section Experimental environment Our Hadoop installation is deployed on 4 computers in a local gigabit network. 4 computers consist of 2 desktop computers, each having an Intel Quad Core 2.40 GHz CPU with 2 GB RAM memory, of 2 server computers, one having an Intel Core i GHz CPU with 8 Cores (Hyperthreading) 6 GB RAM memory, or one having an Intel Quad Core 2.40 GHz CPU with 8 GB RAM memory a RAID5 disk. 8 core server machine has Solid State Disks (SSD) instead of regular RAID disks. difference in disk performance between regular disk machines SSD disk server computer as measured by hdparm iozone (64 kb block size) is shown in Table 1. server s I/O speed with SSD drive is twice as fast as machines with regular disk for rom I/O turns out to be two a half times for cached operations. source control repositories used in our experiments consist of whole Eclipse repository 2 sub-projects from Eclipse called BIRT Datatools. Eclipse has a large repository with a long history, BIRT has a medium repository with a medium length history, Datatools has a small repository with a short history. Using se 3 repositories with different size length of history, we can better evaluate performance of our approach across subject systems. repository information of 3 projects is shown in Table 2. 24
5 Table 3. Experimental results for DJ-REX in Hadoop. Repsitory Desktop Server Strategy 2 nodes 3 nodes 4 nodes Datatools 0:35:50 0:34:14 DJ-REX3 0:19:52 0:14:32 0:16:40 DJ-REX1 2:03:51 2:05:02 2:16:03 BIRT 2:44:09 2:05:55 DJ-REX2 1:40:22 1:40:32 1:47:26 DJ-REX3 1:08:36 0:50:33 0:45:16 DJ-REX3* 3:02:47 Eclipse 12:35:34 DJ-REX3 3:49: Experiments We conduct following experiments: 1. Run J-REX without Hadoop on BIRT, Datatools Eclipse repositories. 2. Run DJ-REX1, DJ-REX2 DJ-REX3 on BIRT with 2, 3 4 machines. 3. Run DJ-REX3 on Datatools with 2, 3 4 machines. 4. Run DJ-REX3 on Eclipse with 4 machines. 5. Run DJ-REX3 on BIRT with 3 virtual machines. Only DJ-REX3 is applied in last three experiments, because experimental results for smallest system, i.e. BIRT, already showed a significant speed improvement compared to or two distributed strategies original, undistributed J-REX. results of all five experiments are summarized in Table 3 are discussed in next section. row with DJ-REX3* corresponds to experiment that has DJ-REX3 running on 3 virtual machines. 4.3 Case study discussion This section uses experiment data results of Table 3 to discuss wher or not various DJ-REX solutions meet 4 requirements outlined in Section 2. Adaptability Table 4 shows implementation deployment effort required for DJ-REX. We first discuss effort devoted to porting J-REX to Hadoop. n we present experience about configuring Hadoop to add in more computing power. implementation effort of three DJ-REX solutions decreases as we got more acquainted with technology. Easy to experiment with various distributed solutions As is often case, MSR researchers do not have expertise required for nor do y have interest in improving performance of ir mining algorithms. need Table 4. Effort to program deploy DJ- REX. J-REX Logic MapReduce strategy for DJ-REX1 MapReduce strategy for DJ-REX2 MapReduce strategy for DJ-REX3 Deployment Configuration Reconfiguration No Change 400 LOC, 2 hours 400 LOC, 2 hours 300 LOC, 1 hours 1 hour 1 minute Table 5. Overview of distributed steps in DJ- REX1 to DJ-REX3. Extraction Parsing Analysis DJ-REX1 No No Yes DJ-REX2 No Yes Yes DJ-REX3 Yes Yes Yes to rewrite an MSR tool from scratch to make it run on Hadoop is not an acceptable option. If programming time for Hadoop migration is long (maybe as long as re-implementing it), n chances of adopting Hadoop become very low. In addition, if one has to modify a tool in such an invasive way, considerably more time will have to be spent to test it again once it runs distributed. We found that applications are very easy to port to Hadoop. First of all, Hadoop provides a number of default mechanisms to split input data across s. For example, MultiFileSplit class splits files in a directory, whereas DBInputSplit class splits rows in a database table. Often, one can reuse se existing mapping strategies. Second, Hadoop has well-defined simple APIs to implement a MapReduce process. One just needs to implement corresponding interfaces to make a custom MapReduce process. Third, several code examples are available to show users how to write MapReduce code with Hadoop [3]. After looking at available code examples, we found that we could reuse code for splitting input data by files. n, we spent a few hours to write around 400 lines of Java code for each of three DJ-REX MapReduce strategies. programming logic of J-REX itself barely 25
6 Input data Intermediate data Output data key a_0.java a_1.java a.java a.java a_0.java a_1.java key a.java a.output b_0.java b.java b_0.java b.java b.output a_2.java a.java a_2.java b_1.java b.java b_1.java Figure 3: MapReduce strategy for DJ-REX Figure 3. MapReduce strategy for DJ-REX. Hadoop [14]. One of code examples is to search a contains 3 phases: extraction phase, parsing phase string in files, which is just like grep comm in analysis phase. Based on se 3 phases, we come up changed. Linux. Anor example is to count word frequency with 3 different implementations as shown in Table 5. from remainder files. of this section explains our three DJ-REX first one is called DJ-REX1. We use one In our experience, after reading above two computer to extract source code offline to parse MapReduce strategies. Intuitively, we need to compare examples, because input data of examples m to AST. After finishing first two phases, difference between adjacent revisions of a Java source code J-REX are all stored in multiple files, we use ir output files are put in HDFS Hadoop is used to file. We could define key/ pair output of function as (D1, a 0.java a 1.java), re- ways to split whole input data by files. n we analyze change information. In this case, only 1 phase spent a few hours wrote around 400 lines of Java of J-REX becomes distributed, files we copy ducer code function to implement output as three (revision MapReduce number, strategies. evolutionary information). into HDFS are XML files which represent AST programming logic key D1 of J-REX represents barely changed. difference between Figure 4. Running time comparison of J-REX information of each version of all source code files in two versions, Here we aexplain 0.javaour MapReduce a 1.java represent strategies in names on a server machine desktop machine DJ- repository. ofrex. two files. Intuitively, Becausewe of want way to that compare we partition difference data, For compared DJ-REX2, to one more fastest phase, DJ-REX3 parsing for phase, Datatools. each between revision adjacent needs versions. to be copied We define transferred (key, to ) more becomes distributed. refore only extraction than pair onein node, which function generates as (D1, extra a_0.java overhead for phase still runs as non-distributed. extraction a_1.java). mining process, function turned out is defined to makeas (revision process phase extracts all main versions of source code much number, longer. change failure data). of thiskey naive D1 strategy represents shows files from repository, se files are copied into importance difference ofbetween designing two a good versions. strategy Because of MapReduce. of way HDFS are used as input to following Hadooped phases. parsing phase, becomes distributed. Only extraction that refore, we partition we tried anor data, each basicrevision MapReduce needs strategy, to be ascopied shown in Figure transferred 3. This to more strategy than performs one node, much which better phase Finally is still non-distributed, DJ-REX3 whereas a fully-distributed parsing analysis phases are with done 3 inside phases running s. in Finally, a distributed DJ-REX3 gives than more our naive overhead strategy. to whole key/ mining pair process. output of implementation runtime function of this is strategy defined as is much (file name, longer. revision failure snapshot), of fashion. is a fully-distributed Input for DJ-REX-3 implementation is a raw with CVS 3 phases data running this strategy shows importance of designing a good Hadoop used throughout 3 phases. whereas key/ pair output of function is in a distributed fashion inside each. input for strategy of MapReduce. For DJ-REX1 to DJ-REX3, using files to partition (file name, evolutionary information for this file). For example, file a.java has 3 revisions. mapping phase gets DJ-REX3 is raw CVS data Hadoop is used throughout all 3 phases. refore, we tried anor MapReduce strategy input data is an intuitively often most suitable shown in Figure 3. This strategy performs much better option for many mining techniques which want to file than names our previous revision strategy. numbers as (key, input, ) sorts pair revision in our explore Using files use of tohadoop. partition input data is an intuitive numbers per function file: (a.java, is defined a 0.java), as (file (a.java, name, a revision 1.java) often most suitable option for many mining techniques snapshot), (a.java, a 2.java). (key, ) Pairs pair with for same keyfunction are n Table which5: want Overview to explore of distributed use of Hadoop. steps in DJ-REX1 to sent to to be (file same name,. evolutionary final information output for for a.java this file). is DJ-REX3 generated For example, evolutionary a.java information. has 3 revisions. In mapping Easy to deploy Extract add more computing Parse power Analyze phase, On topit of generates this basic MapReduce 3 (key, ) strategy, pairs: we have (a.java, implemented a_0.java), 3 flavors (a.java, ofa_1.java) DJ-REX (Table (a.java, 5). Eacha_2.java). flavor dis- DJ-REX2 It took us only 1 hour No to learn howyes to deploy Hadoop Yes in DJ-REX1 No No Yes tributes Since ay different have combination same key, of se 3three phases pairs ofare sent original to J-REX same implementation. final (Figure output 2). for file first a.java flavoris is add more machines), we only needed to add machines DJ-REX3 local network. To exp Yes experiment Yes cluster Yes (i.e., to called evolutionary DJ-REX1. One information. machine extracts source code offline We have parses implemented it into AST3 form. versions Afterwards, of DJ-REX. output Each It machines. took us only Based around on our 1 hour experience, to learn we feel configure that porting Easy names to deploy in a configuration add in more file computing install Hadoop powers on those XML version files explores are stored distributing in HDFS different Hadoop phases uses in it to analyze J- how J-REX to deploy to Hadoop Hadoop is easy in local straightforward, network. If we want for sure REX implementation. change information. As shown In this case, in Figure only 1 2, phase J-REX of J- to easier exp less experiment error-prone than cluster implementing (i.e., adding ourmore own distributed REX becomes distributed. For DJ-REX2, one more phase, platform. 26
7 Figure 5. Running time of J-REX on a server machine desktop machine compared to fastest deployment of DJ-REX1, DJ-REX2 DJ-REX3 for BIRT. Figure 6. Running time of J-REX on a server machine compared to DJ-REX3 with 4 work nodes for Eclipse. Efficiency We now use our experimental data to test how much time could be saved by using Hadoop for mining process. Figure 4 (Datatools),Figure 5 (BIRT) Figure 6 (Eclipse) present results of Table 3 in a graphical way. From Figure 4 (Datatools) Figure 5 (BIRT), we can draw following two conclusions. On one h, faster powerful machinery can speed up mining process. For example, running J-REX on a very fast server machine with SSD drives for BIRT repository saves around 40 minutes compared with running it on desktop machine. On or h, all DJ-REX solutions perform no worse or even better than J-REX solutions regardless of difference in hardware machinery. As shown in Figure 5, running time on SSD server machine is almost same to that using DJ-REX1, which only has analysis phase distributed, since analysis phrase is shortest of all three J-REX phases. refore, performance gain of DJ-REX is not significant. DJ-REX2 DJ-REX3, however, outperform server. running time of DJ-REX3 on BIRT is almost one quarter of running it on a desktop machine one third time of running it on a server machine. running time of DJ-REX3 for Datatools has been reduced to around half time taken by desktop server solutions, for Eclipse to around a quarter of time of server solution. It is clear that more we distribute our process, less time is needed. Figure 7. Comparison of running time of 3 flavors of DJ-REX for BIRT. Figure 7 shows detailed performance statistics of three flavors of DJ-REX for BIRT repository. total running time can be broken down into three parts: preprocess time (black) is time needed for nondistributed phases, copy data time (light blue) is time taken for copying input data into distributed file system, process data time (white) is time needed by distributed phases. In Figure 7, running time of DJ-REX3 is always shortest, whereas DJ-REX1 always takes longest time. reason for this is that undistributed black parts dominate process time for DJ-REX1 DJ-REX2, whereas in DJ-REX3 everything is distributed. Hence, fully distributed DJ-REX3 is most efficient one. In Figure 7, process data time (white) is decreasing constantly. MapReduce strategy of DJ-REX is basically dividing job by files which are processed independently from each or in different s. Hence, one could approximate job s running time by dividing total processing time by number of Hadoop nodes. more Hadoop nodes re are, smaller incremental benefit of extra nodes. In addition, a new node introduces more overhead, like network overhead or distributed file system data synchronization. Figure 7 clearly shows that copy data time (light blue) is increasing when adding nodes hence that performance with 4 nodes is not always best one (e.g. for Datatools). Our experiments show that using Hadoop to support MSR research is an efficient viable approach that can drastically reduce required processing time. Scalability Eclipse has a large repository, BIRT has a medium-sized repository Datatools has a small repository. From Figure 4 (Datatools), Figure 5 (BIRT), Figure 6 (Eclipse) Figure 7 (BIRT), it is clear that Hadoop reduces running time for each of three repositories. When mining small Datatools repository, running time is reduced to 27
8 Figure 8. Running time comparison for BIRT Datatools with DJ-REX3. Figure 9. Running time of basic J-REX on a desktop server machine, of DJ- REX-3 on 3 virtual machines on same server machine. 50%. bigger repository, more time can be saved by Hadoop. running time can be reduced to 36% 30% of non-hadoop version for BIRT Eclipse repositories, respectively. Figure 8 shows that Hadoop scales well for different numbers of nodes (2 to 4) for BIRT Datatools. We did not include running time for Eclipse because of its large variance fact that we could not run Eclipse on desktop machine (we could not fit entire data into memory). However, from Figure 6 we know that running time for Eclipse on server machine is more than 12 hours that it only takes a quarter of this time (around 3.5 hours) using DJ-REX3. Unfortunately, we found that performance of DJ- REX3 is not proportional to amount of computing resources introduced. From Figure 8, we observe that adding a fourth node introduces additional overhead to our process, since copying input data to anor node out-weighs parallelizing tasks to more machines. optimal number of nodes depends on mining problem MapReduce strategies that are being used, as outlined in Section 3. Flexibility Hadoop runs on many different platforms (i.e., Windows, Mac Unix). In our experiments, we used server machines with without SSD drives, relatively slow desktop machines. Because of load balance control in Hadoop, each machine is given a fair amount of work. Because network latency could be one of major causes of data copying overhead, we did an experiment with 3 Hadoop nodes running in 3 virtual machines on Intel Quad Core server machine. Running only 3 virtual machines increases probability that each Hadoop process has its own processor core, whereas running Hadoop inside virtual machines should eliminate majority of network latency. Figure 9 shows running time of DJ-REX3 when deployed on 3 virtual machines on same server machine. performance of DJ-REX3 in virtual machines turns out to be worse than that of undistributed J-REX. We suspect that this happens because virtual machine setup results in slower disk accesses than deployment on a physical machine. However, this could be improved by using a redundant storage array (RAID), or a networked storage array, but this is future work. ability to run Hadoop in a virtual machine can be used to deploy a large Hadoop cluster in a very short time by rapidly replicating starting up virtual machines. A well configured virtual machine could be deployed to run mining process without any configuration, which is extremely suitable for non-experts. 5 Discussion Limitations MapReduce on or software repositories Multiple types of repositories are used in MSR field, but in principle MapReduce could be used as a stard platform to speed up scale up different analyses. main challenge is deriving optimal mapping strategies. For example, a MapReduce strategy could split mailing list data by time or by sender name, when mining a mailing list repository. Similarly, when mapping a bug reports repository, creator creation time of bug report could be used as splitting criteria. Incremental processing Incremental processing is one possible way to deal with large repositories extensive analysis. Instead of processing data from a long history in one shot, one could incrementally process data on a weekly or monthly basis. However, incremental processing requires more sophisticated designs of mining algorithms, sometimes is just not possible to achieve. Since researchers are mostly prototyping ir ideas, a brute force approach might be more desirable with optimizations (such as incremental processing) to follow later. low cost of migrating an analysis technique to MapReduce is negligible compared to complexity of migrating a technique to support incremental processing. 28
9 One-time processing One-time processing involves processing a repository once, n storing it in a compact format for subsequent querying analysis. Clearly, cost of one-time processing is not a major concern. However, we believe that MapReduce can help in two ways: 1) scaling number of possible systems that can be analyzed, 2) speeding up prototyping phase. Using a MapReduce implementation, analyzing querying a large system is simply faster than when doing one-time processing without MapReduce. Moreover, although one-time processing might require a single pass through data, it is often case that developers of technique explore a lot of ideas as y are prototyping ir algorithm ideas, have to debug technique. repositories must be analyzed time time again in se cases. We believe MapReduce can help speed up prototyping phase offer researchers more timely feedback on ir ideas. Robustness MapReduce its Hadoop implementation offer a robust computation model which can deal with different kinds of failures at run-time. If certain nodes fail, tasks belonging to failed nodes are automatically re-assigned to or nodes. All or nodes are notified to avoid trying to read data from failed nodes. Dean et al. [7] reported that MapReduce clusters with over 80 nodes can become unreachable, yet processing continues finishes successfully. This type of robustness permits execution of Hadoop on laptops non-dedicated machines, such that lab computers can join leave a Hadoop cluster rapidly easily based on needs of owners. For example, students can join a Hadoop cluster while y are away from ir desk leave it on until y are back. Current Limitations Most of current limitations are imposed by implementation of Hadoop. Locality is one of most important issues for a distributed platform, as network bwidth is a scarce resource when processing a large amount of data. To solve this problem, Hadoop attempts to replicate data across nodes to always locate nearest replica of data. In Hadoop, a typical configuration with hundreds of computers by default would have only 3 copies of data. In this case, chance of finding required data stored on local machine is very small. However, increasing number of data copies requires more space more time to put large amount of data into distributed file system. This in turn leads to more processing overhead. Deploying data into HDFS file system is anor limitation of Hadoop. In current Hadoop version (0.19.0), all input data needs to be copied into HDFS, which gives much overhead. As Figure 7 Figure 8 show, running time with 4 nodes may not be shortest one. Finding out optimal Hadoop configuration is future work. 6 Related Work Automated evolutionary extractors, optimized mining solutions distributed computing platforms are three areas of research most related to our work. Automated evolutionary extractors Hassan developed an evolutionary code extractor for C language called C-REX [14]. Kenyon framework [6] combines various source code repositories into one to facilitate software evolution research. Draheim et al. created Bloof [8], by which users can define custom evolution metrics from CVS logs. Alonso et al. developed Minero [5] which uses database techniques to integrate manage data from software repositories. Godfrey et al. [13] developed evolutionary extractors that use metrics at system subsystem level to monitor evolution for each release of Linux. In addition, Qu et al. [23] developed evolutionary extractors that track structural dependency changes at file level for each release of GCC compiler. Gall et al. [10, 11] have developed evolutionary extractors that track co-change of files for each changelist in CVS. Gall et al. [9] developed extractors which track source code changes. Zimmermann et al. [24] present an extractor which determines changed functions for each changelist. All se tools can be easily ported to MapReduce framework. Optimizing Mining Solutions To authors knowledge, re is only one related work which tries to optimize software mining solutions on large scale data, i.e. D-CCFinder [18, 19]. D-CCFinder is a distributed implementation of CCFinder to analyze source code with a large size long history in a relatively short time. Unfortunately, this implementation is homegrown specialized to CCFinder, not open to or MSR techniques. More recently, researchers behind D-CCFinder proposed to run CCFinder on a grid-based system [19]. Distributed Platforms re are several distributed platforms that implement MapReduce. prototypical one is from Google [7]. Google platform makes use of Google s file system, called GFS [12]. Phoenix is anor implementation of MapReduce model [21]. Phoenix s main focus is on exploiting multi-core multi-processor systems such as Playstation Cell architecture. GridGain [2] is an open source implementation of MapReduce, but its main disadvantage is that it can only process data which can be stored in a JVM heap space. For large size of data that we usu- 29
10 ally process in MSR, this is not a good choice. We chose Hadoop due to its simple design its wide user base. 7 Conclusions Future Work A scalable software mining solution should be adaptable, efficient, scalable flexible. In this paper, we propose to use MapReduce as a general framework to support research in MSR. To validate our approach, we presented our experience of porting J-REX, an evolutionary code extractor for Java, to Hadoop, an open source implementation of MapReduce. Our experiments demonstrate that our new solution (DJ-REX) satisfies four requirements of scalable software mining solutions. Our experiments show that running our optimized solution (DJ-REX3) on a small local area network with 4 nodes requires 75% less time than time needed when running it on a desktop machine 66% less time than on a server machine. One of our future goals is to dynamically control computation resources in our lab. If someone wants to pull a machine out of platform, we don t want to reconfigure whole platform. If machine becomes idle, one should be able to plug it back into platform. In addition, we also plan to experiment with or technologies like HBase [4] to improve current Hadoop deployment. References [1] Eclipse jdt. [2] Gridgain. [3] Hadoop. [4] Hbase. [5] O. Alonso, P. T. Devanbu, M. Gertz. Database techniques for analysis exploration of software repositories. In Proc. of 1st Workshop on Mining Software Repositories (MSR), [6] J. Bevan, E. J. Whitehead, Jr., S. Kim, M. Godfrey. Facilitating software evolution research with kenyon. In Proc. of 10th European Software Engineering Conference (ESEC/FSE), [7] J. Dean S. Ghemawat. Mapreduce: simplified data processing on large clusters. Commun. ACM, 51(1), [8] D. Draheim L. Pekacki. Process-centric analytical processing of version control data. In Proc. of 6th Int. Workshop on Principles of Software Evolution (IWPSE), [9] B. Fluri, H. C. Gall, M. Pinzger. Fine-grained analysis of change couplings. In Proc. of 4th Int. Workshop on Source Code Analysis Manipulation (SCAM), [10] H. Gall, K. Hajek, M. Jazayeri. Detection of logical coupling based on product release history. In Proc. of 14th Int. Conference on Software Maintenance (ICSM), [11] H. Gall, M. Jazayeri, J. Krajewski. Cvs release history data for detecting logical couplings. In Proc. of 6th Int. Workshop on Principles of Software Evolution (IWPSE), [12] S. Ghemawat, H. Gobioff, S.-T. Leung. google file system. In Proc. of 19th ACM symposium on Operating Systems Principles (SOSP), [13] M. W. Godfrey Q. Tu. Evolution in open source software: A case study. In Proc. of 16th Int. Conference on Software Maintenance (ICSM), [14] A. E. Hassan. Mining software repositories to assist developers support managers. PhD sis, [15] A. E. Hassan. road ahead for mining software repositories. In Frontiers of Software Maintenance (FoSM), pages 48 57, October [16] A. E. Hassan R. C. Holt. Using development history sticky notes to underst software architecture. In Proc. of 12th IEEE Int. Workshop on Program Comprehension (IWPC), [17] T. Kamiya, S. Kusumoto, K. Inoue. Ccfinder: a multilinguistic token-based code clone detection system for large scale source code. IEEE Trans. Softw. Eng., 28(7), [18] S. Livieri, Y. Higo, M. Matushita, K. Inoue. Very-large scale code clone analysis visualization of open source programs using distributed ccfinder: D-ccfinder. In Proc. of 29th Int. conference on Software Engineering (ICSE), [19] Y. Manabe, Y. Higo, K. Inoue. Toward ecient code clone detection on grid environment. In Proc. of 1st Workshop on Accountability Traceability in Global Software Engineering (ATGSE), December [20] A. Michael, F. Armo, G. Rean, D. J. Anthony, K. Ry, K. Andy, L. Gunho, P. David, R. Ariel, S. Ion, Z. Matei. Above clouds: A berkeley view of cloud computing. Technical report, University of California, Berkeley, [21] C. Ranger, R. Raghuraman, A. Penmetsa, G. Bradski, C. Kozyrakis. Evaluating mapreduce for multi-core multiprocessor systems. In Proc. of 13th Int. Symposium on High Performance Computer Architecture (HPCA), [22] G. Robles, J. M. Gonzalez-Barahona, M. Michlmayr, J. J. Amor. Mining large software compilations over time: anor perspective of software evolution. In Proc. of 3rd Int. workshop on Mining software repositories (MSR), [23] Q. Tu M. W. Godfrey. An integrated approach for studying architectural evolution. In Proc. of 10th Int. Workshop on Program Comprehension (IWPC), [24] T. Zimmermann, S. Diehl, A. Zeller. How history justifies system architecture (or not). In Proc. of 6th Int. Workshop on Principles of Software Evolution (IWPSE),
Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team
Introduction to Hadoop Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since
More informationCIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin. Presented by: Suhua Wei Yong Yu
CIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin Presented by: Suhua Wei Yong Yu Papers: MapReduce: Simplified Data Processing on Large Clusters 1 --Jeffrey Dean
More informationAssignment 5. Georgia Koloniari
Assignment 5 Georgia Koloniari 2. "Peer-to-Peer Computing" 1. What is the definition of a p2p system given by the authors in sec 1? Compare it with at least one of the definitions surveyed in the last
More informationMapReduce: A Programming Model for Large-Scale Distributed Computation
CSC 258/458 MapReduce: A Programming Model for Large-Scale Distributed Computation University of Rochester Department of Computer Science Shantonu Hossain April 18, 2011 Outline Motivation MapReduce Overview
More informationGoogle File System (GFS) and Hadoop Distributed File System (HDFS)
Google File System (GFS) and Hadoop Distributed File System (HDFS) 1 Hadoop: Architectural Design Principles Linear scalability More nodes can do more work within the same time Linear on data size, linear
More informationMap Reduce. Yerevan.
Map Reduce Erasmus+ @ Yerevan dacosta@irit.fr Divide and conquer at PaaS 100 % // Typical problem Iterate over a large number of records Extract something of interest from each Shuffle and sort intermediate
More informationData Analysis Using MapReduce in Hadoop Environment
Data Analysis Using MapReduce in Hadoop Environment Muhammad Khairul Rijal Muhammad*, Saiful Adli Ismail, Mohd Nazri Kama, Othman Mohd Yusop, Azri Azmi Advanced Informatics School (UTM AIS), Universiti
More informationIntroduction to MapReduce
Basics of Cloud Computing Lecture 4 Introduction to MapReduce Satish Srirama Some material adapted from slides by Jimmy Lin, Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed
More informationShark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker
Shark Hive on Spark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Agenda Intro to Spark Apache Hive Shark Shark s Improvements over Hive Demo Alpha
More informationCLOUD-SCALE FILE SYSTEMS
Data Management in the Cloud CLOUD-SCALE FILE SYSTEMS 92 Google File System (GFS) Designing a file system for the Cloud design assumptions design choices Architecture GFS Master GFS Chunkservers GFS Clients
More informationEffective Use of CSAIL Storage
Effective Use of CSAIL Storage How to get the most out of your computing infrastructure Garrett Wollman, Jonathan Proulx, and Jay Sekora The Infrastructure Group Introduction Outline of this talk 1. Introductions
More informationIBM Spectrum NAS. Easy-to-manage software-defined file storage for the enterprise. Overview. Highlights
IBM Spectrum NAS Easy-to-manage software-defined file storage for the enterprise Highlights Reduce capital expenditures with storage software on commodity servers Improve efficiency by consolidating all
More informationDell Reference Configuration for Large Oracle Database Deployments on Dell EqualLogic Storage
Dell Reference Configuration for Large Oracle Database Deployments on Dell EqualLogic Storage Database Solutions Engineering By Raghunatha M, Ravi Ramappa Dell Product Group October 2009 Executive Summary
More informationClustering Documents. Case Study 2: Document Retrieval
Case Study 2: Document Retrieval Clustering Documents Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 21 th, 2015 Sham Kakade 2016 1 Document Retrieval Goal: Retrieve
More informationCS 345A Data Mining. MapReduce
CS 345A Data Mining MapReduce Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very large Tens to hundreds of terabytes
More informationMap-Reduce. Marco Mura 2010 March, 31th
Map-Reduce Marco Mura (mura@di.unipi.it) 2010 March, 31th This paper is a note from the 2009-2010 course Strumenti di programmazione per sistemi paralleli e distribuiti and it s based by the lessons of
More informationDynamic processing slots scheduling for I/O intensive jobs of Hadoop MapReduce
Dynamic processing slots scheduling for I/O intensive jobs of Hadoop MapReduce Shiori KURAZUMI, Tomoaki TSUMURA, Shoichi SAITO and Hiroshi MATSUO Nagoya Institute of Technology Gokiso, Showa, Nagoya, Aichi,
More informationCHAPTER 4 ROUND ROBIN PARTITIONING
79 CHAPTER 4 ROUND ROBIN PARTITIONING 4.1 INTRODUCTION The Hadoop Distributed File System (HDFS) is constructed to store immensely colossal data sets accurately and to send those data sets at huge bandwidth
More informationIBM Spectrum NAS, IBM Spectrum Scale and IBM Cloud Object Storage
IBM Spectrum NAS, IBM Spectrum Scale and IBM Cloud Object Storage Silverton Consulting, Inc. StorInt Briefing 2017 SILVERTON CONSULTING, INC. ALL RIGHTS RESERVED Page 2 Introduction Unstructured data has
More informationOptimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures*
Optimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures* Tharso Ferreira 1, Antonio Espinosa 1, Juan Carlos Moure 2 and Porfidio Hernández 2 Computer Architecture and Operating
More informationCS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University
CS 555: DISTRIBUTED SYSTEMS [MAPREDUCE] Shrideep Pallickara Computer Science Colorado State University Frequently asked questions from the previous class survey Bit Torrent What is the right chunk/piece
More informationThe MOSIX Scalable Cluster Computing for Linux. mosix.org
The MOSIX Scalable Cluster Computing for Linux Prof. Amnon Barak Computer Science Hebrew University http://www. mosix.org 1 Presentation overview Part I : Why computing clusters (slide 3-7) Part II : What
More informationDistributed Filesystem
Distributed Filesystem 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributing Code! Don t move data to workers move workers to the data! - Store data on the local disks of nodes in the
More informationClustering Documents. Document Retrieval. Case Study 2: Document Retrieval
Case Study 2: Document Retrieval Clustering Documents Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April, 2017 Sham Kakade 2017 1 Document Retrieval n Goal: Retrieve
More informationCloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe
Cloud Programming Programming Environment Oct 29, 2015 Osamu Tatebe Cloud Computing Only required amount of CPU and storage can be used anytime from anywhere via network Availability, throughput, reliability
More informationOracle Big Data Connectors
Oracle Big Data Connectors Oracle Big Data Connectors is a software suite that integrates processing in Apache Hadoop distributions with operations in Oracle Database. It enables the use of Hadoop to process
More informationLecture 11 Hadoop & Spark
Lecture 11 Hadoop & Spark Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Outline Distributed File Systems Hadoop Ecosystem
More informationBringing OpenStack to the Enterprise. An enterprise-class solution ensures you get the required performance, reliability, and security
Bringing OpenStack to the Enterprise An enterprise-class solution ensures you get the required performance, reliability, and security INTRODUCTION Organizations today frequently need to quickly get systems
More informationParallel Nested Loops
Parallel Nested Loops For each tuple s i in S For each tuple t j in T If s i =t j, then add (s i,t j ) to output Create partitions S 1, S 2, T 1, and T 2 Have processors work on (S 1,T 1 ), (S 1,T 2 ),
More informationParallel Partition-Based. Parallel Nested Loops. Median. More Join Thoughts. Parallel Office Tools 9/15/2011
Parallel Nested Loops Parallel Partition-Based For each tuple s i in S For each tuple t j in T If s i =t j, then add (s i,t j ) to output Create partitions S 1, S 2, T 1, and T 2 Have processors work on
More informationComprehensive Guide to Evaluating Event Stream Processing Engines
Comprehensive Guide to Evaluating Event Stream Processing Engines i Copyright 2006 Coral8, Inc. All rights reserved worldwide. Worldwide Headquarters: Coral8, Inc. 82 Pioneer Way, Suite 106 Mountain View,
More informationA New HadoopBased Network Management System with Policy Approach
Computer Engineering and Applications Vol. 3, No. 3, September 2014 A New HadoopBased Network Management System with Policy Approach Department of Computer Engineering and IT, Shiraz University of Technology,
More informationVoldemort. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation
Voldemort Smruti R. Sarangi Department of Computer Science Indian Institute of Technology New Delhi, India Smruti R. Sarangi Leader Election 1/29 Outline 1 2 3 Smruti R. Sarangi Leader Election 2/29 Data
More informationImproved MapReduce k-means Clustering Algorithm with Combiner
2014 UKSim-AMSS 16th International Conference on Computer Modelling and Simulation Improved MapReduce k-means Clustering Algorithm with Combiner Prajesh P Anchalia Department Of Computer Science and Engineering
More informationindex construct Overview Overview Recap How to construct index? Introduction Index construction Introduction to Recap
to to Information Retrieval Index Construct Ruixuan Li Huazhong University of Science and Technology http://idc.hust.edu.cn/~rxli/ October, 2012 1 2 How to construct index? Computerese term document docid
More informationParallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce
Parallel Programming Principle and Practice Lecture 10 Big Data Processing with MapReduce Outline MapReduce Programming Model MapReduce Examples Hadoop 2 Incredible Things That Happen Every Minute On The
More informationAdvanced Database Systems
Lecture II Storage Layer Kyumars Sheykh Esmaili Course s Syllabus Core Topics Storage Layer Query Processing and Optimization Transaction Management and Recovery Advanced Topics Cloud Computing and Web
More informationTITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP
TITLE: Implement sort algorithm and run it using HADOOP PRE-REQUISITE Preliminary knowledge of clusters and overview of Hadoop and its basic functionality. THEORY 1. Introduction to Hadoop The Apache Hadoop
More informationTowards a Taxonomy of Approaches for Mining of Source Code Repositories
Towards a Taxonomy of Approaches for Mining of Source Code Repositories Huzefa Kagdi, Michael L. Collard, Jonathan I. Maletic Department of Computer Science Kent State University Kent Ohio 44242 {hkagdi,
More informationChapter 5. The MapReduce Programming Model and Implementation
Chapter 5. The MapReduce Programming Model and Implementation - Traditional computing: data-to-computing (send data to computing) * Data stored in separate repository * Data brought into system for computing
More informationMitigating Data Skew Using Map Reduce Application
Ms. Archana P.M Mitigating Data Skew Using Map Reduce Application Mr. Malathesh S.H 4 th sem, M.Tech (C.S.E) Associate Professor C.S.E Dept. M.S.E.C, V.T.U Bangalore, India archanaanil062@gmail.com M.S.E.C,
More informationIntroduction to MapReduce Algorithms and Analysis
Introduction to MapReduce Algorithms and Analysis Jeff M. Phillips October 25, 2013 Trade-Offs Massive parallelism that is very easy to program. Cheaper than HPC style (uses top of the line everything)
More informationEXTRACT DATA IN LARGE DATABASE WITH HADOOP
International Journal of Advances in Engineering & Scientific Research (IJAESR) ISSN: 2349 3607 (Online), ISSN: 2349 4824 (Print) Download Full paper from : http://www.arseam.com/content/volume-1-issue-7-nov-2014-0
More informationPerformance Analysis of Virtual Environments
Performance Analysis of Virtual Environments Nikhil V Mishrikoti (nikmys@cs.utah.edu) Advisor: Dr. Rob Ricci Co-Advisor: Anton Burtsev 1 Introduction Motivation Virtual Machines (VMs) becoming pervasive
More informationPart 1: Indexes for Big Data
JethroData Making Interactive BI for Big Data a Reality Technical White Paper This white paper explains how JethroData can help you achieve a truly interactive interactive response time for BI on big data,
More informationEfficient, Scalable, and Provenance-Aware Management of Linked Data
Efficient, Scalable, and Provenance-Aware Management of Linked Data Marcin Wylot 1 Motivation and objectives of the research The proliferation of heterogeneous Linked Data on the Web requires data management
More informationThe Google File System
October 13, 2010 Based on: S. Ghemawat, H. Gobioff, and S.-T. Leung: The Google file system, in Proceedings ACM SOSP 2003, Lake George, NY, USA, October 2003. 1 Assumptions Interface Architecture Single
More informationBig Data Analysis Using Hadoop and MapReduce
Big Data Analysis Using Hadoop and MapReduce Harrison Carranza, MSIS Marist College, Harrison.Carranza2@marist.edu Mentor: Aparicio Carranza, PhD New York City College of Technology - CUNY, USA, acarranza@citytech.cuny.edu
More informationEsgynDB Enterprise 2.0 Platform Reference Architecture
EsgynDB Enterprise 2.0 Platform Reference Architecture This document outlines a Platform Reference Architecture for EsgynDB Enterprise, built on Apache Trafodion (Incubating) implementation with licensed
More informationApache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context
1 Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes
More informationIndex Construction. Dictionary, postings, scalable indexing, dynamic indexing. Web Search
Index Construction Dictionary, postings, scalable indexing, dynamic indexing Web Search 1 Overview Indexes Query Indexing Ranking Results Application Documents User Information analysis Query processing
More informationIntroducing Oracle R Enterprise 1.4 -
Hello, and welcome to this online, self-paced lesson entitled Introducing Oracle R Enterprise. This session is part of an eight-lesson tutorial series on Oracle R Enterprise. My name is Brian Pottle. I
More informationA BigData Tour HDFS, Ceph and MapReduce
A BigData Tour HDFS, Ceph and MapReduce These slides are possible thanks to these sources Jonathan Drusi - SCInet Toronto Hadoop Tutorial, Amir Payberah - Course in Data Intensive Computing SICS; Yahoo!
More informationKey aspects of cloud computing. Towards fuller utilization. Two main sources of resource demand. Cluster Scheduling
Key aspects of cloud computing Cluster Scheduling 1. Illusion of infinite computing resources available on demand, eliminating need for up-front provisioning. The elimination of an up-front commitment
More informationParallel data processing with MapReduce
Parallel data processing with MapReduce Tomi Aarnio Helsinki University of Technology tomi.aarnio@hut.fi Abstract MapReduce is a parallel programming model and an associated implementation introduced by
More informationIN-MEMORY DATA FABRIC: Real-Time Streaming
WHITE PAPER IN-MEMORY DATA FABRIC: Real-Time Streaming COPYRIGHT AND TRADEMARK INFORMATION 2014 GridGain Systems. All rights reserved. This document is provided as is. Information and views expressed in
More informationLecture 9: MIMD Architectures
Lecture 9: MIMD Architectures Introduction and classification Symmetric multiprocessors NUMA architecture Clusters Zebo Peng, IDA, LiTH 1 Introduction MIMD: a set of general purpose processors is connected
More informationThe Use of Cloud Computing Resources in an HPC Environment
The Use of Cloud Computing Resources in an HPC Environment Bill, Labate, UCLA Office of Information Technology Prakashan Korambath, UCLA Institute for Digital Research & Education Cloud computing becomes
More informationDavid R. Mackay, Ph.D. Libraries play an important role in threading software to run faster on Intel multi-core platforms.
Whitepaper Introduction A Library Based Approach to Threading for Performance David R. Mackay, Ph.D. Libraries play an important role in threading software to run faster on Intel multi-core platforms.
More informationBigtable. A Distributed Storage System for Structured Data. Presenter: Yunming Zhang Conglong Li. Saturday, September 21, 13
Bigtable A Distributed Storage System for Structured Data Presenter: Yunming Zhang Conglong Li References SOCC 2010 Key Note Slides Jeff Dean Google Introduction to Distributed Computing, Winter 2008 University
More informationTHE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES
1 THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES Vincent Garonne, Mario Lassnig, Martin Barisits, Thomas Beermann, Ralph Vigne, Cedric Serfon Vincent.Garonne@cern.ch ph-adp-ddm-lab@cern.ch XLDB
More informationCISC 7610 Lecture 2b The beginnings of NoSQL
CISC 7610 Lecture 2b The beginnings of NoSQL Topics: Big Data Google s infrastructure Hadoop: open google infrastructure Scaling through sharding CAP theorem Amazon s Dynamo 5 V s of big data Everyone
More informationData Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros
Data Clustering on the Parallel Hadoop MapReduce Model Dimitrios Verraros Overview The purpose of this thesis is to implement and benchmark the performance of a parallel K- means clustering algorithm on
More informationDELL EMC DATA DOMAIN SISL SCALING ARCHITECTURE
WHITEPAPER DELL EMC DATA DOMAIN SISL SCALING ARCHITECTURE A Detailed Review ABSTRACT While tape has been the dominant storage medium for data protection for decades because of its low cost, it is steadily
More informationData Informatics. Seon Ho Kim, Ph.D.
Data Informatics Seon Ho Kim, Ph.D. seonkim@usc.edu HBase HBase is.. A distributed data store that can scale horizontally to 1,000s of commodity servers and petabytes of indexed storage. Designed to operate
More informationHDF Virtualization Review
Scott Wegner Beginning in July 2008, The HDF Group embarked on a new project to transition Windows support to a virtualized environment using VMWare Workstation. We utilized virtual machines in order to
More informationHuge Data Analysis and Processing Platform based on Hadoop Yuanbin LI1, a, Rong CHEN2
2nd International Conference on Materials Science, Machinery and Energy Engineering (MSMEE 2017) Huge Data Analysis and Processing Platform based on Hadoop Yuanbin LI1, a, Rong CHEN2 1 Information Engineering
More informationWHITE PAPER AGILOFT SCALABILITY AND REDUNDANCY
WHITE PAPER AGILOFT SCALABILITY AND REDUNDANCY Table of Contents Introduction 3 Performance on Hosted Server 3 Figure 1: Real World Performance 3 Benchmarks 3 System configuration used for benchmarks 3
More informationLab Validation Report
Lab Validation Report NetApp SnapManager for Oracle Simple, Automated, Oracle Protection By Ginny Roth and Tony Palmer September 2010 Lab Validation: NetApp SnapManager for Oracle 2 Contents Introduction...
More informationBig Data Management and NoSQL Databases
NDBI040 Big Data Management and NoSQL Databases Lecture 2. MapReduce Doc. RNDr. Irena Holubova, Ph.D. holubova@ksi.mff.cuni.cz http://www.ksi.mff.cuni.cz/~holubova/ndbi040/ Framework A programming model
More informationParallel Computing: MapReduce Jin, Hai
Parallel Computing: MapReduce Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology ! MapReduce is a distributed/parallel computing framework introduced by Google
More informationAndrew Pavlo, Erik Paulson, Alexander Rasin, Daniel Abadi, David DeWitt, Samuel Madden, and Michael Stonebraker SIGMOD'09. Presented by: Daniel Isaacs
Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel Abadi, David DeWitt, Samuel Madden, and Michael Stonebraker SIGMOD'09 Presented by: Daniel Isaacs It all starts with cluster computing. MapReduce Why
More informationComparative Analysis of K means Clustering Sequentially And Parallely
Comparative Analysis of K means Clustering Sequentially And Parallely Kavya D S 1, Chaitra D Desai 2 1 M.tech, Computer Science and Engineering, REVA ITM, Bangalore, India 2 REVA ITM, Bangalore, India
More informationPLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS
PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad
More informationHigh Performance Computing on MapReduce Programming Framework
International Journal of Private Cloud Computing Environment and Management Vol. 2, No. 1, (2015), pp. 27-32 http://dx.doi.org/10.21742/ijpccem.2015.2.1.04 High Performance Computing on MapReduce Programming
More informationWHITEPAPER. Improve Hadoop Performance with Memblaze PBlaze SSD
Improve Hadoop Performance with Memblaze PBlaze SSD Improve Hadoop Performance with Memblaze PBlaze SSD Exclusive Summary We live in the data age. It s not easy to measure the total volume of data stored
More informationFLAT DATACENTER STORAGE. Paper-3 Presenter-Pratik Bhatt fx6568
FLAT DATACENTER STORAGE Paper-3 Presenter-Pratik Bhatt fx6568 FDS Main discussion points A cluster storage system Stores giant "blobs" - 128-bit ID, multi-megabyte content Clients and servers connected
More informationMapReduce. Cloud Computing COMP / ECPE 293A
Cloud Computing COMP / ECPE 293A MapReduce Jeffrey Dean and Sanjay Ghemawat, MapReduce: simplified data processing on large clusters, In Proceedings of the 6th conference on Symposium on Opera7ng Systems
More informationLecture 1: January 23
CMPSCI 677 Distributed and Operating Systems Spring 2019 Lecture 1: January 23 Lecturer: Prashant Shenoy Scribe: Jonathan Westin (2019), Bin Wang (2018) 1.1 Introduction to the course The lecture started
More informationCLIENT DATA NODE NAME NODE
Volume 6, Issue 12, December 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Efficiency
More informationLecture 1: January 22
CMPSCI 677 Distributed and Operating Systems Spring 2018 Lecture 1: January 22 Lecturer: Prashant Shenoy Scribe: Bin Wang 1.1 Introduction to the course The lecture started by outlining the administrative
More informationChurrasco: Supporting Collaborative Software Evolution Analysis
Churrasco: Supporting Collaborative Software Evolution Analysis Marco D Ambros a, Michele Lanza a a REVEAL @ Faculty of Informatics - University of Lugano, Switzerland Abstract Analyzing the evolution
More informationThe amount of data increases every day Some numbers ( 2012):
1 The amount of data increases every day Some numbers ( 2012): Data processed by Google every day: 100+ PB Data processed by Facebook every day: 10+ PB To analyze them, systems that scale with respect
More information10 Steps to Virtualization
AN INTEL COMPANY 10 Steps to Virtualization WHEN IT MATTERS, IT RUNS ON WIND RIVER EXECUTIVE SUMMARY Virtualization the creation of multiple virtual machines (VMs) on a single piece of hardware, where
More information2/26/2017. The amount of data increases every day Some numbers ( 2012):
The amount of data increases every day Some numbers ( 2012): Data processed by Google every day: 100+ PB Data processed by Facebook every day: 10+ PB To analyze them, systems that scale with respect to
More informationClustering Lecture 8: MapReduce
Clustering Lecture 8: MapReduce Jing Gao SUNY Buffalo 1 Divide and Conquer Work Partition w 1 w 2 w 3 worker worker worker r 1 r 2 r 3 Result Combine 4 Distributed Grep Very big data Split data Split data
More informationDept. Of Computer Science, Colorado State University
CS 455: INTRODUCTION TO DISTRIBUTED SYSTEMS [HADOOP/HDFS] Trying to have your cake and eat it too Each phase pines for tasks with locality and their numbers on a tether Alas within a phase, you get one,
More informationPSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets
2011 Fourth International Symposium on Parallel Architectures, Algorithms and Programming PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets Tao Xiao Chunfeng Yuan Yihua Huang Department
More informationIBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics
IBM Data Science Experience White paper R Transforming R into a tool for big data analytics 2 R Executive summary This white paper introduces R, a package for the R statistical programming language that
More informationCloud Computing & Visualization
Cloud Computing & Visualization Workflows Distributed Computation with Spark Data Warehousing with Redshift Visualization with Tableau #FIUSCIS School of Computing & Information Sciences, Florida International
More informationGoogle File System. By Dinesh Amatya
Google File System By Dinesh Amatya Google File System (GFS) Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung designed and implemented to meet rapidly growing demand of Google's data processing need a scalable
More informationUsing Pig as a Data Preparation Language for Large-Scale Mining Software Repositories Studies: An Experience Report
Using Pig as a Data Preparation Language for Large-Scale Mining Software Repositories Studies: An Experience Report Weiyi Shang, Bram Adams, Ahmed E. Hassan Software Analysis and Intelligence Lab (SAIL),
More informationEmbedded Technosolutions
Hadoop Big Data An Important technology in IT Sector Hadoop - Big Data Oerie 90% of the worlds data was generated in the last few years. Due to the advent of new technologies, devices, and communication
More informationInformation Retrieval
Information Retrieval Suan Lee - Information Retrieval - 04 Index Construction 1 04 Index Construction - Information Retrieval - 04 Index Construction 2 Plan Last lecture: Dictionary data structures Tolerant
More informationVirtual Machines. 2 Disco: Running Commodity Operating Systems on Scalable Multiprocessors([1])
EE392C: Advanced Topics in Computer Architecture Lecture #10 Polymorphic Processors Stanford University Thursday, 8 May 2003 Virtual Machines Lecture #10: Thursday, 1 May 2003 Lecturer: Jayanth Gummaraju,
More informationHDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung
HDFS: Hadoop Distributed File System CIS 612 Sunnie Chung What is Big Data?? Bulk Amount Unstructured Introduction Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per
More informationOperating Systems: Internals and Design Principles. Chapter 2 Operating System Overview Seventh Edition By William Stallings
Operating Systems: Internals and Design Principles Chapter 2 Operating System Overview Seventh Edition By William Stallings Operating Systems: Internals and Design Principles Operating systems are those
More informationHow are Developers Treating License Inconsistency Issues? A Case Study on License Inconsistency Evolution in FOSS Projects
How are Developers Treating License Inconsistency Issues? A Case Study on License Inconsistency Evolution in FOSS Projects Yuhao Wu 1(B), Yuki Manabe 2, Daniel M. German 3, and Katsuro Inoue 1 1 Graduate
More informationHigh Throughput WAN Data Transfer with Hadoop-based Storage
High Throughput WAN Data Transfer with Hadoop-based Storage A Amin 2, B Bockelman 4, J Letts 1, T Levshina 3, T Martin 1, H Pi 1, I Sfiligoi 1, M Thomas 2, F Wuerthwein 1 1 University of California, San
More informationOracle Developer Studio 12.6
Oracle Developer Studio 12.6 Oracle Developer Studio is the #1 development environment for building C, C++, Fortran and Java applications for Oracle Solaris and Linux operating systems running on premises
More information