vhadoop: A Scalable Hadoop Virtual Cluster Platform for MapReduce-Based Parallel Machine Learning with Performance Consideration

Size: px
Start display at page:

Download "vhadoop: A Scalable Hadoop Virtual Cluster Platform for MapReduce-Based Parallel Machine Learning with Performance Consideration"

Transcription

1 2012 IEEE International Conference on Cluster Computing Workshops vhadoop: A Scalable Hadoop Virtual Cluster Platform for MapReduce-Based Parallel Machine Learning with Performance Consideration Kejiang Ye, Xiaohong Jiang, Yanzhang He, Xiang Li, Haiming Yan, Peng Huang College of Computer Science, Zhejiang University Hangzhou , China {yekejiang,jiangxh,heyanzhang,lixiang,yanhaiming,huangpeng}@zju.edu.cn Abstract Big data processing is currently becoming increasingly important in modern era due to the continuous growth of the amount of data generated by various fields such as particle physics, human genomics, earth observation, etc. However, the efficiency of processing large-scale data on modern virtual infrastructure, especially on the virtualized cloud computing infrastructure, is not clear. This paper focuses on the performance of hadoop virtual cluster and proposes a scalable hadoop virtual cluster platform vhadoop for the large-scale MapReduce-based parallel data processing. We first describe the design and implementation of vhadoop platform. Then we perform a series of experiments to investigate both the static and dynamic performance of vhadoop platform, such as the performance characterization of cross-domain hadoop virtual cluster and live migraiton of hadoop virtual cluster. After that, we use the vhadoop platform to process 6 typical parallel clustering algorithms, such as Canopy, Dirichlet, Fuzzy k-means, k-means, MeanShift, MinHash, etc, on two typical datasets. Experimental results verify the efficiency of vhadoop platform to process the MapReduce-based parallel machine learning applications. Keywords-Hadoop; MapReduce; Virtual Cluster; Cloud Computing; Machine Learning; Big Data I. INTRODUCTION Big data [1] has recently received considerable attention due to the continuous growth of the amount of data generated by various fields such as particle physics, human genomics, earth observation, etc. How to compute, transfer, and store these huge data is a prominent challenge which will bring great impact on the traditional architectures and methods of computation, networking and storage. MapReduce [2] is an efficient parallel programming model for the dataintensive applications with the benefits such as simplicity, fault tolerance, and scalability. Hadoop [3] is the opensource implementation of MapReduce which can process hundreds of terabytes of data on at least 10,000 cores. This efficient parallel programming model also benefits the machine learning algorithms, such as clustering, classification, recommendations, to improve the processing efficiency on big data sets. Meanwhile, with rapid development of virtualization technology [4] and cloud computing technology [5], virtual machine () will be the basic computation unit in the cloud computing era to conduct computation in the future. Virtualization provides an abstraction of hardware resources enabling multiple instantiation of operating systems to run simultaneously on a single physical machine, i.e. server consolidation, to improve resource utilization [6]. Another prominent advantage of virtualization is the live migration technique which refers to the act of migrating a virtual machine from one physical machine to another even when the virtual machine continues to execute. This is an effective means to improve the dynamic manageability in the virtualized cloud computing data center [7, 8]. Although virtualization and MapReduce have been widely studied respectively for several years, there are relatively few studies on the combination of these two technologies together that running MapReduce applications on hadoop virtual cluster environment. As the cloud computing becomes more and more mature, big data processing on virtual infrastructure will become more and more common. There are several reasons for this trend: (i) Big data processing with high efficiency is a big challenge which needs to be executed on distributed platforms in parallel. MapReduce is a popular parallel computing framework for the big data processing. (ii) In the cloud era, resource virtualization is a typical feature that most of tasks will be executed on the virtual infrastructure. For example, users can simply rent a hadoop virtual cluster from Amazon EC2 cloud to run the MapReduce tasks without purchasing expensive physical servers. (iii) Virtualization holds many other benefits such as rapid startup, dynamic configuration, high scalability, etc. The hadoop virtual cluster can benefit from all the above advantages. (iv) Moving data to computing resources is more expensive than moving computing resources (such as ) to data due to the high overheads of transferring large amounts data. While virtual machine is more convenient to transfer (or migrate) from one physical machine to another with very low overheads. In this paper, we propose a scalable hadoop virtual cluster platform vhadoop for the large-scale MapReducebased parallel data processing with performance consideration. We first describe the design and implementation of vhadoop platform. Then we perform a series of experiments /12 $ IEEE DOI /ClusterW

2 nmon Monitor Machine Learning Algorithm Library MapReduce Tuner Folk Physical Machine A Assign Maps Master Assign Reduces Input Data Output Data Physical Machine B Map Phase Reduce Phase Figure 1. vhadoop Platform for the Parallel Machine Learning with Performance Consideration. to investigate both the static and dynamic performance of vhadoop, such as the performance characterization of cross-domain virtual cluster and virtual cluster migration. After that, we use the vhadoop platform to process several typical parallel clustering tasks, including Canopy, Dirichlet, Fuzzy k-means, k-means, MeanShift, MinHash, ontwo typical datasets. Experimental results verify the efficiency of vhadoop platform to process the MapReduce-based parallel machine learning applications. The rest of the paper is structured as follows. In Section II, we design and implement a platform vhadoop for the parallel machine learning on hadoop virtual cluster. In Section III, we study both the static and dynamic performance of vhadoop. In Section IV, we use the real parallel machine learning applications to verify the efficiency of vhadoop platform. Section V presents the related work. Finally we give our conclusion and future work in Section VI. II. VHADOOP PLATFORM In this section, we propose a platform vhadoop for the large-scale parallel machine learning on hadoop virtual cluster. A. System Architecture & Flow Figure 1 illustrates the vhadoop architecture for the parallel machine learning. It consists of five main modules: Virtualization Module, Hadoop Module, Machine Learning Algorithm Library, nmon Monitor, MapReduce Tuner. All the five modules corporate with each other to provide a scalable hadoop virtual cluster platform for parallel machine learning. The vhadoop execution flow is shown as follows: 1) Machine Learning Algorithm Library triggers and sends a hadoop virtual cluster request. 2) The Virtualization Module calls and starts a hadoop virtual cluster. 3) The Hadoop Module configures the hadoop parameters, such as master and worker virtual machines. 4) The input data is prepared by uploading to the Hadoop Distributed File System (HDFS). 5) The master virtual machine assigns maps and reduces to the worker virtual machines. 6) Perform the mapping operation. 7) Perform the reducing operation. 8) Collect and analyze the output data. When the whole process begins, both the master virtual machine and worker virtual machines are monitored by the nmon Monitor. 9) The vhadoop performance can be adjusted by the MapReduce Tuner based on the monitoring data. B. Platform Design & Implementation Virtualization Module: is the basic module to implement the resource virtualization. By using the virtualization technology, one physical machine can be shared by several virtual machines. We currently use Xen [4] as infrastructure virtualization layer. Xen supports live migration of virtual machines which is often used to achieve the goal of load balancing, energy saving, and online maintains. Hadoop Module: is responsible for the initial configuration of hadoop virtual cluster. The parameters include: the name of master node and work nodes, dfs.replication, dfs.block.size, map.tasks.maximum, reduce.tasks.maximum, etc. We currently configure Hadoop in images of vhadoop. Machine Learning Algorithm Library: is the library for MapReduce-based parallel machine learning algorithms, including clustering, classification, recommendations. There 153

3 are various algorithms being categorized into the above three categories. For example, Canopy, Dirichlet, Fuzzy k-means, k-means, MeanShift, MinHash, used in this paper, can be categorized to the clustering algorithms. We construct the algorithm library based on the Mahout 1 library which is an open-source machine learning library on hadoop. nmon Monitor: is responsible for monitoring the resource status of both the master virtual machine and worker virtual machines. The utilization of CPU, memory, disk, and network are all monitored. Performance bottleneck can be found by analyzing the monitored data. nmon 2 is an opensource performance monitor for the traditional Linux system. It monitors the comprehensive performance of the Linux system. We extend it to our distributed vhadoop platform to monitor the node performance in parallel. nmon analyser is another tool to generate graphics by using the nmon output files. MapReduce Tuner: is responsible for tuning the configuration parameters of hadoop virtual cluster. The adjustment can be done according to the results generated by the nmon Monitor. It can be implemented by re-configuring the parameters of vhadoop platform or using the live migration technique to dynamically adjust the vhadoop configurations. III. PERFORMANCE ANALYSIS OF HADOOP VIRTUAL CLUSTER In this section, we study both the static and dynamic performance of hadoop virtual cluster. In the static performance analysis, we mainly study the performance of crossdomain hadoop virtual cluster and the scalability of hadoop virtual cluster. While in the dynamic performance analysis, we investigate the live migration performance of hadoop virtual cluster. A. Experimental Configuration 1) Hadoop Virtual Cluster Configuration: All the experiments are performed on Dell T710 servers, with 2 Quad-core 64-bit Xeon processors E5620 at 2.40GHz and 32GB DRAM. We use CentOS 5.6 with kernel version e15xen in Domain 0, and Xen as the virtualization hypervisor. Each virtual machine is installed with Ubuntu 8.10 as the guest OS with the configuration of 1VCPU and 1024MB vmemory. The Hadoop version is , the Mahout version is 0.6. All the virtual machine images are stored on a separate NFS server. 2) MapReduce-based Benchmarks: We choose four typical MapReduce-based benchmarks to test the MapReduce and HDFS performance of hadoop virtual cluster. Table I describes the four benchmarks. The Wordcount benchmark reads text files and counts how often words occur. Each mapper takes a line as input and breaks it into words. It then emits a key/value pair of the Table I MAPREDUCE-BASED PARALLEL BENCHMARKS Name Category Description Wordcount MapReduce Reads text files and counts how often words occur MRBench MapReduce Checks whether small job runs are responsive and running efficiently on the cluster TeraSort MapReduce Sorts the data as fast as possible, & HDFS combining testing the HDFS and MapReduce layers DFSIOTest HDFS Is a read and write test for HDFS Figure 2. Performance Comparison of Wordcount Benchmark between Normal and Cross-Domain Hadoop Virtual Cluster. word and 1. Each reducer sums the counts for each word and emits a single key/value with the word and sum. The MRBench benchmark [9] checks whether small jobs are responsive and running efficiently on the cluster. It focuses on the MapReduce layer since its impact on the HDFS layer is very limited. The TeraSort benchmark is to sort 1TB of data (or any other amount of data you want) as fast as possible. It is a benchmark that combines testing the HDFS and MapReduce layers of an hadoop cluster. A full TeraSort benchmark run consists of the following three steps: (i) Generating the input data via TeraGen. (ii) Running the actual TeraSort on the input data. (iii) Validating the sorted output data via TeraValidate. The TestDFSIO benchmark is a read and write test for HDFS. It is helpful for tasks such as stress testing HDFS, to discover performance bottlenecks in the network. 3) Live Migration Benchmark: To measure the migration performance and overheads of hadoop virtual cluster, we extend our formal Virt-LM Benchmark [10] from single virtual machine migration to multiple virtual machines (virtual cluster) migration which can record the migration time and downtime of each virtual machine and the whole virtual cluster. 154

4 (a) Map Scales (b) Reduce Scales Figure 3. Performance Comparison of MRBench Benchmark between Normal and Cross-Domain Hadoop Virtual Cluster. (a) TeraSort Test (b) DFSIO Test Figure 4. Performance Comparison of TeraSort and DFSIO Benchmarks between Normal and Cross-Domain Hadoop Virtual Cluster. 4) Experimental Precision: In order to ensure the data precision, each of the showed experimental results were obtained via running benchmarks three times with the same configuration and average the three values. B. Static Performance Analysis Due to the large size of virtual cluster and the limited resources in physical machine, a virtual cluster may cross multiple domains (physical machines). We create 16-node hadoop virtual cluster (1 namenode and 15 datanode) to compare the performance of cross-domain hadoop virtual cluster with normal hadoop virtual cluster. In the crossdomain case, 16 virtual machines are distributed equally to the two physical machines, while in the normal case, all the 16 virtual machines are distributed to only one physical machine. Figure 2 shows the Wordcount performance when running on normal and cross-domain hadoop virtual cluster with 16 nodes. The input data is the chosen from the TOEFL (The Test of English as a Foreign Language) reading materials. From the figure, it is obviously that the running time increases as the size of input data scales. Further, the crossdomain hadoop virtual cluster acquires poor performance compared to the normal case which means the MapReduce performance can be obviously affected by the cross-domain configuration due to the increase of network I/O delay. Figure 3 shows the MRBench performance. In Figure 3(a), we set the reduce=1 and scale the number of maps from 1 to 6, while in Figure 3(b), we set the map=15, and scale the number of reduces from 1 to 6. From the figure, we find that as the number of maps and reduces scales, the running time increases quickly. It is because the concurrent running 155

5 (a) Migration Time. (b) Downtime. Figure 5. The Migration Overheads of Idle and Wordcount Hadoop Virtual Cluster with Different DRAM Configurations. will cause the network congestion, thus leading the longer execution time. The performance of cross-domain hadoop virtual cluster is worse than the normal case which is similar to the phenomenon of Figure 2. Figure 4(a) shows both the data generation time and the sort time of TeraSort benchmark. From the figure, we find that when the data size is small, both the data generation time and sort time is relatively small. However, when the data size exceeds 400MB, the running time increases quickly. The performance of cross-domain hadoop virtual cluster is relatively worse. Figure 4(b) shows the DFS performance with DFSIO benchmark. From the figure, we can find that read throughput is better than write throughput. The performance of cross-domain hadoop virtual cluster is worse than the normal case. Discussion From the above analysis, we find that when the data size and concurrent number are small, the performance of cross-domain and normal case are very similar. The gap will become increasingly evident as the data size or concurrent number scales. The reason is that, when the data size and concurrent number scales, the network communication overheads become the main bottleneck itself. The distribution of virtual machines across multiple domains will further affect the network communication performance, thereby affecting the performance of the MapReduce applications. C. Dynamic Performance Analysis Live migration is a key ingredient behind the management activities of cloud computing system to achieve the goals of load balancing, energy saving, failure recovery, and system maintenance. Figure 5 shows the migration time and downtime of each node in the 16-node hadoop virtual cluster which migrates Table II OVERALL MIGRATION TIME AND DOWNTIME OF 16-NODE HADOOP VIRTUAL CLUSTER Overall Migration Overall Time (s) Downtime (ms) idle.1024mb idle.512mb wordcount.1024mb wordcount.512mb from one physical machine to the other. From the figure, we can get the following observations: (i) The larger the memory incurs the longer the migration time will be, while the downtime doesn t has the causal relationship with the size of memory. (ii) Compared with the idle hadoop virtual cluster, the migration time of hadoop virtual cluster running Wordcount benchmark is slightly longer than that of idle hadoop virtual cluster. However, the downtime of hadoop virtual cluster running Wordcount benchmark is much longer than that of idle hadoop virtual cluster. (iii) The downtime of each node in the hadoop virtual cluster running Wordcount benchmark varies widely because of the imbalance of each node in the hadoop virtual cluster. Table II shows the overall migration time and downtime of the whole hadoop virtual cluster. The migration time of hadoop virtual cluster running Wordcount benchmark is about three times of that of idle hadoop virtual cluster. While the downtime of hadoop virtual cluster running Wordcount benchmark is about 13 times of that of idle hadoop virtual cluster. Discussion From the above analysis, we find that live migration of hadoop virtual cluster incurs some overheads, especially the downtime. Fortunately, it is tolerable for the hadoop virtual cluster due to efficient fault tolerant mechanism in hadoop itself. The unavailable service during 156

6 Figure 6. Parallel Clustering on Synthetic Control Data Set with Different Hadoop Virtual Cluster Scales. the period of downtime can be restored by re-sending the requests or obtaining from other available data block copies. Despite a long downtime, the MapReduce workloads can be successfully finished. IV. PARALLEL MACHINE LEARNING ON HADOOP VIRTUAL CLUSTER In this section, we run several typical parallel clustering algorithms on two data sets to illustrate the efficiency of running parallel machine learning on the vhadoop platform. A. MapReduce-based Clustering Algorithms Canopy Clustering is a very simple, fast and accurate method for grouping objects into clusters. All objects are represented as a point in a multidimensional feature space. Canopy Clustering is often used as an initial step in more rigorous clustering techniques, such as K-Means Clustering. k-means Clustering is a rather simple but well known algorithm for grouping objects. All objects need to be represented as a set of numerical features. In addition, the user has to specify the number of groups (referred to as k) he/she wishes to identify. Fuzzy k-means Clustering is an extension of K-Means, the popular simple clustering technique. While K-Means discovers hard clusters (a point belong to only one cluster), Fuzzy K-Means is a more statistically formalized method and discovers soft clusters where a particular point can belong to more than one cluster with certain probability. Mean Shift Clustering produces arbitrarily-shaped clusters depending upon the topology of the data without a priori knowledge of the number of clusters (as required in K- Means). Dirichlet Process Clustering performs Bayesian mixture modeling. Minhash Clustering performs probabilistic dimension reduction of high dimensional data. The essence of the technique is to hash each item using multiple independent hash functions such that the probability of collision of similar items is higher. Multiple such hash tables can then be constructed to answer near neighbor types of queries efficiently. B. Clustering on Synthetic Control Chart Time Series Data Set The Synthetic Control Chart Time Series Data Set 1 contains 600 examples of control charts synthetically generated by the process in Alcock and Manolopoulos in There are six different classes of control charts: (i) Normal, (ii) Cyclic, (iii) Increasing trend, (iv) Decreasing trend, (v) Upward shift, and (vi) Downward shift. We use this real data set to perform the MapReduce-based parallel machine learning on the vhadoop platform. Figure 6 shows the parallel clustering results on the synthetic control chart time series data set with different hadoop virtual cluster sizes. From the figure, we find that the running time of all the three clustering algorithms - canopy, dirichlet, meanshift - increase as the hadoop virtual cluster scales from 2 nodes (1 namenode and 1 datanode) to 16 nodes (1 namenode and 15 datanode). Because the size of data set is fixed, the larger virtual cluster size incurs more data communication between each node in the hadoop virtual cluster. C. Visualizing Sample Clustering We use the DisplayClustering to generates 1000 samples from three symmetric distributions. The data set can be used by the other clustering programs. It displays the points on

7 a screen and superimposes the model parameters that were used to generate the points. Figure 7 shows the visualization sample clustering results on the vhadoop platform with different cluster sizes. Compared to Figure 6, the visualizing sample clustering performs relatively smooth as the size of hadoop virtual cluster scales from 2 to 16. It is because, the workload of visualizing sample clustering is relatively light and can be finished quickly, thereby didn t cause too much pressure on the network. Figure 8(a)-(f) show the screenshot of sample points and clustering results with different clustering algorithms. They display the sample points and then superimpose all of the clusters from each iteration. The last iteration s clustering results are in bold red and the previous several results are colored (orange, yellow, green, blue, magenta) in order after which all earlier clusters are in light grey. This helps to visualize how the clusters converge upon a solution over multiple iterations. V. RELATED WORK Virtualization technology is currently becoming increasingly popular as a core technology to implement the cloud computing paradigm. Many efforts have been made to study the performance characterization of virtualization, including performance evaluation [11, 12], performance modeling [13 15], and performance optimization [16, 17]. Server consolidation [6] is one of the most important application scenario of virtualization to improve the resource utilization. While the live migration technique [18 21] is often used to achieve the goal of load balancing, energy saving [22], online maintenance, etc, in the cloud computing environments. MapReduce technology is an efficient technique to process huge amount of data in parallel. Kambatla et al. [23] optimized the hadoop provisioning in the cloud to reduce the cost and improve the performance. Ibrahimet et al. compared the performance of hadoop cluster on virtual machines and physical machines and found that running MapReduce application on virtual machines incurs additional performance degradation compared to the case that running on physical machines [24]. They also discussed the issues of implementing MapReduce on virtual machines by decoupling the storage unit from the computation unit to reduce the disk I/O overheads [25, 26]. Zaharia et al. pointed out the virtual machine interference, especially the network I/O interference, is the main reason causing the performance degradation of MapReduce system [27]. However, they only focus on the static performance analysis and have not referred to the dynamic performance, i.e. the live migration performance of hadoop virtual cluster. Further, they don t refer to the problem of parallel machine learning on the hadoop virtual cluster which is becoming increasing important in the big data processing on virtualized cloud computing infrastructures. VI. CONCLUSION In this paper, we study the performance and efficiency of running MapReduce-based parallel machine learning applications on hadoop virtual cluster. We first propose a scalable hadoop virtual cluster platform vhadoop for the parallel machine learning with performance consideration through binding the nmon performance monitor, mahout machine learning library, and MapReduce tuner on Xen virtualization platform. Then we perform a series of experiments to investigate both the static and dynamic performance of hadoop virtual cluster, such as the performance characterization of cross-domain virtual cluster and virtual cluster migration, which is helpful to improve the performance of real hadoop virtual cluster. After that, we verify the performance and efficiency of running MapReduce-based parallel machine learning applications, such as Canopy, Dirichlet, Fuzzy k- Means, k-means, MeanShift, MinHash, on our vhadoop platform. Experimental results show that: (i) The network I/O and NFS disk I/O are two main bottlenecks of vhadoop platform due to the shared resource contention and interference. The poor I/O performance in virtualization system and the heavy network communication operations in hadoop system make the network as the main performance bottleneck. (ii) There is a performance degradation when the data size or cluster scale increases. The cross-domain distribution of hadoop virtual cluster will also affect the communication performance of vhadoop. (iii) The vhadoop can perform the live migration of hadoop virtual cluster successfully. Although the service is unavailable in the period of downtime, the hadoop fault tolerance mechanism will re-run the job or restore from other available backup data. (iv) The vhadoop platform is efficient enough to run the MapReduced-based parallel machine learning algorithms on real data sets. Future work will include integrating the vhadoop platform to open source cloud computing system to provide scalable on-demand computation service for processing dataintensive (or big-data) applications with parallel machine learning algorithms. ACKNOWLEDGMENT This work is supported by National High Technology Research 863 Major Program of China (No. 2011AA01A207), National Natural Science Foundation of China (No ), MOE-Intel Information Technology Foundation (No. MOE-INTEL-11-06). REFERENCES [1] C. Lynch, Big data: How do your data grow? Nature, vol. 455, no. 7209, pp , [2] J. Dean and S. Ghemawat, Mapreduce: Simplified data processing on large clusters, Communications of the ACM, vol. 51, no. 1, pp ,

8 (a) Canopy (b) Dirichlet (c) Fuzzy k-means (d) Kmeans (e) MeanShift (f) MinHash Figure 7. Parallel Visualizing Sample Clustering with Different Hadoop Virtual Cluster Scales. (a) Sample Data (b) Canopy (c) Dirichlet (d) Fuzzy k-means (e) k-means (f) Means Shift Figure 8. The Screenshot of Clustering Results with Different Clustering Algorithms. 159

9 [3] T. White, Hadoop: The definitive guide. Yahoo Press, [4] P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho, R. Neugebauer, I. Pratt, and A. Warfield, Xen and the art of virtualization, in Proceedings of the nineteenth ACM Symposium on Operating Systems Principles, 2003, p [5] M. Armbrust, A. Fox, R. Griffith, A. Joseph, R. Katz, A. Konwinski, G. Lee, D. Patterson, A. Rabkin, I. Stoica, et al., A view of cloud computing, Communications of the ACM, vol. 53, no. 4, pp , [6] P. Apparao, R. Iyer, X. Zhang, D. Newell, and T. Adelmeyer, Characterization & analysis of a server consolidation benchmark, in Proceedings of the fourth ACM SIGPLAN/SIGOPS international conference on Virtual execution environments, 2008, pp [7] C.Clark,K.Fraser,S.Hand,J.Hansen,E.Jul,C.Limpach, I. Pratt, and A. Warfield, Live migration of virtual machines, in Proceedings of the 2nd conference on Symposium on Networked Systems Design & Implementation-Volume 2, 2005, p [8] M. Nelson, B. Lim, and G. Hutchins, Fast transparent migration for virtual machines, in Proceedings of the annual conference on USENIX Annual Technical Conference, 2005, p. 25. [9] K. Kim, K. Jeon, H. Han, S. Kim, H. Jung, and H. Yeom, Mrbench: A benchmark for mapreduce framework, in Parallel and Distributed Systems, ICPADS th IEEE International Conference on. IEEE, 2008, pp [10] D. Huang, D. Ye, Q. He, J. Chen, and K. Ye, Virt-LM: a benchmark for live migration of virtual machine, in Proceeding of the second ACM/SPEC International Conference on Performance Engineering (ICPE), 2011, pp [11] K. Ye, J. Che, Q. He, D. Huang, and X. Jiang, Performance combinative evaluation from single virtual machine to multiple virtual machine systems, International Journal of Numerical Analysis and Modeling, vol. 9, no. 2, pp , [12] L. Cherkasova and R. Gardner, Measuring cpu overhead for i/o processing in the xen virtual machine monitor, in Proceedings of the annual conference on USENIX Annual Technical Conference. USENIX Association, 2005, pp [13] K. Ye, X. Jiang, S. Chen, D. Huang, and B. Wang, Analyzing and modeling the performance in Xen-based virtual cluster environment, in th IEEE International Conference on High Performance Computing and Communications (HPCC), 2010, pp [14] O. Tickoo, R. Iyer, R. Illikkal, and D. Newell, Modeling virtual machine performance: challenges and approaches, ACM SIGMETRICS Performance Evaluation Review, vol. 37, no. 3, pp , [15] S. Kundu, R. Rangaswami, K. Dutta, and M. Zhao, Application performance modeling in a virtualized environment, in High Performance Computer Architecture (HPCA), 2010 IEEE 16th International Symposium on. Ieee, 2010, pp [16] A. Menon, A. Cox, and W. Zwaenepoel, Optimizing network virtualization in xen, in Proc. USENIX Annual Technical Conference (USENIX 2006), 2006, pp [17] D. Ongaro, A. Cox, and S. Rixner, Scheduling i/o in virtual machine monitors, in Proceedings of the fourth ACM SIG- PLAN/SIGOPS international conference on Virtual execution environments. ACM, 2008, pp [18] K. Ye, X. Jiang, D. Huang, J. Chen, and B. Wang, Live migration of multiple virtual machines with resource reservation in cloud computing environments, in 2011 IEEE International Conference on Cloud Computing (CLOUD), 2011, pp [19] U. Deshpande, X. Wang, and K. Gopalan, Live gang migration of virtual machines, in Proceedings of the 20th International Symposium on High Performance Distributed Computing (HPDC), 2011, pp [20] S. Al-Kiswany, D. Subhraveti, P. Sarkar, and M. Ripeanu, Flock: virtual machine co-migration for the cloud, in Proceedings of the 20th International Symposium on High Performance Distributed Computing (HPDC), 2011, pp [21] W. Voorsluys, J. Broberg, S. Venugopal, and R. Buyya, Cost of virtual machine live migration in clouds: A performance evaluation, in 1st International Conference on Cloud Computing (CloudCom), 2009, pp [22] K. Ye, D. Huang, X. Jiang, H. Chen, and S. Wu, Virtual machine based energy-efficient data center architecture for cloud computing: a performance perspective, in Proceedings of the 2010 IEEE/ACM International Conference on Green Computing and Communications (GreenCom), 2010, pp [23] K. Kambatla, A. Pathak, and H. Pucha, Towards optimizing hadoop provisioning in the cloud, in Proc. of the First Workshop on Hot Topics in Cloud Computing, [24] S. Ibrahim, H. Jin, L. Lu, L. Qi, S. Wu, and X. Shi, Evaluating mapreduce on virtual machines: The hadoop case, Cloud Computing, pp , [25] S. Ibrahim, H. Jin, B. Cheng, H. Cao, S. Wu, and L. Qi, Cloudlet: towards mapreduce implementation on virtual machines, in Proceedings of the 18th ACM international symposium on High performance distributed computing. ACM, 2009, pp [26] S. Ibrahim, H. Jin, L. Lu, B. He, and S. Wu, Adaptive disk i/o scheduling for mapreduce in virtualized environment, in Parallel Processing (ICPP), 2011 International Conference on. IEEE, 2011, pp [27] M. Zaharia, A. Konwinski, A. Joseph, R. Katz, and I. Stoica, Improving mapreduce performance in heterogeneous environments, in Proceedings of the 8th USENIX conference on Operating systems design and implementation. USENIX Association, 2008, pp

Live Virtual Machine Migration with Efficient Working Set Prediction

Live Virtual Machine Migration with Efficient Working Set Prediction 2011 International Conference on Network and Electronics Engineering IPCSIT vol.11 (2011) (2011) IACSIT Press, Singapore Live Virtual Machine Migration with Efficient Working Set Prediction Ei Phyu Zaw

More information

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad

More information

PROFILING BASED REDUCE MEMORY PROVISIONING FOR IMPROVING THE PERFORMANCE IN HADOOP

PROFILING BASED REDUCE MEMORY PROVISIONING FOR IMPROVING THE PERFORMANCE IN HADOOP ISSN: 0976-2876 (Print) ISSN: 2250-0138 (Online) PROFILING BASED REDUCE MEMORY PROVISIONING FOR IMPROVING THE PERFORMANCE IN HADOOP T. S. NISHA a1 AND K. SATYANARAYAN REDDY b a Department of CSE, Cambridge

More information

High Performance Computing on MapReduce Programming Framework

High Performance Computing on MapReduce Programming Framework International Journal of Private Cloud Computing Environment and Management Vol. 2, No. 1, (2015), pp. 27-32 http://dx.doi.org/10.21742/ijpccem.2015.2.1.04 High Performance Computing on MapReduce Programming

More information

A Robust Cloud-based Service Architecture for Multimedia Streaming Using Hadoop

A Robust Cloud-based Service Architecture for Multimedia Streaming Using Hadoop A Robust Cloud-based Service Architecture for Multimedia Streaming Using Hadoop Myoungjin Kim 1, Seungho Han 1, Jongjin Jung 3, Hanku Lee 1,2,*, Okkyung Choi 2 1 Department of Internet and Multimedia Engineering,

More information

Evaluate the Performance and Scalability of Image Deployment in Virtual Data Center

Evaluate the Performance and Scalability of Image Deployment in Virtual Data Center Evaluate the Performance and Scalability of Image Deployment in Virtual Data Center Kejiang Ye, Xiaohong Jiang, Qinming He, Xing Li, and Jianhai Chen College of Computer Science, Zhejiang University, Zheda

More information

LEEN: Locality/Fairness- Aware Key Partitioning for MapReduce in the Cloud

LEEN: Locality/Fairness- Aware Key Partitioning for MapReduce in the Cloud LEEN: Locality/Fairness- Aware Key Partitioning for MapReduce in the Cloud Shadi Ibrahim, Hai Jin, Lu Lu, Song Wu, Bingsheng He*, Qi Li # Huazhong University of Science and Technology *Nanyang Technological

More information

A priority based dynamic bandwidth scheduling in SDN networks 1

A priority based dynamic bandwidth scheduling in SDN networks 1 Acta Technica 62 No. 2A/2017, 445 454 c 2017 Institute of Thermomechanics CAS, v.v.i. A priority based dynamic bandwidth scheduling in SDN networks 1 Zun Wang 2 Abstract. In order to solve the problems

More information

Survey on MapReduce Scheduling Algorithms

Survey on MapReduce Scheduling Algorithms Survey on MapReduce Scheduling Algorithms Liya Thomas, Mtech Student, Department of CSE, SCTCE,TVM Syama R, Assistant Professor Department of CSE, SCTCE,TVM ABSTRACT MapReduce is a programming model used

More information

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros Data Clustering on the Parallel Hadoop MapReduce Model Dimitrios Verraros Overview The purpose of this thesis is to implement and benchmark the performance of a parallel K- means clustering algorithm on

More information

A Micro Partitioning Technique in MapReduce for Massive Data Analysis

A Micro Partitioning Technique in MapReduce for Massive Data Analysis A Micro Partitioning Technique in MapReduce for Massive Data Analysis Nandhini.C, Premadevi.P PG Scholar, Dept. of CSE, Angel College of Engg and Tech, Tiruppur, Tamil Nadu Assistant Professor, Dept. of

More information

Implementation and performance test of cloud platform based on Hadoop

Implementation and performance test of cloud platform based on Hadoop IOP Conference Series: Earth and Environmental Science PAPER OPEN ACCESS Implementation and performance test of cloud platform based on Hadoop To cite this article: Jingxian Xu et al 2018 IOP Conf. Ser.:

More information

Improving Throughput in Cloud Storage System

Improving Throughput in Cloud Storage System Improving Throughput in Cloud Storage System Chanho Choi chchoi@dcslab.snu.ac.kr Shin-gyu Kim sgkim@dcslab.snu.ac.kr Hyeonsang Eom hseom@dcslab.snu.ac.kr Heon Y. Yeom yeom@dcslab.snu.ac.kr Abstract Because

More information

Storage access optimization with virtual machine migration during execution of parallel data processing on a virtual machine PC cluster

Storage access optimization with virtual machine migration during execution of parallel data processing on a virtual machine PC cluster Storage access optimization with virtual machine migration during execution of parallel data processing on a virtual machine PC cluster Shiori Toyoshima Ochanomizu University 2 1 1, Otsuka, Bunkyo-ku Tokyo

More information

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti International Journal of Computer Engineering and Applications, ICCSTAR-2016, Special Issue, May.16 MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti 1 Department

More information

Dynamic Data Placement Strategy in MapReduce-styled Data Processing Platform Hua-Ci WANG 1,a,*, Cai CHEN 2,b,*, Yi LIANG 3,c

Dynamic Data Placement Strategy in MapReduce-styled Data Processing Platform Hua-Ci WANG 1,a,*, Cai CHEN 2,b,*, Yi LIANG 3,c 2016 Joint International Conference on Service Science, Management and Engineering (SSME 2016) and International Conference on Information Science and Technology (IST 2016) ISBN: 978-1-60595-379-3 Dynamic

More information

Research and Design of Crypto Card Virtualization Framework Lei SUN, Ze-wu WANG and Rui-chen SUN

Research and Design of Crypto Card Virtualization Framework Lei SUN, Ze-wu WANG and Rui-chen SUN 2016 International Conference on Wireless Communication and Network Engineering (WCNE 2016) ISBN: 978-1-60595-403-5 Research and Design of Crypto Card Virtualization Framework Lei SUN, Ze-wu WANG and Rui-chen

More information

Huge Data Analysis and Processing Platform based on Hadoop Yuanbin LI1, a, Rong CHEN2

Huge Data Analysis and Processing Platform based on Hadoop Yuanbin LI1, a, Rong CHEN2 2nd International Conference on Materials Science, Machinery and Energy Engineering (MSMEE 2017) Huge Data Analysis and Processing Platform based on Hadoop Yuanbin LI1, a, Rong CHEN2 1 Information Engineering

More information

Parallel data processing with MapReduce

Parallel data processing with MapReduce Parallel data processing with MapReduce Tomi Aarnio Helsinki University of Technology tomi.aarnio@hut.fi Abstract MapReduce is a parallel programming model and an associated implementation introduced by

More information

AN EFFECTIVE DETECTION OF SATELLITE IMAGES VIA K-MEANS CLUSTERING ON HADOOP SYSTEM. Mengzhao Yang, Haibin Mei and Dongmei Huang

AN EFFECTIVE DETECTION OF SATELLITE IMAGES VIA K-MEANS CLUSTERING ON HADOOP SYSTEM. Mengzhao Yang, Haibin Mei and Dongmei Huang International Journal of Innovative Computing, Information and Control ICIC International c 2017 ISSN 1349-4198 Volume 13, Number 3, June 2017 pp. 1037 1046 AN EFFECTIVE DETECTION OF SATELLITE IMAGES VIA

More information

ABSTRACT I. INTRODUCTION

ABSTRACT I. INTRODUCTION International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2018 IJSRCSEIT Volume 3 Issue 3 ISS: 2456-3307 Hadoop Periodic Jobs Using Data Blocks to Achieve

More information

WHITEPAPER. Improve Hadoop Performance with Memblaze PBlaze SSD

WHITEPAPER. Improve Hadoop Performance with Memblaze PBlaze SSD Improve Hadoop Performance with Memblaze PBlaze SSD Improve Hadoop Performance with Memblaze PBlaze SSD Exclusive Summary We live in the data age. It s not easy to measure the total volume of data stored

More information

International Journal of Modern Trends in Engineering and Research e-issn No.: , Date: 2-4 July, 2015

International Journal of Modern Trends in Engineering and Research   e-issn No.: , Date: 2-4 July, 2015 International Journal of Modern Trends in Engineering and Research www.ijmter.com e-issn No.:2349-9745, Date: 2-4 July, 215 Provisioning Rapid Elasticity by Light-Weight Live Resource Migration S. Kirthica

More information

CATEGORIZATION OF THE DOCUMENTS BY USING MACHINE LEARNING

CATEGORIZATION OF THE DOCUMENTS BY USING MACHINE LEARNING CATEGORIZATION OF THE DOCUMENTS BY USING MACHINE LEARNING Amol Jagtap ME Computer Engineering, AISSMS COE Pune, India Email: 1 amol.jagtap55@gmail.com Abstract Machine learning is a scientific discipline

More information

Hadoop/MapReduce Computing Paradigm

Hadoop/MapReduce Computing Paradigm Hadoop/Reduce Computing Paradigm 1 Large-Scale Data Analytics Reduce computing paradigm (E.g., Hadoop) vs. Traditional database systems vs. Database Many enterprises are turning to Hadoop Especially applications

More information

Implementation of Parallel CASINO Algorithm Based on MapReduce. Li Zhang a, Yijie Shi b

Implementation of Parallel CASINO Algorithm Based on MapReduce. Li Zhang a, Yijie Shi b International Conference on Artificial Intelligence and Engineering Applications (AIEA 2016) Implementation of Parallel CASINO Algorithm Based on MapReduce Li Zhang a, Yijie Shi b State key laboratory

More information

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets 2011 Fourth International Symposium on Parallel Architectures, Algorithms and Programming PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets Tao Xiao Chunfeng Yuan Yihua Huang Department

More information

Next-Generation Cloud Platform

Next-Generation Cloud Platform Next-Generation Cloud Platform Jangwoo Kim Jun 24, 2013 E-mail: jangwoo@postech.ac.kr High Performance Computing Lab Department of Computer Science & Engineering Pohang University of Science and Technology

More information

A New HadoopBased Network Management System with Policy Approach

A New HadoopBased Network Management System with Policy Approach Computer Engineering and Applications Vol. 3, No. 3, September 2014 A New HadoopBased Network Management System with Policy Approach Department of Computer Engineering and IT, Shiraz University of Technology,

More information

An Integration and Load Balancing in Data Centers Using Virtualization

An Integration and Load Balancing in Data Centers Using Virtualization An Integration and Load Balancing in Data Centers Using Virtualization USHA BELLAD #1 and JALAJA G *2 # Student M.Tech, CSE, B N M Institute of Technology, Bengaluru, India * Associate Professor, CSE,

More information

Figure 1: Virtualization

Figure 1: Virtualization Volume 6, Issue 9, September 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com A Profitable

More information

Processing Technology of Massive Human Health Data Based on Hadoop

Processing Technology of Massive Human Health Data Based on Hadoop 6th International Conference on Machinery, Materials, Environment, Biotechnology and Computer (MMEBC 2016) Processing Technology of Massive Human Health Data Based on Hadoop Miao Liu1, a, Junsheng Yu1,

More information

Mitigating Data Skew Using Map Reduce Application

Mitigating Data Skew Using Map Reduce Application Ms. Archana P.M Mitigating Data Skew Using Map Reduce Application Mr. Malathesh S.H 4 th sem, M.Tech (C.S.E) Associate Professor C.S.E Dept. M.S.E.C, V.T.U Bangalore, India archanaanil062@gmail.com M.S.E.C,

More information

DISTRIBUTED VIRTUAL CLUSTER MANAGEMENT SYSTEM

DISTRIBUTED VIRTUAL CLUSTER MANAGEMENT SYSTEM DISTRIBUTED VIRTUAL CLUSTER MANAGEMENT SYSTEM V.V. Korkhov 1,a, S.S. Kobyshev 1, A.B. Degtyarev 1, A. Cubahiro 2, L. Gaspary 3, X. Wang 4, Z. Wu 4 1 Saint Petersburg State University, 7/9 Universitetskaya

More information

MODELING OF CPU USAGE FOR VIRTUALIZED APPLICATION

MODELING OF CPU USAGE FOR VIRTUALIZED APPLICATION e-issn 2455 1392 Volume 2 Issue 4, April 2016 pp. 644-651 Scientific Journal Impact Factor : 3.468 http://www.ijcter.com MODELING OF CPU USAGE FOR VIRTUALIZED APPLICATION Lochan.B 1, Divyashree B A 2 1

More information

4th National Conference on Electrical, Electronics and Computer Engineering (NCEECE 2015)

4th National Conference on Electrical, Electronics and Computer Engineering (NCEECE 2015) 4th National Conference on Electrical, Electronics and Computer Engineering (NCEECE 2015) Benchmark Testing for Transwarp Inceptor A big data analysis system based on in-memory computing Mingang Chen1,2,a,

More information

Distributed Face Recognition Using Hadoop

Distributed Face Recognition Using Hadoop Distributed Face Recognition Using Hadoop A. Thorat, V. Malhotra, S. Narvekar and A. Joshi Dept. of Computer Engineering and IT College of Engineering, Pune {abhishekthorat02@gmail.com, vinayak.malhotra20@gmail.com,

More information

CAVA: Exploring Memory Locality for Big Data Analytics in Virtualized Clusters

CAVA: Exploring Memory Locality for Big Data Analytics in Virtualized Clusters 2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing : Exploring Memory Locality for Big Data Analytics in Virtualized Clusters Eunji Hwang, Hyungoo Kim, Beomseok Nam and Young-ri

More information

Accelerate Big Data Insights

Accelerate Big Data Insights Accelerate Big Data Insights Executive Summary An abundance of information isn t always helpful when time is of the essence. In the world of big data, the ability to accelerate time-to-insight can not

More information

Twitter data Analytics using Distributed Computing

Twitter data Analytics using Distributed Computing Twitter data Analytics using Distributed Computing Uma Narayanan Athrira Unnikrishnan Dr. Varghese Paul Dr. Shelbi Joseph Research Scholar M.tech Student Professor Assistant Professor Dept. of IT, SOE

More information

An Experimental Study of Load Balancing of OpenNebula Open-Source Cloud Computing Platform

An Experimental Study of Load Balancing of OpenNebula Open-Source Cloud Computing Platform An Experimental Study of Load Balancing of OpenNebula Open-Source Cloud Computing Platform A B M Moniruzzaman, StudentMember, IEEE Kawser Wazed Nafi Syed Akther Hossain, Member, IEEE & ACM Abstract Cloud

More information

Improved MapReduce k-means Clustering Algorithm with Combiner

Improved MapReduce k-means Clustering Algorithm with Combiner 2014 UKSim-AMSS 16th International Conference on Computer Modelling and Simulation Improved MapReduce k-means Clustering Algorithm with Combiner Prajesh P Anchalia Department Of Computer Science and Engineering

More information

Introduction to Hadoop and MapReduce

Introduction to Hadoop and MapReduce Introduction to Hadoop and MapReduce Antonino Virgillito THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Large-scale Computation Traditional solutions for computing large

More information

BIG DATA AND HADOOP ON THE ZFS STORAGE APPLIANCE

BIG DATA AND HADOOP ON THE ZFS STORAGE APPLIANCE BIG DATA AND HADOOP ON THE ZFS STORAGE APPLIANCE BRETT WENINGER, MANAGING DIRECTOR 10/21/2014 ADURANT APPROACH TO BIG DATA Align to Un/Semi-structured Data Instead of Big Scale out will become Big Greatest

More information

HiTune. Dataflow-Based Performance Analysis for Big Data Cloud

HiTune. Dataflow-Based Performance Analysis for Big Data Cloud HiTune Dataflow-Based Performance Analysis for Big Data Cloud Jinquan (Jason) Dai, Jie Huang, Shengsheng Huang, Bo Huang, Yan Liu Intel Asia-Pacific Research and Development Ltd Shanghai, China, 200241

More information

Advanced Peer: Data sharing in network based on Peer to Peer

Advanced Peer: Data sharing in network based on Peer to Peer Advanced Peer: Data sharing in network based on Peer to Peer Syed Ayesha Firdose 1, P.Venu Babu 2 1 M.Tech (CSE), Malineni Lakshmaiah Women's Engineering College,Pulladigunta, Vatticherukur, Prathipadu

More information

A Study on Load Balancing in Cloud Computing * Parveen Kumar,* Er.Mandeep Kaur Guru kashi University, Talwandi Sabo

A Study on Load Balancing in Cloud Computing * Parveen Kumar,* Er.Mandeep Kaur Guru kashi University, Talwandi Sabo A Study on Load Balancing in Cloud Computing * Parveen Kumar,* Er.Mandeep Kaur Guru kashi University, Talwandi Sabo Abstract: Load Balancing is a computer networking method to distribute workload across

More information

Research on Load Balancing in Task Allocation Process in Heterogeneous Hadoop Cluster

Research on Load Balancing in Task Allocation Process in Heterogeneous Hadoop Cluster 2017 2 nd International Conference on Artificial Intelligence and Engineering Applications (AIEA 2017) ISBN: 978-1-60595-485-1 Research on Load Balancing in Task Allocation Process in Heterogeneous Hadoop

More information

Analyzing and Improving Load Balancing Algorithm of MooseFS

Analyzing and Improving Load Balancing Algorithm of MooseFS , pp. 169-176 http://dx.doi.org/10.14257/ijgdc.2014.7.4.16 Analyzing and Improving Load Balancing Algorithm of MooseFS Zhang Baojun 1, Pan Ruifang 1 and Ye Fujun 2 1. New Media Institute, Zhejiang University

More information

Global Journal of Engineering Science and Research Management

Global Journal of Engineering Science and Research Management A FUNDAMENTAL CONCEPT OF MAPREDUCE WITH MASSIVE FILES DATASET IN BIG DATA USING HADOOP PSEUDO-DISTRIBUTION MODE K. Srikanth*, P. Venkateswarlu, Ashok Suragala * Department of Information Technology, JNTUK-UCEV

More information

Research Article Apriori Association Rule Algorithms using VMware Environment

Research Article Apriori Association Rule Algorithms using VMware Environment Research Journal of Applied Sciences, Engineering and Technology 8(2): 16-166, 214 DOI:1.1926/rjaset.8.955 ISSN: 24-7459; e-issn: 24-7467 214 Maxwell Scientific Publication Corp. Submitted: January 2,

More information

Map Reduce Group Meeting

Map Reduce Group Meeting Map Reduce Group Meeting Yasmine Badr 10/07/2014 A lot of material in this presenta0on has been adopted from the original MapReduce paper in OSDI 2004 What is Map Reduce? Programming paradigm/model for

More information

Big Data Management and NoSQL Databases

Big Data Management and NoSQL Databases NDBI040 Big Data Management and NoSQL Databases Lecture 2. MapReduce Doc. RNDr. Irena Holubova, Ph.D. holubova@ksi.mff.cuni.cz http://www.ksi.mff.cuni.cz/~holubova/ndbi040/ Framework A programming model

More information

A New Model of Search Engine based on Cloud Computing

A New Model of Search Engine based on Cloud Computing A New Model of Search Engine based on Cloud Computing DING Jian-li 1,2, YANG Bo 1 1. College of Computer Science and Technology, Civil Aviation University of China, Tianjin 300300, China 2. Tianjin Key

More information

IN organizations, most of their computers are

IN organizations, most of their computers are Provisioning Hadoop Virtual Cluster in Opportunistic Cluster Arindam Choudhury, Elisa Heymann, Miquel Angel Senar 1 Abstract Traditional opportunistic cluster is designed for running compute-intensive

More information

Fusion iomemory PCIe Solutions from SanDisk and Sqrll make Accumulo Hypersonic

Fusion iomemory PCIe Solutions from SanDisk and Sqrll make Accumulo Hypersonic WHITE PAPER Fusion iomemory PCIe Solutions from SanDisk and Sqrll make Accumulo Hypersonic Western Digital Technologies, Inc. 951 SanDisk Drive, Milpitas, CA 95035 www.sandisk.com Table of Contents Executive

More information

Research Article Mobile Storage and Search Engine of Information Oriented to Food Cloud

Research Article Mobile Storage and Search Engine of Information Oriented to Food Cloud Advance Journal of Food Science and Technology 5(10): 1331-1336, 2013 DOI:10.19026/ajfst.5.3106 ISSN: 2042-4868; e-issn: 2042-4876 2013 Maxwell Scientific Publication Corp. Submitted: May 29, 2013 Accepted:

More information

Service Oriented Performance Analysis

Service Oriented Performance Analysis Service Oriented Performance Analysis Da Qi Ren and Masood Mortazavi US R&D Center Santa Clara, CA, USA www.huawei.com Performance Model for Service in Data Center and Cloud 1. Service Oriented (end to

More information

Performance Benefits of DataMPI: A Case Study with BigDataBench

Performance Benefits of DataMPI: A Case Study with BigDataBench Benefits of DataMPI: A Case Study with BigDataBench Fan Liang 1,2 Chen Feng 1,2 Xiaoyi Lu 3 Zhiwei Xu 1 1 Institute of Computing Technology, Chinese Academy of Sciences 2 University of Chinese Academy

More information

INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET)

INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET) INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET) International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 6367(Print) ISSN 0976 6375(Online)

More information

Available online at ScienceDirect. Procedia Computer Science 89 (2016 )

Available online at   ScienceDirect. Procedia Computer Science 89 (2016 ) Available online at www.sciencedirect.com ScienceDirect Procedia Computer Science 89 (2016 ) 203 208 Twelfth International Multi-Conference on Information Processing-2016 (IMCIP-2016) Tolhit A Scheduling

More information

Data Analysis Using MapReduce in Hadoop Environment

Data Analysis Using MapReduce in Hadoop Environment Data Analysis Using MapReduce in Hadoop Environment Muhammad Khairul Rijal Muhammad*, Saiful Adli Ismail, Mohd Nazri Kama, Othman Mohd Yusop, Azri Azmi Advanced Informatics School (UTM AIS), Universiti

More information

Chisel++: Handling Partitioning Skew in MapReduce Framework Using Efficient Range Partitioning Technique

Chisel++: Handling Partitioning Skew in MapReduce Framework Using Efficient Range Partitioning Technique Chisel++: Handling Partitioning Skew in MapReduce Framework Using Efficient Range Partitioning Technique Prateek Dhawalia Sriram Kailasam D. Janakiram Distributed and Object Systems Lab Dept. of Comp.

More information

Google File System (GFS) and Hadoop Distributed File System (HDFS)

Google File System (GFS) and Hadoop Distributed File System (HDFS) Google File System (GFS) and Hadoop Distributed File System (HDFS) 1 Hadoop: Architectural Design Principles Linear scalability More nodes can do more work within the same time Linear on data size, linear

More information

Optimal Algorithms for Cross-Rack Communication Optimization in MapReduce Framework

Optimal Algorithms for Cross-Rack Communication Optimization in MapReduce Framework Optimal Algorithms for Cross-Rack Communication Optimization in MapReduce Framework Li-Yung Ho Institute of Information Science Academia Sinica, Department of Computer Science and Information Engineering

More information

Scalability and performance of a virtualized SAP system

Scalability and performance of a virtualized SAP system Association for Information Systems AIS Electronic Library (AISeL) AMCIS 2010 Proceedings Americas Conference on Information Systems (AMCIS) 8-2010 Scalability and performance of a virtualized SAP system

More information

Chapter 5. The MapReduce Programming Model and Implementation

Chapter 5. The MapReduce Programming Model and Implementation Chapter 5. The MapReduce Programming Model and Implementation - Traditional computing: data-to-computing (send data to computing) * Data stored in separate repository * Data brought into system for computing

More information

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe Cloud Programming Programming Environment Oct 29, 2015 Osamu Tatebe Cloud Computing Only required amount of CPU and storage can be used anytime from anywhere via network Availability, throughput, reliability

More information

An Efficient Virtual CPU Scheduling Algorithm for Xen Hypervisor in Virtualized Environment

An Efficient Virtual CPU Scheduling Algorithm for Xen Hypervisor in Virtualized Environment An Efficient Virtual CPU Scheduling Algorithm for Xen Hypervisor in Virtualized Environment Chia-Ying Tseng 1 and Po-Chun Huang 2 Department of Computer Science and Engineering, Tatung University #40,

More information

A Fast and High Throughput SQL Query System for Big Data

A Fast and High Throughput SQL Query System for Big Data A Fast and High Throughput SQL Query System for Big Data Feng Zhu, Jie Liu, and Lijie Xu Technology Center of Software Engineering, Institute of Software, Chinese Academy of Sciences, Beijing, China 100190

More information

An Indian Journal FULL PAPER ABSTRACT KEYWORDS. Trade Science Inc. The study on magnanimous data-storage system based on cloud computing

An Indian Journal FULL PAPER ABSTRACT KEYWORDS. Trade Science Inc. The study on magnanimous data-storage system based on cloud computing [Type text] [Type text] [Type text] ISSN : 0974-7435 Volume 10 Issue 11 BioTechnology 2014 An Indian Journal FULL PAPER BTAIJ, 10(11), 2014 [5368-5376] The study on magnanimous data-storage system based

More information

Indexing Strategies of MapReduce for Information Retrieval in Big Data

Indexing Strategies of MapReduce for Information Retrieval in Big Data International Journal of Advances in Computer Science and Technology (IJACST), Vol.5, No.3, Pages : 01-06 (2016) Indexing Strategies of MapReduce for Information Retrieval in Big Data Mazen Farid, Rohaya

More information

Data Analytics on RAMCloud

Data Analytics on RAMCloud Data Analytics on RAMCloud Jonathan Ellithorpe jdellit@stanford.edu Abstract MapReduce [1] has already become the canonical method for doing large scale data processing. However, for many algorithms including

More information

REMEM: REmote MEMory as Checkpointing Storage

REMEM: REmote MEMory as Checkpointing Storage REMEM: REmote MEMory as Checkpointing Storage Hui Jin Illinois Institute of Technology Xian-He Sun Illinois Institute of Technology Yong Chen Oak Ridge National Laboratory Tao Ke Illinois Institute of

More information

THE SURVEY ON MAPREDUCE

THE SURVEY ON MAPREDUCE THE SURVEY ON MAPREDUCE V.VIJAYALAKSHMI Assistant professor, Department of Computer Science and Engineering, Christ College of Engineering and Technology, Puducherry, India, E-mail: vivenan09@gmail.com.

More information

SMCCSE: PaaS Platform for processing large amounts of social media

SMCCSE: PaaS Platform for processing large amounts of social media KSII The first International Conference on Internet (ICONI) 2011, December 2011 1 Copyright c 2011 KSII SMCCSE: PaaS Platform for processing large amounts of social media Myoungjin Kim 1, Hanku Lee 2 and

More information

Implementation of Aggregation of Map and Reduce Function for Performance Improvisation

Implementation of Aggregation of Map and Reduce Function for Performance Improvisation 2016 IJSRSET Volume 2 Issue 5 Print ISSN: 2395-1990 Online ISSN : 2394-4099 Themed Section: Engineering and Technology Implementation of Aggregation of Map and Reduce Function for Performance Improvisation

More information

Java. Measurement of Virtualization Overhead in a Java Application Server. Kazuaki Takahashi 1 and Hitoshi Oi 1. J2EE SPECjAppServer2004

Java. Measurement of Virtualization Overhead in a Java Application Server. Kazuaki Takahashi 1 and Hitoshi Oi 1. J2EE SPECjAppServer2004 Vol.1-EVA-3 No. 1//3 1. Java 1 1 JEE SPECjAppServer CPU SPECjAppServer 3 CPU Measurement of Virtualization Overhead in a Java Application Server Kazuaki Takahashi 1 and Hitoshi Oi 1 In this technical report,

More information

Decision analysis of the weather log by Hadoop

Decision analysis of the weather log by Hadoop Advances in Engineering Research (AER), volume 116 International Conference on Communication and Electronic Information Engineering (CEIE 2016) Decision analysis of the weather log by Hadoop Hao Wu Department

More information

GEOSS Clearinghouse onto Amazon EC2/Azure

GEOSS Clearinghouse onto Amazon EC2/Azure GEOSS Clearinghouse onto Amazon EC2/Azure Qunying Huang1, Chaowei Yang1, Doug Nebert 2 Kai Liu1, Zhipeng Gui1, Yan Xu3 1Joint Center of Intelligent Computing George Mason University 2Federal Geographic

More information

Hadoop Virtualization Extensions on VMware vsphere 5 T E C H N I C A L W H I T E P A P E R

Hadoop Virtualization Extensions on VMware vsphere 5 T E C H N I C A L W H I T E P A P E R Hadoop Virtualization Extensions on VMware vsphere 5 T E C H N I C A L W H I T E P A P E R Table of Contents Introduction... 3 Topology Awareness in Hadoop... 3 Virtual Hadoop... 4 HVE Solution... 5 Architecture...

More information

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP TITLE: Implement sort algorithm and run it using HADOOP PRE-REQUISITE Preliminary knowledge of clusters and overview of Hadoop and its basic functionality. THEORY 1. Introduction to Hadoop The Apache Hadoop

More information

Parallel HITS Algorithm Implemented Using HADOOP GIRAPH Framework to resolve Big Data Problem

Parallel HITS Algorithm Implemented Using HADOOP GIRAPH Framework to resolve Big Data Problem I J C T A, 9(41) 2016, pp. 1235-1239 International Science Press Parallel HITS Algorithm Implemented Using HADOOP GIRAPH Framework to resolve Big Data Problem Hema Dubey *, Nilay Khare *, Alind Khare **

More information

Research on Availability of Virtual Machine Hot Standby based on Double Shadow Page Tables

Research on Availability of Virtual Machine Hot Standby based on Double Shadow Page Tables International Conference on Computer, Networks and Communication Engineering (ICCNCE 2013) Research on Availability of Virtual Machine Hot Standby based on Double Shadow Page Tables Zhiyun Zheng, Huiling

More information

Survey on Incremental MapReduce for Data Mining

Survey on Incremental MapReduce for Data Mining Survey on Incremental MapReduce for Data Mining Trupti M. Shinde 1, Prof.S.V.Chobe 2 1 Research Scholar, Computer Engineering Dept., Dr. D. Y. Patil Institute of Engineering &Technology, 2 Associate Professor,

More information

Modeling and evaluation on Ad hoc query processing with Adaptive Index in Map Reduce Environment

Modeling and evaluation on Ad hoc query processing with Adaptive Index in Map Reduce Environment DEIM Forum 213 F2-1 Adaptive indexing 153 855 4-6-1 E-mail: {okudera,yokoyama,miyuki,kitsure}@tkl.iis.u-tokyo.ac.jp MapReduce MapReduce MapReduce Modeling and evaluation on Ad hoc query processing with

More information

Optimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures*

Optimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures* Optimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures* Tharso Ferreira 1, Antonio Espinosa 1, Juan Carlos Moure 2 and Porfidio Hernández 2 Computer Architecture and Operating

More information

2012 Seventh International Conference on P2P, Parallel, Grid, Cloud and Internet Computing

2012 Seventh International Conference on P2P, Parallel, Grid, Cloud and Internet Computing 2012 Seventh International Conference on P2P, Parallel, Grid, Cloud and Internet Computing Present the Challenges in Eucalyptus Cloud Infrastructure for Implementing Virtual Machine Migration Technique

More information

The Performance Analysis of a Service Deployment System Based on the Centralized Storage

The Performance Analysis of a Service Deployment System Based on the Centralized Storage The Performance Analysis of a Service Deployment System Based on the Centralized Storage Zhu Xu Dong School of Computer Science and Information Engineering Zhejiang Gongshang University 310018 Hangzhou,

More information

An Enhanced Approach for Resource Management Optimization in Hadoop

An Enhanced Approach for Resource Management Optimization in Hadoop An Enhanced Approach for Resource Management Optimization in Hadoop R. Sandeep Raj 1, G. Prabhakar Raju 2 1 MTech Student, Department of CSE, Anurag Group of Institutions, India 2 Associate Professor,

More information

Spark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay Mellanox Technologies

Spark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay Mellanox Technologies Spark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay 1 Apache Spark - Intro Spark within the Big Data ecosystem Data Sources Data Acquisition / ETL Data Storage Data Analysis / ML Serving 3 Apache

More information

Survey Paper on Traditional Hadoop and Pipelined Map Reduce

Survey Paper on Traditional Hadoop and Pipelined Map Reduce International Journal of Computational Engineering Research Vol, 03 Issue, 12 Survey Paper on Traditional Hadoop and Pipelined Map Reduce Dhole Poonam B 1, Gunjal Baisa L 2 1 M.E.ComputerAVCOE, Sangamner,

More information

FuxiSort. Jiamang Wang, Yongjun Wu, Hua Cai, Zhipeng Tang, Zhiqiang Lv, Bin Lu, Yangyu Tao, Chao Li, Jingren Zhou, Hong Tang Alibaba Group Inc

FuxiSort. Jiamang Wang, Yongjun Wu, Hua Cai, Zhipeng Tang, Zhiqiang Lv, Bin Lu, Yangyu Tao, Chao Li, Jingren Zhou, Hong Tang Alibaba Group Inc Fuxi Jiamang Wang, Yongjun Wu, Hua Cai, Zhipeng Tang, Zhiqiang Lv, Bin Lu, Yangyu Tao, Chao Li, Jingren Zhou, Hong Tang Alibaba Group Inc {jiamang.wang, yongjun.wyj, hua.caihua, zhipeng.tzp, zhiqiang.lv,

More information

Juxtaposition of Apache Tez and Hadoop MapReduce on Hadoop Cluster - Applying Compression Algorithms

Juxtaposition of Apache Tez and Hadoop MapReduce on Hadoop Cluster - Applying Compression Algorithms , pp.289-295 http://dx.doi.org/10.14257/astl.2017.147.40 Juxtaposition of Apache Tez and Hadoop MapReduce on Hadoop Cluster - Applying Compression Algorithms Dr. E. Laxmi Lydia 1 Associate Professor, Department

More information

EXTRACT DATA IN LARGE DATABASE WITH HADOOP

EXTRACT DATA IN LARGE DATABASE WITH HADOOP International Journal of Advances in Engineering & Scientific Research (IJAESR) ISSN: 2349 3607 (Online), ISSN: 2349 4824 (Print) Download Full paper from : http://www.arseam.com/content/volume-1-issue-7-nov-2014-0

More information

Review on Managing RDF Graph Using MapReduce

Review on Managing RDF Graph Using MapReduce Review on Managing RDF Graph Using MapReduce 1 Hetal K. Makavana, 2 Prof. Ashutosh A. Abhangi 1 M.E. Computer Engineering, 2 Assistant Professor Noble Group of Institutions Junagadh, India Abstract solution

More information

MI-PDB, MIE-PDB: Advanced Database Systems

MI-PDB, MIE-PDB: Advanced Database Systems MI-PDB, MIE-PDB: Advanced Database Systems http://www.ksi.mff.cuni.cz/~svoboda/courses/2015-2-mie-pdb/ Lecture 10: MapReduce, Hadoop 26. 4. 2016 Lecturer: Martin Svoboda svoboda@ksi.mff.cuni.cz Author:

More information

Aeromancer: A Workflow Manager for Large- Scale MapReduce-Based Scientific Workflows

Aeromancer: A Workflow Manager for Large- Scale MapReduce-Based Scientific Workflows Aeromancer: A Workflow Manager for Large- Scale MapReduce-Based Scientific Workflows Presented by Sarunya Pumma Supervisors: Dr. Wu-chun Feng, Dr. Mark Gardner, and Dr. Hao Wang synergy.cs.vt.edu Outline

More information

A Comparative study of Clustering Algorithms using MapReduce in Hadoop

A Comparative study of Clustering Algorithms using MapReduce in Hadoop A Comparative study of Clustering Algorithms using MapReduce in Hadoop Dweepna Garg 1, Khushboo Trivedi 2, B.B.Panchal 3 1 Department of Computer Science and Engineering, Parul Institute of Engineering

More information

The Load Balancing Research of SDN based on Ant Colony Algorithm with Job Classification Wucai Lin1,a, Lichen Zhang2,b

The Load Balancing Research of SDN based on Ant Colony Algorithm with Job Classification Wucai Lin1,a, Lichen Zhang2,b 2nd Workshop on Advanced Research and Technology in Industry Applications (WARTIA 2016) The Load Balancing Research of SDN based on Ant Colony Algorithm with Job Classification Wucai Lin1,a, Lichen Zhang2,b

More information