Exploiting Heterogeneity in the Public Cloud for Cost-Effective Data Analytics

Size: px
Start display at page:

Download "Exploiting Heterogeneity in the Public Cloud for Cost-Effective Data Analytics"

Transcription

1 Exploiting Heterogeneity in the Public Cloud for Cost-Effective Data Analytics Gunho Lee, Byung-Gon Chun, Randy H. Katz University of California, Berkeley, Intel Labs Berkeley Abstract Data analytics are key applications running in the clouds. The clouds provide dynamic provisioning that enables users to scale their clusters in response to the demand. These clouds are comprised of heterogeneous server platforms, but current data analytics applications do not consider these heterogeneous platforms. In this paper, we rethink resource allocation and scheduling on data analytics in the cloud to take advantage of the heterogeneity of the platforms. 1 Introduction As cloud computing services become more popular, more services are run in a public cloud. For example, a number of social game developers use a cloud to serve their game applications, as cloud computing provides fast deployment and easy scaling [4]. Such applications generate a huge amount of data about in-game activities and events, and analyzing the data is important to monitor user behavior and improve game play through datadriven iterative development processes [3]. Hence, it is desirable to store data and perform data analytics in the same cloud where the service runs, rather than moving the data to a separate data analytics cluster, to minimize response time and cost of queries When we have a data analytics cluster in the cloud, our primary interest might be to maintain the cluster with minimum cost while meeting the performance requirements. To do so, there are two key questions we need to address. One is how to allocate resources to the data analytics cluster. As the cloud computing provides fast provisioning, we can easily add more machines to the cluster in minutes when the demand surges. Similarly, we can remove and terminate machines when the cluster is idle to reduce the cost to maintain the cluster. As in other applications such as web services and storage systems [6], a data analytics cluster in the cloud should also take advantage of the dynamic nature of the cloud computing to be maintained in a cost-effective manner. The cost associativity [5] of cloud computing (e.g., using 1,000 EC2 machines for 1 hour costs the same as using 1 machine for 1,000 hours) is the key to realize the cost-effective data analytics cluster in the cloud. While allocating resources, heterogeneity should be considered to exploit the full potential of cloud services. Typically, cloud services provide various machine types. For example, Amazon EC2 offers 11 instance types with different specs and prices. One of them even allows access to a GPU that can be used to accelerate a particular application significantly [9]. Hence, we should allocate appropriate type of resources that fit to workload characteristics to achieve high performance per cost. This will lead to a heterogeneous cluster, which raises another interesting question. Given the heterogeneous nature of the data analytics cluster in the cloud, the other question we need to address is how to schedule jobs and tasks in the cluster. Especially if there are multiple jobs to run concurrently, the data analytics system should be aware of the heterogeneity of resources and jobs, then make scheduling decisions appropriately in a way that a job is assigned to preferred resources for performance. It is also important to provide fairness among jobs so that each job receives a fair amount of resources and does not starve in the heterogeneous environment. In this paper, we present architecture and design consideration to build a cost-effective data analytics system in the cloud. We first present background on Hadoop, Amazon EC2 and Data Analytic Cloud in Section 2. We present resource allocation and scheduling issues in heterogeneous environments and our system architecture that addresses the issues in Section 3. We illustrate the potential benefits of our approach with case studies in Section 4 and conclude in Section 5. 1

2 2 Background This paper focuses on a data analytic, which operates as a MapReduce cluster that relies on providers to host large data sets on a distributed storage system, and performs data processing on an analytic engine in public clouds, as described in [7]. Particularly, we consider the Hadoop Distributed File System (HDFS) and Hadoop MapReduce as a storage layer and an analytic engine in this paper. Also, we use Amazon EC2 as an example of a public cloud service. HDFS is comprised of multiple Datas, which store blocks, and a Name, which maintains metadata such as filenames and block locations. A file in HDFS is divided into blocks. Replicas of each block are stored on nodes that run Datas in the cluster. By default, the block size is 64MB, and each block has three replicas. On top of HDFS, Hadoop MapReduce is used to process data. Hadoop MapReduce has multiple TaskTrackers that perform actual computations on participating nodes and a JobTracker that manages jobs and Task- Trackers. Users submit jobs that consist of a map function and a reduce function to the JobTracker, which then launches map and reduce tasks on TaskTrackers. Each map task reads and processes one block from HDFS to generate intermediate results, which in turn fed into reduce tasks to produce the final result. Each TaskTracker holds a fixed number of slots to host map tasks and reduce tasks. Amazon EC2 is a public cloud service that enables users to lease virtual machines. Amazon charges per hour and per machine for this public cloud service without requiring any long-term commitments. Users can choose the spec of their virtual machines from 11 instance types tagged with different prices. Also, users can launch virtual machines in multiple locations across the world. EC2 charges for data transfer in and out of machines, but all transfers are free between machines that are in the same location. Hence, there is strong economic incentive for users to keep network communication within the same location. There are many other cloud providers with different pricing schemes, but their basic structures are similar. 3 Architecture There are two kinds of queries that enter into a data analytics system. One is a production query that is predefined and routinely processed. Usually, it runs against data appended after the previous run and may have a deadline. The other is an ad-hoc query that is entered once to answer certain question as the need arises. A data analytics system in the cloud should handle both types of Cloud Driver Analytics Engine (MapReduce) Job Storage (HDFS) Accelerator Figure 1: Data Analytic Cloud Architecture queries in a cost-effective manner. To that end, we propose architecture to build a data analytics system in the cloud as seen in Figure 1. In this architecture, participating nodes are grouped into one of two pools: long-lived core nodes to host both data and computations, and accelerator nodes that are added to the cluster temporarily when additional computing power is needed. An analytic engine (e.g., Hadoop MapReduce) runs on nodes in both pools while a storage system (e.g., HDFS) is deployed only on core nodes. The cloud driver manages nodes allocated to the analytic cloud and decides when to add/remove what type of nodes to/from which pool, and how many. Users submit a job to the cloud driver with a few hints about the job characteristics, including memory requirement, ability to use a GPU, and the deadline, if available. The cloud driver keeps the history of query executions to estimate the submission rate of production queries and their resource consumption rates. It also monitors the storage system to figure out the incoming data rate. 3.1 Resource Allocation The cloud driver is responsible to allocate resources to the cluster. The number of core nodes is determined primarily based on the storage size requirement. In addition, it refers to the history to see if more nodes are to be added to the core pool to accommodate the production queries. When many production queries with tight deadlines are anticipated or a large ad-hoc query is submitted, the cloud driver will add nodes to the accelerator pool temporarily to handle them rather than allocating too many core nodes that will be underutilized. 2

3 Instance Type $/Hour Disk(GB) $/GB/mon CU $/CU Mem(GB) GB/ I/O m1.small moderate m1.large high m2.2xlarge high c1.medium moderate cc1.4xlarge very high cg1.4xlarge very high Table 1: Example of EC2 Instances. CU represents Compute Unit. When adding nodes, the cloud driver also makes a decision on which instance type to use. If we only considers the cost for the storage, using m1.large instances will be the cheapest as seen in Table 1. However, as they have the least computing power among others, we might need more nodes to accommodate the production queries. In this case, using other instances such as c1.medium can be more efficient in terms of the whole expense. Moreover, as some jobs may run significantly faster on nodes of a particular instance type, such as cg1.4xlarge instances with GPUs. Hence, it is important to know a job/instance type relationship - which we call job affinity - to find a good mix of different instances to build a cost-effective cluster. To figure out job affinity, a newly submitted job will go through a calibration phase in which at least one task of the job will be assigned to a node of each instance type available. Then the cloud driver knows the computing rate of the job on a particular instance type. As a result, the cluster of the nodes is likely to become heterogeneous. To utilize the full potential of heterogeneous cluster, the analytics engine should be aware of node specifications and job affinity to assign tasks of a job to nodes appropriately. It is also responsible to make a job go through a calibration phase so that the cloud driver and the analytics engine can initialize and update the job affinity. 3.2 Scheduling Once the cloud driver allocates a set of resource units such as virtual machines, the analytics engine uses the resources. When the engine runs multiple jobs at the same time, the engine needs to consider heterogeneity for performance and fairness. In this section, we describe how the analytics engine estimates the computing rate of a job on heterogeneous nodes and how to use it to make a smarter scheduling decision for performance and fairness. To that end, we will use the Hadoop MapReduce as an example analytics engine Share and Fairness When a slot becomes available and there are multiple jobs to run, Hadoop scheduler picks a job to assign to the slot. Hadoop fair scheduler [2] realizes fair scheduling by having each job occupy the equal number of slots. However, the number of slots does not reflect the true share of a job as computation slots are not equal. Moreover, even on the same slot, the computation speed varies depending on jobs. Hence, we need to redefine share in a heterogeneous cluster. Ghodsi et al. [8] introduced a fairness policy called Dominant Resource Fairness(DRF) to be used in a heterogeneous cluster based on the user s report on resource requirements. To consider the performance variance on different resources, we will introduce Progress Share that captures the computation speed of a job on given slots. The share of a job is the portion of computing resources to which the job is assigned. In a heterogeneous cluster, the contribution of a particular resource to the share depends on the job workload characteristics and the specification of the resource. Suppose that there are two jobs, J1 and J2, run on one of two different nodes, N1 and N2. If J1 runs x times faster on N1 than N2, we can say that N1 gives x times more share, or computing power, than N2 to J1. On the other hand, if J2 runs on N1 as fast as N2, both nodes give the same share to J2. Specifically, we can say that J1 has x/(x + 1) share of the cluster and J2 has 1/2 share when J1 runs on N1 and J2 runs on N2. If the jobs switch places, the share of J1 will be changed to 1/(x + 1), whereas the share of J2 stays the same. As seen in the example above, once we figure out per-job slot computing rates, they can be used to calculate actual share of each job, which we call Progress Share. In general, given that CR(s, j) is the computing rate of slot s for job j, which will be determined during the calibration phase, the progress share PS(j) of job j is as follows: s PS(j) = S j CR(s,j) (1) s S CR(s,j) where S j is a set of slots running tasks of job j and S is the set of all slots. It is worth noting that the sum of the progress share for all jobs indicates the effectiveness 3

4 # of s :00:00 1:00:00 2:00:00 3:00:00 Time 9 m1 5 m1 + 4 m1 5 m1 + 4 c1 5 m1 + 6 c1 5 m1 + 8 c1 5 m c1 5 m c1 Figure 2: Job executions with various configurations Cost ($ per 4 hours) Speed (normalized) Figure 3: Cost/performance trade-off 5 m1 9 m1 5 m1 + 4 m1 5 m1 + 4 c1 5 m1 + 6 c1 5 m1 + 8 c1 5 m c1 5 m c1 of the resource assignment. If it is below 1, it means that there is an alternative assignment that makes the sum of progress share above - or at least equal to Scheduling The analytics engine scheduler should make sure that each job goes through the calibration phase to determine per-slot computing rate, CP, to calculate the progress share of each job. It should use the progress share to make a appropriate scheduling decision so that the cluster is better utilized (i.e., the sum of progress share 1) while each job receives a fair amount of computing resources. To do so, core nodes include at least one node per group of instance type with similar specs. When a job is submitted, it starts the calibration phase, in which at least one map task is launched on a node in each group. By measuring the completion time of these tasks, the scheduler can determine the per-job slot computing rate. This information is also fed into the cloud driver so that the driver can figure out job affinity to particular instance type. During the normal phase, the scheduler works similar to Hadoop fair scheduler. When a slot becomes available, a job that has less than its minimum or fair share will be chosen. It is also noteworthy that there is a mechanism called delay scheduling [10] that has a job yield to another job for a short period of time if there is no datalocal task to increase data locality. Similarly, if another job has a significantly higher computing rate on the slot, the slot can be yielded to that job. 4 Case Study In this section, we examine two case studies to illustrate the potential benefits of our system. First, we consider the case in which a number of production queries are issued every four hours. We chose five jobs from the gridmix2 benchmark [1] as production queries and used Competion Time (min) cc1 1 cc1 + 1 cg1 1 cc1 + 1 cg1 (progress share) Cluster Configuration Crypt Sort Figure 4: Effect of accelerable jobs and the progress share the FIFO scheduler in Hadoop. If we use 5 m1.large instances for core nodes, its capacity is overloaded by the production queries and, thus, the response time of a query keeps increasing over time. So, we need to add more nodes to the core pool, or the accelerator pool. Figure 2 shows the number of pending jobs in the queue during one round of the production queries on the clusters with various configurations. 9 m1 means using 9 m1.large instances as core nodes, 5 m1 + 4 c1 represents a cluster with 5 m1.large core nodes and 4 c1.medium accelerator nodes, and so on. As we can see, the jobs finished earlier as more accelerator nodes were added. In other words, if the queries have tighter deadlines, it is possible to meet the deadlines with the expenditure of an additional expense. Figure 3 shows the costperformance trade off. Note that c1.medium instances provides higher performance per cost, as they have faster CPUs at lower prices than m1.large, as seen in Table 1. However, as they are equipped with less memory, they might be in no use in reference to jobs which require a large mount of memory. The graph also points out that using more accelerators can cost less while allowing jobs to finish faster. Because the cluster with 6 accelerator nodes of c1.medium instance can finish all jobs before 2 hours, we need to pay for only 2 hours usage 4

5 of accelerator nodes. On the other hand, it takes more than 2 hours with 4 c1.medium accelerators, which leads to charges for 3 hours. In this case study, we can observe that there are number of ways to allocate resources with cost-performance trade-offs. Making a right decision maintains the cluster cost-effectively. The second case is when there is significant job affinity, such as a job that utilizes a GPU. We used a GPUaccelarable dictionary-based cryptographic attack job (Crypto), and a sort job (Sort) for this case. We prepared a small cluster consisting of two nodes and used the fair scheduler. The first bars in Figure 4 show the completion time of each job when the two nodes are all cc1.4xlarge instances. For the second bars, we replaced a node with a cg1.4xlarge instance that is identical to cc1.4xlarge except for the GPUs with which it is equipped. Here, we can observe a significant performance gain of using GPUs. This suggests the benefit of having a heterogeneous cluster. However, just having a heterogeneous cluster does not lead to an optimal usage, as the scheduler will assign tasks randomly without being aware of hardware and job affinity. The last bars show the results with a hardware-aware scheduler that adopts progress share. By giving priority to the Crypto job to the nodes with GPUs, the job was completed much earlier. It is worth noting that even the sort job finished earlier compared to other cases. This is because the Sort receives more than half of slots at any moment. Assining a fraction of a cg1.xlarge node to Crypto contributes a significant amount towards its progress share, which in turn makes the Sort job receive more resources to match its progress share to Crypto s. Even if Sort receives more slots than Crypto, we argue that it is fairer than giving the same number of slots to each job (as the current Hadoop does) as Crypto already receives a significantly preferred resources, which makes it progress faster. In this case study, we can see that the scheduling algorithm of the analytics engine plays an important role to improve performance as well as to schedule jobs fairly. 5 Conclusion References [1] Gridmix. mapreduce/docs/current/gridmix. html. [2] Hadoop fair scheduler. apache.org/common/docs/r0.20.1/ fair_scheduler.html. [3] Playmetrix launches beta of its game analytics service. press-releases/69065/playmetrix- Game-Analytics. [4] Social gaming applications. rightscale.com/products/cloudcomputing-uses/social-gamingapplications.php. [5] M. Armbrust, A. Fox, R. Griffith, A. D. Joseph, R. H. Katz, A. Konwinski, G. Lee, D. A. Patterson, A. Rabkin, I. Stoica, and M. Zaharia. Above the clouds: A berkeley view of cloud computing. Technical Report UCB/EECS , EECS Department, University of California, Berkeley, Feb [6] M. Armbrust, A. Fox, D. Patterson, N. Lanham, H. Oh, B. Trushkowsky, and J. Trutna. Scads: Scale-independent storage for social computing applications. In Conference on Innovative Data Systems Research (CIDR), [7] N. Chohan, C. Castillo, M. Spreitzer, M. Steinder, A. Tantawi, and C. Krintz. See spot run: Using spot instances for mapreduce workflows. In HotCloud, [8] A. Ghodsi, M. Zaharia, B. Hindman, A. Konwinski, S. Shenker, and I. Stoica. Dominant resource fairness: Fair allocation of heterogeneous resources in datacenters. Technical Report UCB/EECS , EECS Department, University of California, Berkeley, May [9] J. Polo, D. Carrera, Y. Becerra, V. Beltran, and J. T. andeduard Ayguad. Performance management of accelerated mapreduce workloads in heterogeneous clusters. In 39th International Conference on Parallel Processing (ICPP2010), [10] M. Zaharia et al. Delay scheduling: A simple technique for achieving locality and fairness in cluster scheduling. In EuroSys, As the cloud environment provides dynamic provisioning of heterogeneous servers, it is important to exploit these features to make a data analytics cluster in the cloud cost-effective. In this paper, we presented a system architecture to build such a cluster and discussed resource allocation and scheduling algorithms that take advantage of the heterogeneity of the cloud. 5

Key aspects of cloud computing. Towards fuller utilization. Two main sources of resource demand. Cluster Scheduling

Key aspects of cloud computing. Towards fuller utilization. Two main sources of resource demand. Cluster Scheduling Key aspects of cloud computing Cluster Scheduling 1. Illusion of infinite computing resources available on demand, eliminating need for up-front provisioning. The elimination of an up-front commitment

More information

Key aspects of cloud computing. Towards fuller utilization. Two main sources of resource demand. Cluster Scheduling

Key aspects of cloud computing. Towards fuller utilization. Two main sources of resource demand. Cluster Scheduling Key aspects of cloud computing Cluster Scheduling 1. Illusion of infinite computing resources available on demand, eliminating need for up-front provisioning. The elimination of an up-front commitment

More information

Ali Ghodsi, Matei Zaharia, Benjamin Hindman, Andy Konwinski, Scott Shenker, Ion Stoica. University of California, Berkeley nsdi 11

Ali Ghodsi, Matei Zaharia, Benjamin Hindman, Andy Konwinski, Scott Shenker, Ion Stoica. University of California, Berkeley nsdi 11 Dominant Resource Fairness: Fair Allocation of Multiple Resource Types Ali Ghodsi, Matei Zaharia, Benjamin Hindman, Andy Konwinski, Scott Shenker, Ion Stoica University of California, Berkeley nsdi 11

More information

THE DEFINITIVE GUIDE FOR AWS CLOUD EC2 FAMILIES

THE DEFINITIVE GUIDE FOR AWS CLOUD EC2 FAMILIES THE DEFINITIVE GUIDE FOR AWS CLOUD EC2 FAMILIES Introduction Amazon Web Services (AWS), which was officially launched in 2006, offers you varying cloud services that are not only cost effective but scalable

More information

An Adaptive Scheduling Technique for Improving the Efficiency of Hadoop

An Adaptive Scheduling Technique for Improving the Efficiency of Hadoop An Adaptive Scheduling Technique for Improving the Efficiency of Hadoop Ms Punitha R Computer Science Engineering M.S Engineering College, Bangalore, Karnataka, India. Mr Malatesh S H Computer Science

More information

2/26/2017. For instance, consider running Word Count across 20 splits

2/26/2017. For instance, consider running Word Count across 20 splits Based on the slides of prof. Pietro Michiardi Hadoop Internals https://github.com/michiard/disc-cloud-course/raw/master/hadoop/hadoop.pdf Job: execution of a MapReduce application across a data set Task:

More information

Mesos: Mul)programing for Datacenters

Mesos: Mul)programing for Datacenters Mesos: Mul)programing for Datacenters Ion Stoica Joint work with: Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy Katz, ScoC Shenker, UC BERKELEY Mo)va)on Rapid innovaeon

More information

Hadoop Virtualization Extensions on VMware vsphere 5 T E C H N I C A L W H I T E P A P E R

Hadoop Virtualization Extensions on VMware vsphere 5 T E C H N I C A L W H I T E P A P E R Hadoop Virtualization Extensions on VMware vsphere 5 T E C H N I C A L W H I T E P A P E R Table of Contents Introduction... 3 Topology Awareness in Hadoop... 3 Virtual Hadoop... 4 HVE Solution... 5 Architecture...

More information

Accelerate Big Data Insights

Accelerate Big Data Insights Accelerate Big Data Insights Executive Summary An abundance of information isn t always helpful when time is of the essence. In the world of big data, the ability to accelerate time-to-insight can not

More information

International Journal of Computer Science Trends and Technology (IJCST) Volume 5 Issue 2, Mar Apr 2017

International Journal of Computer Science Trends and Technology (IJCST) Volume 5 Issue 2, Mar Apr 2017 RESEARCH ARTICLE An Efficient Dynamic Slot Allocation Based On Fairness Consideration for MAPREDUCE Clusters T. P. Simi Smirthiga [1], P.Sowmiya [2], C.Vimala [3], Mrs P.Anantha Prabha [4] U.G Scholar

More information

A Platform for Fine-Grained Resource Sharing in the Data Center

A Platform for Fine-Grained Resource Sharing in the Data Center Mesos A Platform for Fine-Grained Resource Sharing in the Data Center Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony Joseph, Randy Katz, Scott Shenker, Ion Stoica University of California,

More information

Dynamic Task Scheduling in Cloud Computing Based on the Availability Level of Resources

Dynamic Task Scheduling in Cloud Computing Based on the Availability Level of Resources Vol. 1, No. 8 (217), pp.21-36 http://dx.doi.org/1.14257/ijgdc.217.1.8.3 Dynamic Task Scheduling in Cloud Computing Based on the Availability Level of Resources Elhossiny Ibrahim 1, Nirmeen A. El-Bahnasawy

More information

Cutting MapReduce Cost with Spot Market

Cutting MapReduce Cost with Spot Market Cutting MapReduce Cost with Spot Market Huan Liu Accenture Technology Labs huan.liu@accenture.com Abstract Spot market provides the ideal mechanism to leverage idle CPU resources and smooth out the computation

More information

DYNAMIC LOAD BALANCING FOR CLOUD PARTITION IN PUBLIC CLOUD MODEL USING VISTA SCHEDULER ALGORITHM

DYNAMIC LOAD BALANCING FOR CLOUD PARTITION IN PUBLIC CLOUD MODEL USING VISTA SCHEDULER ALGORITHM DYNAMIC LOAD BALANCING FOR CLOUD PARTITION IN PUBLIC CLOUD MODEL USING VISTA SCHEDULER ALGORITHM 1 MANISHANKAR S, 2 SANDHYA R, 3 BHAGYASHREE S 1 Assistant Professor, Department of Computer Science, Amrita

More information

Performance evaluation of job schedulers on Hadoop YARN

Performance evaluation of job schedulers on Hadoop YARN Performance evaluation of job schedulers on Hadoop YARN Jia-Chun Lin Department of Informatics, University of Oslo Gaustadallèen 23 B, Oslo, N-0373, Norway kellylin@ifi.uio.no Ming-Chang Lee Department

More information

An Enhanced Approach for Resource Management Optimization in Hadoop

An Enhanced Approach for Resource Management Optimization in Hadoop An Enhanced Approach for Resource Management Optimization in Hadoop R. Sandeep Raj 1, G. Prabhakar Raju 2 1 MTech Student, Department of CSE, Anurag Group of Institutions, India 2 Associate Professor,

More information

MixApart: Decoupled Analytics for Shared Storage Systems

MixApart: Decoupled Analytics for Shared Storage Systems MixApart: Decoupled Analytics for Shared Storage Systems Madalin Mihailescu, Gokul Soundararajan, Cristiana Amza University of Toronto, NetApp Abstract Data analytics and enterprise applications have very

More information

Resource Management for Dynamic MapReduce Clusters in Multicluster Systems

Resource Management for Dynamic MapReduce Clusters in Multicluster Systems Resource Management for Dynamic MapReduce Clusters in Multicluster Systems Bogdan Ghiţ, Nezih Yigitbasi, Dick Epema Delft University of Technology, the Netherlands {b.i.ghit, m.n.yigitbasi, d.h.j.epema}@tudelft.nl

More information

CCA-410. Cloudera. Cloudera Certified Administrator for Apache Hadoop (CCAH)

CCA-410. Cloudera. Cloudera Certified Administrator for Apache Hadoop (CCAH) Cloudera CCA-410 Cloudera Certified Administrator for Apache Hadoop (CCAH) Download Full Version : http://killexams.com/pass4sure/exam-detail/cca-410 Reference: CONFIGURATION PARAMETERS DFS.BLOCK.SIZE

More information

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad

More information

Aneka Dynamic Provisioning

Aneka Dynamic Provisioning MANJRASOFT PTY LTD Aneka Aneka 3.0 Manjrasoft 05/24/2012 This document describes the dynamic provisioning features implemented in Aneka and how it is possible to leverage dynamic resources for scaling

More information

Big Data for Engineers Spring Resource Management

Big Data for Engineers Spring Resource Management Ghislain Fourny Big Data for Engineers Spring 2018 7. Resource Management artjazz / 123RF Stock Photo Data Technology Stack User interfaces Querying Data stores Indexing Processing Validation Data models

More information

Locality Aware Fair Scheduling for Hammr

Locality Aware Fair Scheduling for Hammr Locality Aware Fair Scheduling for Hammr Li Jin January 12, 2012 Abstract Hammr is a distributed execution engine for data parallel applications modeled after Dryad. In this report, we present a locality

More information

PaaS and Hadoop. Dr. Laiping Zhao ( 赵来平 ) School of Computer Software, Tianjin University

PaaS and Hadoop. Dr. Laiping Zhao ( 赵来平 ) School of Computer Software, Tianjin University PaaS and Hadoop Dr. Laiping Zhao ( 赵来平 ) School of Computer Software, Tianjin University laiping@tju.edu.cn 1 Outline PaaS Hadoop: HDFS and Mapreduce YARN Single-Processor Scheduling Hadoop Scheduling

More information

Big Data 7. Resource Management

Big Data 7. Resource Management Ghislain Fourny Big Data 7. Resource Management artjazz / 123RF Stock Photo Data Technology Stack User interfaces Querying Data stores Indexing Processing Validation Data models Syntax Encoding Storage

More information

A priority based dynamic bandwidth scheduling in SDN networks 1

A priority based dynamic bandwidth scheduling in SDN networks 1 Acta Technica 62 No. 2A/2017, 445 454 c 2017 Institute of Thermomechanics CAS, v.v.i. A priority based dynamic bandwidth scheduling in SDN networks 1 Zun Wang 2 Abstract. In order to solve the problems

More information

Locality-Aware Dynamic VM Reconfiguration on MapReduce Clouds. Jongse Park, Daewoo Lee, Bokyeong Kim, Jaehyuk Huh, Seungryoul Maeng

Locality-Aware Dynamic VM Reconfiguration on MapReduce Clouds. Jongse Park, Daewoo Lee, Bokyeong Kim, Jaehyuk Huh, Seungryoul Maeng Locality-Aware Dynamic VM Reconfiguration on MapReduce Clouds Jongse Park, Daewoo Lee, Bokyeong Kim, Jaehyuk Huh, Seungryoul Maeng Virtual Clusters on Cloud } Private cluster on public cloud } Distributed

More information

Analysis in the Big Data Era

Analysis in the Big Data Era Analysis in the Big Data Era Massive Data Data Analysis Insight Key to Success = Timely and Cost-Effective Analysis 2 Hadoop MapReduce Ecosystem Popular solution to Big Data Analytics Java / C++ / R /

More information

Cloud Computing CS

Cloud Computing CS Cloud Computing CS 15-319 Programming Models- Part III Lecture 6, Feb 1, 2012 Majd F. Sakr and Mohammad Hammoud 1 Today Last session Programming Models- Part II Today s session Programming Models Part

More information

The Use of Cloud Computing Resources in an HPC Environment

The Use of Cloud Computing Resources in an HPC Environment The Use of Cloud Computing Resources in an HPC Environment Bill, Labate, UCLA Office of Information Technology Prakashan Korambath, UCLA Institute for Digital Research & Education Cloud computing becomes

More information

Survey on MapReduce Scheduling Algorithms

Survey on MapReduce Scheduling Algorithms Survey on MapReduce Scheduling Algorithms Liya Thomas, Mtech Student, Department of CSE, SCTCE,TVM Syama R, Assistant Professor Department of CSE, SCTCE,TVM ABSTRACT MapReduce is a programming model used

More information

Capacity Scheduler. Table of contents

Capacity Scheduler. Table of contents Table of contents 1 Purpose...2 2 Features... 2 3 Picking a task to run...2 4 Reclaiming capacity...3 5 Installation...3 6 Configuration... 3 6.1 Using the capacity scheduler... 3 6.2 Setting up queues...4

More information

Statistics Driven Workload Modeling for the Cloud

Statistics Driven Workload Modeling for the Cloud UC Berkeley Statistics Driven Workload Modeling for the Cloud Archana Ganapathi, Yanpei Chen Armando Fox, Randy Katz, David Patterson SMDB 2010 Data analytics are moving to the cloud Cloud computing economy

More information

The Datacenter Needs an Operating System

The Datacenter Needs an Operating System UC BERKELEY The Datacenter Needs an Operating System Anthony D. Joseph LASER Summer School September 2013 My Talks at LASER 2013 1. AMP Lab introduction 2. The Datacenter Needs an Operating System 3. Mesos,

More information

Big Data Processing: Improve Scheduling Environment in Hadoop Bhavik.B.Joshi

Big Data Processing: Improve Scheduling Environment in Hadoop Bhavik.B.Joshi IJSRD - International Journal for Scientific Research & Development Vol. 4, Issue 06, 2016 ISSN (online): 2321-0613 Big Data Processing: Improve Scheduling Environment in Hadoop Bhavik.B.Joshi Abstract

More information

Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center

Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center Benjamin Hindman Andrew Konwinski Matei Zaharia Ali Ghodsi Anthony D. Joseph Randy H. Katz Scott Shenker Ion Stoica Electrical Engineering

More information

Mark Sandstrom ThroughPuter, Inc.

Mark Sandstrom ThroughPuter, Inc. Hardware Implemented Scheduler, Placer, Inter-Task Communications and IO System Functions for Many Processors Dynamically Shared among Multiple Applications Mark Sandstrom ThroughPuter, Inc mark@throughputercom

More information

Technical Paper. Performance and Tuning Considerations for SAS on Dell EMC VMAX 250 All-Flash Array

Technical Paper. Performance and Tuning Considerations for SAS on Dell EMC VMAX 250 All-Flash Array Technical Paper Performance and Tuning Considerations for SAS on Dell EMC VMAX 250 All-Flash Array Release Information Content Version: 1.0 April 2018 Trademarks and Patents SAS Institute Inc., SAS Campus

More information

3. Monitoring Scenarios

3. Monitoring Scenarios 3. Monitoring Scenarios This section describes the following: Navigation Alerts Interval Rules Navigation Ambari SCOM Use the Ambari SCOM main navigation tree to browse cluster, HDFS and MapReduce performance

More information

Elastic Efficient Execution of Varied Containers. Sharma Podila Nov 7th 2016, QCon San Francisco

Elastic Efficient Execution of Varied Containers. Sharma Podila Nov 7th 2016, QCon San Francisco Elastic Efficient Execution of Varied Containers Sharma Podila Nov 7th 2016, QCon San Francisco In other words... How do we efficiently run heterogeneous workloads on an elastic pool of heterogeneous resources,

More information

The Optimization and Improvement of MapReduce in Web Data Mining

The Optimization and Improvement of MapReduce in Web Data Mining Journal of Software Engineering and Applications, 2015, 8, 395-406 Published Online August 2015 in SciRes. http://www.scirp.org/journal/jsea http://dx.doi.org/10.4236/jsea.2015.88039 The Optimization and

More information

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective Part II: Data Center Software Architecture: Topic 3: Programming Models RCFile: A Fast and Space-efficient Data

More information

Real-time Scheduling of Skewed MapReduce Jobs in Heterogeneous Environments

Real-time Scheduling of Skewed MapReduce Jobs in Heterogeneous Environments Real-time Scheduling of Skewed MapReduce Jobs in Heterogeneous Environments Nikos Zacheilas, Vana Kalogeraki Department of Informatics Athens University of Economics and Business 1 Big Data era has arrived!

More information

An Indian Journal FULL PAPER ABSTRACT KEYWORDS. Trade Science Inc. The study on magnanimous data-storage system based on cloud computing

An Indian Journal FULL PAPER ABSTRACT KEYWORDS. Trade Science Inc. The study on magnanimous data-storage system based on cloud computing [Type text] [Type text] [Type text] ISSN : 0974-7435 Volume 10 Issue 11 BioTechnology 2014 An Indian Journal FULL PAPER BTAIJ, 10(11), 2014 [5368-5376] The study on magnanimous data-storage system based

More information

Chapter 5. The MapReduce Programming Model and Implementation

Chapter 5. The MapReduce Programming Model and Implementation Chapter 5. The MapReduce Programming Model and Implementation - Traditional computing: data-to-computing (send data to computing) * Data stored in separate repository * Data brought into system for computing

More information

Nexus: A Common Substrate for Cluster Computing

Nexus: A Common Substrate for Cluster Computing Nexus: A Common Substrate for Cluster Computing Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Scott Shenker, Ion Stoica University of California, Berkeley {benh,andyk,matei,alig,adj,shenker,stoica}@cs.berkeley.edu

More information

Sparrow. Distributed Low-Latency Spark Scheduling. Kay Ousterhout, Patrick Wendell, Matei Zaharia, Ion Stoica

Sparrow. Distributed Low-Latency Spark Scheduling. Kay Ousterhout, Patrick Wendell, Matei Zaharia, Ion Stoica Sparrow Distributed Low-Latency Spark Scheduling Kay Ousterhout, Patrick Wendell, Matei Zaharia, Ion Stoica Outline The Spark scheduling bottleneck Sparrow s fully distributed, fault-tolerant technique

More information

CAVA: Exploring Memory Locality for Big Data Analytics in Virtualized Clusters

CAVA: Exploring Memory Locality for Big Data Analytics in Virtualized Clusters 2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing : Exploring Memory Locality for Big Data Analytics in Virtualized Clusters Eunji Hwang, Hyungoo Kim, Beomseok Nam and Young-ri

More information

Decentralized Grid Management Model Based on Broker Overlay

Decentralized Grid Management Model Based on Broker Overlay Decentralized Grid Management Model Based on Broker Overlay Abdulrahman Azab 1, Hein Meling 1 1 Dept. of Computer Science and Electrical Engineering, Faculty of Science and Technology, University of Stavanger,

More information

Dynamic processing slots scheduling for I/O intensive jobs of Hadoop MapReduce

Dynamic processing slots scheduling for I/O intensive jobs of Hadoop MapReduce Dynamic processing slots scheduling for I/O intensive jobs of Hadoop MapReduce Shiori KURAZUMI, Tomoaki TSUMURA, Shoichi SAITO and Hiroshi MATSUO Nagoya Institute of Technology Gokiso, Showa, Nagoya, Aichi,

More information

Pocket: Elastic Ephemeral Storage for Serverless Analytics

Pocket: Elastic Ephemeral Storage for Serverless Analytics Pocket: Elastic Ephemeral Storage for Serverless Analytics Ana Klimovic*, Yawen Wang*, Patrick Stuedi +, Animesh Trivedi +, Jonas Pfefferle +, Christos Kozyrakis* *Stanford University, + IBM Research 1

More information

Database Applications (15-415)

Database Applications (15-415) Database Applications (15-415) Hadoop Lecture 24, April 23, 2014 Mohammad Hammoud Today Last Session: NoSQL databases Today s Session: Hadoop = HDFS + MapReduce Announcements: Final Exam is on Sunday April

More information

Scientific Workflows and Cloud Computing. Gideon Juve USC Information Sciences Institute

Scientific Workflows and Cloud Computing. Gideon Juve USC Information Sciences Institute Scientific Workflows and Cloud Computing Gideon Juve USC Information Sciences Institute gideon@isi.edu Scientific Workflows Loosely-coupled parallel applications Expressed as directed acyclic graphs (DAGs)

More information

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP TITLE: Implement sort algorithm and run it using HADOOP PRE-REQUISITE Preliminary knowledge of clusters and overview of Hadoop and its basic functionality. THEORY 1. Introduction to Hadoop The Apache Hadoop

More information

Towards Automatic Optimization of MapReduce Programs (Position Paper) Shivnath Babu Duke University

Towards Automatic Optimization of MapReduce Programs (Position Paper) Shivnath Babu Duke University Towards Automatic Optimization of MapReduce Programs (Position Paper) Shivnath Babu Duke University Roadmap Call to action to improve automatic optimization techniques in MapReduce frameworks Challenges

More information

Novel Scheduling Algorithms for Efficient Deployment of MapReduce Applications in Heterogeneous Computing Environments

Novel Scheduling Algorithms for Efficient Deployment of MapReduce Applications in Heterogeneous Computing Environments Novel Scheduling Algorithms for Efficient Deployment of MapReduce Applications in Heterogeneous Computing Environments Sun-Yuan Hsieh 1,2,3, Chi-Ting Chen 1, Chi-Hao Chen 1, Tzu-Hsiang Yen 1, Hung-Chang

More information

EFFICIENT ALLOCATION OF DYNAMIC RESOURCES IN A CLOUD

EFFICIENT ALLOCATION OF DYNAMIC RESOURCES IN A CLOUD EFFICIENT ALLOCATION OF DYNAMIC RESOURCES IN A CLOUD S.THIRUNAVUKKARASU 1, DR.K.P.KALIYAMURTHIE 2 Assistant Professor, Dept of IT, Bharath University, Chennai-73 1 Professor& Head, Dept of IT, Bharath

More information

Dynamic Replication Management Scheme for Cloud Storage

Dynamic Replication Management Scheme for Cloud Storage Dynamic Replication Management Scheme for Cloud Storage May Phyo Thu, Khine Moe Nwe, Kyar Nyo Aye University of Computer Studies, Yangon mayphyothu.mpt1@gmail.com, khinemoenwe@ucsy.edu.mm, kyarnyoaye@gmail.com

More information

Programming Models MapReduce

Programming Models MapReduce Programming Models MapReduce Majd Sakr, Garth Gibson, Greg Ganger, Raja Sambasivan 15-719/18-847b Advanced Cloud Computing Fall 2013 Sep 23, 2013 1 MapReduce In a Nutshell MapReduce incorporates two phases

More information

Performance evaluation of job schedulers on Hadoop YARN

Performance evaluation of job schedulers on Hadoop YARN ارائه شده توسط: سايت ه فا مرجع جديد مقا ت ه شده از ن ت معت CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. 2016; 28:2711 2728 Published online 23 February 2016

More information

PROFILING BASED REDUCE MEMORY PROVISIONING FOR IMPROVING THE PERFORMANCE IN HADOOP

PROFILING BASED REDUCE MEMORY PROVISIONING FOR IMPROVING THE PERFORMANCE IN HADOOP ISSN: 0976-2876 (Print) ISSN: 2250-0138 (Online) PROFILING BASED REDUCE MEMORY PROVISIONING FOR IMPROVING THE PERFORMANCE IN HADOOP T. S. NISHA a1 AND K. SATYANARAYAN REDDY b a Department of CSE, Cambridge

More information

Agents for Cloud Resource Allocation: an Amazon EC2 Case Study

Agents for Cloud Resource Allocation: an Amazon EC2 Case Study Agents for Cloud Resource Allocation: an Amazon EC2 Case Study J. Octavio Gutierrez-Garcia and Kwang Mong Sim, Gwangju Institute of Science and Technology, Gwangju 500-712, Republic of Korea joseogg@gmail.com

More information

Joint Center of Intelligent Computing George Mason University. Federal Geographic Data Committee (FGDC) ESIP 2011 Winter Meeting

Joint Center of Intelligent Computing George Mason University. Federal Geographic Data Committee (FGDC) ESIP 2011 Winter Meeting Qunying Huang 1, Chaowei Yang 1, Doug Nebert 2 Kai Liu 1, Huayi Wu 1 1 Joint Center of Intelligent Computing George Mason University 2 Federal Geographic Data Committee (FGDC) ESIP 2011 Winter Meeting

More information

vcloud Automation Center Reference Architecture vcloud Automation Center 5.2

vcloud Automation Center Reference Architecture vcloud Automation Center 5.2 vcloud Automation Center Reference Architecture vcloud Automation Center 5.2 This document supports the version of each product listed and supports all subsequent versions until the document is replaced

More information

Hybrid Auto-scaling of Multi-tier Web Applications: A Case of Using Amazon Public Cloud

Hybrid Auto-scaling of Multi-tier Web Applications: A Case of Using Amazon Public Cloud Hybrid Auto-scaling of Multi-tier Web Applications: A Case of Using Amazon Public Cloud Abid Nisar, Waheed Iqbal, Fawaz S. Bokhari, and Faisal Bukhari Punjab University College of Information and Technology,Lahore

More information

MixApart: Decoupled Analytics for Shared Storage Systems. Madalin Mihailescu, Gokul Soundararajan, Cristiana Amza University of Toronto and NetApp

MixApart: Decoupled Analytics for Shared Storage Systems. Madalin Mihailescu, Gokul Soundararajan, Cristiana Amza University of Toronto and NetApp MixApart: Decoupled Analytics for Shared Storage Systems Madalin Mihailescu, Gokul Soundararajan, Cristiana Amza University of Toronto and NetApp Hadoop Pig, Hive Hadoop + Enterprise storage?! Shared storage

More information

Mesos: A Pla+orm for Fine- Grained Resource Sharing in the Data Center

Mesos: A Pla+orm for Fine- Grained Resource Sharing in the Data Center Mesos: A Pla+orm for Fine- Grained Resource Sharing in the Data Center Ion Stoica, UC Berkeley Joint work with: A. Ghodsi, B. Hindman, A. Joseph, R. Katz, A. Konwinski, S. Shenker, and M. Zaharia Challenge

More information

CS Project Report

CS Project Report CS7960 - Project Report Kshitij Sudan kshitij@cs.utah.edu 1 Introduction With the growth in services provided over the Internet, the amount of data processing required has grown tremendously. To satisfy

More information

CHAPTER 6 STATISTICAL MODELING OF REAL WORLD CLOUD ENVIRONMENT FOR RELIABILITY AND ITS EFFECT ON ENERGY AND PERFORMANCE

CHAPTER 6 STATISTICAL MODELING OF REAL WORLD CLOUD ENVIRONMENT FOR RELIABILITY AND ITS EFFECT ON ENERGY AND PERFORMANCE 143 CHAPTER 6 STATISTICAL MODELING OF REAL WORLD CLOUD ENVIRONMENT FOR RELIABILITY AND ITS EFFECT ON ENERGY AND PERFORMANCE 6.1 INTRODUCTION This chapter mainly focuses on how to handle the inherent unreliability

More information

EECS 498 Introduction to Distributed Systems

EECS 498 Introduction to Distributed Systems EECS 498 Introduction to Distributed Systems Fall 2017 Harsha V. Madhyastha Today Bitcoin: A peer-to-peer digital currency Spark: In-memory big data processing December 4, 2017 EECS 498 Lecture 21 2 December

More information

Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center

Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center : A Platform for Fine-Grained Resource Sharing in the Data Center Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy Katz, Scott Shenker, Ion Stoica University of California,

More information

DISTRIBUTED NETWORK RESOURCE ALLOCATION WITH INTEGER CONSTRAINTS. Yujiao Cheng, Houfeng Huang, Gang Wu, Qing Ling

DISTRIBUTED NETWORK RESOURCE ALLOCATION WITH INTEGER CONSTRAINTS. Yujiao Cheng, Houfeng Huang, Gang Wu, Qing Ling DISTRIBUTED NETWORK RESOURCE ALLOCATION WITH INTEGER CONSTRAINTS Yuao Cheng, Houfeng Huang, Gang Wu, Qing Ling Department of Automation, University of Science and Technology of China, Hefei, China ABSTRACT

More information

Intro to Software as a Service (SaaS) and Cloud Computing

Intro to Software as a Service (SaaS) and Cloud Computing UC Berkeley Intro to Software as a Service (SaaS) and Cloud Computing Armando Fox, UC Berkeley Reliable Adaptive Distributed Systems Lab 2009-2012 Image: John Curley http://www.flickr.com/photos/jay_que/1834540/

More information

QLIKVIEW SCALABILITY BENCHMARK WHITE PAPER

QLIKVIEW SCALABILITY BENCHMARK WHITE PAPER QLIKVIEW SCALABILITY BENCHMARK WHITE PAPER Hardware Sizing Using Amazon EC2 A QlikView Scalability Center Technical White Paper June 2013 qlikview.com Table of Contents Executive Summary 3 A Challenge

More information

Concepts Introduced in Chapter 6. Warehouse-Scale Computers. Programming Models for WSCs. Important Design Factors for WSCs

Concepts Introduced in Chapter 6. Warehouse-Scale Computers. Programming Models for WSCs. Important Design Factors for WSCs Concepts Introduced in Chapter 6 Warehouse-Scale Computers A cluster is a collection of desktop computers or servers connected together by a local area network to act as a single larger computer. introduction

More information

Dynamic Resource and Energy Aware Scheduling of MapReduce Jobs and Virtual Machines in Cloud Datacenters

Dynamic Resource and Energy Aware Scheduling of MapReduce Jobs and Virtual Machines in Cloud Datacenters Dynamic Resource and Energy Aware Scheduling of MapReduce Jobs and Virtual Machines in Cloud Datacenters Thesis submitted in partial fulfillment of the requirements for the degree of MS by Research in

More information

Configuring a MapReduce Framework for Dynamic and Efficient Energy Adaptation

Configuring a MapReduce Framework for Dynamic and Efficient Energy Adaptation Configuring a MapReduce Framework for Dynamic and Efficient Energy Adaptation Jessica Hartog, Zacharia Fadika, Elif Dede, Madhusudhan Govindaraju Department of Computer Science, State University of New

More information

Distributed Systems CS6421

Distributed Systems CS6421 Distributed Systems CS6421 Intro to Distributed Systems and the Cloud Prof. Tim Wood v I teach: Software Engineering, Operating Systems, Sr. Design I like: distributed systems, networks, building cool

More information

Co-operative Scheduled Energy Aware Load-Balancing technique for an Efficient Computational Cloud

Co-operative Scheduled Energy Aware Load-Balancing technique for an Efficient Computational Cloud 571 Co-operative Scheduled Energy Aware Load-Balancing technique for an Efficient Computational Cloud T.R.V. Anandharajan 1, Dr. M.A. Bhagyaveni 2 1 Research Scholar, Department of Electronics and Communication,

More information

Today s Papers. Composability is Essential. The Future is Parallel Software. EECS 262a Advanced Topics in Computer Systems Lecture 13

Today s Papers. Composability is Essential. The Future is Parallel Software. EECS 262a Advanced Topics in Computer Systems Lecture 13 EECS 262a Advanced Topics in Computer Systems Lecture 13 Resource allocation: Lithe/DRF October 16 th, 2012 Today s Papers Composing Parallel Software Efficiently with Lithe Heidi Pan, Benjamin Hindman,

More information

Performance Comparison for Resource Allocation Schemes using Cost Information in Cloud

Performance Comparison for Resource Allocation Schemes using Cost Information in Cloud Performance Comparison for Resource Allocation Schemes using Cost Information in Cloud Takahiro KOITA 1 and Kosuke OHARA 1 1 Doshisha University Graduate School of Science and Engineering, Tataramiyakotani

More information

MapR Enterprise Hadoop

MapR Enterprise Hadoop 2014 MapR Technologies 2014 MapR Technologies 1 MapR Enterprise Hadoop Top Ranked Cloud Leaders 500+ Customers 2014 MapR Technologies 2 Key MapR Advantage Partners Business Services APPLICATIONS & OS ANALYTICS

More information

Part 1: Indexes for Big Data

Part 1: Indexes for Big Data JethroData Making Interactive BI for Big Data a Reality Technical White Paper This white paper explains how JethroData can help you achieve a truly interactive interactive response time for BI on big data,

More information

Licensing Oracle on Amazon EC2, RDS and Microsoft Azure now twice as expensive!

Licensing Oracle on Amazon EC2, RDS and Microsoft Azure now twice as expensive! Licensing Oracle on Amazon EC2, RDS and Microsoft Azure now twice as expensive! Authors: Adrian Cristache and Andra Tarata This whitepaper provides an overview of the changes in Licensing Oracle Software

More information

Hierarchical Scheduling for Diverse Datacenter Workloads

Hierarchical Scheduling for Diverse Datacenter Workloads Hierarchical Scheduling for Diverse Datacenter Workloads Arka A. Bhattacharya 1, David Culler 1, Eric Friedman 2, Ali Ghodsi 1, Scott Shenker 1, and Ion Stoica 1 1 University of California, Berkeley 2

More information

FairRide: Near-Optimal Fair Cache Sharing

FairRide: Near-Optimal Fair Cache Sharing UC BERKELEY FairRide: Near-Optimal Fair Cache Sharing Qifan Pu, Haoyuan Li, Matei Zaharia, Ali Ghodsi, Ion Stoica 1 Caches are crucial 2 Caches are crucial 2 Caches are crucial 2 Caches are crucial 2 Cache

More information

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University CPSC 426/526 Cloud Computing Ennan Zhai Computer Science Department Yale University Recall: Lec-7 In the lec-7, I talked about: - P2P vs Enterprise control - Firewall - NATs - Software defined network

More information

White Paper. Platform9 ROI for Hybrid Clouds

White Paper. Platform9 ROI for Hybrid Clouds White Paper Platform9 ROI for Hybrid Clouds Quantifying the cost savings and benefits of moving from the Amazon Web Services (AWS) public cloud to the Platform9 hybrid cloud. Abstract Deciding whether

More information

A formal framework for the management of any digital resource in the cloud - Simulation

A formal framework for the management of any digital resource in the cloud - Simulation Mehdi Ahmed-Nacer, Samir Tata and Sami Bhiri (Telecom SudParis) August 15 2015 Updated: August 17, 2015 A formal framework for the management of any digital resource in the cloud - Simulation Abstract

More information

Optimal Algorithms for Cross-Rack Communication Optimization in MapReduce Framework

Optimal Algorithms for Cross-Rack Communication Optimization in MapReduce Framework Optimal Algorithms for Cross-Rack Communication Optimization in MapReduce Framework Li-Yung Ho Institute of Information Science Academia Sinica, Department of Computer Science and Information Engineering

More information

Using Alluxio to Improve the Performance and Consistency of HDFS Clusters

Using Alluxio to Improve the Performance and Consistency of HDFS Clusters ARTICLE Using Alluxio to Improve the Performance and Consistency of HDFS Clusters Calvin Jia Software Engineer at Alluxio Learn how Alluxio is used in clusters with co-located compute and storage to improve

More information

Introduction to MapReduce

Introduction to MapReduce Basics of Cloud Computing Lecture 4 Introduction to MapReduce Satish Srirama Some material adapted from slides by Jimmy Lin, Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed

More information

IaaS Vendor Comparison

IaaS Vendor Comparison IaaS Vendor Comparison Analysis of competitor products Tobias Deml Senior Systemberater BU Cloud & Core Technologies February 01, 2018 2 Tobias Deml Senior Systemberater BU Cloud & Core Technologies Topics

More information

Cloud Computing. What is cloud computing. CS 537 Fall 2017

Cloud Computing. What is cloud computing. CS 537 Fall 2017 Cloud Computing CS 537 Fall 2017 What is cloud computing Illusion of infinite computing resources available on demand Scale-up for most apps Elimination of up-front commitment Small initial investment,

More information

International Journal of Modern Trends in Engineering and Research e-issn No.: , Date: 2-4 July, 2015

International Journal of Modern Trends in Engineering and Research   e-issn No.: , Date: 2-4 July, 2015 International Journal of Modern Trends in Engineering and Research www.ijmter.com e-issn No.:2349-9745, Date: 2-4 July, 215 Provisioning Rapid Elasticity by Light-Weight Live Resource Migration S. Kirthica

More information

Power-Aware Scheduling of Virtual Machines in DVFS-enabled Clusters

Power-Aware Scheduling of Virtual Machines in DVFS-enabled Clusters Power-Aware Scheduling of Virtual Machines in DVFS-enabled Clusters Gregor von Laszewski, Lizhe Wang, Andrew J. Younge, Xi He Service Oriented Cyberinfrastructure Lab Rochester Institute of Technology,

More information

HP ProLiant BladeSystem Gen9 vs Gen8 and G7 Server Blades on Data Warehouse Workloads

HP ProLiant BladeSystem Gen9 vs Gen8 and G7 Server Blades on Data Warehouse Workloads HP ProLiant BladeSystem Gen9 vs Gen8 and G7 Server Blades on Data Warehouse Workloads Gen9 server blades give more performance per dollar for your investment. Executive Summary Information Technology (IT)

More information

GRID SIMULATION FOR DYNAMIC LOAD BALANCING

GRID SIMULATION FOR DYNAMIC LOAD BALANCING GRID SIMULATION FOR DYNAMIC LOAD BALANCING Kapil B. Morey 1, Prof. A. S. Kapse 2, Prof. Y. B. Jadhao 3 1 Research Scholar, Computer Engineering Dept., Padm. Dr. V. B. Kolte College of Engineering, Malkapur,

More information

Accelerating String Matching Algorithms on Multicore Processors Cheng-Hung Lin

Accelerating String Matching Algorithms on Multicore Processors Cheng-Hung Lin Accelerating String Matching Algorithms on Multicore Processors Cheng-Hung Lin Department of Electrical Engineering, National Taiwan Normal University, Taipei, Taiwan Abstract String matching is the most

More information

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce Parallel Programming Principle and Practice Lecture 10 Big Data Processing with MapReduce Outline MapReduce Programming Model MapReduce Examples Hadoop 2 Incredible Things That Happen Every Minute On The

More information