Bohr: Similarity Aware Geo-distributed Data Analytics. Hangyu Li, Hong Xu, Sarana Nutanong City University of Hong Kong

Size: px
Start display at page:

Download "Bohr: Similarity Aware Geo-distributed Data Analytics. Hangyu Li, Hong Xu, Sarana Nutanong City University of Hong Kong"

Transcription

1 Bohr: Similarity Aware Geo-distributed Data Analytics Hangyu Li, Hong Xu, Sarana Nutanong City University of Hong Kong 1

2 Big Data Analytics Analysis Generate 2

3 Data are geo-distributed Frankfurt US Oregon Tokyo US California US Virginia Singapore Sydney San Paulo 3

4 Centralizing data is infeasible Moving data through wide area networks(wans) can be extremely slow. Data sovereignty laws. Frankfurt US Oregon US Virginia Tokyo US California 1. Vulimiri et al., NSDI Pu et al., SIGCOMM Viswanathan et al., OSDI Hsieh et al., NSDI 17 Singapore Sydney San Paulo 4

5 Processing queries in-place Bottleneck site exists. Frankfurt US Oregon Tokyo US California US Virginia Singapore Sydney San Paulo 5

6 Processing queries in-place System Workload Task / Data Placement Optimize aspect JetStream streaming no WAN bandwidth usage Geode batch data WAN bandwidth usage Iridium batch both query response time 6

7 Processing queries in-place Iridium (Pu et al. SIGCOMM 15) Recurring queries, abundant computation resource. Transfer data out from the bottleneck site. Re-schedule reducer tasks. 7

8 Iridium can be improved similarity oblivious data reduction ratio is constant Tokyo Input: Map2 (with combiner) Reduce2 Intermedia: UrlB,2 Intermed Site -iate record Tokyo 3 Input: Map1 (with combiner) Intermedia: UrlA,3 Input: Map3 (with combiner) Intermedia: UrlC,2 Oregon 1 Seoul 2 Reduce1 Reduce3 Total 6 Oregon Seoul 8

9 Iridium can be improved similarity oblivious data reduction ratio is constant Tokyo () Oregon Input: Map2 (with combiner) Reduce2 Intermedia: () Tokyo to Seoul () Tokyo to Oregon () Seoul Intermed Site -iate record Tokyo 2 Input: Map1 (with combiner) Reduce1 Intermedia: UrlA,3 Input: Map3 (with combiner) Reduce3 Intermedia: UrlC,2 Oregon 2 Seoul 3 Total 7 Oregon Seoul 9

10 Iridium can be improved similarity oblivious data reduction ratio is constant Tokyo () Oregon Input: Map2 (with combiner) Reduce2 Intermedia: () Tokyo to Seoul () Tokyo to Oregon () Seoul Intermed Site -iate record Tokyo 2 Input: Map1 (with combiner) Reduce1 Intermedia: UrlA,4 Input: Map3 (with combiner) Reduce3 Intermedia: UrlC,3 Oregon 1 Seoul 2 Total 5 Oregon Seoul 10

11 Bohr Design Key ideas: Recurring queries, abundant computation resource. Similarity aware data transfer. Measuring and predicting the data reduction ratio. 11

12 Bohr Design Site1 OLAP Cube Generation OLAP Cube1 Map1 Combine enabled probe similar dataset Site2 OLAP Cube Generation OLAP Cube2 Map2 Combine enabled Reduce1 Reduce2 final result query tasks Global Manager Job Queue 12

13 Why using OLAP Cube? Sorting dataset according to the attribute. Using record level similarity score. 13

14 Why using OLAP Cube? Advantages: Format the unformatted raw data. Multiple dimension cube for one dataset. 14

15 Similarity Search & Check Search: OLAP instruction(eg. dice, slice, roll up) to retrieve the needed attributes. site 1 site 2 Check: Cross site check: simple probes. The probe components. OLAP Cube1 Probe value(1) value(2) value(k) OLAP Cube2 15

16 Reduce tasks scheduling t : data transfer time. I : input data size. R: data reduction ratio. U: uplink bandwidth. D: downlink bandwidth. r : reduce tasks percentage. Goal : Minimize the data transfer time Time to upload data from site i. Time to download data from other sites. 16

17 System Implementation Based on Spark v2.1.0 and Iridium. Utilize Apache Kylin OLAP data cube to deal with all the OLAP operation. Use Gurobi Optimizer for optimize the LP function. 17

18 Evaluation Methodology Workloads: TPC-DS. Amplab big data benchmark. Hardware platform: A real EC2 deployment with 10 sites(singapore,tokyo ). 18

19 Evaluation Methodology Baseline: Iridium(Pu et al. Sigcomm 15). Performance metrics: Query completion time. Data reduction ratio. 19

20 Data Reduction Ratio Data reduction compared to in-place (%) Singapore Tokyo Oregon Virginia Ohio FrankFurt Seoul Sydney Iridium Bohr London Ireland Bohr can increase the data reductio ratio in all sites. 20

21 Query Completion Time Query completion time (secs) Iridium Bohr 0 Big data (scan) Big data (UDF) Big data (aggr) TPC-DS Bohr saves 30% QCT compare to Iridium. 21

22 Conclusion We propose Bohr: Similarity aware data transfer. Predicting and measuring data reduction ratio for each site. Bohr is 30% faster than Iridium. 22

23 Thank you! 23

24 Future work 1. Generalize the basic idea to more workloads. 2. Dealing with the overhead bring by operate OLAP cubes. 3. Break recurring batch workload condition 4. Multiple types of queries accessing one dataset. 24

Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds

Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds Kevin Hsieh Aaron Harlap, Nandita Vijaykumar, Dimitris Konomis, Gregory R. Ganger, Phillip B. Gibbons, Onur Mutlu Machine Learning and Big

More information

MarkLogic Cloud Service Pricing & Billing Effective: October 1, 2018

MarkLogic Cloud Service Pricing & Billing Effective: October 1, 2018 MarkLogic Cloud Service Pricing & Billing Effective: October 1, 2018 MARKLOGIC DATA HUB SERVICE PRICING COMPUTE AND QUERY CAPACITY MarkLogic Data Hub Service capacity is measured in MarkLogic Capacity

More information

HotCloud 17. Lube: Mitigating Bottlenecks in Wide Area Data Analytics. Hao Wang* Baochun Li

HotCloud 17. Lube: Mitigating Bottlenecks in Wide Area Data Analytics. Hao Wang* Baochun Li HotCloud 17 Lube: Hao Wang* Baochun Li Mitigating Bottlenecks in Wide Area Data Analytics iqua Wide Area Data Analytics DC Master Namenode Workers Datanodes 2 Wide Area Data Analytics Why wide area data

More information

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context 1 Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes

More information

To Relay or Not to Relay for Inter-Cloud Transfers? Fan Lai, Mosharaf Chowdhury, Harsha Madhyastha

To Relay or Not to Relay for Inter-Cloud Transfers? Fan Lai, Mosharaf Chowdhury, Harsha Madhyastha To Relay or Not to Relay for Inter-Cloud Transfers? Fan Lai, Mosharaf Chowdhury, Harsha Madhyastha Background Over 40 Data Centers (DCs) on EC2, Azure, Google Cloud A geographically denser set of DCs across

More information

Sparrow. Distributed Low-Latency Spark Scheduling. Kay Ousterhout, Patrick Wendell, Matei Zaharia, Ion Stoica

Sparrow. Distributed Low-Latency Spark Scheduling. Kay Ousterhout, Patrick Wendell, Matei Zaharia, Ion Stoica Sparrow Distributed Low-Latency Spark Scheduling Kay Ousterhout, Patrick Wendell, Matei Zaharia, Ion Stoica Outline The Spark scheduling bottleneck Sparrow s fully distributed, fault-tolerant technique

More information

A Hierarchical Synchronous Parallel Model for Wide-Area Graph Analytics

A Hierarchical Synchronous Parallel Model for Wide-Area Graph Analytics A Hierarchical Synchronous Parallel Model for Wide-Area Graph Analytics Shuhao Liu*, Li Chen, Baochun Li, Aiden Carnegie University of Toronto April 17, 2018 Graph Analytics What is Graph Analytics? 2

More information

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab Apache is a fast and general engine for large-scale data processing aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes Low latency: sub-second

More information

Cloud and Storage. Transforming IT with AWS and Zadara. Doug Cliche, Storage Solutions Architect June 5, 2018

Cloud and Storage. Transforming IT with AWS and Zadara. Doug Cliche, Storage Solutions Architect June 5, 2018 Cloud and Storage Transforming IT with AWS and Zadara Doug Cliche, Storage Solutions Architect June 5, 2018 What sets AWS apart? Security Fine-grained control Service Breadth & Depth; pace of innovation

More information

Janus. Consolidating Concurrency Control and Consensus for Commits under Conflicts. Shuai Mu, Lamont Nelson, Wyatt Lloyd, Jinyang Li

Janus. Consolidating Concurrency Control and Consensus for Commits under Conflicts. Shuai Mu, Lamont Nelson, Wyatt Lloyd, Jinyang Li Janus Consolidating Concurrency Control and Consensus for Commits under Conflicts Shuai Mu, Lamont Nelson, Wyatt Lloyd, Jinyang Li New York University, University of Southern California State of the Art

More information

Amazon Web Services and Feb 28 outage. Overview presented by Divya

Amazon Web Services and Feb 28 outage. Overview presented by Divya Amazon Web Services and Feb 28 outage Overview presented by Divya Amazon S3 Amazon S3 : store and retrieve any amount of data, at any time, from anywhere on web. Amazon S3 service: Create Buckets Create

More information

Expeditus: Congestion-Aware Load Balancing in Clos Data Center Networks

Expeditus: Congestion-Aware Load Balancing in Clos Data Center Networks Expeditus: Congestion-Aware Load Balancing in Clos Data Center Networks Peng Wang, Hong Xu, Zhixiong Niu, Dongsu Han, Yongqiang Xiong ACM SoCC 2016, Oct 5-7, Santa Clara Motivation Datacenter networks

More information

Amazon Web Services. Foundational Services for Research Computing. April Mike Kuentz, WWPS Solutions Architect

Amazon Web Services. Foundational Services for Research Computing. April Mike Kuentz, WWPS Solutions Architect Amazon Web Services Foundational Services for Research Computing Mike Kuentz, WWPS Solutions Architect April 2017 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS Global Infrastructure

More information

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme LHC2384BU VMware Cloud on AWS A Technical Deep Dive Ray Budavari @rbudavari Frank Denneman - @frankdenneman #VMworld #LHC2384BU Disclaimer This presentation may contain product features that are currently

More information

Security & Compliance in the AWS Cloud. Amazon Web Services

Security & Compliance in the AWS Cloud. Amazon Web Services Security & Compliance in the AWS Cloud Amazon Web Services Our Culture Simple Security Controls Job Zero AWS Pace of Innovation AWS has been continually expanding its services to support virtually any

More information

Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds

Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds Kevin Hsieh Aaron Harlap Nandita Vijaykumar Dimitris Konomis Gregory R. Ganger Phillip B. Gibbons Onur Mutlu Carnegie Mellon University EH

More information

MPROXY. Bloomberg's Proprietary Multicast Server Solution. 20 February 2004 Version: 1.5

MPROXY. Bloomberg's Proprietary Multicast Server Solution. 20 February 2004 Version: 1.5 MPROXY Bloomberg's Proprietary Multicast Server Solution 20 February 2004 Version: 1.5 1 Introduction... 2 Terminology... 2 How does MProxy Work?...3 Requirements...4 Installation and Configuration...

More information

Security & Compliance in the AWS Cloud. Vijay Rangarajan Senior Cloud Architect, ASEAN Amazon Web

Security & Compliance in the AWS Cloud. Vijay Rangarajan Senior Cloud Architect, ASEAN Amazon Web Security & Compliance in the AWS Cloud Vijay Rangarajan Senior Cloud Architect, ASEAN Amazon Web Services @awscloud www.cloudsec.com #CLOUDSEC Security & Compliance in the AWS Cloud TECHNICAL & BUSINESS

More information

Security: Michael South Americas Regional Leader, Public Sector Security & Compliance Business Acceleration

Security: Michael South Americas Regional Leader, Public Sector Security & Compliance Business Acceleration Security: A Driving Force Behind Moving to the Cloud Michael South Americas Regional Leader, Public Sector Security & Compliance Business Acceleration 2017, Amazon Web Services, Inc. or its affiliates.

More information

DocAve Online 3. Release Notes

DocAve Online 3. Release Notes DocAve Online 3 Release Notes Service Pack 16, Cumulative Update 1 Issued May 2017 New Features and Improvements Added support for new storage regions in Amazon S3 type physical devices, including: US

More information

Micron and Hortonworks Power Advanced Big Data Solutions

Micron and Hortonworks Power Advanced Big Data Solutions Micron and Hortonworks Power Advanced Big Data Solutions Flash Energizes Your Analytics Overview Competitive businesses rely on the big data analytics provided by platforms like open-source Apache Hadoop

More information

Global Analytics in the Face of Bandwidth and Regulatory Constraints

Global Analytics in the Face of Bandwidth and Regulatory Constraints Global Analytics in the Face of Bandwidth and Regulatory Constraints Ashish Vulimiri, Carlo Curino, Brighten Godfrey, Thomas Jungblut, Jitu Padhye, George Varghese NSDI 15 Presenter: Sarthak Grover Motivation

More information

Lecture: Benchmarks, Pipelining Intro. Topics: Performance equations wrap-up, Intro to pipelining

Lecture: Benchmarks, Pipelining Intro. Topics: Performance equations wrap-up, Intro to pipelining Lecture: Benchmarks, Pipelining Intro Topics: Performance equations wrap-up, Intro to pipelining 1 Measuring Performance Two primary metrics: wall clock time (response time for a program) and throughput

More information

SDA: Software-Defined Accelerator for general-purpose big data analysis system

SDA: Software-Defined Accelerator for general-purpose big data analysis system SDA: Software-Defined Accelerator for general-purpose big data analysis system Jian Ouyang(ouyangjian@baidu.com), Wei Qi, Yong Wang, Yichen Tu, Jing Wang, Bowen Jia Baidu is beyond a search engine Search

More information

Yuval Bachar. Principal Engineer, Global Infrastructure & Strategy at President, Open19 Foundation

Yuval Bachar. Principal Engineer, Global Infrastructure & Strategy at President, Open19 Foundation Yuval Bachar Principal Engineer, Global Infrastructure & Strategy at President, Open19 Foundation Open19 Foundation Thank you! Open19 Foundation Our mission: A community based organization for open source

More information

Infiniswap. Efficient Memory Disaggregation. Mosharaf Chowdhury. with Juncheng Gu, Youngmoon Lee, Yiwen Zhang, and Kang G. Shin

Infiniswap. Efficient Memory Disaggregation. Mosharaf Chowdhury. with Juncheng Gu, Youngmoon Lee, Yiwen Zhang, and Kang G. Shin Infiniswap Efficient Memory Disaggregation Mosharaf Chowdhury with Juncheng Gu, Youngmoon Lee, Yiwen Zhang, and Kang G. Shin Rack-Scale Computing Datacenter-Scale Computing Geo-Distributed Computing Coflow

More information

QLogic/Lenovo 16Gb Gen 5 Fibre Channel for Database and Business Analytics

QLogic/Lenovo 16Gb Gen 5 Fibre Channel for Database and Business Analytics QLogic/ Gen 5 Fibre Channel for Database Assessment for Database and Business Analytics Using the information from databases and business analytics helps business-line managers to understand their customer

More information

UNIFY DATA AT MEMORY SPEED. Haoyuan (HY) Li, Alluxio Inc. VAULT Conference 2017

UNIFY DATA AT MEMORY SPEED. Haoyuan (HY) Li, Alluxio Inc. VAULT Conference 2017 UNIFY DATA AT MEMORY SPEED Haoyuan (HY) Li, CEO @ Alluxio Inc. VAULT Conference 2017 March 2017 HISTORY Started at UC Berkeley AMPLab In Summer 2012 Originally named as Tachyon Rebranded to Alluxio in

More information

Coflow. Recent Advances and What s Next? Mosharaf Chowdhury. University of Michigan

Coflow. Recent Advances and What s Next? Mosharaf Chowdhury. University of Michigan Coflow Recent Advances and What s Next? Mosharaf Chowdhury University of Michigan Rack-Scale Computing Datacenter-Scale Computing Geo-Distributed Computing Coflow Networking Open Source Apache Spark Open

More information

YOU SUN JEONG DATA ANALYTICS WITH DRUID

YOU SUN JEONG DATA ANALYTICS WITH DRUID YOU SUN JEONG DATA ANALYTICS WITH DRUID 2 WHO AM I? Senior Software Engineer of SK Telecom Commercial Products Big Data Discovery Solution (~ 16) Hadoop DW (~ 15) PaaS(CloudFoundry) (~ 13) Iaas (OpenStack)

More information

Aggregation and Degradation in JetStream: Streaming analytics in the wide area

Aggregation and Degradation in JetStream: Streaming analytics in the wide area NSDI 2014 Aggregation and Degradation in JetStream: Streaming analytics in the wide area Ariel Rabkin, Matvey Arye, Siddhartha Sen, Vivek S. Pai, and Michael J. Freedman Princeton University 报告人 : 申毅杰

More information

Using Alluxio to Improve the Performance and Consistency of HDFS Clusters

Using Alluxio to Improve the Performance and Consistency of HDFS Clusters ARTICLE Using Alluxio to Improve the Performance and Consistency of HDFS Clusters Calvin Jia Software Engineer at Alluxio Learn how Alluxio is used in clusters with co-located compute and storage to improve

More information

Database Acceleration Solution Using FPGAs and Integrated Flash Storage

Database Acceleration Solution Using FPGAs and Integrated Flash Storage Database Acceleration Solution Using FPGAs and Integrated Flash Storage HK Verma, Xilinx Inc. August 2017 1 FPGA Analytics in Flash Storage System In-memory or Flash storage based DB reduce disk access

More information

QLogic 16Gb Gen 5 Fibre Channel for Database and Business Analytics

QLogic 16Gb Gen 5 Fibre Channel for Database and Business Analytics QLogic 16Gb Gen 5 Fibre Channel for Database Assessment for Database and Business Analytics Using the information from databases and business analytics helps business-line managers to understand their

More information

Expected Learning Outcomes Introduction To AWS

Expected Learning Outcomes Introduction To AWS Introduction To AWS Expected Learning Outcomes Introduction To AWS Understand What Cloud Computing Is Discover Why Companies Are Adopting AWS Understand How AWS Can Help Your Explore AWS Services Apply

More information

Essentials of Database Management

Essentials of Database Management Essentials of Database Management Jeffrey A. Hoffer University of Dayton Heikki Topi Bentley University V. Ramesh Indiana University PEARSON Boston Columbus Indianapolis New York San Francisco Upper Saddle

More information

TPCX-BB (BigBench) Big Data Analytics Benchmark

TPCX-BB (BigBench) Big Data Analytics Benchmark TPCX-BB (BigBench) Big Data Analytics Benchmark Bhaskar D Gowda Senior Staff Engineer Analytics & AI Solutions Group Intel Corporation bhaskar.gowda@intel.com 1 Agenda Big Data Analytics & Benchmarks Industry

More information

Building modern media services with Windows Azure. Karl Ots, Symbio

Building modern media services with Windows Azure. Karl Ots, Symbio t Building modern media services with Windows Azure Karl Ots, Symbio About the the presenter Karl Ots, Technical Consultant at Symbio Windows Phone 8 and Windows 8 trainer Windows Azure Insider Co-founder

More information

Global analy)cs in the face of bandwidth and regulatory constraints

Global analy)cs in the face of bandwidth and regulatory constraints Global analy)cs in the face of bandwidth and regulatory constraints Ashish Vulimiri u Carlo Curino m Brighten Godfrey u Thomas Jungblut m Jitu Padhye m George Varghese m u UIUC m MicrosoA Massive data

More information

Optimizing Apache Spark with Memory1. July Page 1 of 14

Optimizing Apache Spark with Memory1. July Page 1 of 14 Optimizing Apache Spark with Memory1 July 2016 Page 1 of 14 Abstract The prevalence of Big Data is driving increasing demand for real -time analysis and insight. Big data processing platforms, like Apache

More information

The Value of Urban Data Petr Suska, Fraunhofer IAO, 30th OCT, 2017 New Delhi

The Value of Urban Data Petr Suska, Fraunhofer IAO, 30th OCT, 2017 New Delhi The Value of Urban Data Petr Suska, Fraunhofer IAO, 30th OCT, 2017 New Delhi Fraunhofer 1 Profile of the Fraunhofer Society Founded: 1949 about 24,000 staff 66 institutes and research units Fraunhofer

More information

Workload Characterization and Optimization of TPC-H Queries on Apache Spark

Workload Characterization and Optimization of TPC-H Queries on Apache Spark Workload Characterization and Optimization of TPC-H Queries on Apache Spark Tatsuhiro Chiba and Tamiya Onodera IBM Research - Tokyo April. 17-19, 216 IEEE ISPASS 216 @ Uppsala, Sweden Overview IBM Research

More information

Siphon: Expediting Inter-Datacenter Coflows in Wide-Area Data Analytics. Shuhao Liu, Li Chen, Baochun Li University of Toronto July 12, 2018

Siphon: Expediting Inter-Datacenter Coflows in Wide-Area Data Analytics. Shuhao Liu, Li Chen, Baochun Li University of Toronto July 12, 2018 Siphon: Expediting Inter-Datacenter Coflows in Wide-Area Data Analytics Shuhao Liu, Li Chen, Baochun Li University of Toronto July 12, 2018 What is a Coflow? One stage in a data analytic job Map 1 Reduce

More information

Data Modeling and Databases Ch 7: Schemas. Gustavo Alonso, Ce Zhang Systems Group Department of Computer Science ETH Zürich

Data Modeling and Databases Ch 7: Schemas. Gustavo Alonso, Ce Zhang Systems Group Department of Computer Science ETH Zürich Data Modeling and Databases Ch 7: Schemas Gustavo Alonso, Ce Zhang Systems Group Department of Computer Science ETH Zürich Database schema A Database Schema captures: The concepts represented Their attributes

More information

Global Analytics in the Face of Bandwidth and Regulatory Constraints

Global Analytics in the Face of Bandwidth and Regulatory Constraints Global Analytics in the Face of Bandwidth and Regulatory Constraints Ashish Vulimiri, University of Illinois at Urbana-Champaign; Carlo Curino, Microsoft; P. Brighten Godfrey, University of Illinois at

More information

HiTune. Dataflow-Based Performance Analysis for Big Data Cloud

HiTune. Dataflow-Based Performance Analysis for Big Data Cloud HiTune Dataflow-Based Performance Analysis for Big Data Cloud Jinquan (Jason) Dai, Jie Huang, Shengsheng Huang, Bo Huang, Yan Liu Intel Asia-Pacific Research and Development Ltd Shanghai, China, 200241

More information

Efficient Process Mapping in Geo-Distributed Cloud Data Centers

Efficient Process Mapping in Geo-Distributed Cloud Data Centers Efficient Process Mapping in Geo-Distributed Cloud Data Centers ABSTRACT Amelie Chi Zhou * Shenzhen University Bingsheng He National University of Singapore Recently, various applications including data

More information

Research in Middleware Systems For In-Situ Data Analytics and Instrument Data Analysis

Research in Middleware Systems For In-Situ Data Analytics and Instrument Data Analysis Research in Middleware Systems For In-Situ Data Analytics and Instrument Data Analysis Gagan Agrawal The Ohio State University (Joint work with Yi Wang, Yu Su, Tekin Bicer and others) Outline Middleware

More information

Optimizing Shuffle in Wide-Area Data Analytics

Optimizing Shuffle in Wide-Area Data Analytics Optimizing Shuffle in Wide-Area Data Analytics Shuhao Liu, Hao Wang, Baochun Li Department of Electrical and Computer Engineering University of Toronto Toronto, Canada {shuhao, haowang, bli}@ece.toronto.edu

More information

Apigee Edge Cloud. Supported browsers:

Apigee Edge Cloud. Supported browsers: Apigee Edge Cloud Description Apigee Edge Cloud is an API management platform to securely deliver and manage all APIs. Apigee Edge Cloud manages the API lifecycle with capabilities that include, but are

More information

BLOOMBERG VAULT FOR FILES. Administrator s Guide

BLOOMBERG VAULT FOR FILES. Administrator s Guide BLOOMBERG VAULT FOR FILES Administrator s Guide INTRODUCTION 01 Introduction 02 Package Installation 02 Pre-Installation Requirement 02 Installation Steps 06 Initial (One-Time) Configuration 06 Bloomberg

More information

PNDA.io: when BGP meets Big-Data

PNDA.io: when BGP meets Big-Data PNDA.io: when BGP meets Big-Data Let s go back in time 26 th April 2017 The Internet is very much alive Millions of BGP events occurring every day 15 Routers Monitored 410 active peers (both IPv4 and IPv6)

More information

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved Hadoop 2.x Core: YARN, Tez, and Spark YARN Hadoop Machine Types top-of-rack switches core switch client machines have client-side software used to access a cluster to process data master nodes run Hadoop

More information

ony Gaddis Haywood Community College STARTING OUT WITH PEARSON Amsterdam Cape Town Dubai London Madrid Milan Munich Paris Montreal Toronto

ony Gaddis Haywood Community College STARTING OUT WITH PEARSON Amsterdam Cape Town Dubai London Madrid Milan Munich Paris Montreal Toronto STARTING OUT WITH J^"* 1 Ti * ony Gaddis Haywood Community College PEARSON Boston Columbus Indianapolis New York San Francisco Upper Saddle River Amsterdam Cape Town Dubai London Madrid Milan Munich Paris

More information

IBM Education Assistance for z/os V2R2

IBM Education Assistance for z/os V2R2 IBM Education Assistance for z/os V2R2 Item: RSM Scalability Element/Component: Real Storage Manager Material current as of May 2015 IBM Presentation Template Full Version Agenda Trademarks Presentation

More information

Traffic Steering using RUM DNS. Abhijeet Rastogi Senior Linkedin (Edge Performance & Traffic)

Traffic Steering using RUM DNS. Abhijeet Rastogi Senior Linkedin (Edge Performance & Traffic) Traffic Steering using RUM DNS Abhijeet Rastogi Senior SRE @ Linkedin (Edge Performance & Traffic) My Team Public DNS CDN Operations Load balancer & reverse proxy at PoPs and DCs Largest Professional Network

More information

Data Transformation and Migration in Polystores

Data Transformation and Migration in Polystores Data Transformation and Migration in Polystores Adam Dziedzic, Aaron Elmore & Michael Stonebraker September 15th, 2016 Agenda Data Migration for Polystores: What & Why? How? Acceleration of physical data

More information

Engage with ESRI in the AWS Cloud. Teresa Carlson, VP of Global Public Sector

Engage with ESRI in the AWS Cloud. Teresa Carlson, VP of Global Public Sector Engage with ESRI in the AWS Cloud Teresa Carlson, VP of Global Public Sector On Premise Infrastructure is Costly & Complex Large Capital Expenditures Patching Software Scaling down as needed Contract negotiation

More information

MDCC MULTI DATA CENTER CONSISTENCY. amplab. Tim Kraska, Gene Pang, Michael Franklin, Samuel Madden, Alan Fekete

MDCC MULTI DATA CENTER CONSISTENCY. amplab. Tim Kraska, Gene Pang, Michael Franklin, Samuel Madden, Alan Fekete MDCC MULTI DATA CENTER CONSISTENCY Tim Kraska, Gene Pang, Michael Franklin, Samuel Madden, Alan Fekete gpang@cs.berkeley.edu amplab MOTIVATION 2 3 June 2, 200: Rackspace power outage of approximately 0

More information

A Network-aware Scheduler in Data-parallel Clusters for High Performance

A Network-aware Scheduler in Data-parallel Clusters for High Performance A Network-aware Scheduler in Data-parallel Clusters for High Performance Zhuozhao Li, Haiying Shen and Ankur Sarker Department of Computer Science University of Virginia May, 2018 1/61 Data-parallel clusters

More information

Apigee Edge Cloud - Bundles Spec Sheets

Apigee Edge Cloud - Bundles Spec Sheets Apigee Edge Cloud - Bundles Spec Sheets Description Apigee Edge Cloud is an API management platform to securely deliver and manage all APIs. Apigee Edge Cloud manages the API lifecycle with capabilities

More information

Getting started with AWS security

Getting started with AWS security Getting started with AWS security Take a prescriptive approach Stella Lee Manager, Enterprise Business Development $ 2 0 B + R E V E N U E R U N R A T E (Annualized from Q4 2017) 4 5 % Y / Y G R O W T

More information

DATABASES AND THE CLOUD. Gustavo Alonso Systems Group / ECC Dept. of Computer Science ETH Zürich, Switzerland

DATABASES AND THE CLOUD. Gustavo Alonso Systems Group / ECC Dept. of Computer Science ETH Zürich, Switzerland DATABASES AND THE CLOUD Gustavo Alonso Systems Group / ECC Dept. of Computer Science ETH Zürich, Switzerland AVALOQ Conference Zürich June 2011 Systems Group www.systems.ethz.ch Enterprise Computing Center

More information

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme LHC2673BU Clearing Cloud Confusion Nick King and Neal Elinski #VMworld #LHC2673BU Disclaimer This presentation may contain product features that are currently under development. This overview of new technology

More information

Built for Speed: Comparing Panoply and Amazon Redshift Rendering Performance Utilizing Tableau Visualizations

Built for Speed: Comparing Panoply and Amazon Redshift Rendering Performance Utilizing Tableau Visualizations Built for Speed: Comparing Panoply and Amazon Redshift Rendering Performance Utilizing Tableau Visualizations Table of contents Faster Visualizations from Data Warehouses 3 The Plan 4 The Criteria 4 Learning

More information

MAINVIEW Batch Optimizer. Data Accelerator Andy Andrews

MAINVIEW Batch Optimizer. Data Accelerator Andy Andrews MAINVIEW Batch Optimizer Data Accelerator Andy Andrews Can I push more workload through my existing hardware configuration? Batch window problems can often be reduced down to two basic problems:! Increasing

More information

Visual C# Tony Gaddis. Haywood Community College STARTING OUT WITH. Piyali Sengupta. Third Edition. Global Edition contributions by.

Visual C# Tony Gaddis. Haywood Community College STARTING OUT WITH. Piyali Sengupta. Third Edition. Global Edition contributions by. STARTING OUT WITH Visual C# 2012 Third Edition Global Edition Tony Gaddis Haywood Community College Global Edition contributions by Piyali Sengupta PEARSON Boston Columbus Indianapolis New York San Francisco

More information

The Red Belly Blockchain Experiments

The Red Belly Blockchain Experiments The Red Belly Blockchain Experiments Concurrent Systems Research Group, University of Sydney, Data-CSIRO Abstract In this paper, we present the largest experiment of a blockchain system to date. To achieve

More information

Wide-Area Spark Streaming: Automated Routing and Batch Sizing

Wide-Area Spark Streaming: Automated Routing and Batch Sizing Wide-Area Spark Streaming: Automated Routing and Batch Sizing Wenxin Li, Di Niu, Yinan Liu, Shuhao Liu, Baochun Li University of Toronto & Dalian University of Technology University of Alberta University

More information

CS15-319: Cloud Computing. Lecture 3 Course Project and Amazon AWS Majd Sakr and Mohammad Hammoud

CS15-319: Cloud Computing. Lecture 3 Course Project and Amazon AWS Majd Sakr and Mohammad Hammoud CS15-319: Cloud Computing Lecture 3 Course Project and Amazon AWS Majd Sakr and Mohammad Hammoud Lecture Outline Discussion On Course Project Amazon Web Services 2 Course Project Course Project Phase I-A

More information

VMware Cloud on AWS Adoption in the Enterprise

VMware Cloud on AWS Adoption in the Enterprise HYP3920BUS VMware Cloud on AWS Adoption in the Enterprise Kit Colbert, VMware, Inc. Sandy Carter, Amazon Web Services Chuck Hoppenrath, Discovery Communications #vmworld #HYP3920BUS Disclaimer This presentation

More information

Dynamic Resource Allocation for Distributed Dataflows. Lauritz Thamsen Technische Universität Berlin

Dynamic Resource Allocation for Distributed Dataflows. Lauritz Thamsen Technische Universität Berlin Dynamic Resource Allocation for Distributed Dataflows Lauritz Thamsen Technische Universität Berlin 04.05.2018 Distributed Dataflows E.g. MapReduce, SCOPE, Spark, and Flink Used for scalable processing

More information

Apache Kylin. OLAP on Hadoop

Apache Kylin. OLAP on Hadoop Apache Kylin OLAP on Hadoop Agenda What s Apache Kylin? Tech Highlights Performance Roadmap Q & A http://kylin.io What s Kylin kylin / ˈkiːˈlɪn / 麒麟 --n. (in Chinese art) a mythical animal of composite

More information

Big Data Using Hadoop

Big Data Using Hadoop IEEE 2016-17 PROJECT LIST(JAVA) Big Data Using Hadoop 17ANSP-BD-001 17ANSP-BD-002 Hadoop Performance Modeling for JobEstimation and Resource Provisioning MapReduce has become a major computing model for

More information

Exploiting Inter-Flow Relationship for Coflow Placement in Data Centers. Xin Sunny Huang, T. S. Eugene Ng Rice University

Exploiting Inter-Flow Relationship for Coflow Placement in Data Centers. Xin Sunny Huang, T. S. Eugene Ng Rice University Exploiting Inter-Flow Relationship for Coflow Placement in Data Centers Xin Sunny Huang, T S Eugene g Rice University This Work Optimizing Coflow performance has many benefits such as avoiding application

More information

Is Explicit Congestion Notification usable with UDP? Stephen McQuistin and Colin Perkins

Is Explicit Congestion Notification usable with UDP? Stephen McQuistin and Colin Perkins Is Explicit Congestion Notification usable with UDP? Stephen McQuistin and Colin Perkins Introduction Explicit congestion notification (ECN) ECN with TCP/IP long established: RFC 3168 Extends IP to mark

More information

Statistics Driven Workload Modeling for the Cloud

Statistics Driven Workload Modeling for the Cloud UC Berkeley Statistics Driven Workload Modeling for the Cloud Archana Ganapathi, Yanpei Chen Armando Fox, Randy Katz, David Patterson SMDB 2010 Data analytics are moving to the cloud Cloud computing economy

More information

Complete. The. Reference. Christopher Adamson. Mc Grauu. LlLIJBB. New York Chicago. San Francisco Lisbon London Madrid Mexico City

Complete. The. Reference. Christopher Adamson. Mc Grauu. LlLIJBB. New York Chicago. San Francisco Lisbon London Madrid Mexico City The Complete Reference Christopher Adamson Mc Grauu LlLIJBB New York Chicago San Francisco Lisbon London Madrid Mexico City Milan New Delhi San Juan Seoul Singapore Sydney Toronto Contents Acknowledgments

More information

Database Concepts. David M. Kroenke UNIVERSITATSBIBLIOTHEK HANNOVER

Database Concepts. David M. Kroenke UNIVERSITATSBIBLIOTHEK HANNOVER Database Concepts Fifth Edition David M. Kroenke David J. Auer ^111 I ii i.111 111 n.n jiiim^ TECHNISCHE INFORMATIOMSBiBLIOTHEK UNIVERSITATSBIBLIOTHEK HANNOVER j TIB/UB Hannover Prentice Hall Boston Columbus

More information

IT has now become commonly accepted that the volume of

IT has now become commonly accepted that the volume of 1 Time- and Cost- Efficient Task Scheduling Across Geo-Distributed Data Centers Zhiming Hu, Member, IEEE, Baochun Li, Fellow, IEEE, and Jun Luo, Member, IEEE Abstract Typically called big data processing,

More information

Popularity Prediction of Facebook Videos for Higher Quality Streaming

Popularity Prediction of Facebook Videos for Higher Quality Streaming Popularity Prediction of Facebook Videos for Higher Quality Streaming Linpeng Tang Qi Huang, Amit Puntambekar Ymir Vigfusson, Wyatt Lloyd, Kai Li 1 Videos are Central to Facebook 8 billion views per day

More information

Concurrent execution of an analytical workload on a POWER8 server with K40 GPUs A Technology Demonstration

Concurrent execution of an analytical workload on a POWER8 server with K40 GPUs A Technology Demonstration Concurrent execution of an analytical workload on a POWER8 server with K40 GPUs A Technology Demonstration Sina Meraji sinamera@ca.ibm.com Berni Schiefer schiefer@ca.ibm.com Tuesday March 17th at 12:00

More information

MULTI-THREADED QUERIES

MULTI-THREADED QUERIES 15-721 Project 3 Final Presentation MULTI-THREADED QUERIES Wendong Li (wendongl) Lu Zhang (lzhang3) Rui Wang (ruiw1) Project Objective Intra-operator parallelism Use multiple threads in a single executor

More information

TPC-E testing of Microsoft SQL Server 2016 on Dell EMC PowerEdge R830 Server and Dell EMC SC9000 Storage

TPC-E testing of Microsoft SQL Server 2016 on Dell EMC PowerEdge R830 Server and Dell EMC SC9000 Storage TPC-E testing of Microsoft SQL Server 2016 on Dell EMC PowerEdge R830 Server and Dell EMC SC9000 Storage Performance Study of Microsoft SQL Server 2016 Dell Engineering February 2017 Table of contents

More information

Scalable Tools - Part I Introduction to Scalable Tools

Scalable Tools - Part I Introduction to Scalable Tools Scalable Tools - Part I Introduction to Scalable Tools Adisak Sukul, Ph.D., Lecturer, Department of Computer Science, adisak@iastate.edu http://web.cs.iastate.edu/~adisak/mbds2018/ Scalable Tools session

More information

Pocket: Elastic Ephemeral Storage for Serverless Analytics

Pocket: Elastic Ephemeral Storage for Serverless Analytics Pocket: Elastic Ephemeral Storage for Serverless Analytics Ana Klimovic*, Yawen Wang*, Patrick Stuedi +, Animesh Trivedi +, Jonas Pfefferle +, Christos Kozyrakis* *Stanford University, + IBM Research 1

More information

Hyper scale Infrastructure is the enabler

Hyper scale Infrastructure is the enabler Hyper scale Infrastructure is the enabler 100+ Datacenters across 34 Regions Worldwide US DoD West TBD US Gov Iowa West US California Central US Iowa South Central US Texas North Central US Illinois Canada

More information

Accelerate Big Data Insights

Accelerate Big Data Insights Accelerate Big Data Insights Executive Summary An abundance of information isn t always helpful when time is of the essence. In the world of big data, the ability to accelerate time-to-insight can not

More information

V Conclusions. V.1 Related work

V Conclusions. V.1 Related work V Conclusions V.1 Related work Even though MapReduce appears to be constructed specifically for performing group-by aggregations, there are also many interesting research work being done on studying critical

More information

Unit 7: Basics in MS Power BI for Excel 2013 M7-5: OLAP

Unit 7: Basics in MS Power BI for Excel 2013 M7-5: OLAP Unit 7: Basics in MS Power BI for Excel M7-5: OLAP Outline: Introduction Learning Objectives Content Exercise What is an OLAP Table Operations: Drill Down Operations: Roll Up Operations: Slice Operations:

More information

Cloud Spanner. Rohit Gupta, Solutions

Cloud Spanner. Rohit Gupta, Solutions Cloud Spanner Rohit Gupta, Solutions Engineer @rohitforcloud Today s goals Provide a brief history of Spanner at Google Provide an explanation of Cloud Spanner Do a demo! Built on the same infrastructure

More information

Real-Time Systems and Programming Languages

Real-Time Systems and Programming Languages Real-Time Systems and Programming Languages Ada, Real-Time Java and C/Real-Time POSIX Fourth Edition Alan Burns and Andy Wellings University of York * ADDISON-WESLEY An imprint of Pearson Education Harlow,

More information

Building Interconnection 2017 Steps Taken & 2018 Plans

Building Interconnection 2017 Steps Taken & 2018 Plans Building Interconnection 2017 Steps Taken & 2018 Plans 2017 Equinix Inc. 2017 Key Highlights Expansion - new markets Launch - Flexible DataCentre Hyperscaler edge Rollout - IXEverywhere - SaaS, IoT & Ecosystems

More information

Data Collection, Preprocessing and Implementation

Data Collection, Preprocessing and Implementation Chapter 6 Data Collection, Preprocessing and Implementation 6.1 Introduction Data collection is the loosely controlled method of gathering the data. Such data are mostly out of range, impossible data combinations,

More information

Optimizing Grouped Aggregation in Geo- Distributed Streaming Analytics

Optimizing Grouped Aggregation in Geo- Distributed Streaming Analytics Optimizing Grouped Aggregation in Geo- Distributed Streaming Analytics Benjamin Heintz, Abhishek Chandra University of Minnesota Ramesh K. Sitaraman UMass Amherst & Akamai Technologies Wide- Area Streaming

More information

Improving the MapReduce Big Data Processing Framework

Improving the MapReduce Big Data Processing Framework Improving the MapReduce Big Data Processing Framework Gistau, Reza Akbarinia, Patrick Valduriez INRIA & LIRMM, Montpellier, France In collaboration with Divyakant Agrawal, UCSB Esther Pacitti, UM2, LIRMM

More information

PROBLEM SOLVING USING JAVA WITH DATA STRUCTURES. A Multimedia Approach. Mark Guzdial and Barbara Ericson PEARSON. College of Computing

PROBLEM SOLVING USING JAVA WITH DATA STRUCTURES. A Multimedia Approach. Mark Guzdial and Barbara Ericson PEARSON. College of Computing PROBLEM SOLVING WITH DATA STRUCTURES USING JAVA A Multimedia Approach Mark Guzdial and Barbara Ericson College of Computing Georgia Institute of Technology PEARSON Boston Columbus Indianapolis New York

More information

Introduction to Amazon Lumberyard and GameLift

Introduction to Amazon Lumberyard and GameLift Introduction to Amazon Lumberyard and GameLift Peter Chapman, Solutions Architect chappete@amazon.com 3/7/2017 A Free AAA Game Engine Deeply Integrated with AWS and Twitch Lumberyard Vision A free, AAA

More information

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development:: Title Duration : Apache Spark Development : 4 days Overview Spark is a fast and general cluster computing system for Big Data. It provides high-level APIs in Scala, Java, Python, and R, and an optimized

More information

Cloud Transformation and Significance of Security

Cloud Transformation and Significance of Security Cloud Transformation and Significance of Security Mohit Sharma, Chief Architect & Cloud Evangelist @onlinesince2009 www.cloudsec.com Datacenter Management Change Management Policy Physical Network Management

More information