Optimizing Grouped Aggregation in Geo- Distributed Streaming Analytics

Size: px
Start display at page:

Download "Optimizing Grouped Aggregation in Geo- Distributed Streaming Analytics"

Transcription

1 Optimizing Grouped Aggregation in Geo- Distributed Streaming Analytics Benjamin Heintz, Abhishek Chandra University of Minnesota Ramesh K. Sitaraman UMass Amherst & Akamai Technologies

2 Wide- Area Streaming Analytics Geo- distributed Streams Video streaming: client logs CDN: server logs Smart homes: sensor readings Real- time Analytics Audience behavior Web application usage Energy consumption Heintz, Chandra, Sitaraman HPDC 5

3 Wide- Area Streaming Analytics Geo- distributed Streams Video streaming: client logs CDN: server logs Smart homes: sensor readings Real- time Analytics Audience behavior Web application usage Energy consumption Hour- by- hour, how many unique users have streamed each video? How many bytes are served for each content provider every minute? Every 5 minutes, show me the average electrical demand by zip code. Heintz, Chandra, Sitaraman HPDC 5

4 Wide- Area Streaming Analytics Geo- distributed Streams Video streaming: client logs CDN: server logs Smart homes: sensor readings Real- time Analytics Audience behavior Web application usage Energy consumption WindowedGrouped Aggregation Hour- by- hour, how many unique users have streamed each video? How many bytes are served for each content provider every minute? Every 5 minutes, show me the average electrical demand by zip code. Heintz, Chandra, Sitaraman HPDC 5

5 Windowed Grouped Aggregation Example For each minute Window ß windowed For each content_provider_id ß grouped Sum bytes_transferred ß aggregation Widely used General aggregation sum, count average, standard deviation (approximate) sets, counts, Heintz, Chandra, Sitaraman HPDC 5 5

6 System Model Hub- and- Spoke Infrastructure Compute at edges and center Queries served from center Server Central Data Warehouse Server Server Server Heintz, Chandra, Sitaraman HPDC 5 6

7 Metrics WAN System- level measure of cost Major OPEX component Staleness User- level measure of quality Delay in getting final results Often an SLA component window begins t window ends t+w staleness results available t+w+s time Heintz, Chandra, Sitaraman HPDC 5 7

8 Our Goal Algorithms to jointly minimize staleness & traffic Algorithm Input: sequence of arrivals Output: sequence of updates : total number of updates Staleness: delay until last update reaches center Heintz, Chandra, Sitaraman HPDC 5 8

9 Naïve Algorithms Pure streaming immediately flushes upon arrival Excessive traffic Pure batching flushes only after end of window Excessive staleness Heintz, Chandra, Sitaraman HPDC 5 9

10 A Running Example Count words in the stream fast data is fast Network: aggregate / second Window length: 8 seconds Heintz, Chandra, Sitaraman HPDC 5

11 Pure Streaming : 6 8 Heintz, Chandra, Sitaraman HPDC 5

12 Pure Streaming : 6 8 Flush immediately Heintz, Chandra, Sitaraman HPDC 5

13 Pure Streaming : 6 8 fast Heintz, Chandra, Sitaraman HPDC 5

14 Pure Streaming : 6 8 fast Heintz, Chandra, Sitaraman HPDC 5

15 Pure Streaming : 6 8 fast data Heintz, Chandra, Sitaraman HPDC 5 5

16 Pure Streaming : 6 8 fast data Heintz, Chandra, Sitaraman HPDC 5 6

17 Pure Streaming : fast data is Heintz, Chandra, Sitaraman HPDC 5 7

18 Pure Streaming : fast data is Heintz, Chandra, Sitaraman HPDC 5 8

19 Pure Streaming : fast data is Heintz, Chandra, Sitaraman HPDC 5 9

20 Pure Streaming : Staleness: : fast data is Heintz, Chandra, Sitaraman HPDC 5

21 Pure Batching : 6 8 fast Heintz, Chandra, Sitaraman HPDC 5

22 Pure Batching : 6 8 fast data Heintz, Chandra, Sitaraman HPDC 5

23 Pure Batching : 6 8 fast data is Heintz, Chandra, Sitaraman HPDC 5

24 Pure Batching : fast data is Heintz, Chandra, Sitaraman HPDC 5

25 Pure Batching : fast data is Heintz, Chandra, Sitaraman HPDC 5 5

26 Pure Batching : fast data is ` fast Heintz, Chandra, Sitaraman HPDC 5 6

27 Pure Batching : data ` fast data is Heintz, Chandra, Sitaraman HPDC 5 7

28 Pure Batching : 6 8 ` fast data is is Heintz, Chandra, Sitaraman HPDC 5 8

29 Pure Batching : is 6 8 Staleness: : ` fast data is Heintz, Chandra, Sitaraman HPDC 5 9

30 Pure Batching : is 6 8 Staleness: fast Naïve Algorithms: `Staleness vs. data : is Heintz, Chandra, Sitaraman HPDC 5

31 Roadmap Naïve algorithms Optimal offline algorithms Practical online algorithms Experimental evaluation Heintz, Chandra, Sitaraman HPDC 5

32 Eager Offline Optimal Idea: flush immediately after last arrival Properties: Flushes exactly once per key (traffic optimal) Flushes at the earliest possible time (staleness optimal) Heintz, Chandra, Sitaraman HPDC 5

33 Eager Offline Optimal : 6 8 fast Heintz, Chandra, Sitaraman HPDC 5

34 Eager Offline Optimal : 6 8 fast Heintz, Chandra, Sitaraman HPDC 5

35 Eager Offline Optimal : fast 6 8 Flush immediately upon last arrival Heintz, Chandra, Sitaraman HPDC 5 5

36 Eager Offline Optimal : 6 8 fast data Heintz, Chandra, Sitaraman HPDC 5 6

37 Eager Offline Optimal : 6 8 fast data Heintz, Chandra, Sitaraman HPDC 5 7

38 Eager Offline Optimal : fast data is Heintz, Chandra, Sitaraman HPDC 5 8

39 Eager Offline Optimal : fast data is Heintz, Chandra, Sitaraman HPDC 5 9

40 Eager Offline Optimal : fast ` fast data is Heintz, Chandra, Sitaraman HPDC 5

41 Eager Offline Optimal : Staleness: : fast data is Heintz, Chandra, Sitaraman HPDC 5

42 Lazy Offline Optimal Idea: flush as late as possible after last arrival Properties Still one flush per key (traffic optimal) Staleness optimal by construction Heintz, Chandra, Sitaraman HPDC 5

43 Lazy Offline Optimal : 6 8 fast Heintz, Chandra, Sitaraman HPDC 5

44 Lazy Offline Optimal : 6 8 fast data Heintz, Chandra, Sitaraman HPDC 5

45 Lazy Offline Optimal : 6 8 fast data is Heintz, Chandra, Sitaraman HPDC 5 5

46 Lazy Offline Optimal : 5 fast data is 6 8 Flush as late as ` possible after last arrival Heintz, Chandra, Sitaraman HPDC 5 6

47 Lazy Offline Optimal : fast data ` data is Heintz, Chandra, Sitaraman HPDC 5 7

48 Lazy Offline Optimal : fast data is Heintz, Chandra, Sitaraman HPDC 5 8

49 Lazy Offline Optimal : fast ` data is is Heintz, Chandra, Sitaraman HPDC 5 9

50 Lazy Offline Optimal : fast fast data is Heintz, Chandra, Sitaraman HPDC 5 5

51 Lazy Offline Optimal : Staleness: : fast data is Heintz, Chandra, Sitaraman HPDC 5 5

52 Family of Optimal Algorithms Eager Lazy 6 8 Heintz, Chandra, Sitaraman HPDC 5 5

53 Roadmap Naïve algorithms Optimal offline algorithms Practical online algorithms Experimental evaluation Heintz, Chandra, Sitaraman HPDC 5 5

54 Caching Analogy Insight: Treat the edge as a cache For staleness When to flush? Dynamic sizing policy For traffic What to flush? Eviction policy fast data is Heintz, Chandra, Sitaraman HPDC 5 5

55 Practical Online Algorithms Dynamic sizing Emulate the offline optimal cache sizes Eager: predict how many keys will arrive again Lazy: predict how late we can defer flushing Heintz, Chandra, Sitaraman HPDC 5 55

56 Practical Online Algorithms Dynamic sizing Emulate the offline optimal cache sizes Eager: predict how many keys will arrive again Lazy: predict how late we can defer flushing Eviction Don t reinvent the wheel Leverage existing work (LRU, LFU, ) Heintz, Chandra, Sitaraman HPDC 5 56

57 Hybrid Online Algorithm Eager: better to overestimate size Many early evictions Risk: incorrect eviction decisions Lazy: better to underestimate size Risk: bandwidth misprediction Be pessimistic Hybrid: linear combination of eager, lazy cache sizes Heintz, Chandra, Sitaraman HPDC 5 57

58 Roadmap Naïve algorithms Optimal offline algorithms Practical online algorithms Experimental evaluation Heintz, Chandra, Sitaraman HPDC 5 58

59 Experiments: Data Set Month- long Akamai download analytics beacon log Name Query Size small (cpid, bw) bytes downloaded (sum) medium (cpid, bw, country_code) bytes downloaded (moments) large (cpid, bw, url) client ip (HyperLogLog) O( ) groups O( ) groups O( 6 ) groups Heintz, Chandra, Sitaraman HPDC 5 59

60 Experiments: Implementation Implemented algorithms in Apache Storm One Storm cluster at center, each edge Seven total PlanetLab sites (washington.edu) (wisc.edu) (csuohio.edu) (uwaterloo.ca) (yale.edu) Replayer Logs Summer Summer Summer Aggregator Reorderer SocketSender (princeton.edu) (ucla.edu) Summer SocketReceiver Reorderer Summer Summer StatsCollector Aggregator Heintz, Chandra, Sitaraman HPDC 5 6

61 Experimental Evaluation: Single- Normalized Mean Batching Hybrid(.5) Streaming Optimal Normalized 95th- %ile Staleness Staleness Batching Hybrid(.5) Streaming Optimal Heintz, Chandra, Sitaraman HPDC 5 6

62 Experimental Evaluation: Single- Normalized Mean Batching Hybrid(.5) Streaming Optimal Near- optimal traffic Normalized 95th- %ile Staleness Batching Hybrid(.5) Streaming Optimal Staleness 65%. Significant staleness reduction Heintz, Chandra, Sitaraman HPDC 5 6

63 Experimental Evaluation: Multi- ( edges) Staleness Batching Hybrid(.5) Streaming Optimal Batching Hybrid(.5) Streaming Optimal Normalized Mean Normalized 95th- %ile Staleness... Heintz, Chandra, Sitaraman HPDC 5 6

64 Experimental Evaluation: Multi- (6 edges) Batching Hybrid(.5) Streaming Optimal Staleness Batching Hybrid(.5) Streaming Optimal Normalized Mean Low High Normalized 95th- %ile Staleness Low High Heintz, Chandra, Sitaraman HPDC 5 6

65 Related Work Systems Twitter Heron (SIGMOD 5) Apache Spark Streaming (SOSP ) Google MillWheel (VLDB ) Muppet (VLDB ) Wide- area Computing WANalytics (CIDR 5) JetStream (NSDI ) Pietzuch et al. (ICDE 6) Optimization Trade- offs BlinkDB (VLDB ) LazyBase (EuroSys ) Heintz, Chandra, Sitaraman HPDC 5 65

66 Conclusion Geo- distributed windowed, grouped aggregation Naïve algorithms: low staleness or low traffic Optimal offline algorithms to jointly minimize both Practical algorithms via caching Dynamic sizing policies Reuse existing eviction policies Heintz, Chandra, Sitaraman HPDC 5 66

Complex Interactions in Content Distribution Ecosystem and QoE

Complex Interactions in Content Distribution Ecosystem and QoE Complex Interactions in Content Distribution Ecosystem and QoE Zhi-Li Zhang Qwest Chair Professor & Distinguished McKnight University Professor Dept. of Computer Science & Eng., University of Minnesota

More information

Op#mizing MapReduce for Highly- Distributed Environments

Op#mizing MapReduce for Highly- Distributed Environments Op#mizing MapReduce for Highly- Distributed Environments Abhishek Chandra Associate Professor Department of Computer Science and Engineering University of Minnesota hep://www.cs.umn.edu/~chandra 1 Big

More information

Aggregation and Degradation in JetStream: Streaming Analytics in the Wide Area

Aggregation and Degradation in JetStream: Streaming Analytics in the Wide Area Aggregation and Degradation in JetStream: Streaming Analytics in the Wide Area Ariel Rabkin Princeton University asrabkin@cs.princeton.edu Work done with Matvey Arye, Siddhartha Sen, Vivek S. Pai, and

More information

TripS: Automated Multi-tiered Data Placement in a Geo-distributed Cloud Environment

TripS: Automated Multi-tiered Data Placement in a Geo-distributed Cloud Environment TripS: Automated Multi-tiered Data Placement in a Geo-distributed Cloud Environment Kwangsung Oh, Abhishek Chandra, and Jon Weissman Department of Computer Science and Engineering University of Minnesota

More information

LazyBase: Trading freshness and performance in a scalable database

LazyBase: Trading freshness and performance in a scalable database LazyBase: Trading freshness and performance in a scalable database (EuroSys 2012) Jim Cipar, Greg Ganger, *Kimberly Keeton, *Craig A. N. Soules, *Brad Morrey, *Alistair Veitch PARALLEL DATA LABORATORY

More information

Aggregation and Degradation in JetStream: Streaming analytics in the wide area

Aggregation and Degradation in JetStream: Streaming analytics in the wide area NSDI 2014 Aggregation and Degradation in JetStream: Streaming analytics in the wide area Ariel Rabkin, Matvey Arye, Siddhartha Sen, Vivek S. Pai, and Michael J. Freedman Princeton University 报告人 : 申毅杰

More information

An Implementation of Fog Computing Attributes in an IoT Environment

An Implementation of Fog Computing Attributes in an IoT Environment An Implementation of Fog Computing Attributes in an IoT Environment Ranjit Deshpande CTO K2 Inc. Introduction Ranjit Deshpande CTO K2 Inc. K2 Inc. s end-to-end IoT platform Transforms Sensor Data into

More information

Coflow. Recent Advances and What s Next? Mosharaf Chowdhury. University of Michigan

Coflow. Recent Advances and What s Next? Mosharaf Chowdhury. University of Michigan Coflow Recent Advances and What s Next? Mosharaf Chowdhury University of Michigan Rack-Scale Computing Datacenter-Scale Computing Geo-Distributed Computing Coflow Networking Open Source Apache Spark Open

More information

Data Analytics with HPC. Data Streaming

Data Analytics with HPC. Data Streaming Data Analytics with HPC Data Streaming Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us

More information

On the Analysis of Caches with Pending Interest Tables

On the Analysis of Caches with Pending Interest Tables On the Analysis of Caches with Pending Interest Tables Mostafa Dehghan 1, Bo Jiang 1 Ali Dabirmoghaddam 2, Don Towsley 1 1 University of Massachusetts Amherst 2 University of California Santa Cruz ICN,

More information

Webinar Series TMIP VISION

Webinar Series TMIP VISION Webinar Series TMIP VISION TMIP provides technical support and promotes knowledge and information exchange in the transportation planning and modeling community. Today s Goals To Consider: Parallel Processing

More information

Efficient On-Demand Operations in Distributed Infrastructures

Efficient On-Demand Operations in Distributed Infrastructures Efficient On-Demand Operations in Distributed Infrastructures Steve Ko and Indranil Gupta Distributed Protocols Research Group University of Illinois at Urbana-Champaign 2 One-Line Summary We need to design

More information

An Efficient Execution Scheme for Designated Event-based Stream Processing

An Efficient Execution Scheme for Designated Event-based Stream Processing DEIM Forum 2014 D3-2 An Efficient Execution Scheme for Designated Event-based Stream Processing Yan Wang and Hiroyuki Kitagawa Graduate School of Systems and Information Engineering, University of Tsukuba

More information

Warehouse- Scale Computing and the BDAS Stack

Warehouse- Scale Computing and the BDAS Stack Warehouse- Scale Computing and the BDAS Stack Ion Stoica UC Berkeley UC BERKELEY Overview Workloads Hardware trends and implications in modern datacenters BDAS stack What is Big Data used For? Reports,

More information

Apache Flink Big Data Stream Processing

Apache Flink Big Data Stream Processing Apache Flink Big Data Stream Processing Tilmann Rabl Berlin Big Data Center www.dima.tu-berlin.de bbdc.berlin rabl@tu-berlin.de XLDB 11.10.2017 1 2013 Berlin Big Data Center All Rights Reserved DIMA 2017

More information

Tools for Social Networking Infrastructures

Tools for Social Networking Infrastructures Tools for Social Networking Infrastructures 1 Cassandra - a decentralised structured storage system Problem : Facebook Inbox Search hundreds of millions of users distributed infrastructure inbox changes

More information

Intelligent Services Serving Machine Learning

Intelligent Services Serving Machine Learning Intelligent Services Serving Machine Learning Joseph E. Gonzalez jegonzal@cs.berkeley.edu; Assistant Professor @ UC Berkeley joseph@dato.com; Co-Founder @ Dato Inc. Contemporary Learning Systems Big Data

More information

A Framework for Clustering Massive Text and Categorical Data Streams

A Framework for Clustering Massive Text and Categorical Data Streams A Framework for Clustering Massive Text and Categorical Data Streams Charu C. Aggarwal IBM T. J. Watson Research Center charu@us.ibm.com Philip S. Yu IBM T. J.Watson Research Center psyu@us.ibm.com Abstract

More information

Datacenter replication solution with quasardb

Datacenter replication solution with quasardb Datacenter replication solution with quasardb Technical positioning paper April 2017 Release v1.3 www.quasardb.net Contact: sales@quasardb.net Quasardb A datacenter survival guide quasardb INTRODUCTION

More information

Energy Consumption in Mobile Phones: A Measurement Study and Implications for Network Applications (IMC09)

Energy Consumption in Mobile Phones: A Measurement Study and Implications for Network Applications (IMC09) Energy Consumption in Mobile Phones: A Measurement Study and Implications for Network Applications (IMC09) Niranjan Balasubramanian Aruna Balasubramanian Arun Venkataramani University of Massachusetts

More information

Microsoft Exam

Microsoft Exam Volume: 42 Questions Case Study: 1 Relecloud General Overview Relecloud is a social media company that processes hundreds of millions of social media posts per day and sells advertisements to several hundred

More information

Architecting Microsoft Azure Solutions (proposed exam 535)

Architecting Microsoft Azure Solutions (proposed exam 535) Architecting Microsoft Azure Solutions (proposed exam 535) IMPORTANT: Significant changes are in progress for exam 534 and its content. As a result, we are retiring this exam on December 31, 2017, and

More information

Testing & Assuring Mobile End User Experience Before Production Neotys

Testing & Assuring Mobile End User Experience Before Production Neotys Testing & Assuring Mobile End User Experience Before Production Neotys Henrik Rexed Agenda Introduction The challenges Best practices NeoLoad mobile capabilities Mobile devices are used more and more At

More information

Flexible Wide Area Consistency Management Sai Susarla

Flexible Wide Area Consistency Management Sai Susarla Flexible Wide Area Consistency Management Sai Susarla The Problem Wide area services are not alike Different consistency & availability requirements But many common issues & mechanisms Concurrency control,

More information

Introduction. Application Performance in the QLinux Multimedia Operating System. Solution: QLinux. Introduction. Outline. QLinux Design Principles

Introduction. Application Performance in the QLinux Multimedia Operating System. Solution: QLinux. Introduction. Outline. QLinux Design Principles Application Performance in the QLinux Multimedia Operating System Sundaram, A. Chandra, P. Goyal, P. Shenoy, J. Sahni and H. Vin Umass Amherst, U of Texas Austin ACM Multimedia, 2000 Introduction General

More information

SCALABLE, LOW LATENCY MODEL SERVING AND MANAGEMENT WITH VELOX

SCALABLE, LOW LATENCY MODEL SERVING AND MANAGEMENT WITH VELOX THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW LATENCY MODEL SERVING AND MANAGEMENT WITH VELOX Daniel Crankshaw, Peter Bailis, Joseph Gonzalez, Haoyuan Li, Zhao Zhang, Ali Ghodsi, Michael Franklin,

More information

Efficient Point to Multipoint Transfers Across Datacenters

Efficient Point to Multipoint Transfers Across Datacenters Efficient Point to Multipoint Transfers cross Datacenters Mohammad Noormohammadpour 1, Cauligi S. Raghavendra 1, Sriram Rao, Srikanth Kandula 1 University of Southern California, Microsoft Source: https://azure.microsoft.com/en-us/overview/datacenters/how-to-choose/

More information

Footprint Descriptors: Theory and Practice of Cache Provisioning in a Global CDN

Footprint Descriptors: Theory and Practice of Cache Provisioning in a Global CDN Footprint Descriptors: Theory and Practice of Cache Provisioning in a Global CDN ABSTRACT Aditya Sundarrajan University of Massachusetts Amherst Mangesh Kasbekar Akamai Technologies Modern CDNs cache and

More information

Website Acceleration with mod_pagespeed

Website Acceleration with mod_pagespeed Website Acceleration with mod_pagespeed Joshua Marantz Google June 15, 2011 @jmarantz www.modpagespeed.com 2011 Google, Inc. All rights reserved. Velocity 2011: Faster By Default 2 Velocity 2011: Faster

More information

Data Warehouses Chapter 12. Class 10: Data Warehouses 1

Data Warehouses Chapter 12. Class 10: Data Warehouses 1 Data Warehouses Chapter 12 Class 10: Data Warehouses 1 OLTP vs OLAP Operational Database: a database designed to support the day today transactions of an organization Data Warehouse: historical data is

More information

Microsoft Perform Data Engineering on Microsoft Azure HDInsight.

Microsoft Perform Data Engineering on Microsoft Azure HDInsight. Microsoft 70-775 Perform Data Engineering on Microsoft Azure HDInsight http://killexams.com/pass4sure/exam-detail/70-775 QUESTION: 30 You are building a security tracking solution in Apache Kafka to parse

More information

ECE7995 Caching and Prefetching Techniques in Computer Systems. Lecture 8: Buffer Cache in Main Memory (I)

ECE7995 Caching and Prefetching Techniques in Computer Systems. Lecture 8: Buffer Cache in Main Memory (I) ECE7995 Caching and Prefetching Techniques in Computer Systems Lecture 8: Buffer Cache in Main Memory (I) 1 Review: The Memory Hierarchy Take advantage of the principle of locality to present the user

More information

Microsoft. Exam Questions Perform Data Engineering on Microsoft Azure HDInsight (beta) Version:Demo

Microsoft. Exam Questions Perform Data Engineering on Microsoft Azure HDInsight (beta) Version:Demo Microsoft Exam Questions 70-775 Perform Data Engineering on Microsoft Azure HDInsight (beta) Version:Demo NEW QUESTION 1 HOTSPOT You install the Microsoft Hive ODBC Driver on a computer that runs Windows

More information

Flash Storage Complementing a Data Lake for Real-Time Insight

Flash Storage Complementing a Data Lake for Real-Time Insight Flash Storage Complementing a Data Lake for Real-Time Insight Dr. Sanhita Sarkar Global Director, Analytics Software Development August 7, 2018 Agenda 1 2 3 4 5 Delivering insight along the entire spectrum

More information

exam. Microsoft Perform Data Engineering on Microsoft Azure HDInsight. Version 1.0

exam.   Microsoft Perform Data Engineering on Microsoft Azure HDInsight. Version 1.0 70-775.exam Number: 70-775 Passing Score: 800 Time Limit: 120 min File Version: 1.0 Microsoft 70-775 Perform Data Engineering on Microsoft Azure HDInsight Version 1.0 Exam A QUESTION 1 You use YARN to

More information

CS 398 ACC Streaming. Prof. Robert J. Brunner. Ben Congdon Tyler Kim

CS 398 ACC Streaming. Prof. Robert J. Brunner. Ben Congdon Tyler Kim CS 398 ACC Streaming Prof. Robert J. Brunner Ben Congdon Tyler Kim MP3 How s it going? Final Autograder run: - Tonight ~9pm - Tomorrow ~3pm Due tomorrow at 11:59 pm. Latest Commit to the repo at the time

More information

Developing Microsoft Azure Solutions (70-532) Syllabus

Developing Microsoft Azure Solutions (70-532) Syllabus Developing Microsoft Azure Solutions (70-532) Syllabus Cloud Computing Introduction What is Cloud Computing Cloud Characteristics Cloud Computing Service Models Deployment Models in Cloud Computing Advantages

More information

microsoft

microsoft 70-775.microsoft Number: 70-775 Passing Score: 800 Time Limit: 120 min Exam A QUESTION 1 Note: This question is part of a series of questions that present the same scenario. Each question in the series

More information

Rack-scale Data Processing System

Rack-scale Data Processing System Rack-scale Data Processing System Jana Giceva, Darko Makreshanski, Claude Barthels, Alessandro Dovis, Gustavo Alonso Systems Group, Department of Computer Science, ETH Zurich Rack-scale Data Processing

More information

A Cost-Space Approach to Distributed Query Optimization in Stream Based Overlays

A Cost-Space Approach to Distributed Query Optimization in Stream Based Overlays A Cost-Space Approach to Distributed Query Optimization in Stream Based Overlays Jeffrey Shneidman, Peter Pietzuch, Matt Welsh, Margo Seltzer, Mema Roussopoulos Systems Research Group Harvard University

More information

Streaming OLAP Applications

Streaming OLAP Applications Streaming OLAP Applications From square one to multi-gigabit streams and beyond C. Scott Andreas HPTS 2013 @cscotta Roadmap Framing the problem Four phases of an architecture s evolution Code: A general-purpose

More information

Pulsar. Realtime Analytics At Scale. Wang Xinglang

Pulsar. Realtime Analytics At Scale. Wang Xinglang Pulsar Realtime Analytics At Scale Wang Xinglang Agenda Pulsar : Real Time Analytics At ebay Business Use Cases Product Requirements Pulsar : Technology Deep Dive 2 Pulsar Business Use Case: Behavioral

More information

Lambda Architecture for Batch and Stream Processing. October 2018

Lambda Architecture for Batch and Stream Processing. October 2018 Lambda Architecture for Batch and Stream Processing October 2018 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Notices This document is provided for informational purposes only.

More information

Nowcasting. D B M G Data Base and Data Mining Group of Politecnico di Torino. Big Data: Hype or Hallelujah? Big data hype?

Nowcasting. D B M G Data Base and Data Mining Group of Politecnico di Torino. Big Data: Hype or Hallelujah? Big data hype? Big data hype? Big Data: Hype or Hallelujah? Data Base and Data Mining Group of 2 Google Flu trends On the Internet February 2010 detected flu outbreak two weeks ahead of CDC data Nowcasting http://www.internetlivestats.com/

More information

DRIZZLE: FAST AND Adaptable STREAM PROCESSING AT SCALE

DRIZZLE: FAST AND Adaptable STREAM PROCESSING AT SCALE DRIZZLE: FAST AND Adaptable STREAM PROCESSING AT SCALE Shivaram Venkataraman, Aurojit Panda, Kay Ousterhout, Michael Armbrust, Ali Ghodsi, Michael Franklin, Benjamin Recht, Ion Stoica STREAMING WORKLOADS

More information

Bohr: Similarity Aware Geo-distributed Data Analytics. Hangyu Li, Hong Xu, Sarana Nutanong City University of Hong Kong

Bohr: Similarity Aware Geo-distributed Data Analytics. Hangyu Li, Hong Xu, Sarana Nutanong City University of Hong Kong Bohr: Similarity Aware Geo-distributed Data Analytics Hangyu Li, Hong Xu, Sarana Nutanong City University of Hong Kong 1 Big Data Analytics Analysis Generate 2 Data are geo-distributed Frankfurt US Oregon

More information

Self-driving Datacenter: Analytics

Self-driving Datacenter: Analytics Self-driving Datacenter: Analytics George Boulescu Consulting Systems Engineer 19/10/2016 Alvin Toffler is a former associate editor of Fortune magazine, known for his works discussing the digital revolution,

More information

Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds

Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds Kevin Hsieh Aaron Harlap, Nandita Vijaykumar, Dimitris Konomis, Gregory R. Ganger, Phillip B. Gibbons, Onur Mutlu Machine Learning and Big

More information

Distributed systems for stream processing

Distributed systems for stream processing Distributed systems for stream processing Apache Kafka and Spark Structured Streaming Alena Hall Alena Hall Large-scale data processing Distributed Systems Functional Programming Data Science & Machine

More information

RIPE75 - Network monitoring at scale. Louis Poinsignon

RIPE75 - Network monitoring at scale. Louis Poinsignon RIPE75 - Network monitoring at scale Louis Poinsignon Why monitoring and what to monitor? Why do we monitor? Billing Reducing costs Traffic engineering Where should we peer? Where should we set-up a new

More information

UNIFY DATA AT MEMORY SPEED. Haoyuan (HY) Li, Alluxio Inc. VAULT Conference 2017

UNIFY DATA AT MEMORY SPEED. Haoyuan (HY) Li, Alluxio Inc. VAULT Conference 2017 UNIFY DATA AT MEMORY SPEED Haoyuan (HY) Li, CEO @ Alluxio Inc. VAULT Conference 2017 March 2017 HISTORY Started at UC Berkeley AMPLab In Summer 2012 Originally named as Tachyon Rebranded to Alluxio in

More information

Wide-Area Spark Streaming: Automated Routing and Batch Sizing

Wide-Area Spark Streaming: Automated Routing and Batch Sizing Wide-Area Spark Streaming: Automated Routing and Batch Sizing Wenxin Li, Di Niu, Yinan Liu, Shuhao Liu, Baochun Li University of Toronto & Dalian University of Technology University of Alberta University

More information

Building an Internet-Scale Publish/Subscribe System

Building an Internet-Scale Publish/Subscribe System Building an Internet-Scale Publish/Subscribe System Ian Rose Mema Roussopoulos Peter Pietzuch Rohan Murty Matt Welsh Jonathan Ledlie Imperial College London Peter R. Pietzuch prp@doc.ic.ac.uk Harvard University

More information

HYBRID TRANSACTION/ANALYTICAL PROCESSING COLIN MACNAUGHTON

HYBRID TRANSACTION/ANALYTICAL PROCESSING COLIN MACNAUGHTON HYBRID TRANSACTION/ANALYTICAL PROCESSING COLIN MACNAUGHTON WHO IS NEEVE RESEARCH? Headquartered in Silicon Valley Creators of the X Platform - Memory Oriented Application Platform Passionate about high

More information

Building Durable Real-time Data Pipeline

Building Durable Real-time Data Pipeline Building Durable Real-time Data Pipeline Apache BookKeeper at Twitter @sijieg Twitter Background Layered Architecture Agenda Design Details Performance Scale @Twitter Q & A Publish-Subscribe Online services

More information

Cisco Tetration Analytics Platform: A Dive into Blazing Fast Deep Storage

Cisco Tetration Analytics Platform: A Dive into Blazing Fast Deep Storage White Paper Cisco Tetration Analytics Platform: A Dive into Blazing Fast Deep Storage What You Will Learn A Cisco Tetration Analytics appliance bundles computing, networking, and storage resources in one

More information

Portable stateful big data processing in Apache Beam

Portable stateful big data processing in Apache Beam Portable stateful big data processing in Apache Beam Kenneth Knowles Apache Beam PMC Software Engineer @ Google klk@google.com / @KennKnowles https://s.apache.org/ffsf-2017-beam-state Flink Forward San

More information

How Eventual is Eventual Consistency?

How Eventual is Eventual Consistency? Probabilistically Bounded Staleness How Eventual is Eventual Consistency? Peter Bailis, Shivaram Venkataraman, Michael J. Franklin, Joseph M. Hellerstein, Ion Stoica (UC Berkeley) BashoChats 002, 28 February

More information

Processing 11 billions events a day with Spark. Alexander Krasheninnikov

Processing 11 billions events a day with Spark. Alexander Krasheninnikov Processing 11 billions events a day with Spark Alexander Krasheninnikov Badoo facts 46 languages 10M Photos added daily 320M registered users 190 countries 21M daily active users 3000+ servers 2 data-centers

More information

Real-time data processing with Apache Flink

Real-time data processing with Apache Flink Real-time data processing with Apache Flink Gyula Fóra gyfora@apache.org Flink committer Swedish ICT Stream processing Data stream: Infinite sequence of data arriving in a continuous fashion. Stream processing:

More information

Hortonworks and The Internet of Things

Hortonworks and The Internet of Things Hortonworks and The Internet of Things Dr. Bernhard Walter Solutions Engineer About Hortonworks Customer Momentum ~700 customers (as of November 4, 2015) 152 customers added in Q3 2015 Publicly traded

More information

@Pentaho #BigDataWebSeries

@Pentaho #BigDataWebSeries Enterprise Data Warehouse Optimization with Hadoop Big Data @Pentaho #BigDataWebSeries Your Hosts Today Dave Henry SVP Enterprise Solutions Davy Nys VP EMEA & APAC 2 Source/copyright: The Human Face of

More information

Comparative Analysis of Range Aggregate Queries In Big Data Environment

Comparative Analysis of Range Aggregate Queries In Big Data Environment Comparative Analysis of Range Aggregate Queries In Big Data Environment Ranjanee S PG Scholar, Dept. of Computer Science and Engineering, Institute of Road and Transport Technology, Erode, TamilNadu, India.

More information

Feedback: Twitter: #TechTalk #wpo #io2011. Make The Web Faster. Joshua Marantz Richard Rabbat Håkon Wium Lie.

Feedback: Twitter:  #TechTalk #wpo #io2011. Make The Web Faster. Joshua Marantz Richard Rabbat Håkon Wium Lie. Feedback: Twitter: http://goo.gl/vf47i #TechTalk #wpo #io2011 Make The Web Faster Joshua Marantz Richard Rabbat Håkon Wium Lie May 10, 2011 Agenda mod_pagespeed Joshua Marantz Feedback: Twitter: http://goo.gl/vf47i

More information

TSAR A TimeSeries AggregatoR. Anirudh Todi TSAR

TSAR A TimeSeries AggregatoR. Anirudh Todi TSAR TSAR A TimeSeries AggregatoR Anirudh Todi Twitter @anirudhtodi TSAR What is TSAR? What is TSAR? TSAR is a framework and service infrastructure for specifying, deploying and operating timeseries aggregation

More information

Based on Big Data: Hype or Hallelujah? by Elena Baralis

Based on Big Data: Hype or Hallelujah? by Elena Baralis Based on Big Data: Hype or Hallelujah? by Elena Baralis http://dbdmg.polito.it/wordpress/wp-content/uploads/2010/12/bigdata_2015_2x.pdf 1 3 February 2010 Google detected flu outbreak two weeks ahead of

More information

Big Data com Hadoop. VIII Sessão - SQL Bahia. Impala, Hive e Spark. Diógenes Pires 03/03/2018

Big Data com Hadoop. VIII Sessão - SQL Bahia. Impala, Hive e Spark. Diógenes Pires 03/03/2018 Big Data com Hadoop Impala, Hive e Spark VIII Sessão - SQL Bahia 03/03/2018 Diógenes Pires Connect with PASS Sign up for a free membership today at: pass.org #sqlpass Internet Live http://www.internetlivestats.com/

More information

Chapter 2: Memory Hierarchy Design Part 2

Chapter 2: Memory Hierarchy Design Part 2 Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental

More information

Logical Diagram of a Set-associative Cache Accessing a Cache

Logical Diagram of a Set-associative Cache Accessing a Cache Introduction Memory Hierarchy Why memory subsystem design is important CPU speeds increase 25%-30% per year DRAM speeds increase 2%-11% per year Levels of memory with different sizes & speeds close to

More information

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved Hadoop 2.x Core: YARN, Tez, and Spark YARN Hadoop Machine Types top-of-rack switches core switch client machines have client-side software used to access a cluster to process data master nodes run Hadoop

More information

LRC: Dependency-Aware Cache Management for Data Analytics Clusters. Yinghao Yu, Wei Wang, Jun Zhang, and Khaled B. Letaief IEEE INFOCOM 2017

LRC: Dependency-Aware Cache Management for Data Analytics Clusters. Yinghao Yu, Wei Wang, Jun Zhang, and Khaled B. Letaief IEEE INFOCOM 2017 LRC: Dependency-Aware Cache Management for Data Analytics Clusters Yinghao Yu, Wei Wang, Jun Zhang, and Khaled B. Letaief IEEE INFOCOM 2017 Outline Cache Management for Data Analytics Clusters Inefficiency

More information

Introduction. Memory Hierarchy

Introduction. Memory Hierarchy Introduction Why memory subsystem design is important CPU speeds increase 25%-30% per year DRAM speeds increase 2%-11% per year 1 Memory Hierarchy Levels of memory with different sizes & speeds close to

More information

Developing Enterprise Cloud Solutions with Azure

Developing Enterprise Cloud Solutions with Azure Developing Enterprise Cloud Solutions with Azure Java Focused 5 Day Course AUDIENCE FORMAT Developers and Software Architects Instructor-led with hands-on labs LEVEL 300 COURSE DESCRIPTION This course

More information

Developing Microsoft Azure Solutions (70-532) Syllabus

Developing Microsoft Azure Solutions (70-532) Syllabus Developing Microsoft Azure Solutions (70-532) Syllabus Cloud Computing Introduction What is Cloud Computing Cloud Characteristics Cloud Computing Service Models Deployment Models in Cloud Computing Advantages

More information

Opportunities for Exploiting Social Awareness in Overlay Networks. Bruce Maggs Duke University Akamai Technologies

Opportunities for Exploiting Social Awareness in Overlay Networks. Bruce Maggs Duke University Akamai Technologies Opportunities for Exploiting Social Awareness in Overlay Networks Bruce Maggs Duke University Akamai Technologies The Akamai Intelligent Platform A Global Platform: 127,000+ Servers 1,100+ Networks 2,500+

More information

Designing Hybrid Data Processing Systems for Heterogeneous Servers

Designing Hybrid Data Processing Systems for Heterogeneous Servers Designing Hybrid Data Processing Systems for Heterogeneous Servers Peter Pietzuch Large-Scale Distributed Systems (LSDS) Group Imperial College London http://lsds.doc.ic.ac.uk University

More information

RocksDB Key-Value Store Optimized For Flash

RocksDB Key-Value Store Optimized For Flash RocksDB Key-Value Store Optimized For Flash Siying Dong Software Engineer, Database Engineering Team @ Facebook April 20, 2016 Agenda 1 What is RocksDB? 2 RocksDB Design 3 Other Features What is RocksDB?

More information

Content Deployment and Caching Techniques in Africa

Content Deployment and Caching Techniques in Africa Content Deployment and Caching Techniques in Africa Presentation to the AfPIF Peering Coordinators Day Dar Es Salaam 2011 Mike Blanche 1 Agenda Current Content Hosting Situation Content Hosting Content

More information

Optimizing and Modeling SAP Business Analytics for SAP HANA. Iver van de Zand, Business Analytics

Optimizing and Modeling SAP Business Analytics for SAP HANA. Iver van de Zand, Business Analytics Optimizing and Modeling SAP Business Analytics for SAP HANA Iver van de Zand, Business Analytics Early data warehouse projects LIMITATIONS ISSUES RAISED Data driven by acquisition, not architecture Too

More information

Staying FIT: Efficient Load Shedding Techniques for Distributed Stream Processing

Staying FIT: Efficient Load Shedding Techniques for Distributed Stream Processing Staying FIT: Efficient Load Shedding Techniques for Distributed Stream Processing Nesime Tatbul Uğur Çetintemel Stan Zdonik Talk Outline Problem Introduction Approach Overview Advance Planning with an

More information

volley: automated data placement for geo-distributed cloud services

volley: automated data placement for geo-distributed cloud services volley: automated data placement for geo-distributed cloud services sharad agarwal, john dunagan, navendu jain, stefan saroiu, alec wolman, harbinder bhogan very rapid pace of datacenter rollout April

More information

Big Data on AWS. Peter-Mark Verwoerd Solutions Architect

Big Data on AWS. Peter-Mark Verwoerd Solutions Architect Big Data on AWS Peter-Mark Verwoerd Solutions Architect What to get out of this talk Non-technical: Big Data processing stages: ingest, store, process, visualize Hot vs. Cold data Low latency processing

More information

Measuring Performance of Complex Event Processing Systems

Measuring Performance of Complex Event Processing Systems Measuring Performance of Complex Event Processing Systems Torsten Grabs, Ming Lu Microsoft StreamInsight Microsoft Corp., Redmond, WA {torsteng, milu}@microsoft.com Agenda Motivation CEP systems and performance

More information

Activator Library. Focus on maximizing the value of your data, gain business insights, increase your team s productivity, and achieve success.

Activator Library. Focus on maximizing the value of your data, gain business insights, increase your team s productivity, and achieve success. Focus on maximizing the value of your data, gain business insights, increase your team s productivity, and achieve success. ACTIVATORS Designed to give your team assistance when you need it most without

More information

End-user mapping: Next-Generation Request Routing for Content Delivery

End-user mapping: Next-Generation Request Routing for Content Delivery Introduction End-user mapping: Next-Generation Request Routing for Content Delivery Fangfei Chen, Ramesh K. Sitaraman, Marcelo Torres ACM SIGCOMM Computer Communication Review. Vol. 45. No. 4. ACM, 2015

More information

Democratizing Content Publication with Coral

Democratizing Content Publication with Coral Democratizing Content Publication with Mike Freedman Eric Freudenthal David Mazières New York University NSDI 2004 A problem Feb 3: Google linked banner to julia fractals Users clicking directed to Australian

More information

AMP-Based Flow Collection. Greg Virgin - RedJack

AMP-Based Flow Collection. Greg Virgin - RedJack AMP-Based Flow Collection Greg Virgin - RedJack AMP- Based Flow Collection AMP - Analytic Metadata Producer : Patented US Government flow / metadata producer AMP generates data including Flows Host metadata

More information

Spark, Shark and Spark Streaming Introduction

Spark, Shark and Spark Streaming Introduction Spark, Shark and Spark Streaming Introduction Tushar Kale tusharkale@in.ibm.com June 2015 This Talk Introduction to Shark, Spark and Spark Streaming Architecture Deployment Methodology Performance References

More information

Improve Web Application Performance with Zend Platform

Improve Web Application Performance with Zend Platform Improve Web Application Performance with Zend Platform Shahar Evron Zend Sr. PHP Specialist Copyright 2007, Zend Technologies Inc. Agenda Benchmark Setup Comprehensive Performance Multilayered Caching

More information

Azure Webinar. Resilient Solutions March Sander van den Hoven Principal Technical Evangelist Microsoft

Azure Webinar. Resilient Solutions March Sander van den Hoven Principal Technical Evangelist Microsoft Azure Webinar Resilient Solutions March 2017 Sander van den Hoven Principal Technical Evangelist Microsoft DX @svandenhoven 1 What is resilience? Client Client API FrontEnd Client Client Client Loadbalancer

More information

Self Regulating Stream Processing in Heron

Self Regulating Stream Processing in Heron Self Regulating Stream Processing in Heron Huijun Wu 2017.12 Huijun Wu Twitter, Inc. Infrastructure, Data Platform, Real-Time Compute Heron Overview Recent Improvements Self Regulating Challenges Dhalion

More information

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara Big Data Technology Ecosystem Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara Agenda End-to-End Data Delivery Platform Ecosystem of Data Technologies Mapping an End-to-End Solution Case

More information

Energy-efficient disk caching for content delivery

Energy-efficient disk caching for content delivery Energy-efficient disk caching for content delivery Aditya Sundarrajan University of Massachusetts Amherst asundar@cs.umass.edu Mangesh Kasbekar Akamai Technologies mkasbeka@akamai.com Ramesh K. Sitaraman

More information

Mining Data Streams. From Data-Streams Management System Queries to Knowledge Discovery from continuous and fast-evolving Data Records.

Mining Data Streams. From Data-Streams Management System Queries to Knowledge Discovery from continuous and fast-evolving Data Records. DATA STREAMS MINING Mining Data Streams From Data-Streams Management System Queries to Knowledge Discovery from continuous and fast-evolving Data Records. Hammad Haleem Xavier Plantaz APPLICATIONS Sensors

More information

MixApart: Decoupled Analytics for Shared Storage Systems. Madalin Mihailescu, Gokul Soundararajan, Cristiana Amza University of Toronto and NetApp

MixApart: Decoupled Analytics for Shared Storage Systems. Madalin Mihailescu, Gokul Soundararajan, Cristiana Amza University of Toronto and NetApp MixApart: Decoupled Analytics for Shared Storage Systems Madalin Mihailescu, Gokul Soundararajan, Cristiana Amza University of Toronto and NetApp Hadoop Pig, Hive Hadoop + Enterprise storage?! Shared storage

More information

HDInsight > Hadoop. October 12, 2017

HDInsight > Hadoop. October 12, 2017 HDInsight > Hadoop October 12, 2017 2 Introduction Mark Hudson >20 years mixing technology with data >10 years with CapTech Microsoft Certified IT Professional Business Intelligence Member of the Richmond

More information

Infiniswap. Efficient Memory Disaggregation. Mosharaf Chowdhury. with Juncheng Gu, Youngmoon Lee, Yiwen Zhang, and Kang G. Shin

Infiniswap. Efficient Memory Disaggregation. Mosharaf Chowdhury. with Juncheng Gu, Youngmoon Lee, Yiwen Zhang, and Kang G. Shin Infiniswap Efficient Memory Disaggregation Mosharaf Chowdhury with Juncheng Gu, Youngmoon Lee, Yiwen Zhang, and Kang G. Shin Rack-Scale Computing Datacenter-Scale Computing Geo-Distributed Computing Coflow

More information

Packet-Level Network Analytics without Compromises NANOG 73, June 26th 2018, Denver, CO. Oliver Michel

Packet-Level Network Analytics without Compromises NANOG 73, June 26th 2018, Denver, CO. Oliver Michel Packet-Level Network Analytics without Compromises NANOG 73, June 26th 2018, Denver, CO Oliver Michel Network monitoring is important Security issues Performance issues Equipment failure Analytics Platform

More information

Counting is Hard: Probabilistically Counting Views at Reddit. Krishnan Chandra, Data Engineer

Counting is Hard: Probabilistically Counting Views at Reddit. Krishnan Chandra, Data Engineer Counting is Hard: Probabilistically Counting Views at Reddit Krishnan Chandra, Data Engineer What is probabilistic counting? Overview How did probabilistic counting help us scale? What issues did we face

More information

High-Performance Event Processing Bridging the Gap between Low Latency and High Throughput Bernhard Seeger University of Marburg

High-Performance Event Processing Bridging the Gap between Low Latency and High Throughput Bernhard Seeger University of Marburg High-Performance Event Processing Bridging the Gap between Low Latency and High Throughput Bernhard Seeger University of Marburg common work with Nikolaus Glombiewski, Michael Körber, Marc Seidemann 1.

More information