Optimizing Grouped Aggregation in Geo- Distributed Streaming Analytics
|
|
- Ophelia Morris
- 5 years ago
- Views:
Transcription
1 Optimizing Grouped Aggregation in Geo- Distributed Streaming Analytics Benjamin Heintz, Abhishek Chandra University of Minnesota Ramesh K. Sitaraman UMass Amherst & Akamai Technologies
2 Wide- Area Streaming Analytics Geo- distributed Streams Video streaming: client logs CDN: server logs Smart homes: sensor readings Real- time Analytics Audience behavior Web application usage Energy consumption Heintz, Chandra, Sitaraman HPDC 5
3 Wide- Area Streaming Analytics Geo- distributed Streams Video streaming: client logs CDN: server logs Smart homes: sensor readings Real- time Analytics Audience behavior Web application usage Energy consumption Hour- by- hour, how many unique users have streamed each video? How many bytes are served for each content provider every minute? Every 5 minutes, show me the average electrical demand by zip code. Heintz, Chandra, Sitaraman HPDC 5
4 Wide- Area Streaming Analytics Geo- distributed Streams Video streaming: client logs CDN: server logs Smart homes: sensor readings Real- time Analytics Audience behavior Web application usage Energy consumption WindowedGrouped Aggregation Hour- by- hour, how many unique users have streamed each video? How many bytes are served for each content provider every minute? Every 5 minutes, show me the average electrical demand by zip code. Heintz, Chandra, Sitaraman HPDC 5
5 Windowed Grouped Aggregation Example For each minute Window ß windowed For each content_provider_id ß grouped Sum bytes_transferred ß aggregation Widely used General aggregation sum, count average, standard deviation (approximate) sets, counts, Heintz, Chandra, Sitaraman HPDC 5 5
6 System Model Hub- and- Spoke Infrastructure Compute at edges and center Queries served from center Server Central Data Warehouse Server Server Server Heintz, Chandra, Sitaraman HPDC 5 6
7 Metrics WAN System- level measure of cost Major OPEX component Staleness User- level measure of quality Delay in getting final results Often an SLA component window begins t window ends t+w staleness results available t+w+s time Heintz, Chandra, Sitaraman HPDC 5 7
8 Our Goal Algorithms to jointly minimize staleness & traffic Algorithm Input: sequence of arrivals Output: sequence of updates : total number of updates Staleness: delay until last update reaches center Heintz, Chandra, Sitaraman HPDC 5 8
9 Naïve Algorithms Pure streaming immediately flushes upon arrival Excessive traffic Pure batching flushes only after end of window Excessive staleness Heintz, Chandra, Sitaraman HPDC 5 9
10 A Running Example Count words in the stream fast data is fast Network: aggregate / second Window length: 8 seconds Heintz, Chandra, Sitaraman HPDC 5
11 Pure Streaming : 6 8 Heintz, Chandra, Sitaraman HPDC 5
12 Pure Streaming : 6 8 Flush immediately Heintz, Chandra, Sitaraman HPDC 5
13 Pure Streaming : 6 8 fast Heintz, Chandra, Sitaraman HPDC 5
14 Pure Streaming : 6 8 fast Heintz, Chandra, Sitaraman HPDC 5
15 Pure Streaming : 6 8 fast data Heintz, Chandra, Sitaraman HPDC 5 5
16 Pure Streaming : 6 8 fast data Heintz, Chandra, Sitaraman HPDC 5 6
17 Pure Streaming : fast data is Heintz, Chandra, Sitaraman HPDC 5 7
18 Pure Streaming : fast data is Heintz, Chandra, Sitaraman HPDC 5 8
19 Pure Streaming : fast data is Heintz, Chandra, Sitaraman HPDC 5 9
20 Pure Streaming : Staleness: : fast data is Heintz, Chandra, Sitaraman HPDC 5
21 Pure Batching : 6 8 fast Heintz, Chandra, Sitaraman HPDC 5
22 Pure Batching : 6 8 fast data Heintz, Chandra, Sitaraman HPDC 5
23 Pure Batching : 6 8 fast data is Heintz, Chandra, Sitaraman HPDC 5
24 Pure Batching : fast data is Heintz, Chandra, Sitaraman HPDC 5
25 Pure Batching : fast data is Heintz, Chandra, Sitaraman HPDC 5 5
26 Pure Batching : fast data is ` fast Heintz, Chandra, Sitaraman HPDC 5 6
27 Pure Batching : data ` fast data is Heintz, Chandra, Sitaraman HPDC 5 7
28 Pure Batching : 6 8 ` fast data is is Heintz, Chandra, Sitaraman HPDC 5 8
29 Pure Batching : is 6 8 Staleness: : ` fast data is Heintz, Chandra, Sitaraman HPDC 5 9
30 Pure Batching : is 6 8 Staleness: fast Naïve Algorithms: `Staleness vs. data : is Heintz, Chandra, Sitaraman HPDC 5
31 Roadmap Naïve algorithms Optimal offline algorithms Practical online algorithms Experimental evaluation Heintz, Chandra, Sitaraman HPDC 5
32 Eager Offline Optimal Idea: flush immediately after last arrival Properties: Flushes exactly once per key (traffic optimal) Flushes at the earliest possible time (staleness optimal) Heintz, Chandra, Sitaraman HPDC 5
33 Eager Offline Optimal : 6 8 fast Heintz, Chandra, Sitaraman HPDC 5
34 Eager Offline Optimal : 6 8 fast Heintz, Chandra, Sitaraman HPDC 5
35 Eager Offline Optimal : fast 6 8 Flush immediately upon last arrival Heintz, Chandra, Sitaraman HPDC 5 5
36 Eager Offline Optimal : 6 8 fast data Heintz, Chandra, Sitaraman HPDC 5 6
37 Eager Offline Optimal : 6 8 fast data Heintz, Chandra, Sitaraman HPDC 5 7
38 Eager Offline Optimal : fast data is Heintz, Chandra, Sitaraman HPDC 5 8
39 Eager Offline Optimal : fast data is Heintz, Chandra, Sitaraman HPDC 5 9
40 Eager Offline Optimal : fast ` fast data is Heintz, Chandra, Sitaraman HPDC 5
41 Eager Offline Optimal : Staleness: : fast data is Heintz, Chandra, Sitaraman HPDC 5
42 Lazy Offline Optimal Idea: flush as late as possible after last arrival Properties Still one flush per key (traffic optimal) Staleness optimal by construction Heintz, Chandra, Sitaraman HPDC 5
43 Lazy Offline Optimal : 6 8 fast Heintz, Chandra, Sitaraman HPDC 5
44 Lazy Offline Optimal : 6 8 fast data Heintz, Chandra, Sitaraman HPDC 5
45 Lazy Offline Optimal : 6 8 fast data is Heintz, Chandra, Sitaraman HPDC 5 5
46 Lazy Offline Optimal : 5 fast data is 6 8 Flush as late as ` possible after last arrival Heintz, Chandra, Sitaraman HPDC 5 6
47 Lazy Offline Optimal : fast data ` data is Heintz, Chandra, Sitaraman HPDC 5 7
48 Lazy Offline Optimal : fast data is Heintz, Chandra, Sitaraman HPDC 5 8
49 Lazy Offline Optimal : fast ` data is is Heintz, Chandra, Sitaraman HPDC 5 9
50 Lazy Offline Optimal : fast fast data is Heintz, Chandra, Sitaraman HPDC 5 5
51 Lazy Offline Optimal : Staleness: : fast data is Heintz, Chandra, Sitaraman HPDC 5 5
52 Family of Optimal Algorithms Eager Lazy 6 8 Heintz, Chandra, Sitaraman HPDC 5 5
53 Roadmap Naïve algorithms Optimal offline algorithms Practical online algorithms Experimental evaluation Heintz, Chandra, Sitaraman HPDC 5 5
54 Caching Analogy Insight: Treat the edge as a cache For staleness When to flush? Dynamic sizing policy For traffic What to flush? Eviction policy fast data is Heintz, Chandra, Sitaraman HPDC 5 5
55 Practical Online Algorithms Dynamic sizing Emulate the offline optimal cache sizes Eager: predict how many keys will arrive again Lazy: predict how late we can defer flushing Heintz, Chandra, Sitaraman HPDC 5 55
56 Practical Online Algorithms Dynamic sizing Emulate the offline optimal cache sizes Eager: predict how many keys will arrive again Lazy: predict how late we can defer flushing Eviction Don t reinvent the wheel Leverage existing work (LRU, LFU, ) Heintz, Chandra, Sitaraman HPDC 5 56
57 Hybrid Online Algorithm Eager: better to overestimate size Many early evictions Risk: incorrect eviction decisions Lazy: better to underestimate size Risk: bandwidth misprediction Be pessimistic Hybrid: linear combination of eager, lazy cache sizes Heintz, Chandra, Sitaraman HPDC 5 57
58 Roadmap Naïve algorithms Optimal offline algorithms Practical online algorithms Experimental evaluation Heintz, Chandra, Sitaraman HPDC 5 58
59 Experiments: Data Set Month- long Akamai download analytics beacon log Name Query Size small (cpid, bw) bytes downloaded (sum) medium (cpid, bw, country_code) bytes downloaded (moments) large (cpid, bw, url) client ip (HyperLogLog) O( ) groups O( ) groups O( 6 ) groups Heintz, Chandra, Sitaraman HPDC 5 59
60 Experiments: Implementation Implemented algorithms in Apache Storm One Storm cluster at center, each edge Seven total PlanetLab sites (washington.edu) (wisc.edu) (csuohio.edu) (uwaterloo.ca) (yale.edu) Replayer Logs Summer Summer Summer Aggregator Reorderer SocketSender (princeton.edu) (ucla.edu) Summer SocketReceiver Reorderer Summer Summer StatsCollector Aggregator Heintz, Chandra, Sitaraman HPDC 5 6
61 Experimental Evaluation: Single- Normalized Mean Batching Hybrid(.5) Streaming Optimal Normalized 95th- %ile Staleness Staleness Batching Hybrid(.5) Streaming Optimal Heintz, Chandra, Sitaraman HPDC 5 6
62 Experimental Evaluation: Single- Normalized Mean Batching Hybrid(.5) Streaming Optimal Near- optimal traffic Normalized 95th- %ile Staleness Batching Hybrid(.5) Streaming Optimal Staleness 65%. Significant staleness reduction Heintz, Chandra, Sitaraman HPDC 5 6
63 Experimental Evaluation: Multi- ( edges) Staleness Batching Hybrid(.5) Streaming Optimal Batching Hybrid(.5) Streaming Optimal Normalized Mean Normalized 95th- %ile Staleness... Heintz, Chandra, Sitaraman HPDC 5 6
64 Experimental Evaluation: Multi- (6 edges) Batching Hybrid(.5) Streaming Optimal Staleness Batching Hybrid(.5) Streaming Optimal Normalized Mean Low High Normalized 95th- %ile Staleness Low High Heintz, Chandra, Sitaraman HPDC 5 6
65 Related Work Systems Twitter Heron (SIGMOD 5) Apache Spark Streaming (SOSP ) Google MillWheel (VLDB ) Muppet (VLDB ) Wide- area Computing WANalytics (CIDR 5) JetStream (NSDI ) Pietzuch et al. (ICDE 6) Optimization Trade- offs BlinkDB (VLDB ) LazyBase (EuroSys ) Heintz, Chandra, Sitaraman HPDC 5 65
66 Conclusion Geo- distributed windowed, grouped aggregation Naïve algorithms: low staleness or low traffic Optimal offline algorithms to jointly minimize both Practical algorithms via caching Dynamic sizing policies Reuse existing eviction policies Heintz, Chandra, Sitaraman HPDC 5 66
Complex Interactions in Content Distribution Ecosystem and QoE
Complex Interactions in Content Distribution Ecosystem and QoE Zhi-Li Zhang Qwest Chair Professor & Distinguished McKnight University Professor Dept. of Computer Science & Eng., University of Minnesota
More informationOp#mizing MapReduce for Highly- Distributed Environments
Op#mizing MapReduce for Highly- Distributed Environments Abhishek Chandra Associate Professor Department of Computer Science and Engineering University of Minnesota hep://www.cs.umn.edu/~chandra 1 Big
More informationAggregation and Degradation in JetStream: Streaming Analytics in the Wide Area
Aggregation and Degradation in JetStream: Streaming Analytics in the Wide Area Ariel Rabkin Princeton University asrabkin@cs.princeton.edu Work done with Matvey Arye, Siddhartha Sen, Vivek S. Pai, and
More informationTripS: Automated Multi-tiered Data Placement in a Geo-distributed Cloud Environment
TripS: Automated Multi-tiered Data Placement in a Geo-distributed Cloud Environment Kwangsung Oh, Abhishek Chandra, and Jon Weissman Department of Computer Science and Engineering University of Minnesota
More informationLazyBase: Trading freshness and performance in a scalable database
LazyBase: Trading freshness and performance in a scalable database (EuroSys 2012) Jim Cipar, Greg Ganger, *Kimberly Keeton, *Craig A. N. Soules, *Brad Morrey, *Alistair Veitch PARALLEL DATA LABORATORY
More informationAggregation and Degradation in JetStream: Streaming analytics in the wide area
NSDI 2014 Aggregation and Degradation in JetStream: Streaming analytics in the wide area Ariel Rabkin, Matvey Arye, Siddhartha Sen, Vivek S. Pai, and Michael J. Freedman Princeton University 报告人 : 申毅杰
More informationAn Implementation of Fog Computing Attributes in an IoT Environment
An Implementation of Fog Computing Attributes in an IoT Environment Ranjit Deshpande CTO K2 Inc. Introduction Ranjit Deshpande CTO K2 Inc. K2 Inc. s end-to-end IoT platform Transforms Sensor Data into
More informationCoflow. Recent Advances and What s Next? Mosharaf Chowdhury. University of Michigan
Coflow Recent Advances and What s Next? Mosharaf Chowdhury University of Michigan Rack-Scale Computing Datacenter-Scale Computing Geo-Distributed Computing Coflow Networking Open Source Apache Spark Open
More informationData Analytics with HPC. Data Streaming
Data Analytics with HPC Data Streaming Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us
More informationOn the Analysis of Caches with Pending Interest Tables
On the Analysis of Caches with Pending Interest Tables Mostafa Dehghan 1, Bo Jiang 1 Ali Dabirmoghaddam 2, Don Towsley 1 1 University of Massachusetts Amherst 2 University of California Santa Cruz ICN,
More informationWebinar Series TMIP VISION
Webinar Series TMIP VISION TMIP provides technical support and promotes knowledge and information exchange in the transportation planning and modeling community. Today s Goals To Consider: Parallel Processing
More informationEfficient On-Demand Operations in Distributed Infrastructures
Efficient On-Demand Operations in Distributed Infrastructures Steve Ko and Indranil Gupta Distributed Protocols Research Group University of Illinois at Urbana-Champaign 2 One-Line Summary We need to design
More informationAn Efficient Execution Scheme for Designated Event-based Stream Processing
DEIM Forum 2014 D3-2 An Efficient Execution Scheme for Designated Event-based Stream Processing Yan Wang and Hiroyuki Kitagawa Graduate School of Systems and Information Engineering, University of Tsukuba
More informationWarehouse- Scale Computing and the BDAS Stack
Warehouse- Scale Computing and the BDAS Stack Ion Stoica UC Berkeley UC BERKELEY Overview Workloads Hardware trends and implications in modern datacenters BDAS stack What is Big Data used For? Reports,
More informationApache Flink Big Data Stream Processing
Apache Flink Big Data Stream Processing Tilmann Rabl Berlin Big Data Center www.dima.tu-berlin.de bbdc.berlin rabl@tu-berlin.de XLDB 11.10.2017 1 2013 Berlin Big Data Center All Rights Reserved DIMA 2017
More informationTools for Social Networking Infrastructures
Tools for Social Networking Infrastructures 1 Cassandra - a decentralised structured storage system Problem : Facebook Inbox Search hundreds of millions of users distributed infrastructure inbox changes
More informationIntelligent Services Serving Machine Learning
Intelligent Services Serving Machine Learning Joseph E. Gonzalez jegonzal@cs.berkeley.edu; Assistant Professor @ UC Berkeley joseph@dato.com; Co-Founder @ Dato Inc. Contemporary Learning Systems Big Data
More informationA Framework for Clustering Massive Text and Categorical Data Streams
A Framework for Clustering Massive Text and Categorical Data Streams Charu C. Aggarwal IBM T. J. Watson Research Center charu@us.ibm.com Philip S. Yu IBM T. J.Watson Research Center psyu@us.ibm.com Abstract
More informationDatacenter replication solution with quasardb
Datacenter replication solution with quasardb Technical positioning paper April 2017 Release v1.3 www.quasardb.net Contact: sales@quasardb.net Quasardb A datacenter survival guide quasardb INTRODUCTION
More informationEnergy Consumption in Mobile Phones: A Measurement Study and Implications for Network Applications (IMC09)
Energy Consumption in Mobile Phones: A Measurement Study and Implications for Network Applications (IMC09) Niranjan Balasubramanian Aruna Balasubramanian Arun Venkataramani University of Massachusetts
More informationMicrosoft Exam
Volume: 42 Questions Case Study: 1 Relecloud General Overview Relecloud is a social media company that processes hundreds of millions of social media posts per day and sells advertisements to several hundred
More informationArchitecting Microsoft Azure Solutions (proposed exam 535)
Architecting Microsoft Azure Solutions (proposed exam 535) IMPORTANT: Significant changes are in progress for exam 534 and its content. As a result, we are retiring this exam on December 31, 2017, and
More informationTesting & Assuring Mobile End User Experience Before Production Neotys
Testing & Assuring Mobile End User Experience Before Production Neotys Henrik Rexed Agenda Introduction The challenges Best practices NeoLoad mobile capabilities Mobile devices are used more and more At
More informationFlexible Wide Area Consistency Management Sai Susarla
Flexible Wide Area Consistency Management Sai Susarla The Problem Wide area services are not alike Different consistency & availability requirements But many common issues & mechanisms Concurrency control,
More informationIntroduction. Application Performance in the QLinux Multimedia Operating System. Solution: QLinux. Introduction. Outline. QLinux Design Principles
Application Performance in the QLinux Multimedia Operating System Sundaram, A. Chandra, P. Goyal, P. Shenoy, J. Sahni and H. Vin Umass Amherst, U of Texas Austin ACM Multimedia, 2000 Introduction General
More informationSCALABLE, LOW LATENCY MODEL SERVING AND MANAGEMENT WITH VELOX
THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW LATENCY MODEL SERVING AND MANAGEMENT WITH VELOX Daniel Crankshaw, Peter Bailis, Joseph Gonzalez, Haoyuan Li, Zhao Zhang, Ali Ghodsi, Michael Franklin,
More informationEfficient Point to Multipoint Transfers Across Datacenters
Efficient Point to Multipoint Transfers cross Datacenters Mohammad Noormohammadpour 1, Cauligi S. Raghavendra 1, Sriram Rao, Srikanth Kandula 1 University of Southern California, Microsoft Source: https://azure.microsoft.com/en-us/overview/datacenters/how-to-choose/
More informationFootprint Descriptors: Theory and Practice of Cache Provisioning in a Global CDN
Footprint Descriptors: Theory and Practice of Cache Provisioning in a Global CDN ABSTRACT Aditya Sundarrajan University of Massachusetts Amherst Mangesh Kasbekar Akamai Technologies Modern CDNs cache and
More informationWebsite Acceleration with mod_pagespeed
Website Acceleration with mod_pagespeed Joshua Marantz Google June 15, 2011 @jmarantz www.modpagespeed.com 2011 Google, Inc. All rights reserved. Velocity 2011: Faster By Default 2 Velocity 2011: Faster
More informationData Warehouses Chapter 12. Class 10: Data Warehouses 1
Data Warehouses Chapter 12 Class 10: Data Warehouses 1 OLTP vs OLAP Operational Database: a database designed to support the day today transactions of an organization Data Warehouse: historical data is
More informationMicrosoft Perform Data Engineering on Microsoft Azure HDInsight.
Microsoft 70-775 Perform Data Engineering on Microsoft Azure HDInsight http://killexams.com/pass4sure/exam-detail/70-775 QUESTION: 30 You are building a security tracking solution in Apache Kafka to parse
More informationECE7995 Caching and Prefetching Techniques in Computer Systems. Lecture 8: Buffer Cache in Main Memory (I)
ECE7995 Caching and Prefetching Techniques in Computer Systems Lecture 8: Buffer Cache in Main Memory (I) 1 Review: The Memory Hierarchy Take advantage of the principle of locality to present the user
More informationMicrosoft. Exam Questions Perform Data Engineering on Microsoft Azure HDInsight (beta) Version:Demo
Microsoft Exam Questions 70-775 Perform Data Engineering on Microsoft Azure HDInsight (beta) Version:Demo NEW QUESTION 1 HOTSPOT You install the Microsoft Hive ODBC Driver on a computer that runs Windows
More informationFlash Storage Complementing a Data Lake for Real-Time Insight
Flash Storage Complementing a Data Lake for Real-Time Insight Dr. Sanhita Sarkar Global Director, Analytics Software Development August 7, 2018 Agenda 1 2 3 4 5 Delivering insight along the entire spectrum
More informationexam. Microsoft Perform Data Engineering on Microsoft Azure HDInsight. Version 1.0
70-775.exam Number: 70-775 Passing Score: 800 Time Limit: 120 min File Version: 1.0 Microsoft 70-775 Perform Data Engineering on Microsoft Azure HDInsight Version 1.0 Exam A QUESTION 1 You use YARN to
More informationCS 398 ACC Streaming. Prof. Robert J. Brunner. Ben Congdon Tyler Kim
CS 398 ACC Streaming Prof. Robert J. Brunner Ben Congdon Tyler Kim MP3 How s it going? Final Autograder run: - Tonight ~9pm - Tomorrow ~3pm Due tomorrow at 11:59 pm. Latest Commit to the repo at the time
More informationDeveloping Microsoft Azure Solutions (70-532) Syllabus
Developing Microsoft Azure Solutions (70-532) Syllabus Cloud Computing Introduction What is Cloud Computing Cloud Characteristics Cloud Computing Service Models Deployment Models in Cloud Computing Advantages
More informationmicrosoft
70-775.microsoft Number: 70-775 Passing Score: 800 Time Limit: 120 min Exam A QUESTION 1 Note: This question is part of a series of questions that present the same scenario. Each question in the series
More informationRack-scale Data Processing System
Rack-scale Data Processing System Jana Giceva, Darko Makreshanski, Claude Barthels, Alessandro Dovis, Gustavo Alonso Systems Group, Department of Computer Science, ETH Zurich Rack-scale Data Processing
More informationA Cost-Space Approach to Distributed Query Optimization in Stream Based Overlays
A Cost-Space Approach to Distributed Query Optimization in Stream Based Overlays Jeffrey Shneidman, Peter Pietzuch, Matt Welsh, Margo Seltzer, Mema Roussopoulos Systems Research Group Harvard University
More informationStreaming OLAP Applications
Streaming OLAP Applications From square one to multi-gigabit streams and beyond C. Scott Andreas HPTS 2013 @cscotta Roadmap Framing the problem Four phases of an architecture s evolution Code: A general-purpose
More informationPulsar. Realtime Analytics At Scale. Wang Xinglang
Pulsar Realtime Analytics At Scale Wang Xinglang Agenda Pulsar : Real Time Analytics At ebay Business Use Cases Product Requirements Pulsar : Technology Deep Dive 2 Pulsar Business Use Case: Behavioral
More informationLambda Architecture for Batch and Stream Processing. October 2018
Lambda Architecture for Batch and Stream Processing October 2018 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Notices This document is provided for informational purposes only.
More informationNowcasting. D B M G Data Base and Data Mining Group of Politecnico di Torino. Big Data: Hype or Hallelujah? Big data hype?
Big data hype? Big Data: Hype or Hallelujah? Data Base and Data Mining Group of 2 Google Flu trends On the Internet February 2010 detected flu outbreak two weeks ahead of CDC data Nowcasting http://www.internetlivestats.com/
More informationDRIZZLE: FAST AND Adaptable STREAM PROCESSING AT SCALE
DRIZZLE: FAST AND Adaptable STREAM PROCESSING AT SCALE Shivaram Venkataraman, Aurojit Panda, Kay Ousterhout, Michael Armbrust, Ali Ghodsi, Michael Franklin, Benjamin Recht, Ion Stoica STREAMING WORKLOADS
More informationBohr: Similarity Aware Geo-distributed Data Analytics. Hangyu Li, Hong Xu, Sarana Nutanong City University of Hong Kong
Bohr: Similarity Aware Geo-distributed Data Analytics Hangyu Li, Hong Xu, Sarana Nutanong City University of Hong Kong 1 Big Data Analytics Analysis Generate 2 Data are geo-distributed Frankfurt US Oregon
More informationSelf-driving Datacenter: Analytics
Self-driving Datacenter: Analytics George Boulescu Consulting Systems Engineer 19/10/2016 Alvin Toffler is a former associate editor of Fortune magazine, known for his works discussing the digital revolution,
More informationGaia: Geo-Distributed Machine Learning Approaching LAN Speeds
Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds Kevin Hsieh Aaron Harlap, Nandita Vijaykumar, Dimitris Konomis, Gregory R. Ganger, Phillip B. Gibbons, Onur Mutlu Machine Learning and Big
More informationDistributed systems for stream processing
Distributed systems for stream processing Apache Kafka and Spark Structured Streaming Alena Hall Alena Hall Large-scale data processing Distributed Systems Functional Programming Data Science & Machine
More informationRIPE75 - Network monitoring at scale. Louis Poinsignon
RIPE75 - Network monitoring at scale Louis Poinsignon Why monitoring and what to monitor? Why do we monitor? Billing Reducing costs Traffic engineering Where should we peer? Where should we set-up a new
More informationUNIFY DATA AT MEMORY SPEED. Haoyuan (HY) Li, Alluxio Inc. VAULT Conference 2017
UNIFY DATA AT MEMORY SPEED Haoyuan (HY) Li, CEO @ Alluxio Inc. VAULT Conference 2017 March 2017 HISTORY Started at UC Berkeley AMPLab In Summer 2012 Originally named as Tachyon Rebranded to Alluxio in
More informationWide-Area Spark Streaming: Automated Routing and Batch Sizing
Wide-Area Spark Streaming: Automated Routing and Batch Sizing Wenxin Li, Di Niu, Yinan Liu, Shuhao Liu, Baochun Li University of Toronto & Dalian University of Technology University of Alberta University
More informationBuilding an Internet-Scale Publish/Subscribe System
Building an Internet-Scale Publish/Subscribe System Ian Rose Mema Roussopoulos Peter Pietzuch Rohan Murty Matt Welsh Jonathan Ledlie Imperial College London Peter R. Pietzuch prp@doc.ic.ac.uk Harvard University
More informationHYBRID TRANSACTION/ANALYTICAL PROCESSING COLIN MACNAUGHTON
HYBRID TRANSACTION/ANALYTICAL PROCESSING COLIN MACNAUGHTON WHO IS NEEVE RESEARCH? Headquartered in Silicon Valley Creators of the X Platform - Memory Oriented Application Platform Passionate about high
More informationBuilding Durable Real-time Data Pipeline
Building Durable Real-time Data Pipeline Apache BookKeeper at Twitter @sijieg Twitter Background Layered Architecture Agenda Design Details Performance Scale @Twitter Q & A Publish-Subscribe Online services
More informationCisco Tetration Analytics Platform: A Dive into Blazing Fast Deep Storage
White Paper Cisco Tetration Analytics Platform: A Dive into Blazing Fast Deep Storage What You Will Learn A Cisco Tetration Analytics appliance bundles computing, networking, and storage resources in one
More informationPortable stateful big data processing in Apache Beam
Portable stateful big data processing in Apache Beam Kenneth Knowles Apache Beam PMC Software Engineer @ Google klk@google.com / @KennKnowles https://s.apache.org/ffsf-2017-beam-state Flink Forward San
More informationHow Eventual is Eventual Consistency?
Probabilistically Bounded Staleness How Eventual is Eventual Consistency? Peter Bailis, Shivaram Venkataraman, Michael J. Franklin, Joseph M. Hellerstein, Ion Stoica (UC Berkeley) BashoChats 002, 28 February
More informationProcessing 11 billions events a day with Spark. Alexander Krasheninnikov
Processing 11 billions events a day with Spark Alexander Krasheninnikov Badoo facts 46 languages 10M Photos added daily 320M registered users 190 countries 21M daily active users 3000+ servers 2 data-centers
More informationReal-time data processing with Apache Flink
Real-time data processing with Apache Flink Gyula Fóra gyfora@apache.org Flink committer Swedish ICT Stream processing Data stream: Infinite sequence of data arriving in a continuous fashion. Stream processing:
More informationHortonworks and The Internet of Things
Hortonworks and The Internet of Things Dr. Bernhard Walter Solutions Engineer About Hortonworks Customer Momentum ~700 customers (as of November 4, 2015) 152 customers added in Q3 2015 Publicly traded
More information@Pentaho #BigDataWebSeries
Enterprise Data Warehouse Optimization with Hadoop Big Data @Pentaho #BigDataWebSeries Your Hosts Today Dave Henry SVP Enterprise Solutions Davy Nys VP EMEA & APAC 2 Source/copyright: The Human Face of
More informationComparative Analysis of Range Aggregate Queries In Big Data Environment
Comparative Analysis of Range Aggregate Queries In Big Data Environment Ranjanee S PG Scholar, Dept. of Computer Science and Engineering, Institute of Road and Transport Technology, Erode, TamilNadu, India.
More informationFeedback: Twitter: #TechTalk #wpo #io2011. Make The Web Faster. Joshua Marantz Richard Rabbat Håkon Wium Lie.
Feedback: Twitter: http://goo.gl/vf47i #TechTalk #wpo #io2011 Make The Web Faster Joshua Marantz Richard Rabbat Håkon Wium Lie May 10, 2011 Agenda mod_pagespeed Joshua Marantz Feedback: Twitter: http://goo.gl/vf47i
More informationTSAR A TimeSeries AggregatoR. Anirudh Todi TSAR
TSAR A TimeSeries AggregatoR Anirudh Todi Twitter @anirudhtodi TSAR What is TSAR? What is TSAR? TSAR is a framework and service infrastructure for specifying, deploying and operating timeseries aggregation
More informationBased on Big Data: Hype or Hallelujah? by Elena Baralis
Based on Big Data: Hype or Hallelujah? by Elena Baralis http://dbdmg.polito.it/wordpress/wp-content/uploads/2010/12/bigdata_2015_2x.pdf 1 3 February 2010 Google detected flu outbreak two weeks ahead of
More informationBig Data com Hadoop. VIII Sessão - SQL Bahia. Impala, Hive e Spark. Diógenes Pires 03/03/2018
Big Data com Hadoop Impala, Hive e Spark VIII Sessão - SQL Bahia 03/03/2018 Diógenes Pires Connect with PASS Sign up for a free membership today at: pass.org #sqlpass Internet Live http://www.internetlivestats.com/
More informationChapter 2: Memory Hierarchy Design Part 2
Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental
More informationLogical Diagram of a Set-associative Cache Accessing a Cache
Introduction Memory Hierarchy Why memory subsystem design is important CPU speeds increase 25%-30% per year DRAM speeds increase 2%-11% per year Levels of memory with different sizes & speeds close to
More informationHadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved
Hadoop 2.x Core: YARN, Tez, and Spark YARN Hadoop Machine Types top-of-rack switches core switch client machines have client-side software used to access a cluster to process data master nodes run Hadoop
More informationLRC: Dependency-Aware Cache Management for Data Analytics Clusters. Yinghao Yu, Wei Wang, Jun Zhang, and Khaled B. Letaief IEEE INFOCOM 2017
LRC: Dependency-Aware Cache Management for Data Analytics Clusters Yinghao Yu, Wei Wang, Jun Zhang, and Khaled B. Letaief IEEE INFOCOM 2017 Outline Cache Management for Data Analytics Clusters Inefficiency
More informationIntroduction. Memory Hierarchy
Introduction Why memory subsystem design is important CPU speeds increase 25%-30% per year DRAM speeds increase 2%-11% per year 1 Memory Hierarchy Levels of memory with different sizes & speeds close to
More informationDeveloping Enterprise Cloud Solutions with Azure
Developing Enterprise Cloud Solutions with Azure Java Focused 5 Day Course AUDIENCE FORMAT Developers and Software Architects Instructor-led with hands-on labs LEVEL 300 COURSE DESCRIPTION This course
More informationDeveloping Microsoft Azure Solutions (70-532) Syllabus
Developing Microsoft Azure Solutions (70-532) Syllabus Cloud Computing Introduction What is Cloud Computing Cloud Characteristics Cloud Computing Service Models Deployment Models in Cloud Computing Advantages
More informationOpportunities for Exploiting Social Awareness in Overlay Networks. Bruce Maggs Duke University Akamai Technologies
Opportunities for Exploiting Social Awareness in Overlay Networks Bruce Maggs Duke University Akamai Technologies The Akamai Intelligent Platform A Global Platform: 127,000+ Servers 1,100+ Networks 2,500+
More informationDesigning Hybrid Data Processing Systems for Heterogeneous Servers
Designing Hybrid Data Processing Systems for Heterogeneous Servers Peter Pietzuch Large-Scale Distributed Systems (LSDS) Group Imperial College London http://lsds.doc.ic.ac.uk University
More informationRocksDB Key-Value Store Optimized For Flash
RocksDB Key-Value Store Optimized For Flash Siying Dong Software Engineer, Database Engineering Team @ Facebook April 20, 2016 Agenda 1 What is RocksDB? 2 RocksDB Design 3 Other Features What is RocksDB?
More informationContent Deployment and Caching Techniques in Africa
Content Deployment and Caching Techniques in Africa Presentation to the AfPIF Peering Coordinators Day Dar Es Salaam 2011 Mike Blanche 1 Agenda Current Content Hosting Situation Content Hosting Content
More informationOptimizing and Modeling SAP Business Analytics for SAP HANA. Iver van de Zand, Business Analytics
Optimizing and Modeling SAP Business Analytics for SAP HANA Iver van de Zand, Business Analytics Early data warehouse projects LIMITATIONS ISSUES RAISED Data driven by acquisition, not architecture Too
More informationStaying FIT: Efficient Load Shedding Techniques for Distributed Stream Processing
Staying FIT: Efficient Load Shedding Techniques for Distributed Stream Processing Nesime Tatbul Uğur Çetintemel Stan Zdonik Talk Outline Problem Introduction Approach Overview Advance Planning with an
More informationvolley: automated data placement for geo-distributed cloud services
volley: automated data placement for geo-distributed cloud services sharad agarwal, john dunagan, navendu jain, stefan saroiu, alec wolman, harbinder bhogan very rapid pace of datacenter rollout April
More informationBig Data on AWS. Peter-Mark Verwoerd Solutions Architect
Big Data on AWS Peter-Mark Verwoerd Solutions Architect What to get out of this talk Non-technical: Big Data processing stages: ingest, store, process, visualize Hot vs. Cold data Low latency processing
More informationMeasuring Performance of Complex Event Processing Systems
Measuring Performance of Complex Event Processing Systems Torsten Grabs, Ming Lu Microsoft StreamInsight Microsoft Corp., Redmond, WA {torsteng, milu}@microsoft.com Agenda Motivation CEP systems and performance
More informationActivator Library. Focus on maximizing the value of your data, gain business insights, increase your team s productivity, and achieve success.
Focus on maximizing the value of your data, gain business insights, increase your team s productivity, and achieve success. ACTIVATORS Designed to give your team assistance when you need it most without
More informationEnd-user mapping: Next-Generation Request Routing for Content Delivery
Introduction End-user mapping: Next-Generation Request Routing for Content Delivery Fangfei Chen, Ramesh K. Sitaraman, Marcelo Torres ACM SIGCOMM Computer Communication Review. Vol. 45. No. 4. ACM, 2015
More informationDemocratizing Content Publication with Coral
Democratizing Content Publication with Mike Freedman Eric Freudenthal David Mazières New York University NSDI 2004 A problem Feb 3: Google linked banner to julia fractals Users clicking directed to Australian
More informationAMP-Based Flow Collection. Greg Virgin - RedJack
AMP-Based Flow Collection Greg Virgin - RedJack AMP- Based Flow Collection AMP - Analytic Metadata Producer : Patented US Government flow / metadata producer AMP generates data including Flows Host metadata
More informationSpark, Shark and Spark Streaming Introduction
Spark, Shark and Spark Streaming Introduction Tushar Kale tusharkale@in.ibm.com June 2015 This Talk Introduction to Shark, Spark and Spark Streaming Architecture Deployment Methodology Performance References
More informationImprove Web Application Performance with Zend Platform
Improve Web Application Performance with Zend Platform Shahar Evron Zend Sr. PHP Specialist Copyright 2007, Zend Technologies Inc. Agenda Benchmark Setup Comprehensive Performance Multilayered Caching
More informationAzure Webinar. Resilient Solutions March Sander van den Hoven Principal Technical Evangelist Microsoft
Azure Webinar Resilient Solutions March 2017 Sander van den Hoven Principal Technical Evangelist Microsoft DX @svandenhoven 1 What is resilience? Client Client API FrontEnd Client Client Client Loadbalancer
More informationSelf Regulating Stream Processing in Heron
Self Regulating Stream Processing in Heron Huijun Wu 2017.12 Huijun Wu Twitter, Inc. Infrastructure, Data Platform, Real-Time Compute Heron Overview Recent Improvements Self Regulating Challenges Dhalion
More informationBig Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara
Big Data Technology Ecosystem Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara Agenda End-to-End Data Delivery Platform Ecosystem of Data Technologies Mapping an End-to-End Solution Case
More informationEnergy-efficient disk caching for content delivery
Energy-efficient disk caching for content delivery Aditya Sundarrajan University of Massachusetts Amherst asundar@cs.umass.edu Mangesh Kasbekar Akamai Technologies mkasbeka@akamai.com Ramesh K. Sitaraman
More informationMining Data Streams. From Data-Streams Management System Queries to Knowledge Discovery from continuous and fast-evolving Data Records.
DATA STREAMS MINING Mining Data Streams From Data-Streams Management System Queries to Knowledge Discovery from continuous and fast-evolving Data Records. Hammad Haleem Xavier Plantaz APPLICATIONS Sensors
More informationMixApart: Decoupled Analytics for Shared Storage Systems. Madalin Mihailescu, Gokul Soundararajan, Cristiana Amza University of Toronto and NetApp
MixApart: Decoupled Analytics for Shared Storage Systems Madalin Mihailescu, Gokul Soundararajan, Cristiana Amza University of Toronto and NetApp Hadoop Pig, Hive Hadoop + Enterprise storage?! Shared storage
More informationHDInsight > Hadoop. October 12, 2017
HDInsight > Hadoop October 12, 2017 2 Introduction Mark Hudson >20 years mixing technology with data >10 years with CapTech Microsoft Certified IT Professional Business Intelligence Member of the Richmond
More informationInfiniswap. Efficient Memory Disaggregation. Mosharaf Chowdhury. with Juncheng Gu, Youngmoon Lee, Yiwen Zhang, and Kang G. Shin
Infiniswap Efficient Memory Disaggregation Mosharaf Chowdhury with Juncheng Gu, Youngmoon Lee, Yiwen Zhang, and Kang G. Shin Rack-Scale Computing Datacenter-Scale Computing Geo-Distributed Computing Coflow
More informationPacket-Level Network Analytics without Compromises NANOG 73, June 26th 2018, Denver, CO. Oliver Michel
Packet-Level Network Analytics without Compromises NANOG 73, June 26th 2018, Denver, CO Oliver Michel Network monitoring is important Security issues Performance issues Equipment failure Analytics Platform
More informationCounting is Hard: Probabilistically Counting Views at Reddit. Krishnan Chandra, Data Engineer
Counting is Hard: Probabilistically Counting Views at Reddit Krishnan Chandra, Data Engineer What is probabilistic counting? Overview How did probabilistic counting help us scale? What issues did we face
More informationHigh-Performance Event Processing Bridging the Gap between Low Latency and High Throughput Bernhard Seeger University of Marburg
High-Performance Event Processing Bridging the Gap between Low Latency and High Throughput Bernhard Seeger University of Marburg common work with Nikolaus Glombiewski, Michael Körber, Marc Seidemann 1.
More information