Flying Faster with Heron
|
|
- Emma Hicks
- 5 years ago
- Views:
Transcription
1 Flying Faster with Heron KARTHIK #TwitterHeron
2 TALK OUTLINE BEGIN I! II ( III b OVERVIEW MOTIVATION HERON IV Z OPERATIONAL EXPERIENCES V K HERON PERFORMANCE END
3 [! OVERVIEW
4 TWITTER IS REAL TIME real time trends real time conversations real time recommendations real time search G Ü " s Emerging break out Real time sports Real time product Real time search of trends in Twitter (in the conversations related recommendations based tweets form #hashtags) with a topic (recent goal on your behavior & or touchdown) profile ANALYZING BILLIONS OF EVENTS IN REAL TIME IS A CHALLENGE!
5 TWITTER STORM Streaming platform for analyzing realtime data as they arrive, so you can react to data as it happens. b \ Ñ / GUARANTEED HORIZONTAL ROBUST CONCISE MESSAGE SCALABILITY FAULT CODE- FOCUS PROCESSING TOLERANCE ON LOGIC
6 STORM TERMINOLOGY TOPOLOGY, Directed acyclic graph Vertices=computation, and edges=streams of data tuples SPOUTS Sources of data tuples for the topology Examples - Event Bus/Kafka/Kestrel/MySQL/Postgres BOLTS % Process incoming tuples and emit outgoing tuples Examples - filtering/aggregation/join/arbitrary function
7 STORM TOPOLOGY BOLT 1 SPOUT 1 SPOUT 2 % BOLT 2 % % BOLT 4 % % BOLT 5 BOLT 3
8 WORD COUNT TOPOLOGY Live stream of Tweets % % TWEET SPOUT PARSE TWEET BOLT WORD COUNT BOLT LOGICAL PLAN
9 WORD COUNT TOPOLOGY % % TWEET SPOUT TASKS PARSE TWEET BOLT TASKS WORD COUNT BOLT TASKS When a parse tweet bolt task emits a tuple which word count bolt task should it send to?
10 STREAM GROUPINGS SHUFFLE GROUPING FIELDS GROUPING ALL GROUPING GLOBAL GROUPING /. -, Random distribution of tuples Group tuples by a field or multiple fields Replicates tuples to all tasks Sends the entire stream to one task
11 WORD COUNT TOPOLOGY SHUFFLE GROUPING FIELDS GROUPING % % TWEET SPOUT TASKS PARSE TWEET BOLT TASKS WORD COUNT BOLT TASKS
12 ( MOTIVATION
13 STORM ARCHITECTURE MASTER NODE TOPOLOGY SUBMISSION Nimbus ASSIGNMENT MAPS Multiple Functionality Scheduling/Monitoring Single point of failure ZK No resource reservation and isolation CLUSTER Storage Contention SUPERVISOR SUPERVISOR W1 W2 W3 W4 W1 W2 W3 W4 SLAVE NODE SLAVE NODE
14 STORM WORKER EXECUTOR1 EXECUTOR2 Complex hierarchy TASK1 JVM PROCESS TASK2 Hard to debug Difficult to tune TASK4 TASK5 TASK3
15 DATA FLOW IN STORM WORKERS In In In In In Queue User Logic Thread In In In Out In Queue Queue User Logic Send Thread Thread Queue Contention Global Receive Thread Outgoing Message Buffer TCP Receive Buffer Multiple Languages Global Send Thread TCP Send Buffer Kernel
16 OVERLOADED ZOOKEEPER Scaled up STORM W zk S1 W W zk S2 S3 Handled unto to 1200 workers per cluster
17 OVERLOADED ZOOKEEPER Analyzing zookeeper traffic KAFKA SPOUT 67% Offset/partition is written every 2 secs STORM RUNTIME 33% Workers write heart beats every 3 secs
18 OVERLOADED ZOOKEEPER Heart beat daemons STORM W H HH zk S1 W S2 zk W KV KVKV S workers per cluster
19 STORM - DEPLOYMENT shared pool storm cluster
20 STORM - DEPLOYMENT shared pool isolated pools joe s topology storm cluster
21 STORM - DEPLOYMENT shared pool isolated pools joe s topology storm cluster jane s topology
22 STORM - DEPLOYMENT shared pool isolated pools joe s topology storm cluster jane s topology dave s topology
23 STORM ISSUES g LACK OF BACK PRESSURE Drops tuples unpredictably EFFICIENCY G Serialization program consumes 75 cores at 30% CPU Topology consumes 600 cores at 20-30% CPU! NO BATCHING Tuple oriented system - implicit batching by 0MQ
24 EVOLUTION OR REVOLUTION? fix storm or develop a new system? FUNDAMENTAL ISSUES- REQUIRE EXTENSIVE REWRITING, Several queues for moving data Inflexible and requires longer development cycle USE EXISTING OPEN SOURCE SOLUTIONS Issues working at scale/lacks required performance Incompatible API and long migration process
25 HERONb
26 HERON DESIGN GOALS FULLY API COMPATIBLE WITH STORM " Directed acyclic graph Topologies, spouts and bolts # TASK ISOLATION Ease of debug ability/resource isolation/profiling d USE OF MAIN STREAM LANGUAGES C++/JAVA/Python
27 HERON ARCHITECTURE Scheduler Topology 1 TOPOLOGY SUBMISSION Topology 2 Topology 3 Topology N
28 TOPOLOGY ARCHITECTURE Topology Master Logical Plan, Physical Plan and Execution State ZK Sync Physical Plan CLUSTER Stream Manager Metrics Manager Stream Manager Metrics Manager I1 I2 I3 I4 I1 I2 I3 I4 CONTAINER CONTAINER
29 TOPOLOGY MASTER Solely responsible for the entire topology b \ Ñ ASSIGNS ROLE MONITORING METRICS
30 TOPOLOGY MASTER Topology Master Logical Plan, Physical Plan and Execution State ZK CLUSTER! PREVENT MULTIPLE TM BECOMING MASTERS! ALLOWS OTHER PROCESS TO DISCOVER TM
31 STREAM MANAGER Routing Engine /, Ñ ROUTES TUPLES BACK PRESSURE ACK MGMT
32 STREAM MANAGER S1 B2 B3 B4 % % %
33 STREAM MANAGER S1 B2 S1 B2 Stream Manager Stream Manager B3 B4 B3 B4 O(n 2 ) O(k 2 ) S1 B2 S1 B2 Stream Manager Stream Manager B3 B4 B3
34 STREAM MANAGER tcp back pressure S1 B2 S1 B2 Stream Manager Stream Manager B3 B4 B3 B4 S1 B2 S1 B2 Stream Manager Stream Manager B3 B4 B3 SLOWS UPSTREAM AND DOWNSTREAM INSTANCES
35 STREAM MANAGER spout back pressure S1 B2 S1 B2 Stream Manager Stream Manager B3 B4 B3 B4 S1 B2 S1 B2 Stream Manager Stream Manager B3 B4 B3
36 STREAM MANAGER stage by stage back pressure S1 B2 S1 B2 Stream Manager Stream Manager B3 B4 B3 B4 S1 B2 S1 B2 Stream Manager Stream Manager B3 B4 B3
37 STREAM MANAGER back pressure advantages PREDICTABILITY! Tuple failures are more deterministic SELF ADJUSTS! Topology goes as fast as the slowest component
38 HERON INSTANCE Does the real work! > > p RUNS ONE TASK EXPOSES API COLLECTS METRICS
39 HERON INSTANCE Stream Manager data-in queue Gateway Thread data-out queue Task Execution Thread Metrics Manager metrics-out queue
40 K OPERATIONAL EXPERIENCES $
41 HERON DEPLOYMENT Aurora Scheduler ZK CLUSTER Topology 1 Aurora Services Topology 2 Heron Web Topology 3 Heron Tracker Heron VIZ Topology N Observability
42 HERON SAMPLE TOPOLOGIES
43 SAMPLE TOPOLOGY DASHBOARD
44 STORM is decommissioned Large amount of data produced every day Large cluster Several topologies deployed Several billion messages every day 1 stage 10 stages 3x reduction in cores and memory
45 HERON x PERFORMANCE 9
46 HERON PERFORMANCE Settings COMPONENTS EXPT #1 EXPT #2 EXPT #3 EXPT #4 Spout Bolt # Heron containers # Storm workers
47 HERON PERFORMANCE Word count topology - Acknowledgements enabled Throughput Latency Storm Heron Storm Heron million tuples/min 700 latency (ms) Spout Parallelism Spout Parallelism 10-14x 5-15x
48 HERON PERFORMANCE Word count topology - CPU usage Storm Heron # cores used Spout Parallelism 2-3x
49 HERON PERFORMANCE Throughput and CPU usage with no acknowledgements - Word count topology Storm Heron Storm Heron million tuples/min 2500 # cores used Spout Parallelism Spout Parallelism
50 HERON EXPERIMENT RTAC topology SHUFFLE GROUPING FIELDS GROUPING % % FIELDS GROUPING % CLIENT EVENT DISTRIBUTOR USER COUNT AGGREGATOR SPOUT BOLT BOLT BOLT
51 HERON PERFORMANCE CPU usage - RTAC Topology Storm Heron Acknowledgements enabled Storm Heron No acknowledgements # cores used 200 # cores used
52 HERON PERFORMANCE Latency with acknowledgements enabled - RTAC Topology Storm Heron latency (ms)
53 CURIOUS TO LEARN MORE Twitter Heron: Stream Processing at Scale Sanjeev Kulkarni, Nikunj Bhagat, Maosong Fu, Vikas Kedigehalli, Christopher Kellogg, Sailesh Mittal, Jignesh M. Patel *,1, Karthik Ramasamy, @staneja Twitter, Inc., *University of Wisconsin Madison Ankit Toshniwal, Siddarth Taneja, Amit Shukla, Karthik Ramasamy, Jignesh M. Patel*, Sanjeev Kulkarni, Jason Jackson, Krishna Gade, Maosong Fu, Jake Donham, Nikunj Bhagat, Sailesh Mittal, @challenger_nik, Twitter, Inc., *University of Wisconsin Madison
54 CONCLUSION SIMPLIFIED ARCHITECTURE " Easy to debug, profile and support HIGH PERFORMANCE % 7-10x increase in throughput 5-10x improvement in latency # EFFICIENCY 3-5x decrease in resource usage
55 #ThankYou FOR LISTENING &
56 R QUESTIONS and ANSWERS ' Go ahead. Ask away.
Scalable Streaming Analytics
Scalable Streaming Analytics KARTHIK RAMASAMY @karthikz TALK OUTLINE BEGIN I! II ( III b Overview Storm Overview Storm Internals IV Z V K Heron Operational Experiences END WHAT IS ANALYTICS? according
More informationReal Time Processing. Karthik Ramasamy
Real Time Processing Karthik Ramasamy Streamlio @karthikz 2 Information Age Real-time is key á K! 3 Real Time Connected World Ñ Internet of Things 30 B connected devices by 2020 Connected Vehicles Data
More informationPaper Presented by Harsha Yeddanapudy
Storm@Twitter Ankit Toshniwal, Siddarth Taneja, Amit Shukla, Karthik Ramasamy, Jignesh M. Patel*, Sanjeev Kulkarni, Jason Jackson, Krishna Gade, Maosong Fu, Jake Donham, Nikunj Bhagat, Sailesh Mittal,
More informationTwitter Heron: Stream Processing at Scale
Twitter Heron: Stream Processing at Scale Saiyam Kohli December 8th, 2016 CIS 611 Research Paper Presentation -Sun Sunnie Chung TWITTER IS A REAL TIME ABSTRACT We process billions of events on Twitter
More information2/20/2019 Week 5-B Sangmi Lee Pallickara
2/20/2019 - Spring 2019 Week 5-B-1 CS535 BIG DATA FAQs PART A. BIG DATA TECHNOLOGY 4. REAL-TIME STREAMING COMPUTING MODELS: APACHE STORM AND TWITTER HERON Special GTA for PA1 Saptashwa Mitra Saptashwa.Mitra@colostate.edu
More information10/24/2017 Sangmi Lee Pallickara Week 10- A. CS535 Big Data Fall 2017 Colorado State University
CS535 Big Data - Fall 2017 Week 10-A-1 CS535 BIG DATA FAQs Term project proposal Feedback for the most of submissions are available PA2 has been posted (11/6) PART 2. SCALABLE FRAMEWORKS FOR REAL-TIME
More informationApache Storm. Hortonworks Inc Page 1
Apache Storm Page 1 What is Storm? Real time stream processing framework Scalable Up to 1 million tuples per second per node Fault Tolerant Tasks reassigned on failure Guaranteed Processing At least once
More informationBASIC INTER/INTRA IPC. Operating System (Linux, Windows) HARDWARE. Scheduling Framework (Mesos, YARN, etc) HERON S GENERAL-PURPOSE ARCHITECTURE
217 IEEE 33rd International Conference on Data Engineering Twitter Heron: Towards Extensible Streaming Engines Maosong Fu t, Ashvin Agrawal m, Avrilia Floratou m, Bill Graham t, Andrew Jorgensen t Mark
More informationBefore proceeding with this tutorial, you must have a good understanding of Core Java and any of the Linux flavors.
About the Tutorial Storm was originally created by Nathan Marz and team at BackType. BackType is a social analytics company. Later, Storm was acquired and open-sourced by Twitter. In a short time, Apache
More informationTutorial: Apache Storm
Indian Institute of Science Bangalore, India भ रत य वज ञ न स स थ न ब गल र, भ रत Department of Computational and Data Sciences DS256:Jan17 (3:1) Tutorial: Apache Storm Anshu Shukla 16 Feb, 2017 Yogesh Simmhan
More informationSelf Regulating Stream Processing in Heron
Self Regulating Stream Processing in Heron Huijun Wu 2017.12 Huijun Wu Twitter, Inc. Infrastructure, Data Platform, Real-Time Compute Heron Overview Recent Improvements Self Regulating Challenges Dhalion
More informationREAL-TIME ANALYTICS WITH APACHE STORM
REAL-TIME ANALYTICS WITH APACHE STORM Mevlut Demir PhD Student IN TODAY S TALK 1- Problem Formulation 2- A Real-Time Framework and Its Components with an existing applications 3- Proposed Framework 4-
More informationStreaming & Apache Storm
Streaming & Apache Storm Recommended Text: Storm Applied Sean T. Allen, Matthew Jankowski, Peter Pathirana Manning 2010 VMware Inc. All rights reserved Big Data! Volume! Velocity Data flowing into the
More informationStorm. Distributed and fault-tolerant realtime computation. Nathan Marz Twitter
Storm Distributed and fault-tolerant realtime computation Nathan Marz Twitter Storm at Twitter Twitter Web Analytics Before Storm Queues Workers Example (simplified) Example Workers schemify tweets and
More informationFROM LEGACY, TO BATCH, TO NEAR REAL-TIME. Marc Sturlese, Dani Solà
FROM LEGACY, TO BATCH, TO NEAR REAL-TIME Marc Sturlese, Dani Solà WHO ARE WE? Marc Sturlese - @sturlese Backend engineer, focused on R&D Interests: search, scalability Dani Solà - @dani_sola Backend engineer
More informationStorm. Distributed and fault-tolerant realtime computation. Nathan Marz Twitter
Storm Distributed and fault-tolerant realtime computation Nathan Marz Twitter Basic info Open sourced September 19th Implementation is 15,000 lines of code Used by over 25 companies >2700 watchers on Github
More informationData Analytics with HPC. Data Streaming
Data Analytics with HPC Data Streaming Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us
More informationTwitrends: A Real Time Trending Topics Detection System for Twitter Social Network
Twitrends: A Real Time Trending Topics Detection System for Twitter Social Network Cosmina Ivan Department of Computer Science Technical University of Cluj-Napoca Cluj County, Romania Andrei Moldovan Department
More informationSTORM AND LOW-LATENCY PROCESSING.
STORM AND LOW-LATENCY PROCESSING Low latency processing Similar to data stream processing, but with a twist Data is streaming into the system (from a database, or a netk stream, or an HDFS file, or ) We
More informationTyphoon: An SDN Enhanced Real-Time Big Data Streaming Framework
Typhoon: An SDN Enhanced Real-Time Big Data Streaming Framework Junguk Cho, Hyunseok Chang, Sarit Mukherjee, T.V. Lakshman, and Jacobus Van der Merwe 1 Big Data Era Big data analysis is increasingly common
More informationPostprint.
http://www.diva-portal.org Postprint This is the accepted version of a paper presented at The 4th International Workshop on Community Networks and Bottom-up-Broadband(CNBuB 2015), 24-26 Aug. 2015, Rome,
More informationPriority Based Resource Scheduling Techniques for a Multitenant Stream Processing Platform
Priority Based Resource Scheduling Techniques for a Multitenant Stream Processing Platform By Rudraneel Chakraborty A thesis submitted to the Faculty of Graduate and Postdoctoral Affairs in partial fulfillment
More informationPutting it together. Data-Parallel Computation. Ex: Word count using partial aggregation. Big Data Processing. COS 418: Distributed Systems Lecture 21
Big Processing -Parallel Computation COS 418: Distributed Systems Lecture 21 Michael Freedman 2 Ex: Word count using partial aggregation Putting it together 1. Compute word counts from individual files
More information2/18/2019 Week 5-A Sangmi Lee Pallickara CS535 BIG DATA. FAQs. 2/18/2019 CS535 Big Data - Spring 2019 Week 5-A-3. Week 5-A-4.
2/18/2019 - Spring 2019 1 CS535 BIG DATA FAQs PART A. BIG DATA TECHNOLOGY 4. REAL-TIME STREAMING COMPUTING MODELS: APACHE STORM AND TWITTER HERON HITS Algorithm How do we limit the base-set? Sangmi Lee
More informationCS 398 ACC Streaming. Prof. Robert J. Brunner. Ben Congdon Tyler Kim
CS 398 ACC Streaming Prof. Robert J. Brunner Ben Congdon Tyler Kim MP3 How s it going? Final Autograder run: - Tonight ~9pm - Tomorrow ~3pm Due tomorrow at 11:59 pm. Latest Commit to the repo at the time
More informationBig Data Infrastructures & Technologies
Big Data Infrastructures & Technologies Data streams and low latency processing DATA STREAM BASICS What is a data stream? Large data volume, likely structured, arriving at a very high rate Potentially
More informationOver the last few years, we have seen a disruption in the data management
JAYANT SHEKHAR AND AMANDEEP KHURANA Jayant is Principal Solutions Architect at Cloudera working with various large and small companies in various Verticals on their big data and data science use cases,
More informationUMP Alert Engine. Status. Requirements
UMP Alert Engine Status Requirements Goal Terms Proposed Design High Level Diagram Alert Engine Topology Stream Receiver Stream Router Policy Evaluator Alert Publisher Alert Topology Detail Diagram Alert
More informationInstalling and Configuring Apache Storm
3 Installing and Configuring Apache Storm Date of Publish: 2018-08-30 http://docs.hortonworks.com Contents Installing Apache Storm... 3...7 Configuring Storm for Supervision...8 Configuring Storm Resource
More informationDynamic Algorithm Selection for the Logic of Tasks in IoT Stream Processing Systems
Dynamic Selection for the Logic of s in IoT Stream Processing Systems Ehsan Poormohammady, Jens Helge Reelfs, Mirko Stoffers, Klaus Wehrle Communication and Distributed Systems, RWTH Aachen University,
More informationPerformance Assessment of Storm and Spark for Twitter Streaming
Performance Assessment of Storm and Spark for Twitter Streaming 1 B. Revathi Reddy, 2 T.Swathi 1 PG Student, Computer Science and Engineering Dept., GPREC, Kurnool(District), Andhra Pradesh-518004, INDIA,
More informationReal-time Calculating Over Self-Health Data Using Storm Jiangyong Cai1, a, Zhengping Jin2, b
4th International Conference on Mechatronics, Materials, Chemistry and Computer Engineering (ICMMCCE 2015) Real-time Calculating Over Self-Health Data Using Storm Jiangyong Cai1, a, Zhengping Jin2, b 1
More information10/26/2017 Sangmi Lee Pallickara Week 10- B. CS535 Big Data Fall 2017 Colorado State University
CS535 Big Data - Fall 2017 Week 10-A-1 CS535 BIG DATA FAQs Term project proposal Feedback for the most of submissions are available PA2 has been posted (11/6) PART 2. SCALABLE FRAMEWORKS FOR REAL-TIME
More informationNetwork-Aware Grouping in Distributed Stream Processing Systems
Network-Aware Grouping in Distributed Stream Processing Systems Fei Chen, Song Wu, Hai Jin Services Computing Technology and System Lab Cluster and Grid Computing Lab School of Computer Science and Technology
More informationApache Storm: Hands-on Session A.A. 2016/17
Università degli Studi di Roma Tor Vergata Dipartimento di Ingegneria Civile e Ingegneria Informatica Apache Storm: Hands-on Session A.A. 2016/17 Matteo Nardelli Laurea Magistrale in Ingegneria Informatica
More informationPaaS SAE Top3 SuperAPP
PaaS SAE Top3 SuperAPP PaaS SAE Top3 SuperAPP Pla$orm Services Group Sam Biwing Monika Rambone Skylee Kingho1d AWS S3 CDN ATS 1k 30+ 10+ Go FE Services Panel C++ Go C/C++ ACM FE Pla$orm Services Group
More informationA BIG DATA STREAMING RECIPE WHAT TO CONSIDER WHEN BUILDING A REAL TIME BIG DATA APPLICATION
A BIG DATA STREAMING RECIPE WHAT TO CONSIDER WHEN BUILDING A REAL TIME BIG DATA APPLICATION Konstantin Gregor / konstantin.gregor@tngtech.com ABOUT ME So ware developer for TNG in Munich Client in telecommunication
More informationFault Tolerance for Stream Processing Engines
Fault Tolerance for Stream Processing Engines Muhammad Anis Uddin Nasir KTH Royal Institute of Technology, Stockholm, Sweden anisu@kth.se arxiv:1605.00928v1 [cs.dc] 3 May 2016 Abstract Distributed Stream
More informationMillWheel:Fault Tolerant Stream Processing at Internet Scale. By FAN Junbo
MillWheel:Fault Tolerant Stream Processing at Internet Scale By FAN Junbo Introduction MillWheel is a low latency data processing framework designed by Google at Internet scale. Motived by Google Zeitgeist
More informationFluentd + MongoDB + Spark = Awesome Sauce
Fluentd + MongoDB + Spark = Awesome Sauce Nishant Sahay, Sr. Architect, Wipro Limited Bhavani Ananth, Tech Manager, Wipro Limited Your company logo here Wipro Open Source Practice: Vision & Mission Vision
More informationGrand challenge: Automatic anomaly detection over sliding windows
Provided by the author(s) and NUI Galway in accordance with publisher policies. Please cite the published version when available. Title Grand challenge: Automatic anomaly detection over sliding windows
More informationmicrosoft
70-775.microsoft Number: 70-775 Passing Score: 800 Time Limit: 120 min Exam A QUESTION 1 Note: This question is part of a series of questions that present the same scenario. Each question in the series
More informationSpark Streaming. Guido Salvaneschi
Spark Streaming Guido Salvaneschi 1 Spark Streaming Framework for large scale stream processing Scales to 100s of nodes Can achieve second scale latencies Integrates with Spark s batch and interactive
More informationSpark, Shark and Spark Streaming Introduction
Spark, Shark and Spark Streaming Introduction Tushar Kale tusharkale@in.ibm.com June 2015 This Talk Introduction to Shark, Spark and Spark Streaming Architecture Deployment Methodology Performance References
More informationScale-Out Algorithm For Apache Storm In SaaS Environment
University of Nebraska - Lincoln DigitalCommons@University of Nebraska - Lincoln Computer Science and Engineering: Theses, Dissertations, and Student Research Computer Science and Engineering, Department
More informationYCSB++ benchmarking tool Performance debugging advanced features of scalable table stores
YCSB++ benchmarking tool Performance debugging advanced features of scalable table stores Swapnil Patil M. Polte, W. Tantisiriroj, K. Ren, L.Xiao, J. Lopez, G.Gibson, A. Fuchs *, B. Rinaldi * Carnegie
More informationIntra-cluster Replication for Apache Kafka. Jun Rao
Intra-cluster Replication for Apache Kafka Jun Rao About myself Engineer at LinkedIn since 2010 Worked on Apache Kafka and Cassandra Database researcher at IBM Outline Overview of Kafka Kafka architecture
More informationWhat s New in ActiveVOS 9.0
What s New in ActiveVOS 9.0 Dr. Michael Rowley, Chief Technology Officer Clive Bearman, Director of Product Marketing 1 Some GoToWebinar Tips Click the maximize button for the best resolution The panel
More informationSpark Streaming. Professor Sasu Tarkoma.
Spark Streaming 2015 Professor Sasu Tarkoma www.cs.helsinki.fi Spark Streaming Spark extension of accepting and processing of streaming high-throughput live data streams Data is accepted from various sources
More informationDhalion: Self-Regulating Stream Processing in Heron
Dhalion: Self-Regulating Stream Processing in Heron Avrilia Floratou Microsoft avflor@microsoft.com Sriram Rao Microsoft sriramra@microsoft.com Ashvin Agrawal Microsoft asagr@microsoft.com Karthik Ramasamy
More informationOverview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::
Title Duration : Apache Spark Development : 4 days Overview Spark is a fast and general cluster computing system for Big Data. It provides high-level APIs in Scala, Java, Python, and R, and an optimized
More informationApache Flink. Alessandro Margara
Apache Flink Alessandro Margara alessandro.margara@polimi.it http://home.deib.polimi.it/margara Recap: scenario Big Data Volume and velocity Process large volumes of data possibly produced at high rate
More informationAn Evaluation of Data Stream Processing Systems for Data Driven Applications
Procedia Computer Science Volume 80, 2016, Pages 439 449 ICCS 2016. The International Conference on Computational Science An Evaluation of Data Stream Processing Systems for Data Driven Applications Jonathan
More informationFunctional Comparison and Performance Evaluation. Huafeng Wang Tianlun Zhang Wei Mao 2016/11/14
Functional Comparison and Performance Evaluation Huafeng Wang Tianlun Zhang Wei Mao 2016/11/14 Overview Streaming Core MISC Performance Benchmark Choose your weapon! 2 Continuous Streaming Micro-Batch
More informationNew Techniques to Curtail the Tail Latency in Stream Processing Systems
New Techniques to Curtail the Tail Latency in Stream Processing Systems Guangxiang Du University of Illinois at Urbana-Champaign Urbana, IL, 61801, USA gdu3@illinois.edu Indranil Gupta University of Illinois
More informationYARN: A Resource Manager for Analytic Platform Tsuyoshi Ozawa
YARN: A Resource Manager for Analytic Platform Tsuyoshi Ozawa ozawa.tsuyoshi@lab.ntt.co.jp ozawa@apache.org About me Tsuyoshi Ozawa Research Engineer @ NTT Twitter: @oza_x86_64 Over 150 reviews in 2015
More informationc 2016 Anirudh Jayakumar
c 2016 Anirudh Jayakumar ADAPTIVE BATCHING OF STREAMS TO ENHANCE THROUGHPUT AND TO SUPPORT DYNAMIC LOAD BALANCING BY ANIRUDH JAYAKUMAR THESIS Submitted in partial fulfillment of the requirements for the
More information8/24/2017 Week 1-B Instructor: Sangmi Lee Pallickara
Week 1-B-0 Week 1-B-1 CS535 BIG DATA FAQs Slides are available on the course web Wait list Term project topics PART 0. INTRODUCTION 2. DATA PROCESSING PARADIGMS FOR BIG DATA Sangmi Lee Pallickara Computer
More informationG-Storm: GPU-enabled High-throughput Online Data Processing in Storm
215 IEEE International Conference on Big Data (Big Data) G-: GPU-enabled High-throughput Online Data Processing in Zhenhua Chen, Jielong Xu, Jian Tang, Kevin Kwiat and Charles Kamhoua Abstract The Single
More informationAdaptive Cluster Computing using JavaSpaces
Adaptive Cluster Computing using JavaSpaces Jyoti Batheja and Manish Parashar The Applied Software Systems Lab. ECE Department, Rutgers University Outline Background Introduction Related Work Summary of
More informationLarge-Scale Data Engineering. Data streams and low latency processing
Large-Scale Data Engineering Data streams and low latency processing DATA STREAM BASICS What is a data stream? Large data volume, likely structured, arriving at a very high rate Potentially high enough
More informationR-Storm: A Resource-Aware Scheduler for STORM. Mohammad Hosseini Boyang Peng Zhihao Hong Reza Farivar Roy Campbell
R-Storm: A Resource-Aware Scheduler for STORM Mohammad Hosseini Boyang Peng Zhihao Hong Reza Farivar Roy Campbell Introduction STORM is an open source distributed real-time data stream processing system
More informationEvaluation of Apache Kafka in Real-Time Big Data Pipeline Architecture
Evaluation of Apache Kafka in Real-Time Big Data Pipeline Architecture Thandar Aung, Hla Yin Min, Aung Htein Maw University of Information Technology Yangon, Myanmar thandaraung@uit.edu.mm, hlayinmin@uit.edu.mm,
More informationMeasuring HEC Performance For Fun and Profit
Measuring HEC Performance For Fun and Profit Itay Neeman Director, Engineering, Splunk Clif Gordon Principal Software Engineer, Splunk September 2017 Washington, DC Forward-Looking Statements During the
More informationTSAR A TimeSeries AggregatoR. Anirudh Todi TSAR
TSAR A TimeSeries AggregatoR Anirudh Todi Twitter @anirudhtodi TSAR What is TSAR? What is TSAR? TSAR is a framework and service infrastructure for specifying, deploying and operating timeseries aggregation
More informationHortonworks Cybersecurity Package
Tuning Guide () docs.hortonworks.com Hortonworks Cybersecurity : Tuning Guide Copyright 2012-2018 Hortonworks, Inc. Some rights reserved. Hortonworks Cybersecurity (HCP) is a modern data application based
More informationChallenges in Data Stream Processing
Università degli Studi di Roma Tor Vergata Dipartimento di Ingegneria Civile e Ingegneria Informatica Challenges in Data Stream Processing Corso di Sistemi e Architetture per Big Data A.A. 2016/17 Valeria
More informationTopics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples
Hadoop Introduction 1 Topics Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples 2 Big Data Analytics What is Big Data?
More information@joerg_schad Nightmares of a Container Orchestration System
@joerg_schad Nightmares of a Container Orchestration System 2017 Mesosphere, Inc. All Rights Reserved. 1 Jörg Schad Distributed Systems Engineer @joerg_schad Jan Repnak Support Engineer/ Solution Architect
More informationAdaptive Online Scheduling in Storm
Adaptive Online Scheduling in Storm Leonardo Aniello aniello@dis.uniroma1.it Roberto Baldoni baldoni@dis.uniroma1.it Leonardo Querzoni querzoni@dis.uniroma1.it Research Center on Cyber Intelligence and
More informationREAL-TIME PEDESTRIAN DETECTION USING APACHE STORM IN A DISTRIBUTED ENVIRONMENT
REAL-TIME PEDESTRIAN DETECTION USING APACHE STORM IN A DISTRIBUTED ENVIRONMENT ABSTRACT Du-Hyun Hwang, Yoon-Ki Kim and Chang-Sung Jeong Department of Electrical Engineering, Korea University, Seoul, Republic
More informationBuilding Durable Real-time Data Pipeline
Building Durable Real-time Data Pipeline Apache BookKeeper at Twitter @sijieg Twitter Background Layered Architecture Agenda Design Details Performance Scale @Twitter Q & A Publish-Subscribe Online services
More informationBIG DATA. Using the Lambda Architecture on a Big Data Platform to Improve Mobile Campaign Management. Author: Sandesh Deshmane
BIG DATA Using the Lambda Architecture on a Big Data Platform to Improve Mobile Campaign Management Author: Sandesh Deshmane Executive Summary Growing data volumes and real time decision making requirements
More informationSource, Sink, and Processor Configuration Values
3 Source, Sink, and Processor Configuration Values Date of Publish: 2018-12-18 https://docs.hortonworks.com/ Contents... 3 Source Configuration Values...3 Processor Configuration Values... 5 Sink Configuration
More informationYCSB++ Benchmarking Tool Performance Debugging Advanced Features of Scalable Table Stores
YCSB++ Benchmarking Tool Performance Debugging Advanced Features of Scalable Table Stores Swapnil Patil Milo Polte, Wittawat Tantisiriroj, Kai Ren, Lin Xiao, Julio Lopez, Garth Gibson, Adam Fuchs *, Billie
More informationParallel Clustering of High-Dimensional Social Media Data Streams
Parallel Clustering of High-Dimensional Social Media Data Streams Xiaoming Gao School of Informatics and Computing Indiana University Bloomington, IN, USA gao4@umail.iu.edu Emilio Ferrara School of Informatics
More informationEsper EQC. Horizontal Scale-Out for Complex Event Processing
Esper EQC Horizontal Scale-Out for Complex Event Processing Esper EQC - Introduction Esper query container (EQC) is the horizontal scale-out architecture for Complex Event Processing with Esper and EsperHA
More informationAdaptive SMT Control for More Responsive Web Applications
Adaptive SMT Control for More Responsive Web Applications Hiroshi Inoue and Toshio Nakatani IBM Research Tokyo University of Tokyo Oct 27, 2014 IISWC @ Raleigh, NC, USA Response time matters! Peak throughput
More informationCharacterising Resource Management Performance in Kubernetes. Appendices.
Characterising Resource Management Performance in Kubernetes. Appendices. Víctor Medel a, Rafael Tolosana-Calasanz a, José Ángel Bañaresa, Unai Arronategui a, Omer Rana b a Aragon Institute of Engineering
More informationReliable Distributed Messaging with HornetQ
Reliable Distributed Messaging with HornetQ Lin Zhao Software Engineer, Groupon lin@groupon.com Agenda Introduction MessageBus Design Client API Monitoring Comparison with HornetQ Cluster Future Work Introduction
More informationWorking with Storm Topologies
3 Working with Storm Topologies Date of Publish: 2018-08-13 http://docs.hortonworks.com Contents Packaging Storm Topologies... 3 Deploying and Managing Apache Storm Topologies...4 Configuring the Storm
More information10 Things to Consider When Using Apache Ka7a: U"liza"on Points of Apache Ka4a Obtained From IoT Use Case
10 Things to Consider When Using Apache Ka7a: U"liza"on Points of Apache Ka4a Obtained From IoT Use Case May 16, 2017 NTT DATA CorporaAon Naoto Umemori, Yuji Hagiwara 2017 NTT DATA Corporation Contents
More informationParallel Patterns for Window-based Stateful Operators on Data Streams: an Algorithmic Skeleton Approach
Parallel Patterns for Window-based Stateful Operators on Data Streams: an Algorithmic Skeleton Approach Tiziano De Matteis, Gabriele Mencagli University of Pisa Italy INTRODUCTION The recent years have
More informationVTDL: A Notation for Data Stream Processing Applications
VTDL: A Notation for Data Stream Processing Applications Christoph Hochreiner, Stefan Schulte and Schahram Dustdar Distributed Systems Group, TU Wien, Austria {c.hochreiner, s.schulte, s.dustdar @dsg.tuwien.ac.at
More informationViper: Communication-Layer Determinism and Scaling in Low-Latency Stream Processing
Viper: Communication-Layer Determinism and Scaling in Low-Latency Stream Processing Ivan Walulya, Yiannis Nikolakopoulos, Vincenzo Gulisano Marina Papatriantafilou and Philippas Tsigas Auto-DaSP 2017 Chalmers
More informationData-Intensive Distributed Computing
Data-Intensive Distributed Computing CS 451/651 431/631 (Winter 2018) Part 9: Real-Time Data Analytics (1/2) March 27, 2018 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo
More informationImproving efficiency of Twitter Infrastructure using Chargeback
Improving efficiency of Twitter Infrastructure using Chargeback @vinucharanya @micheal AGENDA Brief History Problem Chargeback Engineering Challenges The product Impact Future Getty Images from http://www.fifa.com/worldcup/news/y=2010/m=7/news=pride-for-africa-spain-strike-gold-2247372.html
More informationData Acquisition. The reference Big Data stack
Università degli Studi di Roma Tor Vergata Dipartimento di Ingegneria Civile e Ingegneria Informatica Data Acquisition Corso di Sistemi e Architetture per Big Data A.A. 2016/17 Valeria Cardellini The reference
More informationRPT: Re-architecting Loss Protection for Content-Aware Networks
RPT: Re-architecting Loss Protection for Content-Aware Networks Dongsu Han, Ashok Anand ǂ, Aditya Akella ǂ, and Srinivasan Seshan Carnegie Mellon University ǂ University of Wisconsin-Madison Motivation:
More informationSAND: A Fault-Tolerant Streaming Architecture for Network Traffic Analytics
1 SAND: A Fault-Tolerant Streaming Architecture for Network Traffic Analytics Qin Liu, John C.S. Lui 1 Cheng He, Lujia Pan, Wei Fan, Yunlong Shi 2 1 The Chinese University of Hong Kong 2 Huawei Noah s
More informationTowards Flexible Event Processing in Distributed Data Streams
Towards Flexible Event Processing in Distributed Data Streams Sebastian Bothe Fraunhofer IAIS sebastian.bothe@ iais.fraunhofer.de Antonios Deligiannakis Technical University of Crete adeli@softnet.tuc.gr
More informationThe Emergence of the Datacenter Developer. Tobi Knaup, Co-Founder & CTO at
The Emergence of the Datacenter Developer Tobi Knaup, Co-Founder & CTO at Mesosphere @superguenter A Brief History of Operating Systems 2 1950 s Mainframes Punchcards No operating systems Time Sharing
More informationSizing Guidelines and Performance Tuning for Intelligent Streaming
Sizing Guidelines and Performance Tuning for Intelligent Streaming Copyright Informatica LLC 2017. Informatica and the Informatica logo are trademarks or registered trademarks of Informatica LLC in the
More informationThis is the author s final accepted version.
Pamukchiev, A., Jouet, S., and Pezaros, D. P. (2017) Distributed Network Anomaly Detection on an Event Processing Framework. In: IEEE Consumer Communications and Networking Conference 2017, Las Vegas,
More informationCSE 544: Principles of Database Systems
CSE 544: Principles of Database Systems Anatomy of a DBMS, Parallel Databases 1 Announcements Lecture on Thursday, May 2nd: Moved to 9am-10:30am, CSE 403 Paper reviews: Anatomy paper was due yesterday;
More informationBig Streaming Data Processing. How to Process Big Streaming Data 2016/10/11. Fraud detection in bank transactions. Anomalies in sensor data
Big Data Big Streaming Data Big Streaming Data Processing Fraud detection in bank transactions Anomalies in sensor data Cat videos in tweets How to Process Big Streaming Data Raw Data Streams Distributed
More informationSTELA: ON-DEMAND ELASTICITY IN DISTRIBUTED DATA STREAM PROCESSING SYSTEMS LE XU THESIS
2015 Le Xu STELA: ON-DEMAND ELASTICITY IN DISTRIBUTED DATA STREAM PROCESSING SYSTEMS BY LE XU THESIS Submitted in partial fulfillment of the requirements for the degree of Master of Science in Computer
More informationTransport Protocols for Data Center Communication. Evisa Tsolakou Supervisor: Prof. Jörg Ott Advisor: Lect. Pasi Sarolahti
Transport Protocols for Data Center Communication Evisa Tsolakou Supervisor: Prof. Jörg Ott Advisor: Lect. Pasi Sarolahti Contents Motivation and Objectives Methodology Data Centers and Data Center Networks
More informationPulsar. Realtime Analytics At Scale. Wang Xinglang
Pulsar Realtime Analytics At Scale Wang Xinglang Agenda Pulsar : Real Time Analytics At ebay Business Use Cases Product Requirements Pulsar : Technology Deep Dive 2 Pulsar Business Use Case: Behavioral
More informationMachine Learning & Science Data Processing
Machine Learning & Science Data Processing Rob Lyon robert.lyon@manchester.ac.uk SKA Group University of Manchester Machine Learning (1) Collective term for branch of A.I. Uses statistical tools to make
More information