Real Time Processing. Karthik Ramasamy

Size: px
Start display at page:

Download "Real Time Processing. Karthik Ramasamy"

Transcription

1 Real Time Processing Karthik Ramasamy

2 2 Information Age Real-time is key á K!

3 3 Real Time Connected World Ñ Internet of Things 30 B connected devices by 2020 Connected Vehicles Data transferred per vehicle per month 4 MB -> 5 GB + Health Care 153 Exabytes (2013) -> 2314 Exabytes (2020)! Digital Assistants (Predictive Analytics) $2B (2012) -> $6.5B (2019) [1] Siri/Cortana/Google Now Machine Data 40% of digital universe by 2020 > Augmented/Virtual Reality $150B by 2020 [2] Oculus/HoloLens/Magic Leap [1] hxp:// [2] hxp://techcrunch.com/2015/04/06/augmented-and-virtual-reality-to-hit-150-billion-by-2020/#.7q0heh:oabw

4 4 Why Real Time? real time trends real time conversations real time recommendations real time search G Ü! s Emerging break out Real time sports Real time product Real time search of trends in Twitter (in the conversations related recommendations based tweets form #hashtags) with a topic (recent goal on your behavior & or touchdown) profile

5 5 Value of Data It s contextual Time/cri8cal& Decisions& Value&of&Data&to&Decision/Making& Preven8ve/& Predic8ve& Ac8onable& Informa9on&Half%Life& In&Decision%Making& Tradi8onal& Batch &&&&&&&&&&&&&&& Business&&Intelligence& Reac8ve& Historical& Real%& Time& Seconds& Minutes& Hours& Days& Time& Months& [1] Courtesy Michael Franklin, BIRTE, 2015.

6 6 What is Real-Time? It s contextual > 1 hour 10 ms - 1 sec < 500 ms < 1 ms high throughput approximate latency sensitive low latency BATCH Real Time OLTP REAL REAL TIME adhoc queries monthly active users relevance for ads ad impressions count hash tag trends deterministic workflows fanout Tweets search for Tweets Financial Trading

7 7 Real Time Analytics C INTERACTIVE Store data and provide results instantly when a query is posed H Analyze data as it is being produced STREAMING

8 8 Real Time Use Cases I Online Services 10s of ms TransacZon log, Queues, RPCs Real Time s of ms Change propagazon, Streaming analyzcs Data for Batch Analytics secs to mins Log aggregazon, Client events

9 9 Real Time Stack Components: Many moving parts s Data a Messaging Collectors REAL TIME STACK b J Storage Compute

10 10 Scribe " Open source log aggregation Originally from Facebook. TwiXer made significant enhancements for real Zme event aggregazon { High throughput and scale Delivers 125M messages/min. Provides Zght SLAs on data reliability # Runs on every machine Simple, very reliable and efficiently uses memory and CPU

11 Event Bus & Distributed Log Next Generation Messaging "

12 12 Twitter Messaging Core Business Logic (tweets, fanouts ) Scribe Deferred RPC Gizzard Database Search Book Kestrel Kestrel Kestrel My SQL Kala Keeper HDFS

13 13 Kestrel Queue 5 e \ ( MESSAGE MULTIPLE HIGHLY LOOSELY QUEUE PROTOCOLS SCALABLE ORDERED

14 14 Kestrel Queue P # C # P # zk C

15 15 Kestrel Limitations Durability is hard to achieve # $ Adding subscribers is expensive Read-behind degrades performance Too many random I/Os! 7 Scales poorly as #queues increase " Cross DC replication

16 16 Kafka Queue V \ Ñ ( CLIENT STATE CONSUMER HIGH SIMPLE FOR MANAGEMENT SCALABILITY THROUGHPUT OPERATIONS

17 17 Kafka Limitations Relies on file system page cache # " Performance degradation when subscribers fall behind - too much random I/O Scaling to many topics! 7 Loss of data " No centralized operational stats

18 18 Rethinking Messaging Unified Stack - tradeoffs for various workloads # $ Durable writes, intra cluster and geo-replication Multi tenancy! 7 Scale resources independently Cost efficiency " Ease of Manageability

19 19 Event Bus - Pub-Sub Publisher Write Proxy Read Event Bus + Distributed Log Proxy Subscriber Distributed Log Metadata

20 20 Distributed Log BK Publisher Write Proxy BK Read Proxy Subscriber BK Distributed Log Metadata - ZK

21 21 Distributed /. -, / Manhattan Key Value Store Durable Deferred RPC Real Time Search Indexing Pub Sub System Globally Replicated Log

22 22 Distributed 400 TB/Day IN 2 Trillion Events/Day 20 PB/Day PROCESSED OUT 5-10 MS latency

23 TwiHer Heron Next Generation Streaming Engine "

24 24 Twitter Heron Better Storm \ Container Based Architecture - Separate Monitoring and Scheduling 2 Simplified ExecuQon Model % Much BeHer Performance

25 25 Twitter Heron Design: Goals Fully API compatible with Storm # Directed acyclic graph Topologies, Spouts and Bolts $ Task isolation Ease of debug-ability/isolazon/profiling Use of main stream languages C++, Java and Python! g Support for back pressure Topologies should self adjuszng Batching of tuples AmorZzing the cost of transferring tuples " G Efficiency Reduce resource consumption

26 26 Twitter Heron b \ Ñ / Guaranteed Horizontal Robust Concise Message Scalability Fault Code-Focus Passing Tolerance on Logic

27 27 Heron Terminology, Topology Directed acyclic graph verzces = computazon, and edges = streams of data tuples Spouts Sources of data tuples for the topology Examples - Kala/Kestrel/MySQL/Postgres % Bolts Process incoming tuples, and emit outgoing tuples Examples - filtering/aggregazon/join/any funczon

28 28 Heron Topology % Bolt 1 % Spout 1 Bolt 4 % Spout 2 Bolt 2 % Bolt 3 % Bolt 5

29 29 Stream Groupings /. -, Shuffle Grouping Fields Grouping All Grouping Global Grouping Random distribution of tuples Group tuples by a field or multiple fields Replicates tuples to all tasks Send the entire stream to one task

30 30 Heron Architecture: High Level Topology 1 Scheduler Topology 2 Topology Submission Topology N

31 31 Heron Architecture: Topology Topology Master Logical Plan, Physical Plan and Execution State ZK Sync Physical Plan Cluster Metrics Manager Stream Manager Stream Manager Metrics Manager I 1 I 2 I 3 I 4 I 1 I 2 I 3 I 4 CONTAINER CONTAINER

32 32 Heron Stream Manager: BackPressure % % S1 B2 B3 % B4

33 33 Stream Manager Stream Manager: BackPressure S1 B2 S1 B2 Stream Manager Stream Manager B3 B4 B3 B4 S1 B2 S1 B2 Stream Manager Stream Manager B3 B4 B3

34 CONTAINER Topology Master CONTAINER CONTAINER S1 B2 S2 B1 Stream Manager Stream Manager B3 B4 B3 B4 S1 B2 S2 B1 Stream Manager Stream Manager B3 B4 B3 B5 CONTAINER CONTAINER

35 35 Heron Stream Manager: Spout BackPressure S1 B2 S1 B2 Stream Manager Stream Manager B3 B4 B3 B4 S1 B2 S1 B2 Stream Manager Stream Manager B3 B4 B3

36 36 Heron Use Cases REALTIME ETL REAL TIME BI SPAM DETECTION REAL TIME TRENDS REALTIME ML REAL TIME OPS

37 37 Heron Sample Topologies

38 38 Heron has been in production for 3 years 1 stage 10 stages 3x reduction in cores and memory

39 39 Heron Performance: Settings COMPONENTS EXPT #1 EXPT #2 EXPT #3 Spout Bolt # Heron containers

40 40 Throughput Heron Performance: Atmost Once CPU usage Heron (paper) Heron (master) Heron (paper) Heron (master) x x 10, million tuples/min ,820 # cores used ,920 1, Spout Parallelism Spout Parallelism

41 41 Heron Performance: CPU Usage Heron (paper) Heron (master) 40 million tuples/min x Spout Parallelism

42 42 > 400 Real Time Jobs 500 Billions Events/Day PROCESSED MS latency

43 43 Stateful Processing in Heron OpZmisZc Approaches PessimisZc Approaches

44 Tying Together "

45 45 Lambda Architecture Combining batch and real time New Data Client

46 46 Lambda Architecture - The Good Scribe CollecZon Pipeline Event Bus Heron Compute Pipeline Results

47 47 Lambda Architecture - The Bad Have to write everything twice! # $ Have to fix everything (may be Subtle differences in semantics! 7 How much Duct Tape required? " What about Graphs, ML, SQL, etc?

48 48 Summingbird to the Rescue Heron Topology Online key value result store Message broker Summingbird Program Client HDFS Scalding/Map Reduce Batch key value result store

49 49 Curious to Learn More? Twitter Heron: Stream Processing at Scale Sanjeev Kulkarni, Nikunj Bhagat, Maosong Fu, Vikas Kedigehalli, Christopher Kellogg, Sailesh Mittal, Jignesh M. Patel *,1, Karthik Ramasamy, @staneja Twitter, Inc., *University of Wisconsin Madison Maosong Fu, Sailesh Mittal, Vikas Kedigehalli, Karthik Ramasamy, Michael Barry, Andrew Jorgensen, Christopher Kellogg, Neng Lu, Bill Graham, Jingwei Wu Twitter, Inc. Ankit Toshniwal, Siddarth Taneja, Amit Shukla, Karthik Ramasamy, Jignesh M. Patel*, Sanjeev Kulkarni, Jason Jackson, Krishna Gade, Maosong Fu, Jake Donham, Nikunj Bhagat, Sailesh Mittal, @challenger_nik, Twitter, Inc., *University of Wisconsin Madison

50 Curious to Learn More? 50

51 51 Interested in Heron? HERON IS OPEN SOURCED CONTRIBUTIONS ARE WELCOME! FOLLOW

52 52 Interested in Distributed Log? DISTRIBUTED LOG IS OPEN SOURCED CONTRIBUTIONS ARE WELCOME! FOLLOW

53 53 We are SOFTWARE ENGINEERS PRODUCT ENGINEERS SYSTEM ENGINEERS UI ENGINEERS/UX DESIGNERS CAREERS [AT] STREAML.IO

54 54 Any Question??? WHAT WHY WHERE WHEN WHO HOW

55 55 Get in

56 THANKS FOR ATTENDING!!!

Flying Faster with Heron

Flying Faster with Heron Flying Faster with Heron KARTHIK RAMASAMY @KARTHIKZ #TwitterHeron TALK OUTLINE BEGIN I! II ( III b OVERVIEW MOTIVATION HERON IV Z OPERATIONAL EXPERIENCES V K HERON PERFORMANCE END [! OVERVIEW TWITTER IS

More information

Scalable Streaming Analytics

Scalable Streaming Analytics Scalable Streaming Analytics KARTHIK RAMASAMY @karthikz TALK OUTLINE BEGIN I! II ( III b Overview Storm Overview Storm Internals IV Z V K Heron Operational Experiences END WHAT IS ANALYTICS? according

More information

Paper Presented by Harsha Yeddanapudy

Paper Presented by Harsha Yeddanapudy Storm@Twitter Ankit Toshniwal, Siddarth Taneja, Amit Shukla, Karthik Ramasamy, Jignesh M. Patel*, Sanjeev Kulkarni, Jason Jackson, Krishna Gade, Maosong Fu, Jake Donham, Nikunj Bhagat, Sailesh Mittal,

More information

Building Durable Real-time Data Pipeline

Building Durable Real-time Data Pipeline Building Durable Real-time Data Pipeline Apache BookKeeper at Twitter @sijieg Twitter Background Layered Architecture Agenda Design Details Performance Scale @Twitter Q & A Publish-Subscribe Online services

More information

10/24/2017 Sangmi Lee Pallickara Week 10- A. CS535 Big Data Fall 2017 Colorado State University

10/24/2017 Sangmi Lee Pallickara Week 10- A. CS535 Big Data Fall 2017 Colorado State University CS535 Big Data - Fall 2017 Week 10-A-1 CS535 BIG DATA FAQs Term project proposal Feedback for the most of submissions are available PA2 has been posted (11/6) PART 2. SCALABLE FRAMEWORKS FOR REAL-TIME

More information

2/20/2019 Week 5-B Sangmi Lee Pallickara

2/20/2019 Week 5-B Sangmi Lee Pallickara 2/20/2019 - Spring 2019 Week 5-B-1 CS535 BIG DATA FAQs PART A. BIG DATA TECHNOLOGY 4. REAL-TIME STREAMING COMPUTING MODELS: APACHE STORM AND TWITTER HERON Special GTA for PA1 Saptashwa Mitra Saptashwa.Mitra@colostate.edu

More information

Storm. Distributed and fault-tolerant realtime computation. Nathan Marz Twitter

Storm. Distributed and fault-tolerant realtime computation. Nathan Marz Twitter Storm Distributed and fault-tolerant realtime computation Nathan Marz Twitter Storm at Twitter Twitter Web Analytics Before Storm Queues Workers Example (simplified) Example Workers schemify tweets and

More information

Twitter Heron: Stream Processing at Scale

Twitter Heron: Stream Processing at Scale Twitter Heron: Stream Processing at Scale Saiyam Kohli December 8th, 2016 CIS 611 Research Paper Presentation -Sun Sunnie Chung TWITTER IS A REAL TIME ABSTRACT We process billions of events on Twitter

More information

BASIC INTER/INTRA IPC. Operating System (Linux, Windows) HARDWARE. Scheduling Framework (Mesos, YARN, etc) HERON S GENERAL-PURPOSE ARCHITECTURE

BASIC INTER/INTRA IPC. Operating System (Linux, Windows) HARDWARE. Scheduling Framework (Mesos, YARN, etc) HERON S GENERAL-PURPOSE ARCHITECTURE 217 IEEE 33rd International Conference on Data Engineering Twitter Heron: Towards Extensible Streaming Engines Maosong Fu t, Ashvin Agrawal m, Avrilia Floratou m, Bill Graham t, Andrew Jorgensen t Mark

More information

Self Regulating Stream Processing in Heron

Self Regulating Stream Processing in Heron Self Regulating Stream Processing in Heron Huijun Wu 2017.12 Huijun Wu Twitter, Inc. Infrastructure, Data Platform, Real-Time Compute Heron Overview Recent Improvements Self Regulating Challenges Dhalion

More information

Apache Storm. Hortonworks Inc Page 1

Apache Storm. Hortonworks Inc Page 1 Apache Storm Page 1 What is Storm? Real time stream processing framework Scalable Up to 1 million tuples per second per node Fault Tolerant Tasks reassigned on failure Guaranteed Processing At least once

More information

Storm. Distributed and fault-tolerant realtime computation. Nathan Marz Twitter

Storm. Distributed and fault-tolerant realtime computation. Nathan Marz Twitter Storm Distributed and fault-tolerant realtime computation Nathan Marz Twitter Basic info Open sourced September 19th Implementation is 15,000 lines of code Used by over 25 companies >2700 watchers on Github

More information

TSAR A TimeSeries AggregatoR. Anirudh Todi TSAR

TSAR A TimeSeries AggregatoR. Anirudh Todi TSAR TSAR A TimeSeries AggregatoR Anirudh Todi Twitter @anirudhtodi TSAR What is TSAR? What is TSAR? TSAR is a framework and service infrastructure for specifying, deploying and operating timeseries aggregation

More information

Spark, Shark and Spark Streaming Introduction

Spark, Shark and Spark Streaming Introduction Spark, Shark and Spark Streaming Introduction Tushar Kale tusharkale@in.ibm.com June 2015 This Talk Introduction to Shark, Spark and Spark Streaming Architecture Deployment Methodology Performance References

More information

Data Analytics with HPC. Data Streaming

Data Analytics with HPC. Data Streaming Data Analytics with HPC Data Streaming Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us

More information

Before proceeding with this tutorial, you must have a good understanding of Core Java and any of the Linux flavors.

Before proceeding with this tutorial, you must have a good understanding of Core Java and any of the Linux flavors. About the Tutorial Storm was originally created by Nathan Marz and team at BackType. BackType is a social analytics company. Later, Storm was acquired and open-sourced by Twitter. In a short time, Apache

More information

Spark Streaming. Guido Salvaneschi

Spark Streaming. Guido Salvaneschi Spark Streaming Guido Salvaneschi 1 Spark Streaming Framework for large scale stream processing Scales to 100s of nodes Can achieve second scale latencies Integrates with Spark s batch and interactive

More information

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development:: Title Duration : Apache Spark Development : 4 days Overview Spark is a fast and general cluster computing system for Big Data. It provides high-level APIs in Scala, Java, Python, and R, and an optimized

More information

Webinar Series TMIP VISION

Webinar Series TMIP VISION Webinar Series TMIP VISION TMIP provides technical support and promotes knowledge and information exchange in the transportation planning and modeling community. Today s Goals To Consider: Parallel Processing

More information

Improving efficiency of Twitter Infrastructure using Chargeback

Improving efficiency of Twitter Infrastructure using Chargeback Improving efficiency of Twitter Infrastructure using Chargeback @vinucharanya @micheal AGENDA Brief History Problem Chargeback Engineering Challenges The product Impact Future Getty Images from http://www.fifa.com/worldcup/news/y=2010/m=7/news=pride-for-africa-spain-strike-gold-2247372.html

More information

Twitrends: A Real Time Trending Topics Detection System for Twitter Social Network

Twitrends: A Real Time Trending Topics Detection System for Twitter Social Network Twitrends: A Real Time Trending Topics Detection System for Twitter Social Network Cosmina Ivan Department of Computer Science Technical University of Cluj-Napoca Cluj County, Romania Andrei Moldovan Department

More information

Fluentd + MongoDB + Spark = Awesome Sauce

Fluentd + MongoDB + Spark = Awesome Sauce Fluentd + MongoDB + Spark = Awesome Sauce Nishant Sahay, Sr. Architect, Wipro Limited Bhavani Ananth, Tech Manager, Wipro Limited Your company logo here Wipro Open Source Practice: Vision & Mission Vision

More information

Data Acquisition. The reference Big Data stack

Data Acquisition. The reference Big Data stack Università degli Studi di Roma Tor Vergata Dipartimento di Ingegneria Civile e Ingegneria Informatica Data Acquisition Corso di Sistemi e Architetture per Big Data A.A. 2016/17 Valeria Cardellini The reference

More information

Big Streaming Data Processing. How to Process Big Streaming Data 2016/10/11. Fraud detection in bank transactions. Anomalies in sensor data

Big Streaming Data Processing. How to Process Big Streaming Data 2016/10/11. Fraud detection in bank transactions. Anomalies in sensor data Big Data Big Streaming Data Big Streaming Data Processing Fraud detection in bank transactions Anomalies in sensor data Cat videos in tweets How to Process Big Streaming Data Raw Data Streams Distributed

More information

Putting it together. Data-Parallel Computation. Ex: Word count using partial aggregation. Big Data Processing. COS 418: Distributed Systems Lecture 21

Putting it together. Data-Parallel Computation. Ex: Word count using partial aggregation. Big Data Processing. COS 418: Distributed Systems Lecture 21 Big Processing -Parallel Computation COS 418: Distributed Systems Lecture 21 Michael Freedman 2 Ex: Word count using partial aggregation Putting it together 1. Compute word counts from individual files

More information

Distributed systems for stream processing

Distributed systems for stream processing Distributed systems for stream processing Apache Kafka and Spark Structured Streaming Alena Hall Alena Hall Large-scale data processing Distributed Systems Functional Programming Data Science & Machine

More information

Streaming Log Analytics with Kafka

Streaming Log Analytics with Kafka Streaming Log Analytics with Kafka Kresten Krab Thorup, Humio CTO Log Everything, Answer Anything, In Real-Time. Why this talk? Humio is a Log Analytics system Designed to run on-prem High volume, real

More information

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples Hadoop Introduction 1 Topics Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples 2 Big Data Analytics What is Big Data?

More information

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara Big Data Technology Ecosystem Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara Agenda End-to-End Data Delivery Platform Ecosystem of Data Technologies Mapping an End-to-End Solution Case

More information

REAL-TIME ANALYTICS WITH APACHE STORM

REAL-TIME ANALYTICS WITH APACHE STORM REAL-TIME ANALYTICS WITH APACHE STORM Mevlut Demir PhD Student IN TODAY S TALK 1- Problem Formulation 2- A Real-Time Framework and Its Components with an existing applications 3- Proposed Framework 4-

More information

Apache BookKeeper. A High Performance and Low Latency Storage Service

Apache BookKeeper. A High Performance and Low Latency Storage Service Apache BookKeeper A High Performance and Low Latency Storage Service Hello! I am Sijie Guo - PMC Chair of Apache BookKeeper Co-creator of Apache DistributedLog Twitter Messaging/Pub-Sub Team Yahoo! R&D

More information

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS SUJEE MANIYAM FOUNDER / PRINCIPAL @ ELEPHANT SCALE www.elephantscale.com sujee@elephantscale.com HI, I M SUJEE MANIYAM Founder / Principal @ ElephantScale

More information

Data Acquisition. The reference Big Data stack

Data Acquisition. The reference Big Data stack Università degli Studi di Roma Tor Vergata Dipartimento di Ingegneria Civile e Ingegneria Informatica Data Acquisition Corso di Sistemi e Architetture per Big Data A.A. 2017/18 Valeria Cardellini The reference

More information

Data-Intensive Distributed Computing

Data-Intensive Distributed Computing Data-Intensive Distributed Computing CS 451/651 431/631 (Winter 2018) Part 9: Real-Time Data Analytics (1/2) March 27, 2018 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo

More information

BIG DATA. Using the Lambda Architecture on a Big Data Platform to Improve Mobile Campaign Management. Author: Sandesh Deshmane

BIG DATA. Using the Lambda Architecture on a Big Data Platform to Improve Mobile Campaign Management. Author: Sandesh Deshmane BIG DATA Using the Lambda Architecture on a Big Data Platform to Improve Mobile Campaign Management Author: Sandesh Deshmane Executive Summary Growing data volumes and real time decision making requirements

More information

CS 398 ACC Streaming. Prof. Robert J. Brunner. Ben Congdon Tyler Kim

CS 398 ACC Streaming. Prof. Robert J. Brunner. Ben Congdon Tyler Kim CS 398 ACC Streaming Prof. Robert J. Brunner Ben Congdon Tyler Kim MP3 How s it going? Final Autograder run: - Tonight ~9pm - Tomorrow ~3pm Due tomorrow at 11:59 pm. Latest Commit to the repo at the time

More information

Auto Management for Apache Kafka and Distributed Stateful System in General

Auto Management for Apache Kafka and Distributed Stateful System in General Auto Management for Apache Kafka and Distributed Stateful System in General Jiangjie (Becket) Qin Data Infrastructure @LinkedIn GIAC 2017, 12/23/17@Shanghai Agenda Kafka introduction and terminologies

More information

Architectural challenges for building a low latency, scalable multi-tenant data warehouse

Architectural challenges for building a low latency, scalable multi-tenant data warehouse Architectural challenges for building a low latency, scalable multi-tenant data warehouse Mataprasad Agrawal Solutions Architect, Services CTO 2017 Persistent Systems Ltd. All rights reserved. Our analytics

More information

Embedded Technosolutions

Embedded Technosolutions Hadoop Big Data An Important technology in IT Sector Hadoop - Big Data Oerie 90% of the worlds data was generated in the last few years. Due to the advent of new technologies, devices, and communication

More information

FROM LEGACY, TO BATCH, TO NEAR REAL-TIME. Marc Sturlese, Dani Solà

FROM LEGACY, TO BATCH, TO NEAR REAL-TIME. Marc Sturlese, Dani Solà FROM LEGACY, TO BATCH, TO NEAR REAL-TIME Marc Sturlese, Dani Solà WHO ARE WE? Marc Sturlese - @sturlese Backend engineer, focused on R&D Interests: search, scalability Dani Solà - @dani_sola Backend engineer

More information

Real-time Calculating Over Self-Health Data Using Storm Jiangyong Cai1, a, Zhengping Jin2, b

Real-time Calculating Over Self-Health Data Using Storm Jiangyong Cai1, a, Zhengping Jin2, b 4th International Conference on Mechatronics, Materials, Chemistry and Computer Engineering (ICMMCCE 2015) Real-time Calculating Over Self-Health Data Using Storm Jiangyong Cai1, a, Zhengping Jin2, b 1

More information

Flash Storage Complementing a Data Lake for Real-Time Insight

Flash Storage Complementing a Data Lake for Real-Time Insight Flash Storage Complementing a Data Lake for Real-Time Insight Dr. Sanhita Sarkar Global Director, Analytics Software Development August 7, 2018 Agenda 1 2 3 4 5 Delivering insight along the entire spectrum

More information

Personalizing Netflix with Streaming datasets

Personalizing Netflix with Streaming datasets Personalizing Netflix with Streaming datasets Shriya Arora Senior Data Engineer Personalization Analytics @shriyarora What is this talk about? Helping you decide if a streaming pipeline fits your ETL problem

More information

STORM AND LOW-LATENCY PROCESSING.

STORM AND LOW-LATENCY PROCESSING. STORM AND LOW-LATENCY PROCESSING Low latency processing Similar to data stream processing, but with a twist Data is streaming into the system (from a database, or a netk stream, or an HDFS file, or ) We

More information

Grand challenge: Automatic anomaly detection over sliding windows

Grand challenge: Automatic anomaly detection over sliding windows Provided by the author(s) and NUI Galway in accordance with publisher policies. Please cite the published version when available. Title Grand challenge: Automatic anomaly detection over sliding windows

More information

The Future of Real-Time in Spark

The Future of Real-Time in Spark The Future of Real-Time in Spark Reynold Xin @rxin Spark Summit, New York, Feb 18, 2016 Why Real-Time? Making decisions faster is valuable. Preventing credit card fraud Monitoring industrial machinery

More information

Scaling the Yelp s logging pipeline with Apache Kafka. Enrico

Scaling the Yelp s logging pipeline with Apache Kafka. Enrico Scaling the Yelp s logging pipeline with Apache Kafka Enrico Canzonieri enrico@yelp.com @EnricoC89 Yelp s Mission Connecting people with great local businesses. Yelp Stats As of Q1 2016 90M 102M 70% 32

More information

Data pipelines with PostgreSQL & Kafka

Data pipelines with PostgreSQL & Kafka Data pipelines with PostgreSQL & Kafka Oskari Saarenmaa PostgresConf US 2018 - Jersey City Agenda 1. Introduction 2. Data pipelines, old and new 3. Apache Kafka 4. Sample data pipeline with Kafka & PostgreSQL

More information

Big Data Infrastructures & Technologies

Big Data Infrastructures & Technologies Big Data Infrastructures & Technologies Data streams and low latency processing DATA STREAM BASICS What is a data stream? Large data volume, likely structured, arriving at a very high rate Potentially

More information

SparkStreaming. Large scale near- realtime stream processing. Tathagata Das (TD) UC Berkeley UC BERKELEY

SparkStreaming. Large scale near- realtime stream processing. Tathagata Das (TD) UC Berkeley UC BERKELEY SparkStreaming Large scale near- realtime stream processing Tathagata Das (TD) UC Berkeley UC BERKELEY Motivation Many important applications must process large data streams at second- scale latencies

More information

<Insert Picture Here> Introduction to Big Data Technology

<Insert Picture Here> Introduction to Big Data Technology Introduction to Big Data Technology The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into

More information

Hortonworks DataFlow Sam Lachterman Solutions Engineer

Hortonworks DataFlow Sam Lachterman Solutions Engineer Hortonworks DataFlow Sam Lachterman Solutions Engineer 1 Hortonworks Inc. 2011 2017. All Rights Reserved Disclaimer This document may contain product features and technology directions that are under development,

More information

Performance Assessment of Storm and Spark for Twitter Streaming

Performance Assessment of Storm and Spark for Twitter Streaming Performance Assessment of Storm and Spark for Twitter Streaming 1 B. Revathi Reddy, 2 T.Swathi 1 PG Student, Computer Science and Engineering Dept., GPREC, Kurnool(District), Andhra Pradesh-518004, INDIA,

More information

Functional Comparison and Performance Evaluation. Huafeng Wang Tianlun Zhang Wei Mao 2016/11/14

Functional Comparison and Performance Evaluation. Huafeng Wang Tianlun Zhang Wei Mao 2016/11/14 Functional Comparison and Performance Evaluation Huafeng Wang Tianlun Zhang Wei Mao 2016/11/14 Overview Streaming Core MISC Performance Benchmark Choose your weapon! 2 Continuous Streaming Micro-Batch

More information

Fault Tolerance for Stream Processing Engines

Fault Tolerance for Stream Processing Engines Fault Tolerance for Stream Processing Engines Muhammad Anis Uddin Nasir KTH Royal Institute of Technology, Stockholm, Sweden anisu@kth.se arxiv:1605.00928v1 [cs.dc] 3 May 2016 Abstract Distributed Stream

More information

Pulsar. Realtime Analytics At Scale. Wang Xinglang

Pulsar. Realtime Analytics At Scale. Wang Xinglang Pulsar Realtime Analytics At Scale Wang Xinglang Agenda Pulsar : Real Time Analytics At ebay Business Use Cases Product Requirements Pulsar : Technology Deep Dive 2 Pulsar Business Use Case: Behavioral

More information

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved Hadoop 2.x Core: YARN, Tez, and Spark YARN Hadoop Machine Types top-of-rack switches core switch client machines have client-side software used to access a cluster to process data master nodes run Hadoop

More information

Over the last few years, we have seen a disruption in the data management

Over the last few years, we have seen a disruption in the data management JAYANT SHEKHAR AND AMANDEEP KHURANA Jayant is Principal Solutions Architect at Cloudera working with various large and small companies in various Verticals on their big data and data science use cases,

More information

Streaming & Apache Storm

Streaming & Apache Storm Streaming & Apache Storm Recommended Text: Storm Applied Sean T. Allen, Matthew Jankowski, Peter Pathirana Manning 2010 VMware Inc. All rights reserved Big Data! Volume! Velocity Data flowing into the

More information

8/24/2017 Week 1-B Instructor: Sangmi Lee Pallickara

8/24/2017 Week 1-B Instructor: Sangmi Lee Pallickara Week 1-B-0 Week 1-B-1 CS535 BIG DATA FAQs Slides are available on the course web Wait list Term project topics PART 0. INTRODUCTION 2. DATA PROCESSING PARADIGMS FOR BIG DATA Sangmi Lee Pallickara Computer

More information

VOLTDB + HP VERTICA. page

VOLTDB + HP VERTICA. page VOLTDB + HP VERTICA ARCHITECTURE FOR FAST AND BIG DATA ARCHITECTURE FOR FAST + BIG DATA FAST DATA Fast Serve Analytics BIG DATA BI Reporting Fast Operational Database Streaming Analytics Columnar Analytics

More information

Creating a Recommender System. An Elasticsearch & Apache Spark approach

Creating a Recommender System. An Elasticsearch & Apache Spark approach Creating a Recommender System An Elasticsearch & Apache Spark approach My Profile SKILLS Álvaro Santos Andrés Big Data & Analytics Solution Architect in Ericsson with more than 12 years of experience focused

More information

Typhoon: An SDN Enhanced Real-Time Big Data Streaming Framework

Typhoon: An SDN Enhanced Real-Time Big Data Streaming Framework Typhoon: An SDN Enhanced Real-Time Big Data Streaming Framework Junguk Cho, Hyunseok Chang, Sarit Mukherjee, T.V. Lakshman, and Jacobus Van der Merwe 1 Big Data Era Big data analysis is increasingly common

More information

Dynamic Algorithm Selection for the Logic of Tasks in IoT Stream Processing Systems

Dynamic Algorithm Selection for the Logic of Tasks in IoT Stream Processing Systems Dynamic Selection for the Logic of s in IoT Stream Processing Systems Ehsan Poormohammady, Jens Helge Reelfs, Mirko Stoffers, Klaus Wehrle Communication and Distributed Systems, RWTH Aachen University,

More information

A BIG DATA STREAMING RECIPE WHAT TO CONSIDER WHEN BUILDING A REAL TIME BIG DATA APPLICATION

A BIG DATA STREAMING RECIPE WHAT TO CONSIDER WHEN BUILDING A REAL TIME BIG DATA APPLICATION A BIG DATA STREAMING RECIPE WHAT TO CONSIDER WHEN BUILDING A REAL TIME BIG DATA APPLICATION Konstantin Gregor / konstantin.gregor@tngtech.com ABOUT ME So ware developer for TNG in Munich Client in telecommunication

More information

Solace JMS Broker Delivers Highest Throughput for Persistent and Non-Persistent Delivery

Solace JMS Broker Delivers Highest Throughput for Persistent and Non-Persistent Delivery Solace JMS Broker Delivers Highest Throughput for Persistent and Non-Persistent Delivery Java Message Service (JMS) is a standardized messaging interface that has become a pervasive part of the IT landscape

More information

Fast and Easy Stream Processing with Hazelcast Jet. Gokhan Oner Hazelcast

Fast and Easy Stream Processing with Hazelcast Jet. Gokhan Oner Hazelcast Fast and Easy Stream Processing with Hazelcast Jet Gokhan Oner Hazelcast Stream Processing Why should I bother? What is stream processing? Data Processing: Massage the data when moving from place to place.

More information

Building a Data-Friendly Platform for a Data- Driven Future

Building a Data-Friendly Platform for a Data- Driven Future Building a Data-Friendly Platform for a Data- Driven Future Benjamin Hindman - @benh 2016 Mesosphere, Inc. All Rights Reserved. INTRO $ whoami BENJAMIN HINDMAN Co-founder and Chief Architect of Mesosphere,

More information

The Stream Processor as a Database. Ufuk

The Stream Processor as a Database. Ufuk The Stream Processor as a Database Ufuk Celebi @iamuce Realtime Counts and Aggregates The (Classic) Use Case 2 (Real-)Time Series Statistics Stream of Events Real-time Statistics 3 The Architecture collect

More information

Data contains value and knowledge

Data contains value and knowledge Data contains value and knowledge What is the purpose of big data systems? To support analysis and knowledge discovery from very large amounts of data But to extract the knowledge data needs to be Stored

More information

BIG DATA TESTING: A UNIFIED VIEW

BIG DATA TESTING: A UNIFIED VIEW http://core.ecu.edu/strg BIG DATA TESTING: A UNIFIED VIEW BY NAM THAI ECU, Computer Science Department, March 16, 2016 2/30 PRESENTATION CONTENT 1. Overview of Big Data A. 5 V s of Big Data B. Data generation

More information

Cloud Analytics and Business Intelligence on AWS

Cloud Analytics and Business Intelligence on AWS Cloud Analytics and Business Intelligence on AWS Enterprise Applications Virtual Desktops Sharing & Collaboration Platform Services Analytics Hadoop Real-time Streaming Data Machine Learning Data Warehouse

More information

Machine Learning meets Databases. Ioannis Papapanagiotou Cloud Database Engineering

Machine Learning meets Databases. Ioannis Papapanagiotou Cloud Database Engineering Machine Learning meets Databases Ioannis Papapanagiotou Cloud Database Engineering Create Personalized Recommendations for discoveries of engaging video content that maximizes member joy. Personalize Everything

More information

<Insert Picture Here> Value of TimesTen Oracle TimesTen Product Overview

<Insert Picture Here> Value of TimesTen Oracle TimesTen Product Overview Value of TimesTen Oracle TimesTen Product Overview Shig Hiura Sales Consultant, Oracle Embedded Global Business Unit When You Think Database SQL RDBMS Results RDBMS + client/server

More information

Data-Intensive Distributed Computing

Data-Intensive Distributed Computing Data-Intensive Distributed Computing CS 451/651 431/631 (Winter 2018) Part 5: Analyzing Relational Data (1/3) February 8, 2018 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo

More information

In-Memory Data Management Jens Krueger

In-Memory Data Management Jens Krueger In-Memory Data Management Jens Krueger Enterprise Platform and Integration Concepts Hasso Plattner Intitute OLTP vs. OLAP 2 Online Transaction Processing (OLTP) Organized in rows Online Analytical Processing

More information

Twitter data Analytics using Distributed Computing

Twitter data Analytics using Distributed Computing Twitter data Analytics using Distributed Computing Uma Narayanan Athrira Unnikrishnan Dr. Varghese Paul Dr. Shelbi Joseph Research Scholar M.tech Student Professor Assistant Professor Dept. of IT, SOE

More information

A Parallel R Framework

A Parallel R Framework A Parallel R Framework for Processing Large Dataset on Distributed Systems Nov. 17, 2013 This work is initiated and supported by Huawei Technologies Rise of Data-Intensive Analytics Data Sources Personal

More information

Cisco Tetration Analytics

Cisco Tetration Analytics Cisco Tetration Analytics Enhanced security and operations with real time analytics Christopher Say (CCIE RS SP) Consulting System Engineer csaychoh@cisco.com Challenges in operating a hybrid data center

More information

IBM Db2 Event Store Simplifying and Accelerating Storage and Analysis of Fast Data. IBM Db2 Event Store

IBM Db2 Event Store Simplifying and Accelerating Storage and Analysis of Fast Data. IBM Db2 Event Store IBM Db2 Event Store Simplifying and Accelerating Storage and Analysis of Fast Data IBM Db2 Event Store Disclaimer The information contained in this presentation is provided for informational purposes only.

More information

Real-time data processing with Apache Flink

Real-time data processing with Apache Flink Real-time data processing with Apache Flink Gyula Fóra gyfora@apache.org Flink committer Swedish ICT Stream processing Data stream: Infinite sequence of data arriving in a continuous fashion. Stream processing:

More information

LazyBase: Trading freshness and performance in a scalable database

LazyBase: Trading freshness and performance in a scalable database LazyBase: Trading freshness and performance in a scalable database (EuroSys 2012) Jim Cipar, Greg Ganger, *Kimberly Keeton, *Craig A. N. Soules, *Brad Morrey, *Alistair Veitch PARALLEL DATA LABORATORY

More information

Spark Streaming. Professor Sasu Tarkoma.

Spark Streaming. Professor Sasu Tarkoma. Spark Streaming 2015 Professor Sasu Tarkoma www.cs.helsinki.fi Spark Streaming Spark extension of accepting and processing of streaming high-throughput live data streams Data is accepted from various sources

More information

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017 Hadoop File System 1 S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y Moving Computation is Cheaper than Moving Data Motivation: Big Data! What is BigData? - Google

More information

IMPLEMENTING A LAMBDA ARCHITECTURE TO PERFORM REAL-TIME UPDATES

IMPLEMENTING A LAMBDA ARCHITECTURE TO PERFORM REAL-TIME UPDATES IMPLEMENTING A LAMBDA ARCHITECTURE TO PERFORM REAL-TIME UPDATES by PRAMOD KUMAR GUDIPATI B.E., OSMANIA UNIVERSITY (OU), INDIA, 2012 A REPORT submitted in partial fulfillment of the requirements of the

More information

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015 Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL May 2015 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved. Notices This document

More information

HBase Solutions at Facebook

HBase Solutions at Facebook HBase Solutions at Facebook Nicolas Spiegelberg Software Engineer, Facebook QCon Hangzhou, October 28 th, 2012 Outline HBase Overview Single Tenant: Messages Selection Criteria Multi-tenant Solutions

More information

Data Infrastructure at LinkedIn. Shirshanka Das XLDB 2011

Data Infrastructure at LinkedIn. Shirshanka Das XLDB 2011 Data Infrastructure at LinkedIn Shirshanka Das XLDB 2011 1 Me UCLA Ph.D. 2005 (Distributed protocols in content delivery networks) PayPal (Web frameworks and Session Stores) Yahoo! (Serving Infrastructure,

More information

Panoptes: A Network Telemetry Ecosystem - Part Deux

Panoptes: A Network Telemetry Ecosystem - Part Deux Panoptes: A Network Telemetry Ecosystem - Part Deux Panoptes is: Greenfield Python based network telemetry platform that provides real time telemetry and analytics @ Yahoo Implements discovery, polling,

More information

Functional Comparison and Performance Evaluation 毛玮王华峰张天伦 2016/9/10

Functional Comparison and Performance Evaluation 毛玮王华峰张天伦 2016/9/10 Functional Comparison and Performance Evaluation 毛玮王华峰张天伦 2016/9/10 Overview Streaming Core MISC Performance Benchmark Choose your weapon! 2 Continuous Streaming Ack per Record Storm* Twitter Heron* Storage

More information

PaaS SAE Top3 SuperAPP

PaaS SAE Top3 SuperAPP PaaS SAE Top3 SuperAPP PaaS SAE Top3 SuperAPP Pla$orm Services Group Sam Biwing Monika Rambone Skylee Kingho1d AWS S3 CDN ATS 1k 30+ 10+ Go FE Services Panel C++ Go C/C++ ACM FE Pla$orm Services Group

More information

WHY AND HOW TO LEVERAGE THE POWER AND SIMPLICITY OF SQL ON APACHE FLINK - FABIAN HUESKE, SOFTWARE ENGINEER

WHY AND HOW TO LEVERAGE THE POWER AND SIMPLICITY OF SQL ON APACHE FLINK - FABIAN HUESKE, SOFTWARE ENGINEER WHY AND HOW TO LEVERAGE THE POWER AND SIMPLICITY OF SQL ON APACHE FLINK - FABIAN HUESKE, SOFTWARE ENGINEER ABOUT ME Apache Flink PMC member & ASF member Contributing since day 1 at TU Berlin Focusing on

More information

Data Storage Infrastructure at Facebook

Data Storage Infrastructure at Facebook Data Storage Infrastructure at Facebook Spring 2018 Cleveland State University CIS 601 Presentation Yi Dong Instructor: Dr. Chung Outline Strategy of data storage, processing, and log collection Data flow

More information

Tools for Social Networking Infrastructures

Tools for Social Networking Infrastructures Tools for Social Networking Infrastructures 1 Cassandra - a decentralised structured storage system Problem : Facebook Inbox Search hundreds of millions of users distributed infrastructure inbox changes

More information

Down the event-driven road: Experiences of integrating streaming into analytic data platforms

Down the event-driven road: Experiences of integrating streaming into analytic data platforms Down the event-driven road: Experiences of integrating streaming into analytic data platforms Dr. Dominik Benz, Head of Machine Learning Engineering, inovex GmbH Confluent Meetup Munich, 8.10.2018 Integrate

More information

Big data streaming: Choices for high availability and disaster recovery on Microsoft Azure. By Arnab Ganguly DataCAT

Big data streaming: Choices for high availability and disaster recovery on Microsoft Azure. By Arnab Ganguly DataCAT : Choices for high availability and disaster recovery on Microsoft Azure By Arnab Ganguly DataCAT March 2019 Contents Overview... 3 The challenge of a single-region architecture... 3 Configuration considerations...

More information

Evolution of an Apache Spark Architecture for Processing Game Data

Evolution of an Apache Spark Architecture for Processing Game Data Evolution of an Apache Spark Architecture for Processing Game Data Nick Afshartous WB Analytics Platform May 17 th 2017 May 17 th, 2017 About Me nafshartous@wbgames.com WB Analytics Core Platform Lead

More information

Shen PingCAP 2017

Shen PingCAP 2017 Shen Li @ PingCAP About me Shen Li ( 申砾 ) Tech Lead of TiDB, VP of Engineering Netease / 360 / PingCAP Infrastructure software engineer WHY DO WE NEED A NEW DATABASE? Brief History Standalone RDBMS NoSQL

More information

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018 Cloud Computing 2 CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University Learning

More information

CONSOLIDATING RISK MANAGEMENT AND REGULATORY COMPLIANCE APPLICATIONS USING A UNIFIED DATA PLATFORM

CONSOLIDATING RISK MANAGEMENT AND REGULATORY COMPLIANCE APPLICATIONS USING A UNIFIED DATA PLATFORM CONSOLIDATING RISK MANAGEMENT AND REGULATORY COMPLIANCE APPLICATIONS USING A UNIFIED PLATFORM Executive Summary Financial institutions have implemented and continue to implement many disparate applications

More information