Real Time Processing. Karthik Ramasamy
|
|
- Willis Bell
- 6 years ago
- Views:
Transcription
1 Real Time Processing Karthik Ramasamy
2 2 Information Age Real-time is key á K!
3 3 Real Time Connected World Ñ Internet of Things 30 B connected devices by 2020 Connected Vehicles Data transferred per vehicle per month 4 MB -> 5 GB + Health Care 153 Exabytes (2013) -> 2314 Exabytes (2020)! Digital Assistants (Predictive Analytics) $2B (2012) -> $6.5B (2019) [1] Siri/Cortana/Google Now Machine Data 40% of digital universe by 2020 > Augmented/Virtual Reality $150B by 2020 [2] Oculus/HoloLens/Magic Leap [1] hxp:// [2] hxp://techcrunch.com/2015/04/06/augmented-and-virtual-reality-to-hit-150-billion-by-2020/#.7q0heh:oabw
4 4 Why Real Time? real time trends real time conversations real time recommendations real time search G Ü! s Emerging break out Real time sports Real time product Real time search of trends in Twitter (in the conversations related recommendations based tweets form #hashtags) with a topic (recent goal on your behavior & or touchdown) profile
5 5 Value of Data It s contextual Time/cri8cal& Decisions& Value&of&Data&to&Decision/Making& Preven8ve/& Predic8ve& Ac8onable& Informa9on&Half%Life& In&Decision%Making& Tradi8onal& Batch &&&&&&&&&&&&&&& Business&&Intelligence& Reac8ve& Historical& Real%& Time& Seconds& Minutes& Hours& Days& Time& Months& [1] Courtesy Michael Franklin, BIRTE, 2015.
6 6 What is Real-Time? It s contextual > 1 hour 10 ms - 1 sec < 500 ms < 1 ms high throughput approximate latency sensitive low latency BATCH Real Time OLTP REAL REAL TIME adhoc queries monthly active users relevance for ads ad impressions count hash tag trends deterministic workflows fanout Tweets search for Tweets Financial Trading
7 7 Real Time Analytics C INTERACTIVE Store data and provide results instantly when a query is posed H Analyze data as it is being produced STREAMING
8 8 Real Time Use Cases I Online Services 10s of ms TransacZon log, Queues, RPCs Real Time s of ms Change propagazon, Streaming analyzcs Data for Batch Analytics secs to mins Log aggregazon, Client events
9 9 Real Time Stack Components: Many moving parts s Data a Messaging Collectors REAL TIME STACK b J Storage Compute
10 10 Scribe " Open source log aggregation Originally from Facebook. TwiXer made significant enhancements for real Zme event aggregazon { High throughput and scale Delivers 125M messages/min. Provides Zght SLAs on data reliability # Runs on every machine Simple, very reliable and efficiently uses memory and CPU
11 Event Bus & Distributed Log Next Generation Messaging "
12 12 Twitter Messaging Core Business Logic (tweets, fanouts ) Scribe Deferred RPC Gizzard Database Search Book Kestrel Kestrel Kestrel My SQL Kala Keeper HDFS
13 13 Kestrel Queue 5 e \ ( MESSAGE MULTIPLE HIGHLY LOOSELY QUEUE PROTOCOLS SCALABLE ORDERED
14 14 Kestrel Queue P # C # P # zk C
15 15 Kestrel Limitations Durability is hard to achieve # $ Adding subscribers is expensive Read-behind degrades performance Too many random I/Os! 7 Scales poorly as #queues increase " Cross DC replication
16 16 Kafka Queue V \ Ñ ( CLIENT STATE CONSUMER HIGH SIMPLE FOR MANAGEMENT SCALABILITY THROUGHPUT OPERATIONS
17 17 Kafka Limitations Relies on file system page cache # " Performance degradation when subscribers fall behind - too much random I/O Scaling to many topics! 7 Loss of data " No centralized operational stats
18 18 Rethinking Messaging Unified Stack - tradeoffs for various workloads # $ Durable writes, intra cluster and geo-replication Multi tenancy! 7 Scale resources independently Cost efficiency " Ease of Manageability
19 19 Event Bus - Pub-Sub Publisher Write Proxy Read Event Bus + Distributed Log Proxy Subscriber Distributed Log Metadata
20 20 Distributed Log BK Publisher Write Proxy BK Read Proxy Subscriber BK Distributed Log Metadata - ZK
21 21 Distributed /. -, / Manhattan Key Value Store Durable Deferred RPC Real Time Search Indexing Pub Sub System Globally Replicated Log
22 22 Distributed 400 TB/Day IN 2 Trillion Events/Day 20 PB/Day PROCESSED OUT 5-10 MS latency
23 TwiHer Heron Next Generation Streaming Engine "
24 24 Twitter Heron Better Storm \ Container Based Architecture - Separate Monitoring and Scheduling 2 Simplified ExecuQon Model % Much BeHer Performance
25 25 Twitter Heron Design: Goals Fully API compatible with Storm # Directed acyclic graph Topologies, Spouts and Bolts $ Task isolation Ease of debug-ability/isolazon/profiling Use of main stream languages C++, Java and Python! g Support for back pressure Topologies should self adjuszng Batching of tuples AmorZzing the cost of transferring tuples " G Efficiency Reduce resource consumption
26 26 Twitter Heron b \ Ñ / Guaranteed Horizontal Robust Concise Message Scalability Fault Code-Focus Passing Tolerance on Logic
27 27 Heron Terminology, Topology Directed acyclic graph verzces = computazon, and edges = streams of data tuples Spouts Sources of data tuples for the topology Examples - Kala/Kestrel/MySQL/Postgres % Bolts Process incoming tuples, and emit outgoing tuples Examples - filtering/aggregazon/join/any funczon
28 28 Heron Topology % Bolt 1 % Spout 1 Bolt 4 % Spout 2 Bolt 2 % Bolt 3 % Bolt 5
29 29 Stream Groupings /. -, Shuffle Grouping Fields Grouping All Grouping Global Grouping Random distribution of tuples Group tuples by a field or multiple fields Replicates tuples to all tasks Send the entire stream to one task
30 30 Heron Architecture: High Level Topology 1 Scheduler Topology 2 Topology Submission Topology N
31 31 Heron Architecture: Topology Topology Master Logical Plan, Physical Plan and Execution State ZK Sync Physical Plan Cluster Metrics Manager Stream Manager Stream Manager Metrics Manager I 1 I 2 I 3 I 4 I 1 I 2 I 3 I 4 CONTAINER CONTAINER
32 32 Heron Stream Manager: BackPressure % % S1 B2 B3 % B4
33 33 Stream Manager Stream Manager: BackPressure S1 B2 S1 B2 Stream Manager Stream Manager B3 B4 B3 B4 S1 B2 S1 B2 Stream Manager Stream Manager B3 B4 B3
34 CONTAINER Topology Master CONTAINER CONTAINER S1 B2 S2 B1 Stream Manager Stream Manager B3 B4 B3 B4 S1 B2 S2 B1 Stream Manager Stream Manager B3 B4 B3 B5 CONTAINER CONTAINER
35 35 Heron Stream Manager: Spout BackPressure S1 B2 S1 B2 Stream Manager Stream Manager B3 B4 B3 B4 S1 B2 S1 B2 Stream Manager Stream Manager B3 B4 B3
36 36 Heron Use Cases REALTIME ETL REAL TIME BI SPAM DETECTION REAL TIME TRENDS REALTIME ML REAL TIME OPS
37 37 Heron Sample Topologies
38 38 Heron has been in production for 3 years 1 stage 10 stages 3x reduction in cores and memory
39 39 Heron Performance: Settings COMPONENTS EXPT #1 EXPT #2 EXPT #3 Spout Bolt # Heron containers
40 40 Throughput Heron Performance: Atmost Once CPU usage Heron (paper) Heron (master) Heron (paper) Heron (master) x x 10, million tuples/min ,820 # cores used ,920 1, Spout Parallelism Spout Parallelism
41 41 Heron Performance: CPU Usage Heron (paper) Heron (master) 40 million tuples/min x Spout Parallelism
42 42 > 400 Real Time Jobs 500 Billions Events/Day PROCESSED MS latency
43 43 Stateful Processing in Heron OpZmisZc Approaches PessimisZc Approaches
44 Tying Together "
45 45 Lambda Architecture Combining batch and real time New Data Client
46 46 Lambda Architecture - The Good Scribe CollecZon Pipeline Event Bus Heron Compute Pipeline Results
47 47 Lambda Architecture - The Bad Have to write everything twice! # $ Have to fix everything (may be Subtle differences in semantics! 7 How much Duct Tape required? " What about Graphs, ML, SQL, etc?
48 48 Summingbird to the Rescue Heron Topology Online key value result store Message broker Summingbird Program Client HDFS Scalding/Map Reduce Batch key value result store
49 49 Curious to Learn More? Twitter Heron: Stream Processing at Scale Sanjeev Kulkarni, Nikunj Bhagat, Maosong Fu, Vikas Kedigehalli, Christopher Kellogg, Sailesh Mittal, Jignesh M. Patel *,1, Karthik Ramasamy, @staneja Twitter, Inc., *University of Wisconsin Madison Maosong Fu, Sailesh Mittal, Vikas Kedigehalli, Karthik Ramasamy, Michael Barry, Andrew Jorgensen, Christopher Kellogg, Neng Lu, Bill Graham, Jingwei Wu Twitter, Inc. Ankit Toshniwal, Siddarth Taneja, Amit Shukla, Karthik Ramasamy, Jignesh M. Patel*, Sanjeev Kulkarni, Jason Jackson, Krishna Gade, Maosong Fu, Jake Donham, Nikunj Bhagat, Sailesh Mittal, @challenger_nik, Twitter, Inc., *University of Wisconsin Madison
50 Curious to Learn More? 50
51 51 Interested in Heron? HERON IS OPEN SOURCED CONTRIBUTIONS ARE WELCOME! FOLLOW
52 52 Interested in Distributed Log? DISTRIBUTED LOG IS OPEN SOURCED CONTRIBUTIONS ARE WELCOME! FOLLOW
53 53 We are SOFTWARE ENGINEERS PRODUCT ENGINEERS SYSTEM ENGINEERS UI ENGINEERS/UX DESIGNERS CAREERS [AT] STREAML.IO
54 54 Any Question??? WHAT WHY WHERE WHEN WHO HOW
55 55 Get in
56 THANKS FOR ATTENDING!!!
Flying Faster with Heron
Flying Faster with Heron KARTHIK RAMASAMY @KARTHIKZ #TwitterHeron TALK OUTLINE BEGIN I! II ( III b OVERVIEW MOTIVATION HERON IV Z OPERATIONAL EXPERIENCES V K HERON PERFORMANCE END [! OVERVIEW TWITTER IS
More informationScalable Streaming Analytics
Scalable Streaming Analytics KARTHIK RAMASAMY @karthikz TALK OUTLINE BEGIN I! II ( III b Overview Storm Overview Storm Internals IV Z V K Heron Operational Experiences END WHAT IS ANALYTICS? according
More informationPaper Presented by Harsha Yeddanapudy
Storm@Twitter Ankit Toshniwal, Siddarth Taneja, Amit Shukla, Karthik Ramasamy, Jignesh M. Patel*, Sanjeev Kulkarni, Jason Jackson, Krishna Gade, Maosong Fu, Jake Donham, Nikunj Bhagat, Sailesh Mittal,
More informationBuilding Durable Real-time Data Pipeline
Building Durable Real-time Data Pipeline Apache BookKeeper at Twitter @sijieg Twitter Background Layered Architecture Agenda Design Details Performance Scale @Twitter Q & A Publish-Subscribe Online services
More information10/24/2017 Sangmi Lee Pallickara Week 10- A. CS535 Big Data Fall 2017 Colorado State University
CS535 Big Data - Fall 2017 Week 10-A-1 CS535 BIG DATA FAQs Term project proposal Feedback for the most of submissions are available PA2 has been posted (11/6) PART 2. SCALABLE FRAMEWORKS FOR REAL-TIME
More information2/20/2019 Week 5-B Sangmi Lee Pallickara
2/20/2019 - Spring 2019 Week 5-B-1 CS535 BIG DATA FAQs PART A. BIG DATA TECHNOLOGY 4. REAL-TIME STREAMING COMPUTING MODELS: APACHE STORM AND TWITTER HERON Special GTA for PA1 Saptashwa Mitra Saptashwa.Mitra@colostate.edu
More informationStorm. Distributed and fault-tolerant realtime computation. Nathan Marz Twitter
Storm Distributed and fault-tolerant realtime computation Nathan Marz Twitter Storm at Twitter Twitter Web Analytics Before Storm Queues Workers Example (simplified) Example Workers schemify tweets and
More informationTwitter Heron: Stream Processing at Scale
Twitter Heron: Stream Processing at Scale Saiyam Kohli December 8th, 2016 CIS 611 Research Paper Presentation -Sun Sunnie Chung TWITTER IS A REAL TIME ABSTRACT We process billions of events on Twitter
More informationBASIC INTER/INTRA IPC. Operating System (Linux, Windows) HARDWARE. Scheduling Framework (Mesos, YARN, etc) HERON S GENERAL-PURPOSE ARCHITECTURE
217 IEEE 33rd International Conference on Data Engineering Twitter Heron: Towards Extensible Streaming Engines Maosong Fu t, Ashvin Agrawal m, Avrilia Floratou m, Bill Graham t, Andrew Jorgensen t Mark
More informationSelf Regulating Stream Processing in Heron
Self Regulating Stream Processing in Heron Huijun Wu 2017.12 Huijun Wu Twitter, Inc. Infrastructure, Data Platform, Real-Time Compute Heron Overview Recent Improvements Self Regulating Challenges Dhalion
More informationApache Storm. Hortonworks Inc Page 1
Apache Storm Page 1 What is Storm? Real time stream processing framework Scalable Up to 1 million tuples per second per node Fault Tolerant Tasks reassigned on failure Guaranteed Processing At least once
More informationStorm. Distributed and fault-tolerant realtime computation. Nathan Marz Twitter
Storm Distributed and fault-tolerant realtime computation Nathan Marz Twitter Basic info Open sourced September 19th Implementation is 15,000 lines of code Used by over 25 companies >2700 watchers on Github
More informationTSAR A TimeSeries AggregatoR. Anirudh Todi TSAR
TSAR A TimeSeries AggregatoR Anirudh Todi Twitter @anirudhtodi TSAR What is TSAR? What is TSAR? TSAR is a framework and service infrastructure for specifying, deploying and operating timeseries aggregation
More informationSpark, Shark and Spark Streaming Introduction
Spark, Shark and Spark Streaming Introduction Tushar Kale tusharkale@in.ibm.com June 2015 This Talk Introduction to Shark, Spark and Spark Streaming Architecture Deployment Methodology Performance References
More informationData Analytics with HPC. Data Streaming
Data Analytics with HPC Data Streaming Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us
More informationBefore proceeding with this tutorial, you must have a good understanding of Core Java and any of the Linux flavors.
About the Tutorial Storm was originally created by Nathan Marz and team at BackType. BackType is a social analytics company. Later, Storm was acquired and open-sourced by Twitter. In a short time, Apache
More informationSpark Streaming. Guido Salvaneschi
Spark Streaming Guido Salvaneschi 1 Spark Streaming Framework for large scale stream processing Scales to 100s of nodes Can achieve second scale latencies Integrates with Spark s batch and interactive
More informationOverview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::
Title Duration : Apache Spark Development : 4 days Overview Spark is a fast and general cluster computing system for Big Data. It provides high-level APIs in Scala, Java, Python, and R, and an optimized
More informationWebinar Series TMIP VISION
Webinar Series TMIP VISION TMIP provides technical support and promotes knowledge and information exchange in the transportation planning and modeling community. Today s Goals To Consider: Parallel Processing
More informationImproving efficiency of Twitter Infrastructure using Chargeback
Improving efficiency of Twitter Infrastructure using Chargeback @vinucharanya @micheal AGENDA Brief History Problem Chargeback Engineering Challenges The product Impact Future Getty Images from http://www.fifa.com/worldcup/news/y=2010/m=7/news=pride-for-africa-spain-strike-gold-2247372.html
More informationTwitrends: A Real Time Trending Topics Detection System for Twitter Social Network
Twitrends: A Real Time Trending Topics Detection System for Twitter Social Network Cosmina Ivan Department of Computer Science Technical University of Cluj-Napoca Cluj County, Romania Andrei Moldovan Department
More informationFluentd + MongoDB + Spark = Awesome Sauce
Fluentd + MongoDB + Spark = Awesome Sauce Nishant Sahay, Sr. Architect, Wipro Limited Bhavani Ananth, Tech Manager, Wipro Limited Your company logo here Wipro Open Source Practice: Vision & Mission Vision
More informationData Acquisition. The reference Big Data stack
Università degli Studi di Roma Tor Vergata Dipartimento di Ingegneria Civile e Ingegneria Informatica Data Acquisition Corso di Sistemi e Architetture per Big Data A.A. 2016/17 Valeria Cardellini The reference
More informationBig Streaming Data Processing. How to Process Big Streaming Data 2016/10/11. Fraud detection in bank transactions. Anomalies in sensor data
Big Data Big Streaming Data Big Streaming Data Processing Fraud detection in bank transactions Anomalies in sensor data Cat videos in tweets How to Process Big Streaming Data Raw Data Streams Distributed
More informationPutting it together. Data-Parallel Computation. Ex: Word count using partial aggregation. Big Data Processing. COS 418: Distributed Systems Lecture 21
Big Processing -Parallel Computation COS 418: Distributed Systems Lecture 21 Michael Freedman 2 Ex: Word count using partial aggregation Putting it together 1. Compute word counts from individual files
More informationDistributed systems for stream processing
Distributed systems for stream processing Apache Kafka and Spark Structured Streaming Alena Hall Alena Hall Large-scale data processing Distributed Systems Functional Programming Data Science & Machine
More informationStreaming Log Analytics with Kafka
Streaming Log Analytics with Kafka Kresten Krab Thorup, Humio CTO Log Everything, Answer Anything, In Real-Time. Why this talk? Humio is a Log Analytics system Designed to run on-prem High volume, real
More informationTopics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples
Hadoop Introduction 1 Topics Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples 2 Big Data Analytics What is Big Data?
More informationBig Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara
Big Data Technology Ecosystem Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara Agenda End-to-End Data Delivery Platform Ecosystem of Data Technologies Mapping an End-to-End Solution Case
More informationREAL-TIME ANALYTICS WITH APACHE STORM
REAL-TIME ANALYTICS WITH APACHE STORM Mevlut Demir PhD Student IN TODAY S TALK 1- Problem Formulation 2- A Real-Time Framework and Its Components with an existing applications 3- Proposed Framework 4-
More informationApache BookKeeper. A High Performance and Low Latency Storage Service
Apache BookKeeper A High Performance and Low Latency Storage Service Hello! I am Sijie Guo - PMC Chair of Apache BookKeeper Co-creator of Apache DistributedLog Twitter Messaging/Pub-Sub Team Yahoo! R&D
More informationMODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS
MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS SUJEE MANIYAM FOUNDER / PRINCIPAL @ ELEPHANT SCALE www.elephantscale.com sujee@elephantscale.com HI, I M SUJEE MANIYAM Founder / Principal @ ElephantScale
More informationData Acquisition. The reference Big Data stack
Università degli Studi di Roma Tor Vergata Dipartimento di Ingegneria Civile e Ingegneria Informatica Data Acquisition Corso di Sistemi e Architetture per Big Data A.A. 2017/18 Valeria Cardellini The reference
More informationData-Intensive Distributed Computing
Data-Intensive Distributed Computing CS 451/651 431/631 (Winter 2018) Part 9: Real-Time Data Analytics (1/2) March 27, 2018 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo
More informationBIG DATA. Using the Lambda Architecture on a Big Data Platform to Improve Mobile Campaign Management. Author: Sandesh Deshmane
BIG DATA Using the Lambda Architecture on a Big Data Platform to Improve Mobile Campaign Management Author: Sandesh Deshmane Executive Summary Growing data volumes and real time decision making requirements
More informationCS 398 ACC Streaming. Prof. Robert J. Brunner. Ben Congdon Tyler Kim
CS 398 ACC Streaming Prof. Robert J. Brunner Ben Congdon Tyler Kim MP3 How s it going? Final Autograder run: - Tonight ~9pm - Tomorrow ~3pm Due tomorrow at 11:59 pm. Latest Commit to the repo at the time
More informationAuto Management for Apache Kafka and Distributed Stateful System in General
Auto Management for Apache Kafka and Distributed Stateful System in General Jiangjie (Becket) Qin Data Infrastructure @LinkedIn GIAC 2017, 12/23/17@Shanghai Agenda Kafka introduction and terminologies
More informationArchitectural challenges for building a low latency, scalable multi-tenant data warehouse
Architectural challenges for building a low latency, scalable multi-tenant data warehouse Mataprasad Agrawal Solutions Architect, Services CTO 2017 Persistent Systems Ltd. All rights reserved. Our analytics
More informationEmbedded Technosolutions
Hadoop Big Data An Important technology in IT Sector Hadoop - Big Data Oerie 90% of the worlds data was generated in the last few years. Due to the advent of new technologies, devices, and communication
More informationFROM LEGACY, TO BATCH, TO NEAR REAL-TIME. Marc Sturlese, Dani Solà
FROM LEGACY, TO BATCH, TO NEAR REAL-TIME Marc Sturlese, Dani Solà WHO ARE WE? Marc Sturlese - @sturlese Backend engineer, focused on R&D Interests: search, scalability Dani Solà - @dani_sola Backend engineer
More informationReal-time Calculating Over Self-Health Data Using Storm Jiangyong Cai1, a, Zhengping Jin2, b
4th International Conference on Mechatronics, Materials, Chemistry and Computer Engineering (ICMMCCE 2015) Real-time Calculating Over Self-Health Data Using Storm Jiangyong Cai1, a, Zhengping Jin2, b 1
More informationFlash Storage Complementing a Data Lake for Real-Time Insight
Flash Storage Complementing a Data Lake for Real-Time Insight Dr. Sanhita Sarkar Global Director, Analytics Software Development August 7, 2018 Agenda 1 2 3 4 5 Delivering insight along the entire spectrum
More informationPersonalizing Netflix with Streaming datasets
Personalizing Netflix with Streaming datasets Shriya Arora Senior Data Engineer Personalization Analytics @shriyarora What is this talk about? Helping you decide if a streaming pipeline fits your ETL problem
More informationSTORM AND LOW-LATENCY PROCESSING.
STORM AND LOW-LATENCY PROCESSING Low latency processing Similar to data stream processing, but with a twist Data is streaming into the system (from a database, or a netk stream, or an HDFS file, or ) We
More informationGrand challenge: Automatic anomaly detection over sliding windows
Provided by the author(s) and NUI Galway in accordance with publisher policies. Please cite the published version when available. Title Grand challenge: Automatic anomaly detection over sliding windows
More informationThe Future of Real-Time in Spark
The Future of Real-Time in Spark Reynold Xin @rxin Spark Summit, New York, Feb 18, 2016 Why Real-Time? Making decisions faster is valuable. Preventing credit card fraud Monitoring industrial machinery
More informationScaling the Yelp s logging pipeline with Apache Kafka. Enrico
Scaling the Yelp s logging pipeline with Apache Kafka Enrico Canzonieri enrico@yelp.com @EnricoC89 Yelp s Mission Connecting people with great local businesses. Yelp Stats As of Q1 2016 90M 102M 70% 32
More informationData pipelines with PostgreSQL & Kafka
Data pipelines with PostgreSQL & Kafka Oskari Saarenmaa PostgresConf US 2018 - Jersey City Agenda 1. Introduction 2. Data pipelines, old and new 3. Apache Kafka 4. Sample data pipeline with Kafka & PostgreSQL
More informationBig Data Infrastructures & Technologies
Big Data Infrastructures & Technologies Data streams and low latency processing DATA STREAM BASICS What is a data stream? Large data volume, likely structured, arriving at a very high rate Potentially
More informationSparkStreaming. Large scale near- realtime stream processing. Tathagata Das (TD) UC Berkeley UC BERKELEY
SparkStreaming Large scale near- realtime stream processing Tathagata Das (TD) UC Berkeley UC BERKELEY Motivation Many important applications must process large data streams at second- scale latencies
More information<Insert Picture Here> Introduction to Big Data Technology
Introduction to Big Data Technology The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into
More informationHortonworks DataFlow Sam Lachterman Solutions Engineer
Hortonworks DataFlow Sam Lachterman Solutions Engineer 1 Hortonworks Inc. 2011 2017. All Rights Reserved Disclaimer This document may contain product features and technology directions that are under development,
More informationPerformance Assessment of Storm and Spark for Twitter Streaming
Performance Assessment of Storm and Spark for Twitter Streaming 1 B. Revathi Reddy, 2 T.Swathi 1 PG Student, Computer Science and Engineering Dept., GPREC, Kurnool(District), Andhra Pradesh-518004, INDIA,
More informationFunctional Comparison and Performance Evaluation. Huafeng Wang Tianlun Zhang Wei Mao 2016/11/14
Functional Comparison and Performance Evaluation Huafeng Wang Tianlun Zhang Wei Mao 2016/11/14 Overview Streaming Core MISC Performance Benchmark Choose your weapon! 2 Continuous Streaming Micro-Batch
More informationFault Tolerance for Stream Processing Engines
Fault Tolerance for Stream Processing Engines Muhammad Anis Uddin Nasir KTH Royal Institute of Technology, Stockholm, Sweden anisu@kth.se arxiv:1605.00928v1 [cs.dc] 3 May 2016 Abstract Distributed Stream
More informationPulsar. Realtime Analytics At Scale. Wang Xinglang
Pulsar Realtime Analytics At Scale Wang Xinglang Agenda Pulsar : Real Time Analytics At ebay Business Use Cases Product Requirements Pulsar : Technology Deep Dive 2 Pulsar Business Use Case: Behavioral
More informationHadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved
Hadoop 2.x Core: YARN, Tez, and Spark YARN Hadoop Machine Types top-of-rack switches core switch client machines have client-side software used to access a cluster to process data master nodes run Hadoop
More informationOver the last few years, we have seen a disruption in the data management
JAYANT SHEKHAR AND AMANDEEP KHURANA Jayant is Principal Solutions Architect at Cloudera working with various large and small companies in various Verticals on their big data and data science use cases,
More informationStreaming & Apache Storm
Streaming & Apache Storm Recommended Text: Storm Applied Sean T. Allen, Matthew Jankowski, Peter Pathirana Manning 2010 VMware Inc. All rights reserved Big Data! Volume! Velocity Data flowing into the
More information8/24/2017 Week 1-B Instructor: Sangmi Lee Pallickara
Week 1-B-0 Week 1-B-1 CS535 BIG DATA FAQs Slides are available on the course web Wait list Term project topics PART 0. INTRODUCTION 2. DATA PROCESSING PARADIGMS FOR BIG DATA Sangmi Lee Pallickara Computer
More informationVOLTDB + HP VERTICA. page
VOLTDB + HP VERTICA ARCHITECTURE FOR FAST AND BIG DATA ARCHITECTURE FOR FAST + BIG DATA FAST DATA Fast Serve Analytics BIG DATA BI Reporting Fast Operational Database Streaming Analytics Columnar Analytics
More informationCreating a Recommender System. An Elasticsearch & Apache Spark approach
Creating a Recommender System An Elasticsearch & Apache Spark approach My Profile SKILLS Álvaro Santos Andrés Big Data & Analytics Solution Architect in Ericsson with more than 12 years of experience focused
More informationTyphoon: An SDN Enhanced Real-Time Big Data Streaming Framework
Typhoon: An SDN Enhanced Real-Time Big Data Streaming Framework Junguk Cho, Hyunseok Chang, Sarit Mukherjee, T.V. Lakshman, and Jacobus Van der Merwe 1 Big Data Era Big data analysis is increasingly common
More informationDynamic Algorithm Selection for the Logic of Tasks in IoT Stream Processing Systems
Dynamic Selection for the Logic of s in IoT Stream Processing Systems Ehsan Poormohammady, Jens Helge Reelfs, Mirko Stoffers, Klaus Wehrle Communication and Distributed Systems, RWTH Aachen University,
More informationA BIG DATA STREAMING RECIPE WHAT TO CONSIDER WHEN BUILDING A REAL TIME BIG DATA APPLICATION
A BIG DATA STREAMING RECIPE WHAT TO CONSIDER WHEN BUILDING A REAL TIME BIG DATA APPLICATION Konstantin Gregor / konstantin.gregor@tngtech.com ABOUT ME So ware developer for TNG in Munich Client in telecommunication
More informationSolace JMS Broker Delivers Highest Throughput for Persistent and Non-Persistent Delivery
Solace JMS Broker Delivers Highest Throughput for Persistent and Non-Persistent Delivery Java Message Service (JMS) is a standardized messaging interface that has become a pervasive part of the IT landscape
More informationFast and Easy Stream Processing with Hazelcast Jet. Gokhan Oner Hazelcast
Fast and Easy Stream Processing with Hazelcast Jet Gokhan Oner Hazelcast Stream Processing Why should I bother? What is stream processing? Data Processing: Massage the data when moving from place to place.
More informationBuilding a Data-Friendly Platform for a Data- Driven Future
Building a Data-Friendly Platform for a Data- Driven Future Benjamin Hindman - @benh 2016 Mesosphere, Inc. All Rights Reserved. INTRO $ whoami BENJAMIN HINDMAN Co-founder and Chief Architect of Mesosphere,
More informationThe Stream Processor as a Database. Ufuk
The Stream Processor as a Database Ufuk Celebi @iamuce Realtime Counts and Aggregates The (Classic) Use Case 2 (Real-)Time Series Statistics Stream of Events Real-time Statistics 3 The Architecture collect
More informationData contains value and knowledge
Data contains value and knowledge What is the purpose of big data systems? To support analysis and knowledge discovery from very large amounts of data But to extract the knowledge data needs to be Stored
More informationBIG DATA TESTING: A UNIFIED VIEW
http://core.ecu.edu/strg BIG DATA TESTING: A UNIFIED VIEW BY NAM THAI ECU, Computer Science Department, March 16, 2016 2/30 PRESENTATION CONTENT 1. Overview of Big Data A. 5 V s of Big Data B. Data generation
More informationCloud Analytics and Business Intelligence on AWS
Cloud Analytics and Business Intelligence on AWS Enterprise Applications Virtual Desktops Sharing & Collaboration Platform Services Analytics Hadoop Real-time Streaming Data Machine Learning Data Warehouse
More informationMachine Learning meets Databases. Ioannis Papapanagiotou Cloud Database Engineering
Machine Learning meets Databases Ioannis Papapanagiotou Cloud Database Engineering Create Personalized Recommendations for discoveries of engaging video content that maximizes member joy. Personalize Everything
More information<Insert Picture Here> Value of TimesTen Oracle TimesTen Product Overview
Value of TimesTen Oracle TimesTen Product Overview Shig Hiura Sales Consultant, Oracle Embedded Global Business Unit When You Think Database SQL RDBMS Results RDBMS + client/server
More informationData-Intensive Distributed Computing
Data-Intensive Distributed Computing CS 451/651 431/631 (Winter 2018) Part 5: Analyzing Relational Data (1/3) February 8, 2018 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo
More informationIn-Memory Data Management Jens Krueger
In-Memory Data Management Jens Krueger Enterprise Platform and Integration Concepts Hasso Plattner Intitute OLTP vs. OLAP 2 Online Transaction Processing (OLTP) Organized in rows Online Analytical Processing
More informationTwitter data Analytics using Distributed Computing
Twitter data Analytics using Distributed Computing Uma Narayanan Athrira Unnikrishnan Dr. Varghese Paul Dr. Shelbi Joseph Research Scholar M.tech Student Professor Assistant Professor Dept. of IT, SOE
More informationA Parallel R Framework
A Parallel R Framework for Processing Large Dataset on Distributed Systems Nov. 17, 2013 This work is initiated and supported by Huawei Technologies Rise of Data-Intensive Analytics Data Sources Personal
More informationCisco Tetration Analytics
Cisco Tetration Analytics Enhanced security and operations with real time analytics Christopher Say (CCIE RS SP) Consulting System Engineer csaychoh@cisco.com Challenges in operating a hybrid data center
More informationIBM Db2 Event Store Simplifying and Accelerating Storage and Analysis of Fast Data. IBM Db2 Event Store
IBM Db2 Event Store Simplifying and Accelerating Storage and Analysis of Fast Data IBM Db2 Event Store Disclaimer The information contained in this presentation is provided for informational purposes only.
More informationReal-time data processing with Apache Flink
Real-time data processing with Apache Flink Gyula Fóra gyfora@apache.org Flink committer Swedish ICT Stream processing Data stream: Infinite sequence of data arriving in a continuous fashion. Stream processing:
More informationLazyBase: Trading freshness and performance in a scalable database
LazyBase: Trading freshness and performance in a scalable database (EuroSys 2012) Jim Cipar, Greg Ganger, *Kimberly Keeton, *Craig A. N. Soules, *Brad Morrey, *Alistair Veitch PARALLEL DATA LABORATORY
More informationSpark Streaming. Professor Sasu Tarkoma.
Spark Streaming 2015 Professor Sasu Tarkoma www.cs.helsinki.fi Spark Streaming Spark extension of accepting and processing of streaming high-throughput live data streams Data is accepted from various sources
More informationHadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017
Hadoop File System 1 S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y Moving Computation is Cheaper than Moving Data Motivation: Big Data! What is BigData? - Google
More informationIMPLEMENTING A LAMBDA ARCHITECTURE TO PERFORM REAL-TIME UPDATES
IMPLEMENTING A LAMBDA ARCHITECTURE TO PERFORM REAL-TIME UPDATES by PRAMOD KUMAR GUDIPATI B.E., OSMANIA UNIVERSITY (OU), INDIA, 2012 A REPORT submitted in partial fulfillment of the requirements of the
More informationLambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015
Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL May 2015 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved. Notices This document
More informationHBase Solutions at Facebook
HBase Solutions at Facebook Nicolas Spiegelberg Software Engineer, Facebook QCon Hangzhou, October 28 th, 2012 Outline HBase Overview Single Tenant: Messages Selection Criteria Multi-tenant Solutions
More informationData Infrastructure at LinkedIn. Shirshanka Das XLDB 2011
Data Infrastructure at LinkedIn Shirshanka Das XLDB 2011 1 Me UCLA Ph.D. 2005 (Distributed protocols in content delivery networks) PayPal (Web frameworks and Session Stores) Yahoo! (Serving Infrastructure,
More informationPanoptes: A Network Telemetry Ecosystem - Part Deux
Panoptes: A Network Telemetry Ecosystem - Part Deux Panoptes is: Greenfield Python based network telemetry platform that provides real time telemetry and analytics @ Yahoo Implements discovery, polling,
More informationFunctional Comparison and Performance Evaluation 毛玮王华峰张天伦 2016/9/10
Functional Comparison and Performance Evaluation 毛玮王华峰张天伦 2016/9/10 Overview Streaming Core MISC Performance Benchmark Choose your weapon! 2 Continuous Streaming Ack per Record Storm* Twitter Heron* Storage
More informationPaaS SAE Top3 SuperAPP
PaaS SAE Top3 SuperAPP PaaS SAE Top3 SuperAPP Pla$orm Services Group Sam Biwing Monika Rambone Skylee Kingho1d AWS S3 CDN ATS 1k 30+ 10+ Go FE Services Panel C++ Go C/C++ ACM FE Pla$orm Services Group
More informationWHY AND HOW TO LEVERAGE THE POWER AND SIMPLICITY OF SQL ON APACHE FLINK - FABIAN HUESKE, SOFTWARE ENGINEER
WHY AND HOW TO LEVERAGE THE POWER AND SIMPLICITY OF SQL ON APACHE FLINK - FABIAN HUESKE, SOFTWARE ENGINEER ABOUT ME Apache Flink PMC member & ASF member Contributing since day 1 at TU Berlin Focusing on
More informationData Storage Infrastructure at Facebook
Data Storage Infrastructure at Facebook Spring 2018 Cleveland State University CIS 601 Presentation Yi Dong Instructor: Dr. Chung Outline Strategy of data storage, processing, and log collection Data flow
More informationTools for Social Networking Infrastructures
Tools for Social Networking Infrastructures 1 Cassandra - a decentralised structured storage system Problem : Facebook Inbox Search hundreds of millions of users distributed infrastructure inbox changes
More informationDown the event-driven road: Experiences of integrating streaming into analytic data platforms
Down the event-driven road: Experiences of integrating streaming into analytic data platforms Dr. Dominik Benz, Head of Machine Learning Engineering, inovex GmbH Confluent Meetup Munich, 8.10.2018 Integrate
More informationBig data streaming: Choices for high availability and disaster recovery on Microsoft Azure. By Arnab Ganguly DataCAT
: Choices for high availability and disaster recovery on Microsoft Azure By Arnab Ganguly DataCAT March 2019 Contents Overview... 3 The challenge of a single-region architecture... 3 Configuration considerations...
More informationEvolution of an Apache Spark Architecture for Processing Game Data
Evolution of an Apache Spark Architecture for Processing Game Data Nick Afshartous WB Analytics Platform May 17 th 2017 May 17 th, 2017 About Me nafshartous@wbgames.com WB Analytics Core Platform Lead
More informationShen PingCAP 2017
Shen Li @ PingCAP About me Shen Li ( 申砾 ) Tech Lead of TiDB, VP of Engineering Netease / 360 / PingCAP Infrastructure software engineer WHY DO WE NEED A NEW DATABASE? Brief History Standalone RDBMS NoSQL
More informationCloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018
Cloud Computing 2 CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University Learning
More informationCONSOLIDATING RISK MANAGEMENT AND REGULATORY COMPLIANCE APPLICATIONS USING A UNIFIED DATA PLATFORM
CONSOLIDATING RISK MANAGEMENT AND REGULATORY COMPLIANCE APPLICATIONS USING A UNIFIED PLATFORM Executive Summary Financial institutions have implemented and continue to implement many disparate applications
More information