Benchmarking Distributed Stream Processing Platforms for IoT Applications

Similar documents
Data Analytics with HPC. Data Streaming

Sub-millisecond Stateful Stream Querying over Fast-evolving Linked Data

Tutorial: Apache Storm

Scalable Streaming Analytics

REAL-TIME ANALYTICS WITH APACHE STORM

Apache Kafka Your Event Stream Processing Solution

SnailTrail Generalizing Critical Paths for Online Analysis of Distributed Dataflows

New Oracle NoSQL Database APIs that Speed Insertion and Retrieval

Parallel Patterns for Window-based Stateful Operators on Data Streams: an Algorithmic Skeleton Approach

Data Stream Processing in the Cloud

JVM Performance Study Comparing Oracle HotSpot and Azul Zing Using Apache Cassandra

arxiv: v1 [cs.dc] 11 Sep 2017

TPCX-BB (BigBench) Big Data Analytics Benchmark

HiTune. Dataflow-Based Performance Analysis for Big Data Cloud

Embedded Technosolutions

Turbocharge your MySQL analytics with ElasticSearch. Guillaume Lefranc Data & Infrastructure Architect, Productsup GmbH Percona Live Europe 2017

Research Faculty Summit Systems Fueling future disruptions

High-Performance Event Processing Bridging the Gap between Low Latency and High Throughput Bernhard Seeger University of Marburg

Next-Generation Cloud Platform

WHY AND HOW TO LEVERAGE THE POWER AND SIMPLICITY OF SQL ON APACHE FLINK - FABIAN HUESKE, SOFTWARE ENGINEER

Unifying Big Data Workloads in Apache Spark

VOLTDB + HP VERTICA. page

Overview of Data Services and Streaming Data Solution with Azure

Crescando: Predictable Performance for Unpredictable Workloads

R-Storm: A Resource-Aware Scheduler for STORM. Mohammad Hosseini Boyang Peng Zhihao Hong Reza Farivar Roy Campbell

Flash Storage Complementing a Data Lake for Real-Time Insight

Architectural challenges for building a low latency, scalable multi-tenant data warehouse

Practical Big Data Processing An Overview of Apache Flink

E-Store: Fine-Grained Elastic Partitioning for Distributed Transaction Processing Systems

7-8 June 2018 Frankfurt Germany. Improving Business Performance Through Big Data Benchmarking

Flying Faster with Heron

microsoft

Putting it together. Data-Parallel Computation. Ex: Word count using partial aggregation. Big Data Processing. COS 418: Distributed Systems Lecture 21

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Remote Health Monitoring for an Embedded System

Big data streaming: Choices for high availability and disaster recovery on Microsoft Azure. By Arnab Ganguly DataCAT

What's New in MATLAB for Engineering Data Analytics?

Creating a Recommender System. An Elasticsearch & Apache Spark approach

Experiment-Driven Evaluation of Cloud-based Distributed Systems

Modern Data Warehouse The New Approach to Azure BI

Streaming iphone sensor data to SAS Event Stream Processing

Data Acquisition. The reference Big Data stack

System Support for Internet of Things

Copyright Push Technology Ltd December Diffusion TM 4.4 Performance Benchmarks

(incubating) Introduction. Swapnil Bawaskar.

TPC-E testing of Microsoft SQL Server 2016 on Dell EMC PowerEdge R830 Server and Dell EMC SC9000 Storage

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

DS504/CS586: Big Data Analytics Data Management Prof. Yanhua Li

Apache Hadoop 3. Balazs Gaspar Sales Engineer CEE & CIS Cloudera, Inc. All rights reserved.

Cloud Analytics and Business Intelligence on AWS

Oracle NoSQL Database and Cisco- Collaboration that produces results. 1 Copyright 2011, Oracle and/or its affiliates. All rights reserved.

Maximum Sustainable Throughput Prediction for Large-Scale Data Streaming Systems

SparkBench: A Comprehensive Spark Benchmarking Suite Characterizing In-memory Data Analytics

Accelerating Big Data: Using SanDisk SSDs for Apache HBase Workloads

Sparrow. Distributed Low-Latency Spark Scheduling. Kay Ousterhout, Patrick Wendell, Matei Zaharia, Ion Stoica

Randy Pagels Sr. Developer Technology Specialist DX US Team AZURE PRIMED

Correlative Analytic Methods in Large Scale Network Infrastructure Hariharan Krishnaswamy Senior Principal Engineer Dell EMC

Predicting impact of changes in application on SLAs: ETL application performance model

Apache Flink. Alessandro Margara

Performance and Scalability with Griddable.io

Understanding the latent value in all content

Fit for Purpose Platform Positioning and Performance Architecture

TrafficDB: HERE s High Performance Shared-Memory Data Store Ricardo Fernandes, Piotr Zaczkowski, Bernd Göttler, Conor Ettinoffe, and Anis Moussa

8/24/2017 Week 1-B Instructor: Sangmi Lee Pallickara

Streaming Auto-Scaling in Google Cloud Dataflow

Benchmarking Spark ML using BigBench. Sweta Singh TPCTC 2016

Hortonworks and The Internet of Things

An Oracle White Paper September Oracle Utilities Meter Data Management Demonstrates Extreme Performance on Oracle Exadata/Exalogic

exam. Microsoft Perform Data Engineering on Microsoft Azure HDInsight. Version 1.0

Dynamics 365. for Finance and Operations, Enterprise edition (onpremises) system requirements

Grand Challenge: Scalable Stateful Stream Processing for Smart Grids

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Using Apache Beam for Batch, Streaming, and Everything in Between. Dan Halperin Apache Beam PMC Senior Software Engineer, Google

Viper: Communication-Layer Determinism and Scaling in Low-Latency Stream Processing

Oracle Exadata: Strategy and Roadmap

Nowcasting. D B M G Data Base and Data Mining Group of Politecnico di Torino. Big Data: Hype or Hallelujah? Big data hype?

A Generic Microservice Architecture for Environmental Data Management

vsan 6.6 Performance Improvements First Published On: Last Updated On:

Challenges for Data Driven Systems

Making the Most of Hadoop with Optimized Data Compression (and Boost Performance) Mark Cusack. Chief Architect RainStor

Stanislav Harvan Internet of Things

Distributed systems for stream processing

MOHA: Many-Task Computing Framework on Hadoop

17/05/2017. What we ll cover. Who is Greg? Why PaaS and SaaS? What we re not discussing: IaaS

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

A Preliminary Approach for Modeling Energy Efficiency for K-Means Clustering Applications in Data Centers

10 Million Smart Meter Data with Apache HBase

Grand Challenge: Scalable Stateful Stream Processing for Smart Grids

Empowering People with Knowledge the Next Frontier for Web Search. Wei-Ying Ma Assistant Managing Director Microsoft Research Asia

BIG DATA TESTING: A UNIFIED VIEW

Colin Cunningham, Intel Kumaran Siva, Intel Sandeep Mahajan, Oracle 03-Oct :45 p.m. - 5:30 p.m. Moscone West - Room 3020

Enable IoT Solutions using Azure

arxiv: v1 [cs.dc] 2 Dec 2017

DRIZZLE: FAST AND Adaptable STREAM PROCESSING AT SCALE

Cloudline Autonomous Driving Solutions. Accelerating insights through a new generation of Data and Analytics October, 2018

Data Transformation and Migration in Polystores

ProphetStor DiskProphet Ensures SLA for VMware vsan

STYX: Stream Processing with Trustworthy Cloud-based Execution

Fast and Easy Stream Processing with Hazelcast Jet. Gokhan Oner Hazelcast

Energy Management with AWS

Transcription:

DISTRIBUTED RESEARCH ON EMERGING APPLICATIONS & MACHINES dream-lab.in Indian Institute of Science, Bangalore DREAM:Lab Benchmarking Distributed Stream Processing Platforms for IoT Applications Anshu Shukla and Yogesh Simmhan Department of Computational & Data Sciences Indian Institute of Science Eighth TPC Technology Conference on Performance Evaluation & Benchmarking (TPCTC 2016) DREAM:Lab, 2016 This work is licensed under a Creative Commons Attribution 4.0 International License 5-Sep-16

Motivation: Internet of Things (IoT) Billions of sensors monitoring physical infrastructure, human beings and virtual entities in realtime, distributed across network Data streams drive realtime processing, analytics & decisions making Streaming applications control feedback on physical structure, notifications to users http://1.bp.blogspot.com/-n5jfgir1dae/vbpn0iiwyxi/aaaaaaaajws/d1_deumxf8c/s1600/smart-cities-features.png

Data Platforms for IoT MapReduce/Hadoop Large volume, high latency on distributed hosts Training models over sensor data archives Complex Event Processing Declarative pattern matching over event tuples Detect critical events, changing trends Distributed Stream Processing Systems (DSPS) Fast data rates, Weaker message order Distributed execution, fault-tolerant Flexible composition & responsive coordination of Observe, Orient, Decide, Act (OODA) loop Realtime Processing Mission critical apps, Strong latency guarantees Embedded medical devices, power networks * *Big Data Analytics Platforms for Real-time Applications in IoT, Yogesh Simmhan & Srinath Perera, Big Data Analytics: Methods and Applications, Eds. Saumyadipta Pyne, B.L.S. Prakasa Rao, S.B. Rao, 2016 (to appear)

DAG Distributed Stream Processing System [DSPS] Scalable, Distributed and fault-tolerant computations over data streams Composition of streaming applications Directed Acyclic Graph (DAG) of tasks and message streams Unbounded sequence of opaque messages as streams User-define logic blocks as tasks IoT System Task (lat,long) --> Zipcode Actuation Source (lat.,long.,taxiid,status) No.cars in CITY Sink Stream Booking Cancel > 3 times [ 10 mins ]

Need for a DSPS Benchmark for IoT Uniformly compare performance of DSPS for IoT applications Helps select DSPS platform for IoT domain Understanding gaps in capabilities of DSPS Composition of meaningful IoT applications Benchmark design goals Categories & platform-independent implementations of» Common IoT tasks» Classes of non-trivial IoT applications Different real-world IoT data streams

Gaps in related work IoTAbench: Emphasis on scalable synthetic data generation and not on application logic CityBench: RDF stream processing systems Stream Bench: 7 micro-benchmarks on 4 different synthetic workload suites, does not consider aspects unique to IoT applications and streams Spark Bench: Specific to Apache Spark, goal is to identify spark tuning parameters, not platform-agnostic Yahoo Cloud Serving Benchmark (YCSB): Compares different key-value stores on the Cloud Bigbench: Simulates online retail enterprise data with queries covering velocity, volume and variety of data

Contributions 1. Classification of essential tasks & IoT applications, and characteristics of their data sources 2. Propose IoT Benchmark for DSPS a. Micro-benchmark tasks implemented from the above classification b. Reference IoT applications implemented for Pre-processing & statistical summarisation (STATS) Predictive Analytics (PRED) 3. Practical evaluation on Apache Storm DSPS

Classification & Metrics IoT characteristics, DSPS capabilities, Performance Metrics 5-Sep-16 Eighth TPC Technology Conference on Performance Evaluation & Benchmarking (TPCTC 2016)

Dataflow Semantics Dataflow patterns determine consumption & routing of messages Selectivity ratio: Number of output messages emitted by a task on consuming a logical unit input message(s) Decide input rate for downstream tasks Number of tasks, Edge Degree, Length/ Width of Dataflow affect performance, scalability Length= critical path (latency) Selectivity 1:3 Width 3-Transform Data Parallelism 1-Source 2-FlatMap 4-Filter 6- Sink No. of Tasks - 6 [estimated number of cores required ] Width - 3 [Max task parallelism in the streaming data-flow] Length - [length of the critical path in terms of latency duration] 5-Aggregate

IoT Data stream Characteristics Input Throughput Rate at which messages enter the source tasks of the application dataflow Determines the parallelism of tasks in DSPS, ability to scale and meet latency goals Throughput Distribution Variation of input throughput over time. E.g. High demand for cabs in morning/evenings Dynamic load balancing, elasticity CITY dataset [1]: Environmental sensors at 7 cities across 3 continents; TAXI dataset [2]: Smart transportation messages from 2M trips of New York city taxis Impact of Message size, temporal ordering can also be considered [1] http://map.datacanvas.org/#!/data [2] http://www.debs2015.org/call-grand-challenge.html

IoT Streaming Application Classes Extract-Transform-Load (ETL) and Archival Pre-processing like data format transformations, normalizing observation units, interpolate missing data items Archive a copy of the data offline Summarization and Visualization Statistical aggregation and analytics to orient the decision Generate visualizations to present it to end-users and decision makers Prediction and Pattern Detection Prediction to determine the future state and decide if any reaction is required Identify the patterns of interest Classification and notification Mapping decisions to specific actions Notifying the entities in IoT system

Categories of IoT Tasks Parse - parsing encoded messages from the stream sources [XML, CBOR, JSON, SenML] Filter - filtering on specific attribute values [bloom, range filter] Statistical Analytics - finding outliers, second order moments, distinct count Predictive Analytics - predictions using past and current messages, online model training [WEKA library, R Packages] Pattern Detection - identify patterns across events using CEP engines Visual Analytics - periodic charts for streaming application as part of the dataflow [D3.js, JFreeChart] IO Operations - accessing external storage or messaging services to access data [file/cloud storage, database ops, pub-sub]

Evaluation Metrics Metrics to meet performance & scalability needs for IoT applications from DSPS Latency Time between message consumed at source task(s) and their causally dependent messages generated at sink task(s),lesser is better Decides task parallelism for required input rate Peak Rate Aggregated peak output message rate generated from sink task(s),higher is better Estimated resource requirement for required input rate Jitter Variation in observed vs. expected output throughput based on input rate and task selectivity,close to zero is better Stability of task for given rate with sustainable queuing of messages CPU and Memory Utilisation Uniform distribution of load on VMs, hotspots causing bottlenecks Identify which VMs are the potential bottlenecks/under-used, scale-out/in

Benchmarks 5-Sep-16 Eighth TPC Technology Conference on Performance Evaluation & Benchmarking (TPCTC 2016)

Datasets used CITY: Urban environmental sensing data from 90 sensors at 7 cities, sampled per minute Scaled to 1000x to simulate 90,000 sensors timestamp, source, long, lat, temperature, humidity, light, dust, airquality TAXI: Smart transportation messages from 2M trips from 20K taxis in NYC, one per trip Bi-modal rate distribution [morning & evening commutes]; 1000x scaling for temporal freq. Medallion, Hack_license, Pickup_datetime, Dropoff_datetime, Trip_time, Trip_distance, Pickup_long, Pickup_lat, Dropoff_long, Dropoff_lat, Payment_type, Fare_amount, Surcharge, Mta_tax, Tip_amount, Tolls_amount, Total_amount [1] http://map.datacanvas.org/#!/data [2] http://www.debs2015.org/call-grand-challenge.html

Benchmarks available at http://github.com/dream-lab/bm-iot Micro-benchmark Tasks

Benchmarks available at http://github.com/dream-lab/bm-iot Application Benchmarks [STATS] Streaming statistical analysis and filtering of data Data filtering of outliers on individual observation types using a Bloom filter Statistical analytics on observations from individual sensor/taxi IDs Approximate count of distinct readings for every observation type Publishing the results using MQTT task Different types of patterns, selectivities, task parallelisms» Application uses Hash,filter,Aggregate as dataflow patterns

Benchmarks available at http://github.com/dream-lab/bm-iot Application Benchmarks [PRED] Streaming predictive analysis of data Timer Source and blob read - for reading updated model files using timestamp Decision tree uses trained model to classify messages into classes [good,average or poor air quality ] Multi-Variate Linear regression & Average used for finding the error estimate Group and Viz. - Batch and plot the summarised results for different observations Blob write - Result files are written to Cloud storage for hosting on a portal» Application uses duplicate,join,aggregate as dataflow patterns» Multiple source of Input streams

Experiments Apache Storm 1.0.0 5-Sep-16 Eighth TPC Technology Conference on Performance Evaluation & Benchmarking (TPCTC 2016)

Experimental setup Apache Storm 1.0.0 DSPS Executed on Microsoft Azure Cloud VMs OpenJDK 1.7 and CentOS Micro-Benchmarks Benchmark task runs on one exclusive D1 VM, 1 thread» 1-core Intel Xeon E5@2.2 GHz, 3.5 GiB RAM, 50 GiB SSD» Used to finding peak rate Master services, source & sink tasks run on a D8 VM» 8-core Intel Xeon E5@2.2 GHz, 28 GiB RAM, 400 GiB SSD» Ensures spout is not bottleneck at peak rate Application Benchmarks [STATS & PRED] D8 VMs as per input rate for all tasks of the dataflow Experiments run for 10 mins 1000 scaling 7 days of data for CITY and TAXI Application runtime = 7 days*24 hours*60 mins 1000 * scaling = 10.08 mins

Results: Micro-benchmark Peak rate is 3,000 msg/sec or higher except XML parsing & Azure ops being CPU (310 msg/sec) and SLA (5 msg/sec) bound Wide variability in latency box plots Non-uniform task execution times, slow executions causes input tuples to queue up Higher input rates are more sensitive to even small per-tuple framework overhead

Results: Micro-benchmark Jitter: Close to zero values indicates stability of Storm at peak rate without unsustainable queuing CPU and MEM utilization Single-core VM used at 70% or above except Azure tasks being I/O driven High memory utilization for tasks supporting high throughput indicates messages waiting in memory queue

Results: Application benchmark PRED: Tall latency box plot due to variability in WEKA task execution times [observed in micro-benchmarks for DTC and MLR] Jitter remains close to zero, indicating sustainable performance

Results: Application benchmark TAXI : CPU is wider due to bimodal distribution of input rate X-axis represents the number of VMs required to support the given input distribution

Summary & Future Work Classified essential tasks, IoT applications & data stream characteristics Metrics for performance & scalability needs for IoT applications from DSPS Evaluation of Micro and application benchmarks on Apache storm 1.0.0 for CITY and TAXI datasets in distributed environment Indicates the viability of Apache Storm for IoT Application Domains Introducing new tasks to the task categories [SenML, CBOR parsing,annotation] Benchmark and comparison of Apache Spark Streaming with Storm Using CEP engine like Siddhi as Tasks in Pattern recognition category

DISTRIBUTED RESEARCH ON EMERGING APPLICATIONS & MACHINES DISTRIBUTED RESEARCH ON EMERGING APPLICATIONS & MACHINES dream-lab.in Thank you! Indian Institute of Science Contact : shukla@grads.cds.iisc.ac.in DREAM:Lab, 2014 This work is licensed under a Creative Commons Attribution 4.0 International License