A scalability comparison study of data management approaches for smart metering systems

Similar documents
A Scalability Comparison Study of Data Management Approaches for Smart Metering Systems

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Backtesting with Spark

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Big Data com Hadoop. VIII Sessão - SQL Bahia. Impala, Hive e Spark. Diógenes Pires 03/03/2018

10 Million Smart Meter Data with Apache HBase

EsgynDB Enterprise 2.0 Platform Reference Architecture

Kubernetes for Stateful Workloads Benchmarks

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Big Data Architect.

What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed?

Accelerate Big Data Insights

Flash Storage Complementing a Data Lake for Real-Time Insight

Nowcasting. D B M G Data Base and Data Mining Group of Politecnico di Torino. Big Data: Hype or Hallelujah? Big data hype?

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Next-Generation Cloud Platform

Big Data Hadoop Stack

Application of machine learning and big data technologies in OpenAIRE system

Processing of big data with Apache Spark

Big Data with Hadoop Ecosystem

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

OLTP on Hadoop: Reviewing the first Hadoop- based TPC- C benchmarks

Typical size of data you deal with on a daily basis

Oracle Big Data. A NA LYT ICS A ND MA NAG E MENT.

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

Data Analytics and Storage System (DASS) Mixing POSIX and Hadoop Architectures. 13 November 2016

UNIFY DATA AT MEMORY SPEED. Haoyuan (HY) Li, Alluxio Inc. VAULT Conference 2017

Splout SQL When Big Data Output is also Big Data

MapReduce, Apache Hadoop

NoSQL Databases MongoDB vs Cassandra. Kenny Huynh, Andre Chik, Kevin Vu

INDEX-BASED JOIN IN MAPREDUCE USING HADOOP MAPFILES

Embedded Technosolutions

Cloud Computing 3. CSCI 4850/5850 High-Performance Computing Spring 2018

Big Data on AWS. Peter-Mark Verwoerd Solutions Architect

RIGHTNOW A C E

BIG DATA AND HADOOP ON THE ZFS STORAGE APPLIANCE

Chapter 5. The MapReduce Programming Model and Implementation

Informa)on Retrieval and Map- Reduce Implementa)ons. Mohammad Amir Sharif PhD Student Center for Advanced Computer Studies

Cisco and Cloudera Deliver WorldClass Solutions for Powering the Enterprise Data Hub alerts, etc. Organizations need the right technology and infrastr

Presented by Nanditha Thinderu

Performance Evaluation of a MongoDB and Hadoop Platform for Scientific Data Analysis

Databases 2 (VU) ( / )

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

CIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( )

Processing big data with modern applications: Hadoop as DWH backend at Pro7. Dr. Kathrin Spreyer Big data engineer

Practice and Applications of Data Management CMPSCI 345. Lecture 18: Big Data, Hadoop, and MapReduce

SQT03 Big Data and Hadoop with Azure HDInsight Andrew Brust. Senior Director, Technical Product Marketing and Evangelism

INITIAL EVALUATION BIGSQL FOR HORTONWORKS (Homerun or merely a major bluff?)

Cloudera Impala Headline Goes Here

BIG DATA TESTING: A UNIFIED VIEW

Oracle Big Data Connectors

A BigData Tour HDFS, Ceph and MapReduce

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

August 23, 2017 Revision 0.3. Building IoT Applications with GridDB

microsoft

Performance and Scalability Overview

Sempala. Interactive SPARQL Query Processing on Hadoop

Big Data Infrastructure at Spotify

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

HDFS Federation. Sanjay Radia Founder and Hortonworks. Page 1

Smart Meter Data Analytics using Hadoop

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

VOLTDB + HP VERTICA. page

Parallelizing Multiple Group by Query in Shared-nothing Environment: A MapReduce Study Case

Blended Learning Outline: Cloudera Data Analyst Training (171219a)

WHITEPAPER. Improve Hadoop Performance with Memblaze PBlaze SSD

Resource and Performance Distribution Prediction for Large Scale Analytics Queries

MapReduce, Apache Hadoop

Overcoming the Barriers of Graphs on GPUs: Delivering Graph Analy;cs 100X Faster and 40X Cheaper

Expert Lecture plan proposal Hadoop& itsapplication

Improving the MapReduce Big Data Processing Framework

Accelerating Digital Transformation with InterSystems IRIS and vsan

New Oracle NoSQL Database APIs that Speed Insertion and Retrieval

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Processing 11 billions events a day with Spark. Alexander Krasheninnikov

Submitted to: Dr. Sunnie Chung. Presented by: Sonal Deshmukh Jay Upadhyay

Webinar Series TMIP VISION

Big Data Syllabus. Understanding big data and Hadoop. Limitations and Solutions of existing Data Analytics Architecture

Approaching the Petabyte Analytic Database: What I learned

Cloud Computing at Yahoo! Thomas Kwan Director, Research Operations Yahoo! Labs

A Distributed Data- Parallel Execu3on Framework in the Kepler Scien3fic Workflow System

Spatial Analytics Built for Big Data Platforms

SAP VORA 1.4 on AWS - MARKETPLACE EDITION FREQUENTLY ASKED QUESTIONS

Oracle Big Data Fundamentals Ed 2

Stay Informed During and AEer OpenWorld

Introduc)on to Apache Ka1a. Jun Rao Co- founder of Confluent

Hadoop. copyright 2011 Trainologic LTD

Overview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training::

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

Spark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay Mellanox Technologies

Big Data Analytics using Apache Hadoop and Spark with Scala

Accelerating Big Data: Using SanDisk SSDs for Apache HBase Workloads

Importing and Exporting Data Between Hadoop and MySQL

TPCX-BB (BigBench) Big Data Analytics Benchmark

Apache Hive for Oracle DBAs. Luís Marques

High Performance Data Analytics for Numerical Simulations. Bruno Raffin DataMove

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Enabling the Smart Grid through Big Data

Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem. Zohar Elkayam

Transcription:

A scalability comparison study of data management approaches for smart metering systems Houssem Chihoub, Chris.ne Collet Grenoble INP houssem.chihoub@imag.fr Journées Plateformes Clermont Ferrand 6-7 octobre 2016 ICPP 2016 1

Smartgrids & smart metering Ø Huge investments in smartgrids Ø Technological advances in smart metering & IoT 35 millions smart meters Linky in France by 2021 v Data, a lot of data!! 2

Data in smartgrids Data in moaon (out of the scope) Events, alarms, signal alerts in the power grid Event streaming and processing In one of their white papers, HP show how they manage with their solu7on Ver7ca 40+ M of meter data with measurements every 10 min and a total of 22,5 trillions measurements Data at rest Collected meters data, metrics (sensors), weather data, client data, etc At the scale of smart grids -> millions of generated meter data per hour (35 millions meters by 2021 in France) How to store, manage and process these data (ex. for analy.cs)? Large scale data management solu.ons today: large number of models 3

Our goals Iden.fica.on of processing types on smart meter data Comparison of large-scale data processing and management approaches for each type of processing Study of the scalability of these approaches We need datasets, illustra.ve queries, storage space, and cluster of commodity hardware infrastructure. 4

Plan q Context q Data processing in the smartgrid q Data management and processing systems q Data genera.on q Experimental setup q Experimental evalua.on 5

Data processing in the Smartgrid Ø Smart meters and sensors data Ø Temporal data 3 types of queries: ² AggregaAon queries based on func.ons: count of measurements, sum of total consump.ons etc. ² SelecAon and filtering queries: consump.on filters, selec.on of data for given.me interval ² Bill computaaon queries which are complex queries that consist of mul.ple sub queries. 6

Data management and processing systems - Parallel RDBMS - Master/Slave, MPP - Versioning-based concurrency control - ACID seman.cs - MapReduce framework - HDFS distributed file system - Hive SQL query engine Open Source - MapReduce based model - In-memory processing - Acyclic graph execu.on engine - Spark SQL query engine - P2P architecture - Consistent hashing - Column family data model - CQL query language 7

Benchmarking: data genera.on Inves.ga.on of meter data genera.on approach [1] Approach o Extract temperature-independent profile from exis.ng clients o Genera.on of measurements data for new clients from profile data of exis.ng clients and randomly selected weather data while adding some noise o A CSV file per client data Generated data ² 1.7 TB of 5 Millions meters data for 1 year (2013) ² A measurement every one hour ² A total of more than 43 billions measurements (only 4M meters data were experimented on) [1] Benchmarking Smart Meter Data Analy.cs, X Liu et al., EDBT 2015 8

Experimental Setup 9

Experimental Setup ² 70 to 140 nodes ² Storage5k available ² RAM 16GB/node ² CPU: Intel Xeon L5420 & Intel Xeon X3440 (2.5GHz 4 cores/cpu) ² 298GB HDD More than 8000 cores 3 sets of experiments - Increasing data size: from 0.55 M to 4 M meters - Scale-out: from 70 to 140 nodes - Data in memory: from 5 to 30 nodes and 10k meters (all ini.al dataset can be fit in memory) EvaluaAon: Response.me, network traffic size 10

Infrastructure & tools G5K reserva.on.me is limited and not in working hours OAR + Kadeploy FIFO -> weeks to get a (big) reserva.on Nancy site: Graphene and Griffon clusters : 16GB of RAM Large number of nodes + available storage5k space Image based on debian wheezy-prod + Kadeploy (bare metal) o Postgres-XL-9.2 o Cloudera CDH5.2 for Hadoop including Hive-0.13 and Java 7 o Spark1.5 and java 8 o Apache Cassandra-2.2 o Spark-Cassandra connector 1.5 11

Data Loading q Scripts (bash) to load data concurrently for each storage solu.on and given the data model q Number of clients is propor.onal to datanodes q Storage5K loading time (s) 8000 6000 4000 2000 Postgres-Xl very slow to load data hdfs cassandra Big bopleneck when number of clients increases with solu.ons such as Cassandra data loading was faster than data fetching from storage5k (NFS) 0.5 1.5 2.5 meters number (million) 4.0 12

Experimental Evalua.on 13

Data processing in the Smartgrid Illustra.ve queries AggregaAon Q1: Sum of all measurements (consump.on of all meters) for 1-year period (2013) Q2: Sum of all measurements for a given range of meter ids (clients) Q3: Sum of all measurements for a 1-month period (march 2013) SelecAon & Filtering Q4: Selec.on of the first 20000 meter ids and their measurements over a 2-month.me interval and where the consump.on exceeds a given threshold, then sort the result by their consump.ons value. Q5: Selec.on of meter ids and their measurements where the consump.on exceeds a given threshold, then sort the result by their consump.ons value. Q6: Selec.on of measurements given the list of meter ids over a 2-months period of.me Bill computaaon Q7: Compute the bill for a given client following the «tarif vert» billing rules of EDF 14

Increasing Data Volume: Agg Queries 110 nodes 0.55M meters (4.82 B measurements), 1.5M meters (13.14 B measurements) 2.5M meters (21,9 B measurements), 4M meters (35.04 B measurements) 200 150 100 50 0.5 Postgres-XL very efficient for aggrega.ons 1.5 2.5 meters number (million) 4.0 Q1 (sum, all) 200 150 100 50 Memory issues Intermediate phases in spark -> data movement -> higher response.me 200 150 100 50 0.5 1.5 2.5 meters number (million) 4.0 Q2 (sum, meter id range) Q3 (sum, 7me interval) 0.5 1.5 2.5 meters number (million) 4.0 Spark/cassandra performing beper with filtra.on on.me 15

Increasing Data Volume: Other Queries Selec.on & filtering queries Bill query Postgres Hadoop 200 150 100 50 Spark/cass very efficient Order by slowing it 200 150 100 50 0.5 1.5 2.5 4.0 0.5 1.5 2.5 4.0 meters number (million) Q4 (meter ids, measurements, 7me intervall, measurement threshold, order by) meters number (million) Q6 (measurements, meter ids input, 7me interval) 200 150 100 50 Cassandra/Spark impressive < 1S response.me 0.5 1.5 2.5 4.0 meters number (million) Q7 (Bill) 16

Horizontal Scalability: Agg Queries Big experiment set: 500K meters (4,38B measurements) Small experiment set: 10K meters (87.6M measurements) data can be fit in memory (of the whole cluster) 40 30 20 10 70 Postgres very good with agg Spark Memory problem 85 105 number of nodes Big 140 Cassandra inefficient with filtering on keys Spark beper with available memory Q2 (sum, meter id range) 5 4 3 2 1 5 10 20 number of nodes Small 30 10 8 6 4 2 Cassandra beper with.me filtering 4 3 2 1 70 85 105 140 5 10 20 30 number of nodes number of nodes Big Q3 (sum, 7me intervall) Small 17

Horizontal Scalability: Bill Query Big experiment set: 500K meters (4,38B measurements) Small experiment set: 10K meters (87.6M measurements) data can be fit in memory (of the whole cluster) 20 15 10 5 8 6 4 2 70 85 105 number of nodes Big 140 Cassandra/Spark impressive < 1S response.me 5 10 20 number of nodes Small 30 Postgres-Xl deployment on 140 nodes unsuccessful Q7 (Bill) 18

Data transfer o o o Total of transferred data: sum of sent data by all nodes Vnstat for monitoring all the transferred data from a node Data loading for HDFS produces no data transfer from nodes data transfer (GB) 3 2 1 cassandra-spark Spark moves more data around (more intermediate phases) 0 10 20 30 Postgres-XL moves less data -> less delays number of nodes 19

Conclusions Ø Experimental evalua.on of 4 systems (Postgres-XL, Hadoop, Spark, Spark/Cassandra) for meter data processing Ø No best approach for every type of processing Postgres-XL is suited for aggrega.ons, data loading is very slow Spark should have enough memory Spark + Cassandra beper suited for selec.on and filtering and bill queries Data loading is very fast in Cassandra and HDFS Ø Large-scale data processing models should target the minimiza.on of data transfer Ø Towards federated polyglot architecture Ø Limited reserva.on.me is a big problem for conduc.ng Big Data experiments on G5K 20