A scalability comparison study of data management approaches for smart metering systems

Size: px

Start display at page:

Download "A scalability comparison study of data management approaches for smart metering systems"

Baldwin Jenkins
6 years ago
Views:

1 A scalability comparison study of data management approaches for smart metering systems Houssem Chihoub, Chris.ne Collet Grenoble INP Journées Plateformes Clermont Ferrand 6-7 octobre 2016 ICPP

2 Smartgrids & smart metering Ø Huge investments in smartgrids Ø Technological advances in smart metering & IoT 35 millions smart meters Linky in France by 2021 v Data, a lot of data!! 2

Data in smartgrids Data in moaon (out of the scope) Events, alarms, signal alerts in the power grid Event

40+ M of meter data with measurements every 10 min and a total of 22,5 trillions measurements Data at

-> millions of generated meter data per hour (35 millions meters by 2021 in France) How to store, manage

3 Data in smartgrids Data in moaon (out of the scope) Events, alarms, signal alerts in the power grid Event streaming and processing In one of their white papers, HP show how they manage with their solu7on Ver7ca 40+ M of meter data with measurements every 10 min and a total of 22,5 trillions measurements Data at rest Collected meters data, metrics (sensors), weather data, client data, etc At the scale of smart grids -> millions of generated meter data per hour (35 millions meters by 2021 in France) How to store, manage and process these data (ex. for analy.cs)? Large scale data management solu.ons today: large number of models 3

processing and management approaches for each type of processing Study of

4 Our goals Iden.ﬁca.on of processing types on smart meter data Comparison of large-scale data processing and management approaches for each type of processing Study of the scalability of these approaches We need datasets, illustra.ve queries, storage space, and cluster of commodity hardware infrastructure. 4

5 Plan q Context q Data processing in the smartgrid q Data management and processing systems q Data genera.on q Experimental setup q Experimental evalua.on 5

Data processing in the Smartgrid Ø Smart meters and sensors

based on func.ons: count of measurements, sum of total consump.

6 Data processing in the Smartgrid Ø Smart meters and sensors data Ø Temporal data 3 types of queries: ² AggregaAon queries based on func.ons: count of measurements, sum of total consump.ons etc. ² SelecAon and filtering queries: consump.on filters, selec.on of data for given.me interval ² Bill computaaon queries which are complex queries that consist of mul.ple sub queries. 6

7 Data management and processing systems - Parallel RDBMS - Master/Slave, MPP - Versioning-based concurrency control - ACID seman.cs - MapReduce framework - HDFS distributed file system - Hive SQL query engine Open Source - MapReduce based model - In-memory processing - Acyclic graph execu.on engine - Spark SQL query engine - P2P architecture - Consistent hashing - Column family data model - CQL query language 7

ng clients and randomly selected weather data while adding some noise o A CSV file per client data Generated data ² 1.

8 Benchmarking: data genera.on Inves.ga.on of meter data genera.on approach [1] Approach o Extract temperature-independent profile from exis.ng clients o Genera.on of measurements data for new clients from profile data of exis.ng clients and randomly selected weather data while adding some noise o A CSV file per client data Generated data ² 1.7 TB of 5 Millions meters data for 1 year (2013) ² A measurement every one hour ² A total of more than 43 billions measurements (only 4M meters data were experimented on) [1] Benchmarking Smart Meter Data Analy.cs, X Liu et al., EDBT

9 Experimental Setup 9

10 Experimental Setup ² 70 to 140 nodes ² Storage5k available ² RAM 16GB/node ² CPU: Intel Xeon L5420 & Intel Xeon X3440 (2.5GHz 4 cores/cpu) ² 298GB HDD More than 8000 cores 3 sets of experiments - Increasing data size: from 0.55 M to 4 M meters - Scale-out: from 70 to 140 nodes - Data in memory: from 5 to 30 nodes and 10k meters (all ini.al dataset can be fit in memory) EvaluaAon: Response.me, network traffic size 10

on Nancy site: Graphene and Griffon clusters : 16GB of RAM Large number of nodes + available storage5k space

11 Infrastructure & tools G5K reserva.on.me is limited and not in working hours OAR + Kadeploy FIFO -> weeks to get a (big) reserva.on Nancy site: Graphene and Griffon clusters : 16GB of RAM Large number of nodes + available storage5k space Image based on debian wheezy-prod + Kadeploy (bare metal) o Postgres-XL-9.2 o Cloudera CDH5.2 for Hadoop including Hive-0.13 and Java 7 o Spark1.5 and java 8 o Apache Cassandra-2.2 o Spark-Cassandra connector

12 Data Loading q Scripts (bash) to load data concurrently for each storage solu.on and given the data model q Number of clients is propor.onal to datanodes q Storage5K loading time (s) Postgres-Xl very slow to load data hdfs cassandra Big bopleneck when number of clients increases with solu.ons such as Cassandra data loading was faster than data fetching from storage5k (NFS) meters number (million)

13 Experimental Evalua.on 13

14 Data processing in the Smartgrid Illustra.ve queries AggregaAon Q1: Sum of all measurements (consump.on of all meters) for 1-year period (2013) Q2: Sum of all measurements for a given range of meter ids (clients) Q3: Sum of all measurements for a 1-month period (march 2013) SelecAon & Filtering Q4: Selec.on of the first meter ids and their measurements over a 2-month.me interval and where the consump.on exceeds a given threshold, then sort the result by their consump.ons value. Q5: Selec.on of meter ids and their measurements where the consump.on exceeds a given threshold, then sort the result by their consump.ons value. Q6: Selec.on of measurements given the list of meter ids over a 2-months period of.me Bill computaaon Q7: Compute the bill for a given client following the «tarif vert» billing rules of EDF 14

Increasing Data Volume: Agg Queries 110 nodes 0.55M meters (4.82 B measurements), 1.5M meters (13.14 B measurements) 2.5M meters (21,9 B measurements), 4M meters (35.

15 Increasing Data Volume: Agg Queries 110 nodes 0.55M meters (4.82 B measurements), 1.5M meters (13.14 B measurements) 2.5M meters (21,9 B measurements), 4M meters (35.04 B measurements) Postgres-XL very efficient for aggrega.ons meters number (million) 4.0 Q1 (sum, all) Memory issues Intermediate phases in spark -> data movement -> higher response.me meters number (million) 4.0 Q2 (sum, meter id range) Q3 (sum, 7me interval) meters number (million) 4.0 Spark/cassandra performing beper with filtra.on on.me 15

16 Increasing Data Volume: Other Queries Selec.on & filtering queries Bill query Postgres Hadoop Spark/cass very efficient Order by slowing it meters number (million) Q4 (meter ids, measurements, 7me intervall, measurement threshold, order by) meters number (million) Q6 (measurements, meter ids input, 7me interval) Cassandra/Spark impressive < 1S response.me meters number (million) Q7 (Bill) 16

of nodes Big 140 Cassandra inefficient with filtering on keys Spark beper with available memory Q2 (sum, meter id range) 5 4 3 2 1 5 10 20

17 Horizontal Scalability: Agg Queries Big experiment set: 500K meters (4,38B measurements) Small experiment set: 10K meters (87.6M measurements) data can be fit in memory (of the whole cluster) Postgres very good with agg Spark Memory problem number of nodes Big 140 Cassandra inefficient with filtering on keys Spark beper with available memory Q2 (sum, meter id range) number of nodes Small Cassandra beper with.me filtering number of nodes number of nodes Big Q3 (sum, 7me intervall) Small 17

18 Horizontal Scalability: Bill Query Big experiment set: 500K meters (4,38B measurements) Small experiment set: 10K meters (87.6M measurements) data can be fit in memory (of the whole cluster) number of nodes Big 140 Cassandra/Spark impressive < 1S response.me number of nodes Small 30 Postgres-Xl deployment on 140 nodes unsuccessful Q7 (Bill) 18

19 Data transfer o o o Total of transferred data: sum of sent data by all nodes Vnstat for monitoring all the transferred data from a node Data loading for HDFS produces no data transfer from nodes data transfer (GB) cassandra-spark Spark moves more data around (more intermediate phases) Postgres-XL moves less data -> less delays number of nodes 19

20 Conclusions Ø Experimental evalua.on of 4 systems (Postgres-XL, Hadoop, Spark, Spark/Cassandra) for meter data processing Ø No best approach for every type of processing Postgres-XL is suited for aggrega.ons, data loading is very slow Spark should have enough memory Spark + Cassandra beper suited for selec.on and filtering and bill queries Data loading is very fast in Cassandra and HDFS Ø Large-scale data processing models should target the minimiza.on of data transfer Ø Towards federated polyglot architecture Ø Limited reserva.on.me is a big problem for conduc.ng Big Data experiments on G5K 20

A Scalability Comparison Study of Data Management Approaches for Smart Metering Systems

A Scalability Comparison Study of Data Management Approaches for Smart Metering Systems Houssem Chihoub, Christine Collet To cite this version: Houssem Chihoub, Christine Collet. A Scalability Comparison