A scalability comparison study of data management approaches for smart metering systems Houssem Chihoub, Chris.ne Collet Grenoble INP houssem.chihoub@imag.fr Journées Plateformes Clermont Ferrand 6-7 octobre 2016 ICPP 2016 1
Smartgrids & smart metering Ø Huge investments in smartgrids Ø Technological advances in smart metering & IoT 35 millions smart meters Linky in France by 2021 v Data, a lot of data!! 2
Data in smartgrids Data in moaon (out of the scope) Events, alarms, signal alerts in the power grid Event streaming and processing In one of their white papers, HP show how they manage with their solu7on Ver7ca 40+ M of meter data with measurements every 10 min and a total of 22,5 trillions measurements Data at rest Collected meters data, metrics (sensors), weather data, client data, etc At the scale of smart grids -> millions of generated meter data per hour (35 millions meters by 2021 in France) How to store, manage and process these data (ex. for analy.cs)? Large scale data management solu.ons today: large number of models 3
Our goals Iden.fica.on of processing types on smart meter data Comparison of large-scale data processing and management approaches for each type of processing Study of the scalability of these approaches We need datasets, illustra.ve queries, storage space, and cluster of commodity hardware infrastructure. 4
Plan q Context q Data processing in the smartgrid q Data management and processing systems q Data genera.on q Experimental setup q Experimental evalua.on 5
Data processing in the Smartgrid Ø Smart meters and sensors data Ø Temporal data 3 types of queries: ² AggregaAon queries based on func.ons: count of measurements, sum of total consump.ons etc. ² SelecAon and filtering queries: consump.on filters, selec.on of data for given.me interval ² Bill computaaon queries which are complex queries that consist of mul.ple sub queries. 6
Data management and processing systems - Parallel RDBMS - Master/Slave, MPP - Versioning-based concurrency control - ACID seman.cs - MapReduce framework - HDFS distributed file system - Hive SQL query engine Open Source - MapReduce based model - In-memory processing - Acyclic graph execu.on engine - Spark SQL query engine - P2P architecture - Consistent hashing - Column family data model - CQL query language 7
Benchmarking: data genera.on Inves.ga.on of meter data genera.on approach [1] Approach o Extract temperature-independent profile from exis.ng clients o Genera.on of measurements data for new clients from profile data of exis.ng clients and randomly selected weather data while adding some noise o A CSV file per client data Generated data ² 1.7 TB of 5 Millions meters data for 1 year (2013) ² A measurement every one hour ² A total of more than 43 billions measurements (only 4M meters data were experimented on) [1] Benchmarking Smart Meter Data Analy.cs, X Liu et al., EDBT 2015 8
Experimental Setup 9
Experimental Setup ² 70 to 140 nodes ² Storage5k available ² RAM 16GB/node ² CPU: Intel Xeon L5420 & Intel Xeon X3440 (2.5GHz 4 cores/cpu) ² 298GB HDD More than 8000 cores 3 sets of experiments - Increasing data size: from 0.55 M to 4 M meters - Scale-out: from 70 to 140 nodes - Data in memory: from 5 to 30 nodes and 10k meters (all ini.al dataset can be fit in memory) EvaluaAon: Response.me, network traffic size 10
Infrastructure & tools G5K reserva.on.me is limited and not in working hours OAR + Kadeploy FIFO -> weeks to get a (big) reserva.on Nancy site: Graphene and Griffon clusters : 16GB of RAM Large number of nodes + available storage5k space Image based on debian wheezy-prod + Kadeploy (bare metal) o Postgres-XL-9.2 o Cloudera CDH5.2 for Hadoop including Hive-0.13 and Java 7 o Spark1.5 and java 8 o Apache Cassandra-2.2 o Spark-Cassandra connector 1.5 11
Data Loading q Scripts (bash) to load data concurrently for each storage solu.on and given the data model q Number of clients is propor.onal to datanodes q Storage5K loading time (s) 8000 6000 4000 2000 Postgres-Xl very slow to load data hdfs cassandra Big bopleneck when number of clients increases with solu.ons such as Cassandra data loading was faster than data fetching from storage5k (NFS) 0.5 1.5 2.5 meters number (million) 4.0 12
Experimental Evalua.on 13
Data processing in the Smartgrid Illustra.ve queries AggregaAon Q1: Sum of all measurements (consump.on of all meters) for 1-year period (2013) Q2: Sum of all measurements for a given range of meter ids (clients) Q3: Sum of all measurements for a 1-month period (march 2013) SelecAon & Filtering Q4: Selec.on of the first 20000 meter ids and their measurements over a 2-month.me interval and where the consump.on exceeds a given threshold, then sort the result by their consump.ons value. Q5: Selec.on of meter ids and their measurements where the consump.on exceeds a given threshold, then sort the result by their consump.ons value. Q6: Selec.on of measurements given the list of meter ids over a 2-months period of.me Bill computaaon Q7: Compute the bill for a given client following the «tarif vert» billing rules of EDF 14
Increasing Data Volume: Agg Queries 110 nodes 0.55M meters (4.82 B measurements), 1.5M meters (13.14 B measurements) 2.5M meters (21,9 B measurements), 4M meters (35.04 B measurements) 200 150 100 50 0.5 Postgres-XL very efficient for aggrega.ons 1.5 2.5 meters number (million) 4.0 Q1 (sum, all) 200 150 100 50 Memory issues Intermediate phases in spark -> data movement -> higher response.me 200 150 100 50 0.5 1.5 2.5 meters number (million) 4.0 Q2 (sum, meter id range) Q3 (sum, 7me interval) 0.5 1.5 2.5 meters number (million) 4.0 Spark/cassandra performing beper with filtra.on on.me 15
Increasing Data Volume: Other Queries Selec.on & filtering queries Bill query Postgres Hadoop 200 150 100 50 Spark/cass very efficient Order by slowing it 200 150 100 50 0.5 1.5 2.5 4.0 0.5 1.5 2.5 4.0 meters number (million) Q4 (meter ids, measurements, 7me intervall, measurement threshold, order by) meters number (million) Q6 (measurements, meter ids input, 7me interval) 200 150 100 50 Cassandra/Spark impressive < 1S response.me 0.5 1.5 2.5 4.0 meters number (million) Q7 (Bill) 16
Horizontal Scalability: Agg Queries Big experiment set: 500K meters (4,38B measurements) Small experiment set: 10K meters (87.6M measurements) data can be fit in memory (of the whole cluster) 40 30 20 10 70 Postgres very good with agg Spark Memory problem 85 105 number of nodes Big 140 Cassandra inefficient with filtering on keys Spark beper with available memory Q2 (sum, meter id range) 5 4 3 2 1 5 10 20 number of nodes Small 30 10 8 6 4 2 Cassandra beper with.me filtering 4 3 2 1 70 85 105 140 5 10 20 30 number of nodes number of nodes Big Q3 (sum, 7me intervall) Small 17
Horizontal Scalability: Bill Query Big experiment set: 500K meters (4,38B measurements) Small experiment set: 10K meters (87.6M measurements) data can be fit in memory (of the whole cluster) 20 15 10 5 8 6 4 2 70 85 105 number of nodes Big 140 Cassandra/Spark impressive < 1S response.me 5 10 20 number of nodes Small 30 Postgres-Xl deployment on 140 nodes unsuccessful Q7 (Bill) 18
Data transfer o o o Total of transferred data: sum of sent data by all nodes Vnstat for monitoring all the transferred data from a node Data loading for HDFS produces no data transfer from nodes data transfer (GB) 3 2 1 cassandra-spark Spark moves more data around (more intermediate phases) 0 10 20 30 Postgres-XL moves less data -> less delays number of nodes 19
Conclusions Ø Experimental evalua.on of 4 systems (Postgres-XL, Hadoop, Spark, Spark/Cassandra) for meter data processing Ø No best approach for every type of processing Postgres-XL is suited for aggrega.ons, data loading is very slow Spark should have enough memory Spark + Cassandra beper suited for selec.on and filtering and bill queries Data loading is very fast in Cassandra and HDFS Ø Large-scale data processing models should target the minimiza.on of data transfer Ø Towards federated polyglot architecture Ø Limited reserva.on.me is a big problem for conduc.ng Big Data experiments on G5K 20