A scalability comparison study of data management approaches for smart metering systems

Size: px
Start display at page:

Download "A scalability comparison study of data management approaches for smart metering systems"

Transcription

1 A scalability comparison study of data management approaches for smart metering systems Houssem Chihoub, Chris.ne Collet Grenoble INP Journées Plateformes Clermont Ferrand 6-7 octobre 2016 ICPP

2 Smartgrids & smart metering Ø Huge investments in smartgrids Ø Technological advances in smart metering & IoT 35 millions smart meters Linky in France by 2021 v Data, a lot of data!! 2

3 Data in smartgrids Data in moaon (out of the scope) Events, alarms, signal alerts in the power grid Event streaming and processing In one of their white papers, HP show how they manage with their solu7on Ver7ca 40+ M of meter data with measurements every 10 min and a total of 22,5 trillions measurements Data at rest Collected meters data, metrics (sensors), weather data, client data, etc At the scale of smart grids -> millions of generated meter data per hour (35 millions meters by 2021 in France) How to store, manage and process these data (ex. for analy.cs)? Large scale data management solu.ons today: large number of models 3

4 Our goals Iden.fica.on of processing types on smart meter data Comparison of large-scale data processing and management approaches for each type of processing Study of the scalability of these approaches We need datasets, illustra.ve queries, storage space, and cluster of commodity hardware infrastructure. 4

5 Plan q Context q Data processing in the smartgrid q Data management and processing systems q Data genera.on q Experimental setup q Experimental evalua.on 5

6 Data processing in the Smartgrid Ø Smart meters and sensors data Ø Temporal data 3 types of queries: ² AggregaAon queries based on func.ons: count of measurements, sum of total consump.ons etc. ² SelecAon and filtering queries: consump.on filters, selec.on of data for given.me interval ² Bill computaaon queries which are complex queries that consist of mul.ple sub queries. 6

7 Data management and processing systems - Parallel RDBMS - Master/Slave, MPP - Versioning-based concurrency control - ACID seman.cs - MapReduce framework - HDFS distributed file system - Hive SQL query engine Open Source - MapReduce based model - In-memory processing - Acyclic graph execu.on engine - Spark SQL query engine - P2P architecture - Consistent hashing - Column family data model - CQL query language 7

8 Benchmarking: data genera.on Inves.ga.on of meter data genera.on approach [1] Approach o Extract temperature-independent profile from exis.ng clients o Genera.on of measurements data for new clients from profile data of exis.ng clients and randomly selected weather data while adding some noise o A CSV file per client data Generated data ² 1.7 TB of 5 Millions meters data for 1 year (2013) ² A measurement every one hour ² A total of more than 43 billions measurements (only 4M meters data were experimented on) [1] Benchmarking Smart Meter Data Analy.cs, X Liu et al., EDBT

9 Experimental Setup 9

10 Experimental Setup ² 70 to 140 nodes ² Storage5k available ² RAM 16GB/node ² CPU: Intel Xeon L5420 & Intel Xeon X3440 (2.5GHz 4 cores/cpu) ² 298GB HDD More than 8000 cores 3 sets of experiments - Increasing data size: from 0.55 M to 4 M meters - Scale-out: from 70 to 140 nodes - Data in memory: from 5 to 30 nodes and 10k meters (all ini.al dataset can be fit in memory) EvaluaAon: Response.me, network traffic size 10

11 Infrastructure & tools G5K reserva.on.me is limited and not in working hours OAR + Kadeploy FIFO -> weeks to get a (big) reserva.on Nancy site: Graphene and Griffon clusters : 16GB of RAM Large number of nodes + available storage5k space Image based on debian wheezy-prod + Kadeploy (bare metal) o Postgres-XL-9.2 o Cloudera CDH5.2 for Hadoop including Hive-0.13 and Java 7 o Spark1.5 and java 8 o Apache Cassandra-2.2 o Spark-Cassandra connector

12 Data Loading q Scripts (bash) to load data concurrently for each storage solu.on and given the data model q Number of clients is propor.onal to datanodes q Storage5K loading time (s) Postgres-Xl very slow to load data hdfs cassandra Big bopleneck when number of clients increases with solu.ons such as Cassandra data loading was faster than data fetching from storage5k (NFS) meters number (million)

13 Experimental Evalua.on 13

14 Data processing in the Smartgrid Illustra.ve queries AggregaAon Q1: Sum of all measurements (consump.on of all meters) for 1-year period (2013) Q2: Sum of all measurements for a given range of meter ids (clients) Q3: Sum of all measurements for a 1-month period (march 2013) SelecAon & Filtering Q4: Selec.on of the first meter ids and their measurements over a 2-month.me interval and where the consump.on exceeds a given threshold, then sort the result by their consump.ons value. Q5: Selec.on of meter ids and their measurements where the consump.on exceeds a given threshold, then sort the result by their consump.ons value. Q6: Selec.on of measurements given the list of meter ids over a 2-months period of.me Bill computaaon Q7: Compute the bill for a given client following the «tarif vert» billing rules of EDF 14

15 Increasing Data Volume: Agg Queries 110 nodes 0.55M meters (4.82 B measurements), 1.5M meters (13.14 B measurements) 2.5M meters (21,9 B measurements), 4M meters (35.04 B measurements) Postgres-XL very efficient for aggrega.ons meters number (million) 4.0 Q1 (sum, all) Memory issues Intermediate phases in spark -> data movement -> higher response.me meters number (million) 4.0 Q2 (sum, meter id range) Q3 (sum, 7me interval) meters number (million) 4.0 Spark/cassandra performing beper with filtra.on on.me 15

16 Increasing Data Volume: Other Queries Selec.on & filtering queries Bill query Postgres Hadoop Spark/cass very efficient Order by slowing it meters number (million) Q4 (meter ids, measurements, 7me intervall, measurement threshold, order by) meters number (million) Q6 (measurements, meter ids input, 7me interval) Cassandra/Spark impressive < 1S response.me meters number (million) Q7 (Bill) 16

17 Horizontal Scalability: Agg Queries Big experiment set: 500K meters (4,38B measurements) Small experiment set: 10K meters (87.6M measurements) data can be fit in memory (of the whole cluster) Postgres very good with agg Spark Memory problem number of nodes Big 140 Cassandra inefficient with filtering on keys Spark beper with available memory Q2 (sum, meter id range) number of nodes Small Cassandra beper with.me filtering number of nodes number of nodes Big Q3 (sum, 7me intervall) Small 17

18 Horizontal Scalability: Bill Query Big experiment set: 500K meters (4,38B measurements) Small experiment set: 10K meters (87.6M measurements) data can be fit in memory (of the whole cluster) number of nodes Big 140 Cassandra/Spark impressive < 1S response.me number of nodes Small 30 Postgres-Xl deployment on 140 nodes unsuccessful Q7 (Bill) 18

19 Data transfer o o o Total of transferred data: sum of sent data by all nodes Vnstat for monitoring all the transferred data from a node Data loading for HDFS produces no data transfer from nodes data transfer (GB) cassandra-spark Spark moves more data around (more intermediate phases) Postgres-XL moves less data -> less delays number of nodes 19

20 Conclusions Ø Experimental evalua.on of 4 systems (Postgres-XL, Hadoop, Spark, Spark/Cassandra) for meter data processing Ø No best approach for every type of processing Postgres-XL is suited for aggrega.ons, data loading is very slow Spark should have enough memory Spark + Cassandra beper suited for selec.on and filtering and bill queries Data loading is very fast in Cassandra and HDFS Ø Large-scale data processing models should target the minimiza.on of data transfer Ø Towards federated polyglot architecture Ø Limited reserva.on.me is a big problem for conduc.ng Big Data experiments on G5K 20

A Scalability Comparison Study of Data Management Approaches for Smart Metering Systems

A Scalability Comparison Study of Data Management Approaches for Smart Metering Systems A Scalability Comparison Study of Data Management Approaches for Smart Metering Systems Houssem Chihoub, Christine Collet To cite this version: Houssem Chihoub, Christine Collet. A Scalability Comparison

More information

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples Hadoop Introduction 1 Topics Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples 2 Big Data Analytics What is Big Data?

More information

Backtesting with Spark

Backtesting with Spark Backtesting with Spark Patrick Angeles, Cloudera Sandy Ryza, Cloudera Rick Carlin, Intel Sheetal Parade, Intel 1 Traditional Grid Shared storage Storage and compute scale independently Bottleneck on I/O

More information

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved Hadoop 2.x Core: YARN, Tez, and Spark YARN Hadoop Machine Types top-of-rack switches core switch client machines have client-side software used to access a cluster to process data master nodes run Hadoop

More information

Big Data com Hadoop. VIII Sessão - SQL Bahia. Impala, Hive e Spark. Diógenes Pires 03/03/2018

Big Data com Hadoop. VIII Sessão - SQL Bahia. Impala, Hive e Spark. Diógenes Pires 03/03/2018 Big Data com Hadoop Impala, Hive e Spark VIII Sessão - SQL Bahia 03/03/2018 Diógenes Pires Connect with PASS Sign up for a free membership today at: pass.org #sqlpass Internet Live http://www.internetlivestats.com/

More information

10 Million Smart Meter Data with Apache HBase

10 Million Smart Meter Data with Apache HBase 10 Million Smart Meter Data with Apache HBase 5/31/2017 OSS Solution Center Hitachi, Ltd. Masahiro Ito OSS Summit Japan 2017 Who am I? Masahiro Ito ( 伊藤雅博 ) Software Engineer at Hitachi, Ltd. Focus on

More information

EsgynDB Enterprise 2.0 Platform Reference Architecture

EsgynDB Enterprise 2.0 Platform Reference Architecture EsgynDB Enterprise 2.0 Platform Reference Architecture This document outlines a Platform Reference Architecture for EsgynDB Enterprise, built on Apache Trafodion (Incubating) implementation with licensed

More information

Kubernetes for Stateful Workloads Benchmarks

Kubernetes for Stateful Workloads Benchmarks Kubernetes for Stateful Workloads Benchmarks Baremetal Like Performance for For Big Data, Databases And AI/ML Executive Summary Customers are actively evaluating stateful workloads for containerization

More information

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team Introduction to Hadoop Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since

More information

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara Big Data Technology Ecosystem Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara Agenda End-to-End Data Delivery Platform Ecosystem of Data Technologies Mapping an End-to-End Solution Case

More information

Big Data Architect.

Big Data Architect. Big Data Architect www.austech.edu.au WHAT IS BIG DATA ARCHITECT? A big data architecture is designed to handle the ingestion, processing, and analysis of data that is too large or complex for traditional

More information

What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed?

What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed? Simple to start What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed? What is the maximum download speed you get? Simple computation

More information

Accelerate Big Data Insights

Accelerate Big Data Insights Accelerate Big Data Insights Executive Summary An abundance of information isn t always helpful when time is of the essence. In the world of big data, the ability to accelerate time-to-insight can not

More information

Flash Storage Complementing a Data Lake for Real-Time Insight

Flash Storage Complementing a Data Lake for Real-Time Insight Flash Storage Complementing a Data Lake for Real-Time Insight Dr. Sanhita Sarkar Global Director, Analytics Software Development August 7, 2018 Agenda 1 2 3 4 5 Delivering insight along the entire spectrum

More information

Nowcasting. D B M G Data Base and Data Mining Group of Politecnico di Torino. Big Data: Hype or Hallelujah? Big data hype?

Nowcasting. D B M G Data Base and Data Mining Group of Politecnico di Torino. Big Data: Hype or Hallelujah? Big data hype? Big data hype? Big Data: Hype or Hallelujah? Data Base and Data Mining Group of 2 Google Flu trends On the Internet February 2010 detected flu outbreak two weeks ahead of CDC data Nowcasting http://www.internetlivestats.com/

More information

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros Data Clustering on the Parallel Hadoop MapReduce Model Dimitrios Verraros Overview The purpose of this thesis is to implement and benchmark the performance of a parallel K- means clustering algorithm on

More information

Next-Generation Cloud Platform

Next-Generation Cloud Platform Next-Generation Cloud Platform Jangwoo Kim Jun 24, 2013 E-mail: jangwoo@postech.ac.kr High Performance Computing Lab Department of Computer Science & Engineering Pohang University of Science and Technology

More information

Big Data Hadoop Stack

Big Data Hadoop Stack Big Data Hadoop Stack Lecture #1 Hadoop Beginnings What is Hadoop? Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters of commodity hardware

More information

Application of machine learning and big data technologies in OpenAIRE system

Application of machine learning and big data technologies in OpenAIRE system Application of machine learning and big data technologies in OpenAIRE system Warsztaty Orange z cyklu Centrum Badawczo Rozwojowe zaprasza Mateusz Kobos, ICM, Univeristy of Warsaw Warszawa, 2017-05-10 OpenAIRE

More information

Processing of big data with Apache Spark

Processing of big data with Apache Spark Processing of big data with Apache Spark JavaSkop 18 Aleksandar Donevski AGENDA What is Apache Spark? Spark vs Hadoop MapReduce Application Requirements Example Architecture Application Challenges 2 WHAT

More information

Big Data with Hadoop Ecosystem

Big Data with Hadoop Ecosystem Diógenes Pires Big Data with Hadoop Ecosystem Hands-on (HBase, MySql and Hive + Power BI) Internet Live http://www.internetlivestats.com/ Introduction Business Intelligence Business Intelligence Process

More information

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung HDFS: Hadoop Distributed File System CIS 612 Sunnie Chung What is Big Data?? Bulk Amount Unstructured Introduction Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per

More information

OLTP on Hadoop: Reviewing the first Hadoop- based TPC- C benchmarks

OLTP on Hadoop: Reviewing the first Hadoop- based TPC- C benchmarks OLTP on Hadoop: Reviewing the first Hadoop- based TPC- C benchmarks Monte Zweben Co- Founder and Chief Execu6ve Officer John Leach Co- Founder and Chief Technology Officer September 30, 2015 The Tradi6onal

More information

Typical size of data you deal with on a daily basis

Typical size of data you deal with on a daily basis Typical size of data you deal with on a daily basis Processes More than 161 Petabytes of raw data a day https://aci.info/2014/07/12/the-dataexplosion-in-2014-minute-by-minuteinfographic/ On average, 1MB-2MB

More information

Oracle Big Data. A NA LYT ICS A ND MA NAG E MENT.

Oracle Big Data. A NA LYT ICS A ND MA NAG E MENT. Oracle Big Data. A NALYTICS A ND MANAG E MENT. Oracle Big Data: Redundância. Compatível com ecossistema Hadoop, HIVE, HBASE, SPARK. Integração com Cloudera Manager. Possibilidade de Utilização da Linguagem

More information

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info START DATE : TIMINGS : DURATION : TYPE OF BATCH : FEE : FACULTY NAME : LAB TIMINGS : PH NO: 9963799240, 040-40025423

More information

Data Analytics and Storage System (DASS) Mixing POSIX and Hadoop Architectures. 13 November 2016

Data Analytics and Storage System (DASS) Mixing POSIX and Hadoop Architectures. 13 November 2016 National Aeronautics and Space Administration Data Analytics and Storage System (DASS) Mixing POSIX and Hadoop Architectures 13 November 2016 Carrie Spear (carrie.e.spear@nasa.gov) HPC Architect/Contractor

More information

UNIFY DATA AT MEMORY SPEED. Haoyuan (HY) Li, Alluxio Inc. VAULT Conference 2017

UNIFY DATA AT MEMORY SPEED. Haoyuan (HY) Li, Alluxio Inc. VAULT Conference 2017 UNIFY DATA AT MEMORY SPEED Haoyuan (HY) Li, CEO @ Alluxio Inc. VAULT Conference 2017 March 2017 HISTORY Started at UC Berkeley AMPLab In Summer 2012 Originally named as Tachyon Rebranded to Alluxio in

More information

Splout SQL When Big Data Output is also Big Data

Splout SQL When Big Data Output is also Big Data Iván de Prado Alonso CEO of Datasalt www.datasalt.es @ivanprado @datasalt Splout SQL When Big Data Output is also Big Data Big Data consulting & training Full SQL * * Within each par??on Full SQL * Unlike

More information

MapReduce, Apache Hadoop

MapReduce, Apache Hadoop NDBI040: Big Data Management and NoSQL Databases hp://www.ksi.mff.cuni.cz/ svoboda/courses/2016-1-ndbi040/ Lecture 2 MapReduce, Apache Hadoop Marn Svoboda svoboda@ksi.mff.cuni.cz 11. 10. 2016 Charles University

More information

NoSQL Databases MongoDB vs Cassandra. Kenny Huynh, Andre Chik, Kevin Vu

NoSQL Databases MongoDB vs Cassandra. Kenny Huynh, Andre Chik, Kevin Vu NoSQL Databases MongoDB vs Cassandra Kenny Huynh, Andre Chik, Kevin Vu Introduction - Relational database model - Concept developed in 1970 - Inefficient - NoSQL - Concept introduced in 1980 - Related

More information

INDEX-BASED JOIN IN MAPREDUCE USING HADOOP MAPFILES

INDEX-BASED JOIN IN MAPREDUCE USING HADOOP MAPFILES Al-Badarneh et al. Special Issue Volume 2 Issue 1, pp. 200-213 Date of Publication: 19 th December, 2016 DOI-https://dx.doi.org/10.20319/mijst.2016.s21.200213 INDEX-BASED JOIN IN MAPREDUCE USING HADOOP

More information

Embedded Technosolutions

Embedded Technosolutions Hadoop Big Data An Important technology in IT Sector Hadoop - Big Data Oerie 90% of the worlds data was generated in the last few years. Due to the advent of new technologies, devices, and communication

More information

Cloud Computing 3. CSCI 4850/5850 High-Performance Computing Spring 2018

Cloud Computing 3. CSCI 4850/5850 High-Performance Computing Spring 2018 Cloud Computing 3 CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University Learning

More information

Big Data on AWS. Peter-Mark Verwoerd Solutions Architect

Big Data on AWS. Peter-Mark Verwoerd Solutions Architect Big Data on AWS Peter-Mark Verwoerd Solutions Architect What to get out of this talk Non-technical: Big Data processing stages: ingest, store, process, visualize Hot vs. Cold data Low latency processing

More information

RIGHTNOW A C E

RIGHTNOW A C E RIGHTNOW A C E 2 0 1 4 2014 Aras 1 A C E 2 0 1 4 Scalability Test Projects Understanding the results 2014 Aras Overview Original Use Case Scalability vs Performance Scale to? Scaling the Database Server

More information

BIG DATA AND HADOOP ON THE ZFS STORAGE APPLIANCE

BIG DATA AND HADOOP ON THE ZFS STORAGE APPLIANCE BIG DATA AND HADOOP ON THE ZFS STORAGE APPLIANCE BRETT WENINGER, MANAGING DIRECTOR 10/21/2014 ADURANT APPROACH TO BIG DATA Align to Un/Semi-structured Data Instead of Big Scale out will become Big Greatest

More information

Chapter 5. The MapReduce Programming Model and Implementation

Chapter 5. The MapReduce Programming Model and Implementation Chapter 5. The MapReduce Programming Model and Implementation - Traditional computing: data-to-computing (send data to computing) * Data stored in separate repository * Data brought into system for computing

More information

Informa)on Retrieval and Map- Reduce Implementa)ons. Mohammad Amir Sharif PhD Student Center for Advanced Computer Studies

Informa)on Retrieval and Map- Reduce Implementa)ons. Mohammad Amir Sharif PhD Student Center for Advanced Computer Studies Informa)on Retrieval and Map- Reduce Implementa)ons Mohammad Amir Sharif PhD Student Center for Advanced Computer Studies mas4108@louisiana.edu Map-Reduce: Why? Need to process 100TB datasets On 1 node:

More information

Cisco and Cloudera Deliver WorldClass Solutions for Powering the Enterprise Data Hub alerts, etc. Organizations need the right technology and infrastr

Cisco and Cloudera Deliver WorldClass Solutions for Powering the Enterprise Data Hub alerts, etc. Organizations need the right technology and infrastr Solution Overview Cisco UCS Integrated Infrastructure for Big Data and Analytics with Cloudera Enterprise Bring faster performance and scalability for big data analytics. Highlights Proven platform for

More information

Presented by Nanditha Thinderu

Presented by Nanditha Thinderu Presented by Nanditha Thinderu Enterprise systems are highly distributed and heterogeneous which makes administration a complex task Application Performance Management tools developed to retrieve information

More information

Performance Evaluation of a MongoDB and Hadoop Platform for Scientific Data Analysis

Performance Evaluation of a MongoDB and Hadoop Platform for Scientific Data Analysis Performance Evaluation of a MongoDB and Hadoop Platform for Scientific Data Analysis Elif Dede, Madhusudhan Govindaraju Lavanya Ramakrishnan, Dan Gunter, Shane Canon Department of Computer Science, Binghamton

More information

Databases 2 (VU) ( / )

Databases 2 (VU) ( / ) Databases 2 (VU) (706.711 / 707.030) MapReduce (Part 3) Mark Kröll ISDS, TU Graz Nov. 27, 2017 Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, 2017 1 / 42 Outline 1 Problems Suited for Map-Reduce 2 MapReduce:

More information

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018 Cloud Computing 2 CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University Learning

More information

CIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( )

CIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( ) Guide: CIS 601 Graduate Seminar Presented By: Dr. Sunnie S. Chung Dhruv Patel (2652790) Kalpesh Sharma (2660576) Introduction Background Parallel Data Warehouse (PDW) Hive MongoDB Client-side Shared SQL

More information

Processing big data with modern applications: Hadoop as DWH backend at Pro7. Dr. Kathrin Spreyer Big data engineer

Processing big data with modern applications: Hadoop as DWH backend at Pro7. Dr. Kathrin Spreyer Big data engineer Processing big data with modern applications: Hadoop as DWH backend at Pro7 Dr. Kathrin Spreyer Big data engineer GridKa School Karlsruhe, 02.09.2014 Outline 1. Relational DWH 2. Data integration with

More information

Practice and Applications of Data Management CMPSCI 345. Lecture 18: Big Data, Hadoop, and MapReduce

Practice and Applications of Data Management CMPSCI 345. Lecture 18: Big Data, Hadoop, and MapReduce Practice and Applications of Data Management CMPSCI 345 Lecture 18: Big Data, Hadoop, and MapReduce Why Big Data, Hadoop, M-R? } What is the connec,on with the things we learned? } What about SQL? } What

More information

SQT03 Big Data and Hadoop with Azure HDInsight Andrew Brust. Senior Director, Technical Product Marketing and Evangelism

SQT03 Big Data and Hadoop with Azure HDInsight Andrew Brust. Senior Director, Technical Product Marketing and Evangelism Big Data and Hadoop with Azure HDInsight Andrew Brust Senior Director, Technical Product Marketing and Evangelism Datameer Level: Intermediate Meet Andrew Senior Director, Technical Product Marketing and

More information

INITIAL EVALUATION BIGSQL FOR HORTONWORKS (Homerun or merely a major bluff?)

INITIAL EVALUATION BIGSQL FOR HORTONWORKS (Homerun or merely a major bluff?) PER STRICKER, THOMAS KALB 07.02.2017, HEART OF TEXAS DB2 USER GROUP, AUSTIN 08.02.2017, DB2 FORUM USER GROUP, DALLAS INITIAL EVALUATION BIGSQL FOR HORTONWORKS (Homerun or merely a major bluff?) Copyright

More information

Cloudera Impala Headline Goes Here

Cloudera Impala Headline Goes Here Cloudera Impala Headline Goes Here JusAn Erickson Senior Product Manager Speaker Name or Subhead Goes Here February 2013 DO NOT USE PUBLICLY PRIOR TO 10/23/12 Agenda Intro to Impala Architectural Overview

More information

BIG DATA TESTING: A UNIFIED VIEW

BIG DATA TESTING: A UNIFIED VIEW http://core.ecu.edu/strg BIG DATA TESTING: A UNIFIED VIEW BY NAM THAI ECU, Computer Science Department, March 16, 2016 2/30 PRESENTATION CONTENT 1. Overview of Big Data A. 5 V s of Big Data B. Data generation

More information

Oracle Big Data Connectors

Oracle Big Data Connectors Oracle Big Data Connectors Oracle Big Data Connectors is a software suite that integrates processing in Apache Hadoop distributions with operations in Oracle Database. It enables the use of Hadoop to process

More information

A BigData Tour HDFS, Ceph and MapReduce

A BigData Tour HDFS, Ceph and MapReduce A BigData Tour HDFS, Ceph and MapReduce These slides are possible thanks to these sources Jonathan Drusi - SCInet Toronto Hadoop Tutorial, Amir Payberah - Course in Data Intensive Computing SICS; Yahoo!

More information

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours Big Data Hadoop Developer Course Content Who is the target audience? Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours Complete beginners who want to learn Big Data Hadoop Professionals

More information

August 23, 2017 Revision 0.3. Building IoT Applications with GridDB

August 23, 2017 Revision 0.3. Building IoT Applications with GridDB August 23, 2017 Revision 0.3 Building IoT Applications with GridDB Table of Contents Executive Summary... 2 Introduction... 2 Components of an IoT Application... 2 IoT Models... 3 Edge Computing... 4 Gateway

More information

microsoft

microsoft 70-775.microsoft Number: 70-775 Passing Score: 800 Time Limit: 120 min Exam A QUESTION 1 Note: This question is part of a series of questions that present the same scenario. Each question in the series

More information

Performance and Scalability Overview

Performance and Scalability Overview Performance and Scalability Overview This guide provides an overview of some of the performance and scalability capabilities of the Pentaho Business Anlytics platform PENTAHO PERFORMANCE ENGINEERING TEAM

More information

Sempala. Interactive SPARQL Query Processing on Hadoop

Sempala. Interactive SPARQL Query Processing on Hadoop Sempala Interactive SPARQL Query Processing on Hadoop Alexander Schätzle, Martin Przyjaciel-Zablocki, Antony Neu, Georg Lausen University of Freiburg, Germany ISWC 2014 - Riva del Garda, Italy Motivation

More information

Big Data Infrastructure at Spotify

Big Data Infrastructure at Spotify Big Data Infrastructure at Spotify Wouter de Bie Team Lead Data Infrastructure September 26, 2013 2 Who am I? According to ZDNet: "The work they have done to improve the Apache Hive data warehouse system

More information

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context 1 Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes

More information

HDFS Federation. Sanjay Radia Founder and Hortonworks. Page 1

HDFS Federation. Sanjay Radia Founder and Hortonworks. Page 1 HDFS Federation Sanjay Radia Founder and Architect @ Hortonworks Page 1 About Me Apache Hadoop Committer and Member of Hadoop PMC Architect of core-hadoop @ Yahoo - Focusing on HDFS, MapReduce scheduler,

More information

Smart Meter Data Analytics using Hadoop

Smart Meter Data Analytics using Hadoop Smart Meter Data Analytics using Hadoop Ashu Bhardwaj 1, Dr. Williamjeet Singh 2 1 Research Scholar, Department of Computer Engineering, Punjabi University Patiala, (India) 2 Assistant Professor, Department

More information

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou The Hadoop Ecosystem EECS 4415 Big Data Systems Tilemachos Pechlivanoglou tipech@eecs.yorku.ca A lot of tools designed to work with Hadoop 2 HDFS, MapReduce Hadoop Distributed File System Core Hadoop component

More information

VOLTDB + HP VERTICA. page

VOLTDB + HP VERTICA. page VOLTDB + HP VERTICA ARCHITECTURE FOR FAST AND BIG DATA ARCHITECTURE FOR FAST + BIG DATA FAST DATA Fast Serve Analytics BIG DATA BI Reporting Fast Operational Database Streaming Analytics Columnar Analytics

More information

Parallelizing Multiple Group by Query in Shared-nothing Environment: A MapReduce Study Case

Parallelizing Multiple Group by Query in Shared-nothing Environment: A MapReduce Study Case 1 / 39 Parallelizing Multiple Group by Query in Shared-nothing Environment: A MapReduce Study Case PAN Jie 1 Yann LE BIANNIC 2 Frédéric MAGOULES 1 1 Ecole Centrale Paris-Applied Mathematics and Systems

More information

Blended Learning Outline: Cloudera Data Analyst Training (171219a)

Blended Learning Outline: Cloudera Data Analyst Training (171219a) Blended Learning Outline: Cloudera Data Analyst Training (171219a) Cloudera Univeristy s data analyst training course will teach you to apply traditional data analytics and business intelligence skills

More information

WHITEPAPER. Improve Hadoop Performance with Memblaze PBlaze SSD

WHITEPAPER. Improve Hadoop Performance with Memblaze PBlaze SSD Improve Hadoop Performance with Memblaze PBlaze SSD Improve Hadoop Performance with Memblaze PBlaze SSD Exclusive Summary We live in the data age. It s not easy to measure the total volume of data stored

More information

Resource and Performance Distribution Prediction for Large Scale Analytics Queries

Resource and Performance Distribution Prediction for Large Scale Analytics Queries Resource and Performance Distribution Prediction for Large Scale Analytics Queries Prof. Rajiv Ranjan, SMIEEE School of Computing Science, Newcastle University, UK Visiting Scientist, Data61, CSIRO, Australia

More information

MapReduce, Apache Hadoop

MapReduce, Apache Hadoop Czech Technical University in Prague, Faculty of Informaon Technology MIE-PDB: Advanced Database Systems hp://www.ksi.mff.cuni.cz/~svoboda/courses/2016-2-mie-pdb/ Lecture 12 MapReduce, Apache Hadoop Marn

More information

Overcoming the Barriers of Graphs on GPUs: Delivering Graph Analy;cs 100X Faster and 40X Cheaper

Overcoming the Barriers of Graphs on GPUs: Delivering Graph Analy;cs 100X Faster and 40X Cheaper Overcoming the Barriers of Graphs on GPUs: Delivering Graph Analy;cs 100X Faster and 40X Cheaper November 18, 2015 Super Compu3ng 2015 The Amount of Graph Data is Exploding! Billion+ Edges! 2 Graph Applications

More information

Expert Lecture plan proposal Hadoop& itsapplication

Expert Lecture plan proposal Hadoop& itsapplication Expert Lecture plan proposal Hadoop& itsapplication STARTING UP WITH BIG Introduction to BIG Data Use cases of Big Data The Big data core components Knowing the requirements, knowledge on Analyst job profile

More information

Improving the MapReduce Big Data Processing Framework

Improving the MapReduce Big Data Processing Framework Improving the MapReduce Big Data Processing Framework Gistau, Reza Akbarinia, Patrick Valduriez INRIA & LIRMM, Montpellier, France In collaboration with Divyakant Agrawal, UCSB Esther Pacitti, UM2, LIRMM

More information

Accelerating Digital Transformation with InterSystems IRIS and vsan

Accelerating Digital Transformation with InterSystems IRIS and vsan HCI2501BU Accelerating Digital Transformation with InterSystems IRIS and vsan Murray Oldfield, InterSystems Andreas Dieckow, InterSystems Christian Rauber, VMware #vmworld #HCI2501BU Disclaimer This presentation

More information

New Oracle NoSQL Database APIs that Speed Insertion and Retrieval

New Oracle NoSQL Database APIs that Speed Insertion and Retrieval New Oracle NoSQL Database APIs that Speed Insertion and Retrieval O R A C L E W H I T E P A P E R F E B R U A R Y 2 0 1 6 1 NEW ORACLE NoSQL DATABASE APIs that SPEED INSERTION AND RETRIEVAL Introduction

More information

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a) Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a) Cloudera s Developer Training for Apache Spark and Hadoop delivers the key concepts and expertise need to develop high-performance

More information

Processing 11 billions events a day with Spark. Alexander Krasheninnikov

Processing 11 billions events a day with Spark. Alexander Krasheninnikov Processing 11 billions events a day with Spark Alexander Krasheninnikov Badoo facts 46 languages 10M Photos added daily 320M registered users 190 countries 21M daily active users 3000+ servers 2 data-centers

More information

Submitted to: Dr. Sunnie Chung. Presented by: Sonal Deshmukh Jay Upadhyay

Submitted to: Dr. Sunnie Chung. Presented by: Sonal Deshmukh Jay Upadhyay Submitted to: Dr. Sunnie Chung Presented by: Sonal Deshmukh Jay Upadhyay Submitted to: Dr. Sunny Chung Presented by: Sonal Deshmukh Jay Upadhyay What is Apache Survey shows huge popularity spike for Apache

More information

Webinar Series TMIP VISION

Webinar Series TMIP VISION Webinar Series TMIP VISION TMIP provides technical support and promotes knowledge and information exchange in the transportation planning and modeling community. Today s Goals To Consider: Parallel Processing

More information

Big Data Syllabus. Understanding big data and Hadoop. Limitations and Solutions of existing Data Analytics Architecture

Big Data Syllabus. Understanding big data and Hadoop. Limitations and Solutions of existing Data Analytics Architecture Big Data Syllabus Hadoop YARN Setup Programming in YARN framework j Understanding big data and Hadoop Big Data Limitations and Solutions of existing Data Analytics Architecture Hadoop Features Hadoop Ecosystem

More information

Approaching the Petabyte Analytic Database: What I learned

Approaching the Petabyte Analytic Database: What I learned Disclaimer This document is for informational purposes only and is subject to change at any time without notice. The information in this document is proprietary to Actian and no part of this document may

More information

Cloud Computing at Yahoo! Thomas Kwan Director, Research Operations Yahoo! Labs

Cloud Computing at Yahoo! Thomas Kwan Director, Research Operations Yahoo! Labs Cloud Computing at Yahoo! Thomas Kwan Director, Research Operations Yahoo! Labs Overview Cloud Strategy Cloud Services Cloud Research Partnerships - 2 - Yahoo! Cloud Strategy 1. Optimizing for Yahoo-scale

More information

A Distributed Data- Parallel Execu3on Framework in the Kepler Scien3fic Workflow System

A Distributed Data- Parallel Execu3on Framework in the Kepler Scien3fic Workflow System A Distributed Data- Parallel Execu3on Framework in the Kepler Scien3fic Workflow System Ilkay Al(ntas and Daniel Crawl San Diego Supercomputer Center UC San Diego Jianwu Wang UMBC WorDS.sdsc.edu Computa3onal

More information

Spatial Analytics Built for Big Data Platforms

Spatial Analytics Built for Big Data Platforms Spatial Analytics Built for Big Platforms Roberto Infante Software Development Manager, Spatial and Graph 1 Copyright 2011, Oracle and/or its affiliates. All rights Global Digital Growth The Internet of

More information

SAP VORA 1.4 on AWS - MARKETPLACE EDITION FREQUENTLY ASKED QUESTIONS

SAP VORA 1.4 on AWS - MARKETPLACE EDITION FREQUENTLY ASKED QUESTIONS SAP VORA 1.4 on AWS - MARKETPLACE EDITION FREQUENTLY ASKED QUESTIONS 1. What is SAP Vora? SAP Vora is an in-memory, distributed computing solution that helps organizations uncover actionable business insights

More information

Oracle Big Data Fundamentals Ed 2

Oracle Big Data Fundamentals Ed 2 Oracle University Contact Us: 1.800.529.0165 Oracle Big Data Fundamentals Ed 2 Duration: 5 Days What you will learn In the Oracle Big Data Fundamentals course, you learn about big data, the technologies

More information

Stay Informed During and AEer OpenWorld

Stay Informed During and AEer OpenWorld Stay Informed During and AEer OpenWorld TwiIer: @OracleBigData, @OracleExadata, @Infrastructure Follow #CloudReady LinkedIn: Oracle IT Infrastructure Oracle Showcase Page Oracle Big Data Oracle Showcase

More information

Introduc)on to Apache Ka1a. Jun Rao Co- founder of Confluent

Introduc)on to Apache Ka1a. Jun Rao Co- founder of Confluent Introduc)on to Apache Ka1a Jun Rao Co- founder of Confluent Agenda Why people use Ka1a Technical overview of Ka1a What s coming What s Apache Ka1a Distributed, high throughput pub/sub system Ka1a Usage

More information

Hadoop. copyright 2011 Trainologic LTD

Hadoop. copyright 2011 Trainologic LTD Hadoop Hadoop is a framework for processing large amounts of data in a distributed manner. It can scale up to thousands of machines. It provides high-availability. Provides map-reduce functionality. Hides

More information

Overview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training::

Overview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training:: Module Title Duration : Cloudera Data Analyst Training : 4 days Overview Take your knowledge to the next level Cloudera University s four-day data analyst training course will teach you to apply traditional

More information

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES 1 THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES Vincent Garonne, Mario Lassnig, Martin Barisits, Thomas Beermann, Ralph Vigne, Cedric Serfon Vincent.Garonne@cern.ch ph-adp-ddm-lab@cern.ch XLDB

More information

Spark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay Mellanox Technologies

Spark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay Mellanox Technologies Spark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay 1 Apache Spark - Intro Spark within the Big Data ecosystem Data Sources Data Acquisition / ETL Data Storage Data Analysis / ML Serving 3 Apache

More information

Big Data Analytics using Apache Hadoop and Spark with Scala

Big Data Analytics using Apache Hadoop and Spark with Scala Big Data Analytics using Apache Hadoop and Spark with Scala Training Highlights : 80% of the training is with Practical Demo (On Custom Cloudera and Ubuntu Machines) 20% Theory Portion will be important

More information

Accelerating Big Data: Using SanDisk SSDs for Apache HBase Workloads

Accelerating Big Data: Using SanDisk SSDs for Apache HBase Workloads WHITE PAPER Accelerating Big Data: Using SanDisk SSDs for Apache HBase Workloads December 2014 Western Digital Technologies, Inc. 951 SanDisk Drive, Milpitas, CA 95035 www.sandisk.com Table of Contents

More information

Importing and Exporting Data Between Hadoop and MySQL

Importing and Exporting Data Between Hadoop and MySQL Importing and Exporting Data Between Hadoop and MySQL + 1 About me Sarah Sproehnle Former MySQL instructor Joined Cloudera in March 2010 sarah@cloudera.com 2 What is Hadoop? An open-source framework for

More information

TPCX-BB (BigBench) Big Data Analytics Benchmark

TPCX-BB (BigBench) Big Data Analytics Benchmark TPCX-BB (BigBench) Big Data Analytics Benchmark Bhaskar D Gowda Senior Staff Engineer Analytics & AI Solutions Group Intel Corporation bhaskar.gowda@intel.com 1 Agenda Big Data Analytics & Benchmarks Industry

More information

Apache Hive for Oracle DBAs. Luís Marques

Apache Hive for Oracle DBAs. Luís Marques Apache Hive for Oracle DBAs Luís Marques About me Oracle ACE Alumnus Long time open source supporter Founder of Redglue (www.redglue.eu) works for @redgluept as Lead Data Architect @drune After this talk,

More information

High Performance Data Analytics for Numerical Simulations. Bruno Raffin DataMove

High Performance Data Analytics for Numerical Simulations. Bruno Raffin DataMove High Performance Data Analytics for Numerical Simulations Bruno Raffin DataMove bruno.raffin@inria.fr April 2016 About this Talk HPC for analyzing the results of large scale parallel numerical simulations

More information

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development:: Title Duration : Apache Spark Development : 4 days Overview Spark is a fast and general cluster computing system for Big Data. It provides high-level APIs in Scala, Java, Python, and R, and an optimized

More information

Enabling the Smart Grid through Big Data

Enabling the Smart Grid through Big Data Enabling the Smart Grid through Big Data Paul A. Navrá;l, Ph.D. Manager Scalable Visualiza;on Technologies Texas Advanced Compu;ng Center TACC Booth @ SC12 November 14, 2012 The Age of Big Data Records

More information

Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem. Zohar Elkayam

Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem. Zohar Elkayam Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem Zohar Elkayam www.realdbamagic.com Twitter: @realmgic Who am I? Zohar Elkayam, CTO at Brillix Programmer, DBA, team leader, database trainer,

More information