Sensor Data Collection and Processing

Similar documents
Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

I am a Data Nerd and so are YOU!

What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed?

Embedded Technosolutions

Taming Structured And Unstructured Data With SAP HANA Running On VCE Vblock Systems

How Apache Hadoop Complements Existing BI Systems. Dr. Amr Awadallah Founder, CTO Cloudera,

5 reasons why choosing Apache Cassandra is planning for a multi-cloud future

Modern Data Warehouse The New Approach to Azure BI

Spatial Analytics Built for Big Data Platforms

High Performance and Cloud Computing (HPCC) for Bioinformatics

CASE STUDY: USING THE HYBRID CLOUD TO INCREASE CORPORATE VALUE AND ADAPT TO COMPETITIVE WORLD TRENDS

Nowcasting. D B M G Data Base and Data Mining Group of Politecnico di Torino. Big Data: Hype or Hallelujah? Big data hype?

2013 AWS Worldwide Public Sector Summit Washington, D.C.

Data Analysis Using MapReduce in Hadoop Environment

Cloud and Big Data: Business Continuity for Outside of the Enterprise

BIG DATA TESTING: A UNIFIED VIEW

Large Scale Processing with Hadoop

SwiftStack and python-swiftclient

Next-Generation Cloud Platform

Investing in a Better Storage Environment:

Big Data and Object Storage

Demystifying the Cloud With a Look at Hybrid Hosting and OpenStack

The Intersection of Cloud & Solid State Storage

Big Data and Cloud Computing

Hybrid Infrastructure Hosting Clouds + Dedicated + Colocated GoGrid / ServePath September 09

Scalable Web Programming. CS193S - Jan Jannink - 2/25/10

Big Data The end of Data Warehousing?

Convergence and Collaboration: Transforming Business Process and Workflows

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework

IBM Spectrum NAS, IBM Spectrum Scale and IBM Cloud Object Storage

EMC Strategy Overview: Journey To The Private Cloud

TCO REPORT. NAS File Tiering. Economic advantages of enterprise file management

A REVIEW PAPER ON BIG DATA ANALYTICS

From Internet Data Centers to Data Centers in the Cloud

SoftNAS Cloud Data Management Products for AWS Add Breakthrough NAS Performance, Protection, Flexibility

The age of Big Data Big Data for Oracle Database Professionals

SQT03 Big Data and Hadoop with Azure HDInsight Andrew Brust. Senior Director, Technical Product Marketing and Evangelism

EMC Forum EMC ViPR and ECS: A Lap Around Software-Defined Services

The intelligence of hyper-converged infrastructure. Your Right Mix Solution

ATA DRIVEN GLOBAL VISION CLOUD PLATFORM STRATEG N POWERFUL RELEVANT PERFORMANCE SOLUTION CLO IRTUAL BIG DATA SOLUTION ROI FLEXIBLE DATA DRIVEN V

Cloud Computing: Making the Right Choice for Your Organization

Chapter 6 VIDEO CASES

Chapter 5. The MapReduce Programming Model and Implementation

Land Administration and Management: Big Data, Fast Data, Semantics, Graph Databases, Security, Collaboration, Open Source, Shareable Information

Big Data It s not just for Google Any More

Microsoft Analytics Platform System (APS)

Strategic Briefing Paper Big Data

Processing Unstructured Data. Dinesh Priyankara Founder/Principal Architect dinesql Pvt Ltd.

Microsoft Big Data and Hadoop

Typical size of data you deal with on a daily basis

Online Bill Processing System for Public Sectors in Big Data

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

CLOUD COMPUTING. A public cloud sells services to anyone on the Internet. The cloud infrastructure is made available to

White. Paper. EMC Isilon Scale-out Without Compromise. July, 2012

Cloud Computing Techniques for Big Data and Hadoop Implementation

Database Management Systems

Big Data Issues for Federal Records Managers

EMC Forum 2014 EMC ViPR and ECS: A Lap Around Software-Defined Services. Magnus Nilsson Blog: purevirtual.

The Internet. Charging for Internet 2/8/12. Conceptual Picture of the Internet. What does 1000M and 200M mean? Dr. Hayden Kwok-Hay So

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Top Trends in DBMS & DW

High Performance Computing on MapReduce Programming Framework

Provisioning with SUSE Enterprise Storage. Nyers Gábor Trainer &

CSE6331: Cloud Computing

MAKING MONEY ON OPENSTACK. Boris

2014 年 3 月 13 日星期四. From Big Data to Big Value Infrastructure Needs and Huawei Best Practice

5 Fundamental Strategies for Building a Data-centered Data Center

Huge Data Analysis and Processing Platform based on Hadoop Yuanbin LI1, a, Rong CHEN2

Architekturen für die Cloud

Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem. Zohar Elkayam

Deep Storage for Exponential Data. Nathan Thompson CEO, Spectra Logic

The Establishment of Large Data Mining Platform Based on Cloud Computing. Wei CAI

Evolving To The Big Data Warehouse

Introduction to the Mathematics of Big Data. Philippe B. Laval

Scalable Tools - Part I Introduction to Scalable Tools

Get Smart about Backup & Recovery

Fusion iomemory PCIe Solutions from SanDisk and Sqrll make Accumulo Hypersonic

Renovating your storage infrastructure for Cloud era

Smart Data Catalog DATASHEET

Cat Herding. Why It s Time for a Millennial Approach to Storage. Cloud Expo East Western Digital Corporation All rights reserved 01/25/2016

The 7 Habits of Highly Effective API and Service Management

Microsoft Exam

Public, Private, or Hybrid Cloud

Combine Native SQL Flexibility with SAP HANA Platform Performance and Tools

Backtesting with Spark

Global Journal of Engineering Science and Research Management

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

BUSTED! 5 COMMON MYTHS OF MODERN INFRASTRUCTURE. These Common Misconceptions Could Be Holding You Back

Design of Hadoop-based Framework for Analytics of Large Synchrophasor Datasets

The Hadoop Paradigm & the Need for Dataset Management

GDPR Data Discovery and Reporting

WHITE PAPER: TOP 10 CAPABILITIES TO LOOK FOR IN A DATA CATALOG

Approaching the Petabyte Analytic Database: What I learned

Hadoop, Yarn and Beyond

Oracle Big Data. A NA LYT ICS A ND MA NAG E MENT.

Everything you need to know about cloud. For companies with people in them

Large-Scale Data Engineering. Overview and Introduction

Integrating Advanced Analytics with Big Data

Transcription:

Sensor Data Collection and Processing Applying Web Scale To Sensor Data

Today s speaker Josh Patterson josh@cloudera.com / twitter: @jpatanooga Master s Thesis: self-organizing mesh networks Published in IAAI-09: TinyTermite: A Secure Routing Algorithm Conceived, built, and led Hadoop integration for the openpdc project at TVA (Smartgrid stuff) Led small team which designed classification techniques for time series and Map Reduce Open source work at http://openpdc.codeplex.com Now: Solutions Architect at Cloudera 2

NERC Sensor Data Collection openpdc PMU Data Collection circa 2009 120 Sensors 30 samples/second 4.3B Samples/day Housed in Hadoop

Major Themes From openpdc How much is coming in? Too much to make SAN storage cost effective! Planned for ½ Petabyte of Data storage Ok. So, then, where can this data live? Not at Amazon! Regulations, etc. Also: For fun, price ½ Petabyte of storage at amazon Enter Hadoop Linear Scaling Storage in both space and cost Also had that handy MapReduce thing included

Apache Hadoop Open Source Distributed Storage and Processing Engine Consolidates Mixed Data Move complex and relational data into a single repository MapReduce Hadoop Distributed File System (HDFS) Stores Inexpensively Keep raw data always available Use industry standard hardware Processes at the Source Eliminate ETL bottlenecks Mine data first, govern later

What Hadoop does Networks industry standard hardware nodes together to combine compute and storage into scalable distributed system Scales to petabytes without modification Manages fault tolerance and data replication automatically Processes semi-structured and unstructured data easily Supports MapReduce natively to analyze data in parallel

It s About More Than Just Collection Scenario 1 million sensors, collecting sample / 5 min 5 year retention policy Storage needs of 15 TB Reliability and Availability? Processing Single Machine: 15TB takes 2.2 DAYS to scan We d like to do a lot more than simple scans! Hadoop @ 20 nodes: Same task takes 11 Minutes Also can use Parallel Programming Model MapReduce

Unstructured Data Explosion (You) Complex, Unstructured Relational 2,500 exabytes of new information in 2012 with Internet as primary driver Digital universe grew by 62% last year to 800K petabytes and will grow to 1.2 zettabytes this year Source: IDC White Paper - sponsored by EMC. As the Economy Contracts, the Digital Universe Expands. May 2009..

The Cloud The Legend Everything just works in the Cloud The Myth Cloud Computing is a New Technology The Reality Cloud Computing are just more advanced network based applications Not all cloud services are equal, caveat emptor

Scientific American on Cloud Computing Much of what makes cloud computing tick (internet, mobile computers, networked data storage, ) Has been available since the beginning of the dot-com era more than a decade ago. What is new, or at least more recent, is: The greater variety of content that can be delivered online to a wider variety of gadgets.

As it Turn Out The Cloud is just some place in North Virginia Business Insider Lessons Learned From AMZN Failure Amazon is not infallible, and the cloud is not magic. Amazon is not the only IaaS provider, and your application should be able to run on more than one. Cloud deployments must be automated and should take cloud server reliability characteristics into account Read more: http://www.businessinsider.com/learning- the-right-lessons-from-the-amazon-outage-2011-4#ixzz1l4gczcsu

Things to Think About Can I really afford to be locked into a proprietary cloud technology long term? Open Source is coming of age in the enterprise The market for data analysis is exploding Can I use my technology to process this data at scale - -- and process said data fast? Reliable Storage as a serious cost consideration What s a Terabyte cost on this platform? What s a Petabyte cost on this platform?

Hadoop Adoption

Take Aways Not All Data Can Go Into The Cloud Smartgrid data is sensitive, needs private cloud Caveat Emptor You can t just move everything to the cloud Not all cloud tech is of the same reliability Consider Speed at Scale as the killer app Cost at Scale, Cost of Lock-in

Questions? Cloudera s Distribution including Apache Hadoop (CDH): http://www.cloudera.com Resources http://www.slideshare.net/cloudera/hadoop-as-the-platform-for-thesmartgrid-at-tva http://www.tva.gov/news/releases/octdec09/data_collection_software.htm http://gigaom.com/cleantech/the-google-android-of-the-smart-grid-openpdc/ http://news.cnet.com/8301-13846_3-10393259-62.html http://gigaom.com/cleantech/how-to-use-open-source-hadoop-for-the-smartgrid/ http://openpdc.codeplex.com/ Timeseries blog article http://www.cloudera.com/blog/2011/03/simple-moving-average-secondarysort-and-mapreduce-part-1/