Nowcasting. D B M G Data Base and Data Mining Group of Politecnico di Torino. Big Data: Hype or Hallelujah? Big data hype?

Similar documents
Based on Big Data: Hype or Hallelujah? by Elena Baralis

BIG DATA TESTING: A UNIFIED VIEW

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Hadoop, Yarn and Beyond

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Big Data com Hadoop. VIII Sessão - SQL Bahia. Impala, Hive e Spark. Diógenes Pires 03/03/2018

CISC 7610 Lecture 2b The beginnings of NoSQL

Lecture 11 Hadoop & Spark

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Embedded Technosolutions

Big Data Architect.

BIG DATA TECHNOLOGIES: WHAT EVERY MANAGER NEEDS TO KNOW ANALYTICS AND FINANCIAL INNOVATION CONFERENCE JUNE 26-29,

RAMCloud. Scalable High-Performance Storage Entirely in DRAM. by John Ousterhout et al. Stanford University. presented by Slavik Derevyanko

Big Data with Hadoop Ecosystem

Big Data on AWS. Peter-Mark Verwoerd Solutions Architect

Introduction to MapReduce (cont.)

Introduction to Data Mining and Data Analytics

Online Bill Processing System for Public Sectors in Big Data

Big Data Analytics: What is Big Data? Stony Brook University CSE545, Fall 2016 the inaugural edition

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

Webinar Series TMIP VISION

Challenges for Data Driven Systems

Big Data and Object Storage

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

Cisco Tetration Analytics Platform: A Dive into Blazing Fast Deep Storage

EXTRACT DATA IN LARGE DATABASE WITH HADOOP

BIG DATA AND HADOOP ON THE ZFS STORAGE APPLIANCE

Flash Storage Complementing a Data Lake for Real-Time Insight

Decentralized Distributed Storage System for Big Data

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

<Insert Picture Here> Introduction to Big Data Technology

Big Data - Some Words BIG DATA 8/31/2017. Introduction

Hadoop/MapReduce Computing Paradigm

Warehouse- Scale Computing and the BDAS Stack

Big Data Hadoop Stack

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework

Hadoop and HDFS Overview. Madhu Ankam

Wearable Technology Orientation Using Big Data Analytics for Improving Quality of Human Life

Introduction to Big-Data

SpagoBI and Talend jointly support Big Data scenarios

Big Data The end of Data Warehousing?

BlueDBM: An Appliance for Big Data Analytics*

Big Data and IoT. Baris Aksanli 02/10/2016

Next-Generation Cloud Platform

Dealing with Data Especially Big Data

TCO REPORT. NAS File Tiering. Economic advantages of enterprise file management

Data Platforms and Pattern Mining

Hadoop An Overview. - Socrates CCDH

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Next-generation IT Platforms Delivering New Value through Accumulation and Utilization of Big Data

"Big Data... and Related Topics" John S. Erickson, Ph.D The Rensselaer IDEA Rensselaer Polytechnic Institute

Chapter 3. Foundations of Business Intelligence: Databases and Information Management

Big Data Analytics using Apache Hadoop and Spark with Scala

Processing of big data with Apache Spark

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.

The amount of data increases every day Some numbers ( 2012):

2/26/2017. The amount of data increases every day Some numbers ( 2012):

What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed?

MixApart: Decoupled Analytics for Shared Storage Systems. Madalin Mihailescu, Gokul Soundararajan, Cristiana Amza University of Toronto and NetApp

microsoft

Introduction to Hadoop and MapReduce

DATA SCIENCE USING SPARK: AN INTRODUCTION

Big Trend in Business Intelligence: Data Mining over Big Data Web Transaction Data. Fall 2012

Big Data Infrastructures & Technologies

Page 1. Goals for Today" Background of Cloud Computing" Sources Driving Big Data" CS162 Operating Systems and Systems Programming Lecture 24

A Review Approach for Big Data and Hadoop Technology

Top 25 Big Data Interview Questions And Answers

Stages of Data Processing

MAPR DATA GOVERNANCE WITHOUT COMPROMISE

Real Time for Big Data: The Next Age of Data Management. Talksum, Inc. Talksum, Inc. 582 Market Street, Suite 1902, San Francisco, CA 94104

Chapter 6 VIDEO CASES

TOOLS FOR INTEGRATING BIG DATA IN CLOUD COMPUTING: A STATE OF ART SURVEY

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

CONSOLIDATING RISK MANAGEMENT AND REGULATORY COMPLIANCE APPLICATIONS USING A UNIFIED DATA PLATFORM

Chase Wu New Jersey Institute of Technology

Data Analytics at Logitech Snowflake + Tableau = #Winning

High Performance Computing on MapReduce Programming Framework

10 Million Smart Meter Data with Apache HBase

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Kinetic Open Storage Platform: Enabling Break-through Economics in Scale-out Object Storage PRESENTATION TITLE GOES HERE Ali Fenn & James Hughes

New Challenges in Big Data: Technical Perspectives. Hwanjo Yu POSTECH

A Review Paper on Big data & Hadoop

Databases 2 (VU) ( / )

Cloud Computing Techniques for Big Data and Hadoop Implementation

Introduction to Big Data

Analytics in the cloud

BIG DATA & HADOOP: A Survey

Microsoft Exam

Lightweight Streaming-based Runtime for Cloud Computing. Shrideep Pallickara. Community Grids Lab, Indiana University

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems

Examples of Big Data analytics in ENEA: data sources and information extraction strategies

CompSci 516: Database Systems

TESTING BIG DATA WORLD RIGA. by Konstantin Pletenev OCTOBER, 2017, TAPOST GROW CONFIDENTLY

Using Alluxio to Improve the Performance and Consistency of HDFS Clusters

D B M G Data Base and Data Mining Group of Politecnico di Torino

A SURVEY ON SCHEDULING IN HADOOP FOR BIGDATA PROCESSING

Oracle GoldenGate for Big Data

CIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( )

8/24/2017 Week 1-B Instructor: Sangmi Lee Pallickara

Transcription:

Big data hype? Big Data: Hype or Hallelujah? Data Base and Data Mining Group of 2 Google Flu trends On the Internet February 2010 detected flu outbreak two weeks ahead of CDC data Nowcasting http://www.internetlivestats.com/ 3 4 What is big data? What is big data? Many different definitions Data whose scale, diversity and complexity require new architectures, techniques, algorithms and analytics to manage it and extract value and hidden knowledge from it Many different definitions Data whose scale, diversity and complexity require new architectures, techniques, algorithms and analytics to manage it and extract value and hidden knowledge from it 5 6 1

What is big data? The 3Vs of big data Volume: scale of data Variety: different forms of data Velocity: analysis of streaming data Many different definitions Data whose scale, diversity and complexity require new architectures, techniques, algorithms and analytics to manage it and extract value and hidden knowledge from it but also Veracity: uncertainty of data Value: exploit information provided by data 7 8 Volume Data volume increases exponentially over time 44x increase from 2009 to 2020 Variety Various formats, types and structures Numerical data, image data, audio, video, text, time series Digital data 35 ZB in 2020 A single application may generate many different formats 9 10 Velocity Veracity Fast data generation rate Data quality Streaming data Very fast data processing to ensure timeliness 11 12 2

Who generates big data? Value Translate data into business advantage User Generated Content (Web & Mobile) E.g., Facebook, Instagram, Yelp, TripAdvisor, Twitter, YouTube Health and scientific computing 13 14 Who generates big data? Big data at work Log files Web server log files, machine syslog files Map data Crowdsourcing Computing Internet Of Things Sensor networks, RFID, smart meters Sensing 15 Real time traffic info 16 Big data challenges Technology & infrastructure New architectures, programing paradigms and techniques are needed Data management & analysis New emphasys on data Data science Generation Passive recording Typically structured data Bank trading transactions, shopping records, government sector archives Active generation Semistructured or unstructured data User-generated content, e.g., social networks Automatic production Location-aware, context-dependent, highly mobile data 17 Sensor-based Internet-enabled devices 18 3

Acquisition Storage Collection Storage infrastructure Pull-based, e.g., web crawler Storage technology, e.g., HDD, SSD Push-based, e.g., video surveillance, click stream Networking architecture, e.g., DAS, NAS, SAN Transmission Data management Transfer to data center over high capacity links Preprocessing Integration, cleaning, redundancy elimination File systems (HDFS), key-value stores (Memcached), column-oriented databases (Cassandra), document databases (MongoDB) Programming models Map reduce, stream processing, graph processing 19 20 Large scale data processing Analysis Objectives Descriptive analytics, predictive analytics, prescriptive analytics Methods Statistical analysis, data mining, text mining, network and graph data mining Clustering, classification and regression, association analysis Diverse domains call for customized techniques Traditional approach Database and data warehousing systems Well-defined structure Small enough data Big data Datasets not suitable for databases E.g. google crawls May need near real-time (streaming) analysis Different from data warehousing Different programming paradigm 21 22 Large scale data processing Large scale data processing Traditional computation is processor bound Traditional data storage Small dataset On large SANs Complex processing How to increase performance? New and faster processor More RAM Data transferred to processing nodes on demand at computing time Traditional distributed computing Multiple machines, single job Complex systems E.g., MPI Programmers need to manage data transfer synchronization, system failure, dependencies 23 24 4

The bottleneck The bottleneck Processors process data Hard drives store data We need to transfer data from the disk to the processor Hard drives evolution Storage capacity increased fast in recent decades E.g., from 1GB to 1TB The transfer rate increased less E.g., from 5MB/s to 100MB/s Transfer of disk content in memory Few years ago: 3.33 min. Now: 2.7 hours (if you have enough RAM) Problem: data transfer from disk to processors 25 26 The solution Transfer the processing power to the data Multiple distributed disks Each one holding a portion of a large dataset Process in parallel different file portions from different disks Issues Need to manage Hardware failures Network data transfer Data loss Consistency of results Joining data from different disks Scalability Managed by Hadoop 27 28 Apache Hadoop Hadoop scalable approach Open source project by the Apache Foundation Based on 2 Google papers Google File System (GFS), published in 2003 Map Reduce, published in 2004 Reliable storage and processing system based on YARN (Yet Another Resource Negotiator) Storage provided by HDFS Different processing models E.g., Map Reduce, Spark, Spark streaming, Hive, Giraph Data distibuted across nodes automatically When loaded into the system Processing executed on local data Whenever possible No need of data transfer to start the computation Data automatically replicated For availability and reliability Developers only focus on the logic of the problem to solve 29 30 5

Conclusions Certainly not just hype but not a panacea! 31 6