"Big Data... and Related Topics" John S. Erickson, Ph.D The Rensselaer IDEA Rensselaer Polytechnic Institute

Similar documents
Scaling Up HBase. Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech. CSE6242 / CX4242: Data & Visual Analytics

CISC 7610 Lecture 2b The beginnings of NoSQL

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Webinar Series TMIP VISION

Distributed Systems CS6421

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Microsoft Big Data and Hadoop

Nowcasting. D B M G Data Base and Data Mining Group of Politecnico di Torino. Big Data: Hype or Hallelujah? Big data hype?

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Big Data with Hadoop Ecosystem

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Chapter 5. The MapReduce Programming Model and Implementation

Embedded Technosolutions

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem. Zohar Elkayam

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

A brief history on Hadoop

Scalable Tools - Part I Introduction to Scalable Tools

Hadoop An Overview. - Socrates CCDH

Certified Big Data and Hadoop Course Curriculum

Hadoop, Yarn and Beyond

A Review Paper on Big data & Hadoop

DATA SCIENCE USING SPARK: AN INTRODUCTION

Introduction to BigData, Hadoop:-

Scalable Web Programming. CS193S - Jan Jannink - 2/25/10

STATS Data Analysis using Python. Lecture 7: the MapReduce framework Some slides adapted from C. Budak and R. Burns

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.

BIG DATA TECHNOLOGIES: WHAT EVERY MANAGER NEEDS TO KNOW ANALYTICS AND FINANCIAL INNOVATION CONFERENCE JUNE 26-29,

Blended Learning Outline: Cloudera Data Analyst Training (171219a)

Introduction to Big Data. Hadoop. Instituto Politécnico de Tomar. Ricardo Campos

Cloud Computing & Visualization

Stages of Data Processing

Comparing SQL and NOSQL databases

Top 25 Hadoop Admin Interview Questions and Answers

Introduction to Big-Data

Presented by Sunnie S Chung CIS 612

Oracle GoldenGate for Big Data

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

Top 25 Big Data Interview Questions And Answers

Challenges for Data Driven Systems

HADOOP FRAMEWORK FOR BIG DATA

Big Data and Cloud Computing

Oracle Big Data. A NA LYT ICS A ND MA NAG E MENT.

Innovatus Technologies

Cloud Computing. Hwajung Lee. Key Reference: Prof. Jong-Moon Chung s Lecture Notes at Yonsei University

Overview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training::

Distributed File Systems II

Big Data Analytics. Rasoul Karimi

MapReduce, Hadoop and Spark. Bompotas Agorakis

Hadoop. copyright 2011 Trainologic LTD

Next-Generation Cloud Platform

Introduction to Hadoop and MapReduce

Hadoop. Introduction / Overview

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

Databases and Big Data Today. CS634 Class 22

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

An Introduction to Apache Spark

SQT03 Big Data and Hadoop with Azure HDInsight Andrew Brust. Senior Director, Technical Product Marketing and Evangelism

A BigData Tour HDFS, Ceph and MapReduce

Big Data com Hadoop. VIII Sessão - SQL Bahia. Impala, Hive e Spark. Diógenes Pires 03/03/2018

Big Data Analytics using Apache Hadoop and Spark with Scala

Modern Database Concepts

A Survey on Big Data

High Performance Computing on MapReduce Programming Framework

Apache Hadoop Goes Realtime at Facebook. Himanshu Sharma

CS370 Operating Systems

The age of Big Data Big Data for Oracle Database Professionals

Comparative Analysis of Range Aggregate Queries In Big Data Environment

DATABASE DESIGN II - 1DL400

April Final Quiz COSC MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model.

Big Data Hadoop Stack

SpagoBI and Talend jointly support Big Data scenarios

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

CIS 612 Advanced Topics in Database Big Data Project Lawrence Ni, Priya Patil, James Tench

Big Data Architect.

Certified Big Data Hadoop and Spark Scala Course Curriculum

ΕΠΛ 602:Foundations of Internet Technologies. Cloud Computing

Page 1. Goals for Today" Background of Cloud Computing" Sources Driving Big Data" CS162 Operating Systems and Systems Programming Lecture 24

COSC 6339 Big Data Analytics. NoSQL (II) HBase. Edgar Gabriel Fall HBase. Column-Oriented data store Distributed designed to serve large tables

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

New Approaches to Big Data Processing and Analytics

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS

Data Informatics. Seon Ho Kim, Ph.D.

Strategic Briefing Paper Big Data

Global Journal of Engineering Science and Research Management

Distributed Computation Models

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework

R Language for the SQL Server DBA

Hadoop course content

Big Data Hadoop Course Content

8/24/2017 Week 1-B Instructor: Sangmi Lee Pallickara

CSE 444: Database Internals. Lecture 23 Spark

Big Data Retrieving Required Information From Text Files Desmond Hill Yenumula B Reddy (Advisor)

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

CIB Session 12th NoSQL Databases Structures

Hadoop: The Definitive Guide PDF

Transcription:

"Big Data... and Related Topics" John S. Erickson, Ph.D The Rensselaer IDEA Rensselaer Polytechnic Institute erickj4@rpi.edu @olyerickson

Director of Operations, The Rensselaer IDEA Deputy Director, Rensselaer Web Science Research Center at the Tetherless World Constellation, RPI

Bridgewater, NH (12 Sep 2016)

Today... 1. 2. 3. 4. 5. 6. 7. 8. What is "Big Data?" Why is Big Data such a Big Deal? How do we meet the challenges of Big Data? What tools do we use to work with Big Data? What is different about (really) Big Data Analytics? What can you do to get into Big Data "today?" How to get a job in Big Data? To learn more...

What is Big Data? Typically you hear about size...and the data really IS big!...but some say it's more about a way of thinking about the data Usually we talk about the "Four V's" Volume: Handling the scale of the data Velocity: Analyzing streaming data Variety: Managing data in wildly different forms...and formats Veracity: Uncertainty of the data (some people leave this out!)

What is Big Data? Typically you hear about size...and the data really IS big!...but some say it's more about a way of thinking about the data Usually we talk about the "Four V's" Volume: Handling the scale of the data Velocity: Analyzing streaming data Variety: Managing data in wildly different forms...and formats Veracity: Uncertainty of the data (some people leave this out!) The ability of society to harness information in novel ways to produce useful insights or goods and services of significant value and things one can do at a large scale that cannot be done at a smaller one, to extract new insights or create new forms of value. [1] [1] Viktor Mayer-Schönberger and Kenneth Cukier, Big Data: A Revolution That Will Transform How We Live, Work, and Think (2013)

Why is Big Data such a Big Deal? It breaks everything... Volume: new storage architectures (physical and virtual) Velocity: new computational models (parallel, highly distributed) Variety: new approaches to extracting information, meaning, knowledge from any "data" Veracity: modelling to handle validity, errors, missingness at a Very Large Scale NSA (UT) Google (GA) Apple (NC)

How do we meet the challenges of Big Data? Massively parallel hardware (>100K CPUs) Highly distributed computational models (esp. MapReduce -> Hadoop) Highly distributed file systems (esp. Hadoop File System) New database models (e.g. NoSQL -> "Not only SQL") Google (1996) Google (1998) Google (2016)

How it all began... 2003, 2004: Google publishes key papers 2006: Hadoop emerges as open source project under Apache 2007: Yahoo runs 1000-machine cluster with Hadoop 2011: Yahoo reaches 42K Hadoop nodes, 100k CPUs, and 100's of petabytes of storage http://hadoop.apache.org/

What tools do we use to work with Big Data? Infrastructure... Analytics... Applications... Commercial vs OpenSource

What is different about (really) Big Data Analytics Usually*, trying to do conventional analytics Small problems writ huge E.g. indexing TB's of data very quickly Distributed computational model: MapReduce/Hadoop (Highly) distributed file system: Hadoop File System (HDFS)

Quick-and-dirty: MapReduce Split a problem into many sub-problems Assign sub-problems to many agents ("Map") Collect results "Reduce" Jonas Widriksson, "Raspberry PI Hadoop Cluster." (Oct 2014)

Quick-and-dirty: Hadoop File System (HDFS) Designed to run on low cost hardware; highly fault tolerant Files split up into blocks that are replicated to DataNodes By default blocks have a size of 64MB; replicated to 3 nodes in cluster Jonas Widriksson, "Raspberry PI Hadoop Cluster." (Oct 2014)

What can you do to get into Big Data "today?" Everything you really need to manage and analyze Big Data is open source R and Python are great fantastic entry points to data analytics Real Big Data requires hands-on experience with Hadoop and HDFS... R and Python (with NumPy and SciPy) are top data science languages "Big Data Analytics" is more than "R on Steroids"...so find some machines and start playing! RStudio RHadoop: Integrating R and Hadoop!

Kaggle Competitions: Challenge yourself

Kaggle Competitions: Challenge yourself

Jonas Widriksson, "Raspberry PI Hadoop Cluster." (Oct 2014)

Jonas Widriksson, "Raspberry PI Hadoop Cluster." (Oct 2014)

How to get a job in Big Data? Data Analysis: Data Warehousing: Familiarity with large data stores and new database models Relational DBs: MySQL, SQL Server, et.al., ad infinitum... NoSQL: HDFS, HBase, CouchDB, MongoDB,... Data Transformation: Machine learning, statistical analysis (R, MATLAB, Python) MapReduce -> Hadoop Data visualization ETL[1], scripting (Linux shells, Python, etc) Data Collection: Extracting data from existing databases via Web APIs, et.al. Crawling, scraping the Web [1] Extract, Transform, Load

To learn more... Follow the links in these slides... Download, install, and learn R and RStudio Play with the Hadoop framework Read Cringely's series, Thinking about Big Data (Parts 1-3) Listen to this week's TED Radio Hour, Big Data Revolution