Big Data with R and Hadoop

Similar documents
Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Hadoop. Introduction / Overview

The Evolution of Big Data Platforms and Data Science

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

Blended Learning Outline: Cloudera Data Analyst Training (171219a)

This is a brief tutorial that explains how to make use of Sqoop in Hadoop ecosystem.

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics

Innovatus Technologies

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Big Data Syllabus. Understanding big data and Hadoop. Limitations and Solutions of existing Data Analytics Architecture

Unifying Big Data Workloads in Apache Spark

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

Databricks, an Introduction

Overview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training::


Big Data Hadoop Stack

microsoft

DATA SCIENCE USING SPARK: AN INTRODUCTION

Functional Comparison and Performance Evaluation. Huafeng Wang Tianlun Zhang Wei Mao 2016/11/14

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Certified Big Data Hadoop and Spark Scala Course Curriculum

Functional Comparison and Performance Evaluation 毛玮王华峰张天伦 2016/9/10

Activator Library. Focus on maximizing the value of your data, gain business insights, increase your team s productivity, and achieve success.

CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI)

Databases and Big Data Today. CS634 Class 22

Webinar Series TMIP VISION

Enable Interactive R and Python Environments on Hadoop

Cloud Computing 3. CSCI 4850/5850 High-Performance Computing Spring 2018

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

Big Data Analytics using Apache Hadoop and Spark with Scala

Asanka Padmakumara. ETL 2.0: Data Engineering with Azure Databricks

An Overview of Apache Spark

Exam Questions

Oracle Big Data Connectors

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

Analytic Cloud with. Shelly Garion. IBM Research -- Haifa IBM Corporation

exam. Microsoft Perform Data Engineering on Microsoft Azure HDInsight. Version 1.0

Stages of Data Processing

Big Data Infrastructures & Technologies

Spark 2. Alexey Zinovyev, Java/BigData Trainer in EPAM

Submitted to: Dr. Sunnie Chung. Presented by: Sonal Deshmukh Jay Upadhyay

Hadoop is supplemented by an ecosystem of open source projects IBM Corporation. How to Analyze Large Data Sets in Hadoop

Hadoop course content

Processing of big data with Apache Spark

Apache Spark and Scala Certification Training

Certified Big Data and Hadoop Course Curriculum

Oracle Big Data Fundamentals Ed 1

Apache Hadoop 3. Balazs Gaspar Sales Engineer CEE & CIS Cloudera, Inc. All rights reserved.

Hadoop Development Introduction

CSE 444: Database Internals. Lecture 23 Spark

Backtesting with Spark

Spark, Shark and Spark Streaming Introduction

Research challenges in data-intensive computing The Stratosphere Project Apache Flink

Turning Relational Database Tables into Spark Data Sources

Project Requirements

New Features and Enhancements in Big Data Management 10.2

Microsoft. Exam Questions Perform Data Engineering on Microsoft Azure HDInsight (beta) Version:Demo

Oracle Big Data Fundamentals Ed 2

Big Data com Hadoop. VIII Sessão - SQL Bahia. Impala, Hive e Spark. Diógenes Pires 03/03/2018

Driving New Value from Big Data Investments

A Tutorial on Apache Spark

Microsoft Perform Data Engineering on Microsoft Azure HDInsight.

Accelerate Big Data Insights

Spark Overview. Professor Sasu Tarkoma.

Specialist ICT Learning

15.1 Data flow vs. traditional network programming

Hadoop, Yarn and Beyond

Microsoft. Exam Questions Perform Data Engineering on Microsoft Azure HDInsight (beta) Version:Demo

Building A Better Test Platform:

Deep Learning Frameworks with Spark and GPUs

Practical Big Data Processing An Overview of Apache Flink

Big Data Architect.

Cloud Computing & Visualization

Security and Performance advances with Oracle Big Data SQL

Big Data Hadoop Course Content

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS

HADOOP COURSE CONTENT (HADOOP-1.X, 2.X & 3.X) (Development, Administration & REAL TIME Projects Implementation)

Expert Lecture plan proposal Hadoop& itsapplication

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Introduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data

Hortonworks Data Platform

Outline. CS-562 Introduction to data analysis using Apache Spark

Big data systems 12/8/17

An Introduction to Apache Spark

We are ready to serve Latest Testing Trends, Are you ready to learn.?? New Batches Info

Khadija Souissi. Auf z Systems November IBM z Systems Mainframe Event 2016

Beyond MapReduce: Apache Spark Antonino Virgillito

Data Analytics Job Guarantee Program

UNIFY DATA AT MEMORY SPEED. Haoyuan (HY) Li, Alluxio Inc. VAULT Conference 2017

An Introduction to Apache Spark

Databases 2 (VU) ( / )

Lecture 11 Hadoop & Spark

Resilient Distributed Datasets

CompSci 516: Database Systems

Higher level data processing in Apache Spark

Transcription:

Big Data with R and Hadoop Jamie F Olson June 11, 2015

R and Hadoop Review various tools for leveraging Hadoop from R MapReduce Spark Hive/Impala Revolution R 2 / 52

Scaling R to Big Data R has scalability issues: Performance Memory 3 / 52

Scaling R to Big Data R has scalability issues: Performance? Memory? 4 / 52

R Performance Limits R performance bottlenecks are largely gone: Memory model tweaks Just-in-time compiler Highly performant data manipulation tools(eg dplyr, datatable) 5 / 52

R Memory Limits Two choices for dealing with memory issues: Native R solutions: ff, bigmemory Leverage external tools: eg Hadoop, RDBMS 6 / 52

Value of Leaving R Purpose-built Highly engineered Better scalability 7 / 52

Cost of Context-Switching External tools rarely share R s core concepts and features: Vectorization Functional programming 8 / 52

Choosing External Tools Does the value of the tool justify the increased development/conceptual cost? 9 / 52

Outline 1 MapReduce 2 Spark 3 Hadoop Databases 4 Revolution R ScaleR 5 Concluding 10 / 52

The Original Hadoop Map Reduce 11 / 52

Map Figure: Apply the same computation to all data 12 / 52

Reduce Figure: Group and Reduce data 13 / 52

Why MapReduce? Data localization Simple to understand But extremely flexible Extreme scalability 14 / 52

Why not MapReduce? Large overhead Limited support for complex workflows 15 / 52

Really, Why MapReduce? rmr2 16 / 52

Seemlessly Integrated with R First-class support for R types Atomic vectors (including factor and NA) Does what you want with dataframe, matrix, array Works with any R values Recreates your local session in Hadoop Local and global variables Packages 17 / 52

Hello rmr x <- todfs(1:100) y <- values(fromdfs(mapreduce(x, map=function(,x)keyval(x,x*x)))) head(y) ## [1] 1 4 9 16 25 36 18 / 52

Fancy rmr mpg_model <- lm(mpg ~ wt, data=mtcars) new_weights <- todfs(seq(1,5,by=01)) new_mpg <- values(fromdfs(mapreduce(x, map=function(,wt) keyval(wt,predict(mpg_model, newdata=dataframe(wt=wt)))))) head(new_mpg) ## 1 2 3 4 5 6 ## 31940655 26596183 21251711 15907240 10562768 5218297 19 / 52

R-Friendly Data Import Parse text with readtable Read JSON with RJsONIO Load Avro record data into dataframe 20 / 52

Directly Control Job Configuration mapreduce(, backendparameters = list()) Reduce tasks Memory/Cpu resources JVM parameters 21 / 52

Write Results to HDFS MapReduce/rmr read from HDFS and write to HDFS making it easy to integrate scripts with the rest of your Hadoop workflows 22 / 52

Misc Awesomeness Great documentation on the wiki with tutorial and topics on performance and data formats Highly optimized typedbytes serialization written in C Installation only requires defining environmental variables 23 / 52

Caveats Everything is batch Data issues can be difficult to track down The API is great, but MapReduce can be limiting 24 / 52

Try It Out install_github("revolutionanalytics/rmr2", subdir = "pkg") rmroptions(backend = "local") 25 / 52

Outline 1 MapReduce 2 Spark 3 Hadoop Databases 4 Revolution R ScaleR 5 Concluding 26 / 52

Hadoop 20 Standalone compute engine ported to YARN Hybrid memory model keeps more data in RAM Lazy evaluation allows efficient and complex workflows API with more than just Map and Reduce 27 / 52

Why Spark? It runs faster (on the same workflow) You can develop faster Iterative algorithms are feasible 28 / 52

Why not SparkR? Version: 01 29 / 52

Writing for Spark not for R Spark uses key-value tuples, so does SparkR tuple <- list(list("key1", "value1"), list("key2", "value2")) This is an awkward value in R 30 / 52

Wordcount: rmr mapreduce(input_txt, map = function(, txt) { words <- unlist(strsplit(txt, " ")) keyval(words, 1) }, reduce = function(word, ones) { keyval(word, sum(ones)) }) 31 / 52

Wordcount: SparkR words <- flatmap(lines, function(line) { strsplit(line, " ")[[1]] }) wordcount <- map(words, function(word) list(word, 1L)) counts <- reducebykey(wordcount, "+", 2L) 32 / 52

SparkR Tuples This is awkward and slow to do in R wordcount <- map(words, function(word) list(word, 1L)) Compared with a vectorized version implemented through an API like keyval ## NOT VALID SPARKR wordcount <- map(words, function(wordsvec) keyval(wordsvec, 1L)) 33 / 52

Limited API for Data Formats All data starts is a text file You receive it as a character vector 34 / 52

Installation Troubles Requires compilation specific to your specific: Hadoop version Spark version YARN vs no-yarn Additional build tools are required Scala Maven 35 / 52

Return Results to R Enables exploratory data analysis and ad-hoc analytics Cannot return output to HDFS for integration with other tools 36 / 52

Why SparkR, Again? It s the future The API just needs some work 37 / 52

Try It Out install_github("amplab-extras/sparkr-pkg", subdir = "pkg") sc <- sparkrinit(master = "local") 38 / 52

Outline 1 MapReduce 2 Spark 3 Hadoop Databases 4 Revolution R ScaleR 5 Concluding 39 / 52

Can you (S/H)QL? Integrate with existing data lake Leverage existing SQL skills 40 / 52

Connecting from R Connection options (R)ODBC* (R)JDBC *Available from Hadoop distributors 41 / 52

Caveats Apache Sentry Kerberos can be tricky rjava Java version must match Hadoop s Many driver changes in the past few years 42 / 52

Thoughts Danger Zone: Dynamic SQL in R is not pretty Hive QL has a strong flavor Recommend: Data reshaping for inputs ETL to recombine R outputs 43 / 52

RJDBC Examples input_df <- dbgetquery( paste0("select * FROM ", command_line_arg)) # Do Something hdfsput(output_df,output_hdfs_path) dbsendquery(paste0( "CREATE EXTERNAL TABLE r_output", "()", "LOCATION ",output_hdfs_path) 44 / 52

Outline 1 MapReduce 2 Spark 3 Hadoop Databases 4 Revolution R ScaleR 5 Concluding 45 / 52

Write Once Deploy Anywhere Efficient linear-scaling algorithms for big data Cross-platform support for distributed computing Hadoop In-Database Platform LSF 46 / 52

Modeling not Distributed Computing Focus on modeling Generalized Linear Models Tree-based models Clustering And data transformation 47 / 52

ScaleR on Hadoop Inside architecture with R in the cluster Maximum scalability Currently uses MapReduce Beside architecture with R on the edge node Efficient binary format optimizes IO Medium-large data(< 1 TB) can be faster than in-hadoop Connect to Hive/Impala via ODBC(with unixodbc) 48 / 52

Outline 1 MapReduce 2 Spark 3 Hadoop Databases 4 Revolution R ScaleR 5 Concluding 49 / 52

It Depends rmr2 is mature and integrated Spark is better, but SparkR is immature There s always a role for SQL 50 / 52

Questions? 51 / 52

Thank you Revolution Analytics is the leading commercial provider of software and support for the popular open source R statistics language wwwrevolutionanalyticscom 1855GETREVO Twitter: @RevolutionR