Driving New Value from Big Data Investments

Size: px

Start display at page:

Download "Driving New Value from Big Data Investments"

Walter Newton
6 years ago
Views:

1 An Introduction to Using R with Hadoop Jeffrey Breen Principal, Think Big Academy jeffrey.breen@thinkbiganalytics.com February 2013 Driving New Value from Big Data Investments

2 Leading Provider of Innovative Big Analytics Services Building Modern Analytics Solutions to Monetize Big Data Investments IMAGINE Strategy and Roadmap ILLUMINATE Training and Education IMPLEMENT Hands-On Data Science and Data Engineering 2

3 THINK BIG Analytics Methodology Experiment-Driven Short Projects with Nimble Test Solution Cycles ILLUMINATE IMAGINE Innovation and Value IMPLEMENT Breaking Down Business and IT Barriers We Accelerate Your Time to Value Discrete Projects with Beginning and End Early Releases to Validate ROI and Ensure Long Term Success 3

ILLUMINATE: Training and Education THINK BIG Analytics Enable Your IT Staff with New Skills Data Architect Data Architect Big Data Monitoring Database Administrator Big Data Administrator Business

4 ILLUMINATE: Training and Education THINK BIG Analytics Enable Your IT Staff with New Skills Data Architect Data Architect Big Data Monitoring Database Administrator Big Data Administrator Business Analyst Data Science Math Modeler Developers Expert Training/Courses e.g. Hadoop Developer, HBase, Pig and Hive for Modelers Joint Application Development Side-by-Side Mentoring Big Data Engineering Build Capabilities to Manage Rapid Innovation Needed with Big Data Invest in and Scale Skills to Create Data-Driven Organization 4

5 Agenda Why R? What is Hadoop? Counting words with MapReduce Writing MapReduce jobs with RHadoop Data Warehousing with Hive Big Data Hadoop Want to learn more? Q&A 5

6 Agenda Why R? What is Hadoop? Counting words with MapReduce Writing MapReduce jobs with RHadoop Data Warehousing with Hive Big Data Hadoop Want to learn more? Q&A 6

Revolution Confidential http://thebalancedguy.blogspot.

7 Revolution Confidential 7

8 Revolution Confidential 8

9 Number of R Packages Available How many R Packages are there now? At the command line enter: > dim(available.packages()) Slide courtesy of John Versotek, organizer of the Boston Predictive Analytics Meetup

10 Agenda Why R? What is Hadoop? Counting words with MapReduce Writing MapReduce jobs with RHadoop Data Warehousing with Hive Big Data Hadoop Want to learn more? Q&A 10

11 Revolution Confidential

13 Google File System is the Storage

14 MapReduce is the framework

15 Enter Hadoop About this time, Doug Cutting, the creator of Lucene, was working on Nutch. 15

16 Nutch Timeline Year Topics 2003 Google s GFS paper Nutch Distributed File System (NDFS) Google s MapReduce paper Nutch MapReduce Implementation. 16

17 Hadoop Timeline Year Topics 2006 NDFS and Nutch MapReduce extracted to separate Hadoop Apache project Hadoop is a top-level Apache project. Yahoo! announces 10K core cluster. 17

18 Hadoop Design Goals Optimize disk I/O performance. -Minimize disk head seeks! Redundant data storage and processing to eliminate many kinds of data loss. Horizontal scalability. Run on commodity, server-class hardware. 18

19 Revolution Confidential from Jeff Dean, based on Peter Norvig s 19

20 What is Hadoop? An open source project designed to support large scale data processing Inspired by Google s MapReduce-based computational infrastructure Comprised of several components - Hadoop Distributed File System (HDFS) - MapReduce processing framework, job scheduler, etc. - Ingest/outgest services (Sqoop, Flume, etc.) - Higher level languages and libraries (Hive, Pig, Cascading, Mahout) Written in Java, first opened up to alternatives through its Streaming API If your language of choice can handle stdin and stdout, you can use it to write MapReduce jobs 20

Hadoop cluster components SQL Store Ingest Service Outgest Service SQL Store Logs Key italics: process : MR jobs Primary Master Server Job Tracker Name Node Cluster Secondary Master Server Secondary

21 Hadoop cluster components SQL Store Ingest Service Outgest Service SQL Store Logs Key italics: process : MR jobs Primary Master Server Job Tracker Name Node Cluster Secondary Master Server Secondary Name Node Client Servers Hive, Pig,... cron+bash, Azkaban, Sqoop, Scribe, Monitoring, Management Slaves Slave Server Slave Server Slave Server Task Tracker Task Tracker Task Tracker Data Node Data Node Data Node... from Think Big Academy s Hadoop Developer Course Disk Disk Disk Disk Disk Disk Disk Disk Disk Disk Disk Disk Disk Disk Disk Disk Disk Disk Disk Disk Disk Disk Disk Disk 21

Hadoop s distributed file system SQL Store Ingest Service Outgest Service SQL Store Logs Services Name Node Data Nodes Primary Master Server Job Tracker Name Node Cluster Secondary Master Server

22 Hadoop s distributed file system SQL Store Ingest Service Outgest Service SQL Store Logs Services Name Node Data Nodes Primary Master Server Job Tracker Name Node Cluster Secondary Master Server Secondary Name Node Client Servers Hive, Pig,... cron+bash, Azkaban, Sqoop, Scribe, Monitoring, Management 64MB blocks Slaves 3x replication Slave Server Task Tracker Data Node Slave Server Task Tracker Data Node Slave Server Task Tracker Data Node... from Think Big Academy s Hadoop Developer Course Disk Disk Disk Disk Disk Disk Disk Disk Disk Disk Disk Disk Disk Disk Disk Disk Disk Disk Disk Disk Disk Disk Disk Disk 22

23 Agenda Why R? What is Hadoop? Counting words with MapReduce Writing MapReduce jobs with RHadoop Data Warehousing with Hive Big Data Hadoop Want to learn more? Q&A 23

True confession: I was wrong about MapReduce When the Google paper was published in 2004, I was running a typical enterprise IT department Big hardware (Sun, EMC) + big applications (Siebel,

24 True confession: I was wrong about MapReduce When the Google paper was published in 2004, I was running a typical enterprise IT department Big hardware (Sun, EMC) + big applications (Siebel, Peoplesoft) + big databases (Oracle, SQL Server) = big licensing & support costs Loved the scalability, COTS components, and price, but missed the fact that keys (and values) could be compound & complex... and examples like Wordcount didn t help! Source: Hadoop: The Definitive Guide, Second Edition, p

25 Input Mappers Sort, Shuffle Reducers Output Hadoop uses MapReduce There is a Map phase There is a Reduce phase (hadoop, 1) (mapreduce, 1) (uses, 1) (is, 1), (a, 1) We need to convert (map, 1),(phase,1) (there, 1) the Input into the Output. (is, 1), (a, 1) (phase,1) (there, 1), (reduce 1) a 2 hadoop 1 is 2 map 1 mapreduce 1 phase 2 reduce 1 there 2 uses 1 from Think Big Academy s Hadoop Developer Course Copyright , Think Big AnalyNcs, All Rights Reserved

26 Input Mappers Hadoop uses MapReduce (N, " ") There is a Map phase (N, " ") (N, "") There is a Reduce phase (N, " ") from Think Big Academy s Hadoop Developer Course Copyright , Think Big AnalyNcs, All Rights Reserved

27 Input Mappers Hadoop uses MapReduce There is a Map phase (N, " ") (N, " ") (hadoop, 1) (uses, 1) (mapreduce, 1) (there, 1) (is, 1) (a, 1) (map, 1) (phase, 1) (N, "") There is a Reduce phase (N, " ") (there, 1) (is, 1) (a, 1) (reduce, 1) (phase, 1) from Think Big Academy s Hadoop Developer Course Copyright , Think Big AnalyNcs, All Rights Reserved

Revolution Confidential http://blog.stackoverflow.

28 Revolution Confidential 28

29 Input Mappers Sort, Shuffle Reducers Hadoop uses MapReduce (N, " ") (hadoop, 1) (mapreduce, 1) 0-9, a-l (uses, 1) There is a Map phase (N, " ") (is, 1), (a, 1) (map, 1),(phase,1) (there, 1) m-q (N, "") There is a Reduce phase (N, " ") (is, 1), (a, 1) (phase,1) (there, 1), (reduce, 1) r-z from Think Big Academy s Hadoop Developer Course Copyright , Think Big AnalyNcs, All Rights Reserved

30 Input Mappers Sort, Shuffle Reducers Hadoop uses MapReduce There is a Map phase (N, " ") (N, " ") (N, "") (hadoop, 1) (mapreduce, 1) (uses, 1) (is, 1), (a, 1) 0-9, a-l (a, [1,1]), (hadoop, [1]), (is, [1,1]) m-q (map, 1),(phase,1) (there, 1) (map, [1]), (mapreduce, [1]), (phase, [1,1]) There is a Reduce phase (N, " ") (is, 1), (a, 1) (phase,1) (there, 1), (reduce 1) r-z (reduce, [1]), (there, [1,1]), (uses, 1) from Think Big Academy s Hadoop Developer Course Copyright , Think Big AnalyNcs, All Rights Reserved

31 Input Mappers Sort, Shuffle Reducers Output Hadoop uses MapReduce (N, " ") (hadoop, 1) (mapreduce, 1) (uses, 1) 0-9, a-l (a, [1,1]), (hadoop, [1]), (is, [1,1]) a 2 hadoop 1 is 2 There is a Map phase (N, " ") (is, 1), (a, 1) m-q (map, 1),(phase,1) (there, 1) (map, [1]), (mapreduce, [1]), (phase, [1,1]) map 1 mapreduce 1 phase 2 (N, "") There is a Reduce phase (N, " ") (is, 1), (a, 1) (phase,1) (there, 1), (reduce 1) r-z (reduce, [1]), (there, [1,1]), (uses, 1) reduce 1 there 2 uses 1 from Think Big Academy s Hadoop Developer Course Copyright , Think Big AnalyNcs, All Rights Reserved

Input Mappers Sort, Shuffle Reducers Output Hadoop uses MapReduce (N, " ") (hadoop, 1) (mapreduce, 1) (uses, 1) 0-9, a-l (a, [1,1]), (hadoop, [1]), (is, [1,1]) a 2 hadoop 1 is 2 There is a Map phase

32 Input Mappers Sort, Shuffle Reducers Output Hadoop uses MapReduce (N, " ") (hadoop, 1) (mapreduce, 1) (uses, 1) 0-9, a-l (a, [1,1]), (hadoop, [1]), (is, [1,1]) a 2 hadoop 1 is 2 There is a Map phase (N, " ") (is, 1), (a, 1) m-q (map, 1),(phase,1) (there, 1) (map, [1]), (mapreduce, [1]), (phase, [1,1]) map 1 mapreduce 1 phase 2 Map: There is a Reduce phase (N, "") (N, " ") (is, 1), (a, 1) (phase,1) Transform one input to 0- N outputs. (there, 1), (reduce 1) Reduce: r-z (reduce, [1]), (there, [1,1]), one output. (uses, 1) reduce 1 there 2 uses 1 Collect multiple inputs into from Think Big Academy s Hadoop Developer Course Copyright , Think Big AnalyNcs, All Rights Reserved

33 Agenda Why R? What is Hadoop? Counting words with MapReduce Writing MapReduce jobs with RHadoop Data Warehousing with Hive Big Data Hadoop Want to learn more? Q&A 33

34 Enter RHadoop RHadoop is an open source project sponsored by Revolution Analytics Package Overview - rmr2 - all MapReduce-related functions - rhdfs - interaction with Hadoop s HDFS file system - rhbase - access to the NoSQL HBase database rmr2 uses Hadoop s Streaming API to allow R users to write MapReduce jobs in R - handles all of the I/O and job submission for you (no while(<stdin>)-like loops!) 34

35 RHadoop Advantages Modular - Packages group similar functions - Only load (and learn!) what you need - Minimizes prerequisites and dependencies Open Source - Cost: Low (no) barrier to start using - Transparency: Development, issue tracker, Wiki, etc. hosted on github Supported - Sponsored by Revolution Analytics - Training & professional services available - Support available with Revolution R Enterprise subscriptions 35

36 wordcount: code library(rmr2) map = function(k,lines) { words.list = strsplit(lines, '\\s') words = unlist(words.list) } return( keyval(words, 1) ) reduce = function(word, counts) { } keyval(word, sum(counts)) wordcount = function (input, output = NULL) { mapreduce(input = input, output = output, input.format = "text", map = map, reduce = reduce)} from Revolution Analytics Getting Started with RHadoop course 36

37 wordcount: submit job and fetch results Submit job > hdfs.root = 'wordcount' > hdfs.data = file.path(hdfs.root, 'data') > hdfs.out = file.path(hdfs.root, 'out') > out = wordcount(hdfs.data, hdfs.out) Fetch results from HDFS > results = from.dfs( out ) > results.df = as.data.frame(results, stringsasfactors=f ) > colnames(results.df) = c('word', 'count') > head(results.df) word count 1 greatness 2 2 damned 3 3 tis 5 4 jade 1 5 magician 1 from Revolution Analytics Getting Started with RHadoop course 37

38 Code notes Scalable - Hadoop and MapReduce abstract away system details - Code runs on 1 node or 1,000 nodes without modification Portable - You write normal R code, interacting with normal R objects - RHadoop s rmr2 library abstracts away Hadoop details - All the functionality you expect is there including Enterprise R s Flexible - Only the mapper deals with the data directly - All components communicate via key-value pairs - Key-value schema chosen for each analysis rather than as a prerequisite to loading data into the system 38

39 rmr2 Function Overview Convenience - keyval() - creates a key-value pair from any two R objects. Used to generate output from input formatters, mappers, reducers, etc. Input/output - from.dfs(), to.dfs() - read/write data from/to the HDFS - make.input.format() - provides common file parsing (text, CSV) or will wrap a usersupplied function Job execution - mapreduce() - submit job and return an HDFS path to the results if successful 39

40 rhdfs function overview File & directory manipulation - hdfs.ls(), hdfslist.files() - hdfs.delete(), hdfs.del(), hdfs.rm() - hdfs.dircreate(), hdfs.mkdir() - hdfs.chmod(), hdfs.chown(), hdfs.file.info() - hdfs.exists() Copying, moving & renaming files to/from/within HDFS - hdfs.copy(), hdfs.move(), hdfs.rename() - hdfs.put(), hdfs.get() Reading files directly from HDFS - hdfs.file(), hdfs.read(), hdfs.write(), hdfs.flush() - hdfs.seek(), hdfs.tell(con), hdfs.close() - hdfs.line.reader(), hdfs.read.text.file() Misc. - hdfs.init(), hdfs.defaults() 40

41 rhbase function overview Initialization - hb.init() Create and manage tables - hb.list.tables(), hb.describe.table() - hb.new.table(), hb.delete.table() Read and write data - hb.insert(), hb.insert.data.frame() - hb.get(), hb.get.data.frame(), hb.scan() - hb.delete() Administrative, etc. - hb.defaults(), hb.set.table.mode() - hb.regions.table(), hb.compact.table() 41

42 Big Data Warehousing with Hive Hive supplies a SQL-like query language - very familiar for those with relational database experience But Hive compiles, optimizes, and executes these queries as MapReduce jobs on the Hadoop cluster Can be used in conjunction with other Hadoop jobs, such as those written with rmr2 42

43 Hive architecture & access Terminal browser RODBC, RJDBC, etc. Hive JDBC ODBC CLI HWI Thrift Server Driver (compiles, optimizes, executes) Metastore Hadoop Job Tracker Master Name Node DFS 43

44 Accessing Hive via ODBC/JDBC library(rjdbc) # set the classpath to include the JDBC driver location, plus commons-logging [...] class.path = c(hive.class.path, commons.class.path) drv = JDBC("org.apache.hadoop.hive.jdbc.HiveDriver", classpath=class.path, "`") # make a connection to the running Hive Server: conn = dbconnect(drv, "jdbc:hive://localhost:10000/default") # setting the database name in the URL doesn't help, # so issue 'use databasename' command: res = dbsendquery(conn, 'use mydatabase') # submit the query and fetch the results as a data.frame: df = dbgetquery(conn, 'SELECT name, sub FROM employees LATERAL VIEW explode(subordinates) subview AS sub') 44

45 Other ways to use R and Hadoop HDFS - Revolution Enterprise R can read and write files directly on the distributed file system - Files can include ScaleR s XDF-formatted data sets MapReduce - Many other R packages have been written to use R and Hadoop together, including RHIPE, segue, Oracle s R Connector for Hadoop, etc. Hive - Hadoop Streaming is also available for Hive to leverage functionality external to Hadoop and Java - RHive leverages RServe to connect the two 45

46 Big Data Hadoop NoSQL databases offer low-latency, random-access to key-values - HBase - Cassandra - CouchDB - MongoDB - Accumulo Next week, Think Big s Douglas Moore will be presenting at the Boston Storm Meetup: - Predictive Analytics with Storm, Hadoop, R and AWS

47 Want to learn more? Upcoming public Getting Started with RHadoop 1-day classes - Hands-on examples and exercises covering rhdfs, rhbase, and rmr2 - Algorithms and data include wordcount, analysis of airline flight data, and collaborative filtering using structured and unstructured data from text, CSV files and Twitter February 25, Palo Alto, CA March 13, Boston, MA 25% off with user discount Revolution Analytics Quick Start Program for Hadoop - Private Getting Started with RHadoop training - Onsite consulting assistance for initial use case - Revolution R for Hadoop licenses and support - More 47

Hadoop An Overview. - Socrates CCDH

Hadoop An Overview. - Socrates CCDH Hadoop An Overview - Socrates CCDH What is Big Data? Volume Not Gigabyte. Terabyte, Petabyte, Exabyte, Zettabyte - Due to handheld gadgets,and HD format images and videos - In total data, 90% of them collected