Joe Hummel, PhD. Visiting Researcher: U. of California, Irvine Adjunct Professor: U. of Illinois, Chicago & Loyola U., Chicago

Size: px

Start display at page:

Download "Joe Hummel, PhD. Visiting Researcher: U. of California, Irvine Adjunct Professor: U. of Illinois, Chicago & Loyola U., Chicago"

Ashlynn Dean
5 years ago
Views:

1 Joe Hummel, PhD Visiting Researcher: U. of California, Irvine Adjunct Professor: U. of Illinois, Chicago & Loyola U., Chicago Materials:

2 A little history Why Hadoop? How it works Demos Summary Hadoop on Azure 2

3 Map-Reduce is from functional programming // function returns 1 if i is prime, 0 if not: let isprime(i) =... // sums 2 numbers: let sum(x, y) = return x + y // count the number of primes in 1..N: let countprimes(n) = let L = [ 1.. N ] // [ 1, 2, 3, 4, 5, 6,... ] let T = map isprime L // [ 0, 1, 1, 0, 1, 0,... ] let count = reduce sum T // 42 return count Hadoop on Azure 3

4 Created by to drive internet search BIG data scalable to TBs and beyond Parallelism: to get the performance Data partitioning: to drive the parallelism Fault tolerance: at this scale, machines are going to crash, a lot BIG Data page hits 4

5 Search engines: Google, Yahoo, Bing Facebook Twitter Financials Health industry Insurance Credit card companies Just about any company collecting user data Hadoop on Azure 5

6 Freely-available framework for big data Based on concept of Map-Reduce: map function reduce intermediate results Map BIG data Map Map Reduce R Map.. 6

7 Reducer Reducer Reducer Reducer Reducer Reducer Hadoop on Azure 7

8 Data Map Map Map [ <key1,value>, <key4,value>, <key2,value>, ] Sort Sort Sort [ <key1,value>, <key1,value>, ] Merge [ <key1, [value,value, ]>, <key2, [value,value, ]>, ] Reduce [ <key1, value>, <key2, value> ] R 8

9 Netflix data-mining Average rating Netflix Movie Reviews (.txt) Netflix Data Mining App movieid,userid,rating,date 1, ,3, , ,5, , ,3, , ,5, Hadoop on Azure 9

10 Data Map Map Map [ <1,3>, <217,5>, <42,3>, <1,5>, <134,2>, <42,1>, ] Sort Sort Sort [ <1,3>, <1,5>, <42,3>, <42,1>, <134,2>, <217,5>, ] Merge [ <1, [3,5]>, <42, [3,1]>, <134, [2, ]>, <217, [5, ]>, ] Reduce [ <1, 4>, <42, 2>, <134,?>, ] R 10

To compute average rating for every movie: // Javascript version: var map = function (key, value, context) { var values = value.split(","); // field 0 contains movieid, field 2 the rating: context.

11 To compute average rating for every movie: // Javascript version: var map = function (key, value, context) { var values = value.split(","); // field 0 contains movieid, field 2 the rating: context.write(values[0], values[2]); }; var reduce = function (key, values, context) { var sum = 0; var count = 0; while (values.hasnext()) { count++; sum += parseint(values.next()); } context.write(key, sum/count); }; Hadoop on Azure 11

12 Upload data to HDFS Hadoop file system Write map / reduce functions default is to use Java most languages supported: C, C++, C#, JavaScript, Python, Compile and upload code For Java, you upload.jar file For others,.exe or script Submit MapReduce job Wait for job to complete Hadoop on Azure 12

13 Queries against big datasets Embarrassingly-parallel problems Solution must fit into map-reduce framework Non-real-time demands Hadoop is not for: Small datasets (< 1GB?) Sub-second / real-time needs (though clearly Google makes it work) Hadoop on Azure 13

14 We ll be working with Chicago crime data GB 5M rows 14

15 Compute top-10 crimes IUCR Count IUCR = Illinois Uniform Crime Codes Uniform-Crime-R/c7ck-438e 15

16 Hadoop on Azure Supports traditional Hadoop usage Upload data Write MapReduce program Submit job Additional features: Allows access to persistent data from Azure Storage Vault Provides interactive JavaScript console Built-in higher-level query languages (PIG, HIVE) Hadoop on Azure 16

17 // Javascript version: var map = function (key, value, context) { var values = value.split(","); context.write(values[4], 1); }; var reduce = function (key, values, context) { var sum = 0; while (values.hasnext()) { sum += parseint(values.next()); } context.write(key, sum); }; Hadoop on Azure 17

js", "IUCR, Count:long"). orderby("count DESC"). take(10).

18 // interactive PIG with explicit Map-Reduce functions: pig.from("asv://datafiles/cc-from-2001.txt"). mapreduce("scripts/iucr-count.js", "IUCR, Count:long"). orderby("count DESC"). take(10). to("output-from-2001") // visualize the results: file = fs.read("output-from2001/part-r-00000") data = parse(file.data, "IUCR, Count:long") graph.bar(data) Hadoop on Azure 18

19 Microsoft is offering free access to Hadoop Request Hadoop connector for Excel Process data using Hadoop, analyze/visualize using Excel Hadoop on Azure 19

20 Hadoop on Azure 20

Hadoop is all about big data processing Scalable, parallel, fault-tolerant Easy to understand programming model Map-Reduce But then solution must fit

21 Hadoop is all about big data processing Scalable, parallel, fault-tolerant Easy to understand programming model Map-Reduce But then solution must fit into this framework Rich ecosystem developing around Hadoop Technologies: PIG, HIVE, HBase, Companies: Cloudera, Hortonworks, MapR, Hadoop on Azure 21

22 Presenter: Joe Hummel Materials: For more info: Overview, including how to access via.net API: Hadoop on Azure 22

Joe Hummel, PhD. Visiting Researcher: U. of California, Irvine Adjunct Professor: U. of Illinois, Chicago & Loyola U., Chicago

Joe Hummel, PhD. Visiting Researcher: U. of California, Irvine Adjunct Professor: U. of Illinois, Chicago & Loyola U., Chicago Joe Hummel, PhD Visiting Researcher: U. of California, Irvine Adjunct Professor: U. of Illinois, Chicago & Loyola U., Chicago Materials: http://www.joehummel.net/downloads.html Email: joe@joehummel.net