BigData And the Zoo. Mansour Raad Federal GIS Conference 2014

Size: px

Start display at page:

Download "BigData And the Zoo. Mansour Raad Federal GIS Conference 2014"

Beverley Bishop
5 years ago
Views:

1 Federal GIS Conference 2014 February 10 11, 2014 Washington DC BigData And the Zoo Mansour Raad

2 What is BigData?

3 Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it - Dan Ariely

4 No but seriously!

5 Academics Volume Velocity Variety

6 But then I ve seen Volume data at rest Velocity data in motion Variety many types, forms and structures or no structures Veracity data in doubt Validity data that is correct Visualization data in patterns Vulnerability data at risk Value data that is meaningful

7 I m sticking with Volume Velocity Variety

8 When the traditional means are failing you -Anonymous

9 What are the new means?

11 What Is Hadoop? Library / Framework Multi Node Distributed Processing Very Very Large Dataset Resilient To Hardware Failure

12 Hadoop Basic Stack MapReduce Yet Another Resource Negotiator (YARN) Hadoop Distributed File System (HDFS) Commodity Servers

13 Other Hadoop Projects Avro - Serialization / RPC System HBase - Distributed Columnar Database Hive - Ad Hoc SQL Interface Pig - Data Flow Parallel Execution (AML) ZooKeeper - Coordination Service More.

14 HDFS Distributed File System Lots and Lots of Commodity Drives Fault Tolerant Loves Big Files POSIX Like Interface

15 HDFS HDFS Client NameNode DataNode DataNode DataNode

16 HDFS Resilience! HDFS DataNode DataNode DataNode

17 Program BigData

18 Program BigData

19 MapReduce

20 What Is MapReduce? Parallel Fault Tolerant Framework Splits Large Input Invoke User Defined Map Function Shuffle and Sort Invoke User Defined Reduce Function

21 MapReduce

22 MapReduce & HDFS Client.jar Job Tracker Name Node Task Tracker Data Node Task Tracker Data Node Task Tracker Data Node

23 Thinking In MR K1,V1 K2,list(V2) Map list(k2,v2) Reduce list(k3,v3) Shuffle/Sort (filter & transform) (group & aggregate)

24 Hello MapReduce!

25 ID1,X1,Y1 ID2,X2,Y2 ID3,X3,Y3 ID4,X4,Y4 DensityMap

26 DensityMap function map(lineno,text) { tokens = text.split(, ) cell = tocell(tokens[1],tokens[2]) emit( cell, 1) } function tocell(x,y) { // some math!! return cell } function reduce(cell,iterator) { sum = 0 for( one : iterator) sum += one emit( cell, sum) }

27 Writing MR Is Hard

29 Think of Data as Water In Pipes

30 Workflow Pipeline Filter X,Y Collection Source To Cell M GroupBy R count Cell Count Sink

31 Cascading pipeline MapReduce Job

32 Cascading Pipe // Pipe tap x,y input fields into spatial function Pipe pipe = new Each("start", new Fields("X", "Y"), new SpatialDensity()); // Group by emitted cell value pipe = new GroupBy(pipe, cell ); // Count by group and name count POPULATION pipe = new Every(pipe, Fields.GROUP, new Count(new Fields("POPULATION")));

33 How About. No Programing???

34 Apache HIVE

35 Apache HIVE SQL MapReduce Job

36 HQL drop table if exists logs; create external table if not exists logs( ip string, method string, uri string, status string, bytes int, time_taken int, referrer string, user_agent string ) partitioned by (year int, month int, day int, hour int) row format delimited fields terminated by '\t' lines terminated by '\n' stored as textfile location hdfs://hadoop:8020/logs/';

37 HQL $ hive hive> select hour,count(hour) from logs where year=2014 and month=01 and uri = group by hour order by hour;

38 Other AdHoc Engines Cloudera Impala Facebook Presto Amplab Shark Bypass MR generation / Direct HDFS Access

39 What About Spatial?

41 GIS Tools For Hadoop Open Source / Github Apache 2.0 License Geometry API Spatial Framework Hive GeoProcessing Tools

42 Geometry API Shapes Points Polylines Polygons Envelopes

43 Geometry API Geometry Operations Contains / Intersects / Union / Difference / Buffer / ConvexHull Spatial Index

44 Geometry API I/O Operations WKT OGC GeoJSON Shape (bin)

45 API Usage in BigData Map-only jobs - GeoEnrichment Given set of locations Given demographic area Augment location with demographic attributes

46 BigData Binning

47 BigData Binning

48 BigData Binning

49 BigData Spatial Join knn Range Queries Custom Input Format

50 SQL Is Still King!

51 Spatial Hive UDF (User Defined Functions)

52 Spatial Hive UDF Uses Geometry API Constructor ST_POINT / ST_GeomFromGeoJSON Relations ST_Contains / ST_Buffer Accessor ST_Distance, ST_Area

53 Spatial Hive SELECT counties.name, count(*) cnt FROM counties JOIN earthquakes WHERE ST_Contains(counties.boundaryshape, ST_Point(earthquakes.longitude, earthquakes.latitude)) GROUP BY counties.name ORDER BY cnt desc;

54 AIS Demo

55 AIS Data 14.8 million GPS information 1 month Zone 18 (North East / NY Area) MMSI, ZuluTime, Lon/Lat, VoyageId, Draught

56 Demo Steps GP Toolbox Track Assembly From Targets Hex Generation Density Analysis

57 AIS CSV Import Partitioner /ais/yyyy/mm/dd/hh/uuid.csv HDFS MapReduce

62 How to get started?

63 Cloudera QuickStart VM

64 Book That I Recommend Hadoop - The Definitive Guide Hadoop In Action HBase In Action Hadoop - Real World Solution Cookbook MapReduce Design Pattern

65 Q&A

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Big Data Hadoop Developer Course Content Who is the target audience? Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours Complete beginners who want to learn Big Data Hadoop Professionals