Bring Location Intelligence To Big Data Applications on Hadoop, Spark, and NoSQL

Similar documents
Oracle Big Data Spatial and Graph: Spatial Features

Big Data Technologies and Geospatial Data Processing:

Oracle Big Data Spatial and Graph: Spatial Features Roberto Infante 11/11/2015 Latin America Geospatial Forum

Verarbeitung von Vektor- und Rasterdaten auf der Hadoop Plattform DOAG Spatial and Geodata Day 2016

Spatial Analytics Built for Big Data Platforms

What s New with SpaDal and Graph?

Open And Linked Data Oracle proposition Subtitle

BigData And the Zoo. Mansour Raad Federal GIS Conference 2014

Using Linked Data Concepts to Blend and Analyze Geospatial and Statistical Data Creating a Semantic Data Platform

Oracle Big Data Spatial and Graph User s Guide and Reference. Release 2.3

Publishing Statistical Data and Geospatial Data as Linked Data Creating a Semantic Data Platform

Oracle Big Data Fundamentals Ed 2

Deploying Spatial Applications in Oracle Public Cloud

Oracle Big Data SQL. Release 3.2. Rich SQL Processing on All Data

Oracle Big Data. A NA LYT ICS A ND MA NAG E MENT.

Oracle Big Data Fundamentals Ed 1

Deep Learning und Graphenanalyse im Einsatz gegen Hacker

Quick Deployment Step-by-step instructions to deploy Oracle Big Data Lite Virtual Machine

Quick Deployment Step- by- step instructions to deploy Oracle Big Data Lite Virtual Machine

Oracle Big Data Spatial and Graph

Serving Imagery with ArcGIS Server 10.1

Oracle Big Data SQL High Performance Data Virtualization Explained

Geospatial Analytics at Scale

May 22, 2013 Ronald Reagan Building and International Trade Center Washington, DC USA

Introduction to BigData, Hadoop:-

May 2012 Oracle Spatial User Conference

Managing Imagery and Raster Data Using Mosaic Datasets

Hive and Shark. Amir H. Payberah. Amirkabir University of Technology (Tehran Polytechnic)

Big Data Analytics using Apache Hadoop and Spark with Scala

May 21, 2014 Walter E. Washington Convention Center Washington, DC USA. Copyright 2014, Oracle and/or its affiliates. All rights reserved.

Oracle Big Data Spatial and Graph

Copyright 2012, Oracle and/or its affiliates. All rights reserved.

May 22, 2013 Ronald Reagan Building and International Trade Center Washington, DC USA

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Oracle Big Data Spatial and Graph User s Guide and Reference. Release 2.4

Copyright 2012, Oracle and/or its affiliates. All rights reserved.

Oracle Big Data Spatial and Graph User s Guide and Reference. Release 2.5

Blended Learning Outline: Cloudera Data Analyst Training (171219a)

Techno Expert Solutions An institute for specialized studies!

Certified Big Data and Hadoop Course Curriculum

Best Practices, Tips and Tricks With Oracle Spatial and Graph Daniel Geringer, Spatial Solutions Specialist Oracle Corporation

Copyright 2012, Oracle and/or its affiliates. All rights reserved.

Spatial Data Management in Apache Spark: The GeoSpark Perspective and Beyond

A Second Look at DEM s

The Era of Big Spatial Data: A Survey

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

ENGRG 59910: Introduction to GIS

Processing of big data with Apache Spark

Data Analytics Job Guarantee Program

Copyright 2013, Oracle and/or its affiliates. All rights reserved.

Oracle Big Data Cloud Service, Oracle Storage Cloud Service, Oracle Database Cloud Service

Graph Databases nur ein Hype oder das Ende der relationalen Welt? DOAG 2016

Imagery and Raster Data in ArcGIS. Abhilash and Abhijit

Oracle Big Data Connectors

Oracle Spatial Technologies: An Update. Xavier Lopez Director, Spatial Technologies Oracle Corporation

Welcome to NR402 GIS Applications in Natural Resources. This course consists of 9 lessons, including Power point presentations, demonstrations,

Overview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training::

Oracle Spatial Summit 2015 Best Practices for Developing Geospatial Apps for the Cloud

Hive SQL over Hadoop

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture

An Overview of FMW MapViewer

Spatial Databases - a look into the future

Geodatabases. Dr. Zhang SPRING 2016 GISC /03/2016

Turning Relational Database Tables into Spark Data Sources

Oracle Spatial 10g: Advanced

SPATIAL DATA MODELS Introduction to GIS Winter 2015

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Eine für Alle - Oracle DB für Big Data, In-memory und Exadata Dr.-Ing. Holger Friedrich

State of JTS. Presented by: James, Jody, Rob, (Martin)

Importing and Exporting Data Between Hadoop and MySQL

Hadoop Development Introduction

SparkSQL 11/14/2018 1

Accessing and Administering your Enterprise Geodatabase through SQL and Python

Integrating Imagery into ArcGIS Runtime Application. Jie Zhang, Zhiguang Han San Jacinto, 5:30 pm 6:30 pm

Oracle Application Express Plug-in for Ajax Maps: Map Integration the Easy Way

1 Copyright 2011, Oracle and/or its affiliates. All rights reserved.

Caching Imagery Using ArcGIS

RASTER ANALYSIS S H A W N L. P E N M A N E A R T H D A T A A N A LY S I S C E N T E R U N I V E R S I T Y O F N E W M E X I C O

This is a brief tutorial that explains how to make use of Sqoop in Hadoop ecosystem.

CIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin. Presented by: Suhua Wei Yong Yu

Topic 5: Raster and Vector Data Models

Creating Mosaic Datasets and Publishing Image Services using Python

Oracle Big Data Science

Georeferencing Topo Sheets and Scanned Maps


IOUG BIWA and Spa.al SIG TechCast: Introducing Oracle Big Data Spa.al and Graph

Map Visualization in Analytic Applications LJ Qian, Director of Software Development David Lapp, Product Manager Oracle

Microsoft Big Data and Hadoop

Hadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours)

Oracle Big Data Science IOUG Collaborate 16

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Oracle Big Data Spatial and Graph

Certified Big Data Hadoop and Spark Scala Course Curriculum

Big Data Architect.

BEST PRACTICES FOR DEVELOPING GEOSPATIAL BIG DATA APPS FOR THE CLOUD. Nick Salem - Distinguished Engineer David Tatar - Principal Software Engineer

Oracle 10g GeoSpatial Technologies. Eve Kleiman Asia/Pacific Spatial Product Manager Oracle Corporation

Multidimensional Data and Modelling - DBMS

1 Big Data Hadoop. 1. Introduction About this Course About Big Data Course Logistics Introductions

RASTER ANALYSIS GIS Analysis Winter 2016

CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI)

Transcription:

Bring Location Intelligence To Big Data Applications on Hadoop, Spark, and NoSQL Siva Ravada Senior Director of Development Copyright 2015, Oracle and/or its affiliates. All rights reserved.

Program Agenda 1 2 3 4 Overview Vector Data Processing Raster Data Processing Discussion 2

What problems can Big Data Spatial analysis address? Data Harmonization using any location attribute (address, postal code, lat/long, placename, etc). Categorization and filtering based on location and proximity Preparation, validation and cleansing of Spatial and Raster data Visualizing and displaying results on a map

Data Harmonization: Linking information by location Are these data points related? Tweet: sailing by #goldengate Instagram image subtitle: 골든게이트교 * Text message: Driving on 101 North, just reached border between Marin County and San Francisco County GPS Sensor: N 37 49 11 W 122 28 44 Now find all data points around Golden Gate Bridge... * Golden Gate Bridge (in Korean)

What features does Big Data Spatial have? Data enrichment service API using GeoNames and geometry hierarchy data Spatial processing of data stored in HDFS. Raster processing operations: Mosaic and sub-set operations. Geodetic and Cartesian data MapReduce routines and for distance calculations, PointInPolygon, buffer creation, Categorization, KMeansClustering, binning, etc.. Hive support (new) HTML5 Map Visualization API

Program Agenda 1 2 3 4 Overview Vector Data Processing Raster Data Processing Disccussion 6

Vector Data Type support All of the data types we support with the SDO_GEOMETRY are supported on the Hadoop and Spark platform Points, Lines, Polygons, Collections Including Arcs, compound line strings, NURBs, compound polygons, etc. Supports both 2D and 3D data types Supports both Cartesian and Geodetic data models Topological and distance operations Anyinteract, inside, distance, length, simplify, buffer, PointInPolygon

Vector spatial data storage in HDFS Customers load their data into HDFS using a loader of their choice We do not require the data to be in a format that we specify This makes it easy for customers to use any data format their applications prefer And the data can have other business data and not just spatial data We require the customer to provide the InputFormat/RecordReader class The RecordReader implementation reads the customer data record and produces an instance of JGeometry With this model we can support any data format customers use for their data

Store any business data with spatial information in HDFS Oracle JGeometry USER- PROVIDED InputFormat/ RecordReader Class* LOAD ANY FORMAT ANY BUSINESS DATA * Classes for Shapefile and GeoJSON formats provided HDFS

Spatial Index for Spatial Queries Spatial Index stored as a MapFile MapReduce Job with Index Copy the index to distributed cache Mapper reads the index data for the corresponding HDFS block Data Blocks Data Blocks HDFS Data Blocks Data Blocks Process only those records that return hits from the index search 11

Vector Data Processing API Our API is broadly divided into three categories Functions that operate on a single geometry at a time Buffer, simplify, length, area, etc. Functions that operate on a pair of geometries at a time Range-Queries: these are the typical PointInPolygon, AnyInteract type of operations Analogous to our operators and relate functions in the DB Join-Queries: This is the typical Spatial-Join operation where two data sets are joined to find all the interacting pairs of geometries (next release) Functions that categorize and enrich data Associating a data set with a known geometry or named hierarchy For example, process all tweets for a time period and count how many tweets are associated with each city, county, state, etc.

Vector Data Processing API Functions Single Geometry Length Area Buffer Simplify Geometry Pairs Range Queries Point in Polygon Touch, Overlap, Intersect, Contains, Any Interaction Join Queries Interactions on sets of data E.g.: Find all the dropped cell calls in all coverage areas Categorization and Enrichment Associate a data set with a known geometry or named hierarchy Process all Tweets for a period of time and count how many are associated with each city, county, state, etc.

Data Categorization Services Any hierarchical geometry data set for reference Customers choose a set of layers For example, they can select (continents, countries, cities) or (countries, states, counties) as the hierarchy Big Data Spatial map-reduce job processes the customer data and produces a result file

Vector Data Processing: PointInPolygon operation Consider a large customer data set that is loaded into HDFS We want to read each record, extract geometry information and check if the record is inside a given polygon geometry This can be run as a map-reduce job using our Java API from the Hadoop command line Customer creates a map-reduce job and specifies the JGeometry.Inside() method for record processing Customers write their own map-reduce jobs using our APIs

Spatial Binning 16

Spatial Join Process The spatial join process consists of the following phases: 1. Partitioning 2. Join 3. Duplicate removal p1 p2 join1 join1 join2 join2 DS1 DS1 DS2 join2 DS2 p3 p4 join3 join3 17

HIVE Support for Vector Data Provides support for standards based spatial data types ST_GEOMETRY type with constructors GeoJSON and WKT are natively supported Provides support for standard based spatial operators ST_RELATE, ST_CONTAINS, ST_DISTANCE, ST_BUFFER, etc. ST_GEOMETRY has member methods to output geometry as text AsJSON, AsWKT Provides spatial indexing capability Create a Spatial Index with the java API and use it in HIVE queries

Analysis

Spatial Hive integration Table creation CREATE EXTERNAL TABLE IF NOT EXISTS sample_tweets_index (id STRING, geometry STRING, followers_count STRING, friends_count STRING, location STRING) ROW FORMAT SERDE 'oracle.spatial.hadoop.vector.hive.json.geojsonserde' STORED AS INPUTFORMAT 'oracle.spatial.hadoop.vector.mapred.input.spatialindextextinputformat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.hiveignorekeytextoutputformat' LOCATION '/user/oracle/twitter/data'

Spatial Hive integration HiveQL Query SELECT id, followers_count, friends_count, location FROM sample_tweets WHERE ST_Contains( ST_Polygon('{"type": "Polygon","coordinates": [[[-106, 25], [-106, 30], [-104, 30], [-104, 25], [-106, 25]]]}', 8307), ST_Point(geometry, 8307), 0.5) and followers_count > 50;

Accessing BDSG data from the Database Oracle SQL Connectors for HDFS (OSCH) Files in HDFS are delimited text files (fields must be delimited using single-character markers, such as commas or tabs) Spatial data is stored as GeoJSON or WKT format An external table is defined to access the data from HDFS Spatial operations can be performed from the database using SQL JOIN between a DB table: CINEMA and HDFS table: TWEETS_EXT_TAB_FILE select sdo_geom.sdo_distance(ci.geometry, SDO_GEOMETRY(tw.geometry, 4326), 0.05, 'UNIT=MILE'), ci.name, tw.user_id from CINEMA ci, TWEETS_EXT_TAB_FILE tw where SDO_WITHIN_DISTANCE(ci.geometry, SDO_GEOMETRY(tw.geometry, 4326), 'DISTANCE=0.25 UNIT=MILE') = 'TRUE'

Accessing BDSG data from the Database Big Data SQL to access Spatial data in HDFS Files in HDFS are delimited text files (fields must be delimited using single-character markers, such as commas or tabs) or custom formatted files If Spatial data is stored as GeoJSON or WKT format, we provide the SerDe Spatial data stored in custom format, customers provide the SerDe An external table is defined to access the data from HDFS Spatial operations can be performed from the database using SQL JOIN between a DB table: CINEMA and HDFS table: TWEETS_EXT_TAB_FILE select sdo_geom.sdo_distance(ci.geometry, SDO_GEOMETRY(tw.geometry, 4326), 0.05, 'UNIT=MILE'), ci.name, tw.user_id from CINEMA ci, TWEETS_EXT_TAB_FILE tw where SDO_WITHIN_DISTANCE(ci.geometry, SDO_GEOMETRY(tw.geometry, 4326), 'DISTANCE=0.25 UNIT=MILE') = 'TRUE'

NoSQL Support In Vector API A NoSQL data source can be used as the input data for a Vector API Job A NoSQL input can be specified to a Vector Job in one the following ways Using the Oracle NoSQL Key-Value API Hadoop classes Using the Oracle NoSQL Table API Hadoop classes Using the special NoSQL classes provided by the Vector API

Spatial in Spark

Spatial Resilient Distributed Datasets (RDD) A Spatial RDD provides the regular RDD functionality plus spatial transformations and actions Current Spatial RDD implementations are SpatialJavaRDD and SpatialJavaPairRDD which are subclasses of JavaRDD and JavaPairRDD respectively Spatial transformations take a spatial predicate which can be used to spatially filter the RDD s data The following spatial operations can be used to perform a spatial transformations and actions IsInside Contains AnyInteract WithinDistance MBR Nearest Neighbors 26

SpatialJavaRDD Transformation Example 27

Distributed Spatial Index A Distributed Spatial Index is used to perfom faster spatial searches over a Spatial A spatial index may redistribute and repartition an existing spatial RDD in order to speed up queries A local index can be build for each RDD s partition in order to further accelerate spatial searches This feature is optional when creating a distributed spatial index. An existing Distributed Spatial RDD can be saved and loaded from a local or distributed file system using the save and load DistributedSpatialRDD s method The current Distributed Spatial Index implementation is based on QuadTree 28

HTML 5 API for MapVisualization

Big Data Spatial and Graph Spatial Vector Processing Framework Customer Application Sample Application MapReduce job MapReduce Framework, templates Spatial Operators, Functions S&G Java and HIVE API Spatial Enrichment, Categorization API Customer data Generated data JGeometry format RecordInfoProvider class GeoNames and Hierarchy data Oracle Provided Customer code Geospatial Vector data (any format) HDFS Enrichments, Categorizations results

Program Agenda 1 2 3 4 Overview Vector Data Processing Raster Data Processing Discussion 31

Image Server HDFS storage for the image or raster files We can support dozens of file formats (GDAL supported formats) Images are geo-referenced Images can be in different coordinate systems and resolutions Three main capabilities Loader to load raster data from NFS to HDFS Mosaic and subset operations based on a virtual mosaic Image processing framework for raster analysis

Image Server Loader Customers usually have large volumes of raster data in traditional file systems We provide a GDAL based loader to load the data into HDFS such that the resulting HDFS blocks are organized for map-reduce jobs Many formats supported by GDAL are supported In V1, only support 3 band and single band images with float and byte data types In V2 added support for multi-band images

Raster Data Stored on HDFS When data is moved from traditional file system to HDFS, data should be organized such that a map-reduce job can process it with minimum amount of data transfer between data nodes We explain this concept with a raster data analysis example Raster Analysis with Hadoop Given a DEM, find the shaded relief model from it Storage Models considered Pixel data is partitioned over different HDFS blocks What are the best options for storing the DEM data on HDFS to enable this type of raster processing?

Shaded Relief calculation Input: NXM pixels where each pixel is a floating point number denoting elevation Find the shaded relief from the DEM Algorithm Look at the values of 8 neighbors and the current pixel value and generate a new pixel Needs the neighboring pixel values to calculate the new pixel value corresponding to the current pixel

Raster Data Analysis Framework Traditional algorithm will work very well in Map-Reduce framework Customers specify the pixel overlap required while loading the data into HDFS Once the data is loaded according to this overlap, rest of the processing can be done using standard algorithms Very effective for raster data processing as many map-reduce nodes can work together to produce the result in a short amount of time All the operations are performed on a catalogue of images Customers can logically combine certain number of images into a catalogue or into a virtual mosaic

Image Server Features Subset operation Find the set of images from a given catalogue covering a user specified region and generate a new image from the source files The new images will have user specified resolution and coordinate system These can be different for different images in the source catalogue and the resulting image can have a different value Mosaic the input images to deal with gaps and overlaps Create a new file with the specified file format Image Processing Functions

Raster Processing Local Map algebra operations localnot localif localadd localsubstract localmultiply localdivide localpow localsqrt localround locallog locallog10 localfloor localceil localnegate localabs localsin localcos localtan localsinh localcosh localtanh localasin localacos localatan localdefined localundefined

Shaded Relief Generation from DEM Hypothetical illumination of a surface by determining illumination values for each pixel Custom shaded reliefe algorithm can be plugged into our framework Users need to implement a few classes ImageProcessorInterface write results back in the ImageBandWritable data type 39

Image Processing on Hadoop This framework allows for complex image processing on raster data In this example we describe how a DEM can be processed to generate a shaded relief in addition to doing some raster algebra operations In this demo, we start with a DEM for the NAPA valley area 40

Image Processing on Hadoop The DEM can be processed using custom algorithms to generate a hillshade The whole process will execute as a map reduce program Users can take a non map-reduce custom algorithm and plug it into the framework 41

Raster Algebra on Hadoop We also support pixel level raster algebra operations on this raster data We can take the DEM and generate different versions of hillshaded rasters by selectively removing pixels with certain value Elevation>1000m Elevation>1800m Elevation>2800m 42

Big Data Spatial and Graph Spatial Raster Processing Framework Raster Analysis Application Sample Application Raster MapReduce framework S&G Java API Analysis Algorithms Customer data Subset Mosaic Raster Analysis API Generated data Oracle Provided Customer code Raster Data Files on NFS HDFS Raster catalog Derived Rasters GDAL Loader (GDAL formats) Rasters

Program Agenda 1 2 3 4 Overview Vector Data Processing Raster Data Processing Discussion 44

Resources Oracle Big Data Spatial and Graph OTN product page: www.oracle.com/technetwork/database/database-technologies/bigdata-spatialandgraph White papers, software downloads, documentation and videos Oracle Big Data Lite Virtual Machine - a free sandbox to get started: www.oracle.com/technetwork/database/bigdata-appliance/oracle-bigdatalite-2104726.html Hands On Lab for Big Data Spatial: tinyurl.com/bdsg-hol Blog examples, articles & tips: blogs.oracle.com/bigdataspatialgraph Oracle By Example tutorials: www.oracle.com/goto/oll (search Big Data Spatial and Graph ) @OracleBigData, @SpatialHannes, @JeanIhm Oracle Spatial and Graph Group 45