Jaql. Kevin Beyer, Vuk Ercegovac, Eugene Shekita, Jun Rao, Ning Li, Sandeep Tata. IBM Almaden Research Center

Similar documents
Introduction to Hive Cloudera, Inc.

Scalable Web Programming. CS193S - Jan Jannink - 2/25/10

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

CSE 444: Database Internals. Lecture 23 Spark

Shark: Hive (SQL) on Spark

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture

ΕΠΛ 602:Foundations of Internet Technologies. Cloud Computing

Hadoop. copyright 2011 Trainologic LTD

Hadoop is supplemented by an ecosystem of open source projects IBM Corporation. How to Analyze Large Data Sets in Hadoop

Introduction to BigData, Hadoop:-

Chapter 5. The MapReduce Programming Model and Implementation

Hadoop An Overview. - Socrates CCDH

Hadoop Development Introduction

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Pig A language for data processing in Hadoop

Innovatus Technologies

Hadoop Online Training

Map Reduce & Hadoop Recommended Text:

Prototyping Data Intensive Apps: TrendingTopics.org

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

HADOOP FRAMEWORK FOR BIG DATA

Importing and Exporting Data Between Hadoop and MySQL

Research challenges in data-intensive computing The Stratosphere Project Apache Flink

1. Introduction to MapReduce

Understanding NoSQL Database Implementations

MapReduce-II. September 2013 Alberto Abelló & Oscar Romero 1

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

BIG DATA ANALYTICS USING HADOOP TOOLS APACHE HIVE VS APACHE PIG

Blended Learning Outline: Cloudera Data Analyst Training (171219a)

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Certified Big Data and Hadoop Course Curriculum

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark

Big Data Infrastructures & Technologies

Clustering Lecture 8: MapReduce

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Overview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training::

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Introduction to Data Management CSE 344

Certified Big Data Hadoop and Spark Scala Course Curriculum

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

Hadoop/MapReduce Computing Paradigm

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Jaql: A Scripting Language for Large Scale Semi-Structured Data Analysis

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

What Is Datacenter (Warehouse) Computing. Distributed and Parallel Technology. Datacenter Computing Architecture

CSE 190D Spring 2017 Final Exam

Microsoft Big Data and Hadoop

Challenges for Data Driven Systems

A Review Paper on Big data & Hadoop

Data-intensive computing systems

The MapReduce Framework

Databases 2 (VU) ( / )

Hadoop Map Reduce 10/17/2018 1

A BigData Tour HDFS, Ceph and MapReduce

Hadoop & Big Data Analytics Complete Practical & Real-time Training

Apache Pig Releases. Table of contents

COSC 416 NoSQL Databases. NoSQL Databases Overview. Dr. Ramon Lawrence University of British Columbia Okanagan

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

Unifying Big Data Workloads in Apache Spark

CIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin. Presented by: Suhua Wei Yong Yu

microsoft

Vendor: Hortonworks. Exam Code: HDPCD. Exam Name: Hortonworks Data Platform Certified Developer. Version: Demo

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

Introduction to Data Management CSE 344

Pig on Spark project proposes to add Spark as an execution engine option for Pig, similar to current options of MapReduce and Tez.

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker

The amount of data increases every day Some numbers ( 2012):

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

2/26/2017. The amount of data increases every day Some numbers ( 2012):

Distributed File Systems II

New Developments in Spark

Going beyond MapReduce

Hive and Shark. Amir H. Payberah. Amirkabir University of Technology (Tehran Polytechnic)

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Exam Questions

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

SQT03 Big Data and Hadoop with Azure HDInsight Andrew Brust. Senior Director, Technical Product Marketing and Evangelism

Introduction to MapReduce (cont.)

Big Data Hadoop Course Content

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

Big Data Syllabus. Understanding big data and Hadoop. Limitations and Solutions of existing Data Analytics Architecture

Scaling Up 1 CSE 6242 / CX Duen Horng (Polo) Chau Georgia Tech. Hadoop, Pig

Data Analytics using MapReduce framework for DB2's Large Scale XML Data Processing

Introduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data

Hadoop. Introduction / Overview

Big Data and Scripting map reduce in Hadoop

Data Analytics Job Guarantee Program

How Apache Hadoop Complements Existing BI Systems. Dr. Amr Awadallah Founder, CTO Cloudera,

CompSci 516: Database Systems

Jingren Zhou. Microsoft Corp.

Performance Comparison of Hive, Pig & Map Reduce over Variety of Big Data

Parallel SQL and Streaming Expressions in Apache Solr 6. Shalin Shekhar Lucidworks Inc.

Dealing with Data Especially Big Data

Pig Latin Reference Manual 1

Shark: Hive (SQL) on Spark

Expert Lecture plan proposal Hadoop& itsapplication

Hadoop: The Definitive Guide

Transcription:

Jaql Running Pipes in the Clouds Kevin Beyer, Vuk Ercegovac, Eugene Shekita, Jun Rao, Ning Li, Sandeep Tata IBM Almaden Research Center http://code.google.com/p/jaql/ 2009 IBM Corporation

Motivating Scenarios Driving forces Data characteristics Unstructured/Semi-structured (e.g., call-center records, pathology reports,..) Dynamic & Continuously Evolving (e.g., click streams, application logs, system logs) Massive Scale (several terabytes to petabytes..) Nature of analytics Compute intensive Evolving Demands a flexible platform to dynamically accommodate new analytic flows www.facebook.com Advertisement models and market segmentation Search Index Find events in a neighborhood Portal Subscribed feeds Traffic analysis for web marketers (e.g., Nielsen on the web) web server traffic logs Analytics platform Relational data Focused crawl of local events information Analytics platform Web Web Event feeds logs Analytics platform Web Web Relational data 2

ES2: Intranet Data Semantic Search Import Analyze Transform Export ES2 Semantic Search IBM IBM Intranet Intranet Notes Database Bookmarks Crawl Information Extraction & Mining Generate indexing artifacts Indexing Lucene Indexes Cloud Infrastructure 3

ES2: Intranet Data Semantic Search Import Analyze Transform Export ES2 Semantic Search IBM IBM Intranet Intranet Notes Database Dogear Crawl Information Extraction & Mining Generate indexing artifacts Indexing Lucene Indexes Hadoop Map-Reduce Jaql, System T HDFS 4

ES2 s Analysis Stage Import Analyze Transform Export ES2 Semantic Search App IBM Intranet Notes Database Dogear Crawl LA GA Mining Variant Generation Translate to Lucene docs Indexing Lucene Indexes Local Analysis: System T extracts structured information, e.g., candidate home pages w/ person name Global Analysis: Jaql finds the best pages within groups of pages, e.g., personal homepage Mining: Use map-reduce for mining - Acronym extraction - Geo-classification - Learning regular expressions 5

Goals for Jaql Semi-structured data manipulation Use JSON as a data model Data is converted to/from JSON view Most data has a natural JSON representation Relational tables, CSV, XML, Easily extended E.g., Java, Python, JavaScript, Exploit massive parallelism using Hadoop Queries compiled to map-reduce jobs 6

JSON Examples [ == array, {} == record or object, xxx: == field name [ { from:17, to: 19, msg:, phones: [ +1-415-555-1212, +1-408-555-1234, names:[ Jill, Jenny }, { from: 17, to: 12, msg:, names:[ Jack, Jill }, { from: 17, to: 12, msg:, phones: [ +1-415-555-2345 } Message Log [ { id: 12, name: Joe Smith, bday: date( 1971-03-07 ), zip: 94114 }, { id: 17, name: Ann Jones, bday: date( 1973-02-04 ), zip: 94110 }, { id: 19, name: Alicia Fox, bday: date( 1975-04-20 ), zip: 94114 } User Info [ { city: San Francisco, zips: [94110,94112,94114, areacodes: [415 }, { city: San Jose, zips: [95120,95123, areacodes: [408 } City Info 7

Writing a pipeline in Jaql source operator operator sink Find users in zip 94114 [ { id: 12, name: Joe Smith, bday: date( 1971-03-07 ), zip: 94114 }, { id: 17, name: Ann Jones, bday: date( 1973-02-04 ), zip: 94110 }, { id: 19, name: Alicia Fox, bday: date( 1975-04-20 ), zip: 94114 } [ { id: 12, name: Joe Smith }, { id: 19, name: Alicia Fox } 8

Writing a pipeline in Jaql: read source operator operator sink Find users in zip 94114 Query Data read( hdfs:users ); [ { id: 12, name: Joe Smith, bday: date( 1971-03-07 ), zip: 94114 }, { id: 17, name: Ann Jones, bday: date( 1973-02-04 ), zip: 94110 }, { id: 19, name: Alicia Fox, bday: date( 1975-04-20 ), zip: 94114 } 9

Writing a pipeline in Jaql: filter source operator operator sink Find users in zip 94114 Query Data read( hdfs:users ) -> filter $.zip == 94114; [ { id: 12, name: Joe Smith, bday: date( 1971-03-07 ), zip: 94114 }, { id: 19, name: Alicia Fox, bday: date( 1975-04-20 ), zip: 94114 } 10

Writing a pipeline in Jaql: transform source operator operator sink Find users in zip 94114 Query Data read( hdfs:users ) -> filter $.zip == 94114 -> transform { $.id, $.name }; [ { id: 12, name: Joe Smith }, { id: 19, name: Alicia Fox } 11

Writing a pipeline in Jaql: write source operator operator sink Find users in zip 94114 Query Data read( hdfs:users ) -> filter $.zip == 94114 -> transform { $.id, $.name } -> write( hdfs:inzip ); [ { id: 12, name: Joe Smith }, { id: 19, name: Alicia Fox } 12

JAQL I/O users Execution inzip I/O Layer Jaql JSON JSON Interpreter I/O Layer I/O layer abstracts details of data location + format Examples of data stores: HDFS, HBase, Amazon s S3, local FS, HTTP request, JDBC call Examples of data formats: JSON text, CSV, XML (default is JSON binary) JSON used as a data model 13

14 IBM Almaden Research Center Storing Files in HDFS HDFS File is divided into blocks Default block is 64MB Blocks are replicated on HDFS nodes Default replication is three times The name node tracks file names and block locations Block File Name Node /dir/file

Map Jobs in Hadoop JobConf Specifies job configuration InputFormat Abstract data source Creates Splits and RecordReader Split Abstract notion of a data chunk Each split handled by a map task Contains location hints RecordReader Produces stream of values from a Split Map Task Processes values in one Split Calls user map code Reevaluated on error OutputFormat Abstract data sink [ V Map [ V M M Infrastructure handles Parallelism Load balancing Fault-tolerance M 15

Jaql using Map read( hdfs:users ) -> filter $.zip == 94114 -> transform { $.id, $.name } -> write( hdfs:inzip ); users Jaql inzip M M Rewrite Engine M Equivalent map-reduce job in Jaql $ -> filter $.zip == 94114 -> transform { $.id, $.name } mapreduce({ input : {uri: hdfs:users }, map : fn($) ( $ -> filter $.zip == 94114 -> transform { $.id, $.name } ), output : { uri: hdfs:inzip } }) 16

User-defined Functions Identify phone numbers in each message read( hdfs:rawmsgs ) -> transform {$.*, phones: findphones($.msg)}; Express user code with: -Java -Inline scripting - [ Input data Output data { from:17, to: 19, msg: } { from: 17, to: 12, msg: }, { from: 17, to: 19, msg: } [ { from:17, to: 19, msg:, phones: [ +1-415-555-1212, +1-408-555-1234 }, { from: 17, to: 12, msg: }, { from: 17, to: 19, msg:, phones: [ +1-415-555-2345 } 17

Inline Scripting IBM Almaden Research Center script python <<END import re import phoneutil phonere = re.compile('[+x?[0-9\-()+') def findphones(text): for p in phonere.findall(text): p = phoneutil.standardize(p) if p!= None: yield p Easily integrate with other programming languages Python first then JavaScript, Ruby, Perl, Unix shell, etc END; read( hdfs:rawmsgs ) -> transform {$.*, phones: findphones($.msg)}; 18

Grouping Find the number of times the each sender mentioned an area code read( hdfs:rawmsgs ) -> expand for( $p in findphones($.msg) ) [{ $.from, area: area($p) } -> group by $g = ({$.from, $.area}) into { $g.*, n: count($) }; Input data After expand Output data [ { from:17, to: 19, msg: } { from: 17, to: 12, msg: }, { from: 17, to: 19, msg: } [ { from: 17, area: 415 }, { from: 17, area: 408 }, { from: 17, area: 415 } [ { from: 17, area: 415, n: 2 }, { from: 17, area: 408, n: 1 } 19

20 IBM Almaden Research Center Map-Reduce Jobs in Hadoop [ V Map [ K, V Shuffle [ K, [V Reduce [ V M M M R R Partition Sort $ -> expand for( $p in findphones($.msg) ) [{ $.from, area: area($p) } -> transform [{$.from, $.area}, $ $ -> transform { $g.*, n: count($) } mapreduce({ input : {uri: hdfs:rawmsgs }, map : fn($) ($ -> expand for( $p in findphones($.msg) ) [{ $.from, area: area($p) } -> transform [{$.from, $.area}, $ ), reduce : fn($g,$) ($ -> transform { $g.*, n: count($) }), output : { uri: hdfs:out } })

21 IBM Almaden Research Center Exploiting Combiners with Algebraic Aggregates [ V Map [ K, V Shuffle [ K, [V Reduce [ V M M M C R R Partition Sort C C $ -> expand for( $p in findphones($.msg) ) [{ $.from, area: area($p) } -> transform [{$.from, $.area}, $ -> <initial aggregation> <partial aggregation> $ -> <final aggregation> -> transform { $g.*, n: $[0 } mraggregate({ input : {uri: hdfs:rawmsgs }, map : fn($) ($ -> expand for( $p in findphones($.msg) ) [{ $.from, area: area($p) } -> transform [{$.from, $.area}, $ ), aggregate: fn($g,$) ( [ count($) ) final : fn($g,$) ($ -> transform { $g.*, n: $[0 }), output : { uri: hdfs:out } })

Composite Operators composite operator Examples: Join Join two or more inputs on a key Inner/outer/full Multi-predicate, multi-way Merge Concatenate all inputs in any order Union, Intersect, Difference User-defined function 22

CoGroup IBM Almaden Research Center Find users in each zipcode and the name of the city $users = read( hdfs:users ); $cities = read( hbase:cities ) -> expand unroll $.zips; group $users by $zip = ($users.zip), $cities by $zip = ($cities.zips) into {$zip, n: count($users), users: $users[*.name, city: singleton($cities[*.city) }; Input data after group users [{ id: 12, name: Joe Smith, zip: 94114 }, { id: 17, name: Ann Jones, zip: 95120 }, { id: 19, name: Alicia Fox, zip: 94114 } cities [{ city: San Francisco, zips: [94112,94114, areas: [415 }, { city: San Jose, zips: [95120, areas: [408 } [{ zip: 94114, n: 2, users: [ Joe Smith, Alicia Fox, city: San Francisco }, { zip: 95120, n: 1 users: [ Ann Jones, city: San Jose }, { zip: 94112, n: 0, users: [, city: San Francisco } 23

Multiple Outputs: tee Send each input item to additional output pipe read( hdfs:users ) tee ( filter $.zip == 94114 write( hdfs:zip94114 ) ) filter $.bday > date( 1979-01-01 ) write( hdfs:over30 ); 24

Rough Unix analogs of Jaql Unix Jaql < filename, cat read, merge Unix: stream of bytes / lines join grep join filter Jaql: stream of JSON items more structure / types cut, paste, sed, tr transform sort sort head top uniq distinct > filename write tee tee 25

Other Map/Reduce Languages Java The most basic way to write a map/reduce job in Hadoop User implements a Mapper and Reducer interface and creates a JobConf Hadoop Streaming Map and Reduce tasks implemented by external process in any language Read/write data on stdin/stdout Pig Nested relational model, but added Map for dynamic columns From Yahoo. Hadoop subproject. Hive Nested relational model with Map for dynamic columns and Array for lists SQL-like syntax From Facebook. Hadoop subproject. Cascading A dataflow language, mostly relational. Expressions are UDFs; Groovy extension Proprietary and non-hadoop Google Sawzall, Microsoft DryadLINQ, Microsoft Scope 26

Current Roadmap New syntax (presented here) About to be released to open source Robustness Exception handling, Features Programming/scripting language integration, schema, tooling, Performance Query optimization, compact storage, runtime tuning, 27