Data Platforms and Pattern Mining

Size: px
Start display at page:

Download "Data Platforms and Pattern Mining"

Transcription

1 Morteza Zihayat Data Platforms and Pattern Mining IBM Corporation

2 About Myself IBM Software Group Big Data Scientist 4Platform Computing, IBM (2014 Now) PhD Candidate (2011 Now) 4Lassonde School of Engineering, York University Main research interests 4Big Data Mining and Engineering Finding Meaningful Patterns from Structured and Unstructured Big Data Parallel and Distributed Data Mining 4Mining Dynamic and Stream Data 4Social Network Analysis 2

3 Agenda IBM Software Group How Big is this Big Data? Big Data and Apache Hadoop Apache Spark: Lightning-fast cluster computing IBM Platform Conductor for Spark A Case Study 4Mining High Utility Sequential Patterns 4User behavioral analysis Globe and Mail Newspaper 3

4 IBM Software Group How Big is this Big Data? 2,9 MILLION 375 MEGABYTES 20 HOURS 24 PETABYTES NUMBER OF MAILS SENT EVERY SECOND DATA CONSUMED BY HOUSEHOLDS EACH DAY VIDEO UPLOADED TO YOUTUBE EVERY MINUTE DATA PER DAY PROCESSED BY GOOGLE 50 MILLION 700 BILLION 1,3 EXABYTES 398 ITEMS TWEETS PER DAY TOTAL MINUTES SPENT ON FACEBOOK EACH MONTH DATA SENT AND RECEIVED BY MOBILE INTERNET USERS PRODUCTS ORDERED ON AMAZON PER SECOND 4

5 What is Big Data? IBM Software Group Big Data. 5

6 Challenges IBM Software Group Read/Write to disk is slow 4Use multiple disks for parallel read Hardware failure 4Single machine/disk failure 4Keep multiple copies of data How do we merge data from different reads 4Distributed processing or Hadoop MapReduce 6

7 Apache Hadoop IBM Software Group Hadoop is an open-source software framework written in Java 4Distributed storage 4Distributed processing of very large data sets 4Computer clusters Two main components 4HDFS (Hadoop Distributed File System) Provides Distributed Storage 4MapReduce (Distributed Data Processing Model) Provides Distributed Processing 7

8 HDFS IBM Software Group Based on Google s GFS (Google File System) Data is distributed across all nodes Build around the idea of write-once, read-many-times (WORM) File are stored as Blocks 464MB 4Different blocks of the same file are stored on different nodes 4Same block is replicated across several nodes for redundancy 8

9 IBM Software Group HDFS Architecture NameNode 4It stores metadata (No of blocks, On which rack which DataNode the data is stored and other details) DataNode 4It stores the actual data. There is only one NameNode in a cluster and many DataNodes 9

10 MapReduce IBM Software Group MapReduce is a method for distributing a task across multiple nodes Consists of two phases 4Map 4Reduce The data is process in the form of <key,value> Each map task processes a discrete portion of the overall data After all Maps are complete, the system distributes the intermediate data to nodes which perform Reduce phase (aggregation) 10

11 IBM Software Group Example: Word counting Consider the problem of counting the number of occurrences of each word in a large collection of documents Input: documents Output: <word,frequency> Divide collection of documents among the class. Sum up the counts from all the documents to give final answer Each students gives count of individual word in a document. Repeats for assigned quota of documents 11

12 IBM Software Group Word Count Execution Input Output the quick brown fox the fox ate the cow Student 1 Student 2 the, 1 brown, 1 fox, 1 quick, 1 the, 1 fox, 1 the, 1 ate,1 cow,1 brown, 1 Student 1 brown, 1 brown, 2 fox, 2 how, 1 now, 1 the, 3 how now brown cow Student 3 how, 1 now, 1 brown, 1 cow,1 Student 2 ate, 1 cow, 2 quick, 1 12

13 IBM Software Group Word Count Execution Input Map tasks Reduce tasks Output the quick brown fox the fox ate the cow Map Map the, 1 brown, 1 fox, 1 quick, 1 the, 1 fox, 1 the, 1 ate,1 cow,1 brown, 1 brown, 1 Reduce brown, 2 fox, 2 how, 1 now, 1 the, 3 how now brown cow Map how, 1 now, 1 brown, 1 cow,1 Reduce ate, 1 cow, 2 quick, 1 13

14 IBM Software Group MapReduce: Execution overview Master Server distributes M map tasks to machines and monitors their progress. Map task reads the allocated data, saves the map results in local buffer. Shuffle phase assigns reducers to these buffers, which are remotely read and processed by reducers. Reducers output the result on stable storage. 14

15 IBM Software Group MapReduce on a Cluster with Hadoop HDFS 15

16 Problem #1 IBM Software Group MapReduce I/O sandbags runtime for advanced analytics. 4Must persist results after each pass through data 4Advanced analytics often requires multiple passes through data Hadoop Storage Compute Store Hadoop Storage 16

17 IBM Software Group Example: Logistic Regression Goal: find best line separating two sets of points random initial line target Running Time (s) Hadoop Number of Iterations 17

18 Problem #2 IBM Software Group Many point solutions for advanced analytics in Hadoop 4Queries Apache drill,. 4Machine Learning Mahout, H2o, 4Graph Analytics Graphlab, 4Streaming Analytics S4, 18

19 Apache Spark IBM Software Group It was developed in response to limitations in the MapReduce Originally developed in 2009 in UC Berkeley s AMP Lab Fully open sourced in 2010 now a Top Level Project at the Apache Software Foundation 19

20 IBM Software Group Easy and Fast Big Data Easy to Develop 4Rich APIs in Java, Scala, Python 4Interactive shell Fast to Run 4In-memory storage 2-5 less code Up to 10 faster on disk, 100 in memory

21 IBM Software Group Resilient Distributed Datasets (RDD) Spark revolves around RDDs Fault-tolerant collection of elements that can be operated on in parallel 4Parallelized Collection: Scala collection which is run in parallel 4Hadoop Dataset: records of files supported by Hadoop 4The data flow from one iteration to another happens through memory Doesn't touch the disk. 4When the memory is not sufficient enough for the data to fit it It can be either spilled to the drive or is just left to be recreated upon request for the same

22 RDD Operations Transformations IBM Software Group 4Creation of a new dataset from an existing map, filter, distinct, union, sample, groupbykey, join, etc Actions 4Return a value after running a computation collect, count, first, takesample, foreach, etc

23 IBM Software Group RDD Persistence / Caching Variety of storage levels 4memory_only (default), memory_and_disk, etc API Calls 4persist(StorageLevel) 4cache() shorthand for persist(storagelevel.memory_only) Considerations 4Read from disk vs. recompute (memory_and_disk) 4Total memory storage size (memory_only_ser) 4Replicate to second node for faster fault recovery (memory_only_2) Think about this option if supporting a web application

24 IBM Software Group Cache Scaling Matters Execution time (s) Cache disabled 25% 50% 75% Fully cached % of working set in cache

25 IBM Software Group Logistic Regression Performance Running Time (s) Number of Iterations Hadoop Spark 25

26 IBM Software Group A single integrated platform for advanced analytics 26

27 IBM Software Group Spark Performance Machine Learning 4100x faster than MapReduce Queries (Shark) 4100x faster than Hive Streaming 42X throughput of Storm Graph (GraphX) 410X faster than MapReduce 27

28 IBM Software Group IBM Platform Conductor for Spark Addresses the following challenges 4Integrating Spark into existing environments 4Managing Spark lifecycles in the face of frequent updates to open-source Spark distributions 4Managing numerous Spark deployments within the organization Benefits 4Cuts costs by using a service orchestration framework to maximize resource utilization 4Improves performance and efficiency 4Simplifies application management 28

29 29 A Case Study High Utility Sequential Pattern Mining in Big Data IBM Corporation

30 IBM Software Group Frequent Pattern Mining Frequent Pattern Mining (FPM) 4FPM is a fundamental research topic in data mining 4Example application Discover sets of items (i.e., itemsets) that are frequently purchased together by customers Minimum support threshold: 60% TID T 1 T 2 T 3 T 4 T 5 T 6 Transaction {Bread, Milk} {Bread, Milk} {Bread, Milk, Diaper, Beer} {Bread, Milk, Diaper, Beer} {Diamond, Necklace} {Diamond, Necklace} Sup({Bread, Milk}) = 4/6 = 66.6% {Bread, Milk} is a frequent itemset 30

31 IBM Software Group Domain Driven Actionable Knowledge Discovery The identified patterns are handed over to business people They cannot interpret the patterns for business use 4There are many patterns/not informative 4Not interested to business needs. 4How to interpret the patterns to business actions 31

32 IBM Software Group Insufficiency of Frequent Pattern Mining In Market Analysis 4Business objective: Increase Revenue Which patterns can bring high profits and make high revenue? 4May lose infrequent but valuable patterns 4May present too many frequent but unprofitable patterns 4Cannot find patterns having high profits 32

33 IBM Software Group A Motivation Example TID T 1 T 2 T 3 T 4 T 5 T 6 Transaction {Bread(1), Milk(1)} {Bread(1), Milk(1)} {Bread(1), Milk(1), Diaper(3), Beer(6)} {Bread(1), Milk(1), Diaper(3), Beer(6)} {Diamond(1), Necklace(1)} {Diamond(1), Necklace(1)} Item Unit Profit Bread 20 Milk 30 Diamond 1,000 Necklace 300 Diaper 300 Beer 70 {Bread, Milk}: $200 {Diamond, Necklace}: $2600 {Diaper, Beer}:$

34 IBM Software Group High Utility Sequential Pattern Mining Given a set of sequences: find all sequences whose utility is > a userspecified minimum threshold 4Each item has quantity in a transaction 4Each item has a value (e.g., price) Items Profit Milk $3 Egg $2 Birthday Cake $20 Birthday Card $10 Bread $1 CID TID C1 T1 {(Bread,2), (Milk,6)} C1 T2 {(Birthday Card,2)} C1 T3 { (Birthday Cake,2), (egg,3)} C2 T3 {(Bread,2), (Milk,4), (Yoghurt,3), (Tuna,5)} C2 T4 {(egg,5), (Pizza,4), (Juice,2)} C3 T5 { (Bread,2),( Yoghurt,4), (Milk,3)} C3 T6 {(Milk,1), (cheese,2)} 34

35 What is utility? IBM Software Group Utility of item in a transaction = internal utility (quantity of items in the transaction) x external utility (profit of the item). 4U(Milk,T1 ) = 3 6 = 18 Utility of itemset in a sequence = sum of utilities of its items: 4U({Bread, Milk}, C1)= = 20 Utility of sub-sequence in a sequence = sum of its itemsets utilities 4U(<{Milk}{egg}>,C1) = = 24 4 If more than one occurrence, then maximum value among occurrences Items Profit Milk $3 Egg $2 Birthday Cake $20 Birthday Card $10 Bread $1 CID TID C1 T1 {(Bread,2), (Milk,6)} C1 T2 {(Birthday Card,2)} C1 T3 { (Birthday Cake,2), (egg,3)} C2 T3 {(Bread,2), (Milk,4), (Yoghurt,3), (Tuna,5)} C2 T4 {(egg,5), (Pizza,4), (Juice,2)} C3 T5 { (Bread,2),( Yoghurt,4), (Milk,3)} C3 T6 {(Milk,1), (cheese,2)} 35

36 36 A Rule-Based News Recommendation System IBM Corporation

37 IBM Software Group Globe and Mail Dataset Globe and Mail Inc. 4Corpus of news (142,163,909 articles) 4Web clickstream 2 billion records Each record A sequence of visited news 245 attributes 4 TB 37

38 IBM Software Group A Rule-Based Recommendation System User log Phase 1 Discover Frequent Patterns from user logs <News1,New2> Offline Phase 2 Extract rules from the patterns News1 News2 Phase 3 Online 38

39 Goal IBM Software Group A rule based news recommendation engine 4News1 News2 It is quite probable that not all the pages visited by the user are of interest to him/her The value or importance of a news changes over time 39

40 IBM Software Group A Utility Based News Recommendation System cursor movement Time Spent Editorial Board freshness Rule-Based Recommendation System Navigational rules 40

41 IBM Software Group The Proposed Framework User log Phase 1 Discover High Utility Patterns from user logs <News1,New2> Offline Phase 2 Extract Utility-Based rules from the patterns News1 News2 Online Phase 3 Recommend articles based on current user activities 41

42 IBM Software Group Preliminary Experiment Only records 4Over 32 GB memory usage in average 4Around 55 mins run time With the exponential growth of data 4It is impossible to execute algorithms on a single machine 42

43 BigHUSP IBM Software Group 43

44 BigHUSP IBM Software Group 44

45 BigHUSP IBM Software Group Partition data In each mapper Scan input sequences Construct UMatrixs, Calculate required parameters: threshold, Remove all blocks for sequences from memory Store Umatrixs in Memory-Disk to re-use 45

46 BigHUSP IBM Software Group Call Uspan to mine local HUSPs using the local UMatrixs and threshold Output Local HUSPs and their utilities 46

47 BigHUSP IBM Software Group Aggregate local HUSPs If the pattern does not exist Estimate its utility Output Global candidates and their approximate utilities 47

48 BigHUSP IBM Software Group 48

49 IBM Software Group Our recommendations keep user longer 2,537 minutes spent by 107 users 380 minutes spent by 254 users 49

50 IBM Software Group Experimental Results 50

51 Summary IBM Software Group What Big Data is What Hadoop is and why it is important Hadoop main components What Spark is Why Spark is popular High Utility Pattern Mining Rule-Based News Recommendation System Big Data Analysis Using IBM Conductor for Spark 51

52 IBM Software Group THANKS! 52

53 Q1 IBM Software Group According to Spark advocates, how much faster can Apache Spark potentially run batch-processing programs when processed in memory than MapReduce can? 10 times faster 20 times faster 100 times faster 200 times faster 53

54 Q2 IBM Software Group What is the name of the programming framework originally developed by Google that supports the development of applications for processing large data sets in a distributed computing environment? MapReduce Hive Spark 54

55 Q3 IBM Software Group True or false? A big data analytics strategy is often defined by the three V's volume, variety and velocity -- which is helpful but ignores other commonly cited characteristics, such as complexity and variability. True False 55

56 Q4 IBM Software Group Just collecting and storing information isn't enough to produce real business value. Big data analytics technologies are necessary to: 4Formulate eye-catching charts and graphs 4Extract valuable insights from the data 4Integrate data from internal and external sources 56

57 Q5 IBM Software Group Which of the following application types can Spark run in addition to batchprocessing jobs? 4Stream processing 4Machine learning 4Graph processing 4All of the above 57

58 Q6 IBM Software Group Which of the following is NOT a characteristic shared by Hadoop and Spark? 4Both are data processing platforms 4Both are cluster computing environments 4Both have their own file system 4Both use open source APIs to link between different tools 58

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context 1 Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes

More information

Lecture 11 Hadoop & Spark

Lecture 11 Hadoop & Spark Lecture 11 Hadoop & Spark Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Outline Distributed File Systems Hadoop Ecosystem

More information

Distributed Computation Models

Distributed Computation Models Distributed Computation Models SWE 622, Spring 2017 Distributed Software Engineering Some slides ack: Jeff Dean HW4 Recap https://b.socrative.com/ Class: SWE622 2 Review Replicating state machines Case

More information

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab Apache is a fast and general engine for large-scale data processing aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes Low latency: sub-second

More information

CSE 444: Database Internals. Lecture 23 Spark

CSE 444: Database Internals. Lecture 23 Spark CSE 444: Database Internals Lecture 23 Spark References Spark is an open source system from Berkeley Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei

More information

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018 Cloud Computing 2 CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University Learning

More information

Hadoop/MapReduce Computing Paradigm

Hadoop/MapReduce Computing Paradigm Hadoop/Reduce Computing Paradigm 1 Large-Scale Data Analytics Reduce computing paradigm (E.g., Hadoop) vs. Traditional database systems vs. Database Many enterprises are turning to Hadoop Especially applications

More information

Spark Overview. Professor Sasu Tarkoma.

Spark Overview. Professor Sasu Tarkoma. Spark Overview 2015 Professor Sasu Tarkoma www.cs.helsinki.fi Apache Spark Spark is a general-purpose computing framework for iterative tasks API is provided for Java, Scala and Python The model is based

More information

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing Big Data Analytics Izabela Moise, Evangelos Pournaras, Dirk Helbing Izabela Moise, Evangelos Pournaras, Dirk Helbing 1 Big Data "The world is crazy. But at least it s getting regular analysis." Izabela

More information

Spark, Shark and Spark Streaming Introduction

Spark, Shark and Spark Streaming Introduction Spark, Shark and Spark Streaming Introduction Tushar Kale tusharkale@in.ibm.com June 2015 This Talk Introduction to Shark, Spark and Spark Streaming Architecture Deployment Methodology Performance References

More information

Clustering Lecture 8: MapReduce

Clustering Lecture 8: MapReduce Clustering Lecture 8: MapReduce Jing Gao SUNY Buffalo 1 Divide and Conquer Work Partition w 1 w 2 w 3 worker worker worker r 1 r 2 r 3 Result Combine 4 Distributed Grep Very big data Split data Split data

More information

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia MapReduce Spark Some slides are adapted from those of Jeff Dean and Matei Zaharia What have we learnt so far? Distributed storage systems consistency semantics protocols for fault tolerance Paxos, Raft,

More information

An Introduction to Big Data Analysis using Spark

An Introduction to Big Data Analysis using Spark An Introduction to Big Data Analysis using Spark Mohamad Jaber American University of Beirut - Faculty of Arts & Sciences - Department of Computer Science May 17, 2017 Mohamad Jaber (AUB) Spark May 17,

More information

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018 Cloud Computing and Hadoop Distributed File System UCSB CS70, Spring 08 Cluster Computing Motivations Large-scale data processing on clusters Scan 000 TB on node @ 00 MB/s = days Scan on 000-node cluster

More information

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples Hadoop Introduction 1 Topics Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples 2 Big Data Analytics What is Big Data?

More information

Chapter 4: Apache Spark

Chapter 4: Apache Spark Chapter 4: Apache Spark Lecture Notes Winter semester 2016 / 2017 Ludwig-Maximilians-University Munich PD Dr. Matthias Renz 2015, Based on lectures by Donald Kossmann (ETH Zürich), as well as Jure Leskovec,

More information

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved Hadoop 2.x Core: YARN, Tez, and Spark YARN Hadoop Machine Types top-of-rack switches core switch client machines have client-side software used to access a cluster to process data master nodes run Hadoop

More information

Big Data Infrastructures & Technologies

Big Data Infrastructures & Technologies Big Data Infrastructures & Technologies Spark and MLLIB OVERVIEW OF SPARK What is Spark? Fast and expressive cluster computing system interoperable with Apache Hadoop Improves efficiency through: In-memory

More information

Massive Online Analysis - Storm,Spark

Massive Online Analysis - Storm,Spark Massive Online Analysis - Storm,Spark presentation by R. Kishore Kumar Research Scholar Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur Kharagpur-721302, India (R

More information

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets 2011 Fourth International Symposium on Parallel Architectures, Algorithms and Programming PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets Tao Xiao Chunfeng Yuan Yihua Huang Department

More information

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros Data Clustering on the Parallel Hadoop MapReduce Model Dimitrios Verraros Overview The purpose of this thesis is to implement and benchmark the performance of a parallel K- means clustering algorithm on

More information

Processing of big data with Apache Spark

Processing of big data with Apache Spark Processing of big data with Apache Spark JavaSkop 18 Aleksandar Donevski AGENDA What is Apache Spark? Spark vs Hadoop MapReduce Application Requirements Example Architecture Application Challenges 2 WHAT

More information

MapReduce, Hadoop and Spark. Bompotas Agorakis

MapReduce, Hadoop and Spark. Bompotas Agorakis MapReduce, Hadoop and Spark Bompotas Agorakis Big Data Processing Most of the computations are conceptually straightforward on a single machine but the volume of data is HUGE Need to use many (1.000s)

More information

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a) Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a) Cloudera s Developer Training for Apache Spark and Hadoop delivers the key concepts and expertise need to develop high-performance

More information

DATA SCIENCE USING SPARK: AN INTRODUCTION

DATA SCIENCE USING SPARK: AN INTRODUCTION DATA SCIENCE USING SPARK: AN INTRODUCTION TOPICS COVERED Introduction to Spark Getting Started with Spark Programming in Spark Data Science with Spark What next? 2 DATA SCIENCE PROCESS Exploratory Data

More information

Cloud Computing & Visualization

Cloud Computing & Visualization Cloud Computing & Visualization Workflows Distributed Computation with Spark Data Warehousing with Redshift Visualization with Tableau #FIUSCIS School of Computing & Information Sciences, Florida International

More information

Applied Spark. From Concepts to Bitcoin Analytics. Andrew F.

Applied Spark. From Concepts to Bitcoin Analytics. Andrew F. Applied Spark From Concepts to Bitcoin Analytics Andrew F. Hart ahart@apache.org @andrewfhart My Day Job CTO, Pogoseat Upgrade technology for live events 3/28/16 QCON-SP Andrew Hart 2 Additionally Member,

More information

Resilient Distributed Datasets

Resilient Distributed Datasets Resilient Distributed Datasets A Fault- Tolerant Abstraction for In- Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin,

More information

Programming Systems for Big Data

Programming Systems for Big Data Programming Systems for Big Data CS315B Lecture 17 Including material from Kunle Olukotun Prof. Aiken CS 315B Lecture 17 1 Big Data We ve focused on parallel programming for computational science There

More information

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team Introduction to Hadoop Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since

More information

Big Data Architect.

Big Data Architect. Big Data Architect www.austech.edu.au WHAT IS BIG DATA ARCHITECT? A big data architecture is designed to handle the ingestion, processing, and analysis of data that is too large or complex for traditional

More information

An Introduction to Apache Spark

An Introduction to Apache Spark An Introduction to Apache Spark 1 History Developed in 2009 at UC Berkeley AMPLab. Open sourced in 2010. Spark becomes one of the largest big-data projects with more 400 contributors in 50+ organizations

More information

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP TITLE: Implement sort algorithm and run it using HADOOP PRE-REQUISITE Preliminary knowledge of clusters and overview of Hadoop and its basic functionality. THEORY 1. Introduction to Hadoop The Apache Hadoop

More information

BIG DATA TESTING: A UNIFIED VIEW

BIG DATA TESTING: A UNIFIED VIEW http://core.ecu.edu/strg BIG DATA TESTING: A UNIFIED VIEW BY NAM THAI ECU, Computer Science Department, March 16, 2016 2/30 PRESENTATION CONTENT 1. Overview of Big Data A. 5 V s of Big Data B. Data generation

More information

Dell In-Memory Appliance for Cloudera Enterprise

Dell In-Memory Appliance for Cloudera Enterprise Dell In-Memory Appliance for Cloudera Enterprise Spark Technology Overview and Streaming Workload Use Cases Author: Armando Acosta Hadoop Product Manager/Subject Matter Expert Armando_Acosta@Dell.com/

More information

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ. Big Data Programming: an Introduction Spring 2015, X. Zhang Fordham Univ. Outline What the course is about? scope Introduction to big data programming Opportunity and challenge of big data Origin of Hadoop

More information

Certified Big Data Hadoop and Spark Scala Course Curriculum

Certified Big Data Hadoop and Spark Scala Course Curriculum Certified Big Data Hadoop and Spark Scala Course Curriculum The Certified Big Data Hadoop and Spark Scala course by DataFlair is a perfect blend of indepth theoretical knowledge and strong practical skills

More information

Introduction to Data Mining and Data Analytics

Introduction to Data Mining and Data Analytics 1/28/2016 MIST.7060 Data Analytics 1 Introduction to Data Mining and Data Analytics What Are Data Mining and Data Analytics? Data mining is the process of discovering hidden patterns in data, where Patterns

More information

A BigData Tour HDFS, Ceph and MapReduce

A BigData Tour HDFS, Ceph and MapReduce A BigData Tour HDFS, Ceph and MapReduce These slides are possible thanks to these sources Jonathan Drusi - SCInet Toronto Hadoop Tutorial, Amir Payberah - Course in Data Intensive Computing SICS; Yahoo!

More information

Databases 2 (VU) ( / )

Databases 2 (VU) ( / ) Databases 2 (VU) (706.711 / 707.030) MapReduce (Part 3) Mark Kröll ISDS, TU Graz Nov. 27, 2017 Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, 2017 1 / 42 Outline 1 Problems Suited for Map-Reduce 2 MapReduce:

More information

Shark: SQL and Rich Analytics at Scale. Michael Xueyuan Han Ronny Hajoon Ko

Shark: SQL and Rich Analytics at Scale. Michael Xueyuan Han Ronny Hajoon Ko Shark: SQL and Rich Analytics at Scale Michael Xueyuan Han Ronny Hajoon Ko What Are The Problems? Data volumes are expanding dramatically Why Is It Hard? Needs to scale out Managing hundreds of machines

More information

Spark 2. Alexey Zinovyev, Java/BigData Trainer in EPAM

Spark 2. Alexey Zinovyev, Java/BigData Trainer in EPAM Spark 2 Alexey Zinovyev, Java/BigData Trainer in EPAM With IT since 2007 With Java since 2009 With Hadoop since 2012 With EPAM since 2015 About Secret Word from EPAM itsubbotnik Big Data Training 3 Contacts

More information

Memory Management for Spark. Ken Salem Cheriton School of Computer Science University of Waterloo

Memory Management for Spark. Ken Salem Cheriton School of Computer Science University of Waterloo Memory Management for Spark Ken Salem Cheriton School of Computer Science University of aterloo here I m From hat e re Doing Flexible Transactional Persistence DBMS-Managed Energy Efficiency Non-Relational

More information

Big Data Hadoop Course Content

Big Data Hadoop Course Content Big Data Hadoop Course Content Topics covered in the training Introduction to Linux and Big Data Virtual Machine ( VM) Introduction/ Installation of VirtualBox and the Big Data VM Introduction to Linux

More information

Map Reduce. Yerevan.

Map Reduce. Yerevan. Map Reduce Erasmus+ @ Yerevan dacosta@irit.fr Divide and conquer at PaaS 100 % // Typical problem Iterate over a large number of records Extract something of interest from each Shuffle and sort intermediate

More information

Fast, Interactive, Language-Integrated Cluster Computing

Fast, Interactive, Language-Integrated Cluster Computing Spark Fast, Interactive, Language-Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica www.spark-project.org

More information

An Overview of Apache Spark

An Overview of Apache Spark An Overview of Apache Spark CIS 612 Sunnie Chung 2014 MapR Technologies 1 MapReduce Processing Model MapReduce, the parallel data processing paradigm, greatly simplified the analysis of big data using

More information

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development:: Title Duration : Apache Spark Development : 4 days Overview Spark is a fast and general cluster computing system for Big Data. It provides high-level APIs in Scala, Java, Python, and R, and an optimized

More information

An Introduction to Apache Spark

An Introduction to Apache Spark An Introduction to Apache Spark Anastasios Skarlatidis @anskarl Software Engineer/Researcher IIT, NCSR "Demokritos" Outline Part I: Getting to know Spark Part II: Basic programming Part III: Spark under

More information

A brief history on Hadoop

A brief history on Hadoop Hadoop Basics A brief history on Hadoop 2003 - Google launches project Nutch to handle billions of searches and indexing millions of web pages. Oct 2003 - Google releases papers with GFS (Google File System)

More information

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect Big Data Big Data Analyst INTRODUCTION TO BIG DATA ANALYTICS ANALYTICS PROCESSING TECHNIQUES DATA TRANSFORMATION & BATCH PROCESSING REAL TIME (STREAM) DATA PROCESSING Big Data Engineer BIG DATA FOUNDATION

More information

Cloud Computing 3. CSCI 4850/5850 High-Performance Computing Spring 2018

Cloud Computing 3. CSCI 4850/5850 High-Performance Computing Spring 2018 Cloud Computing 3 CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University Learning

More information

Certified Big Data and Hadoop Course Curriculum

Certified Big Data and Hadoop Course Curriculum Certified Big Data and Hadoop Course Curriculum The Certified Big Data and Hadoop course by DataFlair is a perfect blend of in-depth theoretical knowledge and strong practical skills via implementation

More information

HADOOP FRAMEWORK FOR BIG DATA

HADOOP FRAMEWORK FOR BIG DATA HADOOP FRAMEWORK FOR BIG DATA Mr K. Srinivas Babu 1,Dr K. Rameshwaraiah 2 1 Research Scholar S V University, Tirupathi 2 Professor and Head NNRESGI, Hyderabad Abstract - Data has to be stored for further

More information

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems Announcements CompSci 516 Database Systems Lecture 12 - and Spark Practice midterm posted on sakai First prepare and then attempt! Midterm next Wednesday 10/11 in class Closed book/notes, no electronic

More information

Map Reduce & Hadoop Recommended Text:

Map Reduce & Hadoop Recommended Text: Map Reduce & Hadoop Recommended Text: Hadoop: The Definitive Guide Tom White O Reilly 2010 VMware Inc. All rights reserved Big Data! Large datasets are becoming more common The New York Stock Exchange

More information

Twitter data Analytics using Distributed Computing

Twitter data Analytics using Distributed Computing Twitter data Analytics using Distributed Computing Uma Narayanan Athrira Unnikrishnan Dr. Varghese Paul Dr. Shelbi Joseph Research Scholar M.tech Student Professor Assistant Professor Dept. of IT, SOE

More information

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce Parallel Programming Principle and Practice Lecture 10 Big Data Processing with MapReduce Outline MapReduce Programming Model MapReduce Examples Hadoop 2 Incredible Things That Happen Every Minute On The

More information

A Parallel R Framework

A Parallel R Framework A Parallel R Framework for Processing Large Dataset on Distributed Systems Nov. 17, 2013 This work is initiated and supported by Huawei Technologies Rise of Data-Intensive Analytics Data Sources Personal

More information

Embedded Technosolutions

Embedded Technosolutions Hadoop Big Data An Important technology in IT Sector Hadoop - Big Data Oerie 90% of the worlds data was generated in the last few years. Due to the advent of new technologies, devices, and communication

More information

Principal Software Engineer Red Hat Emerging Technology June 24, 2015

Principal Software Engineer Red Hat Emerging Technology June 24, 2015 USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton Principal Software Engineer Red Hat Emerging Technology June 24, 2015 ABOUT ME Distributed systems and data science in Red Hat's Emerging

More information

Introduction to Hadoop and MapReduce

Introduction to Hadoop and MapReduce Introduction to Hadoop and MapReduce Antonino Virgillito THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Large-scale Computation Traditional solutions for computing large

More information

Scalable Tools - Part I Introduction to Scalable Tools

Scalable Tools - Part I Introduction to Scalable Tools Scalable Tools - Part I Introduction to Scalable Tools Adisak Sukul, Ph.D., Lecturer, Department of Computer Science, adisak@iastate.edu http://web.cs.iastate.edu/~adisak/mbds2018/ Scalable Tools session

More information

2/4/2019 Week 3- A Sangmi Lee Pallickara

2/4/2019 Week 3- A Sangmi Lee Pallickara Week 3-A-0 2/4/2019 Colorado State University, Spring 2019 Week 3-A-1 CS535 BIG DATA FAQs PART A. BIG DATA TECHNOLOGY 3. DISTRIBUTED COMPUTING MODELS FOR SCALABLE BATCH COMPUTING SECTION 1: MAPREDUCE PA1

More information

Nowcasting. D B M G Data Base and Data Mining Group of Politecnico di Torino. Big Data: Hype or Hallelujah? Big data hype?

Nowcasting. D B M G Data Base and Data Mining Group of Politecnico di Torino. Big Data: Hype or Hallelujah? Big data hype? Big data hype? Big Data: Hype or Hallelujah? Data Base and Data Mining Group of 2 Google Flu trends On the Internet February 2010 detected flu outbreak two weeks ahead of CDC data Nowcasting http://www.internetlivestats.com/

More information

Databases and Big Data Today. CS634 Class 22

Databases and Big Data Today. CS634 Class 22 Databases and Big Data Today CS634 Class 22 Current types of Databases SQL using relational tables: still very important! NoSQL, i.e., not using relational tables: term NoSQL popular since about 2007.

More information

Page 1. Goals for Today" Background of Cloud Computing" Sources Driving Big Data" CS162 Operating Systems and Systems Programming Lecture 24

Page 1. Goals for Today Background of Cloud Computing Sources Driving Big Data CS162 Operating Systems and Systems Programming Lecture 24 Goals for Today" CS162 Operating Systems and Systems Programming Lecture 24 Capstone: Cloud Computing" Distributed systems Cloud Computing programming paradigms Cloud Computing OS December 2, 2013 Anthony

More information

Big Data Specialized Studies

Big Data Specialized Studies Information Technologies Programs Big Data Specialized Studies Accelerate Your Career extension.uci.edu/bigdata Offered in partnership with University of California, Irvine Extension s professional certificate

More information

Introduction to MapReduce (cont.)

Introduction to MapReduce (cont.) Introduction to MapReduce (cont.) Rafael Ferreira da Silva rafsilva@isi.edu http://rafaelsilva.com USC INF 553 Foundations and Applications of Data Mining (Fall 2018) 2 MapReduce: Summary USC INF 553 Foundations

More information

/ Cloud Computing. Recitation 13 April 14 th 2015

/ Cloud Computing. Recitation 13 April 14 th 2015 15-319 / 15-619 Cloud Computing Recitation 13 April 14 th 2015 Overview Last week s reflection Project 4.1 Budget issues Tagging, 15619Project This week s schedule Unit 5 - Modules 18 Project 4.2 Demo

More information

15.1 Data flow vs. traditional network programming

15.1 Data flow vs. traditional network programming CME 323: Distributed Algorithms and Optimization, Spring 2017 http://stanford.edu/~rezab/dao. Instructor: Reza Zadeh, Matroid and Stanford. Lecture 15, 5/22/2017. Scribed by D. Penner, A. Shoemaker, and

More information

CompSci 516: Database Systems

CompSci 516: Database Systems CompSci 516 Database Systems Lecture 12 Map-Reduce and Spark Instructor: Sudeepa Roy Duke CS, Fall 2017 CompSci 516: Database Systems 1 Announcements Practice midterm posted on sakai First prepare and

More information

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Shark Hive on Spark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Agenda Intro to Spark Apache Hive Shark Shark s Improvements over Hive Demo Alpha

More information

Spark: A Brief History. https://stanford.edu/~rezab/sparkclass/slides/itas_workshop.pdf

Spark: A Brief History. https://stanford.edu/~rezab/sparkclass/slides/itas_workshop.pdf Spark: A Brief History https://stanford.edu/~rezab/sparkclass/slides/itas_workshop.pdf A Brief History: 2004 MapReduce paper 2010 Spark paper 2002 2004 2006 2008 2010 2012 2014 2002 MapReduce @ Google

More information

Hadoop An Overview. - Socrates CCDH

Hadoop An Overview. - Socrates CCDH Hadoop An Overview - Socrates CCDH What is Big Data? Volume Not Gigabyte. Terabyte, Petabyte, Exabyte, Zettabyte - Due to handheld gadgets,and HD format images and videos - In total data, 90% of them collected

More information

STATS Data Analysis using Python. Lecture 7: the MapReduce framework Some slides adapted from C. Budak and R. Burns

STATS Data Analysis using Python. Lecture 7: the MapReduce framework Some slides adapted from C. Budak and R. Burns STATS 700-002 Data Analysis using Python Lecture 7: the MapReduce framework Some slides adapted from C. Budak and R. Burns Unit 3: parallel processing and big data The next few lectures will focus on big

More information

Lambda Architecture with Apache Spark

Lambda Architecture with Apache Spark Lambda Architecture with Apache Spark Michael Hausenblas, Chief Data Engineer MapR First Galway Data Meetup, 2015-02-03 2015 MapR Technologies 2015 MapR Technologies 1 Polyglot Processing 2015 2014 MapR

More information

Introduction to Big-Data

Introduction to Big-Data Introduction to Big-Data Ms.N.D.Sonwane 1, Mr.S.P.Taley 2 1 Assistant Professor, Computer Science & Engineering, DBACER, Maharashtra, India 2 Assistant Professor, Information Technology, DBACER, Maharashtra,

More information

Hadoop. copyright 2011 Trainologic LTD

Hadoop. copyright 2011 Trainologic LTD Hadoop Hadoop is a framework for processing large amounts of data in a distributed manner. It can scale up to thousands of machines. It provides high-availability. Provides map-reduce functionality. Hides

More information

Clustering Documents. Case Study 2: Document Retrieval

Clustering Documents. Case Study 2: Document Retrieval Case Study 2: Document Retrieval Clustering Documents Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 21 th, 2015 Sham Kakade 2016 1 Document Retrieval Goal: Retrieve

More information

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad

More information

Webinar Series TMIP VISION

Webinar Series TMIP VISION Webinar Series TMIP VISION TMIP provides technical support and promotes knowledge and information exchange in the transportation planning and modeling community. Today s Goals To Consider: Parallel Processing

More information

Khadija Souissi. Auf z Systems November IBM z Systems Mainframe Event 2016

Khadija Souissi. Auf z Systems November IBM z Systems Mainframe Event 2016 Khadija Souissi Auf z Systems 07. 08. November 2016 @ IBM z Systems Mainframe Event 2016 Acknowledgements Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.

More information

Impala. A Modern, Open Source SQL Engine for Hadoop. Yogesh Chockalingam

Impala. A Modern, Open Source SQL Engine for Hadoop. Yogesh Chockalingam Impala A Modern, Open Source SQL Engine for Hadoop Yogesh Chockalingam Agenda Introduction Architecture Front End Back End Evaluation Comparison with Spark SQL Introduction Why not use Hive or HBase?

More information

Putting it together. Data-Parallel Computation. Ex: Word count using partial aggregation. Big Data Processing. COS 418: Distributed Systems Lecture 21

Putting it together. Data-Parallel Computation. Ex: Word count using partial aggregation. Big Data Processing. COS 418: Distributed Systems Lecture 21 Big Processing -Parallel Computation COS 418: Distributed Systems Lecture 21 Michael Freedman 2 Ex: Word count using partial aggregation Putting it together 1. Compute word counts from individual files

More information

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA) Introduction to MapReduce Adapted from Jimmy Lin (U. Maryland, USA) Motivation Overview Need for handling big data New programming paradigm Review of functional programming mapreduce uses this abstraction

More information

Data Analysis Using MapReduce in Hadoop Environment

Data Analysis Using MapReduce in Hadoop Environment Data Analysis Using MapReduce in Hadoop Environment Muhammad Khairul Rijal Muhammad*, Saiful Adli Ismail, Mohd Nazri Kama, Othman Mohd Yusop, Azri Azmi Advanced Informatics School (UTM AIS), Universiti

More information

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung HDFS: Hadoop Distributed File System CIS 612 Sunnie Chung What is Big Data?? Bulk Amount Unstructured Introduction Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per

More information

Outline. CS-562 Introduction to data analysis using Apache Spark

Outline. CS-562 Introduction to data analysis using Apache Spark Outline Data flow vs. traditional network programming What is Apache Spark? Core things of Apache Spark RDD CS-562 Introduction to data analysis using Apache Spark Instructor: Vassilis Christophides T.A.:

More information

A very short introduction

A very short introduction A very short introduction General purpose compute engine for clusters batch / interactive / streaming used by and many others History developed in 2009 @ UC Berkeley joined the Apache foundation in 2013

More information

Data Analytics with HPC. Data Streaming

Data Analytics with HPC. Data Streaming Data Analytics with HPC Data Streaming Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us

More information

Distributed Systems 16. Distributed File Systems II

Distributed Systems 16. Distributed File Systems II Distributed Systems 16. Distributed File Systems II Paul Krzyzanowski pxk@cs.rutgers.edu 1 Review NFS RPC-based access AFS Long-term caching CODA Read/write replication & disconnected operation DFS AFS

More information

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017 Hadoop File System 1 S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y Moving Computation is Cheaper than Moving Data Motivation: Big Data! What is BigData? - Google

More information

Distributed Systems. CS422/522 Lecture17 17 November 2014

Distributed Systems. CS422/522 Lecture17 17 November 2014 Distributed Systems CS422/522 Lecture17 17 November 2014 Lecture Outline Introduction Hadoop Chord What s a distributed system? What s a distributed system? A distributed system is a collection of loosely

More information

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344 Where We Are Introduction to Data Management CSE 344 Lecture 22: MapReduce We are talking about parallel query processing There exist two main types of engines: Parallel DBMSs (last lecture + quick review)

More information

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics IBM Data Science Experience White paper R Transforming R into a tool for big data analytics 2 R Executive summary This white paper introduces R, a package for the R statistical programming language that

More information

About Codefrux While the current trends around the world are based on the internet, mobile and its applications, we try to make the most out of it. As for us, we are a well established IT professionals

More information

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours Big Data Hadoop Developer Course Content Who is the target audience? Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours Complete beginners who want to learn Big Data Hadoop Professionals

More information

Chapter 5. The MapReduce Programming Model and Implementation

Chapter 5. The MapReduce Programming Model and Implementation Chapter 5. The MapReduce Programming Model and Implementation - Traditional computing: data-to-computing (send data to computing) * Data stored in separate repository * Data brought into system for computing

More information

Copyright 2012, Oracle and/or its affiliates. All rights reserved.

Copyright 2012, Oracle and/or its affiliates. All rights reserved. 1 Big Data Connectors: High Performance Integration for Hadoop and Oracle Database Melli Annamalai Sue Mavris Rob Abbott 2 Program Agenda Big Data Connectors: Brief Overview Connecting Hadoop with Oracle

More information