Importing and Exporting Data Between Hadoop and MySQL

Similar documents
sqoop Easy, parallel database import/export Aaron Kimball Cloudera Inc. June 8, 2010

Big Data Hadoop Stack

How Apache Hadoop Complements Existing BI Systems. Dr. Amr Awadallah Founder, CTO Cloudera,

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

Hadoop Development Introduction

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

The amount of data increases every day Some numbers ( 2012):

2/26/2017. The amount of data increases every day Some numbers ( 2012):

Joe Hummel, PhD. Visiting Researcher: U. of California, Irvine Adjunct Professor: U. of Illinois, Chicago & Loyola U., Chicago

SQT03 Big Data and Hadoop with Azure HDInsight Andrew Brust. Senior Director, Technical Product Marketing and Evangelism

Hive and Shark. Amir H. Payberah. Amirkabir University of Technology (Tehran Polytechnic)

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

Introduction to Hadoop and MapReduce

Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem. Zohar Elkayam

Sqoop In Action. Lecturer:Alex Wang QQ: QQ Communication Group:

Hadoop An Overview. - Socrates CCDH

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Data Informatics. Seon Ho Kim, Ph.D.

Project Genesis. Cafepress.com Product Catalog Hundreds of Millions of Products Millions of new products every week Accelerating growth

Data Storage Infrastructure at Facebook

Big Data Analytics using Apache Hadoop and Spark with Scala

Introduction to BigData, Hadoop:-

ExamTorrent. Best exam torrent, excellent test torrent, valid exam dumps are here waiting for you

Microsoft Big Data and Hadoop

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

itpass4sure Helps you pass the actual test with valid and latest training material.

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

Shark: Hive (SQL) on Spark

Column Stores and HBase. Rui LIU, Maksim Hrytsenia

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

HBase... And Lewis Carroll! Twi:er,

Hadoop & Big Data Analytics Complete Practical & Real-time Training

Prototyping Data Intensive Apps: TrendingTopics.org

CISC 7610 Lecture 2b The beginnings of NoSQL

Lecture 7 (03/12, 03/14): Hive and Impala Decisions, Operations & Information Technologies Robert H. Smith School of Business Spring, 2018

Introduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Question: 1 You need to place the results of a PigLatin script into an HDFS output directory. What is the correct syntax in Apache Pig?

Bring Context To Your Machine Data With Hadoop, RDBMS & Splunk

Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and thevirtual EDW Headline Goes Here

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Big Data Hadoop Course Content

Clustering Lecture 8: MapReduce

Apache Hive for Oracle DBAs. Luís Marques

Voldemort. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed?

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Scaling Up 1 CSE 6242 / CX Duen Horng (Polo) Chau Georgia Tech. Hadoop, Pig

Performance Tuning Data Transfer Between RDB and Hadoop. Terry Koch. Sr. Engineer

microsoft

April Copyright 2013 Cloudera Inc. All rights reserved.

Shark: SQL and Rich Analytics at Scale. Michael Xueyuan Han Ronny Hajoon Ko

CIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin. Presented by: Suhua Wei Yong Yu

Big Data Architect.

CIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( )

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

Map Reduce. Yerevan.

Hadoop. Introduction / Overview

Innovatus Technologies

A Glimpse of the Hadoop Echosystem

Introduction to Data Management CSE 344

Microsoft. Exam Questions Perform Data Engineering on Microsoft Azure HDInsight (beta) Version:Demo

Hadoop is supplemented by an ecosystem of open source projects IBM Corporation. How to Analyze Large Data Sets in Hadoop

Certified Big Data and Hadoop Course Curriculum

Practice and Applications of Data Management CMPSCI 345. Lecture 18: Big Data, Hadoop, and MapReduce

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS

New Approaches to Big Data Processing and Analytics

We are ready to serve Latest Testing Trends, Are you ready to learn.?? New Batches Info

Hadoop/MapReduce Computing Paradigm

Techno Expert Solutions An institute for specialized studies!

Introduction to Data Management CSE 344

Hadoop. copyright 2011 Trainologic LTD

exam. Microsoft Perform Data Engineering on Microsoft Azure HDInsight. Version 1.0

50 Must Read Hadoop Interview Questions & Answers

sqoop Automatic database import Aaron Kimball Cloudera Inc. June 18, 2009

Big Data Syllabus. Understanding big data and Hadoop. Limitations and Solutions of existing Data Analytics Architecture

Certified Big Data Hadoop and Spark Scala Course Curriculum

HADOOP FRAMEWORK FOR BIG DATA

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.

DHANALAKSHMI COLLEGE OF ENGINEERING, CHENNAI

Hadoop File Formats and Data Ingestion. Prasanth Kothuri, CERN

Copyright 2012, Oracle and/or its affiliates. All rights reserved.

Data Access 3. Migrating data. Date of Publish:

Vendor: Cloudera. Exam Code: CCD-410. Exam Name: Cloudera Certified Developer for Apache Hadoop. Version: Demo

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture

Processing 11 billions events a day with Spark. Alexander Krasheninnikov

TESTING BIG DATA WORLD RIGA. by Konstantin Pletenev OCTOBER, 2017, TAPOST GROW CONFIDENTLY

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Jaql. Kevin Beyer, Vuk Ercegovac, Eugene Shekita, Jun Rao, Ning Li, Sandeep Tata. IBM Almaden Research Center

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Hadoop Overview. Lars George Director EMEA Services

Big Data Analytics. Rasoul Karimi

Outline. Distributed File System Map-Reduce The Computational Model Map-Reduce Algorithm Evaluation Computing Joins

CS November 2018

Informa)on Retrieval and Map- Reduce Implementa)ons. Mohammad Amir Sharif PhD Student Center for Advanced Computer Studies

Achieve Data Democratization with effective Data Integration Saurabh K. Gupta

Chapter 5. The MapReduce Programming Model and Implementation

Hadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours)

Transcription:

Importing and Exporting Data Between Hadoop and MySQL + 1

About me Sarah Sproehnle Former MySQL instructor Joined Cloudera in March 2010 sarah@cloudera.com 2

What is Hadoop? An open-source framework for storing and processing data on a cluster of computers Based on Google's whitepapers of the Google File System and MapReduce Scales linearly (proven to scale to 1000s of nodes) Built-in high availability Designed for batch processing Optimized for streaming reads 3

Why Hadoop? Lots of data (TB+) Need to scan, process or transform all data Complex and/or unstructured data 4

Use cases for Hadoop Recommendation engine Netflix recommends movies last.fm recommends music Ad targeting, log processing, search optimization ContextWeb, ebay and Orbitz Machine learning and classification Yahoo! Mail's spam detection financial companies: identify fraud, credit risk Graph analysis Facebook, LinkedIn and eharmony suggest connections 5

Hadoop vs. MySQL MySQL Data capacity TB+ (may require sharding) PB+ Data per query GB? PB+ Hadoop Read/write Random read/write Sequential scans, Appendonly Query language SQL Java MapReduce, scripting languages, HiveQL Transactions Yes No Indexes Yes No Latency Sub-second (hopefully) Minutes to hours Data structure Structured Structured or un-structured 6

How does Hadoop work? Spreads your data onto tens, hundreds or thousands of machines using the Hadoop Distributed File System (HDFS) Built-in redundancy (replication) for fault-tolerance Machines will fail! HDD MTBF 1000 days, 1000 disks = 1 failure every day Read and process data with MapReduce Processing is sent to the data Many "map" tasks each work on a slice of the data Failed tasks are automatically restarted on another node 7

Why MapReduce? By constraining computation to map and reduce phases, the tasks can be split and run in parallel Able to process huge amounts of data over thousands of machines Scales linearly Programmer is isolated from individual failed tasks Tasks are restarted on another node 8

The problem with MapReduce The developer has to worry about a lot of things besides the analysis/processing logic (Job setup, InputFormat, custom key/value classes) The data is schema-less Even simple things may require several MapReduce passes Would be more convenient to use constructs such as "filter", "join", "aggregate" Solution: Hive SQL-like language on top of MapReduce 9

Example - word count map(key, value) foreach (word in value) output (word, 1) Key and value represent a row of data : key is the byte offset, value is a line Intermediate output: the, 1 cat, 1 in, 1 the, 1 hat, 1 10

Reduce reduce(key, list) sum the list output(key, sum) Hadoop aggregates the keys and calls reduce for each unique key: the, (1,1,1,1,1,1 1) cat, (1,1,1) in, (1,1,1,1,1,1)... Final result: the, 45823 cat, 1204 in, 2693 11

So where does this fit in? 12

Example data pipeline 1. Use MySQL for real-time read/write data access (e.g., websites) 2. Cron job occasionally "Sqoops" data into Hadoop 3. Flume aggregates web logs and loads them into Hadoop 4. Use MapReduce to transform data, run batch analysis, join data, etc 5. Export the transformed results to OLAP or OLTP environment 13

Sqoop: SQL-to-Hadoop Open source software Parallel import/export between Hadoop and various RDBMSes Default implementation is JDBC-based In Cloudera's Distribution including Apache Hadoop (CDH) Optimized for MySQL (built-in) Optimized connectors for Oracle, Netezza, Teradata (others coming) Freely available at cloudera.com 14

How Sqoop works 15

"Sqooping" your tables into Hadoop $ sqoop import --connect jdbc:mysql://foo.com/db --table orders --fields-terminated-by '\t' --lines-terminated-by '\n' This command will submit a Hadoop job that queries your MySQL server at foo.com and reads rows from db.orders Resulting TSV files are stored in Hadoop's Distributed File System 16

Other features Other features : Choose which tables, rows (--where) or columns to import Configurable parallelization Specify the number of connections (--num-mappers) Specify the column to split on (--split-by) MySQL optimization (uses parallel mysqldump commands) LOBs can be inline or a separate file Incremental loads (with TIMESTAMP or AUTO_INCREMENT ) Integration with Hive and HBase 17

Exporting data $ sqoop export --connect jdbc:mysql://foo.com/db --table bar --export-dir /hdfs_path/bar_data Target table must already exist Assumes comma-separated fields Use --fields-terminated-by and --lines-terminated-by Can use a staging table (--staging-table) Failed jobs can have unpredictable results otherwise 18

Sqoop + Hive $ sqoop import --connect jdbc:mysql://foo.com/db --table orders --hive-import By including --hive-import, Sqoop will create a Hive table definition for the data Hive is a component that sits on top of Hadoop to provide: A command-line interface for submitting HiveQL A metastore for putting table definitions on Hadoop data 19

Hive $ hive hive> SHOW TABLES;... hive> SELECT locationid, sum(cost) AS sales FROM orders GROUP BY locationid ORDER BY sales LIMIT 100; 20

Demo $ sqoop import --connect jdbc:mysql://localhost/world --username root --table City --hive-import $ sqoop import --connect jdbc:mysql://localhost/world --username root --table Country --hive-import $ hadoop fs -ls /user/hive/warehouse $ hadoop fs -cat /user/hive/warehouse/city/*0 more $ hive SHOW TABLES; SELECT * FROM city LIMIT 10; SELECT countrycode, sum(population) as people FROM city WHERE population > 100000 GROUP BY countrycode ORDER BY people DESC LIMIT 10; SELECT code, sum(ci.population) as people FROM city ci JOIN country co ON ci.countrycode = co.code WHERE continent = 'North America' GROUP BY code ORDER BY people desc LIMIT 10; 21

Thanks! sarah@cloudera.com 22