Complex stories about Sqooping PostgreSQL data

Similar documents
sqoop Easy, parallel database import/export Aaron Kimball Cloudera Inc. June 8, 2010

Performance Tuning Data Transfer Between RDB and Hadoop. Terry Koch. Sr. Engineer

Hadoop Map Reduce 10/17/2018 1

sqoop Automatic database import Aaron Kimball Cloudera Inc. June 18, 2009

Because databases are not easily accessible by Hadoop, Apache Sqoop was created to efficiently transfer bulk data between Hadoop and external

Innovatus Technologies

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Copyright 2012, Oracle and/or its affiliates. All rights reserved.

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture

Big Data Syllabus. Understanding big data and Hadoop. Limitations and Solutions of existing Data Analytics Architecture

Hortonworks Data Platform

iway iway Big Data Integrator New Features Bulletin and Release Notes Version DN

Introduction to BigData, Hadoop:-

Introduction to Hive Cloudera, Inc.

Data Access 3. Migrating data. Date of Publish:

Postgres Copy Table From One Schema To Another

See Types of Data Supported for information about the types of files that you can import into Datameer.

DOWNLOAD PDF MICROSOFT SQL SERVER HADOOP CONNECTOR USER GUIDE

Introduction to Hive. Feng Li School of Statistics and Mathematics Central University of Finance and Economics

Database Access with JDBC. Dr. Jens Bennedsen, Aarhus University, School of Engineering Aarhus, Denmark

Voldemort. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

Data-Intensive Computing with MapReduce

Hadoop. copyright 2011 Trainologic LTD

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

Hadoop Online Training

Big Data Analysis using Hadoop Lecture 3

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Data Analytics Job Guarantee Program

Third Party Software: How can I import data from MySQL to Hadoop with Apache Sqoop? How can I import data from MySQL to Hadoop with Apache Sqoop?

Spread the Database Love with Heterogeneous Replication. MC Brown, VP, Products

Apache Hive for Oracle DBAs. Luís Marques

Logging and Recovery. 444 Section, April 23, 2009

Hadoop & Big Data Analytics Complete Practical & Real-time Training

Crate Shell. Release

This is a brief tutorial that explains how to make use of Sqoop in Hadoop ecosystem.

Lecture 7 (03/12, 03/14): Hive and Impala Decisions, Operations & Information Technologies Robert H. Smith School of Business Spring, 2018

Migration From DB2 in a Large Public Setting: Lessons Learned

Hadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours)

Deploying Custom Step Plugins for Pentaho MapReduce

ThingWorx Relational Databases Connectors Extension User Guide

itpass4sure Helps you pass the actual test with valid and latest training material.

databases the PMT way Databases SQLDB

Import & Export Utilities

CSCI/CMPE Object-Oriented Programming in Java JDBC. Dongchul Kim. Department of Computer Science University of Texas Rio Grande Valley

How we build TiDB. Max Liu PingCAP Amsterdam, Netherlands October 5, 2016

Sqoop In Action. Lecturer:Alex Wang QQ: QQ Communication Group:

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

Powered by Teradata Connector for Hadoop

HIBERNATE MOCK TEST HIBERNATE MOCK TEST I

Copy Data From One Schema To Another In Sql Developer

World Premium Points of Interest Getting Started Guide

Configuring Databases

Table of Contents. 7. sqoop-import Purpose 7.2. Syntax

Using Apache Phoenix to store and access data

HPCC JDBC Driver. Boca Raton Documentation Team

Map-Reduce Applications: Counting, Graph Shortest Paths

Clustering Lecture 8: MapReduce

Introduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data

Hadoop Streaming. Table of contents. Content-Type text/html; utf-8

World Premium Points of Interest Getting Started Guide

iway Big Data Integrator New Features Bulletin and Release Notes

Manual Trigger Sql Server 2008 Examples Insert Update

Query To Find Table Name Using Column Name In Sql Server

Testing with Soap UI. Tomaš Maconko

Big Data Analytics using Apache Hadoop and Spark with Scala

HADOOP COURSE CONTENT (HADOOP-1.X, 2.X & 3.X) (Development, Administration & REAL TIME Projects Implementation)

Techno Expert Solutions An institute for specialized studies!

Big Data and Hadoop. Course Curriculum: Your 10 Module Learning Plan. About Edureka

Hortonworks Data Platform

COSC 6339 Big Data Analytics. Hadoop MapReduce Infrastructure: Pig, Hive, and Mahout. Edgar Gabriel Fall Pig

This document contains information on fixed and known limitations for Test Data Management.

The Object-Oriented Paradigm. Employee Application Object. The Reality of DBMS. Employee Database Table. From Database to Application.

Impala Intro. MingLi xunzhang

Introducing Kafka Connect. Large-scale streaming data import/export for

13 Creation and Manipulation of Tables and Databases

Shark: Hive (SQL) on Spark

CSE 530A ACID. Washington University Fall 2013

Configuring Databases

A Quick Database Comparison of Db4o and SQL Databases through Cayenne

Configuring Databases

In this exercise, you will import orders table from MySQL database. into HDFS. Get acquainted with some of basic commands of Sqoop

Top 50 JDBC Interview Questions and Answers

Hadoop Lab 3 Creating your first Map-Reduce Process

Hive and Shark. Amir H. Payberah. Amirkabir University of Technology (Tehran Polytechnic)

About Speaker. Shrwan Krishna Shrestha. /

Using AVRO To Run Python Map Reduce Jobs

Chapter 16: Databases

ibolt V3.2 SP1 Release Notes

Version Emergency Bug Fixes Fixed Limitations Known Limitations... 4 Informatica Global Customer Support...

Design Proposal for Hive Metastore Plugin

Hadoop File Formats and Data Ingestion. Prasanth Kothuri, CERN

Hadoop Integration Guide

Certified Big Data and Hadoop Course Curriculum

CS435 Introduction to Big Data Spring 2018 Colorado State University. 2/5/2018 Week 4-A Sangmi Lee Pallickara. FAQs. Total Order Sorting Pattern

Importing and Exporting Data Between Hadoop and MySQL

TI2736-B Big Data Processing. Claudia Hauff

I need to get the maximum length of data per each column in a bunch of tables. are looking at BEGIN -- loop through column names in all_tab_columns.

Distributed Computation Models

Map-Reduce. Marco Mura 2010 March, 31th

Hadoop Integration Guide

Transcription:

Presentation slide for Sqoop User Meetup (Strata + Hadoop World NYC 2013) Complex stories about Sqooping PostgreSQL data 10/28/2013 NTT DATA Corporation Masatake Iwasaki

Introduction 2

About Me Masatake Iwasaki: Software Engineer @ NTT DATA: Developed: NTT(Nippon Telegraph and Telephone Corporation): Telecommunication NTT DATA: Systems Integrator Ludia: Fulltext search index for PostgreSQL using Senna Authored: A Complete Primer for Hadoop (no official English title) Patches for Sqoop: SQOOP- 390: PostgreSQL connector for direct export with pg_ bulkload SQOOP- 999: Support bulk load from HDFS to PostgreSQL using COPY... FROM SQOOP- 1155: Sqoop 2 documentation for connector development 3

Why PostgreSQL? Enterprisy from earlier version comparing to MySQL Active community in Japan NTT DATA commits itself to development 4

Sqooping PostgreSQL data 5

Working around PostgreSQL Direct connector for PostgreSQL loader: SQOOP- 390: PostgreSQL connector for direct export with pg_ bulkload Yet another direct connector for PostgreSQL JDBC: SQOOP- 999: Support bulk load from HDFS to PostgreSQL using COPY... FROM Supporting complex data types: SQOOP- 1149: Support Custom Postgres Types 6

Direct connector for PostgreSQL loader SQOOP- 390: PostgreSQL connector for direct export with pg_ bulkload pg_ bulkload: Data loader for PostgreSQL Server side plug- in library and client side command Providing filtering and transformation of data http://pgbulkload.projects.pgfoundry.org/ 7

SQOOP- 390: PostgreSQL connector for direct export with pg_ bulkload HDFS File Split File Split File Split CREATE TABLE tmp3(like dest INCLUDING CONSTRAINTS) Mapper Mapper Mapper pg_ bulkload pg_ bulkoad pg_ bulkload tmp1 tmp2 tmp3 external process staging table per mapper is must due to table level locks PostgreSQL Reducer Destination Table BEGIN INSERT INTO dest ( SELECT * FROM tmp1 ) DROP TABLE tmp1 INSERT INTO dest ( SELECT * FROM tmp2 ) DROP TABLE tmp2 INSERT INTO dest ( SELECT * FROM tmp3 ) DROP TABLE tmp3 COMMIT 8

Direct connector for PostgreSQL loader Pros: Fast by short- circuitting server functionality Flexible filtering error records Cons: Not so fast Bottleneck is not in client side but in DB side Built- in COPY functionality is fast enough Not General pg_ bulkload supports only export Requiring setup on all slave nodes and client node Possible to Require recovery on failure 9

Yet another direct connector for PostgreSQL JDBC PostgreSQL provides custom SQL command for data import/export COPY table_name [ ( column_name [,...] ) ] FROM { 'filename' STDIN } [ [ WITH ] ( option [,...] ) ] COPY { table_name [ ( column_name [,...] ) ] ( query ) } TO { 'filename' STDOUT } [ [ WITH ] ( option [,...] ) ] where option can be one of: FORMAT format_name OIDS [ boolean ] DELIMITER 'delimiter_character' NULL 'null_string' HEADER [ boolean ] QUOTE 'quote_character' ESCAPE 'escape_character' FORCE_QUOTE { ( column_name [,...] ) * } FORCE_NOT_NULL ( column_name [,...] ) ENCODING 'encoding_name AND JDBC API org.postgresql.copy.* 10

SQOOP- 999: Support bulk load from HDFS to PostgreSQL using COPY... FROM HDFS File Split File Split File Split Using custom SQL command via JDBC API only available in PostgreSQL Mapper PostgreSQL JDBC Mapper PostgreSQL JDBC Mapper PostgreSQL JDBC COPY FROM STDIN WITH... staging tagle PostgreSQL Destination Table 11

SQOOP- 999: Support bulk load from HDFS to PostgreSQL using COPY... FROM import org.postgresql.copy.copymanager; import org.postgresql.copy.copyin;... Requiring PostgreSQL specific interface. protected void setup(context context)... dbconf = new DBConfiguration(conf); CopyManager cm = null;... public void map(longwritable key, Writable value, Context context)... if (value instanceof Text) { line.append(system.getproperty("line.separator")); } try { byte[]data = line.tostring().getbytes("utf-8"); Just feeding lines of text copyin.writetocopy(data, 0, data.length); 12

Yet another direct connector for PostgreSQL JDBC Pros: Fast enough Ease of use JDBC driver jar is distributed automatically by MR framework Cons: Dependency on not general JDBC possible licensing issue (PostgreSQL is OK, itʼ s BSD Lisence) build time requirement (PostgreSQL JDBC is available in Maven repo.) <dependency org="org.postgresql" name="postgresql" rev="${postgresql.version}" conf="common->default" /> Error record causes rollback of whole transaction Still difficult to implement custom connector for IMPORT because of code generation part 13

Supporting complex data types PostgreSQL supports lot of complex data types Geometric Types Points Line Segments Boxes Paths Polygons Circles Network Address Types inet cidr macaddr XML Type JSON Type Supporting complex data types: SQOOP- 1149: Support Custom Postgres Types not me 14

Constraints on JDBC data types in Sqoop framework protected Map<String, Integer> getcolumntypesforrawquery(string stmt) {... results = execute(stmt);... ResultSetMetaData metadata = results.getmetadata(); for (int i = 1; i < cols + 1; i++) { int typeid = metadata.getcolumntype(i); public String tojavatype(int sqltype) { // Mappings taken from: // http://java.sun.com/j2se/1.3/docs/guide/jdbc/getstart/mapping.html if (sqltype == Types.INTEGER) { return "Integer"; } else if (sqltype == Types.VARCHAR) { return "String";... reaches here } else { // TODO(aaron): Support DISTINCT, ARRAY, STRUCT, REF, JAVA_OBJECT. // Return null indicating database-specific manager should return a // java data type if it can find one for any nonstandard type. return null; returns java.sql.types.other for types not mappable to basic Java data types => Losing type imformation 15

Sqoop 1 Summary Pros: Simple Standalone MapReduce Driver Easy to understand for MR application developpers Variety of connectors Lot of information Cons: except for ORM (SqoopRecord) code generation part. Complex command line and inconsistent options meaning of options is according to connectors Not enough modular Dependency on JDBC data model Security 16

Sqooping PostgreSQL Data 2 17

Sqoop 2 Everything are rewritten Working on server side More modular Not compatible with Sqoop 1 at all (Almost) Only generic connector Black box comparing to Sqoop 1 Needs more documentation SQOOP- 1155: Sqoop 2 documentation for connector development Internal of Sqoop2 MapReduce Job ++++++++++++++++++++++++++++++++... - OutputFormat invokes Loader's load method (via SqoopOutputFor.. todo: sequence diagram like figure. 18

Sqoop2: Initialization phase of IMPORT job Implement this,----------------.,-----------. SqoopInputFormat Partitioner `-------+--------' `-----+-----' getsplits -----------> getpartitions ------------------------>,---------. -------> Partition `----+----' <- - - - - - - - - - - -,----------. --------------------------------------------------> SqoopSplit `----+-----' 19

Sqoop2: Map phase of IMPORT job,-----------. SqoopMapper `-----+-----' run --------->,-------------. ----------------------------------> MapDataWriter `------+------',---------. --------------> Extractor `----+----' extract --------------------> read from DB <------------------------------- write* ------------------->,----. ----------> Data `-+--' context.write --------------------------> Implement this Conversion to Sqoop internal data format 20

Sqoop2: Reduce phase of EXPORT job,-------.,---------------------. Reducer SqoopNullOutputFormat `---+---' `----------+----------',-----------------------------. -> SqoopOutputFormatLoadExecutor Implement this `--------------+--------------',----. ---------------------> Data `-+--',-----------------. -> SqoopRecordWriter getrecordwriter `--------+--------' -----------------------> getrecordwriter ----------------->,--------------. -----------------------------> ConsumerThread `------+-------' <- - - - - - - - -,------. <- - - - - - - - - - - - ---> Loader `--+---' load run ------> -----> write ------------------------------------------------> setcontent read* -----------> getcontent <------ <----------- - - -> write into DB --------------> 21

Summary 22

My interest for popularizing Sqoop 2 Complex data type support in Sqoop 2 Bridge to use Sqoop 1 connectors on Sqoop 2 Bridge to use Sqoop 2 connectors from Sqoop 1 CLI 23

Copyright 2011 NTT DATA Corporation