Sqoop In Action. Lecturer:Alex Wang QQ: QQ Communication Group:

Size: px

Start display at page:

Download "Sqoop In Action. Lecturer:Alex Wang QQ: QQ Communication Group:"

Marlene Barnett
5 years ago
Views:

1 Sqoop In Action Lecturer:Alex Wang QQ: QQ Communication Group:

2 Aganda Setup the sqoop environment Import data Incremental import Free-Form Query Import Export data Sqoop and Hive

3 Apache sqoop link page

4 Introduction Sqoop is a tool designed to transfer data between Hadoop and relational databases or mainframes. You can use Sqoop to import data from a relational database management system (RDBMS) such as MySQL or Oracle or a mainframe into the Hadoop Distributed File System (HDFS), transform the data in Hadoop MapReduce, and then export the data back into an RDBMS. Sqoop automates most of this process, relying on the database to describe the schema for the data to be imported. Sqoop uses MapReduce to import and export the data, which provides parallel operation as well as fault tolerance.

5 Apache Sqoop-1 Architecture

6 Apache Sqoop-2 Architecture

7 Prerequisites The following prerequisite knowledge is required for this product: Basic computer technology and terminology Familiarity with command-line interfaces such as bash Relational database management systems Basic familiarity with the purpose and operation of Hadoop

8 Setup sqoop environment Download the sqoop tar and uncompress. Config the environments export SQOOP_HOME=/usr/local/sqoop bin hadoop-0.20 export PATH=$SQOOP_HOME/bin:$PATH

9 Download the database connectors

10 Introduce the sqoop command

11 Prepare for the mysql Install the mysql-server Create a database(sqoop) for test Create two tables

12 Import--Transferring an Entire Table sqoop import \ --connect jdbc:mysql://master:3306/sqoop \ --username username \ --password password \ --table cities

13 Import--Specifying a Target Directory sqoop import \ --password sqoop \ --table cities \ --target-dir /etl/input/cities

14 Import--use --warehousedir sqoop import \ --password sqoop \ --table cities \ --warehouse-dir /etl/input/

15 Import--Importing Only a Subset of Data sqoop import \ --password sqoop \ --table cities \ --target-dir /alex/input/subset/cities \ --where "country = 'USA'"

16 Protecting Your Password sqoop import \ --table cities \ -P

17 Protecting Your Password sqoop import \ --table cities \ --password-file my-sqoop-password echo "my-secret-password" > sqoop.password hadoop dfs -put sqoop.password /user/$user/sqoop.password hadoop dfs -chown 400 /user/$user/sqoop.password

18 Import --Using a File Format Other Than CSV sqoop import \ --password sqoop \ --table cities \ --as-sequencefile sqoop import \ --password sqoop \ --table cities \ --as-avrodatafile

19 Import--Compressing Imported Data sqoop import \ --table cities \ --compress --compression-codec org.apache.hadoop.io.compress.bzip2codec

20 Import--Speeding Up Transfers sqoop import \ --table cities \ --direct

21 Import--Overriding Type Mapping sqoop import \ --table cities \ --map-column-java id=long

22 import--controlling Parallelism sqoop import \ --password sqoop \ --table cities \ --num-mappers 10

23 Import--Encoding NULL Values sqoop import \ --password sqoop \ --table cities \ --null-string '\\N' \ --null-non-string '\\N'

24 Import--Importing All Your Tables sqoop import-all-tables \ --password sqoop sqoop import-all-tables \ --password sqoop \ --exclude-tables cities,countries

25 Incremental Import So far we ve covered use cases where you had to transfer an entire table s contents from the database into Hadoop as a one-time operation. What if you need to keep the imported data on Hadoop in sync with the source table on the relational database side? While you could obtain a fresh copy every day by reimporting all data, that would not be optimal. The amount of time needed to import the data would increase in proportion to the amount of additional data appended to the table daily. This would put an unnecessary performance burden on your database. Why reimport data that has already been imported? For transferring deltas of data, Sqoop offers the ability to do incremental imports.

26 Importing Only New Data Incremental import in append mode will allow you to transfer only the newly created rows. This saves a considerable amount of resources compared with doing a full import every time you need the data to be in sync. One downside is the need to know the value of the last imported row so that next time Sqoop can start off where it ended. Sqoop, when running in incremental mode, always prints out the value of the last mported row. This allows you to easily pick up where you left off.

27 Importing Only New Data sqoop import \ --connect jdbc:mysql://master:3306/sqoop \ --username root \ --password root \ --table cities \ --target-dir /alex/input/append \ --incremental append \ --check-column id \ --last-value 1

28 Incrementally Importing Mutable Data sqoop import \ --password sqoop \ --table visits \ --incremental lastmodified \ --check-column last_update_date \ --last-value " :01:01"

29 Preserving the Last Imported Value sqoop job \ --create visits \ -- \ import \ --password sqoop \ --table visits \ --incremental append \ --check-column id \ --last-value 0 sqoop job --exec visits

30 Sqoop Job The Sqoop metastore is a powerful part of Sqoop that allows you to retain your job definitions and to easily run them anytime. Each saved job has a logical name that is used for referencing. You can list all retained jobs using the --list parameter: sqoop job --list You can remove the old job definitions that are no longer needed with the --delete parameter, for example: sqoop job --delete visits And finally, you can also view content of the saved job definitions using the --show parameter, for example: sqoop job --show visits Output of the --show command will be in the form of properties. Unfortunately, Sqoop currently can t rebuild the command line that you used to create the saved job.

31 Storing Passwords in the Metastore <configuration>... <property> <name>sqoop.metastore.client.record.password</name> <value>true</value> </property> </configuration>

32 Overriding the Arguments to a Saved Job sqoop job --exec visits -- --verbose

33 Sharing the Metastore Between Sqoop Clients sqoop job \ --create cities \ --meta-connect jdbc:hsqldb:hsql://master:16000/sqoop \ -- \ import \ --connect jdbc:mysql://master:3306/sqoop \ --username root \ --password root \ --table cities \ --target-dir /alex/input/append \ --incremental append \ --check-column id \ --last-value 1

34 sqoop-site.xml <configuration>... <property> <name>sqoop.metastore.client.autoconnect.url</name> <value>jdbc:hsqldb:hsql://your-metastore:16000/sqoop</value> </property> </configuration>

35 Free-Form Query Import The previous chapters covered the use cases where you had an input table on the source database system and you needed to transfer the table as a whole or one part at a time into the Hadoop ecosystem. This chapter, on the other hand, will focus on more advanced use cases where you need to import data from more than one table or where you need to customize the transferred data by calling various database functions.

36 Importing Data from Two Tables sqoop import \ --password sqoop \ --query 'SELECT normcities.id, \ countries.country, \ normcities.city \ FROM normcities \ JOIN countries USING(country_id) \ WHERE $CONDITIONS' \ --split-by id \ --target-dir cities

37 Using Custom Boundary Queries sqoop import \ --password sqoop \ --query 'SELECT normcities.id, \ countries.country, \ normcities.city \ FROM normcities \ JOIN countries USING(country_id) \ WHERE $CONDITIONS' \ --split-by id \ --target-dir cities \ --boundary-query "select min(id), max(id) from normcities"

38 Renaming Sqoop Job Instances sqoop import \ --password sqoop \ --query 'SELECT normcities.id, \ countries.country, \ normcities.city \ FROM normcities \ JOIN countries USING(country_id) \ WHERE $CONDITIONS' \ --split-by id \ --target-dir cities \ --mapreduce-job-name normcities

39 Importing Queries with Duplicated Columns --query "SELECT \ cities.city AS first_city \ normcities.city AS second_city \ FROM cities \ LEFT JOIN normcities USING(id)"

40 Export data to database The previous three chapters had one thing in common: they described various use cases of transferring data from a database server to the Hadoop ecosystem. What if you have the opposite scenario and need to transfer generated, processed, or backed-up data from Hadoop to your database? Sqoop also provides facilities for this use case, and the following recipes in this chapter will help you understand how to take advantage of this feature.

41 Transferring Data from Hadoop sqoop export \ --password sqoop \ --table cities \ --export-dir cities

42 Inserting Data in Batches sqoop export \ --password sqoop \ --table cities \ --export-dir cities \ --batch

43 Inserting Data in Batches sqoop export \ -Dsqoop.export.records.per.statement=10 \ --password sqoop \ --table cities \ --export-dir cities sqoop export \ -Dsqoop.export.statements.per.transaction=10 \ --password sqoop \ --table cities \ --export-dir cities

44 Exporting with All-or-Nothing Semantics sqoop export \ --password sqoop \ --table cities \ --staging-table staging_cities

45 Updating an Existing Data Set sqoop export \ --password sqoop \ --table cities \ --update-key id

46 Updating or Inserting at the Same Time sqoop export \ --password sqoop \ --table cities \ --update-key id \ --update-mode allowinsert

47 Using Stored Procedures sqoop export \ --password sqoop \ --call populate_cities

48 Exporting into a Subset of Columns sqoop export \ --password sqoop \ --table cities \ --columns country,city

49 Encoding the NULL Value Differently sqoop export \ --password sqoop \ --table cities \ --input-null-string '\\N' \ --input-null-non-string '\\N'

50 Use Sqoop import data to Hive Sqoop to import your data directly into Hive.

51 Importing Data Directly into Hive sqoop import \ --password sqoop \ --table cities \ --hive-import

52 Using Partitioned Hive Tables sqoop import \ --password sqoop \ --table cities \ --hive-import \ --hive-partition-key day \ --hive-partition-value " "

53 Replacing Special Delimiters During Hive Import sqoop import \ --password sqoop \ --table cities \ --hive-import \ --hive-drop-import-delims sqoop import \ --password sqoop \ --table cities \ --hive-import \ --hive-delims-replacement "SPECIAL"

54 Using the Correct NULL String in Hive sqoop import \ --password sqoop \ --table cities \ --hive-import \ --null-string '\\N' \ --null-non-string '\\N'

55 Sqoop summary Sqoop dependency on the JDBC Sqoop will influence the source database performance.

Powered by Teradata Connector for Hadoop

Powered by Teradata Connector for Hadoop docs.hortonworks.com -D -Dteradata.db.input.file.format=rcfile !and!teradata!database!14.10 -D -D -D -D com.teradata.db.input.num.mappers --num-mappers -D com.teradata.db.input.job.type