More Access to Data Managed by Hadoop

Size: px

Start display at page:

Download "More Access to Data Managed by Hadoop"

Leonard Bryan
6 years ago
Views:

1 How to Train Your Elephant 2 More Access to Data Managed by Hadoop Clif Kranish 1 Topics Introduction What is Hadoop Apache Hive Arrays, Maps and Structs Apache Drill Apache HBase Apache Phoenix Adapter summary Kerberos 2 1

Based on research at Google Originally developed at Yahoo Open source software

2 Introduction What is Hadoop? 3 What is Hadoop? Software framework for storing and processing big data in a distributed fashion Based on research at Google Originally developed at Yahoo Open source software distributed by ASF Now a collection of many projects Runs primarily on Linux Also available on Windows Server

3 What is Hadoop? Relational databases or Hadoop Traditional data sources transactional applications Stable well defined schemas Centrally managed by DBA Flat structures Social media, clickstreams, logs, sensor data Evolving flexible schemas Managed within applications Semi structured or nested 5 Traditional Hadoop HDFS and MapReduce Hadoop Distributed File System Inexpensive runs on large clusters of commodity hardware expands by adding nodes Redundant data is replicated on at least three nodes Resilient fault tolerant, keeps running even if one node goes down Flexible store any kind of data, in its original format Unified Storage, Metadata, Security Appears as a single file store Map Reduce Programming paradigm for distributed processing Requires programming, usually Java 6 3

4 Apache Hadoop Architecture Hive Pig Other Zookeeper Coordination 7 Apache Hadoop2 Architecture MR Hive Pig Other Zookeeper Coordination Y A R N Cluster Manager Distributed Processing Hadoop Distributed File System 8 4

Hadoop Distributions Big 3 Vendors and the Rest Founded 2008 Commercial license Cloudera Management Suite / Impala Founded 2009.

services Database Vendors Others 9 HDFS is a File System File Formats Faster Throughput With Compression and Columnar File Format Compression C Access Description Text RCFile SequenceFile Avro None,

5 Hadoop Distributions Big 3 Vendors and the Rest Founded 2008 Commercial license Cloudera Management Suite / Impala Founded Improved reliability, high availability Free version lacks some proprietary features MapR-FS / MapR-DB / MapR Control System / Drill Founded % Apache Open Source Ambari / Yarn / Tez Cloud services Database Vendors Others 9 HDFS is a File System File Formats Faster Throughput With Compression and Columnar File Format Compression C Access Description Text RCFile SequenceFile Avro None, LZO, gzip, bzip2, Snappy None Snappy, Gzip, Deflate, bzip2 None, Snappy, gzip, deflate, bzip2 N Any Simple flat files Can be structured as CSV Y Any Record Columnar File By Facebook for use with Hive N Any Binary key/value pairs ZLIB, Snappy Y Hive Optimized Row Columnar By Hortonworks for use with Hive None, Snappy, gzip, deflate, bzip2 N Any Schema-oriented binary data Snappy, Gzip Y Impala Columnar file format By Cloudera for use with Impala 10 5

SQL on Hadoop Apache JDBC Drivers for Hadoop Technology Usage Support IBI Adapter Batch oriented / ETL High latency/throughput Hortonworks All distributions Originated at Facebook Hive Incubating BI

6 SQL on Hadoop Apache JDBC Drivers for Hadoop Technology Usage Support IBI Adapter Batch oriented / ETL High latency/throughput Hortonworks All distributions Originated at Facebook Hive Incubating BI and Analytics Runs in memory Low latency/throughput Cloudera MapR, Amazon, ORACLE Hive BI and Anaytics Runs in memory Multiple sources MapR Can be installed anywhere Drill Analytics Machine learning Pivotal Greenplum Incubating 11 Apache Hive Provides data summarization, query, and analysis for Hadoop 12 6

7 What is Hive? Data Warehouse layer on top of Hadoop Adds structure to unstructured data in HDFS Allows use of Hive Query Language to access Provides a JDBC Driver Hive Tables are Data Files stored in HDFS Schema (metadata) stored in an RDBMS 13 Comma Delimited File Create, Load and Select ID,Age,Education,Marital,Gender,Occupation,Income 51397,21,College,Single,Male,Student, ,24,Professional,Single,Male,Student, ,45,College,Single,Male,Student, ,28,Bachelor,Single,Male,Student, CREATE TABLE newcusts ( id STRING, age INT, education STRING, marital STRING, gender STRING, occupation STRING, income DECIMAL(8,2) ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE TBLPROPERTIES("skip.header.line.count"="1") ; LOAD DATA LOCAL INPATH 'newcusts.csv' OVERWRITE INTO TABLE newcusts; SELECT * FROM newcusts ; 14 7

8 HUE (Hadoop User Experience) Create table, load and query 15 HUE - Query Results 16 8

9 Hive Configure Adapter Create Synonym 17 Synonym and Sample Data newcusts 18 9

10 Hive complex data types Arrays, Structs and Maps 19 Text File with Complex Types Array, Map and Struct John Doe# #Mary Smith,Todd Jones#Federal Taxes=.2,State Taxes=.05,Insurance=.1#1 Michigan Ave.,Chicago,IL,60600 Mary Smith# #Bill King#Federal Taxes=.2,State Taxes=.05,Insurance=.1#100 Ontario St.,Chicago,IL,60601 Todd Jones# ##Federal Taxes=.15,State Taxes=.03,Insurance=.1#200 Chicago Ave.,Oak Park,IL,60700 Bill King# ##Federal Taxes=.15,State Taxes=.03,Insurance=.1#300 Obscure Dr.,Obscuria,IL,60100 CREATE TABLE employees ( name STRING, salary FLOAT, subordinates ARRAY<STRING>, deductions MAP<STRING, FLOAT>, address STRUCT<street:STRING, city:string, state:string, zip:int>) ROW FORMAT DELIMITED FIELDS TERMINATED BY '#' COLLECTION ITEMS TERMINATED BY ',' MAP KEYS TERMINATED BY '=' LINES TERMINATED BY '\n' STORED AS TEXTFILE ; LOAD DATA LOCAL INPATH 'employees.dat' OVERWRITE INTO TABLE employees; 20 10

and Structs look the same in results from Hive query Maps

11 HUE Query Editor Create, Load and Select 21 Query Results Showing Hive Complex Datatypes Array, Map and Struct Maps and Structs look the same in results from Hive query Maps get names from data Structs gets names from Metadata 22 11

12 Create Synonym 23 Sample Data 24 12

13 Synonym Editor 25 Synonym Editor - Array Transpose Multiple values to columns Hive Array 26 13

14 Synonym Editor - Array New segment added 27 Synonym with Array New segment with new field name 28 14

15 Synonym with Map or Struct Transpose Multiple values to columns Hive Structure 29 Synonym with Hive Map Deductions 30 15

16 Synonym with Hive Struct Address 31 Apache Drill Schema Free SQL For Hadoop, NoSQL and Cloud 32 16

Apache Drill Originally developed at MapR Now open source Storage Plugins for Files, HDFS, Hive, HBase, MongoDB Distributed, columnar execution engine Runs

17 Apache Drill Originally developed at MapR Now open source Storage Plugins for Files, HDFS, Hive, HBase, MongoDB Distributed, columnar execution engine Runs in memory Supplies JDBC Driver Command line (sqlline) and Web client 33 Drill Web UI with JSON (self describing) source Automatically generates schema 34 17

18 File Plug in Comma delimited file select * from `dfs`.`tmp`.`newcusts.csv` 35 File Plug in Comma Delimited File select columns[0] as id, columns[1] 36 18

19 Hive Plugin Uses Hive metadata select * from `hive`.`default`.`newcusts` offset 1 37 Create Synonym for Drill Using Hive Metadata 38 19

20 Synonym newcusts Drill using Hive Plugin 39 HBase When you need random, real-time read/write access 40 20

21 HBase NoSQL Database for Hadoop Based on Google Big Table storage architecture Column oriented Timestamped For very large tables with billions of rows, millions of columns For sparse data each row may use only a few of the columns Provides random read/write access by row key only No joins Essentially Schemaless (row key and column families) Schemas maintained by JDBC driver for SQL Access 41 SQL to NoSQL JDBC Drivers and Adapters Technology Metadata creation Metadata create external table Hive create view Drill create table Phoenix 42 21

22 Hbase Shell Create Table and Column Families $ hbase shell HBase Shell; enter 'help<return>' for list of supported commands. Type "exit<return>" to leave the HBase Shell Version , r58355eb3c88bded74f382d81cdd36174d68ad0fd, Wed Sep :56:38 UTC 2015 hbase(main):001:0> create 'customers', 0 row(s) in seconds => Hbase::Table - customers 'address', 'loyalty', 'personal' hbase(main):002:0> exit Column Families Table Name 43 HBase Shell Put data a cell at a time hbase(main):002:0> put 'customers', ' 10001', 'address:state', 'va' 0 row(s) in seconds hbase(main):003:0> put 'customers', ' 10001', 'loyalty:agg_rev', '197' 0 row(s) in seconds hbase(main):004:0> put 'customers', ' 10001', 'loyalty:membership', 'silver' 0 row(s) in seconds hbase(main):005:0> put 'customers', ' 10001', 'personal:age', '15-20' 0 row(s) in seconds hbase(main):006:0> put 'customers', ' 10001', 'personal:gender', 'FEMALE' 0 row(s) in seconds hbase(main):007:0> put 'customers', ' 10001', 'personal:name', 'Corrine Mecham' 0 row(s) in seconds hbase(main):008:0> put 'customers', ' 10005', 'address:state', 'in' 0 row(s) in seconds hbase(main):009:0> put 'customers', ' 10005', 'loyalty:agg_rev', '230' 0 row(s) in seconds Values Table Name Column Family:Column Qualifier Row Key 44 22

23 Hbase Shell Scan hbase(main):020:0> scan 'customers', {LIMIT=>3} ROW COLUMN+CELL column=address:state, timestamp= , value=va column=loyalty:agg_rev, timestamp= , value= column=loyalty:membership, timestamp= , value=silver column=personal:age, timestamp= , value= column=personal:gender, timestamp= , value=female column=personal:name, timestamp= , value=corrine Mecham column=address:state, timestamp= , value=in column=loyalty:agg_rev, timestamp= , value= column=loyalty:membership, timestamp= , value=silver column=personal:age, timestamp= , value= column=personal:gender, timestamp= , value=male column=personal:name, timestamp= , value=brittany Park column=address:state, timestamp= , value=ca column=loyalty:agg_rev, timestamp= , value= column=loyalty:membership, timestamp= , value=silver column=personal:age, timestamp= , value= column=personal:gender, timestamp= , value=male column=personal:name, timestamp= , value=rose Lokey 3 row(s) in seconds hbase(main):021:0> 45 HBase to Hive create external table map column names create external table customers ( key varchar(5), address_state varchar(4), loyalty_agg_rev varchar(8), loyalty_mebership varchar(8), personal_age varchar(8), personal_gender varchar(8), personal_name varchar(32) ) STORED BY 'org.apache.hadoop.hive.hbase.hbasestoragehandler' WITH SERDEPROPERTIES ('hbase.columns.mapping' = ':key,address:state,loyalty:agg_rev,loyalty:membership,personal:age,personal:gender, personal:name') TBLPROPERTIES ('hbase.table.name' = 'customers') ; 46 23

24 HUE Hive Query select * from customers 47 HBase / Hive Table Customers Synonym 48 24

25 Hbase / Hive Table Customers Sample Data 49 HBase / Drill select * from `hbase`.`customers` 50 25

HBase / Drill select * from `hbase`.`customers` 51 Drill into HBase Must convert each value create view `dfs`.views.customers as select convert_from(`customers`.

26 HBase / Drill select * from `hbase`.`customers` 51 Drill into HBase Must convert each value create view `dfs`.views.customers as select convert_from(`customers`.`row_key`,'utf8') as id, convert_from(`customers`.`address`.`state`,'utf8') as state, convert_from(`customers`.`personal`.`age`,'utf8') as age, convert_from(`customers`.`personal`.`name`,'utf8') as name, convert_from(`customers`.`personal`.`gender`,'utf8') as gender, convert_from(`customers`.`loyalty`.`agg_rev`,'utf8') as agg_rev, convert_from(`customers`.`loyalty`.`membership`,'utf8') as membership from `hbase`.`customers` ; 52 26

27 Drill into HBase using a View select * from dfs.views.customers 53 Drill into HBase using View Synonym 54 27

28 Drill into HBase Sample Data 55 Apache Phoenix for HBase Puts the SQL back in NoSQL 56 28

29 Apache Phoenix SQL for HBase Relational database layer on top of Hbase Real-time engine Low latency queries high performance SQL query compiled into HBase scans Metadata stored in HBase Use CREATE TABLE to create a new table or map to an existing HBase table 57 Phoenix Create Table create table customers ("id" varchar(5) primary key, "address"."state" varchar(2), "personal"."age" varchar(6), "personal"."name" varchar(24), "personal"."gender" varchar(6), "loyalty"."agg_rev" integer, "loyalty"."membership" varchar(6) ) ; 58 29

30 Phoenix sqlline command line tool select * from customers limit Synonym for Phoenix Table Column Family Column Name 60 30

31 Synonym for Phoenix Table Map to Column Name 61 Access to Phoenix Tables Sample Data 62 31

32 Adapters Summary 63 Access to Data Managed By Hadoop WebFOCUS Reporting Server / DataMigrator Server File Adapters Hive Adapter Drill Adapter Phoenix NFS Gateway JDBC Metadata Hive Impala Drill Phoenix Hive Metadata / HCatalog Drill Views Phoenix Metadata RCFile SequenceFile 64 32

DataMigrator Extended Bulk Load Loads in Parallel Automatically Generates Data Files Creates data file(s) Uses [S]FTP to copy to remote server if

33 DataMigrator Extended Bulk Load Loads in Parallel Automatically Generates Data Files Creates data file(s) Uses [S]FTP to copy to remote server if required Copies to HDFS Automatically Generates Metadata Apache Hive/Impala Metadata Synonym immediately usable by WebFOCUS 65 Kerberos Securing Hadoop 66 33

34 Kerberos Security Deployment Server Wide Get Kerberos ticket Start Server Add connection for everyone All users use same ticket Per User Start Server with Security on Create profile for each user Add connection for users Each user uses their own ticket 67 Kerberos Server wide security /home/gdc$ kinit gc00001 Password for /home/gdc$ klist Ticket cache: FILE:/tmp/krb5cc_278 Default principal: Valid starting Expires Service principal 02/26/16 16:05:53 02/27/16 02:05:58 renew until 03/04/16 16:05:53 /home/gdc$ beeline Beeline>!connect Connecting to Enter username for...: Enter password for...: Connected to: Apache Hive jdbc:hive2://sandbox:10000/default> select id, current_user() from newcusts limit 1; id _c gc row selected (0.158 seconds) 0: jdbc:hive2://sandbox:10000/default>!quit /home/gdc$ edastart start 02/26/ :51: Starting Workspace Manager in /ibi/srv77/dm 02/26/ :51: Logging startup progress and errors in /ibi/srv77/dm/edaprint.log 68 34

35 Kerberos Server wide security settings Add principal to URL -Djavax.security.auth.useSubjectCredsOnly=false 69 Kerberos Per User Security Settings Security on (PTH, LDAP, DBMS, Custom or OPSYS) Enable Kerberos for Hive Adapter ENGINE SQLHIV SET ENABLE_KERBEROS ON 70 35

Kerberos Connect to Hive with credentials from

36 Kerberos Connect to Hive with credentials from user profiles Register each user to create a profile Add a connection to their profile jdbc:hive2://sandbox:10000/default;principal=hive/sandbox@ibi.com;auth=kerberos;kerberosauthtype=fromsubject 71 Socialize to Win! Daily Prizes Awarded! Tweet and tag #IBSummit during the event! in your #IBSummit pics! Check our Summit Facebook & LinkedIn pages for updates, photos, and announcements 36

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara Big Data Technology Ecosystem Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara Agenda End-to-End Data Delivery Platform Ecosystem of Data Technologies Mapping an End-to-End Solution Case