Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem. Zohar Elkayam

Similar documents
Hadoop An Overview. - Socrates CCDH

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Configuring and Deploying Hadoop Cluster Deployment Templates

MapR Enterprise Hadoop

SQT03 Big Data and Hadoop with Azure HDInsight Andrew Brust. Senior Director, Technical Product Marketing and Evangelism

Hadoop. Introduction / Overview

Big Data Analytics using Apache Hadoop and Spark with Scala

Big Data Hadoop Stack

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Microsoft Big Data and Hadoop

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI)

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Big Data Architect.

Big Data Hadoop Course Content

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

Cmprssd Intrduction To

Overview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training::

DATA SCIENCE USING SPARK: AN INTRODUCTION

docs.hortonworks.com

BIG DATA ANALYTICS USING HADOOP TOOLS APACHE HIVE VS APACHE PIG

Blended Learning Outline: Cloudera Data Analyst Training (171219a)

Innovatus Technologies

Cloud Computing & Visualization

New Approaches to Big Data Processing and Analytics

Introduction to Hadoop and MapReduce

How Apache Hadoop Complements Existing BI Systems. Dr. Amr Awadallah Founder, CTO Cloudera,

An Introduction to Big Data Formats

Big Data with Hadoop Ecosystem

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Stages of Data Processing

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2013/14

Hadoop Development Introduction

Introduction to Big-Data

Big Data Infrastructure at Spotify

Databases 2 (VU) ( / )

Hadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours)

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

A BigData Tour HDFS, Ceph and MapReduce

A Review Paper on Big data & Hadoop

microsoft

Certified Big Data and Hadoop Course Curriculum

CISC 7610 Lecture 2b The beginnings of NoSQL

The amount of data increases every day Some numbers ( 2012):

2/26/2017. The amount of data increases every day Some numbers ( 2012):

Big Data com Hadoop. VIII Sessão - SQL Bahia. Impala, Hive e Spark. Diógenes Pires 03/03/2018

Certified Big Data Hadoop and Spark Scala Course Curriculum

Hadoop, Yarn and Beyond

Importing and Exporting Data Between Hadoop and MySQL

What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed?

The age of Big Data Big Data for Oracle Database Professionals

International Journal of Advance Engineering and Research Development. A study based on Cloudera's distribution of Hadoop technologies for big data"

STATS Data Analysis using Python. Lecture 7: the MapReduce framework Some slides adapted from C. Budak and R. Burns

Hadoop. Introduction to BIGDATA and HADOOP

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Introduction to Big Data. Hadoop. Instituto Politécnico de Tomar. Ricardo Campos

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Webinar Series TMIP VISION

Question: 1 You need to place the results of a PigLatin script into an HDFS output directory. What is the correct syntax in Apache Pig?

50 Must Read Hadoop Interview Questions & Answers

DHANALAKSHMI COLLEGE OF ENGINEERING, CHENNAI

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Lecture 11 Hadoop & Spark

This is a brief tutorial that explains how to make use of Sqoop in Hadoop ecosystem.

Chapter 5. The MapReduce Programming Model and Implementation

A Survey on Big Data

Introduction into Big Data analytics Lecture 3 Hadoop ecosystem. Janusz Szwabiński

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture

Department of Information Technology, St. Joseph s College (Autonomous), Trichy, TamilNadu, India

Embedded Technosolutions

MapReduce and Hadoop

Chase Wu New Jersey Institute of Technology

Hadoop. copyright 2011 Trainologic LTD

Oracle Big Data Fundamentals Ed 2

Oracle GoldenGate for Big Data

Hortonworks Data Platform

Introduction to BigData, Hadoop:-

Data Storage Infrastructure at Facebook

HADOOP FRAMEWORK FOR BIG DATA

Hortonworks and The Internet of Things

Oracle Big Data Connectors

Getting Started with Hadoop and BigInsights

Top 25 Big Data Interview Questions And Answers

Hadoop Overview. Lars George Director EMEA Services

Cloud Computing 3. CSCI 4850/5850 High-Performance Computing Spring 2018

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework

HADOOP COURSE CONTENT (HADOOP-1.X, 2.X & 3.X) (Development, Administration & REAL TIME Projects Implementation)

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Introduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data

Techno Expert Solutions An institute for specialized studies!

Introduction to HDFS and MapReduce

Processing Unstructured Data. Dinesh Priyankara Founder/Principal Architect dinesql Pvt Ltd.

BIG DATA COURSE CONTENT

Department of Digital Systems. Digital Communications and Networks. Master Thesis

Transcription:

Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem Zohar Elkayam www.realdbamagic.com Twitter: @realmgic

Who am I? Zohar Elkayam, CTO at Brillix Programmer, DBA, team leader, database trainer, public speaker, and a senior consultant for over 18 years Oracle ACE Associate Part of iloug Israel Oracle User Group Involved with Big Data projects since 2011 Blogger www.realdbamagic.com and www.ildba.co.il 2

About Brillix We offer complete, integrated end-to-end solutions based on best-ofbreed innovations in database, security and big data technologies We provide complete end-to-end 24x7 expert remote database services We offer professional customized on-site trainings, delivered by our top-notch world recognized instructors 3

Some of Our Customers 4

Agenda What is the Big Data challenge? A Big Data Solution: Apache Hadoop HDFS MapReduce and YARN Hadoop Ecosystem: HBase, Sqoop, Hive, Pig and other tools Another Big Data Solution: Apache Spark Where does the DBA fits in? 5

The Challenge 6

The Big Data Challenge 7

Volume Big data comes in one size: Big. Size is measured in Terabyte (10 12 ), Petabyte (10 15 ), Exabyte (10 18 ), Zettabyte (10 21 ) The storing and handling of the data becomes an issue Producing value out of the data in a reasonable time is an issue 8

Variety Big Data extends beyond structured data, including semi-structured and unstructured information: logs, text, audio and videos Wide variety of rapidly evolving data types requires highly flexible stores and handling Un-Structured Objects Flexible Structure Unknown Textual and Binary Structured Tables Columns and Rows Predefined Structure Mostly Textual 9

Velocity The speed in which data is being generated and collected Streaming data and large volume data movement High velocity of data capture requires rapid ingestion Might cause a backlog problem 10

Value Big data is not about the size of the data, It s about the value within the data 11

So, We Define Big Data Problem When the data is too big or moves too fast to handle in a sensible amount of time When the data doesn t fit any conventional database structure When we think that we can still produce value from that data and want to handle it When the technical solution to the business need becomes part of the problem 12

How to do Big Data 13

14

Big Data in Practice Big data is big: technological framework and infrastructure solutions are needed Big data is complicated: We need developers to manage handling of the data We need devops to manage the clusters We need data analysts and data scientists to produce value 15

Possible Solutions: Scale Up Older solution: using a giant server with a lot of resources (scale up: more cores, faster processers, more memory) to handle the data Process everything on a single server with hundreds of CPU cores Use lots of memory (1+ TB) Have a huge data store on high end storage solutions Data needs to be copied to the processes in real time, so it s no good for high amounts of data (Terabytes to Petabytes) 16

Another Solution: Distributed Systems A scale-out solution: let s use distributed systems: use multiple machine for a single job/application More machines means more resources CPU Memory Storage But the solution is still complicated: infrastructure and frameworks are needed 17

Distributed Infrastructure Challenges We need Infrastructure that is built for: Large-scale Linear scale out ability Data-intensive jobs that spread the problem across clusters of server nodes Storage: efficient and cost-effective enough to capture and store terabytes, if not petabytes, of data Network infrastructure that can quickly import large data sets and then replicate it to various nodes for processing High-end hardware is too expensive - we need a solution that uses cheaper hardware 18

Distributed System/Frameworks Challenges How do we distribute our workload across the system? Programming complexity keeping the data in sync What to do with faults and redundancy? How do we handle security demands to protect highly-distributed infrastructure and data? 19

A Big Data Solution: Apache Hadoop 20

Apache Hadoop Open source project run by Apache Foundation (2006) Hadoop brings the ability to cheaply process large amounts of data, regardless of its structure It Is has been the driving force behind the growth of the big data industry Get the public release from: http://hadoop.apache.org/core/ 21

Original Hadoop Components HDFS (Hadoop Distributed File System) distributed file system that runs in clustered environments MapReduce programming paradigm for running processes over clustered environments Hadoop main idea: let s distribute the data to many servers, and then bring the program to the data 22

Hadoop Benefits Designed for scale out Reliable solution based on unreliable hardware Load data first, structure later Designed for storing large files Designed to maximize throughput of large scans Designed to leverage parallelism Solution Ecosystem 23

What Hadoop Is Not? Hadoop is not a database it does not a replacement for DW, or for other relational databases Hadoop is not for OLTP/real-time systems Very good for large amounts, not so much for smaller sets Designed for clusters there is no Hadoop monster server (single server) 24

Hadoop Limitations Hadoop is scalable but it s not fast Some assembly may be required Batteries are not included (DIY mindset) some features needs to be developed if they re not available Open source license limitations apply Technology is changing very rapidly 25

Hadoop under the Hood 26

Original Hadoop 1.0 Components HDFS (Hadoop Distributed File System) distributed file system that runs in a clustered environment MapReduce programming technique for running processes over a clustered environment 27

Hadoop 2.0 Hadoop 2.0 changed the Hadoop conception and introduced a better resource management concept: Hadoop Common HDFS YARN Multiple data processing frameworks including MapReduce, Spark and others 28

HDFS is... A distributed file system Designed to reliably store data using commodity hardware Designed to expect hardware failures and still stay resilient Intended for larger files Designed for batch inserts and appending data (no updates) 29

Files and Blocks Files are split into 128MB blocks (single unit of storage) Managed by NameNode and stored on DataNodes Transparent to users Replicated across machines at load time Same block is stored on multiple machines Good for fault-tolerance and access Default replication factor is 3 30

HDFS is Good for... Storing large files Terabytes, Petabytes, etc... Millions rather than billions of files 128MB or more per file Streaming data Write once and read-many times patterns Optimized for streaming reads rather than random reads 32

HDFS is Not So Good For... Low-latency reads / Real-time application High-throughput rather than low latency for small chunks of data HBase addresses this issue Large amount of small files Better for millions of large files instead of billions of small files Multiple Writers Single writer per file Writes at the end of files, no-support for arbitrary offset 33

Using HDFS in Command Line 34

How Does HDFS Look Like (GUI) 35

Interfacing with HDFS 36

MapReduce is... A programming model for expressing distributed computations at a massive scale An execution framework for organizing and performing such computations MapReduce can be written in Java, Scala, C, Payton, Ruby and others Concept: Bring the code to the data, not the data to the code 37

The MapReduce Paradigm Imposes key-value input/output We implement two main functions: MAP - Takes a large problem and divides into sub problems and performs the same function on all sub-problems Map(k1, v1) -> list(k2, v2) REDUCE - Combine the output from all sub-problems (each key goes to the same reducer) Reduce(k2, list(v2)) -> list(v3) Framework handles everything else (almost) 38

Divide and Conquer 39

YARN Takes care of distributed processing and coordination Scheduling Jobs are broken down into smaller chunks called tasks These tasks are scheduled to run on data nodes Task Localization with Data Framework strives to place tasks on the nodes that host the segment of data to be processed by that specific task Code is moved to where the data is 40

YARN Error Handling Failures are an expected behavior so tasks are automatically re-tried on other machines Data Synchronization Shuffle and Sort barrier re-arranges and moves data between machines Input and output are coordinated by the framework 41

Submitting a Job Yarn script with a class argument command launches a JVM and executes the provided Job $ yarn jar HadoopSamples.jar mr.wordcount.startswithcountjob \ /user/sample/hamlet.txt \ /user/sample/wordcount/ 42

Resource Manage: UI 43

Application View 44

Hadoop Main Problems Hadoop MapReduce Framework (not MapReduce paradigm) had some major problems: Developing MapReduce was complicated there was more than just business logics to develop Transferring data between stages requires the intermediate data to be written to disk (and than read by the next step) Multi-step needed orchestration and abstraction solutions Initial resource management was very painful MapReduce framework was based on resource slots 45

Extending Hadoop The Hadoop Ecosystem

Improving Hadoop: Distributions Core Hadoop is complicated so some tools and solution frameworks were added to make things easier There are over 80 different Apache projects for big data solution which uses Hadoop (and growing!) Hadoop Distributions collects some of these tools and release them as a complete integrated package Cloudera HortonWorks MapR Amazon EMR 47

Common HADOOP 2.0 Technology Eco System 48

Improving Programmability MapReduce code in Java is sometime tedious, so different solutions came to the rescue Pig: Programming language that simplifies Hadoop actions: loading, transforming and sorting data Hive: enables Hadoop to operate as data warehouse using SQL-like syntax Spark and other frameworks 49

Pig Pig is an abstraction on top of Hadoop Provides high level programming language designed for data processing Scripts converted into MapReduce code, and executed on the Hadoop Clusters Makes ETL/ELT processing and other simple MapReduce easier without writing MapReduce code Pig was widely accepted and used by Yahoo!, Twitter, Netflix, and others Often replaced by more up-to-date tools like Apache Spark 50

Hive Data Warehousing Solution built on top of Hadoop Provides SQL-like query language named HiveQL Minimal learning curve for people with SQL expertise Data analysts are target audience Early Hive development work started at Facebook in 2007 Hive is an Apache top level project under Hadoop http://hive.apache.org 51

Hive Provides Ability to bring structure to various data formats Simple interface for ad hoc querying, analyzing and summarizing large amounts of data Access to files on various data stores such as HDFS and HBase Also see: Apache Impala (mainly in Cloudera) 52

Databases and DB Connectivity HBase: Online NoSQL Key/Value wide-column oriented datastore that is native to HDFS Sqoop: a tool designed to import data from and export data to relational databases (HDFS, Hbase, or Hive) Sqoop2: Sqoop centralized service (GUI, WebUI, REST) 53

HBase HBase is the closest thing we had to database in the early Hadoop days Distributed key/value with wide-column oriented NoSQL database, built on top of HDFS Providing Big Table-like capabilities Does not have a query language: only get, put, and scan commands Often compared with Cassandra (non-hadoop native Apache project) 54

When Do We Use HBase? Huge volumes of randomly accessed data HBase is at its best when it s accessed in a distributed fashion by many clients (high consistency) Consider HBase when we are loading data by key, searching data by key (or range), serving data by key, querying data by key or when storing data by row that doesn t conform well to a schema. 55

When NOT To Use HBase HBase doesn t use SQL, don t have an optimizer, doesn t support transactions or joins HBase doesn t have data types See project Apache Phoenix for better data structure and query language when using HBase 56

Sqoop and Sqoop2 Sqoop is a command line tool for moving data from RDBMS to Hadoop. Sqoop2 is a centralized tool for running sqoop. Uses MapReduce load the data from relational database to HDFS Can also export data from HBase to RDBMS Comes with connectors to MySQL, PostgreSQL, Oracle, SQL Server and DB2. $bin/sqoop import --connect 'jdbc:sqlserver://10.80.181.127;username=dbuser;password=dbpasswd;database=tpch' \ --table lineitem --hive-import $bin/sqoop export --connect 'jdbc:sqlserver://10.80.181.127;username=dbuser;password=dbpasswd;database=tpch' \ --table lineitem --export-dir /data/lineitemdata 57

Improving Hadoop More Useful Tools For improving coordination: Zookeeper For improving scheduling/orchestration: Oozie Data Storing in memory: Apache Impala For Improving log collection: Flume Text Search and Data Discovery: Solr For Improving UI and Dashboards: Hue and Ambari 58

Improving Hadoop More Useful Tools (2) Data serialization: Avro and Parquet Data governance: Atlas Security: Knox and Ranger Data Replication: Falcon Machine Learning: Mahout Performance Improvement: Tez And there are more 59

60

Is Hadoop the Only Big Data Solution? No There are other solutions: Apache Spark and Apache Mesos frameworks NoSQL systems (Apache Cassandra, CouchBase, MongoDB and many others) Stream analysis (Apache Kafka, Apache Storm, Apache Flink) Machine learning (Apache Mahout, Spark MLlib) Some can be integrated with Hadoop, but some are independent 61

Another Big Data Solution: Apache Spark Apache Spark is a fast, general engine for large-scale data processing on a cluster Originally developed by UC Berkeley in 2009 as a research project, and is now an open source Apache top level project Main idea: use the memory resources of the cluster for better performance It is now one of the most fast-growing project today 62

The Spark Stack 63

Okay, So Where Does the DBA Fits In? Big Data solutions are not databases. Databases are probably not going to disappear, but we feel the change even today: DBA s must be ready for the change DBA s are the perfect candidates to transition into Big Data Experts: Have system (OS, disk, memory, hardware) experience Can understand data easily DBA s are used to work with developers and other data users 64

What DBAs Needs Now? DBA s will need to know more programming: Java, Scala, Python, R or any other popular language in the Big Data world will do DBA s needs to understand the position shifts, and the introduction of DevOps, Data Scientists, CDO etc. Big Data is changing daily: we need to learn, read, and be involved before we are left behind 65

Q&A 66

Summary Big Data is here it s complicated and RDBMS does not fit anymore Big Data solutions are evolving Hadoop is an example for such a solution Spark is very popular Big Data solution DBA s need to be ready for the change: Big Data solutions are not databases and we make ourselves ready 67

Thank You Zohar Elkayam twitter: @realmgic Zohar@Brillix.co.il www.realdbamagic.com 68