Getting Started with Hadoop and BigInsights

Size: px
Start display at page:

Download "Getting Started with Hadoop and BigInsights"

Transcription

1 Getting Started with Hadoop and BigInsights Alan Fischer e Silva Hadoop Sales Engineer Nov 2015

2 Agenda! Intro! Q&A! Break! Hands on Lab 2

3 Hadoop Timeline 3

4 In a Big Data World. The Technology exists now for us to:! Store everything, for as long as we want! Efficiently analyze everything, without sub-setting! Connect tiny nugget of valuable information buried in piles of worthless bytes 4

5 5 5

6 Apache Hadoop Modules! Hadoop Common: Common Utilities that supports all other modules.! Hadoop Distributed File System (HDFS ): - File system that spans all the nodes in a Hadoop cluster for data storage. - Links together the file systems on many local nodes to make them into one big file system.! Hadoop MapReduce - Software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, faulttolerant manner.! YARN (Yet Another Resource Negotiator) - Scheduling & Resource Management! Large Hadoop Ecosystem: Open-source Apache project Hive Pig HBase Oozie Sqoop Flume 6 Zookeep er Mahout Spark Avro Spark

7 HDFS Architecture! Master / Slave architecture! Master: NameNode - Manages the file system namespace and metadata FsImage EditLog - Regulates access by files by clients NameNode File1 a b c d! Slave: DataNode - Many DataNodes per cluster - Manages storage attached to the nodes - Periodically reports status to NameNode - Data is stored across multiple nodes - Nodes and components will fail, so for reliability data is replicated across multiple nodes a b d b a c a d c c b d DataNodes 7

8 HDFS Blocks! HDFS is designed to support very large files! Each file is split into blocks - Hadoop default: 64MB - BigInsights default: 128MB! Behind the scenes, 1 HDFS block is supported by multiple operating system blocks 64 MB HDFS blocks OS blocks! If a file or a chunk of the file is smaller than the block size, only needed space is used. E.g.: a 210MB file is split as follows: 64 MB 64 MB 64 MB 18 MB 8

9 Replication of Data and Rack Awareness Data Node 1 A C Data Node 2 C Data Node 3 Data Node 5 B A Data Node 6 A Data Node 7 Data Node 9 C B Data Node 10 B Data Node 11 Name Node Rack aware: R1: 1,2,3,4 R2: 5,6,7,8 R3: 9,10,11 Data Node 4 Rack 1 Data Node 8 Rack 2 Data Node 12 Rack 3 Metadata file.txt= A: 1, 5, 6 B: 5, 9, 10 C: 9, 1, 2 Blocks of data are replicated to multiple nodes Behavior is controlled by replication factor, configurable per file - Default is 3 replicas Replication is rack-aware to reduce inter-rack network hops/latency: 1 copy in first rack 2 nd and 3 rd copy together in a separate rack. 9

10 MapReduce Processing Summary! Map Reduce computation model - Data stored in a distributed file system spanning many inexpensive computers - Bring function to the data - Distribute application to the compute resources where the data is stored! Scalable to thousands of nodes and petabytes of data public static class TokenizerMapper extends Mapper<Object,Text,Text,IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); Hadoop Data Nodes } public void map(object key, Text val, Context StringTokenizer itr = new StringTokenizer(val.toString()); } while (itr.hasmoretokens()) { word.set(itr.nexttoken()); context.write(word, one); } public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWrita private IntWritable result = new IntWritable(); public void reduce(text key, Iterable<IntWritable> val, Context context){ int sum = 0;... for (IntWritable v : val) { sum += v.get(); MapReduce Application Distribute map tasks to cluster Shuffle 1. Map Phase (break job into small parts) 2. Shuffle (transfer interim output for final processing) 3. Reduce Phase (boil all output down to a single result set) 10 Result Set Return a single result set

11 Map Task Client How many times does Hello appear in Wordcount.txt MR Application Master Name Node Count Hello in Block C Map Task Map Task Map Task Data Node 1 Data Node 5 Data Node 9 Count=8 A Count=3 B Count=10 C 11

12 Reduce Task Client MR Application Master Results.txt Count=21 HDFS Reduce Task Data Node 3 Sum of Hello from Map tasks Count=8 Count=3 Count=10 Map Task Map Task Map Task Data Node 1 Data Node 5 Data Node 9 A B C 12

13 Hadoop Open Source Ecosystem Diagram Workflow Management Resource Management Data Access Processing Framework MR v2 YARN Coordination Zookeeper Data Storage IBM Corporation 1

14 What is Apache Spark?! Spark is a fast and general engine for large-scale data processing! Speed - In memory, Spark can run programs up to 100X faster than MapReduce - On disk, Spark can run programs up to 10X faster than MapReduce! Application Development - Java - Scala - Python 14

15 What is Apache Spark?! High Level Capabilities - SQL - Streaming - Complex Analytics (ML, R, etc)! Run on multiple platforms - Hadoop, Mesos, standalone, Cloud! Connectivity - HDFS, Cassandra, Hbase, S3, JDBC, ODBC 15

16 Yet Another Resource Scheduler (YARN) Generic scheduling and resource management Support more than just MapReduce Support for more workloads (Hadoop 1.x Map Reduce was primarily batch processing) 16

17 What is HBase?! An open source Apache Top Level Project - An industry leading implementation of Google s BigTable Design - Considered as the Hadoop database - HBase powers some of the leading sites on the Web! A NoSQL data store - NoSQL stands for Not Only SQL - Flexible data model to accommodate semi-structured data - Cost Effective to handle Peta-bytes of data - Traditional RDBMS sharding (partitioning) lacks flexibility.! Why HBASE? - Key, Value store Column Oriented - Highly Scalable: Automatic partitioning, scales linearly and automatically with new nodes. - Low Latency: Support random read/write, small range scan - Highly Available - Strong Consistency - Flexible Data Model, very good for sparse data (no fixed columns) 17

18 How to Analyze Large Data Sets in Hadoop! Although the Hadoop framework is implemented in Java, MapReduce applications do not need to be written in Java! To abstract complexities of Hadoop programming model, a few application development languages have emerged that build on top of Hadoop: - Pig - Hive 18

19 What is Hive? SQL Map Reduce Output! Developed by Facebook in 2007! Provides a Hive-QL or HQL, SQL interface: - DDL and DML are similar to SQL - HQL Queries are translated into Map Reduce jobs - Schema-on-read capability: Projects a table structure onto existing data.! Not a true RDBMS: - Suited for batch mode processing, not real-time, latency. - No transaction support, no single row INSERT, no UPDATE or DELETE. - Limited SQL Support 19

20 Pig: Data Transformation! Pig vs MapReduce 20

21 --split-by tbl_primarykey Data Ingestion - Structured! Sqoop Efficient transferring bulk data between Hadoop clusters and RDBMS $sqoop import --connect jdbc:db2://db2.my.com:50000/sample --username db2user --password db2pwd --table ORDERS -split-by tbl_primarykey target-dir sqoopimports $sqoop export --connect jdbc:db2://db2.my.com:50000/sample --username db2user --password db2pwd --table ORDERS --exportdir /sqoop/datafile.csv 21

22 Data Ingestion - Unstructured! Flume Distributed data collection service Aggregates data from one or many sources to a centralized place Great for logs, twitter feeds, unstructured data in general 22

23 Other Hadoop Related Projects! Data serialization: AVRO - Uses JSON schemas for defining data types, data serialization in compact format! Machine learning: MAHOUT - Library of scalable machine-learning algorithms! Distributed coordination: ZOOKEEPER - Distributed configuration, synchronization, naming registry! Jobs management: OOZIE - Simplifies workflow and coordination of MapReduce jobs 23

24 Introduction to BigInsights & Open Data Platform Initiative (ODPi) 24

25 Goal of the Apache Software Foundation: Let 1000 Flowers Bloom! 249 Top Level Projects, 40 Incubating 2 Million+ Code Commits IBM co-founded the ASF in 1999 and is a Gold Sponsor The Apache Way is about fostering open innovation Not a standards organization 25

26 Harmonize on Open Data Platform to Accelerate Big Data Solutions! Over 30 members from the leading companies in the world! Provides a solid foundation of standard core Apache Hadoop components Goal: Achieve standardization and interoperability of software from ODP members 26

27 Goal of the ODP: Enable Innovation to Flourish on a Common Platform Complements the Apache Software Foundation s governance model ODP efforts focus on integration, testing, and certifying a standard core of Apache Hadoop ecosystem projects Fixes for issues found in ODP testing will be contributed to the ASF projects in line with ASF processes The ODP will not override or replace any aspect of ASF governance 27

28 IBM Open Data Platform as of V4.0 Open Data Platform (ODP) benefits and IBM open source project currency commitment Component IBM Open Platform V4.0 (ODPi) Hortonworks HDP 2.2 (ODPi) Cloudera CDH 5.3 Ambari N/A Flume Hadoop/YARN Hbase Hive Knox N/A Oozie Pig Slider N/A Solr Spark Sqoop Zookeeper

29 IBM Open Data Platform as of V4.1 Open Data Platform (ODP) benefits and IBM open source project currency commitment Component IBM Open Platform V4.1 (ODPi) Hortonworks HDP 2.3 (ODPi) Cloudera CDH Ambari N/A Flume Hadoop / YARN Hbase Hive Kafka Knox N/A Oozie Pig Slider N/A Solr Spark Sqoop Zookeeper

30 Enabling Personas with Capabilities IBM Value Need Persona Business Analyst Data Scientist Administrator Discover data for analysis Visualize data for action Reduce learning curve by leveraging existing skills (SQL, spreadsheets) Complete and Fast Big SQL runs 100% of Hadoop-DS queries and 3.6x times faster query time over Impala (Audited Hadoop-DS benchmark) Identify patterns, trends, insights with machine learning algorithms Apply statistical models to large scale data Customer Insight Large financial services company analyzed 4 billion tweets and identified 110 million client profiles that matched with at least 90 percent precision Manage workloads and schedule jobs to ensure performance Secure environment to reduce risk Performance 4x improvement in running MapReduce jobs over (STAC report) 30

31 Packaging Structure IBM BigInsights for Apache Hadoop IBM BigInsights Analyst Big SQL BigSheets IBM BigInsights Data Scientist Text Analytics Machine Learning on Big R Big R Big SQL BigSheets IBM BigInsights Enterprise Management POSIX Distributed Filesystem Multi-workload, Multi-tenant scheduling IBM Open Platform with Apache Hadoop 31

32 Text Analytics

33 BigInsights and Text Analytics Distills structured info from unstructured text - Sentiment analysis - Consumer behavior - Illegal or suspicious activities - Parses text and detects meaning with annotators Understands the context in which the text is analyzed Unstructured text (document, , etc) Football World Cup 2010, one team distinguished themselves well, losing to the eventual champions 1-0 in the Final. Early in the second half, Netherlands striker, Arjen Robben, had a breakaway, but the keeper for Spain, Iker Casillas made the save. Winger Andres Iniesta scored for Spain for the win. Features pre-built extractors for names, addresses, phone numbers, etc Classification and Insight Multiple Languages 33

34 Text Analytics Tooling Web-based tool to define rules to extract data and derive information from unstructured text Graphical interface to describe structure of various textual formats from log file data to natural language 34

35 BigSheets

36 BigSheets Browser based analytics tool for BigData! Explore, visualize, transform unstructured and structured data! Visual data cleansing and analysis! Filter and enrich content! Visualize Data! Export data into common formats No programming knowledge needed! 36

37 Geospatial Analytics! New for Version 4! Lattitude/Longitude Inputs in WKT format! Over 35 Geo Spatial functions such as: - Area - Distance - Contains - Crosses - Difference - Union - many more! 37

38 Big R and Scalable Machine Learning

39 Challenges with Running Large-Scale Analytics TRADITIONAL APPROACH BIG DATA APPROACH All available information Analyzed information All available information analyzed Analyze small subsets of information Analyze all information 39

40 User Experience for Big R Connect to BI cluster Data frame proxy to large data file Data transformation step Run scalable linear regression on cluster 40

41 NEW: Underneath Big R s ML Algorithms High-level declarative language with R-like syntax Cost-based Optimizer Automatic parallelization Execution plan based on data characteristics and Hadoop configuration 5+ Years Development in IBM Research Key Benefits: " Automatic performance tuning " Protects data science investment as platform progresses 41

42 Big R Machine Learning -- Scalability and Performance 28X Speedup Performance (data fit in memory) R out-of-memory Scalability (data larger than aggr. memory) bigr.lm 28x Scales beyond cluster memory 42

43 3 Key Capabilities of Big R 1 Use of familiar R language on Hadoop - Running native R functions - Existing R assets (code & CRAN) 2 NEW: Run scalable machine learning algorithms beyond R in Hadoop - Wide class of algorithms and growing - R-like syntax for new algorithms & customize existing algorithms 3 NEW: Leverage scale of Hadoop for faster insights - Only IBM can use the entire cluster memory - Only IBM can spill to disk - Only IBM can run thousands of models in parallel 43

44 Big SQL

45 SQL is the most prevalent analytical skill available in most data driven organizations SQL ON HADOOP = BROAD ACCESS TO ANALYTICS ON HADOOP 45

46 SQL on Hadoop Matters for Big Data Analytics For BI Tools like Cognos 46 Visualizations from Cognos

47 Hive is Really 3 Things Storage Model, Metastore, and Execution Engine Applications Hive (Open Source) SQL Execution Engine Hive Metastore (open source) MapReduce CSV Tab Delim. Parquet RC Others Hive Storage Model (open source) 47 47

48 Big SQL preserves open source foundation Co-exists with Hive by using metastore and storage formats. No Lock-in. Applications IBM BigSQL (IBM) Hive (Open Source) SQL Execution Engines Hive Metastore (open source) C/C++ MPP Engine MapReduce CSV Tab Delim. Parquet RC Others Hive Storage Model (open source) 48 48

49 Ok. But.. WHY WOULD I WANT TO DO THAT? 49

50 IBM First to Produce Audited Benchmark Hadoop-DS (based on TPC-DS)! Letters of attestation are available for both Hadoop- DS benchmarks at 10TB and 30TB scale! InfoSizing, Transaction Processing Performance Council Certified Auditors verified both IBM results as well as results on Cloudera Impala and HortonWorks HIVE.! These results are for a non- TPC benchmark. A subset of the TPC-DS Benchmark standard requirements was implemented 50

51 IBM Big SQL Runs 100% of the queries Other environments require significant effort at scale Key points! With Impala and Hive, many queries needed to be re-written, some significantly! Owing to various restrictions, some queries could not be rewritten or failed at run-time! Re-writing queries in a benchmark scenario where results are known is one thing doing this against real databases in production is another 51 Results for 10TB scale shown here

52 Hadoop-DS benchmark Single user 10TB Big SQL is 3.6x faster than Impala and 5.4x faster than Hive 0.13 for single query stream using 46 common queries Based on IBM internal tests comparing BigInsights Big SQL, Cloudera Impala and Hortonworks Hive (current versions available as of 9/01/2014) running on identical hardware. The test workload was based on the latest revision of the TPC-DS benchmark specification at 10TB data size. Successful executions measure the ability to execute queries a) directly from the specification without modification, b) after simple modifications, c) after extensive query rewrites. All minor modifications are either permitted by the TPC-DS benchmark specification or are of a similar nature. All queries were reviewed and attested by a TPC certified auditor. Development effort measured time required by a skilled SQL developer familiar with each system to modify queries so they will execute correctly. Performance test measured scaled query throughput per hour of 4 concurrent users executing a common subset of 46 queries across all 3 systems at 10TB data size. Results may not be typical and will vary based 52 on actual workload, configuration, applications, queries and other variables in a production environment. Cloudera, the Cloudera logo, Cloudera Impala are trademarks of Cloudera. Hortonworks, the Hortonworks logo and other Hortonworks trademarks are trademarks of Hortonworks Inc. in the United States and other countries.

53 But, benchmarking against Hive ISN T ALL THAT INTERESTING ANYMORE 53

54 Hadoop-DS Performance Test Update Spark SQL vs. Big SQL November 14, 2015

55 Performance Test Hadoop-DS (based on TPC-DS) 20 (Physical Node) Cluster! TPC-DS stands for Transaction Processing Council Decision Support (workload) which is an industry standard benchmark for SQL Spark IBM Open Platform V Nodes Big SQL V4.1 IBM Open Platform V Nodes 55 IBM Open Platform V4.1 shipped with Spark but upgraded to *Not an official TPC-DS Benchmark.

56 But first,.is raw performance EVERYTHING? 56

57 Big SQL Security Best In Class Role Based Access Control Row Level Security BRANCH_B BRANCH_A FINANCE Colum Level Security Separation of Duties / Audit 57

58 Big SQL runs more SQL out-of-box Big SQL 4.1 Spark SQL Big SQL is the only engine that can execute all 99 queries with minimal porting effort 58 Porting Effort: 1 hour 3-4 weeks

59 Big SQL 4.1 vs. Spark 1.5.0, Single 1TB 59

60 Big SQL 4.1 vs. Spark 1.5.0, Single 1TB 60

61 Big SQL 4.1 vs. Spark 1.5.0, Single 1TB 61

62 Big SQL 4.1 vs. Spark 1.5.0, Single 1TB 62

63 Conclusions: Big SQL vs. Spark 1TB TPC-DS! Single Stream Results: - Big SQL was faster than Spark SQL 76 / 99 Queries When Big SQL was slower, it was only slower by 1.6X on average - Query vs. Query, Big SQL was on average 5.5X faster - Removing Top 5 / Bottom 5, Big SQL was 2.5X faster 63

64 But that was just a single stream/user. What happens when you SCALE IT.? 64

65 But, what happens when you scale it? More Users Scale Single Stream 4 Concurrent Streams More Data 1 TB Big SQL was faster on 76 / 99 Queries Big SQL averaged 5.5X faster Removing Top / Bottom 5, Big SQL averaged 2.5X faster 10 TB Big SQL was faster on 87/99 Queries Spark SQL FAILED on 7 queries Big SQL averaged 6.2X faster* Removing Top / Bottom 5, Big SQL averaged 4.6X faster Spark SQL FAILED on 3 queries Big SQL was 4.4X faster* Big SQL elapsed time for workload was better than linear Spark SQL could not complete the workload (numerous issues). Partial results possible with only 2 concurrent streams. 65 *Compares only queries that both Big SQL and Spark SQL could complete (benefits Spark SQL)

66 Recommendation: Right Tool for the Right Job Big SQL Spark SQL Migrating existing workloads to Hadoop Security Many Concurrent Users Best in-class Performance Machine Learning Simpler SQL Good Performance 66 Ideal tool for BI Data Analysts and production workloads Ideal tool for Data Scientists and discovery Not Mutually Exclusive. Big SQL & Spark SQL can co-exist in the cluster

67 QUESTIONS 67

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou The Hadoop Ecosystem EECS 4415 Big Data Systems Tilemachos Pechlivanoglou tipech@eecs.yorku.ca A lot of tools designed to work with Hadoop 2 HDFS, MapReduce Hadoop Distributed File System Core Hadoop component

More information

Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem. Zohar Elkayam

Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem. Zohar Elkayam Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem Zohar Elkayam www.realdbamagic.com Twitter: @realmgic Who am I? Zohar Elkayam, CTO at Brillix Programmer, DBA, team leader, database trainer,

More information

Hadoop. Introduction / Overview

Hadoop. Introduction / Overview Hadoop Introduction / Overview Preface We will use these PowerPoint slides to guide us through our topic. Expect 15 minute segments of lecture Expect 1-4 hour lab segments Expect minimal pretty pictures

More information

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours Big Data Hadoop Developer Course Content Who is the target audience? Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours Complete beginners who want to learn Big Data Hadoop Professionals

More information

Big Data Architect.

Big Data Architect. Big Data Architect www.austech.edu.au WHAT IS BIG DATA ARCHITECT? A big data architecture is designed to handle the ingestion, processing, and analysis of data that is too large or complex for traditional

More information

Hadoop An Overview. - Socrates CCDH

Hadoop An Overview. - Socrates CCDH Hadoop An Overview - Socrates CCDH What is Big Data? Volume Not Gigabyte. Terabyte, Petabyte, Exabyte, Zettabyte - Due to handheld gadgets,and HD format images and videos - In total data, 90% of them collected

More information

CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI)

CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI) CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI) The Certificate in Software Development Life Cycle in BIGDATA, Business Intelligence and Tableau program

More information

Innovatus Technologies

Innovatus Technologies HADOOP 2.X BIGDATA ANALYTICS 1. Java Overview of Java Classes and Objects Garbage Collection and Modifiers Inheritance, Aggregation, Polymorphism Command line argument Abstract class and Interfaces String

More information

Introduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data

Introduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data Introduction to Hadoop High Availability Scaling Advantages and Challenges Introduction to Big Data What is Big data Big Data opportunities Big Data Challenges Characteristics of Big data Introduction

More information

MapR Enterprise Hadoop

MapR Enterprise Hadoop 2014 MapR Technologies 2014 MapR Technologies 1 MapR Enterprise Hadoop Top Ranked Cloud Leaders 500+ Customers 2014 MapR Technologies 2 Key MapR Advantage Partners Business Services APPLICATIONS & OS ANALYTICS

More information

Big Data Hadoop Stack

Big Data Hadoop Stack Big Data Hadoop Stack Lecture #1 Hadoop Beginnings What is Hadoop? Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters of commodity hardware

More information

Microsoft Big Data and Hadoop

Microsoft Big Data and Hadoop Microsoft Big Data and Hadoop Lara Rubbelke @sqlgal Cindy Gross @sqlcindy 2 The world of data is changing The 4Vs of Big Data http://nosql.mypopescu.com/post/9621746531/a-definition-of-big-data 3 Common

More information

Hadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours)

Hadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours) Bigdata Fundamentals Day1: (2hours) 1. Understanding BigData. a. What is Big Data? b. Big-Data characteristics. c. Challenges with the traditional Data Base Systems and Distributed Systems. 2. Distributions:

More information

Big Data Analytics using Apache Hadoop and Spark with Scala

Big Data Analytics using Apache Hadoop and Spark with Scala Big Data Analytics using Apache Hadoop and Spark with Scala Training Highlights : 80% of the training is with Practical Demo (On Custom Cloudera and Ubuntu Machines) 20% Theory Portion will be important

More information

Big Data Hadoop Course Content

Big Data Hadoop Course Content Big Data Hadoop Course Content Topics covered in the training Introduction to Linux and Big Data Virtual Machine ( VM) Introduction/ Installation of VirtualBox and the Big Data VM Introduction to Linux

More information

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS SUJEE MANIYAM FOUNDER / PRINCIPAL @ ELEPHANT SCALE www.elephantscale.com sujee@elephantscale.com HI, I M SUJEE MANIYAM Founder / Principal @ ElephantScale

More information

Configuring and Deploying Hadoop Cluster Deployment Templates

Configuring and Deploying Hadoop Cluster Deployment Templates Configuring and Deploying Hadoop Cluster Deployment Templates This chapter contains the following sections: Hadoop Cluster Profile Templates, on page 1 Creating a Hadoop Cluster Profile Template, on page

More information

BigInsights Version 4 Overview

BigInsights Version 4 Overview BigInsights Version 4 Overview KANG Changsung (Kangcs@kr.ibm.com), Lab Services, KLAB, IBM Korea Outline The Open Data Platform consortium What it is Competitive response IBM s strategy BigInsights packaging

More information

SQT03 Big Data and Hadoop with Azure HDInsight Andrew Brust. Senior Director, Technical Product Marketing and Evangelism

SQT03 Big Data and Hadoop with Azure HDInsight Andrew Brust. Senior Director, Technical Product Marketing and Evangelism Big Data and Hadoop with Azure HDInsight Andrew Brust Senior Director, Technical Product Marketing and Evangelism Datameer Level: Intermediate Meet Andrew Senior Director, Technical Product Marketing and

More information

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved Hadoop 2.x Core: YARN, Tez, and Spark YARN Hadoop Machine Types top-of-rack switches core switch client machines have client-side software used to access a cluster to process data master nodes run Hadoop

More information

Hadoop Development Introduction

Hadoop Development Introduction Hadoop Development Introduction What is Bigdata? Evolution of Bigdata Types of Data and their Significance Need for Bigdata Analytics Why Bigdata with Hadoop? History of Hadoop Why Hadoop is in demand

More information

Cmprssd Intrduction To

Cmprssd Intrduction To Cmprssd Intrduction To Hadoop, SQL-on-Hadoop, NoSQL Arseny.Chernov@Dell.com Singapore University of Technology & Design 2016-11-09 @arsenyspb Thank You For Inviting! My special kind regards to: Professor

More information

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara Big Data Technology Ecosystem Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara Agenda End-to-End Data Delivery Platform Ecosystem of Data Technologies Mapping an End-to-End Solution Case

More information

Certified Big Data Hadoop and Spark Scala Course Curriculum

Certified Big Data Hadoop and Spark Scala Course Curriculum Certified Big Data Hadoop and Spark Scala Course Curriculum The Certified Big Data Hadoop and Spark Scala course by DataFlair is a perfect blend of indepth theoretical knowledge and strong practical skills

More information

An Introduction to Big Data Formats

An Introduction to Big Data Formats Introduction to Big Data Formats 1 An Introduction to Big Data Formats Understanding Avro, Parquet, and ORC WHITE PAPER Introduction to Big Data Formats 2 TABLE OF TABLE OF CONTENTS CONTENTS INTRODUCTION

More information

Oracle Big Data Connectors

Oracle Big Data Connectors Oracle Big Data Connectors Oracle Big Data Connectors is a software suite that integrates processing in Apache Hadoop distributions with operations in Oracle Database. It enables the use of Hadoop to process

More information

Certified Big Data and Hadoop Course Curriculum

Certified Big Data and Hadoop Course Curriculum Certified Big Data and Hadoop Course Curriculum The Certified Big Data and Hadoop course by DataFlair is a perfect blend of in-depth theoretical knowledge and strong practical skills via implementation

More information

Hadoop & Big Data Analytics Complete Practical & Real-time Training

Hadoop & Big Data Analytics Complete Practical & Real-time Training An ISO Certified Training Institute A Unit of Sequelgate Innovative Technologies Pvt. Ltd. www.sqlschool.com Hadoop & Big Data Analytics Complete Practical & Real-time Training Mode : Instructor Led LIVE

More information

How Apache Hadoop Complements Existing BI Systems. Dr. Amr Awadallah Founder, CTO Cloudera,

How Apache Hadoop Complements Existing BI Systems. Dr. Amr Awadallah Founder, CTO Cloudera, How Apache Hadoop Complements Existing BI Systems Dr. Amr Awadallah Founder, CTO Cloudera, Inc. Twitter: @awadallah, @cloudera 2 The Problems with Current Data Systems BI Reports + Interactive Apps RDBMS

More information

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info START DATE : TIMINGS : DURATION : TYPE OF BATCH : FEE : FACULTY NAME : LAB TIMINGS : PH NO: 9963799240, 040-40025423

More information

Big Data Syllabus. Understanding big data and Hadoop. Limitations and Solutions of existing Data Analytics Architecture

Big Data Syllabus. Understanding big data and Hadoop. Limitations and Solutions of existing Data Analytics Architecture Big Data Syllabus Hadoop YARN Setup Programming in YARN framework j Understanding big data and Hadoop Big Data Limitations and Solutions of existing Data Analytics Architecture Hadoop Features Hadoop Ecosystem

More information

Bring Context To Your Machine Data With Hadoop, RDBMS & Splunk

Bring Context To Your Machine Data With Hadoop, RDBMS & Splunk Bring Context To Your Machine Data With Hadoop, RDBMS & Splunk Raanan Dagan and Rohit Pujari September 25, 2017 Washington, DC Forward-Looking Statements During the course of this presentation, we may

More information

50 Must Read Hadoop Interview Questions & Answers

50 Must Read Hadoop Interview Questions & Answers 50 Must Read Hadoop Interview Questions & Answers Whizlabs Dec 29th, 2017 Big Data Are you planning to land a job with big data and data analytics? Are you worried about cracking the Hadoop job interview?

More information

What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed?

What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed? Simple to start What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed? What is the maximum download speed you get? Simple computation

More information

microsoft

microsoft 70-775.microsoft Number: 70-775 Passing Score: 800 Time Limit: 120 min Exam A QUESTION 1 Note: This question is part of a series of questions that present the same scenario. Each question in the series

More information

Overview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training::

Overview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training:: Module Title Duration : Cloudera Data Analyst Training : 4 days Overview Take your knowledge to the next level Cloudera University s four-day data analyst training course will teach you to apply traditional

More information

Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and thevirtual EDW Headline Goes Here

Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and thevirtual EDW Headline Goes Here Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and thevirtual EDW Headline Goes Here Marcel Kornacker marcel@cloudera.com Speaker Name or Subhead Goes Here 2013-11-12 Copyright 2013 Cloudera

More information

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture Hadoop 1.0 Architecture Introduction to Hadoop & Big Data Hadoop Evolution Hadoop Architecture Networking Concepts Use cases

More information

April Copyright 2013 Cloudera Inc. All rights reserved.

April Copyright 2013 Cloudera Inc. All rights reserved. Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and the Virtual EDW Headline Goes Here Marcel Kornacker marcel@cloudera.com Speaker Name or Subhead Goes Here April 2014 Analytic Workloads on

More information

New Approaches to Big Data Processing and Analytics

New Approaches to Big Data Processing and Analytics New Approaches to Big Data Processing and Analytics Contributing authors: David Floyer, David Vellante Original publication date: February 12, 2013 There are number of approaches to processing and analyzing

More information

Big Data with Hadoop Ecosystem

Big Data with Hadoop Ecosystem Diógenes Pires Big Data with Hadoop Ecosystem Hands-on (HBase, MySql and Hive + Power BI) Internet Live http://www.internetlivestats.com/ Introduction Business Intelligence Business Intelligence Process

More information

docs.hortonworks.com

docs.hortonworks.com docs.hortonworks.com : Getting Started Guide Copyright 2012, 2014 Hortonworks, Inc. Some rights reserved. The, powered by Apache Hadoop, is a massively scalable and 100% open source platform for storing,

More information

Big Data com Hadoop. VIII Sessão - SQL Bahia. Impala, Hive e Spark. Diógenes Pires 03/03/2018

Big Data com Hadoop. VIII Sessão - SQL Bahia. Impala, Hive e Spark. Diógenes Pires 03/03/2018 Big Data com Hadoop Impala, Hive e Spark VIII Sessão - SQL Bahia 03/03/2018 Diógenes Pires Connect with PASS Sign up for a free membership today at: pass.org #sqlpass Internet Live http://www.internetlivestats.com/

More information

exam. Microsoft Perform Data Engineering on Microsoft Azure HDInsight. Version 1.0

exam.   Microsoft Perform Data Engineering on Microsoft Azure HDInsight. Version 1.0 70-775.exam Number: 70-775 Passing Score: 800 Time Limit: 120 min File Version: 1.0 Microsoft 70-775 Perform Data Engineering on Microsoft Azure HDInsight Version 1.0 Exam A QUESTION 1 You use YARN to

More information

CONSOLIDATING RISK MANAGEMENT AND REGULATORY COMPLIANCE APPLICATIONS USING A UNIFIED DATA PLATFORM

CONSOLIDATING RISK MANAGEMENT AND REGULATORY COMPLIANCE APPLICATIONS USING A UNIFIED DATA PLATFORM CONSOLIDATING RISK MANAGEMENT AND REGULATORY COMPLIANCE APPLICATIONS USING A UNIFIED PLATFORM Executive Summary Financial institutions have implemented and continue to implement many disparate applications

More information

Blended Learning Outline: Cloudera Data Analyst Training (171219a)

Blended Learning Outline: Cloudera Data Analyst Training (171219a) Blended Learning Outline: Cloudera Data Analyst Training (171219a) Cloudera Univeristy s data analyst training course will teach you to apply traditional data analytics and business intelligence skills

More information

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples Hadoop Introduction 1 Topics Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples 2 Big Data Analytics What is Big Data?

More information

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development:: Title Duration : Apache Spark Development : 4 days Overview Spark is a fast and general cluster computing system for Big Data. It provides high-level APIs in Scala, Java, Python, and R, and an optimized

More information

Modern Data Warehouse The New Approach to Azure BI

Modern Data Warehouse The New Approach to Azure BI Modern Data Warehouse The New Approach to Azure BI History On-Premise SQL Server Big Data Solutions Technical Barriers Modern Analytics Platform On-Premise SQL Server Big Data Solutions Modern Analytics

More information

Distributed Systems 16. Distributed File Systems II

Distributed Systems 16. Distributed File Systems II Distributed Systems 16. Distributed File Systems II Paul Krzyzanowski pxk@cs.rutgers.edu 1 Review NFS RPC-based access AFS Long-term caching CODA Read/write replication & disconnected operation DFS AFS

More information

1Z Oracle Big Data 2017 Implementation Essentials Exam Summary Syllabus Questions

1Z Oracle Big Data 2017 Implementation Essentials Exam Summary Syllabus Questions 1Z0-449 Oracle Big Data 2017 Implementation Essentials Exam Summary Syllabus Questions Table of Contents Introduction to 1Z0-449 Exam on Oracle Big Data 2017 Implementation Essentials... 2 Oracle 1Z0-449

More information

Stages of Data Processing

Stages of Data Processing Data processing can be understood as the conversion of raw data into a meaningful and desired form. Basically, producing information that can be understood by the end user. So then, the question arises,

More information

Hortonworks and The Internet of Things

Hortonworks and The Internet of Things Hortonworks and The Internet of Things Dr. Bernhard Walter Solutions Engineer About Hortonworks Customer Momentum ~700 customers (as of November 4, 2015) 152 customers added in Q3 2015 Publicly traded

More information

Hadoop Online Training

Hadoop Online Training Hadoop Online Training IQ training facility offers Hadoop Online Training. Our Hadoop trainers come with vast work experience and teaching skills. Our Hadoop training online is regarded as the one of the

More information

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect Big Data Big Data Analyst INTRODUCTION TO BIG DATA ANALYTICS ANALYTICS PROCESSING TECHNIQUES DATA TRANSFORMATION & BATCH PROCESSING REAL TIME (STREAM) DATA PROCESSING Big Data Engineer BIG DATA FOUNDATION

More information

DATA SCIENCE USING SPARK: AN INTRODUCTION

DATA SCIENCE USING SPARK: AN INTRODUCTION DATA SCIENCE USING SPARK: AN INTRODUCTION TOPICS COVERED Introduction to Spark Getting Started with Spark Programming in Spark Data Science with Spark What next? 2 DATA SCIENCE PROCESS Exploratory Data

More information

Introduction to BigData, Hadoop:-

Introduction to BigData, Hadoop:- Introduction to BigData, Hadoop:- Big Data Introduction: Hadoop Introduction What is Hadoop? Why Hadoop? Hadoop History. Different types of Components in Hadoop? HDFS, MapReduce, PIG, Hive, SQOOP, HBASE,

More information

Hadoop Overview. Lars George Director EMEA Services

Hadoop Overview. Lars George Director EMEA Services Hadoop Overview Lars George Director EMEA Services 1 About Me Director EMEA Services @ Cloudera Consulting on Hadoop projects (everywhere) Apache Committer HBase and Whirr O Reilly Author HBase The Definitive

More information

IBM BigInsights V IBM Corporation

IBM BigInsights V IBM Corporation IBM BigInsights V4 Solution for Big Data Rest Data: Data to analyze are already stored (structured and unstructured) Examples: logs, facebook, twitter, etc. Solution: Hadoop (open source) / IBM BigInsights

More information

Hadoop course content

Hadoop course content course content COURSE DETAILS 1. In-detail explanation on the concepts of HDFS & MapReduce frameworks 2. What is 2.X Architecture & How to set up Cluster 3. How to write complex MapReduce Programs 4. In-detail

More information

Hadoop, Yarn and Beyond

Hadoop, Yarn and Beyond Hadoop, Yarn and Beyond 1 B. R A M A M U R T H Y Overview We learned about Hadoop1.x or the core. Just like Java evolved, Java core, Java 1.X, Java 2.. So on, software and systems evolve, naturally.. Lets

More information

Exam Questions

Exam Questions Exam Questions 70-775 Perform Data Engineering on Microsoft Azure HDInsight (beta) https://www.2passeasy.com/dumps/70-775/ NEW QUESTION 1 You are implementing a batch processing solution by using Azure

More information

A BigData Tour HDFS, Ceph and MapReduce

A BigData Tour HDFS, Ceph and MapReduce A BigData Tour HDFS, Ceph and MapReduce These slides are possible thanks to these sources Jonathan Drusi - SCInet Toronto Hadoop Tutorial, Amir Payberah - Course in Data Intensive Computing SICS; Yahoo!

More information

Processing Unstructured Data. Dinesh Priyankara Founder/Principal Architect dinesql Pvt Ltd.

Processing Unstructured Data. Dinesh Priyankara Founder/Principal Architect dinesql Pvt Ltd. Processing Unstructured Data Dinesh Priyankara Founder/Principal Architect dinesql Pvt Ltd. http://dinesql.com / Dinesh Priyankara @dinesh_priya Founder/Principal Architect dinesql Pvt Ltd. Microsoft Most

More information

Oracle Big Data SQL. Release 3.2. Rich SQL Processing on All Data

Oracle Big Data SQL. Release 3.2. Rich SQL Processing on All Data Oracle Big Data SQL Release 3.2 The unprecedented explosion in data that can be made useful to enterprises from the Internet of Things, to the social streams of global customer bases has created a tremendous

More information

Hadoop. copyright 2011 Trainologic LTD

Hadoop. copyright 2011 Trainologic LTD Hadoop Hadoop is a framework for processing large amounts of data in a distributed manner. It can scale up to thousands of machines. It provides high-availability. Provides map-reduce functionality. Hides

More information

HDP Security Overview

HDP Security Overview 3 HDP Security Overview Date of Publish: 2018-07-15 http://docs.hortonworks.com Contents HDP Security Overview...3 Understanding Data Lake Security... 3 What's New in This Release: Knox... 5 What's New

More information

HDP Security Overview

HDP Security Overview 3 HDP Security Overview Date of Publish: 2018-07-15 http://docs.hortonworks.com Contents HDP Security Overview...3 Understanding Data Lake Security... 3 What's New in This Release: Knox... 5 What's New

More information

The Technology of the Business Data Lake. Appendix

The Technology of the Business Data Lake. Appendix The Technology of the Business Data Lake Appendix Pivotal data products Term Greenplum Database GemFire Pivotal HD Spring XD Pivotal Data Dispatch Pivotal Analytics Description A massively parallel platform

More information

BigInsights and Cognos Stefan Hubertus, Principal Solution Specialist Cognos Wilfried Hoge, IT Architect Big Data IBM Corporation

BigInsights and Cognos Stefan Hubertus, Principal Solution Specialist Cognos Wilfried Hoge, IT Architect Big Data IBM Corporation BigInsights and Cognos Stefan Hubertus, Principal Solution Specialist Cognos Wilfried Hoge, IT Architect Big Data 2013 IBM Corporation A Big Data architecture evolves from a traditional BI architecture

More information

HADOOP COURSE CONTENT (HADOOP-1.X, 2.X & 3.X) (Development, Administration & REAL TIME Projects Implementation)

HADOOP COURSE CONTENT (HADOOP-1.X, 2.X & 3.X) (Development, Administration & REAL TIME Projects Implementation) HADOOP COURSE CONTENT (HADOOP-1.X, 2.X & 3.X) (Development, Administration & REAL TIME Projects Implementation) Introduction to BIGDATA and HADOOP What is Big Data? What is Hadoop? Relation between Big

More information

Introduction into Big Data analytics Lecture 3 Hadoop ecosystem. Janusz Szwabiński

Introduction into Big Data analytics Lecture 3 Hadoop ecosystem. Janusz Szwabiński Introduction into Big Data analytics Lecture 3 Hadoop ecosystem Janusz Szwabiński Outlook of today s talk Apache Hadoop Project Common use cases Getting started with Hadoop Single node cluster Further

More information

HBase... And Lewis Carroll! Twi:er,

HBase... And Lewis Carroll! Twi:er, HBase... And Lewis Carroll! jw4ean@cloudera.com Twi:er, LinkedIn: @jw4ean 1 Introduc@on 2010: Cloudera Solu@ons Architect 2011: Cloudera TAM/DSE 2012-2013: Cloudera Training focusing on Partners and Newbies

More information

Embedded Technosolutions

Embedded Technosolutions Hadoop Big Data An Important technology in IT Sector Hadoop - Big Data Oerie 90% of the worlds data was generated in the last few years. Due to the advent of new technologies, devices, and communication

More information

From Single Purpose to Multi Purpose Data Lakes. Thomas Niewel Technical Sales Director DACH Denodo Technologies March, 2019

From Single Purpose to Multi Purpose Data Lakes. Thomas Niewel Technical Sales Director DACH Denodo Technologies March, 2019 From Single Purpose to Multi Purpose Data Lakes Thomas Niewel Technical Sales Director DACH Denodo Technologies March, 2019 Agenda Data Lakes Multiple Purpose Data Lakes Customer Example Demo Takeaways

More information

Oracle Big Data Fundamentals Ed 2

Oracle Big Data Fundamentals Ed 2 Oracle University Contact Us: 1.800.529.0165 Oracle Big Data Fundamentals Ed 2 Duration: 5 Days What you will learn In the Oracle Big Data Fundamentals course, you learn about big data, the technologies

More information

DHANALAKSHMI COLLEGE OF ENGINEERING, CHENNAI

DHANALAKSHMI COLLEGE OF ENGINEERING, CHENNAI DHANALAKSHMI COLLEGE OF ENGINEERING, CHENNAI Department of Information Technology IT6701 - INFORMATION MANAGEMENT Anna University 2 & 16 Mark Questions & Answers Year / Semester: IV / VII Regulation: 2013

More information

We are ready to serve Latest Testing Trends, Are you ready to learn.?? New Batches Info

We are ready to serve Latest Testing Trends, Are you ready to learn.?? New Batches Info We are ready to serve Latest Testing Trends, Are you ready to learn.?? New Batches Info START DATE : TIMINGS : DURATION : TYPE OF BATCH : FEE : FACULTY NAME : LAB TIMINGS : About Quality Thought We are

More information

Gain Insights From Unstructured Data Using Pivotal HD. Copyright 2013 EMC Corporation. All rights reserved.

Gain Insights From Unstructured Data Using Pivotal HD. Copyright 2013 EMC Corporation. All rights reserved. Gain Insights From Unstructured Data Using Pivotal HD 1 Traditional Enterprise Analytics Process 2 The Fundamental Paradigm Shift Internet age and exploding data growth Enterprises leverage new data sources

More information

Hadoop. Introduction to BIGDATA and HADOOP

Hadoop. Introduction to BIGDATA and HADOOP Hadoop Introduction to BIGDATA and HADOOP What is Big Data? What is Hadoop? Relation between Big Data and Hadoop What is the need of going ahead with Hadoop? Scenarios to apt Hadoop Technology in REAL

More information

Top 25 Big Data Interview Questions And Answers

Top 25 Big Data Interview Questions And Answers Top 25 Big Data Interview Questions And Answers By: Neeru Jain - Big Data The era of big data has just begun. With more companies inclined towards big data to run their operations, the demand for talent

More information

10 Million Smart Meter Data with Apache HBase

10 Million Smart Meter Data with Apache HBase 10 Million Smart Meter Data with Apache HBase 5/31/2017 OSS Solution Center Hitachi, Ltd. Masahiro Ito OSS Summit Japan 2017 Who am I? Masahiro Ito ( 伊藤雅博 ) Software Engineer at Hitachi, Ltd. Focus on

More information

Apache Hadoop.Next What it takes and what it means

Apache Hadoop.Next What it takes and what it means Apache Hadoop.Next What it takes and what it means Arun C. Murthy Founder & Architect, Hortonworks @acmurthy (@hortonworks) Page 1 Hello! I m Arun Founder/Architect at Hortonworks Inc. Lead, Map-Reduce

More information

Hortonworks Data Platform

Hortonworks Data Platform Hortonworks Data Platform Workflow Management (August 31, 2017) docs.hortonworks.com Hortonworks Data Platform: Workflow Management Copyright 2012-2017 Hortonworks, Inc. Some rights reserved. The Hortonworks

More information

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce Parallel Programming Principle and Practice Lecture 10 Big Data Processing with MapReduce Outline MapReduce Programming Model MapReduce Examples Hadoop 2 Incredible Things That Happen Every Minute On The

More information

Oracle GoldenGate for Big Data

Oracle GoldenGate for Big Data Oracle GoldenGate for Big Data The Oracle GoldenGate for Big Data 12c product streams transactional data into big data systems in real time, without impacting the performance of source systems. It streamlines

More information

CISC 7610 Lecture 2b The beginnings of NoSQL

CISC 7610 Lecture 2b The beginnings of NoSQL CISC 7610 Lecture 2b The beginnings of NoSQL Topics: Big Data Google s infrastructure Hadoop: open google infrastructure Scaling through sharding CAP theorem Amazon s Dynamo 5 V s of big data Everyone

More information

Big Data and Hadoop. Course Curriculum: Your 10 Module Learning Plan. About Edureka

Big Data and Hadoop. Course Curriculum: Your 10 Module Learning Plan. About Edureka Course Curriculum: Your 10 Module Learning Plan Big Data and Hadoop About Edureka Edureka is a leading e-learning platform providing live instructor-led interactive online training. We cater to professionals

More information

Big Data landscape Lecture #2

Big Data landscape Lecture #2 Big Data landscape Lecture #2 Contents 1 1 CORE Technologies 2 3 MapReduce YARN 4 SparK 5 Cassandra Contents 2 16 HBase 72 83 Accumulo memcached 94 Blur 10 5 Sqoop/Flume Contents 3 111 MongoDB 12 2 13

More information

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017 Hadoop File System 1 S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y Moving Computation is Cheaper than Moving Data Motivation: Big Data! What is BigData? - Google

More information

Top 25 Hadoop Admin Interview Questions and Answers

Top 25 Hadoop Admin Interview Questions and Answers Top 25 Hadoop Admin Interview Questions and Answers 1) What daemons are needed to run a Hadoop cluster? DataNode, NameNode, TaskTracker, and JobTracker are required to run Hadoop cluster. 2) Which OS are

More information

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context 1 Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes

More information

Introduction to Big Data. Hadoop. Instituto Politécnico de Tomar. Ricardo Campos

Introduction to Big Data. Hadoop. Instituto Politécnico de Tomar. Ricardo Campos Instituto Politécnico de Tomar Introduction to Big Data Hadoop Ricardo Campos Mestrado EI-IC Análise e Processamento de Grandes Volumes de Dados Tomar, Portugal, 2016 Part of the slides used in this presentation

More information

Flash Storage Complementing a Data Lake for Real-Time Insight

Flash Storage Complementing a Data Lake for Real-Time Insight Flash Storage Complementing a Data Lake for Real-Time Insight Dr. Sanhita Sarkar Global Director, Analytics Software Development August 7, 2018 Agenda 1 2 3 4 5 Delivering insight along the entire spectrum

More information

Lecture 11 Hadoop & Spark

Lecture 11 Hadoop & Spark Lecture 11 Hadoop & Spark Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Outline Distributed File Systems Hadoop Ecosystem

More information

Syncsort DMX-h. Simplifying Big Data Integration. Goals of the Modern Data Architecture SOLUTION SHEET

Syncsort DMX-h. Simplifying Big Data Integration. Goals of the Modern Data Architecture SOLUTION SHEET SOLUTION SHEET Syncsort DMX-h Simplifying Big Data Integration Goals of the Modern Data Architecture Data warehouses and mainframes are mainstays of traditional data architectures and still play a vital

More information

HADOOP FRAMEWORK FOR BIG DATA

HADOOP FRAMEWORK FOR BIG DATA HADOOP FRAMEWORK FOR BIG DATA Mr K. Srinivas Babu 1,Dr K. Rameshwaraiah 2 1 Research Scholar S V University, Tirupathi 2 Professor and Head NNRESGI, Hyderabad Abstract - Data has to be stored for further

More information

Importing and Exporting Data Between Hadoop and MySQL

Importing and Exporting Data Between Hadoop and MySQL Importing and Exporting Data Between Hadoop and MySQL + 1 About me Sarah Sproehnle Former MySQL instructor Joined Cloudera in March 2010 sarah@cloudera.com 2 What is Hadoop? An open-source framework for

More information

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet Ema Iancuta iorhian@gmail.com Radu Chilom radu.chilom@gmail.com Big data analytics / machine learning 6+ years

More information

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros Data Clustering on the Parallel Hadoop MapReduce Model Dimitrios Verraros Overview The purpose of this thesis is to implement and benchmark the performance of a parallel K- means clustering algorithm on

More information