Big Data Analysis using Hadoop. Lecture 4. Hadoop EcoSystem

Size: px
Start display at page:

Download "Big Data Analysis using Hadoop. Lecture 4. Hadoop EcoSystem"

Transcription

1 Big Data Analysis using Hadoop Lecture 4 Hadoop EcoSystem Hadoop Ecosytems 1

2 Overview Hive HBase Sqoop Pig Mahoot / Spark / Flink / Storm Hive 2

3 Hive Data Warehousing Solution built on top of Hadoop Provides SQL-like query language named HiveQL Minimal learning curve for people with SQL expertise Data analysts are target audience Ability to bring structure to various data formats Simple interface for ad hoc querying, analyzing and summarizing large amountsof data Access to files on various data stores such as HDFS, etc Website : Download : Documentation : Hive Hive does NOT provide low latency or real-time queries Even querying small amounts of data may take minutes Designed for scalability and ease-of-use rather than low latency responses Translates HiveQL statements into a set of MapReduce Jobs which are then executed on a Hadoop Cluster 3

4 Hive Hive Concepts Re-used from Relational Databases Database: Set of Tables, used for name conflicts resolution Table: Set of Rows that have the same schema (same columns) Row: A single record; a set of columns Column: provides value and type for a single value Can can be dived up based on Partitions Buckets Hive Let s work through a simple example 1. Create a Table 2. Load Data into a Table 3. Query Data 4. Drop a Table 4

5 Hive 1. Create a table hive>!cat data/user-posts.txt; user1,funny Story, user2,cool Deal, user4,interesting Post, user5,yet Another Blog, hive> Values are separate by, and each row represents a record; first value is user name, second is post content and third is timestamp hive> CREATE TABLE posts (user STRING, post STRING, time BIGINT) > ROW FORMAT DELIMITED > FIELDS TERMINATED BY ',' > STORED AS TEXTFILE; OK Time taken: seconds 1st line: creates a table with 3 columns 2nd and 3rd line: how the underlying file should be parsed 4th line: how to store data hive> show tables; OK posts Time taken: seconds hive> describe posts; OK user string post string time bigint Time taken: seconds Hive - 2. Load Data into a Table hive> LOAD DATA LOCAL INPATH 'data/user-posts.txt' > OVERWRITE INTO TABLE posts; Copying data from file:/home/hadoop/training/play_area/data/user-posts.txt Copying file: file:/home/hadoop/training/play_area/data/user-posts.txt Loading data to table default.posts Deleted /user/hive/warehouse/posts OK Time taken: seconds hive> Existing records the table posts are deleted; data in user-posts.txt is loaded into Hive s posts table $ hdfs dfs -cat /user/hive/warehouse/posts/user-posts.txt user1,funny Story, user2,cool Deal, user4,interesting Post, user5,yet Another Blog,

6 Hive 3. Query Data hive> select count (1) from posts; Total MapReduce jobs = 1 Launching Job 1 out of 1... Starting Job = job_ _0004, Tracking URL = Kill Command = hadoop job -Dmapred.job.tracker=localhost: kill job_ _0004 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: :37:24,962 Stage-1 map = 0%, reduce = 0% :37:30,497 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.87 sec :37:31,577 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.87 sec :37:32,664 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 2.64 sec MapReduce Total cumulative CPU time: 2 seconds 640 msec Ended Job = job_ _0004 MapReduce Jobs Launched: Job 0: Map: 1 Reduce: 1 Accumulative CPU: 2.64 sec HDFS Read: 0 HDFS Write: 0 SUCESS Total MapReduce CPU Time Spent: 2 seconds 640 msec OK 4 Time taken: seconds Count number of records in posts table Transformed HiveQL into 1 MapReduce Job Result is 4 records How long did it take to run? Hive 3. Query Data hive> select * from posts where user="user2"; OK user2 Cool Deal Time taken: seconds Select records for "user2" hive> select * from posts where time<= limit 2; OK user1 Funny Story user2 Cool Deal Time taken: seconds hive> Usually there are too many results to display, then one could utilize limit command to bound the display Select records whose timestamp is less or equals to the provided value 6

7 Hive - 4. Drop a Table hive> DROP TABLE posts; OK Time taken: seconds hive> exit; $ hdfs dfs -ls /user/hive/warehouse/ If hive was managing underlying file then it will be removed Hive 7

8 Partitions and Buckets Partitions divide data by grouping similar type of data together based on a column or partition key. Each Table can have one or more partition keys to identify a particular partition. This allows us to have a faster query on slices of the data. A bucket divide each partition or the unpartitioned table into N Buckets based on the hash function of a column(s) in the table Partitions and Buckets CREATE TABLE page_view( viewtime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING COMMENT 'IP Address of the User') COMMENT 'This is the page view table' PARTITIONED BY(dt STRING, country STRING) STORED AS SEQUENCEFILE; 8

9 Partitions and Buckets CREATE TABLE page_view( viewtime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING COMMENT 'IP Address of the User') COMMENT 'This is the page view table' PARTITIONED BY(dt STRING, country STRING) CLUSTERED BY(userid) SORTED BY(viewTime) INTO 32 BUCKETS STORED AS SEQUENCEFILE; The table is clustered by a hash function of userid into 32 buckets. Within each bucket the data is sorted in increasing order of viewtime. Old vs New Hive Old Hive = Map-Reduce New Hive = Map-Reduce, Spark, Tez Tez = an application framework which allows for a complex directedacyclic-graph of tasks for processing data 9

10 Hive Word Count CREATE TABLE docs (line STRING); LOAD_ DATA INPATH docs OVERWRITE INTO TABLE docs; CREATE TABLE word_counts AS SELECT word, count(1) AS count FROM (SELECT explode(split(line, \s )) AS word FROM docs) w GROUP BY word ORDER BY word; HBase 10

11 HBase HBase is a distributed column-oriented database built on top of the Hadoop file system. HBase is a data model that is similar to Google s big table designed to provide quick random access to huge amounts of unstructured data. Columnar oriented database The components of HBase data model consist of tables, rows, column families, columns, cells and versions. Tables are like logical collection of rows stored in separate partitions. Data in a row are grouped together as Column Families. Each Column Family has one or more Columns and these Columns in a family are stored together. Website : Download : Documentation : 11

12 HBase Non-Acid compliant database Hbase shell with limited range of commands Libraries for Java and many other languages MR supported HBase? 12

13 HBase HBase Good at Single random selects and range scans Querying one or a small subset of columns Data compaction as nulls are ignored Not so good at Transactions Joins, Group bys, Where Only one index for a table 13

14 Pig Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. Pig is an abstraction on top of Hadoop Provides high level programming language designed for data processing Converted into MapReduce and executed on Hadoop Clusters MapReduce requires programmers Must think in terms of map and reduce functions More than likely will require Java programmers Pig provides high-level language that can be used by Analysts, Data Scientists, Statisticians, Etc... Different type of user compared to those who write MR functions Website : Download : Documentation : 14

15 Pig Pig's infrastructure layer consists of a compiler that produces sequences of Map-Reduce programs, Pig's language layer currently consists of a textual language called Pig Latin, which has the following key properties: Ease of programming. It is trivial to achieve parallel execution of simple, "embarrassingly parallel" data analysis tasks. Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow sequences, making them easy to write, understand, and maintain. Optimization opportunities. The way in which tasks are encoded permits the system to optimize their execution automatically, allowing the user to focus on semantics rather than efficiency. Extensibility. Users can create their own functions to do special-purpose processing. Pig Latin Command based language Data flow language rather than procedural or declarative Designed specifically for data transformation and flow expression Pig compiler converts Pig Latin to MapReduce Compiler strives to optimize execution You automatically get optimization improvements with Pig updates Provides common operations like join, group, filter, sort. Pig Examples - Aggregation Let s count the number of times each user appears in the excite data set. log = LOAD excite-small.log AS (user, timestamp, query); Grpd = GROUP log BY user; Cntd = FOREACH grpd GENERATE group, COUNT(log) AS cnt; STORE cntd INTO output ; Results: 002BB5A52580A8ED BD9CD3AC6BB

16 Pig Examples - Filtering Let s apply a filter to the groups so that we only get the high frequency users. log = LOAD excite-small.log AS (user, timestamp, query); grpd = GROUP log BY user; cntd = FOREACH grpd GENERATE group, COUNT(log) AS cnt; fltrd = FILTER cntd BY cnt > 50; STORE cntd INTO output ; Results: 0B294E3062F036C CE647F6 78 7D286B5592D83BBE 59.. Pig Examples - Ordering Let s ort the high frequency users by frequency. log = LOAD excite-small.log AS (user, timestamp, query); grpd = GROUP log BY user; cntd = FOREACH grpd GENERATE group, COUNT(log) AS cnt; fltrd = FILTER cntd BY cnt > 50; srtd = ORDER fltrd by cnt; STORE cntd INTO output ; Results: 7D286B5592D83BBE 59 0B294E3062F036C CE647F

17 Pig Examples Joining Data Join the 2 data sets Join the data sets based on words that are common in both bard = LOAD shakespeare_freq AS (freq, word); Kjv = LOAD bible_freq AS (freq, word); Inboth = JOIN bard BY word, kjv BY word; STORE inboth INTO output ; Results: 2 Abide 1 Abide 2 Abraham 111 Abraham 3... Pig Summary of Commands Summary of Pig Commands DESCRIBE Returns the schema of a relation. DUMP Dumps or displays results to screen. EXPLAIN Displays execution plans. ILLUSTRATE Displays a step-by-step execution of a sequence of statements. 17

18 Pig Word Count We know how much coding is needed to perform the WordCount using MR in Java In Pig is is a bit simpler input = load data.txt as (line); words = foreach input generate flatten(tokenize(line)) as word; grpd = group words by word; cntd = foreach grpd generate group, COUNT(words); dump cntd; That was much easier or was it? Pig Summary of Pig Commands Data Types Relational Operators User Defined Functions (UDF) Mathematical Functions Evaluation Functions String Functions Date & Time Functions Bag & Tuple Functions Load & Store Functions SQL -> PIG mapping Good examples in Cheat Sheet doc Pig Cheat Sheet : 18

19 Sqoop Sqoop Apache Sqoop allows the transfer of data between Hadoop and RDBMS Can import one or any number of tables or just portions of tables Uses map-reduce to do the import or export Works with any JDBC compatible database Website : Download : Documentation : 19

20 Sqoop The data set being transferred is sliced up into different partitions A map only job is launched with individual mappers responsible for transferring a slice of the data set Sqoop using the database meta data to infer data types Sqoop Basic set of commands Most common are Import Export 20

21 Sqoop Import Data Need the correct JDBC driver for the database you are using E.g. Oracle JDBC driver ojdbc6.jar Copy this to your./lib directory Import from Oracle : only certain attributes from a table Data saved to hdfs $ bin/sqoop import --connect "jdbc:oracle:thin:@localhost:1521:xe" \ --password "calgary10" --username "SYSTEM" --table "HR.DEPARTMENTS \ --columns "DEPARTMENT_ID,DEPARTMENT_NAME,MANAGER_ID,LOCATION_ID Sqoop Import Data Import in Hive sqoop takes care of populating the Hive metastore with the appropriate metadata for the table Invokes the necessary commands to load the table or partition as the case may be. All of this is done by simply specifying the option hive-import with the import command. $ sqoop import --connect "jdbc:oracle:thin:@localhost:1521:xe" \ --table ORDERS --password "calgary10" --username "SYSTEM \ --hive-import Sqoop converts the data from the native datatypes within the external datastore into the corresponding types within Hive. Sqoop automatically chooses the native delimiter set used by Hive 21

22 Sqoop Export Data Exports the data in the form of SQL INSERT statements Table in the destination DB needs to exists before your export the data Destination table must be empty Data types must correspond Can export from HDFS or Hive $ sqoop export --connect "jdbc:oracle:thin:@localhost:1521:xe" \ --table ORDERS --password "calgary10" --username "SYSTEM \ --export-dir /user/hive/warehouse/test_db.db/my_objects export-dir <directory path>: This is the directory from which data will be exported. Export to DB table with existing data --update-key <primary key> $ sqoop export --connect "jdbc:oracle:thin:@localhost:1521:xe" \ --table ORDERS --password "calgary10" --username "SYSTEM \ --export-dir /user/hive/warehouse/test_db.db/my_objects \ --update-key id Mahout / Flink / Storm / Spark 22

23 mahout Mahout started back in 2008 as a subproject of Apache Lucence, which provides an open source search engine. Lucence provides advanced implementation of search, text mining and information retrieval Related topics with machine learning, and committers migrated onto Mahout Mahout supports four main data science use cases: Collaborative filtering mines user behavior and makes product recommendations (e.g. Amazon recommendations) Clustering takes items in a particular class (such as web pages or newspaper articles) and organizes them into naturally occurring groups, such that items belonging to the same group are similar to each other Classification learns from existing categorizations and then assigns unclassified items to the best category Frequent itemset mining analyzes items in a group (e.g. items in a shopping cart or terms in a query session) and then identifies which items typically appear together Website : Download : Documentation : mahout Originally developed to use MapReduce Java based set of libraries Mahout can now use Spark and Flink for the backend Many still using MapReduce version Original ML language for Hadoop But limited functionality Newer projects have a lot more functionality 23

24 Similarity Measure example Recommender using Nearest Neighbor Storm Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing Use Cases: real-time analytics, online machine learning, continuous computation, distributed RPC, ETL Website : Download : Documentation : 24

25 1/11/18 Storm Storm - WordCount 25

26 Flink Designed for data that is continuously produced => streaming data activity logs, web logs, machines, sensors, and database transactions! Flink can process streaming and batch data Batch is a special case of stream processing Storage-agnostic Good integration with Hadoop ecosystem Flink integrates with a wide variety of open source systems for data input and output (e.g., HDFS, Kafka, Elasticsearch, HBase, and others), deployment (e.g., YARN), as well as acting as an execution engine for other frameworks (e.g., Cascading, Apache Beam incubating aka Google Cloud Dataflow). The Flink project itself comes bundled with a Hadoop MapReduce compatibility layer, a Storm compatibility layer, as well as libraries for machine learning and graph processing Website : Download : Documentation : Flink Flink is used in conjunction with data storage or brokering systems. A typical architectural pattern is to use Flink in conjunction with Apache Kafka to: Ingest data into other systems such as HDFS, databases, or search indices and create continuous ETL pipelines. Perform analytics directly on the moving data to create alerts, dashboards, or power operational applications obviating the need for ingestion and ETL. Perform machine learning on streams by continuously building models of the events as they arrive and using the model to serve online recommendations. Since Flink is a full-fledged system for batch processing as well, it can also be used for applications on top of static data. 26

27 Flink Examples of typical functions in Flink Map FlatMap Filter KeyBy Reduce Fold Aggregations Window WindowAll Window Apply Window Reduce Window Fold Aggregations on windows Union Window Join Window CoGroup Connect CoMap, CoFlatMap Split Select Iterate Extract Timestamps Project (for data streams of Tuples) Several APIs in Java/Scala/Python DataSet API Batch Processing DataStream API Real-time streaming analytics Table API Relational Queries Domain Speciific Libraries FlinkML Machine learningl library for Flink Gelly Graph library for Flink Some benchmarks show Flink is faster at processing streaming data than Spark. But Spark is faster at processing Batch data. At the moment Spark seems to be better as an all round solution. Flink 27

28 Spark MapReduce problems: Many problems aren t easily described as map-reduce Persistence to disk typically slower than in-memory work Apache Spark a general-purpose processing engine that can be used instead of MapReduce Supports Java, Scala and Python Main idea: use the memory resources of the cluster for better performance Supported by many commercial companies, Cloudera, IBM, Oracle, Microsoft Website : Download : Documentation : Spark Many libraries Spark SQL Spark Streaming stream processing of live datastreams MLlib - machine learning GraphX graph manipulation extends Spark RDD with Graph abstraction: a directed multigraph with properties attached to each vertex and edge. 28

29 Spark Key construct: Resilient Distributed Dataset (RDD) Resilient if the data is lost, it can be recreated for previous steps Distributed appears as a single chunk of code, but is actually distributed across nodes Dataset initial data can come from file or created programmatically an RDD is a read-only, partitioned collection of records. RDDs can be created through deterministic operations on either data on stable storage or other RDDs. supports in-memory processing computation RDDs represent data or transformations on data Transformations define a new RDD based on the current one(s) Actions can be applied to RDDs; actions force calculations and return values Lazy evaluation: Nothing computed until an action requires it Transformations set things up Actions cause calculations to actually be performed RDDs are best suited for applications that apply the same operation to all elements of a dataset Spark 29

30 Spark Next topic for this module Check out databricks Most of the developers of Spark work here Lots of resources Developer resources : Videos : Certification : 30

31 Hadoop & Data Architectures From Small Data to Big Data : Working together Working together to get the job done 31

32 SQL : The one language to rule all your data Store External View External View Access Analyze Conceptual Schema Protect Physical Schema SQL : The one language to rule all your data Store External View External View Access Analyze Conceptual Schema Protect Physical Schema Spatial & Graph Oracle NoSQ L 32

33 1/11/18 Current trends for Hadoop Current trends for Hadoop 33

34 Typical Application Story: Before Gluent Products Preferences A complex business application running on a RDBMS Customers SALES Promotions Prices Years of application development & improvement Upstream & downstream dependencies Terabytes of historical data (usually years of history) Big queries run for too long or never complete (or never tried) Does not scale with modern demand Way too expensive RDBMS + SAN Application rewrite very costly & risky or virtually impossible Gluent Data Virtualization (90/10) 10% RDBMS + SAN Customers Products Preferences Promotions Columnar compression: 6-20x data size reduction Application still sees all data: App code & architecture unchanged! SALES (10%) Virtual (90%) Prices Gluent SALES (90%) 90% Hadoop Reduce cost, offload data, increase performance RDBMS + SAN Automatic data flow, No ETL development! 34

35 1/11/18 Current trends for Hadoop 35

36 Current trends for Hadoop Current trends for Hadoop 36

37 Hadoop Ecosystem The ever changing World What will it look like in 12 months from now? 37

38 Technology keeps on changing Hybrid It s all about the DATA 38

COSC 6339 Big Data Analytics. Hadoop MapReduce Infrastructure: Pig, Hive, and Mahout. Edgar Gabriel Fall Pig

COSC 6339 Big Data Analytics. Hadoop MapReduce Infrastructure: Pig, Hive, and Mahout. Edgar Gabriel Fall Pig COSC 6339 Big Data Analytics Hadoop MapReduce Infrastructure: Pig, Hive, and Mahout Edgar Gabriel Fall 2018 Pig Pig is a platform for analyzing large data sets abstraction on top of Hadoop Provides high

More information

Big Data Analytics using Apache Hadoop and Spark with Scala

Big Data Analytics using Apache Hadoop and Spark with Scala Big Data Analytics using Apache Hadoop and Spark with Scala Training Highlights : 80% of the training is with Practical Demo (On Custom Cloudera and Ubuntu Machines) 20% Theory Portion will be important

More information

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect Big Data Big Data Analyst INTRODUCTION TO BIG DATA ANALYTICS ANALYTICS PROCESSING TECHNIQUES DATA TRANSFORMATION & BATCH PROCESSING REAL TIME (STREAM) DATA PROCESSING Big Data Engineer BIG DATA FOUNDATION

More information

Hadoop is supplemented by an ecosystem of open source projects IBM Corporation. How to Analyze Large Data Sets in Hadoop

Hadoop is supplemented by an ecosystem of open source projects IBM Corporation. How to Analyze Large Data Sets in Hadoop Hadoop Open Source Projects Hadoop is supplemented by an ecosystem of open source projects Oozie 25 How to Analyze Large Data Sets in Hadoop Although the Hadoop framework is implemented in Java, MapReduce

More information

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS SUJEE MANIYAM FOUNDER / PRINCIPAL @ ELEPHANT SCALE www.elephantscale.com sujee@elephantscale.com HI, I M SUJEE MANIYAM Founder / Principal @ ElephantScale

More information

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a) Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a) Cloudera s Developer Training for Apache Spark and Hadoop delivers the key concepts and expertise need to develop high-performance

More information

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours Big Data Hadoop Developer Course Content Who is the target audience? Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours Complete beginners who want to learn Big Data Hadoop Professionals

More information

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context 1 Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes

More information

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info START DATE : TIMINGS : DURATION : TYPE OF BATCH : FEE : FACULTY NAME : LAB TIMINGS : PH NO: 9963799240, 040-40025423

More information

Big Data Hadoop Course Content

Big Data Hadoop Course Content Big Data Hadoop Course Content Topics covered in the training Introduction to Linux and Big Data Virtual Machine ( VM) Introduction/ Installation of VirtualBox and the Big Data VM Introduction to Linux

More information

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture Hadoop 1.0 Architecture Introduction to Hadoop & Big Data Hadoop Evolution Hadoop Architecture Networking Concepts Use cases

More information

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved Hadoop 2.x Core: YARN, Tez, and Spark YARN Hadoop Machine Types top-of-rack switches core switch client machines have client-side software used to access a cluster to process data master nodes run Hadoop

More information

CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI)

CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI) CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI) The Certificate in Software Development Life Cycle in BIGDATA, Business Intelligence and Tableau program

More information

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara Big Data Technology Ecosystem Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara Agenda End-to-End Data Delivery Platform Ecosystem of Data Technologies Mapping an End-to-End Solution Case

More information

Hadoop Development Introduction

Hadoop Development Introduction Hadoop Development Introduction What is Bigdata? Evolution of Bigdata Types of Data and their Significance Need for Bigdata Analytics Why Bigdata with Hadoop? History of Hadoop Why Hadoop is in demand

More information

Introduction to Hive Cloudera, Inc.

Introduction to Hive Cloudera, Inc. Introduction to Hive Outline Motivation Overview Data Model Working with Hive Wrap up & Conclusions Background Started at Facebook Data was collected by nightly cron jobs into Oracle DB ETL via hand-coded

More information

CSE 444: Database Internals. Lecture 23 Spark

CSE 444: Database Internals. Lecture 23 Spark CSE 444: Database Internals Lecture 23 Spark References Spark is an open source system from Berkeley Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei

More information

Big Data Syllabus. Understanding big data and Hadoop. Limitations and Solutions of existing Data Analytics Architecture

Big Data Syllabus. Understanding big data and Hadoop. Limitations and Solutions of existing Data Analytics Architecture Big Data Syllabus Hadoop YARN Setup Programming in YARN framework j Understanding big data and Hadoop Big Data Limitations and Solutions of existing Data Analytics Architecture Hadoop Features Hadoop Ecosystem

More information

Practical Big Data Processing An Overview of Apache Flink

Practical Big Data Processing An Overview of Apache Flink Practical Big Data Processing An Overview of Apache Flink Tilmann Rabl Berlin Big Data Center www.dima.tu-berlin.de bbdc.berlin rabl@tu-berlin.de With slides from Volker Markl and data artisans 1 2013

More information

Overview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training::

Overview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training:: Module Title Duration : Cloudera Data Analyst Training : 4 days Overview Take your knowledge to the next level Cloudera University s four-day data analyst training course will teach you to apply traditional

More information

Unifying Big Data Workloads in Apache Spark

Unifying Big Data Workloads in Apache Spark Unifying Big Data Workloads in Apache Spark Hossein Falaki @mhfalaki Outline What s Apache Spark Why Unification Evolution of Unification Apache Spark + Databricks Q & A What s Apache Spark What is Apache

More information

Blended Learning Outline: Cloudera Data Analyst Training (171219a)

Blended Learning Outline: Cloudera Data Analyst Training (171219a) Blended Learning Outline: Cloudera Data Analyst Training (171219a) Cloudera Univeristy s data analyst training course will teach you to apply traditional data analytics and business intelligence skills

More information

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development:: Title Duration : Apache Spark Development : 4 days Overview Spark is a fast and general cluster computing system for Big Data. It provides high-level APIs in Scala, Java, Python, and R, and an optimized

More information

Hadoop ecosystem. Nikos Parlavantzas

Hadoop ecosystem. Nikos Parlavantzas 1 Hadoop ecosystem Nikos Parlavantzas Lecture overview 2 Objective Provide an overview of a selection of technologies in the Hadoop ecosystem Hadoop ecosystem 3 Hadoop ecosystem 4 Outline 5 HBase Hive

More information

Introduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data

Introduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data Introduction to Hadoop High Availability Scaling Advantages and Challenges Introduction to Big Data What is Big data Big Data opportunities Big Data Challenges Characteristics of Big data Introduction

More information

Big Data Architect.

Big Data Architect. Big Data Architect www.austech.edu.au WHAT IS BIG DATA ARCHITECT? A big data architecture is designed to handle the ingestion, processing, and analysis of data that is too large or complex for traditional

More information

DATA SCIENCE USING SPARK: AN INTRODUCTION

DATA SCIENCE USING SPARK: AN INTRODUCTION DATA SCIENCE USING SPARK: AN INTRODUCTION TOPICS COVERED Introduction to Spark Getting Started with Spark Programming in Spark Data Science with Spark What next? 2 DATA SCIENCE PROCESS Exploratory Data

More information

Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem. Zohar Elkayam

Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem. Zohar Elkayam Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem Zohar Elkayam www.realdbamagic.com Twitter: @realmgic Who am I? Zohar Elkayam, CTO at Brillix Programmer, DBA, team leader, database trainer,

More information

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab Apache is a fast and general engine for large-scale data processing aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes Low latency: sub-second

More information

Cloud Computing & Visualization

Cloud Computing & Visualization Cloud Computing & Visualization Workflows Distributed Computation with Spark Data Warehousing with Redshift Visualization with Tableau #FIUSCIS School of Computing & Information Sciences, Florida International

More information

Hadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours)

Hadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours) Bigdata Fundamentals Day1: (2hours) 1. Understanding BigData. a. What is Big Data? b. Big-Data characteristics. c. Challenges with the traditional Data Base Systems and Distributed Systems. 2. Distributions:

More information

Innovatus Technologies

Innovatus Technologies HADOOP 2.X BIGDATA ANALYTICS 1. Java Overview of Java Classes and Objects Garbage Collection and Modifiers Inheritance, Aggregation, Polymorphism Command line argument Abstract class and Interfaces String

More information

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou The Hadoop Ecosystem EECS 4415 Big Data Systems Tilemachos Pechlivanoglou tipech@eecs.yorku.ca A lot of tools designed to work with Hadoop 2 HDFS, MapReduce Hadoop Distributed File System Core Hadoop component

More information

Asanka Padmakumara. ETL 2.0: Data Engineering with Azure Databricks

Asanka Padmakumara. ETL 2.0: Data Engineering with Azure Databricks Asanka Padmakumara ETL 2.0: Data Engineering with Azure Databricks Who am I? Asanka Padmakumara Business Intelligence Consultant, More than 8 years in BI and Data Warehousing A regular speaker in data

More information

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples Hadoop Introduction 1 Topics Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples 2 Big Data Analytics What is Big Data?

More information

Hadoop course content

Hadoop course content course content COURSE DETAILS 1. In-detail explanation on the concepts of HDFS & MapReduce frameworks 2. What is 2.X Architecture & How to set up Cluster 3. How to write complex MapReduce Programs 4. In-detail

More information

Certified Big Data Hadoop and Spark Scala Course Curriculum

Certified Big Data Hadoop and Spark Scala Course Curriculum Certified Big Data Hadoop and Spark Scala Course Curriculum The Certified Big Data Hadoop and Spark Scala course by DataFlair is a perfect blend of indepth theoretical knowledge and strong practical skills

More information

Big Data Hadoop Stack

Big Data Hadoop Stack Big Data Hadoop Stack Lecture #1 Hadoop Beginnings What is Hadoop? Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters of commodity hardware

More information

Hive and Shark. Amir H. Payberah. Amirkabir University of Technology (Tehran Polytechnic)

Hive and Shark. Amir H. Payberah. Amirkabir University of Technology (Tehran Polytechnic) Hive and Shark Amir H. Payberah amir@sics.se Amirkabir University of Technology (Tehran Polytechnic) Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 1 / 45 Motivation MapReduce is hard to

More information

Prototyping Data Intensive Apps: TrendingTopics.org

Prototyping Data Intensive Apps: TrendingTopics.org Prototyping Data Intensive Apps: TrendingTopics.org Pete Skomoroch Research Scientist at LinkedIn Consultant at Data Wrangling @peteskomoroch 09/29/09 1 Talk Outline TrendingTopics Overview Wikipedia Page

More information

Big Data Infrastructures & Technologies

Big Data Infrastructures & Technologies Big Data Infrastructures & Technologies Spark and MLLIB OVERVIEW OF SPARK What is Spark? Fast and expressive cluster computing system interoperable with Apache Hadoop Improves efficiency through: In-memory

More information

Techno Expert Solutions An institute for specialized studies!

Techno Expert Solutions An institute for specialized studies! Course Content of Big Data Hadoop( Intermediate+ Advance) Pre-requistes: knowledge of Core Java/ Oracle: Basic of Unix S.no Topics Date Status Introduction to Big Data & Hadoop Importance of Data& Data

More information

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet Ema Iancuta iorhian@gmail.com Radu Chilom radu.chilom@gmail.com Big data analytics / machine learning 6+ years

More information

Specialist ICT Learning

Specialist ICT Learning Specialist ICT Learning APPLIED DATA SCIENCE AND BIG DATA ANALYTICS GTBD7 Course Description This intensive training course provides theoretical and technical aspects of Data Science and Business Analytics.

More information

Webinar Series TMIP VISION

Webinar Series TMIP VISION Webinar Series TMIP VISION TMIP provides technical support and promotes knowledge and information exchange in the transportation planning and modeling community. Today s Goals To Consider: Parallel Processing

More information

Hadoop. Introduction to BIGDATA and HADOOP

Hadoop. Introduction to BIGDATA and HADOOP Hadoop Introduction to BIGDATA and HADOOP What is Big Data? What is Hadoop? Relation between Big Data and Hadoop What is the need of going ahead with Hadoop? Scenarios to apt Hadoop Technology in REAL

More information

Databases 2 (VU) ( / )

Databases 2 (VU) ( / ) Databases 2 (VU) (706.711 / 707.030) MapReduce (Part 3) Mark Kröll ISDS, TU Graz Nov. 27, 2017 Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, 2017 1 / 42 Outline 1 Problems Suited for Map-Reduce 2 MapReduce:

More information

APACHE HIVE CIS 612 SUNNIE CHUNG

APACHE HIVE CIS 612 SUNNIE CHUNG APACHE HIVE CIS 612 SUNNIE CHUNG APACHE HIVE IS Data warehouse infrastructure built on top of Hadoop enabling data summarization and ad-hoc queries. Initially developed by Facebook. Hive stores data in

More information

BIG DATA ANALYTICS USING HADOOP TOOLS APACHE HIVE VS APACHE PIG

BIG DATA ANALYTICS USING HADOOP TOOLS APACHE HIVE VS APACHE PIG BIG DATA ANALYTICS USING HADOOP TOOLS APACHE HIVE VS APACHE PIG Prof R.Angelin Preethi #1 and Prof J.Elavarasi *2 # Department of Computer Science, Kamban College of Arts and Science for Women, TamilNadu,

More information

Importing and Exporting Data Between Hadoop and MySQL

Importing and Exporting Data Between Hadoop and MySQL Importing and Exporting Data Between Hadoop and MySQL + 1 About me Sarah Sproehnle Former MySQL instructor Joined Cloudera in March 2010 sarah@cloudera.com 2 What is Hadoop? An open-source framework for

More information

Dell In-Memory Appliance for Cloudera Enterprise

Dell In-Memory Appliance for Cloudera Enterprise Dell In-Memory Appliance for Cloudera Enterprise Spark Technology Overview and Streaming Workload Use Cases Author: Armando Acosta Hadoop Product Manager/Subject Matter Expert Armando_Acosta@Dell.com/

More information

An Introduction to Big Data Formats

An Introduction to Big Data Formats Introduction to Big Data Formats 1 An Introduction to Big Data Formats Understanding Avro, Parquet, and ORC WHITE PAPER Introduction to Big Data Formats 2 TABLE OF TABLE OF CONTENTS CONTENTS INTRODUCTION

More information

MapReduce, Hadoop and Spark. Bompotas Agorakis

MapReduce, Hadoop and Spark. Bompotas Agorakis MapReduce, Hadoop and Spark Bompotas Agorakis Big Data Processing Most of the computations are conceptually straightforward on a single machine but the volume of data is HUGE Need to use many (1.000s)

More information

Hadoop. Introduction / Overview

Hadoop. Introduction / Overview Hadoop Introduction / Overview Preface We will use these PowerPoint slides to guide us through our topic. Expect 15 minute segments of lecture Expect 1-4 hour lab segments Expect minimal pretty pictures

More information

Stages of Data Processing

Stages of Data Processing Data processing can be understood as the conversion of raw data into a meaningful and desired form. Basically, producing information that can be understood by the end user. So then, the question arises,

More information

How Apache Hadoop Complements Existing BI Systems. Dr. Amr Awadallah Founder, CTO Cloudera,

How Apache Hadoop Complements Existing BI Systems. Dr. Amr Awadallah Founder, CTO Cloudera, How Apache Hadoop Complements Existing BI Systems Dr. Amr Awadallah Founder, CTO Cloudera, Inc. Twitter: @awadallah, @cloudera 2 The Problems with Current Data Systems BI Reports + Interactive Apps RDBMS

More information

HADOOP COURSE CONTENT (HADOOP-1.X, 2.X & 3.X) (Development, Administration & REAL TIME Projects Implementation)

HADOOP COURSE CONTENT (HADOOP-1.X, 2.X & 3.X) (Development, Administration & REAL TIME Projects Implementation) HADOOP COURSE CONTENT (HADOOP-1.X, 2.X & 3.X) (Development, Administration & REAL TIME Projects Implementation) Introduction to BIGDATA and HADOOP What is Big Data? What is Hadoop? Relation between Big

More information

Activator Library. Focus on maximizing the value of your data, gain business insights, increase your team s productivity, and achieve success.

Activator Library. Focus on maximizing the value of your data, gain business insights, increase your team s productivity, and achieve success. Focus on maximizing the value of your data, gain business insights, increase your team s productivity, and achieve success. ACTIVATORS Designed to give your team assistance when you need it most without

More information

An Introduction to Apache Spark

An Introduction to Apache Spark An Introduction to Apache Spark 1 History Developed in 2009 at UC Berkeley AMPLab. Open sourced in 2010. Spark becomes one of the largest big-data projects with more 400 contributors in 50+ organizations

More information

Microsoft Big Data and Hadoop

Microsoft Big Data and Hadoop Microsoft Big Data and Hadoop Lara Rubbelke @sqlgal Cindy Gross @sqlcindy 2 The world of data is changing The 4Vs of Big Data http://nosql.mypopescu.com/post/9621746531/a-definition-of-big-data 3 Common

More information

Apache Spark 2.0. Matei

Apache Spark 2.0. Matei Apache Spark 2.0 Matei Zaharia @matei_zaharia What is Apache Spark? Open source data processing engine for clusters Generalizes MapReduce model Rich set of APIs and libraries In Scala, Java, Python and

More information

Apache Flink Big Data Stream Processing

Apache Flink Big Data Stream Processing Apache Flink Big Data Stream Processing Tilmann Rabl Berlin Big Data Center www.dima.tu-berlin.de bbdc.berlin rabl@tu-berlin.de XLDB 11.10.2017 1 2013 Berlin Big Data Center All Rights Reserved DIMA 2017

More information

An Introduction to Apache Spark Big Data Madison: 29 July William Red Hat, Inc.

An Introduction to Apache Spark Big Data Madison: 29 July William Red Hat, Inc. An Introduction to Apache Spark Big Data Madison: 29 July 2014 William Benton @willb Red Hat, Inc. About me At Red Hat for almost 6 years, working on distributed computing Currently contributing to Spark,

More information

Hadoop An Overview. - Socrates CCDH

Hadoop An Overview. - Socrates CCDH Hadoop An Overview - Socrates CCDH What is Big Data? Volume Not Gigabyte. Terabyte, Petabyte, Exabyte, Zettabyte - Due to handheld gadgets,and HD format images and videos - In total data, 90% of them collected

More information

Oracle Big Data Fundamentals Ed 2

Oracle Big Data Fundamentals Ed 2 Oracle University Contact Us: 1.800.529.0165 Oracle Big Data Fundamentals Ed 2 Duration: 5 Days What you will learn In the Oracle Big Data Fundamentals course, you learn about big data, the technologies

More information

Introduction to BigData, Hadoop:-

Introduction to BigData, Hadoop:- Introduction to BigData, Hadoop:- Big Data Introduction: Hadoop Introduction What is Hadoop? Why Hadoop? Hadoop History. Different types of Components in Hadoop? HDFS, MapReduce, PIG, Hive, SQOOP, HBASE,

More information

Real-time data processing with Apache Flink

Real-time data processing with Apache Flink Real-time data processing with Apache Flink Gyula Fóra gyfora@apache.org Flink committer Swedish ICT Stream processing Data stream: Infinite sequence of data arriving in a continuous fashion. Stream processing:

More information

April Copyright 2013 Cloudera Inc. All rights reserved.

April Copyright 2013 Cloudera Inc. All rights reserved. Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and the Virtual EDW Headline Goes Here Marcel Kornacker marcel@cloudera.com Speaker Name or Subhead Goes Here April 2014 Analytic Workloads on

More information

Hadoop Online Training

Hadoop Online Training Hadoop Online Training IQ training facility offers Hadoop Online Training. Our Hadoop trainers come with vast work experience and teaching skills. Our Hadoop training online is regarded as the one of the

More information

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015 Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL May 2015 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved. Notices This document

More information

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES 1 THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES Vincent Garonne, Mario Lassnig, Martin Barisits, Thomas Beermann, Ralph Vigne, Cedric Serfon Vincent.Garonne@cern.ch ph-adp-ddm-lab@cern.ch XLDB

More information

Certified Big Data and Hadoop Course Curriculum

Certified Big Data and Hadoop Course Curriculum Certified Big Data and Hadoop Course Curriculum The Certified Big Data and Hadoop course by DataFlair is a perfect blend of in-depth theoretical knowledge and strong practical skills via implementation

More information

Apache Hive for Oracle DBAs. Luís Marques

Apache Hive for Oracle DBAs. Luís Marques Apache Hive for Oracle DBAs Luís Marques About me Oracle ACE Alumnus Long time open source supporter Founder of Redglue (www.redglue.eu) works for @redgluept as Lead Data Architect @drune After this talk,

More information

Data Analytics Job Guarantee Program

Data Analytics Job Guarantee Program Data Analytics Job Guarantee Program 1. INSTALLATION OF VMWARE 2. MYSQL DATABASE 3. CORE JAVA 1.1 Types of Variable 1.2 Types of Datatype 1.3 Types of Modifiers 1.4 Types of constructors 1.5 Introduction

More information

Center for Information Services and High Performance Computing (ZIH) Current trends in big data analysis: second generation data processing

Center for Information Services and High Performance Computing (ZIH) Current trends in big data analysis: second generation data processing Center for Information Services and High Performance Computing (ZIH) Current trends in big data analysis: second generation data processing Course overview Part 1 Challenges Fundamentals and challenges

More information

MapR Enterprise Hadoop

MapR Enterprise Hadoop 2014 MapR Technologies 2014 MapR Technologies 1 MapR Enterprise Hadoop Top Ranked Cloud Leaders 500+ Customers 2014 MapR Technologies 2 Key MapR Advantage Partners Business Services APPLICATIONS & OS ANALYTICS

More information

Bring Context To Your Machine Data With Hadoop, RDBMS & Splunk

Bring Context To Your Machine Data With Hadoop, RDBMS & Splunk Bring Context To Your Machine Data With Hadoop, RDBMS & Splunk Raanan Dagan and Rohit Pujari September 25, 2017 Washington, DC Forward-Looking Statements During the course of this presentation, we may

More information

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Shark Hive on Spark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Agenda Intro to Spark Apache Hive Shark Shark s Improvements over Hive Demo Alpha

More information

Introduction to HDFS and MapReduce

Introduction to HDFS and MapReduce Introduction to HDFS and MapReduce Who Am I - Ryan Tabora - Data Developer at Think Big Analytics - Big Data Consulting - Experience working with Hadoop, HBase, Hive, Solr, Cassandra, etc. 2 Who Am I -

More information

YARN: A Resource Manager for Analytic Platform Tsuyoshi Ozawa

YARN: A Resource Manager for Analytic Platform Tsuyoshi Ozawa YARN: A Resource Manager for Analytic Platform Tsuyoshi Ozawa ozawa.tsuyoshi@lab.ntt.co.jp ozawa@apache.org About me Tsuyoshi Ozawa Research Engineer @ NTT Twitter: @oza_x86_64 Over 150 reviews in 2015

More information

Hadoop & Big Data Analytics Complete Practical & Real-time Training

Hadoop & Big Data Analytics Complete Practical & Real-time Training An ISO Certified Training Institute A Unit of Sequelgate Innovative Technologies Pvt. Ltd. www.sqlschool.com Hadoop & Big Data Analytics Complete Practical & Real-time Training Mode : Instructor Led LIVE

More information

New Developments in Spark

New Developments in Spark New Developments in Spark And Rethinking APIs for Big Data Matei Zaharia and many others What is Spark? Unified computing engine for big data apps > Batch, streaming and interactive Collection of high-level

More information

Configuring and Deploying Hadoop Cluster Deployment Templates

Configuring and Deploying Hadoop Cluster Deployment Templates Configuring and Deploying Hadoop Cluster Deployment Templates This chapter contains the following sections: Hadoop Cluster Profile Templates, on page 1 Creating a Hadoop Cluster Profile Template, on page

More information

Spark, Shark and Spark Streaming Introduction

Spark, Shark and Spark Streaming Introduction Spark, Shark and Spark Streaming Introduction Tushar Kale tusharkale@in.ibm.com June 2015 This Talk Introduction to Shark, Spark and Spark Streaming Architecture Deployment Methodology Performance References

More information

Spark 2. Alexey Zinovyev, Java/BigData Trainer in EPAM

Spark 2. Alexey Zinovyev, Java/BigData Trainer in EPAM Spark 2 Alexey Zinovyev, Java/BigData Trainer in EPAM With IT since 2007 With Java since 2009 With Hadoop since 2012 With EPAM since 2015 About Secret Word from EPAM itsubbotnik Big Data Training 3 Contacts

More information

Big Data and Hadoop. Course Curriculum: Your 10 Module Learning Plan. About Edureka

Big Data and Hadoop. Course Curriculum: Your 10 Module Learning Plan. About Edureka Course Curriculum: Your 10 Module Learning Plan Big Data and Hadoop About Edureka Edureka is a leading e-learning platform providing live instructor-led interactive online training. We cater to professionals

More information

Cloud Computing 3. CSCI 4850/5850 High-Performance Computing Spring 2018

Cloud Computing 3. CSCI 4850/5850 High-Performance Computing Spring 2018 Cloud Computing 3 CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University Learning

More information

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018 Cloud Computing 2 CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University Learning

More information

DATABASE DESIGN II - 1DL400

DATABASE DESIGN II - 1DL400 DATABASE DESIGN II - 1DL400 Fall 2016 A second course in database systems http://www.it.uu.se/research/group/udbl/kurser/dbii_ht16 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology,

More information

Pig A language for data processing in Hadoop

Pig A language for data processing in Hadoop Pig A language for data processing in Hadoop Antonino Virgillito THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Apache Pig: Introduction Tool for querying data on Hadoop

More information

Principal Software Engineer Red Hat Emerging Technology June 24, 2015

Principal Software Engineer Red Hat Emerging Technology June 24, 2015 USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton Principal Software Engineer Red Hat Emerging Technology June 24, 2015 ABOUT ME Distributed systems and data science in Red Hat's Emerging

More information

Modern Data Warehouse The New Approach to Azure BI

Modern Data Warehouse The New Approach to Azure BI Modern Data Warehouse The New Approach to Azure BI History On-Premise SQL Server Big Data Solutions Technical Barriers Modern Analytics Platform On-Premise SQL Server Big Data Solutions Modern Analytics

More information

Creating a Recommender System. An Elasticsearch & Apache Spark approach

Creating a Recommender System. An Elasticsearch & Apache Spark approach Creating a Recommender System An Elasticsearch & Apache Spark approach My Profile SKILLS Álvaro Santos Andrés Big Data & Analytics Solution Architect in Ericsson with more than 12 years of experience focused

More information

Impala. A Modern, Open Source SQL Engine for Hadoop. Yogesh Chockalingam

Impala. A Modern, Open Source SQL Engine for Hadoop. Yogesh Chockalingam Impala A Modern, Open Source SQL Engine for Hadoop Yogesh Chockalingam Agenda Introduction Architecture Front End Back End Evaluation Comparison with Spark SQL Introduction Why not use Hive or HBase?

More information

DATA MINING II - 1DL460

DATA MINING II - 1DL460 DATA MINING II - 1DL460 Spring 2017 A second course in data mining http://www.it.uu.se/edu/course/homepage/infoutv2/vt17 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology,

More information

Introduction to Big-Data

Introduction to Big-Data Introduction to Big-Data Ms.N.D.Sonwane 1, Mr.S.P.Taley 2 1 Assistant Professor, Computer Science & Engineering, DBACER, Maharashtra, India 2 Assistant Professor, Information Technology, DBACER, Maharashtra,

More information

SQT03 Big Data and Hadoop with Azure HDInsight Andrew Brust. Senior Director, Technical Product Marketing and Evangelism

SQT03 Big Data and Hadoop with Azure HDInsight Andrew Brust. Senior Director, Technical Product Marketing and Evangelism Big Data and Hadoop with Azure HDInsight Andrew Brust Senior Director, Technical Product Marketing and Evangelism Datameer Level: Intermediate Meet Andrew Senior Director, Technical Product Marketing and

More information

Research challenges in data-intensive computing The Stratosphere Project Apache Flink

Research challenges in data-intensive computing The Stratosphere Project Apache Flink Research challenges in data-intensive computing The Stratosphere Project Apache Flink Seif Haridi KTH/SICS haridi@kth.se e2e-clouds.org Presented by: Seif Haridi May 2014 Research Areas Data-intensive

More information

Introduction to Apache Pig ja Hive

Introduction to Apache Pig ja Hive Introduction to Apache Pig ja Hive Pelle Jakovits 30 September, 2014, Tartu Outline Why Pig or Hive instead of MapReduce Apache Pig Pig Latin language Examples Architecture Hive Hive Query Language Examples

More information