Hadoop ecosystem. Nikos Parlavantzas

Size: px

Start display at page:

Download "Hadoop ecosystem. Nikos Parlavantzas"

Erick Sherman
6 years ago
Views:

1 1 Hadoop ecosystem Nikos Parlavantzas

2 Lecture overview 2 Objective Provide an overview of a selection of technologies in the Hadoop ecosystem

3 Hadoop ecosystem 3

4 Hadoop ecosystem 4

5 Outline 5 HBase Hive Pig YARN Storm

6 HBase 6

7 HBase 7 Open-source, distributed, scalable, NoSQL database Created in 2007; written in Java Column-family data model Billions of rows, millions of columns Runs on top of HDFS Can serve as the input and output for MapReduce jobs Supports versioning and compression

8 Data model 8 Each row has a row key A table consists of a set of Column Families Each column family consists of one or more Columns

9 Data model 9 Values may exist in multiple versions Column family Column Value at t3

10 Data model 10 Rows are sorted by their row key Different versions are stored in decreasing timestamp order Column families are stored in the same file Four dimensions for a table (RowKey, ColumnFamily, Column, Timestamp) Value

11 Operations 11 All operations are atomic on a per-row basis Get, put, delete data of a single or multiple rows Scan rows between two row keys Apply custom filters to get / scan operations

12 Distribution model 12 Relies on HDFS for storing data reliably Automatic replication, fault-tolerance, scalability May also support cluster-to-cluster replication One master cluster replicates to any number of slave clusters Automatic sharding Regions are contiguous ranges of rows Regions are dynamically split when they become too large

13 Consistency 13 Provides strong consistency guarantees If a client writes a value, other clients receive the updated value on the next request NB. In the case of cluster-to-cluster replication consistency is relaxed Eventual consistency

14 Architecture 14

15 HBase components 15 Master Assigns regions to region servers Performs load balancing of regions Redirects clients to region servers RegionServers Serve read/write requests for 1 or more regions Can be added/removed dynamically to accommodate workload Clients Can use native Java API, REST, Thrift, or Avro

16 Hive 16

17 Hive 17 Data warehouse software that facilitates querying and managing large datasets Initially developed at Facebook Supports SQL-like query language (Hive SQL) Translates Hive SQL to MapReduce or Apache Spark Allows plugging in custom mappers and reducers

18 Hive vs. traditional DBMSs 18 Schema on Write (traditional DBMSs) Efficient storage and fast querying Slow data loading Schema on Read (Hive, other NoSQL systems) Fast data loading; the load operation is just a file copy or move Multiple schemas for same underlying data Slow querying

19 Word count in Hive 19 CREATE TABLE docs (line STRING); LOAD DATA INPATH 'docs' OVERWRITE INTO TABLE docs; CREATE TABLE word_counts AS SELECT word, count(*) AS count FROM (SELECT explode(split(line, \s')) AS word FROM docs) w GROUP BY word ORDER BY word;

20 Data model 20 Tables Data is stored into an HDFS directory Metadata is stored in a relational database (Metastore) Partitions Table data may be split horizontally into multiple partitions; each partition corresponds to particular values of partition columns Partitions are stored in nested subdirectories The motivation is to optimise queries

21 Type System 21 Primitive types Integers: TINYINT, SMALLINT, INT, BIGINT Booleans: BOOLEAN Floating point numbers: FLOAT, DOUBLE Strings: STRING, CHAR, VARCHAR Date/Time: TIMESTAMP, DATE Complex types Arrays: ARRAY<data_type> Maps: MAP<primitive_type, data_type> Structs: STRUCT<col_name : data_type,...>

22 Hive SQL 22 Does not support the full SQL syntax, but becomes richer over time Allows: select, insert, join, update, delete Limited support for transactions, subqueries, indexing Adds multitable inserts MapReduce scripts,

23 Example Table creation 23 CREATE TABLE page_views(viewtime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING COMMENT 'User IP address') PARTITIONED BY(date STRING, country STRING) STORED AS SEQUENCEFILE; Partitioning breaks table into separate files for each (date, country) pair, e.g., /hive/page_view/date= ,country=fr /hive/page_view/date= ,country=gr

24 Example Multitable insert 24 FROM records2 INSERT OVERWRITE TABLE stations_by_year SELECT year, COUNT(DISTINCT station) GROUP BY year INSERT OVERWRITE TABLE records_by_year SELECT year, COUNT(*) GROUP BY year INSERT OVERWRITE TABLE good_records_by_year SELECT year, COUNT(*) WHERE temperature!= 9999 AND (quality = 0 OR quality = 1 OR quality = 4 OR quality = 5 OR quality = 9) GROUP BY year;

25 Example MapReduce 25 FROM ( FROM records2 MAP year, temperature, quality USING 'is_good_quality.py' AS year, temperature) map_output REDUCE year, temperature USING 'max_temperature_reduce.py AS year, temperature;

26 Internals 26

27 Pig 27

28 Pig 28 A high-level dataflow language and an engine for executing data flows on Hadoop Developed at Yahoo! Research in 2006 Generates and compiles MapReduce programs on the fly Supports primitive relational operators (join, group, order, etc.) and userdefined functions

29 An Example Problem 29 Suppose you have user data in a file, website data in another, and you need to find the top 5 most visited pages by users aged Load Users Load Pages Filter by age Join on name Group on url Count clicks Order by clicks Take top 5

30 In MapReduce 30

31 In Pig Latin 31 Users = load users as (name, age); Filtered = filter Users by age >= 18 and age <= 25; Pages = load pages as (user, url); Joined = join Filtered by name, Pages by user; Grouped = group Joined by url; Summed = foreach Grouped generate group, count(joined) as clicks; Sorted = order Summed by clicks desc; Top5 = limit Sorted 5; store Top5 into top5sites ;

32 Ease of Translation 32 Load Users Load Pages Filter by age Join on name Group on url Count clicks Order by clicks Users = load Filtered = filter Pages = load Joined = join Grouped = group Summed = count() Sorted = order Top5 = limit Take top 5

33 Ease of Translation 33 Load Users Load Pages Filter by age Join on name Job 1 Group on url Job 2 Count clicks Order by clicks Job 3 Take top 5 Users = load Fltrd = filter Pages = load Joined = join Grouped = group Summed = count() Sorted = order Top5 = limit

34 Comparison /20 the lines of code Hadoop Pig Minutes /16 the development time Hadoop Pig Performance: 1.5x Hadoop

35 Pig vs. MapReduce 35 Faster development time Data flow versus programming logic Many standard data operations (e.g. join) included No need to manage the details of connecting MapReduce jobs Worse performance, but the gap is narrowing with each Pig release

36 Pig Latin 36 A dataflow language through which users write programs in Pig Each statement represents a processing step that results in a new data set (relation)

37 Data types 37 Scalar types int, long, float, double, chararray, bytearray Complex types tuple: an ordered set of fields e.g., ('bob', 55) map: a set of key-value pairs e.g., ['name'#'bob','age'#55] bag: a collection of tuples e.g., {('bob', 55), ('sally', 52), ('john', 25)}

38 Schemas 38 A relation may have an associated schema, which gives names and types to the fields Schemas are defined at runtime and are optional Types default to bytearray Schemas are recommended for improving clarity and efficiency

39 Schemas 39 Schema 1 A = LOAD 'input/a' as (name:chararray, age:int); B = FILTER A BY age!= 20; Schema 2 A = LOAD 'input/a' as (name:chararray, age:chararray); B = FILTER A BY age!= '20'; No Schema A = LOAD 'input/a' ; B = FILTER A BY A.$1!= '20';

40 Operators 40 load store foreach filter group/cogroup join order distinct union split stream sample limit Read data from file system. Write data to file system. Apply expression to each record and output one or more records. Apply predicate and remove records that do not return true. Collect records with the same key from one or more inputs. Join two or more inputs based on a key. Various join algorithms available. Sort records based on a key. Remove duplicate records. Merge two data sets. Split data into 2 or more sets, based on filter conditions. Send all records through a user provided executable. Read a random sample of the data. Limit the number of records.

41 Functions 41 Eval Functions Transform values e.g., MAX, COUNT, TOKENIZE Filter Functions Test whether a value satisfies a predicate e.g., IsEmpty

42 Functions 42 Load/Store Functions Specify how to load/store data from/into external storage e.g., text files, HBase tables User-defined functions (UDF) Written in Java, Python, Jython, Ruby, and Javascript Stored in Piggy Bank, a repository for UDFs

43 Word count in Pig Latin 43 input_lines = LOAD '/tmp/wc.txt' AS (line:chararray); words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word; word_groups = GROUP words BY word; word_count = FOREACH word_groups GENERATE group, COUNT(words); STORE word_count INTO '/tmp/output';

44 Pig Execution Modes 44 Local mode Launches single JVM Accesses local file system No MapReduce jobs running Hadoop mode Translates program into a sequence of MapReduce jobs and executes them on Hadoop cluster Pig interacts with Hadoop master node

45 Hadoop mode 45

46 YARN 46

47 Hadoop 1 47 No support for programming models beyond MapReduce e.g., iterative applications implemented using MapReduce are 10x slower Limited Scalability Maximum cluster size is around 5000 nodes JobTracker is a bottleneck

48 Hadoop 1 vs Hadoop 2 48 HADOOP 1.0 HADOOP 2.0 MapReduce (data processing) Others MapReduce (cluster resource management & data processing) HDFS (redundant, reliable storage) YARN (cluster resource management) HDFS2 (redundant, highly-available & reliable storage)

49 YARN 49 YARN (Yet Another Resource Negotiator): Hadoop s cluster resource management system Decouples resource management from programming framework MapReduce becomes just one of the applications

50 Architecture 50

51 Architecture 51 Resource Manager Partitions resources ( containers ) among applications Node Manager Per machine Manages the life-cycle of containers Application Master Per application Coordinates application execution e.g. MapReduce or Spark or Storm application masters

52 Benefits 52 Support for multiple applications and frameworks sharing the same cluster e.g., MapReduce, MPI, graph computation frameworks, online computation frameworks e.g., multiple versions of MapReduce can coexist Scalability Resource management is separated from application management Application management is distributed

53 Benefits 53 Fault tolerance Resource manager restores its state from persistent store on failure Application Masters are restarted on failure Failure handling of containers is left to frameworks Improved resource utilisation No static split between map and reduce slots Sharing cluster among multiple applications

54 YARN Ecosystem 54

55 Storm 55

56 Storm 56 Open-source, distributed, real-time computation framework for processing streams of data Developed by BackType, acquired by Twitter; initial release in 2011 Scalable, fault-tolerant, reliable Written mainly in Clojure; supports many programming languages

57 Use cases 57 Stream processing Process a stream of new data and update databases in real time E.g., analyse log messages from multiple servers Continuous computation Do a continuous query and stream the results to clients in real time E.g., streaming trending topics on Twitter Distributed RPC Parallelize CPU-intensive operations E.g., finding the number of people exposed to a URL on Tweeter

58 Concepts 58 Streams Spouts Bolts Topologies

59 Streams 59

60 Spouts 60

61 Bolts 61 Process input streams and produce new streams Can implement functions such as filters, aggregation, join, etc.

62 Topology 62 Network of spouts and bolts

63 Topology 63 Spouts and bolts execute as many tasks across the cluster

64 Stream Grouping 64 When a tuple is emitted, which task does it go to?

65 Stream Grouping 65

66 Example: WordCount 66 TopologyBuilder builder = new TopologyBuilder(); builder.setspout(1, new KernelSpout( kestrel.twitter.com, 22133, sentence_queue, new StringScheme()),5); builder.setbolt(2, new SplitSentence(),8).shuffleGrouping(1); builder.setbolt(3, new WordCount(), 12).fieldGrouping(2, new Fields( word ));

67 Example: WordCount 67 Public static class SplitSentence extends ShellBolt implements IRichBolt { // includes code to split a sentence in words // declares output field word } Public static class WordCount implements IBasicBolt{ //includes code to count words } StormSubmitter.submitTopology( word-count, builder.createtopology);

68 Guaranteed Processing 68 Storm guarantees that every spout tuple will be fully processed i.e., every message in the triggered tree of tuples has been processed If the tuple tree is not completed within a specified timeout, the spout tuple is replayed Reliability can be removed

69 Internals 69 Nimbus Distributes code Assigns tasks to supervisor nodes Monitors for failures Supervisor Executes assigned tasks Zookeeper Keeps cluster state

70 Summary 70 HBase is a distributed, fault-tolerant storage system that follows the columnfamily model and runs on top of HDFS Hive and Pig provide high-level languages for processing large datasets YARN allows multiple types of applications to share the same cluster Storm is a distributed, reliable, faulttolerant system for processing streams of data

71 References 71 Hadoop: The Definitive Guide, 4th Edition, Tom White, O'Reilly Media, Getting Started with Storm: Continuous Streaming Computation with Twitter's Cluster Technology, Jonathan Leibiusky, Gabriel Eisbruch, Dario Simonassi, O'Reilly Media, 2012

Hadoop is supplemented by an ecosystem of open source projects IBM Corporation. How to Analyze Large Data Sets in Hadoop

Hadoop Open Source Projects Hadoop is supplemented by an ecosystem of open source projects Oozie 25 How to Analyze Large Data Sets in Hadoop Although the Hadoop framework is implemented in Java, MapReduce