빅데이터기술개요 2016/8/20 ~ 9/3. 윤형기

Size: px

Start display at page:

Download "빅데이터기술개요 2016/8/20 ~ 9/3. 윤형기"

Edith James
5 years ago
Views:

1 빅데이터기술개요 2016/8/20 ~ 9/3 윤형기 (hky@openwith.net)

2 D4 2

3 Hive 3

4 What is Hive? 개념 a data warehouse infrastructure tool to process structured data in Hadoop. Hadoop 위에서 : summarize Big Data makes querying and analyzing easy. Hive is not A relational database A design for OLTP A language for real-time queries and row-level updates Hive 의기능 It stores schema in a database and processed data into HDFS. It is designed for OLAP. It provides SQL type language for querying called HiveQL or HQL. It is familiar, fast, scalable, and extensible.

5 Hive = SQL interface to Hadoop Data transformation described in SQL-like syntax Data stored in HDFS 일반사항 Developed at Facebook and now an Apache project Provides JDBC connectivity HCatalog Who uses Hive? Hadoop IBM, EMC, Teradata, Oracle, Netapp Cloud Amazon, Google, Rackspace

6 Hive 아키텍처

7 Unit Name Operation User Interface DW infrastructure S/W 로서다음 UI 제공 : - Hive Web UI, Hive CLI Meta Store HiveQL Process Engine Execution Engine HDFS or HBASE to store the schema or Metadata of tables, databases, columns in a table, their data types, and HDFS mapping. Java MR 프로그램대신 HQL 수행 Hive Execution Engine processes the query and generates results as same as MapReduce results. data storage techniques to store data into file system.

9 1 Execute Query Hive interface (CLI, UI) 이용해서질의어제시 (JDBC, ODBC, etc.) 2 Get Plan query compiler 를통해 query parse syntax check 및 query plan 3 Get Metadata compiler 가 Metastore 에게 metadata 를 request 4 Send Metadata Metastore 가 compiler 에전달. 5 Send Plan compiler checks the requirement and resends the plan to the driver. 6 Execute Plan The driver sends the execute plan to the execution engine. 7 Execute Job 7.1 Metadata Ops 내부적으로 MR job. 8 Fetch Result The execution engine receives the results from Data nodes. 9 Send Results execution engine driver. 10 Send Results driver Hive Interfaces.

10 Hive + Hadoop

11 Simple data types

12 Complex data types

13 Hive Language Manual guagemanual

14 Complex Schema Relational DBs complex structures 대신 separate tables + joins slower than a straight disk scan and requires multi-table and multi-row transactions, which Hive doesn t provide.

15 Terminators (Delimiters)

16 Actual File Format

17 Partitioning Separate directories for each partition column. Hive s warehouse cluster directory 에저장 Query speed 개선

18 Hive 와 MR If you re showing all columns and filtering only on partitions, Hive skips MR altogether! If the WHERE clause includes non-partition clauses, then MR is required. (For tables without partitions, select * from tbl_name; will also work without MR.)

19 External Table vs. Managed Table External Tables When you manage the data yourself: The data is used by other tools. You have a custom ETL process. It s common to customize the file format, too...

20 External Table 의생성 The locations can be local, in HDFS, or in S3. Joins can join table data from any such source! External Table 의삭제 The table data are not deleted when you drop the table. The table metadata are deleted from the metastore.

21 Hive Joins 4 가지유형의 join: Inner Joins. Outer Joins. Left Semi Joins (not discussed here). Map side Joins (an optimization Of others).

22 Inner join Row must exist in both tables (inner join). A self join on the same table.

23 Left Outer Join keep all the left-hand side records and if there isn t a corresponding record in the right-hand side, we just use null for those fields.

25 Right outer-join

26 Full outer-join

27 User defined Functions UDFs Works on a single row UDAFs (User defined Aggregate Functions) collection of rows/values aggregates into a new row/value. UDTF (User defined Table Generation Functions)

28 File and Record Formats All the InputFormat does is split the file into records. It knows nothing about the format of those records. The SerDe (serializer/deserializer) parses each record into fields/columns.

29 INPUTFORMAT 과 OUTPUTFORMAT: How records are stored in files and query results are written. SERDE: (serializer-deserializer) How records are stored in columns. INPUTFORMATs are responsible for splitting an input stream into records. OUTPUTFORMATs are responsible for writing records to an output stream (i.e., query results). Two separate classes are used. SERDEs are responsible for tokenizing a record into columns/fields and also encoding columns/fields into records. Unlike the *PUTFORMATs, there is one class for both tasks.

30 외부 script 의호출

31 [ 실습 ] HQL

32 Spark 32

33 What is Spark = an open source big data processing framework developed in 2009 in UC Berkeley s AMPLab, and open sourced in 2010 as an Apache project. advantages compared to other big data and MapReduce technologies like Hadoop and Storm. First of all, Spark gives us a comprehensive, unified framework to manage big data processing requirements with a variety of data sets that are diverse in nature (text data, graph data etc) as well as the source of data (batch v. real-time streaming data). Spark enables applications in Hadoop clusters to run up to 100 times faster in memory and 10 times faster even when running on disk. Spark lets you quickly write applications in Java, Scala, or Python. It comes with a built-in set of over 80 high-level operators. And you can use it interactively to query data within the shell. In addition to Map and Reduce operations, it supports SQL queries, streaming data, machine learning and graph data processing. Developers can use these capabilities stand-alone or combine them to run in a single data pipeline use case. In this first installment of Apache Spark article series, we'll look at what Spark is, how it compares with a typical MapReduce solution and how it provides a complete suite of tools for big data processing.

34 경과

35 Hadoop and Spark MR ( 문제점 ) The Job output data between each step has to be stored in the distributed file system before the next step can begin. Hence, this approach tends to be slow due to replication & disk storage. Also, Hadoop solutions typically include clusters that are hard to set up and manage. It also requires the integration of several tools for different big data use cases (like Mahout for Machine Learning and Storm for streaming data processing). If you wanted to do something complicated, you would have to string together a series of MapReduce jobs and execute them in sequence. Each of those jobs was high-latency, and none could start until the previous job had finished completely. Spark allows to develop complex, multi-step data pipelines using DAG pattern. supports in-memory data sharing across DAGs, so that different jobs can work with the same data. runs on top of HDFS infrastructure to provide enhanced and additional functionality. It provides support for deploying Spark applications in an existing Hadoop v1 cluster (with SIMR Spark-Inside-MapReduce) or Hadoop v2 YARN cluster or even Apache Mesos. 결국 : Spark as an alternative to Hadoop MR than a replacement comprehensive and unified solution to manage different big data use cases and requirements.

36 Spark Features MR 을증진시킴 - with less expensive shuffles. in-memory data storage + near real-time processing 성능개선. lazy evaluation of big data queries optimization of the steps in data processing workflows higher level API 개발생산성 an execution engine that works both in-memory and on-disk 중간산출물을메모리에저장 특히동일데이터반복작업시유리 메모리용량이상의것들은디스크에저장처리 기타 Optimizes arbitrary operator graphs. Spark is written in Scala 언어 JVM 환경에서수행 지원언어 : Scala Java ( 단, interactive shell 은아직 x) Python Clojure R

37 Spark Ecosystem Spark Core API + 추가의 libraries 를통한부가기능 Spark Streaming: for processing the real-time streaming data, based on micro batch It uses the DStream which is basically a series of RDDs, to process the real-time data. Spark SQL: to expose the Spark datasets over JDBC API and allow running the SQL like queries on Spark data ETL (JSON, Parquet, a Database). Spark MLlib: classification, regression, clustering, collaborative filtering, dimension reduction, optimization. Spark GraphX: graph-parallel computation. extends the Spark RDD by introducing the Resilient Distributed Property Graph 기타 : BlinkDB (= approximate query engine), Tachyon (= a memory-centric distributed file system) integration adapters with: Cassandra (Spark Cassandra Connector) and R (SparkR).

38 Figure 1. Spark Framework Libraries

39 Spark Architecture 3 main components: Data Storage API Spark uses HDFS file system for data storage purposes. It works with any Hadoop compatible data source including HDFS, HBase, Cassandra, etc. standard API interface for Scala, Java, and Python Management Framework Resource Management: Spark can be deployed as a Stand-alone server or it can be on a distributed computing framework like Mesos or YARN.

40 Figure 2. Spark Architecture

41 RDD Resilient Distributed Dataset, based on Matei s research paper ; 데이터베이스에서의 table 과유사. Spark stores data in RDD on different partitions. 특징 rearranging the computations and optimizing the data processing. fault tolerance because an RDD know how to recreate and recompute the datasets. RDDs are immutable. 2 types of operations supported Transformation don't return a single value, they return a new RDD. Nothing gets evaluated when you call a Transformation function, it just takes an RDD and return a new RDD. 예 : map, filter, flatmap, groupbykey, reducebykey, aggregatebykey, pipe, and coalesce. Action evaluates and returns a new value. 예 : reduce, collect, count, first, take, countbykey, and foreach.

42 설치및실행 설치 실행

43 Spark Web 콘솔

44 Shared Variables cluster 환경을위한 2 types of shared variables Broadcast Variables: allow to keep read-only variable cached on each machine instead of sending a copy of it with tasks. They can be used to give the nodes in the cluster copies of large input datasets more efficiently. // Broadcast Variables // val broadcastvar = sc.broadcast(array(1, 2, 3)) broadcastvar.value Accumulators: are only added using an associative operation and can therefore be efficiently supported in parallel. They can be used to implement counters (as in MR) or sums. Tasks running on the cluster can add to an accumulator variable using the add method. However, they cannot read its value. Only the driver program can read the accumulator's value. // Accumulators // val accum = sc.accumulator(0, "My Accumulator") sc.parallelize(array(1, 2, 3, 4)).foreach(x => accum += x) accum.value

45 Spark SQL Components 2 main components when using Spark SQL. DataFrame = a distributed collection of data organized into named columns. ( 이전버전의 SchemaRDD). DataFrames can be converted to RDDs by calling the rdd method which returns the content of the DataFrame as an RDD of Rows. DataFrames can be created from different data sources such as: 기존 RDDs Structured data files JSON datasets Hive tables External databases SQLContext to encapsulate all relational functionality in Spark. You create the SQLContext from the existing SparkContext that we have seen in the previous examples. Following code snippet shows how to create a SQLContext object. val sqlcontext = new org.apache.spark.sql.sqlcontext(sc) 그밖에 HiveContext

46 Spark Streaming Real-time streaming 개요 기타의 streaming data processing frameworks: Apache Samza Storm Spark Streaming Figure 1. Spark Ecosystem with Spark Streaming Library

47 Spark Streaming 의동작 it divides the live stream of data into microbatches of a predefined interval (N seconds) 각각의 batch data 를 RDD 로처리 - using operations like map, reduce, reducebykey, join and window. 결과값은다시 batch 로반환. 주요결정사항 time interval 의결정 비교 다른 stream 처리프레임워크는 (micro-batch 대신 ) 각각의 event 로만처리 we can use other Spark libraries (like Core, Machine Learning etc) with Spark Streaming API in the same application.

48 Streaming 데이터의소스 Kafka Flume Twitter ZeroMQ Amazon s Kinesis TCP sockets Spark Streaming architecture.

49 Spark Streaming 의동작방식

50 Spark ML ML 모델예 : Supervised learning Fraud detection Unsupervised learning Social network applications, language prediction Semi-supervised Learning Image categorization, Voice recognition Reinforcement learning Artificial Intelligence (AI) applications

51 [ 실습 ] scala scala 설치 에서다운로드후 wget tar xvf scala tgz sudo mv scala /usr/lib sudo ln -f -s /usr/lib/scala /usr/lib/scala export PATH=$PATH:/usr/lib/scala/bin ( 검증 ) $ scala -version Scala 프로그래밍

52 [ 실습 ] Spark 52

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development:: Title Duration : Apache Spark Development : 4 days Overview Spark is a fast and general cluster computing system for Big Data. It provides high-level APIs in Scala, Java, Python, and R, and an optimized