Technical White Paper

Similar documents
Hortonworks Data Platform

Configuring and Deploying Hadoop Cluster Deployment Templates

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Big Data Hadoop Stack

Innovatus Technologies

Hadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours)

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

docs.hortonworks.com

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Oracle Big Data Fundamentals Ed 2

Object Storage Service. Product Introduction. Issue 04 Date HUAWEI TECHNOLOGIES CO., LTD.

SAP HANA. HA and DR Guide. Issue 03 Date HUAWEI TECHNOLOGIES CO., LTD.

Hadoop Development Introduction

exam. Microsoft Perform Data Engineering on Microsoft Azure HDInsight. Version 1.0

Big Data Analytics using Apache Hadoop and Spark with Scala

Fluentd + MongoDB + Spark = Awesome Sauce

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Exam Questions

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS

Oracle Big Data Fundamentals Ed 1

Apache Flink. Alessandro Margara

Big Data Architect.

Huawei OceanStor ReplicationDirector Software Technical White Paper HUAWEI TECHNOLOGIES CO., LTD. Issue 01. Date

Unifying Big Data Workloads in Apache Spark

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

Certified Big Data Hadoop and Spark Scala Course Curriculum

Cloud Computing & Visualization

A Glimpse of the Hadoop Echosystem

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

CSE 444: Database Internals. Lecture 23 Spark

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

MapR Enterprise Hadoop

Cloud Stream Service. User Guide. Issue 18 Date HUAWEI TECHNOLOGIES CO., LTD.

DATA SCIENCE USING SPARK: AN INTRODUCTION

Elastic Load Balance. User Guide. Issue 01 Date HUAWEI TECHNOLOGIES CO., LTD.

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture

Hadoop An Overview. - Socrates CCDH

Spark Overview. Professor Sasu Tarkoma.

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Big Data Hadoop Course Content

How Apache Hadoop Complements Existing BI Systems. Dr. Amr Awadallah Founder, CTO Cloudera,

Oracle Big Data Connectors

Report on The Infrastructure for Implementing the Mobile Technologies for Data Collection in Egypt

microsoft

Lecture 11 Hadoop & Spark

Anti-DDoS. User Guide (Paris) Issue 01 Date HUAWEI TECHNOLOGIES CO., LTD.

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

HUAWEI OceanStor Enterprise Unified Storage System. HyperReplication Technical White Paper. Issue 01. Date HUAWEI TECHNOLOGIES CO., LTD.

Big Data Syllabus. Understanding big data and Hadoop. Limitations and Solutions of existing Data Analytics Architecture

1 Big Data Hadoop. 1. Introduction About this Course About Big Data Course Logistics Introductions

Hadoop. Introduction / Overview

Hadoop. Introduction to BIGDATA and HADOOP

Index. Raul Estrada and Isaac Ruiz 2016 R. Estrada and I. Ruiz, Big Data SMACK, DOI /

Certified Big Data and Hadoop Course Curriculum

vcenter Server Installation and Setup Update 1 Modified on 30 OCT 2018 VMware vsphere 6.7 vcenter Server 6.7

Oracle NoSQL Database Enterprise Edition, Version 18.1

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.

Chapter 5. The MapReduce Programming Model and Implementation

Esper EQC. Horizontal Scale-Out for Complex Event Processing

Introduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data

Vendor: Hortonworks. Exam Code: HDPCD. Exam Name: Hortonworks Data Platform Certified Developer. Version: Demo

MI-PDB, MIE-PDB: Advanced Database Systems

Data Acquisition. The reference Big Data stack

Vulnerability Scan Service. User Guide. Issue 20 Date HUAWEI TECHNOLOGIES CO., LTD.

ExamTorrent. Best exam torrent, excellent test torrent, valid exam dumps are here waiting for you

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Data Acquisition. The reference Big Data stack

Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem. Zohar Elkayam

Big Data with Hadoop Ecosystem

Introduction to BigData, Hadoop:-

The Fastest Scale-Out NAS

Flash Storage Complementing a Data Lake for Real-Time Insight

Gain Insights From Unstructured Data Using Pivotal HD. Copyright 2013 EMC Corporation. All rights reserved.

FAQs. Business (CIP 2.2) AWS Market Place Troubleshooting and FAQ Guide

HDP Security Overview

HDP Security Overview

Microsoft. Exam Questions Perform Data Engineering on Microsoft Azure HDInsight (beta) Version:Demo

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

HUAWEI Secospace USG Series User Management and Control White Paper

Introduction to the Hadoop Ecosystem - 1

April Copyright 2013 Cloudera Inc. All rights reserved.

CDN. Product Description. Issue 03 Date HUAWEI TECHNOLOGIES CO., LTD.

SQT03 Big Data and Hadoop with Azure HDInsight Andrew Brust. Senior Director, Technical Product Marketing and Evangelism

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

BIG DATA COURSE CONTENT

WHITEPAPER. MemSQL Enterprise Feature List

Security and Performance advances with Oracle Big Data SQL

Cloudera Introduction

Introduction to Big-Data

Huawei FusionCloud Desktop Solution 5.3. Branch Technical White Paper. Issue 01. Date HUAWEI TECHNOLOGIES CO., LTD.

Introduction to Hadoop and MapReduce

Bring Context To Your Machine Data With Hadoop, RDBMS & Splunk

Hive and Shark. Amir H. Payberah. Amirkabir University of Technology (Tehran Polytechnic)

Exam Questions CCA-505

The Technology of the Business Data Lake. Appendix

FROM LEGACY, TO BATCH, TO NEAR REAL-TIME. Marc Sturlese, Dani Solà

Big Data Infrastructures & Technologies

iway iway Big Data Integrator New Features Bulletin and Release Notes Version DN

Transcription:

Issue 01 Date 2017-07-30 HUAWEI TECHNOLOGIES CO., LTD.

2017. All rights reserved. No part of this document may be reproduced or transmitted in any form or by any means without prior written consent of Huawei Technologies Co., Ltd. Trademarks and Permissions and other Huawei trademarks are trademarks of Huawei Technologies Co., Ltd. All other trademarks and trade names mentioned in this document are the property of their respective holders. Notice The purchased products, services and features are stipulated by the contract made between Huawei and the customer. All or part of the products, services and features described in this document may not be within the purchase scope or the usage scope. Unless otherwise specified in the contract, all statements, information, and recommendations in this document are provided "AS IS" without warranties, guarantees or representations of any kind, either express or implied. The information in this document is subject to change without notice. Every effort has been made in the preparation of this document to ensure accuracy of the contents, but all statements, information, and recommendations in this document do not constitute a warranty of any kind, express or implied. Huawei Technologies Co., Ltd. Address: Website: Huawei Industrial Base Bantian, Longgang Shenzhen 518129 People's Republic of China http://e.huawei.com i

Contents Contents 1 Introduction... 1 1.1 Overview of FusionInsight HD... 2 1.2 Basic Components of FusionInsight HD... 3... 5 2.1 Manager: Cluster Management... 6 2.2 HDFS: Distributed File System...7 2.3 YARN: Unified Resource Management and Scheduling Framework... 8 2.4 MapReduce: Distributed Batch Processing Engine...9 2.5 HBase: Distributed Database...10 2.6 Hive: Data Warehouse Component... 10 2.7 Spark: Distributed Real-time Computing Framework...11 2.8 Solr: Full-text Search Component... 12 2.9 LoaderSqoop: Integration of Data in Batches... 14 2.10 Flume: Real-Time Data Collection...16 2.11 Storm: Storm Processing... 17 2.11.1 Storm...17 2.11.2 StreamCQL... 18 2.11.3 Flink...19 2.12 Redis: Distributed High-Speed Cache...20 2.13 Kafka: Distributed Message Queue...21 2.14 Oozie: Job Orchestration and Schedule...22 2.15 Hue: Hadoop Integrated Development Tool...24 ii

1 Introduction 1 Introduction 1.1 Overview of FusionInsight HD 1.2 Basic Components of FusionInsight HD 1

1 Introduction 1.1 Overview of FusionInsight HD Huawei FusionInsight is a unified enterprise-class big data storage, query, and analysis platform. It allows enterprises to quickly construct a system for processing massive data. With FusionInsight, enterprises can spot and capture new business opportunities by analyzing and mining massive data in a real-time or non-real-time manner. FusionInsight consists of five sub-products, which are FusionInsight HD, FusionInsight LibrA, FusionInsight Miner, and FusionInsight Farmer, and one operation OMS FusionInsight Manager. FusionInsight HD: an enterprise-class big data processing platform. It is a distributed data processing system and provides large-capacity data storage, query, analysis, and real-time stream data processing capabilities externally. FusionInsight HD includes Zookeeper, Hadoop, HBase, Loader, HBase, Hive, Hue, Oozie, Phoenix, Solr, Redis, Spark, Streaming, Kafka, Elk,Flink and other components. FusionInsight Miner: an enterprise-class data analysis platform. It is based on distributed storage of FusionInsight HD and the parallel computing technology and is a platform that mines valuable information from massive data. FusionInsight Farmer: an enterprise-class big data application container that provides a unified development, running, and management platform for enterprise services. FusionInsight Manager: an enterprise-class big data O&M platform. It provides highly reliable, secure, fault-tolerant, and easy-to-use cluster management for FusionInsight HD. It supports installation and deployment, monitoring, alarm management, user management, rights management, audit, service management, health check, fault location, upgrades, and patching of large-scale clusters. FusionInsight LibrA: an enterprise-class massively parallel processing (MPP) relational database. It is based on column storage and the MPP architecture and is designed and developed for analysis of structured data and can effectively process data volumes of PB level. FusionInsight LibrA is greatly different from conventional databases in terms of core technologies. It can solve issues concerning data processing performance for users in many industries, provide a general cost-effective computing platform for large-scale data management, and support various data warehouse systems, business intelligence (BI) systems, and decision-making support systems, to support decision analysis of upper-layer applications in a unified manner. 2

1 Introduction 1.2 Basic Components of FusionInsight HD Huawei FusionInsight HD has made encapsulation and enhancement on the Hadoop opensource components to provide stable large-capacity data storage, query and analysis capabilities. Huawei FusionInsight HD has the following components: Manager: an O&M system. It provides highly reliable, secure, fault-tolerant, and easy-touse cluster management for FusionInsight HD. It supports installation, upgrade, and patching of large-scale clusters as well as the management of configurations, monitoring, alarms, users, and tenants. HDFS: The HDFS supports data access with high throughput and applies to processing of large data sets. HBase: As a distributed, column-oriented storage system built on the HDFS, HBase stores massive data. Oozie: Oozie orchestrates and executes jobs for open-source Hadoop components. It runs in a Java servlet container (for example, Tomcat) as a Java web application and uses a database to store workflow definitions and running workflow instances (including the status and variables of the instances). ZooKeeper: ZooKeeper provides distributed service coordination with high availability. It prevents single points of failure (SPOFs) and helps create reliable applications. Redis: provides a high-performance distributed K-V cache system based on memory. Yarn: serves as the resource management system of Hadoop 2.0. It is a general resource module that manages and schedules resources for applications. MapReduce: provides the capability of processing massive data quickly and in parallel. It is a distributed data processing mode and execution environment. Spark: Spark is a distributed computing framework based on memory. Hive: serves as an open-source data warehouse built on top of Hadoop. It supports SQLlike access to structured data which is known as HQL and allows for basic data analysis services. Loader: functions as a data import and export tool enhanced based on the open-source Apache Sqoop component. Loader implements data exchange between FusionInsight HD and a relational database or FTP or SFTP file server. It provides the Java API or Shell task scheduling interface for third-party scheduling platforms. 3

1 Introduction Hue: provides the Web UI of open-source components. Directories and files of the HDFS can be operated through a browser, Oozie can be invoked to create, monitor, and arrange a workflow, the Sqoop component can be operated, and ZooKeeper cluster conditions can be viewed. Flume: serves as a distributed, reliable, and HA massive log aggregation system. Flume supports customized data transmitters for collecting data. Flume also roughly processes data and writes data to customizable data receivers. Solr: functions as a high-performance Lucene-based full text retrieval server. Solr provides a query language even richer than Lucene. It is configurable and scalable and has query performance optimized. Solr provides a GUI with comprehensive management functions. It is an excellent full-text retrieval engine. Kafka: serves as a distributed, real-time message publishing and subscription system with partitions and instances. It provides scalable, high-throughput, low-latency, and reliable message dispatching services. Storm: serves as a distributed, reliable, fault-tolerant and real-time stream data processing system. It provides SQL-like (StreamCQL) query languages. Flink is a distributed and highly available data processing engine that supports both batch processing and stream processing. It supports the exactly-once semantics. Spark SQL: functions as a high-performance SQL engine based on the Spark engine. It can share metadata with Hive. Mahaut: provides a MapReduce-based data mining algorithm library. MLLib: provides a Spark-based data mining algorithm library. GraphX: provides a Spark-based diagram processing algorithm library. 4

2.1 Manager: Cluster Management 2.2 HDFS: Distributed File System 2.3 YARN: Unified Resource Management and Scheduling Framework 2.4 MapReduce: Distributed Batch Processing Engine 2.5 HBase: Distributed Database 2.6 Hive: Data Warehouse Component 2.7 Spark: Distributed Real-time Computing Framework 2.8 Solr: Full-text Search Component 2.9 LoaderSqoop: Integration of Data in Batches 2.10 Flume: Real-Time Data Collection 2.11 Storm: Storm Processing 2.12 Redis: Distributed High-Speed Cache 2.13 Kafka: Distributed Message Queue 2.14 Oozie: Job Orchestration and Schedule 2.15 Hue: Hadoop Integrated Development Tool 5

2.1 Manager: Cluster Management Manager, as the O&M management system of FusionInsight HD, provides services deployed in the cluster with a unified cluster management capability. Manager provides functions such as installation and deployment, performance monitoring, alarms, user management, permission management, auditing, service management, health check, log collection, upgrade, and patching. Figure 2-1 Logical architecture of FusionInsight ManagerManager FusionInsight Manager consists of OMS and NodeAgent: OMS: serves as management nodes in the O&M system. There are two OMS nodes deployed in active/standby mode. NodeAgent: serves as managed nodes in the O&M system. Each node is equipped with a NodeAgent. Table 2-1 Service module description Module WebService Controller Description A Web service deployed under Tomcat. Providing the HTTPS interface of Manager, it is used to access Manager through the web browser. In addition, it provides northbound access capability based on the Syslog and SNMP protocols. The control center of Manager. It converges information from all nodes in the cluster, displays the information to administrators, as well as receives operation instructions from administrators, and synchronizes the information about each node in the cluster according to the range of operation instructions 6

Module NodeAgent IAM PMS CEP FMS OMMAgent CAS AOS OMS Kerberos OMS LDAP Database NTP Description Exists in each cluster node, and serves as an agent where Controller performs all the operations to the components deployed at the node. Represents all the components deployed at the node to exchange with Controller, achieving multipoint-to-point convergence in the whole cluster. Records audit logs. Each non-query operation on the Manager UI has a related audit log. Performance monitoring module. The PMS collects the performance monitoring data for each OMA and provides the query function. Convergence function module. For example, the CEP integrates the used disk space of each OMA into a performance indicator. Alarm module. The FMS collects alarms for each OMA and provides the query function. Agent that collects the performance monitoring data and alarms for the Agent node. Unified authentication center. Login authentication in the CAS is required during the login to the Web Service. The browser automatically switches to the CAS using the URL. Permission management module, it manages the rights of users and user groups. Supports SSO and authentication between Controller and NodeAgent. OMS LDAP provides data storage for user authentication before the cluster is installed and backs up the cluster Ldap after the cluster is installed. Database of Manager. It stores information like configuration, monitoring, and alarms NTP implements clock synchronization between nodes in the cluster and OMS nodes, as well as between OMS nodes and external clock sources. 2.2 HDFS: Distributed File System As a distributed file system of Hadoop, HDFS implements reliable and distributed read/write of massive data. HDFS is applicable to the scenario where data read/write features "write once and read multiple times". However, the write operation is performed in sequence, that is, it is a write operation performed during file creation or an add operation performed behind the existing file. HDFS ensures that only one caller can perform write operation on one file but multiple calls can perform a read operation on one file at the same time. 7

Figure 2-2 Distributed file system (HDFS) 2.3 YARN: Unified Resource Management and Scheduling Framework To implement sharing, scalability, and reliability of a Hadoop cluster and eliminate the early performance bottleneck of JobTracker in the MapReduce framework, the open-source community introduces the unified resource management framework YARN. The essence of the YARN layered structure is ResourceManager. This entity controls the overall cluster and manages allocation of basic computing resources to applications. ResourceManager thoughtfully allocates various resources (computing, memory, and bandwidth resources) to the basic NodeManager (each node agent of YARN). ResourceManager also allocates resources together with ApplicationMaster. In addition, ResourceManager and NodeManager jointly start and monitor their basic applications. In this context, ApplicationMaster undertakes the role that is originally undertaken by TaskTracker and ResourceManager undertakes the role of JobTracker. ApplicationMaster manages all the instances of an application running in YARN. ApplicationMaster coordinates the resources from ResourceManager and monitors execution and resource usage (CPU and memory resource allocation) of a container through NodeManager. Current resources (CPU core and memory) are traditional; however, new resource types that are based on the tasks at hand will be brought about in the future, for example, a graphical processing unit or dedicated processing device. From the perspective of YARN, ApplicationMaster is a user code. Consequently, potential security issues exist. YARN assumes that ApplicationMaster is incorrect or even malicious and therefore treats it as a code without privilege. NodeManager manages all nodes in a YARN cluster. NodeManager provides services for all nodes of a cluster, for example, supervising lifelong management of a container, monitoring resources, and tracing node health. MRv1 manages execution of the Map and Reduce tasks through slots, whereas NodeManager manages abstract containers. These containers represent the resources for various nodes, and these resources are available for the use of a specific application. 8

Figure 2-3 Unified resource management and scheduling framework (YARN) 2.4 MapReduce: Distributed Batch Processing Engine MapReduce is the core of Hadoop. As a software architecture proposed by Google, MapReduce is used for parallel computing of a massive data set (larger than 1 TB). The concepts "Map" and "Reduce" and their main ideas are borrowed from functional programming language, and some features are borrowed from vector programming language. Currently, the software is implemented in the following manner: Specify a Map function to map a group of key-value pairs into a new group of key-value pairs and specify a parallel Reduce function to ensure that all key values in the mapped key-value pairs share the same key group. Figure 2-4 Distributed batch processing engine MapReduce is a software framework for processing a massive data set in parallel. The root of MapReduce is the Map and Reduce functions in functional programming. The Map function accepts a group of data and transforms it into a key-value pair list. Each element in the input 9

domain corresponds to one key-value pair. The Reduce function accepts the lists generated by the Map function and then shrinks the key-value pair list based on the keys of the lists generated by the Map function. MapReduce divides a large affair into multiple parts and allocates them to different devices for processing. In this way, the task that can only be finished on a single powerful server originally can now be finished in a distributed environment. 2.5 HBase: Distributed Database Data storage is undertaken by HBase. HBase is an open-source column-oriented distributed storage system that applies to storage of massive unstructured or semi-structured data, features high reliability, high performance, and flexible extensibility, and supports real-time data read/write. Figure 2-5 Distributed database (HBase) The typical features of a table stored in HBase are as follows: Big table (BigTable): One table contains hundred millions of lines and millions of columns. Column-oriented: Column-oriented storage, retrieval, and permission control Sparse: Null columns in the table do not occupy any storage space. 2.6 Hive: Data Warehouse Component Hive is a database warehouse infrastructure built on top of Hadoop. It provides a series of tools that can be used to extract, transform, and load (ETL) data. Hive is a mechanism that can store, query, and analyze massive data stored on Hadoop. Hive defines simple SQL-like query language, which is known as HQL. It allows a user familiar with SQL to query data. In addition, this language allows a developer familiar with MapReduce to develop customized mapper and reducer for handling complex analysis tasks that cannot be completed by the built-in mapper and reducer. Hive system structure: User interface: Three user interfaces are available, that is, the command-line interface (CLI), client, and web user interface (WebUI). CLI is the most frequently-used user interface. A Hive transcript is started when the CLI is started. The client refers to a Hive client, and a client user connects to the Hive Server. When entering the Client mode, specify the node where the Hive Server resides and start the Hive Server on this node. The WebUI is used to access Hive through a browser. 10

Metadata storage: Hive stores metadata in databases, for example, MySQL and Derby. The metadata in Hive includes the table name, table column and partition and their properties, table property (indicating whether the table is an external table), and directory where the data of the table is stored. 2.7 Spark: Distributed Real-time Computing Framework Apache Spark is an open-source universal distributed cluster computing engine. Figure 2-6 shows the development history of Spark. Figure 2-6 Development history of Spark FusionInsight Spark is an open-source parallel data processing framework. It helps users to simply develop quick and unified big data applications and perform cooperative processing, stream processing, and interactive analysis on data. Spark has the following features: Fast: The data processing speed of Spark is 10 to 100 times higher than that of MapReduce. Easy-to-use: Java, Scala, and Python can be used to simply and quickly compile parallel applications for processing massive data. Spark provides over 80 high-level operators to help users compile parallel applications. Universal: Spark provides many high-level tools, for example, Spark SQL, MLLib, GraphX, and Spark Stream. These tools can be flexibly combined within one application. Integration with Hadoop: Spark can directly run in a Hadoop 2.0 cluster and read the existing Hadoop data. Especially, Spark and FusionInsight are closely combined, and Spark can be deployed by using FusionInsight Manager. Spark provides a framework for fast computing, write, and interactive query. Spark has obvious advantages over Hadoop in terms of performance. Spark uses the in-memory computing mode to avoid I/O bottleneck in the scenario where multiple tasks in the MapReduce workflow compute the same data set. Spark is implemented by using the Scala language. Scala enables distributed data sets to be processed in a method the same as that of processing local data. In addition to interactive data analysis, Spark also supports interactive data mining. Spark adopts memory-based computing; therefore, iterative computing is convenient. By coincidence, iterative computing of the same data is a general problem facing data mining. Besides, Spark can run in the YARN cluster where Hadoop 2.0 is installed. The reason why Spark can not only retain various features like MapReduce fault tolerance, data localization, 11

and scalability but also ensure high performance and avoid busy disk I/O is that a memory abstraction structure called Resilient Distributed Dataset (RDD) is created for Spark. Original distributed memory abstraction, for example, key-value store and database, supports small-granularity update of variable status. This requires backup of data or log update by a cluster to ensure fault tolerance. Consequently, a large amount of I/O is brought about to dataintensive workflows. For the RDD, it has only one set of restricted interfaces and only supports large-granularity update, for example, map and join. In this way, Spark only needs to be record the transformation operation log of data establishment to ensure fault tolerance without recording complete data set. This data transformation link record is the source for tracing data set. Generally, parallel applications apply the same computing process for a large data set. Therefore, the limit to the aforementioned large-granularity update is not large. In fact, as described in the Spark thesis, RDD can function as multiple different computing frameworks, for example, programming models of MapReduce and Pregel. In addition, Spark provides operations to allow a user to make the data transformation process be persistent on a hard disk. Data localization is implemented by allowing a user to control data partitions based on each recorded key value. (An obvious advantage of this mode is that two copies of data to be associated will be hashed in the same mode.) If the memory usage exceeds the physical limit, Spark writes relatively large partitions into a hard disk, thereby ensuring scalability. 2.8 Solr: Full-text Search Component Solr is a separate Apache Lucene-based enterprise-level application search server. It provides REST-like APIs of HTTP/XML and JSON. The main functions of Solr include powerful fulltext search, highlighted display, layer search, near real-time index, dynamic clustering, database integration, and rich document (for example, Word and PDF documents) processing and geographical information search. As an excellent enterprise-level search server of the industry, Solr has the following features: Advanced full-text search function Optimized high-capacity network traffic Standard-based open interfaces: XML, JSON, and HTTP interfaces Comprehensive HTML management UI Java management extensions (JMX) for monitoring servers Linear scalability, automatic index replication, and automatic failover and restoration Near real-time index Adopting XML configuration to implementing flexibility and adaptability Expandable plug-in architecture 12

Figure 2-7 Logic composition of the Solr cluster solution Table 2-2 Solr composition Name Client SolrServer ZooKeeper cluster Description Client communicates with SolrServer in the Solr cluster (SolrCloud) through the HTTP protocol and performs distributed indexing and distributed search operations. SolrServer provides various services, such as index creation and full-text retrieval. It is a data computing and processing unit in the Solr cluster. Generally, SolrServer is combined with DataNode in the HDFS cluster to provide high-performance indexing and search services. ZooKeeper provides distributed coordination services for various processes in the Solr cluster. Each SolrServer registers its information (collection configuration information and SolrServer health information) with ZooKeeper. Based on the information, Client detects the health status of each SolrServer, thereby determining distribution of indexing and search requests. 13

Name HDFS cluster Description The HDFS provides Solr with highly reliable file storage service. All index files are stored in the HDFS. 2.9 LoaderSqoop: Integration of Data in Batches Loader is developed based on the open-source Sqoop component. It is used to exchange data and files between FusionInsight and relational databases, the distributed file system. Loader can import data from relational databases or file servers to the HDFS/HBase of FusionInsight, or export data from the HDFS/HBase to relational databases or file servers. The Loader model consists of a Loader client and Loader server, as shown in Figure 1 Figure 2-8 Loader model Table 1describes the function of each part in Figure 1 14

Table 2-3 The components of the loader model Name Loader Client Loader Server REST API Job Scheduler Transform Engine Execution Engine Submission Engine Job Manager Metadata Repository HA Manager Description Provides a web user interface (WebUI) and a command-line interface (CLI). Processes operation requests sent from the client, manages connectors and metadata, submits MapReduce jobs, and monitors MapReduce job status. Provides a RESTful (HTTP + JSON) interface to process the operation requests sent from the client. Periodically executes Loader jobs. A data transformation engine that supports field combination, string cutting, and string reverse. Executes Loader jobs in MapReduce manner. Submits Loader jobs to MapReduce. Manages Loader jobs, including creating, querying, updating, deleting, activating, deactivating, starting, and stopping jobs. Stores and manages data about Loader connectors, transformation procedures, and jobs. Manages the active/standby status of Loader Server. Two Loader servers are deployed in active/standby mode. Implementing parallel job execution and fault tolerance using MapReduce Loader implements parallel import or export jobs using MapReduce. Some import or export jobs may involve only the Map operations, while some jobs may involve both Map and Reduce operations. Loader implements fault tolerance using MapReduce. Jobs can be rescheduled when job execution fails. Importing data to HBase When the Map operation is performed for MapReduce jobs, Loader obtains data from an external data source. When a Reduce operation is performed for a MapReduce job, Loader enables the same number of Reduce tasks based on the number of Regions. The Reduce Task receives data from Map, generates HFiles by Region, and stores the HFiles in a temporary directory of the HDFS. 15

When a MapReduce job is submitted, Loader migrates HFiles from the temporary directory to the HBase directory. Importing data to the HDFS. When a Map operation is performed for a MapReduce job, Loader obtains data from an external data source and exports the data to a temporary directory (named export directoryldtmp). When a MapReduce job is submitted, Loader migrates data from the temporary directory to the export directory. Exporting data to a relational database. When a Map operation is performed for a MapReduce job, Loader obtains data from the HDFS or HBase and inserts the data to a temporary table (Staging Table) through the JDBC connectivity. When a MapReduce job is submitted, Loader migrates data from the temporary table to a formal table. Exporting data to a file system. When a Map operation is performed for a MapReduce job, Loader obtains data from the HDFS or HBase and writes the data to a temporary directory of the file server. When a MapReduce job is submitted, Loader migrates data from the temporary directory to a formal directory. 2.10 Flume: Real-Time Data Collection Flume is a highly available, reliable, and distributed massive log collection, aggregation, and transmission system. Flume supports customized data transmitters for collecting data in the log system. Flume also roughly processes data and writes data to customizable data receivers. As a branch of Flume, the Flume-NG needs to be obvious, simple, compact, and easy to deploy. The following figure shows the overall architecture of this solution. A Flume-NG consists of Agents. Each Agent consists of three modules, Source, Channel, and Sink. Source receives data. Channel transmits data. Sink sends data to the next end. Source collects log data, divides the data into transaction and events, and imports them into channel. Channel provides a queue to implement simple cache of the data provided by source. 16

Sink obtains the data in Channel and imports the data to a storage file system or database, or submits the data to a remote server. The reliability of Flume is implemented based on the transaction switchovers between Agents. When an Agent breaks down, Channel stores data persistently and transmits data until the Agent restores. The Availability of Flume is based on the built-in Load balancing and Failover mechanism. Channel and Agent can both be configured with multiple entities between which they can use the load balancing policy Each agent is a JVM process. A server supports multiple agents. Collection Node (agent1,2,3) (agent4) processes logs. Convergence node writes HDFS. Each agent of collection node can select multiple convergence nodes. In this way, load balancing can be implemented. 2.11 Storm: Storm Processing 2.11.1 Storm Apache Storm is a distributed, reliable, and fault-tolerant real-time stream data processing system. In Storm, a diagram-shaped data structure needs to be designed first for real-time computing. The data structure is called topology. The topology will be submitted to the cluster. Then the master node in the cluster distributes codes and assigns tasks to the worker nodes. One topology contains two roles: spout and bolt. Spout sends messages and sends data streams in tuples. Bolt converts the data streams and performs computing and filtering operations. One bolt can randomly send data to other bolts. Tuples sent by Spout are unchangeable arrays and map to fixed key-value pairs. 17

Figure 2-9 System architecture of Storm Service processing logics are encapsulated in the topologies of Storm. A topology is a set of Spout (data source) and Bolt (logical processing) components that are connected using Stream Groupings in Direct Acyclic Graph (DAG) mode. All components (Spout and Bolt) in a topology are working in parallel. In a topology, you can specify the parallelism degree for each node. Then Storm allocates tasks in the cluster for computing to improve system processing capabilities. Figure 2-10 Topology 2.11.2 StreamCQL StreamCQL is an SQL-like language used for real-time stream processing. Compared with SQL, StreamCQL has introduced the concept of (time-sequencing) window, which allows data to be stored and processed in the memory. The StreamCQL output is the computing 18

results of data streams at specific time. The use of StreamCQL accelerates service development, enables tasks to be easily submitted to the Storm platform for real-time processing, facilitates output of results, and allows tasks to be terminated at the appropriate time. StormCQL has the following highlights: Easy to use: The StreamCQL syntax is similar to the SQL syntax. Users who have basic knowledge of SQL can easily understand CQL and use StreamCQL to develop services. Rich functions: In addition to basic expressions provided by SQL, StreamCQL provides functions, such as windows, filtering, and concurrency setting, for steam processing, meeting real-time service processing requirements. Easy to scale: StreamCQL provides an extension interface to support increasingly complex service scenarios. Users can customize the input, output, serialization, and deserialization to meet specific service requirements. Easy to debug: StreamCQL provides detailed explanation of error codes, helping users rectify faults. Figure 2-11 Comparison between the native API of Storm and StreamCQL 2.11.3 Flink Flink is a unified computing framework that supports both batch processing and stream processing. It provides a stream data processing engine that supports data distribution and parallel computing. Flink features stream processing and is one of the top open-source stream processing engines in the industry. It is applicable to the following scenario: Low-latency data processing: Flink provides high-concurrency pipeline data processing, millisecond-level latency, and high reliability. The following figure shows the Flink technology stack. 19

Figure 2-12 Flink technology stack The entire Flink system consists of three parts: Client The client is used to submit jobs (streaming jobs) to Flink. TaskManager TaskManager is a service execution node of Flink. It executes specific tasks. A Flink system could have multiple TaskManagers. These TaskManagers are equivalent to each other. JobManager JobManager is a management node of Flink. It manages all TaskManagers and schedules tasks submitted by users to TaskManagers. In high-availability (HA) mode, multiple JobManagers are deployed. Among these JobManagers, one of which is selected as the active JobManager, and the others are standby. Flink provides the following features: Low latency Millisecond-level processing capability Exactly once Asynchronous snapshot mechanism, ensuring that all data is processed only once HA Active/standby JobManagers, preventing single point of failure (SPOF) Scale-ou Manual scale-out supported by TaskManagers 2.12 Redis: Distributed High-Speed Cache Remote Dictionary Service (Redis) is a high-performance key-value in-memory database compiled by the C language. It supports multiple data types, including the string, list, set, zset, 20

and hash. Delivering more advantages than Redis, the Redis cluster applies to production environments. However, the Redis cluster management is complex, and some functions are incomplete. FusionInsight HD provides graphical Redis management. Creation of Redis clusters using the wizard FusionInsight supports the creation of Redis clusters in master/slave mode. The system automatically calculates the number of Redis instances to be installed on nodes and determines the master/slave relationship. Cluster capacity expansion or reduction When required, a user can add one or multiple master/slave instances by simply clicking the mouse. The system automatically completes data migration and balancing for capacity expansion. Balance Data in Redis clusters may not be evenly distributed if an expansion fails or some instances are offline. FusionInsight Manager provides the Balance function to implement automatic balancing of cluster data, ensuring stable operation of clusters. Performance monitoring and alarming The system provides performance monitoring of Redis clusters and intuitive curves to help users learn Redis cluster status and TPS of instances. Cluster reliability guarantee The cluster creation tool provided by the community is incomplete. It allocates master/ slave instances on nodes in sequence. Therefore, a pair of master/slave instances may reside on the same node. If the node fails, the instances are unavailable. FusionInsight HD automatically allocates the master instance in a pair on one node and the slave instance on another node when creating a Redis cluster or even when expanding or reducing capacity. If any node in a cluster is faulty, a master/slave instance switchover is performed, ensuring service continuity. Cluster performance optimization The OS layer and application layer performance tuning function are built in the system to deliver better performance than the Redis cluster provided by the community. The function is usable upon unpacking. 2.13 Kafka: Distributed Message Queue Kafka is a distributed, partitioned, replicated message publishing and subscription system. It provides features similar to the Java Message Service (JMS), but the design is completely different. Kafka provides features, such as message persistence, high throughput, distribution, multi-client support, and real-time processing, and applies to online and offline message consumption. It is ideal for Internet service data collection scenarios, such as conventional data collection, website active tracing, operation data of the aggregation statistics system(data monitoring), and log collection. 21

Figure 2-13 Kafka architecture Name Broker Topic Partition Producer Consumer Description A Broker is a server in a Kafka cluster. A topic is a category or feed name, to which messages are published. A topic can be divided into multiple partitions, which can act as a parallel unit. A partition is an ordered, immutable sequence of messages that is continually appended to a commit log. Each message in a partition is assigned a sequential ID, which is called offset. The offset uniquely identifies each message in the partition. Producers publish messages to a Kafka topic. Consumers subscribe to Topic and receive these messages published to Topic. 2.14 Oozie: Job Orchestration and Schedule Oozie is an open-source workflow engine that is used to schedule and coordinate Hadoop jobs. 22

Figure 2-14 Oozie architecture Name Console Client SDK Database Oozie server Tomcat Hadoop components Description Allows users to view and monitor Oozie workflows. Controls workflows, including submitting, starting, running, planting, and restoring workflows, through an interface. Used to develop application software for specific software packages, software frameworks, hardware platforms, and operating systems (OSs). pg database. Oozie server that runs in an internal or external Tomcat container and stores information such as logs in the database. A free open-source web application server. Underlying components, such as MapReduce and Hive, which execute the workflows orchestrated by Oozie. Principles Oozie is a workflow engine server that runs HD MapReduce workflows. It is also a Java web application running in a Tomcat container. Oozie workflows are constructed using Hadoop Process Definition Language (HPDL). HPDL is an XML-defined language, similar to JBOSS JBPM Process Definition Language (JPDL). Including the Control Node (workflow node which can be controled), and Action Node. Control Node controls workflow orchestration, such as start, end, error, decision, fork, join. 23

An Oozie workflow contains multiple Action Nodes, such as MapReuce and Java. All Action Nodes are deployed and run in Direct Acyclic Graph (DAG) mode. Therefore, Action Nodes run in direction. That is, the next Action Node can run only when the running of the previous Action Node ends. When one Action Node ends, the remote server calls back the Oozie interface. Then Oozie executes the next Action Node of workflow in the same manner until all Action Nodes are executed (execution failures are counted). Oozie workflows provide various types of Action Nodes, such as MapReduce, Hadoop distributed file system (HDFS), Secure Shell (SSH), Java, and Oozie sub-flows, to support different business requirements. 2.15 Hue: Hadoop Integrated Development Tool Hue is a group of Web applications, and it is used for interaction with the FusionInsight platform. Hue helps users to browse the HDFS, perform Hive and Impala query, and start a MapReduce task or Oozie workflow. Hue runs on a browser. In FusionInsight, Hue is integrated into FusionInsight Manager. Figure 2-15 shows the overall architecture of Hue and describes the working mechanism of Hue. Hue Server is a web application container integrated on FusionInsight Manager. It carries the applications interacting with all FusionInsight HD components. Figure 2-15 Integrated development tool Hue involves the following components and functions: File browser: This application allows users to browse directly through UIs and operate different directories of the HDFS. It provides the following functions: Creating a file or directory, uploading a file, downloading a file, renaming, migrating, deleting a file or directory Modifying the owner of a file or directory and permissions Searching a file, a directory, a file owner, or a user group to which a user belongs 24

Viewing and editing a file Query editor: The query editor allows a user to compile simple SQL and query the data stored on Hadoop, for example, the HDFS, HBase, Hive, and Impala. With the help of the query editor, a user can create, manage, and execute SQL conveniently and download the execution result as an Excel file. It provides the following functions: Editing and executing SQL, storing an SQL template, copying and editing a template, and SQL explanation, query, and history recording Database presentation and data table presentation Supporting different types of Hadoop storage Workflow control: involves the following functions: Task browser Providing the task list, finding relevant information (status, start time, and end time) of a corresponding child task based on a specific task Viewing task logs Task customizer Helping users to create and submitting tasks with ease Allowing inputting a variable and parameter value for a specific task Oozie editor: The Oozie editor enables a user to define an Oozie workflow and coordinator. A workflow is the combination of a group of tasks, which controls the execution sequence of tasks. A workflow can control various operations on the node tasks belonging to it automatically, such as execution, stop, and cloning. The coordinator application enables a user to define and execute periodic and interdependent workflow tasks and configure the execution criteria of the workflow. User management: Similar to a conventional Web application, Hue also provides the user management function. With this function, user information can be added or deleted. 25