Technical White Paper

Size: px
Start display at page:

Download "Technical White Paper"

Transcription

1 Issue 01 Date HUAWEI TECHNOLOGIES CO., LTD.

2 2017. All rights reserved. No part of this document may be reproduced or transmitted in any form or by any means without prior written consent of Huawei Technologies Co., Ltd. Trademarks and Permissions and other Huawei trademarks are trademarks of Huawei Technologies Co., Ltd. All other trademarks and trade names mentioned in this document are the property of their respective holders. Notice The purchased products, services and features are stipulated by the contract made between Huawei and the customer. All or part of the products, services and features described in this document may not be within the purchase scope or the usage scope. Unless otherwise specified in the contract, all statements, information, and recommendations in this document are provided "AS IS" without warranties, guarantees or representations of any kind, either express or implied. The information in this document is subject to change without notice. Every effort has been made in the preparation of this document to ensure accuracy of the contents, but all statements, information, and recommendations in this document do not constitute a warranty of any kind, express or implied. Huawei Technologies Co., Ltd. Address: Website: Huawei Industrial Base Bantian, Longgang Shenzhen People's Republic of China i

3 Contents Contents 1 Introduction Overview of FusionInsight HD Basic Components of FusionInsight HD Manager: Cluster Management HDFS: Distributed File System YARN: Unified Resource Management and Scheduling Framework MapReduce: Distributed Batch Processing Engine HBase: Distributed Database Hive: Data Warehouse Component Spark: Distributed Real-time Computing Framework Solr: Full-text Search Component LoaderSqoop: Integration of Data in Batches Flume: Real-Time Data Collection Storm: Storm Processing Storm StreamCQL Flink Redis: Distributed High-Speed Cache Kafka: Distributed Message Queue Oozie: Job Orchestration and Schedule Hue: Hadoop Integrated Development Tool...24 ii

4 1 Introduction 1 Introduction 1.1 Overview of FusionInsight HD 1.2 Basic Components of FusionInsight HD 1

5 1 Introduction 1.1 Overview of FusionInsight HD Huawei FusionInsight is a unified enterprise-class big data storage, query, and analysis platform. It allows enterprises to quickly construct a system for processing massive data. With FusionInsight, enterprises can spot and capture new business opportunities by analyzing and mining massive data in a real-time or non-real-time manner. FusionInsight consists of five sub-products, which are FusionInsight HD, FusionInsight LibrA, FusionInsight Miner, and FusionInsight Farmer, and one operation OMS FusionInsight Manager. FusionInsight HD: an enterprise-class big data processing platform. It is a distributed data processing system and provides large-capacity data storage, query, analysis, and real-time stream data processing capabilities externally. FusionInsight HD includes Zookeeper, Hadoop, HBase, Loader, HBase, Hive, Hue, Oozie, Phoenix, Solr, Redis, Spark, Streaming, Kafka, Elk,Flink and other components. FusionInsight Miner: an enterprise-class data analysis platform. It is based on distributed storage of FusionInsight HD and the parallel computing technology and is a platform that mines valuable information from massive data. FusionInsight Farmer: an enterprise-class big data application container that provides a unified development, running, and management platform for enterprise services. FusionInsight Manager: an enterprise-class big data O&M platform. It provides highly reliable, secure, fault-tolerant, and easy-to-use cluster management for FusionInsight HD. It supports installation and deployment, monitoring, alarm management, user management, rights management, audit, service management, health check, fault location, upgrades, and patching of large-scale clusters. FusionInsight LibrA: an enterprise-class massively parallel processing (MPP) relational database. It is based on column storage and the MPP architecture and is designed and developed for analysis of structured data and can effectively process data volumes of PB level. FusionInsight LibrA is greatly different from conventional databases in terms of core technologies. It can solve issues concerning data processing performance for users in many industries, provide a general cost-effective computing platform for large-scale data management, and support various data warehouse systems, business intelligence (BI) systems, and decision-making support systems, to support decision analysis of upper-layer applications in a unified manner. 2

6 1 Introduction 1.2 Basic Components of FusionInsight HD Huawei FusionInsight HD has made encapsulation and enhancement on the Hadoop opensource components to provide stable large-capacity data storage, query and analysis capabilities. Huawei FusionInsight HD has the following components: Manager: an O&M system. It provides highly reliable, secure, fault-tolerant, and easy-touse cluster management for FusionInsight HD. It supports installation, upgrade, and patching of large-scale clusters as well as the management of configurations, monitoring, alarms, users, and tenants. HDFS: The HDFS supports data access with high throughput and applies to processing of large data sets. HBase: As a distributed, column-oriented storage system built on the HDFS, HBase stores massive data. Oozie: Oozie orchestrates and executes jobs for open-source Hadoop components. It runs in a Java servlet container (for example, Tomcat) as a Java web application and uses a database to store workflow definitions and running workflow instances (including the status and variables of the instances). ZooKeeper: ZooKeeper provides distributed service coordination with high availability. It prevents single points of failure (SPOFs) and helps create reliable applications. Redis: provides a high-performance distributed K-V cache system based on memory. Yarn: serves as the resource management system of Hadoop 2.0. It is a general resource module that manages and schedules resources for applications. MapReduce: provides the capability of processing massive data quickly and in parallel. It is a distributed data processing mode and execution environment. Spark: Spark is a distributed computing framework based on memory. Hive: serves as an open-source data warehouse built on top of Hadoop. It supports SQLlike access to structured data which is known as HQL and allows for basic data analysis services. Loader: functions as a data import and export tool enhanced based on the open-source Apache Sqoop component. Loader implements data exchange between FusionInsight HD and a relational database or FTP or SFTP file server. It provides the Java API or Shell task scheduling interface for third-party scheduling platforms. 3

7 1 Introduction Hue: provides the Web UI of open-source components. Directories and files of the HDFS can be operated through a browser, Oozie can be invoked to create, monitor, and arrange a workflow, the Sqoop component can be operated, and ZooKeeper cluster conditions can be viewed. Flume: serves as a distributed, reliable, and HA massive log aggregation system. Flume supports customized data transmitters for collecting data. Flume also roughly processes data and writes data to customizable data receivers. Solr: functions as a high-performance Lucene-based full text retrieval server. Solr provides a query language even richer than Lucene. It is configurable and scalable and has query performance optimized. Solr provides a GUI with comprehensive management functions. It is an excellent full-text retrieval engine. Kafka: serves as a distributed, real-time message publishing and subscription system with partitions and instances. It provides scalable, high-throughput, low-latency, and reliable message dispatching services. Storm: serves as a distributed, reliable, fault-tolerant and real-time stream data processing system. It provides SQL-like (StreamCQL) query languages. Flink is a distributed and highly available data processing engine that supports both batch processing and stream processing. It supports the exactly-once semantics. Spark SQL: functions as a high-performance SQL engine based on the Spark engine. It can share metadata with Hive. Mahaut: provides a MapReduce-based data mining algorithm library. MLLib: provides a Spark-based data mining algorithm library. GraphX: provides a Spark-based diagram processing algorithm library. 4

8 2.1 Manager: Cluster Management 2.2 HDFS: Distributed File System 2.3 YARN: Unified Resource Management and Scheduling Framework 2.4 MapReduce: Distributed Batch Processing Engine 2.5 HBase: Distributed Database 2.6 Hive: Data Warehouse Component 2.7 Spark: Distributed Real-time Computing Framework 2.8 Solr: Full-text Search Component 2.9 LoaderSqoop: Integration of Data in Batches 2.10 Flume: Real-Time Data Collection 2.11 Storm: Storm Processing 2.12 Redis: Distributed High-Speed Cache 2.13 Kafka: Distributed Message Queue 2.14 Oozie: Job Orchestration and Schedule 2.15 Hue: Hadoop Integrated Development Tool 5

9 2.1 Manager: Cluster Management Manager, as the O&M management system of FusionInsight HD, provides services deployed in the cluster with a unified cluster management capability. Manager provides functions such as installation and deployment, performance monitoring, alarms, user management, permission management, auditing, service management, health check, log collection, upgrade, and patching. Figure 2-1 Logical architecture of FusionInsight ManagerManager FusionInsight Manager consists of OMS and NodeAgent: OMS: serves as management nodes in the O&M system. There are two OMS nodes deployed in active/standby mode. NodeAgent: serves as managed nodes in the O&M system. Each node is equipped with a NodeAgent. Table 2-1 Service module description Module WebService Controller Description A Web service deployed under Tomcat. Providing the HTTPS interface of Manager, it is used to access Manager through the web browser. In addition, it provides northbound access capability based on the Syslog and SNMP protocols. The control center of Manager. It converges information from all nodes in the cluster, displays the information to administrators, as well as receives operation instructions from administrators, and synchronizes the information about each node in the cluster according to the range of operation instructions 6

10 Module NodeAgent IAM PMS CEP FMS OMMAgent CAS AOS OMS Kerberos OMS LDAP Database NTP Description Exists in each cluster node, and serves as an agent where Controller performs all the operations to the components deployed at the node. Represents all the components deployed at the node to exchange with Controller, achieving multipoint-to-point convergence in the whole cluster. Records audit logs. Each non-query operation on the Manager UI has a related audit log. Performance monitoring module. The PMS collects the performance monitoring data for each OMA and provides the query function. Convergence function module. For example, the CEP integrates the used disk space of each OMA into a performance indicator. Alarm module. The FMS collects alarms for each OMA and provides the query function. Agent that collects the performance monitoring data and alarms for the Agent node. Unified authentication center. Login authentication in the CAS is required during the login to the Web Service. The browser automatically switches to the CAS using the URL. Permission management module, it manages the rights of users and user groups. Supports SSO and authentication between Controller and NodeAgent. OMS LDAP provides data storage for user authentication before the cluster is installed and backs up the cluster Ldap after the cluster is installed. Database of Manager. It stores information like configuration, monitoring, and alarms NTP implements clock synchronization between nodes in the cluster and OMS nodes, as well as between OMS nodes and external clock sources. 2.2 HDFS: Distributed File System As a distributed file system of Hadoop, HDFS implements reliable and distributed read/write of massive data. HDFS is applicable to the scenario where data read/write features "write once and read multiple times". However, the write operation is performed in sequence, that is, it is a write operation performed during file creation or an add operation performed behind the existing file. HDFS ensures that only one caller can perform write operation on one file but multiple calls can perform a read operation on one file at the same time. 7

11 Figure 2-2 Distributed file system (HDFS) 2.3 YARN: Unified Resource Management and Scheduling Framework To implement sharing, scalability, and reliability of a Hadoop cluster and eliminate the early performance bottleneck of JobTracker in the MapReduce framework, the open-source community introduces the unified resource management framework YARN. The essence of the YARN layered structure is ResourceManager. This entity controls the overall cluster and manages allocation of basic computing resources to applications. ResourceManager thoughtfully allocates various resources (computing, memory, and bandwidth resources) to the basic NodeManager (each node agent of YARN). ResourceManager also allocates resources together with ApplicationMaster. In addition, ResourceManager and NodeManager jointly start and monitor their basic applications. In this context, ApplicationMaster undertakes the role that is originally undertaken by TaskTracker and ResourceManager undertakes the role of JobTracker. ApplicationMaster manages all the instances of an application running in YARN. ApplicationMaster coordinates the resources from ResourceManager and monitors execution and resource usage (CPU and memory resource allocation) of a container through NodeManager. Current resources (CPU core and memory) are traditional; however, new resource types that are based on the tasks at hand will be brought about in the future, for example, a graphical processing unit or dedicated processing device. From the perspective of YARN, ApplicationMaster is a user code. Consequently, potential security issues exist. YARN assumes that ApplicationMaster is incorrect or even malicious and therefore treats it as a code without privilege. NodeManager manages all nodes in a YARN cluster. NodeManager provides services for all nodes of a cluster, for example, supervising lifelong management of a container, monitoring resources, and tracing node health. MRv1 manages execution of the Map and Reduce tasks through slots, whereas NodeManager manages abstract containers. These containers represent the resources for various nodes, and these resources are available for the use of a specific application. 8

12 Figure 2-3 Unified resource management and scheduling framework (YARN) 2.4 MapReduce: Distributed Batch Processing Engine MapReduce is the core of Hadoop. As a software architecture proposed by Google, MapReduce is used for parallel computing of a massive data set (larger than 1 TB). The concepts "Map" and "Reduce" and their main ideas are borrowed from functional programming language, and some features are borrowed from vector programming language. Currently, the software is implemented in the following manner: Specify a Map function to map a group of key-value pairs into a new group of key-value pairs and specify a parallel Reduce function to ensure that all key values in the mapped key-value pairs share the same key group. Figure 2-4 Distributed batch processing engine MapReduce is a software framework for processing a massive data set in parallel. The root of MapReduce is the Map and Reduce functions in functional programming. The Map function accepts a group of data and transforms it into a key-value pair list. Each element in the input 9

13 domain corresponds to one key-value pair. The Reduce function accepts the lists generated by the Map function and then shrinks the key-value pair list based on the keys of the lists generated by the Map function. MapReduce divides a large affair into multiple parts and allocates them to different devices for processing. In this way, the task that can only be finished on a single powerful server originally can now be finished in a distributed environment. 2.5 HBase: Distributed Database Data storage is undertaken by HBase. HBase is an open-source column-oriented distributed storage system that applies to storage of massive unstructured or semi-structured data, features high reliability, high performance, and flexible extensibility, and supports real-time data read/write. Figure 2-5 Distributed database (HBase) The typical features of a table stored in HBase are as follows: Big table (BigTable): One table contains hundred millions of lines and millions of columns. Column-oriented: Column-oriented storage, retrieval, and permission control Sparse: Null columns in the table do not occupy any storage space. 2.6 Hive: Data Warehouse Component Hive is a database warehouse infrastructure built on top of Hadoop. It provides a series of tools that can be used to extract, transform, and load (ETL) data. Hive is a mechanism that can store, query, and analyze massive data stored on Hadoop. Hive defines simple SQL-like query language, which is known as HQL. It allows a user familiar with SQL to query data. In addition, this language allows a developer familiar with MapReduce to develop customized mapper and reducer for handling complex analysis tasks that cannot be completed by the built-in mapper and reducer. Hive system structure: User interface: Three user interfaces are available, that is, the command-line interface (CLI), client, and web user interface (WebUI). CLI is the most frequently-used user interface. A Hive transcript is started when the CLI is started. The client refers to a Hive client, and a client user connects to the Hive Server. When entering the Client mode, specify the node where the Hive Server resides and start the Hive Server on this node. The WebUI is used to access Hive through a browser. 10

14 Metadata storage: Hive stores metadata in databases, for example, MySQL and Derby. The metadata in Hive includes the table name, table column and partition and their properties, table property (indicating whether the table is an external table), and directory where the data of the table is stored. 2.7 Spark: Distributed Real-time Computing Framework Apache Spark is an open-source universal distributed cluster computing engine. Figure 2-6 shows the development history of Spark. Figure 2-6 Development history of Spark FusionInsight Spark is an open-source parallel data processing framework. It helps users to simply develop quick and unified big data applications and perform cooperative processing, stream processing, and interactive analysis on data. Spark has the following features: Fast: The data processing speed of Spark is 10 to 100 times higher than that of MapReduce. Easy-to-use: Java, Scala, and Python can be used to simply and quickly compile parallel applications for processing massive data. Spark provides over 80 high-level operators to help users compile parallel applications. Universal: Spark provides many high-level tools, for example, Spark SQL, MLLib, GraphX, and Spark Stream. These tools can be flexibly combined within one application. Integration with Hadoop: Spark can directly run in a Hadoop 2.0 cluster and read the existing Hadoop data. Especially, Spark and FusionInsight are closely combined, and Spark can be deployed by using FusionInsight Manager. Spark provides a framework for fast computing, write, and interactive query. Spark has obvious advantages over Hadoop in terms of performance. Spark uses the in-memory computing mode to avoid I/O bottleneck in the scenario where multiple tasks in the MapReduce workflow compute the same data set. Spark is implemented by using the Scala language. Scala enables distributed data sets to be processed in a method the same as that of processing local data. In addition to interactive data analysis, Spark also supports interactive data mining. Spark adopts memory-based computing; therefore, iterative computing is convenient. By coincidence, iterative computing of the same data is a general problem facing data mining. Besides, Spark can run in the YARN cluster where Hadoop 2.0 is installed. The reason why Spark can not only retain various features like MapReduce fault tolerance, data localization, 11

15 and scalability but also ensure high performance and avoid busy disk I/O is that a memory abstraction structure called Resilient Distributed Dataset (RDD) is created for Spark. Original distributed memory abstraction, for example, key-value store and database, supports small-granularity update of variable status. This requires backup of data or log update by a cluster to ensure fault tolerance. Consequently, a large amount of I/O is brought about to dataintensive workflows. For the RDD, it has only one set of restricted interfaces and only supports large-granularity update, for example, map and join. In this way, Spark only needs to be record the transformation operation log of data establishment to ensure fault tolerance without recording complete data set. This data transformation link record is the source for tracing data set. Generally, parallel applications apply the same computing process for a large data set. Therefore, the limit to the aforementioned large-granularity update is not large. In fact, as described in the Spark thesis, RDD can function as multiple different computing frameworks, for example, programming models of MapReduce and Pregel. In addition, Spark provides operations to allow a user to make the data transformation process be persistent on a hard disk. Data localization is implemented by allowing a user to control data partitions based on each recorded key value. (An obvious advantage of this mode is that two copies of data to be associated will be hashed in the same mode.) If the memory usage exceeds the physical limit, Spark writes relatively large partitions into a hard disk, thereby ensuring scalability. 2.8 Solr: Full-text Search Component Solr is a separate Apache Lucene-based enterprise-level application search server. It provides REST-like APIs of HTTP/XML and JSON. The main functions of Solr include powerful fulltext search, highlighted display, layer search, near real-time index, dynamic clustering, database integration, and rich document (for example, Word and PDF documents) processing and geographical information search. As an excellent enterprise-level search server of the industry, Solr has the following features: Advanced full-text search function Optimized high-capacity network traffic Standard-based open interfaces: XML, JSON, and HTTP interfaces Comprehensive HTML management UI Java management extensions (JMX) for monitoring servers Linear scalability, automatic index replication, and automatic failover and restoration Near real-time index Adopting XML configuration to implementing flexibility and adaptability Expandable plug-in architecture 12

16 Figure 2-7 Logic composition of the Solr cluster solution Table 2-2 Solr composition Name Client SolrServer ZooKeeper cluster Description Client communicates with SolrServer in the Solr cluster (SolrCloud) through the HTTP protocol and performs distributed indexing and distributed search operations. SolrServer provides various services, such as index creation and full-text retrieval. It is a data computing and processing unit in the Solr cluster. Generally, SolrServer is combined with DataNode in the HDFS cluster to provide high-performance indexing and search services. ZooKeeper provides distributed coordination services for various processes in the Solr cluster. Each SolrServer registers its information (collection configuration information and SolrServer health information) with ZooKeeper. Based on the information, Client detects the health status of each SolrServer, thereby determining distribution of indexing and search requests. 13

17 Name HDFS cluster Description The HDFS provides Solr with highly reliable file storage service. All index files are stored in the HDFS. 2.9 LoaderSqoop: Integration of Data in Batches Loader is developed based on the open-source Sqoop component. It is used to exchange data and files between FusionInsight and relational databases, the distributed file system. Loader can import data from relational databases or file servers to the HDFS/HBase of FusionInsight, or export data from the HDFS/HBase to relational databases or file servers. The Loader model consists of a Loader client and Loader server, as shown in Figure 1 Figure 2-8 Loader model Table 1describes the function of each part in Figure 1 14

18 Table 2-3 The components of the loader model Name Loader Client Loader Server REST API Job Scheduler Transform Engine Execution Engine Submission Engine Job Manager Metadata Repository HA Manager Description Provides a web user interface (WebUI) and a command-line interface (CLI). Processes operation requests sent from the client, manages connectors and metadata, submits MapReduce jobs, and monitors MapReduce job status. Provides a RESTful (HTTP + JSON) interface to process the operation requests sent from the client. Periodically executes Loader jobs. A data transformation engine that supports field combination, string cutting, and string reverse. Executes Loader jobs in MapReduce manner. Submits Loader jobs to MapReduce. Manages Loader jobs, including creating, querying, updating, deleting, activating, deactivating, starting, and stopping jobs. Stores and manages data about Loader connectors, transformation procedures, and jobs. Manages the active/standby status of Loader Server. Two Loader servers are deployed in active/standby mode. Implementing parallel job execution and fault tolerance using MapReduce Loader implements parallel import or export jobs using MapReduce. Some import or export jobs may involve only the Map operations, while some jobs may involve both Map and Reduce operations. Loader implements fault tolerance using MapReduce. Jobs can be rescheduled when job execution fails. Importing data to HBase When the Map operation is performed for MapReduce jobs, Loader obtains data from an external data source. When a Reduce operation is performed for a MapReduce job, Loader enables the same number of Reduce tasks based on the number of Regions. The Reduce Task receives data from Map, generates HFiles by Region, and stores the HFiles in a temporary directory of the HDFS. 15

19 When a MapReduce job is submitted, Loader migrates HFiles from the temporary directory to the HBase directory. Importing data to the HDFS. When a Map operation is performed for a MapReduce job, Loader obtains data from an external data source and exports the data to a temporary directory (named export directoryldtmp). When a MapReduce job is submitted, Loader migrates data from the temporary directory to the export directory. Exporting data to a relational database. When a Map operation is performed for a MapReduce job, Loader obtains data from the HDFS or HBase and inserts the data to a temporary table (Staging Table) through the JDBC connectivity. When a MapReduce job is submitted, Loader migrates data from the temporary table to a formal table. Exporting data to a file system. When a Map operation is performed for a MapReduce job, Loader obtains data from the HDFS or HBase and writes the data to a temporary directory of the file server. When a MapReduce job is submitted, Loader migrates data from the temporary directory to a formal directory Flume: Real-Time Data Collection Flume is a highly available, reliable, and distributed massive log collection, aggregation, and transmission system. Flume supports customized data transmitters for collecting data in the log system. Flume also roughly processes data and writes data to customizable data receivers. As a branch of Flume, the Flume-NG needs to be obvious, simple, compact, and easy to deploy. The following figure shows the overall architecture of this solution. A Flume-NG consists of Agents. Each Agent consists of three modules, Source, Channel, and Sink. Source receives data. Channel transmits data. Sink sends data to the next end. Source collects log data, divides the data into transaction and events, and imports them into channel. Channel provides a queue to implement simple cache of the data provided by source. 16

20 Sink obtains the data in Channel and imports the data to a storage file system or database, or submits the data to a remote server. The reliability of Flume is implemented based on the transaction switchovers between Agents. When an Agent breaks down, Channel stores data persistently and transmits data until the Agent restores. The Availability of Flume is based on the built-in Load balancing and Failover mechanism. Channel and Agent can both be configured with multiple entities between which they can use the load balancing policy Each agent is a JVM process. A server supports multiple agents. Collection Node (agent1,2,3) (agent4) processes logs. Convergence node writes HDFS. Each agent of collection node can select multiple convergence nodes. In this way, load balancing can be implemented Storm: Storm Processing Storm Apache Storm is a distributed, reliable, and fault-tolerant real-time stream data processing system. In Storm, a diagram-shaped data structure needs to be designed first for real-time computing. The data structure is called topology. The topology will be submitted to the cluster. Then the master node in the cluster distributes codes and assigns tasks to the worker nodes. One topology contains two roles: spout and bolt. Spout sends messages and sends data streams in tuples. Bolt converts the data streams and performs computing and filtering operations. One bolt can randomly send data to other bolts. Tuples sent by Spout are unchangeable arrays and map to fixed key-value pairs. 17

21 Figure 2-9 System architecture of Storm Service processing logics are encapsulated in the topologies of Storm. A topology is a set of Spout (data source) and Bolt (logical processing) components that are connected using Stream Groupings in Direct Acyclic Graph (DAG) mode. All components (Spout and Bolt) in a topology are working in parallel. In a topology, you can specify the parallelism degree for each node. Then Storm allocates tasks in the cluster for computing to improve system processing capabilities. Figure 2-10 Topology StreamCQL StreamCQL is an SQL-like language used for real-time stream processing. Compared with SQL, StreamCQL has introduced the concept of (time-sequencing) window, which allows data to be stored and processed in the memory. The StreamCQL output is the computing 18

22 results of data streams at specific time. The use of StreamCQL accelerates service development, enables tasks to be easily submitted to the Storm platform for real-time processing, facilitates output of results, and allows tasks to be terminated at the appropriate time. StormCQL has the following highlights: Easy to use: The StreamCQL syntax is similar to the SQL syntax. Users who have basic knowledge of SQL can easily understand CQL and use StreamCQL to develop services. Rich functions: In addition to basic expressions provided by SQL, StreamCQL provides functions, such as windows, filtering, and concurrency setting, for steam processing, meeting real-time service processing requirements. Easy to scale: StreamCQL provides an extension interface to support increasingly complex service scenarios. Users can customize the input, output, serialization, and deserialization to meet specific service requirements. Easy to debug: StreamCQL provides detailed explanation of error codes, helping users rectify faults. Figure 2-11 Comparison between the native API of Storm and StreamCQL Flink Flink is a unified computing framework that supports both batch processing and stream processing. It provides a stream data processing engine that supports data distribution and parallel computing. Flink features stream processing and is one of the top open-source stream processing engines in the industry. It is applicable to the following scenario: Low-latency data processing: Flink provides high-concurrency pipeline data processing, millisecond-level latency, and high reliability. The following figure shows the Flink technology stack. 19

23 Figure 2-12 Flink technology stack The entire Flink system consists of three parts: Client The client is used to submit jobs (streaming jobs) to Flink. TaskManager TaskManager is a service execution node of Flink. It executes specific tasks. A Flink system could have multiple TaskManagers. These TaskManagers are equivalent to each other. JobManager JobManager is a management node of Flink. It manages all TaskManagers and schedules tasks submitted by users to TaskManagers. In high-availability (HA) mode, multiple JobManagers are deployed. Among these JobManagers, one of which is selected as the active JobManager, and the others are standby. Flink provides the following features: Low latency Millisecond-level processing capability Exactly once Asynchronous snapshot mechanism, ensuring that all data is processed only once HA Active/standby JobManagers, preventing single point of failure (SPOF) Scale-ou Manual scale-out supported by TaskManagers 2.12 Redis: Distributed High-Speed Cache Remote Dictionary Service (Redis) is a high-performance key-value in-memory database compiled by the C language. It supports multiple data types, including the string, list, set, zset, 20

24 and hash. Delivering more advantages than Redis, the Redis cluster applies to production environments. However, the Redis cluster management is complex, and some functions are incomplete. FusionInsight HD provides graphical Redis management. Creation of Redis clusters using the wizard FusionInsight supports the creation of Redis clusters in master/slave mode. The system automatically calculates the number of Redis instances to be installed on nodes and determines the master/slave relationship. Cluster capacity expansion or reduction When required, a user can add one or multiple master/slave instances by simply clicking the mouse. The system automatically completes data migration and balancing for capacity expansion. Balance Data in Redis clusters may not be evenly distributed if an expansion fails or some instances are offline. FusionInsight Manager provides the Balance function to implement automatic balancing of cluster data, ensuring stable operation of clusters. Performance monitoring and alarming The system provides performance monitoring of Redis clusters and intuitive curves to help users learn Redis cluster status and TPS of instances. Cluster reliability guarantee The cluster creation tool provided by the community is incomplete. It allocates master/ slave instances on nodes in sequence. Therefore, a pair of master/slave instances may reside on the same node. If the node fails, the instances are unavailable. FusionInsight HD automatically allocates the master instance in a pair on one node and the slave instance on another node when creating a Redis cluster or even when expanding or reducing capacity. If any node in a cluster is faulty, a master/slave instance switchover is performed, ensuring service continuity. Cluster performance optimization The OS layer and application layer performance tuning function are built in the system to deliver better performance than the Redis cluster provided by the community. The function is usable upon unpacking Kafka: Distributed Message Queue Kafka is a distributed, partitioned, replicated message publishing and subscription system. It provides features similar to the Java Message Service (JMS), but the design is completely different. Kafka provides features, such as message persistence, high throughput, distribution, multi-client support, and real-time processing, and applies to online and offline message consumption. It is ideal for Internet service data collection scenarios, such as conventional data collection, website active tracing, operation data of the aggregation statistics system(data monitoring), and log collection. 21

25 Figure 2-13 Kafka architecture Name Broker Topic Partition Producer Consumer Description A Broker is a server in a Kafka cluster. A topic is a category or feed name, to which messages are published. A topic can be divided into multiple partitions, which can act as a parallel unit. A partition is an ordered, immutable sequence of messages that is continually appended to a commit log. Each message in a partition is assigned a sequential ID, which is called offset. The offset uniquely identifies each message in the partition. Producers publish messages to a Kafka topic. Consumers subscribe to Topic and receive these messages published to Topic Oozie: Job Orchestration and Schedule Oozie is an open-source workflow engine that is used to schedule and coordinate Hadoop jobs. 22

26 Figure 2-14 Oozie architecture Name Console Client SDK Database Oozie server Tomcat Hadoop components Description Allows users to view and monitor Oozie workflows. Controls workflows, including submitting, starting, running, planting, and restoring workflows, through an interface. Used to develop application software for specific software packages, software frameworks, hardware platforms, and operating systems (OSs). pg database. Oozie server that runs in an internal or external Tomcat container and stores information such as logs in the database. A free open-source web application server. Underlying components, such as MapReduce and Hive, which execute the workflows orchestrated by Oozie. Principles Oozie is a workflow engine server that runs HD MapReduce workflows. It is also a Java web application running in a Tomcat container. Oozie workflows are constructed using Hadoop Process Definition Language (HPDL). HPDL is an XML-defined language, similar to JBOSS JBPM Process Definition Language (JPDL). Including the Control Node (workflow node which can be controled), and Action Node. Control Node controls workflow orchestration, such as start, end, error, decision, fork, join. 23

27 An Oozie workflow contains multiple Action Nodes, such as MapReuce and Java. All Action Nodes are deployed and run in Direct Acyclic Graph (DAG) mode. Therefore, Action Nodes run in direction. That is, the next Action Node can run only when the running of the previous Action Node ends. When one Action Node ends, the remote server calls back the Oozie interface. Then Oozie executes the next Action Node of workflow in the same manner until all Action Nodes are executed (execution failures are counted). Oozie workflows provide various types of Action Nodes, such as MapReduce, Hadoop distributed file system (HDFS), Secure Shell (SSH), Java, and Oozie sub-flows, to support different business requirements Hue: Hadoop Integrated Development Tool Hue is a group of Web applications, and it is used for interaction with the FusionInsight platform. Hue helps users to browse the HDFS, perform Hive and Impala query, and start a MapReduce task or Oozie workflow. Hue runs on a browser. In FusionInsight, Hue is integrated into FusionInsight Manager. Figure 2-15 shows the overall architecture of Hue and describes the working mechanism of Hue. Hue Server is a web application container integrated on FusionInsight Manager. It carries the applications interacting with all FusionInsight HD components. Figure 2-15 Integrated development tool Hue involves the following components and functions: File browser: This application allows users to browse directly through UIs and operate different directories of the HDFS. It provides the following functions: Creating a file or directory, uploading a file, downloading a file, renaming, migrating, deleting a file or directory Modifying the owner of a file or directory and permissions Searching a file, a directory, a file owner, or a user group to which a user belongs 24

28 Viewing and editing a file Query editor: The query editor allows a user to compile simple SQL and query the data stored on Hadoop, for example, the HDFS, HBase, Hive, and Impala. With the help of the query editor, a user can create, manage, and execute SQL conveniently and download the execution result as an Excel file. It provides the following functions: Editing and executing SQL, storing an SQL template, copying and editing a template, and SQL explanation, query, and history recording Database presentation and data table presentation Supporting different types of Hadoop storage Workflow control: involves the following functions: Task browser Providing the task list, finding relevant information (status, start time, and end time) of a corresponding child task based on a specific task Viewing task logs Task customizer Helping users to create and submitting tasks with ease Allowing inputting a variable and parameter value for a specific task Oozie editor: The Oozie editor enables a user to define an Oozie workflow and coordinator. A workflow is the combination of a group of tasks, which controls the execution sequence of tasks. A workflow can control various operations on the node tasks belonging to it automatically, such as execution, stop, and cloning. The coordinator application enables a user to define and execute periodic and interdependent workflow tasks and configure the execution criteria of the workflow. User management: Similar to a conventional Web application, Hue also provides the user management function. With this function, user information can be added or deleted. 25

Hortonworks Data Platform

Hortonworks Data Platform Hortonworks Data Platform Workflow Management (August 31, 2017) docs.hortonworks.com Hortonworks Data Platform: Workflow Management Copyright 2012-2017 Hortonworks, Inc. Some rights reserved. The Hortonworks

More information

Configuring and Deploying Hadoop Cluster Deployment Templates

Configuring and Deploying Hadoop Cluster Deployment Templates Configuring and Deploying Hadoop Cluster Deployment Templates This chapter contains the following sections: Hadoop Cluster Profile Templates, on page 1 Creating a Hadoop Cluster Profile Template, on page

More information

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info START DATE : TIMINGS : DURATION : TYPE OF BATCH : FEE : FACULTY NAME : LAB TIMINGS : PH NO: 9963799240, 040-40025423

More information

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou The Hadoop Ecosystem EECS 4415 Big Data Systems Tilemachos Pechlivanoglou tipech@eecs.yorku.ca A lot of tools designed to work with Hadoop 2 HDFS, MapReduce Hadoop Distributed File System Core Hadoop component

More information

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development:: Title Duration : Apache Spark Development : 4 days Overview Spark is a fast and general cluster computing system for Big Data. It provides high-level APIs in Scala, Java, Python, and R, and an optimized

More information

Big Data Hadoop Stack

Big Data Hadoop Stack Big Data Hadoop Stack Lecture #1 Hadoop Beginnings What is Hadoop? Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters of commodity hardware

More information

Innovatus Technologies

Innovatus Technologies HADOOP 2.X BIGDATA ANALYTICS 1. Java Overview of Java Classes and Objects Garbage Collection and Modifiers Inheritance, Aggregation, Polymorphism Command line argument Abstract class and Interfaces String

More information

Hadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours)

Hadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours) Bigdata Fundamentals Day1: (2hours) 1. Understanding BigData. a. What is Big Data? b. Big-Data characteristics. c. Challenges with the traditional Data Base Systems and Distributed Systems. 2. Distributions:

More information

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved Hadoop 2.x Core: YARN, Tez, and Spark YARN Hadoop Machine Types top-of-rack switches core switch client machines have client-side software used to access a cluster to process data master nodes run Hadoop

More information

docs.hortonworks.com

docs.hortonworks.com docs.hortonworks.com : Getting Started Guide Copyright 2012, 2014 Hortonworks, Inc. Some rights reserved. The, powered by Apache Hadoop, is a massively scalable and 100% open source platform for storing,

More information

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara Big Data Technology Ecosystem Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara Agenda End-to-End Data Delivery Platform Ecosystem of Data Technologies Mapping an End-to-End Solution Case

More information

Oracle Big Data Fundamentals Ed 2

Oracle Big Data Fundamentals Ed 2 Oracle University Contact Us: 1.800.529.0165 Oracle Big Data Fundamentals Ed 2 Duration: 5 Days What you will learn In the Oracle Big Data Fundamentals course, you learn about big data, the technologies

More information

Object Storage Service. Product Introduction. Issue 04 Date HUAWEI TECHNOLOGIES CO., LTD.

Object Storage Service. Product Introduction. Issue 04 Date HUAWEI TECHNOLOGIES CO., LTD. Issue 04 Date 2017-12-20 HUAWEI TECHNOLOGIES CO., LTD. 2017. All rights reserved. No part of this document may be reproduced or transmitted in any form or by any means without prior written consent of

More information

SAP HANA. HA and DR Guide. Issue 03 Date HUAWEI TECHNOLOGIES CO., LTD.

SAP HANA. HA and DR Guide. Issue 03 Date HUAWEI TECHNOLOGIES CO., LTD. Issue 03 Date 2018-05-23 HUAWEI TECHNOLOGIES CO., LTD. Copyright Huawei Technologies Co., Ltd. 2019. All rights reserved. No part of this document may be reproduced or transmitted in any form or by any

More information

Hadoop Development Introduction

Hadoop Development Introduction Hadoop Development Introduction What is Bigdata? Evolution of Bigdata Types of Data and their Significance Need for Bigdata Analytics Why Bigdata with Hadoop? History of Hadoop Why Hadoop is in demand

More information

exam. Microsoft Perform Data Engineering on Microsoft Azure HDInsight. Version 1.0

exam.   Microsoft Perform Data Engineering on Microsoft Azure HDInsight. Version 1.0 70-775.exam Number: 70-775 Passing Score: 800 Time Limit: 120 min File Version: 1.0 Microsoft 70-775 Perform Data Engineering on Microsoft Azure HDInsight Version 1.0 Exam A QUESTION 1 You use YARN to

More information

Big Data Analytics using Apache Hadoop and Spark with Scala

Big Data Analytics using Apache Hadoop and Spark with Scala Big Data Analytics using Apache Hadoop and Spark with Scala Training Highlights : 80% of the training is with Practical Demo (On Custom Cloudera and Ubuntu Machines) 20% Theory Portion will be important

More information

Fluentd + MongoDB + Spark = Awesome Sauce

Fluentd + MongoDB + Spark = Awesome Sauce Fluentd + MongoDB + Spark = Awesome Sauce Nishant Sahay, Sr. Architect, Wipro Limited Bhavani Ananth, Tech Manager, Wipro Limited Your company logo here Wipro Open Source Practice: Vision & Mission Vision

More information

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours Big Data Hadoop Developer Course Content Who is the target audience? Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours Complete beginners who want to learn Big Data Hadoop Professionals

More information

Exam Questions

Exam Questions Exam Questions 70-775 Perform Data Engineering on Microsoft Azure HDInsight (beta) https://www.2passeasy.com/dumps/70-775/ NEW QUESTION 1 You are implementing a batch processing solution by using Azure

More information

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS SUJEE MANIYAM FOUNDER / PRINCIPAL @ ELEPHANT SCALE www.elephantscale.com sujee@elephantscale.com HI, I M SUJEE MANIYAM Founder / Principal @ ElephantScale

More information

Oracle Big Data Fundamentals Ed 1

Oracle Big Data Fundamentals Ed 1 Oracle University Contact Us: +0097143909050 Oracle Big Data Fundamentals Ed 1 Duration: 5 Days What you will learn In the Oracle Big Data Fundamentals course, learn to use Oracle's Integrated Big Data

More information

Apache Flink. Alessandro Margara

Apache Flink. Alessandro Margara Apache Flink Alessandro Margara alessandro.margara@polimi.it http://home.deib.polimi.it/margara Recap: scenario Big Data Volume and velocity Process large volumes of data possibly produced at high rate

More information

Big Data Architect.

Big Data Architect. Big Data Architect www.austech.edu.au WHAT IS BIG DATA ARCHITECT? A big data architecture is designed to handle the ingestion, processing, and analysis of data that is too large or complex for traditional

More information

Huawei OceanStor ReplicationDirector Software Technical White Paper HUAWEI TECHNOLOGIES CO., LTD. Issue 01. Date

Huawei OceanStor ReplicationDirector Software Technical White Paper HUAWEI TECHNOLOGIES CO., LTD. Issue 01. Date Huawei OceanStor Software Issue 01 Date 2015-01-17 HUAWEI TECHNOLOGIES CO., LTD. 2015. All rights reserved. No part of this document may be reproduced or transmitted in any form or by any means without

More information

Unifying Big Data Workloads in Apache Spark

Unifying Big Data Workloads in Apache Spark Unifying Big Data Workloads in Apache Spark Hossein Falaki @mhfalaki Outline What s Apache Spark Why Unification Evolution of Unification Apache Spark + Databricks Q & A What s Apache Spark What is Apache

More information

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect Big Data Big Data Analyst INTRODUCTION TO BIG DATA ANALYTICS ANALYTICS PROCESSING TECHNIQUES DATA TRANSFORMATION & BATCH PROCESSING REAL TIME (STREAM) DATA PROCESSING Big Data Engineer BIG DATA FOUNDATION

More information

Certified Big Data Hadoop and Spark Scala Course Curriculum

Certified Big Data Hadoop and Spark Scala Course Curriculum Certified Big Data Hadoop and Spark Scala Course Curriculum The Certified Big Data Hadoop and Spark Scala course by DataFlair is a perfect blend of indepth theoretical knowledge and strong practical skills

More information

Cloud Computing & Visualization

Cloud Computing & Visualization Cloud Computing & Visualization Workflows Distributed Computation with Spark Data Warehousing with Redshift Visualization with Tableau #FIUSCIS School of Computing & Information Sciences, Florida International

More information

A Glimpse of the Hadoop Echosystem

A Glimpse of the Hadoop Echosystem A Glimpse of the Hadoop Echosystem 1 Hadoop Echosystem A cluster is shared among several users in an organization Different services HDFS and MapReduce provide the lower layers of the infrastructures Other

More information

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a) Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a) Cloudera s Developer Training for Apache Spark and Hadoop delivers the key concepts and expertise need to develop high-performance

More information

CSE 444: Database Internals. Lecture 23 Spark

CSE 444: Database Internals. Lecture 23 Spark CSE 444: Database Internals Lecture 23 Spark References Spark is an open source system from Berkeley Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei

More information

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context 1 Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes

More information

MapR Enterprise Hadoop

MapR Enterprise Hadoop 2014 MapR Technologies 2014 MapR Technologies 1 MapR Enterprise Hadoop Top Ranked Cloud Leaders 500+ Customers 2014 MapR Technologies 2 Key MapR Advantage Partners Business Services APPLICATIONS & OS ANALYTICS

More information

Cloud Stream Service. User Guide. Issue 18 Date HUAWEI TECHNOLOGIES CO., LTD.

Cloud Stream Service. User Guide. Issue 18 Date HUAWEI TECHNOLOGIES CO., LTD. Issue 18 Date 2018-11-30 HUAWEI TECHNOLOGIES CO., LTD. Copyright Huawei Technologies Co., Ltd. 2018. All rights reserved. No part of this document may be reproduced or transmitted in any form or by any

More information

DATA SCIENCE USING SPARK: AN INTRODUCTION

DATA SCIENCE USING SPARK: AN INTRODUCTION DATA SCIENCE USING SPARK: AN INTRODUCTION TOPICS COVERED Introduction to Spark Getting Started with Spark Programming in Spark Data Science with Spark What next? 2 DATA SCIENCE PROCESS Exploratory Data

More information

Elastic Load Balance. User Guide. Issue 01 Date HUAWEI TECHNOLOGIES CO., LTD.

Elastic Load Balance. User Guide. Issue 01 Date HUAWEI TECHNOLOGIES CO., LTD. Issue 01 Date 2018-04-30 HUAWEI TECHNOLOGIES CO., LTD. 2018. All rights reserved. No part of this document may be reproduced or transmitted in any form or by any means without prior written consent of

More information

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture Hadoop 1.0 Architecture Introduction to Hadoop & Big Data Hadoop Evolution Hadoop Architecture Networking Concepts Use cases

More information

Hadoop An Overview. - Socrates CCDH

Hadoop An Overview. - Socrates CCDH Hadoop An Overview - Socrates CCDH What is Big Data? Volume Not Gigabyte. Terabyte, Petabyte, Exabyte, Zettabyte - Due to handheld gadgets,and HD format images and videos - In total data, 90% of them collected

More information

Spark Overview. Professor Sasu Tarkoma.

Spark Overview. Professor Sasu Tarkoma. Spark Overview 2015 Professor Sasu Tarkoma www.cs.helsinki.fi Apache Spark Spark is a general-purpose computing framework for iterative tasks API is provided for Java, Scala and Python The model is based

More information

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015 Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL May 2015 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved. Notices This document

More information

Big Data Hadoop Course Content

Big Data Hadoop Course Content Big Data Hadoop Course Content Topics covered in the training Introduction to Linux and Big Data Virtual Machine ( VM) Introduction/ Installation of VirtualBox and the Big Data VM Introduction to Linux

More information

How Apache Hadoop Complements Existing BI Systems. Dr. Amr Awadallah Founder, CTO Cloudera,

How Apache Hadoop Complements Existing BI Systems. Dr. Amr Awadallah Founder, CTO Cloudera, How Apache Hadoop Complements Existing BI Systems Dr. Amr Awadallah Founder, CTO Cloudera, Inc. Twitter: @awadallah, @cloudera 2 The Problems with Current Data Systems BI Reports + Interactive Apps RDBMS

More information

Oracle Big Data Connectors

Oracle Big Data Connectors Oracle Big Data Connectors Oracle Big Data Connectors is a software suite that integrates processing in Apache Hadoop distributions with operations in Oracle Database. It enables the use of Hadoop to process

More information

Report on The Infrastructure for Implementing the Mobile Technologies for Data Collection in Egypt

Report on The Infrastructure for Implementing the Mobile Technologies for Data Collection in Egypt Report on The Infrastructure for Implementing the Mobile Technologies for Data Collection in Egypt Date: 10 Sep, 2017 Draft v 4.0 Table of Contents 1. Introduction... 3 2. Infrastructure Reference Architecture...

More information

microsoft

microsoft 70-775.microsoft Number: 70-775 Passing Score: 800 Time Limit: 120 min Exam A QUESTION 1 Note: This question is part of a series of questions that present the same scenario. Each question in the series

More information

Lecture 11 Hadoop & Spark

Lecture 11 Hadoop & Spark Lecture 11 Hadoop & Spark Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Outline Distributed File Systems Hadoop Ecosystem

More information

Anti-DDoS. User Guide (Paris) Issue 01 Date HUAWEI TECHNOLOGIES CO., LTD.

Anti-DDoS. User Guide (Paris) Issue 01 Date HUAWEI TECHNOLOGIES CO., LTD. Issue 01 Date 2018-08-15 HUAWEI TECHNOLOGIES CO., LTD. Copyright Huawei Technologies Co., Ltd. 2018. All rights reserved. No part of this document may be reproduced or transmitted in any form or by any

More information

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce Parallel Programming Principle and Practice Lecture 10 Big Data Processing with MapReduce Outline MapReduce Programming Model MapReduce Examples Hadoop 2 Incredible Things That Happen Every Minute On The

More information

HUAWEI OceanStor Enterprise Unified Storage System. HyperReplication Technical White Paper. Issue 01. Date HUAWEI TECHNOLOGIES CO., LTD.

HUAWEI OceanStor Enterprise Unified Storage System. HyperReplication Technical White Paper. Issue 01. Date HUAWEI TECHNOLOGIES CO., LTD. HUAWEI OceanStor Enterprise Unified Storage System HyperReplication Technical White Paper Issue 01 Date 2014-03-20 HUAWEI TECHNOLOGIES CO., LTD. 2014. All rights reserved. No part of this document may

More information

Big Data Syllabus. Understanding big data and Hadoop. Limitations and Solutions of existing Data Analytics Architecture

Big Data Syllabus. Understanding big data and Hadoop. Limitations and Solutions of existing Data Analytics Architecture Big Data Syllabus Hadoop YARN Setup Programming in YARN framework j Understanding big data and Hadoop Big Data Limitations and Solutions of existing Data Analytics Architecture Hadoop Features Hadoop Ecosystem

More information

1 Big Data Hadoop. 1. Introduction About this Course About Big Data Course Logistics Introductions

1 Big Data Hadoop. 1. Introduction About this Course About Big Data Course Logistics Introductions Big Data Hadoop Architect Online Training (Big Data Hadoop + Apache Spark & Scala+ MongoDB Developer And Administrator + Apache Cassandra + Impala Training + Apache Kafka + Apache Storm) 1 Big Data Hadoop

More information

Hadoop. Introduction / Overview

Hadoop. Introduction / Overview Hadoop Introduction / Overview Preface We will use these PowerPoint slides to guide us through our topic. Expect 15 minute segments of lecture Expect 1-4 hour lab segments Expect minimal pretty pictures

More information

Hadoop. Introduction to BIGDATA and HADOOP

Hadoop. Introduction to BIGDATA and HADOOP Hadoop Introduction to BIGDATA and HADOOP What is Big Data? What is Hadoop? Relation between Big Data and Hadoop What is the need of going ahead with Hadoop? Scenarios to apt Hadoop Technology in REAL

More information

Index. Raul Estrada and Isaac Ruiz 2016 R. Estrada and I. Ruiz, Big Data SMACK, DOI /

Index. Raul Estrada and Isaac Ruiz 2016 R. Estrada and I. Ruiz, Big Data SMACK, DOI / Index A ACID, 251 Actor model Akka installation, 44 Akka logos, 41 OOP vs. actors, 42 43 thread-based concurrency, 42 Agents server, 140, 251 Aggregation techniques materialized views, 216 probabilistic

More information

Certified Big Data and Hadoop Course Curriculum

Certified Big Data and Hadoop Course Curriculum Certified Big Data and Hadoop Course Curriculum The Certified Big Data and Hadoop course by DataFlair is a perfect blend of in-depth theoretical knowledge and strong practical skills via implementation

More information

vcenter Server Installation and Setup Update 1 Modified on 30 OCT 2018 VMware vsphere 6.7 vcenter Server 6.7

vcenter Server Installation and Setup Update 1 Modified on 30 OCT 2018 VMware vsphere 6.7 vcenter Server 6.7 vcenter Server Installation and Setup Update 1 Modified on 30 OCT 2018 VMware vsphere 6.7 vcenter Server 6.7 You can find the most up-to-date technical documentation on the VMware website at: https://docs.vmware.com/

More information

Oracle NoSQL Database Enterprise Edition, Version 18.1

Oracle NoSQL Database Enterprise Edition, Version 18.1 Oracle NoSQL Database Enterprise Edition, Version 18.1 Oracle NoSQL Database is a scalable, distributed NoSQL database, designed to provide highly reliable, flexible and available data management across

More information

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ. Big Data Programming: an Introduction Spring 2015, X. Zhang Fordham Univ. Outline What the course is about? scope Introduction to big data programming Opportunity and challenge of big data Origin of Hadoop

More information

Chapter 5. The MapReduce Programming Model and Implementation

Chapter 5. The MapReduce Programming Model and Implementation Chapter 5. The MapReduce Programming Model and Implementation - Traditional computing: data-to-computing (send data to computing) * Data stored in separate repository * Data brought into system for computing

More information

Esper EQC. Horizontal Scale-Out for Complex Event Processing

Esper EQC. Horizontal Scale-Out for Complex Event Processing Esper EQC Horizontal Scale-Out for Complex Event Processing Esper EQC - Introduction Esper query container (EQC) is the horizontal scale-out architecture for Complex Event Processing with Esper and EsperHA

More information

Introduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data

Introduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data Introduction to Hadoop High Availability Scaling Advantages and Challenges Introduction to Big Data What is Big data Big Data opportunities Big Data Challenges Characteristics of Big data Introduction

More information

Vendor: Hortonworks. Exam Code: HDPCD. Exam Name: Hortonworks Data Platform Certified Developer. Version: Demo

Vendor: Hortonworks. Exam Code: HDPCD. Exam Name: Hortonworks Data Platform Certified Developer. Version: Demo Vendor: Hortonworks Exam Code: HDPCD Exam Name: Hortonworks Data Platform Certified Developer Version: Demo QUESTION 1 Workflows expressed in Oozie can contain: A. Sequences of MapReduce and Pig. These

More information

MI-PDB, MIE-PDB: Advanced Database Systems

MI-PDB, MIE-PDB: Advanced Database Systems MI-PDB, MIE-PDB: Advanced Database Systems http://www.ksi.mff.cuni.cz/~svoboda/courses/2015-2-mie-pdb/ Lecture 10: MapReduce, Hadoop 26. 4. 2016 Lecturer: Martin Svoboda svoboda@ksi.mff.cuni.cz Author:

More information

Data Acquisition. The reference Big Data stack

Data Acquisition. The reference Big Data stack Università degli Studi di Roma Tor Vergata Dipartimento di Ingegneria Civile e Ingegneria Informatica Data Acquisition Corso di Sistemi e Architetture per Big Data A.A. 2017/18 Valeria Cardellini The reference

More information

Vulnerability Scan Service. User Guide. Issue 20 Date HUAWEI TECHNOLOGIES CO., LTD.

Vulnerability Scan Service. User Guide. Issue 20 Date HUAWEI TECHNOLOGIES CO., LTD. Issue 20 Date 2018-08-30 HUAWEI TECHNOLOGIES CO., LTD. Copyright Huawei Technologies Co., Ltd. 2018. All rights reserved. No part of this document may be reproduced or transmitted in any form or by any

More information

ExamTorrent. Best exam torrent, excellent test torrent, valid exam dumps are here waiting for you

ExamTorrent.   Best exam torrent, excellent test torrent, valid exam dumps are here waiting for you ExamTorrent http://www.examtorrent.com Best exam torrent, excellent test torrent, valid exam dumps are here waiting for you Exam : Apache-Hadoop-Developer Title : Hadoop 2.0 Certification exam for Pig

More information

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018 Cloud Computing 2 CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University Learning

More information

Data Acquisition. The reference Big Data stack

Data Acquisition. The reference Big Data stack Università degli Studi di Roma Tor Vergata Dipartimento di Ingegneria Civile e Ingegneria Informatica Data Acquisition Corso di Sistemi e Architetture per Big Data A.A. 2016/17 Valeria Cardellini The reference

More information

Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem. Zohar Elkayam

Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem. Zohar Elkayam Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem Zohar Elkayam www.realdbamagic.com Twitter: @realmgic Who am I? Zohar Elkayam, CTO at Brillix Programmer, DBA, team leader, database trainer,

More information

Big Data with Hadoop Ecosystem

Big Data with Hadoop Ecosystem Diógenes Pires Big Data with Hadoop Ecosystem Hands-on (HBase, MySql and Hive + Power BI) Internet Live http://www.internetlivestats.com/ Introduction Business Intelligence Business Intelligence Process

More information

Introduction to BigData, Hadoop:-

Introduction to BigData, Hadoop:- Introduction to BigData, Hadoop:- Big Data Introduction: Hadoop Introduction What is Hadoop? Why Hadoop? Hadoop History. Different types of Components in Hadoop? HDFS, MapReduce, PIG, Hive, SQOOP, HBASE,

More information

The Fastest Scale-Out NAS

The Fastest Scale-Out NAS The Fastest Scale-Out NAS The features a symmetric distributed architecture that delivers superior performance, extensive scale-out capabilities, and a super-large single file system providing shared storage

More information

Flash Storage Complementing a Data Lake for Real-Time Insight

Flash Storage Complementing a Data Lake for Real-Time Insight Flash Storage Complementing a Data Lake for Real-Time Insight Dr. Sanhita Sarkar Global Director, Analytics Software Development August 7, 2018 Agenda 1 2 3 4 5 Delivering insight along the entire spectrum

More information

Gain Insights From Unstructured Data Using Pivotal HD. Copyright 2013 EMC Corporation. All rights reserved.

Gain Insights From Unstructured Data Using Pivotal HD. Copyright 2013 EMC Corporation. All rights reserved. Gain Insights From Unstructured Data Using Pivotal HD 1 Traditional Enterprise Analytics Process 2 The Fundamental Paradigm Shift Internet age and exploding data growth Enterprises leverage new data sources

More information

FAQs. Business (CIP 2.2) AWS Market Place Troubleshooting and FAQ Guide

FAQs. Business (CIP 2.2) AWS Market Place Troubleshooting and FAQ Guide FAQs 1. What is the browser compatibility for logging into the TCS Connected Intelligence Data Lake for Business Portal? Please check whether you are using Mozilla Firefox 18 or above and Google Chrome

More information

HDP Security Overview

HDP Security Overview 3 HDP Security Overview Date of Publish: 2018-07-15 http://docs.hortonworks.com Contents HDP Security Overview...3 Understanding Data Lake Security... 3 What's New in This Release: Knox... 5 What's New

More information

HDP Security Overview

HDP Security Overview 3 HDP Security Overview Date of Publish: 2018-07-15 http://docs.hortonworks.com Contents HDP Security Overview...3 Understanding Data Lake Security... 3 What's New in This Release: Knox... 5 What's New

More information

Microsoft. Exam Questions Perform Data Engineering on Microsoft Azure HDInsight (beta) Version:Demo

Microsoft. Exam Questions Perform Data Engineering on Microsoft Azure HDInsight (beta) Version:Demo Microsoft Exam Questions 70-775 Perform Data Engineering on Microsoft Azure HDInsight (beta) Version:Demo NEW QUESTION 1 HOTSPOT You install the Microsoft Hive ODBC Driver on a computer that runs Windows

More information

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP TITLE: Implement sort algorithm and run it using HADOOP PRE-REQUISITE Preliminary knowledge of clusters and overview of Hadoop and its basic functionality. THEORY 1. Introduction to Hadoop The Apache Hadoop

More information

HUAWEI Secospace USG Series User Management and Control White Paper

HUAWEI Secospace USG Series User Management and Control White Paper Doc. code HUAWEI Secospace USG Series User Management and Control White Paper Issue 1.0 Date 2014-03-27 HUAWEI TECHNOLOGIES CO., LTD. Copyright Huawei Technologies Co., Ltd. 2012. All rights reserved.

More information

Introduction to the Hadoop Ecosystem - 1

Introduction to the Hadoop Ecosystem - 1 Hello and welcome to this online, self-paced course titled Administering and Managing the Oracle Big Data Appliance (BDA). This course contains several lessons. This lesson is titled Introduction to the

More information

April Copyright 2013 Cloudera Inc. All rights reserved.

April Copyright 2013 Cloudera Inc. All rights reserved. Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and the Virtual EDW Headline Goes Here Marcel Kornacker marcel@cloudera.com Speaker Name or Subhead Goes Here April 2014 Analytic Workloads on

More information

CDN. Product Description. Issue 03 Date HUAWEI TECHNOLOGIES CO., LTD.

CDN. Product Description. Issue 03 Date HUAWEI TECHNOLOGIES CO., LTD. Issue 03 Date 2018-08-30 HUAWEI TECHNOLOGIES CO., LTD. Copyright Huawei Technologies Co., Ltd. 2018. All rights reserved. No part of this document may be reproduced or transmitted in any form or by any

More information

SQT03 Big Data and Hadoop with Azure HDInsight Andrew Brust. Senior Director, Technical Product Marketing and Evangelism

SQT03 Big Data and Hadoop with Azure HDInsight Andrew Brust. Senior Director, Technical Product Marketing and Evangelism Big Data and Hadoop with Azure HDInsight Andrew Brust Senior Director, Technical Product Marketing and Evangelism Datameer Level: Intermediate Meet Andrew Senior Director, Technical Product Marketing and

More information

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab Apache is a fast and general engine for large-scale data processing aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes Low latency: sub-second

More information

BIG DATA COURSE CONTENT

BIG DATA COURSE CONTENT BIG DATA COURSE CONTENT [I] Get Started with Big Data Microsoft Professional Orientation: Big Data Duration: 12 hrs Course Content: Introduction Course Introduction Data Fundamentals Introduction to Data

More information

WHITEPAPER. MemSQL Enterprise Feature List

WHITEPAPER. MemSQL Enterprise Feature List WHITEPAPER MemSQL Enterprise Feature List 2017 MemSQL Enterprise Feature List DEPLOYMENT Provision and deploy MemSQL anywhere according to your desired cluster configuration. On-Premises: Maximize infrastructure

More information

Security and Performance advances with Oracle Big Data SQL

Security and Performance advances with Oracle Big Data SQL Security and Performance advances with Oracle Big Data SQL Jean-Pierre Dijcks Oracle Redwood Shores, CA, USA Key Words SQL, Oracle, Database, Analytics, Object Store, Files, Big Data, Big Data SQL, Hadoop,

More information

Cloudera Introduction

Cloudera Introduction Cloudera Introduction Important Notice 2010-2018 Cloudera, Inc. All rights reserved. Cloudera, the Cloudera logo, and any other product or service names or slogans contained in this document are trademarks

More information

Introduction to Big-Data

Introduction to Big-Data Introduction to Big-Data Ms.N.D.Sonwane 1, Mr.S.P.Taley 2 1 Assistant Professor, Computer Science & Engineering, DBACER, Maharashtra, India 2 Assistant Professor, Information Technology, DBACER, Maharashtra,

More information

Huawei FusionCloud Desktop Solution 5.3. Branch Technical White Paper. Issue 01. Date HUAWEI TECHNOLOGIES CO., LTD.

Huawei FusionCloud Desktop Solution 5.3. Branch Technical White Paper. Issue 01. Date HUAWEI TECHNOLOGIES CO., LTD. Issue 01 Date 2015-06-30 HUAWEI TECHNOLOGIES CO., LTD. 2015. All rights reserved. No part of this document may be reproduced or transmitted in any form or by any means without prior written consent of

More information

Introduction to Hadoop and MapReduce

Introduction to Hadoop and MapReduce Introduction to Hadoop and MapReduce Antonino Virgillito THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Large-scale Computation Traditional solutions for computing large

More information

Bring Context To Your Machine Data With Hadoop, RDBMS & Splunk

Bring Context To Your Machine Data With Hadoop, RDBMS & Splunk Bring Context To Your Machine Data With Hadoop, RDBMS & Splunk Raanan Dagan and Rohit Pujari September 25, 2017 Washington, DC Forward-Looking Statements During the course of this presentation, we may

More information

Hive and Shark. Amir H. Payberah. Amirkabir University of Technology (Tehran Polytechnic)

Hive and Shark. Amir H. Payberah. Amirkabir University of Technology (Tehran Polytechnic) Hive and Shark Amir H. Payberah amir@sics.se Amirkabir University of Technology (Tehran Polytechnic) Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 1 / 45 Motivation MapReduce is hard to

More information

Exam Questions CCA-505

Exam Questions CCA-505 Exam Questions CCA-505 Cloudera Certified Administrator for Apache Hadoop (CCAH) CDH5 Upgrade Exam https://www.2passeasy.com/dumps/cca-505/ 1.You want to understand more about how users browse you public

More information

The Technology of the Business Data Lake. Appendix

The Technology of the Business Data Lake. Appendix The Technology of the Business Data Lake Appendix Pivotal data products Term Greenplum Database GemFire Pivotal HD Spring XD Pivotal Data Dispatch Pivotal Analytics Description A massively parallel platform

More information

FROM LEGACY, TO BATCH, TO NEAR REAL-TIME. Marc Sturlese, Dani Solà

FROM LEGACY, TO BATCH, TO NEAR REAL-TIME. Marc Sturlese, Dani Solà FROM LEGACY, TO BATCH, TO NEAR REAL-TIME Marc Sturlese, Dani Solà WHO ARE WE? Marc Sturlese - @sturlese Backend engineer, focused on R&D Interests: search, scalability Dani Solà - @dani_sola Backend engineer

More information

Big Data Infrastructures & Technologies

Big Data Infrastructures & Technologies Big Data Infrastructures & Technologies Spark and MLLIB OVERVIEW OF SPARK What is Spark? Fast and expressive cluster computing system interoperable with Apache Hadoop Improves efficiency through: In-memory

More information

iway iway Big Data Integrator New Features Bulletin and Release Notes Version DN

iway iway Big Data Integrator New Features Bulletin and Release Notes Version DN iway iway Big Data Integrator New Features Bulletin and Release Notes Version 1.5.0 DN3502232.1216 Active Technologies, EDA, EDA/SQL, FIDEL, FOCUS, Information Builders, the Information Builders logo,

More information