EsgynDB Enterprise 2.0 Platform Reference Architecture
|
|
- Doreen Dawson
- 6 years ago
- Views:
Transcription
1 EsgynDB Enterprise 2.0 Platform Reference Architecture This document outlines a Platform Reference Architecture for EsgynDB Enterprise, built on Apache Trafodion (Incubating) implementation with licensed support and extensions. It outlines server configurations for various vendors and attempts to describe some of the considerations for sizing an EsgynDB Application 1 INTRODUCTION ARCHITECTURE CAPACITY PLANNING PROCESSING USAGE MEMORY USAGE DISK USAGE NETWORK USAGE REFERENCE ARCHITECTURE GUIDANCE FOR PRODUCTION BARE METAL CLUSTER MEDIUM/LARGE DEPLOYMENT SMALL DEPLOYMENT CLOUD DEPLOYMENT CONCLUSION Introduction The Apache Trafodion (Incubating) project provides a full transactional SQL database integrated into the Apache Hadoop ecosystem to support operational workloads. EsgynDB Enterprise 2.0, built on Apache Trafodion, offers a fully-supported, enterprise-ready version with extensions for additional features, including cross datacenter support in EsgynDB Enterprise Advanced 2.0. The reference architecture description describes a purpose-built EsgynDB Enterprise installation. Specifically it describes the architecture and provisioning for a cluster whose purpose is running one or more EsgynDB application workload(s). This reference architecture does not describe a configuration where EsgynDB Enterprise is part of a wider Hadoop cluster running other ecosystem applications such as MapReduce. Clusters running mixed workloads can start from sizing/provisioning information here. But the final sizing/provisioning must also incorporate requirements from the other workloads in the cluster. As such it is beyond the scope of this document. 2 Architecture Apache Trafodion provides an enterprise-class, web scale database engine in the Hadoop ecosystem. In addition, Trafodion enables SQL query language and transactional semantics for native Apache HBase and Apache Hive tables. Trafodion provides transactional support for data stored in HBase. It supports fully distributed ACID transactions across multiple statements, tables, and rows, which enables 06Dec Esgyn Corporation 1
2 EsgynDB Enterprise to support operational workloads that are generally beyond most Hadoop ecosystem components. EsgynDB Enterprise Release 2.0 extends Apache Trafodion by providing additional features such as cross datacenter support, using the architecture depicted below The architecture involves one or more clients concurrently using SQL queries to access the data managed by EsgynDB via a driver (ODBC/JDBC/ADO.NET). The driver library provides the connection and session between the application (which might or might not execute on the same cluster) query and the SQL engine layer. In the SQL engine layer, a master query execution server process for each query prepares and executes a query process. Depending on the specifics of the workload, it might involve a distributed transaction manager, or one or more groups of executor server processes (ESPs) that execute portions of the query plan in parallel. These groups of ESPs (for a given query, there might be zero or more groups), reflect the degree of parallelism for the query. The query can reference native HBase or Hive tables as well. Ultimately EsgynDB uses HDFS as the storage layer foundation, with an appropriate replication factor (usually 3, but in some cloud configurations 2 is the appropriate replication factor) to provide availability if a node fails. Significant processes used for query processing include: Process Name Description Distribution Count DCS Master Initial connection point for locating a session-hosting mxosrvr On one single node One active per cluster, often configured with a floating IP for high DCS Server Master executor (mxosrvr) Executor Server Process (ESP) Process that manages status and connection usage for mxosrvr processes Master executor process that hosts the SQL session, does query compilation and execution of root operator Executes parallel fragments of SQL plans On each node Multiple on all data nodes in instance Multiple run on all data nodes in cluster, in variable size groups. availability. One for each node where mxosrvrs run. Count defines maximum number of concurrent sessions. Workload Dependent: determined by concurrent 06Dec Esgyn Corporation 2
3 DTM Maintains transactional state and log outcome information for transactions. Runs on all data nodes in the instance. parallel queries, query plan, and degree of parallelism. One per data node. For the EsgynDB Enterprise 2.0 version, cross datacenter support is implemented via DTM, which communicates with Transaction Manager processes on the peer datacenter clusters to replicate the transactions on both clusters. The EsgynDB Manager architecture was simplified in the above architecture picture to show its relationship to the query processing engine. The EsgynDB Manager subsystem architecture expands to multiple processes as depicted in the following picture: EsgynDB Manager processes include: Process Name Description Distribution Count DB Manager Web application server that browser connects with to On one single node One per cluster on the first data node OpenTSDB Lightweight service On each node One per node processes for collecting time-series metrics TCollectors Collection scripts that collect time-based metrics at interval Multiple on all data nodes in instance; processes per node vary System and HBase metrics are collected on each node EsgynDB metrics collected cluster-wide from a process on the first data node 06Dec Esgyn Corporation 3
4 REST Server Process that handles REST requests from on- and offcluster clients One per cluster One per cluster on the first data node In addition to the listed processes used for query-processing and manageability, there are other processes that are part of the EsgynDB stack supporting its runtime execution environment. These processes generally use fewer resources and have little material impact on platform sizing and provisioning. EsgynDB Enterprise is integrated into the Hadoop ecosystem as depicted in the following picture: The EsgynDB database engine uses HBase for storage services. As such, it relies on HBase configuration and tuning to achieve optimal performance. EsgynDB cluster provisioning must incorporate HBase configuration considerations. HBase processes can be divided into two classes: control processes and data processes. Control processes are one-offs that are involved in managing the HBase system and managing its metadata. Data processes are processes that are involved in serving the data itself, including reading, updating, and writing (HBase scan, get, and put operations). HBase control processes include: Process HMaster ZooKeeper Description Metadata and table creation/deletion Not an HBase process, but used for information management and coordination across nodes. HBase data processes include: Process RegionServer Description Controls data serving, including servicing get/put, and separation of data into individual regions. HBase in turn uses HDFS services for scalability, availability, and recovery (replication) within the cluster. As such, EsgynDB cluster provisioning must also incorporate HDFS configuration considerations, including replication. Control processes are singleton processes that manage the HDFS file system. In HDFS, they control the location for individual data blocks. Data processes are involved in reading and accessing that data. 06Dec Esgyn Corporation 4
5 HDFS control processes include: Process NameNode Secondary NameNode HDFS data processes include: Process DataNode Description Manages the metadata files that are used to map blocks to individual files and select locations for replication. Gets a checkpoint of all metadata from NameNode once per interval (hour is the default). This data can be used to recreate the block -> file mappings if the NameNode is lost. However, it is not simply a hot backup for the NameNode. Description Serves up reads and writes from individual files, and sends periodic I m-alive messages, including files/blocks it is managing to the NameNode. In addition to the HBase and HDFS control processes listed above, other control node processes include: Process Management Server Process Description Ambari, Cloudera Manager, etc. web page node. Some management servers do detailed database and analytic function. In smaller clusters, control processes and data processes might reside on the same node. For larger clusters, management processes have significantly different provisioning requirements and so are often isolated on different nodes. The reference architecture assumes separate control and data nodes. 3 Capacity Planning This section discusses issues and sizing recommendations to take into consideration when sizing an EsgynDB Enterprise database. 3.1 Processing Usage When sizing the processing power for an EsgynDB Enterprise cluster, consider the following: In a typical high-performance configuration, nodes for management are configured separately from data nodes. The two types are typically provisioned differently for storage (size, configuration) as well as network and memory. In a very small or test configuration, the distinction between data nodes and control nodes is blurred, and most management processes are collocated with data processes for both Hadoop/HBase and EsgynDB. So long as this configuration meets performance and availability objectives, this configuration is valid, especially for basic development and test clusters. Consider the following factors when assessing the required number of nodes: More nodes with fewer cores is preferable to an equivalent number of cores spread over fewer nodes, so long as the number of cores per node is reasonably modern (e.g., 8 or more) for typical production workloads. Scaling out (increasing the number of nodes to achieve the desired number of cores) is preferable to scaling up (increasing the cores per node to achieve the desired number of cores) because: o More nodes with fewer cores is typically cheaper than fewer nodes with more cores 06Dec Esgyn Corporation 5
6 o o The domain of failure is smaller when losing a node or disk on a cluster with more nodes The available I/O bandwidth and parallelism is higher with more nodes. Clusters smaller than 3 nodes are not advised, given HDFS replication requirements for availability and recoverability. The number of simultaneous users (concurrency) drives the number external corporate networkconnected nodes, as does the ingest rate for data arrival/refresh. This number determines the total number of mxosrvr processes. The actual connections are distributed around the cluster based on mxosrvr process distribution. Multiple mxosrvr processes can run on the same node. Types of workloads are the other key considerations for number of nodes. The number of nodes and cores reflects the amount of parallelism available for concurrent users of the applications running on the cluster. If typical workloads are high-concurrency short queries, then thinner nodes might be acceptable. If typical workloads involve large scans, then more processing power is needed. Understand the types, frequency, plans, and typical concurrency for the application, ideally via prototyping the workloads and queries whenever possible. 3.2 Memory Usage When sizing an EsgynDB Enterprise cluster for memory usage, keep in mind the following considerations: Many Hadoop ecosystem processes are Java processes. Due to memory efficiency optimizations for the JVM, there is a significant restriction just below 32GB. Crossing this threshold actually results in less usable memory because the internal representation of pointers changes in a way that consumes significantly more space. Large memory consumers for data nodes include: o HDFS DataNode processes o HBase RegionServers Among control processes, the large memory consumers are: HDFS NameNode processes Plan for these processes to use a heap size of 16-32GB each for optimal performance on a large cluster. Reducing the memory for these components affects performance significantly, so do careful tuning and analysis before choosing a smaller value. The primary users of memory in the EsgynDB database engine are the mxosrvrs. For each concurrent connection on a node, plan for 512MB (0.5 GB) per connection per node. 3.3 Disk Usage When sizing an EsgynDB Enterprise cluster for disk usage, keep in mind the following considerations: For data nodes, SSD is only beneficial for high concurrency write. In general HDD is sufficient. For control nodes, SSD is similarly not cost effective the goal is to have most control information cached in memory. For data nodes, HDD data disks configure disks as direct attached storage in a JBOD (Just a Bunch of Disks) configuration. RAID striping slows down HDFS and actually reduces concurrency and recoverability. For control nodes, data disks can be configured as either JBOD or RAID1 or RAID10. 06Dec Esgyn Corporation 6
7 As with processing power, disks are a unit of parallelism. For a given total-disk-per-node value, if workloads include many large scans, it is often most effective to have more smaller disks than fewer larger disks per node on data nodes. The reference architecture assumes that most workloads include large scans. HBase SNAPPY or GZ compression is strongly suggested. SNAPPY has less CPU overhead, but GZ compresses better. Degree of compression varies widely depending on the data and workload patterns, but generally accepted calculations suggest around a 30%-40% reduction, depending on data. Compression adds to the path length for reading and writing, which can have an effect on data growth and ingest. Compression happens at the HBase file block level, limiting the amount of un-compression required at read time. When calculating overall disk space and data disk space per node, be sure to account for working space and anticipated ingest/outflow per node. Also remember that blocks of an HDFS file come with a replication factor (typically set to 3, so 3 copies of the data). That means that each 10 GB file actually occupies 30GB on disk. Esgyn recommends leaving approximately 33% of disk space free for overhead workspace. 3.4 Network Usage When sizing an EsgynDB Enterprise cluster for network usage, keep in mind the following considerations: In general, 10GigE is the standard for networking for data traffic within an EsgynDB cluster. Using a slower network for data flow can significantly impact performance. 2 bonded 10GigE networks provide more throughput for I/O intensive applications. In some cases, a second slower network is configured for cluster (not Hadoop/HBase) maintenance in order to keep that traffic separate from the operational data workflow. Consider failure scenarios when connecting nodes from different racks together. HDFS block placement algorithm is biased to select nodes on at least 2 different racks for a block s location if the replication factor is 3 or greater. If using the cross datacenter feature in EsgynDB Enterprise Advanced 2.0, there must be a highspeed connection between the two data centers. If using the cross datacenter feature in EsgynDB Enterprise Advanced 2.0, both clusters must be configured so that the application can actively connect to either peer cluster via EsgynDB drivers when both are running and accessible. This capability ensures that the application can access either cluster exclusively in case of loss of communications with one of the two. 4 Reference Architecture Guidance for Production Bare Metal Cluster This section contains recommendations for hardware configurations and software provisioning for a bare metal EsgynDB cluster. The recommendations are hardware-independent. Check with your hardware vendor for specific part numbers and availability/timeliness. The configuration described is for a medium or large EsgynDB installation, with separate control and data nodes. Smaller configurations with all processes on the same nodes are covered in separate section. For Data Nodes, the basic hardware recommendation for each node is: 06Dec Esgyn Corporation 7
8 Resource CPU Memory Recommendation Intel XEON or AMD 64-bit processors 8 Number of cores per node 16 64GB for overall Hadoop ecosystem and query processing plus usual overhead plus 0.5GB for each mxosrvr on the node. To calculate the number of mxosrvr processes: max number of concurrent connections number of nodes 64GB Memory Size 128GB. Most common value is 96GB. Network Storage 10GigE, 1GigE, or 2x10GigE Bonded SATA or SAS or SSD, typically TB disks configured in a JBOD configuration For Control Nodes, the basic hardware recommendation for each node is: Resource CPU Memory Recommendation Intel XEON or AMD 64-bit processors 8 Number of cores per node 16 64GB for overall Hadoop ecosystem and query + overhead for swapping and process maintenance as possible/desired. 64GB Memory Size 128GB. Most common value is 96GB. Network Storage 10GigE, 1GigE, or 2x10GigE Bonded, plus appropriate switches for off platform to on platform. SATA or SAS or SSD, typically TB disks configured in a RAID1 or RAID10 configuration 4.1 Medium/Large Deployment A medium or large deployment uses the specifications above including both control and management nodes. Processes are placed in these nodes as depicted in the following figure: 06Dec Esgyn Corporation 8
9 In the above picture, the control nodes flank the data nodes and are only used for the DCS master process. There s no specific constraint for node naming conventions, including no assumption that nodes are consecutively numbered. The vertical bars represent individual nodes, and the ovals represent processes within the node. 4.2 Small Deployment For a small (2-3 node, typically less than one rack) deployment, the control nodes are collapsed into the regular node infrastructure as follows: In the above picture, the control nodes have been removed and control processes run on the same nodes as the functional processes. 06Dec Esgyn Corporation 9
10 5 Cloud Deployment When deploying EsgynDB in a cloud environment such as Amazon s AWS, use the guidelines above to provision resources. For configuration use HDFS replication factor 3 if you choose instance local store for the file system, otherwise use HDFS replication factor 2 if you use EBS volumes. 6 Conclusion This EsgynDB Platform Reference Architecture document serves as a starting point for defining the platform to build an EsgynDB cluster where EsgynDB is the primary purpose for the cluster. It also is intended to assist application developers and users in planning the deployment strategy for EsgynDB applications. Esgyn recommends consulting with an Esgyn technical resource to get additional information, training, and guidance. 06Dec Esgyn Corporation 10
IBM InfoSphere Streams v4.0 Performance Best Practices
Henry May IBM InfoSphere Streams v4.0 Performance Best Practices Abstract Streams v4.0 introduces powerful high availability features. Leveraging these requires careful consideration of performance related
More informationApril Copyright 2013 Cloudera Inc. All rights reserved.
Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and the Virtual EDW Headline Goes Here Marcel Kornacker marcel@cloudera.com Speaker Name or Subhead Goes Here April 2014 Analytic Workloads on
More informationHadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved
Hadoop 2.x Core: YARN, Tez, and Spark YARN Hadoop Machine Types top-of-rack switches core switch client machines have client-side software used to access a cluster to process data master nodes run Hadoop
More informationPlanning for the HDP Cluster
3 Planning for the HDP Cluster Date of Publish: 2018-08-30 http://docs.hortonworks.com Contents... 3 Typical Hadoop Cluster...3 Typical Workload Patterns For Hadoop... 3 Early Deployments...4 Server Node
More informationThe Hadoop Distributed File System Konstantin Shvachko Hairong Kuang Sanjay Radia Robert Chansler
The Hadoop Distributed File System Konstantin Shvachko Hairong Kuang Sanjay Radia Robert Chansler MSST 10 Hadoop in Perspective Hadoop scales computation capacity, storage capacity, and I/O bandwidth by
More informationAccelerating Big Data: Using SanDisk SSDs for Apache HBase Workloads
WHITE PAPER Accelerating Big Data: Using SanDisk SSDs for Apache HBase Workloads December 2014 Western Digital Technologies, Inc. 951 SanDisk Drive, Milpitas, CA 95035 www.sandisk.com Table of Contents
More informationPaperspace. Architecture Overview. 20 Jay St. Suite 312 Brooklyn, NY Technical Whitepaper
Architecture Overview Copyright 2016 Paperspace, Co. All Rights Reserved June - 1-2017 Technical Whitepaper Paperspace Whitepaper: Architecture Overview Content 1. Overview 3 2. Virtualization 3 Xen Hypervisor
More informationTuning Enterprise Information Catalog Performance
Tuning Enterprise Information Catalog Performance Copyright Informatica LLC 2015, 2018. Informatica and the Informatica logo are trademarks or registered trademarks of Informatica LLC in the United States
More informationAccelerate Big Data Insights
Accelerate Big Data Insights Executive Summary An abundance of information isn t always helpful when time is of the essence. In the world of big data, the ability to accelerate time-to-insight can not
More informationMapR Enterprise Hadoop
2014 MapR Technologies 2014 MapR Technologies 1 MapR Enterprise Hadoop Top Ranked Cloud Leaders 500+ Customers 2014 MapR Technologies 2 Key MapR Advantage Partners Business Services APPLICATIONS & OS ANALYTICS
More informationEsgynDB Multi- DataCenter Replication Guide
Esgyn Corporation EsgynDB Multi- DataCenter Replication Guide Published: November 2015 Edition: EsgynDB Release 2.0.0 Contents 1. About This Document...3 2. Intended Audience...3 3. Overview...3 4. Synchronous
More informationDriveScale-DellEMC Reference Architecture
DriveScale-DellEMC Reference Architecture DellEMC/DRIVESCALE Introduction DriveScale has pioneered the concept of Software Composable Infrastructure that is designed to radically change the way data center
More informationThe Oracle Database Appliance I/O and Performance Architecture
Simple Reliable Affordable The Oracle Database Appliance I/O and Performance Architecture Tammy Bednar, Sr. Principal Product Manager, ODA 1 Copyright 2012, Oracle and/or its affiliates. All rights reserved.
More informationNext-Generation Cloud Platform
Next-Generation Cloud Platform Jangwoo Kim Jun 24, 2013 E-mail: jangwoo@postech.ac.kr High Performance Computing Lab Department of Computer Science & Engineering Pohang University of Science and Technology
More informationDell PowerEdge R720xd with PERC H710P: A Balanced Configuration for Microsoft Exchange 2010 Solutions
Dell PowerEdge R720xd with PERC H710P: A Balanced Configuration for Microsoft Exchange 2010 Solutions A comparative analysis with PowerEdge R510 and PERC H700 Global Solutions Engineering Dell Product
More informationHortonworks Data Platform
Hortonworks Data Platform Cluster Planning (June 1, 2017) docs.hortonworks.com Hortonworks Data Platform: Cluster Planning Copyright 2012-2017 Hortonworks, Inc. Some rights reserved. The Hortonworks Data
More informationTrafodion Enterprise-Class Transactional SQL-on-HBase
Trafodion Enterprise-Class Transactional SQL-on-HBase Trafodion Introduction (Welsh for transactions) Joint HP Labs & HP-IT project for transactional SQL database capabilities on Hadoop Leveraging 20+
More informationHadoop. Introduction / Overview
Hadoop Introduction / Overview Preface We will use these PowerPoint slides to guide us through our topic. Expect 15 minute segments of lecture Expect 1-4 hour lab segments Expect minimal pretty pictures
More informationSpark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay Mellanox Technologies
Spark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay 1 Apache Spark - Intro Spark within the Big Data ecosystem Data Sources Data Acquisition / ETL Data Storage Data Analysis / ML Serving 3 Apache
More informationBIG DATA AND HADOOP ON THE ZFS STORAGE APPLIANCE
BIG DATA AND HADOOP ON THE ZFS STORAGE APPLIANCE BRETT WENINGER, MANAGING DIRECTOR 10/21/2014 ADURANT APPROACH TO BIG DATA Align to Un/Semi-structured Data Instead of Big Scale out will become Big Greatest
More informationMicrosoft SQL Server in a VMware Environment on Dell PowerEdge R810 Servers and Dell EqualLogic Storage
Microsoft SQL Server in a VMware Environment on Dell PowerEdge R810 Servers and Dell EqualLogic Storage A Dell Technical White Paper Dell Database Engineering Solutions Anthony Fernandez April 2010 THIS
More informationApache Hadoop Goes Realtime at Facebook. Himanshu Sharma
Apache Hadoop Goes Realtime at Facebook Guide - Dr. Sunny S. Chung Presented By- Anand K Singh Himanshu Sharma Index Problem with Current Stack Apache Hadoop and Hbase Zookeeper Applications of HBase at
More informationInsight Case Studies. Tuning the Beloved DB-Engines. Presented By Nithya Koka and Michael Arnold
Insight Case Studies Tuning the Beloved DB-Engines Presented By Nithya Koka and Michael Arnold Who is Nithya Koka? Senior Hadoop Administrator Project Lead Client Engagement On-Call Engineer Cluster Ninja
More informationPatriot Hardware and Systems Software Requirements
Patriot Hardware and Systems Software Requirements Patriot is designed and written for Microsoft Windows in native C# and.net. As a result, it is a stable and consistent Windows application. Patriot is
More information@joerg_schad Nightmares of a Container Orchestration System
@joerg_schad Nightmares of a Container Orchestration System 2017 Mesosphere, Inc. All Rights Reserved. 1 Jörg Schad Distributed Systems Engineer @joerg_schad Jan Repnak Support Engineer/ Solution Architect
More informationHadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and thevirtual EDW Headline Goes Here
Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and thevirtual EDW Headline Goes Here Marcel Kornacker marcel@cloudera.com Speaker Name or Subhead Goes Here 2013-11-12 Copyright 2013 Cloudera
More informationFlashGrid Software Enables Converged and Hyper-Converged Appliances for Oracle* RAC
white paper FlashGrid Software Intel SSD DC P3700/P3600/P3500 Topic: Hyper-converged Database/Storage FlashGrid Software Enables Converged and Hyper-Converged Appliances for Oracle* RAC Abstract FlashGrid
More informationCloudian Sizing and Architecture Guidelines
Cloudian Sizing and Architecture Guidelines The purpose of this document is to detail the key design parameters that should be considered when designing a Cloudian HyperStore architecture. The primary
More informationDistributed Filesystem
Distributed Filesystem 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributing Code! Don t move data to workers move workers to the data! - Store data on the local disks of nodes in the
More informationTuning Intelligent Data Lake Performance
Tuning Intelligent Data Lake Performance 2016 Informatica LLC. No part of this document may be reproduced or transmitted in any form, by any means (electronic, photocopying, recording or otherwise) without
More informationAlbis: High-Performance File Format for Big Data Systems
Albis: High-Performance File Format for Big Data Systems Animesh Trivedi, Patrick Stuedi, Jonas Pfefferle, Adrian Schuepbach, Bernard Metzler, IBM Research, Zurich 2018 USENIX Annual Technical Conference
More informationCloud Computing & Visualization
Cloud Computing & Visualization Workflows Distributed Computation with Spark Data Warehousing with Redshift Visualization with Tableau #FIUSCIS School of Computing & Information Sciences, Florida International
More informationA Comparative Study of Microsoft Exchange 2010 on Dell PowerEdge R720xd with Exchange 2007 on Dell PowerEdge R510
A Comparative Study of Microsoft Exchange 2010 on Dell PowerEdge R720xd with Exchange 2007 on Dell PowerEdge R510 Incentives for migrating to Exchange 2010 on Dell PowerEdge R720xd Global Solutions Engineering
More informationTITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP
TITLE: Implement sort algorithm and run it using HADOOP PRE-REQUISITE Preliminary knowledge of clusters and overview of Hadoop and its basic functionality. THEORY 1. Introduction to Hadoop The Apache Hadoop
More informationdocs.hortonworks.com
docs.hortonworks.com Hortonworks Data Platform : Cluster Planning Guide Copyright 2012-2014 Hortonworks, Inc. Some rights reserved. The Hortonworks Data Platform, powered by Apache Hadoop, is a massively
More informationCloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018
Cloud Computing and Hadoop Distributed File System UCSB CS70, Spring 08 Cluster Computing Motivations Large-scale data processing on clusters Scan 000 TB on node @ 00 MB/s = days Scan on 000-node cluster
More informationCLOUD-SCALE FILE SYSTEMS
Data Management in the Cloud CLOUD-SCALE FILE SYSTEMS 92 Google File System (GFS) Designing a file system for the Cloud design assumptions design choices Architecture GFS Master GFS Chunkservers GFS Clients
More informationIBM Tivoli Storage Manager for Windows Version Installation Guide IBM
IBM Tivoli Storage Manager for Windows Version 7.1.8 Installation Guide IBM IBM Tivoli Storage Manager for Windows Version 7.1.8 Installation Guide IBM Note: Before you use this information and the product
More informationAccelerate MySQL for Demanding OLAP and OLTP Use Case with Apache Ignite December 7, 2016
Accelerate MySQL for Demanding OLAP and OLTP Use Case with Apache Ignite December 7, 2016 Nikita Ivanov CTO and Co-Founder GridGain Systems Peter Zaitsev CEO and Co-Founder Percona About the Presentation
More informationHCI: Hyper-Converged Infrastructure
Key Benefits: Innovative IT solution for high performance, simplicity and low cost Complete solution for IT workloads: compute, storage and networking in a single appliance High performance enabled by
More informationCloudera Exam CCA-410 Cloudera Certified Administrator for Apache Hadoop (CCAH) Version: 7.5 [ Total Questions: 97 ]
s@lm@n Cloudera Exam CCA-410 Cloudera Certified Administrator for Apache Hadoop (CCAH) Version: 7.5 [ Total Questions: 97 ] Question No : 1 Which two updates occur when a client application opens a stream
More informationCPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University
CPSC 426/526 Cloud Computing Ennan Zhai Computer Science Department Yale University Recall: Lec-7 In the lec-7, I talked about: - P2P vs Enterprise control - Firewall - NATs - Software defined network
More informationLecture 11 Hadoop & Spark
Lecture 11 Hadoop & Spark Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Outline Distributed File Systems Hadoop Ecosystem
More informationConfiguring and Deploying Hadoop Cluster Deployment Templates
Configuring and Deploying Hadoop Cluster Deployment Templates This chapter contains the following sections: Hadoop Cluster Profile Templates, on page 1 Creating a Hadoop Cluster Profile Template, on page
More informationService Description. IBM DB2 on Cloud. 1. Cloud Service. 1.1 IBM DB2 on Cloud Standard Small. 1.2 IBM DB2 on Cloud Standard Medium
Service Description IBM DB2 on Cloud This Service Description describes the Cloud Service IBM provides to Client. Client means the company and its authorized users and recipients of the Cloud Service.
More informationPart 1: Indexes for Big Data
JethroData Making Interactive BI for Big Data a Reality Technical White Paper This white paper explains how JethroData can help you achieve a truly interactive interactive response time for BI on big data,
More informationCisco Tetration Analytics Platform: A Dive into Blazing Fast Deep Storage
White Paper Cisco Tetration Analytics Platform: A Dive into Blazing Fast Deep Storage What You Will Learn A Cisco Tetration Analytics appliance bundles computing, networking, and storage resources in one
More information10 Million Smart Meter Data with Apache HBase
10 Million Smart Meter Data with Apache HBase 5/31/2017 OSS Solution Center Hitachi, Ltd. Masahiro Ito OSS Summit Japan 2017 Who am I? Masahiro Ito ( 伊藤雅博 ) Software Engineer at Hitachi, Ltd. Focus on
More informationLenovo Database Configuration Guide
Lenovo Database Configuration Guide for Microsoft SQL Server OLTP on ThinkAgile SXM Reduce time to value with validated hardware configurations up to 2 million TPM in a DS14v2 VM SQL Server in an Infrastructure
More informationHedvig as backup target for Veeam
Hedvig as backup target for Veeam Solution Whitepaper Version 1.0 April 2018 Table of contents Executive overview... 3 Introduction... 3 Solution components... 4 Hedvig... 4 Hedvig Virtual Disk (vdisk)...
More informationDatabase Architecture 2 & Storage. Instructor: Matei Zaharia cs245.stanford.edu
Database Architecture 2 & Storage Instructor: Matei Zaharia cs245.stanford.edu Summary from Last Time System R mostly matched the architecture of a modern RDBMS» SQL» Many storage & access methods» Cost-based
More informationCrossing the Chasm: Sneaking a parallel file system into Hadoop
Crossing the Chasm: Sneaking a parallel file system into Hadoop Wittawat Tantisiriroj Swapnil Patil, Garth Gibson PARALLEL DATA LABORATORY Carnegie Mellon University In this work Compare and contrast large
More informationFusion iomemory PCIe Solutions from SanDisk and Sqrll make Accumulo Hypersonic
WHITE PAPER Fusion iomemory PCIe Solutions from SanDisk and Sqrll make Accumulo Hypersonic Western Digital Technologies, Inc. 951 SanDisk Drive, Milpitas, CA 95035 www.sandisk.com Table of Contents Executive
More informationApril Final Quiz COSC MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model.
1. MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model. MapReduce is a framework for processing big data which processes data in two phases, a Map
More informationAn Oracle Technical White Paper October Sizing Guide for Single Click Configurations of Oracle s MySQL on Sun Fire x86 Servers
An Oracle Technical White Paper October 2011 Sizing Guide for Single Click Configurations of Oracle s MySQL on Sun Fire x86 Servers Introduction... 1 Foundation for an Enterprise Infrastructure... 2 Sun
More informationMixApart: Decoupled Analytics for Shared Storage Systems. Madalin Mihailescu, Gokul Soundararajan, Cristiana Amza University of Toronto and NetApp
MixApart: Decoupled Analytics for Shared Storage Systems Madalin Mihailescu, Gokul Soundararajan, Cristiana Amza University of Toronto and NetApp Hadoop Pig, Hive Hadoop + Enterprise storage?! Shared storage
More informationVeeam Backup & Replication on IBM Cloud Solution Architecture
Veeam Backup & Replication on IBM Cloud Solution Architecture Date: 2018 07 20 Copyright IBM Corporation 2018 Page 1 of 12 Table of Contents 1 Introduction... 4 1.1 About Veeam Backup & Replication...
More informationWHITE PAPER: BEST PRACTICES. Sizing and Scalability Recommendations for Symantec Endpoint Protection. Symantec Enterprise Security Solutions Group
WHITE PAPER: BEST PRACTICES Sizing and Scalability Recommendations for Symantec Rev 2.2 Symantec Enterprise Security Solutions Group White Paper: Symantec Best Practices Contents Introduction... 4 The
More informationPLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS
PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad
More informationWHITEPAPER. Improve Hadoop Performance with Memblaze PBlaze SSD
Improve Hadoop Performance with Memblaze PBlaze SSD Improve Hadoop Performance with Memblaze PBlaze SSD Exclusive Summary We live in the data age. It s not easy to measure the total volume of data stored
More informationSGI Overview. HPC User Forum Dearborn, Michigan September 17 th, 2012
SGI Overview HPC User Forum Dearborn, Michigan September 17 th, 2012 SGI Market Strategy HPC Commercial Scientific Modeling & Simulation Big Data Hadoop In-memory Analytics Archive Cloud Public Private
More informationIBM Terms of Use SaaS Specific Offering Terms. IBM DB2 on Cloud. 1. IBM SaaS. 2. Charge Metrics
IBM Terms of Use SaaS Specific Offering Terms IBM DB2 on Cloud The Terms of Use ( ToU ) is composed of this IBM Terms of Use - SaaS Specific Offering Terms ( SaaS Specific Offering Terms ) and a document
More informationIntroduction to BigData, Hadoop:-
Introduction to BigData, Hadoop:- Big Data Introduction: Hadoop Introduction What is Hadoop? Why Hadoop? Hadoop History. Different types of Components in Hadoop? HDFS, MapReduce, PIG, Hive, SQOOP, HBASE,
More informationService Description. IBM DB2 on Cloud. 1. Cloud Service. 1.1 IBM DB2 on Cloud Standard Small. 1.2 IBM DB2 on Cloud Standard Medium
Service Description IBM DB2 on Cloud This Service Description describes the Cloud Service IBM provides to Client. Client means the company and its authorized users and recipients of the Cloud Service.
More informationSQL Server 2014 Upgrade
SQL Server 2014 Upgrade Case study featuring In-Memory OLTP and Hybrid-Cloud Scenarios Evgeny Ternovsky, Program Manager II, Data Platform Group Bill Kan, Service Engineer II, Data Platform Group Background
More informationAssessing performance in HP LeftHand SANs
Assessing performance in HP LeftHand SANs HP LeftHand Starter, Virtualization, and Multi-Site SANs deliver reliable, scalable, and predictable performance White paper Introduction... 2 The advantages of
More informationSystem Requirements EDT 6.0. discoveredt.com
System Requirements EDT 6.0 discoveredt.com Contents Introduction... 3 1 Components, Modules & Data Repositories... 3 2 Infrastructure Options... 5 2.1 Scenario 1 - EDT Portable or Server... 5 2.2 Scenario
More informationCrossing the Chasm: Sneaking a parallel file system into Hadoop
Crossing the Chasm: Sneaking a parallel file system into Hadoop Wittawat Tantisiriroj Swapnil Patil, Garth Gibson PARALLEL DATA LABORATORY Carnegie Mellon University In this work Compare and contrast large
More informationCloudera Enterprise 5 Reference Architecture
Cloudera Enterprise 5 Reference Architecture A PSSC Labs Reference Architecture Guide December 2016 Introduction PSSC Labs continues to bring innovative compute server and cluster platforms to market.
More informationConfiguring Short RPO with Actifio StreamSnap and Dedup-Async Replication
CDS and Sky Tech Brief Configuring Short RPO with Actifio StreamSnap and Dedup-Async Replication Actifio recommends using Dedup-Async Replication (DAR) for RPO of 4 hours or more and using StreamSnap for
More informationCorrelation based File Prefetching Approach for Hadoop
IEEE 2nd International Conference on Cloud Computing Technology and Science Correlation based File Prefetching Approach for Hadoop Bo Dong 1, Xiao Zhong 2, Qinghua Zheng 1, Lirong Jian 2, Jian Liu 1, Jie
More informationEMC XTREMCACHE ACCELERATES VIRTUALIZED ORACLE
White Paper EMC XTREMCACHE ACCELERATES VIRTUALIZED ORACLE EMC XtremSF, EMC XtremCache, EMC Symmetrix VMAX and Symmetrix VMAX 10K, XtremSF and XtremCache dramatically improve Oracle performance Symmetrix
More informationDell EMC CIFS-ECS Tool
Dell EMC CIFS-ECS Tool Architecture Overview, Performance and Best Practices March 2018 A Dell EMC Technical Whitepaper Revisions Date May 2016 September 2016 Description Initial release Renaming of tool
More informationHPE Verified Reference Architecture for Vertica SQL on Hadoop Using Hortonworks HDP 2.3 on HPE Apollo 4200 Gen9 with RHEL
HPE Verified Reference Architecture for Vertica SQL on Hadoop Using Hortonworks HDP 2.3 on HPE Apollo 4200 Gen9 with RHEL HPE Reference Architectures April, 2016 Legal Notices The only warranties for Hewlett
More informationMicrosoft Exchange Server 2010 workload optimization on the new IBM PureFlex System
Microsoft Exchange Server 2010 workload optimization on the new IBM PureFlex System Best practices Roland Mueller IBM Systems and Technology Group ISV Enablement April 2012 Copyright IBM Corporation, 2012
More informationEMC Backup and Recovery for Microsoft SQL Server
EMC Backup and Recovery for Microsoft SQL Server Enabled by Microsoft SQL Native Backup Reference Copyright 2010 EMC Corporation. All rights reserved. Published February, 2010 EMC believes the information
More information4th National Conference on Electrical, Electronics and Computer Engineering (NCEECE 2015)
4th National Conference on Electrical, Electronics and Computer Engineering (NCEECE 2015) Benchmark Testing for Transwarp Inceptor A big data analysis system based on in-memory computing Mingang Chen1,2,a,
More informationA BigData Tour HDFS, Ceph and MapReduce
A BigData Tour HDFS, Ceph and MapReduce These slides are possible thanks to these sources Jonathan Drusi - SCInet Toronto Hadoop Tutorial, Amir Payberah - Course in Data Intensive Computing SICS; Yahoo!
More informationEvaluation Report: Improving SQL Server Database Performance with Dot Hill AssuredSAN 4824 Flash Upgrades
Evaluation Report: Improving SQL Server Database Performance with Dot Hill AssuredSAN 4824 Flash Upgrades Evaluation report prepared under contract with Dot Hill August 2015 Executive Summary Solid state
More informationVERITAS Storage Foundation 4.0 TM for Databases
VERITAS Storage Foundation 4.0 TM for Databases Powerful Manageability, High Availability and Superior Performance for Oracle, DB2 and Sybase Databases Enterprises today are experiencing tremendous growth
More informationExchange 2010 Tested Solutions: 500 Mailboxes in a Single Site Running Hyper-V on Dell Servers
Exchange 2010 Tested Solutions: 500 Mailboxes in a Single Site Running Hyper-V on Dell Servers Rob Simpson, Program Manager, Microsoft Exchange Server; Akshai Parthasarathy, Systems Engineer, Dell; Casey
More informationEMC Business Continuity for Microsoft Applications
EMC Business Continuity for Microsoft Applications Enabled by EMC Celerra, EMC MirrorView/A, EMC Celerra Replicator, VMware Site Recovery Manager, and VMware vsphere 4 Copyright 2009 EMC Corporation. All
More informationThe Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou
The Hadoop Ecosystem EECS 4415 Big Data Systems Tilemachos Pechlivanoglou tipech@eecs.yorku.ca A lot of tools designed to work with Hadoop 2 HDFS, MapReduce Hadoop Distributed File System Core Hadoop component
More informationWhite Paper. Major Performance Tuning Considerations for Weblogic Server
White Paper Major Performance Tuning Considerations for Weblogic Server Table of Contents Introduction and Background Information... 2 Understanding the Performance Objectives... 3 Measuring your Performance
More informationWarehouse- Scale Computing and the BDAS Stack
Warehouse- Scale Computing and the BDAS Stack Ion Stoica UC Berkeley UC BERKELEY Overview Workloads Hardware trends and implications in modern datacenters BDAS stack What is Big Data used For? Reports,
More informationAdvanced Architectures for Oracle Database on Amazon EC2
Advanced Architectures for Oracle Database on Amazon EC2 Abdul Sathar Sait Jinyoung Jung Amazon Web Services November 2014 Last update: April 2016 Contents Abstract 2 Introduction 3 Oracle Database Editions
More informationHewlett Packard Enterprise HPE GEN10 PERSISTENT MEMORY PERFORMANCE THROUGH PERSISTENCE
Hewlett Packard Enterprise HPE GEN10 PERSISTENT MEMORY PERFORMANCE THROUGH PERSISTENCE Digital transformation is taking place in businesses of all sizes Big Data and Analytics Mobility Internet of Things
More informationDesign a Remote-Office or Branch-Office Data Center with Cisco UCS Mini
White Paper Design a Remote-Office or Branch-Office Data Center with Cisco UCS Mini February 2015 2015 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public. Page 1 of 9 Contents
More informationChapter 5. The MapReduce Programming Model and Implementation
Chapter 5. The MapReduce Programming Model and Implementation - Traditional computing: data-to-computing (send data to computing) * Data stored in separate repository * Data brought into system for computing
More informationData Protection for Cisco HyperFlex with Veeam Availability Suite. Solution Overview Cisco Public
Data Protection for Cisco HyperFlex with Veeam Availability Suite 1 2017 2017 Cisco Cisco and/or and/or its affiliates. its affiliates. All rights All rights reserved. reserved. Highlights Is Cisco compatible
More informationCisco and Cloudera Deliver WorldClass Solutions for Powering the Enterprise Data Hub alerts, etc. Organizations need the right technology and infrastr
Solution Overview Cisco UCS Integrated Infrastructure for Big Data and Analytics with Cloudera Enterprise Bring faster performance and scalability for big data analytics. Highlights Proven platform for
More informationGain Insights From Unstructured Data Using Pivotal HD. Copyright 2013 EMC Corporation. All rights reserved.
Gain Insights From Unstructured Data Using Pivotal HD 1 Traditional Enterprise Analytics Process 2 The Fundamental Paradigm Shift Internet age and exploding data growth Enterprises leverage new data sources
More informationNEC Express5800 A2040b 22TB Data Warehouse Fast Track. Reference Architecture with SW mirrored HGST FlashMAX III
NEC Express5800 A2040b 22TB Data Warehouse Fast Track Reference Architecture with SW mirrored HGST FlashMAX III Based on Microsoft SQL Server 2014 Data Warehouse Fast Track (DWFT) Reference Architecture
More informationFalling Out of the Clouds: When Your Big Data Needs a New Home
Falling Out of the Clouds: When Your Big Data Needs a New Home Executive Summary Today s public cloud computing infrastructures are not architected to support truly large Big Data applications. While it
More informationApache Hadoop 3. Balazs Gaspar Sales Engineer CEE & CIS Cloudera, Inc. All rights reserved.
Apache Hadoop 3 Balazs Gaspar Sales Engineer CEE & CIS balazs@cloudera.com 1 We believe data can make what is impossible today, possible tomorrow 2 We empower people to transform complex data into clear
More informationHadoop & Big Data Analytics Complete Practical & Real-time Training
An ISO Certified Training Institute A Unit of Sequelgate Innovative Technologies Pvt. Ltd. www.sqlschool.com Hadoop & Big Data Analytics Complete Practical & Real-time Training Mode : Instructor Led LIVE
More informationDeploy a High-Performance Database Solution: Cisco UCS B420 M4 Blade Server with Fusion iomemory PX600 Using Oracle Database 12c
White Paper Deploy a High-Performance Database Solution: Cisco UCS B420 M4 Blade Server with Fusion iomemory PX600 Using Oracle Database 12c What You Will Learn This document demonstrates the benefits
More informationDistributed Systems 16. Distributed File Systems II
Distributed Systems 16. Distributed File Systems II Paul Krzyzanowski pxk@cs.rutgers.edu 1 Review NFS RPC-based access AFS Long-term caching CODA Read/write replication & disconnected operation DFS AFS
More informationStorage for HPC, HPDA and Machine Learning (ML)
for HPC, HPDA and Machine Learning (ML) Frank Kraemer, IBM Systems Architect mailto:kraemerf@de.ibm.com IBM Data Management for Autonomous Driving (AD) significantly increase development efficiency by
More informationDell Reference Configuration for Hortonworks Data Platform 2.4
Dell Reference Configuration for Hortonworks Data Platform 2.4 A Quick Reference Configuration Guide Kris Applegate Solution Architect Dell Solution Centers Executive Summary This document details the
More information