WHITEPAPER. Improve Hadoop Performance with Memblaze PBlaze SSD

Similar documents
Hammerdb Test for OracleRAC with Memblaze PBlaze SSD

WHITEPAPER. Improve PostgreSQL Performance with Memblaze PBlaze SSD

Distributed Filesystem

Highly accurate simulations of big-data clusters for system planning and optimization

BIG DATA AND HADOOP ON THE ZFS STORAGE APPLIANCE

Can Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects?

Consolidating OLTP Workloads on Dell PowerEdge R th generation Servers

Accelerating Hadoop Applications with the MapR Distribution Using Flash Storage and High-Speed Ethernet

IBM Emulex 16Gb Fibre Channel HBA Evaluation

Accelerating Big Data: Using SanDisk SSDs for Apache HBase Workloads

Implementing SQL Server 2016 with Microsoft Storage Spaces Direct on Dell EMC PowerEdge R730xd

SMCCSE: PaaS Platform for processing large amounts of social media

Spark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay Mellanox Technologies

EsgynDB Enterprise 2.0 Platform Reference Architecture

Accelerating storage performance in the PowerEdge FX2 converged architecture modular chassis

Cluster Setup and Distributed File System

Distributed File Systems II

Distributed Systems 16. Distributed File Systems II

Dell Reference Configuration for Large Oracle Database Deployments on Dell EqualLogic Storage

Performance and Energy Efficiency of the 14 th Generation Dell PowerEdge Servers

A Plugin-based Approach to Exploit RDMA Benefits for Apache and Enterprise HDFS

Microsoft SQL Server in a VMware Environment on Dell PowerEdge R810 Servers and Dell EqualLogic Storage

The Hadoop Distributed File System Konstantin Shvachko Hairong Kuang Sanjay Radia Robert Chansler

Sun Fire X4170 M2 Server Frequently Asked Questions

Oracle TimesTen Scaleout: Revolutionizing In-Memory Transaction Processing

Dell EMC Ready Bundle for HPC Digital Manufacturing Dassault Systѐmes Simulia Abaqus Performance

An Oracle Technical White Paper October Sizing Guide for Single Click Configurations of Oracle s MySQL on Sun Fire x86 Servers

Crossing the Chasm: Sneaking a parallel file system into Hadoop

Hyper-converged storage for Oracle RAC based on NVMe SSDs and standard x86 servers

Crossing the Chasm: Sneaking a parallel file system into Hadoop

Emulex LPe16000B 16Gb Fibre Channel HBA Evaluation

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.

Four-Socket Server Consolidation Using SQL Server 2008

Dell EMC Ready Bundle for HPC Digital Manufacturing ANSYS Performance

EMC XTREMCACHE ACCELERATES MICROSOFT SQL SERVER

Hadoop and HDFS Overview. Madhu Ankam

Evaluation Report: HP StoreFabric SN1000E 16Gb Fibre Channel HBA

8Gb Fibre Channel Adapter of Choice in Microsoft Hyper-V Environments

CA485 Ray Walshe Google File System

Performance Comparisons of Dell PowerEdge Servers with SQL Server 2000 Service Pack 4 Enterprise Product Group (EPG)

A Comparative Study of Microsoft Exchange 2010 on Dell PowerEdge R720xd with Exchange 2007 on Dell PowerEdge R510

August Li Qiang, Huang Qiulan, Sun Gongxing IHEP-CC. Supported by the National Natural Science Fund

Accelerating Microsoft SQL Server 2016 Performance With Dell EMC PowerEdge R740

Selecting Server Hardware for FlashGrid Deployment

RIGHTNOW A C E

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

NEC Express5800 A2040b 22TB Data Warehouse Fast Track. Reference Architecture with SW mirrored HGST FlashMAX III

Intel Solid State Drive Data Center Family for PCIe* in Baidu s Data Center Environment

10 Million Smart Meter Data with Apache HBase

Impact of Dell FlexMem Bridge on Microsoft SQL Server Database Performance

Open storage architecture for private Oracle database clouds

Microsoft SQL Server 2012 Fast Track Reference Configuration Using PowerEdge R720 and EqualLogic PS6110XV Arrays

DELL Reference Configuration Microsoft SQL Server 2008 Fast Track Data Warehouse

Fusion iomemory PCIe Solutions from SanDisk and Sqrll make Accumulo Hypersonic

Deploy a High-Performance Database Solution: Cisco UCS B420 M4 Blade Server with Fusion iomemory PX600 Using Oracle Database 12c

An Oracle White Paper June Exadata Hybrid Columnar Compression (EHCC)

Virtualization of the MS Exchange Server Environment

Deploying and Managing Dell Big Data Clusters with StackIQ Cluster Manager

Accelerating Microsoft SQL Server Performance With NVDIMM-N on Dell EMC PowerEdge R740

VDI PERFORMANCE COMPARISON: DELL POWEREDGE FX2 AND FC430 SERVERS WITH VMWARE VIRTUAL SAN (ABRIDGED)

Big Data for Engineers Spring Resource Management

High Throughput WAN Data Transfer with Hadoop-based Storage

Correlation based File Prefetching Approach for Hadoop

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

Best Practices for Deploying a Mixed 1Gb/10Gb Ethernet SAN using Dell Storage PS Series Arrays

HP ProLiant DL380 Gen8 and HP PCle LE Workload Accelerator 28TB/45TB Data Warehouse Fast Track Reference Architecture

TPC-E testing of Microsoft SQL Server 2016 on Dell EMC PowerEdge R830 Server and Dell EMC SC9000 Storage

Google File System (GFS) and Hadoop Distributed File System (HDFS)

Upgrade to Microsoft SQL Server 2016 with Dell EMC Infrastructure

Best Practices for Deploying a Mixed 1Gb/10Gb Ethernet SAN using Dell EqualLogic Storage Arrays

QLE10000 Series Adapter Provides Application Benefits Through I/O Caching

EMC XTREMCACHE ACCELERATES VIRTUALIZED ORACLE

Goverlan Reach Server Hardware & Operating System Guidelines

QLIKVIEW SCALABILITY BENCHMARK WHITE PAPER

Dell PowerEdge R720xd with PERC H710P: A Balanced Configuration for Microsoft Exchange 2010 Solutions

Hadoop/MapReduce Computing Paradigm

Virtualized SQL Server Performance and Scaling on Dell EMC XC Series Web-Scale Hyper-converged Appliances Powered by Nutanix Software

Oracle Big Data Connectors

Database Solutions Engineering. Best Practices for Deploying SSDs in an Oracle OLTP Environment using Dell TM EqualLogic TM PS Series

<Insert Picture Here> MySQL Web Reference Architectures Building Massively Scalable Web Infrastructure

AMD EPYC Processors Showcase High Performance for Network Function Virtualization (NFV)

InfoBrief. Dell 2-Node Cluster Achieves Unprecedented Result with Three-tier SAP SD Parallel Standard Application Benchmark on Linux

EMC XTREMCACHE ACCELERATES ORACLE

High-Performance Lustre with Maximum Data Assurance

Dell PowerEdge R920 System Powers High Performing SQL Server Databases and Consolidates Databases

Dell Reference Configuration for Hortonworks Data Platform 2.4

Introduction...2. Executive summary...2. Test results...3 IOPs...3 Service demand...3 Throughput...4 Scalability...5

Effective resource utilization by In-Memory Parallel Execution in Oracle Real Application Clusters 11g Release 2

DriveScale-DellEMC Reference Architecture

HUAWEI Tecal X8000 High-Density Rack Server

Service Oriented Performance Analysis

Datasheet FUJITSU Software Cloud Monitoring Manager V2.0

Hadoop Virtualization Extensions on VMware vsphere 5 T E C H N I C A L W H I T E P A P E R

Next-Generation Cloud Platform

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

HADOOP 3.0 is here! Dr. Sandeep Deshmukh Sadepach Labs Pvt. Ltd. - Let us grow together!

MI-PDB, MIE-PDB: Advanced Database Systems

SPLUNK ENTERPRISE AND ECS TECHNICAL SOLUTION GUIDE

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

2013 AWS Worldwide Public Sector Summit Washington, D.C.

Transcription:

Improve Hadoop Performance with Memblaze PBlaze SSD

Improve Hadoop Performance with Memblaze PBlaze SSD Exclusive Summary We live in the data age. It s not easy to measure the total volume of data stored electronically, but an IDC estimate put the size of the digital universe at 4.4 zettabytes in 2013 and is forecasting a tenfold growth by 2020 to 44 zettabytes. So how to organize those sheer volume of data being generated every year effectively? Processing massive amounts of data requires a parallel compute and storage infrastructure, thus the parallel processing, scalable and reliable ability which Hadoop provides. This whitepaper demonstrates the performance of Hadoop is improved with Memblaze PBlaze SSD by up to 4.5x over HDD. Introduction Hadoop utilizes the powerful Hadoop Distributed File System (HDFS) which is a highly scalable, parallel file system optimized for very large sequential data sets running on clusters of commodity hardware. CPU and memory can process hundred GB data per second, but disk IO can only support 100MB to several GB read/write per second with HDD. Disk IO is the bottleneck of the system. Large data generated every day, as amount of data grows, processing the data needs longer and longer time. How to save time? Solve disk IO bottleneck issue, improve HDFS performance. Add more disks to each node? May be the node don t have more space for disk. Even there are several hard drive slot, the performance is not good enough. Add more nodes, may be data center don t have more space. Add more nodes speed too much. Don t worry. Memblaze PCIe SSD help you solve it. It offers preferable performance with less nodes. Our test show 3 data nodes, each with one SSD performance is better than 7 data nodes, each with 6 HDDs. Test Cluster Configuration The Hadoop distributed file system (HDFS) is a distributed, scalable, and portable file-system written in Java for the Hadoop framework. A Hadoop cluster has nominally a single name node plus a cluster of data nodes. Each data node serves up blocks of data over the network using a block protocol specific to HDFS. The file system uses TCP/IP sockets for communication. Clients use remote procedure call (RPC) to communicate between each other. The test cluster consists of 1 master node running NameNode, ResourceManager and client function, and 3 data nodes configured as following Figure 1 illustrated: 10Gb Ethernet Name node Data node1 Data node2 Data node3 Figure 1 Test Cluster Topology 2

Improve Hadoop Performance with Memblaze PBlaze SSD CPU: Memory: SSD: Network: Dell PowerEdge R730x 2 socket Intel XeonE5-2630(8 cores) v3 128GB 1 x Memblaze 3.2T PBlaze4 Intel 82599ES 10-Gigabit Linux: CentOS 7.0 File system: Java: Hadoop: Benchmark Tool: xt4 1.7.0_75 hadoop-2.6.3 hadoop-mapreduce-client-jobclient-2.6.3-tests.jar TestDFSIO Following is the list of configuration parameters used in benchmark: General parameters Map tasks: 12 Reduce tasks: 12 HDFS Block size: 128 MB Replication factor: 3 Mapreduce Java child options: -Xmx 2048m Benchmark procedure Io sort buffer size: 512MB The test process aims at evaluate HDFS performance with PCIe SSD. TestDFSIO is used to measure performance of HDFS and stress both network and IO subsystems. The command read and write files in HDFS which is useful in measuring system-wide performance and exposing network bottlenecks on the NameNode and DataNodes. A majority of MapReduce workloads are IO bound more than compute and hence TestDFSIO can provide an accurate initial picture of such scenarios. The benchmark can be run for writing, using the write switch, and using read for the read test. The command line accepts a number of files and sizes of each file in HDFS. As an example, the command for a read test may look like: hadoop jar./share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.6.3-tests.jar TestDFSIO -write -nrfiles 1 -size 240GB TestDFSIO generates 1 map task per file and splits are defined such that each map gets only one file name. After every run, the command generates a log file indicating performance in terms of 4 metrics: Throughput in MBytes/s, Average IO rate in MBytes/s, IO rate standard deviation and execution time. The most notable metrics are throughput and average IO, both of which are based on file size read or written by the individual map task and the elapsed time in performing the task. The throughput for N map tasks are defined as: 3

MB/s Improve Hadoop Performance with Memblaze PBlaze SSD And average IO rate is calculated as: If the cluster has 50 map slots and TestDFSIO creates 1000 files, the throughput can be calculated as: The IO rate can be calculated in similar fashion. While measuring cluster performance using TestDFSIO may be considered sufficient, the HDFS replication factor (value of dfs.replication in hdfssite.xml) also plays important role. A lower replication factor leads to higher throughput performance due to reduced background traffic [1]. Test Results Our test function run on the Namenode. Illustrated in Figure 2, total write 240GB data, replicate 3, file number N with 1, 3, 6, 12, 48, file size M, each file with one map. 1 map can get best Throughput 144MB/s, 12 and 24 maps can get best concurrent throughput 576 MB/s. Because replicate 3, HDFS write 576 * 3 = 1728MB/s to 3 data node, that is 576 MB/s each data node. 700 600 500 400 300 200 100 TestDFSIO Write Throughput 0 1 240GB 3 80GB 6 40GB 12 20GB 24 10GB 48 5GB Total File Size 240GB: N x M Throughput Concurrent throughput Figure 2 TestDFSIO Write Throughput 4

Throughout MB/s MB/s Improve Hadoop Performance with Memblaze PBlaze SSD Illustrated in Figure 3, total read 240GB data, replicate 3, file number N with 1, 3, 6, 12, 48, file size M, each file with one map. 1 map can get best throughput 357MB/s, 48 maps can get best concurrent throughput 8208MB/s. TestDFSIO read throughput 9000 8000 7000 6000 5000 4000 3000 2000 1000 0 1 240GB 3 80GB 6 40GB 12 20GB 24 10GB 48 5GB Total File Size 240GB: N x M Throughput Concurrent throughput Figure 3 TestDFSIO Read Throughput Emulex tested Hadoop with 7 Data node each 6 HDD, the performance test result from test report Emulex, Performance measurement of a Hadoop Cluster. Our benchmark result shows 3 data nodes performance much better than 7 data nodes HDD cluster as shown in Figure 4. Write and Read Thought Comparison between Hadoop Cluster with PCIE SSD and HDD 400 350 300 250 200 150 100 50 0 7 DataNode with 6 HDD per Node 3 DataNode with 1 PCIe SSD per Node Write Throught (MB/s) Read Throught (MB/s) Figure 4 Write and Read Thought Comparison between Hadoop Cluster with PCIE SSD and HDD 5

Improve Hadoop Performance with Memblaze PBlaze SSD Conclusions As the amount of data grows, the most efficiency solution to improve Hadoop IO performance is utilizes PCIe SSD instead of HDD. From the white paper it shows clearly that Hadoop cluster greatly benefits from innovative Memblaze PCIe SSD and offers optimum performance, in the meantime, save more nodes and with TCO comparable to HDDs. Reference [1] Emulex, Performance measurement of a Hadoop Cluster DISCLAIMER Information in this document is provided in connection with Memblaze products. Memblaze provides this document as is, without warranty of any kind, neither expressed nor implied, including, but not limited to, the particular purpose. Memblaze may make improvements and/or changes in this document or in the product described in this document at any time without notice. The products described in this document may contain design defects or errors known as anomalies or errata which may cause the products functions to deviate from published specifications. COPYRIGHT 2016 Memblaze Corp. All rights reserved. No part of this document may be reproduced, transmitted, transcribed, stored in a retrieval system, or translated into any language in any form or by any means without the written permission of Memblaze Corp. TRADEMARKS Memblaze is a trademark of Memblaze Corporation. Other names mentioned in this document are trademarks/registered trademarks of their respective owners. USING THIS DOCUMENT Though Memblaze has reviewed this document and very effort has been made to ensure that this document is current and accurate, more information may have become available subsequent to the production of this guide. In that event, please contact your local Memblaze sales office or your distributor for latest specifications before placing your product order. 6