BIG DATA AND HADOOP ON THE ZFS STORAGE APPLIANCE BRETT WENINGER, MANAGING DIRECTOR 10/21/2014
ADURANT APPROACH TO BIG DATA Align to Un/Semi-structured Data Instead of Big Scale out will become Big Greatest Benefit: Data development velocity Reciprocal Impact: Faster application development
WHAT WE RE NOT LOOKING AT TODAY Streaming Technologies In-memory Technologies
HADOOP 1.0 SYSTEMS ARCHITECTURE ASSUMPTIONS
HADOOP 1.0 SYSTEMS ARCHITECTURE ASSUMPTIONS Map/Reduce Abstracts storage, concurrency, execution HDFS Distributed, fault-tolerant filesystem Primarily designed for cost/scale Not POSIX compliant Works on commodity hardware Files are large (GBs to TBs) and append-only Access is large and sequential Hardware failure is common Fault-tolerance baked in Replicate data 3x Incrementally re-execute computation Avoid single points of failure
THE HADOOP SYSTEMS ARCHITECTURE PROBLEM
THE HADOOP PROBLEM - SYSTEMS ARCHITECTURE VIEW Technical View: Hadoop is a giant I/O platform I/O access fallen behind CPU/Memory density Strategy to address I/O vs processing divergence: Read/Write to as many drives in parallel! Related variable: Increase in spindle count drives additional network traffic (between nodes) Bounded by latency from read/write to disk (in addition to bandwidth)
THE HADOOP PROBLEM - SYSTEMS ARCHITECTURE VIEW Technical View (cont.): Increased number of disk read/writes has reciprocal impact on network bandwidth Teragen is a method for synthetic testing of network capacity Generates 3-9x the network load over normal operations Direct relationship between number of drives per node and number of MapReduce slots for that node Business View: Greater the spindle count, the lower the cost per TB Generally more average nodes are better than super nodes Consider data protection an additional consideration
THE HADOOP PROBLEM CPU CPU Performance Typically, CPU clock speed does not impact processing times Typically CPU is not a performance bottleneck (there are exceptions) Heuristics on CPU: No negative impact for running more and or higher quality CPU s Price and power consumption become primary boundary values for optimal ROI Single task typically uses one thread at a time Typically investing in more cores does not see a linear return Typically investing in more performant CPU s does not see a linear return Typically, threads experience a large amount of idle time while waiting for I/O response
THE HADOOP PROBLEM - MEMORY Memory Performance: Memory capacity does not have a significant impact on processing times Heuristics on Memory: No negative impact for running more and or higher quality Memory Price becomes the primary boundary values for optimal ROI Typically, Memory capacity does not have a significant impact on processing times Additional Memory will support MapReduce in the sorting process
THE HADOOP PROBLEM - DRIVES Drive Density Popular Drive Sizes: 1, 2, 3, 4TB drives Heuristics on drives: Larger the drive, the cheaper the $/TB = optimal ROI Larger drives create an opportunity for replication storms Disk rebuild can take longer and has potential to saturate the network impacting cluster performance Typically, drive size and latency has little impact on cluster performance There are exceptions Typically a less optimal ROI is achieved by using faster drives MapReduce is designed for long sequential reads and writes Less value in addressing disk latency
THE HADOOP PROBLEM - NETWORK Network Performance: Typically, 1GbE is not enough bandwidth for production Hadoop clusters Network Heuristics: Networking is a critical area for Hadoop clusters Production clusters have 10GbE, sometimes 2GbE Compression can drastically improve network performance Bandwidth beyond 10GbE is rarely a necessity Note on Networking: Differences between bandwidth and latency Higher bandwidth can lead to higher volume at a given latency Lower latency fabrics can lead to higher volume and higher response (improved environment performance)
THE HADOOP PROBLEM - POWER Power Considerations: Availability versus cost is the primary consideration Value tapers with size of cluster, for instance: 10 node production cluster for a smaller organization Larger than 20 nodes, the value tapers off If using single power supply: Consider MTBF at node level and network impact for rebuild Exception - Master Nodes: Dual power supplies are recommended
HADOOP COST CONSIDERATIONS Price per Node Performance per Node Capacity per Node Space, Power, Cooling Supportability - FTE Resiliency: Availability Fragmentation Failure Impact (risk)
THE HADOOP SYSTEMS ARCHITECTURE PROBLEM Architecture 3x Full Copy Replication No Compression No Data De-Duplication Near linear scalability (95%) Performance Profile Primary Bottleneck I/O Secondary Bary Bottleneck internode traffic (100 s nodes) CPU/Memory under-utilized per chassis Configuration Backup Solution Prod Sized Cluster Fixed disk sizing at the chassis level
WHY ZFS? Performance Compression Block Size Analytics Backup/Recovery Cost
ZFS HYPOTHESIS ZFS advantages for Hadoop DRAM Faster processing Larger block size (128k-1MB) Faster processing Compression Reduced footprint Encryption (slipped to Fall 2014) Expected Outcome Equivalent/near-equivalent processing Economical backup solution Reduced disk footprint Right size disk allocation to server
WE GET A DISRUPTIVE WIN IF Drive Hadoop from being I/O bound to being CPU/Memory bound Significantly Reduce disk footprint Huge implications if we drive all load to CPU
ZFS TESTING SYS ARCH LOCAL CLUSTER Hadoop: Cloudera 5.1.3 1 Name Node 5 Data Nodes Servers: (6) X4-2L s OL 6.3 (upgraded to OL 6.5) (2) Intel Xeon E5-2690 v2 10-core 3.0 GHz proc s 128GB Memory (DDR3-1600) (12) 4TB 7200 rpm 3.5-inch SAS-2 HDD Local disk Storage: 240TB total local disk
ZFS TESTING SYS ARCH ARRAY CLUSTER Hadoop: Cloudera 5.1.3 1 Name Node 5 Data Nodes Servers: (6) X4-2L s OL 6.3 (upgraded to OL 6.5) (2) Intel Xeon E5-2690 v2 10-core 3.0 GHz proc s 128GB Memory (DDR3-1600) (12) 4TB 7200 rpm 3.5-inch SAS-2 HDD Local disk Storage: ZS3-4 (Clustered) 2TB DRAM 6 Shelves 900GB 10K RPM HDD 108 TB
ZFS STORAGE REFERENCE ARCHITECTURE
BENCHMARK APPROACH Cluster Type Local Cluster Array Cluster Terasort 10GB 100GB 1TB TestDFSIO 100GB 1TB 10TB
DATA TESTING APPROACH Cluster Type Local Cluster Array Cluster Types of Jobs 3 Types written in Hive Simple (4x) Medium Complexity (4x) High Complexity/Inefficient Process (4x) Job Size 400GB 800 GB 1.6 TB
DATA TESTING FINDINGS LOCAL CLUSTER 1.6TB Simple 7978.1 (s) Medium 8970.6 (s) Complex 14121.2 (s) ARRAY CLUSTER 1.6TB Simple 2510.9 (s) Medium 2994.2 (s) Complex 5854.8 (s) *128K block
HADOOP AND ZFS TEST RESULTS SUMMARY Hadoop Operations: Completion of jobs approx 280% faster Larger jobs trend in a near 1:1 linear fashion Compression Compression of 3.6-3.8x achieved on lowest setting
BENEFITS OF RUNNING ZFS ON HADOOP Reduced cluster overhead with replication factor of 2x Reduced storage with replication factor to 2x Increased protection: number of copies of data to 4x Added compression of > 3x (for compressible data) Added caching decreasing I/O response times Added data protection (RAID 1) no overhead Added fault tolerance via clustered heads
PROCESSING IMPLICATIONS TYPE STORAGE CAPACITY PROCESSING (SERVERS) 24 HOURS (PB) ANNUAL (PB) Server 164 164 0.5 184.4 Array 1 55 0.5 185.5
IMPACT OF YARN AND SPARK Reduced Map/Reduce Ratio Management for mixed workloads Greater flexibility on coding choices Lower latency for request to completion = faster QOS by job/process opportunities Greater flexibility on archiving/storing data Possibility of using higher levels of compression for data segments Increased complexity of process/library management
EXABYTE PLATFORM CONSIDERATIONS Compression Access Tiered Data Encryption Capacity Network Speed Workload Segmentation Data Fragmentation Block rebuild/disk rebuild process
MINE IS BIG HOW BIG IS YOURS? Global Data Census: 2014 2.7 zettabytes 2020: 50+ zettabytes (est) Data Scale: KB: 1,000 B MB: 1,000,000 B GB: 1,000,000,000 B TB: 1,000,000,000,000 B PB: 1,000,000,000,000 B EB: 1,000,000,000,000,000 B ZB: 1,000,000,000,000,000,000B
MINE IS BIG HOW BIG IS YOURS? GraySort Benchmark: 2009: 0.578 TB/Min Yahoo, 3452 Nodes (2x, 8GB, 4 SATA) 2011: 0.725 TB/Min UC San Diego, 52 Nodes (2 CPU, 24GB, 16x 500GB) 2013: 1.42 TB/Min - Yahoo, 2100 Nodes (2CPU, 64GB, 12x 3TB) Yahoo: 2012: 42,000 nodes, 200PB, 20 Prod Clusters (largest is 4000 nodes) Facebook: 2010: 2000 nodes, 21PB Spotify: 2014: 694 heterogeneous nodes, 14.25PB (12k jobs/day)
HADOOP ON ZFS TECHNICAL WHITEPAPER Technical Whitepaper Published Follow ADURANT @aduranttech for notification of link
Contact Information: Brett Weninger, Managing Director brett.weninger@adurant.com 720-375.1600 @aduranttech