Big Data and Object Storage or where to store the cold and small data? Sven Bauernfeind Computacenter AG & Co. ohg, Consultancy Germany 28.02.2018 Munich
Volume, Variety & Velocity + Analytics Velocity TERABYTE 290 m diameter GIGABYTE 28 cm diameter Volume Variety + Analytics PETABYTE 300 km diameter
Volume, Variety & Velocity (cont.) 2005 2010 2012 2015 VOLUME 0.1 ZB 1.2 ZB 2.8 ZB 8.5 ZB 2020* 40ZB VARIETY VELOCITY 2020, BUSINESS TRANSACTIONS WILL GROW UP TO 450 BILLION A DAY, ACCORDING TO IDC Source: *IDC (https://www.emc.com/leadership/digital-universe/2014iview/index.htm)
Hitachi Vantara Forum 2018 Hadoop Basics
How to analyse these data? Hadoop Open-Source framework Scalable and distributed computing Hadoop Core and Ecosystem Spark, Hive, Kafka and many more!
Hadoop Solutions Apache Hadoop Hadoop Distributions Big Data Suite less Features more YARN MapReduce HDFS Ecosystem + Packaging Deployment-Tooling Support + Tooling/Modeling Business Analytics Scheduling/Integration
Hadoop Core YARN Resource planning and load balancing Infrastructure management and enterprise services HDFS Distributed overlay file system Redundant storage of large amounts of data MapReduce Programming model for the parallel calculations Fault-tolerant algorithm
Cluster HDFS Basics Logical File Rack 1 Rack 2 Rack 3
Cluster HDFS Basics (cont.) 1 1 1 2 Logical File 3 4 1 Rack 1 Rack 2 Rack 3
Cluster HDFS Basics (cont.) 1 1 3 1 3 2 4 2 Logical File 3 4 4 4 1 2 2 3 Rack 1 Rack 2 Rack 3
A Storage Problem? Cold Data Small Files 21% 8% 6% 6% 2% 27% 7% 5% 56% 19% 11% 32% 0-1 days < 7 days < 30 days < 90 days 90 days 0 KB - 4 KB 4 KB - 512 KB 512 KB - 16 MB 16 MB - 64 MB 64 MB - 128 MB 128 MB - 512 MB 512 MB - 1 GB Source: HPE 2017
Hitachi Vantara Forum 2018 Hadoop Architecture
Hadoop Architecture: Today Hadoop Analytics Hadoop File System Processing x86 Server Storage
Hadoop Architecture: Tomorrow Low Cost Nodes GPU Nodes FPGA Nodes Memory Nodes Processing Hitachi All-Flash-Array Spinning-Disk Nodes Object Storage Storage
Hadoop Architecture: Tomorrow (cont). Low Cost Nodes GPU Nodes FPGA Nodes Memory Nodes Processing SSD Nodes Spinning-Disk Nodes Hitachi HCP Storage
Hadoop Architecture: Tomorrow (cont). Low Cost Nodes GPU Nodes FPGA Nodes Memory Nodes Processing Hitachi All-Flash-Array Hitachi HCP Hitachi HCP Storage
Hitachi Vantara Forum 2018 HDFS Storage Tiering and Archives
HDFS Storage Types DISK SSD Default storage type Hard disk drives Solid state drives Fast, but expensive ARCHIVE RAM_DISK Archival drives Slow, but cheap Object Storage, Cloud Storage Memory drives Very fast with limited capacity
HDFS Storage Policies Assignment of the individual storage types to storage policies HOT Current data, read and write All replicas on DISK COLD Only for old data, read-only All replicas on ARCHIVE WARM Current and old data, mostly read Replicas on DISK and ARCHIVE All_SSD All replicas on SSD One_SSD One replica on SSD, n-1 on DISK Lazy_Persist One replica on RAM_DISK HOT HOT WARM COLD
Storage HDFS Blocks HDFS Policies Without HDFS Storage Tiering HOT WARM COLD All replicas on DISK DISK
Storage HDFS Blocks HDFS Policies With HDFS Storage Tiering HOT WARM COLD n replicas on DISK 1 replica on DISK n-1 replica on ARCHIVE n replicas on ARCHIVE DISK ARCHIVE
Hadoop Archives: HAR Files Layered filesystem on top of HDFS Use MapReduce to create archive Reduce memory consumption on Name Node Each HAR file access reads two index files and one data file hdfs:// har:// Nothing has changed to a client using HAR filesystem Master Index Index File File File File Data Source: http://blog.cloudera.com/blog/2009/02/the-small-files-problem/
Storage HDFS Blocks HDFS Policies Hadoop Archives: HAR Files (cont.) HDFS Name Node metadata information of each block stored in HDFS Hadoop Archive File (HAR) metadata information of each block stored in HAR file DISK ARCHIVE
Thank You Computacenter AG & Co. ohg Consultancy Germany Sven Bauernfeind sven.bauernfeind@computacenter.com +49 173 9158966