Big Data and Object Storage - PDF Free Download

Big Data and Object Storage or where to store the cold and small data? Sven Bauernfeind Computacenter AG & Co. ohg, Consultancy Germany 28.02.2018 Munich

Volume, Variety & Velocity + Analytics Velocity TERABYTE 290 m diameter GIGABYTE 28 cm diameter Volume Variety + Analytics PETABYTE 300 km diameter

Volume, Variety & Velocity (cont.) 2005 2010 2012 2015 VOLUME 0.1 ZB 1.2 ZB 2.8 ZB 8.5 ZB 2020* 40ZB VARIETY VELOCITY 2020, BUSINESS TRANSACTIONS WILL GROW UP TO 450 BILLION A DAY, ACCORDING TO IDC Source: *IDC (https://www.emc.com/leadership/digital-universe/2014iview/index.htm)

Hitachi Vantara Forum 2018 Hadoop Basics

How to analyse these data? Hadoop Open-Source framework Scalable and distributed computing Hadoop Core and Ecosystem Spark, Hive, Kafka and many more!

Hadoop Solutions Apache Hadoop Hadoop Distributions Big Data Suite less Features more YARN MapReduce HDFS Ecosystem + Packaging Deployment-Tooling Support + Tooling/Modeling Business Analytics Scheduling/Integration

Hadoop Core YARN Resource planning and load balancing Infrastructure management and enterprise services HDFS Distributed overlay file system Redundant storage of large amounts of data MapReduce Programming model for the parallel calculations Fault-tolerant algorithm

Cluster HDFS Basics Logical File Rack 1 Rack 2 Rack 3

Cluster HDFS Basics (cont.) 1 1 1 2 Logical File 3 4 1 Rack 1 Rack 2 Rack 3

Cluster HDFS Basics (cont.) 1 1 3 1 3 2 4 2 Logical File 3 4 4 4 1 2 2 3 Rack 1 Rack 2 Rack 3

A Storage Problem? Cold Data Small Files 21% 8% 6% 6% 2% 27% 7% 5% 56% 19% 11% 32% 0-1 days < 7 days < 30 days < 90 days 90 days 0 KB - 4 KB 4 KB - 512 KB 512 KB - 16 MB 16 MB - 64 MB 64 MB - 128 MB 128 MB - 512 MB 512 MB - 1 GB Source: HPE 2017

Hitachi Vantara Forum 2018 Hadoop Architecture

Hadoop Architecture: Today Hadoop Analytics Hadoop File System Processing x86 Server Storage

Hadoop Architecture: Tomorrow Low Cost Nodes GPU Nodes FPGA Nodes Memory Nodes Processing Hitachi All-Flash-Array Spinning-Disk Nodes Object Storage Storage

Hadoop Architecture: Tomorrow (cont). Low Cost Nodes GPU Nodes FPGA Nodes Memory Nodes Processing SSD Nodes Spinning-Disk Nodes Hitachi HCP Storage

Hadoop Architecture: Tomorrow (cont). Low Cost Nodes GPU Nodes FPGA Nodes Memory Nodes Processing Hitachi All-Flash-Array Hitachi HCP Hitachi HCP Storage

Hitachi Vantara Forum 2018 HDFS Storage Tiering and Archives

HDFS Storage Types DISK SSD Default storage type Hard disk drives Solid state drives Fast, but expensive ARCHIVE RAM_DISK Archival drives Slow, but cheap Object Storage, Cloud Storage Memory drives Very fast with limited capacity

HDFS Storage Policies Assignment of the individual storage types to storage policies HOT Current data, read and write All replicas on DISK COLD Only for old data, read-only All replicas on ARCHIVE WARM Current and old data, mostly read Replicas on DISK and ARCHIVE All_SSD All replicas on SSD One_SSD One replica on SSD, n-1 on DISK Lazy_Persist One replica on RAM_DISK HOT HOT WARM COLD

Storage HDFS Blocks HDFS Policies Without HDFS Storage Tiering HOT WARM COLD All replicas on DISK DISK

Storage HDFS Blocks HDFS Policies With HDFS Storage Tiering HOT WARM COLD n replicas on DISK 1 replica on DISK n-1 replica on ARCHIVE n replicas on ARCHIVE DISK ARCHIVE

Hadoop Archives: HAR Files Layered filesystem on top of HDFS Use MapReduce to create archive Reduce memory consumption on Name Node Each HAR file access reads two index files and one data file hdfs:// har:// Nothing has changed to a client using HAR filesystem Master Index Index File File File File Data Source: http://blog.cloudera.com/blog/2009/02/the-small-files-problem/

Storage HDFS Blocks HDFS Policies Hadoop Archives: HAR Files (cont.) HDFS Name Node metadata information of each block stored in HDFS Hadoop Archive File (HAR) metadata information of each block stored in HAR file DISK ARCHIVE

Thank You Computacenter AG & Co. ohg Consultancy Germany Sven Bauernfeind sven.bauernfeind@computacenter.com +49 173 9158966