Big Data and Object Storage

Size: px

Start display at page:

Download "Big Data and Object Storage"

Warren Flynn
6 years ago
Views:

1 Big Data and Object Storage or where to store the cold and small data? Sven Bauernfeind Computacenter AG & Co. ohg, Consultancy Germany Munich

2 Volume, Variety & Velocity + Analytics Velocity TERABYTE 290 m diameter GIGABYTE 28 cm diameter Volume Variety + Analytics PETABYTE 300 km diameter

3 Volume, Variety & Velocity (cont.) VOLUME 0.1 ZB 1.2 ZB 2.8 ZB 8.5 ZB 2020* 40ZB VARIETY VELOCITY 2020, BUSINESS TRANSACTIONS WILL GROW UP TO 450 BILLION A DAY, ACCORDING TO IDC Source: *IDC (

4 Hitachi Vantara Forum 2018 Hadoop Basics

5 How to analyse these data? Hadoop Open-Source framework Scalable and distributed computing Hadoop Core and Ecosystem Spark, Hive, Kafka and many more!

6 Hadoop Solutions Apache Hadoop Hadoop Distributions Big Data Suite less Features more YARN MapReduce HDFS Ecosystem + Packaging Deployment-Tooling Support + Tooling/Modeling Business Analytics Scheduling/Integration

7 Hadoop Core YARN Resource planning and load balancing Infrastructure management and enterprise services HDFS Distributed overlay file system Redundant storage of large amounts of data MapReduce Programming model for the parallel calculations Fault-tolerant algorithm

8 Cluster HDFS Basics Logical File Rack 1 Rack 2 Rack 3

9 Cluster HDFS Basics (cont.) Logical File Rack 1 Rack 2 Rack 3

10 Cluster HDFS Basics (cont.) Logical File Rack 1 Rack 2 Rack 3

11 A Storage Problem? Cold Data Small Files 21% 8% 6% 6% 2% 27% 7% 5% 56% 19% 11% 32% 0-1 days < 7 days < 30 days < 90 days 90 days 0 KB - 4 KB 4 KB KB 512 KB - 16 MB 16 MB - 64 MB 64 MB MB 128 MB MB 512 MB - 1 GB Source: HPE 2017

12 Hitachi Vantara Forum 2018 Hadoop Architecture

13 Hadoop Architecture: Today Hadoop Analytics Hadoop File System Processing x86 Server Storage

14 Hadoop Architecture: Tomorrow Low Cost Nodes GPU Nodes FPGA Nodes Memory Nodes Processing Hitachi All-Flash-Array Spinning-Disk Nodes Object Storage Storage

15 Hadoop Architecture: Tomorrow (cont). Low Cost Nodes GPU Nodes FPGA Nodes Memory Nodes Processing SSD Nodes Spinning-Disk Nodes Hitachi HCP Storage

16 Hadoop Architecture: Tomorrow (cont). Low Cost Nodes GPU Nodes FPGA Nodes Memory Nodes Processing Hitachi All-Flash-Array Hitachi HCP Hitachi HCP Storage

17 Hitachi Vantara Forum 2018 HDFS Storage Tiering and Archives

18 HDFS Storage Types DISK SSD Default storage type Hard disk drives Solid state drives Fast, but expensive ARCHIVE RAM_DISK Archival drives Slow, but cheap Object Storage, Cloud Storage Memory drives Very fast with limited capacity

19 HDFS Storage Policies Assignment of the individual storage types to storage policies HOT Current data, read and write All replicas on DISK COLD Only for old data, read-only All replicas on ARCHIVE WARM Current and old data, mostly read Replicas on DISK and ARCHIVE All_SSD All replicas on SSD One_SSD One replica on SSD, n-1 on DISK Lazy_Persist One replica on RAM_DISK HOT HOT WARM COLD

20 Storage HDFS Blocks HDFS Policies Without HDFS Storage Tiering HOT WARM COLD All replicas on DISK DISK

21 Storage HDFS Blocks HDFS Policies With HDFS Storage Tiering HOT WARM COLD n replicas on DISK 1 replica on DISK n-1 replica on ARCHIVE n replicas on ARCHIVE DISK ARCHIVE

22 Hadoop Archives: HAR Files Layered filesystem on top of HDFS Use MapReduce to create archive Reduce memory consumption on Name Node Each HAR file access reads two index files and one data file hdfs:// har:// Nothing has changed to a client using HAR filesystem Master Index Index File File File File Data Source:

23 Storage HDFS Blocks HDFS Policies Hadoop Archives: HAR Files (cont.) HDFS Name Node metadata information of each block stored in HDFS Hadoop Archive File (HAR) metadata information of each block stored in HAR file DISK ARCHIVE

24 Thank You Computacenter AG & Co. ohg Consultancy Germany Sven Bauernfeind

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Data Clustering on the Parallel Hadoop MapReduce Model Dimitrios Verraros Overview The purpose of this thesis is to implement and benchmark the performance of a parallel K- means clustering algorithm on