Correlation based File Prefetching Approach for Hadoop

Similar documents
Distributed Filesystem

A Fast and High Throughput SQL Query System for Big Data

International Journal of Scientific & Engineering Research, Volume 7, Issue 2, February-2016 ISSN

HDFS Architecture. Gregory Kesden, CSE-291 (Storage Systems) Fall 2017

Dynamic Data Placement Strategy in MapReduce-styled Data Processing Platform Hua-Ci WANG 1,a,*, Cai CHEN 2,b,*, Yi LIANG 3,c

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

The Google File System

Preliminary Research on Distributed Cluster Monitoring of G/S Model

EsgynDB Enterprise 2.0 Platform Reference Architecture

Optimization Scheme for Small Files Storage Based on Hadoop Distributed File System

Using Synology SSD Technology to Enhance System Performance Synology Inc.

Yuval Carmel Tel-Aviv University "Advanced Topics in Storage Systems" - Spring 2013

CS60021: Scalable Data Mining. Sourangshu Bhattacharya

Ubiquitous and Mobile Computing CS 525M: Virtually Unifying Personal Storage for Fast and Pervasive Data Accesses

Implementation of Parallel CASINO Algorithm Based on MapReduce. Li Zhang a, Yijie Shi b

Distributed File Systems II

Processing Technology of Massive Human Health Data Based on Hadoop

7680: Distributed Systems

A New Model of Search Engine based on Cloud Computing

CLIENT DATA NODE NAME NODE

Be Fast, Cheap and in Control with SwitchKV Xiaozhou Li

A Robust Cloud-based Service Architecture for Multimedia Streaming Using Hadoop

CLOUD-SCALE FILE SYSTEMS

Ambry: LinkedIn s Scalable Geo- Distributed Object Store

Hadoop and HDFS Overview. Madhu Ankam

The Analysis Research of Hierarchical Storage System Based on Hadoop Framework Yan LIU 1, a, Tianjian ZHENG 1, Mingjiang LI 1, Jinpeng YUAN 1

Facilitating Consistency Check between Specification & Implementation with MapReduce Framework

Sinbad. Leveraging Endpoint Flexibility in Data-Intensive Clusters. Mosharaf Chowdhury, Srikanth Kandula, Ion Stoica. UC Berkeley

HiTune. Dataflow-Based Performance Analysis for Big Data Cloud

Towards General-Purpose Resource Management in Shared Cloud Services

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

Be Fast, Cheap and in Control with SwitchKV. Xiaozhou Li

A priority based dynamic bandwidth scheduling in SDN networks 1

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

CAVA: Exploring Memory Locality for Big Data Analytics in Virtualized Clusters

Cloudian Sizing and Architecture Guidelines

Hadoop Distributed File System(HDFS)

WHITEPAPER. Improve Hadoop Performance with Memblaze PBlaze SSD

Google File System (GFS) and Hadoop Distributed File System (HDFS)

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo

Nowadays data-intensive applications play a

Optimization Scheme for Storing and Accessing Huge Number of Small Files on HADOOP Distributed File System

The Google File System

Research on the Application of Bank Transaction Data Stream Storage based on HBase Xiaoguo Wang*, Yuxiang Liu and Lin Zhang

LEEN: Locality/Fairness- Aware Key Partitioning for MapReduce in the Cloud

A Comparative Study of Microsoft Exchange 2010 on Dell PowerEdge R720xd with Exchange 2007 on Dell PowerEdge R510

Huge Data Analysis and Processing Platform based on Hadoop Yuanbin LI1, a, Rong CHEN2

HTRC Data API Performance Study

Warehouse- Scale Computing and the BDAS Stack

Assignment 5. Georgia Koloniari

ASEP: An Adaptive Sequential Prefetching Scheme for Second-level Storage System

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets

Dept. Of Computer Science, Colorado State University

The Google File System

Design and Implementation of Music Recommendation System Based on Hadoop

MapReduce. U of Toronto, 2014

Speeding Up Cloud/Server Applications Using Flash Memory

Part IV. Chapter 15 - Introduction to MIMD Architectures

Research on Mass Image Storage Platform Based on Cloud Computing

A New HadoopBased Network Management System with Policy Approach

Distributed Systems 16. Distributed File Systems II

Improvements and Implementation of Hierarchical Clustering based on Hadoop Jun Zhang1, a, Chunxiao Fan1, Yuexin Wu2,b, Ao Xiao1

The Google File System. Alexandru Costan

An Indian Journal FULL PAPER ABSTRACT KEYWORDS. Trade Science Inc. The study on magnanimous data-storage system based on cloud computing

Toward Energy-efficient and Fault-tolerant Consistent Hashing based Data Store. Wei Xie TTU CS Department Seminar, 3/7/2017

Auto Management for Apache Kafka and Distributed Stateful System in General

Service and Cloud Computing Lecture 10: DFS2 Prof. George Baciu PQ838

SMCCSE: PaaS Platform for processing large amounts of social media

RIGHTNOW A C E

DIGITAL VIDEO WATERMARKING ON CLOUD COMPUTING ENVIRONMENTS

Engineering Goals. Scalability Availability. Transactional behavior Security EAI... CS530 S05

The Hadoop Distributed File System Konstantin Shvachko Hairong Kuang Sanjay Radia Robert Chansler

ZBD: Using Transparent Compression at the Block Level to Increase Storage Space Efficiency

The Google File System

BigData and Map Reduce VITMAC03

Cluster Setup and Distributed File System

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

Data Prefetching for Scientific Workflow Based on Hadoop

Overview. Why MapReduce? What is MapReduce? The Hadoop Distributed File System Cloudera, Inc.

Facilitating Magnetic Recording Technology Scaling for Data Center Hard Disk Drives through Filesystem-level Transparent Local Erasure Coding

UNIT-IV HDFS. Ms. Selva Mary. G

MOHA: Many-Task Computing Framework on Hadoop

vsan 6.6 Performance Improvements First Published On: Last Updated On:

MATE-EC2: A Middleware for Processing Data with Amazon Web Services

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

The Load Balancing Research of SDN based on Ant Colony Algorithm with Job Classification Wucai Lin1,a, Lichen Zhang2,b

BIGData (massive generation of content), has been growing

Review On Data Replication with QoS and Energy Consumption for Data Intensive Applications in Cloud Computing

Isilon Performance. Name

Big Data for Engineers Spring Resource Management

Map-Reduce. Marco Mura 2010 March, 31th

Crossing the Chasm: Sneaking a parallel file system into Hadoop

HDFS Architecture Guide

FuxiSort. Jiamang Wang, Yongjun Wu, Hua Cai, Zhipeng Tang, Zhiqiang Lv, Bin Lu, Yangyu Tao, Chao Li, Jingren Zhou, Hong Tang Alibaba Group Inc

Oracle Big Data Connectors

called Hadoop Distribution file System (HDFS). HDFS is designed to run on clusters of commodity hardware and is capable of handling large files. A fil

Spark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay Mellanox Technologies

18-hdfs-gfs.txt Thu Nov 01 09:53: Notes on Parallel File Systems: HDFS & GFS , Fall 2012 Carnegie Mellon University Randal E.

High Throughput WAN Data Transfer with Hadoop-based Storage

Transcription:

IEEE 2nd International Conference on Cloud Computing Technology and Science Correlation based File Prefetching Approach for Hadoop Bo Dong 1, Xiao Zhong 2, Qinghua Zheng 1, Lirong Jian 2, Jian Liu 1, Jie Qiu 2, Ying Li 2 1 Xi'an Jiaotong University 2 IBM Research-China

Agenda 1. Introduction 2. Challenges to HDFS Prefetching 3. Correlation based file prefetching for HDFS 4. Experimental evaluation 5. Conclusion

Hadoop is prevalent! Motivation (1/2) Internet service file systems are extensively developed for data management in large-scale Internet services and cloud computing platforms. Hadoop Distributed File System (HDFS), which is a representative of Internet service file systems, has been widely adopted to support diverse Internet applications. Latency of data accessing from file systems constitutes the major overhead of processing user requests! Due to the access mechanism of HDFS, access latency of reading files from HDFS is serious, especially when accessing massive small files.

Prefetching Motivation (2/2) Prefetching is considered as an effective way to alleviate access latency. It hides visible I/O cost and reduces response time by exploiting access locality and fetching data into cache before they are requested. HDFS currently does not provide prefetching function. It is noteworthy that users usually browse through files following certain access locality when visiting Internet applications. Consequently, correlation based file prefetching can be exploited to reduce access latency for Hadoop-based Internet applications. Due to some characteristics of HDFS, HDFS prefetching is more sophisticated than traditional file system prefetchings.

Contributions A two-level correlation based file prefetching approach is presented to reduce access latency of HDFS for Hadoopbased Internet applications. Major contributions A two-level correlation based file prefetching mechanism, including metadata prefetching and block prefetching, is designed for HDFS. Four placement patterns about where to store prefetched metadata and blocks are studied, with policies to achieve trade-off between performance and efficiency of HDFS prefetching. A dynamic replica selection algorithm for HDFS prefetching is investigated to improve the efficiency of HDFS prefetching and to keep the workload balance of HDFS cluster.

Agenda 1. Introduction 2. Challenges to HDFS Prefetching 3. Correlation based file prefetching for HDFS 4. Experimental evaluation 5. Conclusion

Challenges to HDFS Prefetching (1/2) Because of the master-slave architecture, HDFS prefetching is two-leveled. To prefetch metadata from NameNode and to prefetch blocks from DataNodes. The consistency between these two kinds of operations should be guaranteed. Due to the direct client access mechanism and curse of singleton, NameNode is a crucial bottleneck to performance of HDFS. The proposed prefetching scheme should be lightweight and help mitigate the overload of NameNode effectively.

Challenges to HDFS Prefetching (2/2) Invalidated cached data When failures of blocks or DataNodes are detected, HDFS automatically creates alternate replicas on other DataNodes. In addition, for sake of balance, HDFS dynamically forwards blocks to other specified DataNodes. These operations would generate invalidated cached data, which need to be handled appropriately. The determination of what and how much data to prefetch is non-trivial because of the data replication mechanism of HDFS. Besides closest distances to clients, replica factor and workload balance of HDFS cluster should also be considered.

Agenda 1. Introduction 2. Challenges to HDFS Prefetching 3. Correlation based file prefetching for HDFS 4. Experimental evaluation 5. Conclusion

Overview of HDFS prefetching (1/3) In HDFS prefetching, four fundamental issues need to be addressed What to prefetch How much data to prefetch When to trigger prefetching Where to store prefetched data What to prefetch Metadata and blocks of correlated files will be prefetched from NameNode and DataNodes respectively. Metadata prefetching reduces access latency on metadata server and even overcomes the per-file metadata server interaction problem Block prefetching is used to reduce visible I/O cost and even mitigate the network transfer latency.

Overview of HDFS prefetching (2/3) What to prefetch What to prefetch is predicated in file level rather than block level. Based on file correlation mining and analysis, correlated files are predicted. When a prefetching activity is triggered, metadata and blocks of correlated files will be prefetched. How much data to prefetch The amount of to-be-prefetched data is decided from three aspects: file, block and replica. In file level, the number of files is decided by specific file correlation prediction algorithms. In block level, all blocks of a file will be prefetched. In replica level, how many metadata of replicas and how many replicas to prefetch need to be determined.

Overview of HDFS prefetching (3/3) When to prefetch When to prefetch is to control how early to trigger the next prefetching. Here, the Prefetch on Miss strategy is adopted. When a read request misses in metadata cache, a prefetching activity for both metadata prefetching and block prefetching will be activated.

Placement patterns about where to store prefetched data (1/2) Four combinations exist Prefetched metadata can be stored on NameNode or HDFS clients. Prefetched blocks can be stored on DataNodes or HDFS clients. Features NC-pattern CC-pattern ND-pattern CD-pattern Where to store prefetched metadata NameNode HDFS clients NameNode HDFS clients Where to store prefetched blocks HDFS clients HDFS clients DataNodes DataNodes Reduce the access latency on Y Y Y Y metadata server Address the per-file metadata server N Y N Y interaction problem Mitigate the overhead of NameNode Maybe Y Maybe Y Reduce the visible I/O cost Y Y Y Y Reduce the network transfer latency Y Y N N Aggravate the network transfer Y Y N N overhead

Placement patterns about where to store prefetched data (2/2) A strategy to determine where to store prefetched metadata and blocks is investigated. According to features of the above four patterns, when deciding which one is adopted, the following factors are taken into account: accuracy of file correlation prediction, resource constraint and real-time workload of HDFS cluster and clients. Four policies (the last one will be introduced later). When the main memory on a HDFS client is limited, ND-pattern and CDpattern are more appealing, since in the other two patterns, cached blocks would occupy a great amount of memory. When accurate of correlation prediction results is high, NC-pattern and CCpattern are better choices, because cached blocks on a client can be accessed directly in subsequent requests. In the case of high network transfer workload, ND-pattern and CD-pattern are adopted to avoid additional traffic.

Architecture CD-pattern prefetching is taken as the representative to present the architecture and process of HDFS prefetching. HDFS prefetching is composed of four core modules. Prefetching module is situated on NameNode. It deals with Client prefetching details, and is the actual performer 4 of prefetching. File correlation detection module is also situated on NameNode. It NameNode is responsible for predicting what to prefetch based on file Client correlations. It consists of two components: file correlation mining engine and file correlation base. Metadata cache Command lies stream on each HDFS client, and stores prefetched Data stream Block replica metadata. Each record is the metadata of a file (including the Prfetching module metadata of its Metadata blocks cache and all replicas). DataNode DataNode Block cache Block cache locates on each DataNode. It stores blocks which are File correlation detection module prefetched from disks. 1 2 5 7 6 3

Dynamic replica selection algorithm (1/2) A dynamic replica selection algorithm is introduced for HDFS prefetching to prevent hot DataNodes which frequently serve prefetching requests. Currently, HDFS decides the replica sequences by physical distances (or rather rack-aware distances) between the requested client and DataNodes where all replicas of the block locate. D d : The distance between a DataNode d and a specified HDFS client It is a static method because it only considers physical distances. ( ) S = arg min { D d } 1 d D

Dynamic replica selection algorithm (2/2) Dynamic replica selection algorithm Besides distances, workload balance of HDFS cluster should also be taken into account. Specifically, to determine the closest replicas, more factors are deliberated, including the capabilities, real-time workloads and throughputs of DataNodes. Dd Cd Wd Td D d,cd W d,td S = arg min { f( )} (2) d D The distance between a DataNode d and a specified HDFS client The capability of a DataNode d, including its CPU, memory and so on The workload of a DataNode d, including its CPU utilization rate, memory utilization rate and so on The throughput of a DataNode d

Agenda 1. Introduction 2. Challenges to HDFS Prefetching 3. Correlation based file prefetching for HDFS 4. Experimental evaluation 5. Conclusion

Experimental environment and workload The test platform is built on a cluster of 9 PC servers. One node, which is IBM X3650 server, acts as NameNode. It has 8 Intel Xeon CPU 2.00 GHz, 16 GB memory and 3 TB disk. The other 8 nodes, which are IBM X3610 servers, act as DataNodes. Each of them has 8 Intel Xeon CPU 2.00 GHz, 8 GB memory and 3 TB disk. File set A file set which contains,000 files is stored on HDFS. The file set, whose total size is 305.38 GB, is part of real files in BlueSky. These files range from 32 KB to 16,529 KB, and files with size between 256 KB and 8,192 KB account for 78% of the total files. Accuracy rates of file correlation predictions In the experiments, accuracy rates are set to,, and respectively.

Experiment results of original HDFS Aggregate read throughputs They do not always grow as numbers of HDFS clients in a server increase. Conversely, the growth of throughputs is not obvious when client numbers increase from 4 to 8, and throughputs even decrease when numbers increase from 8 to 16. When the number is 8, throughput grows up to the maximum. Number of HDFS clients in a server Total Download time of files (Sec) Aggregate read throughput (MB/Sec) CPU utilization rate of NameNode Network traffic (MB/Sec) 1 30.276 10.33 1.10% 1.18 2 36.312 17.23 1.27% 2.30 4 50.512 24.88 1.30% 4.34 8 91.108 27.60 1.37% 7.54 16 236.242 21.21 1.30% 10.53 The results of original HDFS experiments in different client numbers are treated as base values. Performances of HDFS prefetchings are compared with these base values which are tested in the same client number.

Experiment results of NC-pattern prefetching 300 250 200 150 50 0 98 96 94 92 90 88 86 84 Total Download Time 5000 4000 3000 2000 0 Average CPU Utilization of NameNode 240 220 200 180 160 140 120 0 Aggregate Read Throughput Network Traffic NC-pattern prefetching is aggressive. It prefetches blocks to clients so as to alleviate the network transfer latency. Numbers of HDFS clients and accuracy rates of file correlation predictions greatly influence the performances. When client numbers are large and accuracy rates are low, aggregate throughputs are reduced by 60.07%. On the other hand, when client numbers are small and accuracy rates are high, NC-pattern prefetching generates an average 1709.27% throughput improvement while using about 5.13% less CPU.

Experiment results of CC-pattern prefetching 300 250 200 150 50 0 95 90 85 80 Total Download Time Average CPU Utilization of NameNode 400 350 300 250 200 150 50 0 220 200 180 160 140 120 Aggregate Read Throughput 80 Network Traffic CC-pattern prefetching is aggressive. It prefetches blocks to clients so as to alleviate the network transfer latency. Numbers of HDFS clients and accuracy rates of file correlation predictions greatly influence the performances. When client numbers are large and accuracy rates are low, aggregate throughputs are reduced by 48.04%. On the other hand, when client numbers are small and accuracy rates are high, CC-pattern prefetching generates 114.06% throughput improvement while using about 6.43% less CPU.

Experiment results of ND-pattern prefetching 105 95 90 110 105 95 90 85 Total Download Time Average CPU Utilization of NameNode 110 105 95 Aggregate Read Throughput 101.5 101.5 99.5 99 98.5 98 97.5 Network Traffic ND-pattern prefetching is conservative. Compared with NC-pattern and CC-pattern, it generate relatively small but stable performance enhancements. For situations aggregate read throughputs are improved, it generates an average 2.99% throughput improvement while using about 6.87% less CPU. Even in situations which aggregate read throughputs are reduced, average reductions of throughputs are just 1.90% in ND-pattern. Moreover, numbers of HDFS clients and accuracy rates of file correlation predictions have trivial influence on the performance.

Experiment results of CD-pattern prefetching 104 102 98 96 94 92 98 96 94 92 90 88 86 84 Total Download Time Average CPU Utilization of NameNode 108 106 104 102 98 96 Aggregate Read Throughput 108 106 104 102 98 96 Network Traffic CD-pattern prefetching is conservative. Compared with NC-pattern and CC-pattern, it generate relatively small but stable performance enhancements. For situations aggregate read throughputs are improved, it generates 3.72% throughput improvement while using about 7. less CPU. Even in situations which aggregate read throughputs are reduced, average reductions of throughputs are just 1.96%. Moreover, numbers of HDFS clients and accuracy rates of file correlation predictions have trivial influence on the performance.

Placement patterns about where to store prefetched data Based on the above experiment results, the fourth policy on how to choose patterns of HDFS prefetching is obtained. When correlation prediction results are unstable, conservative prefetching patterns (ND-pattern and CD-pattern) are more appropriate because they can avoid the performance deterioration due to low accuracy rates.

Agenda 1. Introduction 2. Challenges to HDFS Prefetching 3. Correlation based file prefetching for HDFS 4. Experimental evaluation 5. Conclusion

Conclusion Prefetching mechanism is studied for HDFS to alleviate access latency. First, the challenges to HDFS prefetching are identified. Then a two-level correlation based file prefetching approach is presented for HDFS. Furthermore, four placement patterns to store prefetched metadata and blocks are presented, and four rules on how to choose one of those patterns are investigated to achieve trade-off between performance and efficiency of HDFS prefetching. Particularly, a dynamic replica selection algorithm is investigated to improve the efficiency of HDFS prefetching. Experimental results prove that HDFS prefetching can significantly reduce the access latency of HDFS and reduce the overload of NameNode, therefore improve performance of Hadoop-based Internet applications.