Analytics in the cloud

Size: px

Start display at page:

Download "Analytics in the cloud"

Oswald Bryant
5 years ago
Views:

1 Analytics in the cloud Dow we really need to reinvent the storage stack? R. Ananthanarayanan, Karan Gupta, Prashant Pandey, Himabindu Pucha, Prasenjit Sarkar, Mansi Shah, Renu Tewari Image courtesy NASA / ESA 2002 IBM Corporation

2 Data-Intensive Internet Scale Applications Typical Applications Web-scale search, indexing, mining Genomic sequencing brain-scale network simulations

3 Data-Intensive Internet Scale Applications Key Requirements Scale to very large data sets Platform needs to scale to 1000 s of nodes Built of commodity hardware for cost efficiency Tolerate failures during every job execution Support data shipping to reduce network requirements

4 MapReduce for analytics MapReduce is emerging as a model for large-scale analytics application Important design goals are extreme-scalability and fault-tolerance Storage layer is separated and has well-defined requirements QuickTime and a decompressor are needed to see this picture. Image source:

5 MapReduce Data-store requirements Provide a hierarchical namespace with directories and files Allow applications to read/write data to files Protect data availability and reliability in the face of node and disk failures Provide high bandwidth access to reasonably-sized chunks of data to all compute nodes (not necessarily all-to-all) Provide chunk access-affinity information to allow proper scheduling of tasks

6 Data store options: Cluster FS Vs Specialized FS Specialized FS Cluster FS Scaling Commodity hardware compliant

7 Data store options: Cluster FS Vs Specialized FS Specialized FS Cluster FS Scaling Commodity hardware compliant Traditional application support No Mature management tools No

8 Data store options: Cluster FS Vs Specialized FS Specialized FS Cluster FS Scaling Commodity hardware compliant Traditional application support No Mature management tools No Tuned for Hadoop No

9 Modifying a Cluster Filesystem for MapReduce GPFS Mature filesystem - many large production installations High performance, Highly scalable Reliability features focused on SAN environments Supports rack-aware 2-way replication POSIX interface Supports shared disk (SAN) and shared-nothing setups Not optimized for MapReduce workloads Does not expose data location information largest block size = 16 MB Changes for Hadoop: Make blocks bigger Let the platform know where the big blocks are Optimize replication and placement to reduce network usage

10 Key change: Metablocks Works for many workloads Small FS blocks (eg: 512K) Large Application blks (eg: 64M) New allocation scheme Metablock size granularity for wide striping New allocation policy Block map operates on large Metablock size All FS operations operate on small regular block size FS block Application meta-block Additional changes to provide block location information and write affinity

11 MapReduce performance GPFS rack=1 HDFS rack=1 GPFS rack=2 HDFS rack=2 Time (sec) Grep TeraGen Terasort Test bed idataplex: 42 nodes 8 cores, 8GB RAM 4+1 disks per-node Hadoop : version GPFS: version pre nodes 160 GB data (replication factor = 2)

12 Impact on traditional workloads Normalized perf K 512K, metablock 16M Normalized Random perf Normalized Seq read perf idataplex: 42 nodes 8 cores, 8GB RAM 4+1 disks per-node GPFS: version pre3.3 Bonnie filesystem benchmark

13 Things that didn t work 800 Large filesystem block-size Turn-off Prefetching Create alignment of records to block boundaries HDFS GPFS16M GPFS16ML GPFS16MLA GPFS16MLANP File System Time (sec) K 16M 16M, no-prefetch Normalized Random perf Normalized Seq read perf

14 Advantages of traditional filesystems Traditional filesystems have solved many hard problems like access control, quotas, snapshots Allow traditional and MapReduce applications to share the same input data. Exploit Filesystem tools & scripts based on regular filesystems. Re-use of Backup/Archive solutions built around particular filesystems. Mixed analytics pipelines. Using a MapReduce-specific filesystem (e.g. HDFS): Crawl Load Analyze Output Serve Crawler writes to a traditional filesystem Into mapreduce filesystem Back to traditional fielsystem Using a general-purpose filesystem (e.g. GPFS): Crawl Analyze Serve

15 Conclusion MapReduce platforms can use traditional filesystems without loss of performance. There are important reasons why traditional filesystems are attractive to users of MapReduce.

Published by: PIONEER RESEARCH & DEVELOPMENT GROUP (www.prdg.org)

Published by: PIONEER RESEARCH & DEVELOPMENT GROUP (www.prdg.org) Need to reinvent the storage stack in cloud computing Sagar Wadhwa 1, Dr. Naveen Hemrajani 2 1 M.Tech Scholar, Suresh Gyan Vihar University, Jaipur, Rajasthan, India 2 Profesor, Suresh Gyan Vihar University,