Storage on the Lunatic Fringe

Similar documents
Commodity Reliability And Practices or Building Reliable Systems with CRAP

Storage on the Lunatic Fringe. Thomas M. Ruwart University of Minnesota Digital Technology Center Intelligent Storage Consortium

Storage Update and Storage Best Practices for Microsoft Server Applications. Dennis Martin President, Demartek January 2009 Copyright 2009 Demartek

Customer Success Story Los Alamos National Laboratory

Sun Lustre Storage System Simplifying and Accelerating Lustre Deployments

Gigabyte Bandwidth Enables Global Co-Laboratories

NetApp: Solving I/O Challenges. Jeff Baxter February 2013

Compact Muon Solenoid: Cyberinfrastructure Solutions. Ken Bloom UNL Cyberinfrastructure Workshop -- August 15, 2005

Deep Storage for Exponential Data. Nathan Thompson CEO, Spectra Logic

Assessing and Comparing HP Parallel SCSI and HP Small Form Factor Enterprise Hard Disk Drives in Server Environments

Table 9. ASCI Data Storage Requirements

IBM System Storage DCS3700

Opportunities from our Compute, Network, and Storage Inflection Points

Coming Changes in Storage Technology. Be Ready or Be Left Behind

Dell Microsoft Reference Configuration Performance Results

New Approach to Unstructured Data

High-Energy Physics Data-Storage Challenges

A High-Performance Storage and Ultra- High-Speed File Transfer Solution for Collaborative Life Sciences Research

FC-NVMe. NVMe over Fabrics. Fibre Channel the most trusted fabric can transport NVMe natively. White Paper

High Performance Storage Solutions

Trends in HPC (hardware complexity and software challenges)

Coordinating Parallel HSM in Object-based Cluster Filesystems

BIG DATA AND HADOOP ON THE ZFS STORAGE APPLIANCE

Data Transfers Between LHC Grid Sites Dorian Kcira

Worldwide Production Distributed Data Management at the LHC. Brian Bockelman MSST 2010, 4 May 2010

Cold Storage: The Road to Enterprise Ilya Kuznetsov YADRO

Veritas NetBackup on Cisco UCS S3260 Storage Server

Achieve Optimal Network Throughput on the Cisco UCS S3260 Storage Server

Architecting Storage for Semiconductor Design: Manufacturing Preparation

Introduction to High Performance Parallel I/O

Parallel File Systems for HPC

Dell Microsoft Business Intelligence and Data Warehousing Reference Configuration Performance Results Phase III

Highly Scalable, Non-RDMA NVMe Fabric. Bob Hansen,, VP System Architecture

The Red Storm System: Architecture, System Update and Performance Analysis

Storage Systems. Storage Systems

Remodel. New server deployment time is reduced from weeks to minutes

Design a Remote-Office or Branch-Office Data Center with Cisco UCS Mini

Refining and redefining HPC storage

Design a Remote-Office or Branch-Office Data Center with Cisco UCS Mini

Store Process Analyze Collaborate Archive Cloud The HPC Storage Leader Invent Discover Compete

UW-ATLAS Experiences with Condor

Milestone Solution Partner IT Infrastructure Components Certification Report

Reflections on Failure in Post-Terascale Parallel Computing

Milestone Solution Partner IT Infrastructure Components Certification Report

Beyond Petascale. Roger Haskin Manager, Parallel File Systems IBM Almaden Research Center

BlueGene/L. Computer Science, University of Warwick. Source: IBM

IBM Spectrum Scale IO performance

SGI Overview. HPC User Forum Dearborn, Michigan September 17 th, 2012

Extending PCI-Express in MicroTCA Platforms. Whitepaper. Joey Maitra of Magma & Tony Romero of Performance Technologies

The Benefits of Solid State in Enterprise Storage Systems. David Dale, NetApp

Lustre A Platform for Intelligent Scale-Out Storage

Scaling to Petaflop. Ola Torudbakken Distinguished Engineer. Sun Microsystems, Inc

Parallel Computing: From Inexpensive Servers to Supercomputers

Deploy a High-Performance Database Solution: Cisco UCS B420 M4 Blade Server with Fusion iomemory PX600 Using Oracle Database 12c

Annual Update on Flash Memory for Non-Technologists

COSC6376 Cloud Computing Lecture 17: Storage Systems

IBM Power Advanced Compute (AC) AC922 Server

CC-IN2P3: A High Performance Data Center for Research

IBM Storage. Leading the 21st Century Growth. Freddy Lee Advanced Technical Support

InfiniBand Networked Flash Storage

Delivering HPC Performance at Scale

6 TIPS FOR SELECTING HDD AND SSD DRIVES

Cisco UCS C210 M2 General-Purpose Rack-Mount Server

Company. Intellectual Property. Headquartered in the Silicon Valley

IBM s Data Warehouse Appliance Offerings

NVMFS: A New File System Designed Specifically to Take Advantage of Nonvolatile Memory

Data Movement & Tiering with DMF 7

VSTOR Vault Mass Storage at its Best Reliable Mass Storage Solutions Easy to Use, Modular, Scalable, and Affordable

The CMS Computing Model

Storage Evaluations at BNL

LSI and HGST accelerate database applications with Enterprise RAID and Solid State Drives

HYBRID STORAGE TM. WITH FASTier ACCELERATION TECHNOLOGY

IBM Data Center Networking in Support of Dynamic Infrastructure

EMC SYMMETRIX VMAX 40K STORAGE SYSTEM

Insight: that s for NSA Decision making: that s for Google, Facebook. so they find the best way to push out adds and products

IBM Power AC922 Server

SSD Architecture Considerations for a Spectrum of Enterprise Applications. Alan Fitzgerald, VP and CTO SMART Modular Technologies

Cisco UCS C250 M2 Extended-Memory Rack-Mount Server

Cisco UCS C250 M2 Extended-Memory Rack-Mount Server

Intel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage

2012 HPC Advisory Council

Adrian Proctor Vice President, Marketing Viking Technology

Outline. Execution Environments for Parallel Applications. Supercomputers. Supercomputers

CPS 303 High Performance Computing. Wensheng Shen Department of Computational Science SUNY Brockport

Management Information Systems OUTLINE OBJECTIVES. Information Systems: Computer Hardware. Dr. Shankar Sundaresan

A GPFS Primer October 2005

Introduction to iscsi

Accelerating Real-Time Big Data. Breaking the limitations of captive NVMe storage

DELL EMC DATA DOMAIN SISL SCALING ARCHITECTURE

Energy Efficient Storage

Lustre* is designed to achieve the maximum performance and scalability for POSIX applications that need outstanding streamed I/O.

THE ZADARA CLOUD. An overview of the Zadara Storage Cloud and VPSA Storage Array technology WHITE PAPER

Technology Challenges for Clouds. Henry Newman CTO Instrumental Inc

HPC NETWORKING IN THE REAL WORLD

Microsoft Exchange Server 2010 workload optimization on the new IBM PureFlex System

Xyratex ClusterStor6000 & OneStor

Practical Scientific Computing

Symantec NetBackup 5200

Conference The Data Challenges of the LHC. Reda Tafirout, TRIUMF

Introducing NVDIMM-X: Designed to be the World s Fastest NAND-Based SSD Architecture and a Platform for the Next Generation of New Media SSDs

Transcription:

Storage on the Lunatic Fringe Thomas M. Ruwart Chief [Mad] Scientist tmruwart@atrato.com SNIA Developers Conference San Jose, CA September 25, 2008

Why you are here To learn that there are organizations with bigger data storage requirements than yours What some of their issues and problems are How they are addressing those issues and problems A glimpse into the future of data storage hardware and software technologies and possible solutions

Orientation A bit of history Who are the lunatics in 2008-2009? What are their requirements? Why is this interesting to the Storage Industry? What is anyone doing about this? Conclusions

A bit of History 1988 Supercomputer Centers operating with HUGE disk farms of 50-100 GB! 1GB disk drives cost $20,000 each 8-inch FF, 60Lbs, Average seek time of 15ms, 3600RPM 1995 3.5-inch half-height disk drives are the standard form factor at 4GB/disk Built a 1+TB array of disks with Used 296 4GB 3.5-inch disks Technician working on IBM 3380 Disk Drive 1986 3600RPM, average seek time 12ms, 2lbs, $2000 per disk drive ($500/GB) 37 RAID5 7+1 disk arrays mounted in 5 racks More than $1M in disk arrays Created a single SGI xfs file system across all the drives Created a single 1TB file 2002 ASCI Q 700TB online, high performance, pushing limits of traditional [legacy] block-based file systems 2004 ASCI Red Storm 240TB online, high bandwidth, massively parallel

A bit more history 2002 ASCI Q 700TB online, high performance, pushing limits of traditional [legacy] block-based file systems 2004 ASCI Red Storm 240TB online, high bandwidth, massively parallel 2006 ASC Purple at LLNL 269 racks for the entire machine 12208 Processors in 131 racks 48 racks just for switches (17,000 cables) 2 PBytes of storage: >11,000 disks in 90 racks 2008 Road Runner at LANL First PetaFlop machine (10 15 FLOPs) 6,912 AMD dual-core Opterons plus 12,960 IBM Cell edp 80TB Main Memory (aggregate) 216 GB/sec sustained I/O to storage (432x10GigE) See the Top500 List for complete details www.top500.org

Processor Count Number of Processors by Rank 1000000 100000 Number of Processors 10000 1000 100 10 1 0 50 100 150 200 250 300 350 400 450 500 Rank

Looking Ahead 2009 ZIA Joint development between Sandia National Lab and LANL 2PFlops 256K Processor cores 2PB disk storage 1-2TB/sec sustained bandwidth 2011-2012 10 PFlop and beyond Blue Waters at NCSA University of Illinois and State of Illinois joint project for open peta-scale computing >> 200,000 processors >> 800TB Main Memory >> 10PB disk Storage Looking way ahead, 2020 Tom retires

Who said that? I think there is a world market for maybe five computers. Thomas Watson (1874-1956), Chairman of IBM, 1943 There is no reason anyone would want a computer in their home. Ken Olson, president, chairman and founder of Digital Equipment Corp., 1977 640K ought to be enough for anybody. Bill Gates (1955-), in 1981 Who the hell wants to hear actors talk? H. M. Warner (1881-1958), founder of Warner Brothers, in 1927 Everything that can be invented has been invented. Charles H. Duell, Commissioner, U.S. Office of Patents, 1899

Who would say this? Who on earth needs an ExaByte (10 18 bytes) of storage space? Who needs a TeraByte per Second data transfer rate from storage to the application? Who needs millions, billions, trillions of data transactions per second? Who would ever need to manage a trillion files? You did not hear these questions from me

Who are the Lunatics? High-End Computing (HEC) Community BIG data or LOTS of data, locally and widely distributed, high bandwidth access or high transaction rate, relatively few users, secure, short-term and long-term retention High Energy Physics (HEP) Fermilab, CERN, DESY BIG data, locally distributed, widely available, moderate number of users, sparse access, long-term retention DARPA HPCS Sets the Requirements

HEP LHC at CERN The LHC ( http://lhc.web.cern.ch/lhc/ ) $750M Experiment being built at CERN in Switzerland Activating this year (2008) Holy black holes Batman. The Easy Part collecting the data Data rate from the detectors is ~1 PB/sec Data rate after filtering is a few GB/sec The Hard Part: Storing and Access Dataset for a single experiment is ~1PB Several experiments per year are run Must be made available to 5000 scientists all over the planet (Earth primarily) for the next 10-25 years Dense dataset, sparse data access by any one scientist Access patterns are not deterministic

LHC Data Grid Hierarchy CMS as example, Atlas is similar human=2m Tier 1 ~PByte/sec CMS detector: 15m X 15m X 22m 12,500 tons, $700M. Online System Tier 0 +1 ~2.5 Gbits/sec ~100 MBytes/sec event reconstruction event simulation French Regional Center German Regional Center Italian Center FermiLab, USA Regional Center Tier 3 Physics data cache Courtesy Harvey Newman, CalTech and CERN analysis Institute Institute ~0.25TIPS Institute 100-1000 Mbits/sec Tier 4 Workstations Institute ~0.6-2.5 Gbps Tier2 Center Tier2 Center Tier2 Center Tier2 Center Tier2 Center Tier 2 ~0.6-2.5 Gbps CERN/CMS data goes to 6-8 Tier 1 regional centers, and from each of these to 6-10 Tier 2 centers. Physicists work on analysis channels at 135 institutes. Each institute has ~10 physicists working on one or more channels. 2000 physicists in 31 countries are involved in this 20-year experiment in which DOE is a major player.

What are the DARPA requirements? HEC Community The High Productivity Computing Systems (HPCS) from DARPA 10 15 computations per second Peta-scale computing 1-10 trillion files in a single file system 100 s of thousands of processors Millions of process threads all needing and generating data 1-100 TBytes/sec aggregate bandwidth to disk 30,000+ file creations per second Focus on ease of use, efficiency, and RAS

Why is the Number of Processors Important? Indicator of Number of Independent Program Threads that need access to storage When Number of Processors is greater than the Number of Disks, then I/O will be random Past the age of purely sequential bandwidth Currently in the age of purely random data access patterns Strictly a result of the computer architecture

What are we getting ourselves into? What is 1TB/sec bandwidth to disk? 20,000 disk drives @ 50MB/sec/disk average (assumes no seeks) @ 10ms average access time 2 million IOPS @ 1TB/disk 20PB raw capacity @ 25watts/disk (including cooling power) 500 KWatts 24,000-40,000 disk drives in an real design to include redundancy Space and power/cooling increase up to 2x 1MWatt And that is just the beginning. 10TB/sec would be up to 400,000 disk drives

The Storage Event Horizon 1 GByte/sec 20 Disk Drives 10 GBytes/sec 200 Disk Drives 100 GBytes/sec 2,000 Disk Drives 1 TByte/sec 20,000 Disk Drives ~~~~~~~~~~Storage Event Horizon ~~~~~~~~~~~ 10 TBytes/sec 200,000 Disk Drives 100 TBytes/sec 2,000,000 Disk Drives 1 PByte/sec 20,000,000 Disk Drives

What does 1TB/sec really mean? To what? 1,000 processes @ 1GB/sec each? 100,000 processes at 10MB/sec each? Assumes a process/processor can absorb/generate data at that rate Current ratio of I/O transfer rate to instruction execution rate is about 1000:1 based on ZIA requirements all machines are different Therefore, 1PFlop implies 1TB/sec I/O transfer rate 1 EFLOP implies an I/O transfer rate of 1PB/sec or 20 million disk drives ooops!

Digging ourselves in deeper? 1 Trillion Files 30,000 file creations per second for 1 year = 1 trillion files 1PB of MetaData to describe 1Trillion files Finding any one file within 1 Trillion files Finding anything inside of the 1 Trillion files This is a major transactional problem not a bandwidth problem Traditional file systems and associated [POSIX] semantics break down at these scales need new/relaxed semantics Is the concept of a file still valid in this context?

The Growing Disk Drive Bottleneck Subsystem 1993 1 2007E 1 Increase Network I/O 2 0.001 2 2000x Intel CPU 0.48 100 200x Storage Channel I/O 3 0.05 4 80x PCI 7 0.13 16 123x Intel Front Side Processor Bus 0.53 13 24x Random Disk IOPS 5 90 150 1.7x Random Disk IOPS per Gbyte 5,6 43 4.2-10x Sequential Disk I/O 4 0.005 0.1 20x Sequential Disk BW/Gbyte 0.005 0.0001-50x Notes: 1 Speed of subsystem in GBps 2 Ethernet 3 SCSI and Fibre Channel 4 IBM 3.5 inch drives internal data rate 5 IBM 3.5 inch drives seek + rotational latency 6 Horison/Fred Moore 7 PCI versus 16xPCIe Source: www.archivebuilders.com, "Evolution of Intel Microprocessors: 1971 to 2001

Need more disks, not higher capacity ones Disk drive capacity improves faster than Data transfer rate Seek time Rotational Latency

Access Density

Serious Questions How do you package it? How do you maintain it? How do you connect it all together? How do you access/use a storage system with 250,000 disk drives?

How do you package this? Conservatively 200 x 3½ inch disks per rack with controllers 200 racks of disk drives and controllers 4,000 square feet 10TB/sec is 10 times this or about the size of one football field (~40,000 sq ft)

How do you maintain it? Assume 40,000 disk configuration 2,000,000 hours MTBF per Enterprise-class disk 500,000 hours MTBF per Consumer-class disk ~4 disk failure per week for Enterprise-class disks ~15 failures per week for Consumer-class disks Continual rebuilds in progress 10TB/sec is 10 times this

How do you connect it all together? 10Gbit/sec/channel 1,000 channels @ 100% efficiency Implies a 2,000 channel non-blocking switch fabric What about transceiver failure rates When it breaks, how do you find the broken transceiver? 10TB/sec who on earth would want to do that? (don t ask)

How do you use this? Current file system technology is based on 30+ year-old designs and does not scale Disk I/O software stack is 30+ years old and does not scale Need lots of innovation in many areas Common shared file system interfaces Data Life Cycle Management and seamless integration into existing HEC environments Changes to standards that offer greater scalability without sacrificing data integrity Streaming I/O from zillions of single nodes Data alignment, small-block, large-block, and RAID issues File System Metadata Application Operating System Storage and Transport Application Operating System Storage and Transport

Commodity Reliability And Practices Processors, Networks, Graphics Engines have for the most part gone commodity Disk drives are still largely enterprise-class Significant pressure to move toward more use of commodity disk drives Requires a fundamental change in how we think about RAS for storage i.e. Fail-In-Place Assumes something is always in the process of breaking Must re-orient engineering to think about how to build reliable systems using unreliable components AKA How to build reliable systems using CRAP

History has shown The problems that the Lunatic Fringe is working on today are the problems that will become mainstream in 3-5 years Legacy data access hardware and software mechanisms are breaking down at these scales We need to continue to innovate Individual at all levels Globally across levels Re-orient our thinking on many levels

What s happening now? Areal Density is at about 250Gigabits per square inch 3.5-inch form factor is currently the standard 2.5-inch form factor is emerging in the enterprise SAS and SATA are getting significant traction OSD has been demonstrated and is in active development Consumer-grade storage is cheap cheap cheap Commodity interface speeds are up to 20-40Gigabits/sec Storage and Network processing engines are available New applications for storage are rapidly evolving Relaxed POSIX standards NFS V4 and Parallel NFS

Common thread Their data storage capacity, access, and retention requirements are continually increasing Some of the technologies and concepts the Lunatic Fringe are looking at include: Object-based Storage Devices Intelligent Storage Systems Data Grids High-density disk drive packaging Commodity Reliability And Practices Building Reliable systems with inherently unreliable components Or Building Reliable systems using CRAP New and/or improved software standards Error Detection Techniques and Methods

Conclusions Lunatic Fringe users will continue to push the limits of existing hardware and software technologies Lunatic Fringe is a moving target there will always be a Lunatic Fringe well beyond where you are The Storage Industry at large should pay attention to What they are doing Why they are doing it What they learn

Some Interesting Sites www.llnl.gov Livermore National Labs www.lanl.gov Los Alamos National Labs www.sandia.gov Sandia National Labs www.top500.org The Top500 List www.ncsa.uiuc.edu - NCSA www.psc.edu Pittsburgh Supercomputer Center www.tacc.utexas.edu Texas Advanced Computing Center www.ornl.gov Oak Ridge National Lab http://lhc.web.cern.ch/lhc - CERN and the LHC

Government Research DoE ASCI TriLabs LANL, LLNL, Sandia Lustre (www.lustre.org) Parallel NFS ( www.ietf.org/proceedings/04mar/slides/nfsv4-1.pdf ) NFS Version 4 (nfsv4.org) DICE Data Intensive Computing Environments http://www.avetec.org/dice/ NASA and the IEEE Mass Storage Technical Committee Annual symposium on Mass Storage Systems and Technologies (MSSTC) ( www.storageconference.org )

Academic Storage Research University of Minnesota Digital Technology Center Intelligent Storage Consortium (DISC) www.dtc.umn.edu/programs/disc.html University of California Santa Cruz Storage Systems Research Center (SSRC) http://ssrc.soe.ucsc.edu CMU Parallel Data Lab (PDL) www.pdl.cmu.edu

Thank you Thomas M. Ruwart Chief [Mad] Scientist tmruwart@atrato.com