Storage on the Lunatic Fringe Thomas M. Ruwart Chief [Mad] Scientist tmruwart@atrato.com SNIA Developers Conference San Jose, CA September 25, 2008
Why you are here To learn that there are organizations with bigger data storage requirements than yours What some of their issues and problems are How they are addressing those issues and problems A glimpse into the future of data storage hardware and software technologies and possible solutions
Orientation A bit of history Who are the lunatics in 2008-2009? What are their requirements? Why is this interesting to the Storage Industry? What is anyone doing about this? Conclusions
A bit of History 1988 Supercomputer Centers operating with HUGE disk farms of 50-100 GB! 1GB disk drives cost $20,000 each 8-inch FF, 60Lbs, Average seek time of 15ms, 3600RPM 1995 3.5-inch half-height disk drives are the standard form factor at 4GB/disk Built a 1+TB array of disks with Used 296 4GB 3.5-inch disks Technician working on IBM 3380 Disk Drive 1986 3600RPM, average seek time 12ms, 2lbs, $2000 per disk drive ($500/GB) 37 RAID5 7+1 disk arrays mounted in 5 racks More than $1M in disk arrays Created a single SGI xfs file system across all the drives Created a single 1TB file 2002 ASCI Q 700TB online, high performance, pushing limits of traditional [legacy] block-based file systems 2004 ASCI Red Storm 240TB online, high bandwidth, massively parallel
A bit more history 2002 ASCI Q 700TB online, high performance, pushing limits of traditional [legacy] block-based file systems 2004 ASCI Red Storm 240TB online, high bandwidth, massively parallel 2006 ASC Purple at LLNL 269 racks for the entire machine 12208 Processors in 131 racks 48 racks just for switches (17,000 cables) 2 PBytes of storage: >11,000 disks in 90 racks 2008 Road Runner at LANL First PetaFlop machine (10 15 FLOPs) 6,912 AMD dual-core Opterons plus 12,960 IBM Cell edp 80TB Main Memory (aggregate) 216 GB/sec sustained I/O to storage (432x10GigE) See the Top500 List for complete details www.top500.org
Processor Count Number of Processors by Rank 1000000 100000 Number of Processors 10000 1000 100 10 1 0 50 100 150 200 250 300 350 400 450 500 Rank
Looking Ahead 2009 ZIA Joint development between Sandia National Lab and LANL 2PFlops 256K Processor cores 2PB disk storage 1-2TB/sec sustained bandwidth 2011-2012 10 PFlop and beyond Blue Waters at NCSA University of Illinois and State of Illinois joint project for open peta-scale computing >> 200,000 processors >> 800TB Main Memory >> 10PB disk Storage Looking way ahead, 2020 Tom retires
Who said that? I think there is a world market for maybe five computers. Thomas Watson (1874-1956), Chairman of IBM, 1943 There is no reason anyone would want a computer in their home. Ken Olson, president, chairman and founder of Digital Equipment Corp., 1977 640K ought to be enough for anybody. Bill Gates (1955-), in 1981 Who the hell wants to hear actors talk? H. M. Warner (1881-1958), founder of Warner Brothers, in 1927 Everything that can be invented has been invented. Charles H. Duell, Commissioner, U.S. Office of Patents, 1899
Who would say this? Who on earth needs an ExaByte (10 18 bytes) of storage space? Who needs a TeraByte per Second data transfer rate from storage to the application? Who needs millions, billions, trillions of data transactions per second? Who would ever need to manage a trillion files? You did not hear these questions from me
Who are the Lunatics? High-End Computing (HEC) Community BIG data or LOTS of data, locally and widely distributed, high bandwidth access or high transaction rate, relatively few users, secure, short-term and long-term retention High Energy Physics (HEP) Fermilab, CERN, DESY BIG data, locally distributed, widely available, moderate number of users, sparse access, long-term retention DARPA HPCS Sets the Requirements
HEP LHC at CERN The LHC ( http://lhc.web.cern.ch/lhc/ ) $750M Experiment being built at CERN in Switzerland Activating this year (2008) Holy black holes Batman. The Easy Part collecting the data Data rate from the detectors is ~1 PB/sec Data rate after filtering is a few GB/sec The Hard Part: Storing and Access Dataset for a single experiment is ~1PB Several experiments per year are run Must be made available to 5000 scientists all over the planet (Earth primarily) for the next 10-25 years Dense dataset, sparse data access by any one scientist Access patterns are not deterministic
LHC Data Grid Hierarchy CMS as example, Atlas is similar human=2m Tier 1 ~PByte/sec CMS detector: 15m X 15m X 22m 12,500 tons, $700M. Online System Tier 0 +1 ~2.5 Gbits/sec ~100 MBytes/sec event reconstruction event simulation French Regional Center German Regional Center Italian Center FermiLab, USA Regional Center Tier 3 Physics data cache Courtesy Harvey Newman, CalTech and CERN analysis Institute Institute ~0.25TIPS Institute 100-1000 Mbits/sec Tier 4 Workstations Institute ~0.6-2.5 Gbps Tier2 Center Tier2 Center Tier2 Center Tier2 Center Tier2 Center Tier 2 ~0.6-2.5 Gbps CERN/CMS data goes to 6-8 Tier 1 regional centers, and from each of these to 6-10 Tier 2 centers. Physicists work on analysis channels at 135 institutes. Each institute has ~10 physicists working on one or more channels. 2000 physicists in 31 countries are involved in this 20-year experiment in which DOE is a major player.
What are the DARPA requirements? HEC Community The High Productivity Computing Systems (HPCS) from DARPA 10 15 computations per second Peta-scale computing 1-10 trillion files in a single file system 100 s of thousands of processors Millions of process threads all needing and generating data 1-100 TBytes/sec aggregate bandwidth to disk 30,000+ file creations per second Focus on ease of use, efficiency, and RAS
Why is the Number of Processors Important? Indicator of Number of Independent Program Threads that need access to storage When Number of Processors is greater than the Number of Disks, then I/O will be random Past the age of purely sequential bandwidth Currently in the age of purely random data access patterns Strictly a result of the computer architecture
What are we getting ourselves into? What is 1TB/sec bandwidth to disk? 20,000 disk drives @ 50MB/sec/disk average (assumes no seeks) @ 10ms average access time 2 million IOPS @ 1TB/disk 20PB raw capacity @ 25watts/disk (including cooling power) 500 KWatts 24,000-40,000 disk drives in an real design to include redundancy Space and power/cooling increase up to 2x 1MWatt And that is just the beginning. 10TB/sec would be up to 400,000 disk drives
The Storage Event Horizon 1 GByte/sec 20 Disk Drives 10 GBytes/sec 200 Disk Drives 100 GBytes/sec 2,000 Disk Drives 1 TByte/sec 20,000 Disk Drives ~~~~~~~~~~Storage Event Horizon ~~~~~~~~~~~ 10 TBytes/sec 200,000 Disk Drives 100 TBytes/sec 2,000,000 Disk Drives 1 PByte/sec 20,000,000 Disk Drives
What does 1TB/sec really mean? To what? 1,000 processes @ 1GB/sec each? 100,000 processes at 10MB/sec each? Assumes a process/processor can absorb/generate data at that rate Current ratio of I/O transfer rate to instruction execution rate is about 1000:1 based on ZIA requirements all machines are different Therefore, 1PFlop implies 1TB/sec I/O transfer rate 1 EFLOP implies an I/O transfer rate of 1PB/sec or 20 million disk drives ooops!
Digging ourselves in deeper? 1 Trillion Files 30,000 file creations per second for 1 year = 1 trillion files 1PB of MetaData to describe 1Trillion files Finding any one file within 1 Trillion files Finding anything inside of the 1 Trillion files This is a major transactional problem not a bandwidth problem Traditional file systems and associated [POSIX] semantics break down at these scales need new/relaxed semantics Is the concept of a file still valid in this context?
The Growing Disk Drive Bottleneck Subsystem 1993 1 2007E 1 Increase Network I/O 2 0.001 2 2000x Intel CPU 0.48 100 200x Storage Channel I/O 3 0.05 4 80x PCI 7 0.13 16 123x Intel Front Side Processor Bus 0.53 13 24x Random Disk IOPS 5 90 150 1.7x Random Disk IOPS per Gbyte 5,6 43 4.2-10x Sequential Disk I/O 4 0.005 0.1 20x Sequential Disk BW/Gbyte 0.005 0.0001-50x Notes: 1 Speed of subsystem in GBps 2 Ethernet 3 SCSI and Fibre Channel 4 IBM 3.5 inch drives internal data rate 5 IBM 3.5 inch drives seek + rotational latency 6 Horison/Fred Moore 7 PCI versus 16xPCIe Source: www.archivebuilders.com, "Evolution of Intel Microprocessors: 1971 to 2001
Need more disks, not higher capacity ones Disk drive capacity improves faster than Data transfer rate Seek time Rotational Latency
Access Density
Serious Questions How do you package it? How do you maintain it? How do you connect it all together? How do you access/use a storage system with 250,000 disk drives?
How do you package this? Conservatively 200 x 3½ inch disks per rack with controllers 200 racks of disk drives and controllers 4,000 square feet 10TB/sec is 10 times this or about the size of one football field (~40,000 sq ft)
How do you maintain it? Assume 40,000 disk configuration 2,000,000 hours MTBF per Enterprise-class disk 500,000 hours MTBF per Consumer-class disk ~4 disk failure per week for Enterprise-class disks ~15 failures per week for Consumer-class disks Continual rebuilds in progress 10TB/sec is 10 times this
How do you connect it all together? 10Gbit/sec/channel 1,000 channels @ 100% efficiency Implies a 2,000 channel non-blocking switch fabric What about transceiver failure rates When it breaks, how do you find the broken transceiver? 10TB/sec who on earth would want to do that? (don t ask)
How do you use this? Current file system technology is based on 30+ year-old designs and does not scale Disk I/O software stack is 30+ years old and does not scale Need lots of innovation in many areas Common shared file system interfaces Data Life Cycle Management and seamless integration into existing HEC environments Changes to standards that offer greater scalability without sacrificing data integrity Streaming I/O from zillions of single nodes Data alignment, small-block, large-block, and RAID issues File System Metadata Application Operating System Storage and Transport Application Operating System Storage and Transport
Commodity Reliability And Practices Processors, Networks, Graphics Engines have for the most part gone commodity Disk drives are still largely enterprise-class Significant pressure to move toward more use of commodity disk drives Requires a fundamental change in how we think about RAS for storage i.e. Fail-In-Place Assumes something is always in the process of breaking Must re-orient engineering to think about how to build reliable systems using unreliable components AKA How to build reliable systems using CRAP
History has shown The problems that the Lunatic Fringe is working on today are the problems that will become mainstream in 3-5 years Legacy data access hardware and software mechanisms are breaking down at these scales We need to continue to innovate Individual at all levels Globally across levels Re-orient our thinking on many levels
What s happening now? Areal Density is at about 250Gigabits per square inch 3.5-inch form factor is currently the standard 2.5-inch form factor is emerging in the enterprise SAS and SATA are getting significant traction OSD has been demonstrated and is in active development Consumer-grade storage is cheap cheap cheap Commodity interface speeds are up to 20-40Gigabits/sec Storage and Network processing engines are available New applications for storage are rapidly evolving Relaxed POSIX standards NFS V4 and Parallel NFS
Common thread Their data storage capacity, access, and retention requirements are continually increasing Some of the technologies and concepts the Lunatic Fringe are looking at include: Object-based Storage Devices Intelligent Storage Systems Data Grids High-density disk drive packaging Commodity Reliability And Practices Building Reliable systems with inherently unreliable components Or Building Reliable systems using CRAP New and/or improved software standards Error Detection Techniques and Methods
Conclusions Lunatic Fringe users will continue to push the limits of existing hardware and software technologies Lunatic Fringe is a moving target there will always be a Lunatic Fringe well beyond where you are The Storage Industry at large should pay attention to What they are doing Why they are doing it What they learn
Some Interesting Sites www.llnl.gov Livermore National Labs www.lanl.gov Los Alamos National Labs www.sandia.gov Sandia National Labs www.top500.org The Top500 List www.ncsa.uiuc.edu - NCSA www.psc.edu Pittsburgh Supercomputer Center www.tacc.utexas.edu Texas Advanced Computing Center www.ornl.gov Oak Ridge National Lab http://lhc.web.cern.ch/lhc - CERN and the LHC
Government Research DoE ASCI TriLabs LANL, LLNL, Sandia Lustre (www.lustre.org) Parallel NFS ( www.ietf.org/proceedings/04mar/slides/nfsv4-1.pdf ) NFS Version 4 (nfsv4.org) DICE Data Intensive Computing Environments http://www.avetec.org/dice/ NASA and the IEEE Mass Storage Technical Committee Annual symposium on Mass Storage Systems and Technologies (MSSTC) ( www.storageconference.org )
Academic Storage Research University of Minnesota Digital Technology Center Intelligent Storage Consortium (DISC) www.dtc.umn.edu/programs/disc.html University of California Santa Cruz Storage Systems Research Center (SSRC) http://ssrc.soe.ucsc.edu CMU Parallel Data Lab (PDL) www.pdl.cmu.edu
Thank you Thomas M. Ruwart Chief [Mad] Scientist tmruwart@atrato.com