Data Movement & Storage Using the Data Capacitor Filesystem Justin Miller jupmille@indiana.edu http://pti.iu.edu/dc Big Data for Science Workshop July 2010
Challenges for DISC Keynote by Alex Szalay identified the challenges that researchers face Scientific data doubles every year Amount of data is a barrier to extracting knowledge Problem of today: data access How can we minimize data movement?
Workflow Example Single Compute 0$/$)1+.%&" *+,-./")!"#+.%&"!"#"$%&'"%(#)*+,-./"%
Workflow Example Multiple Compute 3$/$)4+.%&" *+,-./"!"#+.%&")01 *+,-./"!"#+.%&")02!"#"$%&'"%(#)*+,-./"%
Workflow Example Visualization 3$/$)4+.%&" *+,-./")!"#+.%&")01!"#"$%&'"%(#) *+,-./"% *+,-./")!"#+.%&")02 56#.$768$/6+9!"#+.%&"
Workflow Example Archive 3$/$)4+.%&" *+,-./")!"#+.%&")01!"#"$%&'"%(#) *+,-./"% *+,-./")!"#+.%&")02 56#.$768$/6+9!"#+.%&" :$-" ;%&'6<"
Data Movement & Storage This is an unsustainable workflow Works for GB, maybe single TB, but not more Every resource is another series of transfers Data movement is in the way of doing work Good reasons to add resources to workflow And we haven t addressed other drawbacks
IU Central Filesystem Workflow 3$/$)4+.%&" *+,-./")!"#+.%&")01!"#"$%&'"%(#) *+,-./"%!"#" $"%"&'#() *+,-./")!"#+.%&")02 56#.$768$/6+9!"#+.%&" :$-" ;%&'6<"
IU s Data Capacitor Filesystem National Science Foundation funded in 2005 Funds purchased 535TB of Lustre storage 339TB available as production service Data Capacitor name comes from electronics capacitors provides transient storage of electrons absorbs and evens out peaks in flow provides consistent output
Idea of Data Capacitor Centralized short-term storage for IU resources Store your data to compute against, and use for scratch space during your run Possibility exists for mid-term storage
Data Capacitor Centralized Storage Compute using IU s supercomputer Big Red Compute using IU s Quarry cluster Archive to IU s massive HPSS tape archive hierarchical storage archive your data to tape
Central to IU Cyberinfrastructure
Physics Research Dr. Chuck Horowitz, IU physicist Interested in the behavior of neutron stars Studying the behavior of nuclear matter near saturation density can form interesting phase "nuclear pasta Using MDGRAPE-2 hardware for increased performance
Physics Research Particle interaction is simulated via molecular dynamics using specialized MDGRAPE-2 hardware configurations are saved Post processing creates VTK frames Visualization system ingests frame data displays as movie
Physics Research!"#$%&'( )'*"%+,' 7/&/!/$/,.&"+ -.*%/0.1/&."2 )'*"%+,' 3/$' 4+,5.6'
Earth Science Research Linked Environments for Atmospheric Discovery (LEAD) WxChallenge The WxChallenge is a meteorological forecast competition. Compete to forecaste maximum and minimum temperatures, precipitation, and maximum wind speeds for select U.S. cities over a tenweek period each semester
LEAD Workflow 0'.&1'+ -.&.!"#$%&'( )'*"%+,' -.&.!.$.,/&"+!"#$%&'+ 2,/'3,'!4%*&'+ 5+.3*6'+ )'*"%+,'*
Extend the Centralized FS Model The natural progression is to be central to more resources Make data available to more resources IU did this by extending the filesystem across the wide-area network (WAN) Data Capacitor WAN (DC-WAN) New FS separate from the original DC
Data Capacitor WAN
Data Capacitor WAN Tradeoffs The benefit of a centralized WAN filesystem is the illusion of locality Your data is transferred behind the scenes across the network At worst your data will be transferred slower than you like At best it is as fast, or faster, than local storage; typically comparable across research networks
DC-WAN Namespace Mapping WAN FS challenge is heterogeneous user identification across sites The numeric user identification (UID) for a particular user not the same across sites You don t have to worry about this because DC-WAN does the conversion Indiana TACC PSC NCSA SDSC jupmille tg803934 jupmille jupmille jupmille uid=648424 uid=803934 uid=43415 uid=40436 uid=502639
Physics Research with DC-WAN ;'+-$#.')"()" *)+,-./ 0/&)-12/!-&.'"<(5= >>?(+'$/& 3'&-#$'4#.')" 0/&)-12/ 8#.# *#,#2'.)1 9!:!"#$%&'&()" *)+,-./ 0/&)-12/ 5#,/!126'7/
Astronomy with DC-WAN 3-2&)"?(!@ ABCD(+'$/&!"#$%&'&()" *)+,-./ 0/&)-12/ 79:8(3/$/&2),/ ;"/(6/<1//(9+#</1 =;69> 6#.# *#,#2'.)1 7!8 3#,/!124'5/ Image NOAO/AURA/NSF
Center for the Remote Sensing of Ice Sheets (CReSIS) Workflow 5*''67-68!"#$%&' (')"%*+' 9-:*'6+';<=>?@A<#07') 2-&-!-$-+0&"* 3.4,-$'.*+/01'.6&-*&0+-
Gas Giant Planet Research G9- G),,2B3&H(<=G% FE?=A)4$2 1)23"4)5",).6 7$2.3&'$ 0-9% :&B"6"<=CD EF@=A)4$2 +"," -"#"'),.& /%0!"#$ %&'()*$ 89: 9,"&;*)44$<=89 >?@=A)4$2
Demo Small sample of Gas Giant Planet Research Data is on DC-WAN, which is mounted on two different resources Compute on PSC s Pople (SGI Altix 4700) Post-process and visualize results on IU machine that has proprietary software (IDL v7.0); view over network
IU s Data Capacitor WAN Filesystem Funded by Indiana University in 2008 339TB of storage available as production service Centralized short-term storage for nationwide resources, including TeraGrid Use your data on the best resource for your needs Short-term storage like DC, possibility exists for mid-term storage
Based on Lustre Filesystem Lustre is a parallel distributed file system Available under the GNU GPL Used by U.S. government, movie studios, financial institutions, oil and gas industry 7 of the top 10 HPC systems on the June 2009 "Top 500" list 52 of the top 100 run Lustre in 2010
Based on Lustre Filesystem Lustre filesystems can support up to tens of thousands of client systems, petabytes (PBs) of storage and hundreds of gigabytes per second (GB/s) of I/O throughput. Scalable filesystem uses separate servers to aggregate for performance storage backend is hidden from the client
Lustre Filesystem Architecture Lustre presents all clients with standard POSIX filesystem interface Filesystem mount My scratch directory for example: IU: /N/dcwan/scratch/jupmille/ PSC: /N/dcwan/scratch/jupmille/ TACC: /N/dcwan/scratch/jupmille/ NCSA: /N/dcwan/scratch/jupmille/ Standard commands ls, cp, cat, etc. from the command line
Lustre Filesystem Architecture Metadata Server (MDS) stores the filesystem metadata such as filenames, directories, and permissions. file operations such as open/close Object Storage Server (OSS) bulk I/O servers Object Storage Targets (OST) back-end storage devices
Lustre Filesystem Architecture '() *)) *)) *)) *))!"#$%& *)+ *)+ *)+ *)+ *)+ *)+ *)+ *)+ *)+ *)+ *)+
Data Capacitor Hardware 8 pairs Dell PowerEdge 2950 2 x 3.0 GHz Dual Core Xeon Myrinet 10G Ethernet Dual port Qlogic 2432 HBA (4 x FC) 2.6 Kernel (RHEL 5), Lustre 1.8 4 DDN S2A9550 Controllers Over 2.4 GB/sec measured throughput each 339Tb of spinning SATA disk
Data Capacitor WAN Hardware 2 pairs Dell PowerEdge 2950 2 x 3.0 GHz Dual Core Xeon Myrinet 10G Ethernet Dual port Qlogic 2432 HBA (4 x FC) 2.6 Kernel (RHEL 5), Lustre 1.8 1 DDN S2A9550 Controllers Over 2.4 GB/sec measured throughput 339Tb of spinning SATA disk
Getting the Most out of Lustre Lustre is optimized for large files (where large is >1Mb), not so good for small files Lustre has aggressive client side caching if you plan reading the same files more than once, big win Lustre allows you to control how your data is striped across the OSTs, so optimization based on your I/O patterns can reap benefits in throughput
Lustre WAN Future DC-WAN will be mounted on the India and Sierra FutureGrid clusters In the testing phase right now IU s Lustre UID mapping code will be used in a new TeraGrid Lustre-WAN project in development now
Thank you for listening. Questions are welcome. Please use moderators for Q&A Justin Miller jupmille@indiana.edu Data Capacitor Team dc-team-l@indiana.edu http://pti.iu.edu/dc