ParaMEDIC: Parallel Metadata Environment for Distributed I/O and Computing

Size: px

Start display at page:

Download "ParaMEDIC: Parallel Metadata Environment for Distributed I/O and Computing"

Alexandra Edwards
5 years ago
Views:

1 ParaMEDIC: Parallel Metadata Environment for Distributed I/O and Computing Prof. Wu FENG Department of Computer Science Virginia Tech Work smarter not harder

2 Overview Grand Challenge A large-scale biological problem requiring compute resources from around the world and generating a petabyte of data. Issues Data Transfer How to aggregate a petabyte of data over a shared trans-pacific GigE link? Data Integrity Solution Silent error every ~500 GB based on TCP/IP checksums. ParaMEDIC: Parallel Metadata Environment for Distributed I/O and Computing

3 Outline Motivation Problem Statement Approach Case Studies: mpiblast and MPE Results Conclusion

4 Motivation Problem of Biological Significance Discover missing genes in genomes (via sequence-search computations, e.g., mpiblast) Computational Challenges Missing Genes: All-to-all sequence search of all microbial genomes completed to date: O(10 15 ) sequence searches! Storage Challenge: Where to Store the Output? Project Storage Requirement: One petabyte (10 15 bytes) or 50,000 times the contents of the Library of Congress. Impossible to accurately predict. Lack of correlation.

5 Importance of Sequence Search Motivation Why sequence search is so important

Challenges in Sequence Search Observations Overall size of genomic databases doubles every 12 months Processing horsepower doubles only every 18-24

6 Challenges in Sequence Search Observations Overall size of genomic databases doubles every 12 months Processing horsepower doubles only every months Consequence The rate at which genomic databases are growing is outstripping our ability to compute (i.e., sequence search) on them. Exponentially Growing

7 Problem Statement The Case of the Missing Genes Problem Most current genes have been detected by a gene-finder program, which can miss real genes Approach Every possible location along a genome should be checked for the presence of genes Solution All-to-all sequence search of all 567 microbial genomes that have been completed to date but requires more resources than can be traditionally found at a single supercomputer center as of January 2008, O(10 15 ) sequence searches!

8 Outline Motivation Problem Statement Approach Case Studies: mpiblast and MPE Results Conclusion

9 Approach: ParaMEDIC Parallel Metadata Environment for Distributed I/O and Computing A New Way of Programming Distributed I/O Overview Application generates output data ParaMEDIC takes over Transforms output to (orders-of-magnitude smaller) applicationspecific metadata at the compute site Transports metadata over the WAN to the storage site Transforms metadata back to the original data at the storage site (host site for the global filesystem) Why is this different from compression? ParaMEDIC deals with data as abstract objects, not as a byte-stream

10 Approach: ParaMEDIC Software Stack ParaMEDIC API (PMAPI) ParaMEDIC Data Tools Encryption Data Data Integrity ParaMEDIC: Parallel Metadata Environment for Distributed I/O and Computing

11 Tradeoffs in the ParaMEDIC Framework Trading Computation and I/O Increase in Computational Workload Converting output to metadata and back requires extra work Decrease in I/O Workload Only meta-data is transferred over the WAN, so lesser bandwidth usage on the WAN But, computation is free; I/O is not! Trading Portability and Performance Utility functions help develop application plugins, but will always need non-zero effort Data is dealt has high-level objects: Better chance of improved performance

12 Outline Motivation Problem Statement Approach Case Studies: mpiblast and MPE Results Conclusion

13 Sequence Search with mpiblast Output Output Query Sequences Query Sequences Database Sequences Database Sequences Sequential Search of Queries Parallel Search of Queries

14 mpiblast Metadata Alignment of two sequences is independent of the remaining sequences Output Meta-data (IDs of matched sequences) Query Sequences Communicate over the WAN Database Sequences Alignment information for a bunch of sequences Query Sequences Temporary Database Sequences

15 ParaMEDIC-Powered mpiblast I/O Servers hosting file system Compute Sites The ParaMEDIC Framework Compute Master WAN Generate Temp Database I/O Master Storage Site Query Raw Metadata Query Read Temp Database Write Results Compute Workers mpiblast Master mpiblast Master I/O Workers mpiblast Worker mpiblast Worker mpiblast Worker mpiblast Worker mpiblast Worker HPDC '08

16 ParaMEDICized mpiblast for Missing Genes Worldwide Supercomputer Six U.S. supercomputing institutions (~12,000 processors) and one Japanese storage institution (0.5 petabytes), ~10,000 kilometers away

17 MPE: A Profiling Library for MPI MPE: MPI Profiling Environment Suite of performance analysis tools and libraries Shipped as a part of the MPICH2 implementation of MPI Relies on the MPI Profiling Interface Application is run regularly, MPE automagically logs communication calls and time taken Generates lots of data A large-scale application such as FLASH can generate about 2.5MB of data per second per process A 16K process run for an hour generates 150 TB of data HPDC '08

18 Example MPE Profiling Log (GROMACS) Identify periodicity using Fourier transforms and only store the diffs in each period Can give about 3-5X improvement HPDC '08 W. Feng, April

19 Preliminary Results: ANL-VT Supercomputer

20 Preliminary Results: Teragrid Supercomputer

21 SC 07 Storage Challenge: Compute Resources 2200-processor System X cluster (Virginia Tech) 2048-processor BG/L supercomputer (Argonne) 5832-processor SiCortex supercomputer (Argonne) 700-processor Intel Jazz cluster (Argonne) processors on TeraGrid (U. Chicago & SDSC) 512-processor Oliver cluster (CCT at LSU) A few hundred processors on Open Science Grid (RENCI) 128-processors on the Breadboard cluster (Argonne) Total: ~12,000 Processors

22 SC 07 Storage Challenge: Storage Resources Clients 10 quad-core SunFire X4200 Two 16-core SunFire X4500 systems. Object Storage Servers (OSS) 20 SunFire X4500 Object Storage Targets (OST) 140 SunFire X4500 (each OSS has 7 OSTs) RAID configuration for OST RAID5 with 6 drives Network: Gigabit Ethernet Kernel: 2.6 Lustre Version: 1.6.2

23 Storage Utilization with Lustre

24 Storage Utilization Breakdown with Lustre

25 Storage Utilization (Local Disks)

26 Storage Utilization Breakdown (Local Disks)

Conclusion: Biology Biological Problems Addressed Discovering missing genes via sequence-similarity computations 2.63 x 10 14 sequence searches!

27 Conclusion: Biology Biological Problems Addressed Discovering missing genes via sequence-similarity computations 2.63 x sequence searches! Generating a complete genome sequence-similarity tree to speed-up future sequence searches. Status Missing Genes Now possible! Ongoing with biologists Complete Similarity Tree Large % of chromosomes do not match any other chromosomes

28 Conclusion: Computer Science Contributions Worldwide supercomputer consisting of ~12,000 processors and 0.5-petabyte storage Output: 1 PB of raw data 0.3 PB of metadata ParaMEDIC: Parallel Metadata Environment for Distributed I/O and Computing Decouples computation and I/O and drastically reduces I/O overhead.

29 Acknowledgments The Team P. Balaji (Argonne Nat l Lab), J. Archuleta and H. Lin (Virginia Tech) The Biology J. Setubal, A. Warren (Virginia Bioinformatics Institute) Computational Resources K. Shinpaugh, L. Scharf, G. Zelenka (Virginia Tech) I. Foster, M. Papka (U. Chicago) R. Stevens, E. Lusk, S. Coghlan (Argonne National Laboratory) M. Rynge, J. McGee, D. Reed (RENCI) S. Jha and H. Liu (CCT at LSU) Storage Resources S. Matsuoka (Tokyo Inst. of Technology) T. Kujiraoka, S. Ihara (Sun Microsystems, Japan) S. Vail, S. Cochrane (Sun Microsystems, USA)

High Performance Supercomputing using Infiniband based Clustered Servers

High Performance Supercomputing using Infiniband based Clustered Servers M.J. Johnson A.L.C. Barczak C.H. Messom Institute of Information and Mathematical Sciences Massey University Auckland, New Zealand.