MAHA. - Supercomputing System for Bioinformatics

MAHA - Supercomputing System for Bioinformatics - 2013.01.29

Outline 1. MAHA HW 2. MAHA SW 3. MAHA Storage System 2

ETRI HPC R&D Area - Overview Research area Computing HW MAHA System HW - Rpeak : 0.3 PetaFLOPS @ 2015. GPGPU / MIC Based. GPGPU/MIC/Memory nodes - High-speed/Low-latency Network. Management : Ethernet 1Gbps. Computational : InfiniBand 40Gbps. Computational : PCIeLink 128/256 Gbps I/O SSD+HDD File System SW - Max. 700Gbps / 1M IOPS. 40 SSD storage servers (equal to 600 HDD servers) - Power Saving. Dynamic power control on unused HDD/servers (Speed down/sleep/power off according to access rate) <MAHA System Architecture> <MAHA System Layout - Plan> Bio Application Parallelized genome analysis SW - Parallelized genome analysis pipeline. Optimized genome indexing. Parallelized sequence mapping. Parallelized SNP extraction and analysis. Visualization Protein folding analysis SW - 3 dimensional protein mapping - Protein docking analysis and DB Sysyem SW Bio workflow mgmt. - HPC environ. supporting bio workflows. Ease of use Heterogeneous resource mgmt. - Resource mgmt. for bio applications. Performance improvement Integrated cluster mgmt. - Single point of mgmt. for MAHA system. Simplified deployment & mgmt. 3

MAHA Supercomputing System (Jan., 2013) Heterogeneous supercomputer 104 TeraFLOPS with CPU and accelerators (GPGPU, MIC) 53.2 TeraFLOPS compute node based on GPGPU (M2090) 51.3 TeraFLOPS compute node based on MIC (Xeon Phi) Number of cores : over 36,000 cores 54 Compute nodes, 3 Management Nodes, 19 Storage nodes 194 TeraByte of storage (SSD= 34 TB, HDD= 160 TB) CPU 166.4 GFLOPS/CPU Intel Xeon @ E5 8 Cores@2.6GHz GPGPU 665 GFLOPS/GPGPU NVIDIA @ Fermi 512Cores@1.3GHz Node 1.67 TFLOPS Dual CPU, Dual GPGPU 32 GB Memory Subrack 8.3 TFLOPS 5 Nodes 160 GB Memory Subrack 11.5 TFLOPS 5 Nodes 160 GB Memory Node 2.3 TFLOPS Dual CPU, Dual Phi 32 GB Memory Xeon @ Phi 1 TFLOPS/MIC Intel @ Xeon @ Phi TM > 50 Cores CPU 166.4 GFLOPS/CPU Intel @ Xeon @ E5 8 Cores@2.6GHz 4

MAHA Supercomputing System (Jan., 2013) Network 10/1 Gbps Mgmt. Network 1G / 10G Ethernt for System Management 40 Gbps Computational Network 40 Gbps QDR InifiniBand for Computational Network SSD Storage Server MotherBoard RAID Controller 4 SATA2 port Backplane SSD Management Node PCI-E x4 3Gbps SATA port connection SSD SSD SSD Dual CPU (Intel @ Xeon @ E5) 332 Gigaflops 1 User Login Node 2 Management node Accelerated Compute Node Dual CPU (Intel @ Xeon @ E5) 332 Gigaflops 32 GB memory Dual GPGPU (NVIDIA @ M2090) 1,330 Gigaflops Dual MIC (Intel @ Xeon @ Phi) > 2,000 Gigaflops MAHA Supercomputing System (100 TeraFLOPS) MAID Storage Server MotherBoard PCI-E x4 RAID Controller 16 SATA2 port 3Gbps SATA port connection Backplane HDD 5

MAHA Supercomputing System Performance (Jan., 2012) Hybrid HPL (High Performance Linpack) Average R max = 29.9 TeraFLOPS System Efficiency * = 56.2% Avg. : 29.901 TFs Max : 30.310 TFs 6

MAHA Supercomputing Facility MAHA Supercomputing facility Server room : 42 m 2 Hybrid cooling : Cold outside air or internal air conditionig MAHA system MAHA system 7

MAHA Supercomputing Roadmap MAHA Supercomputing System 200 TeraFLOPS in 2013 In year 2015, MAHA will reach 300 TeraFLOPS (R peak ) 2011 2012 2013 2014 2015 50 TFs 100 TFs 200 TFs 250 TFs 300 TFs 8

Outline 1. MAHA HW 2. MAHA SW 3. MAHA Storage System 9

MAHA System Workplace: Objective HPC software solution specially designed for bioinformatics applications For end users (especially in the field of bioinformatics) User-friendly HPC environment supporting bio workflows End users can easily define workflows of bio applications and then efficiently execute them in HPC systems For system administrator Integrated cluster management tool * MIC: Product based on Intel Many Integrated Core architecture 10

MAHA System Workplace: Features & Benefits Features User-friendly HPC environment supporting bio workflow Easy configuration for execution with the aid of workflow analysis Workflow transformation for efficient execution in HPC environment Performance improvement through the support for execution of bio applications Single point of management for MAHA system & services Benefits For end users Easy to use even for non-experts Improved performance For system administrators Simplified deployment & management 11

MAHA System Workplace: Function (1/3) Bio Workflow Management for HPC Environment Bio workflow definition & execution management XML-based workflow model Web UI-based workflow lifecycle management Bio workflow execution engine Transform a user-defined workflow to multiple HPC jobs Cooperate with the resource management software Bio workflow analysis tool Help to find out the characteristics and resource requirements 12

MAHA System Workplace: Function (2/3) Resource Management for Bio Applications Job scheduling & resource allocation End user s view: process a workflow as fast as possible System s view: process as many workflows as possible in a given time Support for execution of bio applications Solve performance problems by analyzing the characteristics of bio applications 13

MAHA System Workplace: Function (3/3) Integrated Cluster Management for MAHA System & Services Provisioning management Cluster operation management Monitoring management Service (including MAHA System Workplace) management Web UI for MAHA System Workplace 14

Outline 1. MAHA HW 2. MAHA SW 3. MAHA Storage System 15

Objective of the MAHA-FS Distributed File System for HPC application, specially for Genome Analysis Upgrading the performance of the GLORY-FS (developed by ETRI) Supporting competitive performance to the Lustre (700Gbps, 1 Million IOPS) Compatibility with existing various genome analysis applications 16

Features and Benefits of the MAHA-FS Features Hybrid Storage Support for high performance/cost with SSD (700Gbps, 1 million IOPS) Support for high capacity/cost with HDD (more than petabytes) Low power consumption Reduce storage power consumption with cutting off un-accessed HDD (up to 50%) Advantage For genome analyst Reduce TCO for large scale storage No need to modify their exist genome analysis applications For administrators Easy management for peta-scale storage 17

Performance & Capacity Considerations for NGS workload Peak 933 MB/s is required for 1 human genome analysis (I/O speed equivalent to 10 SATA HDD) Computing Node 3d 14h 5m 32s Align Sample Sort merge mpileup Peak: 681.63 MB/s Avg: 41.82 MB/s Peak: 933.50 MB/s Avg: 332.83 MB/s Peak: 68 MB/s Avg: 19.6 MB/s Peak: 52.25 MB/s Avg: 11.63 MB/s Peak: 76.38 MB/s Avg: 2.93 MB/s Reference Genome (11 GB) Source Data (218 GB) Temporary Data (96 GB) Intermediate Result Data (?? GB) Final Result Data (819 GB) Shared Storage (MAHA-FS) Total 1.2 TB capacity is required for 1 human genome analysis NGS: Next Generation Sequencing 18

Storage Architecture Consideration for NGS workload MAHA-FS Architecture MAHA-FS Metadata Server on Commodity Parts - Server, Chassis - RAID controller - 1G/10G/40G NIC - SATA SSD/HDD MAHA-FS Clients (Compute Nodes) 1/10G/IB Fabric x86 based storage server and SATA HDD for Lower TCO MAHA-FS Storage Servers on Commodity Parts - Server, Chassis - RAID controller - 1G/10G/40G NIC - SATA SSD/HDD and/or and/or SATASSD SATA HDD SATA SSD SATA HDD Resiliency supported by replication/migration built in the MAHA-FS SW Existing HPC Architecture (Lustre) Lustre Clients (Compute Nodes) Lustre Metadata Server Metadata Storage Array - NetApp E2624 1/10G/IB Fabric Lustre Storage Servers (Active/Active) Data Storage Arrays - NetApp E5460/DE6600 - NetApp E5424/DE5600 SAN (Fiber Channel) and/or SAN (Fiber Channel) and/or External Data Storage Array and the Fiber Channel fabric is the main cause of the High Cost SAS HDD SAS HDD Resiliency supported by redundant server and storage hardware (No support for resiliency within the Lustre) From NetApp 19

MAHA Storage H/W Test-bed Status (2012) Commodity SSD Storage Server MotherBoard RAID Controller * 3 (8 SATA2 port) Backplane SSD PCI-E x4 3Gbps SATA port connection SSD SSD 10G Ethernet/40G Infiniband (VPI Adapter) Total Capacity Built: 34 TB - # SATA SSDs: 192 ea - # Servers: 10 ea Total Capacity Planned: 85 TB (2015) Commodity HDD Storage Server Backplane MotherBoard PCI-E x4 RAID Controller * 1 (16 SATA2 port) 3Gbps SATA port connection HDD Total Capacity Built: 160 TB - # SATA HDDs: 160 ea - # Servers: 9 ea Total Capacity Planned: 400 TB (2015) 2011 2012 10G Ethernet/40G Infiniband (VPI Adapter) 20

MAHA File System SW Architecture Light-weight Metadata Access Protocol (NFS-like protocol) User-level File System No kernel patch/dependency User Kernel App FUSE Low Level VFS & Cache MAHA-FS Client FS Client LMD I/F Hybrid I/O /dev/fuse MAHA-FS Metadata Server MDS SRV Metadata Server Core MySQL NMD Management Protocol DS Clnt I/F Linux Kernel Participation Heartbeat MySQL NMD /proc Light-weight Metadata Engine (NMD) Berkeley DB-like Engine optimized for file system metadata (10 times faster) FUSE kernel module Linux Client MAHA-FS Data Server Data Server Core Hybrid I/O Protocol MDS Clnt I/F Hybrid I/O FS I/F PROC I/F Dynamic selection between two I/O protocol based on the workload characteristic - Sequential I/O optimized Protocol - Random I/O optimized Protocol ext4 Linux Kernel ext4 /proc 21

Overall Performance of the MAHA-FS v1.0 Overall Performance Results (Jan. 2013) - Aggregate Sequential I/O: > 70Gbps, Aggregate Random I/O: > 20 Million IOPS - Metadata Performance: > 100,000 open/sec, > 50,000 create/sec Target Metric (2012) Target Metric (2012) 22

Micro-benchmark Results of the MAHA-FS v1.0 Metadata Performance (1 st Result, Sep. 2012) - About 4 times faster than the Lustre (but need more careful examination) (File Creation: 52,437 ops, File Open: 116,005 ops) File creation performance (ops/sec) File open performance (ops/sec) 3.4 times better 4.6 times better 15,000 (measured) 25,000 (announced) 9,000 (measured) Looks faster, but needs more look 23

Micro-benchmark Results of the MAHA-FS v1.0 Data I/O Performance (Sep. 2012, Jan. 2013) - Still struggling to achieve better performance Additional Tuning and Testing is on going. 24

NGS pipeline benchmark Results of the MAHA-FS v1.0 NGS-pl benchmark Results (1 st Result, Dec. 2012) - Slightly faster than the Lustre, but slower then the NFS NGS pipeline analysis applications NGS1 NGS2 NGS3 NGS4 NGS5 NGS6 NFS1 NFS2 Isolated two NFS server NGS1 NGS2 NGS3 NGS4 NGS5 NGS6 DS1 DS2 Parallel/Distributed File server (Lustre/MAHA-FS) Benchmark Storage Environment Comparison of file systems for NGS-pl workload Just the 1 st result. No more, no less.. (hour) 25

Thank You 26