Challenges in Developing Highly Reliable HPC systems

Size: px

Start display at page:

Download "Challenges in Developing Highly Reliable HPC systems"

Sophia Kristin Hubbard
5 years ago
Views:

1 Dec. 1, 2012 JS International Symopsium on DVLSI Systems 2012 hallenges in Developing Highly Reliable HP systems Koichiro akayama Fujitsu Limited

2 K computer Developed jointly by RIKEN and Fujitsu First computer to achieve 10PFlops 88,128 SPAR64 M VIIIfx PUs 6-dimensional mesh/torus interconnect ofu Ran for 29.5 hours continuously for LINPAK benchmark Equivalent to MBF > 296 node-years System Board Rack K computer PU 4 = = Interconnect + IO-node 6 = 1

3 Outline of the talk Dependability of SPAR64VIIIfx (PU), Interconnect, Board/Rack, System Summary oward exa-scale computer Attributes of dependability Availability Reliability Maintainability onfiguration PU Interconnect Board/Rack System opics in this area will be presented. Safety Integrity 2

4 DDR3 interface Availability Instruction-set extension (HP-AE) Virtual single processor (VISIMPA) Reliability Soft/hard-error resiliency HSIO Maintainability Error reporting ore5 L2$ Data ore7 SPAR64 VIIIfx 8 cores 6MB shared L2$ lock 2GHz DDR3 interface ore4 MA ore1 ore0 L2$ ontrol L2$ Data ore6 MA ore3 ore2 3

HP-AE and VISIMPA HP-AE Extended instruction set to

multi-threading on multi-core PU VISIMPA ore Arithmetic

ore Arithmetic unit Register ore Arithmetic SIMD unit

5 HP-AE and VISIMPA HP-AE Extended instruction set to accelerate scientific calculation VISIMPA Efficient multi-threading on multi-core PU VISIMPA ore Arithmetic unit Register PU Hardware Barrier Synch. ore Arithmetic unit Register ore Arithmetic SIMD unit Floating-point Register register ext. HP-AE Shared L2 cache Sector cache Memory 4

Reused data Use sector 1 Others Use sector 0 Data in sector1 cannot be replaced by other data Before Other data

6 HP-AE: Sector ache Software-controlled virtual local memory Application program can direct data stored to sector 1 by using compiler directives. Reused data Use sector 1 Others Use sector 0 Data in sector1 cannot be replaced by other data Before Other data Sector0 Array A Sector 0 A1 B1 A2 L2 cache Array B Sector 0 Frequently reused data Sector1 Array Sector 1 Sector 0 Sector 1 B2 1 2 ache miss on B3 access After B3 A1 B1 B2 1 2 A2 Sector 0 is replaceable Not replace 5

VISIMPA Virtual Single Processor by Integrated Multi-core Parallel Architecture A mechanism to reduce the number of processes by handling a multicore PU as 1 Processor for highly efficient parallel

7 VISIMPA Virtual Single Processor by Integrated Multi-core Parallel Architecture A mechanism to reduce the number of processes by handling a multicore PU as 1 Processor for highly efficient parallel execution. Hardware barrier: 10 times faster than Soft Barrier Shared L2 ache: prevent false sharing of data between cores ompiler technique: flat MPI model to hybrid model ore PU ore ore PU ore Process Process Inter-core multi-threaded process L2$ L2$ L2$ Memory VISIMPA Memory 6

8 VISIMPA: Hybrid parallel model Inter-PU parallelization: process parallel execution Intra-PU parallelization: thread parallel execution Flat-MPI Model Hybrid Model Interconnect Interconnect PU PU PU PU P P P P P P P P P P P: process, : thread, : core 7

9 Soft/Hard-Error Resiliency Failure Rate Early failure hree levels of production test: chip test, board test, and rack test Early failure Random failure Wear-out failure Random failure Process shrinking and low voltage increase soft errors. Protection methods: E, redundancy, hardware-retry, etc. Water cooling reduces hard-error and delays ware-out failure. ime 8

Highly Reliable PU Data change due to collision with cosmic radiation (ex. 0 1) SPAR64 M VIIIfx Wrong result SPAR64 M VIIIfx can detect errors at the hardware level and self-recover from it.

10 Highly Reliable PU Data change due to collision with cosmic radiation (ex. 0 1) SPAR64 M VIIIfx Wrong result SPAR64 M VIIIfx can detect errors at the hardware level and self-recover from it. Error protection methods: Arithmetic Unit Register ache Parity, Residue Instruction pipeline retry Parity E E Parity+redundancy 1-bit error correctable 1-bit error detectable (Stop the System) 1-bit error harmless 9

11 Production est haracteristics Detected Undetected hip est Short-time Self-loopback onsistent Errors Defective HSIO Random failure Marginal HSIO Board est A JAG (IEEE ) Bad Solder Joints Ground open failure Imperfect shorts Rack est Inter-chip communication test Random failuer Marginal HSIO Ground open failure Imperfect shorts 10

12 Availability Flexible resource mapping Reliability Fault-tolerant routing Maintainability Y+ X+ B+ X- ofu Interconnect 6 dimensional mesh/torus network Y- Interonnect ontroller (I) B- Z+ Z- A AB 1 ofu-unit = 12nodes SPAR64 VIIIfx XYZ 11

13 Failure Bypass by Extended Dimension Order Default dimension order routing X Y Z A B Extended dimension order routing B A X Y Z A B Adding the first B--A order, we have 12 routes between two nodes. Failure 12

Flexible Resource Allocation Interconnect structure aware resource allocation is to suppress mutual interference between jobs Allocation unit is ofu-unit

14 Flexible Resource Allocation Interconnect structure aware resource allocation is to suppress mutual interference between jobs Allocation unit is ofu-unit (12 nodes) Allocation area is in cuboid shape of the network Rotate the job allocation area and fit it in the system empty space. Job D Job B Job Job A 13

15 Availability Reliability Water cooling Maintainability ooling plate for I ooling plate for PU Liquid coupler Power I PU/I System Board 14

16 Water ooling for PU, I, and power I Life-time (relative) Lowering the semiconductor junction temperature extends the life-time of components. Arrhenius law: Lowering the junction temperature from 85 to 30 gives 60 to 100 times longer life-time. PU, I, Power(POL)I is water cooled Secure system reliability Also reduces leakage current of LSIs. Arrhenius law 1.0E E E+02 L=A exp(ea/k B ) L : Life-time A : onstant Ea: Energy k B : Boltzmann constant : Absolute temp. 1.0E E E E For Ea = 0.7~0.8 x61 to 85 Junction temp.( ) 15

17 Availability Hot swapping Reliability System-level redundancy for I/O network and system control Maintainability 16

18 Redundant Structure and Hot Swap Redundant structure By the n+1 redundant structure, even in case of component failure, operation can continue without any functional degradation. Hot swap Without turning off the rack, defective components can be replaced. he impact of maintenance is localized and thus system availability is preserved. 17

19 Redundant Structure and Hot Swap omponent Redundant structure Hot swap Rack power source Yes Yes ooling Fan Yes Yes Service Processor(SP) Yes (Duplicated) Yes System Board(SB/IOSB) No Bypass routing on B- axis port PU/I on SB/IOSB No Water cooled, error resiliency Yes (SB hot swap) POL on SB/IOSB No Water cooled (SB hot swap) Other power sources on SB/IOSB Yes (SB hot swap) DIMM on SB/IOSB No E (SB hot swap) System volume RAID ontroller, power source are redundant HDD is in RAID Yes(Module) 18

20 Redundant I/O Path Redundant I/O path from IOSB to RAID for the system availability Rack-0 Rack-1 19

21 System-Level Redundancy Frontend server OFU IO Network IO Node Local Disk IB SW Server Global Disk Redundant management network Job management node ontrol node System Integration node Redundant ontrol Network Maintenance servers 20

22 System-Level Redundancy Linux based OS High performance file system management OFU IO Network High performance file system management ompiler Parallel language ools/library Virtual tools Frontend server IO Node Local Disk System operation and job management Job management node System Management ontrol node IB SW Server Global Disk System management ofu partitioning management Redundant management network System Integration node Hard maintenance Redundant ontrol Network Maintenance servers 21

Frontend server OFU IO Network IO Node Local Disk IB SW Server Global Disk

23 System-Level Redundancy Redundant parts are shown in red. All operation related components are redundant! Frontend server OFU IO Network IO Node Local Disk IB SW Server Global Disk Redundant management network Job management node ontrol node System Integration node Redundant ontrol Network Maintenance servers 22

24 PU Interconnect Board / Rack System A high performance, low power, multicore, scalar processor which can realize a large scale system while preserving reliability A high speed, high reliability, low latency, low power interconnect (inter PU network) Water cooling enabling both longer life-time and low power consumption A compiler able to bring out the full potential of the processor by parallelizing hundreds of thousands of processes. Management middleware for high availability over a system operating hundreds of thousands of nodes 23

25 oward Exa-Scale omputer Power gap In 2011~2012 the power efficiency trend changed. An advanced low-power technique will be required. Power efficiency 10GF/W Improve power efficiency by 60 Exa-scale supercomputer (2018~20) 1EF, 20-30MW hange in the trend 1GF/W 1000 GF: GigaFlops PF: PetaFlops EF: ExaFLops K computer (2011) 10PF, 12.7MW E+14 1.E+15 1PF 1.E+16 10PF 1.E PF 1.E+18 1EF Performance Power efficiency of the successive supercomputers 24

26 oward Exa-Scale omputer In order to achieve the requirement of power, performance, and reliability, system-wide co-design will be needed. Hardware architecture [PU, Network, ooling, Memory, storage, servers] Software architecture [application, OS, driver, compiler] System-wide co-design for dependability 25

27 26

Fujitsu s Approach to Application Centric Petascale Computing

Fujitsu s Approach to Application Centric Petascale Computing 2 nd Nov. 2010 Motoi Okuda Fujitsu Ltd. Agenda Japanese Next-Generation Supercomputer, K Computer Project Overview Design Targets System Overview