System Software Stack for the Next Generation High-Performance Computers

Size: px

Start display at page:

Download "System Software Stack for the Next Generation High-Performance Computers"

Francis Russell
5 years ago
Views:

1 1,2 2 Gerofi Balazs PC CPU PC OS MPI I/O System Software Stack for the Next Generation High-Performance Computers Yutaka Ishikawa 1,2 Atsushi Hori 2 Gerofi Balazs 1 Masamichi Takagi 3 Akio Shimada 2 Masaaki Shimizu 4 Yuji Saeki 4 Tomoki Shirasawa 5 Gou Nakamura 6 Shinji Sumimoto 7 Tomohito Otawa 7 Abstract: A system software stack, consisting of OS kernel, a low-level communication library, an MPI library, and file I/O library with a hierarchical storage system, have been designed and implemented for two types of clusters, one is a PC cluster whose compute node consists of a PC server with manycore CPUs and another is a manycore-based cluster. In this paper, after a machine environment, being considered, and its challenges to provide a highly efficient and scalable system are described, the current research achievements are reported. 1. CPU Tilela TILE-Gx Intel Xeon Phi Coprocessor Xeon Phi Coprocessor 1 TFlops [1] O(2) m 2 30 MW PFlops Linux MPI c 2013 Information Processing Society of Japan 1

2 Table 1 1 Specifications of Assumed Hardware Environment and K Computer 3 5 TF 128 GF K K 4 10 PB 1.26 PB PFlops 10.62PF 10 20GB/sec 5GB/sec 6-D Mesh/Torus, 6-D Mesh/Torus or Dragonfly, or... File System 1EB 11PB (local) 30PB (global) 1: for (day = 1; day < 365*10; day++) { 2: 1 3: TB 10 4: } 1 Fig. 1 COCO File Access Pattern in COCO OS L1 L2 L3 OS GPU COCO[2] COCO 10 40,000 x 40,000 x 1, TB TB PB TLB way Intel Xeon Phi Knights Corner L1 L2 32KB, 512 KB Xeon E (Sandy-bridge) L1, L2, L3 256KB 2MB 20MB TLB TLB NUMA CPU OS c 2013 Information Processing Society of Japan 2

I/O Linux RCU SPMD (Single Program Multiple Data) 3.1.

2 2 Fig. 3 Fig. 2 3 System Software Stack Examples of Kernel Organizations 2 3.2.1 Linux

3 I/O Linux RCU SPMD (Single Program Multiple Data) OS I/O OS CPU OS OS OS SPMD OS OS [3] OS OS HPC MPI OpenMP MPI+OpenMP PGAS [4] PGAS PGAS PGAS Fig. 3 Fig. 2 3 System Software Stack Examples of Kernel Organizations Linux API Linux API Linux Linux Linux Linux Linux Linux c 2013 Information Processing Society of Japan 3

[7] 4 OS 4 Intel 5 Xeon Phi (2) Fig. 5 Implementation on Xeon Phi (2) 6 McKernel Fig.

4 4 Xeon Phi (1) Fig. 4 Implementation on Xeon Phi (1) Linux 2 Linux Linux Linux OS Linux Linux 2 OS 3 Linux OS Linux API Linux Linux 3 Linux API Linux PC CPU Sandy-bridge Intel Xeon Phi coprocessor Infiniband OS 4 5 McKernel McKernel HIDOS [5] SHIMOS[6] MEE [7] 4 OS 4 Intel 5 Xeon Phi (2) Fig. 5 Implementation on Xeon Phi (2) 6 McKernel Fig. 6 McKernel Xeon Phi Linux PCI Express TCP/IP ssh Xeon Phi Xeon Phi NFS 5 OS McKernel 6 Attached McKernel Builtin McKernel Linux McKernel IHK (Interface for Heterogeneous Kernel) IHK-Linux driver Linux Xeon Phi DMA McKernel IHK-cokernel Xeon Phi Linux IHK-IKC c 2013 Information Processing Society of Japan 4

5 IKC (Inter Kernel Communication) Linux McKernel mcctrl Linux McKernel McKernel Linux Linux GNU libc OpenMP Intel McKernel McKernel Linux Linux API [8] Xeon Phi Attached McKernel Xeon Phi [9] OpenMP OS TLB [9] TLB PGAS AICS PGAS PVAS [10] PGAS TLB I/O OpenMP NUMA thread affinity thread affinity 01: for (i = 0; i < N; i++) 02: MPI_Recv_init(rbuf[i],..., &req[i]); 03: for (I = 0; i < N; i++) 04: MPI_Send_init(sbuf[i],..., &req[i+n]); 05: do { 06: /* Computation */ 07: MPI_Startall(N*2, req); 08: /* Computation */ 09: MPI_Waitall(N*2, req, stat); 10: / **** / 11: } while ( ); Fig. 7 7 An Example of Persistent Communication I/O I/O [11] affinity I/O MPI (Persistent Communication) MPI 3.0 Mellanox Connext-X HPC c 2013 Information Processing Society of Japan 5

6 4.1.3 MPI 7 (MPI Send init, MPI Recv init) MPI Request req MPI Startall MPI Waitall MPI Isend/MPI Irecv 2 i) ii) MPI Startall DMA 4 MPI Startall 4 DMA MPI MPI MPI MPI PAMI[12] Portals4[13] ARMCI[14] GASNet[15] MPI MPI I/O DCFA DCFA-MPI 3 Attached McKernel Infiniband DCFA [16] Xeon Phi PCI Infiniband Infiniband Xeon Phi Infiniband PCI Express Xeon Phi Infiniband Xeon Phi Infiniband Infiniband CPU Sandy-bridge Xeon Phi Xeon Phi Infiniband Xeon Phi CPU CPU Infiniband [17] [17] DCFA MPI YAMPII[18] Xeon Phi Persistent Remote DMA Communication MPI Persistent Remote DMA Communication (PRDMA) FX10 [19] Low Level Communication Library MPICHI Low Level Communication Library (LLC) API LLC RDMA one sided two sided LLC Persistent Remote DMA Communication API I/O COCO 365 PB 1TB/sec 4 10TB/sec PB 365 PB 5.2 c 2013 Information Processing Society of Japan 6

7 2 1 1/2 10PB 5PB COCO 50 5 PB TB/sec PB 139 GB/sec I/O COCO I/O 1 EB 6. PC CPU PC OS MPI McKernel OS Attached McKernel AICS SCALE SCALE Fortran OpenMP MPI 2012 SC OS I/O HPCI HPCI CREST [1] Moore, C.: DATA PROCESSING IN EXASCALE- CLASS COMPUTER SYSTEMS, The Salishan Conference on High Speed Computing (2011). [2] Hasumi, H.: Ocean Component Model (COCO) Version 2.1, Technical report, Division od Climate System Research, Atmosphere and Ocean Research Institute, the University of Tokyo (2000). [3] Petrini, F., Kerbyson, D. J. and Pakin, S.: The Case of the Missing Supercomputer Performance:Achieving Optimal Performance on the 8,192 Processors of ASCI Q, SC 03 Proceedings of the 2003 ACM/IEEE conference on Supercomputing (2003). [4] Carlson, W., Draper, J., Culler, D., Yelick, K., Brooks, E. and Warren, K.: Introduction to UPC and Language Specification, Technical Report CCS-TR , IDA Center for Computing Sciences (1999). [5] Shimosawa, T.: Operating System Organization for Manycore Systems, Technical report, A Doctor Thesis submitted to the Graduate School of the University of Tokyo (2012). [6] Shimosawa, T., Matsuba, H. and Ishikawa, Y.: Logical Partitioning without Architectural Supports, The 32nd IEEE International Computer Software and Applications Conference (COMPSAC 2008) (2008). [7] SWOPP (2011). [8] Gerofi, B. OS 124 c 2013 Information Processing Society of Japan 7

8 (2013). [9] Gerofi, B., Shimada, A., Hori, A. and Ishikawa, Y.: Operating System Assisted Hierarchical Memory Management for Heterogeneous Architectures: Preliminary Results on Stencil Computation, The 13th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid 2013) (2013 (To appear)). [10] Shimada, A., Gerofi, B., Hori, A. and Ishikawa, Y.: PGAS Intra-node Communication towards Many-Core Architecture, The 6th Conference on Partitioned Global Address Space Programming Models (2012). [11] NUMA I/O 124 (2013). [12] Kumar, S., Mamidala, A. R., Faraj, D., Smith, B. E., Blocksome, M., Cernohous, B., Miller, D., Parker, J., Ratterman, J., Heidelberger, P., Chen, D. and Steinmacher-Burrow, B. D.: PAMI: A Parallel Active Message Interface for the Blue Gene/Q Supercomputer, IEEE 26th International Parallel and Distributed ProcessinSymposium, pp (2012). [13] Barrett, B. W., Brightwell, R., Hemmert, K. S., Pedretti, K., Wheeler, K. and Underwood, K. D.: Implementing OpenSHMEM and its Implications for Portals 4, 19th Annual Symposium on High-Performance Interconnects (HotI) (2011). [14] Nieplocha, J., Tipparaju, V., Krishnan, M. and Panda, D.: High Performance Remote Memory Access Comunications: The ARMCI Approach, International Journal of High Performance Computing and Applications, No. 2, pp (2006). [15] Bonachea, D.: GASNet Specification, v1.8, Technical report, U.C. Berkeley Tech Report (UCB/CSD ) (2008). [16] Si, M. and Ishikawa, Y.: Design of Direct Communication Facility for Manycore-based Accelerators, CASS2012 in conjunction with IPDPS2012 (2012). [17] Si, M., Ishikawa, Y. and Takagi, M.: Direct MPI Library for Intel Xeon Phi co-processors, CASS2013 in conjunction with IPDPS2013 (2013). [18] YAMPII MPI pp (2004). [19] Ishikawa, Y., Nakajima, K. and Hori, A.: Revisiting Persistent Communication in MPI, EuroMPI 2012: Recent Advances in the Message Passing Interface, Springer Netherlands, pp (2012 (poster)). c 2013 Information Processing Society of Japan 8

Revisiting Virtual Memory for High Performance Computing on Manycore Architectures: A Hybrid Segmentation Kernel Approach

Revisiting Virtual Memory for High Performance Computing on Manycore Architectures: A Hybrid Segmentation Kernel Approach Yuki Soma, Balazs Gerofi, Yutaka Ishikawa 1 Agenda Background on virtual memory