Leibniz Supercomputer Centre. Movie on YouTube

Size: px

Start display at page:

Download "Leibniz Supercomputer Centre. Movie on YouTube"

Joseph Lindsey
6 years ago
Views:

1 Leibniz Supercomputer Centre Movie on YouTube

2 Peak Performance Peak performance: 3 Peta Flops 3*10 15 Flops Mega 10 6 million Giga 10 9 billion Tera trillion Peta quadrillion Exa quintillion Zetta sextillion Flops: Floating Point Operations per Seconds

Distributed Memory Architecture 18 partitions called islands with 512 nodes Node is a shared memory system with 2 processors Sandy Bridge-EP Intel Xeon E5-2680 8C 2.

3 Distributed Memory Architecture 18 partitions called islands with 512 nodes Node is a shared memory system with 2 processors Sandy Bridge-EP Intel Xeon E C 2.7 GHz (Turbo 3.5 GHz) 32 GByte memory Inifiniband network interface Processor has 8 cores 2-way hyperthreading GHz per core GFlops per processor

4 Sandy Bridge Processor Latency: 4 cycles 12 cycles 31 cycles Core L1 32KB L2 256KB 8 multithreaded cores Core L1 32KB L2 256KB Bandwidth: 2*16/cycle 32 / cycle 32 / cycle L3 2.5 MB Shared L3 L3 2.5 MB Network frequency equal to core frequency Memory QPI PCIe L3 cache Partitioned with cache coherence based on core valid bits Physical addresses distributed by a hash function

5 NUMA Node 4GB 2 QPI links 4GB 4GB 4GB Sandy Bridge Each 2 GT/s Sandy Bridge 4GB 4GB 4GB 4GB 8xPCIe3.0 (8GB/s) Infiniband 2 processors with 32 GB of memory Aggregate memory bandwidth per node GB/s Latency local ~50ns (~135 GHz) remote ~90ns (~240 cycles)

6 Interconnection Network Infiniband FDR-10 FDR means fourteen data rate FDR-10 has an effective data rate of Gb/s Latency: 100 nsec per switch, 1usec MPI Vendor: Mellanox Intra-Island Topology: non-blocking tree 256 communication pairs can talk in parallel. Inter-Island Topology: Pruned Tree 4:1 128 links per island to next level

7 Peak Performance 126 spine 36 port switch 36 port switch switches 19 links Rest for fat node and IO 126 links 648 port switch 516 nodes 516 links 18 islands + IO island 648 port switch

8 9288 Compute Nodes Cold Corridoor Infiniband (red) and Ethernet (green) cabling Matthias Brehm, Herbert Huber, LRZ High Performance Systems Division

Infiniband Interconnect 19 Orcas 126 Spine Switches 11900 Infiniband

9 Infiniband Interconnect 19 Orcas 126 Spine Switches Infiniband Cables Matthias Brehm, Herbert Huber, LRZ High Performance Systems Division

10 IO System Spine Infiniband Switches GPFS for $WORK and $SCRATCH Login nodes $HOME Archive GB/s 5 80 Gb/s 30 10GbE

Controller 5040 3 TByte SATA Disks Matthias

11 Parallel File System GPFS 10 Pbyte, 200 GigaByte/s I/O Bandwidth 9 DDN SFA 12k Controller TByte SATA Disks Matthias Brehm, Herbert Huber, LRZ High Performance Systems Division

SuperMIC Intel Xeon Phi Cluster 32 Nodes 2 Xeon Ivy-Bridge processors E5-2650 8 cores each 2.

12 SuperMIC Intel Xeon Phi Cluster 32 Nodes 2 Xeon Ivy-Bridge processors E cores each 2.6 GHz clock frequency 2 Intel Xeon Phi coprocessors 5110P GHz Memory 64 GB host memory 2x8 GB Xeon Phi

Intel Xeon Phi Number of cores 60 Frequency of cores GDDR5 memory size Number of hardware threads per core SIMD vector registers Flops/cycle Theoretical peak

13 Intel Xeon Phi Number of cores 60 Frequency of cores GDDR5 memory size Number of hardware threads per core SIMD vector registers Flops/cycle Theoretical peak performance L2 cache per core 1.1 GHz 8 GB 4 32 (512-bit wide) per thread context 16 (DP), 32 (SP) 1 TFlop/s (DP), 2 TFlop/s (SP) 512 kb Connection to host 6.2 GB/s

14 Nodes with Coprocessors

15 Access to SuperMIC Login to SuperMUC Login to SuperMIC ssh supermic.smuc.lrz.de Load leveler script with class phi Interactive access to nodes and coprocessors Submit batch script with sleep command. Login to compute nodes ssh i01r13??? Login to MIC coprocessors ssh i01r13???-mic0 ssh i01r13???-mic1 PPK required

16 The Compute Cube of LRZ Rückkühlwerke Hö Höchstleistungsrechner (säulenfrei) (sä Zugangsbrücke Zugangsbrücke Server/Netz Archiv/Backup Archiv/Backup Klima Klima Elektro

17 Run jobs in batch Advantages Reproducable performance Run larger jobs No need to interactive poll for resources Test queue Max 1 island, 32 nodes, 2h, 1 job in queue General queue Max 1 island, 512 nodes, 48 h Large Max 4 islands, 2048 nodes, 48 h Special Max 18 islands

18 Job Script #!/bin/bash wall_clock_limit = 00:4:00 #@ job_name = add #@ job_type = parallel #@ class = test #@ network.mpi = sn_all,not_shared,us #@ output = job$(jobid).out llsubmit job.scp Submission to batch system llq u $USER Check status of own jobs llcancel <jobid> Kill job if no longer needed #@ error = job$(jobid).out #@ node = 2 #@ total_tasks=4 #@ node_usage = not_shared #@ queue. /etc/profile cd ~/apptest/application poe appl

19 Limited CPU Hours available Please Specify job execution as tight as possible. Do not request more nodes than required. We have to pay for all allocated cores, not only the used ones. SHORT (<1sec) sequential runs can be done on the login node. Even SHORT OMP runs can be done on the login node.

20 Login to SuperMUC, Documentation First change the standard password Login via lxhalle due to restriction on connecting machines ssh No outgoing connections allowed Documentation Intel compiler: mposerxe/en-us/2011update/cpp/lin/index.htm

21 Batch Script Parameters energy_policy_tag = NONE Switch of automatic adaptation of core frequency for performance measurements #@ node = 2 #@ total_tasks= 4 #@ task_geometry = {(0,2) (1,3)} #@ tasks_per_node = 2 Limitations on combination documented at LRZ web page

22 Compiler Intel C++ icc Version 12.1 Editors vi emacs xedit

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 13 th CALL (T ier-0)

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 13 th CALL (T ier-0) Contributing sites and the corresponding computer systems for this call are: BSC, Spain IBM System x idataplex CINECA, Italy Lenovo System