Blue Gene/P Application User Worshop

Size: px

Start display at page:

Download "Blue Gene/P Application User Worshop"

Hortense Newman
5 years ago
Views:

Blue Gene/P Application User Worshop Nicolas Tallet Pascal Vezolle nicolas.

1 Blue Gene/P Application User Worshop Nicolas Tallet Pascal Vezolle PSSC Montpellier Blue Gene Team 11/4/2008

2 Agenda Blue Gene/P Application User Workshop Blue Gene/P Architecture Hardware Components Networks Partitioning Blue Gene/P Software Stack I/O Node Compute Node HPC Software Stack Software Locations Programming Models & Execution Modes Programming Models Execution Modes Partition Types Blue Gene/P Compilation Compilers Mathematical Libraries Blue Gene/P Execution MPIRUN / MPIEXEC Commands Environment Variables LoadLeveler HTC Blue Gene/P Parallel Libraries Shared Memory Message Passing Blue Gene/P Debugging Blue Gene/P Integrated Tools Commercial Software Blue Gene/P Performance Analysis Profiling HPC Toolkit Major Open-Source Tools Blue Gene/P Advanced Topics Blue Gene/P Memory Advanced Compilation with IBM XL Compilers SIMD Programming Communications Framework Checkpoint/Restart I/O Optimization 2 PSSC Montpellier Blue Gene Team

3 IBM s Blue Gene/P Supercomputer Applying Innovation that Matters to High Performance Computing Ultrascale performance High reliability Easy manageability Ease of programming, familiar open/standard operating environment Simple porting of parallel codes Space saving design Low power requirements Limited performance and memory per core 3 PSSC Montpellier Blue Gene Team

4 What is Blue Gene? Massively Parallel Architecture A computer system with many independent arithmetic units or entire microprocessors, that are connected together to be one very large computer Opposed to Clusters / Shared Memory Processing Two solutions available by now, one in the roadmap Blue Gene/L PPC 700MHz Scalable to 360+ TF Blue Gene/P PPC 850MHz Scalable to 3+ PF Blue Gene/Q Power Multi-Core Scalable to 10+ PF PSSC Montpellier Blue Gene Team

5 Information Sources IBM Redbooks for Blue Gene ( Application Development Guide System Administration Guide Performance Tools Open Source Communities (Argonne Web site, ) Doxygen Documentation (DCMF, SPI, ) 5 PSSC Montpellier Blue Gene Team

6 Blue Gene/P Application User Worshop Blue Gene/P Architecture Hardware Components Networks Partitioning 6 PSSC Montpellier Blue Gene Team

7 Blue Gene/P Packaging Rack 32 Node Cards Cabled 8x8x16 System 72 Racks Chip Compute Card 1 chip, 20 DRAMs 4 processors Node Card (32 chips 4x4x2) 32 compute, 0-2 IO cards 435 GF/s 64 GB 14 TF/s 2 TB 1 PF/s 144 TB 13.6 GF/s 8 MB EDRAM 13.6 GF/s 2.0 (or 4.0) GB DDR 7 PSSC Montpellier Blue Gene Team

8 Blue Gene/P Compute Card System-on-a-Chip (SoC) PowerPC 450 CPU 850 MHz Frequency Quad Core 2 GB RAM Network Connections 8 PSSC Montpellier Blue Gene Team

9 Blue Gene/P ASIC 9 PSSC Montpellier Blue Gene Team

10 Blue Gene/P Node Card 32 Compute nodes Optional IO card (one of 2 possible) with 10Gb optical link 10 PSSC Montpellier Blue Gene Team Local DC-DC regulators (6 required, 8 with redundancy)

Blue Gene/P Rack Content 32 Compute Nodes populate a Node Card Node cards may be hot plugged for service 0-2 I/O Nodes populated on a Node Card Flexible ratio of compute to I/O nodes I/O Nodes are

11 Blue Gene/P Rack Content 32 Compute Nodes populate a Node Card Node cards may be hot plugged for service 0-2 I/O Nodes populated on a Node Card Flexible ratio of compute to I/O nodes I/O Nodes are identical to Compute Nodes other than placement in the Node Card which defines network connections 16 Node Cards form a Midplane 2 Midplanes form a rack 1024 Compute Nodes per rack 8 to 64 I/O nodes per rack: 80Gb to 640Gb Ethernet bw/rack 11 PSSC Montpellier Blue Gene Team

12 Torus X-Y-Z Z Y X 12 PSSC Montpellier Blue Gene Team

13 Node, Link, Service Card Names L1 N07 N06 N05 N04 S N03 N02 N01 N00 L0 L3 N15 N14 N13 N12 N11 N10 N09 N08 L2 13 PSSC Montpellier Blue Gene Team

14 Node Card J32 J28 J24 J29 J20 J25 J16 J21 J12 J17 J08 J13 J04 J09 J05 J00 J33 EN0 EN1 14 PSSC Montpellier Blue Gene Team J06 J01 J14 J10 J11 J07 J22 J18 J19 J15 J30 J26 J27 J23 J34 J35 J31

15 Link Card U00 U01 U02 U03 U04 U05 J00 J02 J04 J06 J08 J10 J12 J14 J01 J03 J05 J07 J09 J11 J13 J15 15 PSSC Montpellier Blue Gene Team

16 Service Card Rack Row Indicator (0-F) Rack Column Indicator (0-F) Control Network Clock Input 16 PSSC Montpellier Blue Gene Team

17 Clock Card Output 9 Output 8 Output 7 Output 6 Output 5 Output 4 Output 3 Output 2 Output 1 Output 0 Slave Master Input 17 PSSC Montpellier Blue Gene Team

18 Naming Conventions / 1 Racks: Rxx Rack Column (0-F) Rack Row (0-F) See picture below for details Midplanes: Rxx-Mx Midplane (0-1) 0=Bottom 1=Top Rack Column (0-F) Rack Row (0-F) Bulk Power Supply: Rxx-B Power Cable: Rxx-B-C Bulk Power Supply Rack Row (0-F) Rack Column (0-F) Power Cable Bulk Power Supply Rack Row (0-F) Rack Column (0-F) Clock Cards: Rxx-K Fan Assemblies: Rxx-Mx-Ax Clock Rack Column (0-F) Rack Row (0-F) Fan Assembly (0-9) 0=Bottom Front, 4=Top Front 5=Bottom Rear, 9=Top Rear Midplane (0-1) 0-Bottom, 1=Top Rack Column (0-F) Rack Row (0-F) Power Modules: Rxx-B-Px Power Module (0-7) 0-3 Left to right facing front 4-7 left to right facing rear Bulk Power Supply Rack Row (0-F) Rack Column (0-F) Fans: Rxx-Mx-Ax-Fx Fan (0-2) 0=Tailstock 2= Midplane Fan Assembly (0-9) 0=Bottom Front, 4=Top Front 5=Bottom Rear, 9=Top Rear Midplane (0-1) 0-Bottom, 1=Top Rack Column (0-F) Rack Row (0-F) 18 PSSC Montpellier Blue Gene Team

19 Naming Conventions / 2 Service Cards: Rxx-Mx-S Note: Master service card for a rack is always Rxx-M0-S Service Card Midplane (0-1) 0-Bottom, 1=Top Rack Column (0-F) Rack Row (0-F) Compute Cards: Rxx-Mx-Nxx-Jxx Compute Card (04 through 35) Node Card (00-15) Midplane (0-1) 0-Bottom, 1=Top Rack Column (0-F) Rack Row (0-F) Link Cards: Rxx-Mx-Lx Node Cards: Rxx-Mx-Nxx Link Card (0-3) Midplane (0-1) 0-Bottom, 1=Top Rack Column (0-F) Rack Row (0-F) Node Card (00-15) 0=Bottom Front 1=Top Front 2=Bottom Rear 3=Top Rear 00=Bottom Front 07=Top Front 08=Bottom Rear 15=Top Rear I/O Cards: Rxx-Mx-Nxx-Jxx See picture below for details I/O Card (00-01) Node Card (00-15) Midplane (0-1) 0-Bottom, 1=Top Rack Column (0-F) Rack Row (0-F) Midplane (0-1) 0-Bottom, 1=Top Rack Column (0-F) Rack Row (0-F) 19 PSSC Montpellier Blue Gene Team

20 Blue Gene/P System Components Blue Gene/P Rack(s) 1024 Quad-Core Compute Nodes per Rack > 13.7 TF/sec peak 2 GB main memory per Node > 2 TB main memory per Rack Up to 64 I/O Nodes per Rack > 80 GB/sec peak Scalable to 256 racks > 3.5 PF/sec Host System Service and Front End Nodes: System p Software Stack: SLES10, DB2, XLF/C Compilers, HPC Cluster SW Ethernet Switches: Force10, Myrinet Storage System: IBM, DDN 20 PSSC Montpellier Blue Gene Team

21 Blue Gene Blocks Hierarchical Organization Compute Nodes dedicated to running user application, and almost nothing else - simple compute node kernel (CNK) I/O Nodes run Linux and provide a more complete range of OS services files, sockets, process launch, signaling, debugging, and termination Service Node performs system management services (e.g., partitioning, heart beating, monitoring errors) - transparent to application software Front-end Nodes, Filesystem 10 Gb Ethernet 1 Gb Ethernet 21 PSSC Montpellier Blue Gene Team

22 Blue Gene/P Full Configuration BG/P BG/P compute I/O nodes nodes 8 to per Rack per Rack 10GigE per ION Federated 10Gigabit Ethernet Switch 1GigE - 4 per Rack Control network WAN Visualization Archive File System Front-end nodes Service node 22 PSSC Montpellier Blue Gene Team

23 Blue Gene/P Interconnection Networks 3-Dimensional Torus Interconnects all compute nodes Virtual cut-through hardware routing 3.4 Gb/s on all 12 node links (5.1 GB/s per node) 0.5 µs latency between nearest neighbors, 5 µs to the farthest MPI: 3 µs latency for one hop, 10 µs to the farthest Communications backbone for computations 1.7/3.9 TB/s bisection bandwidth, 188TB/s total bandwidth Collective Network One-to-all broadcast functionality Reduction operations functionality 6.8 Gb/s of bandwidth per link Latency of one way tree traversal 1.3 µs, MPI 5 µs ~62TB/s total binary tree bandwidth (72k machine) Interconnects all compute and I/O nodes (1152) Low Latency Global Barrier and Interrupt Latency of one way to reach all 72K nodes 0.65 µs (MPI 1.6 µs) Other networks 10Gb Functional Ethernet I/O nodes only 1Gb Private Control Ethernet Provides JTAG access to hardware. Accessible only from Service Node system 23 PSSC Montpellier Blue Gene Team

24 Blue Gene Networks Torus Compute nodes only Direct access by app DMA Collective Compute and I/O node attached 16 routes allow multiple network configurations to be formed Contains an ALU for collective operation offload Direct access by app Barrier Compute and I/O nodes Low latency barrier across system (< 1usec for 72 rack) Used to synchronize timebases Direct access by app 10Gb Functional Ethernet I/O nodes only 1Gb Private Control Ethernet Provides JTAG, i2c, etc, access to hardware. Accessible only from Service Node system Clock network Single clock source for all racks 24 PSSC Montpellier Blue Gene Team

Three-Dimensional Torus Network: Point-to-Point The torus network is used for general-purpose, point-to-point message passing and multicast operations to a selected class of nodes The topology is a

25 Three-Dimensional Torus Network: Point-to-Point The torus network is used for general-purpose, point-to-point message passing and multicast operations to a selected class of nodes The topology is a three-dimensional torus constructed with point-to-point, serial links between routers that are embedded within the Blue Gene/P ASICs. Torus direct memory access (DMA), which provides memory access for reading, writing,or doing both independently of the processing unit Each CNodes (ASIC) has six nearest-neighbor connections, some of which can traverse relatively long cables. All the partitions do not have a Torus shape (only multiple midplans partition (x 512 nodes); others have a mesh shape, i.e. the opposite sides of the 3D cube are not linked) The target hardware bandwidth for each torus link is 425 MBps in each direction of the link for a total of 5.1 GBps bidirectional bandwidth per node. Interconnection of all Compute Nodes (73,728 for a 72 rack system) Virtual cut-through hardware routing 3.4 Gbps on all 12 node links (5.1 GBps per node) 1.7/3.8 TBps bisection bandwidth, 67 TBps total bandwidth 25 PSSC Montpellier Blue Gene Team

26 Torus VS Mesh The basic block is the midplane, shape 8x8x8= 512 Computes Nodes (2048 cores) Only multiple midplanes partition is a Torus; i.e. each CNode has 6 nearestneighbours All the other partition is a mesh Capability from LoadLeveler to request a Torus or a Mesh with the field: bg_connection= torus/mesh The default is a mesh Z Y X 26 PSSC Montpellier Blue Gene Team

27 Collective Network: Global Operations The global collective network is a high-bandwidth, one-to-all network that is used for collective communication operations, such as broadcast and reductions, and to move process and application data from the I/O Nodes to the Compute Nodes Each Compute and I/O Node has three links to the global collective network at 850 MBps per direction for a total of 5.1 GBps bidirectional bandwidth per node Latency on the global collective network is less than 2 µs from the bottom to top of the Collective, with an additional 2 µs latency to broadcast to all The global collective network supports the following features: One-to-all broadcast functionality Reduction operations functionality 6.8 Gbps of bandwidth per link; latency of network traversal 2 µs 62 TBps total binary network bandwidth Interconnects all compute and I/O Nodes (1088) 27 PSSC Montpellier Blue Gene Team

28 Other Networks Global Interrupt: Low latency barriers and interrupts The global interrupt network is a separate set of wires based on asynchronous logic, which forms another network that enables fast signaling of global interrupts and barriers (global AND or OR). Round-trip latency to perform a global barrier over this network for a 72 K node partition is approximately 1.3 microseconds. From MPI: 1 rack (Argonne) 8 racks (Argonne) 40 racks 72 racks 2.3 usec 2.5 usec 2.7 usec 3.0 usec 10 Gigabit Ethernet: File I/O and host interface This network consists of all I/O Nodes and discrete nodes that are connected to the external10 Gigabit Ethernet switch. The Compute Nodes are not directly connected to this network. All traffic is passed from the Compute Node over the global collective network to the I/O Node and then onto the 10 Gigabit Ethernet network. Control Network: Boot, monitoring, and diagnostics This network consists of a JTAG interface to a 1 GEthernet interface with direct access to shared SRAM in every Compute and I/O Node. It is used for system boot, debug, and monitoring. It allows the Service Node to provide runtime non-invasive RAS support as well as non-invasive access to performance counters. 28 PSSC Montpellier Blue Gene Team

29 Partitioning Partition = Subdivision of a single Blue Gene system Partitions are software defined Torus, Collective and Barrier networks are completely isolated from traffic from other partitions A single job runs on a partition Jobs never share resources or interfere with each other Custom kernels may be booted in a partition 29 PSSC Montpellier Blue Gene Team

30 Sub-Midplane Partitions Sub-Midplane Partition Partition that is smaller than a midplane < 512 Nodes = 2048 Cores Smallest partition is ½ node card (16 nodes) Multiple node cards may form a partition Limitations Mesh (not a true Torus) Each partition must have at least one I/O node Node cards combined into a partition must be adjacent Sub-midplane partition sizes are powers of 2 16 node, 32 node, 64 node, 128 node and 256 node 30 PSSC Montpellier Blue Gene Team

31 Multi-Midplane Partitions One or more midplanes can be grouped into a large partition Limitations Midplanes must be organized to form a torus Constant number of midplanes in X, Y and Z Torus connections may pass through unused midplanes Traffic switched around those midplanes no noise introduced Unused midplanes can often be combined into partitions Inter-rack cables are a limited resource Inter-rack cables carry full network bandwidth A single midplane may always be used as a torus No cables required 31 PSSC Montpellier Blue Gene Team

32 Spec/feature comparison Node Properties Torus Network Collective Network System Properties Property Node Processors Processor Frequency Coherency L1 Cache (private) L2 Cache (private) L3 Cache size (shared) Main Store/Node Main Store Bandwidth Peak Performance Bandwidth Hardware Latency (Nearest Neighbor) Hardware Latency (Worst Case) Bandwidth Hardware Latency (Round Trip Worst Case) Peak Performance Total Power 32 PSSC Montpellier Blue Gene Team BG/L 2 * 440 PowerPC 0.7GHz Software managed 32KB/processor 14 stream prefetching 4MB 512MB and 1GB versions 5.6GB/s (16B wide) 5.6GF/node 6*2*175MB/s = 2.1GB/s 200ns (32B packet) 1.6us (256B packet) 6.4us (64 hops) 2*350MB/s = 700MB/s 5.0us 360TF (64K nodes) 1.5MW BG/P 4 * 450 PowerPC 0.85GHz SMP 32KB/processor 14 stream prefetching 8MB 2GB (studying 4GB) versions 13.6 GB/s (2*16B wide) 13.6 GF/node 6*2*425MB/s = 5.1GB/s 64ns (32B packet) 512ns (256B packet) 3.0us (64 hops) 2*0.85GB/s = 1.7GB/s 2.5us 1PF (72K nodes) > 2.0MW

33 Blue Gene/P Application User Worshop Blue Gene/P Software Stack I/O Node Compute Node HPC Software Stack Programming Models Software Products Locations 33 PSSC Montpellier Blue Gene Team

34 Blue Gene/P Software Stack Service Node SuSE SLES10 BG Nav mpirun Bridge API SLURM / COBALT PAPI DB2 LoadLeveler, GPFS CSM, HPC Toolkit MMCS Proxy Firmware Front-end Node SuSE SLES10 Performance tools gcc (g77,g++), glibc binutils, gdb XLC/F compilers Run-time system GPFS Ethernet Ethernet I/O Node ciod Linux ramdisk gdb GPFS Bootloader I/O Node ciod Linux ramdisk gdb GPFS Bootloader tree C-Node MPICH2 BG msg GNU rts XL rts MASS/V ESSL CNK Bootloader torus C-Node MPICH2 BG msg GNU rts XL rts MASS/V ESSL CNK Bootloader Open source IBM & vendor products BG specific and custom C-Node MPICH2 BG msg GNU rts XL rts MASS/V ESSL CNK Bootloader C-Node MPICH2 BG msg GNU rts XL rts MASS/V ESSL CNK Bootloader 34 PSSC Montpellier Blue Gene Team tree

35 I/O Node Kernel SMP Linux No persistent store Network filesystems only No swap 10Gb Ethernet interface Several CNK System calls are function shipped to here Linux compatibility by executing these syscalls on Linux Function shipping occurs over Collective Network The ciod daemon manages a fixed set of compute nodes in a processing set (pset) Linux provides the portable filesystem and network layer interfaces 35 PSSC Montpellier Blue Gene Team

36 Processing Sets (Psets) I/O node dedicated to a fixed group of compute nodes Compute to I/O ratio is fixed within a partition 128:1, 64:1, 32:1, 16:1 pset 10Gb switch ION CN CN CN CN 36 PSSC Montpellier Blue Gene Team

37 Syscall Function Shipping I/O System Calls are forwarded by the Compute Node to the I/O Node Removes need for filesystem in CNK Manageable 1152 I/O nodes as filesystem clients in a 72 rack system 1 64 Application fscanf Application Application fscanf fscanf glibc read glibc readglibc read CNK packets CNK packets CNK packetsbg/p ASIC BG/P ASIC BG/P ASIC 3 collective network ciod request read reply Linux GPFS PVFS2 BG/P ASIC Filesystem 2 Application performs an fscanf() library call library uses read() syscall CNK forwards read() as message over collective network ciod receives read request Ciod performs read syscall under Linux syscall performs Linux read via filesystem over 10Gb Ethernet Ciod reply to CNK contains read results results returned to library, then app 10Gb Ethernet 37 PSSC Montpellier Blue Gene Team

Compute I/O Daemon (ciod) ciod on the I/O Node serves these roles Interface for job launch and kill I/O syscall requests from CNs within the pset Debug requests from a parallel debugger Syscall

38 Compute I/O Daemon (ciod) ciod on the I/O Node serves these roles Interface for job launch and kill I/O syscall requests from CNs within the pset Debug requests from a parallel debugger Syscall compatibility ciod forks an ioproxy process for each CNK in the pset (actually each virtual node) This process context represents that compute node open files, locks, working dir, I/O blocking state, etc Service connection may be admin enabled telnet to port 7201 and type help 38 PSSC Montpellier Blue Gene Team

Compute Node Software Stack Compute Node Kernel (CNK) controls all access to hardware, and enables bypass for application use User-space libraries and applications can directly

39 Compute Node Software Stack Compute Node Kernel (CNK) controls all access to hardware, and enables bypass for application use User-space libraries and applications can directly access torus and tree Application code can use both processors in a compute node Application code can directly touch hardware only through SPI 39 PSSC Montpellier Blue Gene Team

40 Compute Node Kernel (CNK) CNK is a lightweight kernel and is NOT Linux Goals Be as Linux compatible as possible Provide entire node s resources to the application and get out of the way! OS noise is minimal by design TLBs are by default statically mapped no page faults Only the user application is running no system daemons No source of normal interrupts except for: Timer interrupts as requested by the application Torus DMA completion interrupts 40 PSSC Montpellier Blue Gene Team

41 CNK Modes of Execution CNK provides a fixed process model SMP: single process with four threads and all memory DUAL: two processes with two threads each and half memory VN: virtual node mode with four processes and quarter memory each e.g. mpirun mode SMP to select the execution mode Shared memory can be used between these processes Threading interfaces are fixed Each thread is fixed to a cpu More threads than cpus are not allowed Special I/O threads do exist (more later) Memory usage is fairly rigid due to focus on pinned memory 41 PSSC Montpellier Blue Gene Team

42 CNK System Calls Direct Implementation exit, time, getpid, getuid, alarm, kill, times, brk, getgid, geteuid, getegid, getppid, sigaction, setrlimit, getrlimit, getrusage, gettimeofday, setitimer, getitimer, sigreturn, uname, sigprocmask, sched_yield, nanosleep, set_tid_address, exit_group Implementation through forward to I/O Node open, close, read, write, link, unlink, chdir, chmod, lchown, lseek, utime, access, rename, mkdir, rmdir, dup, fcntl, umask, dup2, symlink, readlink, truncate, ftruncate, fchmod, fchown, statfs, fstatfs, socketcall, stat, lstat, fstat, fsync, llseek, getdents, readv, writev, sysctl, chown, getcwd, truncate64, ftruncate64, stat64, lstat64, fstat64, getdents64, fcntl64 Restricted Implementation mmap, munmap, clone, futex 42 PSSC Montpellier Blue Gene Team

43 CNK mmap and clone restrictions These syscalls are partially implemented to enable the GLIBC pthread library and dynamic linker to function mmap is limited anonymous memory maps (malloc uses this) maps of /dev/shm for shm_open() file maps using MAP_COPY mmap may choose different placement under CNK than Linux mmap will not protect mapped files as readonly mmap does NOT provide demand paging files are completely read during the map clone() may only create a total of 4 threads In VNM no additional clones allowed per process In DUAL one clone allowed per process In SMP three additional clones allowed Fork is not supported 43 PSSC Montpellier Blue Gene Team

44 GLIBC pthread Support New Posix Thread Library (NPTL) unchanged from Linux Attempt to create more threads than allowed results in EAGAIN error Linux rarely fails with this error (but can) Stack overflow is caught by debug address exception Unlike Linux which uses guard pages May be unavailable when under debugger XLSMPOPTS runtime variable controls the stack size 44 PSSC Montpellier Blue Gene Team

45 CNK Dynamic Linking & Python (2.5) Dynamic linking provided by GLIBC ld.so Uses limited mmap implemented by CNK Placement of objects less efficient than static linking As in Linux, static linked apps can still use dlopen() to dynamically load additional code at runtime Goal is to enable use of Python to drive apps pynamic has been demonstrated 45 PSSC Montpellier Blue Gene Team

46 CNK Full socket support CNK provides socket support via the standard Linux socketcall() system call. The socketcall() is a kernel entry point for the socket system calls. It determines which socket function to call and points to a block that contains the actual parameters, which are passed through to the appropriate call. The CNK function-ships the socketcall() parameters to the CIOD, which then performs the requested operation. The CIOD is a userlevel process that controls and services applications in the compute node and interacts with the Midplane Management Control System (MMCS) This socket support allows the creation of both outbound and inbound socket connections using standard Linux APIs. For example, an outbound socket can be created by calling socket(), followed by connect(). An inbound socket can be created by calling socket()followed by bind(), listen(), and accept(). 46 PSSC Montpellier Blue Gene Team

47 Blue Gene/P Software Tools IBM Software Stack Other Software Support XL (FORTRAN, C, and C++) Compilers Externals preserved Optimized for specific BG functions OpenMP support LoadLeveler Job Scheduler Same externals for job submission and system query functions Backfill scheduling to achieve maximum system utilization GPFS Parallel Filesystem Provides high performance file access, as in current pseries and xseries clusters Runs on I/O nodes and disk servers MASS/MASSV/ESSL Scientific Libraries Optimization library and intrinsics for better application performance Serial Static Library supporting 32-bit applications Callable from FORTRAN, C, and C++ MPI Library Message passing interface library, based on MPICH2, tuned for the Blue Gene architecture Parallel File Systems Lustre at LLNL, PVFS2 at ANL Job Schedulers SLURM at LLNL, Cobalt at ANL Altair PBS Pro, Platform LSF (for BG/L only) Condor HTC (porting for BG/P) Parallel Debugger Etnus TotalView Debugger Allinea DDT and OPT (porting for BG/P) Libraries FFT Library - Tuned functions by TU-Vienna VNI (porting for BG/P) Performance Tools HPC Toolkit: MP_Profiler, Xprofiler, HPM, PeekPerf, PAPI Tau, Paraver, Kojak, SCALASCA 47 PSSC Montpellier Blue Gene Team

48 Blue Gene/P Software Ecosystem / 1 IBM Software Stack XL Compilers Externals preserved New options to optimize for specific Blue Gene functions TWS LoadLeveler Same externals for job submission and system query functions Backfill scheduling to achieve maximum system utilization Engineering Scientific Subroutine Library (ESSL)/MASSV Optimization library and intrinsic for better application performance Serial static library supporting 32 bit applications Callable from FORTRAN, C, and C++ GPFS (General Parallel File System) Provides high performance file access as in current System p and System x clusters Runs on I/O Nodes and disk servers 48 PSSC Montpellier Blue Gene Team

49 Blue Gene/P Software Ecosystem / 2 IBM Software Stack MPI Message passage library tuned for Blue Gene architecture Performance Tools HPC Toolkit Hardware DataDirect Networks (DDN), IBM System DCS9550; storage Myricom, Force10; switches 3rd party ISV suppliers Visual Numerics (VNI) developers of IMSL TotalView Technologies Inc. developers of TotalView Debugger Allinea developers of allinea ddt and allinea opt Tsunami Development LLC developers of the Tsunami Imaging Suite Condor developers of software tools Gene Network Sciences (GNS) developers of biosimulation models 49 PSSC Montpellier Blue Gene Team

50 What s new with BG/P Torus DMA and numerous communication library optimizations pthreads and OpenMP support CNK application compatibility with Linux Dynamic linking Use of mmap for shared memory Protected readonly data and application code Protection for stack overflow Full socket support (client and server) Better Linux compatibility in ciod on the I/O node MPMD mpiexec supports multiple executables Some restrictions: executable specified per pset; no DMA (next driver) Numerous control system enhancements 50 PSSC Montpellier Blue Gene Team

51 BGP Software locations (1) BGP directory /bgsys/drivers/ppcfloor arch bin build-ramdisk control efix_history.log lib64 mcp ras toolchain baremetal boot cnk diags gnu-linux libraries navigator runtime tools bgcns bootloader comm doc include linux ramdisk schema BGP includes files: /bgsys/drivers/ppcfloor/arch/include (bgcore; cnk; common; spi) Compilation wrappers and BGP libs: /bgsys/drivers/ppcfloor/comm (fast; gcc_fast) GNU tools (gcc, gdb, ) : /bgsys/drivers/ppcfloor/gnu-linux bin include info lib libexec man powerpc-bgp-linux powerpc-linux-gnu share 51 PSSC Montpellier Blue Gene Team

52 Software Locations / Reference Compilers XL C /opt/ibmcmp/vac/bg XL C++ /opt/ibmcmp/vacpp/bg XL Fortran /opt/ibmcmp/xlf/bg/ GCC /bgsys/drivers/ppcfloor/gnu-linux/bin Libraries System Libraries /bgsys/drivers/ppcfloor/comm/lib MASS /opt/ibmcmp/xlmass/bg/4.4/bglib/ ESSL /opt/ibmmath/essl/4.3/lib Others Tools /bgsys/drivers/ppcfloor/tools 52 PSSC Montpellier Blue Gene Team

53 Blue Gene/P Application User Worshop Programming Models & Execution Modes Programming Models Execution Modes SMP, DUAL, VN Partition Types HPC VS HTC 53 PSSC Montpellier Blue Gene Team

54 Programming Models & Development Environment Familiar Aspects SPMD model - Fortran, C, C++ with MPI (MPI1 + subset of MPI2) Full language support Automatic SIMD FPU exploitation Linux development environment User interacts with system through front-end nodes running Linux compilation, job submission, debugging Compute Node Kernel provides look and feel of a Linux environment POSIX system calls (with some restrictions) BG/P adds pthread support, additional socket support, Tools support for debuggers, MPI tracer, profiler, hardware performance monitors, visualizer (HPC Toolkit), PAPI Dynamic libraries Python 2.5 Aggregate Remote Memory Copy (ARMCI), Global Arrays (GA), UPC, Restrictions (lead to significant scalability benefits) Space sharing - one parallel job (user) per partition of machine, one process per processor of compute node Virtual memory constrained to physical memory size Implies no demand paging, but on-demand linking MPMD model limitations 54 PSSC Montpellier Blue Gene Team

55 Execution Modes Possibilities Single Node / Multi Node 1, 2 or 4 Processes per Node 1, 2 or 4 Threads per Process Notation Virtual Mode VN 4 MPI Processes Per Node Dual Mode DUAL 2 MPI Processes + 2 Threads Per Process Shared Memory SMP 1 MPI Process + 4 Threads Per Process Limitation One user process or thread per core 55 PSSC Montpellier Blue Gene Team

56 Blue Gene/P Execution Modes Quad Mode Previously called Virtual Node Mode All four cores run one MPI process each No threading Memory / MPI process = ¼ node memory MPI programming model Application Dual Mode Two cores run one MPI process each Each process may spawn one thread on core not used by other process Memory / MPI process = ½ node memory Hybrid MPI/OpenMP programming model Application SMP Mode One core runs one MPI process Process may spawn threads on each of the other cores Memory / MPI process = full node memory Hybrid MPI/OpenMP programming model Application P P P P P Core 1 P Core 3 P Core 1 CPU2 Core 2 CPU3 Core 3 Core 0 Core 1 Core 2 Core 3 Core 0 T Core 2 T Core 0 T T T M M M M M M M Memory address space Memory address space Memory address space 56 PSSC Montpellier Blue Gene Team

57 Programming Models MPI Only Virtual Node Mode with enhancements Separate MPI process for each processor in the compute node DMA support for each MPI process Ensure network does not block when processor is computing Drive network harder Sharing of read-only or write-once data on each node Need programming language extension to identify read-only data Allow applications to overcome memory limits of virtual node mode MPI + OpenMP (or pthread) OpenMP within each node relies on cache coherence support Only master thread on each node initiates communication Get benefits of message aggregation Exploit multiple processors to service MPI call 57 PSSC Montpellier Blue Gene Team

58 Symetrical Multi-Processing Mode (SMP) SMP 1 Process/Node 4 Threads/Process 2 GB/Process pthreads and OpenMP are supported 4MB default stack for all new threads 58 PSSC Montpellier Blue Gene Team

59 Dual Mode Dual 2 Processes/Node 2 Threads/Process 1 GB/Process pthreads and OpenMP are supported 4MB default stack for all new threads 59 PSSC Montpellier Blue Gene Team

60 Virtual Node Mode VN 4 Processes/Node 1 Thread/Process 512 MB/process 512 MB/process versus Shared Memory support 60 PSSC Montpellier Blue Gene Team

61 High Throughput Computing (HTC) Mode Many applications that run on Blue Gene today are embarrassingly (pleasantly) parallel They do not fully exploit the torus for MPI communication, since that is not needed for their problem They just want a very large number of small tasks, with a coordinator of results High Throughput Computing Mode on Blue Gene Enables a new class of workloads that use many single-node jobs Leverages the low-cost, low-energy, small footprint of a rack of 1,024 compute nodes Capacity machine ( cluster buster ): run 4,096 jobs on a single rack in virtual node mode (VN) 61 PSSC Montpellier Blue Gene Team

62 HTC Value for «Pleasantly Parallel» Codes Application resiliency A single node failure ends the entire application in the MPI model For HTC, only the job running on the failed node is ended while other single node jobs continue to run. For long-running jobs that require many tasks, this can mean the difference between having to start from scratch and just being able to proceed ahead on the remaining nodes. The front-end node has more memory, better performance, and more functionality than a single compute node Code that runs on the compute nodes is much cleaner It only contains the work to be performed, and leaves the coordination to a script or scheduler This also eliminates the need to sacrifice one node as being the master node. The coordinator functionality can be anything that runs on Linux Perl script, Python, compiled program The coordinator can interact directly with a database To either get the inputs for the application, or to store the results This can eliminate the need to create a flatfile input for the application, or to generate the results in an output file. 62 PSSC Montpellier Blue Gene Team

63 High Performance VS High Throughput Modes High Performance Computing (HPC) Mode Best for Capability Computing Parallel, tightly coupled applications Single Instruction, Multiple Data (SIMD) architecture Programming model: typically MPI Apps need tremendous amount of computational power over short time period High Throughput Computing (HTC) Mode Best for Capacity Computing Large number of independent tasks Multiple Instruction, Multiple Data (MIMD) architecture Programming model: non-mpi Applications need large amount of computational power over long time period Traditionally run on large clusters HTC and HPC modes co-exist on Blue Gene Determined when resource pool (partition) is allocated 63 PSSC Montpellier Blue Gene Team

64 Blue Gene/P Application User Worshop Blue Gene/P Compilation Compilers Mathematical Libraries 64 PSSC Montpellier Blue Gene Team

65 Compilers and Application Libraries Front End Nodes IBM System p 64-bit servers Used for development of application code XL Compilers, GNU Compilers, Blue Gene Libraries Cross Compilation Filesystem A parallel filesystem is generally shared between to Blue Gene (the I/O nodes) and the Front End Nodes 65 PSSC Montpellier Blue Gene Team

66 GNU Compilers GCC Standard System Locations /bgsys/drivers/ppcfloor/gnu-linux/ powerpc-bgp-linux-gcc Specificities gfortran replaces the older g77 -std=legacy emulates previous behavior Long doubles are now 128-bit by default -mlong-double-64 uses more efficient 64-bit Note that MPI is compiled without this option (i.e. 128-bit) No support for OMP in this version 66 PSSC Montpellier Blue Gene Team

67 GNU Tools and Libraries GLIBC 2.4 Thread support enabled Link Option: lpthread Standard System Location /bgsys/drivers/ppcfloor/gnu-linux/bin powerpc-bgp-linux-addr2line Decode more BG/P syscalls gdb / gdbserver python PSSC Montpellier Blue Gene Team

68 IBM XL Compilers for Blue Gene XL Fortran 11.1 XL C/C Standard System Locations /opt/ibmcmp/xlf/bg/11.1/ /opt/ibmcmp/vacpp/bg/9.0/ Specificities Fortran 2003 standard supported xlf2003 Blue Gene/P Optimization Options -qarch=450[d] qtune= PSSC Montpellier Blue Gene Team

69 IBM XL Compilers for Blue Gene / Scripts Runtime libraries: redbook application developnment 69 PSSC Montpellier Blue Gene Team

70 IBM XL Compilers for Blue Gene / MPI Wrappers Included in the BG/P driver Standard System Location /bgsys/drivers/ppcfloor/comm/bin mpicc / mpicxx / mpixlc / mpixlcxx / mpixlc_r / mpixlcxx_r mpixlf2003 / mpixlf77 / mpixlf90 / mpixlf95 / mpif77 / mpixlf2003_r / mpixlf77_r / mpixlf90_r / mpixlf95_r /bgsys/drivers/ppcfloor/comm/bin/fast Fast versions The 'fast' scripts use libraries that are built with assertions turned off and MPICH debug turned off Recommendations Build and test with original wrappers (/comm/bin/mpi*) Make sure you have successful runs of application before switching Using these shaves roughly a microsecond off of most communications calls (which can be 25% improvement) 70 PSSC Montpellier Blue Gene Team

71 XL OpenMP and SMP Support Each compiler has a thread-safe version bgxlf_r / bgxlc_r / bgcc_r mpixlf_r / mpixlc_r The thread-safe compiler version should be used for any multithreaded / OpenMP / SMP application Adds the required options for threaded applications Links in the thread-safe libraries Thread-safe libraries ensure that data accesses and updates are synchronized between threads -qsmp must be used on SMP / OpenMP applications -qsmp alone causes automatic parallelization to occur -qsmp=omp parallelizes based on the OMP directives in the code Refer to language reference for more detail on qsmp suboptions 71 PSSC Montpellier Blue Gene Team

72 IBM Compilers Key Options -qarch=440, 450 Generates only instructions for one floating point (option minimal option with blrts_) -qarch=440d, 450d Generates only instructions for 2 floating point pipes -qtune=450/440 -O3 (-qstrict) Minimal level for SIMDization -O3 -qhot (=simd) -O4 (-qnoipa) -O5 -qdebug=diagnostic Provides details about SIMDization, only with qhot -qreport qlist qsource Provides pseudo-assembler code in.lst generated file 72 PSSC Montpellier Blue Gene Team

73 Makefile using BGP libraries # Compiler BGP_FLOOR = /bgsys/drivers/ppcfloor BGP_IDIRS = -I$(BGP_FLOOR)/arch/include -I$(BGP_FLOOR)/comm/include BGP_LIBS = -L$(BGP_FLOOR)/comm/lib -lmpich.cnk -L$(BGP_FLOOR)/comm/lib -ldcmfcoll.cnk -ldcmf.cnk -lpthread -lrt -L$(BGP_FLOOR)/runtime/SPI - lspi.cna XL = /opt/ibmcmp/xlf/bg/11.1/bin/bgxlf EXE = fhello OBJ = hello.o SRC = hello.f FLAGS = -O3 -qarch=450d -qtune=450 -I$(BGP_FLOOR)/comm/include FLD = -O3 -qarch=450d -qtune=450 $(EXE): $(OBJ) # ${XL} $(FLAGS) $(BGP_LDIRS) -o $(EXE) $(OBJ) $(BGP_LIBS) ${XL} $(FLAGS) -o $(EXE) $(OBJ) $(BGP_LIBS) $(OBJ): $(SRC) ${XL} $(FLAGS) $(BGP_IDIRS) -c $(SRC) clean: rm hello.o fhello 73 PSSC Montpellier Blue Gene Team

74 Makefile using wrappers # Compiler XL = /bgsys/drivers/ppcfloor/comm/bin/mpixlf77 EXE = fhello OBJ = hello.o SRC = hello.f FLAGS = -O3 -qarch=450d -qtune=450 FLD = -O3 -qarch=450d -qtune=450 $(EXE): $(OBJ) # ${XL} $(FLAGS) $(BGP_LDIRS) -o $(EXE) $(OBJ) $(BGP_LIBS) ${XL} $(FLAGS) -o $(EXE) $(OBJ) $(BGP_LIBS) $(OBJ): $(SRC) ${XL} $(FLAGS) $(BGP_IDIRS) -c $(SRC) clean: rm hello.o fhello 74 PSSC Montpellier Blue Gene Team

75 MASS Library MASS = Mathematical Acceleration Subsystem Collection of tuned mathematical intrinsic functions Components MASS Scalar functions MASSV Vector functions (Single & Double precision) Standard System Location /opt/ibmcmp/xlmass/bg/4.4/bglib libmass.a libmassv.a /opt/ibmcmp/xlmass/bg/4.4/include Link Syntax -L/opt/ibmcmp/xlmass/bg/4.4/bglib lmass -l massv 75 PSSC Montpellier Blue Gene Team

76 ESSL Library ESSL = Engineering and Scientific Subroutine Optimization library and intrinsics for better application performance Serial Static Library supporting 32-bit applications Callable from C/C++ and Fortran PowerPC 450 optimized Components ESSL ESSL SMP SMP Support Parallel ESSL Standard System Location /opt/ibmmath/essl/4.3./lib/libesslbg.a./lib/libesslsmpbg.a Link Syntax Fortran Linker -L/opt/ibmmath/essl/4.3/lib lesslbg lesslsmpbg C/C++ Linker -L/opt/ibmmath/essl/4.3/lib lesslbg lesslsmpbg -L/opt/ibmcmp/xlf/bg/11.1/lib -lxlf90_r -lxlopt -lxl -lxlfmath 76 PSSC Montpellier Blue Gene Team

77 MASS / ESSL Compilation Example MASS = -L/opt/ibmcmp/xlmass/bg/4.4/bglib lmass -lmassv ESSL = -L/opt/ibmmath/essl/4.3/lib lesslbg lesslsmpbg ESSL = -L/opt/ibmmath/essl/4.3/lib lesslbg lesslsmpbg -L/opt/ibmcmp/xlf/bg/11.1/lib -lxlf90_r -lxlopt -lxl - lxlfmath LIBS = -Wl,-allow-multiple-definition $(MASS) $(ESSL) 77 PSSC Montpellier Blue Gene Team

78 Blue Gene/P Application User Worshop Blue Gene/P Execution MPIRUN Command Environment Variables LoadLeveler SUBMIT Command IBM Simple Scheduler 78 PSSC Montpellier Blue Gene Team

79 Control System Service Node (SN) An IBM system-p 64-bit system Control System and database are on this system Access to this system is generally privileged Communication with Blue Gene via a private 1Gb control ethernet Database A commercial database tracks state of the system Hardware inventory Partition configuration RAS data Environmental data Operational data including partition state, jobs, and job history Service action support for hot plug hardware Administration and System status Administration either via a console or web Navigator interfaces 79 PSSC Montpellier Blue Gene Team

80 Service Node Database Structure DB2 Configuration Database Operational Database Environmental Database RAS Database Useful log files: /bgsys/logs/bgp Configuration database is the representation of all the hardware on the system Operational database contains information and status for things that do not correspond directly to a single piece of hardware such as jobs, partitions, and history Environmental database keeps current values for all of hardware components on the system, such as fan speeds, temperatures, voltages RAS database collects hard errors, soft errors, machine checks, and software problems detected from the compute complex 80 PSSC Montpellier Blue Gene Team

81 Job Launching Mechanism mpirun Command Standard mpirun options supported May be used to launch any job, not just MPI based applications Has options to allocate partitions when a scheduler is not in use Scheduler APIs enable various schedulers LoadLeveler SLURM Platform LSF Altair PBS Pro Cobalt. Note: All the schedulers are on mpirun/mpiexec 81 PSSC Montpellier Blue Gene Team

MPIRUN Implementation Identical Functionalities to BG/L Implementation + New implementation + New options No more rsh/ssh mechanism for security reason, replace by a

82 MPIRUN Implementation Identical Functionalities to BG/L Implementation + New implementation + New options No more rsh/ssh mechanism for security reason, replace by a deamon running on the Service node freepartition command integrated as an option (-free) Standard input (STDIN) is supported on BGP 82 PSSC Montpellier Blue Gene Team

83 MPIRUN Command Parameters / 1 -args "program args" Pass "program args" to the BlueGene job on the compute nodes -cwd <Working Directory> Specifies the full path to use as the current working directory on the compute nodes. The path is specified as seen by the I/O and compute nodes -exe <Executable> Specifies the full path to the executable to run on the compute nodes. The path is specified as seen by the I/O and compute nodes -mode { SMP DUAL VN } specify what mode the job will run in. Choices are coprocessor or virtual node mode -np <Nb MPI Tasks> Create exactly n MPI ranks for the job. Aliases are -nodes and -n 83 PSSC Montpellier Blue Gene Team

84 MPIRUN Command Parameters / 2 -enable_tty_reporting By default MPIRUN will tell the control system and the C runtime on the compute nodes that STDIN, STDOUT and STDERR are tied to TTY type devices. Enable STDOUT bufferization (GPFS blocksize) -env <Variable Name>=<Variable Value>" Set an environment variable in the environment of the job on the compute nodes -expenv <Variable Name> Export an environment variable in mpiruns current environment to the job on the compute nodes -label Use this option to have mpirun label the source of each line of output. -partition <Block ID> Specify a predefined block to use -mapfile <mapfile> Specify an alternative MPI toplogy. The mapfile path must be fully qualified as seen by the I/O and compute nodes -verbose { } Set the 'verbosity' level. The default is 0 which means that mpirun will not output any status or diagnostic messages unless a severe error occurs. If you are curious as to what is happening try levels 1 or 2. All mpirun generated status and error messages appear on STDERR. 84 PSSC Montpellier Blue Gene Team

85 MPIRUN Command Reference (Documentation) 85 PSSC Montpellier Blue Gene Team

86 MPIRUN Example Simple example: SMP mode, 4 OpenMp threads, 64MB thread stack mpirun partition XXX np 128 mode SMP exe /patch/exe cwd working_directory env OMP_NUM_THREADS=4 XLSMPOPTS=spins=0:yields=0:stack= Mpirun application program interfaces available: get_paramaters, mpirun_done 86 PSSC Montpellier Blue Gene Team

87 MPIRUN Command Example mpirun -cwd ~/hello_world -env " OMP_NUM_THREADS=4 XLSMPOPTS=spins=0:yields=0:stack= " - exe ~/hello_world/hello_world -mode SMP -np 128 -partition R00-M0 Execution Settings 128 MPI Tasks SMP Mode 4 OpenMP Threads 64 MB Thread Stack 87 PSSC Montpellier Blue Gene Team

88 MPIRUN Environment Variables Most command line options for mpirun can be specified using an environment variable -partition MPIRUN_PARTITION -nodes MPIRUN_NODES -mode MPIRUN_MODE -exe MPIRUN_EXE -cwd MPIRUN_CWD -host MMCS_SERVER_IP -env MPIRUN_ENV -expenv MPIRUN_EXP_ENV -mapfile MPIRUN_MAPFILE -args MPIRUN_ARGS -label MPIRUN_LABEL -enable_tty_reporting MPIRUN_ENABLE_TTY_REPORTING 88 PSSC Montpellier Blue Gene Team

89 STDIN / STDOUT / STDERR Support STDIN, STDOUT, and STDERR work as expected. You can pipe or redirect files into mpirun and pipe or redirect output from mpirun, just as you would on any command line utility. STDIN may also come from the keyboard interactively. Any compute node may send STDOUT or STDERR data. Only MPI rank 0 may read STDIN data. Mpirun always tells the control system and the C runtime on the compute nodes that it is writing to TTY devices. This is because logically MPIRUN looks like a pipe; it can not do seeks on STDIN, STDOUT, and STDERR even if they are coming from files. As always, STDIN, STDOUT and STDERR are the slowest ways to get input and output from a supercomputer. Use them sparingly. STDOUT is not buffered and can generate a huge overhead for some applications. It advices for such applications to buffer the stdout with -enable_tty_reporting 89 PSSC Montpellier Blue Gene Team

90 MPIRUN Features Default current working directory is $PWD only if -cwd option is not specified Can be overriden by changing parameter in configuration file bridge.config Conflicting parameters collected from environmentals are ignored export MPIRUN_CWD=/bglhome/samjmill normally, giving -cwd to mpirun -free wait is an error however, MPIRUN_CWD will be ignored in this case Multiple Program Multiple Data (MPMD) support Allows a different executable to be run on each pset we support a MPMD config file (see MPI-2 specification) Command line arguments also supported Both provided by a new mpiexec utility 90 PSSC Montpellier Blue Gene Team

91 MPIEXEC Command What is mpiexec? mpiexec is the method for launching and interacting with parallel Mutliple Program Multiple Data (MPMD) jobs on BlueGene/P. It is very similar to mpirun with the only exception being the arguments supported by mpiexec are slightly different. Command Limitations (As of V1R1M1 Driver) A pset is the smallest granularity for each executable, though one executable can span multiple psets You must use every compute node of each pset, specifically different - np values are not supported The job's mode (SMP, DUAL, VNM) must be uniform across all psets MPI using DMA does not work MPI one-sided operations do not work (see above) Global Arrays and ARMCI do not work 91 PSSC Montpellier Blue Gene Team

92 MPIEXEC Command Parameters Only parameter / environmental supported by mpiexec that is not supported by mpirun -configfile / MPIRUN_MPMD_CONFIGFILE The following parameters / environmentals are not supported by mpiexec since their use is ambiguous for MPMD jobs -args / MPIRUN_ARGS -cwd / MPIRUN_CWD -env / MPIRUN_ENV -env_all / MPIRUN_EXP_ENV_ALL -exe / MPIRUN_EXE -exp_env / MPIRUN_EXP_ENV -partition / MPIRUN_PARTITION -mapfile / MPIRUN_MAPFILE 92 PSSC Montpellier Blue Gene Team

93 MPIEXEC Configuration File Syntax -n <Nb Nodes> -wdir <Working Directory> <Binary> Example Configuration File Content -n 32 -wdir /home/bgpuser /bin/hostname -n 32 -wdir /home/bgpuser/hello_world /home/bgpuser/hello_world/hello_world Runs /bin/hostname on one 32 node pset hello_world on one 32 node pset 93 PSSC Montpellier Blue Gene Team

94 XL Runtime Environment variables Number of threads per MPI tasks OMP_NUM_THREADS OMP and SMP runtime options XLSMPOPTS=option1=XXX:option2=YYY:. schedule= static[=n]:dynamic [=n]:guided [=n]:affinity [=n] parthds= number of threads (with qsmp in the compilation, should be set for esslsmp) stack= amount of space in Bytes for the all thread stack (defauklt 4MB) For performance spins= number of spins before a yield yields= number of yields before a sleep On BGP spins=0:yields=0 512 mpi tasks, 4 OpenMP threads with 64MBytes stack per thread mpirun mode SMP np. env OMP_NUM_THREADS=4 XLSMPOPTS=spins:0:yields=0:stack= A lot of other potential XL runtime variables: c.f. XL documentation 94 PSSC Montpellier Blue Gene Team

95 IBM Tivoli Workload Scheduler LoadLeveler Link to External Presentation IBM Tivoli Workload Scheduler LoadLeveler 95 PSSC Montpellier Blue Gene Team

96 SUBMIT Command submit = mpirun Command for HTC Command used to run a HTC job and act as a lightweight shadow for the real job running on a Blue Gene node Simplifies user interaction with the system by providing a simple common interface for launching, monitoring, and controlling HTC jobs Run from a Frontend Node Contacts the control system to run the HTC user job Allows the user to interact with the running job via the job's standard input, standard output, and standard error Standard System Location /bgsys/drivers/ppcfloor/bin/submit 96 PSSC Montpellier Blue Gene Team

97 HTC Technical Architecture 97 PSSC Montpellier Blue Gene Team

98 SUBMIT Command Syntax /bgsys/drivers/ppcfloor/bin/submit [options] or /bgsys/drivers/ppcfloor/bin/submit [options] binary [arg1 arg2... argn] Options -exe <exe> Executable to run -args "arg1 arg2... argn Arguments, must be enclosed in double quotes -env <env=value> Define an environmental for the job -exp_env <env> Export an environmental to the job's environment -env_all Add all current environmentals to the job's environment -cwd <cwd> The job's current working directory -timeout <seconds> Number of seconds before the job is killed -mode <SMP DUAL VNM> Job mode -location <Rxx-Mx-Nxx-Jxx-Cxx> Compute core location, regular expression supported -pool <id> Compute Node pool ID 98 PSSC Montpellier Blue Gene Team

99 IBM Scheduler for HTC IBM Scheduler for HTC = HTC Jobs Scheduler Handles scheduling of HTC jobs HTC Job Submission External work requests are routed to HTC scheduler Single or multiple work requests from each source IBM Scheduler for HTC finds available HTC client and forwards the work request HTC client runs executable on compute node A launcher program on each compute node handles work request sent to it by the scheduler. When work request completes, the launcher program is reloaded and client is ready to handle another work request. 99 PSSC Montpellier Blue Gene Team

100 IBM Scheduler for HTC Components IBM Scheduler for HTC provides features not available with submit interface Provides queuing of jobs until compute resources are available Tracks failed compute nodes submit interface is intended for usage by job schedulers not end users directly simple_sched Daemon Runs on Service Node or Frontend Node Accepts connections from startd and client programs startd Daemons Run on Frontend Node Connects to simple_sched, gets jobs and executes submit Client programs qsub = Submits job to run qdel = Deletes job submitted by qsub qstat = Gets status of submitted job qcmd = Admin commands 100 PSSC Montpellier Blue Gene Team

101 HTC Executables htcpartition Utility program shipped with Blue Gene Responsible for booting / freeing HTC partitions from a Frontend Node run_simple_sched_jobs Provides instance of IBM Scheduler for HTC and startd Executes commands either specified in command files or read from stdin Creates a cfg file that can be used to submit jobs externally to the cmd files or stdin Exits when the commands have all finished (or can specify keep running ) 101 PSSC Montpellier Blue Gene Team

102 IBM Scheduler for HTC Integration to LoadLeveler LoadLeveler still handles partition creation But doesn t handle partition booting End user can use a static partition or have LL dynamically create a partition Partition booting is done inside the Job Command File (JCF) Through htcpartition executable Flags partition booting mode as HTC LoadLeveler Job Command File (JCF) for HTC jobs requires the following: htcpartition executable: Utility program shipped in Blue Gene Responsible for booting and freeing HTC partition from a FEN run_simple_sched_jobs executable Provides personal instance of SIMPLE job scheduler and startd Executes commands either specified in command files or read from stdin Creates a cfg file that can be used to submit jobs externally to the cmd files or stdin Exits when the commands have all finished (or can specify keep running ) 102 PSSC Montpellier Blue Gene Team

103 IBM Scheduler for HTC Glide-In to LoadLeveler 103 PSSC Montpellier Blue Gene Team

104 LoadLeveler Job Command File Example #!/bin/bash class = BGP64_1H comment = "Personality / HTC" environment = error = $(job_name).$(jobid).err group = ibm input = /dev/null job_name = Personality-HTC job_type = bluegene notification = never output = $(job_name).$(jobid).out requirements = (Machine == "fenp") queue partition_boot() { } /bgsys/drivers/ppcfloor/bin/htcpartition --boot --configfile $SIMPLE_SCHED_CONFIG_FILE --mode SMP partition_free() { /bgsys/drivers/ppcfloor/bin/htcpartition free } run_jobs() { /bgsys/opt/simple_sched/bin/run_simple_sc hed_jobs -config $SIMPLE_SCHED_CONFIG_FILE $COMMANDS_RUN_FILE } # Local Simple Scheduler Configuration File SIMPLE_SCHED_CONFIG_FILE=$PWD/my_simple_sche d.cfg # Command File COMMANDS_RUN_FILE=$PWD/cmds.txt partition_boot trap partition_free EXIT run_jobs 104 PSSC Montpellier Blue Gene Team

105 Blue Gene/P Application User Worshop Blue Gene/P Parallel Libraries Shared Memory Message Passing 105 PSSC Montpellier Blue Gene Team

106 Parallel Architectures CPUCPU CPU CPU CPU CPU CPU Interconnection BUS Memory CPU Mem CPU Mem CPU Mem CPU Mem SMP / Share Everything MPP / Share Nothing 106 PSSC Montpellier Blue Gene Team

107 Shared Memory Parallelism Parallelism at «SMP level» within a single node Limited to the number of cores within a Node (4 on BG/P) Pretty good scalabality on BG/P Invocation Options Automatically by using compiler options (- qsmp=auto) In general not very efficient By adding OpenMP compiler directives in the source code By explicitely using POSIX Threads Serial Program + Parallel directives Directed Decomposition Model SMP Compiler Parallel threaded program 107 PSSC Montpellier Blue Gene Team

108 Distributed Memory Parallelism Parallelism across N SMPs (often Compute Node, CN) Each CN" can only address it s own (local) memory, This implies modifications/re-writting of portion of codes and the use of «message passing» libraries such as MPI (or DCMF, SPI) «Unlimited» Scalability Can scale up to thousands of processors 108 PSSC Montpellier Blue Gene Team

109 Mixed Mode Programming Shared & Distributed Memory at the same time MPI programming style for inter-nodes communications OpenMP programming style within the Nodes This has been proved to be very efficient when each nodes solves a «local problem«that can parallelized by using a mathematical library (ESSLSMP) In the coming years the performance processor improvement will happen through an increase of the number of cores Shared Memory Intra-node Messages Task 0 Task 1 Task 2 Task 3 SP Switch Task 4 Inter-node Messages Shared Memory Intra-node Messages Task 5 Task 6 Task PSSC Montpellier Blue Gene Team

110 Terminology Scalability What are the performance improvments when increasing the number of processors? Bandwidth How many Mbytes can be exchanged per seconde between the nodes Latency «Minimum time needed to send a zero length message to a remote processor «110 PSSC Montpellier Blue Gene Team

111 Blue Gene/P Messaging Framework Application supported encouraged Converse/Charm++ MPICH GLUE DCMF API (C) global arrays ARMCI primitives UPC messaging pt2pt protocols (eager, rendezvous) Multisend protocols CCMI Collective Layer (barrier, broadcast, allreduce) supported but not recommended DMA Device Tree Device IPC Device Message Layer Core (C++) SPI DMA Network Hardware Multiple programming paradigms supported MPI, Charm++, ARMCI, GA, UPC (as a research initiative) SPI : Low level systems programming interface DCMF : Portable active-message API 111 PSSC Montpellier Blue Gene Team

112 Messaging: multiple levels of abstraction Message layer core Tools and interfaces to the BG/P network hardware Close to hardware, flexible, requires high degree of expertise Protocols Point-to-point protocols Low-level communication API for message passing and one-sided communication Easy to use, but less flexible Multisend allows multiple destinations to be posted in one send call Allows overhead of software stack to be amortized across many messages CCMI collective layer (Component Collective Messaging Interface) Extensible collective runtime with schedules and executors to implement different optimized collective operations APIs DCMF (Deep Computing Message Framework) Unified across IBM HPC platforms Active message API, non-blocking optimized collectives SPI : (System Programming Interface) Direct DMA access Runtimes MPI, Charm++ (built directly on DCMF), ARMCI 4.0.5, *UPC research prototype for BG/L 112 PSSC Montpellier Blue Gene Team

113 Common Programming Models MPI only quad mode with enhancements Separate MPI process for each processor in the compute node DMA support for each MPI process Ensure network does not block when processor is computing Drive network harder Sharing of read-only or write-once data on each node Need programming language extension to identify read-only data Allow applications to overcome memory limits of quad mode MPI + OpenMP OpenMP within each node relies on cache coherence support Only master thread on each node initiates communication Get benefits of message aggregation Exploit multiple processors to service MPI call 113 PSSC Montpellier Blue Gene Team

114 Message Passing Interface (MPI) Link to External Presentation Message Passing Interface (MPI) 114 PSSC Montpellier Blue Gene Team

115 Communication Libraries MPI MPICH p2 Optimized collectives built on DCMF Redbook Application development DCMF (Deep Computing Message Framework) SPI (System Programming Interfaces) 115 PSSC Montpellier Blue Gene Team

116 MPI Point-to-Point Routing The route from a sender to a receiver on a torus network has two possible paths: Deterministic routing Adaptive routing Selecting deterministic or adaptive routing depends on the protocol that is used for the communication. The Blue Gene/P MPI implementation supports three different protocols: MPI short protocol The MPI short protocol is used for short messages (less than 224 bytes), which consist of a single packet. These messages are always deterministically routed. The latency for eager messages is around 3.3 µs. MPI eager protocol The MPI eager protocol is used for medium-sized messages. It sends a message to the receiver without negotiating with the receiving side that the other end is ready to receive the message. This protocol also uses deterministic routes for its packets. MPI rendezvous protocol Large (greater than 1200 bytes) messages are sent using the MPI rendezvous protocol. In this case, an initial connection between the two partners is established. Only after that will the receiver use direct memory access (DMA) to obtain the data from the sender. This protocol uses adapstive routing and is optimized for maximum bandwidth. Naturally, the initial rendezvous handshake increases the latency. DCMF_EAGER variable 116 PSSC Montpellier Blue Gene Team

117 MPI on BGP vs BGL MPI 2 Standard compliance SMP with thread mode multiple Thread mode multiple is default Simpler thread modes need to be initialized with MPI_Thread_init Dual and quad mode also supported DMA engine to optimize communication Improved progress semantics over Blue Gene/L DMA makes sends and receives packets while the processor is busy computing Drive network harder: can keep ~5 links busy for near neighbor traffic in both directions Allows overlap of computation and communication Comm-thread which is enabled on packet arrival to make BGP MPI fully progress compliant Allow tag matching of Rzv messages in comm thread Enabled through environment variable DCMF_INTERRUPTS=1 SMP mode has 1 commthread, dual has two and quad (VN) mode has four comm threads Commthreads are scheduled by interrupts Built on top of the DCMF messaging API 3+ us latency and 370 MB/s bandwidth per link 117 PSSC Montpellier Blue Gene Team

118 Collectives Use the hardware support in the collective network and global interrupt networks Supported operations Barrier Broadcast Allreduce Alltoall Allgather 118 PSSC Montpellier Blue Gene Team

119 Barrier Blue Gene/P Application User Workshop Global barriers GI Tree Sub communicators DMA barrier with point to point messages GI barrier latency is about 1.2us Sub communicators use a logarithmic barrier which is about 2500 cycles/hop 119 PSSC Montpellier Blue Gene Team

120 Broadcast Uses Collective network for global operations Peak throughput of 0.92 b/cycle Latency 5+ us DMA on sub communicators Bandwidth of 0.9 b/cycle on rectangular sub-communicators Latency between 10-20us Synchronous broadcast Can be easily extended to one per communicator Currently available in MPI Asynchronous broadcast Low latency DMA Broadcast No barrier needed Unlimited broadcast messages flying on the torus Planned for release 2 Available in the DCMF API in release PSSC Montpellier Blue Gene Team

121 All Reduce Tree network for global operations Throughput of 0.9 b/cycle Latency 4+us on a mid-plane in SMP mode DMA for sub communicators Optimizations for rectangular communicators 121 PSSC Montpellier Blue Gene Team

122 Alltoall Blue Gene/P Application User Workshop DMA based alltoall Uses adaptive routing on the network Optimized for latency and bandwidth (latency 0.9 us/destination) 96% of peak throughput on a midplane Alltoall performance for large messages optimized by the all-to-all mode in the DMA device DCMF_FIFOMODE=ALLTOALL (20% more) Alltoall is disabled for sub-communicators in MPI thread mode multiple in release PSSC Montpellier Blue Gene Team

123 Optimized Collectives Collective (All are non-blocking at the DCMF API level) Torus via DMA Collective Network Barrier Barrier Binomial algorithm N/A Uses Global Interrupt wires to determine when nodes have entered the barrier. Sync Broadcast (BG/L style broadcast where all nodes need to reach the broadcast call before data is transmitted) Rectangular algorithm Binomial algorithm Uses a Collective Broadcast via spanning class route. To prevent unexpected packets, broadcast is executed via global BOR. N/A All-to-All(v) Each node sends messages in randomized permutations to keep the bisection busy. N/A N/A Reduce Rectangular algorithm Binomial algorithm Same as Collective All-reduce, but with no store on non-root nodes. N/A All-reduce Rectangular algorithm Binomial algorithm Uses a Collective Broadcast via spanning class route. Native tree operations, single and double pass double precision floating point operations. N/A All-gather(v) Broadcast, reduce, and all-to-all based algorithms. Algorithm used depends on geometry, partition size, and message size. N/A 123 PSSC Montpellier Blue Gene Team

124 MPI Environment Variables / General BG_MAPPING Allows user to specify mapping as an environment variable Options are: TXYZ, TXZY, TYXZ, TYZX, TZXY, TZYX, XYZT, XZYT, YXZT, YZXT, ZXYT, ZYXT or a path to a mapping file XYZT is the default. Equivalent to BGLMPI_MAPPING on BGL. BG_COREDUMPONEXIT Forces all jobs to dump core on exit Useful for debug 124 PSSC Montpellier Blue Gene Team

125 MPI Environment Variables / MPICH2 DCMF_EAGER This value, passed through atoi(), is the smallest message that will be sent using the Rendezvous protocol. This is also one greater than the largest message sent using the Eager protocol. (Synonyms: DCMF_RVZ, DCMF_RZV) DCMF_COLLECTIVES When set to "0", this will disable the optimized collectives. When set to "1", this will enable the optimized collectives. Otherwise, this is left at the default. (Synonyms: DCMF_COLLECTIVE) DCMF_TOPOLOGY When set to "0", this will disable the optimized topology routines. When set to "1", this will enable the optimized topology routines. Otherwise, this is left at the default. DCMF_ALLREDUCE Possible options: MPICH, BINOMIAL, RECTANGLE, TREE DCMF_INTERRUPTS When set to "0", interrupts are disabled. Message progress occurs via polling. When set to "1", interrupts are enabled. Message progress occurs in the background via interrupts, and/or by polling. Default is "0". (Synonyms: DCMF_INTERRUPT) 125 PSSC Montpellier Blue Gene Team

126 MPI Environment Variables / ARMCI DCMF_INTERRUPTS When set to "0", interrupts are disabled. Message progress occurs via polling. When set to "1", interrupts are enabled. Message progress occurs in the background via interrupts, and/or by polling. Default is "1". (Synonyms: DCMF_INTERRUPT) DCMF Messaging DCMF_RECFIFO The size, in bytes, of each DMA reception FIFO. Incoming torus packets are stored in this fifo until DCMF Messaging can process them. Making this larger can reduce torus network congestion. Making this smaller leaves more memory available to the application. The default size is bytes (8 megabytes), and DCMF Messaging uses one reception FIFO. etc 126 PSSC Montpellier Blue Gene Team

127 OpenMP Blue Gene/P Application User Workshop Link to External Presentation OpenMP Blue Gene/P supports OpenMP PSSC Montpellier Blue Gene Team

128 OpenMP Blue Gene/P Application User Workshop OpenMP (Open Multi-Processing) is an application programming interface (API) that supports multi-platform shared memory multiprocessing programming in C/C++ and Fortran on many architectures It consists of a set of compiler directives, library routines, and environment variables that influence run-time behavior Jointly defined by a group of major computer hardware and software vendors, OpenMP is a portable, scalable model that gives programmers a simple and flexible interface for developing parallel applications For Blue Gene/P, a hybrid-model for parallel programming can be used: OpenMP for parallelism on the node MPI (Message Passing Interface) for communication between nodes 128 PSSC Montpellier Blue Gene Team

129 OpenMP Blue Gene/P Application User Workshop OpenMP is an implementation of multithreading, a method of parallelization whereby the master thread "forks" a specified number of slave threads and a task is divided among them The threads then run concurrently, with the runtime environment allocating threads to different processors The section of code that is meant to run in parallel is marked accordingly, with a preprocessor directive that will cause the threads to form before the section is executed For some programs, the original code may remain intact, and OpenMP directives are added around it For most programs, some code restructuring will also be necessary for optimal execution 129 PSSC Montpellier Blue Gene Team

130 OpenMP (continued) Each thread has an "id" attached to it which can be obtained using a function (called omp_get_thread_num() in C/C+) The thread id is an integer, and the master thread has an id of "0". After the execution of the parallelized code, the threads "join" back into the master thread, which continues onward to the end of the program By default, each thread executes independently The runtime environment allocates threads to processors depending on usage, machine load and other factors The number of threads can be assigned by the runtime environment based on environment variables or in code using functions 130 PSSC Montpellier Blue Gene Team

131 OpenMP Loop Example: Global Weighted Sum #define N /*size of a*/ long w; double a[n]; sum = 0.0; /*fork off the threads and start the work-sharing construct*/ #pragma omp parallel for private(w) reduction(+:sum) schedule(static,1) for(i = 0; i < N; i++) { w = i*i; sum = sum + w*a[i]; } printf("\n %lf",sum); 131 PSSC Montpellier Blue Gene Team

132 Blue Gene/P Application User Worshop Blue Gene/P Debugging Integrated Tools Reference Software 132 PSSC Montpellier Blue Gene Team

133 Available Tools Summary Integrated Tools GDB Core Files + addr2line coreprossecor (Requires Service Node access) Compiler Options: -g, -C, -qsigtrap, -qfltrap, Traceback functions, memory size kernel, signal or exit trap, Supported Commercial Software Totalview DDT (Alinea) 133 PSSC Montpellier Blue Gene Team

134 Debugger Interfaces ptrace-like interfaces available via ciod non-parallel: gdbserver for direct use with gdb parallel: Totalview, DDT, or other tools lightweight core files each node writes a small file with regs, traceback, etc superset of parallel tools consortium format Coreprocessor (next slides) 134 PSSC Montpellier Blue Gene Team

135 GNU Project Debugger (GDB) Simple GNU GPL debug server Debugging Session How-To Binary must be compiled and linked with options: -g -qfullpath GDB server must be started before the application mpirun command has a special option: -start_gdbserver Limitations Interactive mode One GDB Instance Per Compute Node Multiple GDB clients required to debug multiple CNs at the same time Limited subset of primitives (however enough to be useful) Standard Linux GDB client Not aware of Power 450 Double FPU architecture Standard System Location /bgsys/drivers/ppcfloor/gnu-linux/bin 135 PSSC Montpellier Blue Gene Team

136 Using GDB: Example GDB Server mpirun -exe /home/bgpuser/hello_world/hello_world -cwd /home/bgpuser/hello_world -mode SMP -np 64 -start_gdbserver /bgsys/drivers/ppcfloor/ramdisk/sbin/gdbserver -verbose 1 > dump_proctable (to get all the node ipaddr:port per MPI rank) > or type directly the rank you want to attach to GDB client > press enter after entering the target command in the second window GDB Client (One client instance per Compute Node) cd /home/bgpuser/hello_world /bgsys/drivers/ppcfloor/gnu-linux/bin/gdb fonction (gdb) > target remote ipaddr:port (gdb) > step 136 PSSC Montpellier Blue Gene Team

137 Core Files / addr2line Utility Core Files On BG/P, core files are text Hexadecimal addresses in section STACK describe function call chain until program exception Section delimited by tags: +++STACK / ---STACK addr2line Utility addr2line retrieves source code location from hexadecimal address Standard Linux command Usage addr2line e <Binary> <Hexadecimal Address> Standard System Location /bgsys/drivers/ppcfloor/gnu-linux/bin/powerpc-bgp-linux-addr2line 137 PSSC Montpellier Blue Gene Team

138 Coreprocessor Supported utility originally written to aggregate and display lightweight corefiles May also use JTAG on the SN to gather state for kernel hangs Has shown scaling to 64 Racks Views to show commonality among state of nodes 138 PSSC Montpellier Blue Gene Team

139 Coreprocessor GUI 139 PSSC Montpellier Blue Gene Team

140 Personality Definition Double Definition Static data given to every Compute Node and I/O Node at boot time by the control system Personality data contains information that is specific to the node Set of C language structures and functions that allows querying personality data from the node Useful to determine, at run time, where the tasks of the application are running Might be used to tune certain aspects of the application at run time, such as determining which set of tasks share the same I/O Node and then optimizing the network traffic from the Compute Nodes to that I/O Node 140 PSSC Montpellier Blue Gene Team

141 Personality Usage Elements Two Include Files #include <common/bgp_personality.h> #include <common/bgp_personality_inlines.h> Structure _BGP_Personality_t personality; Query Function Kernel_GetPersonality(&personality, sizeof(personality)); 141 PSSC Montpellier Blue Gene Team

142 Personality Information personality.network_config.[x Y Z]nodes Nb X / Y / Z Nodes in Torus personality.network_config.[x Y Z]coord X / Y / Z Node Coordinates in Torus Kernel_PhysicalProcessorID() Core ID on Compute Node (0, 1, 2, 3) BGP_Personality_getLocationString(&personality, location) Location string Rxx-Mx-Nxx-Jxx 142 PSSC Montpellier Blue Gene Team

143 BGP personality: get information mode, pset, partition size, placement /bgsys/drivers/ppcfloor/arch/include/common bgp_personality.h bgp_personality_inlines.h #include <spi/kernel_interface.h> #include <common/bgp_personality.h> #include <common/bgp_personality_inlines.h> int main(int argc, char * argv[]) { int taskid, ntasks; int memory_size_mbytes; _BGP_Personality_t personality; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &taskid); Kernel_GetPersonality(&personality, sizeof(personality)); xcoord = personality.network_config.xcoord; ycoord = personality.network_config.ycoord; zcoord = personality.network_config.zcoord; xsize = personality.network_config.xnodes; ysize = personality.network_config.ynodes; zsize = personality.network_config.znodes; pset_num = personality.network_config.psetnum; pset_size = personality.network_config.psetsize; pset_rank = personality.network_config.rankinpset; memory_size_mbytes = personality.ddr_config.ddrsizemb; BGP_Personality_getLocationString(&personality, location); printf("memory size = %d MBytes\n", memory_size_mbytes); procid = Kernel_PhysicalProcessorID(); node_config = personality.kernel_config.processconfig; if (node_config == _BGP_PERS_PROCESSCONFIG_SMP) printf("smp mode\n"); else if (node_config == _BGP_PERS_PROCESSCONFIG_VNM) printf("virtual-node mode\n"); else if (node_config == _BGP_PERS_PROCESSCONFIG_2x2) printf("dual mode\n"); else printf("unknown mode\n"); 143 PSSC Montpellier Blue Gene Team

144 Blue Gene/P Application User Worshop Blue Gene/P Performance Analysis Profiling HPC Toolkit Major Open-Source Tools 144 PSSC Montpellier Blue Gene Team

145 Performance Tools Low level SPI provided to configure, reset and read hardware perf counters HPC Toolkit Considering addition of per-job performance metrics recorded via job history (DB2, Navigator) Unix gprof command (compiler with g pg) PAPI interface to the perf counters 145 PSSC Montpellier Blue Gene Team

146 Profiling Blue Gene/P Application User Workshop CPU profiling tool similar to gprof Can be used to profile both serial and parallel applications Use procedure-profiling information to construct a graphical display of the functions within an application Provide quick access to the profiled data and helps users identify functions that are the most CPU-intensive Based on sampling (support from both compiler and kernel) Charge execution time to source lines and show disassembly code /bgl/home1/ihchung/hpct_bgl/xprofiler on BG/L front end node 146 PSSC Montpellier Blue Gene Team

147 Profiling How-To Compile the program with options: -g -pg Will create symbols required for debugging / profiling Execute the program Standard way Execution generates profiling files in execution directory gmon.out.<task ID> Binary files, not human readable Nb files depends on environment variable 1 Profiling File / Process 3 Profiling Files only One file for the slowest / fastest / median process Two options for output files interpretation GNU Profiler (Command-line utility) Xprofiler (Graphical utility / Part of HPC Toolkit) 147 PSSC Montpellier Blue Gene Team

148 GNU Profiler (gprof) Allows profiling report generation From profiling output files Standard Usage gprof <Binary> gmon.out.<task ID> > gprof.out.<task ID> Profiling report limited compared to standard Unix/Linux The subroutines and their relative importance Number of calls by not Call tree 148 PSSC Montpellier Blue Gene Team

149 IBM HPC Toolkit Toolkit Content CPU Performance Analysis Xprofiler HPM Message-Passing Performance Analysis MP_Profiler Visualization and Analysis PeekPerf Support via Advanced Computing Technology Center in Research (ACTC) On-site training support Links PSSC Montpellier Blue Gene Team

150 Xprofiler GUI / Overview Window Width of a bar: time including called routines Height of a bar: time excluding called routines Call arrows labeled with number of calls Overview window for easy navigation (View Overview) 150 PSSC Montpellier Blue Gene Team

151 Xprofiler GUI / Source Code Window Source code window displays source code with time profile (in ticks=.01 sec) Access Select function in main display Context Menu Select function in flat profile Code Display Show Source Code 151 PSSC Montpellier Blue Gene Team

152 Xprofiler GUI / Disassembler Code 152 PSSC Montpellier Blue Gene Team

153 Message-Passing Performance: MP_Profiler Library Captures summary data for MPI calls Source code traceback User MUST call MPI_Finalize() in order to get output files. No changes to source code MUST compile with g to obtain source line number information MP_Tracer Library Captures timestamped data for MPI calls Source traceback Available at /bgl/home1/ihchung/hpct_bgl/mp_profiler on BG/L front end node 153 PSSC Montpellier Blue Gene Team

154 MP_Profiler Output with Peekperf 154 PSSC Montpellier Blue Gene Team

155 MP_Profiler - Traces 155 PSSC Montpellier Blue Gene Team

156 MP_Profiler Message Size Distribution MPI Function #Calls Message Size #Bytes Walltime MPI Function #Calls Message Size #Bytes Walltime MPI_Comm_size 1 (1) E-07 MPI_Irecv 2 (1) E-06 MPI_Comm_rank 1 (1) E-07 MPI_Irecv 2 (2) E-06 MPI_Isend MPI_Isend MPI_Isend MPI_Isend MPI_Isend MPI_Isend MPI_Isend MPI_Isend MPI_Isend MPI_Isend MPI_Isend 2 (1) 2 (2) 2 (3) 2 (4) 2 (5) 2 (6) 2 (7) 2 (8) 2 (9) 2 (A) 1 (B) K 1K... 4K 4K... 16K 16K... 64K 64K K 256K... 1M 1M... 4M E E E E E E E E E-06 9E-07 MPI_Irecv MPI_Irecv MPI_Irecv MPI_Irecv MPI_Irecv MPI_Irecv MPI_Irecv MPI_Irecv MPI_Irecv MPI_Waitall MPI_Barrier 2 (3) 2 (4) 2 (5) 2 (6) 2 (7) 2 (8) 2 (9) 2 (A) 1 (B) 21 (1) 5 (1) K 1K... 4K 4K... 16K 16K... 64K 64K K 256K... 1M 1M... 4M E E E E E E E E E PSSC Montpellier Blue Gene Team

157 Output Blue Gene/P Application User Workshop Summary report for each task perfhpm<taskid>.<pid> libhpm (V 2.6.0) summary Total execution time of instrumented code (wall time): seconds Instrumented section: 3 - Label: job 1 - process: 1 file: sanity.c, lines: 33 <--> 70 Count: 1 Wall Clock Time: seconds BGL_FPU_ARITH_MULT_DIV (Multiplication and divisions, fmul, fmuls, fdiv, fdivs (Book E mul, div)) : 0 BGL_FPU_LDST_DBL_ST ( ) : 23 BGL_UPC_L3_WRBUF_LINE_ALLOC (Write buffer line was allocated) :1702 Peekperf performance file hpm<taskid>_<progname>_<pid>.viz Table performance file tb_hpm<taskid>.<pid> 157 PSSC Montpellier Blue Gene Team

158 Environment Flags HPM_EVENT_SET Select the event set to be recorded Integer (0 15) HPM_NUM_INST_PTS Overwrite the default of 100 instrumentation sections in the app. Integer value > 0 HPM_WITH_MEASUREMENT_ERROR Deactivate the procedure that removes measurement errors. True or False (0 or 1). HPM_OUTPUT_NAME Define an output file name different from the default. String HPM_VIZ_OUTPUT Indicate if.viz file (for input to PeekPerf) should be generated or not. True or False (0 or 1). HPM_TABLE_OUTPUT Indicate table text file should be generated or not. True or False (0 or 1). 158 PSSC Montpellier Blue Gene Team

159 Peekperf Blue Gene/P Application User Workshop Visualization and analysistool Offline analysis and viewing capability Supported platforms AIX Linux (Power/Intel) Windows (Intel) BlueGene *The toolkit will be available soon on AMD platform 159 PSSC Montpellier Blue Gene Team

160 MP_Profiler Visualization Using PeekPerf 160 PSSC Montpellier Blue Gene Team

161 MP_Tracer Visualization Using PeekPerf 161 PSSC Montpellier Blue Gene Team

162 HPM Visualization Using PeekPerf 162 PSSC Montpellier Blue Gene Team

163 MPITRACE library (IBM internal tool) IBM Tools provided by B Walkup (ACTC US) Collect all the mpi communications of an application Measure the time spent in the MPI routines Provide the call graph for communication subroutines Add the : mpi_trace.o or libmpiprof.a, or link with: -lmpitrace Set the variable: TRACE_ALL_PROCS=yes export TRACEBACK_LEVEL=n Produce a file report per task : Output files: mpi.profile.#mpi_task 163 PSSC Montpellier Blue Gene Team

164 MPI_TRACE: output file MPI Routine #calls avg. bytes time(sec) MPI_Comm_size MPI_Comm_rank MPI_Isend MPI_Irecv MPI_Waitall MPI_Reduce total communication time = seconds. total elapsed time = seconds. user cpu time = seconds. system time = seconds. maximum memory size = KBytes Message size distributions: MPI_Isend MPI_Irecv MPI_Reduce 164 PSSC Montpellier Blue Gene Team #calls avg. bytes time(sec) #calls avg. bytes time(sec) #calls avg. bytes time(sec)

165 TAU TAU = Tuning and Analysis Utility Program and performance analysis tool framework being developed for the DOE Office of Science, ASC initiatives at LLNL, the ZeptoOS project at ANL, and the Los Alamos National Laboratory Provides a suite of static and dynamic tools that provide graphical user interaction and interoperation to form an integrated analysis environment for parallel Fortran, C++, C, Java, and Python applications. Link PSSC Montpellier Blue Gene Team

166 SCALASCA SCALASCA = SCalable performance Analysis of LArge SCale Applications Performance tool measurement and analysis developed by the Innovative Computing Laboratory (ICL) and the Jülich Supercomputing Centre (JSC). Scalable trace analysis tool based on the KOJAK approach that exploits both distributed memory and parallel processing capabilities SCALASCA analyzes separate local trace files in parallel by replaying the original communication on as many CPUs as have been used to execute the target application itself Link PSSC Montpellier Blue Gene Team

167 Blue Gene/P Application User Worshop Blue Gene/P Advanced Topics Blue Gene/P Memory Advanced Compilation with IBM XL Compilers SIMD Programming Communications Frameworks Checkpoint/Restart I/O Optimization 167 PSSC Montpellier Blue Gene Team

168 BPC Architecture 32k I1/32k D1 32k I1/32k D1 PPC450 PPC450 Double Double FPU FPU 32k I1/32k D1 32k I1/32k D1 PPC450 PPC450 Double Double FPU FPU 32k I1/32k D1 32k I1/32k D1 PPC450 PPC450 Double Double FPU FPU 32k I1/32k D1 32k I1/32k D1 PPC450 PPC450 Double Double FPU FPU Snoop filter Snoop filter L2 L2 Snoop filter Snoop filter L2 L2 Snoop filter Snoop filter L2 L2 Snoop filter Snoop filter L2 L2 PMU PMU JTAG JTAG Acces Acces s s Multiplexing switch Multiplexing switch Multiplexing switch Multiplexing switch 256 DMA DMA Torus Torus Shared Shared SRAM SRAM Collective Collective Arb Arb Global Global Barrier Barrier Shared L3 Shared L3 Directory Directory for for edram edram w/ecc w/ecc Shared L3 Shared L3 Directory Directory for for edram edram w/ecc w/ecc Ethern Ethern et et 10 Gbit 10 Gbit 512b data 72b ECC 512b data 72b ECC 4MB 4MB edram edram L3 Cache L3 Cache or or On-Chip On-Chip Memory Memory 4MB 4MB edram edram L3 Cache L3 Cache or or On-Chip On-Chip Memory Memory DDR2 DDR2 Controller Controller w/ ECC w/ ECC DDR2 DDR2 Controller Controller w/ ECC w/ ECC JTAG 6 3.4Gb/s bidirectional 3 6.8Gb/s bidirectiona l 4 global barriers or interrupts 10 Gb/s 13.6 GB/s DDR-2 DRAM bus 168 PSSC Montpellier Blue Gene Team

169 Blue Gene/P ASIC 169 PSSC Montpellier Blue Gene Team

170 Memory Cache Levels 170 PSSC Montpellier Blue Gene Team

171 L1 Cache Blue Gene/P Application User Workshop Architecture 32kB I-cache + 32KB D-cache 32B line, 64 way-set-associative, 16 sets Round-robin replacement Write-through 2 buses towards the L2, 128 bits width 3 lines fetch buffer Performance L1 load hit 8B/cycle, 4 cycle latency (floating point) L1 load miss, L2 hit 4.6B/cycle, 12 cycle latency Store Write through, limited by external logic to about one request every 2.9 cycles (about 5.5B/cycle peak) Important Avoid instructions prefetching in L1. Reduce the number of prefetch streams below 3 Programmer can use the XL compiler directives or assembler instruction (dcbt) Without an intensive reused of the L1 and register, memory subsystem is not allowed to feed the double FPU 171 PSSC Montpellier Blue Gene Team

172 L2 Cache 3 independent PPC ports Instruction read Data Read PPC450 L1 Instr. Cache L2 Cache L2 IC RD L3 Arbiter to L3 0 L2 Prefetch Data write L1 Data Cache L2 DC RD L2 DC WR L3 arbiter Switch function 1 read and 1 write switched to each L3 every 425MHz cycle L2 boosts the overall performance and does not require any special software provisions. L1 Instr. Cache L1 Data Cache L2 IC RD L2 DC RD L2 DC WR to L3 1 16B@850MHz 32B@425MHz 32B@425MHz 172 PSSC Montpellier Blue Gene Team

173 L2 Data Prefetch Prefetch engine 128B line prefetch 15 entry fully associative prefetch buffer 1 or 2 line deep sequential prefetching Supports up to 7 streams Supports strides less than 128B 4.6B/cycle, 12 cycle latency Data PPC450 Address L2 DATA READ / PREFETCH b line buffers 15 stream engines stream detector 8 miss history data from other units L3 data address 173 PSSC Montpellier Blue Gene Team

174 L3 Cache Blue Gene/P Application User Workshop Request queue per port 8 read requests 4 write requests 4 edram banks per chip, each containing independent directory 15 entry 128B-wide write combining buffer Hit under Miss resolution Limit defined by request queues and write buffer Up to 8 read misses per port Up to 15 write misses per write combining buffer Limitation: banking conflict (possibility to configure dedicated L3/core need IBM Lab support) Core0/1 Core2/3 DMA 32B@425MHz L3 cache 0 Request queue Request queue Request queue L3 cache 1 Request queue Request queue Request queue 32B@425MHz edram 32B@425MHz edram 32B@425MHz edram 32B@425MHz edram 32B@425MHz 174 PSSC Montpellier Blue Gene Team

175 DRAM Controller 20 8b-wide DDR2 DRAM modules per controller 4 module-internal banks (4x512 MB) Command reordering based on bank availablility Theoretical bandwidth: 128 Bytes/16 cycles the L3 cache Request queue Request queue Request queue edram 32B@425MHz edram 32B@425MHz DDR controller Miss handler 32B@425MHz 16B@425MHz 175 PSSC Montpellier Blue Gene Team

176 Performance Latencies L1: 4 cycles L2: 12 cycles L3: 50 cycles DDR: 104 cycles Bandwidth/peak Reg <-> L1 cache: 8B/cycle L1 <-> L2: 4.6B/cycle L2 <-> L3: 16B/cycle per core L3 <-> DDR: less than 16B/cycle per chip 176 PSSC Montpellier Blue Gene Team

177 Bottlenecks L2 L3 switch Not a full core to L3 bank crossbar Request rate and bandwidth limited if two cores of one dual processor group access the same L3 cache Banking 4 banks on 512Mb DDR modules Burst-8 transfer (128B): 16 cycles Page open, access, precharge: 64 cycles Peak bandwidth only achievable if accessing 3 other banks before accessing the same bank again 177 PSSC Montpellier Blue Gene Team

178 Main Memory Banking Optimization Example For sequential access, two arrays used in a single operation must not be aligned on the same bank parameter (n= ) real(8) x(n), y(n), w(n).. do j=1,n x(j) = x(j) + y(j)*w() Enddo BG memory tuning parameter (n= , offset=16) real(8) x(n+offset), y(n+2*offset), w(n).. do j=1,n x(j) = x(j) + y(j)*w() Enddo 178 PSSC Montpellier Blue Gene Team

CNK Memory Usage Static TLB mappings Only 64 mappings per core Kernel mapped protected App R/O text & data protected Alignment restrictions Static TLB benefits no penalty for random data access

179 CNK Memory Usage Static TLB mappings Only 64 mappings per core Kernel mapped protected App R/O text & data protected Alignment restrictions Static TLB benefits no penalty for random data access patterns greatly simplifies DMA An implementation of mmap that uses a dynamically assigned physical memory would be required to take page translations in order to map the current virtual addresses to physical addresses. At this time, we have no plans to implement such a TLB miss handler. 179 PSSC Montpellier Blue Gene Team

180 CNK Memory Design (Memory leaks) The design points of CNK: to have a statically partitioned physical memory Reasons No memory translation exceptions If a virtual mapping is not known by the processor, the hardware issues an exception. The operating system is then required to install the mapping in the hardware (if it exists), or fail with a segfault (if it does not exist). Processing this exception can take a.25-1usec penalty per translation, depending on how sophisiticated the miss handler needs to be. For an application that performs a lot of pointer chasing or pseudo-random memory accesses, this effectively would result in a.25-1usec latency for every memory access. Determinisitic performance from run-to-run locking contention between processes Determinisitic out-of-memory e.g., process 0 will be able to use the same memory footprint from run-to-run because process 0 and processes 1-3 are not competing for the same memory allocations. No OS noise e.g., if nodes enter a barrier, and node 0 takes a translation fault, delaying its participating in the barrier by 1usec. The application just lost msec of "useful work". 180 PSSC Montpellier Blue Gene Team

CNK Memory Usage an application stores data in memory, it can be classified as follows: data = Initialized static and common variables bss = Uninitialized static and common variables heap =

181 CNK Memory Usage an application stores data in memory, it can be classified as follows: data = Initialized static and common variables bss = Uninitialized static and common variables heap = Controlled allocatable arrays stack = Controlled automatic arrays and variables You can use the Linux size command to gain an idea of the memory size of the program. However, the size command does not provide any information about the runtime memory The stack section starts from the top, at address 7fffd31c in SMP Node Mode (approximately 2 GB) and at address 403fd31c in Dual Node Mode (approximately1 GB) and 209fd31c in Virtual Node Mode (approximately 512 MB) 181 PSSC Montpellier Blue Gene Team

Content. MPIRUN Command Environment Variables LoadLeveler SUBMIT Command IBM Simple Scheduler. IBM PSSC Montpellier Customer Center

Content. MPIRUN Command Environment Variables LoadLeveler SUBMIT Command IBM Simple Scheduler. IBM PSSC Montpellier Customer Center Content IBM PSSC Montpellier Customer Center MPIRUN Command Environment Variables LoadLeveler SUBMIT Command IBM Simple Scheduler Control System Service Node (SN) An IBM system-p 64-bit system Control