Content. Execution Modes SMP, DUAL, VN Partition Types HPC VS HTC HTC Compute Node Linux (CNL) IBM PSSC Montpellier Customer Center

Size: px

Start display at page:

Download "Content. Execution Modes SMP, DUAL, VN Partition Types HPC VS HTC HTC Compute Node Linux (CNL) IBM PSSC Montpellier Customer Center"

Tracy Henderson
6 years ago
Views:

1 Content IB SSC ontpellier Customer Center Execution odes S, DUAL, VN artition Types HC VS HTC HTC Compute Node Linux (CNL)

2 Execution odes ossibilities Single Node / ulti Node 1, 2 or 4 rocesses per Node 1, 2 or 4 Threads per rocess Notation (default: 1 thread per core) Virtual ode VN 4 I rocesses er Node Dual ode DUAL 2 I rocesses + 2 Threads er rocess Shared emory S 1 I rocess + 4 Threads er rocess Limitation One user process or thread per core

3 Blue Gene/ Execution odes Quad ode reviously called Virtual Node ode All four cores run one I process each No threading emory / I process = ¼ node memory I programming model Dual ode Two cores run one I process each Each process may spawn one thread on core not used by other process emory / I process = ½ node memory Hybrid I/Open programming model S ode One core runs one I process rocess may spawn threads on each of the other cores emory / I process = full node memory Hybrid I/Open programming model Application Application Application Core 1 Core 3 Core 1 CU2 Core 2 CU3 Core 3 Core 0 Core 1 Core 2 Core 3 Core 0 T Core 2 T Core 0 T T T emory address space emory address space emory address space

4 Symetrical ulti-rocessing ode (S) S 1 rocess/node 4 Threads/rocess 2 GB/rocess pthreads and Open are supported 4B default stack for all new threads

5 Dual ode (DUAL) DUAL 2 rocesses/node 2 Threads/rocess 1 GB/rocess pthreads and Open are supported 4B default stack for all new threads

6 Virtual Node ode (VN) VN 4 rocesses/node 1 Thread/rocess 512 B/process 512 B/process versus Shared emory support

7 ultiple Threads er Core ossibility to have 1-3 threads per core New in V1R40 Not exactly the same behavior compared to Linux ainly useful for switches between programming models in phases (Open / pthreads) Application puts one set of threads to sleep and wakes the other set of threads A core does not automatically switch between threads on a timed basis Switches occur either through a sched_yield() system call, signal delivery, or futex wakeup anaged through environment variable BG_ATHREADDETH

8 Shared emory Support Shared memory is supported in Dual and Virtual Node odes BG_SHAREDEOOLSIZE environment variable specifies in B the amount of memory to be allocated, which you can do using the mpirun Shared memory is allocated using standard Linux methods shm_open() / mmap() fd = shm_open( SH_FILE, O_RDWR, 0600 ); allocation ftruncate( fds[0], AX_SHARED_SIZE ); shmptr1 = mmap( NULL, AX_SHARED_SIZE, ROT_READ ROT_WRITE, A_SHARED, fd,0); munmap(shmptrl, AX_SHARED_SIZE); close(fd) shm_unlink(sh_file); deallocation

9 High Throughput Computing (HTC) ode any applications that run on Blue Gene today are embarrassingly (pleasantly) parallel They do not fully exploit the torus for I communication, since that is not needed for their problem They just want a very large number of small tasks, with a coordinator of results High Throughput Computing ode on Blue Gene Enables a new class of workloads that use many single-node jobs Leverages the low-cost, low-energy, small footprint of a rack of 1,024 compute nodes Capacity machine ( cluster buster ): run 4,096 jobs on a single rack in virtual node mode (VN) New HTC CNL mode with full Linux kernel on each Compute Node (from BG/ driver V1R3)

10 HTC Value for «leasantly arallel» Codes Application resiliency A single node failure ends the entire application in the I model For HTC, only the job running on the failed node is ended while other single node jobs continue to run. For long-running jobs that require many tasks, this can mean the difference between having to start from scratch and just being able to proceed ahead on the remaining nodes. The front-end node has more memory, better performance, and more functionality than a single compute node Code that runs on the compute nodes is much cleaner It only contains the work to be performed, and leaves the coordination to a script or scheduler This also eliminates the need to sacrifice one node as being the master node. The coordinator functionality can be anything that runs on Linux erl script, ython, compiled program The coordinator can interact directly with a database To either get the inputs for the application, or to store the results This can eliminate the need to create a flatfile input for the application, or to generate the results in an output file.

11 High erformance VS High Throughput odes High erformance Computing (HC) ode Best for Capability Computing arallel, tightly coupled applications Single Instruction, ultiple Data (SID) architecture rogramming model: typically I Apps need tremendous amount of computational power over short time period High Throughput Computing (HTC) ode Best for Capacity Computing Large number of independent tasks ultiple Instruction, ultiple Data (ID) architecture rogramming model: non-i Applications need large amount of computational power over long time period Traditionally run on large clusters HTC and HC modes co-exist on Blue Gene Determined when resource pool (partition) is allocated

12 HTC Compute Node Linux (CNL) Feature Description Brings full Linux Kernel functionality onto Compute Node Substituting minimal Compute Node Kernel Allows any serial Linux workload to be executed on Blue Gene/ In particular: Brings support for scripted workload (Shell, erl) Characteristics New feature introduced by Blue Gene driver V1R3 Light Linux Kernel but with full compatibility Still limited memory footprint One single CNL / Compute Node Compute Node is seen as a regular Linux S system Number of rocesses and/or Threads is under user control SSH session on Compute Node becomes possible

IBM PSSC Montpellier Customer Center. Blue Gene/P ASIC IBM Corporation

IBM PSSC Montpellier Customer Center. Blue Gene/P ASIC IBM Corporation Blue Gene/P ASIC Memory Overview/Considerations No virtual Paging only the physical memory (2-4 GBytes/node) In C, C++, and Fortran, the malloc routine returns a NULL pointer when users request more memory