Altix UV HW/SW! SGI Altix UV utilizes an array of advanced hardware and software feature to offload:!

Size: px

Start display at page:

Download "Altix UV HW/SW! SGI Altix UV utilizes an array of advanced hardware and software feature to offload:!"

Ophelia Potter
5 years ago
Views:

1 Altix UV HW/SW! SGI Altix UV utilizes an array of advanced hardware and software feature to offload:!! thread synchronization!! data sharing!! massage passing overhead from CPUs.! This system has a rich set of hardware features that enable scalable programming models to be implemented with high efficiency and performance.!! SGI MPI! The SGI MPI software stack includes a number of software components.!

2 SGI MPI Software Stack! MPI! XPMEM(cross process memory mapping)! GRU development kit! NUMA tools! Perfboost! Perfcatcher! MPInside!

3 UV HUB! The UV_HUB is a custom ASIC developed by SGI. It implements NUMAlink5 protocol, memory operations and associated atomic operations. It provides following capabilities:!!! Cache-coherent global shared memory.!! Offloading time-sensitive and data-intensive operations from processors to increase processing efficiency and scaling.!! Scalable, reliable, fair interconnect with other blades via NUMAlink5.!

4 Altix UV Blade and HUB! Source : SGI Altix UV 1000 System User s Guide!

5 UV HUB in detail! SI(socket interface): provides bridge between the Hub s LH and RH chip sets and Intel sockets.! To communicate with Intel sockets, the SI implements an Intel proprietary Interconnect called CSI(common system interface).! Source : SGI Altix UV admin manual!

6 UV HUB in detail! LH(local home)!! manages directory operations associated with remote memory requests. The LH has a single external memory channel.! RH(Remote home)!! processes coherent and non-coherent CSI transactions that are initialized by a local socket to a remote system address.!! processes Numalink intervention and invalidate requests when remote is locally cached by a socket.! LB(local block)!! provides system software the ability to select, configure and control various functionalities of the UV hub chip.!! provides facilities to monitor, diagnose, and debug hardware states and operations on live systems.!

7 UV HUB Units! The NUMAlink interconnect! The Global reference unit(gru)! The processor interconnect!

8 NUMAlink Interconnect! Shared memory, globally addressable system interconnect.! All physically distributed system memory is mapped into one global address space.! Peak aggregate bi-directional bandwidth 15GB/s.! 2-3x MPI latency improvement.! Special support for block transfer and global operation.! NUMAlink is connected into the memory infrastructure of the system, versus being indirectly connected through an IO subsystem chip.!

9 Fetch-Op in HUB! Fetch-Op-variables on Hub provide fast synchronization! The Fetch-Op AMO helped reduce MPI send/recv latency from 12 to 8 microseconds.! Used by MPI_Barrier, MPI_Win_fence, and shmem barrier all! CPU! CPU! HUB! Fetch-op! variable! ROUTER!

10 GRU! Hardware in the Hub for memory to memory block transfer and CPU synchronization events.! It is used by MPI, SHMEM, UPC! External TLB with large page support! Page initialization! Scatter/Gather operations! Update cache for AMOs!

11 GRU API Components! GRU resource allocators! GRU memory access functions! XPMEM address mapping functions! MPT address mapping functions!

12 MOE! It is a set of functionality that offloads MPI communication workload from CPUs to the Altix UV_HUB ASIC, accelerating common MPI tasks such as barriers and reductions across GSM(global shared memory).! Similar in concept to a TCP/IP offload engine(toe) which offloads TCP/IP protocol processing from system CPUs.! Frees CPU from MPI activity.! Faster reduction operations.! Faster barriers and random access.!

13 Accessing the MOE.! MPI and MOE! MOE implements atomic memory operations in conjunction with a hardware multicast facility that helps to accelerate MPI_barrier, MPI_Bcast, MPI_Allreduce.! Accelerates MPI point-to-point and collective communication.!

14 MOE Advantages! MOE provides:!! MPI message queues!! synchronization primitives!! Advanced RDMA capabilities such as strided and indexed global memory updates.!! Hardware multicast.!

15 Determining System Configuration!

16 topology: displays general information about SGI Altix system, with a focus on node information.! It includes node counts for blades, node IDs, NASIDs, memory per node, UV hub and partition number.! topology!

17 cpumap! cpumap: displays logical CPUs and shows relationship between them.! Aspects displayed include, hyper threading, last level cache sharing and topology placements.! It gets information from /proc/cpuinfo, /sys/ devices/system and /proc/sgi_uv/ topology

18 cpumap!

19 nodeinfo!

20 nodeinfo! Hit: page was allocated on the preferred node.! Miss: preferred node was full. Allocation occurred on this node by a process running on another node that was full.! Foreign: preferred node was full. Had to allocate somewhere else.! Interleave: allocation was for interleaved policy numactl i.! Local: page allocated on This node by a process running on this node.! Remote: page allocated on this node by a process running on the another node.!

21 x86info!

22 pmchart! Put figure here!

23 pmchart! Put figure here!

24 HW Summary! /proc/cpuinfo /proc/meminfo /sys/devices/system/node /dev/cpuset/torque/job#

25 Data Placement Tool!

26 CPU Scheduling! In a single-processor system, only one process can run at a time.! CPU scheduling controls how the OS switches access to the CPU between processes.! Kernel provides mechanism called time slicing.! Time slice is the maximum length of time that a process owns its CPU resource and executes at its current policy.! Each CPU has its own run queue.!

27 Cache Affinity! Affinity scheduling is a special scheduling discipline used in multiprocessor system.! As a process executes, it causes more and more data and instruction text to be loaded into the processor cache. This creates an affinity between the process and the CPU.!

28 Data Placement Tool! NUMA machines have a shared address space. There is a single shared memory space and a single operating system instance.! Performance penalty to access remote memory versus local memory.! Access time to memory vary over physical address ranges and between processing elements. NUMAlink used to access memory between blades/node.! Memory latency is lowest when a processor accesses local memory.! NUMA tool also helps run multiple instances of serial program in a single job script with better processes placement.!

29 NUMA API! The API is called from libcpuset!! cpuset: create, modify, destroy cpuset.! taskset: Run a process on specific physical CPU.! numactl: Control NUMA policy for processes or shared memory.! dplace: Binds process to specific logical CPU.! omplace: Controls the placement of MPI processes and OpenMP threads.! Batch systems: LSF, PBSPro, Torque, SGE!! dlook, dlook-summary, pidstat, cpuset-q

30 cpuset cpuset includes sched_setaffinity for CPU binding and memory binding.! Each task has a link to a cpuset structure that specifies the CPUs and memory node available for its use.! All tasks sharing the same placement constraints reference the same cpuset.!

31 Why Use a cpuset?! Restrict consumption of designated resources CPU to specified processes/threads.! Limit run time variability.! Memory affinity.! Isolates the I/O.!

32 How Are cpuset s Used! Static cpusets (batch calls shared by queue)!! Cpusets are defined by administrator after system startup.!! User attach processes to the existing cpusets.!! Cpusets continue to exist after job finish executing.! Dynamic cpusets!! Workload management system(wms) creates cpuset when It is required by a job.!! WMS attaches job to the newly created cpus.!! WMS destroys cpuset at the end of job.!

33 cpuset Command Line Options! cpuset -c cpuset_name Create CPUSET -m cpuset_name Modify CPUSET -x cpuset_name Destroy CPUSET -d cpuset_name Dump CPUSET attributes -i csname I script Run command -p cpuset_name List all procs in CPUSET -a cpuset_name Attach pids to CPUSET -w pid List CPUSET the PID is attached to -f filename input config file

34 Advantage of Cpuset?! It improves cache locality and memory access time.! Facilitates providing equal resources to each thread in a job.!! Results in both optimum and repeatable performance.!

35 taskset taskset: restricts execution to the listed set of CPUs. However, processes are still free to move among listed CPUs.! It is used to set of retrieve the CPU affinity of a running process given its PID or to launch a new command with a given CPU affinity.! The CPU affinity is represented as a bitmask (hexadecimal), with the lowest order bit.!! 0x ## is processor number 0.!

36 taskset taskset:it does not pin a task to a specific CPU. It only restricts a task so that it does not run any CPU that is not in the cpulist. If you are running an MPI application, you do not use the taskset command. Instead of taskset use dplace.!!mpirun np 8 dplace -s1 c10, 11, /a.out export MPI_DSM_CPULIST 10,11,16-21 mpirun np 8./a.out

37 taskset examples! taskset 0x1./a.out #executes on physical CPU 1 taskset 0x00131./a.out #executes on physical CPUs taskset p 0xa #executes PID on physical CPUs 3 5 and 7 taskset p c 5./a.out #execute a.out on physical CPU 5 taskset p #returns the affinity mask of PID 14386

38 numactl Runs processes with a specific NUMA scheduling or memory placement policy.! Control memory placement!! Interleave node(round robin)!! Membind(allocate from specified node pool)!! Preferred node!! Local allocation(first touch)! Each task has a link to a cpuset structure that specifies the CPUs and memory node available for its use.! All tasks sharing the same placement constraints reference the same cpuset.!

39 numactl Command Line numactl Options! --interleave Set a memory interleave policy. --membind Only allocate memory from nodes. --cpunodebind Only execute command on the CPUs of nodes. --physcpubind Only execute process on CPUs.

40 numactl examples! numactl --physcpubind=+0-4,8-12 myapplic arguments #Run myapplic on cpus 0-4 and 8-12 of the current cpuset. numactl --interleave=all bigdatabase arguments #Run big database with its memory interleaved on all CPUs. numactl --cpubind=0 --membind=0,1 process #Run process on node 0 with memory allocated on node 0 and 1.

41 numactl --hardware

42 dplace dplace ensures the Linux kernel pins a thread [or series of threads] to a specific CPU core within a container. Once pinned they do not migrate.! By default, binds processes sequentially in a round-robin fashion against logical CPUs in current cpuset.! Integrate with MPT[via omplace and environmental variables].! It understands fork, exec, threads etc..! Helps to ensure optimal performance and to minimize runtime variability.!

43 dplace Feature! Default memory allocation policy is node-local (first touch).! dplace allows processes to be bound to specific logical(within cpuset) cpus.! Prevents migration (thread hopping).! May require knowledge of application.! Global load balancing.!

44 dplace Command Line Options dplace -c CPU list -e exact placement -s skip n cpu s before starting placement -n only processes with name -x skip mask -p placement file -r replicate shared text to each node -q list global count

45 dplace examples! dplace c 0-3./a.out # places thread on the first four cpus, beginning with core 0. dplace c 0-7 x2./a.out # place threads on the first 8 cpus, but used SKIP MASK[-x2] to skip the second thread(which in the case of Intel OpenMP is the lightweight monitor thread) mpirun np 8 dplace s1 c 0-7./a.out # skips the first process as this process is essentially the MPI shepherd. dplace handles the placements of the other 7 MPI ranks.

46 numactl and dplace Consider a code that runs with 4 threads.! What is the difference between!!numactl c 0-3 a.out dplace c 0-3 a.out With dplace, each thread is bound to a particular cpu. With numactl, the threads are bound to the range of cpus 0-3, and are free to migrate within that range.! numactl does have memory binding options.!

47 omplace Tool for controlling the placement of MPI processes and OpenMP threads.! -c cpulist: specifies the effective CPU list. -nt threads: specifies the number of threads per MPI process. -s skip: the number of processes to skip before placements starts. -vv verbose: Automatically generated placement file will be displayed in its entirely.

48 omplace examples! mpirun np 2 omplace nt 4 -vv./ a.out # To run 2 MPI processes with 4 threads per process, and to display the generated placement file.

49 dlook! Tool for showing process memory maps and cpu usage.! View address space and page placement.! Two forms!!dlook [options] pid!dlook [options] <command> [commandargs]! Run a MPI job using mpirun and print the memory map for each thread:! mpirun np 8 dlook a.out

50 Summary! Use cpumap to determine partitioning and placement.! Use taskset to lock a process or process groups into CPU or group of CPUs.! Use dplace to place a process group into system topology.! Run an MPI/OpenMP hybrid and use omplace for pining.! Use numactl to control memory placement.!

51 Tips!!!! Use dplace, numactl, or cpuset to lock down processes, preventing thread hopping/migration.! Strong cache affinity reduces cache misses, instruction pipeline flushes.! Keeps processes close to their node-local memory.! Be aware of data placement.!

52 Heisenberg Principle! Looking at the system will impact the system! Tracing events are the highest impact: strace, gprof,! PCP and sar the lowest impact! You can not measure a system without effecting it. top will show up in the top display.! PCP uses less than 1% of a CPU.!

53 sar sar indicates normal/abnormal behavior of system. sar can imply performance problems and bottlenecks.! Many people look at sar as a set of performance metrics when it is not. It is an indicator of what a system is doing!! PCP and sar simply tell you what to look for.!

54 sar sar vq to check kernel table sizes.! sar -W to check swapping activity.! sar rsw to what memory and swap is left.! sar u reports the amount of time executing kernel code.!

55 top, ps, pstree top provides a dynamic real-time view of a running system.! top with H provides thread information.! ps: report a snapshot of the currently running processes on the system. Use with grep <username> to get user specific information.! pstree: display a tree of processes.!

56 vmstat, mpstat vmstat indicates reports information about processes, memory, paging, block IO, traps and cpu activity.! mpstat writes to standard output activities for each available processor, processor 0 being the first one. Global average activities among all processors are also reported.!

57 mpvis mpvis displays a three dimensional bar chart of CPU utilization. The display is updated with new values retrieved from the target host or archive every interval seconds.!

58 pidstat pidstat is used for monitoring individual task a currently being managed by the Linux kernel.! r report page faults and memory utilization -d report I/O statistics -u report CPU utilization -p select tasks for which statistics are to be reported -t display statistics for threads associated with selected tasks pidstat t p 14374!

59 cpuset-q It gives information allocated CPUs, node, IPD, WCHAN, Command name etc..!

60 dlook! Tool for showing process memory maps and cpu usage.! Two forms!!dlook [options] pid!dlook [options] <command> [commandargs]! Run a MPI job using mpirun and print the memory map for each thread:! mpirun np 8 dlook a.out

61 References! UV System Analysis Manual! UV System Administration Manual! Technical Advances in the SGI Altix UV Architecture(white paper)! A Hardware-Accelerated MPI Implementation on SGI Altix UV Systems(white paper)! Linux Application Tuning Guide for SGI X86_64 Based Systems! SGI Message Passing Toolkit(MPT) User s Guide! SGI NUMAlink white paper!

White Paper. Technical Advances in the SGI. UV Architecture

White Paper. Technical Advances in the SGI. UV Architecture White Paper Technical Advances in the SGI UV Architecture TABLE OF CONTENTS 1. Introduction 1 2. The SGI UV Architecture 2 2.1. SGI UV Compute Blade 3 2.1.1. UV_Hub ASIC Functionality 4 2.1.1.1. Global