Altix UV HW/SW! SGI Altix UV utilizes an array of advanced hardware and software feature to offload:!
|
|
- Ophelia Potter
- 5 years ago
- Views:
Transcription
1 Altix UV HW/SW! SGI Altix UV utilizes an array of advanced hardware and software feature to offload:!! thread synchronization!! data sharing!! massage passing overhead from CPUs.! This system has a rich set of hardware features that enable scalable programming models to be implemented with high efficiency and performance.!! SGI MPI! The SGI MPI software stack includes a number of software components.!
2 SGI MPI Software Stack! MPI! XPMEM(cross process memory mapping)! GRU development kit! NUMA tools! Perfboost! Perfcatcher! MPInside!
3 UV HUB! The UV_HUB is a custom ASIC developed by SGI. It implements NUMAlink5 protocol, memory operations and associated atomic operations. It provides following capabilities:!!! Cache-coherent global shared memory.!! Offloading time-sensitive and data-intensive operations from processors to increase processing efficiency and scaling.!! Scalable, reliable, fair interconnect with other blades via NUMAlink5.!
4 Altix UV Blade and HUB! Source : SGI Altix UV 1000 System User s Guide!
5 UV HUB in detail! SI(socket interface): provides bridge between the Hub s LH and RH chip sets and Intel sockets.! To communicate with Intel sockets, the SI implements an Intel proprietary Interconnect called CSI(common system interface).! Source : SGI Altix UV admin manual!
6 UV HUB in detail! LH(local home)!! manages directory operations associated with remote memory requests. The LH has a single external memory channel.! RH(Remote home)!! processes coherent and non-coherent CSI transactions that are initialized by a local socket to a remote system address.!! processes Numalink intervention and invalidate requests when remote is locally cached by a socket.! LB(local block)!! provides system software the ability to select, configure and control various functionalities of the UV hub chip.!! provides facilities to monitor, diagnose, and debug hardware states and operations on live systems.!
7 UV HUB Units! The NUMAlink interconnect! The Global reference unit(gru)! The processor interconnect!
8 NUMAlink Interconnect! Shared memory, globally addressable system interconnect.! All physically distributed system memory is mapped into one global address space.! Peak aggregate bi-directional bandwidth 15GB/s.! 2-3x MPI latency improvement.! Special support for block transfer and global operation.! NUMAlink is connected into the memory infrastructure of the system, versus being indirectly connected through an IO subsystem chip.!
9 Fetch-Op in HUB! Fetch-Op-variables on Hub provide fast synchronization! The Fetch-Op AMO helped reduce MPI send/recv latency from 12 to 8 microseconds.! Used by MPI_Barrier, MPI_Win_fence, and shmem barrier all! CPU! CPU! HUB! Fetch-op! variable! ROUTER!
10 GRU! Hardware in the Hub for memory to memory block transfer and CPU synchronization events.! It is used by MPI, SHMEM, UPC! External TLB with large page support! Page initialization! Scatter/Gather operations! Update cache for AMOs!
11 GRU API Components! GRU resource allocators! GRU memory access functions! XPMEM address mapping functions! MPT address mapping functions!
12 MOE! It is a set of functionality that offloads MPI communication workload from CPUs to the Altix UV_HUB ASIC, accelerating common MPI tasks such as barriers and reductions across GSM(global shared memory).! Similar in concept to a TCP/IP offload engine(toe) which offloads TCP/IP protocol processing from system CPUs.! Frees CPU from MPI activity.! Faster reduction operations.! Faster barriers and random access.!
13 Accessing the MOE.! MPI and MOE! MOE implements atomic memory operations in conjunction with a hardware multicast facility that helps to accelerate MPI_barrier, MPI_Bcast, MPI_Allreduce.! Accelerates MPI point-to-point and collective communication.!
14 MOE Advantages! MOE provides:!! MPI message queues!! synchronization primitives!! Advanced RDMA capabilities such as strided and indexed global memory updates.!! Hardware multicast.!
15 Determining System Configuration!
16 topology: displays general information about SGI Altix system, with a focus on node information.! It includes node counts for blades, node IDs, NASIDs, memory per node, UV hub and partition number.! topology!
17 cpumap! cpumap: displays logical CPUs and shows relationship between them.! Aspects displayed include, hyper threading, last level cache sharing and topology placements.! It gets information from /proc/cpuinfo, /sys/ devices/system and /proc/sgi_uv/ topology
18 cpumap!
19 nodeinfo!
20 nodeinfo! Hit: page was allocated on the preferred node.! Miss: preferred node was full. Allocation occurred on this node by a process running on another node that was full.! Foreign: preferred node was full. Had to allocate somewhere else.! Interleave: allocation was for interleaved policy numactl i.! Local: page allocated on This node by a process running on this node.! Remote: page allocated on this node by a process running on the another node.!
21 x86info!
22 pmchart! Put figure here!
23 pmchart! Put figure here!
24 HW Summary! /proc/cpuinfo /proc/meminfo /sys/devices/system/node /dev/cpuset/torque/job#
25 Data Placement Tool!
26 CPU Scheduling! In a single-processor system, only one process can run at a time.! CPU scheduling controls how the OS switches access to the CPU between processes.! Kernel provides mechanism called time slicing.! Time slice is the maximum length of time that a process owns its CPU resource and executes at its current policy.! Each CPU has its own run queue.!
27 Cache Affinity! Affinity scheduling is a special scheduling discipline used in multiprocessor system.! As a process executes, it causes more and more data and instruction text to be loaded into the processor cache. This creates an affinity between the process and the CPU.!
28 Data Placement Tool! NUMA machines have a shared address space. There is a single shared memory space and a single operating system instance.! Performance penalty to access remote memory versus local memory.! Access time to memory vary over physical address ranges and between processing elements. NUMAlink used to access memory between blades/node.! Memory latency is lowest when a processor accesses local memory.! NUMA tool also helps run multiple instances of serial program in a single job script with better processes placement.!
29 NUMA API! The API is called from libcpuset!! cpuset: create, modify, destroy cpuset.! taskset: Run a process on specific physical CPU.! numactl: Control NUMA policy for processes or shared memory.! dplace: Binds process to specific logical CPU.! omplace: Controls the placement of MPI processes and OpenMP threads.! Batch systems: LSF, PBSPro, Torque, SGE!! dlook, dlook-summary, pidstat, cpuset-q
30 cpuset cpuset includes sched_setaffinity for CPU binding and memory binding.! Each task has a link to a cpuset structure that specifies the CPUs and memory node available for its use.! All tasks sharing the same placement constraints reference the same cpuset.!
31 Why Use a cpuset?! Restrict consumption of designated resources CPU to specified processes/threads.! Limit run time variability.! Memory affinity.! Isolates the I/O.!
32 How Are cpuset s Used! Static cpusets (batch calls shared by queue)!! Cpusets are defined by administrator after system startup.!! User attach processes to the existing cpusets.!! Cpusets continue to exist after job finish executing.! Dynamic cpusets!! Workload management system(wms) creates cpuset when It is required by a job.!! WMS attaches job to the newly created cpus.!! WMS destroys cpuset at the end of job.!
33 cpuset Command Line Options! cpuset -c cpuset_name Create CPUSET -m cpuset_name Modify CPUSET -x cpuset_name Destroy CPUSET -d cpuset_name Dump CPUSET attributes -i csname I script Run command -p cpuset_name List all procs in CPUSET -a cpuset_name Attach pids to CPUSET -w pid List CPUSET the PID is attached to -f filename input config file
34 Advantage of Cpuset?! It improves cache locality and memory access time.! Facilitates providing equal resources to each thread in a job.!! Results in both optimum and repeatable performance.!
35 taskset taskset: restricts execution to the listed set of CPUs. However, processes are still free to move among listed CPUs.! It is used to set of retrieve the CPU affinity of a running process given its PID or to launch a new command with a given CPU affinity.! The CPU affinity is represented as a bitmask (hexadecimal), with the lowest order bit.!! 0x ## is processor number 0.!
36 taskset taskset:it does not pin a task to a specific CPU. It only restricts a task so that it does not run any CPU that is not in the cpulist. If you are running an MPI application, you do not use the taskset command. Instead of taskset use dplace.!!mpirun np 8 dplace -s1 c10, 11, /a.out export MPI_DSM_CPULIST 10,11,16-21 mpirun np 8./a.out
37 taskset examples! taskset 0x1./a.out #executes on physical CPU 1 taskset 0x00131./a.out #executes on physical CPUs taskset p 0xa #executes PID on physical CPUs 3 5 and 7 taskset p c 5./a.out #execute a.out on physical CPU 5 taskset p #returns the affinity mask of PID 14386
38 numactl Runs processes with a specific NUMA scheduling or memory placement policy.! Control memory placement!! Interleave node(round robin)!! Membind(allocate from specified node pool)!! Preferred node!! Local allocation(first touch)! Each task has a link to a cpuset structure that specifies the CPUs and memory node available for its use.! All tasks sharing the same placement constraints reference the same cpuset.!
39 numactl Command Line numactl Options! --interleave Set a memory interleave policy. --membind Only allocate memory from nodes. --cpunodebind Only execute command on the CPUs of nodes. --physcpubind Only execute process on CPUs.
40 numactl examples! numactl --physcpubind=+0-4,8-12 myapplic arguments #Run myapplic on cpus 0-4 and 8-12 of the current cpuset. numactl --interleave=all bigdatabase arguments #Run big database with its memory interleaved on all CPUs. numactl --cpubind=0 --membind=0,1 process #Run process on node 0 with memory allocated on node 0 and 1.
41 numactl --hardware
42 dplace dplace ensures the Linux kernel pins a thread [or series of threads] to a specific CPU core within a container. Once pinned they do not migrate.! By default, binds processes sequentially in a round-robin fashion against logical CPUs in current cpuset.! Integrate with MPT[via omplace and environmental variables].! It understands fork, exec, threads etc..! Helps to ensure optimal performance and to minimize runtime variability.!
43 dplace Feature! Default memory allocation policy is node-local (first touch).! dplace allows processes to be bound to specific logical(within cpuset) cpus.! Prevents migration (thread hopping).! May require knowledge of application.! Global load balancing.!
44 dplace Command Line Options dplace -c CPU list -e exact placement -s skip n cpu s before starting placement -n only processes with name -x skip mask -p placement file -r replicate shared text to each node -q list global count
45 dplace examples! dplace c 0-3./a.out # places thread on the first four cpus, beginning with core 0. dplace c 0-7 x2./a.out # place threads on the first 8 cpus, but used SKIP MASK[-x2] to skip the second thread(which in the case of Intel OpenMP is the lightweight monitor thread) mpirun np 8 dplace s1 c 0-7./a.out # skips the first process as this process is essentially the MPI shepherd. dplace handles the placements of the other 7 MPI ranks.
46 numactl and dplace Consider a code that runs with 4 threads.! What is the difference between!!numactl c 0-3 a.out dplace c 0-3 a.out With dplace, each thread is bound to a particular cpu. With numactl, the threads are bound to the range of cpus 0-3, and are free to migrate within that range.! numactl does have memory binding options.!
47 omplace Tool for controlling the placement of MPI processes and OpenMP threads.! -c cpulist: specifies the effective CPU list. -nt threads: specifies the number of threads per MPI process. -s skip: the number of processes to skip before placements starts. -vv verbose: Automatically generated placement file will be displayed in its entirely.
48 omplace examples! mpirun np 2 omplace nt 4 -vv./ a.out # To run 2 MPI processes with 4 threads per process, and to display the generated placement file.
49 dlook! Tool for showing process memory maps and cpu usage.! View address space and page placement.! Two forms!!dlook [options] pid!dlook [options] <command> [commandargs]! Run a MPI job using mpirun and print the memory map for each thread:! mpirun np 8 dlook a.out
50 Summary! Use cpumap to determine partitioning and placement.! Use taskset to lock a process or process groups into CPU or group of CPUs.! Use dplace to place a process group into system topology.! Run an MPI/OpenMP hybrid and use omplace for pining.! Use numactl to control memory placement.!
51 Tips!!!! Use dplace, numactl, or cpuset to lock down processes, preventing thread hopping/migration.! Strong cache affinity reduces cache misses, instruction pipeline flushes.! Keeps processes close to their node-local memory.! Be aware of data placement.!
52 Heisenberg Principle! Looking at the system will impact the system! Tracing events are the highest impact: strace, gprof,! PCP and sar the lowest impact! You can not measure a system without effecting it. top will show up in the top display.! PCP uses less than 1% of a CPU.!
53 sar sar indicates normal/abnormal behavior of system. sar can imply performance problems and bottlenecks.! Many people look at sar as a set of performance metrics when it is not. It is an indicator of what a system is doing!! PCP and sar simply tell you what to look for.!
54 sar sar vq to check kernel table sizes.! sar -W to check swapping activity.! sar rsw to what memory and swap is left.! sar u reports the amount of time executing kernel code.!
55 top, ps, pstree top provides a dynamic real-time view of a running system.! top with H provides thread information.! ps: report a snapshot of the currently running processes on the system. Use with grep <username> to get user specific information.! pstree: display a tree of processes.!
56 vmstat, mpstat vmstat indicates reports information about processes, memory, paging, block IO, traps and cpu activity.! mpstat writes to standard output activities for each available processor, processor 0 being the first one. Global average activities among all processors are also reported.!
57 mpvis mpvis displays a three dimensional bar chart of CPU utilization. The display is updated with new values retrieved from the target host or archive every interval seconds.!
58 pidstat pidstat is used for monitoring individual task a currently being managed by the Linux kernel.! r report page faults and memory utilization -d report I/O statistics -u report CPU utilization -p select tasks for which statistics are to be reported -t display statistics for threads associated with selected tasks pidstat t p 14374!
59 cpuset-q It gives information allocated CPUs, node, IPD, WCHAN, Command name etc..!
60 dlook! Tool for showing process memory maps and cpu usage.! Two forms!!dlook [options] pid!dlook [options] <command> [commandargs]! Run a MPI job using mpirun and print the memory map for each thread:! mpirun np 8 dlook a.out
61 References! UV System Analysis Manual! UV System Administration Manual! Technical Advances in the SGI Altix UV Architecture(white paper)! A Hardware-Accelerated MPI Implementation on SGI Altix UV Systems(white paper)! Linux Application Tuning Guide for SGI X86_64 Based Systems! SGI Message Passing Toolkit(MPT) User s Guide! SGI NUMAlink white paper!
White Paper. Technical Advances in the SGI. UV Architecture
White Paper Technical Advances in the SGI UV Architecture TABLE OF CONTENTS 1. Introduction 1 2. The SGI UV Architecture 2 2.1. SGI UV Compute Blade 3 2.1.1. UV_Hub ASIC Functionality 4 2.1.1.1. Global
More informationThe SGI Message Passing Toolkit
White Paper The SGI Message Passing Toolkit Optimized Performance Across the Altix Product Line Table of Contents 1.0 Introduction... 1 2.0 SGI MPT Performance... 1 2.1 Message Latency... 1 2.1.1 Altix
More informationWhatÕs New in the Message-Passing Toolkit
WhatÕs New in the Message-Passing Toolkit Karl Feind, Message-passing Toolkit Engineering Team, SGI ABSTRACT: SGI message-passing software has been enhanced in the past year to support larger Origin 2
More informationMulticore Performance and Tools. Part 1: Topology, affinity, clock speed
Multicore Performance and Tools Part 1: Topology, affinity, clock speed Tools for Node-level Performance Engineering Gather Node Information hwloc, likwid-topology, likwid-powermeter Affinity control and
More informationAdvanced Job Launching. mapping applications to hardware
Advanced Job Launching mapping applications to hardware A Quick Recap - Glossary of terms Hardware This terminology is used to cover hardware from multiple vendors Socket The hardware you can touch and
More informationLSF HPC :: getting most out of your NUMA machine
Leopold-Franzens-Universität Innsbruck ZID Zentraler Informatikdienst (ZID) LSF HPC :: getting most out of your NUMA machine platform computing conference, Michael Fink who we are & what we do university
More informationGetting Performance from OpenMP Programs on NUMA Architectures
Getting Performance from OpenMP Programs on NUMA Architectures Christian Terboven, RWTH Aachen University terboven@itc.rwth-aachen.de EU H2020 Centre of Excellence (CoE) 1 October 2015 31 March 2018 Grant
More informationOS impact on performance
PhD student CEA, DAM, DIF, F-91297, Arpajon, France Advisor : William Jalby CEA supervisor : Marc Pérache 1 Plan Remind goal of OS Reproducibility Conclusion 2 OS : between applications and hardware 3
More informationNon-uniform memory access (NUMA)
Non-uniform memory access (NUMA) Memory access between processor core to main memory is not uniform. Memory resides in separate regions called NUMA domains. For highest performance, cores should only access
More informationCerebro Quick Start Guide
Cerebro Quick Start Guide Overview of the system Cerebro consists of a total of 64 Ivy Bridge processors E5-4650 v2 with 10 cores each, 14 TB of memory and 24 TB of local disk. Table 1 shows the hardware
More information6.1 Multiprocessor Computing Environment
6 Parallel Computing 6.1 Multiprocessor Computing Environment The high-performance computing environment used in this book for optimization of very large building structures is the Origin 2000 multiprocessor,
More informationPERFORMANCE IMPLICATIONS OF NUMA WHAT YOU DON T KNOW COULD HURT YOU! CLAIRE CATES SAS INSTITUTE
PERFORMANCE IMPLICATIONS OF NUMA WHAT YOU DON T KNOW COULD HURT YOU! CLAIRE CATES SAS INSTITUTE AGENDA Terms What Testers need to know about NUMA What Developers need to know about NUMA What SysAdmins
More informationPractical Introduction to
1 2 Outline of the workshop Practical Introduction to What is ScaleMP? When do we need it? How do we run codes on the ScaleMP node on the ScaleMP Guillimin cluster? How to run programs efficiently on ScaleMP?
More informationCSE 120 Principles of Operating Systems
CSE 120 Principles of Operating Systems Spring 2018 Lecture 15: Multicore Geoffrey M. Voelker Multicore Operating Systems We have generally discussed operating systems concepts independent of the number
More informationThe Exascale Architecture
The Exascale Architecture Richard Graham HPC Advisory Council China 2013 Overview Programming-model challenges for Exascale Challenges for scaling MPI to Exascale InfiniBand enhancements Dynamically Connected
More informationOperating System. Chapter 4. Threads. Lynn Choi School of Electrical Engineering
Operating System Chapter 4. Threads Lynn Choi School of Electrical Engineering Process Characteristics Resource ownership Includes a virtual address space (process image) Ownership of resources including
More informationA common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...
OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.
More informationExercises: April 11. Hermann Härtig, TU Dresden, Distributed OS, Load Balancing
Exercises: April 11 1 PARTITIONING IN MPI COMMUNICATION AND NOISE AS HPC BOTTLENECK LOAD BALANCING DISTRIBUTED OPERATING SYSTEMS, SCALABILITY, SS 2017 Hermann Härtig THIS LECTURE Partitioning: bulk synchronous
More informationLinux Network Tuning Guide for AMD EPYC Processor Based Servers
Linux Network Tuning Guide for AMD EPYC Processor Application Note Publication # 56224 Revision: 1.00 Issue Date: November 2017 Advanced Micro Devices 2017 Advanced Micro Devices, Inc. All rights reserved.
More informationNUMA Control for Hybrid Applications. Hang Liu & Xiao Zhu TACC September 20, 2013
NUMA Control for Hybrid Applications Hang Liu & Xiao Zhu TACC September 20, 2013 Parallel Paradigms OpenMP MPI Run a bunch of threads in shared memory(spawned by a single a.out). Run a bunch of a.out s
More informationScalable Single System Image SGI Altix 3700, 512p Architecture and Software Environment
Silicon Graphics, Inc. Scalable Single System Image SGI Altix 3700, 512p Architecture and Software Environment Presented by: Jean-Pierre Panziera Principal Engineer Altix 3700 SSSI - Architecture and Software
More information<Insert Picture Here> Boost Linux Performance with Enhancements from Oracle
Boost Linux Performance with Enhancements from Oracle Chris Mason Director of Linux Kernel Engineering Linux Performance on Large Systems Exadata Hardware How large systems are different
More informationScalable Cluster Computing with NVIDIA GPUs Axel Koehler NVIDIA. NVIDIA Corporation 2012
Scalable Cluster Computing with NVIDIA GPUs Axel Koehler NVIDIA Outline Introduction to Multi-GPU Programming Communication for Single Host, Multiple GPUs Communication for Multiple Hosts, Multiple GPUs
More informationCompute Node Linux (CNL) The Evolution of a Compute OS
Compute Node Linux (CNL) The Evolution of a Compute OS Overview CNL The original scheme plan, goals, requirements Status of CNL Plans Features and directions Futures May 08 Cray Inc. Proprietary Slide
More informationParallel and Distributed Computing
Parallel and Distributed Computing NUMA; OpenCL; MapReduce José Monteiro MSc in Information Systems and Computer Engineering DEA in Computational Engineering Department of Computer Science and Engineering
More informationNUMA replicated pagecache for Linux
NUMA replicated pagecache for Linux Nick Piggin SuSE Labs January 27, 2008 0-0 Talk outline I will cover the following areas: Give some NUMA background information Introduce some of Linux s NUMA optimisations
More informationFinisTerrae: Memory Hierarchy and Mapping
galicia supercomputing center Applications & Projects Department FinisTerrae: Memory Hierarchy and Mapping Technical Report CESGA-2010-001 Juan Carlos Pichel Tuesday 12 th January, 2010 Contents Contents
More informationCPU Scheduling. Operating Systems (Fall/Winter 2018) Yajin Zhou ( Zhejiang University
Operating Systems (Fall/Winter 2018) CPU Scheduling Yajin Zhou (http://yajin.org) Zhejiang University Acknowledgement: some pages are based on the slides from Zhi Wang(fsu). Review Motivation to use threads
More informationParallel Applications on Distributed Memory Systems. Le Yan HPC User LSU
Parallel Applications on Distributed Memory Systems Le Yan HPC User Services @ LSU Outline Distributed memory systems Message Passing Interface (MPI) Parallel applications 6/3/2015 LONI Parallel Programming
More informationTechnical Computing Suite supporting the hybrid system
Technical Computing Suite supporting the hybrid system Supercomputer PRIMEHPC FX10 PRIMERGY x86 cluster Hybrid System Configuration Supercomputer PRIMEHPC FX10 PRIMERGY x86 cluster 6D mesh/torus Interconnect
More informationMultiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types
Chapter 5 Multiprocessor Cache Coherence Thread-Level Parallelism 1: read 2: read 3: write??? 1 4 From ILP to TLP Memory System is Coherent If... ILP became inefficient in terms of Power consumption Silicon
More informationDetermining Optimal MPI Process Placement for Large- Scale Meteorology Simulations with SGI MPIplace
Determining Optimal MPI Process Placement for Large- Scale Meteorology Simulations with SGI MPIplace James Southern, Jim Tuccillo SGI 25 October 2016 0 Motivation Trend in HPC continues to be towards more
More informationMotivation. Threads. Multithreaded Server Architecture. Thread of execution. Chapter 4
Motivation Threads Chapter 4 Most modern applications are multithreaded Threads run within application Multiple tasks with the application can be implemented by separate Update display Fetch data Spell
More informationCS/COE1541: Intro. to Computer Architecture
CS/COE1541: Intro. to Computer Architecture Multiprocessors Sangyeun Cho Computer Science Department Tilera TILE64 IBM BlueGene/L nvidia GPGPU Intel Core 2 Duo 2 Why multiprocessors? For improved latency
More informationCS2506 Quick Revision
CS2506 Quick Revision OS Structure / Layer Kernel Structure Enter Kernel / Trap Instruction Classification of OS Process Definition Process Context Operations Process Management Child Process Thread Process
More informationProfiling: Understand Your Application
Profiling: Understand Your Application Michal Merta michal.merta@vsb.cz 1st of March 2018 Agenda Hardware events based sampling Some fundamental bottlenecks Overview of profiling tools perf tools Intel
More informationJURECA Tuning for the platform
JURECA Tuning for the platform Usage of ParaStation MPI 2017-11-23 Outline ParaStation MPI Compiling your program Running your program Tuning parameters Resources 2 ParaStation MPI Based on MPICH (3.2)
More informationLOAD BALANCING DISTRIBUTED OPERATING SYSTEMS, SCALABILITY, SS Hermann Härtig
LOAD BALANCING DISTRIBUTED OPERATING SYSTEMS, SCALABILITY, SS 2016 Hermann Härtig LECTURE OBJECTIVES starting points independent Unix processes and block synchronous execution which component (point in
More informationIntel Xeon Phi архитектура, модели программирования, оптимизация.
Нижний Новгород, 2017 Intel Xeon Phi архитектура, модели программирования, оптимизация. Дмитрий Прохоров, Дмитрий Рябцев, Intel Agenda What and Why Intel Xeon Phi Top 500 insights, roadmap, architecture
More informationWhen the OS gets in the way
When the OS gets in the way (and what you can do about it) Mark Price @epickrram LMAX Exchange Linux When the OS gets in the way (and what you can do about it) Mark Price @epickrram LMAX Exchange It s
More informationLecture 13. Shared memory: Architecture and programming
Lecture 13 Shared memory: Architecture and programming Announcements Special guest lecture on Parallel Programming Language Uniform Parallel C Thursday 11/2, 2:00 to 3:20 PM EBU3B 1202 See www.cse.ucsd.edu/classes/fa06/cse260/lectures/lec13
More informationAutomatic NUMA Balancing. Rik van Riel, Principal Software Engineer, Red Hat Vinod Chegu, Master Technologist, HP
Automatic NUMA Balancing Rik van Riel, Principal Software Engineer, Red Hat Vinod Chegu, Master Technologist, HP Automatic NUMA Balancing Agenda What is NUMA, anyway? Automatic NUMA balancing internals
More informationSGI Performance Suite 1.5 Start Here
SGI Performance Suite 1.5 Start Here 007-5680-006 COPYRIGHT 2010, 2011, 2012, SGI. All rights reserved; provided portions may be copyright in third parties, as indicated elsewhere herein. No permission
More informationIntroduction to HPC and Optimization Tutorial VII
Felix Eckhofer Institut fã 1 4r numerische Mathematik und Optimierung Introduction to HPC and Optimization Tutorial VII January 30, 2013 TU Bergakademie Freiberg OpenMP Case study: Sparse matrix-vector
More informationCOSC 6385 Computer Architecture - Multi Processor Systems
COSC 6385 Computer Architecture - Multi Processor Systems Fall 2006 Classification of Parallel Architectures Flynn s Taxonomy SISD: Single instruction single data Classical von Neumann architecture SIMD:
More informationLecture 17. NUMA Architecture and Programming
Lecture 17 NUMA Architecture and Programming Announcements Extended office hours today until 6pm Weds after class? Partitioning and communication in Particle method project 2012 Scott B. Baden /CSE 260/
More informationThe Multikernel: A new OS architecture for scalable multicore systems Baumann et al. Presentation: Mark Smith
The Multikernel: A new OS architecture for scalable multicore systems Baumann et al. Presentation: Mark Smith Review Introduction Optimizing the OS based on hardware Processor changes Shared Memory vs
More informationPARALLEL ARCHITECTURES
PARALLEL ARCHITECTURES Course Parallel Computing Wolfgang Schreiner Research Institute for Symbolic Computation (RISC) Wolfgang.Schreiner@risc.jku.at http://www.risc.jku.at Parallel Random Access Machine
More informationRealtime Tuning 101. Tuning Applications on Red Hat MRG Realtime Clark Williams
Realtime Tuning 101 Tuning Applications on Red Hat MRG Realtime Clark Williams Agenda Day One Terminology and Concepts Realtime Linux differences from stock Linux Tuning Tools for Tuning Tuning Tools Lab
More informationA common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...
OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.
More informationCluster Computing. Performance and Debugging Issues in OpenMP. Topics. Factors impacting performance. Scalable Speedup
Topics Scalable Speedup and Data Locality Parallelizing Sequential Programs Breaking data dependencies Avoiding synchronization overheads Performance and Debugging Issues in OpenMP Achieving Cache and
More informationSMP and ccnuma Multiprocessor Systems. Sharing of Resources in Parallel and Distributed Computing Systems
Reference Papers on SMP/NUMA Systems: EE 657, Lecture 5 September 14, 2007 SMP and ccnuma Multiprocessor Systems Professor Kai Hwang USC Internet and Grid Computing Laboratory Email: kaihwang@usc.edu [1]
More informationNVIDIA COLLECTIVE COMMUNICATION LIBRARY (NCCL)
NVIDIA COLLECTIVE COMMUNICATION LIBRARY (NCCL) RN-08645-000_v01 September 2018 Release Notes TABLE OF CONTENTS Chapter Chapter Chapter Chapter Chapter Chapter Chapter Chapter Chapter Chapter 1. NCCL Overview...1
More informationCS 471 Operating Systems. Yue Cheng. George Mason University Fall 2017
CS 471 Operating Systems Yue Cheng George Mason University Fall 2017 Outline o Process concept o Process creation o Process states and scheduling o Preemption and context switch o Inter-process communication
More informationInitial Performance Evaluation of the Cray SeaStar Interconnect
Initial Performance Evaluation of the Cray SeaStar Interconnect Ron Brightwell Kevin Pedretti Keith Underwood Sandia National Laboratories Scalable Computing Systems Department 13 th IEEE Symposium on
More informationHardware and Software solutions for scaling highly threaded processors. Denis Sheahan Distinguished Engineer Sun Microsystems Inc.
Hardware and Software solutions for scaling highly threaded processors Denis Sheahan Distinguished Engineer Sun Microsystems Inc. Agenda Chip Multi-threaded concepts Lessons learned from 6 years of CMT
More informationReview: Creating a Parallel Program. Programming for Performance
Review: Creating a Parallel Program Can be done by programmer, compiler, run-time system or OS Steps for creating parallel program Decomposition Assignment of tasks to processes Orchestration Mapping (C)
More informationPractical Near-Data Processing for In-Memory Analytics Frameworks
Practical Near-Data Processing for In-Memory Analytics Frameworks Mingyu Gao, Grant Ayers, Christos Kozyrakis Stanford University http://mast.stanford.edu PACT Oct 19, 2015 Motivating Trends End of Dennard
More informationConsiderations for LS-DYNA Efficiency in SGI IRIX and Linux Environments with a NUMA System Architecture
4 th European LS-DYNA Users Conference MPP / Linux Cluster / Hardware I Considerations for LS-DYNA Efficiency in SGI IRIX and Linux Environments with a NUMA System Architecture Authors: Stan Posey, Nick
More informationCluster Computing. Resource and Job Management for HPC 16/08/2010 SC-CAMP. ( SC-CAMP) Cluster Computing 16/08/ / 50
Cluster Computing Resource and Job Management for HPC SC-CAMP 16/08/2010 ( SC-CAMP) Cluster Computing 16/08/2010 1 / 50 Summary 1 Introduction Cluster Computing 2 About Resource and Job Management Systems
More informationConsiderations for LS-DYNA Workflow Efficiencies in an HPC Linux Environment
9 th International LS-DYNA Users Conference Computing / Code Technology (2) Considerations for LS-DYNA Workflow Efficiencies in an HPC Linux Environment Stanley Posey HPC Applications Development SGI,
More informationWorkload Optimization on Hybrid Architectures
IBM OCR project Workload Optimization on Hybrid Architectures IBM T.J. Watson Research Center May 4, 2011 Chiron & Achilles 2003 IBM Corporation IBM Research Goal Parallelism with hundreds and thousands
More informationEECS 750: Advanced Operating Systems. 01/29 /2014 Heechul Yun
EECS 750: Advanced Operating Systems 01/29 /2014 Heechul Yun 1 Administrative Next summary assignment Resource Containers: A New Facility for Resource Management in Server Systems, OSDI 99 due by 11:59
More informationBest Practices for Setting BIOS Parameters for Performance
White Paper Best Practices for Setting BIOS Parameters for Performance Cisco UCS E5-based M3 Servers May 2013 2014 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public. Page
More informationScheduling II. Today. Next Time. ! Proportional-share scheduling! Multilevel-feedback queue! Multiprocessor scheduling. !
Scheduling II Today! Proportional-share scheduling! Multilevel-feedback queue! Multiprocessor scheduling Next Time! Memory management Scheduling with multiple goals! What if you want both good turnaround
More informationBarbara Chapman, Gabriele Jost, Ruud van der Pas
Using OpenMP Portable Shared Memory Parallel Programming Barbara Chapman, Gabriele Jost, Ruud van der Pas The MIT Press Cambridge, Massachusetts London, England c 2008 Massachusetts Institute of Technology
More informationCS 333 Introduction to Operating Systems. Class 3 Threads & Concurrency. Jonathan Walpole Computer Science Portland State University
CS 333 Introduction to Operating Systems Class 3 Threads & Concurrency Jonathan Walpole Computer Science Portland State University 1 The Process Concept 2 The Process Concept Process a program in execution
More informationComputing architectures Part 2 TMA4280 Introduction to Supercomputing
Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:
More informationA NUMA API for LINUX*
A NUMA API for LINUX* Technical Linux Whitepaper w w w. n o v e l l. c o m April 2005 2 Disclaimer Trademarks Copyright Novell, Inc.makes no representations or warranties with respect to the contents or
More informationSHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008
SHARCNET Workshop on Parallel Computing Hugh Merz Laurentian University May 2008 What is Parallel Computing? A computational method that utilizes multiple processing elements to solve a problem in tandem
More informationLecture Topics. Announcements. Today: Advanced Scheduling (Stallings, chapter ) Next: Deadlock (Stallings, chapter
Lecture Topics Today: Advanced Scheduling (Stallings, chapter 10.1-10.4) Next: Deadlock (Stallings, chapter 6.1-6.6) 1 Announcements Exam #2 returned today Self-Study Exercise #10 Project #8 (due 11/16)
More informationParallel Numerical Algorithms
Parallel Numerical Algorithms http://sudalab.is.s.u-tokyo.ac.jp/~reiji/pna16/ [ 9 ] Shared Memory Performance Parallel Numerical Algorithms / IST / UTokyo 1 PNA16 Lecture Plan General Topics 1. Architecture
More informationIBM Cell Processor. Gilbert Hendry Mark Kretschmann
IBM Cell Processor Gilbert Hendry Mark Kretschmann Architectural components Architectural security Programming Models Compiler Applications Performance Power and Cost Conclusion Outline Cell Architecture:
More informationCluster Network Products
Cluster Network Products Cluster interconnects include, among others: Gigabit Ethernet Myrinet Quadrics InfiniBand 1 Interconnects in Top500 list 11/2009 2 Interconnects in Top500 list 11/2008 3 Cluster
More informationDmitry Durnov 15 February 2017
Cовременные тенденции разработки высокопроизводительных приложений Dmitry Durnov 15 February 2017 Agenda Modern cluster architecture Node level Cluster level Programming models Tools 2/20/2017 2 Modern
More informationLinux Network Tuning Guide for AMD EPYC Processor Based Servers
Linux Network Tuning Guide for AMD EPYC Processor Application Note Publication # 56224 Revision: 1.10 Issue Date: May 2018 Advanced Micro Devices 2018 Advanced Micro Devices, Inc. All rights reserved.
More informationTowards NUMA Support with Distance Information
Towards NUMA Support with Distance Information Dirk Schmidl, Christian Terboven, Dieter an Mey {schmidl terboven anmey}@rz.rwth-aachen.de Rechen- und Kommunikationszentrum (RZ) Agenda Topology of modern
More informationKoronis Performance Tuning 2. By Brent Swartz December 1, 2011
Koronis Performance Tuning 2 By Brent Swartz December 1, 2011 Application Tuning Methodology http://docs.sgi.com/library/tpl/cgi-bin/browse.cgi? coll=linux&db=bks&cmd=toc&pth=/sgi_develop er/lx_86_apptune
More informationMunara Tolubaeva Technical Consulting Engineer. 3D XPoint is a trademark of Intel Corporation in the U.S. and/or other countries.
Munara Tolubaeva Technical Consulting Engineer 3D XPoint is a trademark of Intel Corporation in the U.S. and/or other countries. notices and disclaimers Intel technologies features and benefits depend
More informationChapter 5. Multiprocessors and Thread-Level Parallelism
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model
More informationIntroduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano
Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Outline Key issues to design multiprocessors Interconnection network Centralized shared-memory architectures Distributed
More informationPerformance Analysis of Parallel Scientific Applications In Eclipse
Performance Analysis of Parallel Scientific Applications In Eclipse EclipseCon 2015 Wyatt Spear, University of Oregon wspear@cs.uoregon.edu Supercomputing Big systems solving big problems Performance gains
More informationAdvanced Memory Management
Advanced Memory Management Main Points Applications of memory management What can we do with ability to trap on memory references to individual pages? File systems and persistent storage Goals Abstractions
More informationMULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationBei Wang, Dmitry Prohorov and Carlos Rosales
Bei Wang, Dmitry Prohorov and Carlos Rosales Aspects of Application Performance What are the Aspects of Performance Intel Hardware Features Omni-Path Architecture MCDRAM 3D XPoint Many-core Xeon Phi AVX-512
More informationMULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationReduces latency and buffer overhead. Messaging occurs at a speed close to the processors being directly connected. Less error detection
Switching Operational modes: Store-and-forward: Each switch receives an entire packet before it forwards it onto the next switch - useful in a general purpose network (I.e. a LAN). usually, there is a
More informationChapter 4: Multithreaded
Chapter 4: Multithreaded Programming Chapter 4: Multithreaded Programming Overview Multithreading Models Thread Libraries Threading Issues Operating-System Examples 2009/10/19 2 4.1 Overview A thread is
More informationChapter 8: Virtual Memory. Operating System Concepts
Chapter 8: Virtual Memory Silberschatz, Galvin and Gagne 2009 Chapter 8: Virtual Memory Background Demand Paging Copy-on-Write Page Replacement Allocation of Frames Thrashing Memory-Mapped Files Allocating
More informationXen and the Art of Virtualization. CSE-291 (Cloud Computing) Fall 2016
Xen and the Art of Virtualization CSE-291 (Cloud Computing) Fall 2016 Why Virtualization? Share resources among many uses Allow heterogeneity in environments Allow differences in host and guest Provide
More informationDesigning Next-Generation Data- Centers with Advanced Communication Protocols and Systems Services. Presented by: Jitong Chen
Designing Next-Generation Data- Centers with Advanced Communication Protocols and Systems Services Presented by: Jitong Chen Outline Architecture of Web-based Data Center Three-Stage framework to benefit
More informationMultiprocessor Support
CSC 256/456: Operating Systems Multiprocessor Support John Criswell University of Rochester 1 Outline Multiprocessor hardware Types of multi-processor workloads Operating system issues Where to run the
More informationEARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA
EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA SUDHEER CHUNDURI, SCOTT PARKER, KEVIN HARMS, VITALI MOROZOV, CHRIS KNIGHT, KALYAN KUMARAN Performance Engineering Group Argonne Leadership Computing Facility
More informationHow to Boost the Performance of Your MPI and PGAS Applications with MVAPICH2 Libraries
How to Boost the Performance of Your MPI and PGAS s with MVAPICH2 Libraries A Tutorial at the MVAPICH User Group (MUG) Meeting 18 by The MVAPICH Team The Ohio State University E-mail: panda@cse.ohio-state.edu
More informationOPERATING SYSTEM. Chapter 9: Virtual Memory
OPERATING SYSTEM Chapter 9: Virtual Memory Chapter 9: Virtual Memory Background Demand Paging Copy-on-Write Page Replacement Allocation of Frames Thrashing Memory-Mapped Files Allocating Kernel Memory
More informationWhy you should care about hardware locality and how.
Why you should care about hardware locality and how. Brice Goglin TADaaM team Inria Bordeaux Sud-Ouest Agenda Quick example as an introduction Bind your processes What's the actual problem? Convenient
More informationM7: Next Generation SPARC. Hotchips 26 August 12, Stephen Phillips Senior Director, SPARC Architecture Oracle
M7: Next Generation SPARC Hotchips 26 August 12, 2014 Stephen Phillips Senior Director, SPARC Architecture Oracle Safe Harbor Statement The following is intended to outline our general product direction.
More informationCHAPTER 2: PROCESS MANAGEMENT
1 CHAPTER 2: PROCESS MANAGEMENT Slides by: Ms. Shree Jaswal TOPICS TO BE COVERED Process description: Process, Process States, Process Control Block (PCB), Threads, Thread management. Process Scheduling:
More informationHigh Performance Computing: Tools and Applications
High Performance Computing: Tools and Applications Edmond Chow School of Computational Science and Engineering Georgia Institute of Technology Lecture 9 SIMD vectorization using #pragma omp simd force
More informationNative Computing and Optimization. Hang Liu December 4 th, 2013
Native Computing and Optimization Hang Liu December 4 th, 2013 Overview Why run native? What is a native application? Building a native application Running a native application Setting affinity and pinning
More information