Altix UV HW/SW! SGI Altix UV utilizes an array of advanced hardware and software feature to offload:!

Size: px
Start display at page:

Download "Altix UV HW/SW! SGI Altix UV utilizes an array of advanced hardware and software feature to offload:!"

Transcription

1 Altix UV HW/SW! SGI Altix UV utilizes an array of advanced hardware and software feature to offload:!! thread synchronization!! data sharing!! massage passing overhead from CPUs.! This system has a rich set of hardware features that enable scalable programming models to be implemented with high efficiency and performance.!! SGI MPI! The SGI MPI software stack includes a number of software components.!

2 SGI MPI Software Stack! MPI! XPMEM(cross process memory mapping)! GRU development kit! NUMA tools! Perfboost! Perfcatcher! MPInside!

3 UV HUB! The UV_HUB is a custom ASIC developed by SGI. It implements NUMAlink5 protocol, memory operations and associated atomic operations. It provides following capabilities:!!! Cache-coherent global shared memory.!! Offloading time-sensitive and data-intensive operations from processors to increase processing efficiency and scaling.!! Scalable, reliable, fair interconnect with other blades via NUMAlink5.!

4 Altix UV Blade and HUB! Source : SGI Altix UV 1000 System User s Guide!

5 UV HUB in detail! SI(socket interface): provides bridge between the Hub s LH and RH chip sets and Intel sockets.! To communicate with Intel sockets, the SI implements an Intel proprietary Interconnect called CSI(common system interface).! Source : SGI Altix UV admin manual!

6 UV HUB in detail! LH(local home)!! manages directory operations associated with remote memory requests. The LH has a single external memory channel.! RH(Remote home)!! processes coherent and non-coherent CSI transactions that are initialized by a local socket to a remote system address.!! processes Numalink intervention and invalidate requests when remote is locally cached by a socket.! LB(local block)!! provides system software the ability to select, configure and control various functionalities of the UV hub chip.!! provides facilities to monitor, diagnose, and debug hardware states and operations on live systems.!

7 UV HUB Units! The NUMAlink interconnect! The Global reference unit(gru)! The processor interconnect!

8 NUMAlink Interconnect! Shared memory, globally addressable system interconnect.! All physically distributed system memory is mapped into one global address space.! Peak aggregate bi-directional bandwidth 15GB/s.! 2-3x MPI latency improvement.! Special support for block transfer and global operation.! NUMAlink is connected into the memory infrastructure of the system, versus being indirectly connected through an IO subsystem chip.!

9 Fetch-Op in HUB! Fetch-Op-variables on Hub provide fast synchronization! The Fetch-Op AMO helped reduce MPI send/recv latency from 12 to 8 microseconds.! Used by MPI_Barrier, MPI_Win_fence, and shmem barrier all! CPU! CPU! HUB! Fetch-op! variable! ROUTER!

10 GRU! Hardware in the Hub for memory to memory block transfer and CPU synchronization events.! It is used by MPI, SHMEM, UPC! External TLB with large page support! Page initialization! Scatter/Gather operations! Update cache for AMOs!

11 GRU API Components! GRU resource allocators! GRU memory access functions! XPMEM address mapping functions! MPT address mapping functions!

12 MOE! It is a set of functionality that offloads MPI communication workload from CPUs to the Altix UV_HUB ASIC, accelerating common MPI tasks such as barriers and reductions across GSM(global shared memory).! Similar in concept to a TCP/IP offload engine(toe) which offloads TCP/IP protocol processing from system CPUs.! Frees CPU from MPI activity.! Faster reduction operations.! Faster barriers and random access.!

13 Accessing the MOE.! MPI and MOE! MOE implements atomic memory operations in conjunction with a hardware multicast facility that helps to accelerate MPI_barrier, MPI_Bcast, MPI_Allreduce.! Accelerates MPI point-to-point and collective communication.!

14 MOE Advantages! MOE provides:!! MPI message queues!! synchronization primitives!! Advanced RDMA capabilities such as strided and indexed global memory updates.!! Hardware multicast.!

15 Determining System Configuration!

16 topology: displays general information about SGI Altix system, with a focus on node information.! It includes node counts for blades, node IDs, NASIDs, memory per node, UV hub and partition number.! topology!

17 cpumap! cpumap: displays logical CPUs and shows relationship between them.! Aspects displayed include, hyper threading, last level cache sharing and topology placements.! It gets information from /proc/cpuinfo, /sys/ devices/system and /proc/sgi_uv/ topology

18 cpumap!

19 nodeinfo!

20 nodeinfo! Hit: page was allocated on the preferred node.! Miss: preferred node was full. Allocation occurred on this node by a process running on another node that was full.! Foreign: preferred node was full. Had to allocate somewhere else.! Interleave: allocation was for interleaved policy numactl i.! Local: page allocated on This node by a process running on this node.! Remote: page allocated on this node by a process running on the another node.!

21 x86info!

22 pmchart! Put figure here!

23 pmchart! Put figure here!

24 HW Summary! /proc/cpuinfo /proc/meminfo /sys/devices/system/node /dev/cpuset/torque/job#

25 Data Placement Tool!

26 CPU Scheduling! In a single-processor system, only one process can run at a time.! CPU scheduling controls how the OS switches access to the CPU between processes.! Kernel provides mechanism called time slicing.! Time slice is the maximum length of time that a process owns its CPU resource and executes at its current policy.! Each CPU has its own run queue.!

27 Cache Affinity! Affinity scheduling is a special scheduling discipline used in multiprocessor system.! As a process executes, it causes more and more data and instruction text to be loaded into the processor cache. This creates an affinity between the process and the CPU.!

28 Data Placement Tool! NUMA machines have a shared address space. There is a single shared memory space and a single operating system instance.! Performance penalty to access remote memory versus local memory.! Access time to memory vary over physical address ranges and between processing elements. NUMAlink used to access memory between blades/node.! Memory latency is lowest when a processor accesses local memory.! NUMA tool also helps run multiple instances of serial program in a single job script with better processes placement.!

29 NUMA API! The API is called from libcpuset!! cpuset: create, modify, destroy cpuset.! taskset: Run a process on specific physical CPU.! numactl: Control NUMA policy for processes or shared memory.! dplace: Binds process to specific logical CPU.! omplace: Controls the placement of MPI processes and OpenMP threads.! Batch systems: LSF, PBSPro, Torque, SGE!! dlook, dlook-summary, pidstat, cpuset-q

30 cpuset cpuset includes sched_setaffinity for CPU binding and memory binding.! Each task has a link to a cpuset structure that specifies the CPUs and memory node available for its use.! All tasks sharing the same placement constraints reference the same cpuset.!

31 Why Use a cpuset?! Restrict consumption of designated resources CPU to specified processes/threads.! Limit run time variability.! Memory affinity.! Isolates the I/O.!

32 How Are cpuset s Used! Static cpusets (batch calls shared by queue)!! Cpusets are defined by administrator after system startup.!! User attach processes to the existing cpusets.!! Cpusets continue to exist after job finish executing.! Dynamic cpusets!! Workload management system(wms) creates cpuset when It is required by a job.!! WMS attaches job to the newly created cpus.!! WMS destroys cpuset at the end of job.!

33 cpuset Command Line Options! cpuset -c cpuset_name Create CPUSET -m cpuset_name Modify CPUSET -x cpuset_name Destroy CPUSET -d cpuset_name Dump CPUSET attributes -i csname I script Run command -p cpuset_name List all procs in CPUSET -a cpuset_name Attach pids to CPUSET -w pid List CPUSET the PID is attached to -f filename input config file

34 Advantage of Cpuset?! It improves cache locality and memory access time.! Facilitates providing equal resources to each thread in a job.!! Results in both optimum and repeatable performance.!

35 taskset taskset: restricts execution to the listed set of CPUs. However, processes are still free to move among listed CPUs.! It is used to set of retrieve the CPU affinity of a running process given its PID or to launch a new command with a given CPU affinity.! The CPU affinity is represented as a bitmask (hexadecimal), with the lowest order bit.!! 0x ## is processor number 0.!

36 taskset taskset:it does not pin a task to a specific CPU. It only restricts a task so that it does not run any CPU that is not in the cpulist. If you are running an MPI application, you do not use the taskset command. Instead of taskset use dplace.!!mpirun np 8 dplace -s1 c10, 11, /a.out export MPI_DSM_CPULIST 10,11,16-21 mpirun np 8./a.out

37 taskset examples! taskset 0x1./a.out #executes on physical CPU 1 taskset 0x00131./a.out #executes on physical CPUs taskset p 0xa #executes PID on physical CPUs 3 5 and 7 taskset p c 5./a.out #execute a.out on physical CPU 5 taskset p #returns the affinity mask of PID 14386

38 numactl Runs processes with a specific NUMA scheduling or memory placement policy.! Control memory placement!! Interleave node(round robin)!! Membind(allocate from specified node pool)!! Preferred node!! Local allocation(first touch)! Each task has a link to a cpuset structure that specifies the CPUs and memory node available for its use.! All tasks sharing the same placement constraints reference the same cpuset.!

39 numactl Command Line numactl Options! --interleave Set a memory interleave policy. --membind Only allocate memory from nodes. --cpunodebind Only execute command on the CPUs of nodes. --physcpubind Only execute process on CPUs.

40 numactl examples! numactl --physcpubind=+0-4,8-12 myapplic arguments #Run myapplic on cpus 0-4 and 8-12 of the current cpuset. numactl --interleave=all bigdatabase arguments #Run big database with its memory interleaved on all CPUs. numactl --cpubind=0 --membind=0,1 process #Run process on node 0 with memory allocated on node 0 and 1.

41 numactl --hardware

42 dplace dplace ensures the Linux kernel pins a thread [or series of threads] to a specific CPU core within a container. Once pinned they do not migrate.! By default, binds processes sequentially in a round-robin fashion against logical CPUs in current cpuset.! Integrate with MPT[via omplace and environmental variables].! It understands fork, exec, threads etc..! Helps to ensure optimal performance and to minimize runtime variability.!

43 dplace Feature! Default memory allocation policy is node-local (first touch).! dplace allows processes to be bound to specific logical(within cpuset) cpus.! Prevents migration (thread hopping).! May require knowledge of application.! Global load balancing.!

44 dplace Command Line Options dplace -c CPU list -e exact placement -s skip n cpu s before starting placement -n only processes with name -x skip mask -p placement file -r replicate shared text to each node -q list global count

45 dplace examples! dplace c 0-3./a.out # places thread on the first four cpus, beginning with core 0. dplace c 0-7 x2./a.out # place threads on the first 8 cpus, but used SKIP MASK[-x2] to skip the second thread(which in the case of Intel OpenMP is the lightweight monitor thread) mpirun np 8 dplace s1 c 0-7./a.out # skips the first process as this process is essentially the MPI shepherd. dplace handles the placements of the other 7 MPI ranks.

46 numactl and dplace Consider a code that runs with 4 threads.! What is the difference between!!numactl c 0-3 a.out dplace c 0-3 a.out With dplace, each thread is bound to a particular cpu. With numactl, the threads are bound to the range of cpus 0-3, and are free to migrate within that range.! numactl does have memory binding options.!

47 omplace Tool for controlling the placement of MPI processes and OpenMP threads.! -c cpulist: specifies the effective CPU list. -nt threads: specifies the number of threads per MPI process. -s skip: the number of processes to skip before placements starts. -vv verbose: Automatically generated placement file will be displayed in its entirely.

48 omplace examples! mpirun np 2 omplace nt 4 -vv./ a.out # To run 2 MPI processes with 4 threads per process, and to display the generated placement file.

49 dlook! Tool for showing process memory maps and cpu usage.! View address space and page placement.! Two forms!!dlook [options] pid!dlook [options] <command> [commandargs]! Run a MPI job using mpirun and print the memory map for each thread:! mpirun np 8 dlook a.out

50 Summary! Use cpumap to determine partitioning and placement.! Use taskset to lock a process or process groups into CPU or group of CPUs.! Use dplace to place a process group into system topology.! Run an MPI/OpenMP hybrid and use omplace for pining.! Use numactl to control memory placement.!

51 Tips!!!! Use dplace, numactl, or cpuset to lock down processes, preventing thread hopping/migration.! Strong cache affinity reduces cache misses, instruction pipeline flushes.! Keeps processes close to their node-local memory.! Be aware of data placement.!

52 Heisenberg Principle! Looking at the system will impact the system! Tracing events are the highest impact: strace, gprof,! PCP and sar the lowest impact! You can not measure a system without effecting it. top will show up in the top display.! PCP uses less than 1% of a CPU.!

53 sar sar indicates normal/abnormal behavior of system. sar can imply performance problems and bottlenecks.! Many people look at sar as a set of performance metrics when it is not. It is an indicator of what a system is doing!! PCP and sar simply tell you what to look for.!

54 sar sar vq to check kernel table sizes.! sar -W to check swapping activity.! sar rsw to what memory and swap is left.! sar u reports the amount of time executing kernel code.!

55 top, ps, pstree top provides a dynamic real-time view of a running system.! top with H provides thread information.! ps: report a snapshot of the currently running processes on the system. Use with grep <username> to get user specific information.! pstree: display a tree of processes.!

56 vmstat, mpstat vmstat indicates reports information about processes, memory, paging, block IO, traps and cpu activity.! mpstat writes to standard output activities for each available processor, processor 0 being the first one. Global average activities among all processors are also reported.!

57 mpvis mpvis displays a three dimensional bar chart of CPU utilization. The display is updated with new values retrieved from the target host or archive every interval seconds.!

58 pidstat pidstat is used for monitoring individual task a currently being managed by the Linux kernel.! r report page faults and memory utilization -d report I/O statistics -u report CPU utilization -p select tasks for which statistics are to be reported -t display statistics for threads associated with selected tasks pidstat t p 14374!

59 cpuset-q It gives information allocated CPUs, node, IPD, WCHAN, Command name etc..!

60 dlook! Tool for showing process memory maps and cpu usage.! Two forms!!dlook [options] pid!dlook [options] <command> [commandargs]! Run a MPI job using mpirun and print the memory map for each thread:! mpirun np 8 dlook a.out

61 References! UV System Analysis Manual! UV System Administration Manual! Technical Advances in the SGI Altix UV Architecture(white paper)! A Hardware-Accelerated MPI Implementation on SGI Altix UV Systems(white paper)! Linux Application Tuning Guide for SGI X86_64 Based Systems! SGI Message Passing Toolkit(MPT) User s Guide! SGI NUMAlink white paper!

White Paper. Technical Advances in the SGI. UV Architecture

White Paper. Technical Advances in the SGI. UV Architecture White Paper Technical Advances in the SGI UV Architecture TABLE OF CONTENTS 1. Introduction 1 2. The SGI UV Architecture 2 2.1. SGI UV Compute Blade 3 2.1.1. UV_Hub ASIC Functionality 4 2.1.1.1. Global

More information

The SGI Message Passing Toolkit

The SGI Message Passing Toolkit White Paper The SGI Message Passing Toolkit Optimized Performance Across the Altix Product Line Table of Contents 1.0 Introduction... 1 2.0 SGI MPT Performance... 1 2.1 Message Latency... 1 2.1.1 Altix

More information

WhatÕs New in the Message-Passing Toolkit

WhatÕs New in the Message-Passing Toolkit WhatÕs New in the Message-Passing Toolkit Karl Feind, Message-passing Toolkit Engineering Team, SGI ABSTRACT: SGI message-passing software has been enhanced in the past year to support larger Origin 2

More information

Multicore Performance and Tools. Part 1: Topology, affinity, clock speed

Multicore Performance and Tools. Part 1: Topology, affinity, clock speed Multicore Performance and Tools Part 1: Topology, affinity, clock speed Tools for Node-level Performance Engineering Gather Node Information hwloc, likwid-topology, likwid-powermeter Affinity control and

More information

Advanced Job Launching. mapping applications to hardware

Advanced Job Launching. mapping applications to hardware Advanced Job Launching mapping applications to hardware A Quick Recap - Glossary of terms Hardware This terminology is used to cover hardware from multiple vendors Socket The hardware you can touch and

More information

LSF HPC :: getting most out of your NUMA machine

LSF HPC :: getting most out of your NUMA machine Leopold-Franzens-Universität Innsbruck ZID Zentraler Informatikdienst (ZID) LSF HPC :: getting most out of your NUMA machine platform computing conference, Michael Fink who we are & what we do university

More information

Getting Performance from OpenMP Programs on NUMA Architectures

Getting Performance from OpenMP Programs on NUMA Architectures Getting Performance from OpenMP Programs on NUMA Architectures Christian Terboven, RWTH Aachen University terboven@itc.rwth-aachen.de EU H2020 Centre of Excellence (CoE) 1 October 2015 31 March 2018 Grant

More information

OS impact on performance

OS impact on performance PhD student CEA, DAM, DIF, F-91297, Arpajon, France Advisor : William Jalby CEA supervisor : Marc Pérache 1 Plan Remind goal of OS Reproducibility Conclusion 2 OS : between applications and hardware 3

More information

Non-uniform memory access (NUMA)

Non-uniform memory access (NUMA) Non-uniform memory access (NUMA) Memory access between processor core to main memory is not uniform. Memory resides in separate regions called NUMA domains. For highest performance, cores should only access

More information

Cerebro Quick Start Guide

Cerebro Quick Start Guide Cerebro Quick Start Guide Overview of the system Cerebro consists of a total of 64 Ivy Bridge processors E5-4650 v2 with 10 cores each, 14 TB of memory and 24 TB of local disk. Table 1 shows the hardware

More information

6.1 Multiprocessor Computing Environment

6.1 Multiprocessor Computing Environment 6 Parallel Computing 6.1 Multiprocessor Computing Environment The high-performance computing environment used in this book for optimization of very large building structures is the Origin 2000 multiprocessor,

More information

PERFORMANCE IMPLICATIONS OF NUMA WHAT YOU DON T KNOW COULD HURT YOU! CLAIRE CATES SAS INSTITUTE

PERFORMANCE IMPLICATIONS OF NUMA WHAT YOU DON T KNOW COULD HURT YOU! CLAIRE CATES SAS INSTITUTE PERFORMANCE IMPLICATIONS OF NUMA WHAT YOU DON T KNOW COULD HURT YOU! CLAIRE CATES SAS INSTITUTE AGENDA Terms What Testers need to know about NUMA What Developers need to know about NUMA What SysAdmins

More information

Practical Introduction to

Practical Introduction to 1 2 Outline of the workshop Practical Introduction to What is ScaleMP? When do we need it? How do we run codes on the ScaleMP node on the ScaleMP Guillimin cluster? How to run programs efficiently on ScaleMP?

More information

CSE 120 Principles of Operating Systems

CSE 120 Principles of Operating Systems CSE 120 Principles of Operating Systems Spring 2018 Lecture 15: Multicore Geoffrey M. Voelker Multicore Operating Systems We have generally discussed operating systems concepts independent of the number

More information

The Exascale Architecture

The Exascale Architecture The Exascale Architecture Richard Graham HPC Advisory Council China 2013 Overview Programming-model challenges for Exascale Challenges for scaling MPI to Exascale InfiniBand enhancements Dynamically Connected

More information

Operating System. Chapter 4. Threads. Lynn Choi School of Electrical Engineering

Operating System. Chapter 4. Threads. Lynn Choi School of Electrical Engineering Operating System Chapter 4. Threads Lynn Choi School of Electrical Engineering Process Characteristics Resource ownership Includes a virtual address space (process image) Ownership of resources including

More information

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads... OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.

More information

Exercises: April 11. Hermann Härtig, TU Dresden, Distributed OS, Load Balancing

Exercises: April 11. Hermann Härtig, TU Dresden, Distributed OS, Load Balancing Exercises: April 11 1 PARTITIONING IN MPI COMMUNICATION AND NOISE AS HPC BOTTLENECK LOAD BALANCING DISTRIBUTED OPERATING SYSTEMS, SCALABILITY, SS 2017 Hermann Härtig THIS LECTURE Partitioning: bulk synchronous

More information

Linux Network Tuning Guide for AMD EPYC Processor Based Servers

Linux Network Tuning Guide for AMD EPYC Processor Based Servers Linux Network Tuning Guide for AMD EPYC Processor Application Note Publication # 56224 Revision: 1.00 Issue Date: November 2017 Advanced Micro Devices 2017 Advanced Micro Devices, Inc. All rights reserved.

More information

NUMA Control for Hybrid Applications. Hang Liu & Xiao Zhu TACC September 20, 2013

NUMA Control for Hybrid Applications. Hang Liu & Xiao Zhu TACC September 20, 2013 NUMA Control for Hybrid Applications Hang Liu & Xiao Zhu TACC September 20, 2013 Parallel Paradigms OpenMP MPI Run a bunch of threads in shared memory(spawned by a single a.out). Run a bunch of a.out s

More information

Scalable Single System Image SGI Altix 3700, 512p Architecture and Software Environment

Scalable Single System Image SGI Altix 3700, 512p Architecture and Software Environment Silicon Graphics, Inc. Scalable Single System Image SGI Altix 3700, 512p Architecture and Software Environment Presented by: Jean-Pierre Panziera Principal Engineer Altix 3700 SSSI - Architecture and Software

More information

<Insert Picture Here> Boost Linux Performance with Enhancements from Oracle

<Insert Picture Here> Boost Linux Performance with Enhancements from Oracle Boost Linux Performance with Enhancements from Oracle Chris Mason Director of Linux Kernel Engineering Linux Performance on Large Systems Exadata Hardware How large systems are different

More information

Scalable Cluster Computing with NVIDIA GPUs Axel Koehler NVIDIA. NVIDIA Corporation 2012

Scalable Cluster Computing with NVIDIA GPUs Axel Koehler NVIDIA. NVIDIA Corporation 2012 Scalable Cluster Computing with NVIDIA GPUs Axel Koehler NVIDIA Outline Introduction to Multi-GPU Programming Communication for Single Host, Multiple GPUs Communication for Multiple Hosts, Multiple GPUs

More information

Compute Node Linux (CNL) The Evolution of a Compute OS

Compute Node Linux (CNL) The Evolution of a Compute OS Compute Node Linux (CNL) The Evolution of a Compute OS Overview CNL The original scheme plan, goals, requirements Status of CNL Plans Features and directions Futures May 08 Cray Inc. Proprietary Slide

More information

Parallel and Distributed Computing

Parallel and Distributed Computing Parallel and Distributed Computing NUMA; OpenCL; MapReduce José Monteiro MSc in Information Systems and Computer Engineering DEA in Computational Engineering Department of Computer Science and Engineering

More information

NUMA replicated pagecache for Linux

NUMA replicated pagecache for Linux NUMA replicated pagecache for Linux Nick Piggin SuSE Labs January 27, 2008 0-0 Talk outline I will cover the following areas: Give some NUMA background information Introduce some of Linux s NUMA optimisations

More information

FinisTerrae: Memory Hierarchy and Mapping

FinisTerrae: Memory Hierarchy and Mapping galicia supercomputing center Applications & Projects Department FinisTerrae: Memory Hierarchy and Mapping Technical Report CESGA-2010-001 Juan Carlos Pichel Tuesday 12 th January, 2010 Contents Contents

More information

CPU Scheduling. Operating Systems (Fall/Winter 2018) Yajin Zhou ( Zhejiang University

CPU Scheduling. Operating Systems (Fall/Winter 2018) Yajin Zhou (  Zhejiang University Operating Systems (Fall/Winter 2018) CPU Scheduling Yajin Zhou (http://yajin.org) Zhejiang University Acknowledgement: some pages are based on the slides from Zhi Wang(fsu). Review Motivation to use threads

More information

Parallel Applications on Distributed Memory Systems. Le Yan HPC User LSU

Parallel Applications on Distributed Memory Systems. Le Yan HPC User LSU Parallel Applications on Distributed Memory Systems Le Yan HPC User Services @ LSU Outline Distributed memory systems Message Passing Interface (MPI) Parallel applications 6/3/2015 LONI Parallel Programming

More information

Technical Computing Suite supporting the hybrid system

Technical Computing Suite supporting the hybrid system Technical Computing Suite supporting the hybrid system Supercomputer PRIMEHPC FX10 PRIMERGY x86 cluster Hybrid System Configuration Supercomputer PRIMEHPC FX10 PRIMERGY x86 cluster 6D mesh/torus Interconnect

More information

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types Chapter 5 Multiprocessor Cache Coherence Thread-Level Parallelism 1: read 2: read 3: write??? 1 4 From ILP to TLP Memory System is Coherent If... ILP became inefficient in terms of Power consumption Silicon

More information

Determining Optimal MPI Process Placement for Large- Scale Meteorology Simulations with SGI MPIplace

Determining Optimal MPI Process Placement for Large- Scale Meteorology Simulations with SGI MPIplace Determining Optimal MPI Process Placement for Large- Scale Meteorology Simulations with SGI MPIplace James Southern, Jim Tuccillo SGI 25 October 2016 0 Motivation Trend in HPC continues to be towards more

More information

Motivation. Threads. Multithreaded Server Architecture. Thread of execution. Chapter 4

Motivation. Threads. Multithreaded Server Architecture. Thread of execution. Chapter 4 Motivation Threads Chapter 4 Most modern applications are multithreaded Threads run within application Multiple tasks with the application can be implemented by separate Update display Fetch data Spell

More information

CS/COE1541: Intro. to Computer Architecture

CS/COE1541: Intro. to Computer Architecture CS/COE1541: Intro. to Computer Architecture Multiprocessors Sangyeun Cho Computer Science Department Tilera TILE64 IBM BlueGene/L nvidia GPGPU Intel Core 2 Duo 2 Why multiprocessors? For improved latency

More information

CS2506 Quick Revision

CS2506 Quick Revision CS2506 Quick Revision OS Structure / Layer Kernel Structure Enter Kernel / Trap Instruction Classification of OS Process Definition Process Context Operations Process Management Child Process Thread Process

More information

Profiling: Understand Your Application

Profiling: Understand Your Application Profiling: Understand Your Application Michal Merta michal.merta@vsb.cz 1st of March 2018 Agenda Hardware events based sampling Some fundamental bottlenecks Overview of profiling tools perf tools Intel

More information

JURECA Tuning for the platform

JURECA Tuning for the platform JURECA Tuning for the platform Usage of ParaStation MPI 2017-11-23 Outline ParaStation MPI Compiling your program Running your program Tuning parameters Resources 2 ParaStation MPI Based on MPICH (3.2)

More information

LOAD BALANCING DISTRIBUTED OPERATING SYSTEMS, SCALABILITY, SS Hermann Härtig

LOAD BALANCING DISTRIBUTED OPERATING SYSTEMS, SCALABILITY, SS Hermann Härtig LOAD BALANCING DISTRIBUTED OPERATING SYSTEMS, SCALABILITY, SS 2016 Hermann Härtig LECTURE OBJECTIVES starting points independent Unix processes and block synchronous execution which component (point in

More information

Intel Xeon Phi архитектура, модели программирования, оптимизация.

Intel Xeon Phi архитектура, модели программирования, оптимизация. Нижний Новгород, 2017 Intel Xeon Phi архитектура, модели программирования, оптимизация. Дмитрий Прохоров, Дмитрий Рябцев, Intel Agenda What and Why Intel Xeon Phi Top 500 insights, roadmap, architecture

More information

When the OS gets in the way

When the OS gets in the way When the OS gets in the way (and what you can do about it) Mark Price @epickrram LMAX Exchange Linux When the OS gets in the way (and what you can do about it) Mark Price @epickrram LMAX Exchange It s

More information

Lecture 13. Shared memory: Architecture and programming

Lecture 13. Shared memory: Architecture and programming Lecture 13 Shared memory: Architecture and programming Announcements Special guest lecture on Parallel Programming Language Uniform Parallel C Thursday 11/2, 2:00 to 3:20 PM EBU3B 1202 See www.cse.ucsd.edu/classes/fa06/cse260/lectures/lec13

More information

Automatic NUMA Balancing. Rik van Riel, Principal Software Engineer, Red Hat Vinod Chegu, Master Technologist, HP

Automatic NUMA Balancing. Rik van Riel, Principal Software Engineer, Red Hat Vinod Chegu, Master Technologist, HP Automatic NUMA Balancing Rik van Riel, Principal Software Engineer, Red Hat Vinod Chegu, Master Technologist, HP Automatic NUMA Balancing Agenda What is NUMA, anyway? Automatic NUMA balancing internals

More information

SGI Performance Suite 1.5 Start Here

SGI Performance Suite 1.5 Start Here SGI Performance Suite 1.5 Start Here 007-5680-006 COPYRIGHT 2010, 2011, 2012, SGI. All rights reserved; provided portions may be copyright in third parties, as indicated elsewhere herein. No permission

More information

Introduction to HPC and Optimization Tutorial VII

Introduction to HPC and Optimization Tutorial VII Felix Eckhofer Institut fã 1 4r numerische Mathematik und Optimierung Introduction to HPC and Optimization Tutorial VII January 30, 2013 TU Bergakademie Freiberg OpenMP Case study: Sparse matrix-vector

More information

COSC 6385 Computer Architecture - Multi Processor Systems

COSC 6385 Computer Architecture - Multi Processor Systems COSC 6385 Computer Architecture - Multi Processor Systems Fall 2006 Classification of Parallel Architectures Flynn s Taxonomy SISD: Single instruction single data Classical von Neumann architecture SIMD:

More information

Lecture 17. NUMA Architecture and Programming

Lecture 17. NUMA Architecture and Programming Lecture 17 NUMA Architecture and Programming Announcements Extended office hours today until 6pm Weds after class? Partitioning and communication in Particle method project 2012 Scott B. Baden /CSE 260/

More information

The Multikernel: A new OS architecture for scalable multicore systems Baumann et al. Presentation: Mark Smith

The Multikernel: A new OS architecture for scalable multicore systems Baumann et al. Presentation: Mark Smith The Multikernel: A new OS architecture for scalable multicore systems Baumann et al. Presentation: Mark Smith Review Introduction Optimizing the OS based on hardware Processor changes Shared Memory vs

More information

PARALLEL ARCHITECTURES

PARALLEL ARCHITECTURES PARALLEL ARCHITECTURES Course Parallel Computing Wolfgang Schreiner Research Institute for Symbolic Computation (RISC) Wolfgang.Schreiner@risc.jku.at http://www.risc.jku.at Parallel Random Access Machine

More information

Realtime Tuning 101. Tuning Applications on Red Hat MRG Realtime Clark Williams

Realtime Tuning 101. Tuning Applications on Red Hat MRG Realtime Clark Williams Realtime Tuning 101 Tuning Applications on Red Hat MRG Realtime Clark Williams Agenda Day One Terminology and Concepts Realtime Linux differences from stock Linux Tuning Tools for Tuning Tuning Tools Lab

More information

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads... OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.

More information

Cluster Computing. Performance and Debugging Issues in OpenMP. Topics. Factors impacting performance. Scalable Speedup

Cluster Computing. Performance and Debugging Issues in OpenMP. Topics. Factors impacting performance. Scalable Speedup Topics Scalable Speedup and Data Locality Parallelizing Sequential Programs Breaking data dependencies Avoiding synchronization overheads Performance and Debugging Issues in OpenMP Achieving Cache and

More information

SMP and ccnuma Multiprocessor Systems. Sharing of Resources in Parallel and Distributed Computing Systems

SMP and ccnuma Multiprocessor Systems. Sharing of Resources in Parallel and Distributed Computing Systems Reference Papers on SMP/NUMA Systems: EE 657, Lecture 5 September 14, 2007 SMP and ccnuma Multiprocessor Systems Professor Kai Hwang USC Internet and Grid Computing Laboratory Email: kaihwang@usc.edu [1]

More information

NVIDIA COLLECTIVE COMMUNICATION LIBRARY (NCCL)

NVIDIA COLLECTIVE COMMUNICATION LIBRARY (NCCL) NVIDIA COLLECTIVE COMMUNICATION LIBRARY (NCCL) RN-08645-000_v01 September 2018 Release Notes TABLE OF CONTENTS Chapter Chapter Chapter Chapter Chapter Chapter Chapter Chapter Chapter Chapter 1. NCCL Overview...1

More information

CS 471 Operating Systems. Yue Cheng. George Mason University Fall 2017

CS 471 Operating Systems. Yue Cheng. George Mason University Fall 2017 CS 471 Operating Systems Yue Cheng George Mason University Fall 2017 Outline o Process concept o Process creation o Process states and scheduling o Preemption and context switch o Inter-process communication

More information

Initial Performance Evaluation of the Cray SeaStar Interconnect

Initial Performance Evaluation of the Cray SeaStar Interconnect Initial Performance Evaluation of the Cray SeaStar Interconnect Ron Brightwell Kevin Pedretti Keith Underwood Sandia National Laboratories Scalable Computing Systems Department 13 th IEEE Symposium on

More information

Hardware and Software solutions for scaling highly threaded processors. Denis Sheahan Distinguished Engineer Sun Microsystems Inc.

Hardware and Software solutions for scaling highly threaded processors. Denis Sheahan Distinguished Engineer Sun Microsystems Inc. Hardware and Software solutions for scaling highly threaded processors Denis Sheahan Distinguished Engineer Sun Microsystems Inc. Agenda Chip Multi-threaded concepts Lessons learned from 6 years of CMT

More information

Review: Creating a Parallel Program. Programming for Performance

Review: Creating a Parallel Program. Programming for Performance Review: Creating a Parallel Program Can be done by programmer, compiler, run-time system or OS Steps for creating parallel program Decomposition Assignment of tasks to processes Orchestration Mapping (C)

More information

Practical Near-Data Processing for In-Memory Analytics Frameworks

Practical Near-Data Processing for In-Memory Analytics Frameworks Practical Near-Data Processing for In-Memory Analytics Frameworks Mingyu Gao, Grant Ayers, Christos Kozyrakis Stanford University http://mast.stanford.edu PACT Oct 19, 2015 Motivating Trends End of Dennard

More information

Considerations for LS-DYNA Efficiency in SGI IRIX and Linux Environments with a NUMA System Architecture

Considerations for LS-DYNA Efficiency in SGI IRIX and Linux Environments with a NUMA System Architecture 4 th European LS-DYNA Users Conference MPP / Linux Cluster / Hardware I Considerations for LS-DYNA Efficiency in SGI IRIX and Linux Environments with a NUMA System Architecture Authors: Stan Posey, Nick

More information

Cluster Computing. Resource and Job Management for HPC 16/08/2010 SC-CAMP. ( SC-CAMP) Cluster Computing 16/08/ / 50

Cluster Computing. Resource and Job Management for HPC 16/08/2010 SC-CAMP. ( SC-CAMP) Cluster Computing 16/08/ / 50 Cluster Computing Resource and Job Management for HPC SC-CAMP 16/08/2010 ( SC-CAMP) Cluster Computing 16/08/2010 1 / 50 Summary 1 Introduction Cluster Computing 2 About Resource and Job Management Systems

More information

Considerations for LS-DYNA Workflow Efficiencies in an HPC Linux Environment

Considerations for LS-DYNA Workflow Efficiencies in an HPC Linux Environment 9 th International LS-DYNA Users Conference Computing / Code Technology (2) Considerations for LS-DYNA Workflow Efficiencies in an HPC Linux Environment Stanley Posey HPC Applications Development SGI,

More information

Workload Optimization on Hybrid Architectures

Workload Optimization on Hybrid Architectures IBM OCR project Workload Optimization on Hybrid Architectures IBM T.J. Watson Research Center May 4, 2011 Chiron & Achilles 2003 IBM Corporation IBM Research Goal Parallelism with hundreds and thousands

More information

EECS 750: Advanced Operating Systems. 01/29 /2014 Heechul Yun

EECS 750: Advanced Operating Systems. 01/29 /2014 Heechul Yun EECS 750: Advanced Operating Systems 01/29 /2014 Heechul Yun 1 Administrative Next summary assignment Resource Containers: A New Facility for Resource Management in Server Systems, OSDI 99 due by 11:59

More information

Best Practices for Setting BIOS Parameters for Performance

Best Practices for Setting BIOS Parameters for Performance White Paper Best Practices for Setting BIOS Parameters for Performance Cisco UCS E5-based M3 Servers May 2013 2014 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public. Page

More information

Scheduling II. Today. Next Time. ! Proportional-share scheduling! Multilevel-feedback queue! Multiprocessor scheduling. !

Scheduling II. Today. Next Time. ! Proportional-share scheduling! Multilevel-feedback queue! Multiprocessor scheduling. ! Scheduling II Today! Proportional-share scheduling! Multilevel-feedback queue! Multiprocessor scheduling Next Time! Memory management Scheduling with multiple goals! What if you want both good turnaround

More information

Barbara Chapman, Gabriele Jost, Ruud van der Pas

Barbara Chapman, Gabriele Jost, Ruud van der Pas Using OpenMP Portable Shared Memory Parallel Programming Barbara Chapman, Gabriele Jost, Ruud van der Pas The MIT Press Cambridge, Massachusetts London, England c 2008 Massachusetts Institute of Technology

More information

CS 333 Introduction to Operating Systems. Class 3 Threads & Concurrency. Jonathan Walpole Computer Science Portland State University

CS 333 Introduction to Operating Systems. Class 3 Threads & Concurrency. Jonathan Walpole Computer Science Portland State University CS 333 Introduction to Operating Systems Class 3 Threads & Concurrency Jonathan Walpole Computer Science Portland State University 1 The Process Concept 2 The Process Concept Process a program in execution

More information

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Computing architectures Part 2 TMA4280 Introduction to Supercomputing Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:

More information

A NUMA API for LINUX*

A NUMA API for LINUX* A NUMA API for LINUX* Technical Linux Whitepaper w w w. n o v e l l. c o m April 2005 2 Disclaimer Trademarks Copyright Novell, Inc.makes no representations or warranties with respect to the contents or

More information

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008 SHARCNET Workshop on Parallel Computing Hugh Merz Laurentian University May 2008 What is Parallel Computing? A computational method that utilizes multiple processing elements to solve a problem in tandem

More information

Lecture Topics. Announcements. Today: Advanced Scheduling (Stallings, chapter ) Next: Deadlock (Stallings, chapter

Lecture Topics. Announcements. Today: Advanced Scheduling (Stallings, chapter ) Next: Deadlock (Stallings, chapter Lecture Topics Today: Advanced Scheduling (Stallings, chapter 10.1-10.4) Next: Deadlock (Stallings, chapter 6.1-6.6) 1 Announcements Exam #2 returned today Self-Study Exercise #10 Project #8 (due 11/16)

More information

Parallel Numerical Algorithms

Parallel Numerical Algorithms Parallel Numerical Algorithms http://sudalab.is.s.u-tokyo.ac.jp/~reiji/pna16/ [ 9 ] Shared Memory Performance Parallel Numerical Algorithms / IST / UTokyo 1 PNA16 Lecture Plan General Topics 1. Architecture

More information

IBM Cell Processor. Gilbert Hendry Mark Kretschmann

IBM Cell Processor. Gilbert Hendry Mark Kretschmann IBM Cell Processor Gilbert Hendry Mark Kretschmann Architectural components Architectural security Programming Models Compiler Applications Performance Power and Cost Conclusion Outline Cell Architecture:

More information

Cluster Network Products

Cluster Network Products Cluster Network Products Cluster interconnects include, among others: Gigabit Ethernet Myrinet Quadrics InfiniBand 1 Interconnects in Top500 list 11/2009 2 Interconnects in Top500 list 11/2008 3 Cluster

More information

Dmitry Durnov 15 February 2017

Dmitry Durnov 15 February 2017 Cовременные тенденции разработки высокопроизводительных приложений Dmitry Durnov 15 February 2017 Agenda Modern cluster architecture Node level Cluster level Programming models Tools 2/20/2017 2 Modern

More information

Linux Network Tuning Guide for AMD EPYC Processor Based Servers

Linux Network Tuning Guide for AMD EPYC Processor Based Servers Linux Network Tuning Guide for AMD EPYC Processor Application Note Publication # 56224 Revision: 1.10 Issue Date: May 2018 Advanced Micro Devices 2018 Advanced Micro Devices, Inc. All rights reserved.

More information

Towards NUMA Support with Distance Information

Towards NUMA Support with Distance Information Towards NUMA Support with Distance Information Dirk Schmidl, Christian Terboven, Dieter an Mey {schmidl terboven anmey}@rz.rwth-aachen.de Rechen- und Kommunikationszentrum (RZ) Agenda Topology of modern

More information

Koronis Performance Tuning 2. By Brent Swartz December 1, 2011

Koronis Performance Tuning 2. By Brent Swartz December 1, 2011 Koronis Performance Tuning 2 By Brent Swartz December 1, 2011 Application Tuning Methodology http://docs.sgi.com/library/tpl/cgi-bin/browse.cgi? coll=linux&db=bks&cmd=toc&pth=/sgi_develop er/lx_86_apptune

More information

Munara Tolubaeva Technical Consulting Engineer. 3D XPoint is a trademark of Intel Corporation in the U.S. and/or other countries.

Munara Tolubaeva Technical Consulting Engineer. 3D XPoint is a trademark of Intel Corporation in the U.S. and/or other countries. Munara Tolubaeva Technical Consulting Engineer 3D XPoint is a trademark of Intel Corporation in the U.S. and/or other countries. notices and disclaimers Intel technologies features and benefits depend

More information

Chapter 5. Multiprocessors and Thread-Level Parallelism

Chapter 5. Multiprocessors and Thread-Level Parallelism Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model

More information

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Outline Key issues to design multiprocessors Interconnection network Centralized shared-memory architectures Distributed

More information

Performance Analysis of Parallel Scientific Applications In Eclipse

Performance Analysis of Parallel Scientific Applications In Eclipse Performance Analysis of Parallel Scientific Applications In Eclipse EclipseCon 2015 Wyatt Spear, University of Oregon wspear@cs.uoregon.edu Supercomputing Big systems solving big problems Performance gains

More information

Advanced Memory Management

Advanced Memory Management Advanced Memory Management Main Points Applications of memory management What can we do with ability to trap on memory references to individual pages? File systems and persistent storage Goals Abstractions

More information

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

Bei Wang, Dmitry Prohorov and Carlos Rosales

Bei Wang, Dmitry Prohorov and Carlos Rosales Bei Wang, Dmitry Prohorov and Carlos Rosales Aspects of Application Performance What are the Aspects of Performance Intel Hardware Features Omni-Path Architecture MCDRAM 3D XPoint Many-core Xeon Phi AVX-512

More information

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

Reduces latency and buffer overhead. Messaging occurs at a speed close to the processors being directly connected. Less error detection

Reduces latency and buffer overhead. Messaging occurs at a speed close to the processors being directly connected. Less error detection Switching Operational modes: Store-and-forward: Each switch receives an entire packet before it forwards it onto the next switch - useful in a general purpose network (I.e. a LAN). usually, there is a

More information

Chapter 4: Multithreaded

Chapter 4: Multithreaded Chapter 4: Multithreaded Programming Chapter 4: Multithreaded Programming Overview Multithreading Models Thread Libraries Threading Issues Operating-System Examples 2009/10/19 2 4.1 Overview A thread is

More information

Chapter 8: Virtual Memory. Operating System Concepts

Chapter 8: Virtual Memory. Operating System Concepts Chapter 8: Virtual Memory Silberschatz, Galvin and Gagne 2009 Chapter 8: Virtual Memory Background Demand Paging Copy-on-Write Page Replacement Allocation of Frames Thrashing Memory-Mapped Files Allocating

More information

Xen and the Art of Virtualization. CSE-291 (Cloud Computing) Fall 2016

Xen and the Art of Virtualization. CSE-291 (Cloud Computing) Fall 2016 Xen and the Art of Virtualization CSE-291 (Cloud Computing) Fall 2016 Why Virtualization? Share resources among many uses Allow heterogeneity in environments Allow differences in host and guest Provide

More information

Designing Next-Generation Data- Centers with Advanced Communication Protocols and Systems Services. Presented by: Jitong Chen

Designing Next-Generation Data- Centers with Advanced Communication Protocols and Systems Services. Presented by: Jitong Chen Designing Next-Generation Data- Centers with Advanced Communication Protocols and Systems Services Presented by: Jitong Chen Outline Architecture of Web-based Data Center Three-Stage framework to benefit

More information

Multiprocessor Support

Multiprocessor Support CSC 256/456: Operating Systems Multiprocessor Support John Criswell University of Rochester 1 Outline Multiprocessor hardware Types of multi-processor workloads Operating system issues Where to run the

More information

EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA

EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA SUDHEER CHUNDURI, SCOTT PARKER, KEVIN HARMS, VITALI MOROZOV, CHRIS KNIGHT, KALYAN KUMARAN Performance Engineering Group Argonne Leadership Computing Facility

More information

How to Boost the Performance of Your MPI and PGAS Applications with MVAPICH2 Libraries

How to Boost the Performance of Your MPI and PGAS Applications with MVAPICH2 Libraries How to Boost the Performance of Your MPI and PGAS s with MVAPICH2 Libraries A Tutorial at the MVAPICH User Group (MUG) Meeting 18 by The MVAPICH Team The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

OPERATING SYSTEM. Chapter 9: Virtual Memory

OPERATING SYSTEM. Chapter 9: Virtual Memory OPERATING SYSTEM Chapter 9: Virtual Memory Chapter 9: Virtual Memory Background Demand Paging Copy-on-Write Page Replacement Allocation of Frames Thrashing Memory-Mapped Files Allocating Kernel Memory

More information

Why you should care about hardware locality and how.

Why you should care about hardware locality and how. Why you should care about hardware locality and how. Brice Goglin TADaaM team Inria Bordeaux Sud-Ouest Agenda Quick example as an introduction Bind your processes What's the actual problem? Convenient

More information

M7: Next Generation SPARC. Hotchips 26 August 12, Stephen Phillips Senior Director, SPARC Architecture Oracle

M7: Next Generation SPARC. Hotchips 26 August 12, Stephen Phillips Senior Director, SPARC Architecture Oracle M7: Next Generation SPARC Hotchips 26 August 12, 2014 Stephen Phillips Senior Director, SPARC Architecture Oracle Safe Harbor Statement The following is intended to outline our general product direction.

More information

CHAPTER 2: PROCESS MANAGEMENT

CHAPTER 2: PROCESS MANAGEMENT 1 CHAPTER 2: PROCESS MANAGEMENT Slides by: Ms. Shree Jaswal TOPICS TO BE COVERED Process description: Process, Process States, Process Control Block (PCB), Threads, Thread management. Process Scheduling:

More information

High Performance Computing: Tools and Applications

High Performance Computing: Tools and Applications High Performance Computing: Tools and Applications Edmond Chow School of Computational Science and Engineering Georgia Institute of Technology Lecture 9 SIMD vectorization using #pragma omp simd force

More information

Native Computing and Optimization. Hang Liu December 4 th, 2013

Native Computing and Optimization. Hang Liu December 4 th, 2013 Native Computing and Optimization Hang Liu December 4 th, 2013 Overview Why run native? What is a native application? Building a native application Running a native application Setting affinity and pinning

More information