Intel Compiler. Advanced Technical Skills (ATS) North America. IBM High Performance Computing February 2010 Y. Joanna Wong, Ph.D.

Similar documents
Philippe Thierry Sr Staff Engineer Intel Corp.

Accelerating HPC. (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing

Intel Workstation Technology

AMD Opteron 4200 Series Processor

Best Practices for Setting BIOS Parameters for Performance

Intel Core i7 Processor

Intel Architecture for Software Developers

White Paper. First the Tick, Now the Tock: Next Generation Intel Microarchitecture (Nehalem)

Simultaneous Multithreading on Pentium 4

FAST FORWARD TO YOUR <NEXT> CREATION

EPYC VIDEO CUG 2018 MAY 2018

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Intel Enterprise Processors Technology

Scalability issues : HPC Applications & Performance Tools

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Parallel Computing. Hwansoo Han (SKKU)

IFS RAPS14 benchmark on 2 nd generation Intel Xeon Phi processor

Online Course Evaluation. What we will do in the last week?

Server-Related Policy Configuration

Dell PowerEdge 11 th Generation Servers: R810, R910, and M910 Memory Guidance

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore

LS-DYNA Performance on Intel Scalable Solutions

POWER7: IBM's Next Generation Server Processor

Performance Profiler. Klaus-Dieter Oertel Intel-SSG-DPD IT4I HPC Workshop, Ostrava,

Martin Kruliš, v

Introduction to Parallel Computing

Comp. Org II, Spring

Intel Workstation Platforms

Intel VTune Amplifier XE

Compiling for Nehalem (the Intel Core Microarchitecture, Intel Xeon 5500 processor family and the Intel Core i7 processor)

BIOS Parameters by Server Model

COSC 6385 Computer Architecture - Multi Processor Systems

Six-Core AMD Opteron Processor

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

HETEROGENEOUS SYSTEM ARCHITECTURE: PLATFORM FOR THE FUTURE

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

POWER7: IBM's Next Generation Server Processor

BIOS Parameters by Server Model

Pactron FPGA Accelerated Computing Solutions

Comp. Org II, Spring

Parallel Processing & Multicore computers

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

Silvermont. Introducing Next Generation Low Power Microarchitecture: Dadi Perlmutter

Does the Intel Xeon Phi processor fit HEP workloads?

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Chapter 7. Multicores, Multiprocessors, and Clusters. Goal: connecting multiple computers to get higher performance

CUDA GPGPU Workshop 2012

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 18 Multicore Computers

Parallel Algorithm Engineering

Vector Engine Processor of SX-Aurora TSUBASA

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008

PRACE Autumn School Basic Programming Models

Introducing the Cray XMT. Petr Konecny May 4 th 2007

Intel Workstation Platforms

Altair OptiStruct 13.0 Performance Benchmark and Profiling. May 2015

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Multi-core Architectures. Dr. Yingwu Zhu

Shared Memory and Distributed Multiprocessing. Bhanu Kapoor, Ph.D. The Saylor Foundation

Performance Tools for Technical Computing

Intel VTune Amplifier XE. Dr. Michael Klemm Software and Services Group Developer Relations Division

( ZIH ) Center for Information Services and High Performance Computing. Overvi ew over the x86 Processor Architecture

Performance Tuning Guide for Cisco UCS M5 Servers

Enabling Performance-per-Watt Gains in High-Performance Cluster Computing

Performance and Energy Efficiency of the 14 th Generation Dell PowerEdge Servers

Multi-core Architectures. Dr. Yingwu Zhu

Munara Tolubaeva Technical Consulting Engineer. 3D XPoint is a trademark of Intel Corporation in the U.S. and/or other countries.

Nehalem Hochleistungsrechnen für reale Anwendungen

Aim High. Intel Technical Update Teratec 07 Symposium. June 20, Stephen R. Wheat, Ph.D. Director, HPC Digital Enterprise Group

LS-DYNA Performance Benchmark and Profiling. October 2017

STAR-CCM+ Performance Benchmark and Profiling. July 2014

Fundamental CUDA Optimization. NVIDIA Corporation

Addressing Heterogeneity in Manycore Applications

Carlo Cavazzoni, HPC department, CINECA

ABySS Performance Benchmark and Profiling. May 2010

WHITE PAPER FUJITSU PRIMERGY SERVERS PERFORMANCE REPORT PRIMERGY BX924 S2

The Mont-Blanc approach towards Exascale

HP ProLiant BladeSystem Gen9 vs Gen8 and G7 Server Blades on Data Warehouse Workloads

Enhancing Analysis-Based Design with Quad-Core Intel Xeon Processor-Based Workstations

April 2 nd, Bob Burroughs Director, HPC Solution Sales

The Intel Xeon Phi Coprocessor. Dr-Ing. Michael Klemm Software and Services Group Intel Corporation

Intel Architecture for HPC

White Paper. Technical Advances in the SGI. UV Architecture

General Purpose GPU Computing in Partial Wave Analysis

Using Intel VTune Amplifier XE and Inspector XE in.net environment

Overview of Tianhe-2

Intel High-Performance Computing. Technologies for Engineering

Introduction to parallel Computing

THE PROGRAMMER S GUIDE TO THE APU GALAXY. Phil Rogers, Corporate Fellow AMD

Intel Core X-Series. The Ultimate platform with unprecedented scalability. The Intel Core i9 Extreme Edition processor

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Multiprocessors & Thread Level Parallelism

Computer Architecture

Appendix A Technology and Terms

What does Heterogeneity bring?

MICROPROCESSOR TECHNOLOGY

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

Building NVLink for Developers

Meet the Increased Demands on Your Infrastructure with Dell and Intel. ServerWatchTM Executive Brief

Transcription:

Intel Compiler IBM High Performance Computing February 2010 Y. Joanna Wong, Ph.D. yjw@us.ibm.com 2/22/2010

Nehalem-EP CPU Summary Performance/Features: 4 cores 8M on-chip Shared Cache Simultaneous Multi- Threading capability (SMT) Intel QuickPath Interconnect up to 6.4 GT/s, each direct. per link Integrated Memory Controller (DDR3) New instructions Power: 95W, 80W, 60W Nehalem-EP Memory Controller 3x DDR3 channels 8M Shared Cache Link Controller 2x Intel QuickPath interconnect Socket: New LGA 1366 Socket Process Technology: 45nm CPU Platform Compatibility Tylersburg (TBG) ICH9/10 Driving performance through Multi- Technology and platform enhancements 2 2 Source: Intel Corporation

Intel Xeon 5500 Series (Nehalem-EP) Overview PCI Express* Gen 1, 2 Up to 25.6 Gb/sec bandwidth per link Nehalem QPI DMI Nehalem I/O Hub ICH Key Technologies New 45nm Intel Microarchitecture New Intel QuickPath Interconnect Integrated Memory Controller Next Generation Memory (DDR3) PCI Express Gen 2 IT Benefits More application performance Improved energy efficiency End to end HW assist (virtualization technology improvements) Stable IT image Software compatible Live migration compatible with today s dual and quad-core Intel microarchitecture products using enabled virtualization software Source: Intel Corporation 3

Energy Efficiency Enhancements Intel Intelligent Power Technologies Integrated Power Gates 1 Automated Low Power States Enables idle cores to go to near zero power independently Voltage (cores) More & Lower CPU Power States Reduced latency during transitions Power management now on memory, I/O Enhanced 0 1 2 3 Memory System, Cache, I/O Voltage (rest of processor) new new Automatic operation or manual core disable 2 Adjusts system power consumption based on real-time load Source: 1 Integrated power gates (C6) requires OS support 2 Requires BIOS setting change and system reboot Source: Intel Corporation 4

More Efficient Chipset and Memory Memory Power Management DIMMs are automatically placed into a lower power state when not utilized 1 DIMMs are automatically idled when all CPU cores in the system are idle 2 Chipset Power Management QPI links and PCIe lanes placed in power reduction states when not active 3 Capable of placing PCIe* cards in the lowest power state possible 4 Intel Xeon 5500 QPI PCIe* Intel Xeon 5500 Intel 5500 Chipset End-to-end platform power management Source: 1 Using DIMM CKE (Clock Enable) 2 Using DIMM self refresh 3 Using L0s and L1 states 4 Using cards enabled with ASPM (Active State Power Management) 5 Source: Intel Corporation

Performance Enhancements Intel Xeon 5500 Series Processor (Nehalem-EP) Intel Turbo Boost Technology Intel Hyper-Threading Technology Increases performance by increasing processor frequency and enabling faster speeds when conditions allow Normal 4C Turbo <4C Turbo Increases performance for threaded applications delivering greater throughput and responsiveness Frequency 0 1 2 3 0 1 2 3 0 1 Up to 30% higher All cores operate at rated frequency All cores operate at higher frequency Fewer cores may operate at even higher frequencies Higher performance on demand Higher performance for threaded workloads Source: Intel internal measurements, January 2009. For notes and disclaimers, see performance and legal information slides at end of this presentation. Source: Intel Corporation 6

Nehalem-EP Turbo Mode Frequencies Basic (80W) Standard (80W) Perf (95W) E5502 E5504 E5506 E5520 E5530 E5540 X5550 X5560 X5570 WS (130W) 3.46 GHz +2 3.33 GHz +3 +1 4C/3C/2C operation 3.20 GHz +3 +2 3.06 GHz +3 +2 2.93 GHz +2 2.80 GHz +2 2.66 GHz 2.53 GHz +2 +2 +1 +1 1C turbo only 2.40 GHz +1 1C/2C turbo 2.26 GHz 2.13 GHz 2.00 GHz 4C/3C turbo 1.86 GHz DC Base Freq Source: Intel Corporation 7

Optimizing for Memory Performance: General Guidelines and Potential Configurations Use identical DIMM types throughout the platform: Same size, speed, and number of ranks Use a balanced platform configuration: Populate the same for each channel and each socket Maximize number of channels populated for highest bandwidth IT Requirements (assumes two 4C Nehalem-EP CPUs) GB/core 0.5 1 1.5 2 3 4 4.5 6 8 9 12 18 Platform Capacity 4 8 12 16 24 32 36 48 64 72 96 144 Potential Configurations 2 GB DIMM 2x2 4x2 6x2 8x2 12x2 18x2 4 GB DIMM 2x4 4x4 6x4 8x4 12x4 18x4 8 GB DIMM 2x8 4x8 6x8 8x8 12x8 18x8 DDR3 1333, 1066, 800 support DDR3 1066, 800 support DDR3 800 only Indicates memory sweet spot that equally populates all 6 memory channels Source: Intel Corporation 8

9

10

Intel Smart Cache Caches New 3-level Cache Hierarchy 1 st level caches 32kB Instruction cache 32kB, 8-way Data Cache Support more L1 misses in parallel than Intel 2 microarchitecture 2 nd level Cache 32kB L1 32kB L1 Data Cache Inst. Cache New cache introduced in Intel microarchitecture (Nehalem) Unified (holds code and data) 256 kb per core (8-way) Performance: Very low latency 10 cycle load-to-use Scalability: As core count increases, reduce pressure on shared cache 256kB L2 Cache Source: Intel Corporation 11

Intel Smart Cache 3 rd Level Cache Shared across all cores Size depends on # of cores Quad-core: Up to 8MB (16-ways) Scalability: L1 Caches L1 Caches L1 Caches Built to vary size with varied core counts Built to easily increase L3 size in future parts Perceived latency depends on frequency ratio between core & uncore ~ 40 clocks Inclusive cache policy for best performance L2 Cache L2 Cache L3 Cache L2 Cache Address residing in L1/L2 must be present in 3 rd level cache Source: Intel Corporation 12

Why Inclusive? Inclusive cache provides benefit of an on-die snoop filter Valid Bits 1 bit per core per cache line If line may be in a core, set core valid bit Snoop only needed if line is in L3 and core valid bit is set Guaranteed that line is not modified if multiple bits set Scalability Addition of cores/sockets does not increase snoop traffic seen by cores Latency Minimize effective cache latency by eliminating cross-core snoops in the common case Minimize snoop response time for cross-socket cases Source: Intel Corporation 13

Hardware Prefetching (HWP) HW Prefetching critical to hiding memory latency Structure of HWPs similar as in Intel 2 microarchitecture Algorithmic improvements in Intel microarchitecture (Nehalem) for higher performance L1 Prefetchers Based on instruction history and/or load address pattern L2 Prefetchers Prefetches loads/rfos/code fetches based on address pattern Intel microarchitecture (Nehalem) changes: Efficient Prefetch mechanism Remove the need for Intel Xeon processors to disable HWP Increase prefetcher aggressiveness Locks on address streams quicker, adapts to change faster, issues more prefetchers more aggressively (when appropriate) Source: Intel Corporation 14

Extending Performance and Energy Efficiency - Intel SSE4.2 Instruction Set Architecture (ISA) Accelerated String and Text Processing Faster XML parsing Faster search and pattern matching Novel parallel data matching and comparison operations STTNI SSE4 (45nm CPUs) Accelerated Searching & Pattern Recognition of Large Data Sets Improved performance for Genome Mining, Handwriting recognition. Fast Hamming distance / Population count SSE4.1 (Penryn ) SSE4.2 (Nehalem ) New Communications Capabilities Hardware based CRC instruction Accelerated Network attached storage Improved power efficiency for Software I-SCSI, RDMA, and SCTP ATA STTNI e.g. XML acceleration POPCNT e.g. Genome Mining ATA (Application Targeted Accelerators) CRC32 e.g. iscsi Application Projected 3.8x kernel speedup on XML parsing & 2.7x savings on instruction cycles Source: Intel Corporation 15

Tools Support for Intel Microarchitecture (Nehalem) Intel Compiler 10.x supports the new instructions Nehalem specific compiler optimizations SSE4.2 supported via vectorization and intrinsics Inline assembly supported on both IA-32 and Intel 64 architecture targets Necessary to include required header files in order to access intrinsics Intel XML Software Suite High performance C++ and Java runtime libraries Version 1.0 (C++), version 1.01 (Java) available now Version 1.1 w/sse4.2 optimizations planned for September 2008 Microsoft Visual Studio* 2008 VC++ SSE4.2 supported via intrinsics Inline assembly supported on IA-32 only Necessary to include required header files in order to access intrinsics VC++ 2008 tools masm, msdis, and debuggers recognize the new instructions GCC* 4.3.1 Support Intel microarchitecture (Merom), 45nm next generation Intel microarchitecture (Penryn), Intel microarchitecture (Nehalem) via mtune=generic. Support SSE4.1 and SSE4.2 through vectorizer and intrinsics 16

Optimization Guidelines with Intel Compiler for Intel i7 processor Source: Intel Developer Forum, Tuning your Software for the Next Generation Intel Microarchitecture (Nehalem) 17

18

19

Unaligned Loads / Stores Unaligned loads are as fast as aligned loads Optimized accesses that span two cache-lines Generating misaligned references less of a concern One instruction can replace sequences of up to 7 Fewer instructions Less register pressure Increased opportunities for several optimizations Vectorization memcpy / memset Dynamic Stack alignment less necessary for 32-bit stacks 20

Source: Intel Corporation 21

22

23

Common Intel Compiler Features General optimization settings Cache-management features Interprocedural optimization (IPO) methods Profile-guided optimization (PGO) methods Multithreading support Floating-point arithmetic precision and consistency Compiler optimization and vectorization report Source: Intel white paper Optimization Applications with Intel C++ and Fortran Compilers for Windows, Linux and Mac Os X 24

Source: Intel Corporation 25

Source: Intel Corporation 26

Common Intel Compiler Features Target systems with processor-specific options: -xsse4.2 -xhost -axsse4.2 A good start: generate optimized code specialized for the Intel (Intel i7) processor family that executes the program optimize for and use the most advanced instruction set for the processor on which you compile Generate multiple processor-specific auto-dispatch code paths for Intel processors if there is performance benefit. Executable will run on Intel processor architecture other than Intel i7 -O2 xsse4.2 or -O2 -xhost -O3 xsse4.2 or -O3 -xhost -fast : -O3 ipo no-prec-div static 27

Parallel programming with MPI 2/22/2010

Parallel Scalability Amdahl s Law the law of diminishing returns T n = S + P/n Assumption: Parallel content scales inversely with number of parallel tasks, Ignoring computation and communication load imbalance The maximum parallel speedup is: P/S + 1 source: wikipedia 29

Strong vs Weak Scaling Strong scaling Scalability of a fixed total problem size with the number of processors Weak Scaling Scalability of a fixed problem size per processor with the number of processors 30

Thread: An independent flow of control, may operate within a process with other threads. An schedulable entity Has its own stack, thread-specific data, and own registers Set of pending and blocked signals Process Can not share memory directly Can not share file descriptors A process can own multiple threads An OpenMP job is a process. It creates and owns one or more SMP threads. All the SMP threads share the same PID An MPI job is a set of concurrent processes (or tasks). Each process has its own PID and communicates with other processes via MPI calls 31

Shared Memory Characteristics Single address space Single operating system Limitations Memory Contention Bandwidth Cache coherency snoop traffic gets very expensive as more L2 caches in node Benefits Memory size Programming models 32

Distributed Memory Characteristics Multiple address spaces Multiple operating systems Limitations Switch Contention Bandwidth Local memory size Benefits Economically scale to large processor counts Cache coherency not needed between nodes Shared Memory Node Network 33

Comparison: Shared Memory Programming vs. Distributed Memory Programming Shared memory Single process ID for all threads List threads ps elf Distributed memory Each task has own process ID List tasks: ps Process A Process A Process B Process C Thread 0 Thread 1 Process A Task 0 Task 1 Task 2 Thread 2 As we saw in SMP chapter 34

Parallel Programming on HPC cluster Message Passing (MPI) between Nodes High Performance Interconnect memory memory memory memory OpenMP / Multi-threading within SMP node Cluster of Shared Memory Nodes 35

Parallel Programming choices MPI Good for tightly coupled computations Exploits all networks and all OS Significant programming effort; debugging can be difficult Master/Slave paradigm is supported. OpenMP Easier to get parallel speed up Limited to SMP (single node) Typically applied at loop level limited scalability Automatic parallelization by compiler Need clean programming to get advantage pthreads = Posix threads Good for loosely coupled computations User controlled instantiation and locks fork/execl Standard Unix/Linux technique 36

Schematic Flow of an SMP Code th d0 37 Program ParallelWork start the program read input set up comp. parameters initialize variable DO i=1,imax End do serial work DO i=1,imax End do serial work DO i=1,imax End do Output End program ParallelWork th d0 th d1 th d2 th d3 master slaves fork join (sync.) spin/yield gang sched join (sync.) spin/yield join (sync.) spin/yield

Schematic Flow of an MPI code Program ParallelWork start the program Call MPI_Init(ierr) read input set up comp. parameters DO i=1,imax End do serial work DO i=1,imax End do serial work DO i=1,imax End do Output Call MPI_Finalize End program ParallelWork 38 master slaves proc0 proc1 proc2 proc3 MPI_Init msg passing: Input parameters barrier if needed message passing message passing message passing barrier if needed MPI Finalize wait for msg, synchronization

MPI Basics MPI = Message Passing Interface A message passing library standard based on the consensus of MPI Forum, with participants from vendors (hardware & software), academics, software developers Initially developed to support distributed memory program architecture Not an IEEE or ISO standard, now the defacto industry standard NOT a library -- specification on what the library should be, but not on the implementation MPI-1.0, released 1994 MPI-2.2 (latest), released 2009 MPI-3.0 standard, ongoing 39

MPI tutorials https://computing.llnl.gov/tutorials/mpi 40