Masterpraktikum Scientific Computing

Similar documents
COSC 6385 Computer Architecture - Thread Level Parallelism (I)

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Dan Stafford, Justine Bonnot

Thread and Data parallelism in CPUs - will GPUs become obsolete?

Lecture 2. Memory locality optimizations Address space organization

Static Compiler Optimization Techniques

COSC 6385 Computer Architecture - Multi Processor Systems

RECAP. B649 Parallel Architectures and Programming

Memory Hierarchy Recall the von Neumann bottleneck - single, relatively slow path between the CPU and main memory.

( ZIH ) Center for Information Services and High Performance Computing. Overvi ew over the x86 Processor Architecture

Parallel Computing: Parallel Architectures Jin, Hai

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Chapter 4 Data-Level Parallelism

Tools and techniques for optimization and debugging. Andrew Emerson, Fabio Affinito November 2017

Systems Programming and Computer Architecture ( ) Timothy Roscoe

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics

Tools and techniques for optimization and debugging. Fabio Affinito October 2015

EC 513 Computer Architecture

SIMD Programming CS 240A, 2017

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Parallel Computing Platforms

CSE 260 Introduction to Parallel Computation

Advanced Parallel Programming I

Cache Memories /18-213/15-513: Introduction to Computer Systems 12 th Lecture, October 5, Today s Instructor: Phil Gibbons

Cache Memories. From Bryant and O Hallaron, Computer Systems. A Programmer s Perspective. Chapter 6.

Exploitation of instruction level parallelism

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory hierarchy review. ECE 154B Dmitri Strukov

Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

BlueGene/L (No. 4 in the Latest Top500 List)

Lecture 15: Caches and Optimization Computer Architecture and Systems Programming ( )

Programmation Concurrente (SE205)

Lecture 2: Memory Systems

Agenda Cache memory organization and operation Chapter 6 Performance impact of caches Cache Memories

Today. Cache Memories. General Cache Concept. General Cache Organization (S, E, B) Cache Memories. Example Memory Hierarchy Smaller, faster,

Modern CPU Architectures

Master Program (Laurea Magistrale) in Computer Science and Networking. High Performance Computing Systems and Enabling Platforms.

Multiple Instruction Issue. Superscalars

Cray XE6 Performance Workshop

High Performance Computing. Classes of computing SISD. Computation Consists of :

Keywords and Review Questions

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

EITF20: Computer Architecture Part3.2.1: Pipeline - 3

Design of Digital Circuits Lecture 21: GPUs. Prof. Onur Mutlu ETH Zurich Spring May 2017

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

TDT 4260 lecture 3 spring semester 2015

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

Introduction to OpenMP. Lecture 10: Caches

anced computer architecture CONTENTS AND THE TASK OF THE COMPUTER DESIGNER The Task of the Computer Designer

Getting CPI under 1: Outline

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

Tutorial 11. Final Exam Review

Issues in Multiprocessors

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Super Scalar. Kalyan Basu March 21,

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

ECE 571 Advanced Microprocessor-Based Design Lecture 4

Module 2: Virtual Memory and Caches Lecture 3: Virtual Memory and Caches. The Lecture Contains:

Parallel Numerics, WT 2013/ Introduction

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT-1

Advanced optimizations of cache performance ( 2.2)

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

211: Computer Architecture Summer 2016

CS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines

Computer Architecture

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

High Performance Computing: Tools and Applications

DEPARTMENT OF ELECTRONICS & COMMUNICATION ENGINEERING QUESTION BANK

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

Processors, Performance, and Profiling

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

Memory Hierarchies 2009 DAT105

Normal computer 1 CPU & 1 memory The problem of Von Neumann Bottleneck: Slow processing because the CPU faster than memory

CS4961 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/30/11. Administrative UPDATE. Mary Hall August 30, 2011

Vector Processors. Kavitha Chandrasekar Sreesudhan Ramkumar

CS4961 Parallel Programming. Lecture 7: Introduction to SIMD 09/14/2010. Homework 2, Due Friday, Sept. 10, 11:59 PM. Mary Hall September 14, 2010

Superscalar Processors

Unit 9 : Fundamentals of Parallel Processing

Instruction Level Parallelism (ILP)

Uniprocessors. HPC Fall 2012 Prof. Robert van Engelen

Caches. Samira Khan March 23, 2017

CS356: Discussion #9 Memory Hierarchy and Caches. Marco Paolieri Illustrations from CS:APP3e textbook

Pipelining and Vector Processing

EITF20: Computer Architecture Part4.1.1: Cache - 2

Intel released new technology call P6P

Advance CPU Design. MMX technology. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ. ! Basic concepts

Processor Performance and Parallelism Y. K. Malaiya

Caches 3/23/17. Agenda. The Dataflow Model (of a Computer)

Introduction to Parallel Programming


Last class. Caches. Direct mapped

CPU Architecture Overview. Varun Sampath CIS 565 Spring 2012

45-year CPU Evolution: 1 Law -2 Equations

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

EN164: Design of Computing Systems Lecture 24: Processor / ILP 5

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

Outline. 1 Reiteration. 2 Cache performance optimization. 3 Bandwidth increase. 4 Reduce hit time. 5 Reduce miss penalty. 6 Reduce miss rate

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Transcription:

Masterpraktikum Scientific Computing High-Performance Computing Michael Bader Alexander Heinecke Technische Universität München, Germany

Outline Logins Levels of Parallelism Single Processor Systems Von-Neumann-Principle Accelerating von-neumann: Pipelining Loop Optimizations Pipeline-Conflicts Dependency Analysis Caches Vectorcomputers Architecture of a Vectorcomputer Example: SSE-Extension of Intel Core 2 Dependency Analysis and Vectorization Introduction, Caches, Vectorization, October 16, 2012 2

Logins LRZ logins: one login per participant change password through: https://idportal.lrz.de/r/entry.pl: Passwort ändern on left side, type old one, repeat your new one two times and hit Passwort ändern afterwards. ssh xyz@lxlogin1.lrz.de then: ssh lxa1 (MPP Cluster, AMD MC) or: ssh ice1-login (ICE Cluster, Intel NHL-EP) Documentation http://www.lrz.de/services/compute/ linux-cluster/intro/ Introduction, Caches, Vectorization, October 16, 2012 3

Outline Logins Levels of Parallelism Single Processor Systems Von-Neumann-Principle Accelerating von-neumann: Pipelining Loop Optimizations Pipeline-Conflicts Dependency Analysis Caches Vectorcomputers Architecture of a Vectorcomputer Example: SSE-Extension of Intel Core 2 Dependency Analysis and Vectorization Introduction, Caches, Vectorization, October 16, 2012 4

Levels of Parallelism Parallel applications can be classified according to their granularity of parallelism. Different kinds of hardware support these different levels of parallelism. intra-instruction level vector-computer (e.g. NEC-SX, Cray-SV) SSE, AVX, Intel MIC Cell, GPUs... instruction level super-scalar architecture (e.g. Workstation/PC) VLIW architecture (very long instruktion word) (e.g. Itanium, AMD GPUs) pipelining... ILP Instruction Level Parallelism Introduction, Caches, Vectorization, October 16, 2012 5

thread level Shared-Memory-Machine (e.g. SUN-UltraSparc, SGI-Altix)... process level Distributed-Memory-Machine (Cray-T3E, IBM-SP)... application level Workstation-Cluster (SuperMUC) Grid-Computer... Introduction, Caches, Vectorization, October 16, 2012 6

Classification according to Flynn instruction flow and / vs. data flow handling several instructions simultaneously handling several data elements simultaneously SISD scalar machine (486) SIMD vector-machine (since Pentium MMX!) GPGPU (VLIW) MISD MIMD GPGPU (SIMT) Introduction, Caches, Vectorization, October 16, 2012 7

Outline Logins Levels of Parallelism Single Processor Systems Von-Neumann-Principle Accelerating von-neumann: Pipelining Loop Optimizations Pipeline-Conflicts Dependency Analysis Caches Vectorcomputers Architecture of a Vectorcomputer Example: SSE-Extension of Intel Core 2 Dependency Analysis and Vectorization Introduction, Caches, Vectorization, October 16, 2012 8

Von-Neumann-Principle Figure: Von-Neumann-machine Figure: Stages of an instruction cycle Introduction, Caches, Vectorization, October 16, 2012 9

Pipelining instruction pipelining: overlapping of independent instructions increasing instruction throughout a pipeline with k stages retires, after a certain preload-time, starting with cycle k one instruction per cycle! Figure: execution with and without pipelining Introduction, Caches, Vectorization, October 16, 2012 10

Arithmetic Pipelining Example: Floatingpoint-Addition in four steps loading operands compare exponents, shift mantissa add normalization and write back to registers Figure: vector operation pipeline Introduction, Caches, Vectorization, October 16, 2012 11

Pipelining Optimizations loop unrolling for(i=0;i<3;i++) { for(k=0;k<m;k++) { c[i,k]=a[i,k]*b[i,k]; } } for(k=0;k<m;k++) { c[0,k]=a[0,k]*b[0,k]; c[1,k]=a[1,k]*b[1,k]; c[2,k]=a[2,k]*b[2,k]; } loop fusion for(i=0;i<n;i++) { d[i]=d[i-1]*e[i]; } for(i=0;i<n;i++) { f[i]=f[i]+c[i-1]; } for(i=0;i<n;i++) { d[i]=d[i-1]*e[i]; f[i]=f[i]+c[i-1]; } for further details, see Intel C++ Compiler User Guide Introduction, Caches, Vectorization, October 16, 2012 12

Pipeline-Conflicts structure-conflict: parallel execution is prohibited due to hardware limitations (execution units, memory ports) branch-conflict: jumps branch-prediction, speculative execution data-conflict: conflicting data-accesses (c = a + b, d = c + e) compilers try to avoid conflicts as best as possible. Hardware plays several tricks in order to avoid conflicts at runtime! Introduction, Caches, Vectorization, October 16, 2012 13

Dependency Analysis S1: A = C - A S2: A = B - C S3: B = A + C True dependence (RAW) (e.g. S3 to S2 w.r.t. A) S3 gives a true dependence to S2, S3 δ S2, if OUT(S2) IN(S3) = ; S3 uses result of S2 Anti dependence (WAR) (e.g. S3 to S2 w.r.t. B) S3 gives a anti dependence to S2, S3 δ 1 S2, if IN(S2) OUT(S3) = ; exchanging S3 and S2 will result in using the results of S3 for S2 Introduction, Caches, Vectorization, October 16, 2012 14

Dependency Analysis S1: A = C - A S2: A = B - C S3: B = A + C Output dependence (WAW) (e.g. S2 to S1 w.r.t. A) S2 and S1 are an output dependence, S2 δ O S1, if OUT(S1) OUT(S2) = ; exchanging S2 and S1 result in wrong values. Dependency Analysis checks if pipelining or vectorization is possible! Introduction, Caches, Vectorization, October 16, 2012 15

Outline Logins Levels of Parallelism Single Processor Systems Von-Neumann-Principle Accelerating von-neumann: Pipelining Loop Optimizations Pipeline-Conflicts Dependency Analysis Caches Vectorcomputers Architecture of a Vectorcomputer Example: SSE-Extension of Intel Core 2 Dependency Analysis and Vectorization Introduction, Caches, Vectorization, October 16, 2012 16

Caches - Motivation Processor- vs. Memory-Speed Introduction, Caches, Vectorization, October 16, 2012 17

Memory-hierarchy Introduction, Caches, Vectorization, October 16, 2012 18

Caches transparent buffer memory exploiting locality: temporal locality: If we access address x, we probably will access address x in near future again. spatial locality: If we access address x, we probably will access addresses x + δ or x δ, too. Introduction, Caches, Vectorization, October 16, 2012 19

Caches - Structure Beispiel 512 cache-lines a 64 Bytes (= 32KB) 32 bit addresses (6 bit Block-Offset, 9 bit Index, 17 bit Tag) Introduction, Caches, Vectorization, October 16, 2012 20

Caches - Examples Intel Core i7 32 KB L1-data-cache, 8-way associative 32 KB L1-instruction-cache 256 KB L2-Cache, 8-way associative 8 MB shared L3-Cache, 16-way associative 64 Byte cache-linesize AMD Phenom II X4 64 KB L1-data-cache, 2-way associative 64 KB L1-instruction-cache 512 KB L2 Cache, 16-way associative 6 MB shared L3 Cache, 48-way associative 64 Byte cache-linesize Introduction, Caches, Vectorization, October 16, 2012 21

Caches - How to? mapping-strategy (associativity) fully associative cache direct mapped cache set associative cache replacement strategy First In First Out (FIFO) Least Recently Used (LRU) update strategy Write Through Write Back cache-coherency (organize data validity across non-shared caches) Introduction, Caches, Vectorization, October 16, 2012 22

Cache-Misses cold misses: first access of data at address x (not avoidable!) capacity misses: evict of data at address x due to limited size of cache(-level) conflict misses: evict of data at address x due to associativity of cache(-level) Introduction, Caches, Vectorization, October 16, 2012 23

Cache-Blocking example matrix-multiplikation for(i = 0; i < n; i++) for(j = 0; j < n; j++) for(k = 0; k < n; k++) c[i,j] += a[i,k] * b[k,j]; Introduction, Caches, Vectorization, October 16, 2012 24

Cache-Blocking example matrix-multiplikation for(i = 0; i < n; i++) for(j = 0; j < n; j++) for(k = 0; k < n; k++) c[i,j] += a[i,k] * b[k,j]; Register-Blocking: bandwidth to L1-cache is to small for moving 3 operands per cycle (e.g. Intel Sandy Bridge 48 bytes per cycle, but we would need 96 bytes!) Cache-Blocking: bandwidth of main memory is orders of magnitudes slower than caches (200 cycles vs. 10 cycles) TLB-Blocking: change data structure in order to avoid DTLB misses, usage of huge pages Introduction, Caches, Vectorization, October 16, 2012 25

Outline Logins Levels of Parallelism Single Processor Systems Von-Neumann-Principle Accelerating von-neumann: Pipelining Loop Optimizations Pipeline-Conflicts Dependency Analysis Caches Vectorcomputers Architecture of a Vectorcomputer Example: SSE-Extension of Intel Core 2 Dependency Analysis and Vectorization Introduction, Caches, Vectorization, October 16, 2012 26

Architecture of a Vectorcomputer In the 80s and early 90s: A machine with already discussed pipelined execution units which are operating on Arrays of float-point numbers. Today: execution of entire vector-instructions on vector-registers Introduction, Caches, Vectorization, October 16, 2012 27

Example: SSE-Extension of Intel Core 2 Introduction, Caches, Vectorization, October 16, 2012 28

SSE-Registers Figure: Streaming SIMD Extensions data types Streaming SIMD Extension (SSE) 16 registers (xmm0 to xmm15, each 128 bit wide) 2 double-precision floating point values (each 64 bit) 4 single-precision floating point values (each 32 bit) from 2 64-bit integer values up to 16 8-bit integer values Introduction, Caches, Vectorization, October 16, 2012 29

AVX-Register Advanced Vector Extensions (AVX) Since Intel Sandy Bridge or AMD Bulldozer 16 Register (ymm0 to ymm15, je 256 bit wide) 2 double-precision floating point values (each 64 bit) 4 single-precision floating point values (each 32 bit) currently no integer support, will be added with AVX2 (Intel Haswell) full masking support Introduction, Caches, Vectorization, October 16, 2012 30

Auto-Vectorization #pragma vector always #pragma ivdep for(i=1;i<n;i++) { c[i]=s*c[i]; } Intel Compiler Pragmas for Intel64 Vectorization: #pragma vector always vectorize in any case (even if compiler detects inefficiencies) #pragma ivdep disable some dependency checks during compilation #pragma simd aggressive vectorization, might result in wrong code! further information (SSE & AVX): Intel Architecture Optimization Reference Manual Intel C++ Compiler User s Guide Introduction, Caches, Vectorization, October 16, 2012 31

Dependency Analysis and Vectorization Loop I for(i=0;i<n;i++) { a[i]=b[i]+c[i] } no dependency across iterations can be optimally vectorized and pipelined Loop II for(i=1;i<n-1;i++) { c[i]=c[i]*c[i-1]; } Recursion! c i 1 is still processed when calculating c i no vectorization is possible Introduction, Caches, Vectorization, October 16, 2012 32

SSE/AVX Intrinsic Functions m128i; m128d; //Register-Variablen SSE m256i; m256d; //Register-Variablen AVX m128d _mm_instr_pd( m128d a, m128d b) m256d _mm_instr_pd( m256d a, m256d b) suffixes: ps,pd packed single- und double-precision floating point functions ss,sd scalar single- und double-precision floating point functions... documentation of all function in chapter Intel C++ Intrinsics Reference of Intel C++ Compiler User s Guide Introduction, Caches, Vectorization, October 16, 2012 33

Example: SSE Intrinsic Functions #include <immintrin.h> int sse_add(int length, double* a, double* b, double* c) { int i; m128d t0,t1; } for (i = 0; i < length; i +=2) { // loading 2 64-bit-double values t0 = _mm_load_pd(a[i]); t1 = _mm_load_pd(b[i]); // adding 2 64-bit-double values t0 = _mm_add_pd(t0,t1); // storing 2 64-bit-double values _mm_store_pd(c[i],t0); } Introduction, Caches, Vectorization, October 16, 2012 34

Example: AVX Intrinsic Functions #include <immintrin.h> int avx_add(int length, double* a, double* b, double* c) { int i; m256d t0,t1; } for (i = 0; i < length; i +=4) { // loading 4 64-bit-double values t0 = _mm256_load_pd(a[i]); t1 = _mm256_load_pd(b[i]); // adding 4 64-bit-double values t0 = _mm256_add_pd(t0,t1); // storing 4 64-bit-double values _m256m_store_pd(c[i],t0); } Introduction, Caches, Vectorization, October 16, 2012 35