( ZIH ) Center for Information Services and High Performance Computing. Overvi ew over the x86 Processor Architecture

Similar documents
Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor System

Intel Architecture for Software Developers

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Intel released new technology call P6P

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

XT Node Architecture

Dan Stafford, Justine Bonnot

Superscalar Processors

The Processor: Instruction-Level Parallelism

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Advanced Parallel Programming I

EC 513 Computer Architecture

Using Intel Streaming SIMD Extensions for 3D Geometry Processing

Martin Kruliš, v

Modern CPU Architectures

SSE and SSE2. Timothy A. Chagnon 18 September All images from Intel 64 and IA 32 Architectures Software Developer's Manuals

Intel 64 and IA-32 Architectures Software Developer s Manual

Intel 64 and IA-32 Architectures Software Developer s Manual

Processor (IV) - advanced ILP. Hwansoo Han

Keywords and Review Questions

IF1 --> IF2 ID1 ID2 EX1 EX2 ME1 ME2 WB. add $10, $2, $3 IF1 IF2 ID1 ID2 EX1 EX2 ME1 ME2 WB sub $4, $10, $6 IF1 IF2 ID1 ID2 --> EX1 EX2 ME1 ME2 WB

A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures

Complex Pipelines and Branch Prediction

CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines

Adapted from instructor s. Organization and Design, 4th Edition, Patterson & Hennessy, 2008, MK]

Superscalar Processors Ch 14

Superscalar Processing (5) Superscalar Processors Ch 14. New dependency for superscalar case? (8) Output Dependency?

EJEMPLOS DE ARQUITECTURAS

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

Advanced Processor Architecture

CS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design

Microarchitecture Overview. Performance

Multiple Instruction Issue. Superscalars

Superscalar Processors

CISC, RISC and post-risc architectures

Lecture 26: Parallel Processing. Spring 2018 Jason Tang

Parallel Computing. Prof. Marco Bertini

Instruction Level Parallelism (Branch Prediction)

Instruction Level Parallelism

Superscalar Processor

ECE 571 Advanced Microprocessor-Based Design Lecture 4

Computer Architecture Computer Science & Engineering. Chapter 5. Memory Hierachy BK TP.HCM

Several Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining

Masterpraktikum Scientific Computing

Advanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-11: 80x86 Architecture

Next Generation Technology from Intel Intel Pentium 4 Processor

Pentium 4 Processor Block Diagram

High Performance Computing and Programming 2015 Lab 6 SIMD and Vectorization

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Datapoint 2200 IA-32. main memory. components. implemented by Intel in the Nicholas FitzRoy-Dale

Pentium IV-XEON. Computer architectures M

Beware Of Your Cacheline

ETH, Design of Digital Circuits, SS17 Practice Exercises II - Solutions

CPU Architecture Overview. Varun Sampath CIS 565 Spring 2012

Growth in Cores - A well rehearsed story

Superscalar Machines. Characteristics of superscalar processors

Chapter 13 Reduced Instruction Set Computers

" # " $ % & ' ( ) * + $ " % '* + * ' "

The Memory Hierarchy. Cache, Main Memory, and Virtual Memory (Part 2)

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

IA-32 Architecture COE 205. Computer Organization and Assembly Language. Computer Engineering Department

2

Computer System Architecture

Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling)

COMPUTER ORGANIZATION AND DESI

15-740/ Computer Architecture Lecture 10: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/3/2011

SWAR: MMX, SSE, SSE 2 Multiplatform Programming

Homework 5. Start date: March 24 Due date: 11:59PM on April 10, Monday night. CSCI 402: Computer Architectures

Advanced processor designs

CS146 Computer Architecture. Fall Midterm Exam

6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU

A Key Theme of CIS 371: Parallelism. CIS 371 Computer Organization and Design. Readings. This Unit: (In-Order) Superscalar Pipelines

This Material Was All Drawn From Intel Documents

Advanced Computer Architecture

Limitations of Scalar Pipelines

Multi-core processors are here, but how do you resolve data bottlenecks in native code?

Putting it all Together: Modern Computer Architecture

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

Memory Technology. Caches 1. Static RAM (SRAM) Dynamic RAM (DRAM) Magnetic disk. Ideal memory. 0.5ns 2.5ns, $2000 $5000 per GB

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Manycore Processors. Manycore Chip: A chip having many small CPUs, typically statically scheduled and 2-way superscalar or scalar.

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

Unit 8: Superscalar Pipelines

EECS 322 Computer Architecture Superpipline and the Cache

15-740/ Computer Architecture Lecture 21: Superscalar Processing. Prof. Onur Mutlu Carnegie Mellon University

How to write powerful parallel Applications

EITF20: Computer Architecture Part2.2.1: Pipeline-1

Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Lec 25: Parallel Processors. Announcements

EN164: Design of Computing Systems Lecture 24: Processor / ILP 5

PowerPC 740 and 750

Modern Processor Architectures. L25: Modern Compiler Design

Hardware-based Speculation

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

CMSC Computer Architecture Lecture 2: ISA. Prof. Yanjing Li Department of Computer Science University of Chicago

Transcription:

( ZIH ) Center for Information Services and High Performance Computing Overvi ew over the x86 Processor Architecture Daniel Molka Ulf Markwardt Daniel.Molka@tu-dresden.de ulf.markwardt@tu-dresden.de

Outline Instruction set Instruction format ISA extensions Pipelining Out-of-order execution Memory access Address translation TLB Measurements on current multicore processors, complex cache hierarchy Remote memory access 2

Multicore Processors Multilevel Caches L1/L2 per core Accesses to other cores caches can have high overhead Shared L3 Competitively shared Interference due to replacements caused by other cores and bandwidth limitations of concurrent accesses Shared memory controller and system interface Cores compete for bandwidth One core can not always utilize available memory bandwidth 3

Instruction set x86 instructions Integer instructions originally 16 Bit, later extended to 32 (IA-32) and 64 Bit (AMD64) FPU Floating point instructions 32, 64 and 80 Bit SIMD extensions Single instruction multiple data Reduces code size if applicable MMX: 64 Bit Register, 8x 8 Bit/ 4x 16 Bit/ 2x 32 Bit integer ops/cycle SSE: 128 Bit Register 4x 32 Bit/ 2x 64 Bit floating point ops/cycle 16x 8 Bit/ 8x 16 Bit/4x 32 Bit/ 2x 64 Bit integer ops/cycle Coming soon: AVX: 256 Bit Register 4

Instruction format CISC (Complex Instruction Set Computing) Many instruction formats Variable instruction length Multiple addressing modes RISC (Reduced Instruction Set Computing) execution Complex x86 instructions are decoded into RISC-like instructions, so called microops Load/Store architecture Special instructions for data movement between registers and memory Execution units (ALU,FPU) only work on registers 5

Architectural Registers Original x86 had very few registers Frequent memory accesses requiered extensions added more and wider registers 6

Streaming SIMD Extensions on Intel and AMD processors Intel SSE (*1999 Pentium III) Single precision floating point instructions on XMM registers Additional integer instructions on MMX registers SSE2 Double precision floating point instructions on XMM registers Integer instructions on XMM registers SSE3: additional floating point instructions SSSE3 (Supplemental SSE3): additional integer instructions SSE4.1: scalar product, blending, min, max SSE4.2: string operations, CRC32 AMD 3D-Now! (*1998 AMD K6-2) SSE, SSE2, SSE3 compatible with Intel definition SSE4A: insert and extract instructions, not compatible with Intel s SSE4 7

Using SIMD inst ruct i ons Basic Operation: Data Types: Aligned and unaligned load/store instructions to transfer 128 Bit of data between registers and memory Aligned instructions are faster, but require double quadwords to be 128 Bit aligned Standard function malloc() allocates unaligned memory Compilers fall back to unaligned instructions if alignment is unknown 8

Using SIMD inst ruct i ons Available via intrinsics, assembly or auto-generated by the compiler icc gcc -msse2 (default), -msse3, -mssse3, -msse4.1 -msse2, -msse3: generated code compatible with AMD processors -x and -ax Optimized code for Intel processors -x requires specified extension, -ax includes fallback implementation according to -m/-x - E.g. -msse2 -axsse4.2,sse4.1,ssse3,sse3: runs on all processors with SSE2 -xsse4.2: only runs on Intel Core i7 and Nehalem based Xeons -msse, -msse2, -msse3, -mssse3 -msse, -msse2 enabled by default on 64 Bit systems 9

Basic pipelining Time sequential processing IF ID DF IE MA WB IF ID DF IE MA WB IF ID Time pipelined processing IF ID DF IE MA WB IF ID DF IE MA WB IF ID DF IE MA WB A = B+C E = D*D G = F*F Instruction fetch: retrieves the instruction from the cache Instruction decode: decodes the instruction and looks for operands (register or immediate values Execure: performs the instruction (ADD, SUB, ) Memory access: accesses the memory, and writes data or retrieves data from it Write back (retire): records the calculated value in a register

Pipelining Each pipeline-stage can work on a different instruction Superscalar pipelines can process multiple instructions in each stage Ideally n instruction finish each cycle in an n-way superscalar architecture 2-way superscalar pipeline 11

Longer Pipelines Phases can be split further Decode Pre-Decode (e.g. determine instruction length if not constant [x86: 1 Byte 18 Byte]) Decode: Analyse Opcode Execution E.g. Floating Point Addition - Align exponents - Add significants - Normalize result E.g. Floating Point Multiplication - Add exponents - Multiply significants - Normalize result Hyper-pipeline / super-pipeline on modern micro-processors with appr. 20 stages 12

Out -of - or der execut i on Reordering of instructions Microops do not have to be executed in program order Slow operations (e.g. division) do not stall execution Scheduler can sent subsequent instructions to execution units if they do not have unsatisfied dependencies In-order completion Execution units work on internal registers Reorder Buffer commits (write back/retirement) results to architectural registers in program order 13

AMD dual - core Opteron (K8 microarchi t ectur e) Decodes up to 3 instructions per cycle 128 Bit SSE instructions are split into two 64 Bit ops 72 Makroops in-flight 3 Integer Units 2 (+store) FP pipelines two 64 Bit Load/Store operations per cycle L1/L2 per core System interface shared between both cores 15

Intel dual-core Xeon (Core TM microarchitecture) Decodes up to 4 instructions per cycle (5 with Makroop-Fusion) 128 Bit wide SIMD units 96 Microops in-flight 3 Integer Units 2 FP pipelines One 128 Bit load and one 128 Bit store per cycle L1 per core, shared L2 System interface shared between both cores 16

Virtual Memory Each application has its own unique address space Isolation between processes No restrictions which address regions can be used Shared memory possible if required Multiple page sizes: 4 KiB, 2 MiB, 1 GiB (AMD 64) 17

Fast Translation: Translation Lookaside Buffer (TLB) Stores translations from virtual page addresses (frame numbers) into physical page addresses Very low latency required to provide physical addresses for accessing caches Multilevel TLBs are common, fast L1 TLB, larger L2 TLB Limited number of entries Usually hardly enough 4 KiB entries to cover today s cache sizes - E.g. Core i7: 8 MiB L3 cache, 512 4 KiB entries (covers 2 MiB) - Usage of 2 MiB pages (hugetlbfs) can reduce TLB misses significantly (e.g. Core i7: 32 entries cover 64 MiB) Transparent for applications Not programmable, entries are generated on demand Operating system can invalidate TLBs (e.g on context switches or after freeing memory) TLB miss results in complex and slow address translation 18

Slow Translation: Page Table Wal k (af ter TLB miss) Multiple stages of translation tables Individual base address (from CR3 register) per process points to first table Parts of virtual address used as index into tables First 3 tables contain pointers to other tables, 4 th table stores required physical address only required tables are allocated requires less space than one big table Page Tables written by operating system, translation in hardware 4 additional memory accesses to resolve physical address After Translation result is stored in TLB for future use Larger pages supported as well (2 MiB pages, 21 Bit offset, 3 translation stages) 19

Example: Copy Bandwidth Opteron 2384 (Shanghai) Dual ported L1 cache; 2x 128 Bit load/cycle or 2x 64 Bit store/cycle 64 KiB L1-I, 64 KiB L1-D, 512KiB L2, 6MiB L3 Xeon X5570 (Nehalem) Dual ported L1 cache; 1x 128 Bit load/cycle and 1x 128 Bit store/cycle 32 KiB L1-I, 32 KiB L1-D, 256 KiB L2, 8 MiB L3 20

Example: Opteron 2384 multithreaded memory bandwidth All cache levels scale well for reading and writing Cores do not compete for cache bandwidth (but for L3 size) 2 Threads required to fully utilize memory bandwidth 21

Example: Xeon X5570 mul t i threaded memory bandwidth L1/L2 scale well L3 read bandwidth scales well up to 3 Threads, L3 write bandwidth does not scale well Cores compete for L3 size and bandwidth 3 Threads required to fully utilize memory bandwidth 22

Cur r ent 2 socket systems AMD Opteron 2384 Comparable system architecture NUMA architecture Each processor has its own memory controller 2x DDR2-667: 10,6 GB/s je Socket 3x DDR3-1333: 32 GB/s je Socket Intel Xeon X5570 Point-to-Point connections between processors Low latency to local memory Higher latency to remote RAM Limited interconnect bandwidth HT 1.1: 8 GB/s HT 3.0: 17,6 GB/s QPI: 25,6 GB/s 23

Example: Opteron 2384 memory latency Faster access to modified cache lines in other cores then to exclusive cache lines L3 cache much faster than accesses to other cores local caches Remote caches slower than local memory Remote memory access adds about 40 ns 24

Example: Xeon X5570 memory latency Inclusive L3 cache handles requests for unmodified data in other cores Fast access to modified data in other cores on the same die Accesses to modified cache lines of the second processor causes write back to main memory Remote memory access adds about 40 ns 25

Example: Opteron 2384 memory bandwidth Bandwidth to other cores on the same die as low as main memory bandwidth Accesses to the other processor limited by HyperTransport 26

Example: Xeon X5570 memory bandwidth High bandwidth for on-die accesses to unmodified data Low bandwidth for cache-to-cache transfer of modified data Accesses to the other processor limited by Quickpath Interconnect 27

Optimizations SIMD icc automatically vectorizes loops if O3 is specified Replaces 4(single) or 2(double) loop iterations with one using SIMD Significantly improves performance Use simple loop constructs (e.g. for(i=0;i<n;i++){ }) - no for(;;){ } - Check compiler output if performance critical loops were vectorized #pragma vector aligned Hint for the compiler to use aligned load and store instructions #pragma vector non temporal Hint for the compiler that written data is not accessed again Writes data directly into main memory, avoids cache pollution 28

Opt i mi zat i ons Huget l bf s Needs kernel support (enabled by default in most current distributions) /proc/meminfo shows available number of hugepages Has to be mounted as virtual file system echo #pages > /proc/sys/vm/nr_hugepages mount t hugetlbfs nodev /mnt/huge chown root:users /mnt/huge chmod 777 /mnt/huge Create file in hugetlbfs and use mmap() to allocate memory fd = open(filename,o_create O_RDWR,0666); buf = mmap(0,size,prot_read PROT_WRITE,MAP_SHARED,fd,0); Can significantly reduce TLB misses 29

Further Information Intel 64 and IA-32 Architectures Optimization Reference Manual http://www.intel.com/products/processor/manuals/ Software Optimization Guide for AMD Family 10h Processors http://support.amd.com/de/pages/techdocs.aspx AMD Software Optimization Videos http://developer.amd.com/documentation/videos/pages/softwareoptimizationvid eoseries.aspx 32