Beware Of Your Cacheline

Similar documents
Computer Labs: Profiling and Optimization

Dan Stafford, Justine Bonnot

( ZIH ) Center for Information Services and High Performance Computing. Overvi ew over the x86 Processor Architecture

SSE and SSE2. Timothy A. Chagnon 18 September All images from Intel 64 and IA 32 Architectures Software Developer's Manuals

Computer Labs: Profiling and Optimization

Intel 64 and IA-32 Architectures Software Developer s Manual

Intel 64 and IA-32 Architectures Software Developer s Manual

Figure 1: 128-bit registers introduced by SSE. 128 bits. xmm0 xmm1 xmm2 xmm3 xmm4 xmm5 xmm6 xmm7

Multi-core processors are here, but how do you resolve data bottlenecks in native code?

Martin Kruliš, v

Multi-core is Here! But How Do You Resolve Data Bottlenecks in PC Games? hint: it s all about locality

Instruction Set Progression. from MMX Technology through Streaming SIMD Extensions 2

High Performance Computing and Programming 2015 Lab 6 SIMD and Vectorization

Double-precision General Matrix Multiply (DGEMM)

High Performance Computing: Tools and Applications

SWAR: MMX, SSE, SSE 2 Multiplatform Programming

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

IA-32 Intel Architecture Software Developer s Manual

IA-32 Intel Architecture Software Developer s Manual

Effective C With The GCC And GLIBC

Overview Implicit Vectorisation Explicit Vectorisation Data Alignment Summary. Vectorisation. James Briggs. 1 COSMOS DiRAC.

Cache oriented implementation for numerical codes

SIMD: Data parallel execution

Various optimization and performance tips for processors

Kampala August, Agner Fog

Chapter 6 Caches. Computer System. Alpha Chip Photo. Topics. Memory Hierarchy Locality of Reference SRAM Caches Direct Mapped Associative

Parallelized Progressive Network Coding with Hardware Acceleration

Using Intel Streaming SIMD Extensions for 3D Geometry Processing

AMD Extensions to the. Instruction Sets Manual

The New C Standard (Excerpted material)

Topics to be covered. EEC 581 Computer Architecture. Virtual Memory. Memory Hierarchy Design (II)

CS 6354: Static Scheduling / Branch Prediction. 12 September 2016

Review. Topics. Lecture 3. Advanced Programming Topics. Review: How executable files are generated. Defining macros through compilation flags

How to write powerful parallel Applications

CPUs. Caching: The Basic Idea. Cache : MainMemory :: Window : Caches. Memory management. CPU performance. 1. Door 2. Bigger Door 3. The Great Outdoors

Memory Systems and Performance Engineering

Compiler Optimization

Outline. 1 Reiteration. 2 Cache performance optimization. 3 Bandwidth increase. 4 Reduce hit time. 5 Reduce miss penalty. 6 Reduce miss rate

A Multiprocessor system generally means that more than one instruction stream is being executed in parallel.

High Performance Computing and Programming, Lecture 3

Optimising for the p690 memory system

Introduction. No Optimization. Basic Optimizations. Normal Optimizations. Advanced Optimizations. Inter-Procedural Optimizations

MULTI-CORE PROGRAMMING. Dongrui She December 9, 2010 ASSIGNMENT

Vectorization on KNL

3DNow! Instruction Porting Guide. Application Note

A Matrix--Matrix Multiplication methodology for single/multi-core architectures using SIMD

Intel. Digital Home Capabilities Assessment Tool Results

Compiler Optimization

Growth in Cores - A well rehearsed story

Performance analysis basics

Intel s MMX. Why MMX?

COMPUTER ARCHITECTURES

Inside out of your computer memories (III) Hung-Wei Tseng

IA-32 Intel Architecture Software Developer s Manual

Intel 64 and IA-32 Architectures Optimization Reference Manual

Intel 64 and IA-32 Architectures Optimization Reference Manual

Khem Raj Embedded Linux Conference 2014, San Jose, CA

17. Instruction Sets: Characteristics and Functions

Assembly Language Programming 64-bit environments

OpenCL Vectorising Features. Andreas Beckmann

X-Ray : Automatic Measurement of Hardware Parameters

Compiling for Nehalem (the Intel Core Microarchitecture, Intel Xeon 5500 processor family and the Intel Core i7 processor)

DatasheetDirect.com. Visit to get your free datasheets. This datasheet has been downloaded by

The p196 mpi implementation of the reverse-and-add algorithm for the palindrome quest.

Programming Methods for the Pentium III Processor s Streaming SIMD Extensions Using the VTune Performance Enhancement Environment

Family 15h Models 00h-0Fh AMD FX -Series Processor Product Data Sheet

Exam-2 Scope. 3. Shared memory architecture, distributed memory architecture, SMP, Distributed Shared Memory and Directory based coherence

Instruction Set extensions to X86. Floating Point SIMD instructions

How to Optimize the Scalability & Performance of a Multi-Core Operating System. Architecting a Scalable Real-Time Application on an SMP Platform

SIMD Exploitation in (JIT) Compilers

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Parallel Computing Architectures

Announcements. ! Previous lecture. Caches. Inf3 Computer Architecture

OpenMP: Vectorization and #pragma omp simd. Markus Höhnerbach

CPU Cacheline False Sharing - What is it? - How it can impact performance. - How to find it? (new tool)

Agenda. Pentium III Processor New Features Pentium 4 Processor New Features. IA-32 Architecture. Sunil Saxena Principal Engineer Intel Corporation

AMD Opteron TM & PGI: Enabling the Worlds Fastest LS-DYNA Performance

Review of Last Lecture. CS 61C: Great Ideas in Computer Architecture. The Flynn Taxonomy, Intel SIMD Instructions. Great Idea #4: Parallelism.

SIMD Instructions outside and inside Oracle 12c. Laurent Léturgez 2016

Embedded Systems Programming

CS 61C: Great Ideas in Computer Architecture. The Flynn Taxonomy, Intel SIMD Instructions

Optimization of Lattice QCD codes for the AMD Opteron processor

Master Informatics Eng.

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

How to Write Fast Numerical Code

Kevin O Leary, Intel Technical Consulting Engineer

Auto-Vectorization with GCC

Lecture 4. Performance: memory system, programming, measurement and metrics Vectorization

ME964 High Performance Computing for Engineering Applications

Computer Architecture

CS802 Parallel Processing Class Notes

Lecture 3. Vectorization Memory system optimizations Performance Characterization

Pentium IV-XEON. Computer architectures M

CS 61C: Great Ideas in Computer Architecture. The Flynn Taxonomy, Intel SIMD Instructions

Parallel Computing Platforms

Instruction Set. Instruction Sets Ch Instruction Representation. Machine Instruction. Instruction Set Design (5) Operation types

Way-associative cache

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 10 Instruction Sets: Characteristics and Functions

Parallel Computing. Prof. Marco Bertini

CS4961 Parallel Programming. Lecture 7: Introduction to SIMD 09/14/2010. Homework 2, Due Friday, Sept. 10, 11:59 PM. Mary Hall September 14, 2010

Transcription:

Beware Of Your Cacheline Processor Specific Optimization Techniques Hagen Paul Pfeifer hagen@jauu.net http://protocol-laboratories.net Jan 18 2007

Introduction Why? Memory bandwidth is high (more or less) but today memory latency comes into play (compared to cpu speed) BTW: Excuse my Intel/AMD affinity Core 2 Extreme QX6700: L1D: 32 KByte 8-way; L1I: 32 KByte; 2x4 MByte 16-way Core 2 : L1D: 32 KByte 8-way; L1I: 32 KByte; 4/2 MByte 16-way Athlon 64 X2 / FX-62: L1D: 64 KByte 2-way; L1I: 64 KByte; 2x 512 KByte / 2x 1 MByte (FX-62) 16-way Pentium 4: L1D: 16 KByte 8-way; L1I: 12 Kuops (equivalent 8 till 16 KByte); 1 MByte 8-way Hagen Paul Pfeifer 2 19

Prefetch Use prefetch Data at contiguous memory locations and in ascending order Use for loops that process large amounts of data If data in loop is less than a cacheline: unroll loop (by hand to take full control - avoid -funroll-loops) AMD PREFETCH and PREFETCHW K6-2: Small performance improvement share FSB, low priority Don t add overhead through useless prefetching K6-3: Separate backside bus Up to 20% improvements Hagen Paul Pfeifer 3 19

64 bytes per prefetch AMD tip: prefetch 192 (3*64) bytes ahead or three cachelines (Prefetch Distance) BTW: FSB and AMD is a separate topic (keyword HyperTransport, Memory controller, Direct Connect Architecture ) Usage (AMD): double a[a_really_large_number]; double b[a_really_large_number]; double c[a_really_large_number]; for (i=0; i<a_really_large_number/4; i++) { prefetchw (a[i*4+64]); // will be modifying a prefetch (b[i*4+64]); prefetch (c[i*4+64]); a[i*4] = b[i*4] * c[i*4]; a[i*4+1] = b[i*4+1] * c[i*4+1]; a[i*4+2] = b[i*4+2] * c[i*4+2]; Hagen Paul Pfeifer 4 19

a[i*4+3] = b[i*4+3] * c[i*4+3]; } Hagen Paul Pfeifer 5 19

memcpy movdqa/movdqu (aligned/unaligned) SIMD extension Moves a double quadword from/to mmx register to/from mmx register/memory location movdqa: 16 byte aligned or general-protection exception Advantages: faster ordinary memcpy for bigger sizes p4: prefetchnta, prefetcht0, prefetcht1, prefetcht2 p3: 32 byte p4: 128 byte Like madivce(2) it s a hint and not a command prefetchnta Prefetches data into L2 cache without polluting L1 caches (non temporal) Hagen Paul Pfeifer 6 19

SMP: reduce the CPU cache traffic BTW: icc replace memcpy() in string.h with a processor specific version (handles alignment, tlb priming and prefetching) Hagen Paul Pfeifer 7 19

MMX, SSE, SSE2, SSE4 SIMD: single instruction, multiple data More then one instruction processed concurrently MMX: Concurrent processing with FP unit -> restore register mm[0-7] are aliases for r[0-7] (FPU Register) No over- or underflow exceptions (no carry, overflow or adjust flags) like every SIMD instructions Wrap-Around or saturation mode! SSE2: Mainly graphics and codecs precessing Added 122 instructions SSE3: Hagen Paul Pfeifer 8 19

Introduction: Pentium 4 13 additional SIMD instructions Thread synchronisation (MONITOR, MWAIT) SSSE3: Xeon 5100 and Core 2 Duo 32 new instructions (16 x 64-bit MMX, 16 x 128-bit XMM registers) Intel s Core-Architecture extension: SSE4 (Nehalem New Instructions) 50 new OpCodes Major improvements: String and text processing Vectorizing Hardware CRC32 calculations (crc32) Hagen Paul Pfeifer 9 19

New advanced string instructions Release: 2008 Hagen Paul Pfeifer 10 19

GCC Attributes packed Specifies a minimum alignment Unaligned exception int x attribute ((aligned (16))) = 0; (align on a 16byte boundary) short array[3] attribute ((aligned)); (unspecific alignment - largest alignment) packed Smallest possible alignment struct foo { char a; int x[2] attribute ((packed)); }; Hagen Paul Pfeifer 11 19

struct attribute ((packed)) bar {... }; Beware of you architecture: e.g. performance killer on powerpc #define likely(x) builtin_expect (!!(x), 1) #define unlikely(x) builtin_expect (!!(x), 0) Incidentally: this code isn t portable anymore! ;-) #pragma Hagen Paul Pfeifer 12 19

Tools gcc -S Play with gcc flags and take a look at the generated results! rdtscll cachegrind a cache profiler valgrind --tool=cachegrind command cg_annotate pahole (Arnaldo Carvalho de Melo) Utilize DWARF2 information [acme@newtoy net-2.6]$ pahole kernel/sched.o task_struct /* include2/asm/system.h:11 */ struct task_struct { volatile long int state; /* 0 4 */ struct thread_info * thread_info; /* 4 4 */ Hagen Paul Pfeifer 13 19

atomic_t usage; /* 8 4 */ long unsigned int flags; /* 12 4 */ <SNIP> short unsigned int ioprio; /* 52 2 */ /* XXX 2 bytes hole, try to pack */ long unsigned int sleep_avg; /* 56 4 */ unsigned char fpu_counter; /* 388 1 */ /* XXX 3 bytes hole, try to pack */ int oomkilladj; /* 392 4 */ <SNIP> }; /* size: 1312, sum members: 1287, holes: 3, sum holes: 13, padding: 12 */ pfunct print function details git://git.kernel.org/pub/scm/linux/kernel/git/acme/pahole.git vtune (Intel, Single Developer: $699) Hagen Paul Pfeifer 14 19

Additional Information Some more tips: Avoid branching use branchless instructions sched_setaffinity(2) Avoid SMP trashing CPUID opcode (CPU IDentification) gcc X86 Built-in Functions (v2df builtin_ia32_addsubpd (v2df, v2df)) march= -msse or -msse2 -mfpmath= (387, sse) Use strlen() avoid while(*p++) ++i; Hagen Paul Pfeifer 15 19

PowerPC 4xx support dlmzb instruction dlmzb: determine left-most zero byte strcpy() is another candidate dlmzb with support of lswx and stswx Intel Smart Memory Access Intel Advanced Smart Cache Hagen Paul Pfeifer 16 19

Fin Questions? Hagen Paul Pfeifer 17 19

Additional Information Links: A solaris memcpy implementation for pentium cpu SSE4 Introduction Extending the World s Most Popular Processor Architecture x86 Instruction Reference 3DNow! Instruction Porting Guide How Cachegrind works X86 Built-in Functions GCC Intel 386 and AMD x86-64 Options Data Cache Optimizations for Java Programs Hagen Paul Pfeifer 18 19

Contact Hagen Paul Pfeifer EMail: hagen@jauu.net Key-ID: 0x98350C22 Fingerprint: 490F 557B 6C48 6D7E 5706 2EA2 4A22 8D45 9835 0C22 Hagen Paul Pfeifer 19 19