How to write powerful parallel Applications

Size: px

Start display at page:

Download "How to write powerful parallel Applications"

Warren Phelps
6 years ago
Views:

1 How to write powerful parallel Applications 08: : :15 10:15-10:30 10:30-11:30 11:30-12:30 12:30-13:30 13:30-14:30 14:30-15:15 15:15-15:30 15:30-16:00 16:00-16:45 16:45-17:15 Welcome and Coffee Introduction to the Intel Micro architecture and Software Implications Introduction to Software Design Cycle - From Serial to Parallel Applications Break How to Optimize Applications and Identify Areas for Parallelization Introduction to Parallel Programming Methods Lunch Expressing Parallelism: Using Intel C++ and Fortran Compilers, Professional Editions 10.1, for Performance, Multi-threading Expressing Parallelism: Introducing Threading Through Libraries Break Pinpoint Program Inefficiencies and Threading Bugs - Races and Deadlocks Performance Tuning Threaded Software using Intel VTune Performance Analyzer and Thread Profiler Parallel programming Techniques and Program Testing in Cluster Environments

2 Intel Core Microarchitecture Edmund Preiss EMEA Software Solutions Group

3 Core Architecture Moores Law and Processor Evolution Introduction on Core architecture New features added in 2007: Intro to 45nm Technology -> Shrink New Core Advanced Features Selected Software Implications

10 10 0 10 10-1 10 10-2 10 10-3 10 10-4 10 10-5 10 10-6 10 10-7

transistors goes UP 10 8 10 7 Cost per transistor 10 6 goes DOWN 10

4 Implications of Moore s Law As the number of 10 9 transistors goes UP Cost per transistor 10 6 goes DOWN Scaling + Wafer Size + Volume = Lower Costs Source: WSTS/quest/Intel

5 Examples: New Microarchitecture History EPIC* (Itanium ) x86 IXA* (xscale) Examples: Examples: P5 P6 Intel NetBurst Banias Intel Core Conroe Woodcrest Merom Pentium Pentium Pro Pentium II/III Pentium 4 Pentium D Xeon Pentium M Core Duo * IXA Intel Internet Exchange Architecture/ EPIC Explicitly Parallel Instruction Computing

6 Intel Processor Family Design Cycles 2 YEARS 2 YEARS Shrink/Derivative Presler Yonah Dempsey New Microarchitecture Intel Core Microarchitecture Shrink/Derivative Penryn Family New Microarchitecture Nehalem 65nm 45nm Increase performance per given clock cycle Increase processor frequencies Extend energy efficiency Deliver lead product for 45nm High k + metal gate process technology 2 YEARS Shrink/Derivative New Microarchitecture 32nm Deliver optimized processors across each product segment and power envelope

7 Details of the Intel Core Architecture

8 Intel Core Innovations Core 1 Core 2 Wider Intel Wide Dynamic Execution Bus Deeper Intel Intelligent Power Capability Intel Advanced Digital Media Boost L2 Intel Smart Memory Access Faster Intel Advanced Smart Smarter

9 Core TM vs. NetBurst TM µ-arch: Overview Processor component Pipeline Stages Threads per core L1 Org. L2 Org. Instr. Decoders Integer Units SIMD Units SIMD Inst. Issued per Clock FP Units FP Inst. Issued per clock Power means per core Intel NetBurst TM 31 2 (12K uop I/16K ) 2 x 2MB 1 2 (2x core freq) 2 x 64-bits 1 3 (Add/Mul/Div) 1 135W Intel Core TM 14 1 (32K I/32K ) 1 x 4MB (shared) 4 3 (1x core freq) 3 x 128-bits 3 3 (Add/Mul/Div) Up to 2 (Add + Mul or Div) 80W

10 45nm Technology Penryn code name for an enhanced Intel Core TM microarchitecture at 45 nm Industry s first 45 nm High-K processor technology ~2x transistor density >20% gain in transistor switching speed ~30% decrease in transistor switching power Dual core, quad core Shared L2 cache Intel 64 architecture 128-bit SSE Penryn / Wolfdale / Wolfdale DP Dual Core Package Core 32K I- 32K D- 6M L2 Bus Core 32K I- 32K D- 2 Threads, 1 Package (similar to Intel Core 2 Duo processor)

Core Microarchitecture Instruction Fetch and Pre Decode Instruction Fetch and Pre Decode ucode ROM Instruction Queue Decode 2MB/4MB Shared L2 Instruction Queue Decode ucode ROM Retirement Unit

11 Core Microarchitecture Instruction Fetch and Pre Decode Instruction Fetch and Pre Decode ucode ROM Instruction Queue Decode 2MB/4MB Shared L2 Instruction Queue Decode ucode ROM Retirement Unit (Reorder Buffer) Rename/Alloc Up to 10.4 GB/s FSB Rename/Alloc Retirement Unit (Reorder Buffer) Schedulers Schedulers ALU Branch FPMove ALU FAdd FPMove ALU FMul FPMove LOAD STORE STORE LOAD ALU FMul FPMove ALU FAdd FPMove ALU Branch FPMove L1 D- and D-TLB L1 D- and D-TLB

12 ucode ROM Instruction Fetch And PreDecode Instruction Queue Decode Rename/Alloc Retirement Unit M/4M Shared L2 Up to 10.6 GB/s FSB Intel Core Microarchitecture Primary interfaces Front Front end end Execution Memory Schedulers ALU Branch FPmove ALU FAdd FPmove ALU FMul FPmove Load Store Memory Order Buffer L1 D- D and D-TLBD

13 ucode ROM ALU Branch FPmove Instruction Fetch And PreDecode Instruction Queue Decode Rename/Alloc Retirement Unit Schedulers ALU FAdd FPmove ALU FMul FPmove Load L1 D- D and D-TLBD M/4M Shared L2 Up to 10.6 GB/s FSB Store Memory Order Buffer Intel Core Microarchitecture Front End Up Up to to 6 instructions instructions per per cycle cycle can can be be sent sent to to the the IQ IQ Typical Typical programs programs average average slightly slightly less less than than 4 bytes bytes per per instruction instruction 4 decoders:1 decoders:1 large large and and 3 small. small. All All decoders decoders handle handle simple simple 1-uop instructions. 1-uop instructions. Larger Larger handles handles instructions instructions up up to to 4 uops 4 uops Detects Detects short short loops loops and and locks locks them them in in the the instruction instruction queue queue (IQ) (IQ) Reduced Reduced front front end end power consumption power consumption - - total total saving saving of of up up to 14% to 14%

14 Instruction Queue inc esp store [mem3], ebx jne targ cmp eax, [mem2] load eax, [mem1] Without Macro-Fusion Read five instructions from Instruction Queue Each instruction gets decoded separately dec0 dec1 dec2 dec3 Cycle 1 Cycle 2 inc esp store [mem3], ebx jne targ cmp eax, [mem2] load eax, [mem1]

15 Instruction Queue inc esp store [mem3], ebx jne targ cmp eax, [mem2] load eax, [mem1] With Intel s New Macro-Fusion Read five Instructions from Instruction Queue Send fusable pair to single decoder ct3 dec0 dec1 dec2 dec3 All in one cycle inc esp store [mem3], ebx cmpjne eax, [mem2], targ load eax, [mem1]

16 Slide 15 ct3 66% improvement due to macro fusion and +1 decoder Visually make NGMA bigger/better ctaggard, 03/03/2006

17 ucode ROM Instruction Fetch And PreDecode Instruction Queue Decode Rename/Alloc Retirement Unit M/4M Shared L2 Up to 10.6 GB/s FSB Intel Core Microarchitecture Execution Out-of-Order 4 4 uops uops renamed renamed / / retired retired per per clock clock Uops Uops written written to to RS RS and and ROB ROB RS waits for sources to arrive allowing OOO RS waits execution for sources to arrive allowing OOO execution ROB waits for results to show up for retirement ROB waits for results to show up for retirement 6 6 dispatch dispatch ports ports from from RS RS 3 3 execution execution ports ports (integer (integer / / fp fp / / simd) simd) load load store (address) store (address) store (data) store (data) ALU Branch FPmove Schedulers ALU FAdd FPmove ALU FMul FPmove Load L1 D- D and D-TLBD Store Memory Order Buffer 128-bit 128-bit SSE SSE implementation implementation Port 0 has packed multiply (4 cycles SP 5 Port DP pipelined) 0 has packed multiply (4 cycles SP 5 DP pipelined) Port 1 has packed add (3 cycles all precisions) Port 1 has packed add (3 cycles all precisions) FP FP data data has has one one additional additional cycle bypass cycle bypass latency latency Do Do not not mix mix SSE SSE FP FP and and SSE SSE integer integer ops on same register ops on same register

Intel Advanced Digital Media Boost In Each Core Single Cycle Execution SOURCE

Y3 Y2 Y1 Intel Core Microarchitecture CLOCK CYCLE 1 X4opY4 X3opY3 X2opY2 X1opY1

Perf Energy ADVANTAGE Increased Performance 128 bit Single Cycle In Each Core

18 Intel Advanced Digital Media Boost In Each Core Single Cycle Execution SOURCE SSE/2/3 OP 127 X4 SSE Operation (SSE/SSE2/SSE3) X3 X2 X1 0 DECODE DECODE DEST Y4 Y3 Y2 Y1 Intel Core Microarchitecture CLOCK CYCLE 1 X4opY4 X3opY3 X2opY2 X1opY1 EXECUTE EXECUTE Others CLOCK CYCLE 1 X2opY2 X1opY1 CLOCK CYCLE 2 X4opY4 X3opY3 Perf Energy ADVANTAGE Increased Performance 128 bit Single Cycle In Each Core Improved Energy Efficiency *Graphics not representative of actual die photo or relative size

19 ucode ROM ALU Branch FPmove Instruction Fetch And PreDecode Instruction Queue Decode Rename/Alloc Retirement Unit Schedulers ALU FAdd FPmove ALU FMul FPmove Load L1 D- D and D-TLBD M/4M Shared L2 Up to 10.6 GB/s FSB Store Memory Order Buffer Intel Core Microarchitecture Loads Loads & Stores Stores Memory sub-system 128-bit 128-bit load load and and 128-bit 128-bit store store per cycle per cycle Prefetching Prefetching Memory Memory Disambiguation Disambiguation Shared Shared L1D L1D cache cache prefetching prefetching Unit Unit Prefetcher Prefetcher (aka streaming prefetcher (aka streaming prefetcher Recognizes ascending access patterns in Recognizes recently loaded ascending data access patterns in recently loaded data Prefetches the next line into the processors Prefetches cache the next line into the processors cache Instruction Based Stride Prefetcher Instruction Based Stride Prefetcher Prefetches based upon a load having a regular Prefetches stride based upon a load having a regular stride Can prefetch forward or backward 2 Kbytes Can prefetch (1/2 default forward page or size) backward 2 Kbytes (1/2 default page size) L2 L2 cache cache prefetching: prefetching: Prefetch Logic Prefetch Logic (DPL) (DPL) Prefetches Prefetches data data to to the the 2nd 2nd level level cache before cache before the the DCU DCU requests requests the the data data Maintains Maintains 2 2 tables tables for for tracking tracking loads loads Upstream 16 entries Upstream 16 entries Downstream 4 entries Downstream 4 entries

20 Intel Smart Memory Access: Prefetchers youngest oldest Load4 Load3 Load2 Load1 L1 Shared L2

21 Intel Smart Memory Access: Prefetchers youngest oldest Load4 Load3 Load2 Load1 L1 Shared L2 Memory is too far away

22 Intel Smart Memory Access: Prefetchers youngest oldest Load4 Load3 Load2 Load1 L1 Shared L2 s are closer when they have the data

23 Intel Smart Memory Access: Prefetchers youngest oldest Load4 Load3 Load2 Load1 L1 Shared L2 Prefetchers detect applications data reference patterns

24 Intel Smart Memory Access: Prefetchers youngest oldest Load4 Load3 Load2 Load1 L1 Shared L2 And bring the data closer to data consumer

25 Intel Smart Memory Access: Prefetchers youngest oldest Load4 Load3 Load2 Load1 L1 Shared L2 Solving the Problem of Where

26 Some Implications of Core 2 Architecture for Developers who want to thread their apps

27 Advanced Smart Benefits Two threads which communicate frequently should be scheduled to same two cores sharing L2 cache Use the thread/processor affinity feature in your applications Core 1 Core 2 L2 Core 3 Core 4 L2 FSB Quad Core Processor

28 Memory Related Avoid False Sharing What is false sharing? Multiple threads repeatedly write to the same cache line shared by processors Usually different data lines get invalidated Forces additional reads from memory Severe performance impact in tight loops, in general Threads read/write to the same cache line very rapidly

29 Some Words on Pipelines (1) Modern CPUs may be understood by considering their basic design paradigm, the so-called pipeline. The pipeline is designed to break up the processing of a single instruction in independenent parts that idealy are executed in an identical time window. The independent parts of the processing are called pipeline stages. Since identical processing time in each stage can t be guaranteed, most pipeline stages control a buffer or queue that supplies instructions if the previous stage is still busy or in which instruction can be stored if the next stage is still busy. Underflow or Overflow of a queue will cause the respective stage to run idle and will cause a pipeline stall. Buffer Full Empty Buffer Full Empty Buffer Full Empty Buffer Full Stall Fetch Decode Busy Allocate Busy Idle Execute Busy Idle Retire Busy Idle

30 Some Words on Pipelines (2) In order to achieve the best performance Pipeline stalls must be avoided Since Core 2 performance makes use of speculative execution, a wrongly taken branch might lead to a pipeline flush to keep the instructions consistent. Pipeline flushes must be avoided Understanding the Core 2 pipeline and being able to detect pipeline problems will highly improve the performance of your software Knowledge of the Pipeline and its registers increase the understanding and efficient usage of Vtune Performance analyser E.g. look for Misses, Branch Mispredictions

31 Uop Flow Refer to Vtune Event Counters To L2 Fetch / Decode Execute 32 KB Instruction Next IP Bus Unit 32 KB Instruction Decode (4 issue) Branch Target Buffer Microcode Sequencer Register Allocation Table (RAT) RESOURCE_STALLS RESOURCE_STALLS measures measures here here transfer transfer from from Decode Decode Reservation Stations (RS) 32 entry Scheduler / Dispatch Ports Port Port Port Port Port Port Load Store Addr Store Integer Arithmetic Integer Arithmetic Integer Arithmetic Memory Order Buffer (MOB) SIMD SIMD SIMD FP Add Integer Shift/Rotate Retire FP Div/Mul Re-Order Buffer (ROB) 96 entry RS_UOPS_DISPATCHED measures at Execution UOPS_RETIRED UOPS_RETIRED measures measures at at Retirement Retirement Detailed description in Processor Manuals IA Register Set

32 Backup

Pentium 4 Processor Block Diagram

FP FP Pentium 4 Processor Block Diagram FP move FP store FMul FAdd MMX SSE 3.2 GB/s 3.2 GB/s L D-Cache and D-TLB Store Load edulers Integer Integer & I-TLB ucode Netburst TM Micro-architecture Pipeline