Internals of a modern processor. Telecom Paristech

Size: px

Start display at page:

Download "Internals of a modern processor. Telecom Paristech"

Wesley Casey
5 years ago
Views:

1 Internals of a modern processor Telecom Paristech Jean-Marie.Cottin@arm.com 1

ARM Overview founded in November 1990 12 employees in Cambridge, UK First processor designed for the Newton PDA (Apple) now over

2 ARM Overview founded in November employees in Cambridge, UK First processor designed for the Newton PDA (Apple) now over 1700 employees (worldwide) In France: Sophia-Antipolis: Multicores, secure cores, L2 caches Grenoble: Physical IP (SOI) Paris: Sales 2

3 What does ARM do? Intellectual Property (IP): R&D outsourcing for semiconductor companies ARM technology is low power Royalty based business model Over 15 billion ARM technology based chips shipped to date Average of 2 ARM processors per mobile phone 3

4 About me Telecom Paris (2000) 東京大学 (2002): formal methods Worked on compilers for 4 years (Japan) Joined ARM (Sophia Antipolis) in 2006: Modelling micro-architectures (cycle accurate model of Cortex-A9 in C++, systemc) Benchmarking (performance metrics of CPUs) Validation Integer cores (Cortex-A9 and next generation Cortex-R) L2 cache controller (PL310) 4

5 Plan 1. About pipelines 2. Fetch (speculation & prediction) 3. Decode (instruction set) 4. Dynamic scheduling 5. MMU 6. Caches 7. Multicore CPU (thread level parallelism) 8. Conclusion 5

6 About pipelines Why pipelines? Inputs (registers) Combinatorial logic (big) Pipe stage outputs 6

7 Pipeline (instruction level parallelism) Early times: Z80 was not pipelined Several cycles per instructions Pipeline benefits: Split in small tasks frequency up! Parallelize! (IPL: Instruction Level Parallelism) 1 cycle per instruction (ideal) 3 kinds of stall: Data dependencies Contention for hardware resources Control (e.g. branch target) 7

8 Pipeline examples ARM9 (arch. V5) 5 Stage Pipeline ARM11 (arch. V6) 8 Stage Pipeline Cortex-A8 (arch. V7) 13 Stage Pipeline 8

9 Instruction queue Prediction queue Instruction queue and Dispatch Pipeline of the Cortex-A9 (arch. V7) Scheduling Execution Decode Dual-instruction Decode stage Virtual to physical register pool Register Rename stage Issue Stage ALU/MUL ALU-1 ALU ALU-1 FPU/NEON OoO Write back stage Fetch Instruction prefetch stage Fast-loop Instruction cache Branch Monitor Branch Prediction Global History Buffer BR-Target Addr Cache Return Stack Auto-prefetcher AGU Load-Store Unit Store Buffer Dual-instruction cache Decode stage quad-slot with forwarding Data Cache Memory System MMU & Data AMBA 3 AXI 64bit AMBA 3 AXI 64bit 9

10 Plan 1. About pipelines 2. Fetch (speculation & prediction) 3. Decode (instruction set) 4. Dynamic scheduling 5. MMU 6. Caches 7. Multicore CPU (thread level parallelism) 8. Conclusion 10

11 Fetch: speculation & prediction Speculation: Execute instructions in advance (filling pipe behind branch) and: commit the result in registers whenever control confirmed whenever speculation turned out to be wrong, flush pipe and restart fetching at new address Probabilistic optimization: gain on average Trade-off between performance and power (and complexity) Static prediction: pre-decode immediate branches Dynamic prediction: (Cortex-A9) Branch Target Address Cache (BTAC) Probabilistic model: Predictors & Global History Buffer (GHB) 11

12 Branch Target Address Cache (BTAC) current PC == Tag0 Tag1 Tag2. tagn Target & attributes Branch Target Address Cache (BTAC) associates PC with: predicate is a branch target address and state (ARM,THUMB) Other information (like is a func call, is a func return, etc) Work as a cache (will be presented later) 12

13 Predictors For each branch, a FSM to estimate probability (=taken?) Simplest model: 1 bit Loop: 1 miss at exit + 1 miss at (re)entry Can do better with only 2 bits: (Strongly Weakly) x (Taken Not-Taken) Transitions taken <-> not taken occur after 2 misses T ST T WN N T N T WT N SN Taken Not Taken N 13

14 Not Taken Taken Taken Taken Taken Taken Not Taken Not Taken Taken Not Taken Not Taken Taken Taken Not Taken Not Taken Not Taken Global History Buffer (GHB) PC Improve prediction further! Heuristic tradeoff. # Idea: the same branch gets associated to different predictors according to history of control Catch regular patterns in branch history (e.g. embedded loops) Branch Monitor 14

15 Dynamic prediction Next PC Prefetched PC (+ prediction info) Prefetch Pipe Memory (I-Cache) Instr Stream Instruction word BTAC GHB Challenge: prefetch stage itself is pipelined (3-stages in Cortex-A9) 15

16 Plan 1. About pipelines 2. Fetch (speculation & prediction) 3. Decode (instruction set) 4. Dynamic scheduling 5. MMU 6. Caches 7. Multicore CPU (thread level parallelism) 8. Conclusion 16

17 Decode: instruction sets Stage itself is a tree of comparators (no conceptual issue) Relevant issues are in the definition of instruction set: 1. Performance 2. Code density 3. Compiler support 4. Legacy code and market 17

18 Performance RISC vs. CISC controversy (hist): CISC: Complex Instruction Set Computer microcoded instructions on register and memory (many addressing modes) RISC: Reduced Instruction Set Computer LD/ST (memory) + simple op. on registers, large register bank Conditional instructions eliminate jumps Multiplications (mixed signed/unsigned, 8/16/32/64), MAC and variations, saturated arithmetic SIMD (Single Instruction Multiple Data): SIMD & Advanced SIMD Extension (ARM s ASE, aka Neon, Intel s MMX & SSE) Applications: Video, Graphics (3D rendering), Signal Processing 18

19 Code density Size of code memory is an issue in embedded: Performance: does the program fit in Instruction-cache? Area on chip: sub-optimal compared to behavioral synthesis Thumb (ARM7TDMI): 16bit But not fully fledged instruction set C code: up to 35% smaller «Code compression» thru variable length instruction sets Thumb2 : 16-32bit Full set Almost same performance as ARM but up to 25% smaller 19

20 Compiler support for SIMD SIMD: multiple operations on a pair of vectors (series of adjacent values) Gain over scalar execution potentially huge! Compiler support limited to loop vectorisation: identify variables with a vector access pattern. Ensure there are no data dependencies between different iterations float *a, *b, x; int i, j, n;... for (i = 0; i < n; i++) { // b is accessed with a stride of 1 *(a+j) = x +b[i]; j += 2; // a is accessed with a stride of 2 }; used thru libraries (OpenMax, OpenGL) 20

21 Legacy code and market Legacy code: Imposes backward compatibility on new architectures Can ease transition with (low perf) emulation mode and virtualization Architectures tend to get irregular and complicate over time: Because of compatibility constraint at first place Because disparate customer requests get included Instruction set delimits a market Patents on instruction (+hardware) to deter competition 21

22 22 Any question so far?

23 Plan 1. About pipelines 2. Fetch (speculation & prediction) 3. Decode (instruction set) 4. Dynamic scheduling 5. MMU 6. Caches 7. Multicore CPU (thread level parallelism) 8. Conclusion 23

24 Instruction queue Prediction queue Instruction queue and Dispatch Pipeline of the Cortex-A9 (arch. V7) Scheduling Dual-instruction Decode stage Virtual to physical register pool Register Rename stage Issue Stage ALU/MUL ALU-1 ALU ALU-1 FPU/NEON OoO Write back stage Branch Monitor Auto-prefetcher AGU Memory System Instruction prefetch stage Fast-loop Instruction cache Branch Prediction Global History Buffer BR-Target Addr Cache Return Stack Load-Store Unit Store Buffer Dual-instruction Decode stage quad-slot with forwarding Data Cache AMBA 3 AXI 64bit 24

25 Scheduling Static (compiler) or dynamic (processor) Chooses between equivalent sequences of instructions, all respecting data dependencies (semantics of program) + *() x MUL R1, Ra, Rb LDR R2, [Rptr] ADD R3, R1, R2 or LDR R2, [Rptr] MUL R1, Ra, Rb ADD R3, R1, R2 ptr a b Optimize to overlap latencies of instructions Latencies come from: Operator implementations (MUL, DIV, Floating): 1 to 10 s cycles Forward path (lack of) : 1 cycle Contention for resource like access ports to register bank: 1 cycle Memory accesses: 2 to 100 s cycles! 25

26 Data dependencies LDR MUL ADD BIC ADD R2, #Addr R1, R2, R3 R4, R1, R4 R5, #0xF R4, R5, R4 RAW (Read-After-Write) (data transmission) aka «true» data dependency LDR R1, #Addr STR R1, [R9] ADD R1, R2, R4 ADD R4, R1, #3 ADD R4, R1, R4 Loop LDR R2, #Addr MULS R0, R2, R3 MOV R0, #0 ADD R1, R1, R4 WAR: anti-dependency (register reuse) WAW: output dependency WAR & WAW appear because the nbr of registers is limited: aka name dependencies removed by 'renaming' stage in processor 26

27 Static scheduling: 2 shortcomings Compiler has only a (very) simplistic model of latencies Ex: 3 cycles for LDR Rx,[<addr>] ADD.,Rx,. Optimization of hardware resources is a difficult problem: Pipeline scheduling Register allocation Can be stated as an integer linear programming (ILP) problem (Goodwin & Wilken, 1995) NP-complete! Industry compilers: Scheduling and register allocation are two distinct phases Hence, sub-optimal! 27

28 Scheduling vs. register allocation Calling conventions (ABI: Application Binary Interface): scratch registers: can be left modified by functions (r0, r1, r2, r3) saved registers: must be restored at function return Better scheduling tends to increase register use, incurring «register pressure»: Saved registers: More live intermediate values than scratch registers Cost a PUSH/POP Register spill: More live intermediate values than available registers Save and restore intermediate results on stack Very expensive! Register reuse incurs «name dependencies» (WAR & WAW) which artificially limit IPC (Instructions Per Cycle) 28

29 Instruction queue Prediction queue Instruction queue and Dispatch Pipeline of the Cortex-A9 (arch. V7) Scheduling Dual-instruction Decode stage Virtual to physical register pool Register Rename stage Issue Stage ALU/MUL ALU-1 ALU ALU-1 FPU/NEON OoO Write back stage Branch Monitor Auto-prefetcher AGU Memory System Instruction prefetch stage Fast-loop Instruction cache Branch Prediction Global History Buffer BR-Target Addr Cache Return Stack Load-Store Unit Store Buffer Dual-instruction Decode stage quad-slot with forwarding Data Cache AMBA 3 AXI 64bit 29

30 Dynamic scheduling Definition: An ability to execute instructions in a different order than the one chosen by compiler Aka Out-of-order, non-blocking Especially relevant with caches (non-deterministic delays) In practice, pipeline is forked: multiple issue of instruction The issue stage: Keeps track of RAW, WAR & WAW dependencies Dispatch instruction as soon as operands and resources are available Is critical to processor performances! Benefit from register renaming Redo register allocation to a wider set of registers than available in assembler Reduce register pressure: remove WAR & WAW dependencies 30

31 Register renaming: effect Architectural registers Renamed registers Cache miss incurs stall r0 r1 r2 r3 r14 LDR r3, [r6] ADD r0, r3, r5 STR r0, [r2] LDR r0, [r1] ADD r7, r0, #1 STALLED v0 v1 v2 v3 v55 LDR r3, [r6] ADD v0, r3, r5 STR v0, [r2] LDR v1, [r1] ADD r7, v1, #1 STALLED EXECUTED Cache hit WAR 31

32 Register renaming: implementation keep a table of correspondence architectural <-> virtual # virtual registers > # architectural registers Arch. register reads are remapped to current virtual counterpart Arch. register writes allocate a new virtual register (renaming) LDR r3, [r6] ADD r0, r3, r5 STR r0, [r2] LDR r0, [r1] ADD r7, r0, #1 R0 R1 R2 R3 R4 R5 v0 v1 v2 v3 v4 v5 vx vy 1 2 LDR v3, [v6] ADD vx, v3, v5 STR vx, [v2] LDR vy, [v1] ADD v7, vy, #1 1 Rename R0 -> Vx 2 Rename R0 -> Vy 32

33 33 Any question on scheduling?

34 Plan 1. About pipelines 2. Fetch (speculation & prediction) 3. Decode (instruction set) 4. Dynamic scheduling 5. MMU 6. Caches 7. Multicore CPU (thread level parallelism) 8. Conclusion 34

35 Instruction queue Prediction queue Instruction queue and Dispatch Pipeline of the Cortex-A9 (arch. V7) Dual-instruction Decode stage Virtual to physical register pool Register Rename stage Issue Stage ALU/MUL ALU-1 ALU ALU-1 FPU/NEON OoO Write back stage Instruction prefetch stage Fast-loop Instruction cache Branch Monitor Branch Prediction Global History Buffer BR-Target Addr Cache Return Stack Auto-prefetcher AGU Load-Store Unit Store Buffer Dual-instruction cache Decode stage quad-slot with forwarding Data Cache Memory System MMU & Data AMBA 3 AXI 64bit AMBA 3 AXI 64bit 35

36 Memory Management Unit 0 Physical memory Other processes 0 2^32 Process p1 Process p0 Virtual memory (4GB) MMU memory allocated to p1 memory allocated to p0 2^32 Managed by OS OS needs to provide every process with the same virtual address space For each process, the same address maps to different physical addresses OS just changes the virtual2physical mapping before running a process (page table address register in MMU) 36

37 Memory Management Unit Address seen by process = Virtual address Main memory Virtual page index Page offset 31 n 0 Page Table Page Physical address index 31 n+1 Physical page Page table lookup done by MMU Page Table Page table is located in main memory Pages have attributes: access permission, cacheability It is common to have 2 or 3 levels of translation (Linux: 4K pages) 37

38 Plan 1. About pipelines 2. Fetch (speculation & prediction) 3. Decode (instruction set) 4. Dynamic scheduling 5. MMU 6. Caches 7. Multicore CPU (thread level parallelism) 8. Conclusion 38

39 Introduction A cache is a copy of a limited amount of main memory into a smaller but faster local memory (proxy) Purpose: speed up accesses! Hardware support: Copies data from main to cache memory on demand Ensures master is updated whenever local copy is modified Makes it mostly transparent to software! Optimization relies on 2 assumptions of locality : In space: accesses are often on the same address or contiguous E.g. data types like structures and arrays Counter example: linked lists, binary trees In time: recently accessed global data is likely to be accessed again This notion of locality depends of the size of the cache 39

40 Memory hierarchy Caches of level 1, 2 or 3 SPEED Integrated or external D$ and I$ (Harvard) vs. Unified (V. Neumann) Latency can vary from 1 to 100 s cycles! Non-deterministic! core L1 cache (32K) L2 cache (256K) Main memory (512M) Mass storage (10G) ON CHIP OFF CHIP Managed by OS 40

41 Implementation Data address (physical or virtual) Tag Index Offset in Line 31 L+S L+S-1 L L-1 0 p-lookups == Tag0 Tag0 Tag1 Tag1 Tag2 Tag2.. Tag-x Tag-x Tag-p Tag-(p-1) Cache line[offset] data0 data1 data2^l 2^s-sets: a set is selected by the index in address p-associative (p lookups in parallel, p = 4..16) Typical line size: 4 to 16 words (128 to 512 bits) Each cache line has valid bit A line is dirty when its data is modified valid dirty 41

42 Maintenance Fine but Incoherencies occur when: Code modifies itself (& Harvard): new instruction is still in data cache and instruction cache (and memory) have the old one! OS switches the context (for virtually addressed caches): remap virtual addresses to a different memory range Cacheability of memory regions changed (MPU or MMU) Several cores access the same global variable Software must then initiate cache maintenance operations, either: invalidate (valid := 0) clean-invalidate (write back to main memory & valid := 0) all cache, by virtual address, by physical address Takes time! 42

43 43 Any question on caches?

44 Plan 1. About pipelines 2. Fetch (speculation & prediction) 3. Decode (instruction set) 4. Dynamic scheduling 5. MMU 6. Caches 7. Multicore CPU (thread level parallelism) 8. Conclusion 44

45 Multicore CPU (thread level parallelism) Single threaded execution reaching physical limits: frequency -> power -> heat dissipation limits (PCs) Studies show multicore is more energy efficient than singlecore Thread level parallelism (TLP) as opposed to ILP Industry is now moving to multi-cpu systems Software must manage coherency among caches Whenever a global variable is shared among several cores Problems: Software complexity: debug is a pain! Performance loss (some 30% time spent in cache maintenance) Multicore-cpu: hardware support for cache coherency See MOESI/MESI protocols 45

Data Cache Instruction Cache Data Cache Instruction Cache Data Cache Generalized Interrupt Control and Distribution (GIC) Cache-2-Cache

46 Cache coherency (Cortex-A9 MPCore) Configurable Between 1 and 4 CPUs with optional SIMD engine or Floatingpoint Unit FPU/NEON TRACE FPU/NEON TRACE FPU/NEON TRACE FPU/NEON TRACE Cortex-A9 CPU1 Cortex-A9 CPU2 Cortex-A9 CPU3 Cortex-A9 CPU4 Instruction Cache Data Cache Instruction Cache Data Cache Instruction Cache Data Cache Instruction Cache Data Cache Generalized Interrupt Control and Distribution (GIC) Cache-2-Cache Transfers Snoop Control Unit (SCU) Snoop Filtering Timers Accelerator Coherence Port (ACP) Advanced Bus Interface Unit L2 Cache Controller (PL310) 46

47 Plan 1. About pipelines 2. Fetch (speculation & prediction) 3. Decode (instruction set) 4. Dynamic scheduling 5. MMU 6. Caches 7. Multicore CPU (thread level parallelism) 8. Conclusion 47

48 Summary & Conclusion At pipeline level, instruction throughput is improved by: Good predictive models for branch speculation Dynamic scheduling Renaming improves scheduling Caches reduce latency but incur non-determinism Multicore extensions support cache coherency for faster inter-process communication Upcoming challenges in processor design: Even more optimizations to hide latencies (e.g. speculative data prefetch) Memory system plays a critical role in overall performance (not just CPU alone)! SoC design complexity is paramount (10 s IP integration) Power consumption is the key consideration! 48

49 Questions? Nintendo DSi (ARM7 & ARM9) Microsoft Zune HD (ARM11 MPcore) Lenovo Skylight (Qualcomm ARM SnapDragon) Nokia N900 (Cortex-A8) 49

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 14 Instruction Level Parallelism and Superscalar Processors

William Stallings Computer Organization and Architecture 8 th Edition Chapter 14 Instruction Level Parallelism and Superscalar Processors What is Superscalar? Common instructions (arithmetic, load/store,