CS 310 Embedded Computer Systems CPUS. Seungryoul Maeng

Size: px

Start display at page:

Download "CS 310 Embedded Computer Systems CPUS. Seungryoul Maeng"

Laura Lesley Young
5 years ago
Views:

1 1 EMBEDDED SYSTEM HW CPUS Seungryoul Maeng

2 2 CPUs Types of Processors CPU Performance Instruction Sets Processors used in ES

3 3 Processors Single Purpose ( Hardware ) General Purpose ( Software ) Application Specific ( Software )

4 Custom single-purpose processors: Hardware * Read chapter 2 and 3 in Embedded System Design: A unified Hardware/Software Introduction, Frank Vahid and Tony Givargis.

5 Introduction 5 Processor Digital circuit that performs a computation tasks Controller and datapath General-purpose: Di it l hi variety of computation CCD tasks Standard single-purpose: A2D one particular common task Off-the-shelf a.k.a., peripherals Custom single-purpose: p non-standard task lens Digital camera chip JPEG codec DMA controller CCD preprocessor Microcontroller Pixel coprocessor D2A Multiplier/Accum Display ctrl A custom single-purpose processor may be Fast, small, low power But, high NRE, longer time-tomarket, less flexible Memory controller ISA bus interface UART LCD ctrl

6 6 Custom single-purpose processor basic model external control inputs controller datapath control inputs external data inputs datapath controller next-state and control logic datapath registers datapath control outputs state register functional units external control outputs external data outputs controller and datapath a view inside the controller and datapath

7 Custom single-purpose processors 7 Can be built to execute algorithms Typically y start with FSMD CAD tools can be of great assistance Custom vs. Standard

8 General-Purpose Processors: Software

Introduction 9 General-Purpose Processor Processor designed for a variety of

few chips rather than entire rooms Low unit cost, in part because manufacturer

9 Introduction 9 General-Purpose Processor Processor designed for a variety of computation tasks aka a.k.a., microprocessor micro used when they were implemented on one or a few chips rather than entire rooms Low unit cost, in part because manufacturer spreads NRE cost over large numbers of units Motorola sold half a billion 68HC05 microcontrollers in 1996 alone ARM processors : 1.5 billion processors Carefully designed since higher NRE is acceptable Can yield good performance, size and power

10 Basic Architecture 10 Control unit and datapath Note similarity to singlepurpose processor Key differences Datapath is general Control unit Controller Processor Control /Status Datapath ALU Control unit doesn t store the algorithm the algorithm is programmed into the memory PC IR Registers I/O Memory

11 Two Memory Architectures 11 Pi Princeton Fewer memory wires Processor Processor Harvard Simultaneous program and data memory access Program memory Data memory Memory (program and data) Advantage, disadvantage? d Harvard Princeton

12 Princeton vs. Harvard 12 Harvard can t use selfmodifying code. Harvard allows two simultaneous memory fetches. Most DSPs use Harvard architecture for streaming data: greater memory bandwidth Most high performance processors use Harvard architecture At cache memory level

13 Cache Memory 13 Memory access may be slow Cache is small, but fast memory close to processor Holds copy of part of memory Hits and misses Fast/expensive technology, usually on the same chip Processor Cache Memory Slower/cheaper technology, usually on a different chip

14 Why use microprocessors? 14 Alternatives: field-programmable gate arrays (FPGAs), custom logic, etc. (Custom Singlepurpose Processor or HW Logic) Microprocessors are often very efficient: Low NRE cost, short time-to-market User just writes software; no processor design High flexibility can use same logic to perform many different functions Microprocessors simplify the design of families of products

The performance paradox 15 Microprocessors

Performance increases over time P f d bl 18

15 The performance paradox 15 Microprocessors use much more logic to implement a function than does custom logic. But microprocessors are often very fast Performance increases over time P f d bl th You can easily take this advantage Performance doubles every months

16 The performance paradox 16 Carefully designed since higher NRE is acceptable heavily pipelined; large design teams; aggressive VLSI technology. Performance doubles every months Clock frequency Deeper pipelines IPC(Instructions per cycle)

17 17 Pipelining: Increasing Instruction Throughputh Wash Dry Non-pipelined Pipelined non-pipelined dish cleaning Time pipelined dish cleaning Time Fetch-instr. Decode Fetch ops Pipelined Execute Instruction 1 Store res pipelined instruction execution Time

18 Power 18 Custom logic is a clear winner for low power devices. Modern micro- processors offer features to help control power consumption. Software design techniques can help reduce power consumption.

19 Application-Specific Processors (ASPs)

20 Microprocessor varieties 20 Desktop vs. Embedded Processors Embedded Processors : including CPU core(s), Memory, Peripherals, I/O devices, Networks, etc. SoC processors Netsilicon NET+ARM Embedded Processor

21 Embedded Processors s varieties es 21 General Purpose vs. Application Specific Processors Digital signal processor (DSP): microprocessor optimized for digital signal processing. Application Specific Instruction-set Processors (ASIPs) Microcontroller and Microprocessors Microcontroller: includes I/O devices, on-board memory Usually used in control applications Typical embedded word sizes: 8-bit, 16-bit, 32-bit.

22 22 Many Types of Programmable Processors Past Now / Future Microprocessor Network Processor Microcontroller Sensor Processor DSP Cryptoprocessor Graphics Game Processor Processor Wearable Processor Mobile Processor

23 Application-Specific Processors (ASPs) 23 General-purpose processors Sometimes too general to be effective in demanding application e.g., video processing requires huge video buffers and operations on large arrays of data, inefficient on a GPP But single-purpose processors high NRE, not programmable ASPs targeted to a particular domain Contain architectural features specific to that domain e.g., embedded control, digital signal processing, video processing, network processing, telecommunications, etc. Still programmable

24 A Common ASP: Microcontroller 24 For embedded control applications Reading sensors, setting actuators Mostly dealing with events (bits): data is present, but not in huge amounts e.g., VCR, disk drive, digital camera (assuming SPP for image compression), washing machine, microwave oven Microcontroller features On-chip pperipheralsp Timers, analog-digital converters, serial communication, etc. Tightly integrated for programmer, typically part of register space On-chip program and data memory Direct programmer access to many of the chip s pins Specialized instructions for bit-manipulation and other low-level operations

25 25 Another Common ASP: Digital Signal Processors (DSP) For signal processing applications Large amounts of digitized data, often streaming Data transformations must be applied fast e.g., cell-phone voice filter, digital TV, music synthesizer DSP features Several instruction execution units Multiple-accumulate single-cycle instruction, other instrs. Efficient vector operations e.g., add two arrays Vector ALUs, loop buffers, etc.

26 Trend: Even More Customized ASPs 26 In the past, microprocessors were acquired as chips Today, we increasingly acquire a processor as Intellectual Property (IP) e.g., synthesizable VHDL model Customizable Processors Opportunity to add A custom datapath hardware and A few custom instructions, or delete a few instructions (ASIPs) Can have significant performance, power and size impacts Problem: need compiler/debugger for customized ASIP Remember, most development uses structured languages One solution: automatic compiler/debugger generation e.g.,

27 Selecting a Microprocessor 27 Issues Technical: speed, power, size, cost Other: development environment, prior expertise, licensing, etc. Speed: how evaluate a processor s speed? Clock speed but instructions per cycle may differ Instructions per second but work per instr. may differ Dhrystone: Synthetic benchmark, developed in Dhrystones/sec. MIPS: 1 MIPS = 1757 Dhrystones per second (based on Digital s VAX 11/780). A.k.a. Dhrystone MIPS. Commonly used today. So, 750 MIPS = 750*1757 = 1,317,750 Dhrystones per second SPEC: set of more realistic benchmarks, but oriented to desktops EEMBC EDN Embedded Benchmark Consortium, Suites of benchmarks: automotive, consumer electronics, networking, office automation, telecommunications

28 Processors 비교 28 Processor Clock speed Periph. Bus Width MIPS Power Trans. Price General Purpose Processors Intel PIII 1GHz 2x16 K 32 ~900 97W ~7M $900 L1, 256K L2, MMX IBM 550 MHz 2x32 K 32/64 ~1300 5W ~7M $900 PowerPC L1, 256K 750X L2 MIPS 250 MHz 2x32 K 32/64 NA NA 3.6M NA R way set assoc. StrongARM 233 MHz None W 2.1M NA SA-110 Microcontroller Intel 12 MHz 4K ROM, 128 RAM, 8 ~1 ~0.2W ~10K $ I/O, Timer, UART Motorola 3MHz 4K ROM, 192 RAM, 8 ~.5 ~0.1W ~10K $5 68HC I/O, Timer, WDT, SPI Digital Signal Processors TI C MHz 128K, SRAM, 3 T1 16/32 ~600 NA NA $34 Ports, DMA, 13 ADC, 9 DAC Lucent 80 MHz 16K Inst., 2K Data, NA NA $75 DSP32C Serial Ports, DMA Sources: Intel, Motorola, MIPS, ARM, TI, and IBM Website/Datasheet; Embedded Systems Programming, Nov. 1998

29 Summary 29 General-purpose rpose processors Good performance, low NRE, flexible Controller, datapath, and memory Structured languages prevail But some assembly level programming still necessary Many tools available Including instruction-set simulators, and in-circuit emulators ASPs Microcontrollers, DSPs, network processors, more customized ASIPs Choosing processors is an important step

30 CPU Performance

31 Elements of CPU performance 31 Cycle time Process technologies: transistor size CPU pipeline Instruction level parallelism Number of Transistors per die Types Superscalar VLIW Multi-threading Memory system

32 Pipelining 32 Several instructions ti are executed simultaneously l at different stages of completion Performance Measure Latency Throughput Various conditions can cause pipeline bubbles that reduce utilization: branches memory system delays, etc.

33 Pipeline structures 33 ARM7 has 3-stage pipes: fetch instruction from memory decode opcode and operands execute ARM9 have 5-stage pipes: Instruction fetch Decode Execute Data memory access Register write

34 ARM7 pipeline execution 34 fth fetch decoded execute add r0,r1,#5r1 sub r2,r3,r6r3 r6 fetch decode execute cmp r2,#3 fetch decode execute time

35 ARM9 core instruction pipeline 35

36 Performance measures 36 Latency time it takes for an instruction to get through the pipeline Throughput number of instructions executed per time period Pipelining increases throughput without reducing latency

37 Pipeline stalls 37 If every step cannot be completed in the same amount of time, pipeline stalls Bubbles introduced by stall increase latency, reduce throughput

38 ARM multi-cycle LDMIA instruction 38 ldmia r0,{r2,r3} fetch decodeex ld r2ex ld r3 sub r2,r3,r6 cmp r2,#3 fetch decode ex sub fetch decodeex cmp time

39 Control stalls 39 Branches often introduce stalls (branch penalty) Stall time may depend on whether branch is taken May have to squash instructions that already started executing Don t know what to fetch until condition is evaluated

40 ARM pipelined branch 40 bne foo fetch decode ex bne ex bne ex bne sub r2,r3,r6 foo add r0,r1,r2 fetch decode fetch decode ex add time

41 Example: ARM7 execution time 41 Determine execution time of FIR filter: for (i=0; i<n; i++) f = f + c[i]*x[i]; ;loop initiation code MOV r0,#0 ;use r0 for i, set to 0 MOV r8,#0 ;use separate index for arrays 7 ADR r2,n ;get address for N LDR r1,[r2] ;get value of N MOV r2,#0 ;use re for f, set to 0 ADR r3,c ;load r3 with the add of base of c array ADR r5,x ;load r5 with the add of base of x array ;loop body loop LDR r4,[r3,r8] ;get value of c[i] 4 LDR r6,[r5,r8] ;get value of x[i] MUL r4,r4,r6 ADD r2,r2,r4 ;add into running sum f ;update loop counter and array index ADD r8,r8,#4 ;add one word offset to array index 2 ADD r0,r0,#1 ;add 1 to i ;test for exit 2 or 4 CMP r0,r1 r1 BLT loop ;if i<n, continue loop Loopend.

42 ARM7 execution time(2) 42 Only branch in loop test may take more than one cycle. BLT loop takes 1 cycle best case, 3 worst case. t loop = t init + N(t body +t update )+(N-1)t test,worst +t test,best Branch Penalty Delayed branch Branch Prediction Branch Folding

43 Delayed branch 43 To increase pipeline efficiency, delayed branch mechanism requires n instructions after branch always executed whether branch is executed or not loop loopend. ;loop initiation code.. ;loop body LDR r4,[r3,r8] ;get value of c[i] LDR r6,[r5,r8] ;get value of x[i] MUL r4,r4,r6 ADD r2,r2,r4 ;add into running sum f ;update loop counter and array index ADD r8,r8,#4 ;add one word offset to array index ADD r0,r0,#1 ;add 1 to i ;test for exit CMP r0,r1 BLT loop ;if i<n, continue loop NOP NOP ;loop initiation code.. ;loop body loop LDR r4,[r3,r8] ;get value of c[i] LDR r6,[r5,r8] ;get value of x[i] MUL r4,r4,r6 ;update loop counter and array index ADD r0,r0,#1 ;add 1 to i ;test for exit CMP r0,r1 BLT loop ;if i<n, continue loop ADD r2,r2,r4 ;add into running sum f ADD r8,r8,#4 8 ;add one word offset to array index loopend.

44 ARM10 processor execution time 44 Impossible to describe briefly the exact behavior of all instructions in all circumstances Branch prediction Prefetch buffer Branch folding The independent Load/Store Unit Data alignment How many accesses hit in the cache and TLB

45 ARM10 integer core 45 3 instr s

46 Branch Folding 46

47 Branch Foding(2) 47

48 Integer core 48 Prefetch Unit Fetches instructions from I-cache or external memory Predicts the outcome of branches whenever it can Integer Unit Decode Barrel shifter, ALU, Multiplier Main instruction sequencer Load/store Unit Load or store two registers(64bits) per cycle Decouple from the integer unit after the first access of a LDM or STM instruction Supports Hit-Under-Miss (HUM) operation

49 Pipeline 49 Fetch Issue I-cache access, branch prediction Initial instruction decode Decode Final instruction decode, register read for ALU op, forwarding, and initial interlock resolution Execute Data address calculation, shift, flag setting, CC check, branch mispredict detection, and store data register read Memory Write Data cache access Register writes, instruction retirement

50 Typical operations 50

51 Load or store operation 51

52 LDR operation that misses 52

53 Interlocks 53 Integer core forwarding to resolve data dependencies between instructions Pipeline interlocks Data dependency interlocks: Instructions that have a source register that is loaded from memory by the previous instruction Hardware dependencyd A new load waiting for the LSU to finish an existing LDM or STM A load that misses when the HUM slot is already occupied A new multiply l waiting for a previous multiply l to free up the first stage of the multiply

54 Pipeline forwarding paths 54

55 Example of interlocking and forwarding 55 Execute-to-execute mov r0, #1 add r1, r0, #1 Memory-to-execute ldr r0, [r5] sub r1, r2, #2 add r2, r0, #1

56 56 Example of interlocking and forwarding, cont d Single cycle interlock ldr r0, [r1, r2] str r3, [r0, r4] fetch issue decode execute memory write ldr r1+r2 r0 read fetch issue decoded execute memory write str r0+r4 r3 write

57 Instruction Level Parallelism 57 Instructions may be performed in parallel Data dependencies Control dependencies Resource dependencies Dependency Analysis At compile time At run time

58 58 Data and Control dependencies Execution time depends d on operands, not just opcode. Speculative execution: assume branch direction and execute unwind if wrong add r2,r0,r1 add r3,r2,r5 r0 data dependency r2 r1 r3 r5 a1: cmp r0,r1 a2: blt b1 a3: add r1,r2,r3 b1: sub r1,r2,r3 b1 a1 a2 a3

59 Parallelism extraction 59 Staticti Dynamic use compiler to analyze programs Simpler CPU control Can make use of high level language constructs use hardware to identify opportunities More complex CPU Can make use of data value Can t depend on data values Superscalar VLIW

60 Superscalar and VLIW Architectures 60 Performance can be improved by: Faster clock (but there s a limit) Pipelining: slice up instruction into stages, overlap stages Multiple ALUs to support more than one instruction stream Superscalar Scalar: non-vector operations Fetches instructions in batches, executes as many as possible May require extensive hardware to detect independent instructions VLIW: each word in memory has multiple independent instructions Currently growing in popularity Relies on the compiler to detect and schedule instructions

61 Superscalar execution 61 Superscalar processor can execute several instructions per cycle. Uses multiple pipelined data paths. Programs execute faster, but it is harder to determine how much faster. Superscalar CPU checks data dependencies dynamically:

62 VLIW processors 62 Parallelism extraction: compile time Parallel operations encoded in one long word (Instruction bundle) Instruction Bundle instruction 1 instruction 2 instruction 3 instruction 4 FP unit Integer unit Integer unit Memory unit Slot utilization static scheduling trace scheduling multi-threading

63 Memory system performance 63 Caches introduce indeterminacy in execution time Depends on order of execution Cache miss penalty: added time due to a cache miss Several reasons for a miss: compulsory, conflict, capacity

64 Instruction Sets

65 RISC vs. CISC 65 Complex instruction set computer (CISC): many addressing modes; many operations. Reduced instruction set computer (RISC): load/store; pipelinable instructions.

66 CISC 프로세서 66 Intel 계열마이크로프로세서의종류및역사 연도 프로세서이름 트렌지스터개수 ,250 인텔의첫마이크로프로세스, Busicom 계산기에사용 특징 ,500 Mark-8 에서사용, 최초의가정용컴퓨터 ,000 Altair 에서사용 / ,000 IBM-PC XT 에서사용, 인텔이대기업으로성장 ,000 IBM-PC AT 에서사용, 6 년간천 5 백만대판매 , 비트멀티테스킹지원 ,180,000 수치보조프로세서내장 1993 Pentium 3,100, 음성, 이미지처리기능강화 1995 Pentium 5,500,000 Dynamic Execution 구조채택 Pro 1997 Pentium 2 7,500,000, MMX 기술지원 1999 Pentium 3 24,000,000 SIMD 지원, 12 스테이지파이프라인 2001 Itanium 25,000,000 64비트, Explicitly Parallel Instruction Computing(EPIC) 2002 Pentium 4 55,000, 스테이지하이퍼파이프라인, 하이퍼쓰레딩 2003 Itanium 2 410,000,000 Machine Check Architecture, EPIC, 6MB L3 캐시

67 CISC - History : Packaging 기술변천 67

68 CISC - History 68

69 Instruction set characteristics 69 Fixed vs. variable length. Addressing modes. Number of operands. Types of operands.

70 ARM data processing Instruction Formats (RISC) 70 Data processing immediate shift cond 000 opcode S Rn Rd shift amount shift 0 Rm Data processing register shift cond 000 opcode S Rn Rd Rs 0 shift 1 Rm Data processing 32-bit immediate cond 001 opcode S Rn Rd rotate immediate-8

71 71 Nios II processor Instruction Formats (RISC) Instruction formats I-type R-type J-type

72 Intel IA-32 Instruction Format (CISC) 72

73 Programming model 73 Programming model: registers visible to the programmer. Some registers are not visible (IR).

74 Multiple implementations 74 Successful architectures have several implementations: varying clock speeds; different bus widths; different cache sizes; etc.

Elements of CPU performance

Elements of CPU performance Cycle time. CPU pipeline. Superscalar design. Memory system. Texec = instructions ( )( program cycles instruction seconds )( ) cycle ARM7TDM CPU Core ARM Cortex A-9 Microarchitecture