Lecture 12: EIT090 Computer Architecture Anders Ardö EIT Electrical and Information Technology, Lund University December 1, 2009 A. Ardö, EIT Lecture 12: EIT090 Computer Architecture December 1, 2009 1 / 30 Taxonomy SISD (Single Instruction stream, Single Data stream) traditional uniprocessor SIMD (Single Instruction stream, Multiple Data stream) vector processors MISD (Multiple Instruction stream, Single Data stream) no commercial examples MIMD (Multiple Instruction stream, Multiple Data stream) multiprocessor A. Ardö, EIT Lecture 12: EIT090 Computer Architecture December 1, 2009 2 / 30 Small-scale MIMD designs Symmetric shared MultiProcessors (SMP) with Uniform Memory Access time (UMA) and bus interconnect Often limited to 20-30 processors Flynn (1966) A. Ardö, EIT Lecture 12: EIT090 Computer Architecture December 1, 2009 3 / 30 A. Ardö, EIT Lecture 12: EIT090 Computer Architecture December 1, 2009 4 / 30
Distributed machines Shared vs. Message-passing Uses an interconnection network to connect processor- nodes = NUMA Scalable to a large number of nodes Can be either shared or private address space Message-passing: The programmer must explicitly distribute data No execution overhead between explicit communication Shared : The same data structures as in the sequential program can be used Shared access can lead to high communication overhead A. Ardö, EIT Lecture 12: EIT090 Computer Architecture December 1, 2009 5 / 30 The cache coherence problem A read operation from address X must see the latest value produced by a write to address X With several copies of X, this may be a problem Techniques: Hardware-based protocols: Transparent to the software system, but increases the com plexity of the machine Software-based protocols: Requires the user/compiler to detect when it is safe to cache, but do not require sophisticated hardware. Hard to do = limited use Policies: Write-invalidate remove (invalidate) other processor s copy of a data item when it is written Write-update update other processor s copy of a data item when it is written A. Ardö, EIT Lecture 12: EIT090 Computer Architecture December 1, 2009 7 / 30 A. Ardö, EIT Lecture 12: EIT090 Computer Architecture December 1, 2009 6 / 30 Cache Coherence Protocols Snooping Status for a block is stored in every cache that has a copy of the block. Caches monitor (snoop) the shared bus to update status and take actions. Popular with single shared. Directory based Status for a block is stored in one location (the directory). Messages used to update status. Popular with distributed shared. A. Ardö, EIT Lecture 12: EIT090 Computer Architecture December 1, 2009 8 / 30
Synchronization Consistency models Why synchronize? We need to know when it is safe for different processes to use shared data Issues for synchronization: How do we implement the LOCK operation? Uninterruptable instruction to fetch and update (atomic operation) User level synchronization operation using this primitive For large scale multiprocessors, synchronization can be a bottleneck; techniques to reduce contention and latency of synchronizations are needed Atomic exchange, Test-and-set, Fetch-and-add Sequential consistency Serializing Write operations must stall until performed! Relaxed consistency A relaxed consistency model allows operations to be observed out-of-order between synchronizat ion operations Possible to obtain significant performance advantages A. Ardö, EIT Lecture 12: EIT090 Computer Architecture December 1, 2009 9 / 30 TLP Thread Level Parallelism A. Ardö, EIT Lecture 12: EIT090 Computer Architecture December 1, 2009 10 / 30 Clusters Allow multiple threads to share functional units of a processor. Coarse multithreading thread switch on costly stalls Fine multithreading thread switch each instruction issue slot Simultaneous multithreading (SMT) several threads can issue instructions simultaneously (combines ILP and TLP) Loosely coupled desktop machines No shared High bandwidth, switch-based LAN Standard of-the-shelf components = cheap Easy to scale High availability High administration cost Major problem is power (servers and cooling) Supercomputers A. Ardö, EIT Lecture 12: EIT090 Computer Architecture December 1, 2009 11 / 30 A. Ardö, EIT Lecture 12: EIT090 Computer Architecture December 1, 2009 12 / 30
Lecture 12 agenda Appendix D in "Computer Architecture" A. Ardö, EIT Lecture 12: EIT090 Computer Architecture December 1, 2009 13 / 30 Embedded processors A device that includes a (programmable) computer But is not itself a general-purpuse computer fastest growing segment washing machines, cars, cell phones, TVs,... wide range: low-end 8 bit full size 32 bit price key factor performance, power, real time applications types ASIC SoC DSP A. Ardö, EIT Lecture 12: EIT090 Computer Architecture December 1, 2009 14 / 30 Embedded systems overview Embedded computing systems Computing systems embedded within electronic devices Hard to define. Nearly any computing system other than a desktop computer Billions of units produced yearly, versus millions of desktop units Perhaps 50 per household and per automobile Computers are in here... and here... and even here... Lots more of these, though they cost a lot less each. A. Ardö, EIT Lecture 12: EIT090 Computer Architecture December 1, 2009 15 / 30 4
TA-150 Computer Controlled Stereo Reciever A short list of embedded systems Anti-lock brakes Auto-focus cameras Automatic teller machines Automatic toll systems Automatic transmission Avionic systems Battery chargers Camcorders Cell phones Cell-phone base stations Cordless phones Cruise control Curbside check-in systems Digital cameras Disk drives Electronic card readers Electronic instruments Electronic toys/games Factory control Fax machines Fingerprint identifiers Home security systems Life-support systems Medical testing systems Modems MPEG decoders Network cards Network switches/routers On-board navigation Pagers Photocopiers Point-of-sale systems Portable video games Printers Satellite phones Scanners Smart ovens/dishwashers Speech recognizers Stereo systems Teleconferencing systems Televisions Temperature controllers Theft tracking systems TV set-top boxes VCR s, DVD players Video game consoles Video phones Washers and dryers And the list goes on and on 5 A. Ardö, EIT TA-150 Computer Controlled Stereo Reciever Lecture 12: EIT090 Computer Architecture December 1, 2009 16 / 30 Some common characteristics of embedded systems Single-functioned Executes a single program, repeatedly Tightly-constrained Low cost, low power, small, fast, etc. Reactive and real-time Continually reacts to changes in the system s environment Must compute certain results in real-time without delay A. Ardö, EIT Lecture 12: EIT090 Computer Architecture December 1, 2009 16 / 30 6
Embedded system Embedded Real Time System Actuators, Control Output Environment An embedded system example -- a digital camera lens CCD Digital camera chip A2D JPEG codec CCD preprocessor Microcontroller Pixel coprocessor Multiplier/Accum D2A Input DMA controller Display ctrl Sensors Memory controller ISA bus interface UART LCD ctrl Single-functioned -- always a digital camera Tightly-constrained -- Low cost, low power, small, fast Reactive and real-time -- only to a small extent A. Ardö, EIT Lecture 12: EIT090 Computer Architecture December 1, 2009 17 / 30 Case study: Axis Etrax 7 From Computer Architecture in Industry by Kenny Ranerup A. Ardö, EIT Lecture 12: EIT090 Computer Architecture December 1, 2009 18 / 30 A. Ardö, EIT Lecture 12: EIT090 Computer Architecture December 1, 2009 19 / 30
Computer Architecture in Industry ETRAX And Other Processors At Axis The CRIS CPU Architecture ASICs and processors have been developed at Axis Communications for many years. The first generation, CGA, was a special processor designed for parsing the IBM mainframe communication protocol. The second generation was a complete System on Chip ASIC for the same IBM mainframe market. The processor was a 6809 compatible design developed internally. ETRAX was the 3rd generation of ASICs developed at AXIS Communications. This SoC was targeted for the Print Server market and contained a new CPU architecture called CRIS. The fourth generation, ETRAX100, broadened the ETRAX platform to other applications and increased performance both on network interface and processor. Other special purpose processors have been developed, e.g. for controlling a camera ASIC and a programmable I/O processor. 32-bit data and addresses. 16-bit instruction width with some variable size instructions. RISC inspired instruction set but with complex addressing modes. 16 general purpose 32-bit registers. Condition code register for compare and branch instructions. Data Organization in Memory CRIS is a little endian CPU. Data has no alignment restrictions, but there is a performance penalty for unaligned data accesses. Instructions must be word aligned. Computer Architecture in Industry - Kenny Ranerup '03 - Kenny Ranerup '03 3 7 Instruction Format ETRAX 100 Block Diagram Basic instruction format is 16-bits and must be word aligned. Two register operands. Byte, word, dword operand size. Addressing mode. operand 2 mode opcode size operand 1 15 12 11 10 9 6 5 4 3 0 Computer Architecture in Industry - Kenny Ranerup '03 Computer Architecture in Industry - Kenny Ranerup '03 8 17
Computer Architecture in Industry Axis Etrax FS Architectural Experiments Measurement of instruction and address traces on running product. Trace driven cache simulator to determine cache configuration and algorithms. Effects of expanding datapath from 16 to 32 bits. Analysis of instruction traces and static code to find possible instruction set improvements. Code analysis to find the effects of C++ on instruction mix. Sketches of changes to CPU pipelining. Gate-level remapping of CPU to new technology to estimate cycle time and pipelining. Sketches of a zero-copy DMA architecture for network and peripherals. - Kenny Ranerup '03 18 Axis Network Camera A. Ardö, EIT Lecture 12: EIT090 Computer Architecture December 1, 2009 20 / 30 A. Ardö, EIT Lecture 12: EIT090 Computer Architecture December 1, 2009 21 / 30 A. Ardö, EIT Lecture 12: EIT090 Computer Architecture December 1, 2009 22 / 30
Design challenge optimizing design metrics Common metrics Unit cost: the monetary cost of manufacturing each copy of the system, excluding NRE cost NRE cost (Non-Recurring Engineering cost): The one-time monetary cost of designing the system Size: the physical space required by the system Performance: the execution time or throughput of the system Power: the amount of power consumed by the system Flexibility: the ability to change the functionality of the system without incurring heavy NRE cost Design challenge optimizing design metrics Common metrics (continued) Time-to-prototype: the time needed to build a working version of the system Time-to-market: the time required to develop a system to the point that it can be released and sold to customers Maintainability: the ability to modify the system after its initial release Correctness, safety, many more 9 10 Design metric competition -- improving one may worsen others Design methodologies lens CCD Performance Digital camera chip A2D JPEG codec DMA controller CCD preprocessor Power NRE cost Microcontroller Pixel coprocessor Size D2A Multiplier/Accum Display ctrl Memory controller ISA bus interface UART LCD ctrl Expertise with both software and hardware is needed to optimize design metrics Not just a hardware or software expert, as is common A designer must be comfortable with various technologies in order to choose the best for a given application and constraints Hardware Software Heterogeneous systems: hardware (digital, analog), software Heterogeneous components: SoC, CPU, DSP, ASIC, bus,... Heterogeneous requirements: performance, cost, power,... 11 A. Ardö, EIT Lecture 12: EIT090 Computer Architecture December 1, 2009 23 / 30
Hardware vs software hardware performance power cost software flexibility reconfigurability cost A. Ardö, EIT Lecture 12: EIT090 Computer Architecture December 1, 2009 24 / 30 Real time A. Ardö, EIT Lecture 12: EIT090 Computer Architecture December 1, 2009 25 / 30 Real time performance React to external evironment Permamnet interaction Endless execution External timing requirements Special application areas video process control medical applications airplane control - JAS Hard vs soft real time requirements Analyses WCET - Worst Case Execution Time A. Ardö, EIT Lecture 12: EIT090 Computer Architecture December 1, 2009 26 / 30 A. Ardö, EIT Lecture 12: EIT090 Computer Architecture December 1, 2009 27 / 30
Processor technology The architecture of the computation engine used to implement a system s desired functionality Processor does not have to be programmable Processor not equal to general-purpose processor Controller Control logic and State register IR PC Datapath Register file General ALU Controller Control logic and State register IR PC Datapath Registers Custom ALU Controller Control logic State register Datapath index total + Program Data Assembly code for: total = 0 for i =1 to General-purpose ( software ) Data Program Assembly code for: total = 0 for i =1 to Application-specific Data Single-purpose ( hardware ) A. Ardö, EIT Lecture 12: EIT090 Computer Architecture December 1, 2009 28 / 30 19 Processor technology General-purpose processors Processors vary in their customization for the problem at hand General-purpose processor Desired functionality Application-specific processor total = 0 for i = 1 to N loop total += M[i] end loop Single-purpose processor Programmable device used in a variety of applications Also known as microprocessor Features Program General datapath with large register file and general ALU User benefits Low time-to-market and NRE costs High flexibility Pentium the most well-known, but there are hundreds of others Controller Control logic and State register IR PC Program Assembly code for: total = 0 for i =1 to Datapath Register file General ALU Data 20 21
Single-purpose processors Application-specific processors Digital circuit designed to execute exactly one program a.k.a. coprocessor, accelerator or peripheral Features Contains only the components needed to execute a single program No program Benefits Fast Low power Small size Controller Control logic State register Datapath index total + Data Programmable processor optimized for a particular class of applications having common characteristics Compromise between general-purpose and single-purpose processors Features Program Optimized datapath Special functional units Benefits Some flexibility, good performance, size and power Controller Control logic and State register IR PC Program Assembly code for: total = 0 for i =1 to Datapath Registers Custom ALU Data 22 23 Summary Important, found everywhere, high volume Hardware + software design Cover several areas microelectronics real time software + hardware SoC General purpose, application specific, single purpose A. Ardö, EIT Lecture 12: EIT090 Computer Architecture December 1, 2009 29 / 30 A. Ardö, EIT Lecture 12: EIT090 Computer Architecture December 1, 2009 30 / 30