Twos Complement Signed Numbers. IT 3123 Hardware and Software Concepts. Reminder: Moore s Law. The Need for Speed. Parallelism.

Twos Complement Signed Numbers IT 3123 Hardware and Software Concepts Modern Computer Implementations April 26 Notice: This session is being recorded. Copyright 2009 by Bob Brown http://xkcd.com/571/ Reminder: Moore s Law The number of transistors per unit area doubles every 18-24 months. (So far.) Consequences: More power, same price Same power, lower price Implication for designers: hardware features that were formerly expensive in chip real estate are becoming practical The Need for Speed I think there is a world market for maybe five computers. Attributed to Thomas J. Watson, 1943. Today, available computing power is never enough: Astronomy Pharmaceuticals Aircraft and automobile design Entertainment Seismography and mineral exploration Many others Limitations The speed of light. (How long is a nanosecond?) Heat dissipation Quantum mechanical effects when transistors or conductors become very small. Parallelism Instead of one computer with a 0.001 ns cycle time, consider 1,000 computers with 1 ns cycle time. The total computing capacity is theoretically The total computing capacity is theoretically the same in each case. (But using the capacity is harder in the second case.) 1

Coupling Parallel computing systems can be characterized by degree of coupling. Tightly coupled: high bandwidth and low delay between CPUs. Loosely coupled: lower bandwidth, higher delay. It s a continuum. Degrees of Coupling On-Chip Parallelism Instruction-level parallelism Multithreading Multiple CPUs per chip More than One CPU on a Chip We can build chips with multiple CPUs ( cores ). These cores share the same memory hierarchy. Cache coherence is less a problem Only one copy of code needed Fast inter-processor communication is possible, maybe even easy. Multithreading Reminder: a process is a program in execution. Changing processes ( context switching ) means saving the complete machine state. A thread is a lightweight process, requiring less than a full context switch. A single application cannot benefit from multiple cores (or multiple CPUs) without multithreading. Hardware Support for Multithreading Suppose a core had several sets of registers and a hardware pointer to the current set. One could run a thread for each register set. Context-switching time would be effectively zero. 2

Superscalar Architectures More than one functional unit available So, several instructions can execute in the same cycle Provided the instructions are compatible. Fine v. Coarse Grained Multithreading Fine-grained multithreading: A thread switch occurs on each instruction. Coarse-grained multithreading: thread switches occur only when the current thread encounters a costly stall. Thread switching can be more expensive in time. The pipeline must be re-filled. Fine-Grained Multithreading Stalls are masked by running threads round-robin. There must be a thread for each stage of the pipeline. The number of threads is limited by the number of register sets. Coarse-Grained Multithreading There may not be as many threads available as there are pipeline stages. Another approach is to switch only when there is a stall (or upon an instruction that might cause a stall.) Simultaneous Multithreading Remember superscalar processors? More than one functional unit (e.g. integer, floating-point, and memory) can allow more than one instruction to be completed per clock cycle. More than one thread can run at the same time, provided they use different functional units. Superscalar CPUs and Multithreading Multithreading with a dual-issue superscalar CPU. (a) Fine-grained multithreading. (b) Coarse-grained multithreading. (c) Simultaneous multithreading. 3

Hyperthreading on the Pentium 4 Resource sharing between threads in the Pentium 4 NetBurst microarchitecture. Multiple Cores without Multithreading Can function as a standard symmetric multiprocessing computer. The running queue has one entry per core. The operating system can dispatch a process to a core just like a separate CPU. Problem: A single L1 cache, or one per core? Very Long Instruction Word Computers Instructions for multiple functional units are packaged explicitly. The burden is on the compiler, not on the hardware. (Good.) The Philips TriMedia VLIW CPU Designed expressly as an embedded processor for multimedia devices. Can issue five instructions per cycle Byte-oriented memory; alignment required for half words and full words. 8-way set-associative split cache 128 general registers. R0=0, R1=1; storing to R0 or R1 is not allowed Saturated arithmetic No runtime checking. (The compiler has to be right!) Example TriMedia Instruction TriMedia Functional Units Not every instruction type can appear in every slot. (Next slide.) Empty slots are compacted, so instructions are of variable length. Each operation is predicated: IF R2 IADD R4,R5 -> R8 The X s indicate which slots are valid for each instruction type. 4

Flynn s Taxonomy of Parallel Computers Pitfalls You have to measure execution time. Observed performance is likely to be much less than the combined performance of n processors. In other words, 1,000 one ns processors do not equal one 0.001 ns processor, sadly. Case: The Google Cluster Google: Leader in Web searches ( to Google as a verb!) Free, advertising-supported service An average of 1,000 queries per second at the time the Patterson and Hennessey text was written. (Now many more!) Goal: 1/2 second response time including network latency. Data center design a competitive advantage. Google s Clusters Four data centers at the time the text was written. Fifteen in the US as of fall, 2008, including two in the Atlanta area. At least five in Europe. At least $600 million apiece! New locations value cheap electricity, available water, low taxes. Access to good connections to Internet. What s in a Cluster Thousands of 1RU PCs, each with two disks Patented power supplies that include a battery. The Google File System (GFS) A replicating file system Data replicated within data centers And across data centers. Proprietary Web server software OC48 (2.4 Gb) links, backup OC12 links A Google Cluster Tanenbaum, Structured Computer Organization, Fifth Edition, (c) 2006 Pearson Education, Inc. All rights reserved. 0-13-148521-0 5

Processing a Google Query Reliability Software is the biggest source of failures. 20 PC reboots/day (textbook) 2-3% of PCs per year have hardware failures, mainly non-ecc DRAM and disk failures Tanenbaum, Structured Computer Organization, Fifth Edition, (c) 2006 Pearson Education, Inc. All rights reserved. 0-13-148521-0 Grid Computing The grid layers. Questions 6