CS5222 Advanced Computer Architecture. Lecture 1 Introduction

CS5222 Advanced Computer Architecture Lecture 1 Introduction

Overview Teaching Staff Introduction to Computer Architecture History Future / Trends Significance The course Content Workload Administrative Matters 2

Who am I? Dr. Soo Yuen Jien Contact Information: Room: COM2 #02-61 Consultation Hour: Friday 3pm-5pm Wednesday after lecture Email me for other timing Email: sooyj@comp.nus.edu.sg Comments / Suggestions welcome 3

WHAT IS COMPUTER ARCHITECTURE? 4

Computer Architecture: Definition Architecture (in Computing): The organization of the components and functionalities of a system Computer Architecture: The study of computer (processor) architecture To maximize performance within constraints Typically classified into 3 categories: Instruction Set MicroArchitecture System Design 5

The 3 Categories Instruction Set The hardware/software interface Expose the functionalities to programmer MicroArchitecture Organization of components Techniques / Mechanisms for performance System Design Interconnection, data path Memory hierarchy 6

Computer Architecture VS Computer Architecture: Hardware Engineering Describes the behavior of the processor Describes the high level mechanisms / techniques for better performance Hardware Engineering: Concerns with the actual implementation of the architecture Logic / Circuit implementation, Packaging, Cooling, Transistor process technology etc 7

Computer System: The brief history Let's review the progress of computer system in the past: 1. Follow the thread of "Personal" Computer 2. Another thread on High-end supercomputer Observe the progress in terms of: Speed ( Operations / Second ) Size Availability and Cost 8

The Brief History: 1946 - ENIAC ENIAC: World s first programmable electronic digital computer 1900 additions per second 18,000 vacuum tubes 30 ton, 80 by 8.5 feet 9

The Brief History: 1951 - UNIVAC UNIVAC: first commercial computer of US Uses Von Neumman design 2000 additions per second for $1 million Sold 48 copies 10

The Brief History: 1964 IBM 360 IBM System/360: Six implementations with varying price, performance An example: 2MHz, 128KB-256KB memory, 500K operations/sec for $1M All binary compatible, redefines industry! 11

The Brief History: 1965 PDP-8 DEC PDP-8: first minicomputer 4k of 12-bit words 4 registers 330K operations per second for $16,000 sold 50,000 copies! 12

The Brief History: 1971 Intel 4004 Intel 4004: First microprocessor (single chip CPU) 4-bit processor for calculator 1KB data + 4KB program memory Only 2300 transistors 16-pin package 740KHz 100K operations per second 13

The Brief History: 1977 Apple II Apple II: first personal computer 1 Mhz clock, 4kB of RAM, $1300 ~200k operations per second 14

The Brief History: 1981 IBM PC IBM PC The system that shapes the IT industry as we know it Intel 8088 Processor 4.77 MHz, 16-256kB RAM 240K operations for $3000! 15

The Brief History: 2003 Pentium 4 Intel Pentium 4 processor Clock speed 3.0GHz for around $300 169 million transistors 6000M operations/sec 16

The Brief History: 2011 Intel i7 Intel Core i7 processor Clock speed 3.2GHz for around $500 ~120GFlops 17

The Brief History: Supercomputer 35,000.0 33,826.7 30,000.0 25,000.0 20,000.0 15,000.0 10,000.0 5,000.0 0.0 17,590.0 10,510.0 1,105.0 1,759.02,566.0 Linpack Performance ( teraflops ) Nov, 2008 Road Runner (US) Nov, 2009 Jaguar (US) Nov, 2010 TianHe (China) Nov, 2011 K Computer (Japan) Nov, 2012 Titan (US) Nov, 2013 TianHe 2 (China) 18

Summary: From a few to many n Transistor is the building block of CPU since 1960s 1970-1980 1980-1990 1990-2000 2000-2011 2K 100K 100K 1M 1M 100M 100M 2.2B Current World Population = 7Billion about the number of transistors in 3 CPU chips! 19

Summary: From BIG to small Process size = Minimum length of a transistor 80286 1982 1.5 µm Pentium 1993 0.80 µm - 0.25 µm Pentium 4 2000 0.180 µm - 0.065 µm Core i7 2010 0.045 µm - 0.032 µm Wave length of visible light = 350nm (violet) to 780nm (red) Process size now smaller than wavelength of violet light! 20

Summary: From S-L-O-W to fast FLOPS = FLoating-point Operation Per Second 80286 1982 1.8 MIPS* Pentium 1993 200 MFLOPS # Pentium 4 2000 4 GFLOPS # Core i7 2011 120 GFLOPS # 21

Summary: The Brief History Unprecedented progress since late 1940s Performance doubling ~2 years (1971-2005): Total of 36,000X improvement! If transportation industry matched this improvement, we could have traveled Singapore to Shanghai, China in about a second for roughly a few cents! Incredible amount of innovations to revolutionize the computing industry again and again 22

GREAT!! (BUT IS THERE ANYTHING LEFT TO DO?) 23

Moore s Law Intel co-founder Gordon Moore "predicted" in 1965 that Transistor density will double every 18 months 24

Growth in Processor Performance 25

Growth in Processor Performance Prior to mid-80s Largely technology driven Average 25% performance gain per year Mid-80s to 2002 Both technology, instruction set (RISC), and organization Average 52% performance gain per year Factor of seven gain from organization 2002 onwards Average 20% performance gain per year 26

The Three Walls Three major reasons for the unsustainable growth in uniprocessor performance 1. The Memory Wall: Increasing gap between CPU and Main memory speed 2. The ILP Wall: Decreasing amount of "work" (instruction level parallelism) for processor 3. The Power Wall: Increasing power consumption of processor 27

The Memory Wall Memory access speed increases at about 10% / yr Processor speed increases at about 50% / yr Memory is now order of magnitude slower than the processor speed E.g. Intel Core i7 has 0.3ns cycle, DDR3 SDRAM latency is ~10ns Increasing amount of chip area dedicated to on-chip cache 28

The ILP Wall Instruction Level Parallelism (ILP) defines the amount of instructions that can be executed in parallel The main source of performance for superscalar processors Very limited for implicit ILP, discovered onthe-fly by processor Average ~3 instructions (depends!!) Move to explicit ILP Parallel Programming and Execution 29

The Power Wall We can now cramp more transistor into a chip than the ability (power) to turn them on! 30

Power Consumption: A comparison ~500 watts ~10mega watts Frige ~600 watts 1 HDB block ~50kilo watts 31

The Power Wall: Challenges Mobile/Portable (cell phone, laptop, PDA) Battery life is critical Desktop 400 million computers in the world 0.16PW (PetaWatt = 10 15 Watt) of power dissipation Equivalent to 26 nuclear power plants Data centers 1 single server rack is between 5 and 20 kw 100s of those racks in a single room 32

SO, HOW DO WE FIGHT THE WAR (WALL)? 33

Meeting the challenge Hyper-Threading Technology (HTT) in Xeon and Pentium 4 Allow one physical processor to appear and behave as two virtual processors to the operating system Two independent thread gives more ILP! Intel dual-core (Pentium D) Multiple microprocessor cores on a single chip Copyright 2005 Intel 34

Parallelism saves Power Dynamic Power = C x V 2 x f C = Capacitance, V = Voltage, f = clock freq Performance is proportional to clock frequency Exploit explicit parallelism for reducing power using additional cores Increase density (=more transistors = more capacitance) Can increase cores (2x) and performance (2x) Or increase cores (2x) but decrease frequency (f/2) 35

Multicore Revolution Chip density is continuing to increase ~2x every 2 years Clock speed is not Number of processor cores may double instead 36

Multicore Revolution: Industry We are dedicating all of our future product development to multicore designs. This is a sea change in computing Paul Otellini, President, Intel (2005) All microprocessor companies switch to MP (2X CPUs / 2 yrs) Procrastination results in 2X sequential perf. / 5 yrs Current State: Intel i7 has 6 cores The STI Cell processor (PS3) has 8 cores nvidia Tesla GPU has up to 512 cores Intel MIC has > 50 cores 37

Multicore/Manycore Roadmap Multicore: 2X / 2 yrs 64 cores in 8 years Manycore: 8X to 16X multicore 1000 100 10 1 512 256 128 64 64 32 16 8 4 2 1 2003 2005 2007 2009 2011 2013 2015 38

Architecture Outlook Expect modestly pipelined processors Small cores not much slower than large cores Parallelism is energy efficient path to performance Lower threshold and supply voltages lowers energy per operation Small, regular processing elements easier to verify Heterogeneous processors Special function units to accelerate popular functions 39

Multicore: Impacts All major processor vendors are producing multicore chips Every machine will soon be a parallel machine All programmers will be parallel programmers??? Complexity may eventually be hidden in libraries, compilers, and high level languages But a lot of work is needed to get there Big open questions: What will be the killer apps for multicore machines? How should the chips be designed, and how will they be programmed? Many others.. 40

Parallel Revolution May Fail when we start talking about parallelism and ease of use of truly parallel computers, we're talking about a problem that's as hard as any that computer science has faced. I would be panicked if I were in industry. John Hennessy, President, Stanford University, 1/07 100% failure rate of Parallel Computer Companies Convex, Encore, MasPar, NCUBE, Kendall Square Research, Sequent, (Silicon Graphics), Transputer, Thinking Machines, What if IT goes from a growth industry to a replacement industry? If SW can t effectively multiple cores per chip SW no faster on new computer Only buy if computer wears out 41

Parallel Computing: A view from Berkeley Applications 1. What are the applications? 2. What are common kernels of the applications? Architecture and Hardware 3. What are the HW building blocks? 4. How to connect them? Programming Model and Systems Software 5. How to describe applications and kernels? 6. How to program the hardware? Evaluation 7. How to measure success? 42

Compiler Challenges Heterogeneous processors Increase in the design space for code optimization Auto-tuners: optimizing code at runtime Software controlled memory management Example: Cell processor 43

Parallel Programming Challenges Finding enough parallelism (Amdahl s Law) Granularity Locality Load balance Coordination and synchronization Debugging Performance modeling 44

BACK TO THE COURSE 45

What will we learn in CS5222? Instruction-Level Parallelism (ILP) Pipelining Dynamic Scheduling (Superscalar out-of-order) Static scheduling (VLIW processors) Branch Prediction Multi-threaded processors Multiprocessors Symmetric shared-memory architectures Synchronization Memory consistency Memory Hierarchy Design 46

Where can CS5222 takes you? Advanced Compiler System Software Operating System High Performance Computing Parallel Computing 47

We expect you to know Computer Organization (CS2100) Multi-Core Architecture (CS4223) Significant overlap in topics, but more indepth Instruction set concepts: RISC instruction set design philosophy registers, instructions, etc. Simple pipelining Basic caches, main memory Low-level programming experience C is very likely to be needed 48

Reference Computer Architecture: A Quantitative Approach 4 th Edition Hennessy & Patterson Published by Morgan Kaffman 49

Resources Primary and only information source is IVLE Workbin: Lecture notes Assignment submissions Forum: Ask course-related technical questions in the forum. Email is only for your personal concerns. 50

Assessment Final Exam: 50% Assignments: 30% 2-3 assignments Midterm: 20% Tentatively in week 7 (after term break). During normal lecture hours. 51