Twos Complement Signed Numbers. IT 3123 Hardware and Software Concepts. Reminder: Moore s Law. The Need for Speed. Parallelism.

Similar documents
Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

Multithreading: Exploiting Thread-Level Parallelism within a Processor

Course II Parallel Computer Architecture. Week 2-3 by Dr. Putu Harry Gunawan

Lecture 26: Parallel Processing. Spring 2018 Jason Tang

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

Lecture 28 Multicore, Multithread" Suggested reading:" (H&P Chapter 7.4)"

Computer Systems Architecture

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

Hardware-Based Speculation

THREAD LEVEL PARALLELISM

Computer Systems Architecture

EECS4201 Computer Architecture

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

Computer Architecture Lecture 27: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015

45-year CPU Evolution: 1 Law -2 Equations

Suggested Readings! Lecture 24" Parallel Processing on Multi-Core Chips! Technology Drive to Multi-core! ! Readings! ! H&P: Chapter 7! vs.! CSE 30321!

18-447: Computer Architecture Lecture 30B: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013

WHY PARALLEL PROCESSING? (CE-401)


Parallel Processing. Computer Architecture. Computer Architecture. Outline. Multiple Processor Organization

Parallelism. CS6787 Lecture 8 Fall 2017

Multi-core Architectures. Dr. Yingwu Zhu

Fundamentals of Quantitative Design and Analysis

Lecture 1: Introduction

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 1. Copyright 2012, Elsevier Inc. All rights reserved. Computer Technology

Parallel Processing SIMD, Vector and GPU s cont.

Hyperthreading Technology

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University

Copyright 2012, Elsevier Inc. All rights reserved.

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Multi-core Architectures. Dr. Yingwu Zhu

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

The Processor: Instruction-Level Parallelism

Computer Architecture: Parallel Processing Basics. Prof. Onur Mutlu Carnegie Mellon University

Outline Marquette University

Chapter 1: Introduction to the Microprocessor and Computer 1 1 A HISTORICAL BACKGROUND

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

EECS 470. Lecture 18. Simultaneous Multithreading. Fall 2018 Jon Beaumont

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

DEPARTMENT OF ELECTRONICS & COMMUNICATION ENGINEERING QUESTION BANK

Cray XE6 Performance Workshop

Computer Architecture Spring 2016

Lecture 25: Board Notes: Threads and GPUs

Chapter 2 Parallel Hardware

Computer Architecture

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)

Chapter 2. Parallel Hardware and Parallel Software. An Introduction to Parallel Programming. The Von Neuman Architecture

Computer Architecture. Minas E. Spetsakis Dept. Of Computer Science and Engineering (Class notes based on Hennessy & Patterson)

Processor (IV) - advanced ILP. Hwansoo Han

Chapter 2 Parallel Computer Architecture

Multithreaded Processors. Department of Electrical Engineering Stanford University

Fundamentals of Computer Design

LECTURE 1. Introduction

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design

CSE502: Computer Architecture CSE 502: Computer Architecture

Keywords and Review Questions

CSE502 Graduate Computer Architecture. Lec 22 Goodbye to Computer Architecture and Review

Online Course Evaluation. What we will do in the last week?

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

Parallel Computing: Parallel Architectures Jin, Hai

CS 654 Computer Architecture Summary. Peter Kemper

ECE 588/688 Advanced Computer Architecture II

Memory Hierarchies. Instructor: Dmitri A. Gusev. Fall Lecture 10, October 8, CS 502: Computers and Communications Technology

A Key Theme of CIS 371: Parallelism. CIS 371 Computer Organization and Design. Readings. This Unit: (In-Order) Superscalar Pipelines

Control Hazards. Prediction

Lec 25: Parallel Processors. Announcements

EECS 452 Lecture 9 TLP Thread-Level Parallelism

CPU Architecture Overview. Varun Sampath CIS 565 Spring 2012

Multiple Instruction Issue. Superscalars

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Multithreading Processors and Static Optimization Review. Adapted from Bhuyan, Patterson, Eggers, probably others

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

Introducing Multi-core Computing / Hyperthreading

Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov

INTEL Architectures GOPALAKRISHNAN IYER FALL 2009 ELEC : Computer Architecture and Design

Exam-2 Scope. 3. Shared memory architecture, distributed memory architecture, SMP, Distributed Shared Memory and Directory based coherence

Basic Computer Architecture

CSE 392/CS 378: High-performance Computing - Principles and Practice

Multicore Hardware and Parallelism

Chapter 1: Introduction to Parallel Computing

administrivia final hour exam next Wednesday covers assembly language like hw and worksheets

Algorithm Performance Factors. Memory Performance of Algorithms. Processor-Memory Performance Gap. Moore s Law. Program Model of Memory II

Algorithm Performance Factors. Memory Performance of Algorithms. Processor-Memory Performance Gap. Moore s Law. Program Model of Memory I

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

ECE 152 Introduction to Computer Architecture

Performance of computer systems

! Readings! ! Room-level, on-chip! vs.!

Memory Systems IRAM. Principle of IRAM

Control Hazards. Branch Prediction

COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture

Is Intel s Hyper-Threading Technology Worth the Extra Money to the Average User?

Advanced processor designs

Comp. Org II, Spring

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

Computer Systems. Binary Representation. Binary Representation. Logical Computation: Boolean Algebra

UNIT I (Two Marks Questions & Answers)

Transcription:

Twos Complement Signed Numbers IT 3123 Hardware and Software Concepts Modern Computer Implementations April 26 Notice: This session is being recorded. Copyright 2009 by Bob Brown http://xkcd.com/571/ Reminder: Moore s Law The number of transistors per unit area doubles every 18-24 months. (So far.) Consequences: More power, same price Same power, lower price Implication for designers: hardware features that were formerly expensive in chip real estate are becoming practical The Need for Speed I think there is a world market for maybe five computers. Attributed to Thomas J. Watson, 1943. Today, available computing power is never enough: Astronomy Pharmaceuticals Aircraft and automobile design Entertainment Seismography and mineral exploration Many others Limitations The speed of light. (How long is a nanosecond?) Heat dissipation Quantum mechanical effects when transistors or conductors become very small. Parallelism Instead of one computer with a 0.001 ns cycle time, consider 1,000 computers with 1 ns cycle time. The total computing capacity is theoretically The total computing capacity is theoretically the same in each case. (But using the capacity is harder in the second case.) 1

Coupling Parallel computing systems can be characterized by degree of coupling. Tightly coupled: high bandwidth and low delay between CPUs. Loosely coupled: lower bandwidth, higher delay. It s a continuum. Degrees of Coupling On-Chip Parallelism Instruction-level parallelism Multithreading Multiple CPUs per chip More than One CPU on a Chip We can build chips with multiple CPUs ( cores ). These cores share the same memory hierarchy. Cache coherence is less a problem Only one copy of code needed Fast inter-processor communication is possible, maybe even easy. Multithreading Reminder: a process is a program in execution. Changing processes ( context switching ) means saving the complete machine state. A thread is a lightweight process, requiring less than a full context switch. A single application cannot benefit from multiple cores (or multiple CPUs) without multithreading. Hardware Support for Multithreading Suppose a core had several sets of registers and a hardware pointer to the current set. One could run a thread for each register set. Context-switching time would be effectively zero. 2

Superscalar Architectures More than one functional unit available So, several instructions can execute in the same cycle Provided the instructions are compatible. Fine v. Coarse Grained Multithreading Fine-grained multithreading: A thread switch occurs on each instruction. Coarse-grained multithreading: thread switches occur only when the current thread encounters a costly stall. Thread switching can be more expensive in time. The pipeline must be re-filled. Fine-Grained Multithreading Stalls are masked by running threads round-robin. There must be a thread for each stage of the pipeline. The number of threads is limited by the number of register sets. Coarse-Grained Multithreading There may not be as many threads available as there are pipeline stages. Another approach is to switch only when there is a stall (or upon an instruction that might cause a stall.) Simultaneous Multithreading Remember superscalar processors? More than one functional unit (e.g. integer, floating-point, and memory) can allow more than one instruction to be completed per clock cycle. More than one thread can run at the same time, provided they use different functional units. Superscalar CPUs and Multithreading Multithreading with a dual-issue superscalar CPU. (a) Fine-grained multithreading. (b) Coarse-grained multithreading. (c) Simultaneous multithreading. 3

Hyperthreading on the Pentium 4 Resource sharing between threads in the Pentium 4 NetBurst microarchitecture. Multiple Cores without Multithreading Can function as a standard symmetric multiprocessing computer. The running queue has one entry per core. The operating system can dispatch a process to a core just like a separate CPU. Problem: A single L1 cache, or one per core? Very Long Instruction Word Computers Instructions for multiple functional units are packaged explicitly. The burden is on the compiler, not on the hardware. (Good.) The Philips TriMedia VLIW CPU Designed expressly as an embedded processor for multimedia devices. Can issue five instructions per cycle Byte-oriented memory; alignment required for half words and full words. 8-way set-associative split cache 128 general registers. R0=0, R1=1; storing to R0 or R1 is not allowed Saturated arithmetic No runtime checking. (The compiler has to be right!) Example TriMedia Instruction TriMedia Functional Units Not every instruction type can appear in every slot. (Next slide.) Empty slots are compacted, so instructions are of variable length. Each operation is predicated: IF R2 IADD R4,R5 -> R8 The X s indicate which slots are valid for each instruction type. 4

Flynn s Taxonomy of Parallel Computers Pitfalls You have to measure execution time. Observed performance is likely to be much less than the combined performance of n processors. In other words, 1,000 one ns processors do not equal one 0.001 ns processor, sadly. Case: The Google Cluster Google: Leader in Web searches ( to Google as a verb!) Free, advertising-supported service An average of 1,000 queries per second at the time the Patterson and Hennessey text was written. (Now many more!) Goal: 1/2 second response time including network latency. Data center design a competitive advantage. Google s Clusters Four data centers at the time the text was written. Fifteen in the US as of fall, 2008, including two in the Atlanta area. At least five in Europe. At least $600 million apiece! New locations value cheap electricity, available water, low taxes. Access to good connections to Internet. What s in a Cluster Thousands of 1RU PCs, each with two disks Patented power supplies that include a battery. The Google File System (GFS) A replicating file system Data replicated within data centers And across data centers. Proprietary Web server software OC48 (2.4 Gb) links, backup OC12 links A Google Cluster Tanenbaum, Structured Computer Organization, Fifth Edition, (c) 2006 Pearson Education, Inc. All rights reserved. 0-13-148521-0 5

Processing a Google Query Reliability Software is the biggest source of failures. 20 PC reboots/day (textbook) 2-3% of PCs per year have hardware failures, mainly non-ecc DRAM and disk failures Tanenbaum, Structured Computer Organization, Fifth Edition, (c) 2006 Pearson Education, Inc. All rights reserved. 0-13-148521-0 Grid Computing The grid layers. Questions 6