A Multiprocessor Memory Processor for Efficient Sharing And Access Coordination

Similar documents
EE/CSCI 451: Parallel and Distributed Computation

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

Plot SIZE. How will execution time grow with SIZE? Actual Data. int array[size]; int A = 0;

Software-Controlled Multithreading Using Informing Memory Operations

Practical Near-Data Processing for In-Memory Analytics Frameworks

Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University

A Cache Hierarchy in a Computer System

Chapter 2: Memory Hierarchy Design Part 2

Lecture 16. Today: Start looking into memory hierarchy Cache$! Yay!

Control Hazards. Prediction

Lecture notes for CS Chapter 2, part 1 10/23/18

Chapter 2. Parallel Hardware and Parallel Software. An Introduction to Parallel Programming. The Von Neuman Architecture

Chapter 2: Memory Hierarchy Design Part 2

Fundamental CUDA Optimization. NVIDIA Corporation

Cache Optimisation. sometime he thought that there must be a better way

Cache introduction. April 16, Howard Huang 1

Page 1. Multilevel Memories (Improving performance using a little cash )

Lecture 15: Caches and Optimization Computer Architecture and Systems Programming ( )

EECS 570 Final Exam - SOLUTIONS Winter 2015

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Cache Performance

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Kaisen Lin and Michael Conley

Computer System Overview OPERATING SYSTEM TOP-LEVEL COMPONENTS. Simplified view: Operating Systems. Slide 1. Slide /S2. Slide 2.

Computer System Overview

Hardware-Based Speculation

Memory. From Chapter 3 of High Performance Computing. c R. Leduc

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

University of Toronto Faculty of Applied Science and Engineering

Review: Creating a Parallel Program. Programming for Performance

ECE 669 Parallel Computer Architecture

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

Fundamental CUDA Optimization. NVIDIA Corporation

Module 10: "Design of Shared Memory Multiprocessors" Lecture 20: "Performance of Coherence Protocols" MOESI protocol.

Memory Hierarchies &

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

Motivations. Shared Memory Consistency Models. Optimizations for Performance. Memory Consistency

06-2 EE Lecture Transparency. Formatted 14:50, 4 December 1998 from lsli

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

CUDA OPTIMIZATIONS ISC 2011 Tutorial

CS 426 Parallel Computing. Parallel Computing Platforms

High Performance Computing

Caches Concepts Review

COSC 6385 Computer Architecture. - Memory Hierarchies (II)

Παράλληλη Επεξεργασία

Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Active thread Idle thread

UNIT I (Two Marks Questions & Answers)

CHAPTER 4 MEMORY HIERARCHIES TYPICAL MEMORY HIERARCHY TYPICAL MEMORY HIERARCHY: THE PYRAMID CACHE PERFORMANCE MEMORY HIERARCHIES CACHE DESIGN

Improving Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Highly-Associative Caches

Speculation and Future-Generation Computer Architecture

Locality. CS429: Computer Organization and Architecture. Locality Example 2. Locality Example

Advanced optimizations of cache performance ( 2.2)

Introducing the Cray XMT. Petr Konecny May 4 th 2007

Continuum Computer Architecture

CS252 Lecture Notes Multithreaded Architectures

Caching Basics. Memory Hierarchies

ECE519 Advanced Operating Systems

Principles in Computer Architecture I CSE 240A (Section ) CSE 240A Homework Three. November 18, 2008

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Copyright 2012, Elsevier Inc. All rights reserved.

Computer Systems A Programmer s Perspective 1 (Beta Draft)

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

CS252 S05. Main memory management. Memory hardware. The scale of things. Memory hardware (cont.) Bottleneck

CS 240 Stage 3 Abstractions for Practical Systems

Systems Programming and Computer Architecture ( ) Timothy Roscoe

I, J A[I][J] / /4 8000/ I, J A(J, I) Chapter 5 Solutions S-3.

Martin Kruliš, v

Parallel Processing SIMD, Vector and GPU s cont.

Q.1 Explain Computer s Basic Elements

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2014 Lecture 14

Portland State University ECE 588/688. Dataflow Architectures

Lecture-13 (ROB and Multi-threading) CS422-Spring

CS377P Programming for Performance GPU Programming - II

Dynamic Control Hazard Avoidance

Memory Hierarchies. Instructor: Dmitri A. Gusev. Fall Lecture 10, October 8, CS 502: Computers and Communications Technology

CS161 Design and Architecture of Computer Systems. Cache $$$$$

Chapter 8. Virtual Memory

Copyright 2012, Elsevier Inc. All rights reserved.

Parallel Computer Architecture and Programming Written Assignment 3

Scalable, multithreaded, shared memory machine Designed for single word random global access patterns Very good at large graph problems

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

ECE/CS 757: Homework 1

Cache Injection on Bus Based Multiprocessors

V. Primary & Secondary Memory!

Computer and Hardware Architecture I. Benny Thörnberg Associate Professor in Electronics

Multiprocessors & Thread Level Parallelism

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

Chapter 12. CPU Structure and Function. Yonsei University

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Outline. Exploiting Program Parallelism. The Hydra Approach. Data Speculation Support for a Chip Multiprocessor (Hydra CMP) HYDRA

CS4961 Parallel Programming. Lecture 14: Reasoning about Performance 10/07/2010. Administrative: What s Coming. Mary Hall October 7, 2010

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

CS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines

The Memory Hierarchy & Cache

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

Handout 3 Multiprocessor and thread level parallelism

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Transcription:

1 1 A Multiprocessor Memory Processor for Efficient Sharing And Access Coordination David M. Koppelman Department of Electrical & Computer Engineering Louisiana State University, Baton Rouge koppel@ee.lsu.edu http://www.ee.lsu.edu/koppel 1 ISCA-97-IRAM Formatted 15:22, 20 December 1997 from iram. 1

2 2 Outline Multiprocessor Problems Smart Memory Solution Nested Loops Example Simulation Evaluation Related Work Feasibility Future Work Presentation Note Slides with two-level numbering will be covered if interest expressed. 2 ISCA-97-IRAM Formatted 15:22, 20 December 1997 from iram. 2

3 3 Multiprocessor Problems Miss Latency Penalty can be 100s of cycles. Substantial impact on some programs. Synchronization Overhead Barrier overhead possibly 1000s of cycles. Locks, rendezvous, etc. also time-consuming. As a result......some useful programs slowed......and many algorithms are unimplementably inefficient. 3 ISCA-97-IRAM Formatted 15:22, 20 December 1997 from iram. 3

4 4 Exo-Processor System Solution Derived From Typical Multiprocessor Processing elements mostly normal. Memory modules (of course) can process. A sequentially consistent shared address space assumed......but not required. 4 ISCA-97-IRAM Formatted 15:22, 20 December 1997 from iram. 4

4-1 4-1 Processing Element Processing Element Features Conventional processor (CPU). Very-low-latency message dispatch capability. 4-1 ISCA-97-IRAM Formatted 15:22, 20 December 1997 from iram. 4-1

5 5 Memory Module Memory Module Features Uses single network interface (NWI). Contains multiple memory units. Memory Unit Features DRAM, implements address space. Exo-processor (XP) controls unit. Packet buffer used by XP. 5 ISCA-97-IRAM Formatted 15:22, 20 December 1997 from iram. 5

6 6 Exo-Processor Exo Processor Capabilities Executes context units called exo-packets. Exo-packets stored in packet buffer. Performs arithmetic and logical operations. Load and store only to its memory bank. Execution slow compared to CPU......but can quickly react to changes in memory. 6 ISCA-97-IRAM Formatted 15:22, 20 December 1997 from iram. 6

7 7 Exo-Packet Exo-Packet Contents Procedure identifier. (PID) Control state (CS) ( program counter). Literal operands. Memory-address operands. One memory address used for routing. Example: Two-Operand Exo-Packet (Base System) 7 ISCA-97-IRAM Formatted 15:22, 20 December 1997 from iram. 7

8 8 Execution Overview CPU issues exo-packet......without blocking. Exo-packet sent to memory unit holding the address. Exo-processor typically: Reads operand from memory......performs operation......sends exo-packet to another exo-proc. Operations that might be performed in a step: Read, operate, write to exo-packet. Read, operate, write to same address. Read, wait, operate, write tag, etc. 8 ISCA-97-IRAM Formatted 15:22, 20 December 1997 from iram. 8

8-1 8-1 Exo-Op *a = *b + c Example 1: Dispatched Packet ADDML Step 1 c a b 2: Dispatched Packet ADDML Step 2 *b+c a 3: Dispatched Packet ADDML Step 3 4: Acknowledge: In-progress counter decremented. 8-1 ISCA-97-IRAM Formatted 15:22, 20 December 1997 from iram. 8-1

8-2 8-2 Tagged Operations Tag Overview Memory locations are associated with tags......possibly not part of addressed storage. Memory operations test and set tags. Similar to presence bits, full/empty bits, etc.......except that may be integers (requiring storage). Typical Operation 1: Check tag against needed value. 2: If not equal, wait. 3: When equal, complete operation......possibly changing tag. 8-2 ISCA-97-IRAM Formatted 15:22, 20 December 1997 from iram. 8-2

8-3 8-3 Exo-Op Tagged *a = *b Example 1: Dispatched Packet CPFE Step 1 a b 2: Tag assumed empty. 3: Other exo-op sets tag full. 4: Dispatched Packet CPFE Step 2 a *b 5: No acknowledgment for this case. 8-3 ISCA-97-IRAM Formatted 15:22, 20 December 1997 from iram. 8-3

9 9 Example, Nested Loops Properties Two-level nested loop. Sums depend upon previous outer iteration. Sums within an outer iteration independent. One operand not statically determinable. for( outer = 0; outer < outer_end; outer++ ){ int *perm = perms[outer]; /* On odd iterations... */ if( outer & 0x1 ) for( inner = 0; inner < size; inner++ ) A[inner] = B[inner] + B[ perm[inner] ]; /* On even iterations... */ else for( inner = 0; inner < size; inner++ ) } B[inner] = A[inner] + A[ perm[inner] ]; 9 ISCA-97-IRAM Formatted 15:22, 20 December 1997 from iram. 9

9-1 9-1 Conventional Parallel Implementation Changes Inner loop iterations partitioned evenly. Element A[i] and B[i] in nearby module......for start i<stop. Thus, access fast for 2 of 3 accesses. for( outer = 0; outer < outer_end; outer++ ){ int *perm = perms[outer]; /* Return a permutation. */ barrier(); /* On odd iterations... */ if( outer & 0x1 ) for( inner = start; inner < stop; inner++ ) A[inner] = B[inner] + B[ perm[inner] ]; /* On even iterations... */ else for( inner = start; inner < stop; inner++ ) } B[inner] = A[inner] + A[ perm[inner] ]; 9-1 ISCA-97-IRAM Formatted 15:22, 20 December 1997 from iram. 9-1

9-2 9-2 Conventional Parallel Implementation Execution Delays Possible miss for non-local access. Barrier overhead. Conservative control dependencies due to barrier. for( outer = 0; outer < outer_end; outer++ ){ int *perm = perms[outer]; /* Return a permutation. */ barrier(); /* On odd iterations... */ if( outer & 0x1 ) for( inner = start; inner < stop; inner++ ) A[inner] = B[inner] + B[ perm[inner] ]; /* On even iterations... */ else for( inner = start; inner < stop; inner++ ) } B[inner] = A[inner] + A[ perm[inner] ]; 9-2 ISCA-97-IRAM Formatted 15:22, 20 December 1997 from iram. 9-2

9-3 9-3 Execution of Conventional Loop Code Conventional Execution of Nested Loops Processor 16 14 12 10 8 6 4 2 0 90 100 110 120 130 140 Time / Cycles x 1,000 Idle. Misc. Busy. In barrier, idle. In barrier, busy. Exo Loop, Odd Iter. Exo Loop, Even Iter. Conv. Loop, Even Iter. Conv. Loop, Odd Iter. Gaps indicate barrier delays. 9-3 ISCA-97-IRAM Formatted 15:22, 20 December 1997 from iram. 9-3

9-4 9-4 (Over) Simplified Exo Implementation Changes Separate storage for adjacent iterations. Write sets tag FULL (not shown). Read waits for FULL tag. for( outer = 0; outer < outer_end; outer++ ){ int *perm = perms[outer]; /* Return a permutation. */ for( inner = start; inner < stop; inner++ ) A[outer+1][inner] = A[outer][inner] /; FULL + A[outer][ perm[inner] ] /; FULL ; }} 9-4 ISCA-97-IRAM Formatted 15:22, 20 December 1997 from iram. 9-4

9-5 9-5 Compilation Add procedure compiled for exo-proc. Execution at CPU Exo-packet dispatched for each inner iteration. No blocking: CPU does not wait after dispatch. for( outer = 0; outer < outer_end; outer++ ){ int *perm = perms[outer]; /* Return a permutation. */ for( inner = start; inner < stop; inner++ ) A[outer+1][inner] = A[outer][inner] /; FULL + A[outer][ perm[inner] ] /; FULL ; }} Execution Pacing No dependencies will stall loop......but may stall to avoid congestion. 9-5 ISCA-97-IRAM Formatted 15:22, 20 December 1997 from iram. 9-5

9-6 9-6 Actual (Simulated) Exo Implementation If outer is large, substantial storage needed. Solution 1: Write only after two reads finish. Solution 1 Disadvantages Elements must be accessed twice. A lot of packet buffer storage used. Solution 2 (used here): Provide storage for x outer iterations. Assume one PE rarely >xahead of others. Storage is reused, tag identifies outer iteration. Reads check tag......if too low, they wait......if too high, fault routine called. Solution 2 Disadvantages More storage than solution 1. Re-starting after fault expensive or impossible. 9-6 ISCA-97-IRAM Formatted 15:22, 20 December 1997 from iram. 9-6

10 10 Execution Illustration Execution Conventional Execution of Nested Loops Processor 16 14 12 10 8 6 4 2 0 90 100 110 120 130 140 Time / Cycles x 1,000 Idle. Misc. Busy. In barrier, idle. In barrier, busy. Exo Loop, Odd Iter. Exo Loop, Even Iter. Conv. Loop, Even Iter. Conv. Loop, Odd Iter. Processor Exo and Conventional (Partial) Execution of Nested Loops 16 14 12 10 8 6 4 2 0 50 60 70 80 90 100 110 Time / Cycles x 1,000 Idle. Misc. Busy. In barrier, idle. In barrier, busy. Exo Loop, Odd Iter. Exo Loop, Even Iter. Conv. Loop, Even Iter. Conv. Loop, Odd Iter. Note overlap in exo execution. Note barrier gaps in conventional execution. 10 ISCA-97-IRAM Formatted 15:22, 20 December 1997 from iram. 10

11 11 Execution Illustration Stalls Due to Exceeded Threshold Processor 16 14 12 10 8 6 4 2 0 60 70 80 90 100 110 Time / Cycles x 1,000 Idle. Misc. Busy. In barrier, idle. In barrier, busy. Exo Stall. Exo Loop, Odd Iter. Exo Loop, Even Iter. Conv. Loop, Even Iter. Conv. Loop, Odd Iter. Traffic Generated 16 Rate / ( Flits / 128 Cycles ) x 100 14 12 10 8 6 4 2 0 60 70 80 90 100 110 Time / Cycles x 1,000 11 ISCA-97-IRAM Formatted 15:22, 20 December 1997 from iram. 11

12 12 Experimental Evaluation Goal, determine sensitivity to: Network latency and bandwidth. Memory speed. Exo-op execution time. Goal, determine speedup of: Program fragments with selected properties. A radix sort program. Methodology Execution-driven simulation using Proteus L3.10. Run illustrative code fragments......and modified SPLASH-2 radix kernel. Vary performance of key elements (e.g., link width). Determine bottlenecks. 12 ISCA-97-IRAM Formatted 15:22, 20 December 1997 from iram. 12

13 13 Programs Histogram Small operations on widely shared data. Suffers from contention and false sharing. Very well suited to exo-ops. Nested Loops Three-operand operations with dependencies. No spatial locality on one source operand. Well suited to exo-ops. Vector Sum (a[i]=b[i]+c) Spatial locality on all operands. For conventional version, data not initially cached. Not appropriate for exo-ops (as described above). In-Place and Copy Permutation No spatial locality on destination operand. Conventional implementation uses temp. storage. Favorable layout for conventional version. 13 ISCA-97-IRAM Formatted 15:22, 20 December 1997 from iram. 13

14 14 Programs SPLASH-2 Radix Exo-ops Used For Prefix sum. Permutation. Conventional implementation tuned. 14 ISCA-97-IRAM Formatted 15:22, 20 December 1997 from iram. 14

15 15 Simulation Parameters (Base System) System and Network 16 Processor, 4 4 Mesh Network interface shared by CPU and memory modules. 3-CycleWireDelay 3-Byte Link Width Exo-Ops 21-Cycle Exo-Op Step Latency 12-Cycle Exo-Op Step Resource Usage ( initiation interval) 25 In-Progress Exo-Op Stall Threshold Memory System 42-Cycle Memory Latency Four Banks per Memory Module 8-Way, 16-Byte Line, 2 11 Set, Set Associative Cache 3-Cycle Cache Hit Latency 15 ISCA-97-IRAM Formatted 15:22, 20 December 1997 from iram. 15

15-1 15-1 Time Units Time given in scaled cycles. 3 Scaled Cycles = 1 Real Cycle Memory and Cache 32-Bit Address Space Sequential Consistency Model (w/o Exo-Ops) Full-Map Directory Cache Coherence Four Banks per Memory Module Memory Banks Buffered (last block accessed). 42-Cycle Memory Miss Latency 12-Cycle Memory Hit Latency 8-Way, 16-Byte Line, 2 11 Set, Set Associative Cache 3-Cycle Cache Hit Latency 15-1 ISCA-97-IRAM Formatted 15:22, 20 December 1997 from iram. 15-1

15-2 15-2 Exo-Packet Size 8 Bytes......plus data......plus four bytes per value tag. Exo-Op Timing Issue (CPU instruction slots) 2 Cycles + 1 Cycle per operand. Construction Time (issue to network injection) 6 Cycles Execution of Exo-Op Step 21-Cycle Latency 12-Cycle Resource Usage ( initiation interval) Plus, time for operand memory access. Flow Control 25 In-Progress Exo-Op Stall Threshold 15-2 ISCA-97-IRAM Formatted 15:22, 20 December 1997 from iram. 15-2

16 16 Simulation Results For Base System Program Conv. Exo Speedup Radix 13.5 9.5 1.4 Histo. 156.3 17.1 9.1 Loops 215.6 84.8 2.5 Vector 75.8 38.8 2.0 In Place Perm. 331.8 78.2 4.2 Copy Perm. 107.0 35.7 3.0 Timing given in cycles over input size. Conclusions Whole-program speedup usable. Very significant speedup on portions. 16 ISCA-97-IRAM Formatted 15:22, 20 December 1997 from iram. 16

17 17 Base System Bottlenecks Can the rest of the system keep up with exo-op issue? Of course not! Consider a two-step, two-tag exo-op. Issue time is 7 cycles. If CPU repeatedly issued these, to keep up......memory would have to be 6 faster......network interface would have to be 2.86 faster......and exo-processor would have to be 1.71 faster... Not as unbalanced as numbers suggest. 17 ISCA-97-IRAM Formatted 15:22, 20 December 1997 from iram. 17

18 18 Exo-Processor Speed Experiments Normalized Exec. Time 1.8 1.6 1.4 1.2 1.0 0.8 0.6 0.4 0.2 0.0 3/1 6/3 15/6 21/12 30/21 42/30 60/51 90/69 Exo Processor Latency E Radix E Histo E Loops E Vector E Place E Copy Exo-Processor Normally Faster Than Memory & Network Sensitive Programs Nested loops, because of tags. Vector, locality avoids memory bottleneck......exposing exo-processor bottleneck. Others work well with slower exo-processor. 18 ISCA-97-IRAM Formatted 15:22, 20 December 1997 from iram. 18

18-1 18-1 Memory Latency Experiments Normalized Exec. Time 1.8 1.6 1.4 1.2 1.0 0.8 0.6 0.4 0.2 0.0 9 18 27 36 42 54 69 84 99 Memory Access Latency C Radix E Radix C Histo. E Histo C Loops E Loops C Vector E Vector C Place E Place C Copy E Copy Conventional programs more affected. Exo programs hide lower memory latency. Loops less sensitive because of heavy exo-processor use. 18-1 ISCA-97-IRAM Formatted 15:22, 20 December 1997 from iram. 18-1

18-2 18-2 Network Latency Experiments Normalized Exec. Time 3.0 2.5 2.0 1.5 1.0 0.5 0.0 3/3 6/3 9/3 21/3 42/3 Link Latency / Bandwidth C Radix E Radix C Histo. E Histo C Loops E Loops C Vector E Vector C Place E Place C Copy E Copy x/y Notation: x-cycle link delay (pipelined). y-byte per cycle link bandwidth. High Latency Conventional Programs Suffer Exo Programs Almost Unaffected 18-2 ISCA-97-IRAM Formatted 15:22, 20 December 1997 from iram. 18-2

18-3 18-3 Network Bandwidth Experiments Normalized Exec. Time 2.0 1.8 1.6 1.4 1.2 1.0 0.8 0.6 0.4 0.2 0.0 3/9 3/3 3/1 3/0.5 Link Latency / Bandwidth C Radix E Radix C Histo. E Histo C Loops E Loops C Vector E Vector C Place E Place C Copy E Copy x/y Notation: x-cycle link delay (pipelined). y-byte per cycle link bandwidth. Low Bandwidth Both exo and conventional programs slowed Speedup nevertheless higher. 18-3 ISCA-97-IRAM Formatted 15:22, 20 December 1997 from iram. 18-3

18-4 18-4 Cache Size Experiments Normalized Exec. Time 3.0 2.5 2.0 1.5 1.0 0.5 0.0 2^5 Set 2^7 Set 2^9 Set 2^11 Set 2^13 Set C Radix E Radix C Histo. E Histo C Loops E Loops C Vector E Vector C Place E Place C Copy E Copy Cache Size Fragments use a small amount of memory. Small cache reduces speedup on radix. 18-4 ISCA-97-IRAM Formatted 15:22, 20 December 1997 from iram. 18-4

18-5 18-5 Line Size Experiments 4.0 C Radix 3.5 E Radix Normalized Exec. Time 3.0 2.5 2.0 1.5 1.0 C Histo. E Histo C Loops E Loops C Vector E Vector C Place E Place 0.5 C Copy 0.0 E Copy 8- Byte 16- Byte 32- Byte 64- Byte 128- Byte 256- Byte 512- Byte Cache Line Size Tag wait mechanism works poorly with long lines. Effect varies with access characteristics. 18-5 ISCA-97-IRAM Formatted 15:22, 20 December 1997 from iram. 18-5

18-6 18-6 System Size (Processors) Normalized Exec. Time 1.4 1.2 1.0 0.8 0.6 0.4 0.2 0.0 4 8 16 32 64 System Size (Processors) C Radix E Radix C Histo. E Histo C Loops E Loops C Vector E Vector C Place E Place C Copy E Copy Input scaled (constant input per proc.). Because of latency hiding, exo better in large systems. 18-6 ISCA-97-IRAM Formatted 15:22, 20 December 1997 from iram. 18-6

19 19 Results Summary Normalized Exec. Time 1.8 1.6 1.4 1.2 1.0 0.8 0.6 0.4 0.2 0.0 E Radix E Histo E Loops E Vector E Place E Copy Normalized Exec. Time 1.8 1.6 1.4 1.2 1.0 0.8 0.6 0.4 0.2 0.0 C Radix E Radix C Histo. E Histo C Loops E Loops C Vector E Vector C Place E Place C Copy E Copy Normalized Exec. Time 3.0 2.5 2.0 1.5 1.0 0.5 0.0 C Radix E Radix C Histo. E Histo C Loops E Loops C Vector E Vector C Place E Place C Copy E Copy 3/1 6/3 15/6 21/12 30/21 42/30 60/51 90/69 9 18 27 36 42 54 69 84 99 3/3 6/3 9/3 21/3 42/3 Exo Processor Latency Memory Access Latency Link Latency / Bandwidth Normalized Exec. Time 2.0 1.8 1.6 1.4 1.2 1.0 0.8 0.6 0.4 0.2 0.0 C Radix E Radix C Histo. E Histo C Loops E Loops C Vector E Vector C Place E Place C Copy E Copy Normalized Exec. Time 3.0 2.5 2.0 1.5 1.0 0.5 0.0 C Radix E Radix C Histo. E Histo C Loops E Loops C Vector E Vector C Place E Place C Copy E Copy Normalized Exec. Time 4.0 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 C Radix E Radix C Histo. E Histo C Loops E Loops C Vector E Vector C Place E Place C Copy E Copy Normalized Exec. Time 1.4 1.2 1.0 0.8 0.6 0.4 0.2 0.0 C Radix E Radix C Histo. E Histo C Loops E Loops C Vector E Vector C Place E Place C Copy E Copy 3/9 3/3 3/1 3/0.5 2^5 Set 2^7 Set 2^9 Set 2^11 Set 2^13 Set 8-16- 32-64- 128-256- 512-4 8 16 32 64 Link Latency / Bandwidth Cache Size Byte Byte Byte Byte Byte Cache Line Size Byte Byte System Size (Processors) Exo-Processor Speed Bottleneck in some programs. Slower speed tolerable for others. Memory Latency Effectively hidden, slower memories usable. Network Latency Effectively hidden. Network Bandwidth Slows conventional and exo programs. But speedup higher with lower bandwidth. 19 ISCA-97-IRAM Formatted 15:22, 20 December 1997 from iram. 19

20 20 Related Work Remote Operations Nonlocally Executed Atomic Operations E.g., fetch and add, compare and swap. Implemented in Cray T3E, Cedar. Similarities CPU quickly issues operation. Operations are small. Difference Because of size, limited applicability to computation. 20 ISCA-97-IRAM Formatted 15:22, 20 December 1997 from iram. 20

21 21 Related Work Remote Operations Protocol Processors Dedicated processor for coherence protocol. Used in Stanford FLASH. Similarity Dedicated processor for memory. Difference No specific mechanism for tag waiting. Not used for computation. 21 ISCA-97-IRAM Formatted 15:22, 20 December 1997 from iram. 21

22 22 Related Work Remote Operations Active Message Machines Mechanism provided for fast sending, receiving, and return. Examples include T, EM-X, and the M-Machine Similarities Low-latency message dispatch and receipt. Messages used for memory access. Tag bits. Differences Message handlers may run on CPU slowing other tasks. Employ multithreaded or dataflow paradigms......which may favor different problems than exo systems. 22 ISCA-97-IRAM Formatted 15:22, 20 December 1997 from iram. 22

23 23 Related Work Latency Hiding Low-Switch-Latency Multithreaded Parallel Machines Some switch between threads in a few cycles. Some can issue any active resident thread. Used in many current research projects. Similarity Operation in one thread can start while another waits. Difference Overhead consumed for each thread (e.g., index bounds). 23 ISCA-97-IRAM Formatted 15:22, 20 December 1997 from iram. 23

24 24 Feasibility Assumptions Programs exhibit misses, contention, and synch. delays. No well-behaved alternate implementations. Delays significantly impact execution time. There exist many such programs......or users willing to pay for maximum performance. Multithreading sometimes adds too much overhead. Communication bandwidth not too expensive. Latency can be high. Cost of exo-processor much less than CPU. An exo-processor doesn t need......pipelined execution units......cache, address translation, etc.......or an additional network port. 24 ISCA-97-IRAM Formatted 15:22, 20 December 1997 from iram. 24

25 25 Further Work Algorithm-Driven Development Implement algorithms too fine-grained for current machines. Vector Exo-Ops Exploit spatial locality using large word sizes in DRAM. Comparison with Multithreaded Machines Compare exo-scalar and multiple-thread overheads. Compare ease of discovering multiple threads and exo-ops. Comparison with Cache-Visible Architectures Programs can bypass cache. Subblock writes at memories. Etc. 25 ISCA-97-IRAM Formatted 15:22, 20 December 1997 from iram. 25