Memory Architectures. Week 2, Lecture 1. Copyright 2009 by W. Feng. Based on material from Matthew Sottile.

Size: px

Start display at page:

Roderick Preston
6 years ago
Views:

2 Directory-Based Coherence Idea Maintain pointers instead of simple states with each cache block. Ingredients Data owners ( home nodes) keeping track of nodes with cached copies ( sharers ). Coherence decisions implemented via distributed linked-list update algorithms. Accomplished via a doubly linked list formed between caches, with sharers linking to other sharers, and the home node acting as the list head. Issues?

3 Directory-Based Coherence Read Misses Transaction from requestor to home node. Home node replies with data or head of sharer list. Head pointer is changed to requestor. Requestor sends transaction to first sharer to insert into list. Write Miss Requestor obtains head information from home. Requestor is inserted as new head. If already on the list, it is deleted and re-inserted as head. List is traversed and shared copies are invalided

4 Recall: Cache Coherence & Performance Unlike details with pipelining that only concern compiler writers, you the programmer really need to acknowledge that this is going on under the covers. The coherence protocol can impact your performance.

5 Cache Coherence: Performance Demo

6 Abusing Cache Coherence Why the difference in performance? For the same overall computation, change the interleaving of memory accesses between the processing cores.

7 Traversal Patterns void worker_unfriendly(void *arg) { int rank = (int)arg; int start = rank; int end = N; int stride = P; int i, iter; int numiters = 256; } for (iter=0; iter<numiters; iter++) { for (i=start; i<end; i += stride) { if (i>0 && i<n-1) data[i] = (data[i-1]+data[i]+ } } data[i+1])/3.0;

8 Explanation of Abusing Cache Coherence Each block shared across multiple caches. Each iteration, all threads read from the same block ( Shared) and try to write to this shared block ( Modified). One thread wins and forces all the other processors to invalidate their block ( Invalidate). When one of the others executes a read, the writer is forced into the Shared state and must write back ( Shared). In unfriendly case, many transitions to the Invalid state mean blocks need to be re-read over and over by some cores. Result: Serialized code. So, we observe significantly worse than sequential performance.

9 Philosophical Perspective Cache Coherence Protocols Rarely addressed in the popular literature on multicore. What is the focus? Libraries and languages. Dangerous? Why or why not? Confucius say Understand the hardware consequences of code structure achieve good performance. :-) Just like you need to understand caches to write good sequential code.

10 Software-Based Cache Coherency (CC)? Question Can one build a shared-memory software layer above a distributed memory system (such as a cluster) that takes concepts from cache coherence protocols to implement a largescale shared memory view of a big cluster. Answer: Possible but slow Distributed Shared Memory (DSM) Distributed at hardware level; shared at programming abstraction. Why slow? Bus transactions needed to make CC protocol work. Hardware: Bit level. Low BW. Time scales of a single clock cycle. Software: Bus level. High bandwidth.

11 To Date: Architectural Enhancements Coarse-Grained Parallelism: Core Replication Instruction-Level Parallelism (ILP) Deep Pipelines, Superscalar, and VLIW Sophisticated Memory Hierarchy Lesson? Do sophisticated activities in hardware such that the abstraction layer seen by programmers is similar as possible to writing plain single threaded code. Assuming that enough parallelism exists, the key is memory performance.

12 Key: Memory Performance Hide it or control it ILP and Caches: Hide it. Coherence Control memory consistency with respect to multiple viewpoints. Hide how this is achieved in hardware

13 Memory Access Timing Increase number of processors Scaling of a bus interconnect becomes an issue. More complexity needed to avoid excessive bus transfers Example: Avoid flushing or broadcasting to caches that are not affected by a cache coherence conflict. Longer-term solution?

14 Memory Access Timing Sequential Computer Every location in memory takes the same amount of time to access. Small Shared-Memory Systems Ditto. Also known as Uniform Memory Access (UMA). Larger Shared-Memory Systems More complex memory subsystem makes constant-time access prohibitively difficult. Memory access takes different amounts of time based on where the physical memory exists in the machine relative to accessor. Known as Non-Uniform Memory Access (NUMA). Cache-coherent NUMA ccnuma

15 ccnuma Diagram

16 NUMA A model often found on parallel systems a decade ago. Example: SGI Origin 2000, composed of 128 CPUs with a hierarchy of bricks from which the entire machine is built. Multicore UMA: Intel. NUMA: AMD. The world is moving to NUMA why?

17 UMA vs. NUMA UMA Do not care where data actually lies relative to computations that work on it. NUMA Do care where data is. Better performance if data is near computation. Problem: Hard to enforce without OS and run-time assistance. Why? A combination of resource scheduling, load balancing, and coupling memory management with process management. (Outside scope of this course just wanted you to be aware of this as more and more, not all memory is treated equally from a timing perspective. )

18 NUMA Another Perspective Think of NUMA as simply adding more layers to the cache hierarchy. Memory near computation is faster, much like a local cache. Memory far away from a computation is slower, much like main memory (or a cache deeper in the hierarchy). But why continue to add all these layers? Remember: Memory and bus logic consume an increasingly large amount of real estate on a die, e.g., see Intel Nehalem die (previous lecture). Memory performance becomes critical to code performance.

19 Time for a New Approach? Part I Let s start over KISS: Keep It Simple Stupid Simplify the hardware. Get rid of speculative fetching logic. Get rid of cache coherence hardware get rid of all cache logic Hmm let s get rid of caches altogether (at least for most processor cores) Create a fast processor in terms of functional execution with no fancy memory hierarchy. Use spare logic real estate to add high-performance functional units and a reasonable, but small, amount of RAM-like memory right next to the processor. Much smaller than a cache, but much larger than a register set and addressable in a RAM fashion.

20 Welcome The Cell

21 Initial Reactions Blasphemy! The cache and memory subsystem help me write code productively. Why does Cell want to take this away? How can I survive without a memory hierarchy that just works I want to focus on the functional logic of my program, not memory traversal. Hmmm maybe not

22 Re-Think Programming? Current View Create sequences of instructions with memory as a secondary concern. Challenging Conventional Wisdom How does my world as a programmer change if data (memory) movement is viewed on equal footing as computation? Can I shift my way of thinking to exploit this new memory model and exception computational capability? Answer: Probably, but with effort (which runs counter to code productivity maybe).

23 How We Currently Program A perspective from the designer(s) of the Cell. Plumbing project at home. If we repaired plumbing the way we programmed, we d

24 How We Currently Program A perspective from the designer(s) of the Cell. Plumbing project at home. If we repaired plumbing the way we programmed, we d

25 How We Currently Program A perspective from the designer(s) of the Cell. Plumbing project at home. If we repaired plumbing the way we programmed, we d

26 How We Currently Program A perspective from the designer(s) of the Cell. Plumbing project at home. If we repaired plumbing the way we programmed, we d

27 How We Currently Program A perspective from the designer(s) of the Cell. Plumbing project at home. If we repaired plumbing the way we programmed, we d

28 How We Currently Program A perspective from the designer(s) of the Cell. Plumbing project at home. If we repaired plumbing the way we programmed, we d

29 How We Currently Program A perspective from the designer(s) of the Cell. Plumbing project at home. If we repaired plumbing the way we programmed, we d

30 How We Currently Program A perspective from the designer(s) of the Cell. Plumbing project at home. If we repaired plumbing the way we programmed, we d

31 How We Currently Program A perspective from the designer(s) of the Cell. Plumbing project at home. If we repaired plumbing the way we programmed, we d

32 How We Currently Program A perspective from the designer(s) of the Cell. Plumbing project at home. If we repaired plumbing the way we programmed, we d

33 How We Currently Program A perspective from the designer(s) of the Cell. Plumbing project at home. If we repaired plumbing the way we programmed, we d and so on

34 How We Currently Program Example from a Code Perspective x = a(1) y = b(2) z = c(14) d(1) = x + y + z A good compiler optimizes the above away But it s not difficult to build algorithms that are complex enough that the compiler cannot optimize it all away, thus resulting in going back and forth to main memory over and over Possible causes?

35 How We Currently Program Wasteful and inefficient? (See previous slides.) Ask for memory when we need it, and it happens magically but with potential performance impact Current Solution Performance tuning, e.g., fetch multiple items to CPU at once by packing into cache-friendly blocks. Not always feasible Mesh-based data structures in finite element method (FEM) codes where one struggles with structs of arrays versus arrays of structs Arbitrary data layouts and structures. Plumbing example. New Solution Instead of back-and-forth, on-demand memory model, why not plan ahead and hand the system a shopping list?

36 Shopping List Model More efficient performance. (Obvious.) But it requires programmers to plan ahead. Oh the pain of planning ahead The free ride of cache coherency in this model does not exist. Programmers must translate algorithms previously developed assuming lots of hardware on your CPU to support an ondemand memory model. (Remember pictures of dies?) Not so bad if algorithm already computes the memory traversal pattern, e.g., butterfly pattern in the Fast Fourier Transform. (Are we sacrificing programmer productivity for performance?)

37 Shopping List Model More efficient performance. (Obvious.) But it requires programmers to plan ahead. Oh the pain of planning ahead The free ride of cache coherency in this model does not exist. Programmers must translate algorithms previously developed assuming lots of hardware on your CPU to support an ondemand memory model. (Remember pictures of dies?) Not so bad if algorithm already computes the memory traversal pattern, e.g., butterfly pattern in the Fast Fourier Transform. (Are we sacrificing programmer productivity for performance?)

38 Shopping List Model Plumbing project at home. Using the shopping list model

39 Shopping List Model Plumbing project at home. Using the shopping list model

40 Shopping List Model Plumbing project at home. Using the shopping list model

41 Shopping List Model Plumbing project at home. Using the shopping list model

Support for the Shopping List Model Sophisticated memory access hardware and a high-bw, low-latency internal bus on the chip. Difficulty in programming to this model?

42 Support for the Shopping List Model Sophisticated memory access hardware and a high-bw, low-latency internal bus on the chip. Difficulty in programming to this model? Not due to shortcoming of the chip but due to lack of programming tools to support it. Historical perspective Back in 2007, program to Cell via assembly. IBM releases SDK with libraries to help out Now compiler vendors are helping too. Gedae, RapidMind, and so on. Here to stay?!

43 The Cell A collaboration between Sony, IBM, and Toshiba. Formal name: Cell Broadband Engine Architecture.

44 The Cell PPE: dual-threaded (SMT) PowerPC processing element SPE: singled-threaded, high-performance synergistic processing element EIB: circular data bus connection PPE and SPEs (and I/O + memory)

45 PPE: PowerPC Processing Element Multi-Purpose General-purpose CPU Assist the SPEs by performing computatinos that help keep the SPEs running with minimal idle time. Anatomy Registers 64-bit general-purpose and floating-point registers 128-bit registers for Altivec SIMD operations Caches L1: 32-kB instruction, 32-kB data L2: 512 kb

46 SPE: Synergistic Processing Element High-performance specialized processor element that can be replicated within the Cell architecture. Structure A high-performance execution core called the Synergistic Processing Unit (SPU), which interacts with the outside world via the Memory Flow Controller (MFC). MFC: MMU, DMA, and bus interface. SPE 128-bit registers 256-kB local store SRAM that is visible from PPE. (NOT a cache!) Instruction and data memory (32-bit addressing)

47 EIB: Element Interconnection Bus A Bi-directional Circular Bus Four 16-byte channels Two in each direction 12 devices on the bus 8 SPEs, 1 PPE, 1 interface to system memory, and 2 I/O units. Each device has a single 16B read and a single 16B write port onto the bus. Maximum (Realizable) EIB Bandwidth: ~200 GB/s. Ideal EIB Bandwidth: ~300 GB/s

48 Time for an Old New Approach? Part II The Cell was not the first architecture to re-think how memory is managed. Tera (now owned by Cray) created MTA: MultiThreaded Architecture in the 1990s Throw out the concept of a cache. Basic idea Exploit massive levels of threading to hide high latency to main memory.

49 MTA: Example with 6 Threads

50 MTA: Example with 6 Threads

51 MTA: Example with 6 Threads

52 MTA: Example with 6 Threads

53 MTA: Example with 6 Threads

54 MTA: Example with 6 Threads

55 MTA: Example with 6 Threads

56 MTA: Example with 6 Threads

57 MTA Overview A great deal of processor real estate to support the context for many threads. That is, remove the cache(s) and hide latency by servicing many threads. Dedicate hardware, otherwise used for caching, to service threads. Remember this when we reach GPGPU Good performance requires a high degree of threading such that memory latency can be hidden by servicing many threads. ( Plumbing project only had three threads.) Disclaimer: In reality, MTA capable of more than just roundrobin schedule. Just wanted to get basic idea across.

58 Time for an Old New Approach? Part III Combining Ideas from Part I and Part II Plug-In Architecture: GPGPU Explicit data movement between CPU and GPGPU card out on PCIe (a la FPGA acceleration) Shopping List Model of Cell outside the chip Massive parallelism via threading Memory Model of MTA within the chip

59 Welcome The GPGPU

60 Next Time? Interconnection Networks Shallow coverage of interconnection networks Important to parallel computing Fallen out of fashion over the past deacde or so. The return of its importance Tilera 64-core, Intel 80-core, and so on. Thursday should be the last architecturally-driven lecture.

Computer Systems Architecture

Computer Systems Architecture Lecture 24 Mahadevan Gomathisankaran April 29, 2010 04/29/2010 Lecture 24 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student