Multiprocessors: Basics, Cache coherence, Synchronization, and Memory consistency

Size: px
Start display at page:

Download "Multiprocessors: Basics, Cache coherence, Synchronization, and Memory consistency"

Transcription

1 Multiprocessors: Basics, Cache coherence, Synchronization, and Memory consistency Z. Jerry Shi Assistant Professor of Computer Science and Engineering University of Connecticut * Slides adapted from Blumrich&Gschwind/ELE475 03, Peh/ELE475 *

2 Parallel Computers Definition: A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast. Almasi and Gottlieb, Highly Parallel Computing,1989 Questions about parallel computers: How large a collection? How powerful are processing elements? How do they cooperate and communicate? How are data transmitted? What type of interconnection? What are HW and SW primitives for programmer? Does it translate into performance?

3 Parallel Processors Religion The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor Led to innovative organization tied to particular programming models dl since uniprocessors can t tkeep going e.g., uniprocessors must stop getting faster due to limit of speed of light: 1972,, 1989, NOW Borders religious fervor: you must believe! Fervor damped somewhat when 1990s companies went out of business: Thinking Machines, Kendall Square,... Arguments: pull of opportunity of scalable performance; the push of uniprocessor performance plateau?

4 What Level Parallelism? Bit level parallelism: 1970 to ~ bits, 8 bit, 16 bit, 32 bit microprocessors Instruction level parallelism (ILP): ~1985 through today Pipelining Superscalar VLIW Out-of-Order execution Limits to benefits of ILP? Process Level or Thread level lparallelism; li mainstream for general purpose computing? Servers are parallel Many PCs too!

5 Why Multiprocessors? Microprocessors are the fastest CPUs leverage them Collecting several much easier than redesigning one Complexity of current microprocessors Do we have enough ideas to sustain 1.5X/yr? Can we deliver such complexity on schedule? Slow (but steady) improvement in parallel software (scientific apps, databases, OS) Emergence of embedded and server markets driving microprocessors in addition to desktops Embedded functional parallelism, producer/consumer model Server figure of merit is tasks per hour vs. latency Long term goal: scale number of processors to size of budget and desired performance

6 Parallel Architecture Parallel Architecture extends traditional computer architecture with a communication architecture Abstractions (HW/SW interface) Organizational structure t to realize abstraction ti efficiently i

7 Popular Flynn Categories (1966) SISD (Single Instruction Stream, Single Data Stream) Uniprocessors SIMD (Single Instruction Stream, Multiple Data Stream) Sing instruction memory and control processor (e.g., Illiac-IV, CM-2, vector machines) Simple programming model Low overhead Requires lots of data parallelism All custom integrated circuits MISD (Multiple Instruction Stream, Single Data Stream) No commercial multiprocessor of this type has been built MIMD(Multiple Instruction Stream, Multiple Data Stream) Examples: IBM SP, Cray T3D, SGI Origin, your PC? Flexible Use off-the-shelf microprocessors MIMD is the current winner: Concentrate on major design emphasis <= 128 processor MIMD machines

8 MIMD: Centralized Shared Memory UMA: Uniform Memory Access (latency) UMA: Uniform Memory Access (latency) Symmetric (shared-memory) Multi-Processor (SMP)

9 MIMD: Distributed Memory Advantages: More memory bandwidth (low cost), lower memory latency (local) Drawback: Longer communication latency, More complex software model

10 Distributed Memory Versions Distributed Shared Memory (DSM) Single physical address space with load/store access Address space is shared among nodes Also called Non-Uniform Memory Access (NUMA) Access time depends on the location of the data word in memory Message-Passing Multicomputer Separate address space per processor The same address on different nodes refers to different words Could be completely separated computers connected on LAN Also called clusters Tightly-coupled cluster

11 Layers of Parallel Framework Programming Model: Multiprogramming : lots of jobs, no communications Shared memory: communicate via memory Message passing: send and receive messages Data Parallel: several agents operate on several data sets simultaneously and then exchange information globally and simultaneously (shared or message passing) Communication Abstraction: Shared address space, e.g., load, store, atomic swap Message passing, e.g., send, receive library calls Debate over this topic (ease of programming, scaling) => many hardware designs 1:1 programming model

12 Data Parallel Model Operations can be performed in parallel on each element of a large regular data structure, such as an array One Control Processor broadcast to many PEs When computers were large, could amortize the control portion of many replicated PEs Condition flag per PE so that PE can be skipped Data distributed in each memory Early 1980s VLSI => SIMD rebirth: 32 1-bit procs + memory on a chip was the PE Data parallel programming languages lay out data to processors

13 Communication Mechanisms Shared memory: Memory load and store Oldest, but most popular model Message passing: Send and receive messages Sending messages that request action or deliver data Remote procedure call (RPC) Request the data and/or operations performed on data Synchronous: Request, wait, and continue Three performance metrics Communication bandwidth Communication latency Communication latency hiding Overlap communication with computation or with other communication

14 Performance Metrics: Latency and Bandwidth Bandwidth Need high bandwidth in communication Match limits in network, memory, and processor Challenge is link speed of network interface vs. bisection bandwidth of network Latency Affects performance, since processor may have to wait Affects ease of programming, since it requires more thought to overlap communication and computation Overhead to communicate is a problem in many machines Latency Hiding How can a mechanism help hide latency? Increases programming system burden Examples: overlap message send with computation, prefetch data, switch to other tasks

15 Shared Memory Model Communicate via Load and Store Based on timesharing: processes on multiple processors vs. sharing single processor Process: a virtual address space and about one thread of control Multiple processes can overlap (share), but ALL threads share a process address space Threads may be used in a casual way to refer to multiple loci of execution, ecu even when they do not share an address space Writes to shared address space by one thread are visible to reads of other threads Usual model: share code, private stack, some shared heap, some private heap

16 Message Passing Model Whole computers (CPU, memory, I/O devices) communicate as explicit I/O operations Send specifies local buffer + receiving process on remote computer Receive specifies sending process on remote computer + local buffer to place data Usually send includes process tag and receive has rule on tag (i.e. match 1, match any) Synchronize: When send completes? When buffer free? When request accepted? Receive wait for send? Send+receive => memory-memory copy, where each supplies local address, AND does pairwise sychronization!

17 Advantages of Shared-Memory Communication Compatibility Oldest, but the most popular model Ease of programming Communication patterns among processors are complex and dynamic Simplify compilers Use familiar shared-memory model to develop applications Attention only on performance critical accesses Lower overhead for communications and better use of bandwidth when communicating small items Can use hardware-controlled caching to reduce the frequency of remote communication

18 Advantages of Message-Passing Communication Hardware is simpler Communication is explicit It is easier to understand (when, cost, etc.) In shared memory it can be hard to know when communicating and when not, and how costly it is Force programmer to focus on communication, the costly aspects of parallel computing Synchronization is naturally associated with sending messages Reducing the possibility for errors introduced by incorrect synchronization Easier to use sender-initiated communication May have some advantages in performance

19 Communication Options Supporting message passing on top of shared memory Sending a message == Copying data Data may be misaligned though Supporting shared memory on top of hardware for message passing The cost of communication is too expensive Loads/stores usually move small amounts of data Shared virtual memory may be a possible direction What s the best choice for massively parallel processors (MPP)? Message passing is clearly simpler Shared-memory has been widely supported since 1995

20 Amdahl s Law and Parallel Computers Amdahl s Law: Speedup FracX ( SpeedupX p 1 (1 FracX )) The sequential portion limits parallel speedup Speedup 1/(1 FracX) The large latency of remote access is another major challenge Example: What fraction sequential to get 80x speedup from 100 processors? Assume either 1 processor or 100 fully used. 80 = 1 / ((FracX/100 + (1 FracX)) 0.8 FracX + 80 (1 FracX) = FracX = 1 FracX = (80 1) / 79.2 = Only 0.25% sequential!

21 But In The Real World... Wood and Hill [1995] compared cost-effectiveness of multiprocessor systems stems Challenge DM (1 proc, 6GB memory) cost DM( (1, MB) = $38,400 + $100 MB cost DM (1, 1GB) = $140,800 Challenge XL (32 proc) cost XL (p, MB) = $81,600 + $20,000 p + $100 MB cost XL (8, 1GB) = $344,000 = 2.5x cost DM( (1,1GB) So, XL speedup of 2.5x or more makes the XL more cost effective!

22 Memory System Coherence Cache coherence challenge for hardware Memory system is coherent if any read of a data item returns the most recently written value of that data item Coherence determines what value can be returned by a read Defines the behavior of reads and writes with respect to the same memory location Ensures the memory system is correct. Consistency determines hen written values will be returned by reads (all locations) Defines the behavior of accesses with respect to accesses to other memory locations

23 Multiprocessor Coherence Rules A memory system is coherent if Processor P writes X... No other writes to X Processor P reads X Value must tbe the same Processor Q writes X... Sufficient time... No other writes to X Processor P reads X Value must be the same Processor P writes X Processor Q writes X All processors see it as P, Q or Q, P, but the order should be the same for all processors Wirtes to the same location are serialized

24 Potential HW Coherency Solutions Snooping Solution (Snoopy Bus): Send all requests for data to all processors Processors snoop to see if they have a copy and respond accordingly Requires broadcast, since caching information is at processors Works well with bus (natural broadcast medium) Dominates for small scale machines (most of the market) Directory-Based Schemes (discussed later) Keep track of what is being shared in a centralized place (logically) Distributed ib t memory => distributed ib t d directory for scalability (avoids bottlenecks) Send point-to-point p requests to processors via network Scales better than Snooping Actually existed BEFORE Snooping-based schemes

25 Bus Snooping Topology Memory: centralized with uniform access time ( UMA ) and bus interconnect Symmetric Multiprocessor (SMP)

26 Basic Snooping Protocols Write Invalidate Protocol: Multiple readers, single writer Write to shared data: an invalidate is sent to all caches which snoop and invalidate any copies Handling read miss: Write-through: memory is always up-to-date Write-back: snoop in caches to find most recent copy Write Broadcast/Update Protocol (typically write through): Write to shared data: broadcast on bus, processors snoop, and update any copies Handling read miss: memory is always up-to-date Write serialization: bus serializes requests! Bus is single point of arbitration

27 Basic Snooping Protocols Write Invalidate versus Broadcast: Invalidate requires one transaction per write-run Invalidate uses spatial locality Invalidate a block each time Do not need to invalidate when writing other words in the same block Broadcast has lower latency between write and read Writes update all copies

28 An Example Snooping Protocol Invalidation protocol, write-back cache Each block of memory is in one state: Clean in all caches and up-to-date in memory (Shared) OR Dirty in exactly one cache (Exclusive) OR Not in any caches (Invalid) Each cache block is in one state (track these): Shared : block can be read OR Exclusive : cache has the only copy, it is writeable, and dirty OR Invalid : block contains no data Read misses: cause all caches to snoop bus Writes to a clean line are treated as misses

29 Snoopy-Cache State Machine-I State machine for CPU requests for each cache block Invalid CPU Read Place read miss on bus CPU Read hit Shared (read/only) Cache Block Statet CPU Write Place Write Miss on bus CPU read hit CPU write hit Exclusive (read/write) CPU read miss Write back block, Place read miss on bus CPU Read miss Place read miss on bus CPU Write Place Write Miss on Bus CPU Write Miss Write back cache block Place write miss on bus

30 Snoopy-Cache State Machine-II State machine for bus requests for each cache block Invalid Write miss for this block Shared (read/only) Write miss for this block Write Back Block; (abort memory access) Exclusive (read/write) Read miss for this block Write Back Block; (abort memory access) Read miss for this block

31 Example Processor 1 Processor 2 Bus Memory P1 P2 Bus Memory step State Addr Value State Addr Value Action Proc. Addr Value Addr Value P1: Write 10 to A1 P1: Read A1 P2: Read A1 P2: Write 20 to A1 P2: Write 40 to A2 Assumes initial cache state is invalid and A1 and A2 map to same cache block, but A1!= A2 Remote Write Invalid Read miss on bus Write Remote miss on bus Write Remote Read Write Back Write Back CPU Read hit Shared CPU Read Miss CPU Write Place Write Miss on Bus Exclusive CPU read hit CPU write hit CPU Write Miss Write Back

32 Example: Step 1 P1 P2 Bus Memory step State Addr Value State Addr Value Action Proc. Addr Value Addr Value P1: Write 10 to A1 Excl. A1 10 WrMs P1 A1 P1: Read A1 P2: Read A1 P2: Write 20 to A1 P2: Write 40 to A2 Assumes initial cache state is invalid and A1 and A2 map to same cache block, but A1!= A2. Active arrow = Remote Write Invalid Read miss on bus Write Remote miss on bus Wit Write Remote Read Write Back Write Back CPU Read hit Shared CPU Read Miss CPU Write Place Write Miss on Bus Exclusive CPU read hit CPU write hit CPU Write Miss Write Back

33 Example: Step 2 P1 P2 Bus Memory step State Addr Value State Addr Value Action Proc. Addr Value Addr Value P1: Write 10 to A1 Excl. A1 10 WrMs P1 A1 P1: Read A1 Excl. A1 10 P2: Read A1 P2: Write 20 to A1 P2: Write 40 to A2 Assumes initial cache state is invalid and A1 and A2 map to same cache block, but A1!= A2 Remote Write Invalid Read miss on bus Write Remote miss on bus Wit Write Remote Read Write Back Write Back CPU Read hit Shared CPU Read Miss CPU Write Place Write Miss on Bus Exclusive CPU read hit CPU write hit CPU Write Miss Write Back

34 Example: Step 3 P1 P2 Bus Memory step State Addr Value State Addr Value Action Proc. Addr Value Addr Value P1: Write 10 to A1 Excl. A1 10 WrMs P1 A1 P1: Read A1 Excl. A1 10 P2: Read A1 Shar. A1 RdMs P2 A1 Shar. A1 10 WrBk P1 A1 10 A1 10 Shar. A1 10 RdDa P2 A1 10 A1 10 P2: Write 20 to A1 10 P2: Write 40 to A Assumes initial cache state is invalid and A1 and A2 map to same cache block, but A1!= A2. Remote Write Invalid Read miss on bus Write Remote miss on bus Wit Write Remote Read Write Back Write Back CPU Read hit Shared CPU Read Miss CPU Write Place Write Miss on Bus Exclusive CPU read hit CPU write hit CPU Write Miss Write Back

35 Example: Step 4 P1 P2 Bus Memory step State Addr Value State Addr Value Action Proc. Addr Value Addr Value P1: Write 10 to A1 Excl. A1 10 WrMs P1 A1 P1: Read A1 Excl. A1 10 P2: Read A1 Shar. A1 RdMs P2 A1 Shar. A1 10 WrBk P1 A1 10 A1 10 Shar. A1 10 RdDa P2 A1 10 A1 10 P2: Write 20 to A1 Inv. Excl. A1 20 WrMs P2 A1 A1 10 P2: Write 40 to A Assumes initial cache state is invalid and A1 and A2 map to same cache block, but A1!= A2 Remote Write Invalid Read miss on bus Write Remote miss on bus Wit Write Remote Read Write Back Write Back CPU Read hit Shared CPU Read Miss CPU Write Place Write Miss on Bus Exclusive CPU read hit CPU write hit CPU Write Miss Write Back

36 Example: Step 5 P1 P2 Bus Memory step State Addr Value State Addr Value Action Proc. Addr Value Addr Value P1: Write 10 to A1 Excl. A1 10 WrMs P1 A1 P1: Read A1 Excl. A1 10 P2: Read A1 Shar. A1 RdMs P2 A1 Shar. A1 10 WrBk P1 A1 10 A1 10 Shar. A1 10 RdDa P2 A1 10 A1 10 P2: Write 20 to A1 Inv. Excl. A1 20 WrMs P2 A1 A1 10 P2: Write 40 to A2 WrMs P2 A2 A1 10 Excl. A2 40 WrBk P2 A1 20 A1 20 Assumes initial cache state is invalid and A1 and A2 map to same cache block, but A1!= A2 Remote Write Invalid Read miss on bus Write Remote miss on bus Write Remote Read Write Back Write Back CPU Read hit Shared CPU Read Miss CPU Write Place Write Miss on Bus Exclusive CPU read hit CPU write hit CPU Write Miss Write Back

37 False Sharing P1 for( i=0; ; i++ ) {... *a = i;... } P2 for( j=0; ; j++ ) {... *(a + 1) = j;... } state cache block a

38 False Sharing Quiz x1 and x2 share a cache block, which starts in the shared state. What happens next? Time P1 P2 True or False? 1 Write x1? 2 Read x2 3 Write x1 4 Write x2 5 Read x2

39 False Sharing Quiz Time P1 P2 True or False? 1 Write x1 True 2 Read x2? 3 Write x1 4 Write x2 5 Read x2

40 False Sharing Quiz Time P1 P2 True or False? 1 Write x1 True 2 Read x2 False 3 Write x1? 4 Write x2 5 Read x2

41 False Sharing Quiz Time P1 P2 True or False? 1 Write x1 True 2 Read x2 False 3 Write x1 False 4 Write x2? 5 Read x2

42 False Sharing Quiz Time P1 P2 True or False? 1 Write x1 True 2 Read x2 False 3 Write x1 False 4 Write x2 False 5 Read x2?

43 False Sharing Quiz x1 and x2 share a cache block, which starts in the shared state. What happens next? Time P1 P2 True or False? 1 Write x1 True 2 Read x2 False 3 Write x1 False 4 Write x2 False 5 Read x2 True

44 Larger MPs Separate Memory per Processor Local or Remote access via memory controller One Cache Coherency solution: non-cached pages Alternative: ti directory per cache that ttracks state t of every block kin every cache Which caches have copies of block, dirty vs. clean,... Info per memory block vs. per cache block? PLUS: In memory => simpler protocol (centralized/one location) MINUS: In memory => directory is ƒ(memory size) vs. ƒ(cache size) Prevent directory as bottleneck? distribute directory entries with memory, each keeping track of which processors have copies of their blocks

45 Distributed Directory MPs

46 Directory Protocol States identical to Snooping Protocol: Shared: 1 processors have data; memory is up-to-date Invalid: (no processor has it; not valid in any cache) Exclusive: 1 processor (owner) has data; memory is out-of-date In addition to cache state, must track which processors have data when in the shared state t (usually a bit vector; 1 if processor has a copy)

47 Directory-based cache coherence States identical to snooping case; transactions very similar. Three processors typically involved: Local node - where a request originates Home node - where the memory location of an address resides Remote node - has a copy of a cache block, whether exclusive or shared Generates read miss & write miss message to home directory. Write misses that were broadcast on the bus for snooping => explicit invalidate & data fetch requests.

48 CPU-Cache State Machine CPU Read hit State machine for CPU requests Invalidate for each Shared memory block Invalid (read/only) CPU Read: Invalid state Send Read Miss if in message CPU read miss: memory CPU Write: Send Read Miss Send Write Miss CPU Write:Send msg to h.d. Write Miss message Fetch/Invalidate: to home directory send Data Write Back message to home directory Fetch: send Data Write Back message to home directory CPU read miss: send Data Wit Write Back message and Exclusive read miss to home directory CPU read hit (read/writ) CPU write miss: CPU write hit send Data Write Back message and Write Miss to home directory

49 Directory State Machine Read miss: Sharers += {P}; send Data Value Reply State machine Read miss: for Directory Sharers = {P} requests for each send Data Value Reply Shared memory block Invalid (read only) Data Write Back: Sharers = {} (Write back block) Write Miss: Sharers = {P}; send Fetch/Invalidate; send Data Value Reply msg to remote cache Exclusive (read/write) Write Miss: Write Miss: Sharers = {P}; send Invalidate send Data to Sharers; Value Reply then Sharers = {P}; msg send Data Value Reply msg Read miss: Sharers += {P}; send Fetch; send Data Value Reply msg to remote cache (Write back block)

50 Example Processor 1 Processor 2 Interconnect Directory Memory step P1: Write 10 to A1 P1 P2 Bus Directory Memory State Addr Value State Addr Value Action Proc. Addr Value Addr State {Procs} Value P1: Read A1 P2: Read A1 P2: Write 20 to A1 P2: Write 40 to A2 A1 and A2 map to the same cache block

51 Example Processor 1 Processor 2 Interconnect Directory Memory P1 P2 Bus Directory Memory step State Addr Value State Addr Value Action Proc. Addr Value Addr State {Procs} Value P1: Write 10 to A1 WrMs P1 A1 A1 Ex {P1} Excl. A1 10 DaRp P1 A1 0 P1: Read A1 P2: Read A1 P2: Write 20 to A1 P2: Write 40 to A2 A1 and A2 map to the same cache block

52 Example Processor 1 Processor 2 Interconnect Directory Memory P1 P2 Bus Directory Memory step State Addr Value State Addr Value Action Proc. Addr Value Addr State {Procs} Value P1: Write 10 to A1 WrMs P1 A1 A1 Ex {P1} Excl. A1 10 DaRp P1 A1 0 P1: Read A1 Excl. A1 10 P2: Read A1 P2: Write 20 to A1 P2: Write 40 to A2 A1 and A2 map to the same cache block

53 Example Processor 1 Processor 2 Interconnect Directory Memory P1 P2 Bus Directory Memory step Statet Addr Value Statet Addr Value Ati Action Proc. Addr Value Addr State t {Procs} Value P1: Write 10 to A1 WrMs P1 A1 A1 Ex {P1} Excl. A1 10 DaRp P1 A1 0 P1: Read A1 Excl. A1 10 P2: Read A1 Shar. A1 RdMs P2 A1 Shar. A1 10 Ftch P1 A1 10 A1 10 Shar. A1 10 DaRp P2 A1 10 A1 Shar. P1,P2} 10 P2: Write 20 to A P2: Write 40 to A2 10 Wit Write Back A1 and A2 map to the same cache block

54 Example Processor 1 Processor 2 Interconnect Directory Memory P1 P2 Bus Directory Memory step Statet Addr Value Statet Addr Value Ati Action Proc. Addr Value Addr State t {Procs} Value P1: Write 10 to A1 WrMs P1 A1 A1 Ex {P1} Excl. A1 10 DaRp P1 A1 0 P1: Read A1 Excl. A1 10 P2: Read A1 Shar. A1 RdMs P2 A1 Shar. A1 10 Ftch P1 A1 10 A1 10 Shar. A1 10 DaRp P2 A1 10 A1 Shar. P1,P2} 10 P2: Write 20 to A1 Excl. A1 20 WrMs P2 A1 10 Inv. Inval. P1 A1 A1 Excl. {P2} 10 P2: Write 40 to A2 10 A1 and A2 map to the same cache block

55 Example Processor 1 Processor 2 Interconnect Directory Memory P1 P2 Bus Directory Memory step Statet Addr Value Statet Addr Value Ati Action Proc. Addr Value Addr State t {Procs} Value P1: Write 10 to A1 WrMs P1 A1 A1 Ex {P1} Excl. A1 10 DaRp P1 A1 0 P1: Read A1 Excl. A1 10 P2: Read A1 Shar. A1 RdMs P2 A1 Shar. A1 10 Ftch P1 A1 10 A1 10 Shar. A1 10 DaRp P2 A1 10 A1 Shar. P1,P2} 10 P2: Write 20 to A1 Excl. A1 20 WrMs P2 A1 10 Inv. Inval. P1 A1 A1 Excl. {P2} 10 P2: Write 40 to A2 WrMs P2 A2 A2 Excl. {P2} 0 WrBk P2 A1 20 A1 Unca. {} 20 Excl. A2 40 DaRp P2 A2 0 A2 Excl. {P2} 0 A1 and A2 map to the same cache block

56 Implementation issues with a snoopy protocol Bus request and bus acquisition occurs in two steps Four possible events can occur in between: Write miss on the bus by another processor Read miss on the bus by another processor CPU read miss CPU write miss Interventions: Cache overriding memory

57 Implementation issues with a directory-based protocol Assumptions about networks In-order delivery between a pair of nodes Messages can always be injected Network-level deadlock freedom To ensure system-level deadlock freedom Separate network for requests and replies Allocates space for replies upon sending out requests Requests can be NACK, but not replies Retry NACKed requests

58 Memory Consistency Adve and Gharacharloo, Shared Memory Consistency Models: A Tutorial, WRL Research Report 95/7. Gharachorloo, K., D. Lenoski, J. Ludon, P. Gibbons, A. Gupta, and J. Hennessy, ``Memory consistency and event ordering in scalable sharedmemory multiprocessors,'' in Proceedings of the 17th International Symposium on Computer Architecture, pp , 1990.

59 Memory Consistency Models Cache coherence ensures multiple processors see a consistent view of memory, but how consistent. When must a processor see a value written by another processor? In what order must a processor observe the writes of another processor? Example: P1: A = 0; P2: B = 0; A = 1; B = 1; L1: if (B == 0)... L2: if (A == 0)... Impossible for both if statements L1 & L2 to be true? What if write invalidate is delayed & processor continues? Should this behavior be allowed? Under what conditions? Memory consistency models: what htare the rules for such cases?

60 Strict Consistency In the strict model, any read to a memory location X returns the value stored by the most recent write operation to X. Precise serialization of all memory accesses No cache, talking to memory through bus

61 Examples: P1 W(x) 1 P2 R(x) 1 R(x) 1 P1 W(x) 1 P2 R(x) 0 R(x) 1 P1 W(x) 0 P2 R(x) 0 R(x) 1 Time

62 Sequential Consistency The result of any execution is the same as if The operations of all the processors were executed in some sequential order, and The operations o of each individual du processor appear in this sequence in the order specified by its program. Intuitive: maps well with uniprocessor Implementations Delay memory access completion until all invalidates are done Or, delay next memory access until previous one is done Severe performance restrictions

63 Examples: P1 W(x) 1 P2 R(x) 1 R(x) 2 P3 R(x) 1 R(x) 2 P4 W(x) 2 P1 W(x) 1 P2 R(x) 1 R(x) 2 P3 R(x) 2 R(x) 1 P4 W(x) 2

64 Coherency vs consistency P1 W(x) 1 W(y) 2 P2 R(x) 0 R(x) 2 R(x) 1 R(y) 0 R(y) 1 P3 R(y) 0 R(y) 1 R(x) 0 R(x) 1 P4 W(x) 2 W(y) 1

65 How uniprocessor hardware optimizations of memory system break sequential consistency + solution {No caches} (RAW) Reads bypassing writes: Write buffers Code example: Solution: No buffering of writes Add delays

66 Animations of reads bypassing writes 1. P1: Flag1=1 P1 P2 2. P2: Flag2= P1: if (Flag2==0) Flag1= P2: if (Flag1==0) bus Flag1:0 Flag2:0

67 Animations 1. P1: Flag1=1 P1 P2 2. P2: Flag2= P1: if (Flag2==0) Flag1=1 1 Flag2=1 4. P2: if (Flag1==0) bus Flag1:0 Flag2:0

68 Animations 3 P1 P2 1 2 Flag1=1 1 Flag2=1 bus 1. P1: Flag1=1 2. P2: Flag2=1 3. P1: if (Flag2==0) true! 4. P2: if (Flag1==0) Flag1:0 Flag2:0

69 Animations 3 P1 P2 1 2 Flag1=1 1 4 Flag2=1 bus 1. P1: Flag1=1 2. P2: Flag2=1 3. P1: if (Flag2==0) true! 4. P2: if (Flag1==0) true! Flag1:0 Flag2:0

70 How uniprocessor optimizations of memory system break sequential consistency + solution {No caches} (WAW) Write ordering Overlapping writes in the network Write merging into a cache line Code example: Solution: Wait till write operation has reached memory module before allowing next write acknowledgment for writes more delay

71 Animations of writes bypassing writes P2 P P1: Data= P1: Head=1 3. P2: while (Head==0); 4. P2: = Data M3: Head: 0 M4: Data: 0

72 Animations of writes bypassing writes P2 P P1: Data= P1: Head=1 3. P2: while (Head==0); 4. P2: = Data M3: Head: 0 M4: Data: 0

73 Animations of writes bypassing writes P2 P P1: Data= P1: Head=1 3. P2: while (Head==0); false 4. P2: = Data Head: 0 Data: 0 M3: M4:

74 Animations of writes bypassing writes P P2 4 M3: Head: 0 M4: Data: 0 1. P1: Data= P1: Head=1 3. P2: while (Head==0); true 4. P2: = Data Data=0!

75 How uniprocessor optimizations of memory system break sequential consistency y{ {No caches} (RAR, WAR) Non-blocking reads Speculative execution Code example: Solution: Disable non-blocking reads more delay

76 Animations of reads bypassing reads 1. P1: Data=2000 P1 2. P1: Head= P2: while (Head==0); 4. P2: = Data P2 M3: Head: 0 M4: Data: 0

77 Animations of reads bypassing reads 1. P1: Data=2000 P1 2. P1: Head= P2: while (Head==0); 2 4. P2: = Data Data=0 2 2 P2 M3: Head: 0 M4: Data: 0

78 Animations of reads bypassing reads 1. P1: Data=2000 P1 2. P1: Head= P2: while (Head==0); Ld miss: Head= P2 4. P2: = Data Data=0 3 Head: 0 Data: 0 M3: M4:

79 Now, let s add caches.. More problems (WAW) P1 s write to Data reaches memory, but has not yet been propagated to P2 (cache coherence protocol s message (inv. or update) has not yet arrived) Code Example: Solution: Wait for acknowledgment of update/invalidate before any write/read even longer delay than just waiting for ack from memory module.

80 Animations of problems with a cache P1 P2 Data= P1: Data= P1: Head=1 3. P2: while (Head==0); 4. P2: = Data Head: 0 Data: 2000 M3: M4:

81 Animations of problems with a cache P1 P2 2 Data= P1: Data= P1: Head=1 3. P2: while (Head==0); 4. P2: = Data Head: 1 Data: 2000 M3: M4:

82 Animations of problems with a cache 1. P1: Data=2000 P1 2. P1: Head= P2: while (Head==0); false P2 2 Data=0 2 3 M3: Head: 1 Data: 2000 M4: 4. P2: = Data

83 Animations of problems with a cache P1 1 P2 2 Data=0 2 3 M3: Head: 1 Data: 2000 M4: 1. P1: Data= P1: Head=1 3. P2: while (Head==0); 4. P2: = Data Data=0!

84 Caches + network.. headaches Non-atomic write all processors have to see writes to the same location in the same order Code example: writes to A may appear at P3/P4 in different order due to different network paths Solution: Wait till all acks to a location is complete; or Network preserves ordering between a node pair

85 Animations of problems with networks M1 A=1 P1 A=2 1 2 P2 A P3 M2 P4 B,C

86 Animations of problems with networks M1 A=1 P1 A=2 1 2 P2 A P3 Register1=A=1 M2 B,C P4

87 Animations of problems with networks M1 A=1 P1 A=2 1 2 P2 A P3 Register1=A=1 M2 B,C P4 Register2=A=2

88 Summary of sequential consistency Slow, because In each processor, previous memory operation in program order must be completed before proceeding with next memory operation In cache-coherent h systems, writes to the same location must be visible in the same order to all processors; and no subsequent reads till write is visible to all processors. Mitigating solutions Prefetch cache ownership while waiting in write buffer Speculate and rollback Compiler memory analysis reorder at hw/sw only if safe

89 Relaxed (Weak) Consistency Models Several relaxed models for memory consistency characterized by their attitude towards RAR, WAR, RAW, and WAW to different addresses Not really an issue for most programs -- they are synchronized A program is synchronized if all access to shared data are ordered by synchronization operations acquire (s) {lock}... normal access... release (s) {unlock} Other processes do not need to know intermediate values Only those programs willing to be nondeterministic are not synchronized, leading to data races

90 Relaxed (Weak) Consistency Accesses to synchronization variables are sequentially consistent No access to a synchronization variable is allowed to be performed until all previous writes have completed everywhere No data access (read or write) is allowed to be performed until all previous accesses to synchronization variables have been performed P1 W(x) 1 W(x) 2 S P2 R(x) 0 R(x) 2 S R(x) 2 P3 R(x) 1 S R(x) 2

91 Processor consistency Also called PRAM (an acronym for Pipelined Random Access Memory) consistency. Writes done by a single processor are received by all other processors in the order in which hthey were issued. However, writes from different processors may be seen in a different order by different processors The basic idea of processor consistency is to better reflect the reality of networks in which the latency between different nodes can be different Is the following scenario valid under processor consistency? P1 W(x) 1 W(x) 2 P2 R(x) 2 R(x) 1

92 Release consistency Release consistency considers locks on areas of memory, and propagates only the locked memory as needed Before an ordinary access to a shared variable is performed, all previous acquires done by the process must have completed successfully Before a release is allowed to be performed, all previous reads and writes done by the process must have completed The acquire and release accesses must be sequentially consistent

93 Synchronization Hardware needs to know how programmer intends different processes to use shared data Issues for synchronization: Hardware needs to support uninterruptable instructions to fetch and update memory (atomic operation); User level synchronization operation using this primitive; For large scale MPs, synchronization can be a bottleneck; techniques to reduce contention and latency of synchronization

94 Uninterruptable Instruction to Fetch and Update Memory Atomic exchange: interchange a value in a register for a value in memory 0 => synchronization variable is free 1 => synchronization variable is locked and unavailable Set register to 1 and swap New value in register determines success in getting lock 0 if you succeeded in setting the lock (you were first) 1 if another processor had already claimed access Key is that exchange operation is indivisible Test-and-set: tests a value and sets it if the value passes the test Fetch-and-increment: returns the value of a memory location and atomically increments it 0 => synchronization variable is free

95 Load Linked, Store Conditional Hard to have read and write in one instruction; use two instead Load linked (or load locked) + store conditional Load linked returns the initial value Store conditional returns 1 if it succeeds (no other store to same memory location since preceeding load) and 0 otherwise Example doing atomic swap with LL & SC: try: mov R3,R4 ; mov exchange value ll R2,0(R1) ; load linked sc R3,0(R1) ; store conditional beqz R3,try ; branch store fails (R3 = 0) mov R4,R2 ; put load value in R4 Example doing fetch & increment with LL & SC: try: ll R2,0(R1) ; load linked addi R2,R2,#1 ; increment (OK if reg reg) reg) sc R2,0(R1) ; store conditional beqz R2,try ; branch store fails (R2 = 0)

96 User-Level Synchronization Spin locks: processor continuously tries to acquire, spinning around a loop trying to get the lock li R2,#1 lockit: exch R2,0(R1) ;atomic exchange bnez R2,lockit ;already locked? What about MP with cache coherency? Want to spin on cache copy to avoid full memory latency Likl Likely to get cache hits for such variables ibl Problem: exchange includes a write, which invalidates all other copies; this generates considerable bus traffic Solution: start by simply repeatedly reading the variable; when it is free, then try exchange ( test and test&set ): try: li R2,#1 lockit: lw R3,0(R1) ;load var bnez R3,lockit ;not free=>spin exch R2,0(R1) ;atomic exchange bnez R2,try ;already locked?

97 Scalable Synchronization Exponential backoff if you fail to get the lock, don t try again for an exponentially increasing amount of time Queuing lock Wait for lock by getting on the queue. Lock is passed down the queue. Can be implemented in hardware or software Combining tree Implement barriers as a tree to avoid contention Powerful atomic primitives such as fetch-and-increment

98 Summary Caches contain all information on state of cached memory blocks Snooping and Directory Protocols similar; bus makes snooping easier because of broadcast (snooping => uniform memory access) Directory has extra data structure to keep track of state of all memory blocks Distributing directory => scalable shared memory multiprocessor => Cache coherent, Non uniform memory access (cc-numa) Practical implementations of shared memory multiprocessors are not sequentially consistent due to performance reasons programmers use synchronization to ensure correctness

99 Case Study: IBM Power4: Cache States L1 cache: Writethrough L2 cache: Maintains coherence (invalidation-based), Writeback, Inclusive of L1 L3 cache: Writeback, non-inclusive, maintains coherence (invalidation-based) Snooping protocol Snoops on all buses.

100 IBM Power4

101 Case Study: IBM Power4 Caches

102 Interconnection network of buses

103 IBM Power4 s L2 coherence protocol I (Invalid state) SL (Shared state, in one core within the module) S (Shared state, in multiple cores within a module) M (Modified state, and exclusive) Me (Data is not modified, but exclusive, no write back necessary) Mu [Load-linked lwarx, Store-conditional swarx] T (tagged: data modified, but not exclusively owned) Optimization: Do not write back modified exclusive data upon a read dhit

104 IBM Power4 s L2 coherence protocol

105 IBM Power4 s L3 coherence protocol States: Invalid Shared: Can only give data to its own L2s that it s caching data for Tagged: Modified and shared (local memory) Tagged remote: Modified and shared (remote memory) O (prefetch): Data in L3 = Memory, and same node, not modified.

106 IBM Blue Gene/L memory architecture

107 Node overview

108 Single chip

Review: Multiprocessor. CPE 631 Session 21: Multiprocessors (Part 2) Potential HW Coherency Solutions. Bus Snooping Topology

Review: Multiprocessor. CPE 631 Session 21: Multiprocessors (Part 2) Potential HW Coherency Solutions. Bus Snooping Topology Review: Multiprocessor CPE 631 Session 21: Multiprocessors (Part 2) Department of Electrical and Computer Engineering University of Alabama in Huntsville Basic issues and terminology Communication: share

More information

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors?

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors? Parallel Computers CPE 63 Session 20: Multiprocessors Department of Electrical and Computer Engineering University of Alabama in Huntsville Definition: A parallel computer is a collection of processing

More information

Aleksandar Milenkovich 1

Aleksandar Milenkovich 1 Parallel Computers Lecture 8: Multiprocessors Aleksandar Milenkovic, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Definition: A parallel computer is a collection

More information

Page 1. Lecture 12: Multiprocessor 2: Snooping Protocol, Directory Protocol, Synchronization, Consistency. Bus Snooping Topology

Page 1. Lecture 12: Multiprocessor 2: Snooping Protocol, Directory Protocol, Synchronization, Consistency. Bus Snooping Topology CS252 Graduate Computer Architecture Lecture 12: Multiprocessor 2: Snooping Protocol, Directory Protocol, Synchronization, Consistency Review: Multiprocessor Basic issues and terminology Communication:

More information

Aleksandar Milenkovic, Electrical and Computer Engineering University of Alabama in Huntsville

Aleksandar Milenkovic, Electrical and Computer Engineering University of Alabama in Huntsville Lecture 18: Multiprocessors Aleksandar Milenkovic, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Parallel Computers Definition: A parallel computer is a collection

More information

Lecture 17: Multiprocessors: Size, Consitency. Review: Networking Summary

Lecture 17: Multiprocessors: Size, Consitency. Review: Networking Summary Lecture 17: Multiprocessors: Size, Consitency Professor David A. Patterson Computer Science 252 Spring 1998 DAP Spr. 98 UCB 1 Review: Networking Summary Protocols allow hetereogeneous networking Protocols

More information

Aleksandar Milenkovic, Electrical and Computer Engineering University of Alabama in Huntsville. Review: Small-Scale Shared Memory

Aleksandar Milenkovic, Electrical and Computer Engineering University of Alabama in Huntsville. Review: Small-Scale Shared Memory Lecture 20: Multiprocessors Aleksandar Milenkovic, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Review: Small-Scale Shared Memory Caches serve to: Increase

More information

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model

More information

Chapter-4 Multiprocessors and Thread-Level Parallelism

Chapter-4 Multiprocessors and Thread-Level Parallelism Chapter-4 Multiprocessors and Thread-Level Parallelism We have seen the renewed interest in developing multiprocessors in early 2000: - The slowdown in uniprocessor performance due to the diminishing returns

More information

CS654 Advanced Computer Architecture Lec 14 Directory Based Multiprocessors

CS654 Advanced Computer Architecture Lec 14 Directory Based Multiprocessors CS654 Advanced Computer Architecture Lec 14 Directory Based Multiprocessors Peter Kemper Adapted from the slides of EECS 252 by Prof. David Patterson Electrical Engineering and Computer Sciences University

More information

Lecture 30: Multiprocessors Flynn Categories, Large vs. Small Scale, Cache Coherency Professor Randy H. Katz Computer Science 252 Spring 1996

Lecture 30: Multiprocessors Flynn Categories, Large vs. Small Scale, Cache Coherency Professor Randy H. Katz Computer Science 252 Spring 1996 Lecture 30: Multiprocessors Flynn Categories, Large vs. Small Scale, Cache Coherency Professor Randy H. Katz Computer Science 252 Spring 1996 RHK.S96 1 Flynn Categories SISD (Single Instruction Single

More information

CMSC 611: Advanced. Distributed & Shared Memory

CMSC 611: Advanced. Distributed & Shared Memory CMSC 611: Advanced Computer Architecture Distributed & Shared Memory Centralized Shared Memory MIMD Processors share a single centralized memory through a bus interconnect Feasible for small processor

More information

CPE 631 Lecture 21: Multiprocessors

CPE 631 Lecture 21: Multiprocessors Lecture 2: Multiprocessors Aleksandar Milenković, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Review: Small-Scale Shared Memory Caches serve to: Increase

More information

CMSC 611: Advanced Computer Architecture

CMSC 611: Advanced Computer Architecture CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis Interconnection Networks Massively processor networks (MPP) Thousands of nodes

More information

Multiprocessor Synchronization

Multiprocessor Synchronization Multiprocessor Synchronization Material in this lecture in Henessey and Patterson, Chapter 8 pgs. 694-708 Some material from David Patterson s slides for CS 252 at Berkeley 1 Multiprogramming and Multiprocessing

More information

5008: Computer Architecture

5008: Computer Architecture 5008: Computer Architecture Chapter 4 Multiprocessors and Thread-Level Parallelism --II CA Lecture08 - multiprocessors and TLP (cwliu@twins.ee.nctu.edu.tw) 09-1 Review Caches contain all information on

More information

Chapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST

Chapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST Chapter 8. Multiprocessors In-Cheol Park Dept. of EE, KAIST Can the rapid rate of uniprocessor performance growth be sustained indefinitely? If the pace does slow down, multiprocessor architectures will

More information

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence 1 COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence Cristinel Ababei Dept. of Electrical and Computer Engineering Marquette University Credits: Slides adapted from presentations

More information

Chapter 5. Thread-Level Parallelism

Chapter 5. Thread-Level Parallelism Chapter 5 Thread-Level Parallelism Instructor: Josep Torrellas CS433 Copyright Josep Torrellas 1999, 2001, 2002, 2013 1 Progress Towards Multiprocessors + Rate of speed growth in uniprocessors saturated

More information

Lecture 33: Multiprocessors Synchronization and Consistency Professor Randy H. Katz Computer Science 252 Spring 1996

Lecture 33: Multiprocessors Synchronization and Consistency Professor Randy H. Katz Computer Science 252 Spring 1996 Lecture 33: Multiprocessors Synchronization and Consistency Professor Randy H. Katz Computer Science 252 Spring 1996 RHK.S96 1 Review: Miss Rates for Snooping Protocol 4th C: Coherency Misses More processors:

More information

Chapter 6: Multiprocessors and Thread-Level Parallelism

Chapter 6: Multiprocessors and Thread-Level Parallelism Chapter 6: Multiprocessors and Thread-Level Parallelism 1 Parallel Architectures 2 Flynn s Four Categories SISD (Single Instruction Single Data) Uniprocessors MISD (Multiple Instruction Single Data)???;

More information

Computer Architecture Lecture 10: Thread Level Parallelism II (Chapter 5) Chih Wei Liu 劉志尉 National Chiao Tung University

Computer Architecture Lecture 10: Thread Level Parallelism II (Chapter 5) Chih Wei Liu 劉志尉 National Chiao Tung University Computer Architecture Lecture 10: Thread Level Parallelism II (Chapter 5) Chih Wei Liu 劉志尉 National Chiao Tung University cwliu@twins.ee.nctu.edu.tw Review Caches contain all information on state of cached

More information

Parallel Architecture. Hwansoo Han

Parallel Architecture. Hwansoo Han Parallel Architecture Hwansoo Han Performance Curve 2 Unicore Limitations Performance scaling stopped due to: Power Wire delay DRAM latency Limitation in ILP 3 Power Consumption (watts) 4 Wire Delay Range

More information

Chapter 5. Multiprocessors and Thread-Level Parallelism

Chapter 5. Multiprocessors and Thread-Level Parallelism Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model

More information

Chapter 5 Thread-Level Parallelism. Abdullah Muzahid

Chapter 5 Thread-Level Parallelism. Abdullah Muzahid Chapter 5 Thread-Level Parallelism Abdullah Muzahid 1 Progress Towards Multiprocessors + Rate of speed growth in uniprocessors is saturating + Modern multiple issue processors are becoming very complex

More information

CSE 502 Graduate Computer Architecture Lec 19 Directory-Based Shared-Memory Multiprocessors & MP Synchronization

CSE 502 Graduate Computer Architecture Lec 19 Directory-Based Shared-Memory Multiprocessors & MP Synchronization CSE 502 Graduate Computer Architecture Lec 19 Directory-Based Shared-Memory Multiprocessors & MP Synchronization Larry Wittie Computer Science, StonyBrook University http://www.cs.sunysb.edu/~cse502 and

More information

Chapter 5. Multiprocessors and Thread-Level Parallelism

Chapter 5. Multiprocessors and Thread-Level Parallelism Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model

More information

Multiprocessors & Thread Level Parallelism

Multiprocessors & Thread Level Parallelism Multiprocessors & Thread Level Parallelism COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Introduction

More information

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor Multiprocessing Parallel Computers Definition: A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast. Almasi and Gottlieb, Highly Parallel

More information

Shared Memory SMP and Cache Coherence (cont) Adapted from UCB CS252 S01, Copyright 2001 USB

Shared Memory SMP and Cache Coherence (cont) Adapted from UCB CS252 S01, Copyright 2001 USB Shared SMP and Cache Coherence (cont) Adapted from UCB CS252 S01, Copyright 2001 USB 1 Review: Snoopy Cache Protocol Write Invalidate Protocol: Multiple readers, single writer Write to shared data: an

More information

Chapter 9 Multiprocessors

Chapter 9 Multiprocessors ECE200 Computer Organization Chapter 9 Multiprocessors David H. lbonesi and the University of Rochester Henk Corporaal, TU Eindhoven, Netherlands Jari Nurmi, Tampere University of Technology, Finland University

More information

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types Chapter 5 Multiprocessor Cache Coherence Thread-Level Parallelism 1: read 2: read 3: write??? 1 4 From ILP to TLP Memory System is Coherent If... ILP became inefficient in terms of Power consumption Silicon

More information

Lecture 24: Thread Level Parallelism -- Distributed Shared Memory and Directory-based Coherence Protocol

Lecture 24: Thread Level Parallelism -- Distributed Shared Memory and Directory-based Coherence Protocol Lecture 24: Thread Level Parallelism -- Distributed Shared Memory and Directory-based Coherence Protocol CSE 564 Computer Architecture Fall 2016 Department of Computer Science and Engineering Yonghong

More information

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Outline Key issues to design multiprocessors Interconnection network Centralized shared-memory architectures Distributed

More information

Thread-Level Parallelism

Thread-Level Parallelism Advanced Computer Architecture Thread-Level Parallelism Some slides are from the instructors resources which accompany the 6 th and previous editions. Some slides are from David Patterson, David Culler

More information

Introduction. Coherency vs Consistency. Lec-11. Multi-Threading Concepts: Coherency, Consistency, and Synchronization

Introduction. Coherency vs Consistency. Lec-11. Multi-Threading Concepts: Coherency, Consistency, and Synchronization Lec-11 Multi-Threading Concepts: Coherency, Consistency, and Synchronization Coherency vs Consistency Memory coherency and consistency are major concerns in the design of shared-memory systems. Consistency

More information

Processor Architecture and Interconnect

Processor Architecture and Interconnect Processor Architecture and Interconnect What is Parallelism? Parallel processing is a term used to denote simultaneous computation in CPU for the purpose of measuring its computation speeds. Parallel Processing

More information

Page 1. Lecture 11: Multiprocessor 1: Reasons, Classifications, Performance Metrics, Applications. Parallel Processors Religion. Parallel Computers

Page 1. Lecture 11: Multiprocessor 1: Reasons, Classifications, Performance Metrics, Applications. Parallel Processors Religion. Parallel Computers CS252 Graduate Computer Architecture Lecture 11: Multiprocessor 1: Reasons, Classifications, Performance Metrics, Applications February 23, 2001 Prof. David A. Patterson Computer Science 252 Spring 2001

More information

ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols CA SMP and cache coherence

ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols CA SMP and cache coherence Computer Architecture ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols 1 Shared Memory Multiprocessor Memory Bus P 1 Snoopy Cache Physical Memory P 2 Snoopy

More information

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM UNIT III MULTIPROCESSORS AND THREAD LEVEL PARALLELISM 1. Symmetric Shared Memory Architectures: The Symmetric Shared Memory Architecture consists of several processors with a single physical memory shared

More information

Handout 3 Multiprocessor and thread level parallelism

Handout 3 Multiprocessor and thread level parallelism Handout 3 Multiprocessor and thread level parallelism Outline Review MP Motivation SISD v SIMD (SIMT) v MIMD Centralized vs Distributed Memory MESI and Directory Cache Coherency Synchronization and Relaxed

More information

Flynn s Classification

Flynn s Classification Flynn s Classification SISD (Single Instruction Single Data) Uniprocessors MISD (Multiple Instruction Single Data) No machine is built yet for this type SIMD (Single Instruction Multiple Data) Examples:

More information

CMSC 611: Advanced. Parallel Systems

CMSC 611: Advanced. Parallel Systems CMSC 611: Advanced Computer Architecture Parallel Systems Parallel Computers Definition: A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems

More information

CISC 662 Graduate Computer Architecture Lectures 15 and 16 - Multiprocessors and Thread-Level Parallelism

CISC 662 Graduate Computer Architecture Lectures 15 and 16 - Multiprocessors and Thread-Level Parallelism CISC 662 Graduate Computer Architecture Lectures 15 and 16 - Multiprocessors and Thread-Level Parallelism Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from

More information

Parallel Computer Architecture Spring Distributed Shared Memory Architectures & Directory-Based Memory Coherence

Parallel Computer Architecture Spring Distributed Shared Memory Architectures & Directory-Based Memory Coherence Parallel Computer Architecture Spring 2018 Distributed Shared Memory Architectures & Directory-Based Memory Coherence Nikos Bellas Computer and Communications Engineering Department University of Thessaly

More information

Chap. 4 Multiprocessors and Thread-Level Parallelism

Chap. 4 Multiprocessors and Thread-Level Parallelism Chap. 4 Multiprocessors and Thread-Level Parallelism Uniprocessor performance Performance (vs. VAX-11/780) 10000 1000 100 10 From Hennessy and Patterson, Computer Architecture: A Quantitative Approach,

More information

COSC 6385 Computer Architecture - Multi Processor Systems

COSC 6385 Computer Architecture - Multi Processor Systems COSC 6385 Computer Architecture - Multi Processor Systems Fall 2006 Classification of Parallel Architectures Flynn s Taxonomy SISD: Single instruction single data Classical von Neumann architecture SIMD:

More information

Computer Systems Architecture

Computer Systems Architecture Computer Systems Architecture Lecture 24 Mahadevan Gomathisankaran April 29, 2010 04/29/2010 Lecture 24 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student

More information

Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano

Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano Outline The problem of cache coherence Snooping protocols Directory-based protocols Prof. Cristina Silvano, Politecnico

More information

Page 1. Instruction-Level Parallelism (ILP) CISC 662 Graduate Computer Architecture Lectures 16 and 17 - Multiprocessors and Thread-Level Parallelism

Page 1. Instruction-Level Parallelism (ILP) CISC 662 Graduate Computer Architecture Lectures 16 and 17 - Multiprocessors and Thread-Level Parallelism CISC 662 Graduate Computer Architecture Lectures 16 and 17 - Multiprocessors and Thread-Level Parallelism Michela Taufer Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer Architecture,

More information

Thread- Level Parallelism. ECE 154B Dmitri Strukov

Thread- Level Parallelism. ECE 154B Dmitri Strukov Thread- Level Parallelism ECE 154B Dmitri Strukov Introduc?on Thread- Level parallelism Have mul?ple program counters and resources Uses MIMD model Targeted for?ghtly- coupled shared- memory mul?processors

More information

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently

More information

Chapter - 5 Multiprocessors and Thread-Level Parallelism

Chapter - 5 Multiprocessors and Thread-Level Parallelism Chapter - 5 Multiprocessors and Thread-Level Parallelism We have seen the renewed interest in developing multiprocessors in early 2000: - The slowdown in uniprocessor performance due to the diminishing

More information

Chapter Seven. Idea: create powerful computers by connecting many smaller ones

Chapter Seven. Idea: create powerful computers by connecting many smaller ones Chapter Seven Multiprocessors Idea: create powerful computers by connecting many smaller ones good news: works for timesharing (better than supercomputer) vector processing may be coming back bad news:

More information

COSC4201 Multiprocessors

COSC4201 Multiprocessors COSC4201 Multiprocessors Prof. Mokhtar Aboelaze Parts of these slides are taken from Notes by Prof. David Patterson (UCB) Multiprocessing We are dedicating all of our future product development to multicore

More information

COSC 6385 Computer Architecture - Thread Level Parallelism (III)

COSC 6385 Computer Architecture - Thread Level Parallelism (III) OS 6385 omputer Architecture - Thread Level Parallelism (III) Edgar Gabriel Spring 2018 Some slides are based on a lecture by David uller, University of alifornia, Berkley http://www.eecs.berkeley.edu/~culler/courses/cs252-s05

More information

Lecture 23: Thread Level Parallelism -- Introduction, SMP and Snooping Cache Coherence Protocol

Lecture 23: Thread Level Parallelism -- Introduction, SMP and Snooping Cache Coherence Protocol Lecture 23: Thread Level Parallelism -- Introduction, SMP and Snooping Cache Coherence Protocol CSE 564 Computer Architecture Summer 2017 Department of Computer Science and Engineering Yonghong Yan yan@oakland.edu

More information

COSC 6385 Computer Architecture. - Thread Level Parallelism (III)

COSC 6385 Computer Architecture. - Thread Level Parallelism (III) OS 6385 omputer Architecture - Thread Level Parallelism (III) Spring 2013 Some slides are based on a lecture by David uller, University of alifornia, Berkley http://www.eecs.berkeley.edu/~culler/courses/cs252-s05

More information

Multiprocessors 1. Outline

Multiprocessors 1. Outline Multiprocessors 1 Outline Multiprocessing Coherence Write Consistency Snooping Building Blocks Snooping protocols and examples Coherence traffic and performance on MP Directory-based protocols and examples

More information

EC 513 Computer Architecture

EC 513 Computer Architecture EC 513 Computer Architecture Cache Coherence - Directory Cache Coherence Prof. Michel A. Kinsy Shared Memory Multiprocessor Processor Cores Local Memories Memory Bus P 1 Snoopy Cache Physical Memory P

More information

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering Multiprocessors and Thread-Level Parallelism Multithreading Increasing performance by ILP has the great advantage that it is reasonable transparent to the programmer, ILP can be quite limited or hard to

More information

Computer Science 146. Computer Architecture

Computer Science 146. Computer Architecture Computer Architecture Spring 24 Harvard University Instructor: Prof. dbrooks@eecs.harvard.edu Lecture 2: More Multiprocessors Computation Taxonomy SISD SIMD MISD MIMD ILP Vectors, MM-ISAs Shared Memory

More information

3/13/2008 Csci 211 Lecture %/year. Manufacturer/Year. Processors/chip Threads/Processor. Threads/chip 3/13/2008 Csci 211 Lecture 8 4

3/13/2008 Csci 211 Lecture %/year. Manufacturer/Year. Processors/chip Threads/Processor. Threads/chip 3/13/2008 Csci 211 Lecture 8 4 Outline CSCI Computer System Architecture Lec 8 Multiprocessor Introduction Xiuzhen Cheng Department of Computer Sciences The George Washington University MP Motivation SISD v. SIMD v. MIMD Centralized

More information

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache. Multiprocessors why would you want a multiprocessor? Multiprocessors and Multithreading more is better? Cache Cache Cache Classifying Multiprocessors Flynn Taxonomy Flynn Taxonomy Interconnection Network

More information

Computer Architecture

Computer Architecture Jens Teubner Computer Architecture Summer 2016 1 Computer Architecture Jens Teubner, TU Dortmund jens.teubner@cs.tu-dortmund.de Summer 2016 Jens Teubner Computer Architecture Summer 2016 83 Part III Multi-Core

More information

Lecture 24: Virtual Memory, Multiprocessors

Lecture 24: Virtual Memory, Multiprocessors Lecture 24: Virtual Memory, Multiprocessors Today s topics: Virtual memory Multiprocessors, cache coherence 1 Virtual Memory Processes deal with virtual memory they have the illusion that a very large

More information

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems 1 License: http://creativecommons.org/licenses/by-nc-nd/3.0/ 10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems To enhance system performance and, in some cases, to increase

More information

EEC 581 Computer Architecture. Lec 11 Synchronization and Memory Consistency Models (4.5 & 4.6)

EEC 581 Computer Architecture. Lec 11 Synchronization and Memory Consistency Models (4.5 & 4.6) EEC 581 Computer rchitecture Lec 11 Synchronization and Memory Consistency Models (4.5 & 4.6) Chansu Yu Electrical and Computer Engineering Cleveland State University cknowledgement Part of class notes

More information

Chapter 6. Parallel Processors from Client to Cloud Part 2 COMPUTER ORGANIZATION AND DESIGN. Homogeneous & Heterogeneous Multicore Architectures

Chapter 6. Parallel Processors from Client to Cloud Part 2 COMPUTER ORGANIZATION AND DESIGN. Homogeneous & Heterogeneous Multicore Architectures COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 6 Parallel Processors from Client to Cloud Part 2 Homogeneous & Heterogeneous Multicore Architectures Intel XEON 22nm

More information

Computer parallelism Flynn s categories

Computer parallelism Flynn s categories 04 Multi-processors 04.01-04.02 Taxonomy and communication Parallelism Taxonomy Communication alessandro bogliolo isti information science and technology institute 1/9 Computer parallelism Flynn s categories

More information

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

COSC 6385 Computer Architecture - Thread Level Parallelism (I) COSC 6385 Computer Architecture - Thread Level Parallelism (I) Edgar Gabriel Spring 2014 Long-term trend on the number of transistor per integrated circuit Number of transistors double every ~18 month

More information

M4 Parallelism. Implementation of Locks Cache Coherence

M4 Parallelism. Implementation of Locks Cache Coherence M4 Parallelism Implementation of Locks Cache Coherence Outline Parallelism Flynn s classification Vector Processing Subword Parallelism Symmetric Multiprocessors, Distributed Memory Machines Shared Memory

More information

Computer Systems Architecture

Computer Systems Architecture Computer Systems Architecture Lecture 23 Mahadevan Gomathisankaran April 27, 2010 04/27/2010 Lecture 23 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student

More information

Multiprocessors - Flynn s Taxonomy (1966)

Multiprocessors - Flynn s Taxonomy (1966) Multiprocessors - Flynn s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) Conventional uniprocessor Although ILP is exploited Single Program Counter -> Single Instruction stream The

More information

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University Material from: Parallel Computer Organization and Design by Debois,

More information

Computer Architecture

Computer Architecture 18-447 Computer Architecture CSCI-564 Advanced Computer Architecture Lecture 29: Consistency & Coherence Lecture 20: Consistency and Coherence Bo Wu Prof. Onur Mutlu Colorado Carnegie School Mellon University

More information

Page 1. Cache Coherence

Page 1. Cache Coherence Page 1 Cache Coherence 1 Page 2 Memory Consistency in SMPs CPU-1 CPU-2 A 100 cache-1 A 100 cache-2 CPU-Memory bus A 100 memory Suppose CPU-1 updates A to 200. write-back: memory and cache-2 have stale

More information

UNIT I (Two Marks Questions & Answers)

UNIT I (Two Marks Questions & Answers) UNIT I (Two Marks Questions & Answers) Discuss the different ways how instruction set architecture can be classified? Stack Architecture,Accumulator Architecture, Register-Memory Architecture,Register-

More information

Fall 2012 EE 6633: Architecture of Parallel Computers Lecture 4: Shared Address Multiprocessors Acknowledgement: Dave Patterson, UC Berkeley

Fall 2012 EE 6633: Architecture of Parallel Computers Lecture 4: Shared Address Multiprocessors Acknowledgement: Dave Patterson, UC Berkeley Fall 2012 EE 6633: Architecture of Parallel Computers Lecture 4: Shared Address Multiprocessors Acknowledgement: Dave Patterson, UC Berkeley Avinash Kodi Department of Electrical Engineering & Computer

More information

CMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago

CMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago CMSC 22200 Computer Architecture Lecture 15: Memory Consistency and Synchronization Prof. Yanjing Li University of Chicago Administrative Stuff! Lab 5 (multi-core) " Basic requirements: out later today

More information

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University A.R. Hurson Computer Science and Engineering The Pennsylvania State University 1 Large-scale multiprocessor systems have long held the promise of substantially higher performance than traditional uniprocessor

More information

CSE502: Computer Architecture CSE 502: Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture CSE 502: Computer Architecture Shared-Memory Multi-Processors Shared-Memory Multiprocessors Multiple threads use shared memory (address space) SysV Shared Memory or Threads in software Communication implicit

More information

Limitations of parallel processing

Limitations of parallel processing Your professor du jour: Steve Gribble gribble@cs.washington.edu 323B Sieg Hall all material in this lecture in Henessey and Patterson, Chapter 8 635-640 645, 646 654-665 11/8/00 CSE 471 Multiprocessors

More information

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism Computing Systems & Performance Beyond Instruction-Level Parallelism MSc Informatics Eng. 2012/13 A.J.Proença From ILP to Multithreading and Shared Cache (most slides are borrowed) When exploiting ILP,

More information

1. Memory technology & Hierarchy

1. Memory technology & Hierarchy 1. Memory technology & Hierarchy Back to caching... Advances in Computer Architecture Andy D. Pimentel Caches in a multi-processor context Dealing with concurrent updates Multiprocessor architecture In

More information

Parallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence

Parallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence Parallel Computer Architecture Spring 2018 Shared Memory Multiprocessors Memory Coherence Nikos Bellas Computer and Communications Engineering Department University of Thessaly Parallel Computer Architecture

More information

Flynn's Classification

Flynn's Classification Multiprocessors Oracles's SPARC M7-32 core, 64MB L3 cache (8 x 8 MB), 1.6TB/s. 256 KB of 4-way SA L2 ICache, 0.5 TB/s per cluster. 2 cores share 256 KB, 8-way SA L2 DCache, 0.5 TB/s. Flynn's Classification

More information

EITF20: Computer Architecture Part 5.2.1: IO and MultiProcessor

EITF20: Computer Architecture Part 5.2.1: IO and MultiProcessor EITF20: Computer Architecture Part 5.2.1: IO and MultiProcessor Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration I/O MultiProcessor Summary 2 Virtual memory benifits Using physical memory efficiently

More information

Shared Symmetric Memory Systems

Shared Symmetric Memory Systems Shared Symmetric Memory Systems Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering Department University

More information

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO Departamento de Engenharia Informática Architectures for Embedded Computing MEIC-A, MEIC-T, MERC Lecture Slides Version 3.0 - English Lecture 11

More information

Computer Architecture

Computer Architecture Computer Architecture Chapter 7 Parallel Processing 1 Parallelism Instruction-level parallelism (Ch.6) pipeline superscalar latency issues hazards Processor-level parallelism (Ch.7) array/vector of processors

More information

Multiprocessors and Thread Level Parallelism CS 282 KAUST Spring 2010 Muhamed Mudawar

Multiprocessors and Thread Level Parallelism CS 282 KAUST Spring 2010 Muhamed Mudawar Multiprocessors and Thread Level Parallelism CS 282 KAUST Spring 2010 Muhamed Mudawar Original slides created by: David Patterson Uniprocessor Performance (SPECint) 80) Performance (vs. VAX-11/78 10000

More information

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University Material from: Parallel Computer Organization and Design by Debois,

More information

ECSE 425 Lecture 30: Directory Coherence

ECSE 425 Lecture 30: Directory Coherence ECSE 425 Lecture 30: Directory Coherence H&P Chapter 4 Last Time Snoopy Coherence Symmetric SMP Performance 2 Today Directory- based Coherence 3 A Scalable Approach: Directories One directory entry for

More information

Lecture 18: Coherence Protocols. Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections

Lecture 18: Coherence Protocols. Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections Lecture 18: Coherence Protocols Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections 4.2-4.4) 1 SMP/UMA/Centralized Memory Multiprocessor Main Memory I/O System

More information

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems Multiprocessors II: CC-NUMA DSM DSM cache coherence the hardware stuff Today s topics: what happens when we lose snooping new issues: global vs. local cache line state enter the directory issues of increasing

More information

Multiprocessor Cache Coherency. What is Cache Coherence?

Multiprocessor Cache Coherency. What is Cache Coherence? Multiprocessor Cache Coherency CS448 1 What is Cache Coherence? Two processors can have two different values for the same memory location 2 1 Terminology Coherence Defines what values can be returned by

More information

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction) EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction) Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering

More information