Supporting Distributed Shared Memory. Axel Jantsch Xiaowen Chen, Zhonghai Lu Royal Institute of Technology, Sweden September 16, 2009

Size: px

Start display at page:

Download "Supporting Distributed Shared Memory. Axel Jantsch Xiaowen Chen, Zhonghai Lu Royal Institute of Technology, Sweden September 16, 2009"

Arline McGee
5 years ago
Views:

1 Supporting Distributed Shared Memory Axel Jantsch Xiaowen Chen, Zhonghai Lu Royal Institute of Technology, Sweden September 16, 2009

2 Memory content in today s SoCs

3 3 Elements in SoC Processing: Well understood; Standard solutions; Commodity IPs Communication: Well understood by research community; no standard solutions yet Storage: Not well understood due to limited perspectives and changing technology parameters KTH/ICT/ES 3

4 Memory Access Bottleneck 4

5 Memory Access Bandwidth vs Processing Capacity KTH/ICT/ES 5

bandwidth per core 16 Tb/s aggregate memory bandwidth (200 Gb/s for

6 Memory Bandwidth in 3D ICs 8x8 MPSoC Resource size: 2x2 mm 2 Switch size: 100x100 um 2 TSV pitch: 5 um TSVs/switch: 128@2GHz 256 Gb/s memory bandwidth per core 16 Tb/s aggregate memory bandwidth (200 Gb/s for TilePro64) Read latency < 5 ns delay 80 x bandwidth 1/10 latency 1/10 power No area overhead

7 Challenges of Distributed Memory New memory architectures must be developed Cache coherency Need for efficient distributed cache coherency schemes Memory consistency Platform for memory consistency required Programming models for parallel computing and distributed memory

8 Target Architecture: DSM Based Multi-core NoCs Private memory physical addressing Shared memory logical addressing The philosophy of this design is to speed up frequent private access as well as to maintain a single virtual space. 8

9 Data Management Engine - Architecture dual interfaces and dual processors cooperation of the interface units and two mini-processors dual-port shared Control Store and Local Memory hardware support for mutex synchronization dynamic uploading microcode into the Control Store. 9

10 Mini-processor 4 function units 5 stage pipeline 10

11 Microinstruction The microinstructions are organized horizontally. 11

12 Operation Mechanism command-triggered microcode execution 12

13 Hardware Cost Optimize for area Optimize for speed Frequency: ~444 MHz (2.25 ns) ~455 MHz (2.2 ns) Area (Logic): ~44k NAND gates ~51k NAND gates Area (Control Store): ~300k NAND gates (4 1024*32b dual port SRAMs) Synopsys Design Compiler to synthesize the DMC design Chartered 0.13-micro technology, Artisan Memory Compiler to generate the dual-port Control Store. 13

14 DME Assembler This tool assembles a microprogram in the DME language into machine-readable binary code. 14

15 DME Programs Virtual to Physical Memory translation Burst memory access Synchronization and locking Cache coherence protocol Memory consistency support functions Under Development: Dynamic memory allocator KTH/ICT/ES 15

16 Experiments and Results Experimental Platform Network: dimension-order XY routing Round-robin Arbitration Flow control using FIFOs Experiments: synthetic workloads: Uniform traffic Hotpot traffic application workloads: Matrix multiplication 2D radix-2 DIT FFT 16

17 Experiments and Results Synthetic Workloads Shared memory access, Effect of transaction size 17

18 Experiments and Results Synthetic Workloads Shared memory access, Effect of network size 18

19 Experiments and Results Synthetic Workloads Synchronization, Effect of network size 19

20 Experiments and Results Application Workloads Matrix multiplication, 2D radix-2 DIT FFT 20

21 Summary Distributed Shared Memory Architecture Integrated in the Communication Infrastructure Data Management Engine supports all memory and data management: Virtual address space Cache coherence Memory consistency Shared memory communication Message passing Dynamic memory allocation Abstract data types KTH/ICT/ES Etc. 21

22 Microprogramming Flow 22

23 Virtual-to-Physical Transaction 23

24 Memory Access 24

25 Synchronization Local polling avoids incurring additional network traffic and won't block other commands for a long time. 25

26 Performance Analysis Memory Access For a remote read transaction (T rss and T rsb, α=1), its delay consists of seven parts: (1) V2P translation latency: T v2p, (2) latency of distinguishing whether the read is local or remote: T d, (3) latency of launching a remote request message to the remote destination node: T m, (4) communication latency: T com = T csd (from source to destination) + T cds (from destination to source), (5) latency of filling the pipeline at the beginning of microcode execution: T f, (6) latency of branching where the memory read microcode is: T b, (7) latency of executing the memory read microcode: 3 cycles for single read and 1+2*(n b +1)+1 cycles for burst read of n b words. 26

For acquiring a remote lock, its delay (T

27 Performance Analysis Synchronization Synchronization is categorized into two types: (1) Local shared; (2) Remote shared. For acquiring a remote lock, its delay (T sync_r ) consists of seven parts (similar with shared memory access): (1) T v2p, (2) T d, (3) T m, (4) T com = T csd + T cds, (5) T f, (6) T b, (7) latency of executing the test-and-set(), 8. The (5), (6) and (7) are multiplied by the acquire times, n b. 27

JNTUWORLD. 1. Discuss in detail inter processor arbitration logics and procedures with necessary diagrams? [15]

JNTUWORLD. 1. Discuss in detail inter processor arbitration logics and procedures with necessary diagrams? [15] Code No: 09A50402 R09 Set No. 2 1. Discuss in detail inter processor arbitration logics and procedures with necessary diagrams? [15] 2. (a) Discuss asynchronous serial transfer concept? (b) Explain in