SysAlloc: A Hardware Manager for Dynamic Memory Allocation in Heterogeneous Systems

Size: px

Start display at page:

Download "SysAlloc: A Hardware Manager for Dynamic Memory Allocation in Heterogeneous Systems"

Aubrey Kelley
5 years ago
Views:

1 SysAlloc: A Hardware Manager for Dynamic Memory Allocation in Heterogeneous Systems Zeping Xue, David Thomas zeping.xue10@imperial.ac.uk FPL 2015, London 4 Sept

2 Single Processor System Allocator Memory Memory Mapped Bus 2

3 Allocator Size Allocator Address int *ptr; ptr = malloc(size); 3

4 Allocator Size 1.Search for suitable memory block 2. Record the memory allocation Address int *ptr; ptr = malloc(size); 4

5 Multi-Processors (Homogeneous) System Allocator Allocator Allocator Memory Memory Mapped Bus 5

6 What about Heterogeneous Systems? Allocator Memory Memory Mapped Bus FPGA Accelerator 1 FPGA Accelerator 2 6

7 What about Heterogeneous Systems? FPL 2015 Fleming et al. PushPush Allocator Memory Memory Mapped Bus FPGA Accelerator 1 FPGA Accelerator 2 7

8 Proposed Solution: Hardware Allocator Memory Memory Mapped Bus FPGA Accelerator 1 FPGA Accelerator 2 SysAlloc 8

9 Having memory management in heterogenous systems... Improve Memory Efficiency Reduce Design Effort Make HLS More Capable 9

10 Related Work 1. Managing Shared Distributed BRAMs in NoCs [1][2][3] 2. Crossbar connected MPSoC with shared BRAMs [4] 3. HLS DMMs [5][6] [1] Anagnostopoulos et al [2] Göhringer et al [3] Monchiero et al [4] Dessouky et al [5] Winterstein et al [6] Diamantopoulos et al

11 Related Work 1. Managing Shared Distributed BRAMs in NoCs [1][2][3] 2. Crossbar connected MPSoC with shared BRAMs [4] 3. HLS DMMs [5][6] SysAlloc [1] [2] [3] [4] Connection Bus NoC NoC NoC Crossbar Memory On/Off-Chip Shared Distributed On-Chip BRAMs [1] Anagnostopoulos et al Scalability Unbounded Platform-Dependent Limited by [2] Göhringer et al resource [3] Monchiero et al Clock Rate 150MHz Platform-Dependent [4] Dessouky et al MHz [5] Winterstein et al [6] Diamantopoulos et al. 2015

12 Our Proposed Hardware Allocator Scalable Functionality-wise, having arbitrary number of clients Flexible Any range of memory can be managed SysAlloc Efficient Clock Rate Resource Utilisation 12

13 Route Map SysAlloc: System Integration Allocator Algorithm Evaluation Results Conclusion 13

14 SysAlloc System Integration 14

15 SysAlloc System Integration 15

16 Allocator Access Examples Software (ARM core) Access int* SysMalloc(int size){ int token, status, *pointer; token = 0; status = 0; while(token == 0){ token = AXI_READ(SysA_TOKEN); AXI_WRITE(SysA_CMD, size); while(status == 0){ status = AXI_READ(SysA_STAT); pointer = AXI_READ(SysA_RESULT); Memory Map 0x x SysA_TOKEN 1 SysA_CMD 0 SysA_STAT 0 SysA_RESULT 0 Storage return pointer; 16

17 Allocator Access Examples Software (ARM core) Access int* SysMalloc(int size){ int token, status, *pointer; token = 0; status = 0; while(token == 0){ token = AXI_READ(SysA_TOKEN); AXI_WRITE(SysA_CMD, size); while(status == 0){ status = AXI_READ(SysA_STAT); pointer = AXI_READ(SysA_RESULT); Memory Map 0x x SysA_TOKEN 1 SysA_CMD 0 SysA_STAT 0 SysA_RESULT 0 Storage return pointer; 17

18 Allocator Access Examples Software (ARM core) Access int* SysMalloc(int size){ int token, status, *pointer; token = 0; status = 0; while(token == 0){ token = AXI_READ(SysA_TOKEN); AXI_WRITE(SysA_CMD, size); while(status == 0){ status = AXI_READ(SysA_STAT); pointer = AXI_READ(SysA_RESULT); Memory Map 0x x SysA_TOKEN 0 SysA_CMD 0 SysA_STAT 0 SysA_RESULT 0 Storage return pointer; 18

19 Allocator Access Examples Software (ARM core) Access int* SysMalloc(int size){ int token, status, *pointer; token = 0; status = 0; while(token == 0){ token = AXI_READ(SysA_TOKEN); AXI_WRITE(SysA_CMD, size); while(status == 0){ status = AXI_READ(SysA_STAT); pointer = AXI_READ(SysA_RESULT); Memory Map 0x x SysA_TOKEN 0 SysA_CMD 0 SysA_STAT 0 SysA_RESULT 0 Storage return pointer; 19

20 Allocator Access Examples Software (ARM core) Access int* SysMalloc(int size){ int token, status, *pointer; token = 0; status = 0; Memory Map 0x x Storage while(token == 0){ token = AXI_READ(SysA_TOKEN); AXI_WRITE(SysA_CMD, size); while(status == 0){ status = AXI_READ(SysA_STAT); SysA_TOKEN 0 SysA_CMD size SysA_STAT 0 SysA_RESULT 0 pointer = AXI_READ(SysA_RESULT); return pointer; 20

21 Allocator Access Examples Software (ARM core) Access int* SysMalloc(int size){ int token, status, *pointer; token = 0; status = 0; Memory Map 0x x Storage while(token == 0){ token = AXI_READ(SysA_TOKEN); AXI_WRITE(SysA_CMD, size); while(status == 0){ status = AXI_READ(SysA_STAT); SysA_TOKEN 0 SysA_CMD size SysA_STAT 0 SysA_RESULT 0 pointer = AXI_READ(SysA_RESULT); return pointer; 21

22 Allocator Access Examples Software (ARM core) Access int* SysMalloc(int size){ int token, status, *pointer; token = 0; status = 0; Memory Map 0x x Storage while(token == 0){ token = AXI_READ(SysA_TOKEN); AXI_WRITE(SysA_CMD, size); while(status == 0){ status = AXI_READ(SysA_STAT); SysA_TOKEN 0 SysA_CMD size SysA_STAT 0 SysA_RESULT 0 pointer = AXI_READ(SysA_RESULT); return pointer; 22

23 Allocator Access Examples Software (ARM core) Access int* SysMalloc(int size){ int token, status, *pointer; token = 0; status = 0; Memory Map 0x x Storage while(token == 0){ token = AXI_READ(SysA_TOKEN); AXI_WRITE(SysA_CMD, size); while(status == 0){ status = AXI_READ(SysA_STAT); SysA_TOKEN 0 SysA_CMD size SysA_STAT 0 SysA_RESULT 0 pointer = AXI_READ(SysA_RESULT); return pointer; 23

24 Allocator Access Examples Software (ARM core) Access int* SysMalloc(int size){ int token, status, *pointer; token = 0; status = 0; Memory Map 0x x Storage while(token == 0){ token = AXI_READ(SysA_TOKEN); AXI_WRITE(SysA_CMD, size); while(status == 0){ status = AXI_READ(SysA_STAT); SysA_TOKEN 0 SysA_CMD size SysA_STAT 1 SysA_RESULT 0x0000A100 pointer = AXI_READ(SysA_RESULT); return pointer; 24

25 Allocator Access Examples Software (ARM core) Access int* SysMalloc(int size){ int token, status, *pointer; token = 0; status = 0; Memory Map 0x x Storage while(token == 0){ token = AXI_READ(SysA_TOKEN); AXI_WRITE(SysA_CMD, size); while(status == 0){ status = AXI_READ(SysA_STAT); SysA_TOKEN 0 SysA_CMD size SysA_STAT 1 SysA_RESULT 0x0000A100 pointer = AXI_READ(SysA_RESULT); return pointer; 25

26 Allocator Access Examples Software (ARM core) Access int* SysMalloc(int size){ int token, status, *pointer; token = 0; status = 0; Memory Map 0x x Storage while(token == 0){ token = AXI_READ(SysA_TOKEN); AXI_WRITE(SysA_CMD, size); while(status == 0){ status = AXI_READ(SysA_STAT); SysA_TOKEN 0 SysA_CMD size SysA_STAT 1 SysA_RESULT 0x0000A100 pointer = AXI_READ(SysA_RESULT); return pointer; 26

27 Allocator Access Examples Software (ARM core) Access int* SysMalloc(int size){ int token, status, *pointer; token = 0; status = 0; Memory Map 0x x Storage while(token == 0){ token = AXI_READ(SysA_TOKEN); AXI_WRITE(SysA_CMD, size); while(status == 0){ status = AXI_READ(SysA_STAT); SysA_TOKEN 0 SysA_CMD size SysA_STAT 1 SysA_RESULT 0x0000A100 pointer = AXI_READ(SysA_RESULT); return pointer; 27

28 Allocator Access Examples Software (ARM core) Access int* SysMalloc(int size){ int token, status, *pointer; token = 0; status = 0; Memory Map 0x x Storage while(token == 0){ token = AXI_READ(SysA_TOKEN); AXI_WRITE(SysA_CMD, size); while(status == 0){ status = AXI_READ(SysA_STAT); SysA_TOKEN 0 SysA_CMD size SysA_STAT 1 SysA_RESULT 0x0000A100 pointer = AXI_READ(SysA_RESULT); return pointer; 28

29 Allocator Access Examples Software (ARM core) Access int* SysMalloc(int size){ int token, status, *pointer; token = 0; status = 0; Memory Map 0x x Storage while(token == 0){ token = AXI_READ(SysA_TOKEN); AXI_WRITE(SysA_CMD, size); while(status == 0){ status = AXI_READ(SysA_STAT); SysA_TOKEN 1 SysA_CMD size SysA_STAT 0 SysA_RESULT 0x0000A100 pointer = AXI_READ(SysA_RESULT); return pointer; 29

30 Allocator Access Examples Software (ARM core) Access int* SysMalloc(int size){ int token, status, *pointer; token = 0; status = 0; while(token == 0){ token = AXI_READ(SysA_TOKEN); AXI_WRITE(SysA_CMD, size); while(status == 0){ status = AXI_READ(SysA_STAT); pointer = AXI_READ(SysA_RESULT); HLS Access (Vivado HLS) int SysMalloc(int size,int *PortM){ int token, status, result; token = 0; status = 0; while(token == 0){ token = PortM[SysA_TOKEN]; PortM[SysA_CMD] = size; while(status == 0){ status = PortM[SysA_STAT]; result = PortM[SysA_RESULT]; return pointer; return result + PortM; 30

31 SysAlloc Allocator Algorithm 31

32 Background of SysAlloc Allocator Algorithm [1] 32

33 Background of SysAlloc Allocator Algorithm Allocated Memory Bit-Map [1] 33

34 Background of SysAlloc Allocator Algorithm [1] [1] 34

35 Background of SysAlloc Allocator Algorithm [1] [1] 35

36 Limitation! For large bit-map, high resource utilisation and low clock rate due to long sequence of logic gates. Bit-Map 36

37 SysAlloc Allocator Algorithm Our Proposed Solution 37

38 Allocation Tree Allocation Search Binary Tree Node Value Node Status 00 Empty 10 Partially Full 11 Full Bit-Map 38

39 Allocation Tree Storage Allocation Search Binary Tree Stored in memory as groups of nodes Bit-Map 39

40 Allocation Tree Storage Bit-Map

41 Allocation Tree Storage Node Value Representation Node Status 00 Empty 10 Partially Full 11 Full Bit-Map

42 32 blocks Allocation Tree Storage 16 blocks Request size = 3 blocks 8 blocks 4 blocks 2 blocks Bit-Map

43 32 blocks Allocation Tree Storage 16 blocks Request size = 3 blocks 8 blocks 4 blocks 2 blocks Bit-Map

44 32 blocks Allocation Tree Storage 16 blocks Request size = 3 blocks 8 blocks 4 blocks 2 blocks Bit-Map

45 32 blocks Allocation Tree Storage 16 blocks Request size = 3 blocks 8 blocks 4 blocks 2 blocks Bit-Map

46 32 blocks Allocation Tree Storage 16 blocks Request size = 3 blocks 8 blocks 4 blocks 2 blocks Bit-Map

47 32 blocks Allocation Tree Storage 16 blocks Request size = 3 blocks 8 blocks 4 blocks 2 blocks Bit-Map

48 32 blocks Allocation Tree Storage 16 blocks Request size = 3 blocks 8 blocks 4 blocks 2 blocks Bit-Map

49 32 blocks Allocation Tree Storage 16 blocks Request size = 3 blocks 8 blocks 4 blocks 2 blocks Bit-Map

50 32 blocks Allocation Tree Storage 16 blocks Request size = 3 blocks 8 blocks 4 blocks 2 blocks Bit-Map

51 32 blocks Allocation Tree Storage 16 blocks Request size = 3 blocks 8 blocks 4 blocks 2 blocks Bit-Map

52 Benefits Constant FPGA Logic Utilisation 52

53 Benefits For large Bit-Maps, Max Tree Depth Log2(Bit-Map Size) << Bit-Map Size Using the search tree is more efficient than directly searching in bit-map. 53

54 Evaluation Results 54

55 Evaluation Set-up Environment SysAlloc Configuration Synthetic Clients Zynq-7000 SoC XC7Z MHz Manage 128MB DDR Minimum Block Size = 16B Request rate and size are configurable 55

56 Scalability Pathological Case 56

57 Scalability Size & Wait Period with Exponential Growth 1.36 million malloc/s 57

58 Configuration Cases Case A Case B Case C Managing 512 MB DDR with minimum block in size of 16B. A linked-list using 256KB on-chip memory with minimum block in size of 8B. Signal processing applications need 256 1MB buffers to be managed. 58

59 Configuration Cases Bit-Map Size 59

60 Configuration Cases FPGA LEs Utilisation 60

61 Configuration Cases Memory Bits Required 61

62 Configuration Cases Latency 62

63 Conclusion 63

64 Conclusion Scalable Efficient SysAlloc Flexible Peak system allocation rate = 1.36 million malloc/s, when managing 128MB DDR for 4 Clients 64

65 Future Work Faster Allocator 65

66 Future Work More Efficient Communication Protocol 66

67 Future Work Garbage Collection 67

68 Future Work HLS Access 68

69 Thank You & Questions? Source code can be found at 69

FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow

FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow Abstract: High-level synthesis (HLS) of data-parallel input languages, such as the Compute Unified Device Architecture