Optimizing DMA Data Transfers for Embedded Multi-Cores

Size: px

Start display at page:

Download "Optimizing DMA Data Transfers for Embedded Multi-Cores"

Roxanne Bradford
5 years ago
Views:

1 Optimizing DMA Data Transfers for Embedded Multi-Cores Selma Saïdi Jury members: Oded Maler: Dir. de these Ahmed Bouajjani: President du Jury Luca Benini: Rapporteur Albert Cohen: Rapporteur Eric Flamand: Examinateur Bruno Jego: Examinateur Selma Saidi Optimizing DMA Transfers 24 October

in collaboration with Thierry Lepley, STMicroelectronics Minalogic project ATHOLE

2 Context of the Thesis Ph.D CIFRE with STMicroelectronics, supervised by, Oded Maler, Verimag, Bruno Jego and in collaboration with Thierry Lepley, STMicroelectronics Minalogic project ATHOLE low-power multi-core platform for embedded systems partners: ST, CEA, Thales, CWS, Verimag Selma Saidi Optimizing DMA Transfers 24 October

3 Outline Context and Motivation 1 Context and Motivation 2 Contribution Problem Definition Optimal Granularity for a Single Processor 1Dim Data 2Dim Data Multiple Processors Shared Data 3 Experiments on the Cell.BE 4 The move towards Platform Conclusions and Perpectives Selma Saidi Optimizing DMA Transfers 24 October

Embedded Systems Context and Motivation There is an increasing requirement for performance under low power constraints: Need to integrate more

4 Embedded Systems Context and Motivation There is an increasing requirement for performance under low power constraints: Need to integrate more functionnalities in Embedded devices, Applications are becoming more computationaly intensive and power hungry, Selma Saidi Optimizing DMA Transfers 24 October

5 Context and Motivation The Emergence of Multicore Architectures Running 2 processors in the same chip at half the speed will be less energy consuming and equally performant, Selma Saidi Optimizing DMA Transfers 24 October

6 Context and Motivation Embedded Multicore Architectures: Platform 2012: a manycore computation fabric, P2012 Fabric Cluster0 Cluster1 Cluster2 Cluster3 ClusterN 1 ClusterN Selma Saidi Optimizing DMA Transfers 24 October

7 Context and Motivation Embedded Multicore Architectures: Platform 2012: a manycore computation fabric, P2012 Fabric Cluster... PE0 PE1 PE16 Cluster0 L1 memory Cluster2 Cluster3 ClusterN 1 ClusterN Selma Saidi Optimizing DMA Transfers 24 October

8 Context and Motivation Embedded Multicore Architectures: Platform 2012: a manycore computation fabric, main characteristic: explicitly managed Memories: P2012 Fabric Cluster... PE0 PE1 PE16 Cluster0 L1 memory Cluster2 Cluster3 ClusterN 1 ClusterN Selma Saidi Optimizing DMA Transfers 24 October

9 Context and Motivation Embedded Multicore Architectures: Platform 2012: a manycore computation fabric, main characteristic: explicitly managed Memories: Features: 1 Scratchpad memories (No caches), P2012 Fabric Cluster... PE0 PE1 PE16 Cluster0 L1 memory Cluster2 Cluster3 ClusterN 1 ClusterN Selma Saidi Optimizing DMA Transfers 24 October

10 Context and Motivation Embedded Multicore Architectures: Platform 2012: a manycore computation fabric, main characteristic: explicitly managed Memories: Features: 1 Scratchpad memories (No caches) 2 DMA engine: a hardware for managing data transfers, P2012 Fabric Cluster0 Cluster2 Cluster... PE0 PE1 PE16 DMA L1 memory Cluster3 ClusterN 1 ClusterN Selma Saidi Optimizing DMA Transfers 24 October

11 Context and Motivation Embedded Multicore Architectures: Platform 2012: a manycore computation fabric, main characteristic: explicitly managed Memories: Features: 1 Scratchpad memories (No caches) 2 DMA engine: a hardware for managing data transfers, P2012 Fabric Cluster0 Cluster2 DMA Cluster... PE0 PE1 PE16 L1 memory DMA command: (src@, dst@, block size) ClusterN 1 ClusterN Selma Saidi Optimizing DMA Transfers 24 October

12 Context and Motivation Embedded Multicore Architectures: Platform 2012: a manycore computation fabric, main characteristic: explicitly managed Memories: Features: 1 Scratchpad memories (No caches) 2 DMA engine: a hardware for managing data transfers, 3 data transfers are explicitly managed by the software, P2012 Fabric Cluster0 Cluster2 DMA Cluster... PE0 PE1 PE16 L1 memory DMA command: (src@, dst@, block size) ClusterN 1 ClusterN Selma Saidi Optimizing DMA Transfers 24 October

13 Context and Motivation Embedded Multicore Architectures: Platform 2012: a manycore computation fabric, main characteristic: explicitly managed Memories: Features: 1 Scratchpad memories (No caches) 2 DMA engine: a hardware for managing data transfers, 3 data transfers are explicitly managed by the software, P2012 Fabric Cluster0 Cluster2 Cluster... PE0 PE1 PE16 DMA L1 memory DMA Advantages: a) Transfer efficiently large data blocks b) can work in parallel with the processor ClusterN 1 ClusterN Selma Saidi Optimizing DMA Transfers 24 October

14 Context and Motivation Embedded Multicore Architectures: acts as a general purpose programmable accelerator: Heterogeneous Multicore Architectures, P2012 Fabric Cluster... PE0 PE1 PE16 Cluster0 DMA L1 memory Cluster2 Cluster3 ClusterN 1 ClusterN Selma Saidi Optimizing DMA Transfers 24 October

Context and Motivation Embedded Multicore Architectures: acts as a general purpose programmable accelerator: Heterogeneous Multicore Architectures, P2012 Fabric Cluster.

15 Context and Motivation Embedded Multicore Architectures: acts as a general purpose programmable accelerator: Heterogeneous Multicore Architectures, P2012 Fabric Cluster... PE0 PE1 PE16 Cluster0 DMA L1 memory Cluster2 Cluster3 ClusterN 1 ClusterN This is the class of architectures in which we are interested!! Selma Saidi Optimizing DMA Transfers 24 October

16 Context and Motivation Heterogeneous Multi-core Architectures a powerful host processor and a multi-core fabric to accelerate computationally heavy kernels. Multi core fabric Host CPU PE0 PEn... Memory Memory Interconnect Off chip Memory Selma Saidi Optimizing DMA Transfers 24 October

17 Context and Motivation Heterogeneous Multi-core Architectures a powerful host processor and a multi-core fabric to accelerate computationally heavy kernels. Multi core fabric Host CPU T0 PE0 T1 Memory... T2 PEn Memory Interconnect Off chip Memory Selma Saidi Optimizing DMA Transfers 24 October

18 Context and Motivation Heterogeneous Multi-core Architectures Offloadable kernels work on large data sets, initially stored in a distant off-chip memory. Algorithm for i = 0 to n 1 Y [i] = f (X [i]) od Multi core fabric Host CPU T0 PE0 PEn... Memory Memory Interconnect Off chip Memory... X... Y Selma Saidi Optimizing DMA Transfers 24 October

19 Context and Motivation Heterogeneous Multi-core Architectures High off-chip memory latency: accessing off-chip data is very costly Algorithm for i = 0 to n 1 Y [i] = f (X [i]) od Read Multi core fabric Host CPU T0 PE0 PEn... Memory Memory Interconnect Off chip Memory Write... X... Y Selma Saidi Optimizing DMA Transfers 24 October

20 Context and Motivation Heterogeneous Multi-core Architectures Data is transferred to a closer but smaller on-chip memory, using DMAs (Direct Memory Access). Algorithm for i = 0 to n 1 Y [i] = f (X [i]) od Multi core fabric Host CPU T0 PE0 PEn Data Block Transfers block 1 block 0 Memory DMA... Memory... Interconnect Off chip Memory... X... Y Selma Saidi Optimizing DMA Transfers 24 October

21 Context and Motivation DMA Data Transfers: Single Buffering s: number of array elements in one block, block0 block1 blockm 2 blockm 1 X [0] X [1]... X [n 1] X s n Selma Saidi Optimizing DMA Transfers 24 October

22 Context and Motivation DMA Data Transfers: Single Buffering s: number of array elements in one block, block0 block1 blockm 2 blockm 1 X [0] X [1]... X [n 1] X s n i = 0 Fetch(blocki) while (i < n/s) i + + Compute(blocki) Write back(blocki) Selma Saidi Optimizing DMA Transfers 24 October

23 Context and Motivation DMA Data Transfers: Single Buffering s: number of array elements in one block, block0 block1 blockm 2 blockm 1 X [0] X [1]... X [n 1] X s n i = 0 Fetch(blocki) dma get(local-buffer, blocki, s) while (i < n/s) i + + Compute(blocki) Write back(blocki) Selma Saidi Optimizing DMA Transfers 24 October

24 Context and Motivation DMA Data Transfers: Single Buffering s: number of array elements in one block, block0 block1 blockm 2 blockm 1 X [0] X [1]... X [n 1] X s n i = 0 Fetch(blocki) while (i < n/s) i + + Compute(blocki) Write back(blocki) Selma Saidi Optimizing DMA Transfers 24 October

25 Context and Motivation DMA Data Transfers: Single Buffering s: number of array elements in one block, block0 block1 blockm 2 blockm 1 X [0] X [1]... X [n 1] X s n i = 0 Fetch(blocki) while (i < n/s) i + + Compute(blocki) Write back(blocki) dma put(blocki, local-buffer, s) Selma Saidi Optimizing DMA Transfers 24 October

26 Context and Motivation DMA Data Transfers: Single Buffering s: number of array elements in one block, block0 block1 blockm 2 blockm 1 X [0] X [1]... X [n 1] X s n i = 0 Fetch(blocki) while (i < n/s) i + + Compute(blocki) Write back(blocki) dma put(blocki, local-buffer, s) Sequential execution of computations and data transfers. Selma Saidi Optimizing DMA Transfers 24 October

27 Context and Motivation DMA Data Transfers: Double Buffering Asynchronous DMA calls: Fetch(block0) dma get(local buffer[1], block0, s) i = 0 dma get(local buffer[2], blocki+1, s) while (i < (n/s) 1) i + + Compute(blocki) Fetch(blocki+1) Write back(blocki) Compute(block(n/s) 1) Write back(block(n/s) 1) Selma Saidi Optimizing DMA Transfers 24 October

28 Context and Motivation DMA Data Transfers: Double Buffering Asynchronous DMA calls: Fetch(block0) dma get(local buffer[1], block0, s) i = 0 dma get(local buffer[2], blocki+1, s) while (i < (n/s) 1) i + + Compute(blocki) Fetch(blocki+1) Write back(blocki) Compute(block(n/s) 1) Write back(block(n/s) 1) Selma Saidi Optimizing DMA Transfers 24 October

29 Context and Motivation DMA Data Transfers: Double Buffering Asynchronous DMA calls: Fetch(block0) dma get(local buffer[1], block0, s) i = 0 dma get(local buffer[2], blocki+1, s) while (i < (n/s) 1) i + + Compute(blocki) Fetch(blocki+1) Write back(blocki) Compute(block(n/s) 1) Write back(block(n/s) 1) Overlap of computations and data transfers. Selma Saidi Optimizing DMA Transfers 24 October

30 Context and Motivation DMA Data Transfers: Double Buffering Asynchronous DMA calls: Fetch(block0) dma get(local buffer[1], block0, s) i = 0 dma get(local buffer[2], blocki+1, s) while (i < (n/s) 1) i + + Compute(blocki) Write back(blocki) Fetch(blocki+1) Compute(block(n/s) 1) Write back(block(n/s) 1) Overlap of computations and data transfers. Selma Saidi Optimizing DMA Transfers 24 October

31 Context and Motivation Double Buffering Pipelined Execution Overlap of, Computation of current block, Transfer of next block. Input Transfer b 0 Computation b 0 Output Transfer Prologue b 0 Time Proc idle time Selma Saidi Optimizing DMA Transfers 24 October

32 Context and Motivation Double Buffering Pipelined Execution Overlap of, Computation of current block, Transfer of next block. Input Transfer b 0 b 1 Computation b 0 b 1 Output Transfer Prologue b 0 b 1 Time Proc idle time Selma Saidi Optimizing DMA Transfers 24 October

33 Context and Motivation Double Buffering Pipelined Execution Overlap of, Computation of current block, Transfer of next block. Input Transfer b 0 b 1 b 2 Computation b 0 b 1 b 2 Output Transfer Prologue b 0 b 1 b 2 Time Proc idle time Selma Saidi Optimizing DMA Transfers 24 October

34 Context and Motivation Double Buffering Pipelined Execution Overlap of, Computation of current block, Transfer of next block. Input Transfer b 0 b 1 b 2 b 3 Computation b 0 b 1 b 2 b 3 Output Transfer Prologue b 0 b 1 b 2 b 3 Time Proc idle time Selma Saidi Optimizing DMA Transfers 24 October

35 Context and Motivation Double Buffering Pipelined Execution Overlap of, Computation of current block, Transfer of next block. Input Transfer b 0 b 1 b 2 b 3 b 4 Computation b 0 b 1 b 2 b 3 b 4 Output Transfer Prologue b 0 b 1 b 2 b 3 b 4 Epilogue Time Proc idle time Selma Saidi Optimizing DMA Transfers 24 October

36 Context and Motivation Double Buffering Pipelined Execution Overlap of, Computation of current block, Transfer of next block. Input Transfer b 0 b 1 b 2 b 3 b 4 Computation b 0 b 1 b 2 b 3 b 4 Output Transfer Prologue b 0 b 1 b 2 b 3 b 4 Epilogue Time Proc idle time Performance can be further improved by an appropriate choice of data granularity. Selma Saidi Optimizing DMA Transfers 24 October

37 Context and Motivation Granularity of Transfers 1Dim Data: block size s block0 block1 blockm 2 blockm 1 X [0] X [1]... X [n 1] X s n Selma Saidi Optimizing DMA Transfers 24 October

38 Context and Motivation Granularity of Transfers 1Dim Data: block size s block0 block1 blockm 2 blockm 1 X [0] X [1]... X [n 1] X s n 2Dim Data: block shape (s 1, s 2 ) s2 X (i1, i2) s1 m1 block(j1, j2) m2 Selma Saidi Optimizing DMA Transfers 24 October

39 Context and Motivation Granularity of Transfers 1Dim Data: block size s block0 block1 blockm 2 blockm 1 X [0] X [1]... X [n 1] X s n 2Dim Data: block shape (s 1, s 2 ) s2 X (i1, i2) s1 m1 block(j1, j2) m2 Contribution: We derive optimal granularity for 1D and 2D DMA transfers, Selma Saidi Optimizing DMA Transfers 24 October

40 Context and Motivation Our Contribution: We derive optimal granularity for 1D and 2D DMA transfers, 1Dim data work was published in Hipeac 2012, S.Saidi, P.tendulkar, T.Lepley, O.Maler, Optimizing explicit data transfers for data parallel applicationson the Cell architecture 2D data work was published in DSD 2012, S.Saidi, P.tendulkar, T.Lepley, O.Maler, Optimal 2D Data Partitioning for DMA Transfers on MPSoCs extended version of the paper submitted to: Embedded Hardware Design: Microprocessors and Microsystems Journal. Selma Saidi Optimizing DMA Transfers 24 October

41 Outline Context and Motivation 1 Context and Motivation 2 Contribution Problem Definition Optimal Granularity for a Single Processor 1Dim Data 2Dim Data Multiple Processors Shared Data 3 Experiments on the Cell.BE 4 The move towards Platform Conclusions and Perpectives Selma Saidi Optimizing DMA Transfers 24 October

42 Outline Contribution 1 Context and Motivation 2 Contribution Problem Definition Optimal Granularity for a Single Processor 1Dim Data 2Dim Data Multiple Processors Shared Data 3 Experiments on the Cell.BE 4 The move towards Platform Conclusions and Perpectives Selma Saidi Optimizing DMA Transfers 24 October

43 Outline Contribution Problem Definition 1 Context and Motivation 2 Contribution Problem Definition Optimal Granularity for a Single Processor 1Dim Data 2Dim Data Multiple Processors Shared Data 3 Experiments on the Cell.BE 4 The move towards Platform Conclusions and Perpectives Selma Saidi Optimizing DMA Transfers 24 October

44 Software Pipelining Contribution Problem Definition We want to optimize execution of the pipeline. R1 R2 R3 R4 Rm C1 C2 C3 Cm 1 Cm W1 W2 Wm 2 Wm 1 Wm Prologue Epilogue Selma Saidi Optimizing DMA Transfers 24 October

45 Contribution Problem Definition Software Pipelining We want to optimize execution of the pipeline. R1 R2 R3 R4 Rm C1 C2 C3 Cm 1 Cm W1 W2 Wm 2 Wm 1 Wm Prologue Epilogue Optimal Granularity: What is the Granularity choice that optimizes performance? Selma Saidi Optimizing DMA Transfers 24 October

46 Contribution Problem Definition Computation Regime and Transfer Regime T and C: Transfer and Computation time of a block Transfer Regime T > C: Input transfer b Computation Output transfer b0 b1 b2 b3 b4 b5 b6 b7 b b b b b b0 b1 b2 b3 b4 b5 b6 b7 b8 b5 b6 b7 b8 Prologue Proc Idle Time Epilogue Computation Regime C > T : Selma Saidi Optimizing DMA Transfers 24 October

47 Contribution Problem Definition Computation Regime and Transfer Regime T and C: Transfer and Computation time of a block Transfer Regime T > C: Input transfer b Computation Output transfer b0 b1 b2 b3 b4 b5 b6 b7 b b b b b b0 b1 b2 b3 b4 b5 b6 b7 b8 b5 b6 b7 b8 Prologue Proc Idle Time Epilogue Computation Regime C > T : Input transfer b0 b1 b2 Computation b0 b1 b Output transfer b0 b1 b2 Prologue Epilogue Time Selma Saidi Optimizing DMA Transfers 24 October

48 Contribution In the Computation Regime: Problem Definition Granularity s: Input transfer b0 b1 b2 Computation b0 b1 b Output transfer b0 b1 b2 Prologue Epilogue Time Granularity s : s > s: Input transfer b0 b1 Computation Output transfer b0 b0 b b1 Prologue Epilogue Time Selma Saidi Optimizing DMA Transfers 24 October

49 Contribution Problem Definition Optimal Granularity: Problem Formulation 1Dim Data: block size Find s such that, 2Dim Data: block shape Min T (s) s.t. T (s) C(s) (s) [1..n] s M Selma Saidi Optimizing DMA Transfers 24 October

50 Contribution Problem Definition Optimal Granularity: Problem Formulation 1Dim Data: block size Find s such that, Min T (s) s.t. T (s) C(s) (s) [1..n] s M 2Dim Data: block shape Find (s1, s 2 ) such that, Min T (s 1, s 2 ) s.t. T (s 1, s 2 ) C(s 1, s 2 ) (s 1, s 2 ) [1..n 1 ] [1..n 2 ] s 1 s 2 M Selma Saidi Optimizing DMA Transfers 24 October

51 Outline Contribution Optimal Granularity for a Single Processor 1 Context and Motivation 2 Contribution Problem Definition Optimal Granularity for a Single Processor 1Dim Data 2Dim Data Multiple Processors Shared Data 3 Experiments on the Cell.BE 4 The move towards Platform Conclusions and Perpectives Selma Saidi Optimizing DMA Transfers 24 October

52 Outline Contribution Optimal Granularity for a Single Processor 1 Context and Motivation 2 Contribution Problem Definition Optimal Granularity for a Single Processor 1Dim Data 2Dim Data Multiple Processors Shared Data 3 Experiments on the Cell.BE 4 The move towards Platform Conclusions and Perpectives Selma Saidi Optimizing DMA Transfers 24 October

53 Optimal Granularity : 1Dim Data Contribution Optimal Granularity for a Single Processor Characterization of Computation and Transfer Time: Selma Saidi Optimizing DMA Transfers 24 October

54 Contribution Optimal Granularity for a Single Processor Optimal Granularity : 1Dim Data Characterization of Computation and Transfer Time: s: nb array elements clustered in one (Contiguous) block, s b Computation time C(s): DMA Transfer time T(s): Selma Saidi Optimizing DMA Transfers 24 October

55 Contribution Optimal Granularity for a Single Processor Optimal Granularity : 1Dim Data Characterization of Computation and Transfer Time: s: nb array elements clustered in one (Contiguous) block, s b Computation time C(s): ω: time to compute one element, DMA Transfer time T(s): Selma Saidi Optimizing DMA Transfers 24 October

56 Contribution Optimal Granularity for a Single Processor Optimal Granularity : 1Dim Data Characterization of Computation and Transfer Time: s: nb array elements clustered in one (Contiguous) block, s b Computation time C(s): ω: time to compute one element, DMA Transfer time T(s): C(s) = ω s Selma Saidi Optimizing DMA Transfers 24 October

57 Contribution Optimal Granularity for a Single Processor Optimal Granularity : 1Dim Data Characterization of Computation and Transfer Time: s: nb array elements clustered in one (Contiguous) block, s b Computation time C(s): ω: time to compute one element, DMA Transfer time T(s): I : fixed DMA initialization cost, C(s) = ω s Selma Saidi Optimizing DMA Transfers 24 October

58 Contribution Optimal Granularity for a Single Processor Optimal Granularity : 1Dim Data Characterization of Computation and Transfer Time: s: nb array elements clustered in one (Contiguous) block, s b Computation time C(s): ω: time to compute one element, DMA Transfer time T(s): I : fixed DMA initialization cost, α: transfer cost per byte, C(s) = ω s Selma Saidi Optimizing DMA Transfers 24 October

59 Contribution Optimal Granularity for a Single Processor Optimal Granularity : 1Dim Data Characterization of Computation and Transfer Time: s: nb array elements clustered in one (Contiguous) block, s b Computation time C(s): ω: time to compute one element, C(s) = ω s DMA Transfer time T(s): I : fixed DMA initialization cost, α: transfer cost per byte, b: size of one array element, Selma Saidi Optimizing DMA Transfers 24 October

60 Contribution Optimal Granularity for a Single Processor Optimal Granularity : 1Dim Data Characterization of Computation and Transfer Time: s: nb array elements clustered in one (Contiguous) block, s b Computation time C(s): ω: time to compute one element, C(s) = ω s DMA Transfer time T(s): I : fixed DMA initialization cost, α: transfer cost per byte, b: size of one array element, T (s) = I + α b s Selma Saidi Optimizing DMA Transfers 24 October

61 Contribution Optimal Granularity : 1Dim Data Optimal Granularity for a Single Processor Pb Formulation Min T (s) s.t. Optimal Granularity s : T (s) C(s) s [1..n] s M Time T = C s: block size T > C T < C C(s) M: Memory limitation T (s) I Transfer Domain s s M Computation Domain Selma Saidi Optimizing DMA Transfers 24 October

62 Contribution Optimal Granularity : 1Dim Data Optimal Granularity for a Single Processor Pb Formulation Min T (s) s.t. Optimal Granularity s : C(s ) = T (s ) T (s) C(s) s [1..n] s M s: block size M: Memory limitation Time T = C T < C C(s) T (s) I s s M Computation Domain Selma Saidi Optimizing DMA Transfers 24 October

63 Outline Contribution Optimal Granularity for a Single Processor 1 Context and Motivation 2 Contribution Problem Definition Optimal Granularity for a Single Processor 1Dim Data 2Dim Data Multiple Processors Shared Data 3 Experiments on the Cell.BE 4 The move towards Platform Conclusions and Perpectives Selma Saidi Optimizing DMA Transfers 24 October

64 Contribution Optimal Granularity for a Single Processor Optimal Granularity : 2Dim Data Characterization of Computation and Transfer Time: (s 1 s 2 ): nb array elements clustered in one (square) block, s 2 X (i 1, i 2) Computation time C(s 1, s 2 ): s 1 m 1 block(j 1, j 2 ) m 2 Selma Saidi Optimizing DMA Transfers 24 October

65 Contribution Optimal Granularity for a Single Processor Optimal Granularity : 2Dim Data Characterization of Computation and Transfer Time: (s 1 s 2 ): nb array elements clustered in one (square) block, s 2 X (i 1, i 2) Computation time C(s 1, s 2 ): ω: time to compute one element, s 1 m 1 block(j 1, j 2 ) m 2 Selma Saidi Optimizing DMA Transfers 24 October

66 Contribution Optimal Granularity for a Single Processor Optimal Granularity : 2Dim Data Characterization of Computation and Transfer Time: (s 1 s 2 ): nb array elements clustered in one (square) block, s 2 X (i 1, i 2) Computation time C(s 1, s 2 ): ω: time to compute one element, s 1 C(s 1, s 2 ) = ω s 1 s 2 m 1 block(j 1, j 2 ) m 2 Selma Saidi Optimizing DMA Transfers 24 October

67 Contribution Optimal Granularity for a Single Processor Optimal Granularity : 2Dim Data src addr stride s 2 s 1 Strided DMA Transfer time T (s 1, s 2 ): I 1 : transfer cost overhead per line, T (s 1, s 2 ) = I + I 1 s 1 + α b s 1 s 2 Selma Saidi Optimizing DMA Transfers 24 October

68 Contribution Optimal Granularity for a Single Processor Optimal Granularity : 2Dim Data src addr stride s 2 s 1 Strided DMA Transfer time T (s 1, s 2 ): I 1 : transfer cost overhead per line, T (s 1, s 2 ) = I + I 1 s 1 + α b s 1 s 2 Strided DMA transfers are costlier than contiguous transfers Selma Saidi Optimizing DMA Transfers 24 October

69 Contribution Optimal Granularity for a Single Processor Influence of the Block Shape on the DMA Transfer Cost Different block shapes with same area BUT different DMA transfer time, s 1 = 1 s 1 = 2 s 1 = 4 s 2 = 4 (s 1, s 2 ) = (1, 4) s 2 = 2 (s 1, s 2 ) = (2, 2) s 2 = 1 (s 1, s 2 ) = (4, 1) Selma Saidi Optimizing DMA Transfers 24 October

70 Contribution Optimal Granularity : 2Dim Data Optimal Granularity for a Single Processor C(s 1, s 2 ) : computation time of a block, T (s 1, s 2 ) : transfer time of a block, Selma Saidi Optimizing DMA Transfers 24 October

71 Contribution Optimal Granularity for a Single Processor Optimal Granularity : 2Dim Data C(s 1, s 2 ) : computation time of a block, T (s 1, s 2 ) : transfer time of a block, s 1 n 1 T = C Computation Domain T < C I 1 /ψ Transfer Domain T > C n 2 s 2 Selma Saidi Optimizing DMA Transfers 24 October

72 Contribution Optimal Granularity for a Single Processor Optimal Granularity : 2Dim Data Pb Formulation Min T (s 1, s 2 ) s.t. T (s 1, s 2 ) C(s 1, s 2 ) (s 1, s 2 ) [1..n 1 ] [1..n 2 ] s 1 s 2 M s 1 T = C n 1 Computation Domain T < C s 1 : block height s 2 : block width I 1 /ψ Transfer Domain T > C n 2 s 2 Selma Saidi Optimizing DMA Transfers 24 October

73 Contribution Optimal Granularity for a Single Processor Optimal Granularity : 2Dim Data Pb Formulation Min T (s 1, s 2 ) s.t. T (s 1, s 2 ) C(s 1, s 2 ) (s 1, s 2 ) [1..n 1 ] [1..n 2 ] s 1 s 2 M s 1 : block height s 2 : block width s 1 T = C n 1 s s 1 T (s s 1 s 1, s 2 ) < T (s 1, s 2 ) s 2 = s 2 n 2 s 2 Selma Saidi Optimizing DMA Transfers 24 October

74 Contribution Optimal Granularity : 2Dim Data Optimal Granularity for a Single Processor s 1 n 1 T = C 1 s (1/ψ)(I 1 + I 0 ) n 2 s 2 { Block Height : s 1 = 1 Block Width : s 2 = (1/ψ)(I 1 + I 0 ) Optimal granularity is the Contiguous block to reach the computation regime: Selma Saidi Optimizing DMA Transfers 24 October

75 Outline Contribution Multiple Processors 1 Context and Motivation 2 Contribution Problem Definition Optimal Granularity for a Single Processor 1Dim Data 2Dim Data Multiple Processors Shared Data 3 Experiments on the Cell.BE 4 The move towards Platform Conclusions and Perpectives Selma Saidi Optimizing DMA Transfers 24 October

76 Multiple Processors Contribution Multiple Processors Partitioning: p contiguous chunks of data 1 P 1 n p P 2 P 3 P n 1... n 1... p p n p Array Processors are identical: same local store capacity, same double buffering granularity...etc. Selma Saidi Optimizing DMA Transfers 24 October

77 Multiple Processors Contribution Multiple Processors Pipelined execution for several processors: Input Transfer b0, b1, b P P1 P2 OutputTransfer Time Prologue Proc Idle time processors DMA requests are done concurrently, Selma Saidi Optimizing DMA Transfers 24 October

78 Multiple Processors Contribution Multiple Processors Pipelined execution for several processors: Input Transfer b0, b1, b P P1 P2 OutputTransfer b3, b4, b b b1 b Time Prologue Proc Idle time processors DMA requests are done concurrently, Selma Saidi Optimizing DMA Transfers 24 October

79 Multiple Processors Contribution Multiple Processors Pipelined execution for several processors: Input Transfer b0, b1, b P P1 P2 OutputTransfer b3, b4, b b b1 b b0, b1, b2 b6, b7, b b b4 b Time Prologue Proc Idle time processors DMA requests are done concurrently, Selma Saidi Optimizing DMA Transfers 24 October

80 Multiple Processors Contribution Multiple Processors Pipelined execution for several processors: Input Transfer b0, b1, b P P1 P2 OutputTransfer b3, b4, b b b1 b b0, b1, b2 b6, b7, b b b4 b b8 b7 b6 b3, b4, b5 b6, b7, b Time Prologue Epilogue Proc Idle time processors DMA requests are done concurrently, Selma Saidi Optimizing DMA Transfers 24 October

81 Multiple Processors Contribution Multiple Processors Pipelined execution for several processors: Input Transfer b0, b1, b P P1 P2 OutputTransfer b3, b4, b b b1 b b0, b1, b2 b6, b7, b b b4 b b8 b7 b6 b3, b4, b5 b6, b7, b Time Prologue Epilogue Proc Idle time processors DMA requests are done concurrently, T (s, p) = I + α(p) b s α(p) : transfer cost per byte given contentions of p concurrent transfer requests. Selma Saidi Optimizing DMA Transfers 24 October

82 Contribution Multiple Processors Multiple Processors: Optimal Granularity Optimal granularity given p processors: s (p), s 1 time n 1 T 1 = C C(s) T1(s) s (1) 1 s (1) n s n 2 s 2 (a) One-dimensional data (b) Two-dimensional data Selma Saidi Optimizing DMA Transfers 24 October

83 Contribution Multiple Processors Multiple Processors: Optimal Granularity Optimal granularity given p processors: s (p), s 1 time n 1 T 1 = C C(s) Tp(s) T1(s) s (1) 1 s (1) s (p) n s n 2 s 2 (a) One-dimensional data (b) Two-dimensional data Selma Saidi Optimizing DMA Transfers 24 October

84 Contribution Multiple Processors Multiple Processors: Optimal Granularity Optimal granularity given p processors: s (p), s 1 time n 1 T 1 = C T p = C C(s) Tp(s) T1(s) 1 s (1) s (p) s (1) s (p) n s n 2 s 2 (a) One-dimensional data (b) Two-dimensional data Optimal Granularity increases with number of processors Selma Saidi Optimizing DMA Transfers 24 October

85 Contribution Multiple Processors Summary: We derived optimal granularity, main idea: balance between computation and transfer time of a block, 2D data: block shape influences transfer time (overhead per line, I 1 ) multiple processors: number of processors influence transfer time (with α(p)) Selma Saidi Optimizing DMA Transfers 24 October

86 Local Memory Constraint Contribution Multiple Processors What if Optimal Granularity does not fit in Local memory? 1 take available memory space, 2 reduce the number of processors, s 1 time n 1 M/s 2 Tp = C C(s) Tp(s) Tp (s) 1 s (p) s (p ) M s (p) (a) One-dimensional data s M (b) Two-dimensional data n 2 s 2 Selma Saidi Optimizing DMA Transfers 24 October

87 Outline Contribution Shared Data 1 Context and Motivation 2 Contribution Problem Definition Optimal Granularity for a Single Processor 1Dim Data 2Dim Data Multiple Processors Shared Data 3 Experiments on the Cell.BE 4 The move towards Platform Conclusions and Perpectives Selma Saidi Optimizing DMA Transfers 24 October

88 Contribution Shared Data Applications with shared data: 1Dim Data parallel loop with shared input data: for i := 0 to n 1 do Y [i] := f (X [i], V [i]); od V [i] = {X [i 1], X [i 2],..., X [i k]} Selma Saidi Optimizing DMA Transfers 24 October

89 Contribution Shared Data Applications with shared data: 1Dim Data parallel loop with shared input data: for i := 0 to n 1 do Y [i] := f (X [i], V [i]); V [i] = {X [i 1], X [i 2],..., X [i k]} od Neighboring blocks share data: s n Neighboring Shared Data b0 b1 b2 b3 b4 b5 Selma Saidi Optimizing DMA Transfers 24 October

90 Contribution Shared Data Shared Data: 2Dim Data parallel loop with shared input data: for i := 1 to n 1 do for i := 2 to n 2 do Y [i 1, i 2 ] := f (X [i 1, i 2 ], V [i 1, i 2 ]); V [i 1, i 2 ] = {X [i 1 1, i 2 ], X [i 1, i 2 1],..., X [i 1 k, i 2 ]} od symmetric window, k/2 k/2 n2 n1 Selma Saidi Optimizing DMA Transfers 24 October

91 Contribution Shared Data Optimal Granularity : 1Dim Data Compare strategies for transferring shared data: 1 Replication: via DMA transfers from the off-chip memory to local memory. 2 Inter-processor communication: processors exchange data via the network-on-chip between the cores; 3 Local buffering: via local copies done by the processors. Based on a parametric study, we derive optimal strategy and granularity for transferring shared data, Selma Saidi Optimizing DMA Transfers 24 October

92 Contribution Shared Data Optimal Granularity : 2Dim Data we consider Replication for transferring shared data, R: size of replicated data. R = 12 s1 = 2 s2 = 2 (s 1, s 2) = (2, 2) Selma Saidi Optimizing DMA Transfers 24 October

93 Contribution Shared Data Optimal Granularity : 2Dim Data Influence of the block shape on the size of share data: Compare Transfer cost of a flat and a square block, R: size of replicated data. R = 14 R = 12 s1 = 1 s1 = 2 s2 = 4 s2 = 2 (s 1, s 2) = (1, 4) (s 1, s 2) = (2, 2) Selma Saidi Optimizing DMA Transfers 24 October

94 Contribution Shared Data Optimal Granularity : 2Dim Data Influence of the block shape on the size of share data: Compare Transfer cost of a flat and a square block, R: size of replicated data. R = 14 R = 12 s1 = 1 s1 = 2 s2 = 4 s2 = 2 (s 1, s 2) = (1, 4) (s 1, s 2) = (2, 2) More Replicated data Overhead Selma Saidi Optimizing DMA Transfers 24 October

95 Contribution Shared Data Optimal Granularity : 2Dim Data Influence of the block shape on the size of share data: Compare Transfer cost of a flat and a square block, R: size of replicated data. R = 14 R = 12 s1 = 1 s1 = 2 s2 = 4 s2 = 2 (s 1, s 2) = (1, 4) (s 1, s 2) = (2, 2) More Replicated data Overhead More transfer lines Overhead Selma Saidi Optimizing DMA Transfers 24 October

96 Contribution Shared Data Optimal Granularity : 2Dim Data Influence of the block shape on the size of share data: Compare Transfer cost of a flat and a square block, R: size of replicated data. R = 14 R = 12 s1 = 1 s1 = 2 s2 = 4 s2 = 2 (s 1, s 2) = (1, 4) (s 1, s 2) = (2, 2) More Replicated data Overhead More transfer lines Overhead Selma Saidi Optimizing DMA Transfers 24 October

97 Contribution Shared Data Optimal Granularity : 2Dim Data Problem Formulation Find (s1, s 2 ) such that, min T (s 1 + k, s 2 + k) s.t. T (s 1 + k, s 2 + k) C(s 1, s 2 ) (s 1, s 2 ) [1..n 1 ] [1..n 2 ] (s 1 + k) (s 2 + k) M Selma Saidi Optimizing DMA Transfers 24 October

98 Contribution Shared Data Optimal Granularity : 2Dim Data s 1 n 1 T = C T k = C s 1 = s 2 s n 2 s 2 Selma Saidi Optimizing DMA Transfers 24 October

99 Contribution Shared Data Optimal Granularity : 2Dim Data Optimal shape: between a square and a flat shape, s 1 n 1 T = C T k = C s 1 = s 2 s k s n 2 s 2 { s 1 = + (c 1 /ψ)(1/d) s 2 = + (I 1/ψ)(1 + D) Selma Saidi Optimizing DMA Transfers 24 October

100 Outline Experiments on the Cell.BE 1 Context and Motivation 2 Contribution Problem Definition Optimal Granularity for a Single Processor 1Dim Data 2Dim Data Multiple Processors Shared Data 3 Experiments on the Cell.BE 4 The move towards Platform Conclusions and Perpectives Selma Saidi Optimizing DMA Transfers 24 October

101 Experiments on the Cell.BE Overview of Cell B.E. Architecture SPU SPU SPU SPU MFC MMU MFC MMU MFC MMU MFC MMU PPU MMU L2 EIB I/O Interface XDR DRAM Interface Coherent Interface SPU SPU SPU SPU MFC MMU MFC MMU MFC MMU MFC MMU Platform Characteristics: 9-core heterogeneous multi-core architecture, with a Power Processor Element(PPE) and 8 Synergistic Processing Elements(SPE). Limited local store capacity per SPE: 256 Kbytes Explicitely managed memory system, using DMAs Selma Saidi Optimizing DMA Transfers 24 October

102 Experiments on the Cell.BE Measured DMA Latency Transfer Time (clock cycles) SPU 2 SPU 4 SPU 8 SPU super-block size (s b) Profiled hardware parameters: DMA issue time I 400 clock cycles Off-chip memory transfer cost/byte: 1 proc α(1) 0.22 clock cycles Off-chip memory transfer cost/byte: p procs α(p) p α(1) inter-processor comm transfer cost/byte for β 0.13 clock cycles Selma Saidi Optimizing DMA Transfers 24 October

103 Experiments on the Cell.BE Optimal Granularity: 1Dim Data, No Sharing 10 7 Execution Time (clock cycles) super-block size (s b) 2 SPU-pred 8 SPU-pred 2 SPU-meas 8 SPU-meas predicted optimal granularities give good performance. Selma Saidi Optimizing DMA Transfers 24 October

104 Experiments on the Cell.BE Optimal Granularity: 1Dim Data, Sharing s n Neighboring Shared Data Comparing several strategies: b0 b1 b2 b3 b4 b5 Execution Time (clock cycles) super-block size (s b) 2 repl 2 ipc 2 local buffering 8 repl 8 ipc 8 local buffering Selma Saidi Optimizing DMA Transfers 24 October

105 Experiments on the Cell.BE Optimal Granularity: 2Dim Data, Sharing We implement double buffering on a mean filtering algorithm, predicted optimal granularities give good performance. Selma Saidi Optimizing DMA Transfers 24 October

106 Outline The move towards Platform Context and Motivation 2 Contribution Problem Definition Optimal Granularity for a Single Processor 1Dim Data 2Dim Data Multiple Processors Shared Data 3 Experiments on the Cell.BE 4 The move towards Platform Conclusions and Perpectives Selma Saidi Optimizing DMA Transfers 24 October

107 The move towards Platform 2012 P2012 Memory Hierarchy 1 Intra cluster L1 memory (256 Kbytes), 2 Inter cluster L2 memory, 3 Off-chip L3 memory Selma Saidi Optimizing DMA Transfers 24 October

108 DMA Latency The move towards Platform 2012 we measure the DMA latency on P2012, DMA performance Model: T (s, p) = I + α(p)bs I = 240cycles p α(p) Selma Saidi Optimizing DMA Transfers 24 October

109 DMA Latency The move towards Platform 2012 we measure the DMA latency on P2012, DMA performance Model: T (s, p) = I + α(p)bs I = 240cycles p α(p) Selma Saidi Optimizing DMA Transfers 24 October

110 The move towards Platform 2012 DMA Transfers in a Cluster Shared Local Memory and shared DMA: 2 approaches for transferring data, 1 Liberal: Processors fetch data independently 2 Collaborative: Processors fetch data collectively Selma Saidi Optimizing DMA Transfers 24 October

111 The move towards Platform 2012 Liberal Approach: Processors fetch data independently Off chip Memory local Memory DMA Transfers Cluster T (s, p) = I + α(p)bs C(s, p) = o + ωs s M/p Computations P0 P1... P16 Opencl Kernel: work item = processor async work item copy: DMA fetch for each processor Selma Saidi Optimizing DMA Transfers 24 October

112 The move towards Platform 2012 Collab Approach: Processors fetch data Collectively Off chip Memory DMA Transfers Cluster T (s, p) = I + α(1)bs C(s, p) = o(p) + (ω/p)s s M local Memory Computations P0 P1... P16 Opencl Kernel: work group = cluster async work group copy: DMA fetch for the cluster Selma Saidi Optimizing DMA Transfers 24 October

113 The move towards Platform 2012 Liberal Approach: Processors fetch data independently increase of number of processors reduces max buffer size, double buffering implementation results: Optimal Granularity does not fit in the available memory space Selma Saidi Optimizing DMA Transfers 24 October

114 The move towards Platform 2012 Liberal Approach: Processors fetch data independently synchronization overhead, double buffering implementation results: Performance degradation when increase number of processors Selma Saidi Optimizing DMA Transfers 24 October

115 The move towards Platform 2012 On-going Work: find the right balance between number of processors and Memory space budget, compare both liberal and collaborative approach, Selma Saidi Optimizing DMA Transfers 24 October

116 Outline Conclusions and Perpectives 1 Context and Motivation 2 Contribution Problem Definition Optimal Granularity for a Single Processor 1Dim Data 2Dim Data Multiple Processors Shared Data 3 Experiments on the Cell.BE 4 The move towards Platform Conclusions and Perpectives Selma Saidi Optimizing DMA Transfers 24 October

117 Conclusion Conclusions and Perpectives We presented a general methodology for automating decisions about Optimal granularity for data transfers, we capture the following facts, 1 Block shape and size influence the transfer/computation Ratio, 2 DMA performance (sensitivity to the block shapes, number of processors) 3 tradeoff between Strided DMA overhead vs size of replicated data Selma Saidi Optimizing DMA Transfers 24 October

118 Conclusions and Perpectives Perspectives 1 Consider other applications patterns, 2 Capture variations (hardware and Software), 3 Generalize the approach to more than 2 memory levels, 4 Integrate this work in a complete compilation flow, 5 combine task and data parallelism, 6... Selma Saidi Optimizing DMA Transfers 24 October

119 Conclusions and Perpectives Thank You!!! Selma Saidi Optimizing DMA Transfers 24 October

Optimal 2D Data Partitioning for DMA Transfers on MPSoCs

Optimal 2D Data Partitioning for DMA Transfers on MPSoCs Selma Saidi, Pranav Tendulkar, Thierry Lepley, Oded Maler VERIMAG Lab (CNRS,University of Grenoble), France STMicroelectronics Grenoble, France