Optimizing DMA Data Transfers for Embedded Multi-Cores
|
|
- Roxanne Bradford
- 5 years ago
- Views:
Transcription
1 Optimizing DMA Data Transfers for Embedded Multi-Cores Selma Saïdi Jury members: Oded Maler: Dir. de these Ahmed Bouajjani: President du Jury Luca Benini: Rapporteur Albert Cohen: Rapporteur Eric Flamand: Examinateur Bruno Jego: Examinateur Selma Saidi Optimizing DMA Transfers 24 October
2 Context of the Thesis Ph.D CIFRE with STMicroelectronics, supervised by, Oded Maler, Verimag, Bruno Jego and in collaboration with Thierry Lepley, STMicroelectronics Minalogic project ATHOLE low-power multi-core platform for embedded systems partners: ST, CEA, Thales, CWS, Verimag Selma Saidi Optimizing DMA Transfers 24 October
3 Outline Context and Motivation 1 Context and Motivation 2 Contribution Problem Definition Optimal Granularity for a Single Processor 1Dim Data 2Dim Data Multiple Processors Shared Data 3 Experiments on the Cell.BE 4 The move towards Platform Conclusions and Perpectives Selma Saidi Optimizing DMA Transfers 24 October
4 Embedded Systems Context and Motivation There is an increasing requirement for performance under low power constraints: Need to integrate more functionnalities in Embedded devices, Applications are becoming more computationaly intensive and power hungry, Selma Saidi Optimizing DMA Transfers 24 October
5 Context and Motivation The Emergence of Multicore Architectures Running 2 processors in the same chip at half the speed will be less energy consuming and equally performant, Selma Saidi Optimizing DMA Transfers 24 October
6 Context and Motivation Embedded Multicore Architectures: Platform 2012: a manycore computation fabric, P2012 Fabric Cluster0 Cluster1 Cluster2 Cluster3 ClusterN 1 ClusterN Selma Saidi Optimizing DMA Transfers 24 October
7 Context and Motivation Embedded Multicore Architectures: Platform 2012: a manycore computation fabric, P2012 Fabric Cluster... PE0 PE1 PE16 Cluster0 L1 memory Cluster2 Cluster3 ClusterN 1 ClusterN Selma Saidi Optimizing DMA Transfers 24 October
8 Context and Motivation Embedded Multicore Architectures: Platform 2012: a manycore computation fabric, main characteristic: explicitly managed Memories: P2012 Fabric Cluster... PE0 PE1 PE16 Cluster0 L1 memory Cluster2 Cluster3 ClusterN 1 ClusterN Selma Saidi Optimizing DMA Transfers 24 October
9 Context and Motivation Embedded Multicore Architectures: Platform 2012: a manycore computation fabric, main characteristic: explicitly managed Memories: Features: 1 Scratchpad memories (No caches), P2012 Fabric Cluster... PE0 PE1 PE16 Cluster0 L1 memory Cluster2 Cluster3 ClusterN 1 ClusterN Selma Saidi Optimizing DMA Transfers 24 October
10 Context and Motivation Embedded Multicore Architectures: Platform 2012: a manycore computation fabric, main characteristic: explicitly managed Memories: Features: 1 Scratchpad memories (No caches) 2 DMA engine: a hardware for managing data transfers, P2012 Fabric Cluster0 Cluster2 Cluster... PE0 PE1 PE16 DMA L1 memory Cluster3 ClusterN 1 ClusterN Selma Saidi Optimizing DMA Transfers 24 October
11 Context and Motivation Embedded Multicore Architectures: Platform 2012: a manycore computation fabric, main characteristic: explicitly managed Memories: Features: 1 Scratchpad memories (No caches) 2 DMA engine: a hardware for managing data transfers, P2012 Fabric Cluster0 Cluster2 DMA Cluster... PE0 PE1 PE16 L1 memory DMA command: (src@, dst@, block size) ClusterN 1 ClusterN Selma Saidi Optimizing DMA Transfers 24 October
12 Context and Motivation Embedded Multicore Architectures: Platform 2012: a manycore computation fabric, main characteristic: explicitly managed Memories: Features: 1 Scratchpad memories (No caches) 2 DMA engine: a hardware for managing data transfers, 3 data transfers are explicitly managed by the software, P2012 Fabric Cluster0 Cluster2 DMA Cluster... PE0 PE1 PE16 L1 memory DMA command: (src@, dst@, block size) ClusterN 1 ClusterN Selma Saidi Optimizing DMA Transfers 24 October
13 Context and Motivation Embedded Multicore Architectures: Platform 2012: a manycore computation fabric, main characteristic: explicitly managed Memories: Features: 1 Scratchpad memories (No caches) 2 DMA engine: a hardware for managing data transfers, 3 data transfers are explicitly managed by the software, P2012 Fabric Cluster0 Cluster2 Cluster... PE0 PE1 PE16 DMA L1 memory DMA Advantages: a) Transfer efficiently large data blocks b) can work in parallel with the processor ClusterN 1 ClusterN Selma Saidi Optimizing DMA Transfers 24 October
14 Context and Motivation Embedded Multicore Architectures: acts as a general purpose programmable accelerator: Heterogeneous Multicore Architectures, P2012 Fabric Cluster... PE0 PE1 PE16 Cluster0 DMA L1 memory Cluster2 Cluster3 ClusterN 1 ClusterN Selma Saidi Optimizing DMA Transfers 24 October
15 Context and Motivation Embedded Multicore Architectures: acts as a general purpose programmable accelerator: Heterogeneous Multicore Architectures, P2012 Fabric Cluster... PE0 PE1 PE16 Cluster0 DMA L1 memory Cluster2 Cluster3 ClusterN 1 ClusterN This is the class of architectures in which we are interested!! Selma Saidi Optimizing DMA Transfers 24 October
16 Context and Motivation Heterogeneous Multi-core Architectures a powerful host processor and a multi-core fabric to accelerate computationally heavy kernels. Multi core fabric Host CPU PE0 PEn... Memory Memory Interconnect Off chip Memory Selma Saidi Optimizing DMA Transfers 24 October
17 Context and Motivation Heterogeneous Multi-core Architectures a powerful host processor and a multi-core fabric to accelerate computationally heavy kernels. Multi core fabric Host CPU T0 PE0 T1 Memory... T2 PEn Memory Interconnect Off chip Memory Selma Saidi Optimizing DMA Transfers 24 October
18 Context and Motivation Heterogeneous Multi-core Architectures Offloadable kernels work on large data sets, initially stored in a distant off-chip memory. Algorithm for i = 0 to n 1 Y [i] = f (X [i]) od Multi core fabric Host CPU T0 PE0 PEn... Memory Memory Interconnect Off chip Memory... X... Y Selma Saidi Optimizing DMA Transfers 24 October
19 Context and Motivation Heterogeneous Multi-core Architectures High off-chip memory latency: accessing off-chip data is very costly Algorithm for i = 0 to n 1 Y [i] = f (X [i]) od Read Multi core fabric Host CPU T0 PE0 PEn... Memory Memory Interconnect Off chip Memory Write... X... Y Selma Saidi Optimizing DMA Transfers 24 October
20 Context and Motivation Heterogeneous Multi-core Architectures Data is transferred to a closer but smaller on-chip memory, using DMAs (Direct Memory Access). Algorithm for i = 0 to n 1 Y [i] = f (X [i]) od Multi core fabric Host CPU T0 PE0 PEn Data Block Transfers block 1 block 0 Memory DMA... Memory... Interconnect Off chip Memory... X... Y Selma Saidi Optimizing DMA Transfers 24 October
21 Context and Motivation DMA Data Transfers: Single Buffering s: number of array elements in one block, block0 block1 blockm 2 blockm 1 X [0] X [1]... X [n 1] X s n Selma Saidi Optimizing DMA Transfers 24 October
22 Context and Motivation DMA Data Transfers: Single Buffering s: number of array elements in one block, block0 block1 blockm 2 blockm 1 X [0] X [1]... X [n 1] X s n i = 0 Fetch(blocki) while (i < n/s) i + + Compute(blocki) Write back(blocki) Selma Saidi Optimizing DMA Transfers 24 October
23 Context and Motivation DMA Data Transfers: Single Buffering s: number of array elements in one block, block0 block1 blockm 2 blockm 1 X [0] X [1]... X [n 1] X s n i = 0 Fetch(blocki) dma get(local-buffer, blocki, s) while (i < n/s) i + + Compute(blocki) Write back(blocki) Selma Saidi Optimizing DMA Transfers 24 October
24 Context and Motivation DMA Data Transfers: Single Buffering s: number of array elements in one block, block0 block1 blockm 2 blockm 1 X [0] X [1]... X [n 1] X s n i = 0 Fetch(blocki) while (i < n/s) i + + Compute(blocki) Write back(blocki) Selma Saidi Optimizing DMA Transfers 24 October
25 Context and Motivation DMA Data Transfers: Single Buffering s: number of array elements in one block, block0 block1 blockm 2 blockm 1 X [0] X [1]... X [n 1] X s n i = 0 Fetch(blocki) while (i < n/s) i + + Compute(blocki) Write back(blocki) dma put(blocki, local-buffer, s) Selma Saidi Optimizing DMA Transfers 24 October
26 Context and Motivation DMA Data Transfers: Single Buffering s: number of array elements in one block, block0 block1 blockm 2 blockm 1 X [0] X [1]... X [n 1] X s n i = 0 Fetch(blocki) while (i < n/s) i + + Compute(blocki) Write back(blocki) dma put(blocki, local-buffer, s) Sequential execution of computations and data transfers. Selma Saidi Optimizing DMA Transfers 24 October
27 Context and Motivation DMA Data Transfers: Double Buffering Asynchronous DMA calls: Fetch(block0) dma get(local buffer[1], block0, s) i = 0 dma get(local buffer[2], blocki+1, s) while (i < (n/s) 1) i + + Compute(blocki) Fetch(blocki+1) Write back(blocki) Compute(block(n/s) 1) Write back(block(n/s) 1) Selma Saidi Optimizing DMA Transfers 24 October
28 Context and Motivation DMA Data Transfers: Double Buffering Asynchronous DMA calls: Fetch(block0) dma get(local buffer[1], block0, s) i = 0 dma get(local buffer[2], blocki+1, s) while (i < (n/s) 1) i + + Compute(blocki) Fetch(blocki+1) Write back(blocki) Compute(block(n/s) 1) Write back(block(n/s) 1) Selma Saidi Optimizing DMA Transfers 24 October
29 Context and Motivation DMA Data Transfers: Double Buffering Asynchronous DMA calls: Fetch(block0) dma get(local buffer[1], block0, s) i = 0 dma get(local buffer[2], blocki+1, s) while (i < (n/s) 1) i + + Compute(blocki) Fetch(blocki+1) Write back(blocki) Compute(block(n/s) 1) Write back(block(n/s) 1) Overlap of computations and data transfers. Selma Saidi Optimizing DMA Transfers 24 October
30 Context and Motivation DMA Data Transfers: Double Buffering Asynchronous DMA calls: Fetch(block0) dma get(local buffer[1], block0, s) i = 0 dma get(local buffer[2], blocki+1, s) while (i < (n/s) 1) i + + Compute(blocki) Write back(blocki) Fetch(blocki+1) Compute(block(n/s) 1) Write back(block(n/s) 1) Overlap of computations and data transfers. Selma Saidi Optimizing DMA Transfers 24 October
31 Context and Motivation Double Buffering Pipelined Execution Overlap of, Computation of current block, Transfer of next block. Input Transfer b 0 Computation b 0 Output Transfer Prologue b 0 Time Proc idle time Selma Saidi Optimizing DMA Transfers 24 October
32 Context and Motivation Double Buffering Pipelined Execution Overlap of, Computation of current block, Transfer of next block. Input Transfer b 0 b 1 Computation b 0 b 1 Output Transfer Prologue b 0 b 1 Time Proc idle time Selma Saidi Optimizing DMA Transfers 24 October
33 Context and Motivation Double Buffering Pipelined Execution Overlap of, Computation of current block, Transfer of next block. Input Transfer b 0 b 1 b 2 Computation b 0 b 1 b 2 Output Transfer Prologue b 0 b 1 b 2 Time Proc idle time Selma Saidi Optimizing DMA Transfers 24 October
34 Context and Motivation Double Buffering Pipelined Execution Overlap of, Computation of current block, Transfer of next block. Input Transfer b 0 b 1 b 2 b 3 Computation b 0 b 1 b 2 b 3 Output Transfer Prologue b 0 b 1 b 2 b 3 Time Proc idle time Selma Saidi Optimizing DMA Transfers 24 October
35 Context and Motivation Double Buffering Pipelined Execution Overlap of, Computation of current block, Transfer of next block. Input Transfer b 0 b 1 b 2 b 3 b 4 Computation b 0 b 1 b 2 b 3 b 4 Output Transfer Prologue b 0 b 1 b 2 b 3 b 4 Epilogue Time Proc idle time Selma Saidi Optimizing DMA Transfers 24 October
36 Context and Motivation Double Buffering Pipelined Execution Overlap of, Computation of current block, Transfer of next block. Input Transfer b 0 b 1 b 2 b 3 b 4 Computation b 0 b 1 b 2 b 3 b 4 Output Transfer Prologue b 0 b 1 b 2 b 3 b 4 Epilogue Time Proc idle time Performance can be further improved by an appropriate choice of data granularity. Selma Saidi Optimizing DMA Transfers 24 October
37 Context and Motivation Granularity of Transfers 1Dim Data: block size s block0 block1 blockm 2 blockm 1 X [0] X [1]... X [n 1] X s n Selma Saidi Optimizing DMA Transfers 24 October
38 Context and Motivation Granularity of Transfers 1Dim Data: block size s block0 block1 blockm 2 blockm 1 X [0] X [1]... X [n 1] X s n 2Dim Data: block shape (s 1, s 2 ) s2 X (i1, i2) s1 m1 block(j1, j2) m2 Selma Saidi Optimizing DMA Transfers 24 October
39 Context and Motivation Granularity of Transfers 1Dim Data: block size s block0 block1 blockm 2 blockm 1 X [0] X [1]... X [n 1] X s n 2Dim Data: block shape (s 1, s 2 ) s2 X (i1, i2) s1 m1 block(j1, j2) m2 Contribution: We derive optimal granularity for 1D and 2D DMA transfers, Selma Saidi Optimizing DMA Transfers 24 October
40 Context and Motivation Our Contribution: We derive optimal granularity for 1D and 2D DMA transfers, 1Dim data work was published in Hipeac 2012, S.Saidi, P.tendulkar, T.Lepley, O.Maler, Optimizing explicit data transfers for data parallel applicationson the Cell architecture 2D data work was published in DSD 2012, S.Saidi, P.tendulkar, T.Lepley, O.Maler, Optimal 2D Data Partitioning for DMA Transfers on MPSoCs extended version of the paper submitted to: Embedded Hardware Design: Microprocessors and Microsystems Journal. Selma Saidi Optimizing DMA Transfers 24 October
41 Outline Context and Motivation 1 Context and Motivation 2 Contribution Problem Definition Optimal Granularity for a Single Processor 1Dim Data 2Dim Data Multiple Processors Shared Data 3 Experiments on the Cell.BE 4 The move towards Platform Conclusions and Perpectives Selma Saidi Optimizing DMA Transfers 24 October
42 Outline Contribution 1 Context and Motivation 2 Contribution Problem Definition Optimal Granularity for a Single Processor 1Dim Data 2Dim Data Multiple Processors Shared Data 3 Experiments on the Cell.BE 4 The move towards Platform Conclusions and Perpectives Selma Saidi Optimizing DMA Transfers 24 October
43 Outline Contribution Problem Definition 1 Context and Motivation 2 Contribution Problem Definition Optimal Granularity for a Single Processor 1Dim Data 2Dim Data Multiple Processors Shared Data 3 Experiments on the Cell.BE 4 The move towards Platform Conclusions and Perpectives Selma Saidi Optimizing DMA Transfers 24 October
44 Software Pipelining Contribution Problem Definition We want to optimize execution of the pipeline. R1 R2 R3 R4 Rm C1 C2 C3 Cm 1 Cm W1 W2 Wm 2 Wm 1 Wm Prologue Epilogue Selma Saidi Optimizing DMA Transfers 24 October
45 Contribution Problem Definition Software Pipelining We want to optimize execution of the pipeline. R1 R2 R3 R4 Rm C1 C2 C3 Cm 1 Cm W1 W2 Wm 2 Wm 1 Wm Prologue Epilogue Optimal Granularity: What is the Granularity choice that optimizes performance? Selma Saidi Optimizing DMA Transfers 24 October
46 Contribution Problem Definition Computation Regime and Transfer Regime T and C: Transfer and Computation time of a block Transfer Regime T > C: Input transfer b Computation Output transfer b0 b1 b2 b3 b4 b5 b6 b7 b b b b b b0 b1 b2 b3 b4 b5 b6 b7 b8 b5 b6 b7 b8 Prologue Proc Idle Time Epilogue Computation Regime C > T : Selma Saidi Optimizing DMA Transfers 24 October
47 Contribution Problem Definition Computation Regime and Transfer Regime T and C: Transfer and Computation time of a block Transfer Regime T > C: Input transfer b Computation Output transfer b0 b1 b2 b3 b4 b5 b6 b7 b b b b b b0 b1 b2 b3 b4 b5 b6 b7 b8 b5 b6 b7 b8 Prologue Proc Idle Time Epilogue Computation Regime C > T : Input transfer b0 b1 b2 Computation b0 b1 b Output transfer b0 b1 b2 Prologue Epilogue Time Selma Saidi Optimizing DMA Transfers 24 October
48 Contribution In the Computation Regime: Problem Definition Granularity s: Input transfer b0 b1 b2 Computation b0 b1 b Output transfer b0 b1 b2 Prologue Epilogue Time Granularity s : s > s: Input transfer b0 b1 Computation Output transfer b0 b0 b b1 Prologue Epilogue Time Selma Saidi Optimizing DMA Transfers 24 October
49 Contribution Problem Definition Optimal Granularity: Problem Formulation 1Dim Data: block size Find s such that, 2Dim Data: block shape Min T (s) s.t. T (s) C(s) (s) [1..n] s M Selma Saidi Optimizing DMA Transfers 24 October
50 Contribution Problem Definition Optimal Granularity: Problem Formulation 1Dim Data: block size Find s such that, Min T (s) s.t. T (s) C(s) (s) [1..n] s M 2Dim Data: block shape Find (s1, s 2 ) such that, Min T (s 1, s 2 ) s.t. T (s 1, s 2 ) C(s 1, s 2 ) (s 1, s 2 ) [1..n 1 ] [1..n 2 ] s 1 s 2 M Selma Saidi Optimizing DMA Transfers 24 October
51 Outline Contribution Optimal Granularity for a Single Processor 1 Context and Motivation 2 Contribution Problem Definition Optimal Granularity for a Single Processor 1Dim Data 2Dim Data Multiple Processors Shared Data 3 Experiments on the Cell.BE 4 The move towards Platform Conclusions and Perpectives Selma Saidi Optimizing DMA Transfers 24 October
52 Outline Contribution Optimal Granularity for a Single Processor 1 Context and Motivation 2 Contribution Problem Definition Optimal Granularity for a Single Processor 1Dim Data 2Dim Data Multiple Processors Shared Data 3 Experiments on the Cell.BE 4 The move towards Platform Conclusions and Perpectives Selma Saidi Optimizing DMA Transfers 24 October
53 Optimal Granularity : 1Dim Data Contribution Optimal Granularity for a Single Processor Characterization of Computation and Transfer Time: Selma Saidi Optimizing DMA Transfers 24 October
54 Contribution Optimal Granularity for a Single Processor Optimal Granularity : 1Dim Data Characterization of Computation and Transfer Time: s: nb array elements clustered in one (Contiguous) block, s b Computation time C(s): DMA Transfer time T(s): Selma Saidi Optimizing DMA Transfers 24 October
55 Contribution Optimal Granularity for a Single Processor Optimal Granularity : 1Dim Data Characterization of Computation and Transfer Time: s: nb array elements clustered in one (Contiguous) block, s b Computation time C(s): ω: time to compute one element, DMA Transfer time T(s): Selma Saidi Optimizing DMA Transfers 24 October
56 Contribution Optimal Granularity for a Single Processor Optimal Granularity : 1Dim Data Characterization of Computation and Transfer Time: s: nb array elements clustered in one (Contiguous) block, s b Computation time C(s): ω: time to compute one element, DMA Transfer time T(s): C(s) = ω s Selma Saidi Optimizing DMA Transfers 24 October
57 Contribution Optimal Granularity for a Single Processor Optimal Granularity : 1Dim Data Characterization of Computation and Transfer Time: s: nb array elements clustered in one (Contiguous) block, s b Computation time C(s): ω: time to compute one element, DMA Transfer time T(s): I : fixed DMA initialization cost, C(s) = ω s Selma Saidi Optimizing DMA Transfers 24 October
58 Contribution Optimal Granularity for a Single Processor Optimal Granularity : 1Dim Data Characterization of Computation and Transfer Time: s: nb array elements clustered in one (Contiguous) block, s b Computation time C(s): ω: time to compute one element, DMA Transfer time T(s): I : fixed DMA initialization cost, α: transfer cost per byte, C(s) = ω s Selma Saidi Optimizing DMA Transfers 24 October
59 Contribution Optimal Granularity for a Single Processor Optimal Granularity : 1Dim Data Characterization of Computation and Transfer Time: s: nb array elements clustered in one (Contiguous) block, s b Computation time C(s): ω: time to compute one element, C(s) = ω s DMA Transfer time T(s): I : fixed DMA initialization cost, α: transfer cost per byte, b: size of one array element, Selma Saidi Optimizing DMA Transfers 24 October
60 Contribution Optimal Granularity for a Single Processor Optimal Granularity : 1Dim Data Characterization of Computation and Transfer Time: s: nb array elements clustered in one (Contiguous) block, s b Computation time C(s): ω: time to compute one element, C(s) = ω s DMA Transfer time T(s): I : fixed DMA initialization cost, α: transfer cost per byte, b: size of one array element, T (s) = I + α b s Selma Saidi Optimizing DMA Transfers 24 October
61 Contribution Optimal Granularity : 1Dim Data Optimal Granularity for a Single Processor Pb Formulation Min T (s) s.t. Optimal Granularity s : T (s) C(s) s [1..n] s M Time T = C s: block size T > C T < C C(s) M: Memory limitation T (s) I Transfer Domain s s M Computation Domain Selma Saidi Optimizing DMA Transfers 24 October
62 Contribution Optimal Granularity : 1Dim Data Optimal Granularity for a Single Processor Pb Formulation Min T (s) s.t. Optimal Granularity s : C(s ) = T (s ) T (s) C(s) s [1..n] s M s: block size M: Memory limitation Time T = C T < C C(s) T (s) I s s M Computation Domain Selma Saidi Optimizing DMA Transfers 24 October
63 Outline Contribution Optimal Granularity for a Single Processor 1 Context and Motivation 2 Contribution Problem Definition Optimal Granularity for a Single Processor 1Dim Data 2Dim Data Multiple Processors Shared Data 3 Experiments on the Cell.BE 4 The move towards Platform Conclusions and Perpectives Selma Saidi Optimizing DMA Transfers 24 October
64 Contribution Optimal Granularity for a Single Processor Optimal Granularity : 2Dim Data Characterization of Computation and Transfer Time: (s 1 s 2 ): nb array elements clustered in one (square) block, s 2 X (i 1, i 2) Computation time C(s 1, s 2 ): s 1 m 1 block(j 1, j 2 ) m 2 Selma Saidi Optimizing DMA Transfers 24 October
65 Contribution Optimal Granularity for a Single Processor Optimal Granularity : 2Dim Data Characterization of Computation and Transfer Time: (s 1 s 2 ): nb array elements clustered in one (square) block, s 2 X (i 1, i 2) Computation time C(s 1, s 2 ): ω: time to compute one element, s 1 m 1 block(j 1, j 2 ) m 2 Selma Saidi Optimizing DMA Transfers 24 October
66 Contribution Optimal Granularity for a Single Processor Optimal Granularity : 2Dim Data Characterization of Computation and Transfer Time: (s 1 s 2 ): nb array elements clustered in one (square) block, s 2 X (i 1, i 2) Computation time C(s 1, s 2 ): ω: time to compute one element, s 1 C(s 1, s 2 ) = ω s 1 s 2 m 1 block(j 1, j 2 ) m 2 Selma Saidi Optimizing DMA Transfers 24 October
67 Contribution Optimal Granularity for a Single Processor Optimal Granularity : 2Dim Data src addr stride s 2 s 1 Strided DMA Transfer time T (s 1, s 2 ): I 1 : transfer cost overhead per line, T (s 1, s 2 ) = I + I 1 s 1 + α b s 1 s 2 Selma Saidi Optimizing DMA Transfers 24 October
68 Contribution Optimal Granularity for a Single Processor Optimal Granularity : 2Dim Data src addr stride s 2 s 1 Strided DMA Transfer time T (s 1, s 2 ): I 1 : transfer cost overhead per line, T (s 1, s 2 ) = I + I 1 s 1 + α b s 1 s 2 Strided DMA transfers are costlier than contiguous transfers Selma Saidi Optimizing DMA Transfers 24 October
69 Contribution Optimal Granularity for a Single Processor Influence of the Block Shape on the DMA Transfer Cost Different block shapes with same area BUT different DMA transfer time, s 1 = 1 s 1 = 2 s 1 = 4 s 2 = 4 (s 1, s 2 ) = (1, 4) s 2 = 2 (s 1, s 2 ) = (2, 2) s 2 = 1 (s 1, s 2 ) = (4, 1) Selma Saidi Optimizing DMA Transfers 24 October
70 Contribution Optimal Granularity : 2Dim Data Optimal Granularity for a Single Processor C(s 1, s 2 ) : computation time of a block, T (s 1, s 2 ) : transfer time of a block, Selma Saidi Optimizing DMA Transfers 24 October
71 Contribution Optimal Granularity for a Single Processor Optimal Granularity : 2Dim Data C(s 1, s 2 ) : computation time of a block, T (s 1, s 2 ) : transfer time of a block, s 1 n 1 T = C Computation Domain T < C I 1 /ψ Transfer Domain T > C n 2 s 2 Selma Saidi Optimizing DMA Transfers 24 October
72 Contribution Optimal Granularity for a Single Processor Optimal Granularity : 2Dim Data Pb Formulation Min T (s 1, s 2 ) s.t. T (s 1, s 2 ) C(s 1, s 2 ) (s 1, s 2 ) [1..n 1 ] [1..n 2 ] s 1 s 2 M s 1 T = C n 1 Computation Domain T < C s 1 : block height s 2 : block width I 1 /ψ Transfer Domain T > C n 2 s 2 Selma Saidi Optimizing DMA Transfers 24 October
73 Contribution Optimal Granularity for a Single Processor Optimal Granularity : 2Dim Data Pb Formulation Min T (s 1, s 2 ) s.t. T (s 1, s 2 ) C(s 1, s 2 ) (s 1, s 2 ) [1..n 1 ] [1..n 2 ] s 1 s 2 M s 1 : block height s 2 : block width s 1 T = C n 1 s s 1 T (s s 1 s 1, s 2 ) < T (s 1, s 2 ) s 2 = s 2 n 2 s 2 Selma Saidi Optimizing DMA Transfers 24 October
74 Contribution Optimal Granularity : 2Dim Data Optimal Granularity for a Single Processor s 1 n 1 T = C 1 s (1/ψ)(I 1 + I 0 ) n 2 s 2 { Block Height : s 1 = 1 Block Width : s 2 = (1/ψ)(I 1 + I 0 ) Optimal granularity is the Contiguous block to reach the computation regime: Selma Saidi Optimizing DMA Transfers 24 October
75 Outline Contribution Multiple Processors 1 Context and Motivation 2 Contribution Problem Definition Optimal Granularity for a Single Processor 1Dim Data 2Dim Data Multiple Processors Shared Data 3 Experiments on the Cell.BE 4 The move towards Platform Conclusions and Perpectives Selma Saidi Optimizing DMA Transfers 24 October
76 Multiple Processors Contribution Multiple Processors Partitioning: p contiguous chunks of data 1 P 1 n p P 2 P 3 P n 1... n 1... p p n p Array Processors are identical: same local store capacity, same double buffering granularity...etc. Selma Saidi Optimizing DMA Transfers 24 October
77 Multiple Processors Contribution Multiple Processors Pipelined execution for several processors: Input Transfer b0, b1, b P P1 P2 OutputTransfer Time Prologue Proc Idle time processors DMA requests are done concurrently, Selma Saidi Optimizing DMA Transfers 24 October
78 Multiple Processors Contribution Multiple Processors Pipelined execution for several processors: Input Transfer b0, b1, b P P1 P2 OutputTransfer b3, b4, b b b1 b Time Prologue Proc Idle time processors DMA requests are done concurrently, Selma Saidi Optimizing DMA Transfers 24 October
79 Multiple Processors Contribution Multiple Processors Pipelined execution for several processors: Input Transfer b0, b1, b P P1 P2 OutputTransfer b3, b4, b b b1 b b0, b1, b2 b6, b7, b b b4 b Time Prologue Proc Idle time processors DMA requests are done concurrently, Selma Saidi Optimizing DMA Transfers 24 October
80 Multiple Processors Contribution Multiple Processors Pipelined execution for several processors: Input Transfer b0, b1, b P P1 P2 OutputTransfer b3, b4, b b b1 b b0, b1, b2 b6, b7, b b b4 b b8 b7 b6 b3, b4, b5 b6, b7, b Time Prologue Epilogue Proc Idle time processors DMA requests are done concurrently, Selma Saidi Optimizing DMA Transfers 24 October
81 Multiple Processors Contribution Multiple Processors Pipelined execution for several processors: Input Transfer b0, b1, b P P1 P2 OutputTransfer b3, b4, b b b1 b b0, b1, b2 b6, b7, b b b4 b b8 b7 b6 b3, b4, b5 b6, b7, b Time Prologue Epilogue Proc Idle time processors DMA requests are done concurrently, T (s, p) = I + α(p) b s α(p) : transfer cost per byte given contentions of p concurrent transfer requests. Selma Saidi Optimizing DMA Transfers 24 October
82 Contribution Multiple Processors Multiple Processors: Optimal Granularity Optimal granularity given p processors: s (p), s 1 time n 1 T 1 = C C(s) T1(s) s (1) 1 s (1) n s n 2 s 2 (a) One-dimensional data (b) Two-dimensional data Selma Saidi Optimizing DMA Transfers 24 October
83 Contribution Multiple Processors Multiple Processors: Optimal Granularity Optimal granularity given p processors: s (p), s 1 time n 1 T 1 = C C(s) Tp(s) T1(s) s (1) 1 s (1) s (p) n s n 2 s 2 (a) One-dimensional data (b) Two-dimensional data Selma Saidi Optimizing DMA Transfers 24 October
84 Contribution Multiple Processors Multiple Processors: Optimal Granularity Optimal granularity given p processors: s (p), s 1 time n 1 T 1 = C T p = C C(s) Tp(s) T1(s) 1 s (1) s (p) s (1) s (p) n s n 2 s 2 (a) One-dimensional data (b) Two-dimensional data Optimal Granularity increases with number of processors Selma Saidi Optimizing DMA Transfers 24 October
85 Contribution Multiple Processors Summary: We derived optimal granularity, main idea: balance between computation and transfer time of a block, 2D data: block shape influences transfer time (overhead per line, I 1 ) multiple processors: number of processors influence transfer time (with α(p)) Selma Saidi Optimizing DMA Transfers 24 October
86 Local Memory Constraint Contribution Multiple Processors What if Optimal Granularity does not fit in Local memory? 1 take available memory space, 2 reduce the number of processors, s 1 time n 1 M/s 2 Tp = C C(s) Tp(s) Tp (s) 1 s (p) s (p ) M s (p) (a) One-dimensional data s M (b) Two-dimensional data n 2 s 2 Selma Saidi Optimizing DMA Transfers 24 October
87 Outline Contribution Shared Data 1 Context and Motivation 2 Contribution Problem Definition Optimal Granularity for a Single Processor 1Dim Data 2Dim Data Multiple Processors Shared Data 3 Experiments on the Cell.BE 4 The move towards Platform Conclusions and Perpectives Selma Saidi Optimizing DMA Transfers 24 October
88 Contribution Shared Data Applications with shared data: 1Dim Data parallel loop with shared input data: for i := 0 to n 1 do Y [i] := f (X [i], V [i]); od V [i] = {X [i 1], X [i 2],..., X [i k]} Selma Saidi Optimizing DMA Transfers 24 October
89 Contribution Shared Data Applications with shared data: 1Dim Data parallel loop with shared input data: for i := 0 to n 1 do Y [i] := f (X [i], V [i]); V [i] = {X [i 1], X [i 2],..., X [i k]} od Neighboring blocks share data: s n Neighboring Shared Data b0 b1 b2 b3 b4 b5 Selma Saidi Optimizing DMA Transfers 24 October
90 Contribution Shared Data Shared Data: 2Dim Data parallel loop with shared input data: for i := 1 to n 1 do for i := 2 to n 2 do Y [i 1, i 2 ] := f (X [i 1, i 2 ], V [i 1, i 2 ]); V [i 1, i 2 ] = {X [i 1 1, i 2 ], X [i 1, i 2 1],..., X [i 1 k, i 2 ]} od symmetric window, k/2 k/2 n2 n1 Selma Saidi Optimizing DMA Transfers 24 October
91 Contribution Shared Data Optimal Granularity : 1Dim Data Compare strategies for transferring shared data: 1 Replication: via DMA transfers from the off-chip memory to local memory. 2 Inter-processor communication: processors exchange data via the network-on-chip between the cores; 3 Local buffering: via local copies done by the processors. Based on a parametric study, we derive optimal strategy and granularity for transferring shared data, Selma Saidi Optimizing DMA Transfers 24 October
92 Contribution Shared Data Optimal Granularity : 2Dim Data we consider Replication for transferring shared data, R: size of replicated data. R = 12 s1 = 2 s2 = 2 (s 1, s 2) = (2, 2) Selma Saidi Optimizing DMA Transfers 24 October
93 Contribution Shared Data Optimal Granularity : 2Dim Data Influence of the block shape on the size of share data: Compare Transfer cost of a flat and a square block, R: size of replicated data. R = 14 R = 12 s1 = 1 s1 = 2 s2 = 4 s2 = 2 (s 1, s 2) = (1, 4) (s 1, s 2) = (2, 2) Selma Saidi Optimizing DMA Transfers 24 October
94 Contribution Shared Data Optimal Granularity : 2Dim Data Influence of the block shape on the size of share data: Compare Transfer cost of a flat and a square block, R: size of replicated data. R = 14 R = 12 s1 = 1 s1 = 2 s2 = 4 s2 = 2 (s 1, s 2) = (1, 4) (s 1, s 2) = (2, 2) More Replicated data Overhead Selma Saidi Optimizing DMA Transfers 24 October
95 Contribution Shared Data Optimal Granularity : 2Dim Data Influence of the block shape on the size of share data: Compare Transfer cost of a flat and a square block, R: size of replicated data. R = 14 R = 12 s1 = 1 s1 = 2 s2 = 4 s2 = 2 (s 1, s 2) = (1, 4) (s 1, s 2) = (2, 2) More Replicated data Overhead More transfer lines Overhead Selma Saidi Optimizing DMA Transfers 24 October
96 Contribution Shared Data Optimal Granularity : 2Dim Data Influence of the block shape on the size of share data: Compare Transfer cost of a flat and a square block, R: size of replicated data. R = 14 R = 12 s1 = 1 s1 = 2 s2 = 4 s2 = 2 (s 1, s 2) = (1, 4) (s 1, s 2) = (2, 2) More Replicated data Overhead More transfer lines Overhead Selma Saidi Optimizing DMA Transfers 24 October
97 Contribution Shared Data Optimal Granularity : 2Dim Data Problem Formulation Find (s1, s 2 ) such that, min T (s 1 + k, s 2 + k) s.t. T (s 1 + k, s 2 + k) C(s 1, s 2 ) (s 1, s 2 ) [1..n 1 ] [1..n 2 ] (s 1 + k) (s 2 + k) M Selma Saidi Optimizing DMA Transfers 24 October
98 Contribution Shared Data Optimal Granularity : 2Dim Data s 1 n 1 T = C T k = C s 1 = s 2 s n 2 s 2 Selma Saidi Optimizing DMA Transfers 24 October
99 Contribution Shared Data Optimal Granularity : 2Dim Data Optimal shape: between a square and a flat shape, s 1 n 1 T = C T k = C s 1 = s 2 s k s n 2 s 2 { s 1 = + (c 1 /ψ)(1/d) s 2 = + (I 1/ψ)(1 + D) Selma Saidi Optimizing DMA Transfers 24 October
100 Outline Experiments on the Cell.BE 1 Context and Motivation 2 Contribution Problem Definition Optimal Granularity for a Single Processor 1Dim Data 2Dim Data Multiple Processors Shared Data 3 Experiments on the Cell.BE 4 The move towards Platform Conclusions and Perpectives Selma Saidi Optimizing DMA Transfers 24 October
101 Experiments on the Cell.BE Overview of Cell B.E. Architecture SPU SPU SPU SPU MFC MMU MFC MMU MFC MMU MFC MMU PPU MMU L2 EIB I/O Interface XDR DRAM Interface Coherent Interface SPU SPU SPU SPU MFC MMU MFC MMU MFC MMU MFC MMU Platform Characteristics: 9-core heterogeneous multi-core architecture, with a Power Processor Element(PPE) and 8 Synergistic Processing Elements(SPE). Limited local store capacity per SPE: 256 Kbytes Explicitely managed memory system, using DMAs Selma Saidi Optimizing DMA Transfers 24 October
102 Experiments on the Cell.BE Measured DMA Latency Transfer Time (clock cycles) SPU 2 SPU 4 SPU 8 SPU super-block size (s b) Profiled hardware parameters: DMA issue time I 400 clock cycles Off-chip memory transfer cost/byte: 1 proc α(1) 0.22 clock cycles Off-chip memory transfer cost/byte: p procs α(p) p α(1) inter-processor comm transfer cost/byte for β 0.13 clock cycles Selma Saidi Optimizing DMA Transfers 24 October
103 Experiments on the Cell.BE Optimal Granularity: 1Dim Data, No Sharing 10 7 Execution Time (clock cycles) super-block size (s b) 2 SPU-pred 8 SPU-pred 2 SPU-meas 8 SPU-meas predicted optimal granularities give good performance. Selma Saidi Optimizing DMA Transfers 24 October
104 Experiments on the Cell.BE Optimal Granularity: 1Dim Data, Sharing s n Neighboring Shared Data Comparing several strategies: b0 b1 b2 b3 b4 b5 Execution Time (clock cycles) super-block size (s b) 2 repl 2 ipc 2 local buffering 8 repl 8 ipc 8 local buffering Selma Saidi Optimizing DMA Transfers 24 October
105 Experiments on the Cell.BE Optimal Granularity: 2Dim Data, Sharing We implement double buffering on a mean filtering algorithm, predicted optimal granularities give good performance. Selma Saidi Optimizing DMA Transfers 24 October
106 Outline The move towards Platform Context and Motivation 2 Contribution Problem Definition Optimal Granularity for a Single Processor 1Dim Data 2Dim Data Multiple Processors Shared Data 3 Experiments on the Cell.BE 4 The move towards Platform Conclusions and Perpectives Selma Saidi Optimizing DMA Transfers 24 October
107 The move towards Platform 2012 P2012 Memory Hierarchy 1 Intra cluster L1 memory (256 Kbytes), 2 Inter cluster L2 memory, 3 Off-chip L3 memory Selma Saidi Optimizing DMA Transfers 24 October
108 DMA Latency The move towards Platform 2012 we measure the DMA latency on P2012, DMA performance Model: T (s, p) = I + α(p)bs I = 240cycles p α(p) Selma Saidi Optimizing DMA Transfers 24 October
109 DMA Latency The move towards Platform 2012 we measure the DMA latency on P2012, DMA performance Model: T (s, p) = I + α(p)bs I = 240cycles p α(p) Selma Saidi Optimizing DMA Transfers 24 October
110 The move towards Platform 2012 DMA Transfers in a Cluster Shared Local Memory and shared DMA: 2 approaches for transferring data, 1 Liberal: Processors fetch data independently 2 Collaborative: Processors fetch data collectively Selma Saidi Optimizing DMA Transfers 24 October
111 The move towards Platform 2012 Liberal Approach: Processors fetch data independently Off chip Memory local Memory DMA Transfers Cluster T (s, p) = I + α(p)bs C(s, p) = o + ωs s M/p Computations P0 P1... P16 Opencl Kernel: work item = processor async work item copy: DMA fetch for each processor Selma Saidi Optimizing DMA Transfers 24 October
112 The move towards Platform 2012 Collab Approach: Processors fetch data Collectively Off chip Memory DMA Transfers Cluster T (s, p) = I + α(1)bs C(s, p) = o(p) + (ω/p)s s M local Memory Computations P0 P1... P16 Opencl Kernel: work group = cluster async work group copy: DMA fetch for the cluster Selma Saidi Optimizing DMA Transfers 24 October
113 The move towards Platform 2012 Liberal Approach: Processors fetch data independently increase of number of processors reduces max buffer size, double buffering implementation results: Optimal Granularity does not fit in the available memory space Selma Saidi Optimizing DMA Transfers 24 October
114 The move towards Platform 2012 Liberal Approach: Processors fetch data independently synchronization overhead, double buffering implementation results: Performance degradation when increase number of processors Selma Saidi Optimizing DMA Transfers 24 October
115 The move towards Platform 2012 On-going Work: find the right balance between number of processors and Memory space budget, compare both liberal and collaborative approach, Selma Saidi Optimizing DMA Transfers 24 October
116 Outline Conclusions and Perpectives 1 Context and Motivation 2 Contribution Problem Definition Optimal Granularity for a Single Processor 1Dim Data 2Dim Data Multiple Processors Shared Data 3 Experiments on the Cell.BE 4 The move towards Platform Conclusions and Perpectives Selma Saidi Optimizing DMA Transfers 24 October
117 Conclusion Conclusions and Perpectives We presented a general methodology for automating decisions about Optimal granularity for data transfers, we capture the following facts, 1 Block shape and size influence the transfer/computation Ratio, 2 DMA performance (sensitivity to the block shapes, number of processors) 3 tradeoff between Strided DMA overhead vs size of replicated data Selma Saidi Optimizing DMA Transfers 24 October
118 Conclusions and Perpectives Perspectives 1 Consider other applications patterns, 2 Capture variations (hardware and Software), 3 Generalize the approach to more than 2 memory levels, 4 Integrate this work in a complete compilation flow, 5 combine task and data parallelism, 6... Selma Saidi Optimizing DMA Transfers 24 October
119 Conclusions and Perpectives Thank You!!! Selma Saidi Optimizing DMA Transfers 24 October
Optimal 2D Data Partitioning for DMA Transfers on MPSoCs
Optimal 2D Data Partitioning for DMA Transfers on MPSoCs Selma Saidi, Pranav Tendulkar, Thierry Lepley, Oded Maler VERIMAG Lab (CNRS,University of Grenoble), France STMicroelectronics Grenoble, France
More informationThe University of Texas at Austin
EE382N: Principles in Computer Architecture Parallelism and Locality Fall 2009 Lecture 24 Stream Processors Wrapup + Sony (/Toshiba/IBM) Cell Broadband Engine Mattan Erez The University of Texas at Austin
More informationCellSs Making it easier to program the Cell Broadband Engine processor
Perez, Bellens, Badia, and Labarta CellSs Making it easier to program the Cell Broadband Engine processor Presented by: Mujahed Eleyat Outline Motivation Architecture of the cell processor Challenges of
More informationParallel Computing: Parallel Architectures Jin, Hai
Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer
More informationIBM Cell Processor. Gilbert Hendry Mark Kretschmann
IBM Cell Processor Gilbert Hendry Mark Kretschmann Architectural components Architectural security Programming Models Compiler Applications Performance Power and Cost Conclusion Outline Cell Architecture:
More informationComputer Architecture
Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,
More informationOptimizing Data Sharing and Address Translation for the Cell BE Heterogeneous CMP
Optimizing Data Sharing and Address Translation for the Cell BE Heterogeneous CMP Michael Gschwind IBM T.J. Watson Research Center Cell Design Goals Provide the platform for the future of computing 10
More informationMultiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed
Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448 1 The Greed for Speed Two general approaches to making computers faster Faster uniprocessor All the techniques we ve been looking
More informationDesign methodology for multi processor systems design on regular platforms
Design methodology for multi processor systems design on regular platforms Ph.D in Electronics, Computer Science and Telecommunications Ph.D Student: Davide Rossi Ph.D Tutor: Prof. Roberto Guerrieri Outline
More informationEvaluating the Portability of UPC to the Cell Broadband Engine
Evaluating the Portability of UPC to the Cell Broadband Engine Dipl. Inform. Ruben Niederhagen JSC Cell Meeting CHAIR FOR OPERATING SYSTEMS Outline Introduction UPC Cell UPC on Cell Mapping Compiler and
More informationReview: Creating a Parallel Program. Programming for Performance
Review: Creating a Parallel Program Can be done by programmer, compiler, run-time system or OS Steps for creating parallel program Decomposition Assignment of tasks to processes Orchestration Mapping (C)
More informationOptimisation des transferts de données sur systèmes multiprocesseurs sur puce
Optimisation des transferts de données sur systèmes multiprocesseurs sur puce Selma Saidi To cite this version: Selma Saidi. Optimisation des transferts de données sur systèmes multiprocesseurs sur puce.
More informationA framework for optimizing OpenVX Applications on Embedded Many Core Accelerators
A framework for optimizing OpenVX Applications on Embedded Many Core Accelerators Giuseppe Tagliavini, DEI University of Bologna Germain Haugou, IIS ETHZ Andrea Marongiu, DEI University of Bologna & IIS
More information1. Memory technology & Hierarchy
1 Memory technology & Hierarchy Caching and Virtual Memory Parallel System Architectures Andy D Pimentel Caches and their design cf Henessy & Patterson, Chap 5 Caching - summary Caches are small fast memories
More informationProcessor Architecture and Interconnect
Processor Architecture and Interconnect What is Parallelism? Parallel processing is a term used to denote simultaneous computation in CPU for the purpose of measuring its computation speeds. Parallel Processing
More informationMIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer
MIMD Overview Intel Paragon XP/S Overview! MIMDs in the 1980s and 1990s! Distributed-memory multicomputers! Intel Paragon XP/S! Thinking Machines CM-5! IBM SP2! Distributed-memory multicomputers with hardware
More informationRoadrunner. By Diana Lleva Julissa Campos Justina Tandar
Roadrunner By Diana Lleva Julissa Campos Justina Tandar Overview Roadrunner background On-Chip Interconnect Number of Cores Memory Hierarchy Pipeline Organization Multithreading Organization Roadrunner
More informationXPU A Programmable FPGA Accelerator for Diverse Workloads
XPU A Programmable FPGA Accelerator for Diverse Workloads Jian Ouyang, 1 (ouyangjian@baidu.com) Ephrem Wu, 2 Jing Wang, 1 Yupeng Li, 1 Hanlin Xie 1 1 Baidu, Inc. 2 Xilinx Outlines Background - FPGA for
More informationFFTC: Fastest Fourier Transform on the IBM Cell Broadband Engine. David A. Bader, Virat Agarwal
FFTC: Fastest Fourier Transform on the IBM Cell Broadband Engine David A. Bader, Virat Agarwal Cell System Features Heterogeneous multi-core system architecture Power Processor Element for control tasks
More informationIntroduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1
Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip
More informationEECS 583 Class 20 Research Topic 2: Stream Compilation, GPU Compilation
EECS 583 Class 20 Research Topic 2: Stream Compilation, GPU Compilation University of Michigan December 3, 2012 Guest Speakers Today: Daya Khudia and Mehrzad Samadi nnouncements & Reading Material Exams
More informationPerformance Optimization Part II: Locality, Communication, and Contention
Lecture 7: Performance Optimization Part II: Locality, Communication, and Contention Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Beth Rowley Nobody s Fault but Mine
More informationCaches. Hiding Memory Access Times
Caches Hiding Memory Access Times PC Instruction Memory 4 M U X Registers Sign Ext M U X Sh L 2 Data Memory M U X C O N T R O L ALU CTL INSTRUCTION FETCH INSTR DECODE REG FETCH EXECUTE/ ADDRESS CALC MEMORY
More informationMultiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University
A.R. Hurson Computer Science and Engineering The Pennsylvania State University 1 Large-scale multiprocessor systems have long held the promise of substantially higher performance than traditional uniprocessor
More informationUsing Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology
Using Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology September 19, 2007 Markus Levy, EEMBC and Multicore Association Enabling the Multicore Ecosystem Multicore
More informationFlexible Architecture Research Machine (FARM)
Flexible Architecture Research Machine (FARM) RAMP Retreat June 25, 2009 Jared Casper, Tayo Oguntebi, Sungpack Hong, Nathan Bronson Christos Kozyrakis, Kunle Olukotun Motivation Why CPUs + FPGAs make sense
More information6.1 Multiprocessor Computing Environment
6 Parallel Computing 6.1 Multiprocessor Computing Environment The high-performance computing environment used in this book for optimization of very large building structures is the Origin 2000 multiprocessor,
More informationMotivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism
Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the
More informationCPU-GPU Heterogeneous Computing
CPU-GPU Heterogeneous Computing Advanced Seminar "Computer Engineering Winter-Term 2015/16 Steffen Lammel 1 Content Introduction Motivation Characteristics of CPUs and GPUs Heterogeneous Computing Systems
More informationMemory. Lecture 22 CS301
Memory Lecture 22 CS301 Administrative Daily Review of today s lecture w Due tomorrow (11/13) at 8am HW #8 due today at 5pm Program #2 due Friday, 11/16 at 11:59pm Test #2 Wednesday Pipelined Machine Fetch
More informationHigh Performance Computing. University questions with solution
High Performance Computing University questions with solution Q1) Explain the basic working principle of VLIW processor. (6 marks) The following points are basic working principle of VLIW processor. The
More informationHigh performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli
High performance 2D Discrete Fourier Transform on Heterogeneous Platforms Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli Motivation Fourier Transform widely used in Physics, Astronomy, Engineering
More informationIntroduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano
Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Outline Key issues to design multiprocessors Interconnection network Centralized shared-memory architectures Distributed
More informationPortable Parallel Programming for Multicore Computing
Portable Parallel Programming for Multicore Computing? Vivek Sarkar Rice University vsarkar@rice.edu FPU ISU ISU FPU IDU FXU FXU IDU IFU BXU U U IFU BXU L2 L2 L2 L3 D Acknowledgments Rice Habanero Multicore
More informationAn Introduction to Parallel Programming
An Introduction to Parallel Programming Ing. Andrea Marongiu (a.marongiu@unibo.it) Includes slides from Multicore Programming Primer course at Massachusetts Institute of Technology (MIT) by Prof. SamanAmarasinghe
More informationThe Pennsylvania State University. The Graduate School. College of Engineering A NEURAL NETWORK BASED CLASSIFIER ON THE CELL BROADBAND ENGINE
The Pennsylvania State University The Graduate School College of Engineering A NEURAL NETWORK BASED CLASSIFIER ON THE CELL BROADBAND ENGINE A Thesis in Electrical Engineering by Srijith Rajamohan 2009
More informationhigh performance medical reconstruction using stream programming paradigms
high performance medical reconstruction using stream programming paradigms This Paper describes the implementation and results of CT reconstruction using Filtered Back Projection on various stream programming
More informationA Tuneable Software Cache Coherence Protocol for Heterogeneous MPSoCs. Marco Bekooij & Frank Ophelders
A Tuneable Software Cache Coherence Protocol for Heterogeneous MPSoCs Marco Bekooij & Frank Ophelders Outline Context What is cache coherence Addressed challenge Short overview of related work Related
More informationGPUfs: Integrating a file system with GPUs
GPUfs: Integrating a file system with GPUs Mark Silberstein (UT Austin/Technion) Bryan Ford (Yale), Idit Keidar (Technion) Emmett Witchel (UT Austin) 1 Traditional System Architecture Applications OS CPU
More informationReducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip
Reducing Hit Times Critical Influence on cycle-time or CPI Keep L1 small and simple small is always faster and can be put on chip interesting compromise is to keep the tags on chip and the block data off
More informationComputer Systems Architecture I. CSE 560M Lecture 19 Prof. Patrick Crowley
Computer Systems Architecture I CSE 560M Lecture 19 Prof. Patrick Crowley Plan for Today Announcement No lecture next Wednesday (Thanksgiving holiday) Take Home Final Exam Available Dec 7 Due via email
More information4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4.
Chapter 4: CPU 4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4.8 Control hazard 4.14 Concluding Rem marks Hazards Situations that
More informationAPP: MINIMIZING INTERFERENCE USING AGGRESSIVEPIPELINED PREFETCHING IN MULTI-LEVEL BUFFER CACHES
APP: MINIMIZING INTERFERENCE USING AGGRESSIVEPIPELINED PREFETCHING IN MULTI-LEVEL BUFFER CACHES Christina M. Patrick Pennsylvania State University patrick@cse.psu.edu Nicholas Voshell Pennsylvania State
More informationAn Extension of the StarSs Programming Model for Platforms with Multiple GPUs
An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé 2 Rosa M. Badia 2 Francisco Igual 1 Jesús Labarta 2 Rafael Mayo 1 Enrique S. Quintana-Ortí 1 1 Departamento
More informationCSE 392/CS 378: High-performance Computing - Principles and Practice
CSE 392/CS 378: High-performance Computing - Principles and Practice Parallel Computer Architectures A Conceptual Introduction for Software Developers Jim Browne browne@cs.utexas.edu Parallel Computer
More informationWHY PARALLEL PROCESSING? (CE-401)
PARALLEL PROCESSING (CE-401) COURSE INFORMATION 2 + 1 credits (60 marks theory, 40 marks lab) Labs introduced for second time in PP history of SSUET Theory marks breakup: Midterm Exam: 15 marks Assignment:
More informationLecture 7: Parallel Processing
Lecture 7: Parallel Processing Introduction and motivation Architecture classification Performance evaluation Interconnection network Zebo Peng, IDA, LiTH 1 Performance Improvement Reduction of instruction
More informationImplicit and Explicit Optimizations for Stencil Computations
Implicit and Explicit Optimizations for Stencil Computations By Shoaib Kamil 1,2, Kaushik Datta 1, Samuel Williams 1,2, Leonid Oliker 2, John Shalf 2 and Katherine A. Yelick 1,2 1 BeBOP Project, U.C. Berkeley
More informationCompilation for Heterogeneous Platforms
Compilation for Heterogeneous Platforms Grid in a Box and on a Chip Ken Kennedy Rice University http://www.cs.rice.edu/~ken/presentations/heterogeneous.pdf Senior Researchers Ken Kennedy John Mellor-Crummey
More informationTowards Efficient Video Compression Using Scalable Vector Graphics on the Cell Broadband Engine
Towards Efficient Video Compression Using Scalable Vector Graphics on the Cell Broadband Engine Andreea Sandu, Emil Slusanschi, Alin Murarasu, Andreea Serban, Alexandru Herisanu, Teodor Stoenescu University
More informationEE382 Processor Design. Processor Issues for MP
EE382 Processor Design Winter 1998 Chapter 8 Lectures Multiprocessors, Part I EE 382 Processor Design Winter 98/99 Michael Flynn 1 Processor Issues for MP Initialization Interrupts Virtual Memory TLB Coherency
More informationIntroduction to CELL B.E. and GPU Programming. Agenda
Introduction to CELL B.E. and GPU Programming Department of Electrical & Computer Engineering Rutgers University Agenda Background CELL B.E. Architecture Overview CELL B.E. Programming Environment GPU
More informationScalable Distributed Memory Machines
Scalable Distributed Memory Machines Goal: Parallel machines that can be scaled to hundreds or thousands of processors. Design Choices: Custom-designed or commodity nodes? Network scalability. Capability
More informationSerial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing
CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.
More informationMulti processor systems with configurable hardware acceleration
Multi processor systems with configurable hardware acceleration Ph.D in Electronics, Computer Science and Telecommunications Ph.D Student: Davide Rossi Ph.D Tutor: Prof. Roberto Guerrieri Outline Motivations
More informationComputing architectures Part 2 TMA4280 Introduction to Supercomputing
Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:
More informationCOMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers
COMP 322: Fundamentals of Parallel Programming Lecture 37: General-Purpose GPU (GPGPU) Computing Max Grossman, Vivek Sarkar Department of Computer Science, Rice University max.grossman@rice.edu, vsarkar@rice.edu
More informationReplacement Policy: Which block to replace from the set?
Replacement Policy: Which block to replace from the set? Direct mapped: no choice Associative: evict least recently used (LRU) difficult/costly with increasing associativity Alternative: random replacement
More informationMemory-Intensive Computations. Optimized On-Chip Pipelining of. on the CELL BE. Christoph Kessler Jörg Keller
Optimized On-Chip Pipelining of Memory-Intensive Computations on the CELL BE Christoph Kessler Jörg Keller IDA / PELAB Dept. of Math. and CS Linköpings universitet FernUniversität in Hagen Linköping, Sweden
More informationELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Memory Organization Part II
ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Organization Part II Ujjwal Guin, Assistant Professor Department of Electrical and Computer Engineering Auburn University, Auburn,
More informationCSC501 Operating Systems Principles. OS Structure
CSC501 Operating Systems Principles OS Structure 1 Announcements q TA s office hour has changed Q Thursday 1:30pm 3:00pm, MRC-409C Q Or email: awang@ncsu.edu q From department: No audit allowed 2 Last
More informationComputer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University
Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Moore s Law Moore, Cramming more components onto integrated circuits, Electronics, 1965. 2 3 Multi-Core Idea:
More informationMemory Systems and Compiler Support for MPSoC Architectures. Mahmut Kandemir and Nikil Dutt. Cap. 9
Memory Systems and Compiler Support for MPSoC Architectures Mahmut Kandemir and Nikil Dutt Cap. 9 Fernando Moraes 28/maio/2013 1 MPSoC - Vantagens MPSoC architecture has several advantages over a conventional
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control
More informationArquitecturas y Modelos de. Multicore
Arquitecturas y Modelos de rogramacion para Multicore 17 Septiembre 2008 Castellón Eduard Ayguadé Alex Ramírez Opening statements * Some visionaries already predicted multicores 30 years ago And they have
More informationLecture 14: Memory Hierarchy. James C. Hoe Department of ECE Carnegie Mellon University
18 447 Lecture 14: Memory Hierarchy James C. Hoe Department of ECE Carnegie Mellon University 18 447 S18 L14 S1, James C. Hoe, CMU/ECE/CALCM, 2018 Your goal today Housekeeping understand memory system
More informationFinite Element Integration and Assembly on Modern Multi and Many-core Processors
Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,
More informationGPGPU LAB. Case study: Finite-Difference Time- Domain Method on CUDA
GPGPU LAB Case study: Finite-Difference Time- Domain Method on CUDA Ana Balevic IPVS 1 Finite-Difference Time-Domain Method Numerical computation of solutions to partial differential equations Explicit
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control
More informationA Cache Hierarchy in a Computer System
A Cache Hierarchy in a Computer System Ideally one would desire an indefinitely large memory capacity such that any particular... word would be immediately available... We are... forced to recognize the
More informationConcurrent/Parallel Processing
Concurrent/Parallel Processing David May: April 9, 2014 Introduction The idea of using a collection of interconnected processing devices is not new. Before the emergence of the modern stored program computer,
More informationParallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence
Parallel Computer Architecture Spring 2018 Shared Memory Multiprocessors Memory Coherence Nikos Bellas Computer and Communications Engineering Department University of Thessaly Parallel Computer Architecture
More informationIntroduction to Parallel Computing. CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014
Introduction to Parallel Computing CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014 1 Definition of Parallel Computing Simultaneous use of multiple compute resources to solve a computational
More informationOn the Portability and Performance of Message-Passing Programs on Embedded Multicore Platforms
On the Portability and Performance of Message-Passing Programs on Embedded Multicore Platforms Shih-Hao Hung, Po-Hsun Chiu, Chia-Heng Tu, Wei-Ting Chou and Wen-Long Yang Graduate Institute of Networking
More information( ZIH ) Center for Information Services and High Performance Computing. Event Tracing and Visualization for Cell Broadband Engine Systems
( ZIH ) Center for Information Services and High Performance Computing Event Tracing and Visualization for Cell Broadband Engine Systems ( daniel.hackenberg@zih.tu-dresden.de ) Daniel Hackenberg Cell Broadband
More informationNUMA-Aware Data-Transfer Measurements for Power/NVLink Multi-GPU Systems
NUMA-Aware Data-Transfer Measurements for Power/NVLink Multi-GPU Systems Carl Pearson 1, I-Hsin Chung 2, Zehra Sura 2, Wen-Mei Hwu 1, and Jinjun Xiong 2 1 University of Illinois Urbana-Champaign, Urbana
More informationMCF5307 DRAM CONTROLLER. MCF5307 DRAM CTRL 1-1 Motorola ColdFire
MCF5307 DRAM CONTROLLER MCF5307 DRAM CTRL 1-1 MCF5307 DRAM CONTROLLER MCF5307 MCF5307 DRAM Controller I Addr Gen Supports 2 banks of DRAM Supports External Masters Programmable Wait States & Refresh Timer
More informationHow to Write Fast Code , spring th Lecture, Mar. 31 st
How to Write Fast Code 18-645, spring 2008 20 th Lecture, Mar. 31 st Instructor: Markus Püschel TAs: Srinivas Chellappa (Vas) and Frédéric de Mesmay (Fred) Introduction Parallelism: definition Carrying
More informationEvaluating Multicore Architectures for Application in High Assurance Systems
Evaluating Multicore Architectures for Application in High Assurance Systems Ryan Bradetich, Paul Oman, Jim Alves-Foss, and Theora Rice Center for Secure and Dependable Systems University of Idaho Contact:
More informationThe future is parallel but it may not be easy
The future is parallel but it may not be easy Michael J. Flynn Maxeler and Stanford University M. J. Flynn 1 HiPC Dec 07 Outline I The big technology tradeoffs: area, time, power HPC: What s new at the
More informationComprehensive Review of Data Prefetching Mechanisms
86 Sneha Chhabra, Raman Maini Comprehensive Review of Data Prefetching Mechanisms 1 Sneha Chhabra, 2 Raman Maini 1 University College of Engineering, Punjabi University, Patiala 2 Associate Professor,
More informationArchitectural Differences nc. DRAM devices are accessed with a multiplexed address scheme. Each unit of data is accessed by first selecting its row ad
nc. Application Note AN1801 Rev. 0.2, 11/2003 Performance Differences between MPC8240 and the Tsi106 Host Bridge Top Changwatchai Roy Jenevein risc10@email.sps.mot.com CPD Applications This paper discusses
More informationSystems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2014/15
Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2014/15 Lecture X: Parallel Databases Topics Motivation and Goals Architectures Data placement Query processing Load balancing
More informationComputer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors
Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently
More informationSony/Toshiba/IBM (STI) CELL Processor. Scientific Computing for Engineers: Spring 2008
Sony/Toshiba/IBM (STI) CELL Processor Scientific Computing for Engineers: Spring 2008 Nec Hercules Contra Plures Chip's performance is related to its cross section same area 2 performance (Pollack's Rule)
More informationCISC 879 Software Support for Multicore Architectures Spring Student Presentation 6: April 8. Presenter: Pujan Kafle, Deephan Mohan
CISC 879 Software Support for Multicore Architectures Spring 2008 Student Presentation 6: April 8 Presenter: Pujan Kafle, Deephan Mohan Scribe: Kanik Sem The following two papers were presented: A Synchronous
More information04 - DSP Architecture and Microarchitecture
September 11, 2015 Memory indirect addressing (continued from last lecture) ; Reality check: Data hazards! ; Assembler code v3: repeat 256,endloop load r0,dm1[dm0[ptr0++]] store DM0[ptr1++],r0 endloop:
More informationAN 831: Intel FPGA SDK for OpenCL
AN 831: Intel FPGA SDK for OpenCL Host Pipelined Multithread Subscribe Send Feedback Latest document on the web: PDF HTML Contents Contents 1 Intel FPGA SDK for OpenCL Host Pipelined Multithread...3 1.1
More informationConventional Computer Architecture. Abstraction
Conventional Computer Architecture Conventional = Sequential or Single Processor Single Processor Abstraction Conventional computer architecture has two aspects: 1 The definition of critical abstraction
More informationArchitecture at HP: Two Decades of Innovation
Architecture at HP: Two Decades of Innovation Joel S. Birnbaum Microprocessor Forum San Jose, CA October 14, 1997 Computer Architecture at HP A summary of motivations, innovations and implementations in
More informationUnit 9 : Fundamentals of Parallel Processing
Unit 9 : Fundamentals of Parallel Processing Lesson 1 : Types of Parallel Processing 1.1. Learning Objectives On completion of this lesson you will be able to : classify different types of parallel processing
More informationLecture 2: CUDA Programming
CS 515 Programming Language and Compilers I Lecture 2: CUDA Programming Zheng (Eddy) Zhang Rutgers University Fall 2017, 9/12/2017 Review: Programming in CUDA Let s look at a sequential program in C first:
More informationRuntime Adaptation of Application Execution under Thermal and Power Constraints in Massively Parallel Processor Arrays
Runtime Adaptation of Application Execution under Thermal and Power Constraints in Massively Parallel Processor Arrays Éricles Sousa 1, Frank Hannig 1, Jürgen Teich 1, Qingqing Chen 2, and Ulf Schlichtmann
More informationIntroduction I/O 1. I/O devices can be characterized by Behavior: input, output, storage Partner: human or machine Data rate: bytes/sec, transfers/sec
Introduction I/O 1 I/O devices can be characterized by Behavior: input, output, storage Partner: human or machine Data rate: bytes/sec, transfers/sec I/O bus connections I/O Device Summary I/O 2 I/O System
More informationComputer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13
Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13 Moore s Law Moore, Cramming more components onto integrated circuits, Electronics,
More informationIn multiprogramming systems, processes share a common store. Processes need space for:
Memory Management In multiprogramming systems, processes share a common store. Processes need space for: code (instructions) static data (compiler initialized variables, strings, etc.) global data (global
More informationCUDA OPTIMIZATIONS ISC 2011 Tutorial
CUDA OPTIMIZATIONS ISC 2011 Tutorial Tim C. Schroeder, NVIDIA Corporation Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control
More informationLecture 7: Parallel Processing
Lecture 7: Parallel Processing Introduction and motivation Architecture classification Performance evaluation Interconnection network Zebo Peng, IDA, LiTH 1 Performance Improvement Reduction of instruction
More informationBandwidth Avoiding Stencil Computations
Bandwidth Avoiding Stencil Computations By Kaushik Datta, Sam Williams, Kathy Yelick, and Jim Demmel, and others Berkeley Benchmarking and Optimization Group UC Berkeley March 13, 2008 http://bebop.cs.berkeley.edu
More informationAddress Translation. Tore Larsen Material developed by: Kai Li, Princeton University
Address Translation Tore Larsen Material developed by: Kai Li, Princeton University Topics Virtual memory Virtualization Protection Address translation Base and bound Segmentation Paging Translation look-ahead
More information