Optimizing DMA Data Transfers for Embedded Multi-Cores

Size: px
Start display at page:

Download "Optimizing DMA Data Transfers for Embedded Multi-Cores"

Transcription

1 Optimizing DMA Data Transfers for Embedded Multi-Cores Selma Saïdi Jury members: Oded Maler: Dir. de these Ahmed Bouajjani: President du Jury Luca Benini: Rapporteur Albert Cohen: Rapporteur Eric Flamand: Examinateur Bruno Jego: Examinateur Selma Saidi Optimizing DMA Transfers 24 October

2 Context of the Thesis Ph.D CIFRE with STMicroelectronics, supervised by, Oded Maler, Verimag, Bruno Jego and in collaboration with Thierry Lepley, STMicroelectronics Minalogic project ATHOLE low-power multi-core platform for embedded systems partners: ST, CEA, Thales, CWS, Verimag Selma Saidi Optimizing DMA Transfers 24 October

3 Outline Context and Motivation 1 Context and Motivation 2 Contribution Problem Definition Optimal Granularity for a Single Processor 1Dim Data 2Dim Data Multiple Processors Shared Data 3 Experiments on the Cell.BE 4 The move towards Platform Conclusions and Perpectives Selma Saidi Optimizing DMA Transfers 24 October

4 Embedded Systems Context and Motivation There is an increasing requirement for performance under low power constraints: Need to integrate more functionnalities in Embedded devices, Applications are becoming more computationaly intensive and power hungry, Selma Saidi Optimizing DMA Transfers 24 October

5 Context and Motivation The Emergence of Multicore Architectures Running 2 processors in the same chip at half the speed will be less energy consuming and equally performant, Selma Saidi Optimizing DMA Transfers 24 October

6 Context and Motivation Embedded Multicore Architectures: Platform 2012: a manycore computation fabric, P2012 Fabric Cluster0 Cluster1 Cluster2 Cluster3 ClusterN 1 ClusterN Selma Saidi Optimizing DMA Transfers 24 October

7 Context and Motivation Embedded Multicore Architectures: Platform 2012: a manycore computation fabric, P2012 Fabric Cluster... PE0 PE1 PE16 Cluster0 L1 memory Cluster2 Cluster3 ClusterN 1 ClusterN Selma Saidi Optimizing DMA Transfers 24 October

8 Context and Motivation Embedded Multicore Architectures: Platform 2012: a manycore computation fabric, main characteristic: explicitly managed Memories: P2012 Fabric Cluster... PE0 PE1 PE16 Cluster0 L1 memory Cluster2 Cluster3 ClusterN 1 ClusterN Selma Saidi Optimizing DMA Transfers 24 October

9 Context and Motivation Embedded Multicore Architectures: Platform 2012: a manycore computation fabric, main characteristic: explicitly managed Memories: Features: 1 Scratchpad memories (No caches), P2012 Fabric Cluster... PE0 PE1 PE16 Cluster0 L1 memory Cluster2 Cluster3 ClusterN 1 ClusterN Selma Saidi Optimizing DMA Transfers 24 October

10 Context and Motivation Embedded Multicore Architectures: Platform 2012: a manycore computation fabric, main characteristic: explicitly managed Memories: Features: 1 Scratchpad memories (No caches) 2 DMA engine: a hardware for managing data transfers, P2012 Fabric Cluster0 Cluster2 Cluster... PE0 PE1 PE16 DMA L1 memory Cluster3 ClusterN 1 ClusterN Selma Saidi Optimizing DMA Transfers 24 October

11 Context and Motivation Embedded Multicore Architectures: Platform 2012: a manycore computation fabric, main characteristic: explicitly managed Memories: Features: 1 Scratchpad memories (No caches) 2 DMA engine: a hardware for managing data transfers, P2012 Fabric Cluster0 Cluster2 DMA Cluster... PE0 PE1 PE16 L1 memory DMA command: (src@, dst@, block size) ClusterN 1 ClusterN Selma Saidi Optimizing DMA Transfers 24 October

12 Context and Motivation Embedded Multicore Architectures: Platform 2012: a manycore computation fabric, main characteristic: explicitly managed Memories: Features: 1 Scratchpad memories (No caches) 2 DMA engine: a hardware for managing data transfers, 3 data transfers are explicitly managed by the software, P2012 Fabric Cluster0 Cluster2 DMA Cluster... PE0 PE1 PE16 L1 memory DMA command: (src@, dst@, block size) ClusterN 1 ClusterN Selma Saidi Optimizing DMA Transfers 24 October

13 Context and Motivation Embedded Multicore Architectures: Platform 2012: a manycore computation fabric, main characteristic: explicitly managed Memories: Features: 1 Scratchpad memories (No caches) 2 DMA engine: a hardware for managing data transfers, 3 data transfers are explicitly managed by the software, P2012 Fabric Cluster0 Cluster2 Cluster... PE0 PE1 PE16 DMA L1 memory DMA Advantages: a) Transfer efficiently large data blocks b) can work in parallel with the processor ClusterN 1 ClusterN Selma Saidi Optimizing DMA Transfers 24 October

14 Context and Motivation Embedded Multicore Architectures: acts as a general purpose programmable accelerator: Heterogeneous Multicore Architectures, P2012 Fabric Cluster... PE0 PE1 PE16 Cluster0 DMA L1 memory Cluster2 Cluster3 ClusterN 1 ClusterN Selma Saidi Optimizing DMA Transfers 24 October

15 Context and Motivation Embedded Multicore Architectures: acts as a general purpose programmable accelerator: Heterogeneous Multicore Architectures, P2012 Fabric Cluster... PE0 PE1 PE16 Cluster0 DMA L1 memory Cluster2 Cluster3 ClusterN 1 ClusterN This is the class of architectures in which we are interested!! Selma Saidi Optimizing DMA Transfers 24 October

16 Context and Motivation Heterogeneous Multi-core Architectures a powerful host processor and a multi-core fabric to accelerate computationally heavy kernels. Multi core fabric Host CPU PE0 PEn... Memory Memory Interconnect Off chip Memory Selma Saidi Optimizing DMA Transfers 24 October

17 Context and Motivation Heterogeneous Multi-core Architectures a powerful host processor and a multi-core fabric to accelerate computationally heavy kernels. Multi core fabric Host CPU T0 PE0 T1 Memory... T2 PEn Memory Interconnect Off chip Memory Selma Saidi Optimizing DMA Transfers 24 October

18 Context and Motivation Heterogeneous Multi-core Architectures Offloadable kernels work on large data sets, initially stored in a distant off-chip memory. Algorithm for i = 0 to n 1 Y [i] = f (X [i]) od Multi core fabric Host CPU T0 PE0 PEn... Memory Memory Interconnect Off chip Memory... X... Y Selma Saidi Optimizing DMA Transfers 24 October

19 Context and Motivation Heterogeneous Multi-core Architectures High off-chip memory latency: accessing off-chip data is very costly Algorithm for i = 0 to n 1 Y [i] = f (X [i]) od Read Multi core fabric Host CPU T0 PE0 PEn... Memory Memory Interconnect Off chip Memory Write... X... Y Selma Saidi Optimizing DMA Transfers 24 October

20 Context and Motivation Heterogeneous Multi-core Architectures Data is transferred to a closer but smaller on-chip memory, using DMAs (Direct Memory Access). Algorithm for i = 0 to n 1 Y [i] = f (X [i]) od Multi core fabric Host CPU T0 PE0 PEn Data Block Transfers block 1 block 0 Memory DMA... Memory... Interconnect Off chip Memory... X... Y Selma Saidi Optimizing DMA Transfers 24 October

21 Context and Motivation DMA Data Transfers: Single Buffering s: number of array elements in one block, block0 block1 blockm 2 blockm 1 X [0] X [1]... X [n 1] X s n Selma Saidi Optimizing DMA Transfers 24 October

22 Context and Motivation DMA Data Transfers: Single Buffering s: number of array elements in one block, block0 block1 blockm 2 blockm 1 X [0] X [1]... X [n 1] X s n i = 0 Fetch(blocki) while (i < n/s) i + + Compute(blocki) Write back(blocki) Selma Saidi Optimizing DMA Transfers 24 October

23 Context and Motivation DMA Data Transfers: Single Buffering s: number of array elements in one block, block0 block1 blockm 2 blockm 1 X [0] X [1]... X [n 1] X s n i = 0 Fetch(blocki) dma get(local-buffer, blocki, s) while (i < n/s) i + + Compute(blocki) Write back(blocki) Selma Saidi Optimizing DMA Transfers 24 October

24 Context and Motivation DMA Data Transfers: Single Buffering s: number of array elements in one block, block0 block1 blockm 2 blockm 1 X [0] X [1]... X [n 1] X s n i = 0 Fetch(blocki) while (i < n/s) i + + Compute(blocki) Write back(blocki) Selma Saidi Optimizing DMA Transfers 24 October

25 Context and Motivation DMA Data Transfers: Single Buffering s: number of array elements in one block, block0 block1 blockm 2 blockm 1 X [0] X [1]... X [n 1] X s n i = 0 Fetch(blocki) while (i < n/s) i + + Compute(blocki) Write back(blocki) dma put(blocki, local-buffer, s) Selma Saidi Optimizing DMA Transfers 24 October

26 Context and Motivation DMA Data Transfers: Single Buffering s: number of array elements in one block, block0 block1 blockm 2 blockm 1 X [0] X [1]... X [n 1] X s n i = 0 Fetch(blocki) while (i < n/s) i + + Compute(blocki) Write back(blocki) dma put(blocki, local-buffer, s) Sequential execution of computations and data transfers. Selma Saidi Optimizing DMA Transfers 24 October

27 Context and Motivation DMA Data Transfers: Double Buffering Asynchronous DMA calls: Fetch(block0) dma get(local buffer[1], block0, s) i = 0 dma get(local buffer[2], blocki+1, s) while (i < (n/s) 1) i + + Compute(blocki) Fetch(blocki+1) Write back(blocki) Compute(block(n/s) 1) Write back(block(n/s) 1) Selma Saidi Optimizing DMA Transfers 24 October

28 Context and Motivation DMA Data Transfers: Double Buffering Asynchronous DMA calls: Fetch(block0) dma get(local buffer[1], block0, s) i = 0 dma get(local buffer[2], blocki+1, s) while (i < (n/s) 1) i + + Compute(blocki) Fetch(blocki+1) Write back(blocki) Compute(block(n/s) 1) Write back(block(n/s) 1) Selma Saidi Optimizing DMA Transfers 24 October

29 Context and Motivation DMA Data Transfers: Double Buffering Asynchronous DMA calls: Fetch(block0) dma get(local buffer[1], block0, s) i = 0 dma get(local buffer[2], blocki+1, s) while (i < (n/s) 1) i + + Compute(blocki) Fetch(blocki+1) Write back(blocki) Compute(block(n/s) 1) Write back(block(n/s) 1) Overlap of computations and data transfers. Selma Saidi Optimizing DMA Transfers 24 October

30 Context and Motivation DMA Data Transfers: Double Buffering Asynchronous DMA calls: Fetch(block0) dma get(local buffer[1], block0, s) i = 0 dma get(local buffer[2], blocki+1, s) while (i < (n/s) 1) i + + Compute(blocki) Write back(blocki) Fetch(blocki+1) Compute(block(n/s) 1) Write back(block(n/s) 1) Overlap of computations and data transfers. Selma Saidi Optimizing DMA Transfers 24 October

31 Context and Motivation Double Buffering Pipelined Execution Overlap of, Computation of current block, Transfer of next block. Input Transfer b 0 Computation b 0 Output Transfer Prologue b 0 Time Proc idle time Selma Saidi Optimizing DMA Transfers 24 October

32 Context and Motivation Double Buffering Pipelined Execution Overlap of, Computation of current block, Transfer of next block. Input Transfer b 0 b 1 Computation b 0 b 1 Output Transfer Prologue b 0 b 1 Time Proc idle time Selma Saidi Optimizing DMA Transfers 24 October

33 Context and Motivation Double Buffering Pipelined Execution Overlap of, Computation of current block, Transfer of next block. Input Transfer b 0 b 1 b 2 Computation b 0 b 1 b 2 Output Transfer Prologue b 0 b 1 b 2 Time Proc idle time Selma Saidi Optimizing DMA Transfers 24 October

34 Context and Motivation Double Buffering Pipelined Execution Overlap of, Computation of current block, Transfer of next block. Input Transfer b 0 b 1 b 2 b 3 Computation b 0 b 1 b 2 b 3 Output Transfer Prologue b 0 b 1 b 2 b 3 Time Proc idle time Selma Saidi Optimizing DMA Transfers 24 October

35 Context and Motivation Double Buffering Pipelined Execution Overlap of, Computation of current block, Transfer of next block. Input Transfer b 0 b 1 b 2 b 3 b 4 Computation b 0 b 1 b 2 b 3 b 4 Output Transfer Prologue b 0 b 1 b 2 b 3 b 4 Epilogue Time Proc idle time Selma Saidi Optimizing DMA Transfers 24 October

36 Context and Motivation Double Buffering Pipelined Execution Overlap of, Computation of current block, Transfer of next block. Input Transfer b 0 b 1 b 2 b 3 b 4 Computation b 0 b 1 b 2 b 3 b 4 Output Transfer Prologue b 0 b 1 b 2 b 3 b 4 Epilogue Time Proc idle time Performance can be further improved by an appropriate choice of data granularity. Selma Saidi Optimizing DMA Transfers 24 October

37 Context and Motivation Granularity of Transfers 1Dim Data: block size s block0 block1 blockm 2 blockm 1 X [0] X [1]... X [n 1] X s n Selma Saidi Optimizing DMA Transfers 24 October

38 Context and Motivation Granularity of Transfers 1Dim Data: block size s block0 block1 blockm 2 blockm 1 X [0] X [1]... X [n 1] X s n 2Dim Data: block shape (s 1, s 2 ) s2 X (i1, i2) s1 m1 block(j1, j2) m2 Selma Saidi Optimizing DMA Transfers 24 October

39 Context and Motivation Granularity of Transfers 1Dim Data: block size s block0 block1 blockm 2 blockm 1 X [0] X [1]... X [n 1] X s n 2Dim Data: block shape (s 1, s 2 ) s2 X (i1, i2) s1 m1 block(j1, j2) m2 Contribution: We derive optimal granularity for 1D and 2D DMA transfers, Selma Saidi Optimizing DMA Transfers 24 October

40 Context and Motivation Our Contribution: We derive optimal granularity for 1D and 2D DMA transfers, 1Dim data work was published in Hipeac 2012, S.Saidi, P.tendulkar, T.Lepley, O.Maler, Optimizing explicit data transfers for data parallel applicationson the Cell architecture 2D data work was published in DSD 2012, S.Saidi, P.tendulkar, T.Lepley, O.Maler, Optimal 2D Data Partitioning for DMA Transfers on MPSoCs extended version of the paper submitted to: Embedded Hardware Design: Microprocessors and Microsystems Journal. Selma Saidi Optimizing DMA Transfers 24 October

41 Outline Context and Motivation 1 Context and Motivation 2 Contribution Problem Definition Optimal Granularity for a Single Processor 1Dim Data 2Dim Data Multiple Processors Shared Data 3 Experiments on the Cell.BE 4 The move towards Platform Conclusions and Perpectives Selma Saidi Optimizing DMA Transfers 24 October

42 Outline Contribution 1 Context and Motivation 2 Contribution Problem Definition Optimal Granularity for a Single Processor 1Dim Data 2Dim Data Multiple Processors Shared Data 3 Experiments on the Cell.BE 4 The move towards Platform Conclusions and Perpectives Selma Saidi Optimizing DMA Transfers 24 October

43 Outline Contribution Problem Definition 1 Context and Motivation 2 Contribution Problem Definition Optimal Granularity for a Single Processor 1Dim Data 2Dim Data Multiple Processors Shared Data 3 Experiments on the Cell.BE 4 The move towards Platform Conclusions and Perpectives Selma Saidi Optimizing DMA Transfers 24 October

44 Software Pipelining Contribution Problem Definition We want to optimize execution of the pipeline. R1 R2 R3 R4 Rm C1 C2 C3 Cm 1 Cm W1 W2 Wm 2 Wm 1 Wm Prologue Epilogue Selma Saidi Optimizing DMA Transfers 24 October

45 Contribution Problem Definition Software Pipelining We want to optimize execution of the pipeline. R1 R2 R3 R4 Rm C1 C2 C3 Cm 1 Cm W1 W2 Wm 2 Wm 1 Wm Prologue Epilogue Optimal Granularity: What is the Granularity choice that optimizes performance? Selma Saidi Optimizing DMA Transfers 24 October

46 Contribution Problem Definition Computation Regime and Transfer Regime T and C: Transfer and Computation time of a block Transfer Regime T > C: Input transfer b Computation Output transfer b0 b1 b2 b3 b4 b5 b6 b7 b b b b b b0 b1 b2 b3 b4 b5 b6 b7 b8 b5 b6 b7 b8 Prologue Proc Idle Time Epilogue Computation Regime C > T : Selma Saidi Optimizing DMA Transfers 24 October

47 Contribution Problem Definition Computation Regime and Transfer Regime T and C: Transfer and Computation time of a block Transfer Regime T > C: Input transfer b Computation Output transfer b0 b1 b2 b3 b4 b5 b6 b7 b b b b b b0 b1 b2 b3 b4 b5 b6 b7 b8 b5 b6 b7 b8 Prologue Proc Idle Time Epilogue Computation Regime C > T : Input transfer b0 b1 b2 Computation b0 b1 b Output transfer b0 b1 b2 Prologue Epilogue Time Selma Saidi Optimizing DMA Transfers 24 October

48 Contribution In the Computation Regime: Problem Definition Granularity s: Input transfer b0 b1 b2 Computation b0 b1 b Output transfer b0 b1 b2 Prologue Epilogue Time Granularity s : s > s: Input transfer b0 b1 Computation Output transfer b0 b0 b b1 Prologue Epilogue Time Selma Saidi Optimizing DMA Transfers 24 October

49 Contribution Problem Definition Optimal Granularity: Problem Formulation 1Dim Data: block size Find s such that, 2Dim Data: block shape Min T (s) s.t. T (s) C(s) (s) [1..n] s M Selma Saidi Optimizing DMA Transfers 24 October

50 Contribution Problem Definition Optimal Granularity: Problem Formulation 1Dim Data: block size Find s such that, Min T (s) s.t. T (s) C(s) (s) [1..n] s M 2Dim Data: block shape Find (s1, s 2 ) such that, Min T (s 1, s 2 ) s.t. T (s 1, s 2 ) C(s 1, s 2 ) (s 1, s 2 ) [1..n 1 ] [1..n 2 ] s 1 s 2 M Selma Saidi Optimizing DMA Transfers 24 October

51 Outline Contribution Optimal Granularity for a Single Processor 1 Context and Motivation 2 Contribution Problem Definition Optimal Granularity for a Single Processor 1Dim Data 2Dim Data Multiple Processors Shared Data 3 Experiments on the Cell.BE 4 The move towards Platform Conclusions and Perpectives Selma Saidi Optimizing DMA Transfers 24 October

52 Outline Contribution Optimal Granularity for a Single Processor 1 Context and Motivation 2 Contribution Problem Definition Optimal Granularity for a Single Processor 1Dim Data 2Dim Data Multiple Processors Shared Data 3 Experiments on the Cell.BE 4 The move towards Platform Conclusions and Perpectives Selma Saidi Optimizing DMA Transfers 24 October

53 Optimal Granularity : 1Dim Data Contribution Optimal Granularity for a Single Processor Characterization of Computation and Transfer Time: Selma Saidi Optimizing DMA Transfers 24 October

54 Contribution Optimal Granularity for a Single Processor Optimal Granularity : 1Dim Data Characterization of Computation and Transfer Time: s: nb array elements clustered in one (Contiguous) block, s b Computation time C(s): DMA Transfer time T(s): Selma Saidi Optimizing DMA Transfers 24 October

55 Contribution Optimal Granularity for a Single Processor Optimal Granularity : 1Dim Data Characterization of Computation and Transfer Time: s: nb array elements clustered in one (Contiguous) block, s b Computation time C(s): ω: time to compute one element, DMA Transfer time T(s): Selma Saidi Optimizing DMA Transfers 24 October

56 Contribution Optimal Granularity for a Single Processor Optimal Granularity : 1Dim Data Characterization of Computation and Transfer Time: s: nb array elements clustered in one (Contiguous) block, s b Computation time C(s): ω: time to compute one element, DMA Transfer time T(s): C(s) = ω s Selma Saidi Optimizing DMA Transfers 24 October

57 Contribution Optimal Granularity for a Single Processor Optimal Granularity : 1Dim Data Characterization of Computation and Transfer Time: s: nb array elements clustered in one (Contiguous) block, s b Computation time C(s): ω: time to compute one element, DMA Transfer time T(s): I : fixed DMA initialization cost, C(s) = ω s Selma Saidi Optimizing DMA Transfers 24 October

58 Contribution Optimal Granularity for a Single Processor Optimal Granularity : 1Dim Data Characterization of Computation and Transfer Time: s: nb array elements clustered in one (Contiguous) block, s b Computation time C(s): ω: time to compute one element, DMA Transfer time T(s): I : fixed DMA initialization cost, α: transfer cost per byte, C(s) = ω s Selma Saidi Optimizing DMA Transfers 24 October

59 Contribution Optimal Granularity for a Single Processor Optimal Granularity : 1Dim Data Characterization of Computation and Transfer Time: s: nb array elements clustered in one (Contiguous) block, s b Computation time C(s): ω: time to compute one element, C(s) = ω s DMA Transfer time T(s): I : fixed DMA initialization cost, α: transfer cost per byte, b: size of one array element, Selma Saidi Optimizing DMA Transfers 24 October

60 Contribution Optimal Granularity for a Single Processor Optimal Granularity : 1Dim Data Characterization of Computation and Transfer Time: s: nb array elements clustered in one (Contiguous) block, s b Computation time C(s): ω: time to compute one element, C(s) = ω s DMA Transfer time T(s): I : fixed DMA initialization cost, α: transfer cost per byte, b: size of one array element, T (s) = I + α b s Selma Saidi Optimizing DMA Transfers 24 October

61 Contribution Optimal Granularity : 1Dim Data Optimal Granularity for a Single Processor Pb Formulation Min T (s) s.t. Optimal Granularity s : T (s) C(s) s [1..n] s M Time T = C s: block size T > C T < C C(s) M: Memory limitation T (s) I Transfer Domain s s M Computation Domain Selma Saidi Optimizing DMA Transfers 24 October

62 Contribution Optimal Granularity : 1Dim Data Optimal Granularity for a Single Processor Pb Formulation Min T (s) s.t. Optimal Granularity s : C(s ) = T (s ) T (s) C(s) s [1..n] s M s: block size M: Memory limitation Time T = C T < C C(s) T (s) I s s M Computation Domain Selma Saidi Optimizing DMA Transfers 24 October

63 Outline Contribution Optimal Granularity for a Single Processor 1 Context and Motivation 2 Contribution Problem Definition Optimal Granularity for a Single Processor 1Dim Data 2Dim Data Multiple Processors Shared Data 3 Experiments on the Cell.BE 4 The move towards Platform Conclusions and Perpectives Selma Saidi Optimizing DMA Transfers 24 October

64 Contribution Optimal Granularity for a Single Processor Optimal Granularity : 2Dim Data Characterization of Computation and Transfer Time: (s 1 s 2 ): nb array elements clustered in one (square) block, s 2 X (i 1, i 2) Computation time C(s 1, s 2 ): s 1 m 1 block(j 1, j 2 ) m 2 Selma Saidi Optimizing DMA Transfers 24 October

65 Contribution Optimal Granularity for a Single Processor Optimal Granularity : 2Dim Data Characterization of Computation and Transfer Time: (s 1 s 2 ): nb array elements clustered in one (square) block, s 2 X (i 1, i 2) Computation time C(s 1, s 2 ): ω: time to compute one element, s 1 m 1 block(j 1, j 2 ) m 2 Selma Saidi Optimizing DMA Transfers 24 October

66 Contribution Optimal Granularity for a Single Processor Optimal Granularity : 2Dim Data Characterization of Computation and Transfer Time: (s 1 s 2 ): nb array elements clustered in one (square) block, s 2 X (i 1, i 2) Computation time C(s 1, s 2 ): ω: time to compute one element, s 1 C(s 1, s 2 ) = ω s 1 s 2 m 1 block(j 1, j 2 ) m 2 Selma Saidi Optimizing DMA Transfers 24 October

67 Contribution Optimal Granularity for a Single Processor Optimal Granularity : 2Dim Data src addr stride s 2 s 1 Strided DMA Transfer time T (s 1, s 2 ): I 1 : transfer cost overhead per line, T (s 1, s 2 ) = I + I 1 s 1 + α b s 1 s 2 Selma Saidi Optimizing DMA Transfers 24 October

68 Contribution Optimal Granularity for a Single Processor Optimal Granularity : 2Dim Data src addr stride s 2 s 1 Strided DMA Transfer time T (s 1, s 2 ): I 1 : transfer cost overhead per line, T (s 1, s 2 ) = I + I 1 s 1 + α b s 1 s 2 Strided DMA transfers are costlier than contiguous transfers Selma Saidi Optimizing DMA Transfers 24 October

69 Contribution Optimal Granularity for a Single Processor Influence of the Block Shape on the DMA Transfer Cost Different block shapes with same area BUT different DMA transfer time, s 1 = 1 s 1 = 2 s 1 = 4 s 2 = 4 (s 1, s 2 ) = (1, 4) s 2 = 2 (s 1, s 2 ) = (2, 2) s 2 = 1 (s 1, s 2 ) = (4, 1) Selma Saidi Optimizing DMA Transfers 24 October

70 Contribution Optimal Granularity : 2Dim Data Optimal Granularity for a Single Processor C(s 1, s 2 ) : computation time of a block, T (s 1, s 2 ) : transfer time of a block, Selma Saidi Optimizing DMA Transfers 24 October

71 Contribution Optimal Granularity for a Single Processor Optimal Granularity : 2Dim Data C(s 1, s 2 ) : computation time of a block, T (s 1, s 2 ) : transfer time of a block, s 1 n 1 T = C Computation Domain T < C I 1 /ψ Transfer Domain T > C n 2 s 2 Selma Saidi Optimizing DMA Transfers 24 October

72 Contribution Optimal Granularity for a Single Processor Optimal Granularity : 2Dim Data Pb Formulation Min T (s 1, s 2 ) s.t. T (s 1, s 2 ) C(s 1, s 2 ) (s 1, s 2 ) [1..n 1 ] [1..n 2 ] s 1 s 2 M s 1 T = C n 1 Computation Domain T < C s 1 : block height s 2 : block width I 1 /ψ Transfer Domain T > C n 2 s 2 Selma Saidi Optimizing DMA Transfers 24 October

73 Contribution Optimal Granularity for a Single Processor Optimal Granularity : 2Dim Data Pb Formulation Min T (s 1, s 2 ) s.t. T (s 1, s 2 ) C(s 1, s 2 ) (s 1, s 2 ) [1..n 1 ] [1..n 2 ] s 1 s 2 M s 1 : block height s 2 : block width s 1 T = C n 1 s s 1 T (s s 1 s 1, s 2 ) < T (s 1, s 2 ) s 2 = s 2 n 2 s 2 Selma Saidi Optimizing DMA Transfers 24 October

74 Contribution Optimal Granularity : 2Dim Data Optimal Granularity for a Single Processor s 1 n 1 T = C 1 s (1/ψ)(I 1 + I 0 ) n 2 s 2 { Block Height : s 1 = 1 Block Width : s 2 = (1/ψ)(I 1 + I 0 ) Optimal granularity is the Contiguous block to reach the computation regime: Selma Saidi Optimizing DMA Transfers 24 October

75 Outline Contribution Multiple Processors 1 Context and Motivation 2 Contribution Problem Definition Optimal Granularity for a Single Processor 1Dim Data 2Dim Data Multiple Processors Shared Data 3 Experiments on the Cell.BE 4 The move towards Platform Conclusions and Perpectives Selma Saidi Optimizing DMA Transfers 24 October

76 Multiple Processors Contribution Multiple Processors Partitioning: p contiguous chunks of data 1 P 1 n p P 2 P 3 P n 1... n 1... p p n p Array Processors are identical: same local store capacity, same double buffering granularity...etc. Selma Saidi Optimizing DMA Transfers 24 October

77 Multiple Processors Contribution Multiple Processors Pipelined execution for several processors: Input Transfer b0, b1, b P P1 P2 OutputTransfer Time Prologue Proc Idle time processors DMA requests are done concurrently, Selma Saidi Optimizing DMA Transfers 24 October

78 Multiple Processors Contribution Multiple Processors Pipelined execution for several processors: Input Transfer b0, b1, b P P1 P2 OutputTransfer b3, b4, b b b1 b Time Prologue Proc Idle time processors DMA requests are done concurrently, Selma Saidi Optimizing DMA Transfers 24 October

79 Multiple Processors Contribution Multiple Processors Pipelined execution for several processors: Input Transfer b0, b1, b P P1 P2 OutputTransfer b3, b4, b b b1 b b0, b1, b2 b6, b7, b b b4 b Time Prologue Proc Idle time processors DMA requests are done concurrently, Selma Saidi Optimizing DMA Transfers 24 October

80 Multiple Processors Contribution Multiple Processors Pipelined execution for several processors: Input Transfer b0, b1, b P P1 P2 OutputTransfer b3, b4, b b b1 b b0, b1, b2 b6, b7, b b b4 b b8 b7 b6 b3, b4, b5 b6, b7, b Time Prologue Epilogue Proc Idle time processors DMA requests are done concurrently, Selma Saidi Optimizing DMA Transfers 24 October

81 Multiple Processors Contribution Multiple Processors Pipelined execution for several processors: Input Transfer b0, b1, b P P1 P2 OutputTransfer b3, b4, b b b1 b b0, b1, b2 b6, b7, b b b4 b b8 b7 b6 b3, b4, b5 b6, b7, b Time Prologue Epilogue Proc Idle time processors DMA requests are done concurrently, T (s, p) = I + α(p) b s α(p) : transfer cost per byte given contentions of p concurrent transfer requests. Selma Saidi Optimizing DMA Transfers 24 October

82 Contribution Multiple Processors Multiple Processors: Optimal Granularity Optimal granularity given p processors: s (p), s 1 time n 1 T 1 = C C(s) T1(s) s (1) 1 s (1) n s n 2 s 2 (a) One-dimensional data (b) Two-dimensional data Selma Saidi Optimizing DMA Transfers 24 October

83 Contribution Multiple Processors Multiple Processors: Optimal Granularity Optimal granularity given p processors: s (p), s 1 time n 1 T 1 = C C(s) Tp(s) T1(s) s (1) 1 s (1) s (p) n s n 2 s 2 (a) One-dimensional data (b) Two-dimensional data Selma Saidi Optimizing DMA Transfers 24 October

84 Contribution Multiple Processors Multiple Processors: Optimal Granularity Optimal granularity given p processors: s (p), s 1 time n 1 T 1 = C T p = C C(s) Tp(s) T1(s) 1 s (1) s (p) s (1) s (p) n s n 2 s 2 (a) One-dimensional data (b) Two-dimensional data Optimal Granularity increases with number of processors Selma Saidi Optimizing DMA Transfers 24 October

85 Contribution Multiple Processors Summary: We derived optimal granularity, main idea: balance between computation and transfer time of a block, 2D data: block shape influences transfer time (overhead per line, I 1 ) multiple processors: number of processors influence transfer time (with α(p)) Selma Saidi Optimizing DMA Transfers 24 October

86 Local Memory Constraint Contribution Multiple Processors What if Optimal Granularity does not fit in Local memory? 1 take available memory space, 2 reduce the number of processors, s 1 time n 1 M/s 2 Tp = C C(s) Tp(s) Tp (s) 1 s (p) s (p ) M s (p) (a) One-dimensional data s M (b) Two-dimensional data n 2 s 2 Selma Saidi Optimizing DMA Transfers 24 October

87 Outline Contribution Shared Data 1 Context and Motivation 2 Contribution Problem Definition Optimal Granularity for a Single Processor 1Dim Data 2Dim Data Multiple Processors Shared Data 3 Experiments on the Cell.BE 4 The move towards Platform Conclusions and Perpectives Selma Saidi Optimizing DMA Transfers 24 October

88 Contribution Shared Data Applications with shared data: 1Dim Data parallel loop with shared input data: for i := 0 to n 1 do Y [i] := f (X [i], V [i]); od V [i] = {X [i 1], X [i 2],..., X [i k]} Selma Saidi Optimizing DMA Transfers 24 October

89 Contribution Shared Data Applications with shared data: 1Dim Data parallel loop with shared input data: for i := 0 to n 1 do Y [i] := f (X [i], V [i]); V [i] = {X [i 1], X [i 2],..., X [i k]} od Neighboring blocks share data: s n Neighboring Shared Data b0 b1 b2 b3 b4 b5 Selma Saidi Optimizing DMA Transfers 24 October

90 Contribution Shared Data Shared Data: 2Dim Data parallel loop with shared input data: for i := 1 to n 1 do for i := 2 to n 2 do Y [i 1, i 2 ] := f (X [i 1, i 2 ], V [i 1, i 2 ]); V [i 1, i 2 ] = {X [i 1 1, i 2 ], X [i 1, i 2 1],..., X [i 1 k, i 2 ]} od symmetric window, k/2 k/2 n2 n1 Selma Saidi Optimizing DMA Transfers 24 October

91 Contribution Shared Data Optimal Granularity : 1Dim Data Compare strategies for transferring shared data: 1 Replication: via DMA transfers from the off-chip memory to local memory. 2 Inter-processor communication: processors exchange data via the network-on-chip between the cores; 3 Local buffering: via local copies done by the processors. Based on a parametric study, we derive optimal strategy and granularity for transferring shared data, Selma Saidi Optimizing DMA Transfers 24 October

92 Contribution Shared Data Optimal Granularity : 2Dim Data we consider Replication for transferring shared data, R: size of replicated data. R = 12 s1 = 2 s2 = 2 (s 1, s 2) = (2, 2) Selma Saidi Optimizing DMA Transfers 24 October

93 Contribution Shared Data Optimal Granularity : 2Dim Data Influence of the block shape on the size of share data: Compare Transfer cost of a flat and a square block, R: size of replicated data. R = 14 R = 12 s1 = 1 s1 = 2 s2 = 4 s2 = 2 (s 1, s 2) = (1, 4) (s 1, s 2) = (2, 2) Selma Saidi Optimizing DMA Transfers 24 October

94 Contribution Shared Data Optimal Granularity : 2Dim Data Influence of the block shape on the size of share data: Compare Transfer cost of a flat and a square block, R: size of replicated data. R = 14 R = 12 s1 = 1 s1 = 2 s2 = 4 s2 = 2 (s 1, s 2) = (1, 4) (s 1, s 2) = (2, 2) More Replicated data Overhead Selma Saidi Optimizing DMA Transfers 24 October

95 Contribution Shared Data Optimal Granularity : 2Dim Data Influence of the block shape on the size of share data: Compare Transfer cost of a flat and a square block, R: size of replicated data. R = 14 R = 12 s1 = 1 s1 = 2 s2 = 4 s2 = 2 (s 1, s 2) = (1, 4) (s 1, s 2) = (2, 2) More Replicated data Overhead More transfer lines Overhead Selma Saidi Optimizing DMA Transfers 24 October

96 Contribution Shared Data Optimal Granularity : 2Dim Data Influence of the block shape on the size of share data: Compare Transfer cost of a flat and a square block, R: size of replicated data. R = 14 R = 12 s1 = 1 s1 = 2 s2 = 4 s2 = 2 (s 1, s 2) = (1, 4) (s 1, s 2) = (2, 2) More Replicated data Overhead More transfer lines Overhead Selma Saidi Optimizing DMA Transfers 24 October

97 Contribution Shared Data Optimal Granularity : 2Dim Data Problem Formulation Find (s1, s 2 ) such that, min T (s 1 + k, s 2 + k) s.t. T (s 1 + k, s 2 + k) C(s 1, s 2 ) (s 1, s 2 ) [1..n 1 ] [1..n 2 ] (s 1 + k) (s 2 + k) M Selma Saidi Optimizing DMA Transfers 24 October

98 Contribution Shared Data Optimal Granularity : 2Dim Data s 1 n 1 T = C T k = C s 1 = s 2 s n 2 s 2 Selma Saidi Optimizing DMA Transfers 24 October

99 Contribution Shared Data Optimal Granularity : 2Dim Data Optimal shape: between a square and a flat shape, s 1 n 1 T = C T k = C s 1 = s 2 s k s n 2 s 2 { s 1 = + (c 1 /ψ)(1/d) s 2 = + (I 1/ψ)(1 + D) Selma Saidi Optimizing DMA Transfers 24 October

100 Outline Experiments on the Cell.BE 1 Context and Motivation 2 Contribution Problem Definition Optimal Granularity for a Single Processor 1Dim Data 2Dim Data Multiple Processors Shared Data 3 Experiments on the Cell.BE 4 The move towards Platform Conclusions and Perpectives Selma Saidi Optimizing DMA Transfers 24 October

101 Experiments on the Cell.BE Overview of Cell B.E. Architecture SPU SPU SPU SPU MFC MMU MFC MMU MFC MMU MFC MMU PPU MMU L2 EIB I/O Interface XDR DRAM Interface Coherent Interface SPU SPU SPU SPU MFC MMU MFC MMU MFC MMU MFC MMU Platform Characteristics: 9-core heterogeneous multi-core architecture, with a Power Processor Element(PPE) and 8 Synergistic Processing Elements(SPE). Limited local store capacity per SPE: 256 Kbytes Explicitely managed memory system, using DMAs Selma Saidi Optimizing DMA Transfers 24 October

102 Experiments on the Cell.BE Measured DMA Latency Transfer Time (clock cycles) SPU 2 SPU 4 SPU 8 SPU super-block size (s b) Profiled hardware parameters: DMA issue time I 400 clock cycles Off-chip memory transfer cost/byte: 1 proc α(1) 0.22 clock cycles Off-chip memory transfer cost/byte: p procs α(p) p α(1) inter-processor comm transfer cost/byte for β 0.13 clock cycles Selma Saidi Optimizing DMA Transfers 24 October

103 Experiments on the Cell.BE Optimal Granularity: 1Dim Data, No Sharing 10 7 Execution Time (clock cycles) super-block size (s b) 2 SPU-pred 8 SPU-pred 2 SPU-meas 8 SPU-meas predicted optimal granularities give good performance. Selma Saidi Optimizing DMA Transfers 24 October

104 Experiments on the Cell.BE Optimal Granularity: 1Dim Data, Sharing s n Neighboring Shared Data Comparing several strategies: b0 b1 b2 b3 b4 b5 Execution Time (clock cycles) super-block size (s b) 2 repl 2 ipc 2 local buffering 8 repl 8 ipc 8 local buffering Selma Saidi Optimizing DMA Transfers 24 October

105 Experiments on the Cell.BE Optimal Granularity: 2Dim Data, Sharing We implement double buffering on a mean filtering algorithm, predicted optimal granularities give good performance. Selma Saidi Optimizing DMA Transfers 24 October

106 Outline The move towards Platform Context and Motivation 2 Contribution Problem Definition Optimal Granularity for a Single Processor 1Dim Data 2Dim Data Multiple Processors Shared Data 3 Experiments on the Cell.BE 4 The move towards Platform Conclusions and Perpectives Selma Saidi Optimizing DMA Transfers 24 October

107 The move towards Platform 2012 P2012 Memory Hierarchy 1 Intra cluster L1 memory (256 Kbytes), 2 Inter cluster L2 memory, 3 Off-chip L3 memory Selma Saidi Optimizing DMA Transfers 24 October

108 DMA Latency The move towards Platform 2012 we measure the DMA latency on P2012, DMA performance Model: T (s, p) = I + α(p)bs I = 240cycles p α(p) Selma Saidi Optimizing DMA Transfers 24 October

109 DMA Latency The move towards Platform 2012 we measure the DMA latency on P2012, DMA performance Model: T (s, p) = I + α(p)bs I = 240cycles p α(p) Selma Saidi Optimizing DMA Transfers 24 October

110 The move towards Platform 2012 DMA Transfers in a Cluster Shared Local Memory and shared DMA: 2 approaches for transferring data, 1 Liberal: Processors fetch data independently 2 Collaborative: Processors fetch data collectively Selma Saidi Optimizing DMA Transfers 24 October

111 The move towards Platform 2012 Liberal Approach: Processors fetch data independently Off chip Memory local Memory DMA Transfers Cluster T (s, p) = I + α(p)bs C(s, p) = o + ωs s M/p Computations P0 P1... P16 Opencl Kernel: work item = processor async work item copy: DMA fetch for each processor Selma Saidi Optimizing DMA Transfers 24 October

112 The move towards Platform 2012 Collab Approach: Processors fetch data Collectively Off chip Memory DMA Transfers Cluster T (s, p) = I + α(1)bs C(s, p) = o(p) + (ω/p)s s M local Memory Computations P0 P1... P16 Opencl Kernel: work group = cluster async work group copy: DMA fetch for the cluster Selma Saidi Optimizing DMA Transfers 24 October

113 The move towards Platform 2012 Liberal Approach: Processors fetch data independently increase of number of processors reduces max buffer size, double buffering implementation results: Optimal Granularity does not fit in the available memory space Selma Saidi Optimizing DMA Transfers 24 October

114 The move towards Platform 2012 Liberal Approach: Processors fetch data independently synchronization overhead, double buffering implementation results: Performance degradation when increase number of processors Selma Saidi Optimizing DMA Transfers 24 October

115 The move towards Platform 2012 On-going Work: find the right balance between number of processors and Memory space budget, compare both liberal and collaborative approach, Selma Saidi Optimizing DMA Transfers 24 October

116 Outline Conclusions and Perpectives 1 Context and Motivation 2 Contribution Problem Definition Optimal Granularity for a Single Processor 1Dim Data 2Dim Data Multiple Processors Shared Data 3 Experiments on the Cell.BE 4 The move towards Platform Conclusions and Perpectives Selma Saidi Optimizing DMA Transfers 24 October

117 Conclusion Conclusions and Perpectives We presented a general methodology for automating decisions about Optimal granularity for data transfers, we capture the following facts, 1 Block shape and size influence the transfer/computation Ratio, 2 DMA performance (sensitivity to the block shapes, number of processors) 3 tradeoff between Strided DMA overhead vs size of replicated data Selma Saidi Optimizing DMA Transfers 24 October

118 Conclusions and Perpectives Perspectives 1 Consider other applications patterns, 2 Capture variations (hardware and Software), 3 Generalize the approach to more than 2 memory levels, 4 Integrate this work in a complete compilation flow, 5 combine task and data parallelism, 6... Selma Saidi Optimizing DMA Transfers 24 October

119 Conclusions and Perpectives Thank You!!! Selma Saidi Optimizing DMA Transfers 24 October

Optimal 2D Data Partitioning for DMA Transfers on MPSoCs

Optimal 2D Data Partitioning for DMA Transfers on MPSoCs Optimal 2D Data Partitioning for DMA Transfers on MPSoCs Selma Saidi, Pranav Tendulkar, Thierry Lepley, Oded Maler VERIMAG Lab (CNRS,University of Grenoble), France STMicroelectronics Grenoble, France

More information

The University of Texas at Austin

The University of Texas at Austin EE382N: Principles in Computer Architecture Parallelism and Locality Fall 2009 Lecture 24 Stream Processors Wrapup + Sony (/Toshiba/IBM) Cell Broadband Engine Mattan Erez The University of Texas at Austin

More information

CellSs Making it easier to program the Cell Broadband Engine processor

CellSs Making it easier to program the Cell Broadband Engine processor Perez, Bellens, Badia, and Labarta CellSs Making it easier to program the Cell Broadband Engine processor Presented by: Mujahed Eleyat Outline Motivation Architecture of the cell processor Challenges of

More information

Parallel Computing: Parallel Architectures Jin, Hai

Parallel Computing: Parallel Architectures Jin, Hai Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer

More information

IBM Cell Processor. Gilbert Hendry Mark Kretschmann

IBM Cell Processor. Gilbert Hendry Mark Kretschmann IBM Cell Processor Gilbert Hendry Mark Kretschmann Architectural components Architectural security Programming Models Compiler Applications Performance Power and Cost Conclusion Outline Cell Architecture:

More information

Computer Architecture

Computer Architecture Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,

More information

Optimizing Data Sharing and Address Translation for the Cell BE Heterogeneous CMP

Optimizing Data Sharing and Address Translation for the Cell BE Heterogeneous CMP Optimizing Data Sharing and Address Translation for the Cell BE Heterogeneous CMP Michael Gschwind IBM T.J. Watson Research Center Cell Design Goals Provide the platform for the future of computing 10

More information

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448 1 The Greed for Speed Two general approaches to making computers faster Faster uniprocessor All the techniques we ve been looking

More information

Design methodology for multi processor systems design on regular platforms

Design methodology for multi processor systems design on regular platforms Design methodology for multi processor systems design on regular platforms Ph.D in Electronics, Computer Science and Telecommunications Ph.D Student: Davide Rossi Ph.D Tutor: Prof. Roberto Guerrieri Outline

More information

Evaluating the Portability of UPC to the Cell Broadband Engine

Evaluating the Portability of UPC to the Cell Broadband Engine Evaluating the Portability of UPC to the Cell Broadband Engine Dipl. Inform. Ruben Niederhagen JSC Cell Meeting CHAIR FOR OPERATING SYSTEMS Outline Introduction UPC Cell UPC on Cell Mapping Compiler and

More information

Review: Creating a Parallel Program. Programming for Performance

Review: Creating a Parallel Program. Programming for Performance Review: Creating a Parallel Program Can be done by programmer, compiler, run-time system or OS Steps for creating parallel program Decomposition Assignment of tasks to processes Orchestration Mapping (C)

More information

Optimisation des transferts de données sur systèmes multiprocesseurs sur puce

Optimisation des transferts de données sur systèmes multiprocesseurs sur puce Optimisation des transferts de données sur systèmes multiprocesseurs sur puce Selma Saidi To cite this version: Selma Saidi. Optimisation des transferts de données sur systèmes multiprocesseurs sur puce.

More information

A framework for optimizing OpenVX Applications on Embedded Many Core Accelerators

A framework for optimizing OpenVX Applications on Embedded Many Core Accelerators A framework for optimizing OpenVX Applications on Embedded Many Core Accelerators Giuseppe Tagliavini, DEI University of Bologna Germain Haugou, IIS ETHZ Andrea Marongiu, DEI University of Bologna & IIS

More information

1. Memory technology & Hierarchy

1. Memory technology & Hierarchy 1 Memory technology & Hierarchy Caching and Virtual Memory Parallel System Architectures Andy D Pimentel Caches and their design cf Henessy & Patterson, Chap 5 Caching - summary Caches are small fast memories

More information

Processor Architecture and Interconnect

Processor Architecture and Interconnect Processor Architecture and Interconnect What is Parallelism? Parallel processing is a term used to denote simultaneous computation in CPU for the purpose of measuring its computation speeds. Parallel Processing

More information

MIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer

MIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer MIMD Overview Intel Paragon XP/S Overview! MIMDs in the 1980s and 1990s! Distributed-memory multicomputers! Intel Paragon XP/S! Thinking Machines CM-5! IBM SP2! Distributed-memory multicomputers with hardware

More information

Roadrunner. By Diana Lleva Julissa Campos Justina Tandar

Roadrunner. By Diana Lleva Julissa Campos Justina Tandar Roadrunner By Diana Lleva Julissa Campos Justina Tandar Overview Roadrunner background On-Chip Interconnect Number of Cores Memory Hierarchy Pipeline Organization Multithreading Organization Roadrunner

More information

XPU A Programmable FPGA Accelerator for Diverse Workloads

XPU A Programmable FPGA Accelerator for Diverse Workloads XPU A Programmable FPGA Accelerator for Diverse Workloads Jian Ouyang, 1 (ouyangjian@baidu.com) Ephrem Wu, 2 Jing Wang, 1 Yupeng Li, 1 Hanlin Xie 1 1 Baidu, Inc. 2 Xilinx Outlines Background - FPGA for

More information

FFTC: Fastest Fourier Transform on the IBM Cell Broadband Engine. David A. Bader, Virat Agarwal

FFTC: Fastest Fourier Transform on the IBM Cell Broadband Engine. David A. Bader, Virat Agarwal FFTC: Fastest Fourier Transform on the IBM Cell Broadband Engine David A. Bader, Virat Agarwal Cell System Features Heterogeneous multi-core system architecture Power Processor Element for control tasks

More information

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1 Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip

More information

EECS 583 Class 20 Research Topic 2: Stream Compilation, GPU Compilation

EECS 583 Class 20 Research Topic 2: Stream Compilation, GPU Compilation EECS 583 Class 20 Research Topic 2: Stream Compilation, GPU Compilation University of Michigan December 3, 2012 Guest Speakers Today: Daya Khudia and Mehrzad Samadi nnouncements & Reading Material Exams

More information

Performance Optimization Part II: Locality, Communication, and Contention

Performance Optimization Part II: Locality, Communication, and Contention Lecture 7: Performance Optimization Part II: Locality, Communication, and Contention Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Beth Rowley Nobody s Fault but Mine

More information

Caches. Hiding Memory Access Times

Caches. Hiding Memory Access Times Caches Hiding Memory Access Times PC Instruction Memory 4 M U X Registers Sign Ext M U X Sh L 2 Data Memory M U X C O N T R O L ALU CTL INSTRUCTION FETCH INSTR DECODE REG FETCH EXECUTE/ ADDRESS CALC MEMORY

More information

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University A.R. Hurson Computer Science and Engineering The Pennsylvania State University 1 Large-scale multiprocessor systems have long held the promise of substantially higher performance than traditional uniprocessor

More information

Using Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology

Using Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology Using Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology September 19, 2007 Markus Levy, EEMBC and Multicore Association Enabling the Multicore Ecosystem Multicore

More information

Flexible Architecture Research Machine (FARM)

Flexible Architecture Research Machine (FARM) Flexible Architecture Research Machine (FARM) RAMP Retreat June 25, 2009 Jared Casper, Tayo Oguntebi, Sungpack Hong, Nathan Bronson Christos Kozyrakis, Kunle Olukotun Motivation Why CPUs + FPGAs make sense

More information

6.1 Multiprocessor Computing Environment

6.1 Multiprocessor Computing Environment 6 Parallel Computing 6.1 Multiprocessor Computing Environment The high-performance computing environment used in this book for optimization of very large building structures is the Origin 2000 multiprocessor,

More information

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the

More information

CPU-GPU Heterogeneous Computing

CPU-GPU Heterogeneous Computing CPU-GPU Heterogeneous Computing Advanced Seminar "Computer Engineering Winter-Term 2015/16 Steffen Lammel 1 Content Introduction Motivation Characteristics of CPUs and GPUs Heterogeneous Computing Systems

More information

Memory. Lecture 22 CS301

Memory. Lecture 22 CS301 Memory Lecture 22 CS301 Administrative Daily Review of today s lecture w Due tomorrow (11/13) at 8am HW #8 due today at 5pm Program #2 due Friday, 11/16 at 11:59pm Test #2 Wednesday Pipelined Machine Fetch

More information

High Performance Computing. University questions with solution

High Performance Computing. University questions with solution High Performance Computing University questions with solution Q1) Explain the basic working principle of VLIW processor. (6 marks) The following points are basic working principle of VLIW processor. The

More information

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli High performance 2D Discrete Fourier Transform on Heterogeneous Platforms Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli Motivation Fourier Transform widely used in Physics, Astronomy, Engineering

More information

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Outline Key issues to design multiprocessors Interconnection network Centralized shared-memory architectures Distributed

More information

Portable Parallel Programming for Multicore Computing

Portable Parallel Programming for Multicore Computing Portable Parallel Programming for Multicore Computing? Vivek Sarkar Rice University vsarkar@rice.edu FPU ISU ISU FPU IDU FXU FXU IDU IFU BXU U U IFU BXU L2 L2 L2 L3 D Acknowledgments Rice Habanero Multicore

More information

An Introduction to Parallel Programming

An Introduction to Parallel Programming An Introduction to Parallel Programming Ing. Andrea Marongiu (a.marongiu@unibo.it) Includes slides from Multicore Programming Primer course at Massachusetts Institute of Technology (MIT) by Prof. SamanAmarasinghe

More information

The Pennsylvania State University. The Graduate School. College of Engineering A NEURAL NETWORK BASED CLASSIFIER ON THE CELL BROADBAND ENGINE

The Pennsylvania State University. The Graduate School. College of Engineering A NEURAL NETWORK BASED CLASSIFIER ON THE CELL BROADBAND ENGINE The Pennsylvania State University The Graduate School College of Engineering A NEURAL NETWORK BASED CLASSIFIER ON THE CELL BROADBAND ENGINE A Thesis in Electrical Engineering by Srijith Rajamohan 2009

More information

high performance medical reconstruction using stream programming paradigms

high performance medical reconstruction using stream programming paradigms high performance medical reconstruction using stream programming paradigms This Paper describes the implementation and results of CT reconstruction using Filtered Back Projection on various stream programming

More information

A Tuneable Software Cache Coherence Protocol for Heterogeneous MPSoCs. Marco Bekooij & Frank Ophelders

A Tuneable Software Cache Coherence Protocol for Heterogeneous MPSoCs. Marco Bekooij & Frank Ophelders A Tuneable Software Cache Coherence Protocol for Heterogeneous MPSoCs Marco Bekooij & Frank Ophelders Outline Context What is cache coherence Addressed challenge Short overview of related work Related

More information

GPUfs: Integrating a file system with GPUs

GPUfs: Integrating a file system with GPUs GPUfs: Integrating a file system with GPUs Mark Silberstein (UT Austin/Technion) Bryan Ford (Yale), Idit Keidar (Technion) Emmett Witchel (UT Austin) 1 Traditional System Architecture Applications OS CPU

More information

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip Reducing Hit Times Critical Influence on cycle-time or CPI Keep L1 small and simple small is always faster and can be put on chip interesting compromise is to keep the tags on chip and the block data off

More information

Computer Systems Architecture I. CSE 560M Lecture 19 Prof. Patrick Crowley

Computer Systems Architecture I. CSE 560M Lecture 19 Prof. Patrick Crowley Computer Systems Architecture I CSE 560M Lecture 19 Prof. Patrick Crowley Plan for Today Announcement No lecture next Wednesday (Thanksgiving holiday) Take Home Final Exam Available Dec 7 Due via email

More information

4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4.

4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4. Chapter 4: CPU 4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4.8 Control hazard 4.14 Concluding Rem marks Hazards Situations that

More information

APP: MINIMIZING INTERFERENCE USING AGGRESSIVEPIPELINED PREFETCHING IN MULTI-LEVEL BUFFER CACHES

APP: MINIMIZING INTERFERENCE USING AGGRESSIVEPIPELINED PREFETCHING IN MULTI-LEVEL BUFFER CACHES APP: MINIMIZING INTERFERENCE USING AGGRESSIVEPIPELINED PREFETCHING IN MULTI-LEVEL BUFFER CACHES Christina M. Patrick Pennsylvania State University patrick@cse.psu.edu Nicholas Voshell Pennsylvania State

More information

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé 2 Rosa M. Badia 2 Francisco Igual 1 Jesús Labarta 2 Rafael Mayo 1 Enrique S. Quintana-Ortí 1 1 Departamento

More information

CSE 392/CS 378: High-performance Computing - Principles and Practice

CSE 392/CS 378: High-performance Computing - Principles and Practice CSE 392/CS 378: High-performance Computing - Principles and Practice Parallel Computer Architectures A Conceptual Introduction for Software Developers Jim Browne browne@cs.utexas.edu Parallel Computer

More information

WHY PARALLEL PROCESSING? (CE-401)

WHY PARALLEL PROCESSING? (CE-401) PARALLEL PROCESSING (CE-401) COURSE INFORMATION 2 + 1 credits (60 marks theory, 40 marks lab) Labs introduced for second time in PP history of SSUET Theory marks breakup: Midterm Exam: 15 marks Assignment:

More information

Lecture 7: Parallel Processing

Lecture 7: Parallel Processing Lecture 7: Parallel Processing Introduction and motivation Architecture classification Performance evaluation Interconnection network Zebo Peng, IDA, LiTH 1 Performance Improvement Reduction of instruction

More information

Implicit and Explicit Optimizations for Stencil Computations

Implicit and Explicit Optimizations for Stencil Computations Implicit and Explicit Optimizations for Stencil Computations By Shoaib Kamil 1,2, Kaushik Datta 1, Samuel Williams 1,2, Leonid Oliker 2, John Shalf 2 and Katherine A. Yelick 1,2 1 BeBOP Project, U.C. Berkeley

More information

Compilation for Heterogeneous Platforms

Compilation for Heterogeneous Platforms Compilation for Heterogeneous Platforms Grid in a Box and on a Chip Ken Kennedy Rice University http://www.cs.rice.edu/~ken/presentations/heterogeneous.pdf Senior Researchers Ken Kennedy John Mellor-Crummey

More information

Towards Efficient Video Compression Using Scalable Vector Graphics on the Cell Broadband Engine

Towards Efficient Video Compression Using Scalable Vector Graphics on the Cell Broadband Engine Towards Efficient Video Compression Using Scalable Vector Graphics on the Cell Broadband Engine Andreea Sandu, Emil Slusanschi, Alin Murarasu, Andreea Serban, Alexandru Herisanu, Teodor Stoenescu University

More information

EE382 Processor Design. Processor Issues for MP

EE382 Processor Design. Processor Issues for MP EE382 Processor Design Winter 1998 Chapter 8 Lectures Multiprocessors, Part I EE 382 Processor Design Winter 98/99 Michael Flynn 1 Processor Issues for MP Initialization Interrupts Virtual Memory TLB Coherency

More information

Introduction to CELL B.E. and GPU Programming. Agenda

Introduction to CELL B.E. and GPU Programming. Agenda Introduction to CELL B.E. and GPU Programming Department of Electrical & Computer Engineering Rutgers University Agenda Background CELL B.E. Architecture Overview CELL B.E. Programming Environment GPU

More information

Scalable Distributed Memory Machines

Scalable Distributed Memory Machines Scalable Distributed Memory Machines Goal: Parallel machines that can be scaled to hundreds or thousands of processors. Design Choices: Custom-designed or commodity nodes? Network scalability. Capability

More information

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.

More information

Multi processor systems with configurable hardware acceleration

Multi processor systems with configurable hardware acceleration Multi processor systems with configurable hardware acceleration Ph.D in Electronics, Computer Science and Telecommunications Ph.D Student: Davide Rossi Ph.D Tutor: Prof. Roberto Guerrieri Outline Motivations

More information

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Computing architectures Part 2 TMA4280 Introduction to Supercomputing Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:

More information

COMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers

COMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers COMP 322: Fundamentals of Parallel Programming Lecture 37: General-Purpose GPU (GPGPU) Computing Max Grossman, Vivek Sarkar Department of Computer Science, Rice University max.grossman@rice.edu, vsarkar@rice.edu

More information

Replacement Policy: Which block to replace from the set?

Replacement Policy: Which block to replace from the set? Replacement Policy: Which block to replace from the set? Direct mapped: no choice Associative: evict least recently used (LRU) difficult/costly with increasing associativity Alternative: random replacement

More information

Memory-Intensive Computations. Optimized On-Chip Pipelining of. on the CELL BE. Christoph Kessler Jörg Keller

Memory-Intensive Computations. Optimized On-Chip Pipelining of. on the CELL BE. Christoph Kessler Jörg Keller Optimized On-Chip Pipelining of Memory-Intensive Computations on the CELL BE Christoph Kessler Jörg Keller IDA / PELAB Dept. of Math. and CS Linköpings universitet FernUniversität in Hagen Linköping, Sweden

More information

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Memory Organization Part II

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Memory Organization Part II ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Organization Part II Ujjwal Guin, Assistant Professor Department of Electrical and Computer Engineering Auburn University, Auburn,

More information

CSC501 Operating Systems Principles. OS Structure

CSC501 Operating Systems Principles. OS Structure CSC501 Operating Systems Principles OS Structure 1 Announcements q TA s office hour has changed Q Thursday 1:30pm 3:00pm, MRC-409C Q Or email: awang@ncsu.edu q From department: No audit allowed 2 Last

More information

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Moore s Law Moore, Cramming more components onto integrated circuits, Electronics, 1965. 2 3 Multi-Core Idea:

More information

Memory Systems and Compiler Support for MPSoC Architectures. Mahmut Kandemir and Nikil Dutt. Cap. 9

Memory Systems and Compiler Support for MPSoC Architectures. Mahmut Kandemir and Nikil Dutt. Cap. 9 Memory Systems and Compiler Support for MPSoC Architectures Mahmut Kandemir and Nikil Dutt Cap. 9 Fernando Moraes 28/maio/2013 1 MPSoC - Vantagens MPSoC architecture has several advantages over a conventional

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

Arquitecturas y Modelos de. Multicore

Arquitecturas y Modelos de. Multicore Arquitecturas y Modelos de rogramacion para Multicore 17 Septiembre 2008 Castellón Eduard Ayguadé Alex Ramírez Opening statements * Some visionaries already predicted multicores 30 years ago And they have

More information

Lecture 14: Memory Hierarchy. James C. Hoe Department of ECE Carnegie Mellon University

Lecture 14: Memory Hierarchy. James C. Hoe Department of ECE Carnegie Mellon University 18 447 Lecture 14: Memory Hierarchy James C. Hoe Department of ECE Carnegie Mellon University 18 447 S18 L14 S1, James C. Hoe, CMU/ECE/CALCM, 2018 Your goal today Housekeeping understand memory system

More information

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Finite Element Integration and Assembly on Modern Multi and Many-core Processors Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,

More information

GPGPU LAB. Case study: Finite-Difference Time- Domain Method on CUDA

GPGPU LAB. Case study: Finite-Difference Time- Domain Method on CUDA GPGPU LAB Case study: Finite-Difference Time- Domain Method on CUDA Ana Balevic IPVS 1 Finite-Difference Time-Domain Method Numerical computation of solutions to partial differential equations Explicit

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control

More information

A Cache Hierarchy in a Computer System

A Cache Hierarchy in a Computer System A Cache Hierarchy in a Computer System Ideally one would desire an indefinitely large memory capacity such that any particular... word would be immediately available... We are... forced to recognize the

More information

Concurrent/Parallel Processing

Concurrent/Parallel Processing Concurrent/Parallel Processing David May: April 9, 2014 Introduction The idea of using a collection of interconnected processing devices is not new. Before the emergence of the modern stored program computer,

More information

Parallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence

Parallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence Parallel Computer Architecture Spring 2018 Shared Memory Multiprocessors Memory Coherence Nikos Bellas Computer and Communications Engineering Department University of Thessaly Parallel Computer Architecture

More information

Introduction to Parallel Computing. CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014

Introduction to Parallel Computing. CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014 Introduction to Parallel Computing CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014 1 Definition of Parallel Computing Simultaneous use of multiple compute resources to solve a computational

More information

On the Portability and Performance of Message-Passing Programs on Embedded Multicore Platforms

On the Portability and Performance of Message-Passing Programs on Embedded Multicore Platforms On the Portability and Performance of Message-Passing Programs on Embedded Multicore Platforms Shih-Hao Hung, Po-Hsun Chiu, Chia-Heng Tu, Wei-Ting Chou and Wen-Long Yang Graduate Institute of Networking

More information

( ZIH ) Center for Information Services and High Performance Computing. Event Tracing and Visualization for Cell Broadband Engine Systems

( ZIH ) Center for Information Services and High Performance Computing. Event Tracing and Visualization for Cell Broadband Engine Systems ( ZIH ) Center for Information Services and High Performance Computing Event Tracing and Visualization for Cell Broadband Engine Systems ( daniel.hackenberg@zih.tu-dresden.de ) Daniel Hackenberg Cell Broadband

More information

NUMA-Aware Data-Transfer Measurements for Power/NVLink Multi-GPU Systems

NUMA-Aware Data-Transfer Measurements for Power/NVLink Multi-GPU Systems NUMA-Aware Data-Transfer Measurements for Power/NVLink Multi-GPU Systems Carl Pearson 1, I-Hsin Chung 2, Zehra Sura 2, Wen-Mei Hwu 1, and Jinjun Xiong 2 1 University of Illinois Urbana-Champaign, Urbana

More information

MCF5307 DRAM CONTROLLER. MCF5307 DRAM CTRL 1-1 Motorola ColdFire

MCF5307 DRAM CONTROLLER. MCF5307 DRAM CTRL 1-1 Motorola ColdFire MCF5307 DRAM CONTROLLER MCF5307 DRAM CTRL 1-1 MCF5307 DRAM CONTROLLER MCF5307 MCF5307 DRAM Controller I Addr Gen Supports 2 banks of DRAM Supports External Masters Programmable Wait States & Refresh Timer

More information

How to Write Fast Code , spring th Lecture, Mar. 31 st

How to Write Fast Code , spring th Lecture, Mar. 31 st How to Write Fast Code 18-645, spring 2008 20 th Lecture, Mar. 31 st Instructor: Markus Püschel TAs: Srinivas Chellappa (Vas) and Frédéric de Mesmay (Fred) Introduction Parallelism: definition Carrying

More information

Evaluating Multicore Architectures for Application in High Assurance Systems

Evaluating Multicore Architectures for Application in High Assurance Systems Evaluating Multicore Architectures for Application in High Assurance Systems Ryan Bradetich, Paul Oman, Jim Alves-Foss, and Theora Rice Center for Secure and Dependable Systems University of Idaho Contact:

More information

The future is parallel but it may not be easy

The future is parallel but it may not be easy The future is parallel but it may not be easy Michael J. Flynn Maxeler and Stanford University M. J. Flynn 1 HiPC Dec 07 Outline I The big technology tradeoffs: area, time, power HPC: What s new at the

More information

Comprehensive Review of Data Prefetching Mechanisms

Comprehensive Review of Data Prefetching Mechanisms 86 Sneha Chhabra, Raman Maini Comprehensive Review of Data Prefetching Mechanisms 1 Sneha Chhabra, 2 Raman Maini 1 University College of Engineering, Punjabi University, Patiala 2 Associate Professor,

More information

Architectural Differences nc. DRAM devices are accessed with a multiplexed address scheme. Each unit of data is accessed by first selecting its row ad

Architectural Differences nc. DRAM devices are accessed with a multiplexed address scheme. Each unit of data is accessed by first selecting its row ad nc. Application Note AN1801 Rev. 0.2, 11/2003 Performance Differences between MPC8240 and the Tsi106 Host Bridge Top Changwatchai Roy Jenevein risc10@email.sps.mot.com CPD Applications This paper discusses

More information

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2014/15

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2014/15 Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2014/15 Lecture X: Parallel Databases Topics Motivation and Goals Architectures Data placement Query processing Load balancing

More information

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently

More information

Sony/Toshiba/IBM (STI) CELL Processor. Scientific Computing for Engineers: Spring 2008

Sony/Toshiba/IBM (STI) CELL Processor. Scientific Computing for Engineers: Spring 2008 Sony/Toshiba/IBM (STI) CELL Processor Scientific Computing for Engineers: Spring 2008 Nec Hercules Contra Plures Chip's performance is related to its cross section same area 2 performance (Pollack's Rule)

More information

CISC 879 Software Support for Multicore Architectures Spring Student Presentation 6: April 8. Presenter: Pujan Kafle, Deephan Mohan

CISC 879 Software Support for Multicore Architectures Spring Student Presentation 6: April 8. Presenter: Pujan Kafle, Deephan Mohan CISC 879 Software Support for Multicore Architectures Spring 2008 Student Presentation 6: April 8 Presenter: Pujan Kafle, Deephan Mohan Scribe: Kanik Sem The following two papers were presented: A Synchronous

More information

04 - DSP Architecture and Microarchitecture

04 - DSP Architecture and Microarchitecture September 11, 2015 Memory indirect addressing (continued from last lecture) ; Reality check: Data hazards! ; Assembler code v3: repeat 256,endloop load r0,dm1[dm0[ptr0++]] store DM0[ptr1++],r0 endloop:

More information

AN 831: Intel FPGA SDK for OpenCL

AN 831: Intel FPGA SDK for OpenCL AN 831: Intel FPGA SDK for OpenCL Host Pipelined Multithread Subscribe Send Feedback Latest document on the web: PDF HTML Contents Contents 1 Intel FPGA SDK for OpenCL Host Pipelined Multithread...3 1.1

More information

Conventional Computer Architecture. Abstraction

Conventional Computer Architecture. Abstraction Conventional Computer Architecture Conventional = Sequential or Single Processor Single Processor Abstraction Conventional computer architecture has two aspects: 1 The definition of critical abstraction

More information

Architecture at HP: Two Decades of Innovation

Architecture at HP: Two Decades of Innovation Architecture at HP: Two Decades of Innovation Joel S. Birnbaum Microprocessor Forum San Jose, CA October 14, 1997 Computer Architecture at HP A summary of motivations, innovations and implementations in

More information

Unit 9 : Fundamentals of Parallel Processing

Unit 9 : Fundamentals of Parallel Processing Unit 9 : Fundamentals of Parallel Processing Lesson 1 : Types of Parallel Processing 1.1. Learning Objectives On completion of this lesson you will be able to : classify different types of parallel processing

More information

Lecture 2: CUDA Programming

Lecture 2: CUDA Programming CS 515 Programming Language and Compilers I Lecture 2: CUDA Programming Zheng (Eddy) Zhang Rutgers University Fall 2017, 9/12/2017 Review: Programming in CUDA Let s look at a sequential program in C first:

More information

Runtime Adaptation of Application Execution under Thermal and Power Constraints in Massively Parallel Processor Arrays

Runtime Adaptation of Application Execution under Thermal and Power Constraints in Massively Parallel Processor Arrays Runtime Adaptation of Application Execution under Thermal and Power Constraints in Massively Parallel Processor Arrays Éricles Sousa 1, Frank Hannig 1, Jürgen Teich 1, Qingqing Chen 2, and Ulf Schlichtmann

More information

Introduction I/O 1. I/O devices can be characterized by Behavior: input, output, storage Partner: human or machine Data rate: bytes/sec, transfers/sec

Introduction I/O 1. I/O devices can be characterized by Behavior: input, output, storage Partner: human or machine Data rate: bytes/sec, transfers/sec Introduction I/O 1 I/O devices can be characterized by Behavior: input, output, storage Partner: human or machine Data rate: bytes/sec, transfers/sec I/O bus connections I/O Device Summary I/O 2 I/O System

More information

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13 Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13 Moore s Law Moore, Cramming more components onto integrated circuits, Electronics,

More information

In multiprogramming systems, processes share a common store. Processes need space for:

In multiprogramming systems, processes share a common store. Processes need space for: Memory Management In multiprogramming systems, processes share a common store. Processes need space for: code (instructions) static data (compiler initialized variables, strings, etc.) global data (global

More information

CUDA OPTIMIZATIONS ISC 2011 Tutorial

CUDA OPTIMIZATIONS ISC 2011 Tutorial CUDA OPTIMIZATIONS ISC 2011 Tutorial Tim C. Schroeder, NVIDIA Corporation Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

Lecture 7: Parallel Processing

Lecture 7: Parallel Processing Lecture 7: Parallel Processing Introduction and motivation Architecture classification Performance evaluation Interconnection network Zebo Peng, IDA, LiTH 1 Performance Improvement Reduction of instruction

More information

Bandwidth Avoiding Stencil Computations

Bandwidth Avoiding Stencil Computations Bandwidth Avoiding Stencil Computations By Kaushik Datta, Sam Williams, Kathy Yelick, and Jim Demmel, and others Berkeley Benchmarking and Optimization Group UC Berkeley March 13, 2008 http://bebop.cs.berkeley.edu

More information

Address Translation. Tore Larsen Material developed by: Kai Li, Princeton University

Address Translation. Tore Larsen Material developed by: Kai Li, Princeton University Address Translation Tore Larsen Material developed by: Kai Li, Princeton University Topics Virtual memory Virtualization Protection Address translation Base and bound Segmentation Paging Translation look-ahead

More information