ENVISION. ACCELERATE.

Size: px
Start display at page:

Download "ENVISION. ACCELERATE."

Transcription

1 ENVISION. ACCELERATE. ARRIVE. ClearSpeed Programming Model: Optimizing Performance 1

2 Overview Compute considerations Memory considerations Latency hiding Miscellaneous Profiling Inline assembly Optimal performance: Top 10 tips 2

3 ENVISION. ACCELERATE. ARRIVE. Compute considerations 3

4 Source for further information Majority of this information is gleaned from: CSX600 Programming Manual 06-RM-1305 Chapter 4: Execution Pipeline 4

5 Poly ALU is 8-bit Hence, a 4-byte operation will take 2x as long as a 2-byte operation Be mindful in your code Do you need 32 bit int when a 16 bit short will do? Good example: Array subscripts! 6kB PE memory; a signed short will cover 32kB Added benefit: Variables will take up less poly memory space This is a scare resource, so use it wisely 5

6 Perform poly-only expressions Precopy constant items from mono variable to poly variable Expression will then be poly-only Faster Potential for value reuse Don t need to resend mono poly Example: // Assume initialised // Assume initialised mono int mmax; mono int mmax; poly int ploop; poly int ploop, pmax; while (ploop < mmax) pmax = mmax; // Single send! { while (ploop < pmax) { 6

7 Poly Conditionals As far as possible, remove common subexpressions from poly if blocks Reduce amount of replicated work Reminder: PEs do not skip poly conditionals All PEs process same code, some just ignore instructions All PEs pay the same cycle cost Be prepared to compute and throw away results if it leads to fewer poly conditional blocks Increase efficiency as PEs are enabled and processing more of the time Poly if uses predicated instructions Not a branch so no jump overhead Cheap if few additional instructions are executed 7

8 Poly Conditionals (continued) Example of superfluous compute for speed: poly double a, b; All instructions processed (PEs disabled in conditional) if <condition 1> if <condition 2> a = ComputationA(params) else b = ComputationB(params) endif else a = ComputationA(params) Computation carried out twice! endif 8

9 Poly Conditionals (continued) Smaller total cycle count Superfluous compute for speed (continued): poly double a, b, atemp; atemp = ComputationA(params); if <condition 1> if <condition 2> a = atemp; else b = ComputationB(params) endif else a = atemp; endif Now single computation, assigned twice 9

10 Standard procedural programming speedups Don t calculate in for loop header ; example: for (int i=0; i < (a / b); i++) Calculated each iteration! What if result of calculation is constant over the loop? Faster to compute the result once and reuse: int limit = a / b; for (int i=0; i < limit; i++) Remember: The fastest computation is the one you don t carry out! 10

11 More examples of superfluous computation Pointer arithmetic don t calculate absolute value each iteration for (i=0 ; i<10 ; i++) { pointer = start_address + i*128 ; function(pointer,...) ; } Prefer: pointer = start_address; for (i=0 ; i<10 ; i++) { function(pointer,...) ; pointer += 128 ; // Not carrying out a multiply } 11

12 Division speedup Division is a slow operation When compared to multiply Are you dividing by a constant? More efficient to multiply by its inverse float data[10]; // Some nebulous data to process float divisor; // Quantity we wish to divide by for (int i=0; i<10; i++) { data[i] /= divisor; } Instead, compute the inverse, store it and reuse it: float data[10]; // Some nebulous data to process float divisor; // Quantity we wish to divide by float invdivisor = 1.0 / divisor; for (int i=0; i<10; i++) { data[i] *= invdivisor; } 12

13 Array lookup optimization You can achieve speedups with 2.x compiler: 3.x compiler is more efficient poly double *x, *y; for (short i=0; i<32; i++) { y[i]=x[i]; } You would be better coding: poly double *x, *y; poly double *xi, *yi; xi = x; yi = y; for (short i=0; i<32; i++) { *yi++ = *xi++; } Note removal of index Removes related address calculation 13

14 Literal Constants You can achieve speedups with 2.x compiler: 3.x compiler is more efficient Compared with: poly double x, y; x = x + 1.0; y = y This may prove to be faster: static poly const double one = 1.0; poly double x, y; x = x + one; y = y + one; 14

15 Can anything be precomputed? If there is a set of constants you require Precalculate on host Send to board (tightly packed structure, multiples of 32 bytes, 8 byte aligned ) Potentially faster than calculating on board Especially if they are loaded in advance (along with the executable) "Old-school" lookup tables win out in certain circumstances Factorial: 3249! = ^(10,000) Look-up is a lot faster than 3,249 multiplies... 15

16 Should you use vector instructions? Don t automatically assume you need to vectorise Only if you: Have a small set of working variables Are not mono poly bandwidth bound Don't assume vectorization is always a huge win Particularly with sets of expressions with lots of variables Could run out of registers, so spill variables to memory Code will run slower due to additional memory access 16

17 Vector Math Library (VML) VML functions take up PE memory Such as 128 bytes for sin & cos double-precision functions But VML functions are faster than libcn Even if the arguments are scalar, not vector Refer to The Cn Standard Library Document ID 06-RM-1139 Section 5: The ClearSpeed Vector Math Library 17

18 Compiler optimizations Refer to ClearSpeed SDK Reference Manual Document ID 06-RM-1136 Section 3.3: Compiler optimizations Breaks down exactly what the different levels of optimization will do O1, O2, O3, O4 Different optimization implementations Compiler 2.x: Not all are available with poly variables Compiler 3.x: Supported on poly, with additional optimizations Check the documentation for further details 18

19 ENVISION. ACCELERATE. ARRIVE. Memory considerations 19

20 DRAM memory optimization: 32-byte access ECC in DRAM works on 8 byte wide words If you write < 8 bytes, DRAM will: Read 8 bytes, overwrite N bytes, recalculate ECC, write 8 bytes Use multiples of 8 bytes DRAM has burst length of 4 4 * 8 = 32 bytes So accessing 1 byte takes as long as accessing 32 bytes Further reading: CSX600 Programming Manual (ID 06-RM-1305) Section 3.4: DRAM Section 3.4.6: Performance 20

21 DRAM memory optimization: 32-byte access PIO bus width is 64 bytes Access multiples of 64 bytes for peak performance Requires access aligned to 8 byte address For performance, align to 32 byte address Enables DMA engine to be used C n support: #pragma align N Will align next data structure to N bytes 21

22 DRAM memory optimization: random access Only one row in each DRAM bank can be open at any one time. Controller must open & close rows as addresses come in This takes time but controller schedules open / close commands ahead of the data access Consecutive access to memory is penalty free Random access within a page is penalty free Random access within all open pages is penalty free Number of banks varies with board Usually 4 or 8 banks 22

23 DRAM memory optimization: multiple access Keep to one data stream at a time Even if you have two perfectly well behaved consecutive write streams from different sources (2 MTAPS or MTAP + Host) your bandwidth will be slashed Streams get interleaved, so are seen as accesses to different pages in same bank Changing between read and write takes time Do as many accesses in the same direction as possible 23

24 DRAM memory optimization: multiple access However, if you know exactly what you are doing: There are 3 memory controllers FPGA controls PCIe access One in each CSX processor Underlying issue is DRAM banks If you access too many different banks: Performance drops badly as memory is flushed 24

25 Board DRAM: bank differences PCI-X: 8 DRAM banks PCIe: 4 DRAM banks Performance differences can be seen moving from PCI-X to PCIe Due to number of DRAM banks 25

26 Remember: available bandwidths Mono memory to poly memory 3.2 GB/s aggregate over 96 PEs Poly memory to registers 840 MB/s per PE, ~160 GB/s aggregate/board Swazzle path bandwidth 1680 MB/s per PE, ~320 GB/s aggregate/board Total bandwidth for Advance board (2 CSX600 processors) ~0.5 TB/s Consider when looking to move large amounts of data 26

27 Summary of optimal DRAM usage Transferring large blocks of data is more efficient than transferring small blocks Fixed overhead in initialising transfer Host to board Mono to poly, poly to mono Use multiples of 32 bytes, preferably 64 bytes Aligned to 8 byte address, preferably 32 byte aligned #pragma align 8 Benefits if you can pack your data structures into 32 bytes! If you re using 24 bytes, there s an additional 8 bytes that will transfer for no additional cost Use one stream at a time Transfer to/from host when CSX isn t accessing DRAM 27

28 DRAM usage note: you will hit 80% of peak traffic Due to: 64 bytes sent on ClearConnect Bus (Programmed I/O) Divided into 4 lots of 16 bytes, plus 16 byte address Hence sending 64 bytes implies 80 bytes sent 64/80 = 80% of peak This can be pathologically hit with: 1 byte sent Becomes 1 lot fo 16 bytes, plus 16 byte address Hence sending 1 byte implies 32 bytes sent 1/32 = 3% of peak And then there s the read-modify-write for ECC DRAM 28

29 Swazzle path This is 8 bytes wide So, swazzling 1 byte takes as long as 8 Peak: 8 bytes per cycle, ~161 GB/sec/processor Don't use the swazzle_up_zero instruction Prefer set_swazzle_ends followed by swazzle_up Don t repeatedly set up ends of swazzle Don t need to call set_swazzle_ends multiple times If you wish to reuse same end values 29

30 Sending mono to poly Given 3.2 GB/s bandwidth mono poly 80% of peak gives ~2.5 GB/s Approximately 10 bytes per cycle across the processor Divide amongst 96 PEs: ~0.1 bytes per cycle per PE Alternative: mono to poly broadcast 1 byte per cycle Note: cache latency to load mono (10 s of cycles) Cache would, however, get 32 bytes in a line Bear in mind when considering mono poly Choice will depend on amount of data to be sent 30

31 Memcpy variants: which to use? Never use memcpy(m2p,p2m) Never better than 10% of peak Due to relaxed memory alignment & size constraints Use the async_ versions instead Even if you wait immediately for it to complete Bear in mind alignment and size requirements E.g. poly source must be 4-byte aligned Refer to The Cn Standard Library Document ID 06-RM-1139 Section 5: The ClearSpeed Vector Math Library 31

32 ENVISION. ACCELERATE. ARRIVE. Latency hiding 32

33 Using semaphores to reduce wait on transfers Given the case: Send data to the card Run a program on the data Retrieve the data Delay between transfer and compute can be reduced by using GSU semaphores 33

34 Using semaphores to reduce wait on transfers Load and run program on the card Waits on a GSU semaphore (GSU1) before starting Transfer data to the card Link GSU1 semaphore to transfer completion When the transfer completes, program will immediately start running. Transfer the results from the card Link GSU2 semaphore to transfer start Program on the card has completed: Signal a GSU semaphore (GSU2) Triggers host to transfer results Immediately process completes 34

35 Improved use of semaphores and host transfers Previous slide can be improved with doublebuffering: Transfer to the card two problems to be solved The card starts processing when the first problem arrives (*) Transfer the first results back, which will start when the first problem has been solved. The card will start on the second problem as soon as it has finished the first one. Transfer a third problem to the card, over the top of the first problem. Go to (*) and retrieve the second set of results... If data transfer is faster than compute, CSX will never stall But remember to have 1 data stream at a time if possible 35

36 Asynchronous I/O example void foo(double *A, double *B, int n) { // Assume n is divisible by 24*96 poly unsigned short penum = get_penum(); poly double mat[4]={1.,2.,3.,4.}; poly double a_front[12], a_back[12]; poly double b[4]={0.,0.,0.,0.}; int i; async_memcpym2p(19,a_front,a+12*penum,12*sizeof(double)); A+=12*96; n-=24*96; while (n) { // About to request memory block in advance async_memcpym2p(17,a_back,a+12*penum,12*sizeof(double)); A+=12*96; sem_wait(19); // Wait for memory transfer should have already completed for (i=0;i<12;i++) { b[0] += a_front[i]*mat[0] + a_front[i+1]*mat[1]; b[1] += a_front[i+1]*mat[0] + a_front[i]*mat[1]; b[2] += a_front[i]*mat[2] - a_front[i+1]*mat[3]; b[3] += a_front[i+1]*mat[2] - a_front[i]*mat[3]; } n-=12*96; // Next: request memory block in advance async_memcpym2p(19,a_front,a+12*penum,12*sizeof(double)); A+=12*96; sem_wait(17); for (i=0;i<12;i++) { // compute on a_back, then finish outside while loop 36

37 Should you double-buffer mono to poly access? Don t automatically assume this! What if the return data is extremely small? Not a bottleneck Only double-buffer if you are memory bandwidth bound Otherwise, just wasting PE memory Could get better memory reuse with single buffering Hence could be more efficient by being greedy with memory for compute & single-buffer! Overall: No hard and fast rule 37

38 ENVISION. ACCELERATE. ARRIVE. Miscellaneous 38

39 Embedded SRAM impact on programs ESRAM is 128KB Programs run at peak performance when located inside Compiler/linker 2.x: If program is > 128KB, program will not be placed in ESRAM Dramatic performance hit Compiler/linker 3.0: #pragma hot identifies code to be placed in ESRAM Compiler/linker 3.1: Will dynamically page code into ESRAM 39

40 Custom stack/heap sizes in poly memory Default is: 3KB stack, 3KB heap You can change this through pragmas Consider: If you create code with (say) 5.9KB heap You then get called from a different function/library Which has a different setup (eg 0.5KB heap) You re going to be in trouble! 40

41 Debugging: random errors occurring? Try running your code on the simulator More diagnostics compared to hardware For instance: You've accidentally fallen off the end of 6kB poly memory Hardware will eventually wrap the address Masking off irrelevant bits - so 8kB will become 0kB Simulator will tell you if you've fallen off the end of the memory map 41

42 ENVISION. ACCELERATE. ARRIVE. Profiling 42

43 ClearSpeed Visual Profiler Host tracing Trace CSAPI function User can infer overlapping host/board utilization Locate hot-spots Board tracing Trace board side functions without instrumentation Locate hot-spots Board hardware utilization Display activity of csx functional units including: ld/st Pi/o SIMD microcode Instruction cache Data cache Thread Cycle accurate View corresponding source Unified GUI 43

44 Detailed profiling is essential for accelerator tuning HOST CODE PROFILING Visually inspect multiple host threads. Time specific code sections. Check overlap of host threads. HOST/BOARD INTERACTION Infer cause and effect. Measure transfer bandwidth. Check overlap of host and board compute. Host Host Host CPU(s) Host CPU(s) CPU(s) CPU(s) Advance Accelerator Board Advance Accelerator Board CSX 600 CSX600 CSX 600 CSX600 Pipeline Pipeline Pipeline Pipeline ACCELERATOR PIPE View instruction issue. Visualize overlap of executing instructions. Get cycle-accurate timing. Remove instruction-level performance bottlenecks. CSX600 SYSTEM Trace at system level. Inspect overlap of compute and I/O. View cache utilization. Graph performance. 44

45 csvprof: Host Tracing Dynamic loading of CSAPI Trace implementation Triggered with an environment variable: export CS_CSAPI_TRACE=1 Recall similar enabling of debug support: export CS_CSAPI_DEBUGGER=1 Specify tracing format: export CS_CSAPI_TRACE_CSVPROF=1 currently this is the only implementation, but in the future Specify output file for trace: export CS_CSAPI_TRACE_CSVPROF_FILE=mytrace.cst default filename = csvprof_data.cst Output file written during CSAPI_delete 45

46 Profile of complete LINPACK run (x86 view) Overview of system performance during LINPACK run Profiling of x86 source code inside LINPACK CSX600 Interaction displayed with x86 code profile 46

47 A single LINPACK DGEMM call (x86 view) Scale from full LINPACK run to individual DGEMM Individual GFLOPS for a DGEMM call displayed in profile CSX600 aspects of individual DGEMM call now seen 47

48 Multiple CSX600 DGEMM calls (x86 view) See individual CSX600 processor contribution Overlap of x86 threads handling CSX600 offload Overhead of data transfer between processors visible 48

49 Multiple CSX600 DGEMM calls (CSX600 view) View the DGEMM calls on the CSX600 processor Each call ties up with the host view of card execution Much higher level of detail available from the profiler 49

50 Single DGEMM inner loop on CSX600 (CSX600 view) Scale from the view of code executing on the CSX600 View the host data being copied into CSX600 memory Tune the CSX600 code based on data flow profile 50

51 Pipeline view of CSX600 DGEMM inner loop (CSX600 view) Profile the code running at the instruction level See the pipeline performance for each instruction Tune the instruction scheduling for the application code 51

52 Visual Profiler Board Tracing Enabled using the debugger, csgdb Can use interactively or through gdb script Can select events to profile, or all events Requires buffer allocation on the card Today, this is done statically One could use CSAPI to allocate buffer, but developer must get location and size of the buffer to user to be entered for csgdb Easy if running only on one chip, place buffer in the other chip s memory Explicit dump to generate trace file Can control the type of data to be dumped 52

53 csvprof: Sample gdb script % cat./csgdb_trace.gdb connect load./foo.csx cstrace buffer 0x x cstrace event all on tbreak test_me continue cstrace enable continue cstrace dump foo.cst cstrace dump branch dgemm_test4_branch.cst quit % csgdb command=./csgdb_trace.gdb 53

54 ENVISION. ACCELERATE. ARRIVE. Inline assembly 54

55 Inline assembler within C n Refer to SDK Reference Manual Document ID: 06-RM-1136 Section 12.11: Inline assembler Given that you also know the instruction set Refer to CSX600 Instruction Set Reference Manual Document ID: 06-RM-1137 Brief example presented here 55

56 Inline assembler: example C n inline assembler: similar to function syntax Cannot be defined within basic blocks Uses _asm keyword to differentiate from normal functions Example: asm mono float addf(mono float x, mono float y) { @{y}f; This could be called from the Cn code as follows: int main(void) { } float y = addf(10.0,20.0); return 0, (int) y 56

57 Overview of features Variables can be access Enables register allocation by compiler Directives inform compiler of user intentions If a parameter (register) is to be modified Defines register requirements Example: requesting a 32-bit mono register Refer to Section in the SDK Reference Manual Example: insert that overwrites x parameter: #pragma asm_inc <arith.inc> asm mono short adddbl(mono short x, mono short y) x @{x}; } 57

58 Once you re happy with assembler If your code isn t running as fast as expected: Examine produced assembly from C n Verify that compiler is doing what you expect Code in a tight inner kernel may not be optimal Compiler/linker 2.x has fewer optimizations than 3.x series You can discover if it is worth hand-crafting part of a routine in assembler 58

59 ENVISION. ACCELERATE. ARRIVE. Optimal performance: Top 20 tips 59

60 Top tips: counting down 1. Use both chips on the board! Yes, don t forget you have 2 processors 2. Asynchronous I/O (latency hiding) overlap everything! Mono with poly, on-off-host, poly load/store with poly compute 3. Move common code from poly conditionals to outside conditional Remember: poly does not branch, don t pay to run code twice Refactor code to process common sub-expressions once 4. Align memory access to 8 bytes Enables DMA access 5. Use multiples of 64 byte for DRAM access Maximise efficiency 60

61 Top tips: counting down 6. Vector intrinsics To achieve maximum performance 7. Use async_memcpy Don t pay the overhead for more flexible memory sizes 8. Poly ALU is 8-bit; tighten integer math Why calculate 32-bit (4 registers) if an 8-bit (single register) will suffice? 9. Use swazzle Massive transfer bandwidth that runs in parallel with DRAM 10. Multiple boards At least prepare your code for scalability 61

62 ENVISION. ACCELERATE. ARRIVE. Summary 62

63 Summary Compute considerations Memory considerations Latency hiding Miscellaneous Profiling Inline assembly Optimal performance: Top 10 tips 63

64 64

65 ENVISION. ACCELERATE. ARRIVE. Reducing start-up of small applications 65

66 a. Training_2007_09_07 - #2 - #7, client-server code (small example no code shown) 66

COMP 635: Seminar on Heterogeneous Processors. Lecture 7: ClearSpeed CSX600 Processor.

COMP 635: Seminar on Heterogeneous Processors. Lecture 7: ClearSpeed CSX600 Processor. COMP 635: Seminar on Heterogeneous Processors Lecture 7: ClearSpeed CSX600 Processor www.cs.rice.edu/~vsarkar/comp635 Vivek Sarkar Department of Computer Science Rice University vsarkar@rice.edu October

More information

Visual Profiler. User Guide

Visual Profiler. User Guide Visual Profiler User Guide Version 3.0 Document No. 06-RM-1136 Revision: 4.B February 2008 Visual Profiler User Guide Table of contents Table of contents 1 Introduction................................................

More information

ENVISION. ACCELERATE.

ENVISION. ACCELERATE. ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical Training December 2007 Overview 1 Presenters Ronald Langhi Technical Marketing Manager ron@clearspeed.com Brian Sumner Senior Engineer brians@clearspeed.com

More information

ClearSpeed Visual Profiler

ClearSpeed Visual Profiler ClearSpeed Visual Profiler Copyright 2007 ClearSpeed Technology plc. All rights reserved. 12 November 2007 www.clearspeed.com 1 Profiling Application Code Why use a profiler? Program analysis tools are

More information

CSX600 Runtime Software. User Guide

CSX600 Runtime Software. User Guide CSX600 Runtime Software User Guide Version 3.0 Document No. 06-UG-1345 Revision: 3.D January 2008 Table of contents Table of contents 1 Introduction................................................ 7 2

More information

ENVISION. ACCELERATE.

ENVISION. ACCELERATE. ENVISION. ACCELERATE. ARRIVE. ClearSpeed Programming Model: An Introduction 1 Overview PC host communication to ClearSpeed A first look at C n Using the toolchain: hello world Lower level review of ClearSpeed

More information

CLEARSPEED WHITEPAPER: CSX PROCESSOR ARCHITECTURE

CLEARSPEED WHITEPAPER: CSX PROCESSOR ARCHITECTURE CSX PROCESSOR ARCHITECTURE CLEARSPEED WHITEPAPER: CSX PROCESSOR ARCHITECTURE Abstract This paper describes the architecture of the CSX family of processors based on ClearSpeed s multi-threaded array processor;

More information

Optimising for the p690 memory system

Optimising for the p690 memory system Optimising for the p690 memory Introduction As with all performance optimisation it is important to understand what is limiting the performance of a code. The Power4 is a very powerful micro-processor

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy

More information

CUDA OPTIMIZATIONS ISC 2011 Tutorial

CUDA OPTIMIZATIONS ISC 2011 Tutorial CUDA OPTIMIZATIONS ISC 2011 Tutorial Tim C. Schroeder, NVIDIA Corporation Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow Fundamental Optimizations (GTC 2010) Paulius Micikevicius NVIDIA Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow Optimization

More information

Introductory Programming Manual. The ClearSpeed Software Development Kit. Document No. 06-UG-1117 Revision: 2.E

Introductory Programming Manual. The ClearSpeed Software Development Kit. Document No. 06-UG-1117 Revision: 2.E Introductory Programming Manual The ClearSpeed Software Development Kit Document No. 06-UG-1117 Revision: 2.E January 2008 The ClearSpeed Software Development Kit Introductory Programming Manual Overview

More information

Key Point. What are Cache lines

Key Point. What are Cache lines Caching 1 Key Point What are Cache lines Tags Index offset How do we find data in the cache? How do we tell if it s the right data? What decisions do we need to make in designing a cache? What are possible

More information

CS252 S05. Main memory management. Memory hardware. The scale of things. Memory hardware (cont.) Bottleneck

CS252 S05. Main memory management. Memory hardware. The scale of things. Memory hardware (cont.) Bottleneck Main memory management CMSC 411 Computer Systems Architecture Lecture 16 Memory Hierarchy 3 (Main Memory & Memory) Questions: How big should main memory be? How to handle reads and writes? How to find

More information

Parallel Computing: Parallel Architectures Jin, Hai

Parallel Computing: Parallel Architectures Jin, Hai Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer

More information

Fundamental Optimizations

Fundamental Optimizations Fundamental Optimizations Paulius Micikevicius NVIDIA Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010 Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access

More information

CUDA Memories. Introduction 5/4/11

CUDA Memories. Introduction 5/4/11 5/4/11 CUDA Memories James Gain, Michelle Kuttel, Sebastian Wyngaard, Simon Perkins and Jason Brownbridge { jgain mkuttel sperkins jbrownbr}@cs.uct.ac.za swyngaard@csir.co.za 3-6 May 2011 Introduction

More information

HPC VT Machine-dependent Optimization

HPC VT Machine-dependent Optimization HPC VT 2013 Machine-dependent Optimization Last time Choose good data structures Reduce number of operations Use cheap operations strength reduction Avoid too many small function calls inlining Use compiler

More information

Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010

Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010 Fundamental Optimizations Paulius Micikevicius NVIDIA Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010 Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access

More information

CS 426 Parallel Computing. Parallel Computing Platforms

CS 426 Parallel Computing. Parallel Computing Platforms CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:

More information

Caches and Memory Hierarchy: Review. UCSB CS240A, Winter 2016

Caches and Memory Hierarchy: Review. UCSB CS240A, Winter 2016 Caches and Memory Hierarchy: Review UCSB CS240A, Winter 2016 1 Motivation Most applications in a single processor runs at only 10-20% of the processor peak Most of the single processor performance loss

More information

Under the Compiler's Hood: Supercharge Your PLAYSTATION 3 (PS3 ) Code. Understanding your compiler is the key to success in the gaming world.

Under the Compiler's Hood: Supercharge Your PLAYSTATION 3 (PS3 ) Code. Understanding your compiler is the key to success in the gaming world. Under the Compiler's Hood: Supercharge Your PLAYSTATION 3 (PS3 ) Code. Understanding your compiler is the key to success in the gaming world. Supercharge your PS3 game code Part 1: Compiler internals.

More information

Chapter 2. Parallel Hardware and Parallel Software. An Introduction to Parallel Programming. The Von Neuman Architecture

Chapter 2. Parallel Hardware and Parallel Software. An Introduction to Parallel Programming. The Von Neuman Architecture An Introduction to Parallel Programming Peter Pacheco Chapter 2 Parallel Hardware and Parallel Software 1 The Von Neuman Architecture Control unit: responsible for deciding which instruction in a program

More information

CUDA Performance Optimization. Patrick Legresley

CUDA Performance Optimization. Patrick Legresley CUDA Performance Optimization Patrick Legresley Optimizations Kernel optimizations Maximizing global memory throughput Efficient use of shared memory Minimizing divergent warps Intrinsic instructions Optimizations

More information

Caches and Memory Hierarchy: Review. UCSB CS240A, Fall 2017

Caches and Memory Hierarchy: Review. UCSB CS240A, Fall 2017 Caches and Memory Hierarchy: Review UCSB CS24A, Fall 27 Motivation Most applications in a single processor runs at only - 2% of the processor peak Most of the single processor performance loss is in the

More information

Tesla GPU Computing A Revolution in High Performance Computing

Tesla GPU Computing A Revolution in High Performance Computing Tesla GPU Computing A Revolution in High Performance Computing Gernot Ziegler, Developer Technology (Compute) (Material by Thomas Bradley) Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction

More information

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in

More information

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip Reducing Hit Times Critical Influence on cycle-time or CPI Keep L1 small and simple small is always faster and can be put on chip interesting compromise is to keep the tags on chip and the block data off

More information

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA Optimization Overview GPU architecture Kernel optimization Memory optimization Latency optimization Instruction optimization CPU-GPU

More information

COSC 6385 Computer Architecture. - Memory Hierarchies (II)

COSC 6385 Computer Architecture. - Memory Hierarchies (II) COSC 6385 Computer Architecture - Memory Hierarchies (II) Fall 2008 Cache Performance Avg. memory access time = Hit time + Miss rate x Miss penalty with Hit time: time to access a data item which is available

More information

Analysis Report. Number of Multiprocessors 3 Multiprocessor Clock Rate Concurrent Kernel Max IPC 6 Threads per Warp 32 Global Memory Bandwidth

Analysis Report. Number of Multiprocessors 3 Multiprocessor Clock Rate Concurrent Kernel Max IPC 6 Threads per Warp 32 Global Memory Bandwidth Analysis Report v3 Duration 932.612 µs Grid Size [ 1024,1,1 ] Block Size [ 1024,1,1 ] Registers/Thread 32 Shared Memory/Block 28 KiB Shared Memory Requested 64 KiB Shared Memory Executed 64 KiB Shared

More information

Fixed-Point Math and Other Optimizations

Fixed-Point Math and Other Optimizations Fixed-Point Math and Other Optimizations Embedded Systems 8-1 Fixed Point Math Why and How Floating point is too slow and integers truncate the data Floating point subroutines: slower than native, overhead

More information

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION WHAT YOU WILL LEARN An iterative method to optimize your GPU code Some common bottlenecks to look out for Performance diagnostics with NVIDIA Nsight

More information

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho Why memory hierarchy? L1 cache design Sangyeun Cho Computer Science Department Memory hierarchy Memory hierarchy goals Smaller Faster More expensive per byte CPU Regs L1 cache L2 cache SRAM SRAM To provide

More information

Software Overview Release Rev: 3.0

Software Overview Release Rev: 3.0 Software Overview Release Rev: 3.0 1 Overview of ClearSpeed software The ClearSpeed Advance accelerators are provided with a package of runtime software. A software development kit (SDK) is also available

More information

High-Performance Cryptography in Software

High-Performance Cryptography in Software High-Performance Cryptography in Software Peter Schwabe Research Center for Information Technology Innovation Academia Sinica September 3, 2012 ECRYPT Summer School: Challenges in Security Engineering

More information

COSC 6385 Computer Architecture - Memory Hierarchy Design (III)

COSC 6385 Computer Architecture - Memory Hierarchy Design (III) COSC 6385 Computer Architecture - Memory Hierarchy Design (III) Fall 2006 Reducing cache miss penalty Five techniques Multilevel caches Critical word first and early restart Giving priority to read misses

More information

COSC 6385 Computer Architecture - Memory Hierarchies (II)

COSC 6385 Computer Architecture - Memory Hierarchies (II) COSC 6385 Computer Architecture - Memory Hierarchies (II) Edgar Gabriel Spring 2018 Types of cache misses Compulsory Misses: first access to a block cannot be in the cache (cold start misses) Capacity

More information

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies TDT4255 Lecture 10: Memory hierarchies Donn Morrison Department of Computer Science 2 Outline Chapter 5 - Memory hierarchies (5.1-5.5) Temporal and spacial locality Hits and misses Direct-mapped, set associative,

More information

OpenACC Course. Office Hour #2 Q&A

OpenACC Course. Office Hour #2 Q&A OpenACC Course Office Hour #2 Q&A Q1: How many threads does each GPU core have? A: GPU cores execute arithmetic instructions. Each core can execute one single precision floating point instruction per cycle

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

Advanced CUDA Programming. Dr. Timo Stich

Advanced CUDA Programming. Dr. Timo Stich Advanced CUDA Programming Dr. Timo Stich (tstich@nvidia.com) Outline SIMT Architecture, Warps Kernel optimizations Global memory throughput Launch configuration Shared memory access Instruction throughput

More information

Advance CPU Design. MMX technology. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ. ! Basic concepts

Advance CPU Design. MMX technology. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ. ! Basic concepts Computer Architectures Advance CPU Design Tien-Fu Chen National Chung Cheng Univ. Adv CPU-0 MMX technology! Basic concepts " small native data types " compute-intensive operations " a lot of inherent parallelism

More information

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1 CSE 431 Computer Architecture Fall 2008 Chapter 5A: Exploiting the Memory Hierarchy, Part 1 Mary Jane Irwin ( www.cse.psu.edu/~mji ) [Adapted from Computer Organization and Design, 4 th Edition, Patterson

More information

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy Chapter 5A Large and Fast: Exploiting Memory Hierarchy Memory Technology Static RAM (SRAM) Fast, expensive Dynamic RAM (DRAM) In between Magnetic disk Slow, inexpensive Ideal memory Access time of SRAM

More information

Martin Kruliš, v

Martin Kruliš, v Martin Kruliš 1 Optimizations in General Code And Compilation Memory Considerations Parallelism Profiling And Optimization Examples 2 Premature optimization is the root of all evil. -- D. Knuth Our goal

More information

CS Computer Architecture

CS Computer Architecture CS 35101 Computer Architecture Section 600 Dr. Angela Guercio Fall 2010 An Example Implementation In principle, we could describe the control store in binary, 36 bits per word. We will use a simple symbolic

More information

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION. Julien Demouth, NVIDIA Cliff Woolley, NVIDIA

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION. Julien Demouth, NVIDIA Cliff Woolley, NVIDIA CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION Julien Demouth, NVIDIA Cliff Woolley, NVIDIA WHAT WILL YOU LEARN? An iterative method to optimize your GPU code A way to conduct that method with NVIDIA

More information

Example. How are these parameters decided?

Example. How are these parameters decided? Example How are these parameters decided? Comparing cache organizations Like many architectural features, caches are evaluated experimentally. As always, performance depends on the actual instruction mix,

More information

Optimising with the IBM compilers

Optimising with the IBM compilers Optimising with the IBM Overview Introduction Optimisation techniques compiler flags compiler hints code modifications Optimisation topics locals and globals conditionals data types CSE divides and square

More information

Computer Systems Architecture I. CSE 560M Lecture 18 Guest Lecturer: Shakir James

Computer Systems Architecture I. CSE 560M Lecture 18 Guest Lecturer: Shakir James Computer Systems Architecture I CSE 560M Lecture 18 Guest Lecturer: Shakir James Plan for Today Announcements No class meeting on Monday, meet in project groups Project demos < 2 weeks, Nov 23 rd Questions

More information

Double-Precision Matrix Multiply on CUDA

Double-Precision Matrix Multiply on CUDA Double-Precision Matrix Multiply on CUDA Parallel Computation (CSE 60), Assignment Andrew Conegliano (A5055) Matthias Springer (A995007) GID G--665 February, 0 Assumptions All matrices are square matrices

More information

Caches Concepts Review

Caches Concepts Review Caches Concepts Review What is a block address? Why not bring just what is needed by the processor? What is a set associative cache? Write-through? Write-back? Then we ll see: Block allocation policy on

More information

14:332:331. Week 13 Basics of Cache

14:332:331. Week 13 Basics of Cache 14:332:331 Computer Architecture and Assembly Language Fall 2003 Week 13 Basics of Cache [Adapted from Dave Patterson s UCB CS152 slides and Mary Jane Irwin s PSU CSE331 slides] 331 Lec20.1 Fall 2003 Head

More information

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed) Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2011/12 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2011/12 1 2

More information

Processors, Performance, and Profiling

Processors, Performance, and Profiling Processors, Performance, and Profiling Architecture 101: 5-Stage Pipeline Fetch Decode Execute Memory Write-Back Registers PC FP ALU Memory Architecture 101 1. Fetch instruction from memory. 2. Decode

More information

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed) Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2012/13 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2012/13 1 2

More information

A Cache Hierarchy in a Computer System

A Cache Hierarchy in a Computer System A Cache Hierarchy in a Computer System Ideally one would desire an indefinitely large memory capacity such that any particular... word would be immediately available... We are... forced to recognize the

More information

Memory. Lecture 2: different memory and variable types. Memory Hierarchy. CPU Memory Hierarchy. Main memory

Memory. Lecture 2: different memory and variable types. Memory Hierarchy. CPU Memory Hierarchy. Main memory Memory Lecture 2: different memory and variable types Prof. Mike Giles mike.giles@maths.ox.ac.uk Oxford University Mathematical Institute Oxford e-research Centre Key challenge in modern computer architecture

More information

This section covers the MIPS instruction set.

This section covers the MIPS instruction set. This section covers the MIPS instruction set. 1 + I am going to break down the instructions into two types. + a machine instruction which is directly defined in the MIPS architecture and has a one to one

More information

Profiling & Tuning Applications. CUDA Course István Reguly

Profiling & Tuning Applications. CUDA Course István Reguly Profiling & Tuning Applications CUDA Course István Reguly Introduction Why is my application running slow? Work it out on paper Instrument code Profile it NVIDIA Visual Profiler Works with CUDA, needs

More information

Page 1. Multilevel Memories (Improving performance using a little cash )

Page 1. Multilevel Memories (Improving performance using a little cash ) Page 1 Multilevel Memories (Improving performance using a little cash ) 1 Page 2 CPU-Memory Bottleneck CPU Memory Performance of high-speed computers is usually limited by memory bandwidth & latency Latency

More information

Optimizing DirectX Graphics. Richard Huddy European Developer Relations Manager

Optimizing DirectX Graphics. Richard Huddy European Developer Relations Manager Optimizing DirectX Graphics Richard Huddy European Developer Relations Manager Some early observations Bear in mind that graphics performance problems are both commoner and rarer than you d think The most

More information

Programming with MPI

Programming with MPI Programming with MPI p. 1/?? Programming with MPI Miscellaneous Guidelines Nick Maclaren Computing Service nmm1@cam.ac.uk, ext. 34761 March 2010 Programming with MPI p. 2/?? Summary This is a miscellaneous

More information

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II CS 152 Computer Architecture and Engineering Lecture 7 - Memory Hierarchy-II Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste

More information

Kampala August, Agner Fog

Kampala August, Agner Fog Advanced microprocessor optimization Kampala August, 2007 Agner Fog www.agner.org Agenda Intel and AMD microprocessors Out Of Order execution Branch prediction Platform, 32 or 64 bits Choice of compiler

More information

I/O Devices. Nima Honarmand (Based on slides by Prof. Andrea Arpaci-Dusseau)

I/O Devices. Nima Honarmand (Based on slides by Prof. Andrea Arpaci-Dusseau) I/O Devices Nima Honarmand (Based on slides by Prof. Andrea Arpaci-Dusseau) Hardware Support for I/O CPU RAM Network Card Graphics Card Memory Bus General I/O Bus (e.g., PCI) Canonical Device OS reads/writes

More information

Lecture 2: different memory and variable types

Lecture 2: different memory and variable types Lecture 2: different memory and variable types Prof. Mike Giles mike.giles@maths.ox.ac.uk Oxford University Mathematical Institute Oxford e-research Centre Lecture 2 p. 1 Memory Key challenge in modern

More information

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012) Cache Coherence CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012) Shared memory multi-processor Processors read and write to shared variables - More precisely: processors issues

More information

Adapted from David Patterson s slides on graduate computer architecture

Adapted from David Patterson s slides on graduate computer architecture Mei Yang Adapted from David Patterson s slides on graduate computer architecture Introduction Ten Advanced Optimizations of Cache Performance Memory Technology and Optimizations Virtual Memory and Virtual

More information

In examining performance Interested in several things Exact times if computable Bounded times if exact not computable Can be measured

In examining performance Interested in several things Exact times if computable Bounded times if exact not computable Can be measured System Performance Analysis Introduction Performance Means many things to many people Important in any design Critical in real time systems 1 ns can mean the difference between system Doing job expected

More information

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v6.5 August 2014 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2

More information

Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Cache Performance

Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Cache Performance 6.823, L11--1 Cache Performance and Memory Management: From Absolute Addresses to Demand Paging Asanovic Laboratory for Computer Science M.I.T. http://www.csg.lcs.mit.edu/6.823 Cache Performance 6.823,

More information

Slide Set 9. for ENCM 369 Winter 2018 Section 01. Steve Norman, PhD, PEng

Slide Set 9. for ENCM 369 Winter 2018 Section 01. Steve Norman, PhD, PEng Slide Set 9 for ENCM 369 Winter 2018 Section 01 Steve Norman, PhD, PEng Electrical & Computer Engineering Schulich School of Engineering University of Calgary March 2018 ENCM 369 Winter 2018 Section 01

More information

Integrating DMA capabilities into BLIS for on-chip data movement. Devangi Parikh Ilya Polkovnichenko Francisco Igual Peña Murtaza Ali

Integrating DMA capabilities into BLIS for on-chip data movement. Devangi Parikh Ilya Polkovnichenko Francisco Igual Peña Murtaza Ali Integrating DMA capabilities into BLIS for on-chip data movement Devangi Parikh Ilya Polkovnichenko Francisco Igual Peña Murtaza Ali 5 Generations of TI Multicore Processors Keystone architecture Lowers

More information

The University of Texas at Austin

The University of Texas at Austin EE382N: Principles in Computer Architecture Parallelism and Locality Fall 2009 Lecture 24 Stream Processors Wrapup + Sony (/Toshiba/IBM) Cell Broadband Engine Mattan Erez The University of Texas at Austin

More information

Latches. IT 3123 Hardware and Software Concepts. Registers. The Little Man has Registers. Data Registers. Program Counter

Latches. IT 3123 Hardware and Software Concepts. Registers. The Little Man has Registers. Data Registers. Program Counter IT 3123 Hardware and Software Concepts Notice: This session is being recorded. CPU and Memory June 11 Copyright 2005 by Bob Brown Latches Can store one bit of data Can be ganged together to store more

More information

Lecture 15: Caches and Optimization Computer Architecture and Systems Programming ( )

Lecture 15: Caches and Optimization Computer Architecture and Systems Programming ( ) Systems Group Department of Computer Science ETH Zürich Lecture 15: Caches and Optimization Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Last time Program

More information

Chapter 6 Caches. Computer System. Alpha Chip Photo. Topics. Memory Hierarchy Locality of Reference SRAM Caches Direct Mapped Associative

Chapter 6 Caches. Computer System. Alpha Chip Photo. Topics. Memory Hierarchy Locality of Reference SRAM Caches Direct Mapped Associative Chapter 6 s Topics Memory Hierarchy Locality of Reference SRAM s Direct Mapped Associative Computer System Processor interrupt On-chip cache s s Memory-I/O bus bus Net cache Row cache Disk cache Memory

More information

Agenda. Cache-Memory Consistency? (1/2) 7/14/2011. New-School Machine Structures (It s a bit more complicated!)

Agenda. Cache-Memory Consistency? (1/2) 7/14/2011. New-School Machine Structures (It s a bit more complicated!) 7/4/ CS 6C: Great Ideas in Computer Architecture (Machine Structures) Caches II Instructor: Michael Greenbaum New-School Machine Structures (It s a bit more complicated!) Parallel Requests Assigned to

More information

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight

More information

Advanced Memory Organizations

Advanced Memory Organizations CSE 3421: Introduction to Computer Architecture Advanced Memory Organizations Study: 5.1, 5.2, 5.3, 5.4 (only parts) Gojko Babić 03-29-2018 1 Growth in Performance of DRAM & CPU Huge mismatch between CPU

More information

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0 Julien Demouth, NVIDIA What Will You Learn? An iterative method to optimize your GPU code A way to conduct that method with Nsight VSE APOD

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy Chapter 5 Large and Fast: Exploiting Memory Hierarchy Review: Major Components of a Computer Processor Devices Control Memory Input Datapath Output Secondary Memory (Disk) Main Memory Cache Performance

More information

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose Joe Stam

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose Joe Stam Convolution Soup: A case study in CUDA optimization The Fairmont San Jose Joe Stam Optimization GPUs are very fast BUT Poor programming can lead to disappointing performance Squeaking out the most speed

More information

Administrivia. HW0 scores, HW1 peer-review assignments out. If you re having Cython trouble with HW2, let us know.

Administrivia. HW0 scores, HW1 peer-review assignments out. If you re having Cython trouble with HW2, let us know. Administrivia HW0 scores, HW1 peer-review assignments out. HW2 out, due Nov. 2. If you re having Cython trouble with HW2, let us know. Review on Wednesday: Post questions on Piazza Introduction to GPUs

More information

Image Processing Optimization C# on GPU with Hybridizer

Image Processing Optimization C# on GPU with Hybridizer Image Processing Optimization C# on GPU with Hybridizer regis.portalez@altimesh.com Median Filter Denoising Noisy image (lena 1960x1960) Denoised image window = 3 2 Median Filter Denoising window Output[i,j]=

More information

Cache memories are small, fast SRAM-based memories managed automatically in hardware. Hold frequently accessed blocks of main memory

Cache memories are small, fast SRAM-based memories managed automatically in hardware. Hold frequently accessed blocks of main memory Cache Memories Cache memories are small, fast SRAM-based memories managed automatically in hardware. Hold frequently accessed blocks of main memory CPU looks first for data in caches (e.g., L1, L2, and

More information

Cortex-R5 Software Development

Cortex-R5 Software Development Cortex-R5 Software Development Course Description Cortex-R5 software development is a three days ARM official course. The course goes into great depth, and provides all necessary know-how to develop software

More information

Compiler Optimization

Compiler Optimization Compiler Optimization The compiler translates programs written in a high-level language to assembly language code Assembly language code is translated to object code by an assembler Object code modules

More information

Maximizing Face Detection Performance

Maximizing Face Detection Performance Maximizing Face Detection Performance Paulius Micikevicius Developer Technology Engineer, NVIDIA GTC 2015 1 Outline Very brief review of cascaded-classifiers Parallelization choices Reducing the amount

More information

Algorithm Performance Factors. Memory Performance of Algorithms. Processor-Memory Performance Gap. Moore s Law. Program Model of Memory II

Algorithm Performance Factors. Memory Performance of Algorithms. Processor-Memory Performance Gap. Moore s Law. Program Model of Memory II Memory Performance of Algorithms CSE 32 Data Structures Lecture Algorithm Performance Factors Algorithm choices (asymptotic running time) O(n 2 ) or O(n log n) Data structure choices List or Arrays Language

More information

Cache Optimisation. sometime he thought that there must be a better way

Cache Optimisation. sometime he thought that there must be a better way Cache sometime he thought that there must be a better way 2 Cache 1. Reduce miss rate a) Increase block size b) Increase cache size c) Higher associativity d) compiler optimisation e) Parallelism f) prefetching

More information

Basics of Performance Engineering

Basics of Performance Engineering ERLANGEN REGIONAL COMPUTING CENTER Basics of Performance Engineering J. Treibig HiPerCH 3, 23./24.03.2015 Why hardware should not be exposed Such an approach is not portable Hardware issues frequently

More information

Modern Processor Architectures. L25: Modern Compiler Design

Modern Processor Architectures. L25: Modern Compiler Design Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions

More information

Advanced Computer Architecture

Advanced Computer Architecture 18-742 Advanced Computer Architecture Test 2 April 14, 1998 Name (please print): Instructions: DO NOT OPEN TEST UNTIL TOLD TO START YOU HAVE UNTIL 12:20 PM TO COMPLETE THIS TEST The exam is composed of

More information

UNIT I (Two Marks Questions & Answers)

UNIT I (Two Marks Questions & Answers) UNIT I (Two Marks Questions & Answers) Discuss the different ways how instruction set architecture can be classified? Stack Architecture,Accumulator Architecture, Register-Memory Architecture,Register-

More information

Introduction to CELL B.E. and GPU Programming. Agenda

Introduction to CELL B.E. and GPU Programming. Agenda Introduction to CELL B.E. and GPU Programming Department of Electrical & Computer Engineering Rutgers University Agenda Background CELL B.E. Architecture Overview CELL B.E. Programming Environment GPU

More information