ENVISION. ACCELERATE.

Size: px
Start display at page:

Download "ENVISION. ACCELERATE."

Transcription

1 ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical Training December 2007 Overview 1

2 Presenters Ronald Langhi Technical Marketing Manager Brian Sumner Senior Engineer 2

3 ClearSpeed Technology: Company Background Founded in 2001 Focused on alleviating the power, heat, and density challenges of HPC systems 103 patents granted and pending (as of September 2007) Offices in San Jose, California and Bristol, UK 3

4 Agenda Accelerators ClearSpeed and HPC Hardware overview Installing hardware and software Thinking about performance Software Development Kit Application examples Help and support 4

5 ENVISION. ACCELERATE. ARRIVE. What is an accelerator? 5

6 What is an accelerator? A device to improve performance Relieve main CPU of workload Or to augment CPU s capability An accelerator card can increase performance On specific tasks Without aggravating facility limits on clusters (power, size, cooling) 6

7 All accelerators are good for their intended purpose FPGAs Good for integer, bit-level ops Programming looks like circuit design Low power per chip, but 20x more power than custom VLSI Not for 64-bit FLOPS Cell and GPUs Good for video gaming tasks 32-bit FLOPS, not IEEE Unconventional programming model Small local memory High power consumption (> 200 W) ClearSpeed Good for HPC applications IEEE 64-bit and 32-bit FLOPS Custom VLSI, true coprocessor At least 1 GB local memory Very low power consumption (25 W) Familiar programming model 7

8 The case for accelerators Accelerators designed for HPC applications can improve performance as well as performance per (watt, cabinet, dollar) Accelerators enable: Larger problems for given compute time, or Higher accuracy for given compute time, or Same problem in shorter time Host to card latency and bandwidth are not major barriers to successful use of properlydesigned accelerators. 8

9 ENVISION. ACCELERATE. ARRIVE. What can be accelerated? 9

10 Good application targets for acceleration Application needs to be both computationally intensive and contain a high degree of data parallelism. Computationally intensive: Software depends on executing large numbers of arithmetic calculations Usually 64-bit FLoating point Operations per Second (FLOPS) Should also have a high ratio of FLOPS to data movement (bandwidth) Computationally intensive applications may run for many hours or more even on large clusters. Data parallelism: Software performs the same sequence of operations again and again but on a different item of data each time Example computationally intensive, data parallel problems include: Large matrix arithmetic (linear algebra) Molecular simulations Monte Carlo options pricing in financial applications And many, many more 10

11 Example data parallel problems that can be accelerated Ab initio Computational Chemistry Structural Analysis Electromagnetic Modeling Radar Cross-Section Global Illumination Graphics 11

12 HPC Requirements Accelerator boards increase compute performance on highly specific tasks, without aggravating facility limits on clusters (power, size) Need to consider Type of application Software Data type and precision Compatibility with host (logical and physical) Memory size (local to accelerator) Latency and bandwidth to host 12

13 An HPC-specific accelerator CSX600 coprocessor for math acceleration Assists serial CPU running compute-intensive math libraries Available on add-in boards, e.g. PCI-X, PCIe Potentially integrated on the motherboard Can also be used for embedded applications Significantly accelerates certain libraries and applications Target libraries: Level 3 BLAS, LAPACK, ACML, Intel MKL Mathematical modeling tools: Mathematica, MATLAB, etc. In-house code: Using the SDK to port compute-intensive kernels ClearSpeed Advance board Dual CSX600 coprocessors Sustains 67 GFLOPS for 64-bit matrix multiply (DGEMM) calls PCI-X, PCI Express x8 Low power; typically Watts 13

14 Plug-and-play Acceleration ClearSpeed host-side library CSXL Provides some of the most commonly used and important Level 3 BLAS and LAPACK functions Exploits standard shared/dynamic library mechanisms to intercept calls to L3 BLAS and LAPACK Executes calls heterogeneously across both the multicore host and the ClearSpeed accelerators simultaneously for maximum performance Compatible with ACML from AMD and MKL from Intel User & application do not need to be aware of ClearSpeed Except that the application suddenly runs faster 14

15 Programming considerations Is my main data type integer or floating-point? Is the data parallel in nature? What precision do I need? How much data needs to be local to the accelerated task? Does existing accelerator software meet my needs, or do I have to write my own? If I have to write my own code will the existing tools meet my needs for example: compiler, debugger, and simulator? 15

16 ENVISION. ACCELERATE. ARRIVE. Hardware Overview 16

17 CSX600: A chip designed for HPC ClearSpeed CSX600 Array of 96 Processor Elements; 64-bit and 32-bit floating point Single-Instruction, Multiple-Data (SIMD) 210 MHz -- key to low power 47% logic, 53% memory About 50% of the logic is FPU Hence around one quarter of the chip is floating point hardware Embedded SRAM Interface to DDR2 DRAM Inter-processor I/O ports ~ 1 TB/sec internal bandwidth 128 million transistors Approximately 10 Watts 17

18 CSX600 processor core CSX 600 System Network Data Cache PE 0 Mono Controller PE 1 Instruction Cache Poly Controller PE 95 Programmable I/O to DRAM Peripheral Network Control and Debug System Network Multi-Threaded Array Processing Programmed in familiar languages Hardware multi-threading Asynchronous, overlapped I/O Run-time extensible instruction set Array of 96 Processor Elements (PEs) Each has multiple execution units Including double precision floating point and integer units 18

19 CSX600 processing element (PE) PE n 1 PE n FP Mul FP Add Div, Sqrt MAC Register File 128 Bytes PE SRAM 6 KBytes ALU PE Programmed I/O 128 PIO Collection & Distribution n+1 Multiple execution units 4-stage floating point adder 4-stage floating point multiplier Divide/square root unit } Fixed-point MAC 16x Integer ALU with shifter Load/store 5-port register file (3 reads, 2 writes) Closely coupled 6 KB SRAM for data High bandwidth per PE DMA (PIO) Per PE address generators (serves as hardware gather-scatter) Fast inter-pe communication path 32/64-bit IEEE

20 Advance accelerator memory hierarchy Tier 3 Host DRAM: 1-32 GBytes typical Aggregate: ~1GB/s 1.0 GBytes Tier 2 Bank 1 Bank CSX 0DRAM: 0.5 GBytes CSX DRAM: 0.5 GBytes 5.4 GB/s ~0.03 GB/s per PE 192 PEs * 6 KB = 1.1 MB 161 GB/s Tier 1 PE 95 PE 0 Poly memory: 6 KBytes PEs * 128 Byte = 24 KB Tier 0 Per PE Register memory: 128 Bytes Swazzle 322 GB/s 725 GB/s Total: 80 GFLOPS, 1.1 TB/s but only 25 Watts Per PE Arithmetic: 0.42 GFLOPS 20

21 Acceleration by plug-in card Advance X620 (PCI-X) 203 mm length, full-height Advance e620 PCIe (x8) Dual ClearSpeed CSX600 coprocessors R > 66 GFLOPS for 64-bit matrix multiply (DGEMM) calls Hardware also supports 32-bit floating point and integer calculations 133 MHz PCI-X two-thirds length (8 ) form factor PCIe x8 half-length form factor 1 GB of memory on the board Drivers today for Linux (Red Hat and SLES) and Windows (XP, Server 2003) Low power: 25 watts typical Multiple boards can be used together for greater performance Half length, full-height Both boards can sustain over 66 GFLOPS on 64-bit HPC kernels 21

22 Host to board DMA performance The board includes a host DMA controller which can act as a bus master. All DMA transfers are at least 8-byte aligned. The host DMA engine will attempt to use the full bandwidth of the bus. Type of PCI-X slot Peak bandwidth Expected DMA speed PCI Express x8 2,000 MB/s Up to 1,300 MB/s PCI-X 133 MHz 1,066 MB/s Up to 750 MB/s Note: measured bandwidth is highly system-dependent Variations of up to 50% have been observed Depends on system chipset, operating system, bus contention 22

23 ENVISION. ACCELERATE. ARRIVE. Installing Hardware and Software 23

24 Configuration support Advance supports the following host operating systems: Operating System SuSE Linux Enterprise Server 9 IA32 (x86) AMD64/EM64T (x86-64) Red Hat Enterprise Linux 4 Windows XP SP2 Windows Server 2003 preview Supported host BLAS libraries AMD ACML Intel MKL Goto ATLAS Supported compilers For Linux: gcc, icc, fort, pgf For Windows XP, 2003: Visual C For the latest support information go to 24

25 Base software All ClearSpeed software on Linux is installed using the rpm command. The software consists of three parts: Runtime and driver software Diagnostics ClearSpeed standard libraries, CSXL & CSFFT You can download the latest versions from the ClearSpeed support website: 25

26 Installing base software on Linux 1. Log in to the Linux machine as root and change to the directory containing the drivers package. 2. Install the runtime software, using the command: rpm i csx600_m512_le-runtime-<version>.<arch>.rpm 3. Install the Kernel module - for Linux 2.6 simply install the open source CSX driver using: /opt/clearspeed/csx600_m512_le/drivers/csx/install-csx 4. Install the board diagnostics: rpm i csx600_m512_le-board_diagnostics- <version>.<arch>.rpm 5. Install the CSXL library package: rpm i csx600_m512_le-csxl_<version>.<arch>.rpm Note: For Windows a Jungo driver will need to be installed and configured see installation manual for more details. 26

27 Confirming successful installation ClearSpeed distributes diagnostic tests to check that the board and drivers are successfully installed: 1. Open a shell window and go to an appropriate directory: cd /tmp 2. Set up ClearSpeed environment variables, by typing: source /opt/clearspeed/csx600_m512_le/bin/bashrc 3. Run the diagnostic program, by typing the command: /opt/clearspeed/csx600_m512_le/bin/run_tests.pl Some tests take several minutes to complete. Each test will write Pass or Fail to standard output. A log file test.log will be written in the current directory. 27

28 csreset The csreset command reinitializes an Advance board and its processors. It must be run after start-up or reboot of the system or simulator. It is also a good idea to run csreset at the start of a batch job that calls the Advance board. The csreset command can take argument flags to provide a finer level of control. These include: -A Specifies that all boards should be reset. -v Verbose output. This shows the details about each board. -h Help. This shows the full list of options. 28

29 If you have problems with software installation Make sure you are logged in as super-user. As root for Linux. As administrator for Windows. If the configure or make install steps fail, check that you have the appropriate header files. Check the preconfigured header files and, if necessary, obtain the appropriate configured header file. If the system cannot access the board but the driver is installed, make sure the board is seated well. Try removing the board and reinstalling. 29

30 ENVISION. ACCELERATE. ARRIVE. Targeting ClearSpeed Advance: Exploiting Data Parallelism 30

31 Alternative approaches Three main approaches to acceleration: 1. Use an application which is already ported 2. Plug and play 3. Custom port using the SDK 31

32 Using an application which is already ported Acceleration: simply insert ClearSpeed Latest list of ported applications: Includes: Amber Mathematica MATLAB Star-P 32

33 Plug and play libraries: CSXL Underlying shared libraries are augmented with ClearSpeed CSXL accelerated functions Includes key functions from: LAPACK Level 3 BLAS As an example, BLAS is used by: AMD ACML Intel MKL Full list on: Application is transparently accelerated No modifications to application 33

34 Acceleration using CSXL and standard libraries Application Automatically select optimum path Host Library LAPACK BLAS etc. CSXL Intercept Layer CSXL Library LAPACK BLAS etc. 34

35 Considerations for custom port of application Is the task large enough to consider acceleration? Takes time to ship data to the accelerator Accelerator can work in parallel with host Overlap computation Performance considerations Look for areas of data parallelism Overlap compute with data I/O Make full use of ClearSpeed I/O paths Analysis starts with model based on memory tiers and can be verified using performance profiling tools 35

36 Is this trip necessary? Considering I/O Node memory Node Bandwidth = B Accelerator memory Accelerator Time to move N data to/from another node or an accelerator is ~latency+n/b seconds. Because local memory bandwidth is usually higher than B, acceleration might be lost in the communication time. Estimate the break-even point for the task (note: offloading is different from accelerating, where host continues working). speed breakeven accelerator node time (larger problem size) 36

37 Memory bandwidth dictates performance Node memory Accelerator DRAM 17 GB/s Multicore x86 PCI-X or PCIe 1 to 2 GB/s 5.4 GB/s Accelerator Accelerator Local RAM 192 GB/s Applications that can stage into local RAM can go 10x faster than current high-end Intel and AMD hosts Applications residing in Accelerator DRAM do not make use of massive memory bandwidth GPUs face very similar issue 37

38 Latency and bandwidth: Simple offload model Accelerator bandwidth bandwidth Host latency latency Host Accelerator must be quite fast for this approach to have benefit This mental picture may stem from early days of Intel 80x87, Motorola 6888x math coprocessors 38

39 Latency and bandwidth: Acceleration model Accelerator bandwidth bandwidth Host latency Host latency Host Host continues working Accelerator needs only be fast enough to make up for time lost to bandwidth + latency Easiest use model Host and accelerator share the same task, like DGEMM More flexible Host, accelerator each specialize what they do 39

40 Accelerator need not wait for all data before starting Accelerator bandwidth bandwidth Host latency Host latency Host Host can work while data is moved PCI transfers might burden a single x86 core by up to 40% Other cores on host continue productive work at full speed Accelerator can work while data is moved Can be slower than the host, and still add performance! In practice, latency is microseconds; accelerator task takes seconds Latency gaps above would be microscopic if drawn to scale 40

41 Performance considerations Look for data parallelism Fine-grained vector operations Medium-grained unrolled independent loops Coarse-grained multiple simultaneous data channels/sets Performance analysis for accelerator cards Like analysis for message-passing parallelism but with more levels of memory and communication Application porting success depends heavily on attention to memory bandwidths (Surprisingly) not so much on the bandwidth between host and accelerator card 41

42 PCI Bus ClearSpeed boards utilize either PCI-X or PCIe busses PCI-X 133 MHz: 1 GB/s Peak PCIe x8: 1.6 GB/s Peak Available memory on board 1 GB of 200 MHz DDR2 SDRAM shared by 2 CSX600 processors Must consider both the transfer rate AND the available memory If application requires more memory, then more communication to the board is necessary Infinitely fast board Time = Bus Speed * Total data size transferred 42

43 PCI Bus Driver performance is very machine-specific and depends on transfer size, direction, etc. Transfer Size vs. Transfer Rate See Runtime User s Guide for current driver performance 43

44 On-board Memory 2 level memory hierarchy 1 GB mono shared memory 6 kb poly memory per processing element (PE) 6 kb/pe * 96 PE = 576 kb per CSX600 Peak bandwidth between levels 2.7 GB/s x 2 chips = 5.4 GB/s Must consider both the transfer rate AND the available memory If application requires more memory, then more communication to the board is necessary Infinitely fast PEs Time = Bus Speed * Total data size transferred Secondary considerations Burst size: 64 Bytes/PE (i.e., 8 doubles) Transfers can be smaller, but at reduced efficiency 44

45 SIMD Computing What is SIMD? Single Instruction, Multiple Data Each PE sees the same instruction stream Each PE issues load, multiply, etc., simultaneously But acts on different data per PE PARALLEL COMPUTATION ClearSpeed SIMD is enhanced by: Local memory for each PE data management is easier within poly memory does not require adjacent access for all 96 elements involved in the computation from shared memory pool PEs can be enabled/disabled not required to use all PEs always useful for handling boundaries 45

46 SIMD Array 96 PEs per CSX MHz double precision multiply-accumulate per cycle 4 cycle pipeline depth for multiply and accumulation For top performance, use operations on 4 element vectors on each PE Nearest neighbor communication swazzle path topology is a line or ring Bandwidth: 8 Bytes per cycle between register files 8*96*210 = 161 GB/s Useful for fine grained communication 46

47 Good Example Kernels Dense Linear Algebra Matrix-Matrix products (DGEMM) Low memory bandwidth required = high data re-use Inner kernel: Matrix-multivector product 96x96 matrix, x4 vectors» 96x96 matrix due to 96 PEs» 4 vectors due to multiply/accumulate pipeline depth Monte Carlo (computational finance) Embarrassingly parallel task distribution Very little data requirement Molecular Dynamics (Amber, BUDE) Large numbers of identical tasks can be found Requires small working data sets 47

48 Possible Kernel Partial Differential Equations Some are memory bandwidth limited, so not a good candidate for ClearSpeed acceleration small stencil implies little computation per grid point wide, sparse stencil implies large active data set But, some PDE simulations are good candidates require a small grid, so can run entirely in PE memory (computational finance) have large, dense stencils large amounts of computation per grid point sufficiently small active data set implicit time stepping large systems of equations solved via direct methods direct solvers utilize dense linear algebra kernels (i.e., DGEMM) 48

49 Keys to Success Parallelism is essential Proper management of the poly memory is also critical Application must accept memory bandwidth limits PCIe or PCI-X On-board memory hierarchy SDK enables asynchronous data transfers permits efficient double buffering to manage data streams, accommodating the size limit Application must employ a small working data set less than 576 kb, distributed across 96 PEs also aware of 1 GB shared memory limit While developing ClearSpeed applications, use the ClearSpeed Visual Profiler to discover what is actually happening on the board! 49

50 Remember the host processor Today s multi-core hosts are very useful for managing other tasks that are not accelerated by ClearSpeed Many applications can overlap these tasks with ClearSpeed accelerated tasks Profile the host portion of your application as well using any of a variety of tools Use ClearSpeed Visual Profile for CSAPI utilization 50

51 General optimization techniques Latency hiding Overlap compute with I/O Data reuse On-chip swazzle path Maximize PE usage Ensure all PEs are processing, not idle 51

52 Overlap data with compute Double-buffer Many levels of data I/O compute parallelism PE load/store overlaps PE compute PE to board memory can also overlap Board memory to host memory can also overlap Hence, if task is compute bound: Data takes no time to transfer If task is I/O bound: Compute takes no time to calculate 52

53 Data reuse Swazzle path Left or right 64 bit transfer (8 bytes) 8 bytes per cycle, so ~161GB/s per CSX processor Can be complete loop or linear chain Parallel with other data I/O Register-register move On-off chip in parallel Doesn t impinge on DRAM access PE local memory register in parallel Doesn t impinge on local memory access 53

54 Maximize PE usage Aim for 100% efficiency PEs use predicated execution PEs are disabled rather than code skipped Minimize effects extract common code from conditionals Mono processor can branch Skip blocks of code 54

55 Detail of I/O widths for performance analysis Each accelerator board has: 161 GB/s bandwidth PE register to PE memory 4 bytes per cycle 322 GB/s swazzle path bandwidth 8 bytes per cycle 968 GB/s bandwidth PE register to PE ALU 24 bytes per cycle 5.4 GB/s DRAM bandwidth 32 bytes per cycle (Aggregate bandwidth for two CSX600 chips.) 322GB/s PE n-1 64 PE n 161GB/s FP Mul FP Add Div, Sqrt MAC Register File 128 Bytes PE SRAM 6 KBytes ALU Programmed I/O 128 PE n+1 5.4GB/s PIO Collection & Distribution CSX DRAM 1 GByte 322GB/s GB/s 55

56 ENVISION. ACCELERATE. ARRIVE. Software Development Kit 56

57 ClearSpeed SDK overview C n compiler C with extension for SIMD control Assembler Linker Simulator Debugger Graphical profiler Libraries Documentation Available for Windows XP / 2003 and Linux (Red Hat Enterprise Linux 4 and SLES 9) 57

58 Agenda 1. Introduction to C n 2. C n Libraries 3. Debugging C n 4. CSAPI: Host / Board Communication 58

59 ENVISION. ACCELERATE. ARRIVE. Introduction to C n 59

60 Software Development The CSX architecture is simpler to program: Single program for serial and parallel operations Architecture and compiler co-designed Instruction and data caches Simple, regular 32-bit instruction set Large, flexible register file Fast thread context switching Built-in debug support Same development process as traditional architectures: compile, assemble, link C n is a simple parallel extension of C 60

61 C n C with vector extensions for CSX New Keywords mono and poly storage qualifiers mono is a serial (single) variable poly is a parallel (vector) variable Mono variables in 1 GB DRAM Poly variables in 6 KB SRAM of each PE DRAM 1 GB 61

62 C n differences from C New data type multiplicity modifiers: mono: denotes serial variable resident in mono memory mono is the default multiplicity poly: denotes parallel/vector variable resident in poly memory local to each PE applies to pointers, doubly so: mono int * poly foo; foo is a pointer in poly memory to an int in mono memory poly int * mono bar; bar is a pointer in mono memory to an int in poly memory int * poly *mono good_grief; as you would expect. Pointer sizes: mono int * 4 bytes (32-bit addressable space, 512 MB) poly int * 2 bytes (16-bit addressable space, 6 kb) 62

63 C n differences from C Execution context: Alters branch/jump behavior In mono context, jumps occur as in traditional architecture In poly context, PEs are enabled/disabled if (penum>32) { } else { } disables false PEs on true branch, then re-enables the false PEs and disables the other PEs for the false branch both branches executed break, continue return select PEs get disabled until the end of scope on all PEs select PEs get disabled until all PEs return, or end of scope 63

64 Porting C to C n (Example 1) C code int i, j; for( i=0; i<96; i++ ) { j = 2*i; } Similar C n code poly int i, j; i = get_penum(); // i=0 on PE0, i=1 on PE1 etc. j = 2*i; // j=0 on PE0, j=2 on PE2 etc. 64

65 Porting C to C n (Example 2) C code int i; for( i=0; i<n; i++ ) { } Similar C n code poly int me, i; mono int npes; me = get_penum(); // me=0 on PE0, me=1 on PE1 etc. npes = get_num_pes(); // npes = 96 // i=0,96,192, ; 1,97,193, etc. for( i=me; i<n; i+=npes ) { } 65

66 Simple C n example void foo (double *A, double *B, int n) { // Assume n is divisible by 24*96. poly double mat[4]={1.,2.,3.,4.}; poly double a[24]; poly double b[4]={0.,0.,0.,0.}; int i; while (n) { memcpym2p (a, A+24*get_penum(), 24*sizeof(double)); A+=24*96; for (i=0; i<24; i++) { b[0] += a[i]*mat[0] + a[i+1]*mat[1]; b[1] += a[i+1]*mat[0] + a[i]*mat[1]; b[2] += a[i]*mat[2] - a[i+1]*mat[3]; b[3] += a[i+1]*mat[2] - a[i]*mat[3]; } n -= 24*96; } memcpyp2m (B+4*get_penum(), b, 4*sizeof(double)); return; } 66

67 ENVISION. ACCELERATE. ARRIVE. C n Libraries 67

68 Runtime libraries C n supports standard C runtime, including: malloc printf sqrt memcpy C n extensions include: sqrtp memcpym2p / memcpyp2m get_penum swazzle any / all 68

69 Asynchronous I/O For most efficient use of limited PE memory, overlap data transfers between mono memory and poly: async_memcpym2p/p2m sem_sig / sem_wait For greatest efficiency, async_memcpy routines bypass the data cache, so coherency must be maintained: dcache_flush / dcache_flush_address 69

70 Asynchronous I/O example void foo(double *A, double *B,int n) { // Assume n is divisible by 24*96 poly unsigned short penum=get_penum(); poly double mat[4]={1.,2.,3.,4.}; poly double a_front[12], a_back[12]; poly double b[4]={0.,0.,0.,0.}; int i; async_memcpym2p(19,a_front,a+12*penum,12*sizeof(double));a+=12*96; n-=24*96; while (n) { async_memcpym2p(17,a_back,a+12*penum,12*sizeof(double));a+=12*96; sem_wait(19); for (i=0;i<12;i++) { b[0] += a_front[i]*mat[0] + a_front[i+1]*mat[1]; b[1] += a_front[i+1]*mat[0] + a_front[i]*mat[1]; b[2] += a_front[i]*mat[2] - a_front[i+1]*mat[3]; b[3] += a_front[i+1]*mat[2] - a_front[i]*mat[3]; } n-=12*96; async_memcpym2p(19,a_front,a+12*penum,12*sizeof(double));a+=12*96; sem_wait(17); for (i=0;i<12;i++) { // compute on a_back, then finish outside while loop 70

71 ENVISION. ACCELERATE. ARRIVE. C n Pointers 71

72 C n mono and poly pointers Using mono and poly with pointers mono int * mono mpmi mono pointer to mono int poly int * mono mppi mono pointer to poly int mono int * poly ppmi poly pointer to mono int poly int * poly pppi poly pointer to poly int Most commonly used is mono pointer to poly poly <type> * mono <variable_name> 72

73 C n mono and poly pointers mono pointer to mono int mono int * mono mpmi int * Mono memory int 73

74 C n mono and poly pointers mono pointer to poly int poly int * mono ppmi int Poly memory int * Mono memory int Poly memory int Poly memory Note: Points to same location in each PE 74

75 C n mono and poly pointers poly pointer to poly int poly int * poly pppi int Poly memory int * Poly memory int * Int int Poly memory int * Note: Pointer stored in same location in each PE 75

76 C n mono and poly pointers poly pointer to mono int mono int * poly ppmi int * Poly memory int Mono memory int * Poly memory int int int * Poly memory Note: Pointer stored in same location in each PE 76

77 ENVISION. ACCELERATE. ARRIVE. Conditional Expressions 77

78 Conditional Expressions: mono-if Conditions based on mono expressions Expression has same value on all PEs Code block selected according to expression and branch instruction executed mono int i, j; i = j = 1; if( i == j ) { // this block executed on all PEs } else { // this block branched over on all PEs } 78

79 Conditional Expressions: poly-if Conditions based on poly expressions Expression may have different values on different PEs But SIMD model implies all PEs execute same instruction simultaneously All branches executed on all PEs, with PE enabled if conditional expression is true (like predicated instructions) poly int i; i = get_penum(); if( i < 48 ) { // PEs 0, 1, 2, execute instructions // PEs 48, 49, instructions issued but ignored } else { // PEs 0, 1, 2, instructions issued but ignored // PEs 48, 49, execute instructions } 79

80 Conditional Expressions: poly-while While loops based on poly expressions Loop continues execution until condition is false on all PEs PEs will be disabled one by one until while condition is false on all PEs count keeps track of total number of iterations (96 in this case) mono int count = 0; poly int me; me = get_penum(); while( me > 0 ) { --me; ++count; } 80

81 Other variations between C and C n Labeled break and continue statements No switch statement using poly variables (use multiple if statements) No goto statement in poly context 81

82 ENVISION. ACCELERATE. ARRIVE. Moving Data 82

83 Data flow Board and host communicate via Linux kernel module or Windows driver Create a handle and establish the connection 83

84 Data flow Register intent of using the first processor on the card Load the code onto the enabled processor 84

85 Data flow Transfer data from host to board Semaphores synchronize transfers between host and board 85

86 Data flow Run the code on the enabled processor Host can continue with other work 86

87 Data flow Send results back to host Halt board program and clean up 87

88 Implicit broadcast from mono and poly Implicit broadcast from mono to poly by assignment mono int m = 7; poly int p; p = m; // Implicit broadcast to all PEs Assigning poly to mono is not permitted mono int m; poly int p = get_penum(); m = p; // NO! m receives different value from each PE 88

89 Explicit data movement mono to poly memcpym2p(); async_memcpym2p() Memory copy of n bytes from mono to poly Source is a poly pointer to mono memory, which can have a different value for each PE Destination is a mono pointer to poly memory, that is destination address is the same for all PEs Source data in mono memory Same destination on each PE PE0 PE1 PE2 PE95 89

90 Explicit data movement poly to mono memcpyp2m(); async_memcpyp2m() Memory copy of n bytes from poly to mono Source is a mono pointer to poly memory; therefore source address is the same for every PE Destination is a poly pointer to mono memory, which can have a different value for each PE Destination data in mono memory Same source address on each PE PE0 PE1 PE2 PE95 90

91 Explicit data movement asynchronous async_memcpym2p(); async_memcpyp2m() Asynchronous memory copy of n bytes from mono to poly or from poly to mono Computation continues during data copy Mono memory data cache NOT flushed Restrictions on alignment of data Use semaphores to wait for completion of copy Much higher bandwidth than synchronous versions dcache_flush(); async_memcpym2p( semaphore, ); // computation continues sem_wait( semaphore ); // use data that has been transferred from mono memory 91

92 Explicit data movement swazzle Register-to-register transfer between neighboring PE s PE n ALU Status flags To: PE n-1 Register file To: PE n+1 Memory Enable stack 92

93 Swazzle operations Assembly language versions operate directly on register file C n versions operate on data and include implicit data movement from memory to registers Variants swazzle_up( poly int src ); // copy to higher numbered PE swazzle_down( poly int src ); // copy to lower numbered PE swazzle_up_generic( poly void *dst, poly void *src, unsigned int size ); swazzle_down_generic( ); Similar swazzles operating on other data types Functions to set data copied into ends of swazzle chain 93

94 Data movement bandwidths per CSX600 Mono memory to poly memory 2.7 GB/s aggregate over 96 PEs Poly memory to registers 840 MB/s per PE, 81 GB/s aggregate Swazzle path bandwidth 1680 MB/s per PE, 161 GB/s aggregate Total bandwidth for Advance board (2 CSX600 processors) ~0.5 TB/s 94

95 DMA performance Advance board has a host DMA controller which can act as a PCI bus master All DMA transfers are at least 8-byte aligned Host DMA engine will attempt to use the entire bus bandwidth ClearSpeed Advance DMA Performance MB / s e620_read_avg e620_write_avg X620_Read_avg X620_Write_avg Transfer size (MB) 95

96 ENVISION. ACCELERATE. ARRIVE. CSAPI Host - Board communication 96

97 Host-Board interaction basics The basic model for interaction between the host and the card is very simple: The ClearSpeed board can signal and wait for semaphores; it cannot initiate data transactions with the host. The host pushes data to and pulls data from the board. The host can also signal and receive semaphores. 97

98 Connecting to the board A host application needs to perform the following sequence to launch a process on the board: Create a CSAPI handle CSAPI_new Establish a connection with the board CSAPI_connect Register the host application with the driver CSAPI_register_application Load the CSX application on the desired chip CSAPI_load Run the CSX application on the desired chip CSAPI_run 98

99 Interacting with the board Get board memory address of a known symbol CSAPI_get_symbol_value This must be done after the application is loaded, if the dynamic load capability is to be used. Write/Read data to a retrieved memory address CSAPI_write_mono_memory CSAPI_read_mono_memory Asynchronous variants of these routines also exist A process does not need to be running for these operations to succeed, but the process needs to be loaded. These should not be performed DURING process termination. Managing semaphores CSAPI_allocate_shared_semaphore Declares a semaphore for use on both host and card CSAPI_semaphore_wait CSAPI_semaphore_signal 99

100 Cleaning up Process termination CSAPI_wait_on_terminate CSAPI_get_return_value Clean-up CSAPI_delete See CSX600 Runtime Software User Guide for more details, including: managing multiple processes on the board/chip at once managing board control registers board reset managing multi-threaded CSX applications board memory allocation managing multiple boards/chips error handling 100

101 ENVISION. ACCELERATE. ARRIVE. Debugging C n 101

102 csgdb csgdb is a port of the open source gdb debugger full symbolic debugging of mono/poly variables full gdb breakpoint support step through C n or assembly views mono and poly registers views PE enabled state also accessible via DDD DDD allows graphical data visualization 102

103 Debug control To enable debugging: export CS_CSAPI_DEBUGGER=1 initializes the debug interface within the host application export CS_CSAPI_DEBUGGER_ATTACH=1 host application will then write a port number to stdout and wait for <Return/Enter> to be pressed so that csgdb can be manually attached to the connected board process Launch the host application This can be done with or without a debugger. Launch csgdb in a new shell* csgdb <csx_file_name> <port_number> No need to connect as the host application did this already set desired breakpoints run Note that the host is currently blocked waiting for <Return/Enter>, so card process may also be blocked waiting for the host. Press return in the host shell for the host and card applications to proceed. 103

104 csgdb Debugger (Shown with ddd Front-end) On-chip poly array contents displayed Real time plot of contents of PE memory Cn source-level break point, watch points, single step, etc. Register contents Disassembly, break point, watch points, single step, etc. 104

105 csgdb Command-line example % cscn foo.cn g o foo.csx % csgdb./foo.csx (gdb) connect 0x in FRAME_BEGIN_MONO () (gdb) break 109 Breakpoint 1 at 0x800154c0: file foo.cn, line 109. (gdb) run Starting program: /home/kris/my_app/foo.csx Breakpoint 1, main () at foo.cn:109 (gdb) next 110 y = MINY + (get_penum() * STEPY); (gdb) print y $1 = {-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1} 105

106 ENVISION. ACCELERATE. ARRIVE. ClearSpeed Visual Profiler Explaining Performance 106

107 ClearSpeed Visual Profiler (csvprof) Host tracing Trace CSAPI function User can infer overlapping host/board utilization Locate hot-spots Board tracing Trace board side functions without instrumentation Locate hot-spots Board hardware utilization Display activity of csx functional units including: ld/st Pi/o SIMD microcode Cycle accurate View corresponding source Unified GUI Instruction cache Data cache Thread 107

108 Detailed profiling is essential for accelerator tuning HOST CODE PROFILING Visually inspect multiple host threads. Time specific code sections. Check overlap of host threads. HOST/BOARD INTERACTION Infer cause and effect. Measure transfer bandwidth. Check overlap of host and board compute. Host Host Host CPU(s) Host CPU(s) CPU(s) CPU(s) Advance Accelerator Board Advance Accelerator Board CSX 600 CSX600 CSX 600 CSX600 Pipeline Pipeline Pipeline Pipeline ACCELERATOR PIPE View instruction issue. Visualize overlap of executing instructions. Get cycle-accurate timing. Remove instruction-level performance bottlenecks. CSX600 SYSTEM Trace at system level. Inspect overlap of compute and I/O. View cache utilization. Graph performance. 108

109 csvprof: Host tracing Dynamic loading of CSAPI Trace implementation Triggered with an environment variable: export CS_CSAPI_TRACE=1» Recall similar enabling of debug support:» export CS_CSAPI_DEBUGGER=1 Specify tracing format: export CS_CSAPI_TRACE_CSVPROF=1 currently this is the only implementation, but in the future Specify output file for trace: export CS_CSAPI_TRACE_CSVPROF_FILE=mytrace.cst default filename = csvprof_data.cst Output file written during CSAPI_delete 109

110 csvprof: Host-Board interaction 110

111 csvprof: Host code profile Linpack benchmark 111

112 csvprof: CSX600 system profile 112

113 csvprof: Accelerator pipeline profile 113

114 csvprof: Instruction pipeline stalls 114

115 csvprof: Advance board tracing Enabled using the debugger, csgdb Can use interactively or through gdb script Can select events to profile, or all events Requires buffer allocation on the card Today, this is done statically One could use CSAPI to allocate buffer, but developer must get location and size of the buffer to user to be entered for csgdb Easy if running only on one chip, place buffer in the other chip s memory Explicit dump to generate trace file Can control the type of data to be dumped 115

116 csvprof: Sample gdb script % cat./csgdb_trace.gdb connect load./foo.csx cstrace buffer 0x x cstrace event all on tbreak test_me continue cstrace enable continue cstrace dump foo.cst cstrace dump branch dgemm_test4_branch.cst quit % csgdb command=./csgdb_trace.gdb 116

117 ENVISION. ACCELERATE. ARRIVE. Tuning Tips 117

118 Pipelined arithmetic Four-stage floating-point pipeline Use vector types, vector intrinsic functions, and vector math library for high efficiency DVECTOR a, b, c; poly double x[n]; a = *(( DVECTOR *)x[0]); b = *(( DVECTOR *)x[4]); c = cs_sqrt( cs_vadd( a, b ) ); 118

119 Poly conditionals When possible, remove common subexpressions from poly if-blocks to reduce amount of replicated work. Maybe need to compute and throw away results if it leads to fewer poly conditional blocks. A poly if-block uses predicated instructions, not a branch, so it is cheap if not many additional instructions are executed. 119

120 Poly loop counters Loops with poly counters are more expensive than those with mono counters Use mono loop counters if possible 120

121 Arrays Pointer incrementing is more efficient than using array index notation Poly addresses require 16 bits Use short for poly pointer increments This avoids conversion of int to short 121

122 Data transfer Synchronous functions are completely general flush the data cache each transfer memcpyp2m() memcpym2p() Asynchronous functions maximize performance do not flush cache have data size and alignment restrictions require use of wait semaphore async_memcpyp2m(); sem_wait() async_memcpym2p(); sem_wait() Large data blocks are more efficient than small blocks Host to board Board to host Mono to poly Poly to mono 122

123 ENVISION. ACCELERATE. ARRIVE. Application Examples 123

124 Math function speed comparison bit Function Operations per Second (Billions) GHz dual-core Opteron 3 GHz dual-core Woodcrest ClearSpeed Advance card Sqrt InvSqrt Exp Ln Cos Sin SinCos Inv Function name Typical speedup of ~8X over the fastest x86 processors, because math functions stay in local memory on the card 124

125 Nucleic Acid Builder (NAB) Newton-Raphson refinement now possible; large DGEMM calls from computed second derivatives will be in AMBER x speedup obtained for this operation in three hours of programmer effort Enables accurate computation of entropy and Gibbs Free Energy for first time AMBER itself has cases that ClearSpeed accelerates by 3.2x to 9x, with 5x to 17x possible once we exploit symmetry of atomatom interactions 125

126 AMBER molecular modeling with ClearSpeed AMBER Generalized Born Models 1, 2, and 6 Run Time, in Minutes Generalized Born 1 Generalized Born 2 Generalized Born 3 Host Advance X620 AMBER model Host Advance X620 Speedup Gen Born min 24.6 min 3.4 Gen Born min 23.5 min 3.6 Gen Born min 4.0 min

127 Monte Carlo methods exploit high local bandwidth Monte Carlo methods are ideal for ClearSpeed acceleration: High regularity and locality of the algorithm Very high compute to I/O ratio Very good scalability to high degrees of parallelism Needs 64-bit Excellent results for parallelization Achieving 10x performance per Advance card vs. highly optimized code on the fastest x86 CPUs available today Maintains high precision required by the computations True 64-bit IEEE 754 floating point throughout 25 W per card typical when card is computing ClearSpeed has a Monte Carlo example code, available in source form for evaluation 127

128 Monte Carlo applications scale very well No acceleration: 200M samples, 79 seconds 1 Advance board: 200M samples, 3.6 seconds 5 Advance boards: 200M samples, 0.7 seconds European Option Pricing Model Speedup Number of ClearSpeed Advance Boards 128

129 Why do Monte Carlo applications need 64-bit? Accuracy increases as the square root of the number of trials, so five-decimal accuracy takes 10 billion trials. But, when you sum many similar values, you start to lose all the significant digits. 64-bit summation needed to get a single-precision result! Single precision: x = x10 8 Double precision: x = x

130 ENVISION. ACCELERATE. ARRIVE. Help and Support 130

131 Installed documentation docs directory CSXL user guide runtime user guide csvprof Visual Profiler overview and examples SDK getting started gdb manual instruction set manual C n library manual reference manual release notes examples directory 131

132 ClearSpeed online General information, news, etc. Company website Report a problem, find answers, etc. Support website support.clearspeed.com Support website has: Documentation, user guides, reference manuals Solutions knowledge base Software downloads Log a case 132

133 Join the ClearSpeed Developer Program! Designed to support the leading-edge community of developers using accelerators Membership is free and has the following benefits: Access to the ClearSpeed Developer website ClearSpeed Developer Community on-line forum Invitation to participate in ClearSpeed Developer & User Community meetings and events Repository to share and access demonstrations and sample codes within the ClearSpeed Developer Community Technical updates, tips and tricks from the gurus at ClearSpeed and the Developer Community And more, including opportunities to preview new software releases and developer discount programs. Leverage the expertise of developers worldwide. Ask a question, or share your knowledge. Register now at developer.clearspeed.com! 133

134 134

COMP 635: Seminar on Heterogeneous Processors. Lecture 7: ClearSpeed CSX600 Processor.

COMP 635: Seminar on Heterogeneous Processors. Lecture 7: ClearSpeed CSX600 Processor. COMP 635: Seminar on Heterogeneous Processors Lecture 7: ClearSpeed CSX600 Processor www.cs.rice.edu/~vsarkar/comp635 Vivek Sarkar Department of Computer Science Rice University vsarkar@rice.edu October

More information

ENVISION. ACCELERATE.

ENVISION. ACCELERATE. ENVISION. ACCELERATE. ARRIVE. ClearSpeed Programming Model: An Introduction 1 Overview PC host communication to ClearSpeed A first look at C n Using the toolchain: hello world Lower level review of ClearSpeed

More information

ENVISION. ACCELERATE.

ENVISION. ACCELERATE. ENVISION. ACCELERATE. ARRIVE. ClearSpeed Programming Model: Optimizing Performance 1 Overview Compute considerations Memory considerations Latency hiding Miscellaneous Profiling Inline assembly Optimal

More information

Visual Profiler. User Guide

Visual Profiler. User Guide Visual Profiler User Guide Version 3.0 Document No. 06-RM-1136 Revision: 4.B February 2008 Visual Profiler User Guide Table of contents Table of contents 1 Introduction................................................

More information

CLEARSPEED WHITEPAPER: CSX PROCESSOR ARCHITECTURE

CLEARSPEED WHITEPAPER: CSX PROCESSOR ARCHITECTURE CSX PROCESSOR ARCHITECTURE CLEARSPEED WHITEPAPER: CSX PROCESSOR ARCHITECTURE Abstract This paper describes the architecture of the CSX family of processors based on ClearSpeed s multi-threaded array processor;

More information

64-Bit Floating-Point Accelerators for HPC Applications

64-Bit Floating-Point Accelerators for HPC Applications ENVISION. ACCELERATE. ARRIVE. 64-Bit Floating-Point Accelerators for HPC Applications John L. Gustafson, Ph.D. CTO, High Performance Computing ClearSpeed Technology, Inc. 1 Outline Acceleration issues

More information

CSX600 Runtime Software. User Guide

CSX600 Runtime Software. User Guide CSX600 Runtime Software User Guide Version 3.0 Document No. 06-UG-1345 Revision: 3.D January 2008 Table of contents Table of contents 1 Introduction................................................ 7 2

More information

Tesla GPU Computing A Revolution in High Performance Computing

Tesla GPU Computing A Revolution in High Performance Computing Tesla GPU Computing A Revolution in High Performance Computing Gernot Ziegler, Developer Technology (Compute) (Material by Thomas Bradley) Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction

More information

Tesla GPU Computing A Revolution in High Performance Computing

Tesla GPU Computing A Revolution in High Performance Computing Tesla GPU Computing A Revolution in High Performance Computing Mark Harris, NVIDIA Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory

More information

ClearSpeed Visual Profiler

ClearSpeed Visual Profiler ClearSpeed Visual Profiler Copyright 2007 ClearSpeed Technology plc. All rights reserved. 12 November 2007 www.clearspeed.com 1 Profiling Application Code Why use a profiler? Program analysis tools are

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control

More information

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy

More information

Introductory Programming Manual. The ClearSpeed Software Development Kit. Document No. 06-UG-1117 Revision: 2.E

Introductory Programming Manual. The ClearSpeed Software Development Kit. Document No. 06-UG-1117 Revision: 2.E Introductory Programming Manual The ClearSpeed Software Development Kit Document No. 06-UG-1117 Revision: 2.E January 2008 The ClearSpeed Software Development Kit Introductory Programming Manual Overview

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

Software Overview Release Rev: 3.0

Software Overview Release Rev: 3.0 Software Overview Release Rev: 3.0 1 Overview of ClearSpeed software The ClearSpeed Advance accelerators are provided with a package of runtime software. A software development kit (SDK) is also available

More information

High Performance Computing on GPUs using NVIDIA CUDA

High Performance Computing on GPUs using NVIDIA CUDA High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

General Purpose GPU Computing in Partial Wave Analysis

General Purpose GPU Computing in Partial Wave Analysis JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data

More information

CUDA OPTIMIZATIONS ISC 2011 Tutorial

CUDA OPTIMIZATIONS ISC 2011 Tutorial CUDA OPTIMIZATIONS ISC 2011 Tutorial Tim C. Schroeder, NVIDIA Corporation Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

Martin Kruliš, v

Martin Kruliš, v Martin Kruliš 1 Optimizations in General Code And Compilation Memory Considerations Parallelism Profiling And Optimization Examples 2 Premature optimization is the root of all evil. -- D. Knuth Our goal

More information

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight

More information

CUDA Performance Optimization. Patrick Legresley

CUDA Performance Optimization. Patrick Legresley CUDA Performance Optimization Patrick Legresley Optimizations Kernel optimizations Maximizing global memory throughput Efficient use of shared memory Minimizing divergent warps Intrinsic instructions Optimizations

More information

Bindel, Fall 2011 Applications of Parallel Computers (CS 5220) Tuning on a single core

Bindel, Fall 2011 Applications of Parallel Computers (CS 5220) Tuning on a single core Tuning on a single core 1 From models to practice In lecture 2, we discussed features such as instruction-level parallelism and cache hierarchies that we need to understand in order to have a reasonable

More information

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology CS8803SC Software and Hardware Cooperative Computing GPGPU Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology Why GPU? A quiet revolution and potential build-up Calculation: 367

More information

Parallel Computing: Parallel Architectures Jin, Hai

Parallel Computing: Parallel Architectures Jin, Hai Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer

More information

Advanced CUDA Optimization 1. Introduction

Advanced CUDA Optimization 1. Introduction Advanced CUDA Optimization 1. Introduction Thomas Bradley Agenda CUDA Review Review of CUDA Architecture Programming & Memory Models Programming Environment Execution Performance Optimization Guidelines

More information

Master Informatics Eng.

Master Informatics Eng. Advanced Architectures Master Informatics Eng. 207/8 A.J.Proença The Roofline Performance Model (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 207/8 AJProença, Advanced Architectures,

More information

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.

More information

HPC VT Machine-dependent Optimization

HPC VT Machine-dependent Optimization HPC VT 2013 Machine-dependent Optimization Last time Choose good data structures Reduce number of operations Use cheap operations strength reduction Avoid too many small function calls inlining Use compiler

More information

Higher Level Programming Abstractions for FPGAs using OpenCL

Higher Level Programming Abstractions for FPGAs using OpenCL Higher Level Programming Abstractions for FPGAs using OpenCL Desh Singh Supervising Principal Engineer Altera Corporation Toronto Technology Center ! Technology scaling favors programmability CPUs."#/0$*12'$-*

More information

GPU Acceleration of Matrix Algebra. Dr. Ronald C. Young Multipath Corporation. fmslib.com

GPU Acceleration of Matrix Algebra. Dr. Ronald C. Young Multipath Corporation. fmslib.com GPU Acceleration of Matrix Algebra Dr. Ronald C. Young Multipath Corporation FMS Performance History Machine Year Flops DEC VAX 1978 97,000 FPS 164 1982 11,000,000 FPS 164-MAX 1985 341,000,000 DEC VAX

More information

CS 101, Mock Computer Architecture

CS 101, Mock Computer Architecture CS 101, Mock Computer Architecture Computer organization and architecture refers to the actual hardware used to construct the computer, and the way that the hardware operates both physically and logically

More information

Introduction to CELL B.E. and GPU Programming. Agenda

Introduction to CELL B.E. and GPU Programming. Agenda Introduction to CELL B.E. and GPU Programming Department of Electrical & Computer Engineering Rutgers University Agenda Background CELL B.E. Architecture Overview CELL B.E. Programming Environment GPU

More information

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni CUDA Optimizations WS 2014-15 Intelligent Robotics Seminar 1 Table of content 1 Background information 2 Optimizations 3 Summary 2 Table of content 1 Background information 2 Optimizations 3 Summary 3

More information

SDACCEL DEVELOPMENT ENVIRONMENT. The Xilinx SDAccel Development Environment. Bringing The Best Performance/Watt to the Data Center

SDACCEL DEVELOPMENT ENVIRONMENT. The Xilinx SDAccel Development Environment. Bringing The Best Performance/Watt to the Data Center SDAccel Environment The Xilinx SDAccel Development Environment Bringing The Best Performance/Watt to the Data Center Introduction Data center operators constantly seek more server performance. Currently

More information

Advanced CUDA Programming. Dr. Timo Stich

Advanced CUDA Programming. Dr. Timo Stich Advanced CUDA Programming Dr. Timo Stich (tstich@nvidia.com) Outline SIMT Architecture, Warps Kernel optimizations Global memory throughput Launch configuration Shared memory access Instruction throughput

More information

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow Fundamental Optimizations (GTC 2010) Paulius Micikevicius NVIDIA Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow Optimization

More information

When MPPDB Meets GPU:

When MPPDB Meets GPU: When MPPDB Meets GPU: An Extendible Framework for Acceleration Laura Chen, Le Cai, Yongyan Wang Background: Heterogeneous Computing Hardware Trend stops growing with Moore s Law Fast development of GPU

More information

ASSEMBLY LANGUAGE MACHINE ORGANIZATION

ASSEMBLY LANGUAGE MACHINE ORGANIZATION ASSEMBLY LANGUAGE MACHINE ORGANIZATION CHAPTER 3 1 Sub-topics The topic will cover: Microprocessor architecture CPU processing methods Pipelining Superscalar RISC Multiprocessing Instruction Cycle Instruction

More information

GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS

GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS Agenda Forming a GPGPU WG 1 st meeting Future meetings Activities Forming a GPGPU WG To raise needs and enhance information sharing A platform for knowledge

More information

Double-precision General Matrix Multiply (DGEMM)

Double-precision General Matrix Multiply (DGEMM) Double-precision General Matrix Multiply (DGEMM) Parallel Computation (CSE 0), Assignment Andrew Conegliano (A0) Matthias Springer (A00) GID G-- January, 0 0. Assumptions The following assumptions apply

More information

Double-Precision Matrix Multiply on CUDA

Double-Precision Matrix Multiply on CUDA Double-Precision Matrix Multiply on CUDA Parallel Computation (CSE 60), Assignment Andrew Conegliano (A5055) Matthias Springer (A995007) GID G--665 February, 0 Assumptions All matrices are square matrices

More information

XPU A Programmable FPGA Accelerator for Diverse Workloads

XPU A Programmable FPGA Accelerator for Diverse Workloads XPU A Programmable FPGA Accelerator for Diverse Workloads Jian Ouyang, 1 (ouyangjian@baidu.com) Ephrem Wu, 2 Jing Wang, 1 Yupeng Li, 1 Hanlin Xie 1 1 Baidu, Inc. 2 Xilinx Outlines Background - FPGA for

More information

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel

More information

MIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer

MIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer MIMD Overview Intel Paragon XP/S Overview! MIMDs in the 1980s and 1990s! Distributed-memory multicomputers! Intel Paragon XP/S! Thinking Machines CM-5! IBM SP2! Distributed-memory multicomputers with hardware

More information

Distributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca

Distributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca Distributed Dense Linear Algebra on Heterogeneous Architectures George Bosilca bosilca@eecs.utk.edu Centraro, Italy June 2010 Factors that Necessitate to Redesign of Our Software» Steepness of the ascent

More information

Issues In Implementing The Primal-Dual Method for SDP. Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM

Issues In Implementing The Primal-Dual Method for SDP. Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM Issues In Implementing The Primal-Dual Method for SDP Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM 87801 borchers@nmt.edu Outline 1. Cache and shared memory parallel computing concepts.

More information

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1 Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip

More information

The Nios II Family of Configurable Soft-core Processors

The Nios II Family of Configurable Soft-core Processors The Nios II Family of Configurable Soft-core Processors James Ball August 16, 2005 2005 Altera Corporation Agenda Nios II Introduction Configuring your CPU FPGA vs. ASIC CPU Design Instruction Set Architecture

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References This set of slides is mainly based on: CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory Slide of Applied

More information

BASIC COMPUTER ORGANIZATION. Operating System Concepts 8 th Edition

BASIC COMPUTER ORGANIZATION. Operating System Concepts 8 th Edition BASIC COMPUTER ORGANIZATION Silberschatz, Galvin and Gagne 2009 Topics CPU Structure Registers Memory Hierarchy (L1/L2/L3/RAM) Machine Language Assembly Language Running Process 3.2 Silberschatz, Galvin

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 6 Parallel Processors from Client to Cloud Introduction Goal: connecting multiple computers to get higher performance

More information

Experts in Application Acceleration Synective Labs AB

Experts in Application Acceleration Synective Labs AB Experts in Application Acceleration 1 2009 Synective Labs AB Magnus Peterson Synective Labs Synective Labs quick facts Expert company within software acceleration Based in Sweden with offices in Gothenburg

More information

The Bifrost GPU architecture and the ARM Mali-G71 GPU

The Bifrost GPU architecture and the ARM Mali-G71 GPU The Bifrost GPU architecture and the ARM Mali-G71 GPU Jem Davies ARM Fellow and VP of Technology Hot Chips 28 Aug 2016 Introduction to ARM Soft IP ARM licenses Soft IP cores (amongst other things) to our

More information

Kampala August, Agner Fog

Kampala August, Agner Fog Advanced microprocessor optimization Kampala August, 2007 Agner Fog www.agner.org Agenda Intel and AMD microprocessors Out Of Order execution Branch prediction Platform, 32 or 64 bits Choice of compiler

More information

Latches. IT 3123 Hardware and Software Concepts. Registers. The Little Man has Registers. Data Registers. Program Counter

Latches. IT 3123 Hardware and Software Concepts. Registers. The Little Man has Registers. Data Registers. Program Counter IT 3123 Hardware and Software Concepts Notice: This session is being recorded. CPU and Memory June 11 Copyright 2005 by Bob Brown Latches Can store one bit of data Can be ganged together to store more

More information

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of

More information

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA Optimization Overview GPU architecture Kernel optimization Memory optimization Latency optimization Instruction optimization CPU-GPU

More information

Addressing the Memory Wall

Addressing the Memory Wall Lecture 26: Addressing the Memory Wall Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Cage the Elephant Back Against the Wall (Cage the Elephant) This song is for the

More information

The Memory Component

The Memory Component The Computer Memory Chapter 6 forms the first of a two chapter sequence on computer memory. Topics for this chapter include. 1. A functional description of primary computer memory, sometimes called by

More information

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić How to perform HPL on CPU&GPU clusters Dr.sc. Draško Tomić email: drasko.tomic@hp.com Forecasting is not so easy, HPL benchmarking could be even more difficult Agenda TOP500 GPU trends Some basics about

More information

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v7.0 March 2015 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2

More information

Integrating DMA capabilities into BLIS for on-chip data movement. Devangi Parikh Ilya Polkovnichenko Francisco Igual Peña Murtaza Ali

Integrating DMA capabilities into BLIS for on-chip data movement. Devangi Parikh Ilya Polkovnichenko Francisco Igual Peña Murtaza Ali Integrating DMA capabilities into BLIS for on-chip data movement Devangi Parikh Ilya Polkovnichenko Francisco Igual Peña Murtaza Ali 5 Generations of TI Multicore Processors Keystone architecture Lowers

More information

CS 426 Parallel Computing. Parallel Computing Platforms

CS 426 Parallel Computing. Parallel Computing Platforms CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:

More information

Intra-MIC MPI Communication using MVAPICH2: Early Experience

Intra-MIC MPI Communication using MVAPICH2: Early Experience Intra-MIC MPI Communication using MVAPICH: Early Experience Sreeram Potluri, Karen Tomko, Devendar Bureddy, and Dhabaleswar K. Panda Department of Computer Science and Engineering Ohio State University

More information

The S6000 Family of Processors

The S6000 Family of Processors The S6000 Family of Processors Today s Design Challenges The advent of software configurable processors In recent years, the widespread adoption of digital technologies has revolutionized the way in which

More information

Misc. Third Generation Batch Multiprogramming. Fourth Generation Time Sharing. Last Time Evolution of OSs

Misc. Third Generation Batch Multiprogramming. Fourth Generation Time Sharing. Last Time Evolution of OSs Third Generation Batch Multiprogramming Misc. Problem: but I/O still expensive; can happen in middle of job Idea: have a pool of ready jobs in memory, switch to one when another needs I/O When one job

More information

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions. Schedule CUDA Digging further into the programming manual Application Programming Interface (API) text only part, sorry Image utilities (simple CUDA examples) Performace considerations Matrix multiplication

More information

Processors, Performance, and Profiling

Processors, Performance, and Profiling Processors, Performance, and Profiling Architecture 101: 5-Stage Pipeline Fetch Decode Execute Memory Write-Back Registers PC FP ALU Memory Architecture 101 1. Fetch instruction from memory. 2. Decode

More information

Introduction to Operating Systems. Chapter Chapter

Introduction to Operating Systems. Chapter Chapter Introduction to Operating Systems Chapter 1 1.3 Chapter 1.5 1.9 Learning Outcomes High-level understand what is an operating system and the role it plays A high-level understanding of the structure of

More information

Modern Processor Architectures. L25: Modern Compiler Design

Modern Processor Architectures. L25: Modern Compiler Design Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions

More information

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v6.5 August 2014 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2

More information

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008 SHARCNET Workshop on Parallel Computing Hugh Merz Laurentian University May 2008 What is Parallel Computing? A computational method that utilizes multiple processing elements to solve a problem in tandem

More information

ARM Processors for Embedded Applications

ARM Processors for Embedded Applications ARM Processors for Embedded Applications Roadmap for ARM Processors ARM Architecture Basics ARM Families AMBA Architecture 1 Current ARM Core Families ARM7: Hard cores and Soft cores Cache with MPU or

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

Intel Enterprise Processors Technology

Intel Enterprise Processors Technology Enterprise Processors Technology Kosuke Hirano Enterprise Platforms Group March 20, 2002 1 Agenda Architecture in Enterprise Xeon Processor MP Next Generation Itanium Processor Interconnect Technology

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied

More information

EXPLORING PARALLEL PROCESSING OPPORTUNITIES IN AERMOD. George Delic * HiPERiSM Consulting, LLC, Durham, NC, USA

EXPLORING PARALLEL PROCESSING OPPORTUNITIES IN AERMOD. George Delic * HiPERiSM Consulting, LLC, Durham, NC, USA EXPLORING PARALLEL PROCESSING OPPORTUNITIES IN AERMOD George Delic * HiPERiSM Consulting, LLC, Durham, NC, USA 1. INTRODUCTION HiPERiSM Consulting, LLC, has a mission to develop (or enhance) software and

More information

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization

More information

Memory Systems IRAM. Principle of IRAM

Memory Systems IRAM. Principle of IRAM Memory Systems 165 other devices of the module will be in the Standby state (which is the primary state of all RDRAM devices) or another state with low-power consumption. The RDRAM devices provide several

More information

Intel C++ Compiler Professional Edition 11.0 for Linux* In-Depth

Intel C++ Compiler Professional Edition 11.0 for Linux* In-Depth Intel C++ Compiler Professional Edition 11.0 for Linux* In-Depth Contents Intel C++ Compiler Professional Edition for Linux*...3 Intel C++ Compiler Professional Edition Components:...3 Features...3 New

More information

An Evaluation of an Energy Efficient Many-Core SoC with Parallelized Face Detection

An Evaluation of an Energy Efficient Many-Core SoC with Parallelized Face Detection An Evaluation of an Energy Efficient Many-Core SoC with Parallelized Face Detection Hiroyuki Usui, Jun Tanabe, Toru Sano, Hui Xu, and Takashi Miyamori Toshiba Corporation, Kawasaki, Japan Copyright 2013,

More information

Introduction to Operating. Chapter Chapter

Introduction to Operating. Chapter Chapter Introduction to Operating Systems Chapter 1 1.3 Chapter 1.5 1.9 Learning Outcomes High-level understand what is an operating system and the role it plays A high-level understanding of the structure of

More information

Intel C++ Compiler Professional Edition 11.1 for Mac OS* X. In-Depth

Intel C++ Compiler Professional Edition 11.1 for Mac OS* X. In-Depth Intel C++ Compiler Professional Edition 11.1 for Mac OS* X In-Depth Contents Intel C++ Compiler Professional Edition 11.1 for Mac OS* X. 3 Intel C++ Compiler Professional Edition 11.1 Components:...3 Features...3

More information

Cell SDK and Best Practices

Cell SDK and Best Practices Cell SDK and Best Practices Stefan Lutz Florian Braune Hardware-Software-Co-Design Universität Erlangen-Nürnberg siflbrau@mb.stud.uni-erlangen.de Stefan.b.lutz@mb.stud.uni-erlangen.de 1 Overview - Introduction

More information

A Multi-Tiered Optimization Framework for Heterogeneous Computing

A Multi-Tiered Optimization Framework for Heterogeneous Computing A Multi-Tiered Optimization Framework for Heterogeneous Computing IEEE HPEC 2014 Alan George Professor of ECE University of Florida Herman Lam Assoc. Professor of ECE University of Florida Andrew Milluzzi

More information

IBM Cell Processor. Gilbert Hendry Mark Kretschmann

IBM Cell Processor. Gilbert Hendry Mark Kretschmann IBM Cell Processor Gilbert Hendry Mark Kretschmann Architectural components Architectural security Programming Models Compiler Applications Performance Power and Cost Conclusion Outline Cell Architecture:

More information

Architecture without explicit locks for logic simulation on SIMD machines

Architecture without explicit locks for logic simulation on SIMD machines Architecture without explicit locks for logic on machines M. Chimeh Department of Computer Science University of Glasgow UKMAC, 2016 Contents 1 2 3 4 5 6 The Using models to replicate the behaviour of

More information

OpenACC Course. Office Hour #2 Q&A

OpenACC Course. Office Hour #2 Q&A OpenACC Course Office Hour #2 Q&A Q1: How many threads does each GPU core have? A: GPU cores execute arithmetic instructions. Each core can execute one single precision floating point instruction per cycle

More information

Introduction to GPU computing

Introduction to GPU computing Introduction to GPU computing Nagasaki Advanced Computing Center Nagasaki, Japan The GPU evolution The Graphic Processing Unit (GPU) is a processor that was specialized for processing graphics. The GPU

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

Trends in the Infrastructure of Computing

Trends in the Infrastructure of Computing Trends in the Infrastructure of Computing CSCE 9: Computing in the Modern World Dr. Jason D. Bakos My Questions How do computer processors work? Why do computer processors get faster over time? How much

More information

Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop

Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop http://cscads.rice.edu/ Discussion and Feedback CScADS Autotuning 07 Top Priority Questions for Discussion

More information

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed) Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2012/13 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2012/13 1 2

More information

This section covers the MIPS instruction set.

This section covers the MIPS instruction set. This section covers the MIPS instruction set. 1 + I am going to break down the instructions into two types. + a machine instruction which is directly defined in the MIPS architecture and has a one to one

More information

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION WHAT YOU WILL LEARN An iterative method to optimize your GPU code Some common bottlenecks to look out for Performance diagnostics with NVIDIA Nsight

More information

X-Stream II. Processing Method. Operating System. Hardware Performance. Elements of Processing Speed TECHNICAL BRIEF

X-Stream II. Processing Method. Operating System. Hardware Performance. Elements of Processing Speed TECHNICAL BRIEF X-Stream II Peter J. Pupalaikis Principal Technologist September 2, 2010 Summary This paper explains how X- Stream II techonlogy improves the speed and responsiveness of LeCroy oscilloscopes. TECHNICAL

More information

Laboratory Exercise 3 Comparative Analysis of Hardware and Emulation Forms of Signed 32-Bit Multiplication

Laboratory Exercise 3 Comparative Analysis of Hardware and Emulation Forms of Signed 32-Bit Multiplication Laboratory Exercise 3 Comparative Analysis of Hardware and Emulation Forms of Signed 32-Bit Multiplication Introduction All processors offer some form of instructions to add, subtract, and manipulate data.

More information

Performance of Multicore LUP Decomposition

Performance of Multicore LUP Decomposition Performance of Multicore LUP Decomposition Nathan Beckmann Silas Boyd-Wickizer May 3, 00 ABSTRACT This paper evaluates the performance of four parallel LUP decomposition implementations. The implementations

More information