ENVISION. ACCELERATE.
|
|
- Melvin Knight
- 5 years ago
- Views:
Transcription
1 ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical Training December 2007 Overview 1
2 Presenters Ronald Langhi Technical Marketing Manager Brian Sumner Senior Engineer 2
3 ClearSpeed Technology: Company Background Founded in 2001 Focused on alleviating the power, heat, and density challenges of HPC systems 103 patents granted and pending (as of September 2007) Offices in San Jose, California and Bristol, UK 3
4 Agenda Accelerators ClearSpeed and HPC Hardware overview Installing hardware and software Thinking about performance Software Development Kit Application examples Help and support 4
5 ENVISION. ACCELERATE. ARRIVE. What is an accelerator? 5
6 What is an accelerator? A device to improve performance Relieve main CPU of workload Or to augment CPU s capability An accelerator card can increase performance On specific tasks Without aggravating facility limits on clusters (power, size, cooling) 6
7 All accelerators are good for their intended purpose FPGAs Good for integer, bit-level ops Programming looks like circuit design Low power per chip, but 20x more power than custom VLSI Not for 64-bit FLOPS Cell and GPUs Good for video gaming tasks 32-bit FLOPS, not IEEE Unconventional programming model Small local memory High power consumption (> 200 W) ClearSpeed Good for HPC applications IEEE 64-bit and 32-bit FLOPS Custom VLSI, true coprocessor At least 1 GB local memory Very low power consumption (25 W) Familiar programming model 7
8 The case for accelerators Accelerators designed for HPC applications can improve performance as well as performance per (watt, cabinet, dollar) Accelerators enable: Larger problems for given compute time, or Higher accuracy for given compute time, or Same problem in shorter time Host to card latency and bandwidth are not major barriers to successful use of properlydesigned accelerators. 8
9 ENVISION. ACCELERATE. ARRIVE. What can be accelerated? 9
10 Good application targets for acceleration Application needs to be both computationally intensive and contain a high degree of data parallelism. Computationally intensive: Software depends on executing large numbers of arithmetic calculations Usually 64-bit FLoating point Operations per Second (FLOPS) Should also have a high ratio of FLOPS to data movement (bandwidth) Computationally intensive applications may run for many hours or more even on large clusters. Data parallelism: Software performs the same sequence of operations again and again but on a different item of data each time Example computationally intensive, data parallel problems include: Large matrix arithmetic (linear algebra) Molecular simulations Monte Carlo options pricing in financial applications And many, many more 10
11 Example data parallel problems that can be accelerated Ab initio Computational Chemistry Structural Analysis Electromagnetic Modeling Radar Cross-Section Global Illumination Graphics 11
12 HPC Requirements Accelerator boards increase compute performance on highly specific tasks, without aggravating facility limits on clusters (power, size) Need to consider Type of application Software Data type and precision Compatibility with host (logical and physical) Memory size (local to accelerator) Latency and bandwidth to host 12
13 An HPC-specific accelerator CSX600 coprocessor for math acceleration Assists serial CPU running compute-intensive math libraries Available on add-in boards, e.g. PCI-X, PCIe Potentially integrated on the motherboard Can also be used for embedded applications Significantly accelerates certain libraries and applications Target libraries: Level 3 BLAS, LAPACK, ACML, Intel MKL Mathematical modeling tools: Mathematica, MATLAB, etc. In-house code: Using the SDK to port compute-intensive kernels ClearSpeed Advance board Dual CSX600 coprocessors Sustains 67 GFLOPS for 64-bit matrix multiply (DGEMM) calls PCI-X, PCI Express x8 Low power; typically Watts 13
14 Plug-and-play Acceleration ClearSpeed host-side library CSXL Provides some of the most commonly used and important Level 3 BLAS and LAPACK functions Exploits standard shared/dynamic library mechanisms to intercept calls to L3 BLAS and LAPACK Executes calls heterogeneously across both the multicore host and the ClearSpeed accelerators simultaneously for maximum performance Compatible with ACML from AMD and MKL from Intel User & application do not need to be aware of ClearSpeed Except that the application suddenly runs faster 14
15 Programming considerations Is my main data type integer or floating-point? Is the data parallel in nature? What precision do I need? How much data needs to be local to the accelerated task? Does existing accelerator software meet my needs, or do I have to write my own? If I have to write my own code will the existing tools meet my needs for example: compiler, debugger, and simulator? 15
16 ENVISION. ACCELERATE. ARRIVE. Hardware Overview 16
17 CSX600: A chip designed for HPC ClearSpeed CSX600 Array of 96 Processor Elements; 64-bit and 32-bit floating point Single-Instruction, Multiple-Data (SIMD) 210 MHz -- key to low power 47% logic, 53% memory About 50% of the logic is FPU Hence around one quarter of the chip is floating point hardware Embedded SRAM Interface to DDR2 DRAM Inter-processor I/O ports ~ 1 TB/sec internal bandwidth 128 million transistors Approximately 10 Watts 17
18 CSX600 processor core CSX 600 System Network Data Cache PE 0 Mono Controller PE 1 Instruction Cache Poly Controller PE 95 Programmable I/O to DRAM Peripheral Network Control and Debug System Network Multi-Threaded Array Processing Programmed in familiar languages Hardware multi-threading Asynchronous, overlapped I/O Run-time extensible instruction set Array of 96 Processor Elements (PEs) Each has multiple execution units Including double precision floating point and integer units 18
19 CSX600 processing element (PE) PE n 1 PE n FP Mul FP Add Div, Sqrt MAC Register File 128 Bytes PE SRAM 6 KBytes ALU PE Programmed I/O 128 PIO Collection & Distribution n+1 Multiple execution units 4-stage floating point adder 4-stage floating point multiplier Divide/square root unit } Fixed-point MAC 16x Integer ALU with shifter Load/store 5-port register file (3 reads, 2 writes) Closely coupled 6 KB SRAM for data High bandwidth per PE DMA (PIO) Per PE address generators (serves as hardware gather-scatter) Fast inter-pe communication path 32/64-bit IEEE
20 Advance accelerator memory hierarchy Tier 3 Host DRAM: 1-32 GBytes typical Aggregate: ~1GB/s 1.0 GBytes Tier 2 Bank 1 Bank CSX 0DRAM: 0.5 GBytes CSX DRAM: 0.5 GBytes 5.4 GB/s ~0.03 GB/s per PE 192 PEs * 6 KB = 1.1 MB 161 GB/s Tier 1 PE 95 PE 0 Poly memory: 6 KBytes PEs * 128 Byte = 24 KB Tier 0 Per PE Register memory: 128 Bytes Swazzle 322 GB/s 725 GB/s Total: 80 GFLOPS, 1.1 TB/s but only 25 Watts Per PE Arithmetic: 0.42 GFLOPS 20
21 Acceleration by plug-in card Advance X620 (PCI-X) 203 mm length, full-height Advance e620 PCIe (x8) Dual ClearSpeed CSX600 coprocessors R > 66 GFLOPS for 64-bit matrix multiply (DGEMM) calls Hardware also supports 32-bit floating point and integer calculations 133 MHz PCI-X two-thirds length (8 ) form factor PCIe x8 half-length form factor 1 GB of memory on the board Drivers today for Linux (Red Hat and SLES) and Windows (XP, Server 2003) Low power: 25 watts typical Multiple boards can be used together for greater performance Half length, full-height Both boards can sustain over 66 GFLOPS on 64-bit HPC kernels 21
22 Host to board DMA performance The board includes a host DMA controller which can act as a bus master. All DMA transfers are at least 8-byte aligned. The host DMA engine will attempt to use the full bandwidth of the bus. Type of PCI-X slot Peak bandwidth Expected DMA speed PCI Express x8 2,000 MB/s Up to 1,300 MB/s PCI-X 133 MHz 1,066 MB/s Up to 750 MB/s Note: measured bandwidth is highly system-dependent Variations of up to 50% have been observed Depends on system chipset, operating system, bus contention 22
23 ENVISION. ACCELERATE. ARRIVE. Installing Hardware and Software 23
24 Configuration support Advance supports the following host operating systems: Operating System SuSE Linux Enterprise Server 9 IA32 (x86) AMD64/EM64T (x86-64) Red Hat Enterprise Linux 4 Windows XP SP2 Windows Server 2003 preview Supported host BLAS libraries AMD ACML Intel MKL Goto ATLAS Supported compilers For Linux: gcc, icc, fort, pgf For Windows XP, 2003: Visual C For the latest support information go to 24
25 Base software All ClearSpeed software on Linux is installed using the rpm command. The software consists of three parts: Runtime and driver software Diagnostics ClearSpeed standard libraries, CSXL & CSFFT You can download the latest versions from the ClearSpeed support website: 25
26 Installing base software on Linux 1. Log in to the Linux machine as root and change to the directory containing the drivers package. 2. Install the runtime software, using the command: rpm i csx600_m512_le-runtime-<version>.<arch>.rpm 3. Install the Kernel module - for Linux 2.6 simply install the open source CSX driver using: /opt/clearspeed/csx600_m512_le/drivers/csx/install-csx 4. Install the board diagnostics: rpm i csx600_m512_le-board_diagnostics- <version>.<arch>.rpm 5. Install the CSXL library package: rpm i csx600_m512_le-csxl_<version>.<arch>.rpm Note: For Windows a Jungo driver will need to be installed and configured see installation manual for more details. 26
27 Confirming successful installation ClearSpeed distributes diagnostic tests to check that the board and drivers are successfully installed: 1. Open a shell window and go to an appropriate directory: cd /tmp 2. Set up ClearSpeed environment variables, by typing: source /opt/clearspeed/csx600_m512_le/bin/bashrc 3. Run the diagnostic program, by typing the command: /opt/clearspeed/csx600_m512_le/bin/run_tests.pl Some tests take several minutes to complete. Each test will write Pass or Fail to standard output. A log file test.log will be written in the current directory. 27
28 csreset The csreset command reinitializes an Advance board and its processors. It must be run after start-up or reboot of the system or simulator. It is also a good idea to run csreset at the start of a batch job that calls the Advance board. The csreset command can take argument flags to provide a finer level of control. These include: -A Specifies that all boards should be reset. -v Verbose output. This shows the details about each board. -h Help. This shows the full list of options. 28
29 If you have problems with software installation Make sure you are logged in as super-user. As root for Linux. As administrator for Windows. If the configure or make install steps fail, check that you have the appropriate header files. Check the preconfigured header files and, if necessary, obtain the appropriate configured header file. If the system cannot access the board but the driver is installed, make sure the board is seated well. Try removing the board and reinstalling. 29
30 ENVISION. ACCELERATE. ARRIVE. Targeting ClearSpeed Advance: Exploiting Data Parallelism 30
31 Alternative approaches Three main approaches to acceleration: 1. Use an application which is already ported 2. Plug and play 3. Custom port using the SDK 31
32 Using an application which is already ported Acceleration: simply insert ClearSpeed Latest list of ported applications: Includes: Amber Mathematica MATLAB Star-P 32
33 Plug and play libraries: CSXL Underlying shared libraries are augmented with ClearSpeed CSXL accelerated functions Includes key functions from: LAPACK Level 3 BLAS As an example, BLAS is used by: AMD ACML Intel MKL Full list on: Application is transparently accelerated No modifications to application 33
34 Acceleration using CSXL and standard libraries Application Automatically select optimum path Host Library LAPACK BLAS etc. CSXL Intercept Layer CSXL Library LAPACK BLAS etc. 34
35 Considerations for custom port of application Is the task large enough to consider acceleration? Takes time to ship data to the accelerator Accelerator can work in parallel with host Overlap computation Performance considerations Look for areas of data parallelism Overlap compute with data I/O Make full use of ClearSpeed I/O paths Analysis starts with model based on memory tiers and can be verified using performance profiling tools 35
36 Is this trip necessary? Considering I/O Node memory Node Bandwidth = B Accelerator memory Accelerator Time to move N data to/from another node or an accelerator is ~latency+n/b seconds. Because local memory bandwidth is usually higher than B, acceleration might be lost in the communication time. Estimate the break-even point for the task (note: offloading is different from accelerating, where host continues working). speed breakeven accelerator node time (larger problem size) 36
37 Memory bandwidth dictates performance Node memory Accelerator DRAM 17 GB/s Multicore x86 PCI-X or PCIe 1 to 2 GB/s 5.4 GB/s Accelerator Accelerator Local RAM 192 GB/s Applications that can stage into local RAM can go 10x faster than current high-end Intel and AMD hosts Applications residing in Accelerator DRAM do not make use of massive memory bandwidth GPUs face very similar issue 37
38 Latency and bandwidth: Simple offload model Accelerator bandwidth bandwidth Host latency latency Host Accelerator must be quite fast for this approach to have benefit This mental picture may stem from early days of Intel 80x87, Motorola 6888x math coprocessors 38
39 Latency and bandwidth: Acceleration model Accelerator bandwidth bandwidth Host latency Host latency Host Host continues working Accelerator needs only be fast enough to make up for time lost to bandwidth + latency Easiest use model Host and accelerator share the same task, like DGEMM More flexible Host, accelerator each specialize what they do 39
40 Accelerator need not wait for all data before starting Accelerator bandwidth bandwidth Host latency Host latency Host Host can work while data is moved PCI transfers might burden a single x86 core by up to 40% Other cores on host continue productive work at full speed Accelerator can work while data is moved Can be slower than the host, and still add performance! In practice, latency is microseconds; accelerator task takes seconds Latency gaps above would be microscopic if drawn to scale 40
41 Performance considerations Look for data parallelism Fine-grained vector operations Medium-grained unrolled independent loops Coarse-grained multiple simultaneous data channels/sets Performance analysis for accelerator cards Like analysis for message-passing parallelism but with more levels of memory and communication Application porting success depends heavily on attention to memory bandwidths (Surprisingly) not so much on the bandwidth between host and accelerator card 41
42 PCI Bus ClearSpeed boards utilize either PCI-X or PCIe busses PCI-X 133 MHz: 1 GB/s Peak PCIe x8: 1.6 GB/s Peak Available memory on board 1 GB of 200 MHz DDR2 SDRAM shared by 2 CSX600 processors Must consider both the transfer rate AND the available memory If application requires more memory, then more communication to the board is necessary Infinitely fast board Time = Bus Speed * Total data size transferred 42
43 PCI Bus Driver performance is very machine-specific and depends on transfer size, direction, etc. Transfer Size vs. Transfer Rate See Runtime User s Guide for current driver performance 43
44 On-board Memory 2 level memory hierarchy 1 GB mono shared memory 6 kb poly memory per processing element (PE) 6 kb/pe * 96 PE = 576 kb per CSX600 Peak bandwidth between levels 2.7 GB/s x 2 chips = 5.4 GB/s Must consider both the transfer rate AND the available memory If application requires more memory, then more communication to the board is necessary Infinitely fast PEs Time = Bus Speed * Total data size transferred Secondary considerations Burst size: 64 Bytes/PE (i.e., 8 doubles) Transfers can be smaller, but at reduced efficiency 44
45 SIMD Computing What is SIMD? Single Instruction, Multiple Data Each PE sees the same instruction stream Each PE issues load, multiply, etc., simultaneously But acts on different data per PE PARALLEL COMPUTATION ClearSpeed SIMD is enhanced by: Local memory for each PE data management is easier within poly memory does not require adjacent access for all 96 elements involved in the computation from shared memory pool PEs can be enabled/disabled not required to use all PEs always useful for handling boundaries 45
46 SIMD Array 96 PEs per CSX MHz double precision multiply-accumulate per cycle 4 cycle pipeline depth for multiply and accumulation For top performance, use operations on 4 element vectors on each PE Nearest neighbor communication swazzle path topology is a line or ring Bandwidth: 8 Bytes per cycle between register files 8*96*210 = 161 GB/s Useful for fine grained communication 46
47 Good Example Kernels Dense Linear Algebra Matrix-Matrix products (DGEMM) Low memory bandwidth required = high data re-use Inner kernel: Matrix-multivector product 96x96 matrix, x4 vectors» 96x96 matrix due to 96 PEs» 4 vectors due to multiply/accumulate pipeline depth Monte Carlo (computational finance) Embarrassingly parallel task distribution Very little data requirement Molecular Dynamics (Amber, BUDE) Large numbers of identical tasks can be found Requires small working data sets 47
48 Possible Kernel Partial Differential Equations Some are memory bandwidth limited, so not a good candidate for ClearSpeed acceleration small stencil implies little computation per grid point wide, sparse stencil implies large active data set But, some PDE simulations are good candidates require a small grid, so can run entirely in PE memory (computational finance) have large, dense stencils large amounts of computation per grid point sufficiently small active data set implicit time stepping large systems of equations solved via direct methods direct solvers utilize dense linear algebra kernels (i.e., DGEMM) 48
49 Keys to Success Parallelism is essential Proper management of the poly memory is also critical Application must accept memory bandwidth limits PCIe or PCI-X On-board memory hierarchy SDK enables asynchronous data transfers permits efficient double buffering to manage data streams, accommodating the size limit Application must employ a small working data set less than 576 kb, distributed across 96 PEs also aware of 1 GB shared memory limit While developing ClearSpeed applications, use the ClearSpeed Visual Profiler to discover what is actually happening on the board! 49
50 Remember the host processor Today s multi-core hosts are very useful for managing other tasks that are not accelerated by ClearSpeed Many applications can overlap these tasks with ClearSpeed accelerated tasks Profile the host portion of your application as well using any of a variety of tools Use ClearSpeed Visual Profile for CSAPI utilization 50
51 General optimization techniques Latency hiding Overlap compute with I/O Data reuse On-chip swazzle path Maximize PE usage Ensure all PEs are processing, not idle 51
52 Overlap data with compute Double-buffer Many levels of data I/O compute parallelism PE load/store overlaps PE compute PE to board memory can also overlap Board memory to host memory can also overlap Hence, if task is compute bound: Data takes no time to transfer If task is I/O bound: Compute takes no time to calculate 52
53 Data reuse Swazzle path Left or right 64 bit transfer (8 bytes) 8 bytes per cycle, so ~161GB/s per CSX processor Can be complete loop or linear chain Parallel with other data I/O Register-register move On-off chip in parallel Doesn t impinge on DRAM access PE local memory register in parallel Doesn t impinge on local memory access 53
54 Maximize PE usage Aim for 100% efficiency PEs use predicated execution PEs are disabled rather than code skipped Minimize effects extract common code from conditionals Mono processor can branch Skip blocks of code 54
55 Detail of I/O widths for performance analysis Each accelerator board has: 161 GB/s bandwidth PE register to PE memory 4 bytes per cycle 322 GB/s swazzle path bandwidth 8 bytes per cycle 968 GB/s bandwidth PE register to PE ALU 24 bytes per cycle 5.4 GB/s DRAM bandwidth 32 bytes per cycle (Aggregate bandwidth for two CSX600 chips.) 322GB/s PE n-1 64 PE n 161GB/s FP Mul FP Add Div, Sqrt MAC Register File 128 Bytes PE SRAM 6 KBytes ALU Programmed I/O 128 PE n+1 5.4GB/s PIO Collection & Distribution CSX DRAM 1 GByte 322GB/s GB/s 55
56 ENVISION. ACCELERATE. ARRIVE. Software Development Kit 56
57 ClearSpeed SDK overview C n compiler C with extension for SIMD control Assembler Linker Simulator Debugger Graphical profiler Libraries Documentation Available for Windows XP / 2003 and Linux (Red Hat Enterprise Linux 4 and SLES 9) 57
58 Agenda 1. Introduction to C n 2. C n Libraries 3. Debugging C n 4. CSAPI: Host / Board Communication 58
59 ENVISION. ACCELERATE. ARRIVE. Introduction to C n 59
60 Software Development The CSX architecture is simpler to program: Single program for serial and parallel operations Architecture and compiler co-designed Instruction and data caches Simple, regular 32-bit instruction set Large, flexible register file Fast thread context switching Built-in debug support Same development process as traditional architectures: compile, assemble, link C n is a simple parallel extension of C 60
61 C n C with vector extensions for CSX New Keywords mono and poly storage qualifiers mono is a serial (single) variable poly is a parallel (vector) variable Mono variables in 1 GB DRAM Poly variables in 6 KB SRAM of each PE DRAM 1 GB 61
62 C n differences from C New data type multiplicity modifiers: mono: denotes serial variable resident in mono memory mono is the default multiplicity poly: denotes parallel/vector variable resident in poly memory local to each PE applies to pointers, doubly so: mono int * poly foo; foo is a pointer in poly memory to an int in mono memory poly int * mono bar; bar is a pointer in mono memory to an int in poly memory int * poly *mono good_grief; as you would expect. Pointer sizes: mono int * 4 bytes (32-bit addressable space, 512 MB) poly int * 2 bytes (16-bit addressable space, 6 kb) 62
63 C n differences from C Execution context: Alters branch/jump behavior In mono context, jumps occur as in traditional architecture In poly context, PEs are enabled/disabled if (penum>32) { } else { } disables false PEs on true branch, then re-enables the false PEs and disables the other PEs for the false branch both branches executed break, continue return select PEs get disabled until the end of scope on all PEs select PEs get disabled until all PEs return, or end of scope 63
64 Porting C to C n (Example 1) C code int i, j; for( i=0; i<96; i++ ) { j = 2*i; } Similar C n code poly int i, j; i = get_penum(); // i=0 on PE0, i=1 on PE1 etc. j = 2*i; // j=0 on PE0, j=2 on PE2 etc. 64
65 Porting C to C n (Example 2) C code int i; for( i=0; i<n; i++ ) { } Similar C n code poly int me, i; mono int npes; me = get_penum(); // me=0 on PE0, me=1 on PE1 etc. npes = get_num_pes(); // npes = 96 // i=0,96,192, ; 1,97,193, etc. for( i=me; i<n; i+=npes ) { } 65
66 Simple C n example void foo (double *A, double *B, int n) { // Assume n is divisible by 24*96. poly double mat[4]={1.,2.,3.,4.}; poly double a[24]; poly double b[4]={0.,0.,0.,0.}; int i; while (n) { memcpym2p (a, A+24*get_penum(), 24*sizeof(double)); A+=24*96; for (i=0; i<24; i++) { b[0] += a[i]*mat[0] + a[i+1]*mat[1]; b[1] += a[i+1]*mat[0] + a[i]*mat[1]; b[2] += a[i]*mat[2] - a[i+1]*mat[3]; b[3] += a[i+1]*mat[2] - a[i]*mat[3]; } n -= 24*96; } memcpyp2m (B+4*get_penum(), b, 4*sizeof(double)); return; } 66
67 ENVISION. ACCELERATE. ARRIVE. C n Libraries 67
68 Runtime libraries C n supports standard C runtime, including: malloc printf sqrt memcpy C n extensions include: sqrtp memcpym2p / memcpyp2m get_penum swazzle any / all 68
69 Asynchronous I/O For most efficient use of limited PE memory, overlap data transfers between mono memory and poly: async_memcpym2p/p2m sem_sig / sem_wait For greatest efficiency, async_memcpy routines bypass the data cache, so coherency must be maintained: dcache_flush / dcache_flush_address 69
70 Asynchronous I/O example void foo(double *A, double *B,int n) { // Assume n is divisible by 24*96 poly unsigned short penum=get_penum(); poly double mat[4]={1.,2.,3.,4.}; poly double a_front[12], a_back[12]; poly double b[4]={0.,0.,0.,0.}; int i; async_memcpym2p(19,a_front,a+12*penum,12*sizeof(double));a+=12*96; n-=24*96; while (n) { async_memcpym2p(17,a_back,a+12*penum,12*sizeof(double));a+=12*96; sem_wait(19); for (i=0;i<12;i++) { b[0] += a_front[i]*mat[0] + a_front[i+1]*mat[1]; b[1] += a_front[i+1]*mat[0] + a_front[i]*mat[1]; b[2] += a_front[i]*mat[2] - a_front[i+1]*mat[3]; b[3] += a_front[i+1]*mat[2] - a_front[i]*mat[3]; } n-=12*96; async_memcpym2p(19,a_front,a+12*penum,12*sizeof(double));a+=12*96; sem_wait(17); for (i=0;i<12;i++) { // compute on a_back, then finish outside while loop 70
71 ENVISION. ACCELERATE. ARRIVE. C n Pointers 71
72 C n mono and poly pointers Using mono and poly with pointers mono int * mono mpmi mono pointer to mono int poly int * mono mppi mono pointer to poly int mono int * poly ppmi poly pointer to mono int poly int * poly pppi poly pointer to poly int Most commonly used is mono pointer to poly poly <type> * mono <variable_name> 72
73 C n mono and poly pointers mono pointer to mono int mono int * mono mpmi int * Mono memory int 73
74 C n mono and poly pointers mono pointer to poly int poly int * mono ppmi int Poly memory int * Mono memory int Poly memory int Poly memory Note: Points to same location in each PE 74
75 C n mono and poly pointers poly pointer to poly int poly int * poly pppi int Poly memory int * Poly memory int * Int int Poly memory int * Note: Pointer stored in same location in each PE 75
76 C n mono and poly pointers poly pointer to mono int mono int * poly ppmi int * Poly memory int Mono memory int * Poly memory int int int * Poly memory Note: Pointer stored in same location in each PE 76
77 ENVISION. ACCELERATE. ARRIVE. Conditional Expressions 77
78 Conditional Expressions: mono-if Conditions based on mono expressions Expression has same value on all PEs Code block selected according to expression and branch instruction executed mono int i, j; i = j = 1; if( i == j ) { // this block executed on all PEs } else { // this block branched over on all PEs } 78
79 Conditional Expressions: poly-if Conditions based on poly expressions Expression may have different values on different PEs But SIMD model implies all PEs execute same instruction simultaneously All branches executed on all PEs, with PE enabled if conditional expression is true (like predicated instructions) poly int i; i = get_penum(); if( i < 48 ) { // PEs 0, 1, 2, execute instructions // PEs 48, 49, instructions issued but ignored } else { // PEs 0, 1, 2, instructions issued but ignored // PEs 48, 49, execute instructions } 79
80 Conditional Expressions: poly-while While loops based on poly expressions Loop continues execution until condition is false on all PEs PEs will be disabled one by one until while condition is false on all PEs count keeps track of total number of iterations (96 in this case) mono int count = 0; poly int me; me = get_penum(); while( me > 0 ) { --me; ++count; } 80
81 Other variations between C and C n Labeled break and continue statements No switch statement using poly variables (use multiple if statements) No goto statement in poly context 81
82 ENVISION. ACCELERATE. ARRIVE. Moving Data 82
83 Data flow Board and host communicate via Linux kernel module or Windows driver Create a handle and establish the connection 83
84 Data flow Register intent of using the first processor on the card Load the code onto the enabled processor 84
85 Data flow Transfer data from host to board Semaphores synchronize transfers between host and board 85
86 Data flow Run the code on the enabled processor Host can continue with other work 86
87 Data flow Send results back to host Halt board program and clean up 87
88 Implicit broadcast from mono and poly Implicit broadcast from mono to poly by assignment mono int m = 7; poly int p; p = m; // Implicit broadcast to all PEs Assigning poly to mono is not permitted mono int m; poly int p = get_penum(); m = p; // NO! m receives different value from each PE 88
89 Explicit data movement mono to poly memcpym2p(); async_memcpym2p() Memory copy of n bytes from mono to poly Source is a poly pointer to mono memory, which can have a different value for each PE Destination is a mono pointer to poly memory, that is destination address is the same for all PEs Source data in mono memory Same destination on each PE PE0 PE1 PE2 PE95 89
90 Explicit data movement poly to mono memcpyp2m(); async_memcpyp2m() Memory copy of n bytes from poly to mono Source is a mono pointer to poly memory; therefore source address is the same for every PE Destination is a poly pointer to mono memory, which can have a different value for each PE Destination data in mono memory Same source address on each PE PE0 PE1 PE2 PE95 90
91 Explicit data movement asynchronous async_memcpym2p(); async_memcpyp2m() Asynchronous memory copy of n bytes from mono to poly or from poly to mono Computation continues during data copy Mono memory data cache NOT flushed Restrictions on alignment of data Use semaphores to wait for completion of copy Much higher bandwidth than synchronous versions dcache_flush(); async_memcpym2p( semaphore, ); // computation continues sem_wait( semaphore ); // use data that has been transferred from mono memory 91
92 Explicit data movement swazzle Register-to-register transfer between neighboring PE s PE n ALU Status flags To: PE n-1 Register file To: PE n+1 Memory Enable stack 92
93 Swazzle operations Assembly language versions operate directly on register file C n versions operate on data and include implicit data movement from memory to registers Variants swazzle_up( poly int src ); // copy to higher numbered PE swazzle_down( poly int src ); // copy to lower numbered PE swazzle_up_generic( poly void *dst, poly void *src, unsigned int size ); swazzle_down_generic( ); Similar swazzles operating on other data types Functions to set data copied into ends of swazzle chain 93
94 Data movement bandwidths per CSX600 Mono memory to poly memory 2.7 GB/s aggregate over 96 PEs Poly memory to registers 840 MB/s per PE, 81 GB/s aggregate Swazzle path bandwidth 1680 MB/s per PE, 161 GB/s aggregate Total bandwidth for Advance board (2 CSX600 processors) ~0.5 TB/s 94
95 DMA performance Advance board has a host DMA controller which can act as a PCI bus master All DMA transfers are at least 8-byte aligned Host DMA engine will attempt to use the entire bus bandwidth ClearSpeed Advance DMA Performance MB / s e620_read_avg e620_write_avg X620_Read_avg X620_Write_avg Transfer size (MB) 95
96 ENVISION. ACCELERATE. ARRIVE. CSAPI Host - Board communication 96
97 Host-Board interaction basics The basic model for interaction between the host and the card is very simple: The ClearSpeed board can signal and wait for semaphores; it cannot initiate data transactions with the host. The host pushes data to and pulls data from the board. The host can also signal and receive semaphores. 97
98 Connecting to the board A host application needs to perform the following sequence to launch a process on the board: Create a CSAPI handle CSAPI_new Establish a connection with the board CSAPI_connect Register the host application with the driver CSAPI_register_application Load the CSX application on the desired chip CSAPI_load Run the CSX application on the desired chip CSAPI_run 98
99 Interacting with the board Get board memory address of a known symbol CSAPI_get_symbol_value This must be done after the application is loaded, if the dynamic load capability is to be used. Write/Read data to a retrieved memory address CSAPI_write_mono_memory CSAPI_read_mono_memory Asynchronous variants of these routines also exist A process does not need to be running for these operations to succeed, but the process needs to be loaded. These should not be performed DURING process termination. Managing semaphores CSAPI_allocate_shared_semaphore Declares a semaphore for use on both host and card CSAPI_semaphore_wait CSAPI_semaphore_signal 99
100 Cleaning up Process termination CSAPI_wait_on_terminate CSAPI_get_return_value Clean-up CSAPI_delete See CSX600 Runtime Software User Guide for more details, including: managing multiple processes on the board/chip at once managing board control registers board reset managing multi-threaded CSX applications board memory allocation managing multiple boards/chips error handling 100
101 ENVISION. ACCELERATE. ARRIVE. Debugging C n 101
102 csgdb csgdb is a port of the open source gdb debugger full symbolic debugging of mono/poly variables full gdb breakpoint support step through C n or assembly views mono and poly registers views PE enabled state also accessible via DDD DDD allows graphical data visualization 102
103 Debug control To enable debugging: export CS_CSAPI_DEBUGGER=1 initializes the debug interface within the host application export CS_CSAPI_DEBUGGER_ATTACH=1 host application will then write a port number to stdout and wait for <Return/Enter> to be pressed so that csgdb can be manually attached to the connected board process Launch the host application This can be done with or without a debugger. Launch csgdb in a new shell* csgdb <csx_file_name> <port_number> No need to connect as the host application did this already set desired breakpoints run Note that the host is currently blocked waiting for <Return/Enter>, so card process may also be blocked waiting for the host. Press return in the host shell for the host and card applications to proceed. 103
104 csgdb Debugger (Shown with ddd Front-end) On-chip poly array contents displayed Real time plot of contents of PE memory Cn source-level break point, watch points, single step, etc. Register contents Disassembly, break point, watch points, single step, etc. 104
105 csgdb Command-line example % cscn foo.cn g o foo.csx % csgdb./foo.csx (gdb) connect 0x in FRAME_BEGIN_MONO () (gdb) break 109 Breakpoint 1 at 0x800154c0: file foo.cn, line 109. (gdb) run Starting program: /home/kris/my_app/foo.csx Breakpoint 1, main () at foo.cn:109 (gdb) next 110 y = MINY + (get_penum() * STEPY); (gdb) print y $1 = {-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1} 105
106 ENVISION. ACCELERATE. ARRIVE. ClearSpeed Visual Profiler Explaining Performance 106
107 ClearSpeed Visual Profiler (csvprof) Host tracing Trace CSAPI function User can infer overlapping host/board utilization Locate hot-spots Board tracing Trace board side functions without instrumentation Locate hot-spots Board hardware utilization Display activity of csx functional units including: ld/st Pi/o SIMD microcode Cycle accurate View corresponding source Unified GUI Instruction cache Data cache Thread 107
108 Detailed profiling is essential for accelerator tuning HOST CODE PROFILING Visually inspect multiple host threads. Time specific code sections. Check overlap of host threads. HOST/BOARD INTERACTION Infer cause and effect. Measure transfer bandwidth. Check overlap of host and board compute. Host Host Host CPU(s) Host CPU(s) CPU(s) CPU(s) Advance Accelerator Board Advance Accelerator Board CSX 600 CSX600 CSX 600 CSX600 Pipeline Pipeline Pipeline Pipeline ACCELERATOR PIPE View instruction issue. Visualize overlap of executing instructions. Get cycle-accurate timing. Remove instruction-level performance bottlenecks. CSX600 SYSTEM Trace at system level. Inspect overlap of compute and I/O. View cache utilization. Graph performance. 108
109 csvprof: Host tracing Dynamic loading of CSAPI Trace implementation Triggered with an environment variable: export CS_CSAPI_TRACE=1» Recall similar enabling of debug support:» export CS_CSAPI_DEBUGGER=1 Specify tracing format: export CS_CSAPI_TRACE_CSVPROF=1 currently this is the only implementation, but in the future Specify output file for trace: export CS_CSAPI_TRACE_CSVPROF_FILE=mytrace.cst default filename = csvprof_data.cst Output file written during CSAPI_delete 109
110 csvprof: Host-Board interaction 110
111 csvprof: Host code profile Linpack benchmark 111
112 csvprof: CSX600 system profile 112
113 csvprof: Accelerator pipeline profile 113
114 csvprof: Instruction pipeline stalls 114
115 csvprof: Advance board tracing Enabled using the debugger, csgdb Can use interactively or through gdb script Can select events to profile, or all events Requires buffer allocation on the card Today, this is done statically One could use CSAPI to allocate buffer, but developer must get location and size of the buffer to user to be entered for csgdb Easy if running only on one chip, place buffer in the other chip s memory Explicit dump to generate trace file Can control the type of data to be dumped 115
116 csvprof: Sample gdb script % cat./csgdb_trace.gdb connect load./foo.csx cstrace buffer 0x x cstrace event all on tbreak test_me continue cstrace enable continue cstrace dump foo.cst cstrace dump branch dgemm_test4_branch.cst quit % csgdb command=./csgdb_trace.gdb 116
117 ENVISION. ACCELERATE. ARRIVE. Tuning Tips 117
118 Pipelined arithmetic Four-stage floating-point pipeline Use vector types, vector intrinsic functions, and vector math library for high efficiency DVECTOR a, b, c; poly double x[n]; a = *(( DVECTOR *)x[0]); b = *(( DVECTOR *)x[4]); c = cs_sqrt( cs_vadd( a, b ) ); 118
119 Poly conditionals When possible, remove common subexpressions from poly if-blocks to reduce amount of replicated work. Maybe need to compute and throw away results if it leads to fewer poly conditional blocks. A poly if-block uses predicated instructions, not a branch, so it is cheap if not many additional instructions are executed. 119
120 Poly loop counters Loops with poly counters are more expensive than those with mono counters Use mono loop counters if possible 120
121 Arrays Pointer incrementing is more efficient than using array index notation Poly addresses require 16 bits Use short for poly pointer increments This avoids conversion of int to short 121
122 Data transfer Synchronous functions are completely general flush the data cache each transfer memcpyp2m() memcpym2p() Asynchronous functions maximize performance do not flush cache have data size and alignment restrictions require use of wait semaphore async_memcpyp2m(); sem_wait() async_memcpym2p(); sem_wait() Large data blocks are more efficient than small blocks Host to board Board to host Mono to poly Poly to mono 122
123 ENVISION. ACCELERATE. ARRIVE. Application Examples 123
124 Math function speed comparison bit Function Operations per Second (Billions) GHz dual-core Opteron 3 GHz dual-core Woodcrest ClearSpeed Advance card Sqrt InvSqrt Exp Ln Cos Sin SinCos Inv Function name Typical speedup of ~8X over the fastest x86 processors, because math functions stay in local memory on the card 124
125 Nucleic Acid Builder (NAB) Newton-Raphson refinement now possible; large DGEMM calls from computed second derivatives will be in AMBER x speedup obtained for this operation in three hours of programmer effort Enables accurate computation of entropy and Gibbs Free Energy for first time AMBER itself has cases that ClearSpeed accelerates by 3.2x to 9x, with 5x to 17x possible once we exploit symmetry of atomatom interactions 125
126 AMBER molecular modeling with ClearSpeed AMBER Generalized Born Models 1, 2, and 6 Run Time, in Minutes Generalized Born 1 Generalized Born 2 Generalized Born 3 Host Advance X620 AMBER model Host Advance X620 Speedup Gen Born min 24.6 min 3.4 Gen Born min 23.5 min 3.6 Gen Born min 4.0 min
127 Monte Carlo methods exploit high local bandwidth Monte Carlo methods are ideal for ClearSpeed acceleration: High regularity and locality of the algorithm Very high compute to I/O ratio Very good scalability to high degrees of parallelism Needs 64-bit Excellent results for parallelization Achieving 10x performance per Advance card vs. highly optimized code on the fastest x86 CPUs available today Maintains high precision required by the computations True 64-bit IEEE 754 floating point throughout 25 W per card typical when card is computing ClearSpeed has a Monte Carlo example code, available in source form for evaluation 127
128 Monte Carlo applications scale very well No acceleration: 200M samples, 79 seconds 1 Advance board: 200M samples, 3.6 seconds 5 Advance boards: 200M samples, 0.7 seconds European Option Pricing Model Speedup Number of ClearSpeed Advance Boards 128
129 Why do Monte Carlo applications need 64-bit? Accuracy increases as the square root of the number of trials, so five-decimal accuracy takes 10 billion trials. But, when you sum many similar values, you start to lose all the significant digits. 64-bit summation needed to get a single-precision result! Single precision: x = x10 8 Double precision: x = x
130 ENVISION. ACCELERATE. ARRIVE. Help and Support 130
131 Installed documentation docs directory CSXL user guide runtime user guide csvprof Visual Profiler overview and examples SDK getting started gdb manual instruction set manual C n library manual reference manual release notes examples directory 131
132 ClearSpeed online General information, news, etc. Company website Report a problem, find answers, etc. Support website support.clearspeed.com Support website has: Documentation, user guides, reference manuals Solutions knowledge base Software downloads Log a case 132
133 Join the ClearSpeed Developer Program! Designed to support the leading-edge community of developers using accelerators Membership is free and has the following benefits: Access to the ClearSpeed Developer website ClearSpeed Developer Community on-line forum Invitation to participate in ClearSpeed Developer & User Community meetings and events Repository to share and access demonstrations and sample codes within the ClearSpeed Developer Community Technical updates, tips and tricks from the gurus at ClearSpeed and the Developer Community And more, including opportunities to preview new software releases and developer discount programs. Leverage the expertise of developers worldwide. Ask a question, or share your knowledge. Register now at developer.clearspeed.com! 133
134 134
COMP 635: Seminar on Heterogeneous Processors. Lecture 7: ClearSpeed CSX600 Processor.
COMP 635: Seminar on Heterogeneous Processors Lecture 7: ClearSpeed CSX600 Processor www.cs.rice.edu/~vsarkar/comp635 Vivek Sarkar Department of Computer Science Rice University vsarkar@rice.edu October
More informationENVISION. ACCELERATE.
ENVISION. ACCELERATE. ARRIVE. ClearSpeed Programming Model: An Introduction 1 Overview PC host communication to ClearSpeed A first look at C n Using the toolchain: hello world Lower level review of ClearSpeed
More informationENVISION. ACCELERATE.
ENVISION. ACCELERATE. ARRIVE. ClearSpeed Programming Model: Optimizing Performance 1 Overview Compute considerations Memory considerations Latency hiding Miscellaneous Profiling Inline assembly Optimal
More informationVisual Profiler. User Guide
Visual Profiler User Guide Version 3.0 Document No. 06-RM-1136 Revision: 4.B February 2008 Visual Profiler User Guide Table of contents Table of contents 1 Introduction................................................
More informationCLEARSPEED WHITEPAPER: CSX PROCESSOR ARCHITECTURE
CSX PROCESSOR ARCHITECTURE CLEARSPEED WHITEPAPER: CSX PROCESSOR ARCHITECTURE Abstract This paper describes the architecture of the CSX family of processors based on ClearSpeed s multi-threaded array processor;
More information64-Bit Floating-Point Accelerators for HPC Applications
ENVISION. ACCELERATE. ARRIVE. 64-Bit Floating-Point Accelerators for HPC Applications John L. Gustafson, Ph.D. CTO, High Performance Computing ClearSpeed Technology, Inc. 1 Outline Acceleration issues
More informationCSX600 Runtime Software. User Guide
CSX600 Runtime Software User Guide Version 3.0 Document No. 06-UG-1345 Revision: 3.D January 2008 Table of contents Table of contents 1 Introduction................................................ 7 2
More informationTesla GPU Computing A Revolution in High Performance Computing
Tesla GPU Computing A Revolution in High Performance Computing Gernot Ziegler, Developer Technology (Compute) (Material by Thomas Bradley) Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction
More informationTesla GPU Computing A Revolution in High Performance Computing
Tesla GPU Computing A Revolution in High Performance Computing Mark Harris, NVIDIA Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory
More informationClearSpeed Visual Profiler
ClearSpeed Visual Profiler Copyright 2007 ClearSpeed Technology plc. All rights reserved. 12 November 2007 www.clearspeed.com 1 Profiling Application Code Why use a profiler? Program analysis tools are
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control
More informationParallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model
Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy
More informationIntroductory Programming Manual. The ClearSpeed Software Development Kit. Document No. 06-UG-1117 Revision: 2.E
Introductory Programming Manual The ClearSpeed Software Development Kit Document No. 06-UG-1117 Revision: 2.E January 2008 The ClearSpeed Software Development Kit Introductory Programming Manual Overview
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control
More informationSoftware Overview Release Rev: 3.0
Software Overview Release Rev: 3.0 1 Overview of ClearSpeed software The ClearSpeed Advance accelerators are provided with a package of runtime software. A software development kit (SDK) is also available
More informationHigh Performance Computing on GPUs using NVIDIA CUDA
High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationGeneral Purpose GPU Computing in Partial Wave Analysis
JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data
More informationCUDA OPTIMIZATIONS ISC 2011 Tutorial
CUDA OPTIMIZATIONS ISC 2011 Tutorial Tim C. Schroeder, NVIDIA Corporation Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control
More informationMartin Kruliš, v
Martin Kruliš 1 Optimizations in General Code And Compilation Memory Considerations Parallelism Profiling And Optimization Examples 2 Premature optimization is the root of all evil. -- D. Knuth Our goal
More informationCS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS
CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight
More informationCUDA Performance Optimization. Patrick Legresley
CUDA Performance Optimization Patrick Legresley Optimizations Kernel optimizations Maximizing global memory throughput Efficient use of shared memory Minimizing divergent warps Intrinsic instructions Optimizations
More informationBindel, Fall 2011 Applications of Parallel Computers (CS 5220) Tuning on a single core
Tuning on a single core 1 From models to practice In lecture 2, we discussed features such as instruction-level parallelism and cache hierarchies that we need to understand in order to have a reasonable
More informationCS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology
CS8803SC Software and Hardware Cooperative Computing GPGPU Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology Why GPU? A quiet revolution and potential build-up Calculation: 367
More informationParallel Computing: Parallel Architectures Jin, Hai
Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer
More informationAdvanced CUDA Optimization 1. Introduction
Advanced CUDA Optimization 1. Introduction Thomas Bradley Agenda CUDA Review Review of CUDA Architecture Programming & Memory Models Programming Environment Execution Performance Optimization Guidelines
More informationMaster Informatics Eng.
Advanced Architectures Master Informatics Eng. 207/8 A.J.Proença The Roofline Performance Model (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 207/8 AJProença, Advanced Architectures,
More informationSerial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing
CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.
More informationHPC VT Machine-dependent Optimization
HPC VT 2013 Machine-dependent Optimization Last time Choose good data structures Reduce number of operations Use cheap operations strength reduction Avoid too many small function calls inlining Use compiler
More informationHigher Level Programming Abstractions for FPGAs using OpenCL
Higher Level Programming Abstractions for FPGAs using OpenCL Desh Singh Supervising Principal Engineer Altera Corporation Toronto Technology Center ! Technology scaling favors programmability CPUs."#/0$*12'$-*
More informationGPU Acceleration of Matrix Algebra. Dr. Ronald C. Young Multipath Corporation. fmslib.com
GPU Acceleration of Matrix Algebra Dr. Ronald C. Young Multipath Corporation FMS Performance History Machine Year Flops DEC VAX 1978 97,000 FPS 164 1982 11,000,000 FPS 164-MAX 1985 341,000,000 DEC VAX
More informationCS 101, Mock Computer Architecture
CS 101, Mock Computer Architecture Computer organization and architecture refers to the actual hardware used to construct the computer, and the way that the hardware operates both physically and logically
More informationIntroduction to CELL B.E. and GPU Programming. Agenda
Introduction to CELL B.E. and GPU Programming Department of Electrical & Computer Engineering Rutgers University Agenda Background CELL B.E. Architecture Overview CELL B.E. Programming Environment GPU
More informationCUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni
CUDA Optimizations WS 2014-15 Intelligent Robotics Seminar 1 Table of content 1 Background information 2 Optimizations 3 Summary 2 Table of content 1 Background information 2 Optimizations 3 Summary 3
More informationSDACCEL DEVELOPMENT ENVIRONMENT. The Xilinx SDAccel Development Environment. Bringing The Best Performance/Watt to the Data Center
SDAccel Environment The Xilinx SDAccel Development Environment Bringing The Best Performance/Watt to the Data Center Introduction Data center operators constantly seek more server performance. Currently
More informationAdvanced CUDA Programming. Dr. Timo Stich
Advanced CUDA Programming Dr. Timo Stich (tstich@nvidia.com) Outline SIMT Architecture, Warps Kernel optimizations Global memory throughput Launch configuration Shared memory access Instruction throughput
More informationKernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow
Fundamental Optimizations (GTC 2010) Paulius Micikevicius NVIDIA Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow Optimization
More informationWhen MPPDB Meets GPU:
When MPPDB Meets GPU: An Extendible Framework for Acceleration Laura Chen, Le Cai, Yongyan Wang Background: Heterogeneous Computing Hardware Trend stops growing with Moore s Law Fast development of GPU
More informationASSEMBLY LANGUAGE MACHINE ORGANIZATION
ASSEMBLY LANGUAGE MACHINE ORGANIZATION CHAPTER 3 1 Sub-topics The topic will cover: Microprocessor architecture CPU processing methods Pipelining Superscalar RISC Multiprocessing Instruction Cycle Instruction
More informationGPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS
GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS Agenda Forming a GPGPU WG 1 st meeting Future meetings Activities Forming a GPGPU WG To raise needs and enhance information sharing A platform for knowledge
More informationDouble-precision General Matrix Multiply (DGEMM)
Double-precision General Matrix Multiply (DGEMM) Parallel Computation (CSE 0), Assignment Andrew Conegliano (A0) Matthias Springer (A00) GID G-- January, 0 0. Assumptions The following assumptions apply
More informationDouble-Precision Matrix Multiply on CUDA
Double-Precision Matrix Multiply on CUDA Parallel Computation (CSE 60), Assignment Andrew Conegliano (A5055) Matthias Springer (A995007) GID G--665 February, 0 Assumptions All matrices are square matrices
More informationXPU A Programmable FPGA Accelerator for Diverse Workloads
XPU A Programmable FPGA Accelerator for Diverse Workloads Jian Ouyang, 1 (ouyangjian@baidu.com) Ephrem Wu, 2 Jing Wang, 1 Yupeng Li, 1 Hanlin Xie 1 1 Baidu, Inc. 2 Xilinx Outlines Background - FPGA for
More informationIntroduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes
Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel
More informationMIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer
MIMD Overview Intel Paragon XP/S Overview! MIMDs in the 1980s and 1990s! Distributed-memory multicomputers! Intel Paragon XP/S! Thinking Machines CM-5! IBM SP2! Distributed-memory multicomputers with hardware
More informationDistributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca
Distributed Dense Linear Algebra on Heterogeneous Architectures George Bosilca bosilca@eecs.utk.edu Centraro, Italy June 2010 Factors that Necessitate to Redesign of Our Software» Steepness of the ascent
More informationIssues In Implementing The Primal-Dual Method for SDP. Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM
Issues In Implementing The Primal-Dual Method for SDP Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM 87801 borchers@nmt.edu Outline 1. Cache and shared memory parallel computing concepts.
More informationIntroduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1
Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip
More informationThe Nios II Family of Configurable Soft-core Processors
The Nios II Family of Configurable Soft-core Processors James Ball August 16, 2005 2005 Altera Corporation Agenda Nios II Introduction Configuring your CPU FPGA vs. ASIC CPU Design Instruction Set Architecture
More informationIntroduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono
Introduction to CUDA Algoritmi e Calcolo Parallelo References This set of slides is mainly based on: CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory Slide of Applied
More informationBASIC COMPUTER ORGANIZATION. Operating System Concepts 8 th Edition
BASIC COMPUTER ORGANIZATION Silberschatz, Galvin and Gagne 2009 Topics CPU Structure Registers Memory Hierarchy (L1/L2/L3/RAM) Machine Language Assembly Language Running Process 3.2 Silberschatz, Galvin
More informationCOMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 6 Parallel Processors from Client to Cloud Introduction Goal: connecting multiple computers to get higher performance
More informationExperts in Application Acceleration Synective Labs AB
Experts in Application Acceleration 1 2009 Synective Labs AB Magnus Peterson Synective Labs Synective Labs quick facts Expert company within software acceleration Based in Sweden with offices in Gothenburg
More informationThe Bifrost GPU architecture and the ARM Mali-G71 GPU
The Bifrost GPU architecture and the ARM Mali-G71 GPU Jem Davies ARM Fellow and VP of Technology Hot Chips 28 Aug 2016 Introduction to ARM Soft IP ARM licenses Soft IP cores (amongst other things) to our
More informationKampala August, Agner Fog
Advanced microprocessor optimization Kampala August, 2007 Agner Fog www.agner.org Agenda Intel and AMD microprocessors Out Of Order execution Branch prediction Platform, 32 or 64 bits Choice of compiler
More informationLatches. IT 3123 Hardware and Software Concepts. Registers. The Little Man has Registers. Data Registers. Program Counter
IT 3123 Hardware and Software Concepts Notice: This session is being recorded. CPU and Memory June 11 Copyright 2005 by Bob Brown Latches Can store one bit of data Can be ganged together to store more
More informationGPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC
GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of
More informationFundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA
Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA Optimization Overview GPU architecture Kernel optimization Memory optimization Latency optimization Instruction optimization CPU-GPU
More informationAddressing the Memory Wall
Lecture 26: Addressing the Memory Wall Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Cage the Elephant Back Against the Wall (Cage the Elephant) This song is for the
More informationThe Memory Component
The Computer Memory Chapter 6 forms the first of a two chapter sequence on computer memory. Topics for this chapter include. 1. A functional description of primary computer memory, sometimes called by
More informationHow to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić
How to perform HPL on CPU&GPU clusters Dr.sc. Draško Tomić email: drasko.tomic@hp.com Forecasting is not so easy, HPL benchmarking could be even more difficult Agenda TOP500 GPU trends Some basics about
More informationTUNING CUDA APPLICATIONS FOR MAXWELL
TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v7.0 March 2015 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2
More informationIntegrating DMA capabilities into BLIS for on-chip data movement. Devangi Parikh Ilya Polkovnichenko Francisco Igual Peña Murtaza Ali
Integrating DMA capabilities into BLIS for on-chip data movement Devangi Parikh Ilya Polkovnichenko Francisco Igual Peña Murtaza Ali 5 Generations of TI Multicore Processors Keystone architecture Lowers
More informationCS 426 Parallel Computing. Parallel Computing Platforms
CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:
More informationIntra-MIC MPI Communication using MVAPICH2: Early Experience
Intra-MIC MPI Communication using MVAPICH: Early Experience Sreeram Potluri, Karen Tomko, Devendar Bureddy, and Dhabaleswar K. Panda Department of Computer Science and Engineering Ohio State University
More informationThe S6000 Family of Processors
The S6000 Family of Processors Today s Design Challenges The advent of software configurable processors In recent years, the widespread adoption of digital technologies has revolutionized the way in which
More informationMisc. Third Generation Batch Multiprogramming. Fourth Generation Time Sharing. Last Time Evolution of OSs
Third Generation Batch Multiprogramming Misc. Problem: but I/O still expensive; can happen in middle of job Idea: have a pool of ready jobs in memory, switch to one when another needs I/O When one job
More informationCUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.
Schedule CUDA Digging further into the programming manual Application Programming Interface (API) text only part, sorry Image utilities (simple CUDA examples) Performace considerations Matrix multiplication
More informationProcessors, Performance, and Profiling
Processors, Performance, and Profiling Architecture 101: 5-Stage Pipeline Fetch Decode Execute Memory Write-Back Registers PC FP ALU Memory Architecture 101 1. Fetch instruction from memory. 2. Decode
More informationIntroduction to Operating Systems. Chapter Chapter
Introduction to Operating Systems Chapter 1 1.3 Chapter 1.5 1.9 Learning Outcomes High-level understand what is an operating system and the role it plays A high-level understanding of the structure of
More informationModern Processor Architectures. L25: Modern Compiler Design
Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions
More informationTUNING CUDA APPLICATIONS FOR MAXWELL
TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v6.5 August 2014 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2
More informationSHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008
SHARCNET Workshop on Parallel Computing Hugh Merz Laurentian University May 2008 What is Parallel Computing? A computational method that utilizes multiple processing elements to solve a problem in tandem
More informationARM Processors for Embedded Applications
ARM Processors for Embedded Applications Roadmap for ARM Processors ARM Architecture Basics ARM Families AMBA Architecture 1 Current ARM Core Families ARM7: Hard cores and Soft cores Cache with MPU or
More informationCopyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more
More informationIntel Enterprise Processors Technology
Enterprise Processors Technology Kosuke Hirano Enterprise Platforms Group March 20, 2002 1 Agenda Architecture in Enterprise Xeon Processor MP Next Generation Itanium Processor Interconnect Technology
More informationIntroduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono
Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied
More informationEXPLORING PARALLEL PROCESSING OPPORTUNITIES IN AERMOD. George Delic * HiPERiSM Consulting, LLC, Durham, NC, USA
EXPLORING PARALLEL PROCESSING OPPORTUNITIES IN AERMOD George Delic * HiPERiSM Consulting, LLC, Durham, NC, USA 1. INTRODUCTION HiPERiSM Consulting, LLC, has a mission to develop (or enhance) software and
More informationGPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)
GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization
More informationMemory Systems IRAM. Principle of IRAM
Memory Systems 165 other devices of the module will be in the Standby state (which is the primary state of all RDRAM devices) or another state with low-power consumption. The RDRAM devices provide several
More informationIntel C++ Compiler Professional Edition 11.0 for Linux* In-Depth
Intel C++ Compiler Professional Edition 11.0 for Linux* In-Depth Contents Intel C++ Compiler Professional Edition for Linux*...3 Intel C++ Compiler Professional Edition Components:...3 Features...3 New
More informationAn Evaluation of an Energy Efficient Many-Core SoC with Parallelized Face Detection
An Evaluation of an Energy Efficient Many-Core SoC with Parallelized Face Detection Hiroyuki Usui, Jun Tanabe, Toru Sano, Hui Xu, and Takashi Miyamori Toshiba Corporation, Kawasaki, Japan Copyright 2013,
More informationIntroduction to Operating. Chapter Chapter
Introduction to Operating Systems Chapter 1 1.3 Chapter 1.5 1.9 Learning Outcomes High-level understand what is an operating system and the role it plays A high-level understanding of the structure of
More informationIntel C++ Compiler Professional Edition 11.1 for Mac OS* X. In-Depth
Intel C++ Compiler Professional Edition 11.1 for Mac OS* X In-Depth Contents Intel C++ Compiler Professional Edition 11.1 for Mac OS* X. 3 Intel C++ Compiler Professional Edition 11.1 Components:...3 Features...3
More informationCell SDK and Best Practices
Cell SDK and Best Practices Stefan Lutz Florian Braune Hardware-Software-Co-Design Universität Erlangen-Nürnberg siflbrau@mb.stud.uni-erlangen.de Stefan.b.lutz@mb.stud.uni-erlangen.de 1 Overview - Introduction
More informationA Multi-Tiered Optimization Framework for Heterogeneous Computing
A Multi-Tiered Optimization Framework for Heterogeneous Computing IEEE HPEC 2014 Alan George Professor of ECE University of Florida Herman Lam Assoc. Professor of ECE University of Florida Andrew Milluzzi
More informationIBM Cell Processor. Gilbert Hendry Mark Kretschmann
IBM Cell Processor Gilbert Hendry Mark Kretschmann Architectural components Architectural security Programming Models Compiler Applications Performance Power and Cost Conclusion Outline Cell Architecture:
More informationArchitecture without explicit locks for logic simulation on SIMD machines
Architecture without explicit locks for logic on machines M. Chimeh Department of Computer Science University of Glasgow UKMAC, 2016 Contents 1 2 3 4 5 6 The Using models to replicate the behaviour of
More informationOpenACC Course. Office Hour #2 Q&A
OpenACC Course Office Hour #2 Q&A Q1: How many threads does each GPU core have? A: GPU cores execute arithmetic instructions. Each core can execute one single precision floating point instruction per cycle
More informationIntroduction to GPU computing
Introduction to GPU computing Nagasaki Advanced Computing Center Nagasaki, Japan The GPU evolution The Graphic Processing Unit (GPU) is a processor that was specialized for processing graphics. The GPU
More informationTDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading
Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5
More informationTrends in the Infrastructure of Computing
Trends in the Infrastructure of Computing CSCE 9: Computing in the Modern World Dr. Jason D. Bakos My Questions How do computer processors work? Why do computer processors get faster over time? How much
More informationCenter for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop
Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop http://cscads.rice.edu/ Discussion and Feedback CScADS Autotuning 07 Top Priority Questions for Discussion
More informationComputer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more
More informationMemory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)
Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2012/13 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2012/13 1 2
More informationThis section covers the MIPS instruction set.
This section covers the MIPS instruction set. 1 + I am going to break down the instructions into two types. + a machine instruction which is directly defined in the MIPS architecture and has a one to one
More informationCUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION
CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION WHAT YOU WILL LEARN An iterative method to optimize your GPU code Some common bottlenecks to look out for Performance diagnostics with NVIDIA Nsight
More informationX-Stream II. Processing Method. Operating System. Hardware Performance. Elements of Processing Speed TECHNICAL BRIEF
X-Stream II Peter J. Pupalaikis Principal Technologist September 2, 2010 Summary This paper explains how X- Stream II techonlogy improves the speed and responsiveness of LeCroy oscilloscopes. TECHNICAL
More informationLaboratory Exercise 3 Comparative Analysis of Hardware and Emulation Forms of Signed 32-Bit Multiplication
Laboratory Exercise 3 Comparative Analysis of Hardware and Emulation Forms of Signed 32-Bit Multiplication Introduction All processors offer some form of instructions to add, subtract, and manipulate data.
More informationPerformance of Multicore LUP Decomposition
Performance of Multicore LUP Decomposition Nathan Beckmann Silas Boyd-Wickizer May 3, 00 ABSTRACT This paper evaluates the performance of four parallel LUP decomposition implementations. The implementations
More information