ENVISION. ACCELERATE.

Size: px

Start display at page:

Download "ENVISION. ACCELERATE."

Melvin Knight
5 years ago
Views:

1 ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical Training December 2007 Overview 1

2 Presenters Ronald Langhi Technical Marketing Manager Brian Sumner Senior Engineer 2

3 ClearSpeed Technology: Company Background Founded in 2001 Focused on alleviating the power, heat, and density challenges of HPC systems 103 patents granted and pending (as of September 2007) Offices in San Jose, California and Bristol, UK 3

4 Agenda Accelerators ClearSpeed and HPC Hardware overview Installing hardware and software Thinking about performance Software Development Kit Application examples Help and support 4

5 ENVISION. ACCELERATE. ARRIVE. What is an accelerator? 5

6 What is an accelerator? A device to improve performance Relieve main CPU of workload Or to augment CPU s capability An accelerator card can increase performance On specific tasks Without aggravating facility limits on clusters (power, size, cooling) 6

IEEE Unconventional programming model Small local memory High power consumption (> 200 W) ClearSpeed Good for HPC applications IEEE

7 All accelerators are good for their intended purpose FPGAs Good for integer, bit-level ops Programming looks like circuit design Low power per chip, but 20x more power than custom VLSI Not for 64-bit FLOPS Cell and GPUs Good for video gaming tasks 32-bit FLOPS, not IEEE Unconventional programming model Small local memory High power consumption (> 200 W) ClearSpeed Good for HPC applications IEEE 64-bit and 32-bit FLOPS Custom VLSI, true coprocessor At least 1 GB local memory Very low power consumption (25 W) Familiar programming model 7

8 The case for accelerators Accelerators designed for HPC applications can improve performance as well as performance per (watt, cabinet, dollar) Accelerators enable: Larger problems for given compute time, or Higher accuracy for given compute time, or Same problem in shorter time Host to card latency and bandwidth are not major barriers to successful use of properlydesigned accelerators. 8

9 ENVISION. ACCELERATE. ARRIVE. What can be accelerated? 9

10 Good application targets for acceleration Application needs to be both computationally intensive and contain a high degree of data parallelism. Computationally intensive: Software depends on executing large numbers of arithmetic calculations Usually 64-bit FLoating point Operations per Second (FLOPS) Should also have a high ratio of FLOPS to data movement (bandwidth) Computationally intensive applications may run for many hours or more even on large clusters. Data parallelism: Software performs the same sequence of operations again and again but on a different item of data each time Example computationally intensive, data parallel problems include: Large matrix arithmetic (linear algebra) Molecular simulations Monte Carlo options pricing in financial applications And many, many more 10

11 Example data parallel problems that can be accelerated Ab initio Computational Chemistry Structural Analysis Electromagnetic Modeling Radar Cross-Section Global Illumination Graphics 11

12 HPC Requirements Accelerator boards increase compute performance on highly specific tasks, without aggravating facility limits on clusters (power, size) Need to consider Type of application Software Data type and precision Compatibility with host (logical and physical) Memory size (local to accelerator) Latency and bandwidth to host 12

13 An HPC-specific accelerator CSX600 coprocessor for math acceleration Assists serial CPU running compute-intensive math libraries Available on add-in boards, e.g. PCI-X, PCIe Potentially integrated on the motherboard Can also be used for embedded applications Significantly accelerates certain libraries and applications Target libraries: Level 3 BLAS, LAPACK, ACML, Intel MKL Mathematical modeling tools: Mathematica, MATLAB, etc. In-house code: Using the SDK to port compute-intensive kernels ClearSpeed Advance board Dual CSX600 coprocessors Sustains 67 GFLOPS for 64-bit matrix multiply (DGEMM) calls PCI-X, PCI Express x8 Low power; typically Watts 13

14 Plug-and-play Acceleration ClearSpeed host-side library CSXL Provides some of the most commonly used and important Level 3 BLAS and LAPACK functions Exploits standard shared/dynamic library mechanisms to intercept calls to L3 BLAS and LAPACK Executes calls heterogeneously across both the multicore host and the ClearSpeed accelerators simultaneously for maximum performance Compatible with ACML from AMD and MKL from Intel User & application do not need to be aware of ClearSpeed Except that the application suddenly runs faster 14

15 Programming considerations Is my main data type integer or floating-point? Is the data parallel in nature? What precision do I need? How much data needs to be local to the accelerated task? Does existing accelerator software meet my needs, or do I have to write my own? If I have to write my own code will the existing tools meet my needs for example: compiler, debugger, and simulator? 15

16 ENVISION. ACCELERATE. ARRIVE. Hardware Overview 16

CSX600: A chip designed for HPC ClearSpeed CSX600 Array of 96 Processor Elements; 64-bit and 32-bit floating point Single-Instruction, Multiple-Data (SIMD) 210 MHz -- key to low power 47% logic, 53%

17 CSX600: A chip designed for HPC ClearSpeed CSX600 Array of 96 Processor Elements; 64-bit and 32-bit floating point Single-Instruction, Multiple-Data (SIMD) 210 MHz -- key to low power 47% logic, 53% memory About 50% of the logic is FPU Hence around one quarter of the chip is floating point hardware Embedded SRAM Interface to DDR2 DRAM Inter-processor I/O ports ~ 1 TB/sec internal bandwidth 128 million transistors Approximately 10 Watts 17

18 CSX600 processor core CSX 600 System Network Data Cache PE 0 Mono Controller PE 1 Instruction Cache Poly Controller PE 95 Programmable I/O to DRAM Peripheral Network Control and Debug System Network Multi-Threaded Array Processing Programmed in familiar languages Hardware multi-threading Asynchronous, overlapped I/O Run-time extensible instruction set Array of 96 Processor Elements (PEs) Each has multiple execution units Including double precision floating point and integer units 18

19 CSX600 processing element (PE) PE n 1 PE n FP Mul FP Add Div, Sqrt MAC Register File 128 Bytes PE SRAM 6 KBytes ALU PE Programmed I/O 128 PIO Collection & Distribution n+1 Multiple execution units 4-stage floating point adder 4-stage floating point multiplier Divide/square root unit } Fixed-point MAC 16x Integer ALU with shifter Load/store 5-port register file (3 reads, 2 writes) Closely coupled 6 KB SRAM for data High bandwidth per PE DMA (PIO) Per PE address generators (serves as hardware gather-scatter) Fast inter-pe communication path 32/64-bit IEEE

20 Advance accelerator memory hierarchy Tier 3 Host DRAM: 1-32 GBytes typical Aggregate: ~1GB/s 1.0 GBytes Tier 2 Bank 1 Bank CSX 0DRAM: 0.5 GBytes CSX DRAM: 0.5 GBytes 5.4 GB/s ~0.03 GB/s per PE 192 PEs * 6 KB = 1.1 MB 161 GB/s Tier 1 PE 95 PE 0 Poly memory: 6 KBytes PEs * 128 Byte = 24 KB Tier 0 Per PE Register memory: 128 Bytes Swazzle 322 GB/s 725 GB/s Total: 80 GFLOPS, 1.1 TB/s but only 25 Watts Per PE Arithmetic: 0.42 GFLOPS 20

Acceleration by plug-in card Advance X620 (PCI-X) 203 mm length, full-height Advance e620 PCIe (x8) Dual ClearSpeed CSX600 coprocessors R > 66 GFLOPS for 64-bit matrix multiply (DGEMM) calls Hardware

21 Acceleration by plug-in card Advance X620 (PCI-X) 203 mm length, full-height Advance e620 PCIe (x8) Dual ClearSpeed CSX600 coprocessors R > 66 GFLOPS for 64-bit matrix multiply (DGEMM) calls Hardware also supports 32-bit floating point and integer calculations 133 MHz PCI-X two-thirds length (8 ) form factor PCIe x8 half-length form factor 1 GB of memory on the board Drivers today for Linux (Red Hat and SLES) and Windows (XP, Server 2003) Low power: 25 watts typical Multiple boards can be used together for greater performance Half length, full-height Both boards can sustain over 66 GFLOPS on 64-bit HPC kernels 21

22 Host to board DMA performance The board includes a host DMA controller which can act as a bus master. All DMA transfers are at least 8-byte aligned. The host DMA engine will attempt to use the full bandwidth of the bus. Type of PCI-X slot Peak bandwidth Expected DMA speed PCI Express x8 2,000 MB/s Up to 1,300 MB/s PCI-X 133 MHz 1,066 MB/s Up to 750 MB/s Note: measured bandwidth is highly system-dependent Variations of up to 50% have been observed Depends on system chipset, operating system, bus contention 22

23 ENVISION. ACCELERATE. ARRIVE. Installing Hardware and Software 23

24 Configuration support Advance supports the following host operating systems: Operating System SuSE Linux Enterprise Server 9 IA32 (x86) AMD64/EM64T (x86-64) Red Hat Enterprise Linux 4 Windows XP SP2 Windows Server 2003 preview Supported host BLAS libraries AMD ACML Intel MKL Goto ATLAS Supported compilers For Linux: gcc, icc, fort, pgf For Windows XP, 2003: Visual C For the latest support information go to 24

25 Base software All ClearSpeed software on Linux is installed using the rpm command. The software consists of three parts: Runtime and driver software Diagnostics ClearSpeed standard libraries, CSXL & CSFFT You can download the latest versions from the ClearSpeed support website: 25

26 Installing base software on Linux 1. Log in to the Linux machine as root and change to the directory containing the drivers package. 2. Install the runtime software, using the command: rpm i csx600_m512_le-runtime-<version>.<arch>.rpm 3. Install the Kernel module - for Linux 2.6 simply install the open source CSX driver using: /opt/clearspeed/csx600_m512_le/drivers/csx/install-csx 4. Install the board diagnostics: rpm i csx600_m512_le-board_diagnostics- <version>.<arch>.rpm 5. Install the CSXL library package: rpm i csx600_m512_le-csxl_<version>.<arch>.rpm Note: For Windows a Jungo driver will need to be installed and configured see installation manual for more details. 26

27 Confirming successful installation ClearSpeed distributes diagnostic tests to check that the board and drivers are successfully installed: 1. Open a shell window and go to an appropriate directory: cd /tmp 2. Set up ClearSpeed environment variables, by typing: source /opt/clearspeed/csx600_m512_le/bin/bashrc 3. Run the diagnostic program, by typing the command: /opt/clearspeed/csx600_m512_le/bin/run_tests.pl Some tests take several minutes to complete. Each test will write Pass or Fail to standard output. A log file test.log will be written in the current directory. 27

28 csreset The csreset command reinitializes an Advance board and its processors. It must be run after start-up or reboot of the system or simulator. It is also a good idea to run csreset at the start of a batch job that calls the Advance board. The csreset command can take argument flags to provide a finer level of control. These include: -A Specifies that all boards should be reset. -v Verbose output. This shows the details about each board. -h Help. This shows the full list of options. 28

29 If you have problems with software installation Make sure you are logged in as super-user. As root for Linux. As administrator for Windows. If the configure or make install steps fail, check that you have the appropriate header files. Check the preconfigured header files and, if necessary, obtain the appropriate configured header file. If the system cannot access the board but the driver is installed, make sure the board is seated well. Try removing the board and reinstalling. 29

30 ENVISION. ACCELERATE. ARRIVE. Targeting ClearSpeed Advance: Exploiting Data Parallelism 30

31 Alternative approaches Three main approaches to acceleration: 1. Use an application which is already ported 2. Plug and play 3. Custom port using the SDK 31

32 Using an application which is already ported Acceleration: simply insert ClearSpeed Latest list of ported applications: Includes: Amber Mathematica MATLAB Star-P 32

33 Plug and play libraries: CSXL Underlying shared libraries are augmented with ClearSpeed CSXL accelerated functions Includes key functions from: LAPACK Level 3 BLAS As an example, BLAS is used by: AMD ACML Intel MKL Full list on: Application is transparently accelerated No modifications to application 33

34 Acceleration using CSXL and standard libraries Application Automatically select optimum path Host Library LAPACK BLAS etc. CSXL Intercept Layer CSXL Library LAPACK BLAS etc. 34

35 Considerations for custom port of application Is the task large enough to consider acceleration? Takes time to ship data to the accelerator Accelerator can work in parallel with host Overlap computation Performance considerations Look for areas of data parallelism Overlap compute with data I/O Make full use of ClearSpeed I/O paths Analysis starts with model based on memory tiers and can be verified using performance profiling tools 35

36 Is this trip necessary? Considering I/O Node memory Node Bandwidth = B Accelerator memory Accelerator Time to move N data to/from another node or an accelerator is ~latency+n/b seconds. Because local memory bandwidth is usually higher than B, acceleration might be lost in the communication time. Estimate the break-even point for the task (note: offloading is different from accelerating, where host continues working). speed breakeven accelerator node time (larger problem size) 36

37 Memory bandwidth dictates performance Node memory Accelerator DRAM 17 GB/s Multicore x86 PCI-X or PCIe 1 to 2 GB/s 5.4 GB/s Accelerator Accelerator Local RAM 192 GB/s Applications that can stage into local RAM can go 10x faster than current high-end Intel and AMD hosts Applications residing in Accelerator DRAM do not make use of massive memory bandwidth GPUs face very similar issue 37

38 Latency and bandwidth: Simple offload model Accelerator bandwidth bandwidth Host latency latency Host Accelerator must be quite fast for this approach to have benefit This mental picture may stem from early days of Intel 80x87, Motorola 6888x math coprocessors 38

39 Latency and bandwidth: Acceleration model Accelerator bandwidth bandwidth Host latency Host latency Host Host continues working Accelerator needs only be fast enough to make up for time lost to bandwidth + latency Easiest use model Host and accelerator share the same task, like DGEMM More flexible Host, accelerator each specialize what they do 39

40 Accelerator need not wait for all data before starting Accelerator bandwidth bandwidth Host latency Host latency Host Host can work while data is moved PCI transfers might burden a single x86 core by up to 40% Other cores on host continue productive work at full speed Accelerator can work while data is moved Can be slower than the host, and still add performance! In practice, latency is microseconds; accelerator task takes seconds Latency gaps above would be microscopic if drawn to scale 40

41 Performance considerations Look for data parallelism Fine-grained vector operations Medium-grained unrolled independent loops Coarse-grained multiple simultaneous data channels/sets Performance analysis for accelerator cards Like analysis for message-passing parallelism but with more levels of memory and communication Application porting success depends heavily on attention to memory bandwidths (Surprisingly) not so much on the bandwidth between host and accelerator card 41

42 PCI Bus ClearSpeed boards utilize either PCI-X or PCIe busses PCI-X 133 MHz: 1 GB/s Peak PCIe x8: 1.6 GB/s Peak Available memory on board 1 GB of 200 MHz DDR2 SDRAM shared by 2 CSX600 processors Must consider both the transfer rate AND the available memory If application requires more memory, then more communication to the board is necessary Infinitely fast board Time = Bus Speed * Total data size transferred 42

43 PCI Bus Driver performance is very machine-specific and depends on transfer size, direction, etc. Transfer Size vs. Transfer Rate See Runtime User s Guide for current driver performance 43

44 On-board Memory 2 level memory hierarchy 1 GB mono shared memory 6 kb poly memory per processing element (PE) 6 kb/pe * 96 PE = 576 kb per CSX600 Peak bandwidth between levels 2.7 GB/s x 2 chips = 5.4 GB/s Must consider both the transfer rate AND the available memory If application requires more memory, then more communication to the board is necessary Infinitely fast PEs Time = Bus Speed * Total data size transferred Secondary considerations Burst size: 64 Bytes/PE (i.e., 8 doubles) Transfers can be smaller, but at reduced efficiency 44

45 SIMD Computing What is SIMD? Single Instruction, Multiple Data Each PE sees the same instruction stream Each PE issues load, multiply, etc., simultaneously But acts on different data per PE PARALLEL COMPUTATION ClearSpeed SIMD is enhanced by: Local memory for each PE data management is easier within poly memory does not require adjacent access for all 96 elements involved in the computation from shared memory pool PEs can be enabled/disabled not required to use all PEs always useful for handling boundaries 45

46 SIMD Array 96 PEs per CSX MHz double precision multiply-accumulate per cycle 4 cycle pipeline depth for multiply and accumulation For top performance, use operations on 4 element vectors on each PE Nearest neighbor communication swazzle path topology is a line or ring Bandwidth: 8 Bytes per cycle between register files 8*96*210 = 161 GB/s Useful for fine grained communication 46

47 Good Example Kernels Dense Linear Algebra Matrix-Matrix products (DGEMM) Low memory bandwidth required = high data re-use Inner kernel: Matrix-multivector product 96x96 matrix, x4 vectors» 96x96 matrix due to 96 PEs» 4 vectors due to multiply/accumulate pipeline depth Monte Carlo (computational finance) Embarrassingly parallel task distribution Very little data requirement Molecular Dynamics (Amber, BUDE) Large numbers of identical tasks can be found Requires small working data sets 47

48 Possible Kernel Partial Differential Equations Some are memory bandwidth limited, so not a good candidate for ClearSpeed acceleration small stencil implies little computation per grid point wide, sparse stencil implies large active data set But, some PDE simulations are good candidates require a small grid, so can run entirely in PE memory (computational finance) have large, dense stencils large amounts of computation per grid point sufficiently small active data set implicit time stepping large systems of equations solved via direct methods direct solvers utilize dense linear algebra kernels (i.e., DGEMM) 48

49 Keys to Success Parallelism is essential Proper management of the poly memory is also critical Application must accept memory bandwidth limits PCIe or PCI-X On-board memory hierarchy SDK enables asynchronous data transfers permits efficient double buffering to manage data streams, accommodating the size limit Application must employ a small working data set less than 576 kb, distributed across 96 PEs also aware of 1 GB shared memory limit While developing ClearSpeed applications, use the ClearSpeed Visual Profiler to discover what is actually happening on the board! 49

50 Remember the host processor Today s multi-core hosts are very useful for managing other tasks that are not accelerated by ClearSpeed Many applications can overlap these tasks with ClearSpeed accelerated tasks Profile the host portion of your application as well using any of a variety of tools Use ClearSpeed Visual Profile for CSAPI utilization 50

51 General optimization techniques Latency hiding Overlap compute with I/O Data reuse On-chip swazzle path Maximize PE usage Ensure all PEs are processing, not idle 51

52 Overlap data with compute Double-buffer Many levels of data I/O compute parallelism PE load/store overlaps PE compute PE to board memory can also overlap Board memory to host memory can also overlap Hence, if task is compute bound: Data takes no time to transfer If task is I/O bound: Compute takes no time to calculate 52

53 Data reuse Swazzle path Left or right 64 bit transfer (8 bytes) 8 bytes per cycle, so ~161GB/s per CSX processor Can be complete loop or linear chain Parallel with other data I/O Register-register move On-off chip in parallel Doesn t impinge on DRAM access PE local memory register in parallel Doesn t impinge on local memory access 53

54 Maximize PE usage Aim for 100% efficiency PEs use predicated execution PEs are disabled rather than code skipped Minimize effects extract common code from conditionals Mono processor can branch Skip blocks of code 54

55 Detail of I/O widths for performance analysis Each accelerator board has: 161 GB/s bandwidth PE register to PE memory 4 bytes per cycle 322 GB/s swazzle path bandwidth 8 bytes per cycle 968 GB/s bandwidth PE register to PE ALU 24 bytes per cycle 5.4 GB/s DRAM bandwidth 32 bytes per cycle (Aggregate bandwidth for two CSX600 chips.) 322GB/s PE n-1 64 PE n 161GB/s FP Mul FP Add Div, Sqrt MAC Register File 128 Bytes PE SRAM 6 KBytes ALU Programmed I/O 128 PE n+1 5.4GB/s PIO Collection & Distribution CSX DRAM 1 GByte 322GB/s GB/s 55

56 ENVISION. ACCELERATE. ARRIVE. Software Development Kit 56

57 ClearSpeed SDK overview C n compiler C with extension for SIMD control Assembler Linker Simulator Debugger Graphical profiler Libraries Documentation Available for Windows XP / 2003 and Linux (Red Hat Enterprise Linux 4 and SLES 9) 57

58 Agenda 1. Introduction to C n 2. C n Libraries 3. Debugging C n 4. CSAPI: Host / Board Communication 58

59 ENVISION. ACCELERATE. ARRIVE. Introduction to C n 59

60 Software Development The CSX architecture is simpler to program: Single program for serial and parallel operations Architecture and compiler co-designed Instruction and data caches Simple, regular 32-bit instruction set Large, flexible register file Fast thread context switching Built-in debug support Same development process as traditional architectures: compile, assemble, link C n is a simple parallel extension of C 60

61 C n C with vector extensions for CSX New Keywords mono and poly storage qualifiers mono is a serial (single) variable poly is a parallel (vector) variable Mono variables in 1 GB DRAM Poly variables in 6 KB SRAM of each PE DRAM 1 GB 61

62 C n differences from C New data type multiplicity modifiers: mono: denotes serial variable resident in mono memory mono is the default multiplicity poly: denotes parallel/vector variable resident in poly memory local to each PE applies to pointers, doubly so: mono int * poly foo; foo is a pointer in poly memory to an int in mono memory poly int * mono bar; bar is a pointer in mono memory to an int in poly memory int * poly *mono good_grief; as you would expect. Pointer sizes: mono int * 4 bytes (32-bit addressable space, 512 MB) poly int * 2 bytes (16-bit addressable space, 6 kb) 62

63 C n differences from C Execution context: Alters branch/jump behavior In mono context, jumps occur as in traditional architecture In poly context, PEs are enabled/disabled if (penum>32) { } else { } disables false PEs on true branch, then re-enables the false PEs and disables the other PEs for the false branch both branches executed break, continue return select PEs get disabled until the end of scope on all PEs select PEs get disabled until all PEs return, or end of scope 63

64 Porting C to C n (Example 1) C code int i, j; for( i=0; i<96; i++ ) { j = 2*i; } Similar C n code poly int i, j; i = get_penum(); // i=0 on PE0, i=1 on PE1 etc. j = 2*i; // j=0 on PE0, j=2 on PE2 etc. 64

65 Porting C to C n (Example 2) C code int i; for( i=0; i<n; i++ ) { } Similar C n code poly int me, i; mono int npes; me = get_penum(); // me=0 on PE0, me=1 on PE1 etc. npes = get_num_pes(); // npes = 96 // i=0,96,192, ; 1,97,193, etc. for( i=me; i<n; i+=npes ) { } 65

66 Simple C n example void foo (double *A, double *B, int n) { // Assume n is divisible by 24*96. poly double mat[4]={1.,2.,3.,4.}; poly double a[24]; poly double b[4]={0.,0.,0.,0.}; int i; while (n) { memcpym2p (a, A+24*get_penum(), 24*sizeof(double)); A+=24*96; for (i=0; i<24; i++) { b[0] += a[i]*mat[0] + a[i+1]*mat[1]; b[1] += a[i+1]*mat[0] + a[i]*mat[1]; b[2] += a[i]*mat[2] - a[i+1]*mat[3]; b[3] += a[i+1]*mat[2] - a[i]*mat[3]; } n -= 24*96; } memcpyp2m (B+4*get_penum(), b, 4*sizeof(double)); return; } 66

67 ENVISION. ACCELERATE. ARRIVE. C n Libraries 67

68 Runtime libraries C n supports standard C runtime, including: malloc printf sqrt memcpy C n extensions include: sqrtp memcpym2p / memcpyp2m get_penum swazzle any / all 68

69 Asynchronous I/O For most efficient use of limited PE memory, overlap data transfers between mono memory and poly: async_memcpym2p/p2m sem_sig / sem_wait For greatest efficiency, async_memcpy routines bypass the data cache, so coherency must be maintained: dcache_flush / dcache_flush_address 69

70 Asynchronous I/O example void foo(double *A, double *B,int n) { // Assume n is divisible by 24*96 poly unsigned short penum=get_penum(); poly double mat[4]={1.,2.,3.,4.}; poly double a_front[12], a_back[12]; poly double b[4]={0.,0.,0.,0.}; int i; async_memcpym2p(19,a_front,a+12*penum,12*sizeof(double));a+=12*96; n-=24*96; while (n) { async_memcpym2p(17,a_back,a+12*penum,12*sizeof(double));a+=12*96; sem_wait(19); for (i=0;i<12;i++) { b[0] += a_front[i]*mat[0] + a_front[i+1]*mat[1]; b[1] += a_front[i+1]*mat[0] + a_front[i]*mat[1]; b[2] += a_front[i]*mat[2] - a_front[i+1]*mat[3]; b[3] += a_front[i+1]*mat[2] - a_front[i]*mat[3]; } n-=12*96; async_memcpym2p(19,a_front,a+12*penum,12*sizeof(double));a+=12*96; sem_wait(17); for (i=0;i<12;i++) { // compute on a_back, then finish outside while loop 70

71 ENVISION. ACCELERATE. ARRIVE. C n Pointers 71

72 C n mono and poly pointers Using mono and poly with pointers mono int * mono mpmi mono pointer to mono int poly int * mono mppi mono pointer to poly int mono int * poly ppmi poly pointer to mono int poly int * poly pppi poly pointer to poly int Most commonly used is mono pointer to poly poly <type> * mono <variable_name> 72

73 C n mono and poly pointers mono pointer to mono int mono int * mono mpmi int * Mono memory int 73

74 C n mono and poly pointers mono pointer to poly int poly int * mono ppmi int Poly memory int * Mono memory int Poly memory int Poly memory Note: Points to same location in each PE 74

75 C n mono and poly pointers poly pointer to poly int poly int * poly pppi int Poly memory int * Poly memory int * Int int Poly memory int * Note: Pointer stored in same location in each PE 75

76 C n mono and poly pointers poly pointer to mono int mono int * poly ppmi int * Poly memory int Mono memory int * Poly memory int int int * Poly memory Note: Pointer stored in same location in each PE 76

77 ENVISION. ACCELERATE. ARRIVE. Conditional Expressions 77

78 Conditional Expressions: mono-if Conditions based on mono expressions Expression has same value on all PEs Code block selected according to expression and branch instruction executed mono int i, j; i = j = 1; if( i == j ) { // this block executed on all PEs } else { // this block branched over on all PEs } 78

79 Conditional Expressions: poly-if Conditions based on poly expressions Expression may have different values on different PEs But SIMD model implies all PEs execute same instruction simultaneously All branches executed on all PEs, with PE enabled if conditional expression is true (like predicated instructions) poly int i; i = get_penum(); if( i < 48 ) { // PEs 0, 1, 2, execute instructions // PEs 48, 49, instructions issued but ignored } else { // PEs 0, 1, 2, instructions issued but ignored // PEs 48, 49, execute instructions } 79

80 Conditional Expressions: poly-while While loops based on poly expressions Loop continues execution until condition is false on all PEs PEs will be disabled one by one until while condition is false on all PEs count keeps track of total number of iterations (96 in this case) mono int count = 0; poly int me; me = get_penum(); while( me > 0 ) { --me; ++count; } 80

81 Other variations between C and C n Labeled break and continue statements No switch statement using poly variables (use multiple if statements) No goto statement in poly context 81

82 ENVISION. ACCELERATE. ARRIVE. Moving Data 82

83 Data flow Board and host communicate via Linux kernel module or Windows driver Create a handle and establish the connection 83

84 Data flow Register intent of using the first processor on the card Load the code onto the enabled processor 84

85 Data flow Transfer data from host to board Semaphores synchronize transfers between host and board 85

86 Data flow Run the code on the enabled processor Host can continue with other work 86

87 Data flow Send results back to host Halt board program and clean up 87

88 Implicit broadcast from mono and poly Implicit broadcast from mono to poly by assignment mono int m = 7; poly int p; p = m; // Implicit broadcast to all PEs Assigning poly to mono is not permitted mono int m; poly int p = get_penum(); m = p; // NO! m receives different value from each PE 88

89 Explicit data movement mono to poly memcpym2p(); async_memcpym2p() Memory copy of n bytes from mono to poly Source is a poly pointer to mono memory, which can have a different value for each PE Destination is a mono pointer to poly memory, that is destination address is the same for all PEs Source data in mono memory Same destination on each PE PE0 PE1 PE2 PE95 89

90 Explicit data movement poly to mono memcpyp2m(); async_memcpyp2m() Memory copy of n bytes from poly to mono Source is a mono pointer to poly memory; therefore source address is the same for every PE Destination is a poly pointer to mono memory, which can have a different value for each PE Destination data in mono memory Same source address on each PE PE0 PE1 PE2 PE95 90

91 Explicit data movement asynchronous async_memcpym2p(); async_memcpyp2m() Asynchronous memory copy of n bytes from mono to poly or from poly to mono Computation continues during data copy Mono memory data cache NOT flushed Restrictions on alignment of data Use semaphores to wait for completion of copy Much higher bandwidth than synchronous versions dcache_flush(); async_memcpym2p( semaphore, ); // computation continues sem_wait( semaphore ); // use data that has been transferred from mono memory 91

92 Explicit data movement swazzle Register-to-register transfer between neighboring PE s PE n ALU Status flags To: PE n-1 Register file To: PE n+1 Memory Enable stack 92

93 Swazzle operations Assembly language versions operate directly on register file C n versions operate on data and include implicit data movement from memory to registers Variants swazzle_up( poly int src ); // copy to higher numbered PE swazzle_down( poly int src ); // copy to lower numbered PE swazzle_up_generic( poly void *dst, poly void *src, unsigned int size ); swazzle_down_generic( ); Similar swazzles operating on other data types Functions to set data copied into ends of swazzle chain 93

94 Data movement bandwidths per CSX600 Mono memory to poly memory 2.7 GB/s aggregate over 96 PEs Poly memory to registers 840 MB/s per PE, 81 GB/s aggregate Swazzle path bandwidth 1680 MB/s per PE, 161 GB/s aggregate Total bandwidth for Advance board (2 CSX600 processors) ~0.5 TB/s 94

95 DMA performance Advance board has a host DMA controller which can act as a PCI bus master All DMA transfers are at least 8-byte aligned Host DMA engine will attempt to use the entire bus bandwidth ClearSpeed Advance DMA Performance MB / s e620_read_avg e620_write_avg X620_Read_avg X620_Write_avg Transfer size (MB) 95

96 ENVISION. ACCELERATE. ARRIVE. CSAPI Host - Board communication 96

97 Host-Board interaction basics The basic model for interaction between the host and the card is very simple: The ClearSpeed board can signal and wait for semaphores; it cannot initiate data transactions with the host. The host pushes data to and pulls data from the board. The host can also signal and receive semaphores. 97

98 Connecting to the board A host application needs to perform the following sequence to launch a process on the board: Create a CSAPI handle CSAPI_new Establish a connection with the board CSAPI_connect Register the host application with the driver CSAPI_register_application Load the CSX application on the desired chip CSAPI_load Run the CSX application on the desired chip CSAPI_run 98

99 Interacting with the board Get board memory address of a known symbol CSAPI_get_symbol_value This must be done after the application is loaded, if the dynamic load capability is to be used. Write/Read data to a retrieved memory address CSAPI_write_mono_memory CSAPI_read_mono_memory Asynchronous variants of these routines also exist A process does not need to be running for these operations to succeed, but the process needs to be loaded. These should not be performed DURING process termination. Managing semaphores CSAPI_allocate_shared_semaphore Declares a semaphore for use on both host and card CSAPI_semaphore_wait CSAPI_semaphore_signal 99

100 Cleaning up Process termination CSAPI_wait_on_terminate CSAPI_get_return_value Clean-up CSAPI_delete See CSX600 Runtime Software User Guide for more details, including: managing multiple processes on the board/chip at once managing board control registers board reset managing multi-threaded CSX applications board memory allocation managing multiple boards/chips error handling 100

101 ENVISION. ACCELERATE. ARRIVE. Debugging C n 101

102 csgdb csgdb is a port of the open source gdb debugger full symbolic debugging of mono/poly variables full gdb breakpoint support step through C n or assembly views mono and poly registers views PE enabled state also accessible via DDD DDD allows graphical data visualization 102

103 Debug control To enable debugging: export CS_CSAPI_DEBUGGER=1 initializes the debug interface within the host application export CS_CSAPI_DEBUGGER_ATTACH=1 host application will then write a port number to stdout and wait for <Return/Enter> to be pressed so that csgdb can be manually attached to the connected board process Launch the host application This can be done with or without a debugger. Launch csgdb in a new shell* csgdb <csx_file_name> <port_number> No need to connect as the host application did this already set desired breakpoints run Note that the host is currently blocked waiting for <Return/Enter>, so card process may also be blocked waiting for the host. Press return in the host shell for the host and card applications to proceed. 103

104 csgdb Debugger (Shown with ddd Front-end) On-chip poly array contents displayed Real time plot of contents of PE memory Cn source-level break point, watch points, single step, etc. Register contents Disassembly, break point, watch points, single step, etc. 104

105 csgdb Command-line example % cscn foo.cn g o foo.csx % csgdb./foo.csx (gdb) connect 0x in FRAME_BEGIN_MONO () (gdb) break 109 Breakpoint 1 at 0x800154c0: file foo.cn, line 109. (gdb) run Starting program: /home/kris/my_app/foo.csx Breakpoint 1, main () at foo.cn:109 (gdb) next 110 y = MINY + (get_penum() * STEPY); (gdb) print y $1 = {-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1} 105

106 ENVISION. ACCELERATE. ARRIVE. ClearSpeed Visual Profiler Explaining Performance 106

107 ClearSpeed Visual Profiler (csvprof) Host tracing Trace CSAPI function User can infer overlapping host/board utilization Locate hot-spots Board tracing Trace board side functions without instrumentation Locate hot-spots Board hardware utilization Display activity of csx functional units including: ld/st Pi/o SIMD microcode Cycle accurate View corresponding source Unified GUI Instruction cache Data cache Thread 107

Detailed profiling is essential for accelerator tuning HOST CODE

Time specific code sections. Check overlap of host threads.

Check overlap of host and board compute.

Advance Accelerator Board CSX 600 CSX600 CSX 600 CSX600 Pipeline Pipeline

Visualize overlap of executing instructions. Get cycle-accurate timing.

108 Detailed profiling is essential for accelerator tuning HOST CODE PROFILING Visually inspect multiple host threads. Time specific code sections. Check overlap of host threads. HOST/BOARD INTERACTION Infer cause and effect. Measure transfer bandwidth. Check overlap of host and board compute. Host Host Host CPU(s) Host CPU(s) CPU(s) CPU(s) Advance Accelerator Board Advance Accelerator Board CSX 600 CSX600 CSX 600 CSX600 Pipeline Pipeline Pipeline Pipeline ACCELERATOR PIPE View instruction issue. Visualize overlap of executing instructions. Get cycle-accurate timing. Remove instruction-level performance bottlenecks. CSX600 SYSTEM Trace at system level. Inspect overlap of compute and I/O. View cache utilization. Graph performance. 108

109 csvprof: Host tracing Dynamic loading of CSAPI Trace implementation Triggered with an environment variable: export CS_CSAPI_TRACE=1» Recall similar enabling of debug support:» export CS_CSAPI_DEBUGGER=1 Specify tracing format: export CS_CSAPI_TRACE_CSVPROF=1 currently this is the only implementation, but in the future Specify output file for trace: export CS_CSAPI_TRACE_CSVPROF_FILE=mytrace.cst default filename = csvprof_data.cst Output file written during CSAPI_delete 109

110 csvprof: Host-Board interaction 110

111 csvprof: Host code profile Linpack benchmark 111

112 csvprof: CSX600 system profile 112

113 csvprof: Accelerator pipeline profile 113

114 csvprof: Instruction pipeline stalls 114

115 csvprof: Advance board tracing Enabled using the debugger, csgdb Can use interactively or through gdb script Can select events to profile, or all events Requires buffer allocation on the card Today, this is done statically One could use CSAPI to allocate buffer, but developer must get location and size of the buffer to user to be entered for csgdb Easy if running only on one chip, place buffer in the other chip s memory Explicit dump to generate trace file Can control the type of data to be dumped 115

116 csvprof: Sample gdb script % cat./csgdb_trace.gdb connect load./foo.csx cstrace buffer 0x x cstrace event all on tbreak test_me continue cstrace enable continue cstrace dump foo.cst cstrace dump branch dgemm_test4_branch.cst quit % csgdb command=./csgdb_trace.gdb 116

117 ENVISION. ACCELERATE. ARRIVE. Tuning Tips 117

118 Pipelined arithmetic Four-stage floating-point pipeline Use vector types, vector intrinsic functions, and vector math library for high efficiency DVECTOR a, b, c; poly double x[n]; a = *(( DVECTOR *)x[0]); b = *(( DVECTOR *)x[4]); c = cs_sqrt( cs_vadd( a, b ) ); 118

119 Poly conditionals When possible, remove common subexpressions from poly if-blocks to reduce amount of replicated work. Maybe need to compute and throw away results if it leads to fewer poly conditional blocks. A poly if-block uses predicated instructions, not a branch, so it is cheap if not many additional instructions are executed. 119

120 Poly loop counters Loops with poly counters are more expensive than those with mono counters Use mono loop counters if possible 120

121 Arrays Pointer incrementing is more efficient than using array index notation Poly addresses require 16 bits Use short for poly pointer increments This avoids conversion of int to short 121

122 Data transfer Synchronous functions are completely general flush the data cache each transfer memcpyp2m() memcpym2p() Asynchronous functions maximize performance do not flush cache have data size and alignment restrictions require use of wait semaphore async_memcpyp2m(); sem_wait() async_memcpym2p(); sem_wait() Large data blocks are more efficient than small blocks Host to board Board to host Mono to poly Poly to mono 122

123 ENVISION. ACCELERATE. ARRIVE. Application Examples 123

124 Math function speed comparison bit Function Operations per Second (Billions) GHz dual-core Opteron 3 GHz dual-core Woodcrest ClearSpeed Advance card Sqrt InvSqrt Exp Ln Cos Sin SinCos Inv Function name Typical speedup of ~8X over the fastest x86 processors, because math functions stay in local memory on the card 124

125 Nucleic Acid Builder (NAB) Newton-Raphson refinement now possible; large DGEMM calls from computed second derivatives will be in AMBER x speedup obtained for this operation in three hours of programmer effort Enables accurate computation of entropy and Gibbs Free Energy for first time AMBER itself has cases that ClearSpeed accelerates by 3.2x to 9x, with 5x to 17x possible once we exploit symmetry of atomatom interactions 125

AMBER molecular modeling with ClearSpeed AMBER Generalized Born Models 1, 2, and 6 Run Time, in Minutes 100.0 90.0 80.0 70.0 60.0 50.0 40.0 30.0 20.0 10.0 0.0 83.5 84.6 37.9 24.6 23.5 4.

126 AMBER molecular modeling with ClearSpeed AMBER Generalized Born Models 1, 2, and 6 Run Time, in Minutes Generalized Born 1 Generalized Born 2 Generalized Born 3 Host Advance X620 AMBER model Host Advance X620 Speedup Gen Born min 24.6 min 3.4 Gen Born min 23.5 min 3.6 Gen Born min 4.0 min

127 Monte Carlo methods exploit high local bandwidth Monte Carlo methods are ideal for ClearSpeed acceleration: High regularity and locality of the algorithm Very high compute to I/O ratio Very good scalability to high degrees of parallelism Needs 64-bit Excellent results for parallelization Achieving 10x performance per Advance card vs. highly optimized code on the fastest x86 CPUs available today Maintains high precision required by the computations True 64-bit IEEE 754 floating point throughout 25 W per card typical when card is computing ClearSpeed has a Monte Carlo example code, available in source form for evaluation 127

128 Monte Carlo applications scale very well No acceleration: 200M samples, 79 seconds 1 Advance board: 200M samples, 3.6 seconds 5 Advance boards: 200M samples, 0.7 seconds European Option Pricing Model Speedup Number of ClearSpeed Advance Boards 128

Why do Monte Carlo applications need 64-bit?

five-decimal accuracy takes 10 billion trials.

129 Why do Monte Carlo applications need 64-bit? Accuracy increases as the square root of the number of trials, so five-decimal accuracy takes 10 billion trials. But, when you sum many similar values, you start to lose all the significant digits. 64-bit summation needed to get a single-precision result! Single precision: x = x10 8 Double precision: x = x

130 ENVISION. ACCELERATE. ARRIVE. Help and Support 130

131 Installed documentation docs directory CSXL user guide runtime user guide csvprof Visual Profiler overview and examples SDK getting started gdb manual instruction set manual C n library manual reference manual release notes examples directory 131

132 ClearSpeed online General information, news, etc. Company website Report a problem, find answers, etc. Support website support.clearspeed.com Support website has: Documentation, user guides, reference manuals Solutions knowledge base Software downloads Log a case 132

133 Join the ClearSpeed Developer Program! Designed to support the leading-edge community of developers using accelerators Membership is free and has the following benefits: Access to the ClearSpeed Developer website ClearSpeed Developer Community on-line forum Invitation to participate in ClearSpeed Developer & User Community meetings and events Repository to share and access demonstrations and sample codes within the ClearSpeed Developer Community Technical updates, tips and tricks from the gurus at ClearSpeed and the Developer Community And more, including opportunities to preview new software releases and developer discount programs. Leverage the expertise of developers worldwide. Ask a question, or share your knowledge. Register now at developer.clearspeed.com! 133

134 134

COMP 635: Seminar on Heterogeneous Processors. Lecture 7: ClearSpeed CSX600 Processor.

COMP 635: Seminar on Heterogeneous Processors Lecture 7: ClearSpeed CSX600 Processor www.cs.rice.edu/~vsarkar/comp635 Vivek Sarkar Department of Computer Science Rice University vsarkar@rice.edu October