COMP 635: Seminar on Heterogeneous Processors. Lecture 7: ClearSpeed CSX600 Processor.

Size: px

Start display at page:

Download "COMP 635: Seminar on Heterogeneous Processors. Lecture 7: ClearSpeed CSX600 Processor."

Geraldine Jenkins
6 years ago
Views:

1 COMP 635: Seminar on Heterogeneous Processors Lecture 7: ClearSpeed CSX600 Processor Vivek Sarkar Department of Computer Science Rice University October 22, 2007 Announcements REMINDER: 4-page project/study write-up due by 12/7/07 Report can be prepared in a group --- you should plan on 4 pages/person in that case Send me ASAP with proposed topic for your write-up if you haven t already done so References The best of both worlds: Delivering aggregated performance for high-performance math libraries in accelerated systems, James Irwin and Simon McIntosh-Smith, ISC Clearspeed Software Overview Release _release.pdf ClearSpeed Introductory Programming Manual ming.pdf Acknowledgments Overview slides on Clearspeed CSX600 from Simon McIntosh-Smith 2

CSX600 accelerator chip Array of 96 Processor Elements 64-bit and 32-bit floating point 210 MHz key to low power 47% logic, 53% memory About 50% of the logic is FPUs ~1 TB/sec internal bandwidth At

2 CSX600 accelerator chip Array of 96 Processor Elements 64-bit and 32-bit floating point 210 MHz key to low power 47% logic, 53% memory About 50% of the logic is FPUs ~1 TB/sec internal bandwidth At the register file 128 million transistors Low Power, Approx 10 Watts 3 MTAP processor core CSX600 Peripheral Network Mono Controller Data Cache Instruc- Control tion and Cache Debug PE 1 PE 95 Programmable I/O to DRAM Array of 96 Processor Elements (PEs) Each is a Very Long Instruction Word (VLIW) core, not just an ALU Coarse-grained data parallel processing Poly Controller PE 0 Multi-Threaded Array Processing Hardware multi-threading Asynchronous, overlapped I/O Run-time extensible instruction set System Network System Network Cn is the natural language Single poly data type modifier Rich expressive semantics 4

Processing Elements PE n 1 PE n FP Mul FP Add Div, Sqrt MAC Register File 128 Bytes PE SRAM 6 KBytes ALU 64 64 PE 32 32 64 64 64 Programmed I/O 128 PIO

root unit Fixed-point MAC 16x16 32+64 Integer ALU with shifter Load/store High-bandwidth, 5-port register file (3r, 2w) Closely coupled 6 KB SRAM for data

3 Processing Elements PE n 1 PE n FP Mul FP Add Div, Sqrt MAC Register File 128 Bytes PE SRAM 6 KBytes ALU PE Programmed I/O 128 PIO Collection & Distribution n+1 Multiple execution units 4-stage floating point adder 4-stage floating point multiplier } 32/64-bit IEEE 754 Divide/square root unit Fixed-point MAC 16x Integer ALU with shifter Load/store High-bandwidth, 5-port register file (3r, 2w) Closely coupled 6 KB SRAM for data High bandwidth per PE DMA (PIO) Per PE address generators Complete pointer model, including parallel pointer chasing and vectors of addresses 5 Three tiers of memory 6

Application Acceleration Model 7 Cn language for MTAP architecture Cn = C extended with mono and poly keywords mono is a serial (single) variable One copy exists on mono execution unit Visible to all

4 Application Acceleration Model 7 Cn language for MTAP architecture Cn = C extended with mono and poly keywords mono is a serial (single) variable One copy exists on mono execution unit Visible to all processing elements in poly execution unit mono assumed unless poly specified poly is a parallel (vector) variable One per processing element in the poly execution unit Visible to a single processing element Data can be shared via swazzle operation Not visible to mono execution unit The Cn language supports the following basic types: char, unsigned char, signed char short, unsigned short, signed short int, unsigned int, signed int long, unsigned long, signed long float, double Cn also supports the following aggregate types: struct union pointers arrays 8

5 Simple Example of C vs. Cn code 9 Rules for mono and poly expressions A mono rval can be assigned to a poly lval e.g., poly int x; int y; x = y; Mono expressions get promoted to poly (but not vice versa) e.g., poly int x; int y; x = x + y; A mono variable cannot be used as an lval with a poly-valued rval PE enablement Poly unit uses an enable register to control execution of each PE. The enable register is a stack, and a new bit, specifying the result of a test, can be pushed onto the top of the stack allowing nested predicated execution. A PE with no 0 s in its enable register is enabled A PE with at least one 0 in its enable register is disabled 10

6 Poly-valued if-then-else Consider the following code example poly short penum = get_penum(); mono int i; poly int j; if (penum < 32) { // Poly-valued conditional j = 0; // Only executed on enabled PEs i = 0; // Always executed } else { j = 1; // Only executed on enabled PEs i = 1; // Always executed } Semantics All statements in poly-valued if-then-else are executed in sequence (elseblock follows then-block) Poly statement is only executed for processors for which poly-valued conditional is enabled Mono statement is always executed NOTE: switch statement expression must be mono-valued 11 Poly-valued while loop poly int i; // May be different on different PEs mono int loop_count = 0;... while (i < N) { i++; /* Increment poly loop control */ loop_count++; /* Increment mono loop count */... } Semantics while loop continues execution so long as condition is true for at least one PE Final value of loop_count contains maximum number of iterations executed by any PE 12

7 Data transfers between mono and poly memcpym2p Transfers data from mono space to poly space. Every enabled PE transfers the same amount of data to the same location in poly memory. memcpyp2m As above, but transfer data from poly to mono memory. Asynchronous versions also available Use signal and wait operations in semaphores Cache consistency with mono memory needs to be enforced by software dcache_flush ensures consistency between mono memory & cache 13 Mono and Poly pointers (4 kinds) mono int * mono mpmi poly int * mono mppi mono int * poly ppmi poly int * poly pppi mono ptr to mono int mono ptr to poly int poly ptr to mono int poly ptr to poly int 14

8 Mono and Poly pointers poly int * mono (mono ptr to poly int) 15 Mono and Poly pointers mono int * poly (poly ptr to mono int) 16

9 Mono and Poly pointers poly int * poly (poly ptr to poly int) 17 DAXPY example in Cn 18

ClearSpeed Advance e620 & X620 accelerator boards Dual ClearSpeed CSX600 coprocessors R 66 GFLOPS for 64-bit matrix multiply (DGEMM) calls 80 Double Precision GFLOPS peak Hardware also supports

10 ClearSpeed Advance e620 & X620 accelerator boards Dual ClearSpeed CSX600 coprocessors R 66 GFLOPS for 64-bit matrix multiply (DGEMM) calls 80 Double Precision GFLOPS peak Hardware also supports 32-bit floating point and integer calculations Single PCI slot (PCI Express x8 or PCI-X) Multiple boards can be used together for greater performance Over 1 GByte/s between accelerator and host 1 GByte of ECC protected memory on the board Flat 64-bit shared address space for the board Drivers for Linux (RedHat and Suse) and Windows 9 ounces, 6 inches long, 35 watts for entire card (at socket) 19 Clearspeed Linpack results Standard System Two 3.0 GHz Intel Xeon 5160 (Woodcrest) dual core processors, 16GB memory per node Single server: 34 GFLOPS Four node cluster: 136 GFLOPS Power consumption: 1,940 Watts Benchmark runtime: 48.4 minutes ClearSpeed Accelerated System Add two Advance accelerator boards per node (25W per board!) Single server: 90.1 GFLOPS Four node cluster: GFLOPS Power consumption: 2,140 Watts Benchmark runtime: 18.4 minutes Source: Accelerating HPC Applications with ClearSpeed by Daniel Kliger (Slide 5), ed%20daresbury%20mew% pdf 20

DGEMM Power Efficiency Source: Accelerating HPC Applications with ClearSpeed by Daniel Kliger (Slide 6), www.cse.scitech.ac.

11 DGEMM Power Efficiency Source: Accelerating HPC Applications with ClearSpeed by Daniel Kliger (Slide 6), 21 Three Categories of ClearSpeed users Application-level Users Library-level Programmers SDK (C compiler for CPU + Cn compiler for CSX600) SDK-level Programmers 22

ClearSpeed Software = Runtime + SDK 23 Runtime Software CSXL library (subset of BLAS and LAPACK) Supports DGEMM, DGETRF, DGESV CSDFT library Supports FFT, Inverse FFT, and Convolution Vector Math

12 ClearSpeed Software = Runtime + SDK 23 Runtime Software CSXL library (subset of BLAS and LAPACK) Supports DGEMM, DGETRF, DGESV CSDFT library Supports FFT, Inverse FFT, and Convolution Vector Math Library (VML) A set of random number generators A set of vector math functions (sin, exp, log etc.) Host interface library (csapi.h) Processor control functions: controlling the state of the CSX600 processors (run, halt, start, wait, get return value) CSX600 register access functions CSX600 memory access functions Thread functions, Semaphore functions, Callback functions Memory allocation COMP 635, functions Fall 2007 (V.Sarkar) 24

13 ClearSpeed API for Host Applications 25 ClearSpeed API for Host Applications CSAPI_new CSAPI_load CSAPI_connect CSAPI_run CSAPI_write_mono_memory CSAPI_signal CSAPI_wait CSAPI_read_mono_memory 26

14 Sample Code 27 Matrix multiply (DGEMM) performance GFLOPS added by accelerator PCIe Sep 2007 PCI-X Dec Matrix size Note: curve only samples integer multiples of vector size. Performance measured from host. 28

ClearSpeed development environment C n optimising compiler C with poly extension for SIMD control Uses ACE CoSy compiler development system Assembler, linker Simulators Fast high level and slower

Profiling csprof Visualises an accelerated application s performance while running on both a multi-core host and either ClearSpeed s Advance board or the simulator.

15 ClearSpeed development environment C n optimising compiler C with poly extension for SIMD control Uses ACE CoSy compiler development system Assembler, linker Simulators Fast high level and slower timing accurate versions Debuggers gdb, csgdb A port of the GNU debugger gdb for x86 and csgdb that can run on ClearSpeed s hardware, together give a consistent host and CSX600 view. Profiling csprof Visualises an accelerated application s performance while running on both a multi-core host and either ClearSpeed s Advance board or the simulator. Intimately integrates with debuggers. Libraries (BLAS, RNG, FFT, more..) High level APIs Under development Documentation, training materials Available for Windows and Linux (Red Hat 4 and SLES 9) 29 csgdb/ddd debugger On Chip vector contents displayed Real time plot of contents of PE Memory Cn Source level break point, watch points single step Register contents Disassembly, break point, watch points single step 30

Profiling details of host and board system level activity HOST CODE PROFILING Visually inspect host code executing Supports multiple

collection Host Host Host CPU(s) Host CPU(s) CPU(s) CPU(s) Advance Accelerator Board Advance Accelerator Board CSX 600 CSX 600 CSX600

instruction issue information. Visualize overlap of executing instructions. Optimize code at the instruction level.

16 Profiling details of host and board system level activity HOST CODE PROFILING Visually inspect host code executing Supports multiple threads and processes Time specific code sections See overlap of host threads executing Platform and processor agnostic trace collection Host Host Host CPU(s) Host CPU(s) CPU(s) CPU(s) Advance Accelerator Board Advance Accelerator Board CSX 600 CSX 600 CSX600 CSX600 Pipeline Pipeline HOST/BOARD INTERACTION View host/board interactions Provides performance information for data transfer operations. Trace cluster node/board interaction See overlap of host compute and board compute Pipeline Pipeline CSX600 PIPELINE View detailed instruction issue information. Visualize overlap of executing instructions. Optimize code at the instruction level. View instruction level performance bottlenecks. Get accurate instruction timing CSX600 SYSTEM View system level trace Visually inspect the overlap of compute and I/O visualize cache utilization View branch trace of code executing Find and analyse performance bottlenecks Get accurate event timing 31

ENVISION. ACCELERATE.

ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical Training December 2007 Overview 1 Presenters Ronald Langhi Technical Marketing Manager ron@clearspeed.com Brian Sumner Senior Engineer brians@clearspeed.com