ENVISION. ACCELERATE.

Size: px

Start display at page:

Download "ENVISION. ACCELERATE."

Ashlynn Gregory
5 years ago
Views:

1 ENVISION. ACCELERATE. ARRIVE. ClearSpeed Programming Model: Optimizing Performance 1

2 Overview Compute considerations Memory considerations Latency hiding Miscellaneous Profiling Inline assembly Optimal performance: Top 10 tips 2

3 ENVISION. ACCELERATE. ARRIVE. Compute considerations 3

4 Source for further information Majority of this information is gleaned from: CSX600 Programming Manual 06-RM-1305 Chapter 4: Execution Pipeline 4

5 Poly ALU is 8-bit Hence, a 4-byte operation will take 2x as long as a 2-byte operation Be mindful in your code Do you need 32 bit int when a 16 bit short will do? Good example: Array subscripts! 6kB PE memory; a signed short will cover 32kB Added benefit: Variables will take up less poly memory space This is a scare resource, so use it wisely 5

6 Perform poly-only expressions Precopy constant items from mono variable to poly variable Expression will then be poly-only Faster Potential for value reuse Don t need to resend mono poly Example: // Assume initialised // Assume initialised mono int mmax; mono int mmax; poly int ploop; poly int ploop, pmax; while (ploop < mmax) pmax = mmax; // Single send! { while (ploop < pmax) { 6

7 Poly Conditionals As far as possible, remove common subexpressions from poly if blocks Reduce amount of replicated work Reminder: PEs do not skip poly conditionals All PEs process same code, some just ignore instructions All PEs pay the same cycle cost Be prepared to compute and throw away results if it leads to fewer poly conditional blocks Increase efficiency as PEs are enabled and processing more of the time Poly if uses predicated instructions Not a branch so no jump overhead Cheap if few additional instructions are executed 7

8 Poly Conditionals (continued) Example of superfluous compute for speed: poly double a, b; All instructions processed (PEs disabled in conditional) if <condition 1> if <condition 2> a = ComputationA(params) else b = ComputationB(params) endif else a = ComputationA(params) Computation carried out twice! endif 8

9 Poly Conditionals (continued) Smaller total cycle count Superfluous compute for speed (continued): poly double a, b, atemp; atemp = ComputationA(params); if <condition 1> if <condition 2> a = atemp; else b = ComputationB(params) endif else a = atemp; endif Now single computation, assigned twice 9

10 Standard procedural programming speedups Don t calculate in for loop header ; example: for (int i=0; i < (a / b); i++) Calculated each iteration! What if result of calculation is constant over the loop? Faster to compute the result once and reuse: int limit = a / b; for (int i=0; i < limit; i++) Remember: The fastest computation is the one you don t carry out! 10

11 More examples of superfluous computation Pointer arithmetic don t calculate absolute value each iteration for (i=0 ; i<10 ; i++) { pointer = start_address + i*128 ; function(pointer,...) ; } Prefer: pointer = start_address; for (i=0 ; i<10 ; i++) { function(pointer,...) ; pointer += 128 ; // Not carrying out a multiply } 11

12 Division speedup Division is a slow operation When compared to multiply Are you dividing by a constant? More efficient to multiply by its inverse float data[10]; // Some nebulous data to process float divisor; // Quantity we wish to divide by for (int i=0; i<10; i++) { data[i] /= divisor; } Instead, compute the inverse, store it and reuse it: float data[10]; // Some nebulous data to process float divisor; // Quantity we wish to divide by float invdivisor = 1.0 / divisor; for (int i=0; i<10; i++) { data[i] *= invdivisor; } 12

13 Array lookup optimization You can achieve speedups with 2.x compiler: 3.x compiler is more efficient poly double *x, *y; for (short i=0; i<32; i++) { y[i]=x[i]; } You would be better coding: poly double *x, *y; poly double *xi, *yi; xi = x; yi = y; for (short i=0; i<32; i++) { *yi++ = *xi++; } Note removal of index Removes related address calculation 13

14 Literal Constants You can achieve speedups with 2.x compiler: 3.x compiler is more efficient Compared with: poly double x, y; x = x + 1.0; y = y This may prove to be faster: static poly const double one = 1.0; poly double x, y; x = x + one; y = y + one; 14

15 Can anything be precomputed? If there is a set of constants you require Precalculate on host Send to board (tightly packed structure, multiples of 32 bytes, 8 byte aligned ) Potentially faster than calculating on board Especially if they are loaded in advance (along with the executable) "Old-school" lookup tables win out in certain circumstances Factorial: 3249! = ^(10,000) Look-up is a lot faster than 3,249 multiplies... 15

16 Should you use vector instructions? Don t automatically assume you need to vectorise Only if you: Have a small set of working variables Are not mono poly bandwidth bound Don't assume vectorization is always a huge win Particularly with sets of expressions with lots of variables Could run out of registers, so spill variables to memory Code will run slower due to additional memory access 16

17 Vector Math Library (VML) VML functions take up PE memory Such as 128 bytes for sin & cos double-precision functions But VML functions are faster than libcn Even if the arguments are scalar, not vector Refer to The Cn Standard Library Document ID 06-RM-1139 Section 5: The ClearSpeed Vector Math Library 17

18 Compiler optimizations Refer to ClearSpeed SDK Reference Manual Document ID 06-RM-1136 Section 3.3: Compiler optimizations Breaks down exactly what the different levels of optimization will do O1, O2, O3, O4 Different optimization implementations Compiler 2.x: Not all are available with poly variables Compiler 3.x: Supported on poly, with additional optimizations Check the documentation for further details 18

19 ENVISION. ACCELERATE. ARRIVE. Memory considerations 19

20 DRAM memory optimization: 32-byte access ECC in DRAM works on 8 byte wide words If you write < 8 bytes, DRAM will: Read 8 bytes, overwrite N bytes, recalculate ECC, write 8 bytes Use multiples of 8 bytes DRAM has burst length of 4 4 * 8 = 32 bytes So accessing 1 byte takes as long as accessing 32 bytes Further reading: CSX600 Programming Manual (ID 06-RM-1305) Section 3.4: DRAM Section 3.4.6: Performance 20

21 DRAM memory optimization: 32-byte access PIO bus width is 64 bytes Access multiples of 64 bytes for peak performance Requires access aligned to 8 byte address For performance, align to 32 byte address Enables DMA engine to be used C n support: #pragma align N Will align next data structure to N bytes 21

22 DRAM memory optimization: random access Only one row in each DRAM bank can be open at any one time. Controller must open & close rows as addresses come in This takes time but controller schedules open / close commands ahead of the data access Consecutive access to memory is penalty free Random access within a page is penalty free Random access within all open pages is penalty free Number of banks varies with board Usually 4 or 8 banks 22

23 DRAM memory optimization: multiple access Keep to one data stream at a time Even if you have two perfectly well behaved consecutive write streams from different sources (2 MTAPS or MTAP + Host) your bandwidth will be slashed Streams get interleaved, so are seen as accesses to different pages in same bank Changing between read and write takes time Do as many accesses in the same direction as possible 23

24 DRAM memory optimization: multiple access However, if you know exactly what you are doing: There are 3 memory controllers FPGA controls PCIe access One in each CSX processor Underlying issue is DRAM banks If you access too many different banks: Performance drops badly as memory is flushed 24

25 Board DRAM: bank differences PCI-X: 8 DRAM banks PCIe: 4 DRAM banks Performance differences can be seen moving from PCI-X to PCIe Due to number of DRAM banks 25

26 Remember: available bandwidths Mono memory to poly memory 3.2 GB/s aggregate over 96 PEs Poly memory to registers 840 MB/s per PE, ~160 GB/s aggregate/board Swazzle path bandwidth 1680 MB/s per PE, ~320 GB/s aggregate/board Total bandwidth for Advance board (2 CSX600 processors) ~0.5 TB/s Consider when looking to move large amounts of data 26

27 Summary of optimal DRAM usage Transferring large blocks of data is more efficient than transferring small blocks Fixed overhead in initialising transfer Host to board Mono to poly, poly to mono Use multiples of 32 bytes, preferably 64 bytes Aligned to 8 byte address, preferably 32 byte aligned #pragma align 8 Benefits if you can pack your data structures into 32 bytes! If you re using 24 bytes, there s an additional 8 bytes that will transfer for no additional cost Use one stream at a time Transfer to/from host when CSX isn t accessing DRAM 27

28 DRAM usage note: you will hit 80% of peak traffic Due to: 64 bytes sent on ClearConnect Bus (Programmed I/O) Divided into 4 lots of 16 bytes, plus 16 byte address Hence sending 64 bytes implies 80 bytes sent 64/80 = 80% of peak This can be pathologically hit with: 1 byte sent Becomes 1 lot fo 16 bytes, plus 16 byte address Hence sending 1 byte implies 32 bytes sent 1/32 = 3% of peak And then there s the read-modify-write for ECC DRAM 28

29 Swazzle path This is 8 bytes wide So, swazzling 1 byte takes as long as 8 Peak: 8 bytes per cycle, ~161 GB/sec/processor Don't use the swazzle_up_zero instruction Prefer set_swazzle_ends followed by swazzle_up Don t repeatedly set up ends of swazzle Don t need to call set_swazzle_ends multiple times If you wish to reuse same end values 29

30 Sending mono to poly Given 3.2 GB/s bandwidth mono poly 80% of peak gives ~2.5 GB/s Approximately 10 bytes per cycle across the processor Divide amongst 96 PEs: ~0.1 bytes per cycle per PE Alternative: mono to poly broadcast 1 byte per cycle Note: cache latency to load mono (10 s of cycles) Cache would, however, get 32 bytes in a line Bear in mind when considering mono poly Choice will depend on amount of data to be sent 30

31 Memcpy variants: which to use? Never use memcpy(m2p,p2m) Never better than 10% of peak Due to relaxed memory alignment & size constraints Use the async_ versions instead Even if you wait immediately for it to complete Bear in mind alignment and size requirements E.g. poly source must be 4-byte aligned Refer to The Cn Standard Library Document ID 06-RM-1139 Section 5: The ClearSpeed Vector Math Library 31

32 ENVISION. ACCELERATE. ARRIVE. Latency hiding 32

33 Using semaphores to reduce wait on transfers Given the case: Send data to the card Run a program on the data Retrieve the data Delay between transfer and compute can be reduced by using GSU semaphores 33

34 Using semaphores to reduce wait on transfers Load and run program on the card Waits on a GSU semaphore (GSU1) before starting Transfer data to the card Link GSU1 semaphore to transfer completion When the transfer completes, program will immediately start running. Transfer the results from the card Link GSU2 semaphore to transfer start Program on the card has completed: Signal a GSU semaphore (GSU2) Triggers host to transfer results Immediately process completes 34

35 Improved use of semaphores and host transfers Previous slide can be improved with doublebuffering: Transfer to the card two problems to be solved The card starts processing when the first problem arrives (*) Transfer the first results back, which will start when the first problem has been solved. The card will start on the second problem as soon as it has finished the first one. Transfer a third problem to the card, over the top of the first problem. Go to (*) and retrieve the second set of results... If data transfer is faster than compute, CSX will never stall But remember to have 1 data stream at a time if possible 35

36 Asynchronous I/O example void foo(double *A, double *B, int n) { // Assume n is divisible by 24*96 poly unsigned short penum = get_penum(); poly double mat[4]={1.,2.,3.,4.}; poly double a_front[12], a_back[12]; poly double b[4]={0.,0.,0.,0.}; int i; async_memcpym2p(19,a_front,a+12*penum,12*sizeof(double)); A+=12*96; n-=24*96; while (n) { // About to request memory block in advance async_memcpym2p(17,a_back,a+12*penum,12*sizeof(double)); A+=12*96; sem_wait(19); // Wait for memory transfer should have already completed for (i=0;i<12;i++) { b[0] += a_front[i]*mat[0] + a_front[i+1]*mat[1]; b[1] += a_front[i+1]*mat[0] + a_front[i]*mat[1]; b[2] += a_front[i]*mat[2] - a_front[i+1]*mat[3]; b[3] += a_front[i+1]*mat[2] - a_front[i]*mat[3]; } n-=12*96; // Next: request memory block in advance async_memcpym2p(19,a_front,a+12*penum,12*sizeof(double)); A+=12*96; sem_wait(17); for (i=0;i<12;i++) { // compute on a_back, then finish outside while loop 36

37 Should you double-buffer mono to poly access? Don t automatically assume this! What if the return data is extremely small? Not a bottleneck Only double-buffer if you are memory bandwidth bound Otherwise, just wasting PE memory Could get better memory reuse with single buffering Hence could be more efficient by being greedy with memory for compute & single-buffer! Overall: No hard and fast rule 37

38 ENVISION. ACCELERATE. ARRIVE. Miscellaneous 38

39 Embedded SRAM impact on programs ESRAM is 128KB Programs run at peak performance when located inside Compiler/linker 2.x: If program is > 128KB, program will not be placed in ESRAM Dramatic performance hit Compiler/linker 3.0: #pragma hot identifies code to be placed in ESRAM Compiler/linker 3.1: Will dynamically page code into ESRAM 39

40 Custom stack/heap sizes in poly memory Default is: 3KB stack, 3KB heap You can change this through pragmas Consider: If you create code with (say) 5.9KB heap You then get called from a different function/library Which has a different setup (eg 0.5KB heap) You re going to be in trouble! 40

41 Debugging: random errors occurring? Try running your code on the simulator More diagnostics compared to hardware For instance: You've accidentally fallen off the end of 6kB poly memory Hardware will eventually wrap the address Masking off irrelevant bits - so 8kB will become 0kB Simulator will tell you if you've fallen off the end of the memory map 41

42 ENVISION. ACCELERATE. ARRIVE. Profiling 42

43 ClearSpeed Visual Profiler Host tracing Trace CSAPI function User can infer overlapping host/board utilization Locate hot-spots Board tracing Trace board side functions without instrumentation Locate hot-spots Board hardware utilization Display activity of csx functional units including: ld/st Pi/o SIMD microcode Instruction cache Data cache Thread Cycle accurate View corresponding source Unified GUI 43

Detailed profiling is essential for accelerator tuning HOST CODE PROFILING Visually inspect

HOST/BOARD INTERACTION Infer cause and effect. Measure transfer bandwidth.

Host Host Host CPU(s) Host CPU(s) CPU(s) CPU(s) Advance Accelerator Board Advance

ACCELERATOR PIPE View instruction issue. Visualize overlap of executing instructions.

44 Detailed profiling is essential for accelerator tuning HOST CODE PROFILING Visually inspect multiple host threads. Time specific code sections. Check overlap of host threads. HOST/BOARD INTERACTION Infer cause and effect. Measure transfer bandwidth. Check overlap of host and board compute. Host Host Host CPU(s) Host CPU(s) CPU(s) CPU(s) Advance Accelerator Board Advance Accelerator Board CSX 600 CSX600 CSX 600 CSX600 Pipeline Pipeline Pipeline Pipeline ACCELERATOR PIPE View instruction issue. Visualize overlap of executing instructions. Get cycle-accurate timing. Remove instruction-level performance bottlenecks. CSX600 SYSTEM Trace at system level. Inspect overlap of compute and I/O. View cache utilization. Graph performance. 44

45 csvprof: Host Tracing Dynamic loading of CSAPI Trace implementation Triggered with an environment variable: export CS_CSAPI_TRACE=1 Recall similar enabling of debug support: export CS_CSAPI_DEBUGGER=1 Specify tracing format: export CS_CSAPI_TRACE_CSVPROF=1 currently this is the only implementation, but in the future Specify output file for trace: export CS_CSAPI_TRACE_CSVPROF_FILE=mytrace.cst default filename = csvprof_data.cst Output file written during CSAPI_delete 45

46 Profile of complete LINPACK run (x86 view) Overview of system performance during LINPACK run Profiling of x86 source code inside LINPACK CSX600 Interaction displayed with x86 code profile 46

47 A single LINPACK DGEMM call (x86 view) Scale from full LINPACK run to individual DGEMM Individual GFLOPS for a DGEMM call displayed in profile CSX600 aspects of individual DGEMM call now seen 47

48 Multiple CSX600 DGEMM calls (x86 view) See individual CSX600 processor contribution Overlap of x86 threads handling CSX600 offload Overhead of data transfer between processors visible 48

49 Multiple CSX600 DGEMM calls (CSX600 view) View the DGEMM calls on the CSX600 processor Each call ties up with the host view of card execution Much higher level of detail available from the profiler 49

50 Single DGEMM inner loop on CSX600 (CSX600 view) Scale from the view of code executing on the CSX600 View the host data being copied into CSX600 memory Tune the CSX600 code based on data flow profile 50

51 Pipeline view of CSX600 DGEMM inner loop (CSX600 view) Profile the code running at the instruction level See the pipeline performance for each instruction Tune the instruction scheduling for the application code 51

52 Visual Profiler Board Tracing Enabled using the debugger, csgdb Can use interactively or through gdb script Can select events to profile, or all events Requires buffer allocation on the card Today, this is done statically One could use CSAPI to allocate buffer, but developer must get location and size of the buffer to user to be entered for csgdb Easy if running only on one chip, place buffer in the other chip s memory Explicit dump to generate trace file Can control the type of data to be dumped 52

53 csvprof: Sample gdb script % cat./csgdb_trace.gdb connect load./foo.csx cstrace buffer 0x x cstrace event all on tbreak test_me continue cstrace enable continue cstrace dump foo.cst cstrace dump branch dgemm_test4_branch.cst quit % csgdb command=./csgdb_trace.gdb 53

54 ENVISION. ACCELERATE. ARRIVE. Inline assembly 54

55 Inline assembler within C n Refer to SDK Reference Manual Document ID: 06-RM-1136 Section 12.11: Inline assembler Given that you also know the instruction set Refer to CSX600 Instruction Set Reference Manual Document ID: 06-RM-1137 Brief example presented here 55

56 Inline assembler: example C n inline assembler: similar to function syntax Cannot be defined within basic blocks Uses _asm keyword to differentiate from normal functions Example: asm mono float addf(mono float x, mono float y) { @{y}f; This could be called from the Cn code as follows: int main(void) { } float y = addf(10.0,20.0); return 0, (int) y 56

57 Overview of features Variables can be access Enables register allocation by compiler Directives inform compiler of user intentions If a parameter (register) is to be modified Defines register requirements Example: requesting a 32-bit mono register Refer to Section in the SDK Reference Manual Example: insert that overwrites x parameter: #pragma asm_inc <arith.inc> asm mono short adddbl(mono short x, mono short y) x @{x}; } 57

58 Once you re happy with assembler If your code isn t running as fast as expected: Examine produced assembly from C n Verify that compiler is doing what you expect Code in a tight inner kernel may not be optimal Compiler/linker 2.x has fewer optimizations than 3.x series You can discover if it is worth hand-crafting part of a routine in assembler 58

59 ENVISION. ACCELERATE. ARRIVE. Optimal performance: Top 20 tips 59

60 Top tips: counting down 1. Use both chips on the board! Yes, don t forget you have 2 processors 2. Asynchronous I/O (latency hiding) overlap everything! Mono with poly, on-off-host, poly load/store with poly compute 3. Move common code from poly conditionals to outside conditional Remember: poly does not branch, don t pay to run code twice Refactor code to process common sub-expressions once 4. Align memory access to 8 bytes Enables DMA access 5. Use multiples of 64 byte for DRAM access Maximise efficiency 60

61 Top tips: counting down 6. Vector intrinsics To achieve maximum performance 7. Use async_memcpy Don t pay the overhead for more flexible memory sizes 8. Poly ALU is 8-bit; tighten integer math Why calculate 32-bit (4 registers) if an 8-bit (single register) will suffice? 9. Use swazzle Massive transfer bandwidth that runs in parallel with DRAM 10. Multiple boards At least prepare your code for scalability 61

62 ENVISION. ACCELERATE. ARRIVE. Summary 62

63 Summary Compute considerations Memory considerations Latency hiding Miscellaneous Profiling Inline assembly Optimal performance: Top 10 tips 63

64 64

65 ENVISION. ACCELERATE. ARRIVE. Reducing start-up of small applications 65

66 a. Training_2007_09_07 - #2 - #7, client-server code (small example no code shown) 66

COMP 635: Seminar on Heterogeneous Processors. Lecture 7: ClearSpeed CSX600 Processor.

COMP 635: Seminar on Heterogeneous Processors Lecture 7: ClearSpeed CSX600 Processor www.cs.rice.edu/~vsarkar/comp635 Vivek Sarkar Department of Computer Science Rice University vsarkar@rice.edu October