GPU & Reproductibility, predictability, Repetability, Determinism.

Size: px

Start display at page:

Download "GPU & Reproductibility, predictability, Repetability, Determinism."

Andrea Richardson
5 years ago
Views:

1 GPU & Reproductibility, predictability, Repetability, Determinism. David Defour DALI/LIRMM, UPVD

2 Phd Thesis proposal Subject Study the {numerical} predictability of GPUs Why?... fuzzy definition but useful for Bounding execution on Real Time System, Debugging, Build proof. and...

FPS Battlezone 2 used a lockstep networking model requiring absolutely identical results on every client, down to the least-significant bit of the mantissa, or the simulations would start to diverge.

During development, we discovered that AMD and Intel processors produced slightly different results for trancendental functions (sin, cos, tan, and their inverses), so we had to wrap them in

3 FPS Battlezone 2 used a lockstep networking model requiring absolutely identical results on every client, down to the least-significant bit of the mantissa, or the simulations would start to diverge. While this was difficult to achieve, it meant we only needed to send user input across the network; all other game state could be computed locally. During development, we discovered that AMD and Intel processors produced slightly different results for trancendental functions (sin, cos, tan, and their inverses), so we had to wrap them in non-optimized function calls to force the compiler to leave them at single-precision. That was enough to make AMD and Intel processors consistent, but it was definitely a learning experience. Ken Miller, Pandemic Studios In FSW1 when desync is detected in player would be instantly killed by magic sniper. All that stuff was fixed in FSW2. We just ran precise FP and used Havok FPU libs instead SIMD on PC. Also integer modulo is problem too because C++ standard says it s implementation defined (in case when multiple compilers/ platforms are used). In general I liked tools for lockstep we developed, finding desyncs in code on FSW2 was trivial. Branimir Karadžić, Pandemic Studios

4 CUDA tutorial in... 1 min Software model: Hardware model: Grid of Blocks of Threads Set of CTA of SM of CU working in SIMT manner No assumptions on execution for Blocks, Warps, Threads within a warp, Communication with Synchronization barrier: syncthreads() atomics: atomicadd()

5 Hardware view Block scheduler Warp scheduler Instruction scheduler Clock tree Dynamic frequency scaling ECC

Your Computer s configuration Name GPU CUDA capability CUDA #MP/ GPC MP Core / MP Warp Scheduler/ MP GPU clock (Mhz) Memory clock(mhz) C870 G80 1.

6 Your Computer s configuration Name GPU CUDA capability CUDA #MP/ GPC MP Core / MP Warp Scheduler/ MP GPU clock (Mhz) Memory clock(mhz) C870 G GX G GTX480 GF GTX560 GF GTX680 K

7 First work Are GPUs predictable? Regarding Blocks scheduling Regarding Threads scheduling Can predictability be improved? Reseting initial states of the GPU Changing frequency

8 Measure of Predictibility How to measure predictability in the context of data parallel processing? Solution (Statistical Mode) For one problem, one program, one input dataset, one processor, look at the output over several runs and take the most probable output Example: Run Output x y z w x x y x z t Predictibility of 40%

9 Block vs Warp? Test Block scheduling Launch 1 to 32 blocks with 1 warp per block Test Warp scheduling Launch #MP blocks with 1 to maxthreads per block global void TestOrder(int* dmem) {! const unsigned int gid = blockdim.x * blockidx.x + threadidx.x;! int cl;! SYNC! cl=clock();! dmem[gid] = cl; } What is the most predictable, warp or block?

10 Warp scheduling C8 GTX"4 GTX"5 GTX"6 2" 4" 6" 8" 12" 14" 16"

11 Block scheduling C8 GTX"4 GTX"5 GTX"6 5" 15" 25" Number,of,block,

12 What is Next? Choice 1 Choice 2.a Choice 2.b Choice 2.c I don t trust clock(), lets try global memory access Synchronize warps of block before starting Let s optimize it by Reset the Reduce clock device before frequency starting

13 I don t trust clock(), let s use global access atomicadd(&cpt, 1);

14 Global memory access Clock AtomicAdd() Warp Block 5" 15" 25" Number,of,block, C8 GTX"4 GTX"5 GTX"6 C8 GTX"4 GTX"5 GTX"6 5" 15" 25" Block,number, GTX"4 GTX"5 GTX"6 GTX"4 GTX"5 GTX"6 2" 4" 6" 8" 12" 14" 16" 2" 4" 6" 8" 12" 14" 16"

15 Global memory Clock access AtomicAdd() Warp Block 5" 15" 25" Number,of,block, C8 GTX"4 GTX"5 GTX"6 C8 GTX"4 GTX"5 GTX"6 5" 15" 25" Block,number, GTX"4 GTX"5 GTX"6 GTX"4 GTX"5 GTX"6 2" 4" 6" 8" 12" 14" 16" 2" 4" 6" 8" 12" 14" 16"

16 Global memory access Clock AtomicAdd() Warp Block 5" 15" 25" Number,of,block, C8 GTX"4 GTX"5 GTX"6 C8 GTX"4 GTX"5 GTX"6 5" 15" 25" Block,number, GTX"4 GTX"5 GTX"6 GTX"4 GTX"5 GTX"6 2" 4" 6" 8" 12" 14" 16" 2" 4" 6" 8" 12" 14" 16"

17 Synchronize warp asm volatile ("bar.sync 0;")

18 1 Synchronization Clock syncthreads() Warp Block 5" 15" 25" Number,of,block, C8 GTX"4 GTX"5 GTX"6 C8 GTX"4 GTX"5 GTX"6 5" 15" 25" Block,number, C8 GTX"4 GTX"5 GTX"6 C8 GTX"4 GTX"5 GTX"6 2" 4" 6" 8" 12" 14" 16" 2" 4" 6" 8" 12" 14" 16"

19 1 Synchronization Clock syncthreads() Warp Block C8 GTX"4 GTX"5 GTX"6 5" 15" 25" Number,of,block, C8 Improvement for GTX680 GTX"4 GTX"5 GTX"6 5" 15" 25" Block,number, C8 GTX"4 GTX"5 GTX"6 C8 GTX"4 GTX"5 GTX"6 2" 4" 6" 8" 12" 14" 16" 2" 4" 6" 8" 12" 14" 16"

20 32 Synchronization Clock 32 syncthreads() Warp Block 5" 15" 25" Number,of,block, C8 GTX"4 GTX"5 GTX"6 C8 GTX"4 GTX"5 GTX"6 5" 15" 25" Number,of,block, C8 GTX"4 GTX"5 GTX"6 C8 GTX"4 GTX"5 GTX"6 2" 4" 6" 8" 12" 14" 16" 2" 4" 6" 8" 12" 14" 16"

21 32 Synchronization Clock 32 syncthreads() Warp Block 5" 15" 25" C8 GTX"4 GTX"5 GTX"6 Number,of,block, Improvement for every GPU C8 except G80 GTX"4 GTX"5 GTX"6 5" 15" 25" Number,of,block, C8 GTX"4 GTX"5 GTX"6 C8 GTX"4 GTX"5 GTX"6 2" 4" 6" 8" 12" 14" 16" 2" 4" 6" 8" 12" 14" 16"

22 Play with clock frequency

23 Clock Frequency GTX480: Default clock GPU: 701Mhz, Memory:1848Mhz, Shader 1401Mhz Set each clock for each of the 4 performance level to 900Mhz. GTX560: Default clock Memory: 2004Mhz, Shader: 1620Mhz Set each clock for each performance level to 1Ghz

24 UnderClocking Clock Underclocking Warp Block 1" 2" 3" 4" 5" 6" 7" 8" 9" 11" 12" 13" 14" 15" 16" 17" 18" 19" 21" 22" 23" 24" 25" 26" 27" 28" Number,of,block, GTX"4 GTX"5 GTX"6 GTX"4 GTX"5 GTX"6 1" 2" 3" 4" 5" 6" 7" 8" 9" 11" 12" 13" 14" 15" 16" 17" 18" 19" 21" 22" 23" 24" 25" 26" 27" 28" Number,of,block, GTX"4 GTX"5 GTX"6 GTX"4 GTX"5 GTX"6 1" 2" 3" 4" 5" 6" 7" 8" 9" 11" 12" 13" 14" 15" 16" 1" 2" 3" 4" 5" 6" 7" 8" 9" 11" 12" 13" 14" 15" 16" Number,of,Warp,,

25 Reset the device cudadevicereset();

26 Reset Clock cudadevicereset() Warp Block 5" 15" 25" Number,of,block, C8 GTX"4 GTX"5 GTX"6 C8 GTX"4 GTX"5 GTX"6 5" 15" 25" Number,of,block, C8 980GX2" GTX"4 GTX"5 GTX"6 C8 GTX"4 GTX"5 GTX"6 2" 4" 6" 8" 12" 14" 16" 2" 4" 6" 8" 12" 14" 16"

27 Reset atomic cudadevicereset() Warp Block 5" 15" 25" Block,number, GTX"4 GTX"5 GTX"6 GTX"4 GTX"5 GTX"6 5" 15" 25" Number,of,block, 980GX2" GTX"4 GTX"5 GTX"6 GTX"4 GTX"5 GTX"6 2" 4" 6" 8" 12" 14" 16" 2" 4" 6" 8" 12" 14" 16"

28 Impact of Software 2 examples: FP summation Tree operations (Rootfix, Leaffix)

29 A simple problem: FP Summation «Technology Challenges in Achieving Exascale Systems», Darpa Report 2008

Solutions for Reduction Thread Grid Block Block Block Grid Block Block Block Grid Block Block Block Private memory (register) atomicadd() Shared

30 Solutions for Reduction Thread Grid Block Block Block Grid Block Block Block Grid Block Block Block Private memory (register) atomicadd() Shared memory atomicadd() Shared memory atomicadd() atomicadd() atomicadd() Global memory Global memory Global memory Solution N 1 Solution N 2 Solution N 3

31 Predictability 1,00E+02% 9,00E+01% 8,00E+01% 7,00E+01% 6,00E+01% 5,00E+01% 4,00E+01% 3,00E+01% V1%% V2% V3% 2,00E+01% 1,00E+01% 0,00E+00% 1% 2% 3% 4% 5% 6% 7% 8% 9% 10% 11% 12% 13% 14% 15% 16% Number,of,Warp, GTX480

32 Optimized Predictability GPU+MEM: 1GHZ; RESET ; syncthread() "V1"" V2" V3" 1" 2" 3" 4" 5" 6" 7" 8" 9" 11" 12" 13" 14" 15" 16" Number,of,Warp, GTX480

33 CUDA SDK

Its time to graduate 3 easy factors to improve

34 Its time to graduate 3 easy factors to improve predictability Reset the device Align clock frequency (and lower it by the way) Synchronize warp before starting However software play an important role But... all this Floating-Point non associativity Chosen algorithm Compiler s optimization

35 2 Sweep Operations Rootfix Scan a tree from top to bottom and writes the sum of parents into every child, up to the leaf Leaffix Scan a tree from bottom to top and writes the sum of children into every parent, up to the root /43

Multi-Processors and GPU

Multi-Processors and GPU Philipp Koehn 7 December 2016 Predicted CPU Clock Speed 1 Clock speed 1971: 740 khz, 2016: 28.7 GHz Source: Horowitz "The Singularity is Near" (2005) Actual CPU Clock Speed 2 Clock