igpu: Exception Support and Speculation Execution on GPUs Jaikrishnan Menon, Marc de Kruijf University of Wisconsin-Madison ISCA 2012

Size: px

Start display at page:

Download "igpu: Exception Support and Speculation Execution on GPUs Jaikrishnan Menon, Marc de Kruijf University of Wisconsin-Madison ISCA 2012"

Kelly Robinson
5 years ago
Views:

1 igpu: Exception Support and Speculation Execution on GPUs Jaikrishnan Menon, Marc de Kruijf University of Wisconsin-Madison ISCA 2012

2 Outline Motivation and Challenges Background Mechanism igpu Architecture Evaluation Conclusion

3 Motivations and Challenges CPU got its evolution by introducing precise exception, context switch, and speculation; To be a general-purpose processor, it is necessary for GPU to support these functions ; To achieve these targets, we cannot mimic what CPU do, because, 1. Huge amount of live states (tens of thousands registers) need to be preserved; Reorder buffer is unrealistic; 2. Instruction execution in GPU has very long running; It makes reorder buffer more impossible; 3. High Fan-out control to maintain sequential ordering in SIMD is a challenge. It is not natural in parallel SIMD or vector structure;

4 Background: idempotent area To ensure precise exception, the key point is to manage write -- Keep write back in order. CPU use hardware mechanisms: reorder buffer/register renaming Vector processor/ GPU: To avoid hardware mechanism, need to introduce software mechanism idempotent area. Idempotent area definition: This region can be re-executed multiple times without changing the result.

5 Idempotent Area How to use idempotent area? Set synchronizing barrier between successive idempotent area Regard every barrier as a restart point All potentially faulting instructions wait in front of barrier to check exceptions before entering new idempotent area

6 Idempotent Area How to get idempotent area in code? Every instruction is an idempotent area Our purpose is to find large-size idempotent area, But how? Some rules we can follows, Case 1: mov.u32 $r1, global[$r0] Idem.boundry; add.u32 $r1, $r2, $r1 Idem.boundry; WAR dependence

7 Following these rules, we can cut these area into more desirable large idempotent area; But it is not enough. We should be able to construct idempotent area. The hint is from the comparison between case 1 and case 3. With certain register renaming, non-idempotent area can be converted to idempotent area; Conclusion: we can get any size of idempotent area manually; Idempotent area Case 2: mov.u32 $r1, global[$r0] Idem.boundry; mov.u32 $r0, $r1 mov.u32 $r2, $r0 Idem.boundry; RAW dependence Case 3: mov.u32 $r1, global[$r0] Idem.boundry; mov.u32 $r0, $r1 add.u32 $r0, $r0, $2 Idem.boundry; RAW after WAR dependence

8 Observation GPU has supportive structure 1. I-Buffer in GPGPU manages the status of every warp. And Warp is the minimum unit handled by hardware. Thus I believe warp should be the unit for context switch. 2. I believe there should be similar component in real GPU. Because registers have two types: genereal-purpose registers and predicate registers (boolean condition). 3. Vector processor is different from GPU because warps are independent from each other. So just take a warp into account.

9 Observation Live registers in every instruction execution varies If we set up the barrier in the state with lease variable, we can eliminate a lot of overhead.

10 Observation Barrier with least variable is not good enough to support speculation; Differences between exception and speculation: Exception: need to save and restore general-purpose registers in barrier state; Speculation: just save and restore PC; Speculation occurs very frequently Thus we can separate idempotent area into small partitions

11 Compiler, ISA and Hardware Compiler: 1. cut idempotent areas due to live states; 2. construct the area using register rename; r1 will be used in the loop. It may introduce WAR dependence. Thus add a write before read.

12 Compiler, ISA and Hardware ISA: 1. Add special instruction to mark the boundaries between two idempotent regions; Hardware: 1. Add restart PC for every warp; (It works like stack register) 2. Information in general purpose registers can be stored in memory or just add another stack register to preserve them;

13 Evaluation Runtime Overhead: page fault handling and re-execution overhead With 1 page fault/1 M instructions, overhead is less than 1 percent.

14 Evaluation Total Overhead with different error rate

Handout 3. HSAIL and A SIMT GPU Simulator

Handout 3. HSAIL and A SIMT GPU Simulator Handout 3 HSAIL and A SIMT GPU Simulator 1 Outline Heterogeneous System Introduction of HSA Intermediate Language (HSAIL) A SIMT GPU Simulator Summary 2 Heterogeneous System CPU & GPU CPU GPU CPU wants