The Use of Multithreading for Exception Handling Craig Zilles, Joel Emer*, Guri Sohi University of Wisconsin - Madison *Compaq - Alpha Development Group International Symposium on Microarchitecture - 32 November, 1999
Overview Extensions to a multithreaded processor to reclaim lost performance during exception handling in a pipelined, out-oforder processor HARDWARE EXCEPTIONS PERFORMANCE IN TRADITIONAL IMPLEMENTATION IMPORTANT CHARACTERISTICS OF EXCEPTION HANDLERS EXPLOIT THEM WITH EXTENSION TO SMT PROCESSOR METHODOLOGY/PERFORMANCE AN OPTIMIZATION: QUICK-STARTING CONCLUSIONS International Symposium on Microarchitecture - 32, November 1999 2
Hardware exceptions COST-EFFECTIVE HARDWARE UNCOMMON CASE HANDLED BY SOFTWARE RECOVERABLE EXCEPTIONS (NOT SEGFAULTS) TLB miss unaligned access emulated instructions EVENT DETECTED BY HARDWARE, RESOLVED BY SOFTWARE A short piece of code is executed Control is returned to the application at the exception International Symposium on Microarchitecture - 32, November 1999 3
Performance problem MUCH LIKE BRANCH MISPREDICT causes CHANGE IN CONTROL FLOW often DETECTED AT EXECUTE TIME EXCEPTION DETECTED PRE-EXCEPT APPLICATION POST-EXCEPT APPLICATION T I M E SQUASH THE EXCEPTION AND POST-EXCEPTION INSTRUCTIONS PRE-EXCEPT APPLICATION FETCH/EXECUTE EXCEPTION HANDLER PRE-EXCEPT APPLICATION EXCPT. HANDLER REFETCH APPLICATION CODE PRE-EXCEPT APPLICATION EXCPT. HANDLER POST-EXCEPT APPLI DYNAMIC INSTRUCTION STREAM International Symposium on Microarchitecture - 32, November 1999 4
Trends WITH INCREASED PIPELINE LENGTH, SUPERSCALAR WIDTH, AND WINDOW SIZE IT ONLY GETS WORSE 45 40 3 STAGE 3 pipe stages 7pipe STAGE stages 11 pipe stages 11 STAGE penalty cycles per TLB miss 35 30 25 20 15 10 5 0 ALPHADOOM alphadoom applu COMPRESS compress deltablue GCC hydro2dmurphi vortex gcc murphi AVERAGE average APPLU DELTABLUE HYDRO2D VORTEX International Symposium on Microarchitecture - 32, November 1999 5
Structure of Exception Handler RECONVERGENT CONTROL FLOW The same application instructions are executed in the same order INDEPENDENT of the exception handler s execution MINIMAL DATA DEPENDENCES between application and exception handler typically only involving excepting instruction Example: TLB MISS HANDLER reads miss address from privileged register loads from page table writes TLB International Symposium on Microarchitecture - 32, November 1999 6
Extension to SMT processor RECONVERGENT CONTROL FLOW DON T SQUASH ALLOCATE THE HANDLER TO SEPARATE THREAD FIFO management of window resources (within a thread) extra hardware required for ordering threads THREAD #1 PRE-EXCEPT APPLICATION POST-EXCEPT APPLICATION #2 EXCPT. HANDLER PROVIDE APPEARANCE OF SEQUENTIAL EXECUTION Control thread retirement order International Symposium on Microarchitecture - 32, November 1999 7
Extension to SMT processor MINIMAL DATA DEPENDENCES USE SEPARATE REGISTER FILE Avoids additional renamer complexity UNCOMMON CASE (TLB MISS PAGE FAULT CONTEXT SWITCH) REVERT TO NORMAL MECHANISM MEMORY DEPENDENCES DETECT CONFLICTS, RECOVER (MUCH LIKE R10K, OR ARB) International Symposium on Microarchitecture - 32, November 1999 8
Methodology EXAMPLE IMPLEMENTATION: SOFTWARE TLB MISS HANDLING EXECUTION DRIVEN SMT SIMULATOR BUILT FROM ALPHA ARCHITECTURE SIMPLESCALAR TOOLKIT SUPPORTS ENOUGH OF 21164 PRIVILEGED ARCHITECTURE TO RUN COMMON-CASE TLB HANDLER O SPECULATIVE EXECUTION, MULTIPLE IN-FLIGHT MISSES 8 WIDE, 128 WINDOW, 7 STAGE, BIG YAGS, 64K L1 S, 1M L2 BENCHMARKS WITH NON-TRIVIAL TLB BEHAVIOR (FROM SPEC AND ELSEWHERE) SCALED DOWN (64 ENTRY) DATA TLB METRIC: PENALTY PER MISS (additional overhead vs. simulation with perfect TLB) / misses International Symposium on Microarchitecture - 32, November 1999 9
Performance DOES MUCH BETTER THAN TRADITIONAL SOFTWARE APPROACH NOT AS GOOD AS AGGRESSIVE HARDWARE TLB MISS WIDGET 40 35 TRADITIONAL traditional software MULTITHREAD-1 multithreaded(1) MULTITHREAD-3 multithreaded(3) HARDWARE hardware penalty cycles per TLB miss 30 25 20 15 10 5 0 ALPHADOOM alphadoom applu COMPRESS compress deltablue GCC hydro2dmurphi vortexaverage APPLU gcc murphi average DELTABLUE HYDRO2D VORTEX International Symposium on Microarchitecture - 32, November 1999 10
Optimization: Quick-starting PERFORMANCE GAP BETWEEN HARDWARE AND MULTI-THREADED FETCH/DECODE LATENCY SOLUTION: CACHE EXCEPTION HANDLER PARTWAY DOWN PIPELINE OUR SMT IMPLEMENTATION: PER THREAD FETCH BUFFERS, IDLE RESOURCES WHEN THREAD IS IDLE PREDICT NEXT EXCEPTION, USE IDLE FETCH CYCLES TO PREFETCH HANDLER. REDUCES MULTI-THREADED EXCEPTION LATENCY. FETCH FETCH DECODE RENAME REGREAD REGREAD EXECUTE International Symposium on Microarchitecture - 32, November 1999 11
Performance: Quick-starting ALMOST CUTS PERFORMANCE GAP IN HALF 25 MULTI-1 multithreaded(1) QUICKSTART-1 quick start(1) HARDWARE hardware 20 penalty cycles per TLB miss 15 10 5 0 ALPHADOOM alphadoom applu COMPRESS compress deltablue GCC hydro2dmurphi vortexaverage gcc murphi average APPLU DELTABLUE HYDRO2D VORTEX International Symposium on Microarchitecture - 32, November 1999 12
Single Thread Performance vs. Throughput SINGLE APPLICATION: (PREVIOUS RESULTS) FOCUS: IMPROVE SINGLE THREAD PERFORMANCE MULTIPROGRAMED/MULTITHREADED WORKLOAD: FOCUS: MAXIMIZE THROUGHPUT OUR EXPERIMENT: (NOT NECESSARILY A FAIR COMPARISON) RUN 3 APPLICATIONS, 1 IDLE THREAD FOR EXCEPTION HANDLING International Symposium on Microarchitecture - 32, November 1999 13
Performance on Multiprogramed Workloads 12 10 TRADITIONAL traditional MULTI-1 multithreaded(1) QUICKSTART-1 quick start(1) HARDWARE hardware penalty cycles per TLB miss 8 6 4 2 0 ADMadm gcc vor ADMapl cmp h2d APL APL apl dbl vor APL CMP dbl gcc h2d DBL adm cmp vor adm h2d mph apl dbl mph cmp gcc mph AVERAGE average CMP GCC H2D CMP DBL DBL GCC GCC VOR VOR MPH H2D MPH VOR MPH H2D PERFORMANCE IS MORE COMPLICATED SMT IS MORE LATENCY TOLERANT SMT IS LESS TOLERANT OF WASTED BANDWIDTH International Symposium on Microarchitecture - 32, November 1999 14
Related Work ARCHITECTURES: M-MACHINE O O FILLO, KECKLER, DALLY, CARTER, CHANG, GUREVICH, LEE KECKLER, DALLY, CHANG, LEE, CHATTERJEE MULTISCALAR/KESTREL SUBORDINATE MULTITHREADING: CHAPPEL, STARK, KIM, REINHART, AND PATT SONG AND DUBOIS International Symposium on Microarchitecture - 32, November 1999 15
Conclusions SIGNIFICANTLY IMPROVES EXCEPTION HANDLING PERFORMANCE: software TLB miss performance approaching that of an aggressive hardware TLB miss performance NOT ALL EXCEPTIONS CAN BE IMPLEMENTED IN HARDWARE HIGH PERFORMANCE EXCEPTIONS ENABLE NOVEL SOFTWARE SYSTEMS a la SOFTWARE DSM or CONCURRENT GC International Symposium on Microarchitecture - 32, November 1999 16