Performance Cloning: A Technique for Disseminating Proprietary Applications as Benchmarks

Size: px

Start display at page:

Download "Performance Cloning: A Technique for Disseminating Proprietary Applications as Benchmarks"

Chad Fowler
5 years ago
Views:

1 Performance Cloning: Technique for isseminating Proprietary pplications as enchmarks jay Joshi (University of Texas) Lieven Eeckhout (Ghent University, elgium) Robert H. ell Jr. (IM Corp.) Lizy John (University of Texas) IEEE International Symposium on Workload Characterization October 26, 2006

2 Outline ackground and Motivation Performance Cloning Central Idea Performance Cloning Framework Workload Profiling lgorithm for Clone Generation nalysis and Results Summary

3 Toy enchmarks e.g. Hanoi, Heapsort enchmark Spectrum Microbenchmarks e.g. STREM Kernel Codes e.g. Livermore Loops pplication Suites e.g. SPEC CPU Synthetic enchmarks e.g. hrystone, Whetstone Complete pplication Code Less evelopment Effort More Scalable More Maintainable Less Representative More evelopment Effort Less Scalable Less Maintainable More Representative

4 Real World pplications as enchmarks Increases confidence in making design tradeoffs Customize microprocessor design to specific applications est way to understand processor s use Perhaps the only way to understand emerging workload characteristics Simplifies purchasing decisions for customers

5 Challenges With Using Real World pplications Real world applications tend to be proprietary Using real world applications for performance studies can be tedious - ifficult to duplicate user environment - Modifying application to research environment - uplicating real input data set Real world workloads are a moving target..

6 The Problem. Need a methodology to create benchmarks that capture the main performance of real world applications Resulting benchmarks should hide functional meaning of code bility to study what-if scenarios by varying program characteristics

7 Outline ackground and Motivation Performance Cloning Central Idea Performance Cloning Framework Workload Profiling lgorithm for Clone Generation nalysis and Results Summary

8 Performance Cloning Central Idea Real World pplication Workload Characteristics Instruction Mix asic lock Size ILP ata Locality. Performance Clone R1, R2,R3 L R4, R1, R6 MUL R3, R6, R7 R3, R2, R5 IV R10, R2, R1 SU R3, R5, R6 STORE R3, R10, R20 R1, R2,R3 L R4, R1, R6 MUL R3, R6, R7 R3, R2, R5 IV R10, R2, R1 SU R3, R5, R1 EQ R3, R6, LOOP SU R3, R5, R6 STORE R3, R10, R20 IV R10, R2, R1. Measure Inherent Workload Characteristics Generate Clone with Similar Characteristics

9 Performance Cloning Framework Microarchitecture-Independent Workload Profiling Modeling Workload ttributes into Synthetic Workload Experiment Environment Real World Proprietary Workload Workload Profiler inary Instrumentation OR Simulation Workload Profile = Workload Synthesizer Synthetic enchmark Clone Real Hardware Workload ttributes + istribution Of ttribute Values Execution riven Simulator

10 Outline ackground and Motivation Performance Cloning Central Idea Performance Cloning Framework Workload Profiling lgorithm for Clone Generation nalysis and Results Summary

11 Microarchitecture-Independent Profile Control Flow ehavior ata Locality Control Flow Predictability Instruction Mix Instruction Level Parallelism

12 Control Flow ehavior (1) Statistical Flow Graph (K=0) [Eeckhout et al., ISC 2004] pplication inary Profiling 40% (5) 60% 100% 33% (3) C (2) 67%

13 Control Flow ehavior (2) pplication inary Statistical Flow Graph (K=1) [Eeckhout et al., ISC 2004] (2) 60% 100% (3) 67% 33% (1) 100% C (2) C (1) 100% Profiling

14 Modeling Memory ata ccess Pattern Identify streams of data references Stream? Sequence of memory addresses in an arithmetic progression Elements of arrays,, and C form 3 streams for( ii = 0; ii < N; ii ++) [ii] = [ii] + C [ii] 200, 204, , 324, , 408, Issuing Sequence : 320, 404, 200, 324, 408, 204. Streams are interleaved and may contain noise 4, 8, 12, 16, 1, 3, 20, 24, 5, 7, 2, 9, 11, 28

15 Extracting Streams Reference pattern of static Load / Store Instructions PC-correlated spatial locality - ependence on address referenced by nearby Ld / St - Programs with pointer chasing codes PC-correlated temporal locality - ependence on previous address generated by same Ld / St - Programs with multidimensional arrays Could static Load / Store instructions be natural sources of streams? Profile every static Load / Store instruction Number of different strides with which it accesses data

16 ehavior of Static Load/Store Instructions Percentage of ynamic Memory References basicmath bitcount crc32 dijkstra fft ghostscript_mibench gsm jpeg patricia qsort rsynth stringsearch susan typeset cjpeg djpeg g721-decode ghostscript_media mpeg-decode rasta rawaudio texgen unepic s a First-Order Model, Static Load/Stores can be modeled as single stream

17 Modeling Control Flow Predictability Capture behavior of easy and difficult to predict branches Inherent program feature that captures branch behavior Transition Rate [ Haungs et al. HPC 00 ] # of Taken-Not Taken transitions / # of times executed ranches with low transition-rate (easier to predict) TTTTTTTTTN, NNNNNNNNNT ranches with high transition-rate (easier to predict) TNTNTNTNTN ranches with moderate transition-rate (tougher to predict)

18 Modeling Instruction Level Parallelism ependency istance R1, R3,R4 MUL R5,R3,R2 R5,R3,R6 L R4, (R8) SU R8,R2, R1 Read fter Write ependency istance = 3 Measure istribution of ependency istances Upto 1, Upto 2, Upto 4, Upto 8, Upto 16, Upto 32, >32

19 Outline ackground and Motivation Performance Cloning Central Idea Performance Cloning Framework Workload Profiling lgorithm for Clone Generation nalysis and Results Summary

20 Performance Clone Generation Instruction Mix Register ependency istance Stride Pattern of Load/Store ranch Transition Rate ranch Transition Probabilities R C R R 0.9 R 0.1 Workload Profile

21 Performance Clone Generation Instruction Mix Register ependency istance Stride Pattern of Load/Store ranch Transition Rate ranch Transition Probabilities 0.8 R 0.2 R R C Synthetic Clone Generation C R Workload Profile

22 Performance Clone Generation Instruction Mix Register ependency istance Stride Pattern of Load/Store ranch Transition Rate ranch Transition Probabilities 1 ig Loop 0.8 R 0.2 R R C Synthetic Clone Generation C R Workload Profile

23 Performance Clone Generation Instruction Mix Register ependency istance Stride Pattern of Load/Store ranch Transition Rate ranch Transition Probabilities Memory ccess Model (Strides) 1 ig Loop 0.8 R 0.2 R R C Synthetic Clone Generation C R Workload Profile

24 Performance Clone Generation Instruction Mix Register ependency istance Stride Pattern of Load/Store ranch Transition Rate ranch Transition Probabilities Memory ccess Model (Strides) 1 ig Loop 0.8 R 0.2 R R C Synthetic Clone Generation C R Workload Profile ranching Model ased on Transition Rate

25 Performance Clone Generation Instruction Mix Register ependency istance Stride Pattern of Load/Store ranch Transition Rate ranch Transition Probabilities Memory ccess Model (Strides) 1 ig Loop 0.8 R 0.2 R R C Synthetic Clone Generation C R Workload Profile ranching Model ased on Transition Rate Register ssignment C code with asm & volatile constructs

26 Outline ackground and Motivation Performance Cloning Central Idea Performance Cloning Framework Workload Profiling lgorithm for Clone Generation nalysis and Results Summary

27 Tools & enchmarks SimpleScalar/Wattch Simulators for profiling and cycle-accurate simulation lpha IS Programs compiled with Compaq cc v O3 level enchmarks from Miench and Mediaench benchmark suites as representatives of characteristics of Embedded pplications Program basicmath, qsort, bitcount, susan crc32, dijkstra, patricia fft, gsm ghostscript, rsynth, stringsearch jpeg, typeset cjpeg, djpeg, g721-decode, ghostscript, mpeg, rasta, rawaudio, texgen, unepic pplication omain utomotive Networking Telecommunication Office Consumer Media

28 Evaluation bsolute accuracy - bility of performance clone to estimate absolute IPC and Power Relative accuracy - Sensitivity (IPC and Power) of performance clone to cache & microarchitecture design changes ase Configuration L1 I-cache L1 -cache L2 Unified cache Fetch, ecode, and Issue Width Fetch Queue ranch Predictor Functional Units Reorder uffer Load Store Queue Memory (us Width, First lock Latency) 16 K/2-way/32 16 K/2-way/32 64 K/4-way/64 1-wide out-of-order 8 entry 2-level Gp predictor 2 Integer LU, 1 FP Multiplication Unit, 1 FP LU 16 entries 8 entries 8, 40 cycles

29 bsolute ccuracy in IPC Original enchmark Synthetic Clone IPC on ase Configuration basicmath bitcount crc32 mpeg-decode rasta rawaudio texgen unepic dijkstra fft ghostscript gsm jpeg patricia qsort rsynth stringsearch susan typeset cjpeg djpeg g721-decode ghostscript verage absolute error in estimating IPC is 8.7%

30 bsolute ccuracy in Power verage absolute error in estimating power is 6.4% Original enchmark Synthetic Clone Power Consumption on ase Configuration basicmath bitcount crc32 dijkstra fft ghostscript gsm jpeg patricia qsort rsynth stringsearch susan typeset cjpeg djpeg g721-decode ghostscript mpeg-decode rasta rawaudio texgen unepic

31 Tracking esign Changes (1) cross 28 cache configurations Pearson' Correlation Coefficient basicmath bitcount crc32 dijkstra fft ghostscript gsm jpeg patricia qsort rsynth stringsearch susan typeset cjpeg djpeg g721-decode ghostscript mpeg-decode rasta rawaudio texgen unepic verage

32 Tracking esign Changes (2) Ranking of Cache Configuration (Real) Ranking of Cache Configuration (Synthetic) cross 28 cache configurations

33 Tracking esign Changes (3) esign Change verage Relative Error in IPC verage Relative Error in Power ouble the number of entries in the reorder buffer and load store Queue 5.81% 3.41% Reduce the L1 cache size to half 1.48% 0.39% ouble the fetch, decode, and issue Width Change the predictor from a 2- level to a not-taken predictor Change the instruction issue policy to in-order 5.41% 4.59% 6.51% 1.80% 3.26% 1.22% 5 ifferent Microarchitecture Changes

34 Outline ackground and Motivation Performance Cloning Central Idea Performance Cloning Framework Workload Profiling lgorithm for Clone Generation nalysis and Results Summary

35 Conclusions Technique that clones performance but hides functional meaning of code - rchitects & esigners can get access to proprietary workloads - Foster benchmark sharing between industry and academia - Customers can make informed purchase decisions Evaluation of technique on embedded benchmarks is promising - Synthetic clone exhibits similar power/performance characteristics - Synthetic clone is a good proxy to original application

36 Challenges & Limitations Compiler technology is absorbed into the performance clone - Limited use for compiler studies enchmark contains IS specific embedded asm statements - Every embedded microprocessor designer cares about single IS - Possibilities for true portability virtual IS, binary translation bstract workload model simple by construction - bility to perform what-if performance studies - Higher order models to capture complex dataflow

Distilling the Essence of Proprietary Workloads into Miniature Benchmarks

Distilling the Essence of Proprietary Workloads into Miniature Benchmarks AJAY JOSHI University of Texas at Austin LIEVEN EECKHOUT Ghent University ROBERT H. BELL JR. IBM, Austin and LIZY K. JOHN University