Re-constructing High-Level Information for Language-specific Binary Re-optimization

Size: px

Start display at page:

Download "Re-constructing High-Level Information for Language-specific Binary Re-optimization"

Jacob Baldwin
5 years ago
Views:

IBM Research - Tokyo Reid Copeland IBM Canada Motohiro

1 CGO 2016 Re-constructing High-Level Information for Language-specific Binary Re-optimization Toshihiko Koju IBM Research - Tokyo Reid Copeland IBM Canada Motohiro Kawahito IBM Research - Tokyo Moriyoshi Ohara IBM Research - Tokyo 1

2 Summary Language-specific binary optimizer can achieve competitive performance relative to the state-of-the-art source compiler. source code latest compiler re-compiled binary earlier compiler Achieved +40.1% by our binary optimizer vs % by the latest source compiler. original binary binary optimizer optimized binary language-specific contextual info. 2

3 Outline! Motivation & Goal.! Our Approach. Four Major Optimizations.! Performance Evaluation.! Conclusions. 3

4 Compilation Technologies are becoming Increasingly Important! Recent processors are equipped with specialized HW units. e.g. SIMD, FPGA, GPU, DFP (Decimal Floating-Point)! Re-compilation with the latest compilation technologies is typically required to exploit them. single-thread perf. of CPU downlevel processor with specialized HW units w/o specialized HW units CPU generation re-compilation Specialized HW units complement diminishing single-thread performance improvements. latest processor DFP SIMD FPGA GPU 4 existing application latest compiler re-compiled application

5 Compiler Migration is Often Challenging for Large Enterprise Customers! Adapting to changes in the following can be a significant undertaking: language, compiler option, build environment, runtime environment downlevel processor changes in language behavior, compiler options, build environment, runtime environment compiler migration re-compilation latest processor DFP SIMD FPGA GPU existing application latest compiler re-compiled application 5

Source code re-compilation: compiler migration changes in language behavior, compiler options, build environment, runtime environment source code

6 Ease Migration by Binary Optimization! We have developed a production-level binary optimizer for mainframe COBOL applications. IBM Automatic Binary Optimizer for z/os.! The binary optimizer utilizes the same optimization engine of the latest source code COBOL compiler. Source code re-compilation: compiler migration changes in language behavior, compiler options, build environment, runtime environment source code Binary re-optimization: existing binary 6 latest compiler Provide strong compatibility with the current environment. binary optimizer re-compiled binary optimized binary

7 Traditional Approach does not Work! Naive translation from machine instructions to IR (intermediate Representation) degraded the performance. This is typically for binary translators based on general compiler engines. original binary naive translation w/o optimizations naive translation with optimizations 7 higher is better Relative Performance 180.0% 160.0% 140.0% 120.0% 100.0% 80.0% 60.0% 40.0% 20.0% 0.0% 155.2% 100.0% 90.0% 62.3% cob42.opt binopt.o0 binopt.o1 cob5.dev source code re-compilation

8 Our Goal! Achieve competitive performance relative to the state-of-theart source code compiler. original binary naive translation w/o optimizations naive translation with optimizations 8 higher is better Relative Performance 180.0% 160.0% 140.0% 120.0% 100.0% 80.0% 60.0% 40.0% 20.0% 0.0% 155.2% fill in the gap 100.0% 90.0% 62.3% cob42.opt binopt.o0 binopt.o1 cob5.dev source code re-compilation our results

9 Outline! Motivation & Goal.! Our Approach. Four Major Optimizations.! Performance Evaluation.! Conclusions. 9

10 We Need High-level Information to Optimize Binaries! Compiler optimizations require "HLI" (high-level information). e.g. variables, data types, scopes, constants, library semantics! Only limited HLI is provided by the naive translation. Resulting in limited optimization opportunities. poor HLI (high-level information): variables data types scopes constants library semantics existing binary naive IR generator IR optimization engine optimized binary limited optimization opportunities 10

11 Our Approach: Language-Specific Binary Optimization! General binary optimizers typically do not assume source languages which results in better applicability. e.g. Dynamo [Bala00], DynamoRIO [Bruening03], DBT86/RASP [Hertzberg11]! Our binary optimizer assumes the source language to reconstruct HLI by using the "contextual information". e.g. data structure, register usage, instruction pattern General binary optimizer: No assumption on source languages. general binary binary optimizer poor HLI optimized binary limited optimizations Our binary optimizer: Assuming the source language. COBOL binary binary optimizer optimized binary 11 contextual information rich HLI sophisticated optimizations

12 Re-construction of the Key HLI machine instructions Variable location inspection of usage: location instruction pattern abstract interpretation: data structure register usage use-define analysis: EFA data structure liveness analysis: location data structure EFA (Effective Address) scope generation of aliasing information: location scope inspection of literal pool: EFA data structure determination of routine: EFA data structure inspection of instruction pattern: variable instruction pattern constant value runtime routine analyze parameter: semantics of routine parameter information data type aliasing original operation 12

13 Abstract Interpretation: Analysis for EFA! Take advantage of knowledge about data structures and register usages of COBOL. Machine instructions: COBOL Data structures & register usage: Abstract interpretation (a) Load R2,(R9+0x200) (b) Load R3,(R2+0x100) (c) Load R4,(R2+0x200) (d) Load R5,(R2+R4) R9 control block R13 stack heap (a) EFA=control block+off_heap R2=&heap (b) EFA=heap+0x100 R3=unknown (c) EFA=heap+0x200 R4=unknown (d) EFA=heap+unknown R5=unknown R12 label table literal pool 13 Without the contextual information, we cannot tell EFA which is a basis of many other HLI.

14 Four Major Optimizations! There are four significant optimizations which are newly introduced to the latest source code compiler. 1. Type Reduction of Decimal Type. 2. Strength Reduction of Edited Moves. 3. Skipping Truncations of Binary Numbers. 4. Inlining Efficient Version of Routines. 14 higher is better original binary 180.0% 160.0% 140.0% 120.0% 100.0% 80.0% 60.0% 40.0% 20.0% 0.0% four optimizations Relative Performance 100.0% naive translation w/o optimizations 62.3% 90.0% 155.2% cob42.opt binopt.o0 binopt.o1 cob5.dev naive translation with optimizations source code re-compilation

15 Optimization 1: Type Reduction of Decimal Type (1/2)! Convert decimal types to decimal floating-point (DFP) types. Computation on decimal type: memory-to-memory Computation on DFP type: register-to-register Representation of integer number 87: binary integer type " 87 zoned decimal type " 0xF8F7 packed decimal type " 0x087F Business applications perform a lot of decimal computations. Simple Example: COBOL: ZonedDecimal A, B; A = A + B HLI about variables is necessary to enable the type reduction. Original instruction: ZonedToPacked (R13+272, 5 bytes), (R2+0, 8 bytes) ByteOR (R13+276), 0xf ZonedToPacked (R13+280, 5 bytes), (R2+8, 8 bytes) ByteOR (R13+284),0xf PackedAdd (R13+272, 5bytes), (R13+280, 5 bytes) PackedToZoned (R2+0, 8 bytes), (R13+272, 5 bytes) 15

16 Optimization 1: Type Reduction of Decimal Type (2/2) Original instruction: ZonedToPacked (R13+272, 5 bytes), (R2+0, 8 bytes) ByteOR (R13+276), 0xf ZonedToPacked (R13+280, 5 bytes), (R2+8, 8 bytes) ByteOR (R13+284),0xf PackedAdd (R13+272, 5bytes), (R13+280, 5 bytes) PackedToZoned (R2+0, 8 bytes), (R13+272, 5 bytes) 16 EFA information use-def analysis liveness analysis Use-Def Analysis: *D=DEF, U=Use ZonedToPacked D ByteOR UD ZonedToPacked D ByteOR UD PackedAdd PackedToZoned UD U U Variables: TMP1 TMP2 A B U D U Optimized instruction: ZonedToDFP FPR1,*(A) ZonedToDFP FPR2,*(B) DFPAdd FPR1,FPR2 DFPToZoned FPR1,*(A) Type Reduction ZonedToPacked TMP1, A PackedSetSign TMP1, plus ZonedToPacked TMP2, B PackedSetSign TMP2, plus PackedAdd TMP1, TMP2 PackedToZoned TMP1, A Re-constructed HLI about variables

17 Optimization 2: Strength Reduction of Edited Moves! Generate a specialized instruction sequence for a string format operation in COBOL. Simple Example: COBOL: * FOO: string of the format MOVE 123 TO FOO. " FOO=0123 Original instruction: EDIT control-bytes,numeric EDIT is a millicode instruction which interprets each byte of the control-bytes. Re-construct HLI about the original operation from control-bytes. Optimized instruction: ConvToZoned TMP, numeric CopyZoned FOO, TMP 17

18 Optimization 3: Skipping Truncations of Binary Numbers! Generate guard code to skip unnecessary truncation of binary integer numbers. Simple Example: COBOL: * FOO: binary number with 4 digits. FOO=FOO Original code: TMP=FOO+1000 FOO=TMP%10000 Divide instruction is unconditionally executed to suppress overflows. Overflow is rare. Re-construct HLI about the original operation: Quot is not in use. Dividend is a constant of Pow10. Optimized code: TMP=FOO+1000 if abs(tmp) >= TMP=TMP%10000 FOO=TMP 18

19 Optimization 4: Inlining Efficient Version of Routines! Replace a runtime routine with the equivalent code which exploits a more efficient algorithm and newer instructions. Simple Example: COBOL: * FOO, BAR: large packed decimal numbers (e.g. 18 digits). FOO=FOO/BAR Original instruction: CALL LargeDivide Runtime path length is more than 100 instructions. Exploit new DFP instructions. Re-construct HLI about the original operation: Semantics of the runtime routine. Parameter information. Optimized instruction: PackedToDFP FPR1,FOO PackedToDFP FPR2,BAR DFPDivide FPR1,FPR2 DFPRound FPR1 DFPToPacked FPR1,FOO 19

20 Outline! Motivation & Goal.! Our Approach. Four Major Optimizations.! Performance Evaluation.! Conclusions. 20

21 Experimental Environment! Machine: 5.5 GHz zec12 running z/os.! Benchmark: IBM Internal benchmark suite for the development of the COBOL compilers.! Binary optimizer: Development version of IBM Automatic Binary Optimizer (2015).! Original source code compiler: IBM Enterprise COBOL V4R2 (2009).! Latest source code compiler: Development version of IBM Enterprise COBOL V5 (2015). 21 cevalnd amort6 Decimal Benchmarks accoun*ng computa*ons. mortgage payment comparisons and amor*za*on computa*ons (external FP). djpcom85 record processing. seqpr ixseq upda*ng sequen*al file records. upda*ng indexed VSAM file records. telco invupd amort2 atm06 Non- Decimal Benchmarks telecom system. upda*ng sales master records. mortgage payment comparisons and amor*za*on computa*ons (long FP). automa*c teller machine transac*ons.

22 Performance Impact of Each Optimization a. O0 (no optimization) " -37.7% b. O1 (without HLI) " -10.0% c. b. + Skipping truncation " -1.4% d. c. + Type reduction " +25.4% e. d. + Optimization of Edited move " +36.7% f. e. + inlining routines (full) " +40.1% Improved performance from -10.0% to +40.1% by re-constructing HLI.! The baseline (100%) is the performance of the original binaries. Original binaries are compiled with the highest optimization-level of the old compiler. decimal benchmarks non-decimal benchmarks 300.0%$ higher is better 22 Rela?ve$Performance$ 250.0%$ 200.0%$ 150.0%$ 100.0%$ 50.0%$ 0.0%$ cevalnd$ amort6$ djpcom85$ seqpr$ ixseq$ telco$ invupd$ amort2$ atm06$ geo.$mean$ a.$ b.$ c.$ d.$ e.$ f.$ g.$

23 Performance Comparisons with the Re-compilation f. e. + inlining routines (full) " +40.1% g. latest source code compiler " +55.2% Good competitiveness of the binary optimizer relative to the state-of-theart source code compiler.! The baseline (100%) is the performance of the original binaries. Original binaries are compiled with the highest optimization-level of the old compiler. decimal benchmarks non-decimal benchmarks 300.0%$ higher is better 23 Rela?ve$Performance$ 250.0%$ 200.0%$ 150.0%$ 100.0%$ 50.0%$ 0.0%$ cevalnd$ amort6$ djpcom85$ seqpr$ ixseq$ telco$ invupd$ amort2$ atm06$ geo.$mean$ a.$ b.$ c.$ d.$ e.$ f.$ g.$

24 Performance Comparisons with the Re-compilation f. e. + inlining routines (full) " +40.1% g. latest source code compiler " +55.2% Can get even closer by: Enabling loop optimizations. Extend range of analysis of runtime routines.! The baseline (100%) is the performance of the original binaries. Original binaries are compiled with the highest optimization-level of the old compiler. decimal benchmarks non-decimal benchmarks 300.0%$ higher is better 24 Rela?ve$Performance$ 250.0%$ 200.0%$ 150.0%$ 100.0%$ 50.0%$ 0.0%$ cevalnd$ amort6$ djpcom85$ seqpr$ ixseq$ telco$ invupd$ amort2$ atm06$ geo.$mean$ a.$ b.$ c.$ d.$ e.$ f.$ g.$

25 Outline! Motivation & Goal.! Our Approach. Four Major Optimizations.! Performance Evaluation.! Conclusions. 25

26 Conclusion Language-specific binary optimizer can achieve competitive performance relative to the state-of-the-art source compiler. source code latest compiler re-compiled binary earlier compiler Achieved +40.1% by our binary optimizer vs % by the latest source compiler. original binary binary optimizer optimized binary language-specific contextual info. 26

27 Questions? 27

A Trace-based Java JIT Compiler Retrofitted from a Method-based Compiler

A Trace-based Java JIT Compiler Retrofitted from a Method-based Compiler Hiroshi Inoue, Hiroshige Hayashizaki, Peng Wu and Toshio Nakatani IBM Research Tokyo IBM Research T.J. Watson Research Center April