Delft-Java Link Translation Buffer

Delft-Java Link Translation Buffer John Glossner 1,2 and Stamatis Vassiliadis 2 1 Lucent / Bell Labs Advanced DSP Architecture and Compiler Research Allentown, Pa glossner@lucent.com 2 Delft University of Technology CARDIT - Computer Architecture and Digital Technique Delft, The Netherlands {glossner,stamatis@cardit.et.tudelft.nl

Overview Java Properties S/W Bytecode Execution H/W Bytecode Execution Delft-Java Engine Java Hardware Support Java Dynamic Instruction Translation Link Translation Buffer Preliminary Results Conclusions

Java Properties Object-Oriented Programming Language Inheritance and Polymorphism Supported Programmer Supplied Parallelism (Threads) Dynamically Linked Resolved C++ s fragile class problem but imposes performance constraints on class access Entire set of objects in system not required at compile time Strongly Typed Statically determinable type state enables simple on-the-fly translation of bytecodes into efficient machine code [Gos95] Compiled to Platform Independent Virtual Machine

Interpretation S/W Bytecode Execution S/W emulates the Java Virtual Machine Cross platform but poses performance issues Just-in-time Compilation Translate from bytecode to native code just prior to execution 5-10x performance improvement over Interpretation Compilation is only resident for the current program invocation Native Compilation Native code generated directly from the Java source Best Performance but contrary to write once, run anywhere

S/W Bytecode Execution(2) Off-line Compilation Program is distributed in bytecode form and translated into a native machine code (and stored on disk) prior to execution Additional (time-consuming) optimizations may be performed Bytecode contains nearly the same amount of information as the Java source code Hybrid/Dynamic Compilation Highly optimized JIT integrated in run-time environment Interpreted code profiled during execution Hotspots detected are dynamically compiled Improvements up to 140x interpreted and 13x JIT reported

H/W Bytecode Execution Sun picojava Direct Execution Stack Cache Implemented with registers Automatic stack spill/fill Acceleration instruction folding Instruction and Data Cache Global L1 Extended bytecodes Complex instructions trap Contiguous Stack Frame Delft-Java Dynamic Translation Translated to RISC instructions Indirect register access Runtime register allocation Acceleration compounding instruction issue multiple thread units Link Translation Buffer Instruction and Data Cache Global L1 Cache Per thread L0 Instruction, Stack, Local Variable Cache Superset of instructions (w/ BEX) Complex instructions trap Contiguous Stack Frame

Delft-Java Engine RISC-style Architecture 32-bit Instructions Multiple Register Files Concurrent Multithreaded Organization Multiple Hdwr Thread Units Multiple Instruction Issue Per Thread Indirect Register Access Supervisory Instructions Branch Java View (bex) Integer & Floating Point 8, 16, 32, and 64-bit Signed & Unsigned Integers IEEE-754 Floating Point Multimedia Instructions SIMD Parallelism DSP Arithmetic Extensions Saturation Logic Rounding Modes 32-bit Address Space Base + Offset + Displacement

Java Hardware Support Transparent Extraction of Parallelism Multiple concurrent thread units Dynamic Java Instruction Translation Register file caches stack with indirect access JVM Reserved Instruction Used For BEX Link Translation Buffer For Dynamic Linking Associates a caller s object reference and constant pool entry ID with a linked object invocation Logical Controller For Non-Supported Translations Thin interpretive layer and Java run-time

Delft-Java Organization

Link Translation Buffer A global cache for dynamically resolved names Associates a caller s method invocation with the (previously) resolved method address First time an object method is invoked, the controller resolves the constant pool name Additional invoke instructions with same signature are executed from the LTB cache JVM invoke instructions maintain high-level information which allow use of a cache

C++ vs. Java Invocation C++ class MyClass { public: virtual void instancemethod() {; ; class MySubClass : public MyClass { public: virtual void instancemethod() {; ; void main() { MyClass mc = MyClass(); MyClass msc = MySubClass(); mc.instancemethod(); msc.instancemethod(); Java public class MyClass { public void instancemethod() { public class MySubClass extends MyClass { public void instancemethod() { class Test { public static void main(string args[]) { MyClass mc = new MyClass(); MyClass msc = new MySubClass(); mc.instancemethod(); msc.instancemethod();

C++ Virtual Tables C++ class MyClass { public: virtual void instancemethod() {; ; Vtables Memory class MySubClass : public MyClass { public: virtual void instancemethod() {; ; MySubClass 3 2 1 0 instancem() void main() { MyClass mc = MyClass(); MyClass msc = MySubClass(); mc.instancemethod(); msc.instancemethod(); MyClass 3 2 1 0 instancem()

Java Method Dispatch Java public class MyClass { public void instancemethod() { public class MySubClass extends MyClass { public void instancemethod() { class Test { public static void main(string args[]) { MyClass mc = new MyClass(); MyClass msc = new MySubClass(); mc.instancemethod(); msc.instancemethod(); Java Bytecodes Method void main(java.lang.string []) Line instaddr Instr 1 0 new #3 <Class MyClass> 2 3 dup 3 4 invokenonvirtual #7 <Method MyClass.<init>()V> 4 7 store_1 5 8 new #4 <Class MySubClass> 6 11 dup 7 12 invokenonvirtual #6 <Method MySubClass.<init>()V> 8 15 astore_2 9 16 aload_1 10 17 invokevirtual #5 <Method MyClass.instanceMethod()V> 11 20 aload_2 12 21 invokevirtual #5 <Method MyClass.instanceMethod()V> 13 24 return Lines 10 and 12 appear to invoke same method Disambiguated by aload in lines 9 and 11 Line 4 stored a MyClass Object Reference in LV[1] Line 8 stored a MySubClass Object Reference in LV[2] Functions much like a C++ virtual table. A JVM may decide to build a vtable (or method dispatch table) dynamically at runtime

LTB Acceleration Constant Pool contains name of of method to be invoked stored as a string (e.g. a symbol table) JVM searches for method based on run-time type and returns the address of the method being invoked The resolved address can now be associated with the run-time type and Constant Pool offset An LTB accelerates latebinding by storing the association in a special fast-access memory If the invocation signature is found in the LTB, the invocation address is quickly return. Otherwise, the request is forwarded to the control unit.

LTB Organization Caller's Reference 32-bit CPool Entry 16-bit Callee's Object Ref 32-bit LV[0] 32-bit CP[0] 32-bit Other Given a Caller s ID and a Callee s ID, a method or field can be associated with a (previously resolved) physical address Other Possible Optimizations Each (JVM 16-bit) per frame Local Variables is mapped to a starting 32-bit physical address Each (JVM 16-bit) per class Constant Pool location is mapped to a starting 32-bit physical address Synchronization locks Garbage collection reference counts Field data cache Extended instructions may lock, free, or flush an LTB line

LTB Performance Assumptions: Unit latency execution except LTB > 100 cycles on 1st access 1 cycle if in LTB Perfect branch prediction Perfect caches (except LTB) Single-thread, single in-order issue Random Replacement Algorithm Fully Associative LTB Preliminary: Synthetic benchmarks used C++ Model is not a compliant JVM JVM APIs not supported GC and IO not implemented Workload Characteristics Work Load Objects Created Percent Dynamic Instr Ideal Speedup WL1 2048 40% 1.67 WL2 32 10% 1.11 WL3 512 20% 1.25 WL4 1024 30% 1.43 Available Speedup Application Speedup 1.5 1.4 1.3 1.2 1.1 1.0 0% 5% 10% 20% 30% % Method Invocations Ideal 10x 6x 5x 4x 3x 2x 1.5x 1.1x

LTB Performance Speedup Application Speedup 1.70 1.60 1.50 1.40 1.30 1.20 1.10 16 Entry 32 Entry 64 Entry 128 Entry 256 Entry 512 Entry 1024 Entry Ideal 1.00 0.90 Miss Rate WL1 WL2 WL3 WL4 Work Load Work Load Objects Created Percent Dynamic Instr Ideal Speedup WL1 2048 40% 1.67 WL2 32 10% 1.11 WL3 512 20% 1.25 WL4 1024 30% 1.43 100% 90% 80% Miss Rate 70% 60% 50% 40% 30% 20% 10% 0% 16 Entry 32 Entry 64 Entry 128 Entry 256 Entry 512 Entry 1024 Entry Ideal -10% WL1 WL2 WL3 WL4 Work Load

Conclusions LTB Buffers May Improve Java Performance 1.1x to 1.5x on synthetic benchmarks Dynamic method invocation supported in ISA Provides for architecturally transparent acceleration (LTB) A sequence of instructions is produced for a typical ISA High-level operation (dynamic link invocation) is lost Little possibility for run-time acceleration Future Work Extend model to compliant JVM (non-synthetic benchmarks) Characterize use of an instruction address as the caller s ID Characterize performance versus LTB associativity