Performance Impact of Multithreaded Java Server Applications

Size: px

Start display at page:

Download "Performance Impact of Multithreaded Java Server Applications"

Jasmin Johnston
6 years ago
Views:

1 Performance Impact of Multithreaded Java Server Applications Yue Luo, Lizy K. John Laboratory of Computer Architecture ECE Department University of Texas at Austin 1/2/01 1

2 Outline Motivation VolanoMark Benchmark Methodology Results Conclusion Further Work Needed 2

3 Motivation Performance under the presence of a large number of threads is crucial for a commercial Java server. Java applications are shifting from client-side to serverside. Server needs to support multiple simultaneous client connections. No select() or poll() or asynchronous I/O in Java Current Java programming paradigm: one thread for one connection. 3

4 VolanoMark Benchmark VolanoMark is a 100% Pure Java server benchmark characterized by long-lasting network connections and high thread counts. Based on real commercial software. Server benchmark. Long-lasting network connections and high thread counts. Two threads for each client connection. 4

5 VolanoMark Benchmark Server Message 1 user1 user2 user3 user1 user2 user3 Chat room 1 Chat room 2 Client 5

6 Methodology Performance counters used to study OS and user activity on Pentium III system. Monitoring Tools -- Pmon Developed in our lab. Better controlled. Device driver to read performance counters Low overhead 6

7 Platform Parameters Hardware Uni-processor CPU Frequency: 500MHz L1 I Cache: 16KB, 4-way, 32 Byte/Line, LRU L2 Cache: 512KB, 4-way, 32 Byte/Line, Non-blocking Main Memory: 1GB Software Windows NT Workstation 4.0 Sun JDK1.3.0-C with HotSpot server (build 2.0fcs-E, mixed mode) 7

8 Monitoring Issues Synchronize measurements with client connections to skip starting and shutdown process Add wrapper to the client. The wrapper starts an extra connection immediately before starting the client to trigger measurement. Avoid counter overflow Counting interval: 3sec 8

9 Results Decreasing CPI Hots pot CPI OS has larger CPI OS CPI decreases significantly User CPI sees small fluctuation. 2 Connections USER OS OVERALL 9

10 Results More instructions executed! Regardless of the number of connections, each thread basically does the same thing. Therefore instructions for each connection should remain the same. OS part increases Hotspot Instructions per Connections significantly 4.50E E E E E E E E E E+00 USER OS User part increases slightly Even more execution time is in OS mode due to the larger CPI in OS. One guess: overhead in connection and thread management; some OS algorithm with non-linear complexity (e.g. O(N*logN) ) 10

11 Results Decreasing L1 I-Cache miss ratio 14% 12% 10% 8% 6% 4% 2% Hotspot L1 Icache m isse s per Instruction 0% 0.80% 0.70% 0.60% 0.50% 0.40% 0.30% 0.20% 0.10% 0.00% USER OS OVERALL Hots pot ITLB M is se s pe r Instruction USER OS OVERALL Beneficial interference between threads: They share program codes. The more threads we have, the more likely we context switch to another thread that is executing the same part of the program thus codes in I cache and entries in ITLB are reused. 11

12 Results Decreasing I stalls per instruction Hotspot Istalls Cycles per Instruction As the result of decreasing I- cache miss ratio and ITLB miss ratio, instruction fetching stalls are lowered for both OS part and user part. 0 USER OS OVERALL 12

13 Results Increasing L1 D cache miss ratio 20% 15% 10% 5% Hotspot L1 Dcache m isses pe r Data Re ference OS: Huge increase More thread data, larger data footprint More context switches User: Slight decrease 0% USER OS OVERALL 13

14 Results Significant OS L1 D-cache Misses 6.00E E E E E E E+00 Hots pot Cache M isses per Connection OS D-CACHE USER D-CACHE OS I-CACHE USER I-CACHE With more connections, OS are doing more and incurring more data misses Send & receive network packets Threads scheduling & synchronization 14

15 Results L2 Cache miss ratio Hotspot L2 Cache Miss Ratio 16% 14% 12% 10% 8% 6% 4% 2% 0% USER OS OVERALL 15

16 Results Branches 25% 24% 23% 22% 21% 20% Hotspot Branch Fre quency More branches in OS code More branches are taken May be due to more loops in OS code 19% USER OS OVERALL 80% Hotspot Branch Taken Ratio 75% 70% 65% 60% 55% 50% USER OS OVERALL 16

17 Results More accurate branch predictions 20% 15% 10% 5% Hotspot Branch M is predict Ratio Branches in loops are easier to predict so we have more accurate branch predictions 0% USER OS OVERALL Hots pot BTB M is s Ratio 70% 60% 50% 40% 30% 20% 10% 0% Due to the beneficial code sharing among threads, BTB miss ratio decreases. USER OS OV ERA LL 17

18 Results More resource stalls Hotspot Resource Stalls per Instruction Lower instruction stalls and better branch prediction result in larger resource stalls May favor a CPU with more resources. USER OS OVERALL 18

19 Conclusions Multi-threading is an excellent approach to support multiple simultaneous client connections. Heavy multithreading is more crucial to Java server applications due to its lack of I/O multiplexing APIs. Thread creation and synchronization as well as network connection management are the responsibility of the operating system. With more concurrent connections, more OS activity is involved in the server execution. Threads usually share program code; thus instruction cache, ITLB and BTB will all benefit when the system context switch from one thread to another thread executing the same part of code. Multithreading also benefits branch predictors. Each thread will incur some code and data overheads especially in operating system mode. Given enough memory resources, the nonlinearly increasing overheads are the biggest impediment to performance scalability. Further tuning of the application and operating system may alleviate this problem. 19

20 Further work needed More complex benchmark needed (eg SPEC jbb2000) to validate the results. We need to distinguish characterization of multi-threaded server applications from that of VolanoMark. Find why much more instructions are executed in OS with more connections and try to reduce them. 20

21 Results on Sparc With Shade (backup slide) Also observed decreasing L1 I-cache miss ratio 4.0% L1 I-Cache M iss Ratio 3.5% 3.0% 2.5% 2.0% 1.5% 1.0% 0.5% 0.0% GREEN NA TIV E 21

22 Results on Sparc With Shade (backup slide) Also observed better branch prediction 16% 14% 12% 10% 8% 6% 4% 2% 0% Branch M ispredict Rate GREEN NA TIV E 22

Today s demanding Internet applications typically

Today s demanding Internet applications typically COVER FEATURE Benchmarking Internet Servers on Superscalar Machines The authors compared three popular Internet server benchmarks with a suite of CPU-intensive benchmarks to evaluate the impact of front-end