A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures

Size: px

Start display at page:

Download "A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures"

Randell Park
6 years ago
Views:

1 A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures W.M. Roshan Weerasuriya and D.N. Ranasinghe University of Colombo School of Computing

2 A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures Overview/ Motivation of this research performance oriented classification of different software workload types (categories/ environments) on multi-core multiprocessor NUMA and UMA architectures server processors: 'dual processor dual core AMD Opteron 2220 SE' 'single processor quad core Intel Xeon E5506'

3 software program workload types/ categories/ memory intensive applications processing intensive memory and processing intensive (matrix multiplication), system call intensive applications, file reading and writing, socket based, message passing middle ware based thread based

4 Target Audience End users who wish to setup server based systems using commodity multi-core multi-processor systems hardware engineers as of to notice and understand that their processor designs are performing well on some application categories but in the same time doesn't perform well on some application categories compiler developers as to understand that given a particular processor architecture, their compiler generated code performance is also dependent on the category/type/domain of the application being written

5 Related Work In their paper [2] Haoqiang Jin, Robert Hood, Johnny Chang etc analyzes the effect of resource contention on shared resources (in the memory hierarchy), on the performance of applications running on multi-core based server processors. Here the authors consider the computational ability and analyzes the performance of executing computational intensive benchmarks. But my work is different as I try to analyze this with respect to different application domains such as computational intensive, memory intensive, system call intensive, middle-ware execution intensive (i.e. execution performance of enterprise middleware).

6 Related Work cont... In their paper [2] Haoqiang Jin, Robert Hood, Johnny Chang etc analyzes the effect of resource contention on shared resources (in the memory hierarchy), on the performance of applications running on multi-core based server processors. Here the authors consider the computational ability and analyzes the performance of executing computational intensive benchmarks. But my work is different as I try to analyze this with respect to different application domains such as computational intensive, memory intensive, system call intensive, middle-ware execution intensive (i.e. execution performance of enterprise middleware).

7 Related Work cont... several on-line reports which compare performance of processors from both companies [15, 16, 17, 18] Anyhow most of these reports simply present the performance metrics such as execution time and throughput. But we go beyond this and we not only present performance metrics such as execution time and throughput, but also we analyze performance metrics such as: level1 cache miss count level2 cache miss count Translation Lookaside Buffer misses Conditional branch instructions mis-predicted Cycles stalled on any resource Total Wall clock cycles Total Wall clock time

8 Related Work cont... How is our work different to those above All the above studies lacks a in-depth performance based classification of application categories on multi-core architecture alone vs. a hybrid multi-core multi-processor architecture. (i.e. a hybrid heterogeneous architectural comparison) with in detail hardware performance monitoring statistics. The work done in my research addresses this analysis.

9 Why is this work different from existing work Comparing the performance of using a multi-core system against using an equivalent hybrid multi-core multi-processor combined system. A categorization of generic application domains/types as on how well they perform on a multi-core system against a equivalent hybrid multi-core multi-processor system. Coming up with a simple set of generic benchmarks which could be used to evaluate hybrid heterogeneous systems, with in detail hardware performance monitoring counter statistics generation.

10 Evaluated Architectures Performance evaluation of: multi-processor multi-core architecture vs a equivalent capacity single-processor multi-core architecture. 1st server processor architecture is a dual-processor dualcore architecture (hence a multi-processor multi-core architecture) 2nd server processor architecture is a single-processor quadcore architecture (hence a single-processor multi-core architecture).

11 Evaluated Architectures Speed(GHz) L1 Cache (KB) L2 Cache L3 Cache Dual-Core AMD Opteron 2.8 (64 data + 64instru = 128) x per core 1MB x per core no Intel Xeon Quad Core 2.13 (32 data + 32instru = 64) x per core 256 KB x per core 4 MB (Intel Smart Cache, Shared)

12 CACHE PARAMETERS OF DUAL-CORE AMD OPTERON(TM) 2220 SE PROCESSOR Level Capacity Associativity (ways) Line Size (bytes) Access Latency (clocks) First Level Data 64 KB (per core) First Level Instruction 64 KB (per core) 2 64 N/A Second Level 1 MB (per core) Third Level No level 3 cache L1 to L2 relationship is exclusive. (i.e the L1 content is not repeated in L2)

13 The Xeon L1 to L2 to L3 relationship is inclusive. (i.e the L1/ L2 content must be present in L3) CACHE PARAMETERS OF INTEL XEON E5506 PROCESSOR Level Capacity Associativity (ways) Line Size (bytes) Access Latency (clocks) First Level Data 32 KB (per core) First Level Instruction 32 KB (per core) 4 N/A N/A Second Level 256KB (per core) Third Level 4MB (shared among all 4 cores)

14 INTEGRATED MEMORY CONTROLLER Dual-Core AMD Opteron 2220 SE Quad Core Intel Xeon E5506 Dual channel 128-bit wide 6-channel 333 MHz DDR memory 800 MHz DDR memory Peak memory bandwidth 5.3 Gbytes/s up to 25.6 GB/sec

15 Methodology and Implementation Custom Benchmarks. memory intensive testing programs (Test1) computational intensive testing programs (Test2) memory plus computational intensive testing program (Test3) system call intensive testing programs socket based testing program (Test4) file reading/ writing testing program (Test5) thread based, threading intensive testing program (Test6) middleware based testing programs (Test7)

16 CODE SEGMENTS FROM THE TEST 1 //Simple memory access application #define ROWS 1 //NOTE: double datatype size is 8bytes //to fit into 32kb cache #define COLUMNS 4096 double matrix_a[rows][columns]; for (x = tmpstartcolumn; x<tmpendcolumn; x++) { int y = matrix_a[0][x]; y++; matrix_a[0][x] = y; }

17 CODE SEGMENTS FROM THE TEST 2 //Simple memory access application plus with some further more //processing overhead (i.e. also utilizing the microprocessor further//more than the Test2) for (x = tmpstartcolumn; x<tmpendcolumn; x++) { double y = matrix_a[0][x]; //trying to exercise/ use the microprocessor further more int z, z2; for (z=0; z<num_of_inner_loops; z++) { } y= ((y + y * z) / 2) * 1.1; for (z2=0; z2<10; z2++) { } if (((int)y)%2 == 0) else matrix_a[0][x] = y; y = ((y * z) / 2) * ; y = ((y * z) / 2) * ;

18 Motivation/ Objective Analyze and describe the impact of processor performance parameters on the performance of different application domains and analyze how this behavior varies among server processor architectures We analyze the variant behavior of those processor performance parameters, among different application domains

19 processor performance parameters Resource Stalls Branch Mis-predictions Translation Lookaside Buffer (TLB) Misses Cache Misses

20 why Intel VTune, AMD Code Analyst was not used Intel VTune runs only on Intel processors and AMD Code Analyst runs only on AMD processors. My experiment architectures include both Intel processors and AMD processors I will have to use both the above tools. Then I will loose the uniformity of my results as these are two independent tools by two different processor vendors. Therefore I decided to use Performance Application Programming Interface (PAPI) library

21 PAPI hardware performance monitoring counters Ll data cache misses L1 Instruction Cache misses L2 cache misses Data translation lookaside buffer misses (DTLB) Total translation lookaside buffer misses (TLB) Conditional branch instructions mispredicted Cycles stalled on any resource

22 PAPI events and methods used PAPI_L1_DCM: L1 data cache misses PAPI_L1_ICM: L1 instruction Cache misses PAPI_L2_TCM: L2 total cache misses PAPI_TLB_DM: Data translation lookaside buffer misses PAPI_TLB_TL: Total translation lookaside buffer misses

23 PAPI events and methods used PAPI_BR_MSP: Conditional branch instructions mispredicted PAPI_RES_STL: Cycles stalled on any resource PAPI_get_real_cyc(): Total wall clock cycles PAPI_get_real_usec(): Total wall clock time

24 RESULTS, EVALUATION, ANALYSIS AND DISCUSSION

25 STATISTICS OBTAINED FOR TEST 1 Processor/ Array Size L1 Cache Misses L2 Cache Misses TLB Misses Conditional Branch Instructions Mis-predicted Opteron/32KB Xeon/32KB Opteron/64KB Xeon/64KB Opteron/1MB Xeon/1MB Opteron/4MB Xeon/4MB Opteron/5MB Xeon/5MB

26 Better Processor for Respective Application Domain Application Domain In terms of: Total Wall Clock Cycles processor intensive Xeon Opteron In terms of: Total Wall Clock Time (Effective Performance) memory and processing intensive (matrix multiplication) system call intensive: socket based system call intensive: file reading and writing based Xeon Xeon Xeon Xeon Xeon Xeon thread based Xeon Opteron/Xeon middle ware based --- Xeon

27 RESULTS, EVALUATION, ANALYSIS AND DISCUSSION cont Overall Analysis In Detail

28 Computational intensive performance analysis of the two processors

29 Computational intensive performance analysis cont % processor, iterations 0.6Mb RAM, iterations 2.4Mb RAM iterations 4.8Mb RAM, iterations 6Mb RAM Cycles stalled on any resource Total Wall clock time in seconds for all Threads Opteron Cycles stalled on any resource Xeon Cycles stalled on any resource Opteron - Total Wall clock time in seconds for all Threads Xeon - Total Wall clock time in seconds for all Threads

30 Computational Intensive cont... Why has the older Opteron performed better? No of pipeline stages: Opteron: 12 for integer 17 for floating-point Xeon: 14 for both integer and floating-point

31 Computational Intensive cont... Why has the older Opteron has performed better? Cont Opteron has the higher number of pipeline stages for floatingpoint The higher the number of pipeline stages the clock cycle time required for each stage is less hence the processor clock frequency could be increased Opteron processor is 2.8GHz and the Xeon is 2.13GHz This makes the Opteron's execution engine to perform faster and give a better throughput than the Xeon

32 Computational Intensive cont... Why has the older Opteron has performed better? Cont Opteron server used is a Dual Core Dual Processor and the Xeon server used is a Quad Core Single Processor Hence in Xeon all the four(4) cores are manufactured in a single die, it also leads to constraining the processor clock frequency, due to the compressed die area The more the die area is constrained for a single core the more the single core clock frequency will have to be reduced Hence Xeon clock frequency is 2.13GHz and Opteron is 2.8GHz

33 Computational Intensive cont... Why has the older Opteron has performed better? Cont Also the Opterons Integer pipe line can fetch and decode floating-point instructions as well This also will accelerate the Opteron's execution engine and will boost the throughput

34 Computational intensive performance analysis cont... Micro-architecture Parameters AMD Opteron 2220 SE Intel Xeon E5506 No of pipeline stages 8 way super-scalar processor 12 for integer 17 for floating-point 6 way super-scalar processor 14 Scheduler can issue \ dispatch up to how many μops per cycle 11 micro-ops (The schedulers and the load/store unit can dispatch). 3 micro-ops to the instruction control unit 6 μops

35 So now what is next? Will this be the same for other application domains?

36 Better Processor for Respective Application Domain Application Domain In terms of: Total Wall Clock Cycles processor intensive Xeon Opteron In terms of: Total Wall Clock Time (Effective Performance) memory and processing intensive (matrix multiplication) system call intensive: socket based system call intensive: file reading and writing based Xeon Xeon Xeon Xeon Xeon Xeon thread based Xeon Opteron/Xeon middle ware based --- Xeon

37 Significance of Integrated Memory Controller in overruling microprocessor performance Computational Intensive benchmark Level 2 cache misses No of loops Opteron Xeon Total translation lookaside buffer misses (TLB) No of loops Opteron Xeon Cycles stalled on any resource No of iterations Opteron Xeon Total Wall clock time in seconds for all Threads no of iterations Opteron Xeon

38 Significance of IMC in overruling microprocessor performance cont. Computational and largely memory intensive benchmark (500 x 500 and 1000 x 1000 matrix multiplication with extra computation performed) 25Mb - 93Mb RAM utilization Level 2 cache misses x x 1000 matrix size Opteron Xeon Total translation lookaside buffer misses (TLB) 500 x x 1000 matrix size Opteron Xeon Cycles stalled on any resource x x 1000 matrix size Opteron Xeon Total Wall clock time in seconds for all Threads 500 x x 1000 matrix size Opteron Xeon

39 Significance of IMC in overruling microprocessor performance cont. Memory Access: Load and Store Operation Enhancements AMD Opteron 2220 SE Intel Xeon E5506 Peak issue rate operation per cycle Two 64-bit loads or stores one 128-bit load one 128-bit store Load/ store queue 44-entry Deeper buffers for load and store operations: 48 load buffers 32 store buffers 10 fill buffers

40 Significance of IMC in overruling microprocessor performance cont. Integrated Memory Controller Memory access channels AMD Opteron 2220 SE Dual channel 128-bit wide Intel Xeon E channel Memory support 333 MHz DDR memory 800 MHz DDR memory Peak memory bandwidth 5.3 Gbytes/s up to 19.2 GB/sec

41 Significance of TLB Unit in overruling microprocessor performance?

42 Significance of TLB Unit in overruling microprocessor performance Xeon TLB unit is ahead of the Opteron Hence in general the TLB statistics of the Xeon should be better than the Opteron. But with our results we figured out the following points: 1st Point - Xeon Instruction TLB (ITLB) statistics are poor compared to the Opteron 2nd Point - When the memory access workload increases the Xeon TLB statistics become poor compared to the Opteron 3rd Point - The TLB miss penalty is overruled by a better IMC

43 Significance of TLB Unit in overruling microprocessor performance cont Translation Lookaside Buffers (TLB) Of The Microprocessors Dual-Core AMD Opteron 2220 SE Quad Core Intel Xeon E5506 Number of levels of TLB 2 Number of levels of TLB 2 L1 Data TLB for 4-KByte pages 32 entries DTLB0 for 4-KByte pages 64 entries L1 Data TLB for large pages (2MB/4MB) L1 Instruction TLB for 4-KByte pages L1 Instruction TLB for large pages (2MB/4MB) 8 entries DTLB0 for large pages (2MB/4MB) 32 entries 32 entries ITLB for 4-KByte pages 64 entries 8 entries ITLB for large pages 7 entries L1 TLB Associativity (ways) fully DTLB0 / ITLB Associativity (ways) L2 TLB for 4-KByte pages 512 entries STLB for 4-KByte pages 512 entries (services both data and instruction look-ups) 4 Associativity (ways) 4 STLB Associativity (ways) 4 An DTLB0 miss and STLB hit causes a penalty of An DTLB0 miss and STLB hit causes a penalty of 7cycles The delays associated with a miss to the STLB and Page miss handler(pmh) The delays associated with a miss to the STLB and Page miss handler(pmh) largely non-blocking

44 Significance of TLB Unit in overruling microprocessor performance cont... Computational Intensive benchmark 100% processor, 6Mb RAM Data translation lookaside buffer misses (DTLB) Instruction translation lookaside buffer misses (ITLB) Opteron Data translation lookaside buffer misses (DTLB) Xeon Data translation lookaside buffer misses (DTLB) Opteron Instruction translation lookaside buffer misses (ITLB) Xeon Instruction translation lookaside buffer misses (ITLB) Total TLB misses Opteron Total translation lookaside buffer misses (TLB) Xeon Total translation lookaside buffer misses (TLB) Total Wall clock time in seconds for all Threads Opteron - Total Wall clock time in seconds for all Threads Xeon - Total Wall clock time in seconds for all Threads

45 Significance of TLB Unit in overruling microprocessor cont.. Computational and largely memory intensive benchmark (500 x 500 matrix multiplication). 25Mb RAM utilization x500: Data translation lookaside buffer misses (DTLB) Thread 1 Thread 2 Thread 3 Thread 4 500x500: Total translation lookaside buffer misses (TLB) Opteron - Data translation lookaside buffer misses (DTLB) Xeon - Data translation lookaside buffer misses (DTLB) Thread 1 500x500: Instruction translation lookaside buffer misses (ITLB) Thread 2 Thread 3 Thread 4 500x500: Total Wall clock time in seconds for all Threads Opteron - Instruction translation lookaside buffer misses (ITLB) Xeon - Instruction translation lookaside buffer misses (ITLB) Thread 1 Thread 2 Thread 3 Thread 4 Opteron - Total translation lookaside buffer misses (TLB) Xeon - Total translation lookaside buffer misses (TLB) Opteron Xeon Total Wall clock time in seconds for all Threads

46 Significance of TLB Unit in overruling microprocessor cont.. Computational and largely memory intensive benchmark (1000 x 1000 matrix multiplication), 93Mb RAM utilization 1000x1000:Data translation lookaside buffer misses(dtlb) 1000x1000: Instruction translation lookaside buffer misses (ITLB) Thread 1 Thread 2 Thread 3 Thread 4 Opteron - Data translation lookaside buffer misses (DTLB) Xeon - Data translation lookaside buffer misses (DTLB) Thread 1 Thread 2 Thread 3 Thread 4 Opteron - Instruction translation lookaside buffer misses (ITLB) Xeon - Instruction translation lookaside buffer misses (ITLB) x1000: Total translation lookaside buffer misses (TLB) x1000: Total Wall clock time in seconds for all Threads Opteron - Total translation lookaside buffer misses (TLB) Total Wall clock time in seconds for all Threads Thread 1 Thread 2 Thread 3 Thread 4 Xeon - Total translation lookaside buffer misses (TLB) Opteron Xeon

47 Significance of Branch Prediction Unit in overruling microprocessor performance?

48 Significance of Branch Prediction Unit in overruling microprocessor performance From the hardware performance counter statistics obtained related to the Branch Prediction Unit of the two server processors we figured out the following two points: 1st Point - The Xeon Branch Prediction Unit is optimized and works better when the source code actually contains branching if command instructions. But doesn't work well when the source code doesn't contain any branching if command instructions (no if commands, but code contains for loops ). 2nd Point - branch mis-predictions has less impact (or no impact at all) on the final application execution throughput.

49 Significance of Branch Prediction Unit in overruling microprocessor performance cont... Proving Above 1st point System call intensive benchmark (source code contains branching if command instructions) Computational intensive benchmark (source code contains branching if command instructions) Computational and largely memory intensive benchmark (500 x 500 matrix multiplication with extra computation performed) (source code contains branching if command instructions) Conditional branch instructions mispredicted requests requests Opteron- Conditional branch instructions mispredicted Xeon- Conditional branch instructions mispredicted Conditional branch instructions mispredicted E+06 Opteron Conditional branch instructions mispredicted Xeon Conditional branch instructions mispredicted 500x500- Conditional branch instructions mispredicted Thread1 Thread2 Thread3 Thread4 Opteron - Conditional branch instructions mispredicted Xeon - Conditional branch instructions mispredicted

50 Significance of Branch Prediction Unit in overruling microprocessor performance cont... Proving Above 1st point Simple Memory access intensive benchmark (source code doesn't contain any branching if command instructions (no if commands, but code contains for loops )) Computational and largely memory intensive benchmark (500 x 500 matrix multiplication) (source code doesn't contain any branching if command instructions (no if commands, but code contains for loops )) 2000 Conditional branch instructions mispredicted 500x500: Conditional branch instructions mispredicted Opteron Conditional branch instructions mispredicted Xeon Conditional branch instructions mispredicted Thread 1 Thread 2 Thread 3 Thread 4 Opteron- Conditional branch instructions mispredicted Xeon- Conditional branch instructions mispredicted 0 32KB 64KB 1MB 4MB 5MB 6MB 8MB 10MB 12MB

51 Significance of Branch Prediction Unit in overruling microprocessor performance cont... Proving Above 2nd point Computational intensive benchmark (source code contains branching if command instructions) Conditional branch instructions mispredicted E+06 Opteron Conditional branch instructions mispredicted Xeon Conditional branch instructions mispredicted Total Wall clock time in seconds for all Threads E+06 ` Opteron - Total Wall clock time in seconds for all Threads Xeon - Total Wall clock time in seconds for all Threads

52 Significance of Branch Prediction Unit in overruling microprocessor performance cont... Proving Above 2nd point cont Computational and largely memory intensive benchmark (500 x 500 matrix multiplication) (source code doesn't contain any branching if command instructions (no if commands, but code contains for loops ) ) 500x500: Conditional branch instructions mispredicted 500x500: Total Wall clock time in seconds for all Threads Thread 1 Thread 2 Thread 3 Thread 4 Opteron- Conditional branch instructions mispredicted Xeon- Conditional branch instructions mispredicted Opteron Xeon Total Wall clock time in seconds for all Threads

53 Significance of Branch Prediction Unit in overruling microprocessor performance cont... Proving Above 2nd point cont Computational and largely memory intensive benchmark (1000 x 1000 matrix multiplication) (source code doesn't contain any branching if command instructions (no if commands, but code contains for loops )) 1000x1000: Conditional branch instructions mispredicted 1000x1000: Total Wall clock time in seconds for all Threads Opteron- Conditional branch instructions mispredicted Total Wall clock time in seconds for all Threads Thread 1 Thread 2 Thread 3 Thread 4 Xeon- Conditional branch instructions mispredicted Opteron Xeon

54 Drawbacks / Limitations / Future work compared only L1 and L2 cache misses, but the Xeon processor has an L3 cache which has its obvious advantages in terms of memory access latency in comparison to external memory accesses on L2 cache misses in case of the Opteron processor Also in the work done cache sizes and latencies are considered, but the associativity is not considered, it is another parameter to look into. for all the performance metrics presented in this work, there is a requirement of further explanations, reasoning or justifications, requiring further architectural analysis in comparing the reason for why one processor to out perform the other. i.e To relate the results to the architectural features of the microprocessors.

55 Drawbacks / Limitations / Future work One could say that industry standard benchmarks should be used (such as SPEC2006) rather that writing the benchmarks since the former are accepted by the industry and research community, or at least explain why my hand-written ones are better. It may be possible that different benchmarks draw different conclusions about the two processors and hence, it is expected that the benchmarks themselves are accepted as credible.

56 Drawbacks / Limitations / Future work Most significantly, the work done claim that single-processor multi-core architectures are better than multi-processor multicore processors architectures, based only on the small set of experiments that I have performed in this research. This conclusion should be claimed to be limited to my hand-written benchmarks.

57 Conclusion & Contribution The findings of this work shows us that in most number application domains the 'Single processor Quad core Intel Xeon' gives better performance statistics than the 'Dual processor Dual core AMD Opteron', in terms of both the 'Total Wall Clock Cycles taken for the execution' and ' Total Wall Clock Time taken for the execution (Effective Performance)'. The experimented single processor quad-core UMA architecture performance is better than the equivalent dual-processor dualcore NUMA architecture. Hence I conclude that the evaluated single-processor multi-core server architecture performs better than the equivalent multiprocessor multi-core server architecture, in handling different workloads.

58 Conclusion & Contribution Coming up with a simple set of generic benchmarks which could be used to evaluate hybrid heterogeneous systems, with in detail hardware performance monitoring counter statistics generation. This research done and its results published is of a huge benefit for server processor architects/ designers as we have provided a detailed set of performance metrics and statistics, and a detailed analysis of the performance of different software domains on these server processor architectures. Using these: published results, hand-written custom benchmarks and the detailed analysis given, processor architects and designers could revisit and re-evaluate their processor architectures and re-analyze their processor designs to see why their processor architectures are performing well on some application domains and why not their processor architecture are not performing well on some other application domains.

59 Paper Publications Done W.M.R. Weerasuriya and D.N. Ranasinghe, Older Opteron Outperforms the Newer Xeon: A Memory Intensive Application Study of Server Based Microprocessors, 21 st International Conference on Systems Engineering (ICSEng2011), Las Vegas, NV, USA. W.M.R. Weerasuriya and D.N. Ranasinghe, A Comparative Performance Evaluation of Multi-Processor Multi-Core Server Processor Architectures on Enterprise Middleware Performance, 3 rd APSIPA ASC 2011, Xi'an, China. W.M.R. Weerasuriya and D.N. Ranasinghe, Performance Analysis of System Call Intensive Software Application Execution on Server Processor Architectures: Opteron and Xeon, 2nd International Conference on Emerging Trends in Engineering and Technology (IETET-2011), Kurukshetra(Haryana) India.

60 Thank you

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining: