A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures

Similar documents
Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

XT Node Architecture

Microarchitecture Overview. Performance

Jackson Marusarz Intel Corporation

4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4.

Microarchitecture Overview. Performance

Advanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Exploring the Effects of Hyperthreading on Scientific Applications

Intel released new technology call P6P

Mainstream Computer System Components

( ZIH ) Center for Information Services and High Performance Computing. Overvi ew over the x86 Processor Architecture

HiPERiSM Consulting, LLC.

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Mainstream Computer System Components CPU Core 2 GHz GHz 4-way Superscaler (RISC or RISC-core (x86): Dynamic scheduling, Hardware speculation

Unit 8: Superscalar Pipelines

Advanced Processor Architecture

2

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

A Key Theme of CIS 371: Parallelism. CIS 371 Computer Organization and Design. Readings. This Unit: (In-Order) Superscalar Pipelines

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

Advanced d Processor Architecture. Computer Systems Laboratory Sungkyunkwan University

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

Homework 5. Start date: March 24 Due date: 11:59PM on April 10, Monday night. CSCI 402: Computer Architectures

COMPUTER ORGANIZATION AND DESI

Next Generation Technology from Intel Intel Pentium 4 Processor

The Processor: Instruction-Level Parallelism

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor System

High performance computing. Memory

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Intel Architecture for Software Developers

CS 152 Computer Architecture and Engineering

The Memory Hierarchy. Cache, Main Memory, and Virtual Memory (Part 2)

CS 426 Parallel Computing. Parallel Computing Platforms

Six-Core AMD Opteron Processor

Profiling: Understand Your Application

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

WHY PARALLEL PROCESSING? (CE-401)

High-Performance Microarchitecture Techniques John Paul Shen Director of Microarchitecture Research Intel Labs

Performance of Multicore LUP Decomposition

Processor (IV) - advanced ILP. Hwansoo Han

Write only as much as necessary. Be brief!

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

Using Intel Streaming SIMD Extensions for 3D Geometry Processing

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU

CS146 Computer Architecture. Fall Midterm Exam

The Alpha Microprocessor: Out-of-Order Execution at 600 Mhz. R. E. Kessler COMPAQ Computer Corporation Shrewsbury, MA

Static, multiple-issue (superscaler) pipelines

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved.

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Parallel Computing: Parallel Architectures Jin, Hai

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Memory Organization Part II

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

Several Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining

Module 5: "MIPS R10000: A Case Study" Lecture 9: "MIPS R10000: A Case Study" MIPS R A case study in modern microarchitecture.

KeyStone II. CorePac Overview

Memory Hierarchy Y. K. Malaiya

Chapter-5 Memory Hierarchy Design

Itanium 2 Processor Microarchitecture Overview

Reference. T1 Architecture. T1 ( Niagara ) Case Study of a Multi-core, Multithreaded

The Alpha Microprocessor: Out-of-Order Execution at 600 MHz. Some Highlights

EITF20: Computer Architecture Part4.1.1: Cache - 2

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Advanced cache optimizations. ECE 154B Dmitri Strukov

HW1 Solutions. Type Old Mix New Mix Cost CPI

Case Study 1: Optimizing Cache Performance via Advanced Techniques

Anastasia Ailamaki. Performance and energy analysis using transactional workloads

Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window

EITF20: Computer Architecture Part4.1.1: Cache - 2

Intel Enterprise Processors Technology

Superscalar Processors

Chapter 4. The Processor

An Energy-Efficient Asymmetric Multi-Processor for HPC Virtualization

Computer System Components

EECS 322 Computer Architecture Superpipline and the Cache

Master Informatics Eng.

Limitations of Scalar Pipelines

Power Measurement Using Performance Counters

Inside Intel Core Microarchitecture

Problem 1. (15 points):

Advanced Parallel Programming I

One-Level Cache Memory Design for Scalable SMT Architectures

Pentium IV-XEON. Computer architectures M

Memory Hierarchy Basics. Ten Advanced Optimizations. Small and Simple

ECE 571 Advanced Microprocessor-Based Design Lecture 4

Dynamic Control Hazard Avoidance

Keywords and Review Questions

Cache Optimisation. sometime he thought that there must be a better way

PowerPC 620 Case Study

Principles in Computer Architecture I CSE 240A (Section ) CSE 240A Homework Three. November 18, 2008

Chapter 2: Memory Hierarchy Design Part 2

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

Chapter 2: Memory Hierarchy Design Part 2

Low-power Architecture. By: Jonathan Herbst Scott Duntley

Transcription:

A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures W.M. Roshan Weerasuriya and D.N. Ranasinghe University of Colombo School of Computing

A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures Overview/ Motivation of this research performance oriented classification of different software workload types (categories/ environments) on multi-core multiprocessor NUMA and UMA architectures server processors: 'dual processor dual core AMD Opteron 2220 SE' 'single processor quad core Intel Xeon E5506'

software program workload types/ categories/ memory intensive applications processing intensive memory and processing intensive (matrix multiplication), system call intensive applications, file reading and writing, socket based, message passing middle ware based thread based

Target Audience End users who wish to setup server based systems using commodity multi-core multi-processor systems hardware engineers as of to notice and understand that their processor designs are performing well on some application categories but in the same time doesn't perform well on some application categories compiler developers as to understand that given a particular processor architecture, their compiler generated code performance is also dependent on the category/type/domain of the application being written

Related Work In their paper [2] Haoqiang Jin, Robert Hood, Johnny Chang etc analyzes the effect of resource contention on shared resources (in the memory hierarchy), on the performance of applications running on multi-core based server processors. Here the authors consider the computational ability and analyzes the performance of executing computational intensive benchmarks. But my work is different as I try to analyze this with respect to different application domains such as computational intensive, memory intensive, system call intensive, middle-ware execution intensive (i.e. execution performance of enterprise middleware).

Related Work cont... In their paper [2] Haoqiang Jin, Robert Hood, Johnny Chang etc analyzes the effect of resource contention on shared resources (in the memory hierarchy), on the performance of applications running on multi-core based server processors. Here the authors consider the computational ability and analyzes the performance of executing computational intensive benchmarks. But my work is different as I try to analyze this with respect to different application domains such as computational intensive, memory intensive, system call intensive, middle-ware execution intensive (i.e. execution performance of enterprise middleware).

Related Work cont... several on-line reports which compare performance of processors from both companies [15, 16, 17, 18] Anyhow most of these reports simply present the performance metrics such as execution time and throughput. But we go beyond this and we not only present performance metrics such as execution time and throughput, but also we analyze performance metrics such as: level1 cache miss count level2 cache miss count Translation Lookaside Buffer misses Conditional branch instructions mis-predicted Cycles stalled on any resource Total Wall clock cycles Total Wall clock time

Related Work cont... How is our work different to those above All the above studies lacks a in-depth performance based classification of application categories on multi-core architecture alone vs. a hybrid multi-core multi-processor architecture. (i.e. a hybrid heterogeneous architectural comparison) with in detail hardware performance monitoring statistics. The work done in my research addresses this analysis.

Why is this work different from existing work Comparing the performance of using a multi-core system against using an equivalent hybrid multi-core multi-processor combined system. A categorization of generic application domains/types as on how well they perform on a multi-core system against a equivalent hybrid multi-core multi-processor system. Coming up with a simple set of generic benchmarks which could be used to evaluate hybrid heterogeneous systems, with in detail hardware performance monitoring counter statistics generation.

Evaluated Architectures Performance evaluation of: multi-processor multi-core architecture vs a equivalent capacity single-processor multi-core architecture. 1st server processor architecture is a dual-processor dualcore architecture (hence a multi-processor multi-core architecture) 2nd server processor architecture is a single-processor quadcore architecture (hence a single-processor multi-core architecture).

Evaluated Architectures Speed(GHz) L1 Cache (KB) L2 Cache L3 Cache Dual-Core AMD Opteron 2.8 (64 data + 64instru = 128) x per core 1MB x per core no Intel Xeon Quad Core 2.13 (32 data + 32instru = 64) x per core 256 KB x per core 4 MB (Intel Smart Cache, Shared)

CACHE PARAMETERS OF DUAL-CORE AMD OPTERON(TM) 2220 SE PROCESSOR Level Capacity Associativity (ways) Line Size (bytes) Access Latency (clocks) First Level Data 64 KB (per core) 2 64 3 First Level Instruction 64 KB (per core) 2 64 N/A Second Level 1 MB (per core) 16 64 11 Third Level No level 3 cache L1 to L2 relationship is exclusive. (i.e the L1 content is not repeated in L2)

The Xeon L1 to L2 to L3 relationship is inclusive. (i.e the L1/ L2 content must be present in L3) CACHE PARAMETERS OF INTEL XEON E5506 PROCESSOR Level Capacity Associativity (ways) Line Size (bytes) Access Latency (clocks) First Level Data 32 KB (per core) 8 64 4 First Level Instruction 32 KB (per core) 4 N/A N/A Second Level 256KB (per core) 8 64 10 Third Level 4MB (shared among all 4 cores) 16 64 35-40+

INTEGRATED MEMORY CONTROLLER Dual-Core AMD Opteron 2220 SE Quad Core Intel Xeon E5506 Dual channel 128-bit wide 6-channel 333 MHz DDR memory 800 MHz DDR memory Peak memory bandwidth 5.3 Gbytes/s up to 25.6 GB/sec

Methodology and Implementation Custom Benchmarks. memory intensive testing programs (Test1) computational intensive testing programs (Test2) memory plus computational intensive testing program (Test3) system call intensive testing programs socket based testing program (Test4) file reading/ writing testing program (Test5) thread based, threading intensive testing program (Test6) middleware based testing programs (Test7)

CODE SEGMENTS FROM THE TEST 1 //Simple memory access application #define ROWS 1 //NOTE: double datatype size is 8bytes //to fit into 32kb cache #define COLUMNS 4096 double matrix_a[rows][columns]; for (x = tmpstartcolumn; x<tmpendcolumn; x++) { int y = matrix_a[0][x]; y++; matrix_a[0][x] = y; }

CODE SEGMENTS FROM THE TEST 2 //Simple memory access application plus with some further more //processing overhead (i.e. also utilizing the microprocessor further//more than the Test2) for (x = tmpstartcolumn; x<tmpendcolumn; x++) { double y = matrix_a[0][x]; //trying to exercise/ use the microprocessor further more int z, z2; for (z=0; z<num_of_inner_loops; z++) { } y= ((y + y * z) / 2) * 1.1; for (z2=0; z2<10; z2++) { } if (((int)y)%2 == 0) else matrix_a[0][x] = y; y = ((y * z) / 2) * 3.4567; y = ((y * z) / 2) * 6.4567;

Motivation/ Objective Analyze and describe the impact of processor performance parameters on the performance of different application domains and analyze how this behavior varies among server processor architectures We analyze the variant behavior of those processor performance parameters, among different application domains

processor performance parameters Resource Stalls Branch Mis-predictions Translation Lookaside Buffer (TLB) Misses Cache Misses

why Intel VTune, AMD Code Analyst was not used Intel VTune runs only on Intel processors and AMD Code Analyst runs only on AMD processors. My experiment architectures include both Intel processors and AMD processors I will have to use both the above tools. Then I will loose the uniformity of my results as these are two independent tools by two different processor vendors. Therefore I decided to use Performance Application Programming Interface (PAPI) library

PAPI hardware performance monitoring counters Ll data cache misses L1 Instruction Cache misses L2 cache misses Data translation lookaside buffer misses (DTLB) Total translation lookaside buffer misses (TLB) Conditional branch instructions mispredicted Cycles stalled on any resource

PAPI events and methods used PAPI_L1_DCM: L1 data cache misses PAPI_L1_ICM: L1 instruction Cache misses PAPI_L2_TCM: L2 total cache misses PAPI_TLB_DM: Data translation lookaside buffer misses PAPI_TLB_TL: Total translation lookaside buffer misses

PAPI events and methods used PAPI_BR_MSP: Conditional branch instructions mispredicted PAPI_RES_STL: Cycles stalled on any resource PAPI_get_real_cyc(): Total wall clock cycles PAPI_get_real_usec(): Total wall clock time

RESULTS, EVALUATION, ANALYSIS AND DISCUSSION

STATISTICS OBTAINED FOR TEST 1 Processor/ Array Size L1 Cache Misses L2 Cache Misses TLB Misses Conditional Branch Instructions Mis-predicted Opteron/32KB 659 84 52 125 Xeon/32KB 885 724 20 79 Opteron/64KB 1199 93 28 119 Xeon/64KB 1354 1221 29 86 Opteron/1MB 16721 1103 384 147 Xeon/1MB 17210 17735 270 107 Opteron/4MB 65853 25076 1100 142 Xeon/4MB 66306 73309 1056 1042 Opteron/5MB 82281 23608 1367 153 Xeon/5MB 82657 91377 1349 687

Better Processor for Respective Application Domain Application Domain In terms of: Total Wall Clock Cycles processor intensive Xeon Opteron In terms of: Total Wall Clock Time (Effective Performance) memory and processing intensive (matrix multiplication) system call intensive: socket based system call intensive: file reading and writing based Xeon Xeon Xeon Xeon Xeon Xeon thread based Xeon Opteron/Xeon middle ware based --- Xeon

RESULTS, EVALUATION, ANALYSIS AND DISCUSSION cont Overall Analysis In Detail

Computational intensive performance analysis of the two processors

Computational intensive performance analysis cont... 100% processor, 120000 iterations 0.6Mb RAM, 480000 iterations 2.4Mb RAM 960000 iterations 4.8Mb RAM, 1200000 iterations 6Mb RAM Cycles stalled on any resource Total Wall clock time in seconds for all Threads 1400000000000 200 1200000000000 180 1000000000000 160 800000000000 600000000000 400000000000 Opteron Cycles stalled on any resource Xeon Cycles stalled on any resource 140 120 100 80 60 40 Opteron - Total Wall clock time in seconds for all Threads Xeon - Total Wall clock time in seconds for all Threads 200000000000 20 0 12000 16000 20000 120000 480000 960000 1200000 0 12000 16000 20000 120000 480000 960000 1200000

Computational Intensive cont... Why has the older Opteron performed better? No of pipeline stages: Opteron: 12 for integer 17 for floating-point Xeon: 14 for both integer and floating-point

Computational Intensive cont... Why has the older Opteron has performed better? Cont Opteron has the higher number of pipeline stages for floatingpoint The higher the number of pipeline stages the clock cycle time required for each stage is less hence the processor clock frequency could be increased Opteron processor is 2.8GHz and the Xeon is 2.13GHz This makes the Opteron's execution engine to perform faster and give a better throughput than the Xeon

Computational Intensive cont... Why has the older Opteron has performed better? Cont Opteron server used is a Dual Core Dual Processor and the Xeon server used is a Quad Core Single Processor Hence in Xeon all the four(4) cores are manufactured in a single die, it also leads to constraining the processor clock frequency, due to the compressed die area The more the die area is constrained for a single core the more the single core clock frequency will have to be reduced Hence Xeon clock frequency is 2.13GHz and Opteron is 2.8GHz

Computational Intensive cont... Why has the older Opteron has performed better? Cont Also the Opterons Integer pipe line can fetch and decode floating-point instructions as well This also will accelerate the Opteron's execution engine and will boost the throughput

Computational intensive performance analysis cont... Micro-architecture Parameters AMD Opteron 2220 SE Intel Xeon E5506 No of pipeline stages 8 way super-scalar processor 12 for integer 17 for floating-point 6 way super-scalar processor 14 Scheduler can issue \ dispatch up to how many μops per cycle 11 micro-ops (The schedulers and the load/store unit can dispatch). 3 micro-ops to the instruction control unit 6 μops

So now what is next? Will this be the same for other application domains?

Better Processor for Respective Application Domain Application Domain In terms of: Total Wall Clock Cycles processor intensive Xeon Opteron In terms of: Total Wall Clock Time (Effective Performance) memory and processing intensive (matrix multiplication) system call intensive: socket based system call intensive: file reading and writing based Xeon Xeon Xeon Xeon Xeon Xeon thread based Xeon Opteron/Xeon middle ware based --- Xeon

Significance of Integrated Memory Controller in overruling microprocessor performance Computational Intensive benchmark 25000 20000 15000 10000 5000 0 Level 2 cache misses 120000 1200000 No of loops Opteron Xeon 18000 16000 14000 12000 10000 8000 6000 4000 2000 0 Total translation lookaside buffer misses (TLB) 120000 1200000 No of loops Opteron Xeon Cycles stalled on any resource 1400000000000 1200000000000 1000000000000 800000000000 600000000000 400000000000 200000000000 0 120000 1200000 No of iterations Opteron Xeon 200 180 160 140 120 100 80 60 40 20 0 Total Wall clock time in seconds for all Threads 120000 1200000 no of iterations Opteron Xeon

Significance of IMC in overruling microprocessor performance cont. Computational and largely memory intensive benchmark (500 x 500 and 1000 x 1000 matrix multiplication with extra computation performed) 25Mb - 93Mb RAM utilization Level 2 cache misses 600000000 500000000 400000000 300000000 200000000 100000000 0 500 x 500 1000 x 1000 matrix size Opteron Xeon 9000000 8000000 7000000 6000000 5000000 4000000 3000000 2000000 1000000 0 Total translation lookaside buffer misses (TLB) 500 x 500 1000 x 1000 matrix size Opteron Xeon Cycles stalled on any resource 1200000000000 1000000000000 800000000000 600000000000 400000000000 200000000000 0 500 x 500 1000 x 1000 matrix size Opteron Xeon 200 180 160 140 120 100 80 60 40 20 0 Total Wall clock time in seconds for all Threads 500 x 500 1000 x 1000 matrix size Opteron Xeon

Significance of IMC in overruling microprocessor performance cont. Memory Access: Load and Store Operation Enhancements AMD Opteron 2220 SE Intel Xeon E5506 Peak issue rate operation per cycle Two 64-bit loads or stores one 128-bit load one 128-bit store Load/ store queue 44-entry Deeper buffers for load and store operations: 48 load buffers 32 store buffers 10 fill buffers

Significance of IMC in overruling microprocessor performance cont. Integrated Memory Controller Memory access channels AMD Opteron 2220 SE Dual channel 128-bit wide Intel Xeon E5506 6-channel Memory support 333 MHz DDR memory 800 MHz DDR memory Peak memory bandwidth 5.3 Gbytes/s up to 19.2 GB/sec

Significance of TLB Unit in overruling microprocessor performance?

Significance of TLB Unit in overruling microprocessor performance Xeon TLB unit is ahead of the Opteron Hence in general the TLB statistics of the Xeon should be better than the Opteron. But with our results we figured out the following points: 1st Point - Xeon Instruction TLB (ITLB) statistics are poor compared to the Opteron 2nd Point - When the memory access workload increases the Xeon TLB statistics become poor compared to the Opteron 3rd Point - The TLB miss penalty is overruled by a better IMC

Significance of TLB Unit in overruling microprocessor performance cont Translation Lookaside Buffers (TLB) Of The Microprocessors Dual-Core AMD Opteron 2220 SE Quad Core Intel Xeon E5506 Number of levels of TLB 2 Number of levels of TLB 2 L1 Data TLB for 4-KByte pages 32 entries DTLB0 for 4-KByte pages 64 entries L1 Data TLB for large pages (2MB/4MB) L1 Instruction TLB for 4-KByte pages L1 Instruction TLB for large pages (2MB/4MB) 8 entries DTLB0 for large pages (2MB/4MB) 32 entries 32 entries ITLB for 4-KByte pages 64 entries 8 entries ITLB for large pages 7 entries L1 TLB Associativity (ways) fully DTLB0 / ITLB Associativity (ways) L2 TLB for 4-KByte pages 512 entries STLB for 4-KByte pages 512 entries (services both data and instruction look-ups) 4 Associativity (ways) 4 STLB Associativity (ways) 4 An DTLB0 miss and STLB hit causes a penalty of An DTLB0 miss and STLB hit causes a penalty of 7cycles The delays associated with a miss to the STLB and Page miss handler(pmh) The delays associated with a miss to the STLB and Page miss handler(pmh) largely non-blocking

Significance of TLB Unit in overruling microprocessor performance cont... Computational Intensive benchmark 100% processor, 6Mb RAM Data translation lookaside buffer misses (DTLB) Instruction translation lookaside buffer misses (ITLB) 8000 7000 6000 5000 4000 3000 2000 1000 0 12000 16000 20000 120000 480000 960000 1200000 Opteron Data translation lookaside buffer misses (DTLB) Xeon Data translation lookaside buffer misses (DTLB) 9000 8000 7000 6000 5000 4000 3000 2000 1000 0 12000 16000 20000 120000 480000 960000 1200000 Opteron Instruction translation lookaside buffer misses (ITLB) Xeon Instruction translation lookaside buffer misses (ITLB) 18000 16000 14000 12000 10000 8000 6000 4000 2000 0 Total TLB misses 12000 16000 20000 120000 480000 960000 1200000 Opteron Total translation lookaside buffer misses (TLB) Xeon Total translation lookaside buffer misses (TLB) 200 180 160 140 120 100 80 60 40 20 0 Total Wall clock time in seconds for all Threads 12000 16000 20000 120000 480000 960000 1200000 Opteron - Total Wall clock time in seconds for all Threads Xeon - Total Wall clock time in seconds for all Threads

Significance of TLB Unit in overruling microprocessor cont.. Computational and largely memory intensive benchmark (500 x 500 matrix multiplication). 25Mb RAM utilization 60000 50000 40000 30000 20000 10000 0 500x500: Data translation lookaside buffer misses (DTLB) Thread 1 Thread 2 Thread 3 Thread 4 500x500: Total translation lookaside buffer misses (TLB) Opteron - Data translation lookaside buffer misses (DTLB) Xeon - Data translation lookaside buffer misses (DTLB) 9000 8000 7000 6000 5000 4000 3000 2000 1000 0 3.000 Thread 1 500x500: Instruction translation lookaside buffer misses (ITLB) Thread 2 Thread 3 Thread 4 500x500: Total Wall clock time in seconds for all Threads Opteron - Instruction translation lookaside buffer misses (ITLB) Xeon - Instruction translation lookaside buffer misses (ITLB) 2.500 70000 60000 50000 40000 30000 20000 10000 0 Thread 1 Thread 2 Thread 3 Thread 4 Opteron - Total translation lookaside buffer misses (TLB) Xeon - Total translation lookaside buffer misses (TLB) 2.000 1.500 1.000 0.500 0.000 Opteron Xeon Total Wall clock time in seconds for all Threads

Significance of TLB Unit in overruling microprocessor cont.. Computational and largely memory intensive benchmark (1000 x 1000 matrix multiplication), 93Mb RAM utilization 1000x1000:Data translation lookaside buffer misses(dtlb) 1000x1000: Instruction translation lookaside buffer misses (ITLB) 2020000 2010000 2000000 1990000 1980000 1970000 1960000 1950000 1940000 1930000 1920000 Thread 1 Thread 2 Thread 3 Thread 4 Opteron - Data translation lookaside buffer misses (DTLB) Xeon - Data translation lookaside buffer misses (DTLB) 16000 14000 12000 10000 8000 6000 4000 2000 0 Thread 1 Thread 2 Thread 3 Thread 4 Opteron - Instruction translation lookaside buffer misses (ITLB) Xeon - Instruction translation lookaside buffer misses (ITLB) 2040000 1000x1000: Total translation lookaside buffer misses (TLB) 25.000 1000x1000: Total Wall clock time in seconds for all Threads 2020000 2000000 1980000 1960000 1940000 Opteron - Total translation lookaside buffer misses (TLB) 20.000 15.000 10.000 Total Wall clock time in seconds for all Threads 1920000 1900000 Thread 1 Thread 2 Thread 3 Thread 4 Xeon - Total translation lookaside buffer misses (TLB) 5.000 0.000 Opteron Xeon

Significance of Branch Prediction Unit in overruling microprocessor performance?

Significance of Branch Prediction Unit in overruling microprocessor performance From the hardware performance counter statistics obtained related to the Branch Prediction Unit of the two server processors we figured out the following two points: 1st Point - The Xeon Branch Prediction Unit is optimized and works better when the source code actually contains branching if command instructions. But doesn't work well when the source code doesn't contain any branching if command instructions (no if commands, but code contains for loops ). 2nd Point - branch mis-predictions has less impact (or no impact at all) on the final application execution throughput.

Significance of Branch Prediction Unit in overruling microprocessor performance cont... Proving Above 1st point System call intensive benchmark (source code contains branching if command instructions) Computational intensive benchmark (source code contains branching if command instructions) Computational and largely memory intensive benchmark (500 x 500 matrix multiplication with extra computation performed) (source code contains branching if command instructions) Conditional branch instructions mispredicted 80000 70000 60000 50000 40000 30000 20000 10000 0 1000 requests 10000 requests Opteron- Conditional branch instructions mispredicted Xeon- Conditional branch instructions mispredicted 6000000000 5000000000 4000000000 3000000000 2000000000 1000000000 Conditional branch instructions mispredicted 0 12000 16000 20000 120000 480000 960000 1E+06 Opteron Conditional branch instructions mispredicted Xeon Conditional branch instructions mispredicted 500x500- Conditional branch instructions mispredicted 150000000 100000000 50000000 0 Thread1 Thread2 Thread3 Thread4 Opteron - Conditional branch instructions mispredicted Xeon - Conditional branch instructions mispredicted

Significance of Branch Prediction Unit in overruling microprocessor performance cont... Proving Above 1st point Simple Memory access intensive benchmark (source code doesn't contain any branching if command instructions (no if commands, but code contains for loops )) Computational and largely memory intensive benchmark (500 x 500 matrix multiplication) (source code doesn't contain any branching if command instructions (no if commands, but code contains for loops )) 2000 Conditional branch instructions mispredicted 500x500: Conditional branch instructions mispredicted 1800 1600 1400 1200 1000 800 600 400 200 Opteron Conditional branch instructions mispredicted Xeon Conditional branch instructions mispredicted 700000 600000 500000 400000 300000 200000 100000 0 Thread 1 Thread 2 Thread 3 Thread 4 Opteron- Conditional branch instructions mispredicted Xeon- Conditional branch instructions mispredicted 0 32KB 64KB 1MB 4MB 5MB 6MB 8MB 10MB 12MB

Significance of Branch Prediction Unit in overruling microprocessor performance cont... Proving Above 2nd point Computational intensive benchmark (source code contains branching if command instructions) 6000000000 5000000000 4000000000 3000000000 2000000000 1000000000 Conditional branch instructions mispredicted 0 12000 16000 20000 120000 480000 960000 1E+06 Opteron Conditional branch instructions mispredicted Xeon Conditional branch instructions mispredicted 200 180 160 140 120 100 80 60 40 20 Total Wall clock time in seconds for all Threads 0 12000 16000 20000 120000 480000 960000 1E+06 ` Opteron - Total Wall clock time in seconds for all Threads Xeon - Total Wall clock time in seconds for all Threads

Significance of Branch Prediction Unit in overruling microprocessor performance cont... Proving Above 2nd point cont Computational and largely memory intensive benchmark (500 x 500 matrix multiplication) (source code doesn't contain any branching if command instructions (no if commands, but code contains for loops ) ) 500x500: Conditional branch instructions mispredicted 500x500: Total Wall clock time in seconds for all Threads 700000 600000 500000 400000 300000 200000 100000 0 Thread 1 Thread 2 Thread 3 Thread 4 Opteron- Conditional branch instructions mispredicted Xeon- Conditional branch instructions mispredicted 3.000 2.500 2.000 1.500 1.000 0.500 0.000 Opteron Xeon Total Wall clock time in seconds for all Threads

Significance of Branch Prediction Unit in overruling microprocessor performance cont... Proving Above 2nd point cont Computational and largely memory intensive benchmark (1000 x 1000 matrix multiplication) (source code doesn't contain any branching if command instructions (no if commands, but code contains for loops )) 1000x1000: Conditional branch instructions mispredicted 1000x1000: Total Wall clock time in seconds for all Threads 25.000 3000000 2500000 2000000 1500000 Opteron- Conditional branch instructions mispredicted 20.000 15.000 10.000 Total Wall clock time in seconds for all Threads 1000000 500000 0 Thread 1 Thread 2 Thread 3 Thread 4 Xeon- Conditional branch instructions mispredicted 5.000 0.000 Opteron Xeon

Drawbacks / Limitations / Future work compared only L1 and L2 cache misses, but the Xeon processor has an L3 cache which has its obvious advantages in terms of memory access latency in comparison to external memory accesses on L2 cache misses in case of the Opteron processor Also in the work done cache sizes and latencies are considered, but the associativity is not considered, it is another parameter to look into. for all the performance metrics presented in this work, there is a requirement of further explanations, reasoning or justifications, requiring further architectural analysis in comparing the reason for why one processor to out perform the other. i.e To relate the results to the architectural features of the microprocessors.

Drawbacks / Limitations / Future work One could say that industry standard benchmarks should be used (such as SPEC2006) rather that writing the benchmarks since the former are accepted by the industry and research community, or at least explain why my hand-written ones are better. It may be possible that different benchmarks draw different conclusions about the two processors and hence, it is expected that the benchmarks themselves are accepted as credible.

Drawbacks / Limitations / Future work Most significantly, the work done claim that single-processor multi-core architectures are better than multi-processor multicore processors architectures, based only on the small set of experiments that I have performed in this research. This conclusion should be claimed to be limited to my hand-written benchmarks.

Conclusion & Contribution The findings of this work shows us that in most number application domains the 'Single processor Quad core Intel Xeon' gives better performance statistics than the 'Dual processor Dual core AMD Opteron', in terms of both the 'Total Wall Clock Cycles taken for the execution' and ' Total Wall Clock Time taken for the execution (Effective Performance)'. The experimented single processor quad-core UMA architecture performance is better than the equivalent dual-processor dualcore NUMA architecture. Hence I conclude that the evaluated single-processor multi-core server architecture performs better than the equivalent multiprocessor multi-core server architecture, in handling different workloads.

Conclusion & Contribution Coming up with a simple set of generic benchmarks which could be used to evaluate hybrid heterogeneous systems, with in detail hardware performance monitoring counter statistics generation. This research done and its results published is of a huge benefit for server processor architects/ designers as we have provided a detailed set of performance metrics and statistics, and a detailed analysis of the performance of different software domains on these server processor architectures. Using these: published results, hand-written custom benchmarks and the detailed analysis given, processor architects and designers could revisit and re-evaluate their processor architectures and re-analyze their processor designs to see why their processor architectures are performing well on some application domains and why not their processor architecture are not performing well on some other application domains.

Paper Publications Done W.M.R. Weerasuriya and D.N. Ranasinghe, Older Opteron Outperforms the Newer Xeon: A Memory Intensive Application Study of Server Based Microprocessors, 21 st International Conference on Systems Engineering (ICSEng2011), Las Vegas, NV, USA. W.M.R. Weerasuriya and D.N. Ranasinghe, A Comparative Performance Evaluation of Multi-Processor Multi-Core Server Processor Architectures on Enterprise Middleware Performance, 3 rd APSIPA ASC 2011, Xi'an, China. W.M.R. Weerasuriya and D.N. Ranasinghe, Performance Analysis of System Call Intensive Software Application Execution on Server Processor Architectures: Opteron and Xeon, 2nd International Conference on Emerging Trends in Engineering and Technology (IETET-2011), Kurukshetra(Haryana) India.

Thank you