Detection of Weak Spots in Benchmarks Memory Space by using PCA and CA

Similar documents
UCB CS61C : Machine Structures

Performance. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

UCB CS61C : Machine Structures

Architecture of Parallel Computer Systems - Performance Benchmarking -

PIPELINING AND PROCESSOR PERFORMANCE

A Fast Instruction Set Simulator for RISC-V

Resource-Conscious Scheduling for Energy Efficiency on Multicore Processors

SEN361 Computer Organization. Prof. Dr. Hasan Hüseyin BALIK (2 nd Week)

Computer Architecture. Introduction

CPU Performance Evaluation: Cycles Per Instruction (CPI) Most computers run synchronously utilizing a CPU clock running at a constant clock rate:

Performance Characterization of SPEC CPU Benchmarks on Intel's Core Microarchitecture based processor

EKT 303 WEEK Pearson Education, Inc., Hoboken, NJ. All rights reserved.

Information System Architecture Natawut Nupairoj Ph.D. Department of Computer Engineering, Chulalongkorn University

Energy Models for DVFS Processors

ISA-Aging. (SHRINK: Reducing the ISA Complexity Via Instruction Recycling) Accepted for ISCA 2015

A Dynamic Program Analysis to find Floating-Point Accuracy Problems

Reducing the SPEC2006 Benchmark Suite for Simulation Based Computer Architecture Research

Footprint-based Locality Analysis

Last time. Lecture #29 Performance & Parallel Intro

Lightweight Memory Tracing

Improving Cache Performance by Exploi7ng Read- Write Disparity. Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A.

Improving Cache Performance by Exploi7ng Read- Write Disparity. Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A.

Microarchitecture Overview. Performance

Open Access Research on the Establishment of MSR Model in Cloud Computing based on Standard Performance Evaluation

Memory Performance Characterization of SPEC CPU2006 Benchmarks Using TSIM1

NightWatch: Integrating Transparent Cache Pollution Control into Dynamic Memory Allocation Systems

Benchmarking Clusters with High Core-Count Nodes

Sandbox Based Optimal Offset Estimation [DPC2]

Near-Threshold Computing: How Close Should We Get?

Pradip Bose Tom Conte IEEE Computer May 1998

Introducing the GCC to the Polyhedron Model

Wait of a Decade: Did SPEC CPU 2017 Broaden the Performance Horizon?

Performance, Power, Die Yield. CS301 Prof Szajda

Energy Proportional Datacenter Memory. Brian Neel EE6633 Fall 2012

Microarchitecture Overview. Performance

Chapter 1: Introduction to the Microprocessor and Computer 1 1 A HISTORICAL BACKGROUND

Benchmarking CPU Performance

Memory Systems IRAM. Principle of IRAM

time step first instruction second instruction

CS 61C: Great Ideas in Computer Architecture Performance and Floating-Point Arithmetic

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Speedup Factor Estimation through Dynamic Behavior Analysis for FPGA

Enhancing Analysis-Based Design with Quad-Core Intel Xeon Processor-Based Workstations

Addressing End-to-End Memory Access Latency in NoC-Based Multicores

CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines

Outline Marquette University

Computing System Fundamentals/Trends + Review of Performance Evaluation and ISA Design

Lecture 3 Notes Topic: Benchmarks

TESLA P100 PERFORMANCE GUIDE. HPC and Deep Learning Applications

Computer Performance Evaluation: Cycles Per Instruction (CPI)

QLogic TrueScale InfiniBand and Teraflop Simulations

Computer Performance Evaluation and Benchmarking. EE 382M Dr. Lizy Kurian John

Linux Performance on IBM zenterprise 196

Advances of parallel computing. Kirill Bogachev May 2016

Computer System. Performance

TDT4255 Computer Design. Lecture 1. Magnus Jahre

Multi-Level Cache Hierarchy Evaluation for Programmable Media Processors. Overview

High Performance Computing: Architecture, Applications, and SE Issues. Peter Strazdins

Maximizing Memory Performance for ANSYS Simulations

Higher Level Programming Abstractions for FPGAs using OpenCL

Computer System architectures

AMD HyperTransport Technology-Based System Architecture

Efficient Physical Register File Allocation with Thread Suspension for Simultaneous Multi-Threading Processors

Benchmarking CPU Performance. Benchmarking CPU Performance

The information provided is intended to help designers and end users make performance

Chapter 1. Computer Abstractions and Technology. Adapted by Paulo Lopes, IST

CSCI 402: Computer Architectures. Computer Abstractions and Technology (4) Fengguang Song Department of Computer & Information Science IUPUI.

TESLA P100 PERFORMANCE GUIDE. Deep Learning and HPC Applications

Potential for hardware-based techniques for reuse distance analysis

APPENDIX Summary of Benchmarks

Lecture 1: Course Introduction and Overview Prof. Randy H. Katz Computer Science 252 Spring 1996

Thesis Defense Lavanya Subramanian

Quantifying power consumption variations of HPC systems using SPEC MPI benchmarks

QLE10000 Series Adapter Provides Application Benefits Through I/O Caching

Translation Caching: Skip, Don t Walk (the Page Table)

Predicting Performance Impact of DVFS for Realistic Memory Systems

Engineers can be significantly more productive when ANSYS Mechanical runs on CPUs with a high core count. Executive Summary

Robert Jamieson. Robs Techie PP Everything in this presentation is at your own risk!

GPU Computing and Its Applications

Alpha AXP Workstation Family Performance Brief - OpenVMS

ECE C61 Computer Architecture Lecture 2 performance. Prof. Alok N. Choudhary.

Perceptron Learning for Reuse Prediction

What is a parallel computer?

Chapter 1: Fundamentals of Quantitative Design and Analysis

Computer Architecture!

High Performance Computing with Accelerators

MICROPROCESSOR TECHNOLOGY

Computer Architecture

Bias Scheduling in Heterogeneous Multi-core Architectures

General Purpose GPU Computing in Partial Wave Analysis

Introduction to Microprocessor

Figure 1-1. A multilevel machine.

Cluster Computing Paul A. Farrell 9/15/2011. Dept of Computer Science Kent State University 1. Benchmarking CPU Performance

Dynamic Cache Pooling for Improving Energy Efficiency in 3D Stacked Multicore Processors

CS3350B Computer Architecture CPU Performance and Profiling

Package on Board Simulation with 3-D Electromagnetic Simulation

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University

The World s First Seventh-Generation x86 Processor: Delivering the Ultimate Performance for Cutting-Edge Software Applications

HP PA-8000 RISC CPU. A High Performance Out-of-Order Processor

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

Transcription:

Leonardo Electronic Journal of Practices and Technologies ISSN 1583-1078 Issue 16, January-June 2010 p. 43-52 Detection of Weak Spots in Memory Space by using PCA and CA Abdul Kareem PARCHUR *, Fazal NOORBASHA and Ram Asaray SINGH Department of Physics and Electronics, Dr. H. S. Gour University, Sagar, India-470003 * E-mail: kareemskpa@hotmail.com ( * Corresponding author: +91-9907048098) Abstract This paper describes the weak spots in SPEC CPU INT 2006 memory space by using Principal Component Analysis and Cluster Analysis. We used recently published SPEC CPU INT 2006 Benchmark scores of AMD Opteron 2000+ and AMD Opteron 8000+ series processors. The four most significant PCs, which are retained for 72.6% of the variance, PC2, PC3, and PC4 covers 26.5%, 2.9%, 0.91% and 0.019% variance respectively. The dendrogram is useful to identify the similarities and dissimilarities between the benchmarks in workload space. These results and analysis can be used by performance engineers, scientists and developers to better understand the benchmark behavior in workload space and to design a Benchmark Suite that covers the complete workload space. Keywords SPEC CPU INT 2006, Principal Component Analysis (PCA), Cluster Analysis (CA), Performance. Introduction AMD Opteron 2000+ and AMD Opteron 8000+ Series Processors are the members of a new family of seventh-generation AMD processors designed to meet the computation- http://lejpt.academicdirect.org 43

Detection of Weak Spots in Memory Space by using PCA and CA Abdul Kareem PARCHUR, Fazal NOORBASHA and Ram Asaray SINGH intensive requirements of cutting-edge software applications running on high-performance desktop systems, workstations, and servers with most advanced x86-64 architecture. The technology innovation for the x86 architecture drives today's personal computers. It incorporates the most new microarchitectural features including a more than 200MHz system bus, a highperformance cache architecture and enhanced 3DNow! technology. 3DNow! technology is a set of 21 new instructions designed to open the traditional processing bottlenecks for floating-point-intensive and multimedia applications. The 3DNow! technology enables faster frame rates on high resolution scenes, better physical modeling of real-world environments, sharper and more detailed 3D imaging, smoother video playback, and near theater quality audio. Future AMD processors designed to operate at a frequencies grater than of 3GHz, should provide even high performance implementations of 3DNow! technology [1]. AMD Athlon processors are manufactured on AMD s robust 0.18-micron aluminum process technology and on AMD s leading-edge HiP6L 0.18-micron process technology featuring copper interconnects. The approximately 37-million-transistor new AMD Athlon processor has a die size of 120 mm2 on 0.18-micron technology [2]. Computer architectural complexity is growing so dramatically, the performance becomes an important approach to take full advantage of hardware s computational potential [3]. The CMOS scaling leading to ever increasing level of transistor integration on a chip, designers of high performance embedded processors have ample area available to increase processor resources in order to improve performance [4]. The SPEC CPU2006 benchmark suite contains several programs from different application areas such as Physics, Artificial intelligence and Combinatorial Optimization etc. The recently released SPEC CPU2006 benchmark suite is expected to be used by computer designers and computer architecture researchers for pre-silicon early design analysis [5]. Accuracy of the processor performance depends on the selected benchmarks in simulation study. The selected benchmarks should cover the vide spectrum of the application area. Increase in benchmarks program accelerates the simulation time, at the same time improper selection of the benchmarks may not accurately determines the performance of the processor Increasing size of the benchmarks makes detailed simulation an extremely time consuming process[6]. In this paper we have detected the hotspots in benchmarks memory space by using AMD Opteron 2000+ and AMD Opteron 8000+ series processors SPEC CPU INT 2006 44

Leonardo Electronic Journal of Practices and Technologies ISSN 1583-1078 Issue 16, January-June 2010 p. 43-52 performance scores, PCA and CA techniques. The similarities and dissimilarities between the benchmarks have been identified. Scope of This Study Building a high-performance microprocessor presents many reliability challenges. Today we are moving towards the nanotechnology era and also from 32-bit processor environment to 64-bit processor environment. The analysis of our study examines the weak spots in different series of AMD processors (AMD Opteron 2000+ and AMD Opteron 8000+ Series) which are fabricated for the requirement of the modern generation utility. This study is helpful to build complete benchmark suite which covers the entire spectrum of the application area and to predict the performance of the processor more accurately. We previously reported the performance prediction of the processors and evaluated scalability of the Memory Wait Time which degraded the performance of the processor by using a simple statistical correlation technique [7]. This analysis is more useful to performance engineers, scientists and developers to better understand benchmark behavior in workload space, and the scalability of the performance in modern generation commercial processors. are used for the performance evolution of the processors. The SPEC, HINT, and TPC are most important and popular benchmarks are available for performance evolution. SPEC is a nonprofit corporation formed to establish, maintain, and endorse a standardized set of benchmarks. SPEC s membership includes computer hardware and software vendors, leading universities, and research facilities worldwide. SPEC CPU2006 is designed to provide a comparative measure of compute-intensive performance across a range of hardware. Comprised of two suites of benchmarks, SPEC CPU2006 gauges computeintensive integer performance with CINT2006 and measures floating-point performance with CFP2006. CINT2006 and CFP2006 results are presented as ratios, which are calculated using a reference time determined by SPEC and the runtime of the benchmark higher scores indicate better performance [8]. 45

Detection of Weak Spots in Memory Space by using PCA and CA Abdul Kareem PARCHUR, Fazal NOORBASHA and Ram Asaray SINGH Table 1. The CINT 2006 Suite S. No Integer Benchmark Language Description 1 C++ PERL Programming Language 2 C Data Compression 3 C C Language Optimizing Compiler 4 C Combinatorial Optimization 5 C Artificial Intelligence : Game Playing 6 C Search a Gene Sequence Database 7 C Artificial Intelligence : Chess 8 C Physics / Quantum Computing 9 C Video Compression 10 C++ Discrete Event Simulation 11 C++ Path Finding Algorithm 12 C++ XSLT Processor Table 2. The CFP2006 Suite S. No Floating Point Language Description Benchmark 1 410.bwaves Fortran 77 Computational Fluid Dynamics 2 416.gamess Fortran Quantum Chemical Computations 3 433.milc C Physics / Quantum Chromo Dynamics 4 434.zeusmp Fortran 77 Physics / Magneto Hydro Dynamics 5 435.gromacs C/Fortran Chemistry / Molecular Dynamics 6 436.cactusADM C / Fortran-90 Physics / General Relativity 7 437.leslie3d Fortran 90 Computational Fluid Dynamics 8 444.namd C++ Scientific, Structural Biology, Classical Molecular Dynamics Simulation. 9 447.dealII C++ Solution of Partial Differential Equations using the Adaptive Finite Element Method. 10 450.soplex C++ Simplex Linear Programming Solver 11 453.povray C++ Computer Visualization / Ray Tracing 12 454.calculix C/Fortran-90 Structural Mechanics 13 459.GemsFDTD Fortran-90 Computational Electromagnetic 14 465.tonto Fortran-95 Quantum Crystallography 15 470.lbm C Computational Fluid Dynamics 16 481.wrf C/Fortran 90 Weather Processing 17 482.sphinx3 C Speech Recognition The SPEC CPU2006 suite contains 18 floating-point programs (Some programs are written in C and some in FORTRAN) and 13 integer programs (8 written in C, 4 in C++ and 1 in ANSI C). Table.1 and Table 2 provides a list of the benchmarks in SPEC CPU2006 suite. The SPEC CPU2006 benchmarks replace the SPEC89, SPEC92, SPEC95 and SPEC CPU 46

Leonardo Electronic Journal of Practices and Technologies ISSN 1583-1078 Issue 16, January-June 2010 p. 43-52 2000 benchmarks [8, 9, 10]. Methodology In this study we use the integer benchmarks from the newly released SPEC CPU2006 suite for the detection of weak spots in this analysis. Benchmark scores for AMD Opteron 2000+ series processors and AMD Opteron 8000+ series are obtained under the same operating conditions. We reported the performance scaling in AMD Opteron 2000+ series processors and AMD Opteron 8000+ series Processors [7]. Principal Component analysis and Cluster Analysis is used to identify the weak spots in workload memory space and to find the similarities and dissimilarities between different benchmarks in workload memory space. We used commercial statistical software called STATISTICA v.7.0 [11] for evaluating PCA and CA. Results and Discussion Using the Benchmark scores of AMD Opteron 2000+ series processors and AMD Opteron 8000+ series processors we obtained four most significant principal components, the first principal component (PC1) covers 79.6%, PC2 (26.5%), PC3 (0.91%) and PC4 (0.019%) of variance respectively. Among all PCs the first two principal components gives important information about benchmark behavior. The eigenvalues scree plot of all principal components, (PC1-PC4) is shown in Figure 1. Figure 2, shows the benchmarks behavior in PC1 and PC2 memory space. Among all the benchmarks shows high deviation in memory space. The benchmarks (C Language Optimizing Compiler) and (Discrete Event Simulation) are overlapped at top of the memory space by showing high variance, the benchmarks (PERL Programming Language) and (Video Compression) and the benchmark (Artificial Intelligence : Game Playing) and (XSLT Processor) are overlapped at the bottom of the memory space, these benchmarks can only increase the simulation time without providing an extra information. These weak spot in the memory space was represented by gray shapes in memory space of PC1 vs. PC2. These 47

Detection of Weak Spots in Memory Space by using PCA and CA Abdul Kareem PARCHUR, Fazal NOORBASHA and Ram Asaray SINGH weak spots identification provides the information to build a complete benchmark suite that covers a complete workload space. 10 9 72.5993% 8 7 6 Eigenvalues 5 4 3 26.4710% 2 1 0 0.9103% 0.0195% -1 1 2 3 4 Principal Components Figure 1. Eigenvalues scree plot of all principal components, which explain the variance in the workload (PC1-PC4) Figure 2. SPEC CINT 2006 programs plotted in the PC space using memory access characteristics (PC1 vs. PC2), Weak spots are highlighted trough a gray shapes Figure 3 and Figure 4 shows the SPEC CINT 2006 programs plotted in the PC space using memory access characteristics, PC3vs. PC4 and PC2vs. PC3 respectively, Weak spots are highlighted trough a gray shapes. Figure 5 represents the variance in the four significant individual Principal Components. (a)-(d) Presents the variation of individual Principal component score corresponding to each benchmark, figure 5(a) shows the most significant 48

Leonardo Electronic Journal of Practices and Technologies ISSN 1583-1078 Issue 16, January-June 2010 p. 43-52 variations in the in the benchmarks, PC1 covers 72.6% variation in memory space. The dissimilar behavior benchmarks are represented trough red circles in figure 5(a). Figure 3. SPEC CINT 2006 programs plotted in the PC space using memory access characteristics (PC2 vs. PC3), Weak spots are highlighted trough a gray shapes Figure 4. SPEC CINT 2006 programs plotted in the PC space using memory access characteristics (PC3 vs. PC4), Weak spots are highlighted trough a gray shapes Figure 6 shows the dendrogram, which explains similarities and dissimilarities in workload space of AMD Opteron 2000+ and AMD Opteron 8000+ Series Processors. The benchmarks and are linked with smaller linkage distance; on the other hand benchmark is useful for Physics / Quantum Computing shows long linkage distance. This dendrogram is useful for selecting benchmark suite for performance evolution. The line drawn at linkage distance L=400, can select K=4 benchmark, so, one can 49

Detection of Weak Spots in Memory Space by using PCA and CA Abdul Kareem PARCHUR, Fazal NOORBASHA and Ram Asaray SINGH reduce the program execution time. Figure 7 shows the two-way Joining results of AMD Opteron 2000+ and AMD Opteron 8000+ Series Processors and SPEC CPU INT 2006 benchmarks. The benchmark shows high execution time which is represented trough 1800 score point boxes. Principal Component 1 1.1 1.0 0.9 0.8 0.7 0.6 Principal Component 3 0.20 0.15 0.10 0.05 0.00-0.05-0.10-0.15 (a) 0.5 (b) -0.20 1.0 0.04 0.8 0.6 0.03 Principal Component 2 0.4 0.2 0.0-0.2-0.4-0.6-0.8 Principal Component 4 0.02 0.01 0.00-0.01-0.02-1.0 (c) (d) Figure 5. Represents the variance in the four significant Principal Components. (a)-(d) Presents the variation of individual Principal component score corresponding to each benchmark -0.03 Disclaimer All the observations and analysis done in this paper on SPEC CPU2006int are the author s opinions and should not be used as official or unofficial guidelines from SPEC in selecting benchmarks for any purpose. This paper only provides guidelines for performance engineers, academic users, scientists and developers to better understand the benchmark suite and to build a complete benchmark suit which covers the entire spectrum of the memory space without weak spots. 50

Leonardo Electronic Journal of Practices and Technologies ISSN 1583-1078 Issue 16, January-June 2010 p. 43-52 K=4, L=400 Smaller Linkage Distance Long Linkage Distance 0 500 1000 1500 2000 Linkage Distance Figure 6. Dendrogram showing the similarities and dissimilarities in workload space of AMD Opteron 2000+ and AMD Opteron 8000+ Series Processors Two-Way Joining Results AMD Opetron 2222 8220 8218 2356 8216 1800 1600 1400 1200 1000 800 600 Figure 7. Cluster analysis of Two-way joining results showing the similarities and dissimilarities in workload space of AMD Opteron 2000+ and AMD Opteron 8000+ Series Processors Acknowledgement The authors would like to thank Prof. D. K. Gautam, Head, Department of Electronics, North Maharastra University, Jalgaon, (M.S), India, and Prof. Ravi Pandey, Professor and Department Chair, Michigan Tech. University, USA, for many stimulating comments and discussions. One of the authors (A.K.P.) gratefully acknowledges financial support of UGC for a meritorious research fellowship. 51

Detection of Weak Spots in Memory Space by using PCA and CA Abdul Kareem PARCHUR, Fazal NOORBASHA and Ram Asaray SINGH References 1. Oberman S., Favor G., Weber F., AMD 3DNow! Technology: architecture and implementations, IEEE_M_MICRO, 1999, 19, p. 37-48. 2. AMD 86-64 Architecture Manuals [online] [accessed on August, 2009]. Available at: http://www.amd.com. 3. Xue Y., Zhao C., Automated Phase-Ordering of Loop Optimizations Based on Polyhedron Model, Proc. 10th IEEE International Conference on High Performance Computing and Communications HPCC '08, 2008, p. 672-677. 4. Homayoun H., Pasricha S., Makhzar M., Veidenbaum A., Dynamic register file resizing and frequency scaling to improve embedded processor performance and energy-delay efficiency, Proc. 45th ACM/IEEE Design Automation Conference DAC 2008, 2008, p. 68-71. 5. Aashish Phansalkar, Ajay Joshi and Lizy K. John, Analysis of Redundancy and Application Balance in the SPEC CPU2006 Benchmark Suite, ISCA 07, June 9-13, 2007. 6. Nair A., John L., Simulation points for SPEC CPU 2006, Proc. IEEE International Conference on Computer Design ICCD 2008, 2008, p. 397-403. 7. Abdul Kareem P., R. A. Singh, Performance Scaling of Individual SPEC INT 2006 Results for AMD Processors, Leonardo Electronic Journal of Practices and Technologies, 2009, 14, p. 65-72. 8. SPEC CPU2000 Press Release FAQ [online] [accessed on August, 2009], Available at: http://www.spec.org/osg/cpu2000/press/ faq.html 9. KleinOsowski A.J., Lilja D.J., MinneSPEC: A new SPEC benchmark workload for simulation-based computer architecture research, Computer Architecture Letters, 2002, 1, p. 7-7. 10. Henning J.L., SPEC CPU2000: Measuring CPU performance in the new millennium. IEEE Computer, 2000, 33, p. 28-35. 11. StatSoft, Inc. (2004). STATISTICA (data analysis software system), version 7, for windows. www.statsoft.com. 52