Cluster Computing Paul A. Farrell 9/15/2011. Dept of Computer Science Kent State University 1. Benchmarking CPU Performance

Similar documents
Benchmarking CPU Performance. Benchmarking CPU Performance

Benchmarking CPU Performance

CSE5351: Parallel Processing Part III

An evaluation of the Performance and Scalability of a Yellowstone Test-System in 5 Benchmarks

An Intelligent and Cost-effective Solution to Implement High Performance Computing

Trends in systems and how to get efficient performance

Performance of HPC Applications over InfiniBand, 10 Gb and 1 Gb Ethernet. Swamy N. Kandadai and Xinghong He and

Measuring Performance. Speed-up, Amdahl s Law, Gustafson s Law, efficiency, benchmarks

Low-Level Monitoring and High-Level Tuning of UPC on CC-NUMA Architectures

David Cronk University of Tennessee, Knoxville, TN

Evaluation of Intel Memory Drive Technology Performance for Scientific Applications

General Purpose GPU Computing in Partial Wave Analysis

Software Distributed Shared Memory with High Bandwidth Network: Production and Evaluation

Integrated Device Technology RISController and CPU Performance Report September 1997

J. Blair Perot. Ali Khajeh-Saeed. Software Engineer CD-adapco. Mechanical Engineering UMASS, Amherst

I/O Channels. RAM size. Chipsets. Cluster Computing Paul A. Farrell 9/8/2011. Memory (RAM) Dept of Computer Science Kent State University 1

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou

Node Hardware. Performance Convergence

A Relative Development Time Productivity Metric for HPC Systems

Implementing Efficient and Scalable Flow Control Schemes in MPI over InfiniBand

Chapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Alpha AXP Workstation Family Performance Brief - OpenVMS

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Performance Evaluation of Fast Ethernet, Giganet and Myrinet on a Cluster

Technical Brief: Specifying a PC for Mascot

Algorithms and Architecture. William D. Gropp Mathematics and Computer Science

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters

A Comparison of Three MPI Implementations

Performance of Voltaire InfiniBand in IBM 64-Bit Commodity HPC Clusters

Turbostream: A CFD solver for manycore

Performance of computer systems

AD910A M.2 (NGFF) to SATA III Converter Card

The Use of the MPI Communication Library in the NAS Parallel Benchmarks

UMBC. Rubini and Corbet, Linux Device Drivers, 2nd Edition, O Reilly. Systems Design and Programming

Intel released new technology call P6P

Fujitsu s Approach to Application Centric Petascale Computing

Memory Systems IRAM. Principle of IRAM

Performance and accuracy of hardware-oriented. native-, solvers in FEM simulations

High Performance MPI on IBM 12x InfiniBand Architecture

Cost-Performance Evaluation of SMP Clusters

Master Informatics Eng.

Minerva. Performance & Burn In Test Rev AD903A/AD903D Converter Card. Table of Contents. 1. Overview

Exploring the Effects of Hyperthreading on Scientific Applications

Performance COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

The bottom line: Performance. Measuring and Discussing Computer System Performance. Our definition of Performance. How to measure Execution Time?

WORTMANN AG IT Made in Germany

Outline. Parallel Algorithms for Linear Algebra. Number of Processors and Problem Size. Speedup and Efficiency

Architecture of Parallel Computer Systems - Performance Benchmarking -

Evaluating Computers: Bigger, better, faster, more?

Lecture 3 Notes Topic: Benchmarks

A Case for High Performance Computing with Virtual Machines

Slurm Configuration Impact on Benchmarking

Adaptive Scientific Software Libraries

High Performance Computing for PDE Some numerical aspects of Petascale Computing

Derivation and Verification of Parallel Components for the Needs of an HPC Cloud

Types of Workloads. Raj Jain Washington University in Saint Louis Saint Louis, MO These slides are available on-line at:

Performance Estimation of High Performance Computing Systems with Energy Efficient Ethernet Technology

ANALYSIS OF CLUSTER INTERCONNECTION NETWORK TOPOLOGIES

Can Memory-Less Network Adapters Benefit Next-Generation InfiniBand Systems?

Comparison of Parallel Processing Systems. Motivation

Organizational issues (I)

Outline. CSC 447: Parallel Programming for Multi- Core and Cluster Systems

NAS Applied Research Branch. Ref: Intl. Journal of Supercomputer Applications, vol. 5, no. 3 (Fall 1991), pg. 66{73. Abstract

Large-scale Gas Turbine Simulations on GPU clusters

Dell Guide to Server Benchmarks

CUDA. Matthew Joyner, Jeremy Williams

Communication Characteristics in the NAS Parallel Benchmarks

CUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation

Early Evaluation of the Cray X1 at Oak Ridge National Laboratory

Recap: Machine Organization

Performance evaluation. Performance evaluation. CS/COE0447: Computer Organization. It s an everyday process

High Performance Computing on GPUs using NVIDIA CUDA

CS232: Computer Architecture II

TEST REPORT. SEPTEMBER 2007 Linpack performance on Red Hat Enterprise Linux 5.1 and 3 AS Intel-based servers

Technology for a better society. hetcomp.com

Organizational issues (I)

Java Performance Analysis for Scientific Computing

A Multi-Tiered Optimization Framework for Heterogeneous Computing

Presentations: Jack Dongarra, University of Tennessee & ORNL. The HPL Benchmark: Past, Present & Future. Mike Heroux, Sandia National Laboratories

WHITE PAPER FUJITSU PRIMERGY SERVERS PERFORMANCE REPORT PRIMERGY BX924 S2

Parallel computer architecture classification

Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery

Data parallel algorithms, algorithmic building blocks, precision vs. accuracy

Chapter 4 Main Memory

LINPACK Benchmark. on the Fujitsu AP The LINPACK Benchmark. Assumptions. A popular benchmark for floating-point performance. Richard P.

Intel Math Kernel Library

i960 Microprocessor Performance Brief October 1998 Order Number:

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Benchmark Results. 2006/10/03

The Alpha Microprocessor: Out-of-Order Execution at 600 MHz. Some Highlights

Systems Design and Programming. Instructor: Chintan Patel

LiMIC: Support for High-Performance MPI Intra-Node Communication on Linux Cluster

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Maximizing Memory Performance for ANSYS Simulations

The Alpha Microprocessor: Out-of-Order Execution at 600 Mhz. R. E. Kessler COMPAQ Computer Corporation Shrewsbury, MA

Which is the best? Measuring & Improving Performance (if planes were computers...) An architecture example

Computers and Microprocessors. Lecture 34 PHYS3360/AEP3630

Performance Study of Hyper-Threading Technology on the LUSITANIA Supercomputer

Quad-Core Intel Xeon Processor-based 3200/3210 Chipset Server Platforms

Memory Technology. Assignment 08. CSTN3005 PC Architecture III October 25, 2005 Author: Corina Roofthooft Instructor: Dave Crabbe

Transcription:

Many benchmarks available MHz (cycle speed of processor) MIPS (million instructions per second) Peak FLOPS Whetstone Stresses unoptimized scalar performance, since it is designed to defeat any effort to find concurrency. Popular way to estimate MIPSMFLOPS (million floating point operations per second) SPEC benchmarks (SPECint, SPECfloat, SPECmark) Maintained by a consortium of workstation vendors Frequently-changing collection of programs one might run on a workstation, plus kernels like matrix multiplication It is virtually impossible to track SPEC performance from one year to the next since the definition of the problem set is always changing LINPACK - dense linear solver with partial pivoting 100x100, 1000x1000 or larger Cluster Computing 1 Cluster Computing 2 NAS parallel benchmark (NPB) - a small set of programs designed to help evaluate the performance of parallel supercomputers. Derived from computational fluid dynamics (CFD) applications Consist of five kernels and three pseudo-applications in three sizes called Sample Code, Class A and Class B Embarrassingly parallel EP Multigrid MG Conjugate gradient CG D FFT PDE FT Integer sort IS LU solver LU Pentadiagonal solver SP Block tridiagonal solver BT STREAM, peak memory bandwidth small collection of very simple loop operations tries to estimate the total rate at which all addressable memory spaces can deliver data to respective processors Fhourstones, Dhrystone, nsieve, heapsort, Hanoi, queens, flops, fft, mm assorted integer and floating-point benchmarks for small problems Cluster Computing 3 Cluster Computing 4 Kent State University 1

Problems with Benchmarks Correlation Examples Benchmark performance does not necessarily correlate with application performance Performance on two benchmarks may not correlate Benchmarks problems tend to be small, easily portable and easy to explain As speed increases the benchmarks run too quickly and must be redefined Benchmarks tend to measure performance for a particular size of problem LINPACK v GAMESS computational chemistry application Peak FLOPS v FLOPS from NAS benchmark 1 Correlation is -0.692 Cluster Computing 5 Cluster Computing 6 HINT - an attempted synthesis HINT benchmark created in 1995 at Ames DOE Laboratory by John L. Gustafson and Quinn Snell Problem with previous benchmarks - tended to emphasize one part of performance curve HINT - Hierarchical INTegration tries to produce curve rather than number HINT -relation to previous benchmarks Hint shows the effects of cache size, and memory size Corresponds to Fhourstone etc initially LINPACK 100x100 SPECint around 100K Eventually corresponds to Stream - a benchmark for memory performance when end up using Virtual Memory Aim: to provide a scalable benchmark that reflects the type of work done in iterative refinement Cluster Computing 7 Cluster Computing 8 Kent State University 2

HINT Infinitely scalable Speed defined by quality improvement per second (QUIPS). "Quality" is the reciprocal of the error, which combines precision loss and discretization error. The problem can be run with any data type: floating point (any precision), integer (any precision), extended-precision arithmetic, etc HINT provides a graph of performance, it also has a "single number" measure (the area under the graph) that summarizes performance As the size of the HINT task grows, the memory access pattern becomes more complicated in a way that defeats caches. Hint Algorithm Use interval subdivision to find rational bounds on the area in the xy plane for which x ranges from 0 to 1 and y ranges from 0 to (1- x) / (1+ x). Subdivide x and y ranges into 2 k equal subintervals and count the squares thus defined that are completely inside the area (lower bound) or completely contain the area (upper bound). The function (1- x) / (1+ x) is monotone decreasing, so the upper bound comes from the left function value and the lower bound from the right function value on any subinterval. No other knowledge about the function may be used. Cluster Computing 9 Cluster Computing 10 HINT - first subdivision Bounds after subdivision into two intervals Upper left and lower right contain 87 and 47 squares 87-square region should be subdivided 47-square error will then move to the front of the queue of subintervals to be split HINT Illustration hint.mpeg.mpg Cluster Computing 11 Cluster Computing 12 Kent State University 3

Different Precisions Hint Results on Some Processors Cluster Computing 13 Cluster Computing 14 More Recent Results - Machines Strider AMD Athlon(tm) MP 2100+ (1733.41 Mhz) 256KB cache each, 3 GB of memory. Memory bus speed Rc1, v1 (RocketCalc nodes) Intel Pentium Xeon 2.4 GHz, 512 KB cache 8 GB Registered ECC DDR SDRAM per processor Arakis Intel Pentium 4 2.4Ghz. 512KB cache Asus P4T533-C motherboard, 850E BIOS, 533/400 MHz FSB, 1GB RDRAM Frodo Intel Pentium 4 1.5GHz, 256KB cache RDRAM Fianna25 Intel Pentium III/450 MHz, 512 KB cache, 256MB PC100 Compliant SDRAM More Recent Floating Point Cluster Computing 15 Cluster Computing 16 Kent State University 4

More Recent Integer Recent Double Precision Cluster Computing 17 Cluster Computing 18 References http://discov.cs.kent.edu/resources/perf/hint/publicati ons Cluster Computing 19 Kent State University 5