Application Performance on Dual Processor Cluster Nodes

Similar documents
Exploring the Effects of Hyperthreading on Scientific Applications

Exploring the Effects of Hyper-Threading on Scientific Applications

EE108B Lecture 17 I/O Buses and Interfacing to CPU. Christos Kozyrakis Stanford University

Initial Performance Evaluation of the Cray SeaStar Interconnect

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Node Hardware. Performance Convergence

Designing High Performance Communication Middleware with Emerging Multi-core Architectures

Quiz for Chapter 6 Storage and Other I/O Topics 3.10

I/O Channels. RAM size. Chipsets. Cluster Computing Paul A. Farrell 9/8/2011. Memory (RAM) Dept of Computer Science Kent State University 1

Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Master Informatics Eng.

Performance Characteristics of Dual-Processor HPC Cluster Nodes Based on 64-bit Commodity Processors

Parallel Computing Platforms

Performance of Variant Memory Configurations for Cray XT Systems

AMD HyperTransport Technology-Based System Architecture

The AMD64 Technology for Server and Workstation. Dr. Ulrich Knechtel Enterprise Program Manager EMEA

Robert Jamieson. Robs Techie PP Everything in this presentation is at your own risk!

Shared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network

CS395T: Introduction to Scientific and Technical Computing

Sami Saarinen Peter Towers. 11th ECMWF Workshop on the Use of HPC in Meteorology Slide 1

High Performance MPI on IBM 12x InfiniBand Architecture

Building 96-processor Opteron Cluster at Florida International University (FIU) January 5-10, 2004

Performance of Variant Memory Configurations for Cray XT Systems

Experiences with the Parallel Virtual File System (PVFS) in Linux Clusters

Lecture 18: DRAM Technologies

EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA

HyperTransport. Dennis Vega Ryan Rawlins

What are Clusters? Why Clusters? - a Short History

2008 International ANSYS Conference

GRID Testing and Profiling. November 2017

Supercomputing with Commodity CPUs: Are Mobile SoCs Ready for HPC?

Convergence of Parallel Architecture

XT Node Architecture

Leveraging HyperTransport for a custom high-performance cluster network

Linux User Group of Davis. Marc J. Miller Strategic Alliance Manager, AMD October 7, 2003

CSC501 Operating Systems Principles. OS Structure

Octopus: A Multi-core implementation

Can Memory-Less Network Adapters Benefit Next-Generation InfiniBand Systems?

Lecture 23. Finish-up buses Storage

Memory Scalability Evaluation of the Next-Generation Intel Bensley Platform with InfiniBand

Mainstream Computer System Components

FUSION1200 Scalable x86 SMP System

Parallel Computer Architectures. Lectured by: Phạm Trần Vũ Prepared by: Thoại Nam

Parallel Computer Architecture - Basics -

INTERCONNECTION TECHNOLOGIES. Non-Uniform Memory Access Seminar Elina Zarisheva

Flexible Architecture Research Machine (FARM)

Effective Use of Multi-Core Commodity Systems in HPC

CS/ECE 217. GPU Architecture and Parallel Programming. Lecture 16: GPU within a computing system

Performance Optimization for an ARM Cortex-A53 System Using Software Workloads and Cycle Accurate Models. Jason Andrews

Cluster Computing. Chip Watson Jefferson Lab High Performance Computing. Acknowledgements to Don Holmgren, Fermilab,, USQCD Facilities Project

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

Performance of HPC Applications over InfiniBand, 10 Gb and 1 Gb Ethernet. Swamy N. Kandadai and Xinghong He and

BlueGene/L. Computer Science, University of Warwick. Source: IBM

Performance comparison between a massive SMP machine and clusters

Power Systems AC922 Overview. Chris Mann IBM Distinguished Engineer Chief System Architect, Power HPC Systems December 11, 2017

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

Mainstream Computer System Components CPU Core 2 GHz GHz 4-way Superscaler (RISC or RISC-core (x86): Dynamic scheduling, Hardware speculation

Lecture 17. NUMA Architecture and Programming

MM5 Modeling System Performance Research and Profiling. March 2009

HORUS. Large Scale SMP for Opterons

NUMA replicated pagecache for Linux

Single-Points of Performance

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

InfiniBand Experiences of PC²

HPCC Results. Nathan Wichmann Benchmark Engineer

The Future of Computing: AMD Vision

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Intel Enterprise Processors Technology

Chapter 2: Computer-System Structures. Hmm this looks like a Computer System?

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Form Factors. Motherboards Jamie Tees

Uniprocessor Computer Architecture Example: Cray T3E

Heterogeneous Multi-Computer System A New Platform for Multi-Paradigm Scientific Simulation

Six-Core AMD Opteron Processor

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

Effect of memory latency

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

arxiv: v1 [physics.comp-ph] 4 Nov 2013

Shared Memory Parallel Programming. Shared Memory Systems Introduction to OpenMP

The mobile computing evolution. The Griffin architecture. Memory enhancements. Power management. Thermal management

Accelerating HPC. (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved.

Parallel Architectures

Composite Metrics for System Throughput in HPC

10-Gigabit iwarp Ethernet: Comparative Performance Analysis with InfiniBand and Myrinet-10G

Performance of the AMD Opteron LS21 for IBM BladeCenter

Efficient CPU GPU data transfers CUDA 6.0 Unified Virtual Memory

Agenda. System Performance Scaling of IBM POWER6 TM Based Servers

PC DESY Peter Wegner. PC Cluster Definition 1

High Performance Compute Platform Based on multi-core DSP for Seismic Modeling and Imaging

Benchmark Results. 2006/10/03

Aim High. Intel Technical Update Teratec 07 Symposium. June 20, Stephen R. Wheat, Ph.D. Director, HPC Digital Enterprise Group

The Optimal CPU and Interconnect for an HPC Cluster

Computer System Components

OP2 FOR MANY-CORE ARCHITECTURES

Organizational issues (I)

NUMA-aware OpenMP Programming

Lecture 12: Instruction Execution and Pipelining. William Gropp

Comparing Linux Clusters for the Community Climate System Model

Transcription:

Application Performance on Dual Processor Cluster Nodes by Kent Milfeld milfeld@tacc.utexas.edu edu Avijit Purkayastha, Kent Milfeld, Chona Guiang, Jay Boisseau TEXAS ADVANCED COMPUTING CENTER

Thanks Newisys (Austin, TX) AMD Opteron System Dell (Austin, TX) & Cray Intel Xeon System 2

OUTLINE HPC needs for Single- & Dual-processor Commodity system Nodes The architecture of Intel Xeon & AMD Opteron Systems Single & Dual Processor Xeon & Opteron Performance Comparison Measured Memory Characteristics Parallel vs Serial Execution of Codes on a Node Kernels Applications 3

Motivation 1995-2000 Commodity Massively Parallel Systems used uni-processor nodes: Beowulf Systems SP2(SC) T3E Today the e-commerce market has driven the price of SMP servers down. Dell, Gateway, HP/Compaq, compete for this market. 4

Motivation Dual processor scoreboard for HPC Applications: Single Dual x x x x x x x x x x Peak performance (TFLOP) Cost Per Processor Memory Subsystem No shared bus system No Coherence in Caches (processor and northbridge & OS) No False Sharing Memory Size Message Passing No Shared interconnect adapters On-node MPI performance I/O Performance Local Parallel 5

Intel Architecture Commodity IA-32 Server IA-32 IA-32 Memory Memory 200 Mhz dual channel 1.6GB/sec (200 MHz) 1.6GB/sec (200 MHz) 3.2 GB/sec (400MHz) North Bridge FSB Front-Side Bus Memory (Speed) Bus NB SB Bus PCI (Speed) Switch PCI Adapter ( NIC ) 0.5GB/s (66 MHz) South Bridge 6

Intel Architecture HyperTransport Link Widths and Speeds Memory Memory 1.6Gb/s per pin pair 2.66GB/s (333MHz) 2.66GB/s (333MHz) Two unidirectional point-to-point links 2,4,8,16 or 32bits wide @ up to 800MHz (DDR) Opteron Chip DDR Memory Controller Sys. Request Queue Core Hyper- Transport XBAR Hyper- Transport Hyper- Transport Opteron Chip Hyper- Transport 3.2 GB/s per dir. @ 800MHz x2 7

AMD Architecture AMD Opteron 6.4GB/s HT 6.4GB/s Coherent HT AMD Opteron 2.1/2.7 GB/sec AMD-8151 HT AGP Tunnel 6.4GB/s HT Dual Channel 266/333 MHz (PC2100/2700) AMD-8131 HT PCI-X Tunnel 8

IBM Power4 1.3GHz Core 1.3GHz Core L3 Memory Shared L2 L3 Dir 13.8GB/sec chip-chip communication GX Expansion Bus 1.7GB/sec 9

Memory Latency I1 = IA(1) DO I = 2,N I2 = IA(I1) I1 = I2 END DO 1.) Load IA with sequence 1 N. 2.) Randomize IA entries. 3.) Measure Clock Periods of loop. (CPs/N = single memory access time = latency) 4.) Loop does not optimizes: no prefetching or streams 10

500 450 400 350 300 250 200 150 100 50 0 128 Latency (clock periods) 256 Memory Latency Xeon 11 512 1024 2048 4096 8192 16384 32768 Array size (bytes) 65536 131072 262144 524288 1048576 2097152 4194304 8388608 16777216 33554432 2.4GHz ~470 CP ~2 CP, L1 L2

200 180 160 140 120 100 80 60 40 20 0 128 Latency (clock periods) 256 Memory Latency AMD 12 512 1024 2048 4096 8192 16384 32768 Array size (bytes) 65536 131072 262144 524288 1048576 2097152 4194304 8388608 16777216 33554432 1.4GHz ~170 CP ~2-3 CP, L1 L2

Memory Bandwidth DO I = 1,N S = S + A(I) T = T + B(I) END DO 1.) -O3, unrolling = 2 2.) Two streams gives high, reasonable bandwidths expected across memory & caches 13

AMD SP/DP memory bandwidth 12000 10000 1.4GHz Opteron 8000 Bandwidth (MB/s) 6000 4000 2000 serial dual CPU0 dual CPU1 dual CPU0 dual CPU1 serial 2.0GB/s per cpu 2.3GB/s 0 512 1024 2048 4096 8192 16384 32768 65536 131072 262144 Size (bytes) X 4 14

Xeon SP/DP memory bandwidth 14000 2.4GHz Xeon 12000 10000 Bandwidth (MB/s) 8000 6000 4000 dual CPU0 dual CPU1 serial 2000 0 1.0GB/s per cpu 512 1024 2048 4096 8192 16384 32768 65536 131072 262144 Size (bytes) 2.3GB/s X 4 15

STREAM Results Kernel Intel Xeon AMD Opteron Kernel Intel Xeon AMD Opteron Copy 1213 2162 Copy 1167 3934 Scale 1206 2093 Scale 1162 4087 Add 1381 2341 Add 1273 4561 Triad 1375 2411 Triad 1281 4529 Serial Execution, (MB/sec). Parallel Execution, two threads (MB/sec). 16

MPI On-Node Bandwidth It should be faster than node-to-node. (MB/sec) DELL 2650 295 @ 2MB Opteron Suse-64 ch_p4 172 @ 2MB Opteron Suse-64 ch_shmem 404 @ 2MB IBM P690 HPC IBM P690 Turbo IBM P655 HPC 1324 @ 2MB 1398 @ 2MB 1684 @ 2MB Different implementations of MPI will vary with On-Node Performance. 17

Hand Coded Matrix-Matrix Multiply Accesses Memory with 1 stream and 1 strided pattern. (Don t do this at home in your optimized code.) clock periods per iteration 40 35 30 25 20 15 do i=1,n; do j=1,n; do j=1,n C(i,k)=C(i,k)+A(i,j)*B(j,k) end do; end do; end do 2x throughput when run on two CPUs. 26nsec (@1.4GHz) AMD Opteron dual run CPU0 dual run average dual run CPU1 serial 10 5 0 16 32 48 64 96 128 176 256 368 512 720 1024 1456 2048 matrix leading dimension (n) 18

Hand Coded Matrix-Matrix Multiply clock periods per iteration 80 70 60 50 40 30 do i=1,n; do j=1,n; do j=1,n C(i,k)=C(i,k)+A(i,j)*B(j,k) end do; end do; end do 2-CPU throughput suffers with shared bus. 29nsec (@2.4GHz) Intel Xeon dual run CPU0 dual run average dual run CPU1 serial 20 10 0 16 32 48 64 96 128 176 256 368 512 720 1024 1456 2048 matrix leading dimension (n) Serial execution is faster on Xeon: 15nsec vs. 26nsec 19 2-CPU throughput suffers with shared bus.

Performance (MFLOPS) (MB/sec) Library Matrix-Matrix Multiply (DGEMM) AMD Opteron 1.4GHz 3500 3000 2500 2000 1500 1000 serial 500 parallel two MPI tasks 0 100 10000 1000000 1,000 10000000 Size (8-byte (matrix words) order) 0 MKL 5.1 Library Performance (MB/sec) (MFLOPS) 6000 5000 Intel Xeon 4000 3000 2000 serial 1000 two parallel MPI tasks 0 10010 10000 1000000 1E+08 Size (8-byte (matrix words) order) 10 1,000 Intel Xeon 2.4GHz May be much higher with Opteron-optimized Libs (e.g., NAG Lib.) 20

Remote & Local Memory Read/Write clock cycles per iteration 25 20 15 10 5 Swap the 2 j columns A(i,j)=time*A(i,j) Remote Access Local Access AMD Opteron "local" thread 0 "local" average "local" thread 1 "remote" thread 0 "remote" average "remote" thread 1 1.) Each processor writes a column to local memory. 2.) Each processor reads/writes to same column. (Local Access) 3.) Each processor swaps column index and reads/writes to remote memory. (Remote Access) 0 400 700 1000 1400 2500 4000 5500 7000 8500 10000 11500 13000 matrix leading dimension (n) 21

SM: : Stommel model of ocean circulation ; solves 2-D partial differential equation. Uses Finite Difference approx for derivatives on discretized domain, (timed for a constant number of Jacobi iterations). Memory Intensive Applications MD: : Molecular Dynamics of argon lattice. Uses Verlet algorithm for propagation (displacement & velocities). Compute Intensive Platform AMD Opteron 2P Serial SM (sec) 68.0 Parallel SM 43.4 (sec) Platform AMD Opteron 2P Serial MD (sec) 9.48 Parallel MD 5.3 (sec) Intel Xeon 2P 57.5 63.4 Intel Xeon 2P 7.14 5.4 22

Opteron SERIAL Summary Xeon Opteron Parallel Xeon Latency Low High Overlapped Overlapped Band- width MxM (per CP) MXM (time) DGEMM ~2GB/s Lower Higher Low ~2GB/s Higher Lower High 23 2x 2x mem slightly lower Scale: 1.9x (MKL 5.1 not optimized for AMD) 1x 1x mem slightly higher Scale:1.8x 2x Opteron performance

Summary Performance of dual-processor systems varies with memory architecture and processor speed. AMD memory bandwidth scales by 2x when second processor is used (using local memory). Xeon memory bandwidth is shared by second processor. Xeon outperforms Opteron on serial compute- intensive codes (due to speed: 2.4GHz Xeon vs. 1.4GHz Opteron); but lead can be eliminated with dual-processor execution of (parallel) programs when memory bandwidths & synchronizations are involved. 24