Application Performance on Dual Processor Cluster Nodes

Size: px

Start display at page:

Download "Application Performance on Dual Processor Cluster Nodes"

Clifford Ramsey
5 years ago
Views:

1 Application Performance on Dual Processor Cluster Nodes by Kent Milfeld edu Avijit Purkayastha, Kent Milfeld, Chona Guiang, Jay Boisseau TEXAS ADVANCED COMPUTING CENTER

2 Thanks Newisys (Austin, TX) AMD Opteron System Dell (Austin, TX) & Cray Intel Xeon System 2

3 OUTLINE HPC needs for Single- & Dual-processor Commodity system Nodes The architecture of Intel Xeon & AMD Opteron Systems Single & Dual Processor Xeon & Opteron Performance Comparison Measured Memory Characteristics Parallel vs Serial Execution of Codes on a Node Kernels Applications 3

4 Motivation Commodity Massively Parallel Systems used uni-processor nodes: Beowulf Systems SP2(SC) T3E Today the e-commerce market has driven the price of SMP servers down. Dell, Gateway, HP/Compaq, compete for this market. 4

5 Motivation Dual processor scoreboard for HPC Applications: Single Dual x x x x x x x x x x Peak performance (TFLOP) Cost Per Processor Memory Subsystem No shared bus system No Coherence in Caches (processor and northbridge & OS) No False Sharing Memory Size Message Passing No Shared interconnect adapters On-node MPI performance I/O Performance Local Parallel 5

6 Intel Architecture Commodity IA-32 Server IA-32 IA-32 Memory Memory 200 Mhz dual channel 1.6GB/sec (200 MHz) 1.6GB/sec (200 MHz) 3.2 GB/sec (400MHz) North Bridge FSB Front-Side Bus Memory (Speed) Bus NB SB Bus PCI (Speed) Switch PCI Adapter ( NIC ) 0.5GB/s (66 MHz) South Bridge 6

7 Intel Architecture HyperTransport Link Widths and Speeds Memory Memory 1.6Gb/s per pin pair 2.66GB/s (333MHz) 2.66GB/s (333MHz) Two unidirectional point-to-point links 2,4,8,16 or 32bits up to 800MHz (DDR) Opteron Chip DDR Memory Controller Sys. Request Queue Core Hyper- Transport XBAR Hyper- Transport Hyper- Transport Opteron Chip Hyper- Transport 3.2 GB/s per 800MHz x2 7

8 AMD Architecture AMD Opteron 6.4GB/s HT 6.4GB/s Coherent HT AMD Opteron 2.1/2.7 GB/sec AMD-8151 HT AGP Tunnel 6.4GB/s HT Dual Channel 266/333 MHz (PC2100/2700) AMD-8131 HT PCI-X Tunnel 8

9 IBM Power4 1.3GHz Core 1.3GHz Core L3 Memory Shared L2 L3 Dir 13.8GB/sec chip-chip communication GX Expansion Bus 1.7GB/sec 9

10 Memory Latency I1 = IA(1) DO I = 2,N I2 = IA(I1) I1 = I2 END DO 1.) Load IA with sequence 1 N. 2.) Randomize IA entries. 3.) Measure Clock Periods of loop. (CPs/N = single memory access time = latency) 4.) Loop does not optimizes: no prefetching or streams 10

11 Latency (clock periods) 256 Memory Latency Xeon Array size (bytes) GHz ~470 CP ~2 CP, L1 L2

12 Latency (clock periods) 256 Memory Latency AMD Array size (bytes) GHz ~170 CP ~2-3 CP, L1 L2

13 Memory Bandwidth DO I = 1,N S = S + A(I) T = T + B(I) END DO 1.) -O3, unrolling = 2 2.) Two streams gives high, reasonable bandwidths expected across memory & caches 13

14 AMD SP/DP memory bandwidth GHz Opteron 8000 Bandwidth (MB/s) serial dual CPU0 dual CPU1 dual CPU0 dual CPU1 serial 2.0GB/s per cpu 2.3GB/s Size (bytes) X 4 14

15 Xeon SP/DP memory bandwidth GHz Xeon Bandwidth (MB/s) dual CPU0 dual CPU1 serial GB/s per cpu Size (bytes) 2.3GB/s X 4 15

16 STREAM Results Kernel Intel Xeon AMD Opteron Kernel Intel Xeon AMD Opteron Copy Copy Scale Scale Add Add Triad Triad Serial Execution, (MB/sec). Parallel Execution, two threads (MB/sec). 16

17 MPI On-Node Bandwidth It should be faster than node-to-node. (MB/sec) DELL MB Opteron Suse-64 ch_p4 2MB Opteron Suse-64 ch_shmem 2MB IBM P690 HPC IBM P690 Turbo IBM P655 HPC 2MB 2MB 2MB Different implementations of MPI will vary with On-Node Performance. 17

18 Hand Coded Matrix-Matrix Multiply Accesses Memory with 1 stream and 1 strided pattern. (Don t do this at home in your optimized code.) clock periods per iteration do i=1,n; do j=1,n; do j=1,n C(i,k)=C(i,k)+A(i,j)*B(j,k) end do; end do; end do 2x throughput when run on two CPUs. 26nsec (@1.4GHz) AMD Opteron dual run CPU0 dual run average dual run CPU1 serial matrix leading dimension (n) 18

19 Hand Coded Matrix-Matrix Multiply clock periods per iteration do i=1,n; do j=1,n; do j=1,n C(i,k)=C(i,k)+A(i,j)*B(j,k) end do; end do; end do 2-CPU throughput suffers with shared bus. 29nsec Intel Xeon dual run CPU0 dual run average dual run CPU1 serial matrix leading dimension (n) Serial execution is faster on Xeon: 15nsec vs. 26nsec 19 2-CPU throughput suffers with shared bus.

20 Performance (MFLOPS) (MB/sec) Library Matrix-Matrix Multiply (DGEMM) AMD Opteron 1.4GHz serial 500 parallel two MPI tasks , Size (8-byte (matrix words) order) 0 MKL 5.1 Library Performance (MB/sec) (MFLOPS) Intel Xeon serial 1000 two parallel MPI tasks E+08 Size (8-byte (matrix words) order) 10 1,000 Intel Xeon 2.4GHz May be much higher with Opteron-optimized Libs (e.g., NAG Lib.) 20

21 Remote & Local Memory Read/Write clock cycles per iteration Swap the 2 j columns A(i,j)=time*A(i,j) Remote Access Local Access AMD Opteron "local" thread 0 "local" average "local" thread 1 "remote" thread 0 "remote" average "remote" thread 1 1.) Each processor writes a column to local memory. 2.) Each processor reads/writes to same column. (Local Access) 3.) Each processor swaps column index and reads/writes to remote memory. (Remote Access) matrix leading dimension (n) 21

22 SM: : Stommel model of ocean circulation ; solves 2-D partial differential equation. Uses Finite Difference approx for derivatives on discretized domain, (timed for a constant number of Jacobi iterations). Memory Intensive Applications MD: : Molecular Dynamics of argon lattice. Uses Verlet algorithm for propagation (displacement & velocities). Compute Intensive Platform AMD Opteron 2P Serial SM (sec) 68.0 Parallel SM 43.4 (sec) Platform AMD Opteron 2P Serial MD (sec) 9.48 Parallel MD 5.3 (sec) Intel Xeon 2P Intel Xeon 2P

23 Opteron SERIAL Summary Xeon Opteron Parallel Xeon Latency Low High Overlapped Overlapped Band- width MxM (per CP) MXM (time) DGEMM ~2GB/s Lower Higher Low ~2GB/s Higher Lower High 23 2x 2x mem slightly lower Scale: 1.9x (MKL 5.1 not optimized for AMD) 1x 1x mem slightly higher Scale:1.8x 2x Opteron performance

24 Summary Performance of dual-processor systems varies with memory architecture and processor speed. AMD memory bandwidth scales by 2x when second processor is used (using local memory). Xeon memory bandwidth is shared by second processor. Xeon outperforms Opteron on serial compute- intensive codes (due to speed: 2.4GHz Xeon vs. 1.4GHz Opteron); but lead can be eliminated with dual-processor execution of (parallel) programs when memory bandwidths & synchronizations are involved. 24

Exploring the Effects of Hyperthreading on Scientific Applications

Exploring the Effects of Hyperthreading on Scientific Applications by Kent Milfeld milfeld@tacc.utexas.edu edu Kent Milfeld, Chona Guiang, Avijit Purkayastha, Jay Boisseau TEXAS ADVANCED COMPUTING CENTER