Exploring the Effects of Hyperthreading on Scientific Applications

Size: px

Start display at page:

Download "Exploring the Effects of Hyperthreading on Scientific Applications"

Isabel Cobb
5 years ago
Views:

1 Exploring the Effects of Hyperthreading on Scientific Applications by Kent Milfeld edu Kent Milfeld, Chona Guiang, Avijit Purkayastha, Jay Boisseau TEXAS ADVANCED COMPUTING CENTER

2 Linux Cluster 3.7 TFLOPS Cray-Dell Machine (to be installed in July) 3 Node Linux Cluster CrayRx (SDSC Rocks + Cray System Admin) Dell Service Dell 175 Xeon Dual-Processor Nodes 3.GHz processors dual channel 266MHz DDRAM (1.GB/cpu) Myrinet CLOS configuration (2Gb/sec switch, D adapters) 2

3 Linux Cluster * Server Switch interne tt PC+ PC PC PC PC PC+ * File Server Switch 3

4 Node Dell PowerEdge 265 / 175 2U/1U Processors: Two 3. GHz Intel Xeon Processors Chipset: ServerWorks Grand Champ LE chipset Memory: 2GB dual channel (266 MHz DDR SDRAM) FSB: 533 MHz (Front Side Bus) Cache: 512KB L2 Advanced Transfer Cache Disk: Dual-channel integrated Ultra3 (Ultra16) SCSI Adaptec AIC-7899 (16Mb/s) controller 4

5 Hyper-Threading Technology 1 OR 2 1 AND 2 Thread Process Task 1 1 Process Thread Task 1 Thread Process Task 2 2 STATE Process Thread Task 2 STATE STATE 1 Single- State Processor Hyper- Threaded Processor Non-HT HT 5

6 Hyper-Threading Technology Hyper-Threading is an implementation of an architectural technique called Simultaneous MultiThreading (SMT) Exploits Instruction Level Parallelism on a Single CPU Why? Comm. Server Workload efficiency is about 67% Performance Benefits from independent processes/threads Any time there is a stall for resources on a thread Any time disparate operations are used Commercial: Software (Algorithm & Code Modification) Multithreading HPC: Already has large number of parallel applications What about the SHARING? 6

7 Intel Xeon Seven-way superscalar Architecture Able to pipeline 128 instructions Intel Solution: Efficiency instead of Redundancy With only 5% more real estate (die area) HT 2 Logical processors / CPU Share most of the processor resources Two processes and/or tasks execute in logical units concurrently 7

8 Microprocessor Architecture System Bus Memory Interface Bus Frequent Path Less Frequent Path L3 Cache (Optional, Server Products) L2 Cache 8-way L1 Cache 4-way Fetch/ Decode Front End Trace Cache Execution out-of-order core Retirement BTB/Branch Prediction Branch History Update 8

9 Pipeline Depth 1 Fetch 2 Fetch Pentium III processor misprediction pipeline 3 Decode 4 Decode 5 Decode Rename ROB Rd Rdy/Sch Dispatch 1 Exec TC FetchTC FetchDriveAlloc Rename Que Sch Sch Sch Disp Disp FR FR Ex Flgs BrCkDrive Pentium 4 processor misprediction pipeline +Physical (renaming) Registers: 128 Integer and 128 Floating Point 9

10 Redesigned Pipeline for HT L2 Access Queue Decode Queue Trace Cache Queue IP Front End ITLB DECODE L2 Access Out-of-Order Execution Engine Rename Allocate Reg. Alias Table Queue Sched Reg Read Exec. L1 Cache Store Buffer Reg. Write Retire Allocate Registers Registers Re-Order Buffer 8-12 deep D-Cache 1

11 Memory Measurements & Application Scaling with HT Memory Characterization for a single processor Latency Bandwidth Memory Characterization for HT Worst Use of Memory (non-strided Gather/Scatter) Best Use of Memory (sequential access) Applications (MxM( MxM,, MD, LP) 11

12 Measuring Latency Memory Latency I1 = IA(1) DO I = 2,N I2 = IA(I1) I1 = I2 END DO 12

13 Memory Latency Measurement Single and Dual Processor Latency Single CPU Dual CPU Latency (Clock Periods) Size (Kbytes) Dual CPU

14 Bandwidth (with 2 Streams) Measuring Bandwidth DO I = 1,N S = S + A(I) T = T + B(I) END DO 14

15 Memory Bandwidth Memory Read Bandwidth (MB/sec) Size (Kbytes) 2.3GB/sec Total 15

16 Latency (Clock Periods) Latency (Clock Periods) HT Memory Latency Measurement Single and Dual Processor Latency HyperThreading Latency, 3 tasks on 2 processors Single CPU Dual LP CPU Dual LP1 CPU1 LP Size (Kbytes) 15 2 Size (Kbytes)

17 HT Memory Latency Measurement HyperThreading Latency, 4 tasks on 2 processors Latency (Clock Periods) Size (Kbytes) LP LP1 LP2 LP

18 Intro Xeon Xeon Memory Memory 2.1GB/sec 2.1GB/sec 4.3GB/sec North Bridge Dell Xeons (July, 23) System used In following In Analyses Memory Memory Xeon 1.6GB/sec 1.6GB/sec 3.2GB/sec North Bridge Xeon Experimental Cluster 18

19 HT Memory Bandwidth Measurement Memory Read (distributed memory model) Memory Read (distributed memory model) Bandwidth (MB/sec) CPU CPU1 2.2GB/sec Total Bandwidth (MB/sec) Size (Kbytes) Size (Kbytes) LP LP1 LP2 2.5GB/sec Total 19

20 HT Memory Bandwidth Measurement Memory Read (distributed memory model) Bandwidth (MB/sec) LP LP1 LP3 LP Size (Kbytes) 1.73GB/sec Total 2

21 HT Memory Bandwidth Measurement Bandwidth (MB/sec) Memory Read (shared memory model) GB/sec Total 3.6GB/sec Total Cache Friendly Size (Kbytes) LP LP1 LP2 LP3 21

22 Parallel Matrix-Matrix Multiply Single-Matrix Matrix Multiply Performance (GFLOPS) thread task 2 threads tasks 3 threads tasks 4 threads tasks Size (matrix order) 22

23 Molecular Dynamics Molecular dynamics simulation of a 256 particle argon lattice One picosecond Verlet algorithm Threading 1 Thread 2 Threads 3 Threads HT 4 Threads HT Time(sec): Scaling:

24 Monte Carlo Lithography Simulation 1**7 Monte Carlo iterations Each thread outputs the lattice configuration and various Monte Carlo statistics to disk. Threading 1 Thread 2 Threads 3 Threads HT 4 Threads HT Time(sec): Scaling:

25 Conclusions HT Performance Multiple independent Processes/Tasks/Threads are necessary. Bandwidth-limited and/or Compute Intensive codes may see degradation of performance. Non-extreme Code may see performance enhancements Cache sharing (synchronization may be required) Non-Streaming memory access I/O Disparate operations (e.g., integer mult.. & float mult.) 25

Application Performance on Dual Processor Cluster Nodes

Application Performance on Dual Processor Cluster Nodes by Kent Milfeld milfeld@tacc.utexas.edu edu Avijit Purkayastha, Kent Milfeld, Chona Guiang, Jay Boisseau TEXAS ADVANCED COMPUTING CENTER Thanks Newisys