Hyper-Threading Performance with Intel CPUs for Linux SAP Deployment on ProLiant Servers. Session #3798. Hein van den Heuvel

Similar documents
Install Cisco ISE on a Linux KVM

Install Cisco ISE on a Linux KVM

Install Cisco ISE on a Linux KVM

Simultaneous Multithreading on Pentium 4

Install Cisco ISE on a Linux KVM

Intel Hyper-Threading technology

SAP SD Benchmark with DB2 and Red Hat Enterprise Linux 5 on IBM System x3850 M2

Brian Leffler The Ajax Experience July 26 th, 2007

Multithreading: Exploiting Thread-Level Parallelism within a Processor

COSC 6385 Computer Architecture. - Multi-Processors (IV) Simultaneous multi-threading and multi-core processors

CPU models after Spectre & Meltdown. Paolo Bonzini Red Hat, Inc. KVM Forum 2018

SAS Enterprise Miner Performance on IBM System p 570. Jan, Hsian-Fen Tsao Brian Porter Harry Seifert. IBM Corporation

Red Hat Development Suite 1.3 Installation Guide

Is Intel s Hyper-Threading Technology Worth the Extra Money to the Average User?

Intel Processor Identification and the CPUID Instruction

Exploring the Effects of Hyperthreading on Scientific Applications

Mission-Critical Enterprise Linux. April 17, 2006

Chapter 12. Virtualization

Multithreaded Processors. Department of Electrical Engineering Stanford University

What Transitioning from 32-bit to 64-bit x86 Computing Means Today

InfoBrief. Dell 2-Node Cluster Achieves Unprecedented Result with Three-tier SAP SD Parallel Standard Application Benchmark on Linux

Intel Enterprise Solutions

CS425 Computer Systems Architecture

KVM Virtualization With Enomalism 2 On An Ubuntu 8.10 Server

Intel Enterprise Processors Technology

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

AMD: WebBench Virtualization Performance Study

Hyperthreading Technology

Red Hat Development Suite 1.2 Installation Guide

HP solutions for mission critical SQL Server Data Management environments

Multiprocessor System. Multiprocessor Systems. Bus Based UMA. Types of Multiprocessors (MPs) Cache Consistency. Bus Based UMA. Chapter 8, 8.

HP s Performance Oriented Datacenter

Multi-core Architectures. Dr. Yingwu Zhu

An Implementation Of Multiprocessor Linux

Multiprocessor Systems. COMP s1

Seminar report Hyper-Threading Submitted in partial fulfillment of the requirement for the award of degree Of Mechanical

Multi-core Programming Evolution

Scheduling the Intel Core i7

WebSphere Application Server Base Performance

HP ProLiant DL580 G5. HP ProLiant BL680c G5. IBM p570 POWER6. Fujitsu Siemens PRIMERGY RX600 S4. Egenera BladeFrame PB400003R.

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

HP SAS benchmark performance tests

HP visoko-performantna OLTP rješenja

Multiprocessor Systems. Chapter 8, 8.1

Hewlett Packard Enterprise HPE GEN10 PERSISTENT MEMORY PERFORMANCE THROUGH PERSISTENCE

Introduction...2. Executive summary...2. Test results...3 IOPs...3 Service demand...3 Throughput...4 Scalability...5

Power your planet. Optimizing the Enterprise Data Center POWER7 Powers a Smarter Infrastructure

CSE 237B Project Report Implementing KLT Algorithm on IPAQ. 1 Implementing KLT Algorithm on ipaq Problem Statement... 2

The AMD64 Technology for Server and Workstation. Dr. Ulrich Knechtel Enterprise Program Manager EMEA

HP ProLiant DL360 G7 Server - Overview

Overhead Evaluation about Kprobes and Djprobe (Direct Jump Probe)

HPC and IT Issues Session Agenda. Deployment of Simulation (Trends and Issues Impacting IT) Mapping HPC to Performance (Scaling, Technology Advances)

CPU Scheduling. Operating Systems (Fall/Winter 2018) Yajin Zhou ( Zhejiang University

E-BUSINESS APPLICATIONS R12 (RUP 3) RECEIVABLES ONLINE PROCESSING: Using Oracle10g on an IBM System P5 595

RIGHTNOW A C E

CSE502: Computer Architecture CSE 502: Computer Architecture

Bill Nesheim Sun Microsystems, Inc. Bob Kasten Intel Corporation

TECHNOLOGY BRIEF. Compaq 8-Way Multiprocessing Architecture EXECUTIVE OVERVIEW CONTENTS

Red Hat Development Suite 2.0

Computer Architecture Spring 2016

Overview of the SPARC Enterprise Servers

Day 5: Introduction to Parallel Intel Architectures

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved.

SGI Challenge Overview

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

How much energy can you save with a multicore computer for web applications?

Power 7. Dan Christiani Kyle Wieschowski

Newest generation of HP ProLiant DL380 takes #1 position overall on Oracle E-Business Suite Small Model Benchmark

HPE ProLiant DL580 Gen10 and Ultrastar SS300 SSD 195TB Microsoft SQL Server Data Warehouse Fast Track Reference Architecture

Diffusion TM 5.0 Performance Benchmarks

Dialog (interactive) data input. Reporting. Printing processing

SAP RDS 에최적화된 IBM Hardware IBM System x3850 X5 and x3690 X5: Workload Optimized Solution for SAP

Linux Network Tuning Guide for AMD EPYC Processor Based Servers

C++ Pub Quiz. Sponsored by: Sponsored by: A 90 minute quiz session at ACCU Terrace Bar, Marriott Hotel, Bristol , April 10, 2014

HP Integrity Server HP Technical Update Days 26./

61C Survey. Agenda. Reference Problem. Application Example: Deep Learning 10/25/17. CS 61C: Great Ideas in Computer Architecture

Multi-core Architectures. Dr. Yingwu Zhu

State of the Linux Kernel

Database Workload. from additional misses in this already memory-intensive databases? interference could be a problem) Key question:

Pexip Infinity Server Design Guide

CMSC 313 COMPUTER ORGANIZATION & ASSEMBLY LANGUAGE PROGRAMMING LECTURE 03, SPRING 2013

Table of Contents. HP A7173A PCI-X Dual Channel Ultra320 SCSI Host Bus Adapter. Performance Paper for HP PA-RISC Servers

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 18 Multicore Computers

Evaluation Report: HP StoreFabric SN1000E 16Gb Fibre Channel HBA

E-BUSINESS SUITE APPLICATIONS R12 (RUP 4) LARGE/EXTRA-LARGE PAYROLL (BATCH) BENCHMARK - USING ORACLE10g ON A HEWLETT-PACKARD PROLIANT DL380 G6 SERVER

Using Transparent Compression to Improve SSD-based I/O Caches

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

ECE 571 Advanced Microprocessor-Based Design Lecture 4

Shared Memory Parallel Programming. Shared Memory Systems Introduction to OpenMP

מגישים: multiply.c (includes implementation of mult_block for 2.2\a): #include "multiply.h" /* Question 2.1, 1(a) */

A Multiprocessor system generally means that more than one instruction stream is being executed in parallel.

2008 International ANSYS Conference

Performance of DB2 Enterprise-Extended Edition on NT with Virtual Interface Architecture


Advanced d Processor Architecture. Computer Systems Laboratory Sungkyunkwan University

Scaling PostgreSQL on SMP Architectures

High Volume Transaction Processing in Enterprise Applications

Linux Performance Tuning

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Linux Kernel Hacking Free Course

Transcription:

Hyper-Threading Performance with Intel CPUs for Linux SAP Deployment on ProLiant Servers Session #3798 Hein van den Heuvel Performance Engineer Hewlett-Packard 2004 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice

Topics Hyper-Threading Intro Implementation details Intel, IBM, Sun Linux implementation My own tests SAP (SD) benchmark Benchmark Results Conclusions: (18% improvement for SAP 2-tier)

Intel Hyper-Threading Overview Hyper-Threading Technology is a form of simultaneous multithreading technology (SMT), where multiple threads of software applications can be run simultaneously on one processor. This is achieved by duplicating the architectural state on each processor, while sharing one set of processor execution resources. The architectural state tracks the flow of a program or thread, and the execution resources are the units on the processor that do the work: add, multiply, load, etc. http://www.intel.com/business/bss/products/hyperthreading/server/ht_server.pdf http://www.intel.com/technology/hyperthread/

Intel HT in a picture

To-be-updated Hyper-Threading Versus Dual Core HP (PA + ipf) opted for dual core technology. Each processor has full set of resources Only limitation is shared system connection. Allows for dense (8p 4u 4640) minimally constrained systems Software licensing impact (Oracle!) Hyper-Threading technology effectiveness will depend on application

IBM P5 SMT Summary Enhanced Simultaneous Multi-Threading features To improve SMT performance for various workload mixes and provide robust quality of service, POWER5 provides two features: Dynamic resource balancing The objective of dynamic resource balancing is to ensure that the two threads executing on the same processor flow smoothly through the system. Depending on the situation, the POWER5 processor resource balancing logic has a different thread throttling mechanism. Adjustable thread priority Adjustable thread priority lets software determine when one thread should have a greater (or lesser) share of execution resources. The POWER5 supports eight software-controlled priority levels for each thread. ( http://www.redbooks.ibm.com/redpapers/pdfs/redp9117.pdf ) ( http://www-1.ibm.com/servers/eserver/iseries/perfmgmt/pdf/smt.pdf )

IBM P5 Picture A single die contains two identical processor cores, each supporting two logical threads. This architecture makes the chip appear as a four-way symmetric multiprocessor to the operating system. The POWER5 processor core has been designed to support both enhanced simultaneous multi-threading (SMT) and single-threaded (ST) operation modes.

IBM P5 Picture In the p5-570 system, the POWER5 chip has been packaged with the L3 cache chip into a cost-effective Dual Chip Module (DCM) package. The storage structure for the POWER5 processor chip is a distributed memory architecture that provides high-memory bandwidth. Each processor can address all memory and sees a single shared memory resource. As such, a single DCM and its associated L3 cache and memory are packaged on a single processor card. Access to memory behind another processor is accomplished through the fabric buses. The p5-570 supports up to two processor cards (each card is a 2-way) in any building block. Each processor card has a single DCM containing a POWER5 processor chip and a 36 MB L3 module.

IBM P5 Picture

SUN Summary Starting with the ability to run two concurrent threads in the UltraSPARC IV processor, support for multi-threading at the chip level will eventually enable single processors to process tens of threads simultaneously. Each Sun Fire Enterprise server supports multiple chip multi-threaded UltraSPARC IV processors with each processor capable of running up to two concurrent threads, providing up to twice the throughput of previous-generation UltraSPARC III processors. Up to 72 UltraSPARC IV processors are supported in the high-end Sun Fire E25K server for support of up to 144 concurrent threads. http://www.sun.com/servers/highend/whitepapers/ - Sun_Fire_Enterprise_Servers_Performance.pdf

SUN Scheduling Picture

SUN Picture

Linux Scheduler A Hyper-threaded CPU is seen by the kernel as two different CPUs so Linux does not have to be explicitly made aware of it. Eprom config Linux breaks the oldest idle rule and forces immediate rescheduling when it discovers that a hyper-threaded CPU is running two idle processes Oldest idle = Among the idle processors that can execute a given runnable process, the scheduler selects the least recently active

LINUX /proc/cpuinfo example SuSE SLES 8(UnitedLinux 1.0)(i586) Kernel 2.4.21-138-smp (0). # cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 15 model : 2 model name : Intel(R) Xeon(TM) MP CPU 3.00GHz stepping : 6 cpu MHz : 2990.372 cache size : 512 KB physical id : 0 : processor : 1 physical id : 0 : processor : 2 physical id : 1 processor : 3 physical id : 1 : : processor : 4 physical id : 2 : processor : 5 physical id : 2 : processor : 6 physical id : 3 : processor : 7 physical id : 3

LINUX /proc/cpuinfo example 2.4.19 SuSE SLES 8 (UnitedLinux 1.0) (i586)\nkernel 2.4.19-64GB-SMP (2)" cat /proc/cpuinfo : flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm and NO HT shows: processor : 0 physical id : 0 processor : 1 physical id : 0 processor : 2 physical id : 0 processor : 3 physical id : 0

LINUX cpu busy indication CPU time no longer behaves linearly. Example for 4 CPU HT system: 8 virtual processors. Start executing 4 tasks each 100% cpu busy. Scheduler will push those to run on 4 physical processors TOP and VMSTAT will show 50% CPU busy (1/2 processors!) But adding 4 more tasks will NOT give 2x throughput

What would benefit from Xeon HT? Yes: Multi Threaded apps. Concurrent single threaded Need some (low level) STALLs that would otherwise have CPU essentially unused. BTW a CPU waiting for memory to come in while really idle for the application is 100% CPU busy for the OS. No: - Single threaded application - CPU intense, L1/L2 Cache effective apps.

Linux processor binding tools int sched_setaffinity ( pid_t pid, unsigned int len, unsigned long *new_mask_ptr) int sched_getaffinity ( pid_t pid, unsigned int *user_len_ptr, unsigned long *user_mask_ptr) Tool by Robert Love: affinity-run.c http://www.kernel.org/pub/linux/kernel/people/rml/cpu-affinity/affinity-run.c My variation on that tool with get/setpriority included.h file in case your distribution has improperly exported syscalls. C File H File

My own tests Process 10 loops over 100 MB Array 10,000 x 10,000 so outer jumps physical memory pages Measure run-time in milliseconds, Lower is Better 5 sub-test Inner loop first, read-only. Outer loop first, read-only. INNER loop first, read + write OUTER loop first, read + write No cell change in loop, read-only C File

1-12 streams on 4 CPUs NO HT (Average time. Less is better) 100000 90000 80000 70000 60000 50000 40000 30000 none inner outer INNER OUTER 20000 10000 0 0 2 4 6 8 10 12 14

1-12 streams on 4 CPUs HT (Average time. Less is better) 100000 90000 80000 70000 60000 50000 40000 30000 none inner outer INNER OUTER 20000 10000 0 0 2 4 6 8 10 12 14

DL760 with 8 real CPUs (Average time. Less is better) 100000 90000 80000 70000 60000 50000 40000 30000 none inner outer INNER OUTER 20000 10000 0 0 2 4 6 8 10 12 14

1-12 streams on 4 CPUs (mixed) (Average time. Less is better) 100000 90000 80000 70000 60000 50000 40000 30000 20000 none-ht inner-ht outer-ht INNER-ht OUTER-ht none inner outer INNER OUTER 10000 0 0 2 4 6 8 10 12 14

Hyper-Threading and BAD Affinity (Average time. Less is better) 100000 90000 80000 70000 60000 50000 40000 30000 none inner outer INNER OUTER 20000 10000 0 0 2 4 6 8 10 12 14

1-12 streams on 4 CPUs (relative) 1.20 1.00 0.80 0.60 0.40 none inner outer INNER OUTER 0.20 0.00 0 2 4 6 8 10 12 14

1-12 streams on 4 CPUs HT (relative) 3.00 2.50 2.00 1.50 1.00 none inner outer INNER OUTER 0.50 0.00 1 3 5 7 9 11

Hyper-Threading and BAD affinity (relative) 3.00 2.50 2.00 1.50 1.00 none inner outer INNER OUTER 0.50 0.00 0 4 8 12 16

Inner-loop first test inner 35000 30000 25000 20000 15000 10000 4p-10-ht-even.txt 4p-10-ht-mask.txt 4p-10-ht-odd.txt 4p-10-ht.txt 4p-10-noht.txt 5000 0 4-streams 8-streams 12-streams

Outer-loop first test outer 35000 30000 25000 20000 15000 10000 4p-10-ht-even.txt 4p-10-ht-mask.txt 4p-10-ht-odd.txt 4p-10-ht.txt 4p-10-noht.txt 5000 0 4-streams 8-streams 12-streams

No array access test none 35000 30000 25000 20000 15000 10000 4p-10-ht-even.txt 4p-10-ht-mask.txt 4p-10-ht-odd.txt 4p-10-ht.txt 4p-10-noht.txt 5000 0 4-streams 8-streams 12-streams

Raw data from tests Procs none-ht none inner-ht inner outer-ht outer INNER-ht INNER OUTER-ht OUTER 1 4685 4684 9357 9363 17241 17169 15365 15387 25719 26922 2 4686 4684 9362 9365 17225 17163 16386 16172 26647 27303 3 4685 4684 9358 9363 17243 17467 16405 16179 28268 27263 4 4690 4709 9362 9398 24127 17303 40 16498 16376 28248 28386 5 6561 5697 16 16234 11070 47 22123 20600 8 19370 19271 43312 32998 6 7862 6132 29 15093 12188 24 26422 22396 18 19031 21293 10 66050 36426 7 8647 7397 17 16741 14568 15 28906 26626 9 20199 25256 20 97666 43635 8 9336 9312 17975 18587 3 30993 34305 9 20389 32615 37 128282 56479 9 9801 9833 18689 19571 4 32239 35939 10 22002 34129 35 134646 59248 10 10561 10716 19284 20968 8 35280 38625 8 23587 36634 35 133578 63651 11 11252 11918 5 21481 23853 9 37648 44695 15 25051 41777 40 157133 72568 12 12186 14025 13 23383 27890 16 40542 51671 21 26521 48869 45 166529 84976

SAP Architecture overview Each application Instances maintains own task(dispatch) queue: DIAlogue, UPDate, ENQue, SPOol, Multiple Instances is just fine, even on single box Less contention on local resources (buffers) Dispatch Queue goes to first free worker ( disp+work ) starting at first worker. First worker only idle waiting for database (Oracle) or if there simply is no more new work. Used affinity to force that critical processes did not share same physical processor. (Update!)

SAP SD Benchmark Overview Official measurements is SAPs which corresponds to a certain throughput (Dataprocessing Steps / Hour) Practical measurement is in Users Each user executes a fixed (20) number of DS steps in a loop with 10 second think time and 2.0 second Response Time requirement. http://www.sap.com/benchmark/bm_description.htm#sd http://www.sap.com/benchmark/sd2tier.asp

Response time Knee (Mock up) Mock up Response Time CPU Busy Response time 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 0% 20% 40% 60% 80% 100% 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 0 0 100 200 300 400 500 600 700 12 10 8 6 4 2

And now finally for the real work we did

SAP SD Result Matrix Hyper- Threading Users Resp Time CPU time On 550 0.3 80% On 650 1.99 99% Off 550 1.99 99% Off 650 X 100%

System Under Test: DL560 DL560 3.0 Ghz Processor 4GB Memory Sap 4.7 (620 kernel) Oracle as DB

Benchmark Conclusions Hyperthreading gives 18% performance boost. Must takes extra steps to exploit fully Why not enable? Just do it. Not Dual Core.

Resources Linux HP Intel IBM

Co-produced by: