Linux Performance on IBM zenterprise 196

Similar documents
Linux Performance on IBM System z Enterprise

The zec12 Racehorse a Linux Performance Update

A Fast Instruction Set Simulator for RISC-V

Resource-Conscious Scheduling for Energy Efficiency on Multicore Processors

Performance Characterization of SPEC CPU Benchmarks on Intel's Core Microarchitecture based processor

UCB CS61C : Machine Structures

Lightweight Memory Tracing

Energy Proportional Datacenter Memory. Brian Neel EE6633 Fall 2012

Improving Cache Performance by Exploi7ng Read- Write Disparity. Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A.

Performance analysis of Intel Core 2 Duo processor

CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines

UCB CS61C : Machine Structures

CPU Performance Evaluation: Cycles Per Instruction (CPI) Most computers run synchronously utilizing a CPU clock running at a constant clock rate:

Performance. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Architecture of Parallel Computer Systems - Performance Benchmarking -

Improving Cache Performance by Exploi7ng Read- Write Disparity. Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A.

Computer Architecture. Introduction

NightWatch: Integrating Transparent Cache Pollution Control into Dynamic Memory Allocation Systems

Microarchitecture Overview. Performance

Footprint-based Locality Analysis

Energy Models for DVFS Processors

EKT 303 WEEK Pearson Education, Inc., Hoboken, NJ. All rights reserved.

A Heterogeneous Multiple Network-On-Chip Design: An Application-Aware Approach

Addressing End-to-End Memory Access Latency in NoC-Based Multicores

SEN361 Computer Organization. Prof. Dr. Hasan Hüseyin BALIK (2 nd Week)

PIPELINING AND PROCESSOR PERFORMANCE

Efficient and Effective Misaligned Data Access Handling in a Dynamic Binary Translation System

Security-Aware Processor Architecture Design. CS 6501 Fall 2018 Ashish Venkat

Bias Scheduling in Heterogeneous Multi-core Architectures

SLES11-SP1 Driver Performance Evaluation

Sandbox Based Optimal Offset Estimation [DPC2]

Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems

Near-Threshold Computing: How Close Should We Get?

A Front-end Execution Architecture for High Energy Efficiency

Computing System Fundamentals/Trends + Review of Performance Evaluation and ISA Design

Red Hat Enterprise Linux on IBM System z Performance Evaluation

OpenPrefetch. (in-progress)

Linux on System z Distribution Performance Update

Relative Performance of a Multi-level Cache with Last-Level Cache Replacement: An Analytic Review

LLVM performance optimization for z Systems

Improving Cache Performance using Victim Tag Stores

The Application Slowdown Model: Quantifying and Controlling the Impact of Inter-Application Interference at Shared Caches and Main Memory

Perceptron Learning for Reuse Prediction

Data Prefetching by Exploiting Global and Local Access Patterns

Lightweight Memory Tracing

Which is the best? Measuring & Improving Performance (if planes were computers...) An architecture example

Scalable Dynamic Task Scheduling on Adaptive Many-Cores

IBM zenterprise. IBM zenterprise - poslední přírůstek do rodiny IBM mainframů. Dalibor Kurek IBM Corporation

ISA-Aging. (SHRINK: Reducing the ISA Complexity Via Instruction Recycling) Accepted for ISCA 2015

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Minimalist Open-page: A DRAM Page-mode Scheduling Policy for the Many-core Era

Open Access Research on the Establishment of MSR Model in Cloud Computing based on Standard Performance Evaluation

Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching

Energy-centric DVFS Controlling Method for Multi-core Platforms

IBM System zenterprise 196: Memory Subsystem Overview

Insertion and Promotion for Tree-Based PseudoLRU Last-Level Caches

WebSphere Application Server Base Performance

Impact of Cache Coherence Protocols on the Processing of Network Traffic

Evaluating STT-RAM as an Energy-Efficient Main Memory Alternative

A Comprehensive Scheduler for Asymmetric Multicore Systems

15-740/ Computer Architecture Lecture 10: Runahead and MLP. Prof. Onur Mutlu Carnegie Mellon University

Information System Architecture Natawut Nupairoj Ph.D. Department of Computer Engineering, Chulalongkorn University

IBM System z: A Peek Under the Hood

The Relatively New LSPR and zec12/zbc12 Performance Brief

Spatial Memory Streaming (with rotated patterns)

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

PiCL: a Software-Transparent, Persistent Cache Log for Nonvolatile Main Memory

Generating Low-Overhead Dynamic Binary Translators

Thesis Defense Lavanya Subramanian

Chapter-5 Memory Hierarchy Design

DEMM: a Dynamic Energy-saving mechanism for Multicore Memories

Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency

562 IEEE TRANSACTIONS ON COMPUTERS, VOL. 65, NO. 2, FEBRUARY 2016

CPU MF Counters Enablement Webinar

PowerPC 740 and 750

Multi-Cache Resizing via Greedy Coordinate Descent

An Application-Oriented Approach for Designing Heterogeneous Network-on-Chip

Predicting Performance Impact of DVFS for Realistic Memory Systems

Architectural Supports to Protect OS Kernels from Code-Injection Attacks

Filtered Runahead Execution with a Runahead Buffer

Making Data Prefetch Smarter: Adaptive Prefetching on POWER7

Analysis of Program Based on Function Block

The information provided is intended to help designers and end users make performance

The Load Slice Core Microarchitecture

z10 Capacity Planning Issues Fabio Massimo Ottaviani EPV Technologies White paper

Linux on System z - Performance Update Part 1 of 2

Multiperspective Reuse Prediction

ChargeCache. Reducing DRAM Latency by Exploiting Row Access Locality

Linux on System z - Disk I/O Alternatives

Potential for hardware-based techniques for reuse distance analysis

An Intelligent Fetching algorithm For Efficient Physical Register File Allocation In Simultaneous Multi-Threading CPUs

2015 CPU MF Update. John Burg IBM. March 3, 2015 Session Number Insert Custom Session QR if Desired.

Portland State University ECE 588/688. Cray-1 and Cray T3E

System z13: First Experiences and Capacity Planning Considerations

Last time. Lecture #29 Performance & Parallel Intro

SAS Enterprise Miner Performance on IBM System p 570. Jan, Hsian-Fen Tsao Brian Porter Harry Seifert. IBM Corporation

IBM Power Systems Performance Report. POWER9, POWER8 and POWER7 Results

Processors, Performance, and Profiling

Efficient Memory Shadowing for 64-bit Architectures

WHITE PAPER FUJITSU PRIMERGY SERVERS PERFORMANCE REPORT PRIMERGY BX924 S2

Transcription:

Martin Kammerer martin.kammerer@de.ibm.com 9/27/10 Linux Performance on IBM zenterprise 196 visit us at http://www.ibm.com/developerworks/linux/linux390/perf/index.html

Trademarks IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International Business Machines Corp., registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on the Web at Copyright and trademark information at www.ibm.com/legal/copytrade.shtml. Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both. Java and all Java-based trademarks and logos are trademarks of Sun Microsystems, Inc. in the United States, other countries, or both. Other product and service names might be trademarks of IBM or other companies. Source: If applicable, describe source origin 2 Linux Performance on IBM zenterprise 196

Agenda zenterprise 196 design Linux performance comparison z196 and z10 SPECCPU case study 3 Linux Performance on IBM zenterprise 196

z196 Continues the CMOS Mainframe Heritage 5000 4.4 GHz 5.2 GHz 4000 3000 MHz 2000 1000 300 MHz 420 MHz 550 MHz 770 MHz 1.2 GHz 1.7 GHz 0 1997 1998 1999 2000 2003 2005 2008 2010 G4 G5 G6 z900 z990 z9 EC z10 EC z196 4 Linux Performance on IBM zenterprise 196

z196 Water cooled Under the covers (Model M66 or M80) front view Internal Batteries (optional) Power Supplies Ethernet cables for internal System LAN connecting Flexible Service Processor (FSP) cage controller cards Support Elements I/O cage I/O drawers Processor Books, Memory, MBA and HCA cards InfiniBand I/O Interconnects 2 x Water Cooling Units 5 Linux Performance on IBM zenterprise 196

z196 Processor Design Basics Processor Design CPU (core) cycle time pipeline, execution order branch prediction hardware vs. millicode Memory subsystem high speed buffers (caches) on chip, on book private, shared buses number, bandwidth latency distance speed of light Logical View of Single Book Private Cache CPU Memory Shared Cache Private Cache CPU Private Cache CPU 6 Linux Performance on IBM zenterprise 196

z196 vs. z10 hardware comparison z10 EC CPU 4.4 Ghz Caches L1 private 64k instr, 128k data L1.5 private 3 MBs L2 shared 48 MBs / book book interconnect: star L1.5 L1 CPU Memory L2 Cache L1.5 L1 CPU L1.5 L1 CPU z196 CPU 5.2 Ghz Out-of-Order execution Caches L1 private 64k instr, 128k data L2 private 1.5 MBs L3 shared 24 MBs / chip L4 shared 192 MBs / book book interconnect: star 7 Linux Performance on IBM zenterprise 196 L2 L1 CPU 1 L3 Cache L2 L1 CPU 4 Memory L4 Cache L2 L1 CPU 1 L3 Cache L2 L1 CPU 4

z196 Multi-Chip Module (MCM) Packaging 96mm x 96mm MCM 103 Glass Ceramic layers 8 chip sites 7356 LGA connections 20 and 24 way MCMs Maximum power used by MCM is 1800W CMOS 12s chip Technology PU, SC, S chips, 45 nm 6 PU chips/mcm Each up to 4 cores One memory control (MC) per PU chip 1.4 billion transistors/pu chip L1 cache/pu core L2 cache/pu core L3 cache shared by 4 PUs per chip S10 S11 PU 2 SC 1 PU 1 PU 0 SC 0 S00 S01 2 Storage Control (SC) chip 24.427 mm x 19.604 mm 1.5 billion transistors/sc chip L4 Cache 96 MB per SC chip (192 MB/Book) L4 access to/from other MCMs 4 SEEPROM (S) chips 2 x active and 2 x redundant Product data for MCM, chips and other engineering information Clock Functions distributed across PU and SC chips Master Time-of-Day (TOD) function is on the SC PU 3 PU 4 PU 5 8 Linux Performance on IBM zenterprise 196

z196 PU core Each core is a superscalar, out of order processor with these characteristics: Six execution units 2 fixed point (integer), 2 load/store, 1 binary floating point, 1 decimal floating point Up to three instructions decoded per cycle (vs. 2 in z10) 211 complex instructions cracked into multiple internal operations 246 of the most complex z/architecture instructions are implemented via millicode Up to five instructions/operations executed per cycle (vs. 2 in z10) Execution can occur out of (program) order Memory address generation and memory accesses can occur out of (program) order Special circuitry to make execution and memory accesses appear in order to software Each core has 3 private caches 64KB 1 st level cache for instructions, 128KB 1 st level cache of data 1.5MB L2 cache containing both instructions and data 9 Linux Performance on IBM zenterprise 196

z196 New Instruction Set Architecture Re-compiled code/apps get further performance gains through 110+ new instructions, e.g.: High-Word Facility (30 new instructions) Independent addressing to high word of 64-bit GPRs Effectively provides compiler/ software with 16 additional 32-bit registers Interlocked-Access Facility (12 new instructions) Interlocked (atomic) load, value update and store operation in a single instruction Immediate exploitation by Java Load/Store-on-Condition Facility (6 new instructions) Load or store conditionally executed based on condition code Dramatic improvement in certain codes with highly unpredictable branches Distinct-Operands Facility (22 new instructions) Independent specification of result register (different than either source register) Reduces register value copying Population-Count Facility (1 new instruction) Hardware implementation of bit counting ~5x faster than prior software implementations Integer to/from Floating point converts (21 new instructions) 10 Linux Performance on IBM zenterprise 196

z196 Out of Order (OOO) Value OOO yields significant performance benefit for compute intensive apps through Re-ordering instruction execution Later (younger) instructions can execute ahead of an older stalled instruction Re-ordering storage accesses and parallel storage accesses OOO maintains good performance growth for traditional apps Instrs 1 2 3 4 5 L1 miss In-order core execution Out-of-order core execution L1 miss Time Execution Storage access Time 11 Linux Performance on IBM zenterprise 196

z196 Capacity/Performance compared to z10 EC z196 to z10 EC Ratios z196 PCI Values LSPR Mixed workload average, multi-image for z/os 1.11 with HiperDispatch active on z196 Uni-processor 1.33 1202 MIPs Tables 16-way z196 to 16-way z10 EC 1.38 14371 32-way z196 to 32-way z10 EC 1.38 25241 64-way z196 to 64-way z10 EC 1.41 44953 80-way z196 to 64-way z10 EC 1.64 52286 PCI - Processor Capacity Index 12 Linux Performance on IBM zenterprise 196

Agenda zenterprise 196 design Linux performance comparison z196 and z10 SPECCPU case study 13 Linux Performance on IBM zenterprise 196

Environment Hardware z196: 2817-718 M49 z10 : 2097-726 E26 z9 : 2094-718 S18 Linux Linux distribution with recent kernel SLES11 SP1: 2.6.32.12 Linux in LPAR Shared processors Other LPARs deactivated Source: If applicable, describe source origin 14 Linux Performance on IBM zenterprise 196

File server benchmark description dbench 3 Emulation of Netbench benchmark Generates file system load on the Linux VFS Does the same I/O calls like the smbd server in Samba (without networking calls) Mixed file operations workload for each process: create, write, read, append, delete Measures throughput of transferred data Configuration 2 GB memory, mainly memory operations Scaling processors 1, 2, 4, 8, 16 For each processor configuration scaling processes 1, 4, 8, 12, 16, 20, 26, 32, 40 15 Linux Performance on IBM zenterprise 196

dbench SLES11 improves on average by 40% 7000 dbench throughput [MB/s] 6000 5000 4000 3000 2000 1 CPU z10 2 CPUs z10 4 CPUs z10 8 CPUs z10 16 CPUs z10 1 CPU z196 2 CPU z196s 4 CPUs z196 8 CPUs z196 16 CPUs z196 1000 0 1 4 8 12 16 20 26 32 40 number of processes 16 Linux Performance on IBM zenterprise 196

z10 versus z9 Improvement z10 versus z9 Measured with 8 CPUs: average improvement is 50% Throughput [MB/s] 3500 3000 2500 2000 1500 1000 500 z9 z10 0 1 4 8 12 16 20 26 32 40 number of processes 17 Linux Performance on IBM zenterprise 196

Kernel benchmark description lmbench 3 Suite of operating system microbenchmarks Focuses on interactions between the operating system and the hardware architecture Bandwidth measurements for memory and file access, operations and movement Latency measurements for process handling and communication Latency measurements for basic system calls Configuration 2 GB memory 4 processors 18 Linux Performance on IBM zenterprise 196

lmbench Most benefits in L3 and L4 cache all numbers are deviation z196 versus z10 [%] simple syscall SLES11-40 simple read/write 0 select of file descriptors 30 signal handler -25 process fork 30 libc bcopy aligned L1 / L2 / L3 / L4 cache libc bcopy unaligned L1 / L2 / L3 / L4 cache 10 / 10 / 100 / 300 15 / 0 / 0 / 40 35 / 90 / 300 / 800 memory bzero L1 / L2 / L3 / L4 cache memory partial read L1 / L2 / L3 / L4 cache 45 / 40 / 130 / 500 memory partial read/write L1 / L2 / L3 / L4 cache 20 / 20 / 20 / 150 memory partial write L1 / L2 / L3 / L4 cache 80 / 30 / 60 / 300 memory read L1 / L2 / L3 / L4 cache 10 / 30 / 50 / 300 memory write L1 / L2 / L3 / L4 cache 50 / 30 / 30 / 180 Mmap read L1 / L2 / L3 / L4 cache 50 / 35 / 85 / 300 Mmap read open2close L1 / L2 / L3 / L4 cache 30 / 0 / 15 / 150 Read L1 / L2 / L3 / L4 cache 10 / 25 / 80 / 300 Read open2close L1 / L2 / L3 / L4 cache 15 / 30 / 85 / 300 Unrolled bcopy unaligned L1 / L2 / L3 / L4 cache 100 / 75 / 75 / 200 19 Linux Performance on IBM zenterprise 196

z10 versus z9 in many cases improvements of 25%, peaks at 100% some results for context switch tests decreased Area z10 improvement versus z9 [%] simple syscall 30 simple read/write 30/40 select of file descriptors 60 signal handler overhead 50 process fork 30 libc bcopy aligned L1/L1.5/L2cache/main memory libc bcopy aligned L1/L1.5/L2cache/main memory memory bzero L1/L1.5/L2cache/main memory 130 / 300 / 10 / -10 100 / -20 / -30 / -20 80 / 50 / -40 / -60 memory partial read L1/L1.5/L2cache/main memory 150 / 250 / 30 / 15 memory partial read/write L1/L1.5/L2cache/main memory 100 / 270 / 130 / 15 memory partial write L1/L1.5/L2cache/main memory 100 / 270 / 130 / 15 memory read L1/L1.5/L2cache/main memory 120 / 200 / 100 / 10 memory write L1/L1.5/L2cache/main memory 40 / 80 / 60 / 15 Mmap read L1/L1.5/L2cache/main memory 60 / 200 / 50 / 5 Mmap read open2close L1/L1.5/L2cache/main memory 50 / 70 / 40 / 0 Read L1/L1.5/L2cache/main memory 80 / 130/ 60 / 0 Read open2close L1/L1.5/L2cache/main memory 80 / 130/ 60 / 5 Unrolled bcopy unaligned L1/L1.5/L2cache/main memory 70 / 120/ 70 / 60 20 Linux Performance on IBM zenterprise 196

Java benchmark description Java server benchmark evaluates the performance of server side Java Exercises Java Virtual Machine (JVM) Just-In-Time compiler (JIT) Garbage collection Multiple threads Simulates real-world applications including XML processing or floating point operations Can be used to measure performance of CPUs, memory hierarchy and scalability Configurations 8 processors, 2 GB memory, 1 JVM 16 processors, 8 GB memory, 4 JVMs Java Version 6 21 Linux Performance on IBM zenterprise 196

SpecJBB Overall throughput improvement: close to 45% throughput improvement [%] SLES11 SP1 comment 44 2 GB, 8 CPUs 45 8 GB, 16 CPUs, 4 JVMs SLES11 result s Throughput z10 z196 1 2 3 4 5 6 7 8 9 1011121314151617181920 Number of Warehouses 22 Linux Performance on IBM zenterprise 196

SpecJBB throughput compared to last generations In 2008 the message was 60% more throughput with z10 than z9 2008 31-bit JVM z9 2008 31-bit JVM z10 2010 64-bit JVM z196 4 CPUs 8 CPUs 16 CPUs 23 Linux Performance on IBM zenterprise 196

CPU-intensive benchmark suite Stressing a system's processor, memory subsystem and compiler Workloads developed from real user applications Exercising integer and floating point in C, C++, and Fortran programs Can be used to evaluate compile options Can be used to optimize the compiler's code generation for a given target system Configuration 1 processor 2 GB memory Executing one test case at a time 24 Linux Performance on IBM zenterprise 196

SpecCPU Linux: Internal driver, kernel 2.6.29, gcc 4.5, glibc 2.9.3 SpecFP suite improves by 86% SpecINT suite improves by 76% SpecFP z196 (march=z196) versus z10 (march=z10) 482.sphinx3 481.wrf 470.lbm 465.tonto 459.GemsFDTD 454.calculix 453.povray 450.soplex 447.dealII 444.namd 437.leslie3d 436.cactusADM 435.gromacs 434.zeusmp 433.milc 416.gamess 410.bwaves improvements [%] 0 20 40 60 80 100 120 140 160 SpecINT z196 (march=z196) versus z10 (march=z10) improvements [%] 483.xalancbmk 473.astar 471.omnetpp 464.h264ref 462.libquantum 458.sjeng 456.hmmer 445.gobmk 429.mcf 403.gcc 401.bzip2 400.perlbench 0 20 40 60 80 100 120 25 Linux Performance on IBM zenterprise 196

z10 versus z9 Overall improvement with z10 versus z9: 90% z10 runtime improvements versus z9 [%] 301.apsi 200.sixtrack 191.fma3d 189.lucas 188.amm p 187.facerec 183.equake 179.art 178.galgel 177.m esa 173.applu 172.mgrid 171.swim 168.wupwise 0 20 40 60 80 100 120 140 26 Linux Performance on IBM zenterprise 196

Agenda zenterprise 196 design Linux performance comparison z196 and z10 SPECCPU case study 27 Linux Performance on IBM zenterprise 196

SPECINT2006, 456.hmmer Search for patterns in a gene sequence database Use of Profile Hidden Markov Models Programming Language: C (57 source files) Function Profile %Total Cumulative Function 95.6 95.6 P7Viterbi 1.9 97.5 sre_random 1.8 99.3 FChoose 28 Linux Performance on IBM zenterprise 196

456.hmmer, hotloop for (k = 1; k <= M; k++) { mc[k] = mpp[k-1] + tpmm[k-1]; if ((sc = ip[k-1] + tpim[k-1]) > mc[k]) mc[k] = sc; if ((sc = dpp[k-1] + tpdm[k-1]) > mc[k]) mc[k] = sc; if ((sc = xmb + bp[k]) > mc[k]) mc[k] = sc; mc[k] += ms[k]; if (mc[k] < -INFTY) mc[k] = -INFTY; dc[k] = dc[k-1] + tpdd[k-1]; if ((sc = mc[k-1] + tpmd[k-1]) > dc[k]) dc[k] = sc; if (dc[k] < -INFTY) dc[k] = -INFTY; if (k < M) { ic[k] = mpp[k] + tpmi[k]; if ((sc = ip[k] + tpii[k]) > ic[k]) ic[k] = sc; ic[k] += is[k]; if (ic[k] < -INFTY) ic[k] = -INFTY; } 29 Linux Performance on IBM zenterprise 196

456.hmmer, hotloop, code generated for z10 if ((sc = xmb + bp[k]) > mc[k]) mc[k] = sc; GCC generates: 1834 LR GPR3,GPR4 1846 LR GPR4,GPR6 E380F0B00004 LG GPR8,176(,GPR15) Problem: 5A481004 A GPR4,4(GPR8,GPR1) Address Generation Interlock 1943 CR GPR4,GPR3 A7240088 BHRC *+272... 50412004 ST GPR4,4(GPR1,GPR2) a7f4ff79 BRC *-270 Problem: Branch Mispredict 30 Linux Performance on IBM zenterprise 196

LOCR, Load on Condition LOCR R1,R2,M3 RRF-c 'B9F2' M3 / / / / R1 R2 0 16 20 24 28 31 The second operand is placed unchanged at the first operand location if the condition code has one of the values specified by M3; otherwise, the first operand remains unchanged. 31 Linux Performance on IBM zenterprise 196

456.hmmer, hotloop, code generated for z196 if ((sc = xmb + bp[k]) > mc[k]) mc[k] = sc; GCC generates: 5AA06000 A GPR10,0(,GPR6) 191A CR GPR1,GPR10 B9F2501A LOCRNHE GPR1,GPR10 50102004 ST GPR1,4(,GPR2) ~70% more throughput for this code compared to the z10-code both running on z196 32 Linux Performance on IBM zenterprise 196

Summary z196 performance advantages Higher clock speed More cache OOO processing New instructions More processors Processor scalability Performance gains with Linux workloads Up to 45% for Java Up to 86% for single threaded CPU intense 40% when scaling processors and/or processes SLES11 SP1 improvements 33 Linux Performance on IBM zenterprise 196

Thank You. Questions?? 34 Linux Performance on IBM zenterprise 196