System z13: First Experiences and Capacity Planning Considerations

Size: px

Start display at page:

Download "System z13: First Experiences and Capacity Planning Considerations"

Dwain Holmes
6 years ago
Views:

1 System z13: First Experiences and Capacity Planning Considerations Robert Vaupel IBM R&D, Germany Many Thanks to: Martin Recktenwald, Matthias Bangert and Alain Maneville for information to this presentation Wednesday 04/11/2015 Session BK

2 Content z13: Installations Planning for z13 Preparation for the migration, upgrade CPU Measurement Facility and SMF 113 For what is it good and what can be done with it? Processor Topology What else to consider? Service Levels Store Into Instruction Stream code (assembler) Appendix Resources Real value of z13 comes with

versus MCM More cache Improved Out-of-Order processing Completely new

3 z13: On one page Newest System z processor with Much more memory than previous systems up to 10TB A new processor and mainboard design SCM versus MCM More cache Improved Out-of-Order processing Completely new features: SIMD & SMT The real value of z13 comes with exploiting its features

4 z13: Installations More than 400 so far Most installations, migrations and upgrades run smoothly w/o any problems What we know Most installation are within -1% to +8% of the zpcr expectation Expectation is ~10% better than ITR than zec12 This in fact is not much compared to previous upgrades Therefore it is worth to plan the upgrade carefully So far ~3% of the installations experience some problems with the migration This is less compared to previous significant processor design changes (for example z10) But it is enough to talk about some planning considerations

z13: Planning: To Start with A successful migration starts with a thoroughly planning Start with zpcr: https://www-03.ibm.com/support/techdocs/atsmastr.

5 z13: Planning: To Start with A successful migration starts with a thoroughly planning Start with zpcr: Based on IBM provided LSPR data This gives you the right size for your CEC

6 z13 Planning: What else can be done? Plan your z13 LPARs carefully Why is this important? We will see that the LPAR layout matters That means LPAR weight, # of logical processors and the resulting numbers of vertical high, medium and low processors (VH, VM, VL) A good starting point here is the LPAR Design tool available from the WLM homepage: What does it do? It allows you to layout your LPARs Examine the VH, VM, VL processors Provides guidelines on how to set up the LPARs efficiently

7 Preparation for z13: CPU MF Use the CPU Measurement Facility (Hardware Instrumentation Sampling) to obtain insight into the processor and cache architecture Value of CPU Measurement Facility (CPU MF) Recommended methodology for successful z Systems processor capacity planning Need on Before processor to determine LSPR workload Validate achieved z Systems processor performance Needed on Before and After processors Provide insights for workload pattern, behavior, new features and functions Continuously running on all LPARs Capturing CPU MF data is an industry Best Practice 7

8 Preparation for z13: CPU MF Introduced in z10 and later processors Facility that provides hardware instrumentation data for production systems Two major components Counters for capacity planning Cache and memory hierarchy information SCPs supported include z/os and zvm Sampling for detailed, module level analysis z/os HIS started task Gathered on an LPAR basis Writes SMF 113 records z/vm Monitor Records Gathered on an LPAR basis all guests are aggregated Writes new Domain 5 (Processor) Records 13 (CPU MF Counters) records Minimal Overhead 8

9 Preparation for z13: Why is CPU MF Important? z13 provides lower single thread improvements than previous processor changes, e.g. zec12 versus z196 z13 provides more variability in capacity improvement Capacity projections and expectations must be as accurate as possible Annotation: RNI: Relative Nest Intensity is a metric which describes the access to various cache levels of the processor architecture 9

10 Preparation for z13: Enabling CPU MF Configure the System z server to collect CPU MF data (Image Profile on HMC/SE) Setup HIS Procedure in SYS1.PROCLIB (if not already available) Specify where to store the HIS output file (setup step also required for COUNTER data) Setup SMFPRMxx member to collect SMF 113 records Start HIS, start measurements, synchronize with SMF (and RMF) Enable CPU MF in the LPAR security settings Overhead for counters/extended counters is negligible. Setup the HIS address space and start it Recommended: F HIS,B,TT= Text',PATH='/his/',CTRONLY,CTR=(B,E),SI=SYNC Or F HIS,B,TT= Text',PATH='/his/',CTRONLY,CTR=ALL,SI=SYNC Set up SMF113 recording in SMFPRMxx SMF 113 has 2 subtypes: 1 (delta counter) and 2 (total counter) In our example we always use subtype 2 SMF 113s were 1.2% of the space compared to SMF 70s & SMF 72s Recommendation: Setup it once and run it continuously CPU MF Webinar Replays and Presentations z/os CPU MF Detailed Instructions Step by Step Guide z/vm Using CPU Measurement Facility Host Counters 1

11 What can be done with SMF 113 data? First it allows you to evaluate your LPARs from before and after the migration: Collect data from comparable time periods before and after the migration Compare time periods when same workload executes For example: Prime shift from 08:00 to 12:00 or 08:00 to 16:00 Metrics: CPI = Cycles per Instructions MIPS = Million Instructions per Second = GHz / CPI Annotation: The MIPS value is the average MIPS value per processor, LSPR tables show the MIPS value for the CEC (all regular processors) Example: from a real installation (zec12 to z13) Data was collected from before and after the migration For comparable weeks, comparable time frame, comparable load Notice: There are always fluctuations The difference is never one number, in this case the improvement was between 4.4% and 13.1% in favor of the z13 which is within expectations (on average in this case 9% vs. expectations: ~10%)

12 What can be done with SMF 113 data? Second it is possible to analyze the access of processors to data and cache in detail For example: When a Level 1 Miss occurs from where the data or instructions come from (why is this important? next page) CPI: Estimated Instruction Complexity Tells whether the Instruction mix is uniform and compared to other models how much it is faster (slower) CPI: From Finite Cache/Memory Gives the average number of cycles to acquire data/instructions from the cache hierarchy and memory In the example below:» The processor of the z13 is on average 18% faster» Access to data and instructions from cache/memory requires about the same number of cycles» On average the z13 is 9% faster for the three compared days

13 SMF 113 data: Assess what happens if something goes wrong Extreme Example: Effect if data needs to be fetched from a different drawer 1 8 % 1 3 % 1

14 z13: Some discussion points SMF 113 data allows to assess the efficiency of the LPARs before and after the migration Make sure the time periods can really be compared to each other SMF 113 data allows to identify problems Question is what to look at for setting up LPARs on z13 Avoid that LPARs are split between drawers This can be avoided in >99% of all cases for VH processors Important: Make sure the memory is distributed between the drawers so that so that no unnecessary split occurs Do not define an excessive amount of vertical low processors VL processors reduce the efficiency of VM processors by using some of their share VL (and VM) processors may look for a PCP on a different Chip, Node and sometimes even drawer Therefore they often need to access data/instructions from remote cache structures They cause VH processors to access data/instructions from remote cache structures Thus they reduce the efficiency of all logical processors Recommendations: Define only the number of logical processors which are really needed to handle peaks Do not define an excessive amount of VL processors Define the weight so that at least 80% of the time during the time period when important workload executes everything can be handled by VH processors

Annotation: Processing SMF 113 data If you need a quick start to process SMF 113 data Look at the WLM home page: http://www-03.ibm.

html#smf113 The tool Provides a set of REXX programs which process SMF 113 subtype 2 data from all processors types: z10, z196, zec12, z13 The output is a

15 Annotation: Processing SMF 113 data If you need a quick start to process SMF 113 data Look at the WLM home page: The tool Provides a set of REXX programs which process SMF 113 subtype 2 data from all processors types: z10, z196, zec12, z13 The output is a CSV file including the most common metrics which can be calculated from the SMF 113 counters There is also a spreadsheet to display the most basic statistics for cache access and cycles per Instructions Remark: The spreadsheet expects US number notation with a dot as decimal point

z13 Analysis: Logical Processor Topology If you want to know how your logical processors are placed on Chips, Nodes and Drawers Collect SMF 99 subtype 14 records from if possible all z/os partitions

available on the WLM home page (URL see below) Topology Reporting Tool Creates a CSV file containing the topology information of all logical processors Processes data for z10, z196, zec12 and z13

16 z13 Analysis: Logical Processor Topology If you want to know how your logical processors are placed on Chips, Nodes and Drawers Collect SMF 99 subtype 14 records from if possible all z/os partitions of your CECs The amount of data is very minimal even much smaller than for SMF 113 data Combine all SMF 99 subtype 14 records of one CEC into one SMF dataset Now process the data with another tool available on the WLM home page (URL see below) Topology Reporting Tool Creates a CSV file containing the topology information of all logical processors Processes data for z10, z196, zec12 and z13 Recommendations for running the program: Use the ALLINTV option Combine SMF data from multiple LPARs of the same CEC into one SMF dataset For z/os 1.13: Apply Apar OA47418 Provides a spreadsheet to display the topologies 1

17 z13: Taking a Breath What did we learn so far z13 provides ~10% ITR improvement compared to zec12 z13 is a little more sensitive to the LPAR definition Is there anything else which needs to be considered?

18 What else to consider? What about the Program Code? Recommendations Use newest compiler versions whenever possible because those are optimized for the latest architecture Use latest service levels especially after introduction of a new architecture Operating system Firmware Last but not least New processor designs sometimes have implications of what was once a good thing to do

19 Service Levels Bundle 14 Changes how PR/SM assigns VH and VM processors for small partitions Before LPARs with a weight within {1.5 to 2.0} PCP share were assigns 1 VH and 1 VM Now these LPARs are assigned 2 VM processors Why? Because PR/SM optimizes the placement of VH processors but not necessarily the placement of VM processors» Therefore it is possible that VH and VM processors of the same partition are located on different Chips (or Nodes) which is not brilliant for small partitions» While 2 VM processors are most often placed closed together on the same or adjacent chips OA47968 (WLM Hiperdispatch enhancements) One important change is the park, un-park sequence of VL processors Now it is ensured that VLs with logically high processor numbers are first parked because those are very often not closely located to VH or VM processors

Store Into Instruction Stream (SIIS) Look at WSC Flash 10208 (Processor Design Considerations) https://www-03.ibm.com/support/techdocs/atsmastr.

20 Store Into Instruction Stream (SIIS) Look at WSC Flash (Processor Design Considerations) The situation is not new (flash was released for z10) but still exists It has to do with Out-of-Order processing of modern processors Out-Of-Order Processing OOO yields significant performance benefit for compute intensive apps through Re-ordering instruction execution Later (younger) instructions can execute ahead of an older stalled instruction Re-ordering storage accesses and parallel storage accesses OOO maintains good performance growth for traditional apps

21 Background: SIIS Split instruction/data cache design requires special SIIS handling Ifetching runs (far!) ahead Storing runs somewhat ahead (out of order) To really execute the store, the istream must have stopped using the cache line LR 15,4 MR 14,5 LR 2,15 LR 15,4 MR 14,5 ST 3,208(0,13) LR 3,15 A 2,2604(0,7) ST 2,976(0,13) A 3,2604(0,7) ST 3,LABEL LABEL: L 2,92(0,9) L 15,44(0,2) L 3,64(0,12) LA 1,237(0,3) BASR 14,15 LR 1,4 MR 0,5 LR 4,1 L 5,208(0,13) LR 1,5 MR 0,6 LR 14,1 A 4,2604(0,7) ST 4,976(0,13) L 4,304(0,9) A 14,3944(0,4) ST 14,984(0,13) L 15,44(0,2)... Fully done ( completed ) up to here Store into instruction stream Out-of-order (started execution) up to here Ifetch is already far, far away This can be many instructions This can be a lot more

22 Background: SIIS LR 15,4 MR 14,5 LR 2,15 LR 15,4 MR 14,5 ST 3,208(0,13) LR 3,15 A 2,2604(0,7) ST 2,976(0,13) A 3,2604(0,7) ST 3,LABEL LABEL: L 2,92(0,9) L 15,44(0,2) L 3,64(0,12) LA 1,237(0,3) BASR 14,15 LR 1,4 MR 0,5 LR 4,1 L 5,208(0,13) LR 1,5 MR 0,6 LR 14,1 A 4,2604(0,7) ST 4,976(0,13) L 4,304(0,9) A 14,3944(0,4) ST 14,984(0,13) L 15,44(0,2)... Caches keep track of cache lines, so cache line accesses are used to detect SIIS cases Cache line size: 256 bytes (since a long time) IFetch just keeps on fetching Even though only the blue instructions are done yet, IFetch will be way beyond that point, having fetched the SIIS, SIIS-victim, and a lot of stuff following that already Out-of-order execution runs ahead quite a bit Even though only the blue instructions are done yet, the SIIS into LABEL will already fetch the cache line into the data cache Without special handling, at this point: IStream will loose the cache line, potentially even the first green instruction This means the SIIS will also be lost, even though it just got the cache line (it was not the next instruction to be done yet) IFetch will immediately ask for the first green instruction again Moving the cache line back from the data cache into the instruction cache Repeat Special core internal interlock between instruction and data cache prevents this loop But there is no path to send the store data directly into the IStream

23 Background: SIIS What changed from z12 to z13: z12 core cache hierarchy Core boundary L1-I$ (64K) L1-D$ (96K) L2-D$ (1M) L2-I$ (1M) Fast, private L2-D$ for data cache Slow, shared L2-I$ Usually bypassed for L2-D$ fetches Cache lines that have to travel between instruction and data side can do so through the L2-I$ L3$ (48M) Not shared Shared

24 Background: SIIS What changed from z12 to z13: z13 core cache hierarchy Core boundary L1-I$ (96K) L2-I$ (2M) L1-D$ (128K) L2-D$ (2M) Fast, private L2-D$ for data cache and fast, private L2-I$ for instruction cache This improves instruction cache miss latency It also improves L2 cache miss latency First shared cache level is L3 cache Cache lines that have to travel between instruction and data side can do so through the L3$ L3$ (48M) Not shared Shared

25 Store Into Instruction Stream On z13 sometimes more visible than on z196 and zec12 because the bigger L2 I- and D-caches are not interconnected anymore So what is in general an advantage can become a disadvantage in this particular case Did this happen on z13: The answer is Yes Where does it typically happen? In assembler routines which exist for a long time Typically Batch programs use such routines What to do? If you recognize that specific Batch programs run significantly longer Use tools to analyze the instruction pattern Analyze the hot spots in the program code For SIIS coding patterns: remove such occurrences Notice: The situation is not new and modern processor design require technologies which may interfere with historical coding practices.

26 z13: Summary z13 provides new design points and set the base to grow capacity and functionality for z Systems processors Because it is new some care for migration and planning is advised

28 Resources Processor Design Considerations (WSC Flash) LPAR Design Tool System z Topology Report SMF 113 Reporting Tool CPU MF Webinar Replays and Presentations z/os CPU MF Detailed Instructions Step by Step Guide z/vm Using CPU Measurement Facility Host Counters CPU MF Extended Counter Description The Load-Program-Parameter and the CPU-Measurement Facilities

29 Annotation The real value of z13 comes with exploiting its features Examples DB2 V11 zedc

30 (z13 and) DB2 V11

31 z13 and DB2 V11 DB2 workloads showing 4 to 38% range Mostly better than expected 10% improvement 2.4x reduction in compression cost utility

33 zedc test results: Three European clients Cost comparison: zedc and tailored compression: Tailored compression: 2.52 CPU sec per GB zedc: 0.15 CPU sec per GB

34 zedc benefits: European clients

35 Session feedback Please submit your feedback at Session is BK 35

CPU MF Counters Enablement Webinar

Advanced Technical Skills (ATS) North America CPU MF Counters Enablement Webinar John Burg Kathy Walsh May 2, 2012 1 Announcing CPU MF Enablement Education Two Part Series Part 1 General Education Today