Fit for Purpose Platform Positioning and Performance Architecture

Fit for Purpose Platform Positioning and Performance Architecture Joe Temple IBM Monday, February 4, 11AM-12PM Session Number 12927 Insert Custom Session QR if Desired.

Fit for Purpose Categorized Workload Types Mixed Workload Type 1 Scales up Updates to shared data and work queues Complex virtualization Business Intelligence with heavy data sharing and ad hoc queries Parallel Data Structures Type 3 Scales well on clusters XML parsing Buisness intelligence with Structured Queries HPC applications Application Function Data Structure Usage Pattern SLA Integration Scale Highly Threaded Type 2 Scales well on large SMP Web application servers Single instance of an ERP system Some partitioned databases Small Discrete Type 4 Limited scaling needs HTTP servers File and print FTP servers Small end user apps Black are design factors Blue are local factors

These do not define workload Insert Custom Session QR if Desired. Languages c/c++,cobol,fortran, JAVA, etc. Middleware Oracle, DB2, Websphere, MQ, Tuxedo,CICS, Encina,etc. Workload Type OLTP, Analytics, Business Applications, Infrastructure Mixed/Consolidated, Highly Threaded, Parallel, Small Discrete Workload types can be used for positioning machines, but are not enough to guide platform selection 3

Workload Definition A workload consists of a workload type plus local factors: Usage Pattern Load Variability Scale Size of load Service Level Response or turnaround expectation at load Desired Efficiency Target utilization level Integration Connections and shared data impacts 4

x86 Performance Degrades As I/O Demand Increases Run multiple virtual machines on x86 server Each virtual machine has an average I/O rate x86 processor utilization is consumed as I/O rate increases Intel CPU As IO Load Increases 50 45 CPU CPU% utilization 40 35 30 25 20 15 Excess CPU cycles spent on processing I/O Saturation No IO 256KB IO 512KB IO 1MB IO 1.5MB IO 10 5 0 0 5 10 15 20 25 Source: CPO Internal Study. 12-core Westmere EP with KVM. FB at 22 tps with varying IO per transaction. VMs 05. What System z Can Do That x86-based Systems Can t 5

Scaling Matters Productive Nodes 12 11 10 9 8 7 6 5 4 3 2 1 1.78 DB2 for z/os Near Linear Scalability 1.69 3.48 Perfect Linear Performance Productive Nodes 5.24 2.44 6.86 9.98 1 2 3 4 5 6 7 8 9 10 11 12 Nodes in Cluster DB2 for z/os OLTP result (ITG 03) Oracle RAC Poor Scalability Oracle RAC is inefficient by design Network based lock and buffer management Scaling RAC requires complex tuning and partitioning Application partition awareness makes it difficult to add or remove nodes Published studies demonstrate difficult or poor scalability Dell (shown in chart): Poor scalability despite using InfiniBand for RAC interconnect CERN: Four month team effort to tune RAC, change database, change application Insight Technology: Even a simple application on two node RAC requires complex tuning and partitioning to scale Oracle RAC characteristics as shown in Dell RAC InfiniBand Study http://www.dell.com/downloads/global/power/ps2q07-20070279-mahmood.pdf CERN (European Organization for Nuclear Research) http://www.oracleracsig.org/pls/apex/rac_sig.download_my_file?p_file=1001900 Insight Technology http://www.insight-tec.com/en/mailmagazine/vol136.html

Cache Working Set Matters 1000$ Cache(Hierarchy(up(to(node(memory( 100$ Latency((nSec)( 10$ 1$ Sandy$Bridge$ p7+$ite$ p780$$ zec12$ 0.1$ 0.01$ 0.1$ 1$ 10$ 100$ Capacity((MB)( 7

Queuing and Load Variability Matter!Wait!Time!v!U+liza+on!by!variability! 10" 9" 8" 7" 6" 5" 4" 3" 2" 1" 0" 0" 0.2" 0.4" 0.6" 0.8" 1" Very"Low" Low" Moderate" High" Very"High" 8

Response (me and Consolida(on ma/er 10" 9" 8" Response'Teim' 7" 6" 5" 4" 3" 2" 1" Sandy"Bridge" Westmere"EX" P780" p7+"p6"mode" Power"7+" z196" zec12" 0" 0" 0.5" 1" 1.5" 2" 2.5" 3" 3.5" N' Each machine par,,on is three dedicated cores, number of loads varies

Modeling and benchmarks There is enough data in the machine specification to make an architectural performance model We know that distributed on line loads usually have high variability However, the resulting model has relatively low precision It is better to use measurements Traditional measurement of maximum throughput metrics will not help enhance the model. It simply replaces it with another low precision model. We should measure single thread speed Single thread Saturation Scaling with increased thread count (related to saturation) Interval data of the usage and possibly throughput pattern 10

Performance architecture involves requirements as well as comparisons How is response time defined? Completion of a single thread of work? Completion of many threads of work? What response time is required and what fraction of the peaks need to be covered? There is a trade off between peak coverage, cost and utilization efficiency. Feasibility can become and issue if the SLA is too tight. Is aggregate throughput meaningful to users or is the preferred metric the number of loads contained in the machine while meeting the SLA? 11

There is a design tradeoff between throughput and capacity 300" Max$Thruput$v$Thread$Capacity$ 250" 200" 150" 100" 50" SB" WM"EX" P460" P7+" ZEC12" 0" 0" 20" 40" 60" 80" 100" Here: Throughput is Clock * SMT muttiplier/threads per core * total threads Thread Capacity is Clock * SMT muttiplier/threads per core * cache/thread It is best to replace is Clock * SMT muttiplier/threads per core by measured thread speed Throughput can t be faster that thread speed * Threads Thread capacity is how much work can stack on a single thread which is related to both the thread speed and the cache available. 12

Virtual Machine Density and the Tradeoff As VM s per core of the workload increases the importance of aggregate throughput decreases As the size of a virtual machine increases The importance of its internal throughput rate increases. Increased density favors favors z; increased VM size favors Power Intel is favored when resources can be aggregate without scaling penalties. Power and z are favored when resources can be shared without scaling penalties. 13

Do you need a deep dive to understand workload fit? Workload fit involves more than determining the workload type and a throughput ratio rule of thumb. Operational considerations will change the relative capacity of machines Throughput ratios do not generally take operational tradeoffs into consideration An Performance Architecture workshop can provide such a deep dive. The objectives of the workshop are to build a model which produces characteristic curves Response time v Throughput Response time v Load Count or VM count Response time v utilization Throughput v utilization Scaling The workshop can work with machine specs and assumed usage patterns in lieu of data but collection of data will yield better results 14

Fit for purpose thinking comes down to: Know the legacy, workload, and costs Know the current IT Environment Understand the workload Examine costs Workload analysis gets technical fast, and real cost analysis is a deep dive.

Eagle Engagements Free of Charge total cost of ownership study that helps customers evaluate the lowest cost option among alternative approaches. The study usually requires one day for an on-site visit and is specifically tailored to a customer s enterprise. The study can be focused on at least one of the areas below : Fit For Purpose Platform Selection Private Cloud Implementation Enterprise Server Issues We conduct Eagle studies for System z, POWER, and PureSystems accounts Over 300 customer studies since the formation of the TCO Eagle team in 2007 Engage our Eagle-Eyed Experts! Start by requesting your IBM Contact to send an email to eagletco@us.ibm.com For deep workload analysis workshop use the same link and ask for Joe Temple Will be ramping up capability for workload deep dives in the coming quarters.