DB2 for z/os Best Practices Recommendations from DB2 Health Check Studies: Operations

Size: px

Start display at page:

Download "DB2 for z/os Best Practices Recommendations from DB2 Health Check Studies: Operations"

Rachel Houston
5 years ago
Views:

1 DB2 for z/os Best Practices Recommendations from DB2 Health Check Studies: Operations John Campbell & Florence Dubois DB2 for z/os User Technology 2012 IBM Corporation Transcript of webcast Slide 1 (00:00) Hello this is John Campbell, a distinguished engineer from DB2 for z/os development. Today s web lecture is from the series on DB2 for z/os best practices. Specifically, today s web lecture is part 4 called Operations from the series on recommendations from DB2 health check studies. Slide 2 (00:24) On slide 2 is a disclaimer and a list of trademarks related to this presentation. Now let's turn to slide 3. Slide 3 (00:34) On slide 3 are the overall objectives of this series on DB2 best practices, which is intended to introduce and discuss the issues most commonly found, share the experience from customer DB2 health check studies, share the experience of live customer production incidents, and provide proven global recommended best practices, and finally to encourage proactive behavior as opposed to regret analysis. Now let's turn to slide 4. Slide 4 (01:05) On slide 4 is the first part of the agenda here, which describes part 1 and part 2, which are separate modules in this series of web lectures. Now let s turn to slide 5. Slide 5 (01:17) On slide 5 here on the agenda are parts 3 and 4. Today s web lecture is about

2 operations. We re going to cover three specific topics: preventative software maintenance, DB2 virtual and real storage management, and finally performance- and exception-based monitoring. Now let's turn to slide 6. Slide 6 (01:41) Here on slide 6, I d like to introduce and discuss the most common problems associated with preventative software maintenance. First of all, too many customers are very back level on preventative service. So much so that they experience problems that could have been avoided if they had applied the preventative service (in other words, the missing HIPER or resolved the PE). Some of these problems were already experienced by other customers, and a PTF was readily available. In some cases, the very same customer may have already have experienced that very same problem but have taken no action and continue to run into the same problem again and again. Many customers apply no HIPERs or PE fixes since the last preventative service upgrades. I ve also experienced a number of high-profile production incidents that could have been readily avoided by a missing HIPER. Too many organizations have a fix-on-failure culture. The result of that is that they don t apply any preventative service until they actually experience a problem. The problem with this fix-on-failure culture is that when you do run into some problem there may be a very long prerequisite PTF chain to be applied in order to get that corrective fix applied. Many customers have deployed z/os parallel sysplex and DB2 Data Sharing technologies to avoid planned outages and to remove dependencies on change windows. However, there are way too customers here who are not getting the full benefits of DB2 Data Sharing technology, for example, because they have application affinities where an application becomes a single point of failure. Or there may be other single point of failures around the Data Sharing group. But being back level on preventative service leads to a delay in exploiting new availability functions and it also means a delay in applying DB2 serviceability enhancements to prevent outages. Another trend with large organizations and with outsourcers is that they have a one-size-fits-all approach to maintenance, and the same maintenance is being applied across different application environments. The problem with this is that no two application environments are typically the same. And with a one-size-fits-all maintenance approach that s not complemented by a

3 one-size-fits-all testing approach, this can lead to escalating maintenance costs whereby the organization tries to roll out some maintenance with some success, but then leads to failures in particular application environments, and the maintenance has to be upgraded. Now let's turn to slide 7. Slide 7 (04:21) So on slide 7, let me answer the question why install preventative software maintenance. In order to achieve the highest level of availability, it depends on having an adaptive preventative maintenance process -- basically, learning from experience. There is no one-size-fits-all. So, for example, if an organization is running into defects where they are the first customer in the world to be experiencing that defect, and this is a recurring theme, this indicates that the customer is too aggressive about applying service. On the other hand, if a customer runs into problems where the fixes were readily available and not applied, this means that the customer s maintenance process is too conservative and they need to be more aggressive about applying preventative service. Secondly, applying preventative maintenance can and will avoid outages. From our own analysis of the DB2 for z/os lab, up to twenty percent of multi-system outages could have been avoided by regularly installing critical PTFs in other words, HIPERs and PE fixes. And finally, executing a preventative maintenance process requires a deep understanding of the trade-offs in achieving high systems availability. Now let s turn to slide 8. Slide 8 (05:40) On slide 8 is a chart that tries to illustrate the trade-offs in maintenance. If you look at the graph, on the left side, on the Y axis, is the percentage chance, and on the X axis are the months going minus one, minus two, minus three, etcetera. When it shows three months here, this means three months after a PTF becomes available. And what the graph shows, first of all looking at the blue (yellow), is that time goes on and the maintenance falls further and further back level, there s an increased chance that a customer will run into old bugs in other words, a bug where the fix was readily available. On the other hand, when you look at the blue on this chart, this indicates the chance of hitting a PTF in error. So, I m proposing here that the sweet spot is probably three or four months after PTF availability, which is a reasonable balance between avoiding problems where the fix was readily available and at the same time avoiding the excessive

4 chance of running into PEs. So what a customer must do here is balance for severity versus risk. In other words, balance the risk of problems encountered versus problems avoided, factoring in the potential for PTFs in error, and also factoring in application workload type. Some workloads, such as traditional legacy workloads, are much more stable and use a very limited amount of DB2 function, or at least function that s well stabilized inside DB2. On the other hand, there may be new application workload types using very new features of DB2 where you need to be much more aggressive about applying preventative service. And also, for any customer, you also need to factor into account the available windows for change control to install the actual service. But most importantly, every customer needs an adaptive service strategy that is adjusted based on prevailing experience -- looking at experience over the previous 12 to 18 months. And also you need to factor in the organization s attitude in terms of risk, to changing the environment, to exploiting new DB2 releases, associated products, and new feature function. For example, if the customer is very aggressive about using the latest features and functions of new DB2 releases, they need to be more aggressive about applying preventative service, providing it more often and not staying too far behind. On the other hand, if the customer is very conservative in terms of the usage of new features and functions and slow to adopt new DB2 releases, they can afford to apply preventative service less often and can afford to stay further behind. The last part of this equation is to factor in what s happening in terms of DB2 product and service plans. Now let s turn to slide 9. Slide 9 (08:35) On slide 9, I want to talk about consolidated service test. The goal of consolidated service test is to enhance the way that IBM Service for the z/os software products is tested and delivered, and the intent is to provide a single, coordinated service recommendation. CST testing provides cross-product testing for the participating products, like DB2 for example, CICS, and z/os and the other products on the z/os stack. So this is testing over and above what the respective development groups do. The list of products included in CST is continually expanding. The testing, as I said, that is performed in CST, is in addition to those performed in the existing testing programs and does not replace any current testing that is performed by the individual program products. The end goal is to standardize on the maintenance recommendation for the z/os software stack platform. The results of the CST are published quarterly on the

5 CST website. On slide 9 is the web page that will take you to the latest available quarterly report. IBM also publishes a monthly addendum with an update on tested HIPERs and PE fixes. After service has passed the CST testing, it is then packaged and marked with what s called RSU. RSU stands for Recommended Service Upgrade. And this is then made available for customers to order online. Now let s turn to slide 10. Slide 10 (10:14) On slide 10 is an example (picture here) that compares the CST RSU process versus what I would call the PUT calendar because the PUT calendar is an alternative way for a customer to pull and apply maintenance. Let s go to the top half of the chart, which talks about the PUT calendar. We have the calendar going January, February, March, April For each month there is a corresponding PUT from the previous month. So in January 2012, the PUT that s available in January 2012 is, in fact, PUT1112. That s from December So when you look at PUT1112, which is available in January 2012, you see that the base code, in terms of maintenance, is December 2011, and the HIPERs and PEs are also current up to December And as each month goes forward on the PUT calendar, you see that both the base and the HIPERs and PEs come forward by one month. So basically, when you have a PUT calendar and you order a PUT tape, once you have the ordered tape in your hands, then you re one month behind on the base, the HIPERs, and the PEs. The bottom half of this chart shows the CST RSU calendar. In the bottom left hand corner, we have the CST testing in fourth quarter And ultimately that results in RSU1112. When you look inside the package, you ve got all service through the end of September not already-marked RSU, followed by HIPERs and PEs through the end of November That RSU (RSU1112) is orderable in January, so what you get in January is the base at September 2011 and HIPERs and PEs through to The first thing that is obvious about the RSU process is that your base is further back in time to protect you against PEs related to non-hiper maintenance, and you ve got HIPERs and PEs that have been through extra testing through to the end of November. So this gives you more protection in terms of maintenance being stable. Now, as the RSU goes forward month by month, and we look at RSU1201, that s January 2012, that s orderable in February The base is still at September 2011, but the HIPERs and PEs have moved forward one extra month through

6 December Finally, every time a quarterly RSU becomes due, that s the point at which the base, in terms of non-hiper maintenance, moves forward. So, when we get to RSU1203, which is orderable in April, then basically the base moves forward from September 2011 to December 2011, and now the HIPERs and PEs have moved forward to February Now let s turn to slide 11. Slide 11 (13:25) On slide 11, I d like to introduce and discuss enhanced HOLDDATA. This is a critical ingredient for a preventative service strategy. At the top of chart 11 is a link to information about what enhanced HOLDDATA is and how to use it. As I said, it s a key element of the CST RSU best practices process. The goal is to simplify service management. It can be used and should be used to identify missing PE fixes and HIPER PTFs using the SMP/E REPORT ERRSYSMODS. What you re able to do is to produce a summary report that includes the fixing PTF number when the PTF is available. It also includes the HIPER reason flags, such as DAL for data loss, FUL for major function loss, or PRF for performance. It identifies whether any fixing PTFs are in RECEIVE status, in other words, available for installation, and if the chain of PTFs to fix the error has any outstanding PEs. The enhanced HOLDDATA is updated by IBM on a daily basis, and a single set of HOLDDATA is both cumulative and complete. Up to three years of history is available. Now let's turn to slide 12. Slide 12 (14:52) Another thing that we ve done in DB2 Development is to exploit what s called fix category HOLDDATA, often referred to as FIXCAT for short. Again, at the top of slide 12 there is a web page to click on to information that describes what fix category HOLDDATA is and describes the categories that are used by DB2. The advantage of fix categories is that they can be used to identify a group of fixes that are required to support a particular hardware device or to provide a particular software function. And it s supplied in the form of SMP/E FIXCAT HOLDDATA statements. On the second half of this chart is a current list of FIXCAT HOLDs for the DB2 for z/os product. I ll pick a few of them out here. For example, DB2STGLK are fixes for DB2 storage leak problems. Or DB2INCORR describes fixes for DB2 SQL incorrect output problems. This is a way of filtering through the many HIPERs to pull out the ones that really matter. I ve picked on the ones that I ve mentioned

7 already. Things like storage leaks, storage overlays, and incorrect SQL output are obviously very critical problems and are of paramount importance. Therefore, it s essential that these fixes for these sorts of problems are put on as soon as possible. Now let s turn to slide 13. Slide 13 (16:21) On slide 13, I want to provide some recommendations for preventative software maintenance. One very important point to touch on to start with is the change management process in an individual customer environment. In too many customer environments the change management process is very risk averse, and strangles all types of change. So it actually limits the amount of change going on. I d like to encourage customers to actually assess the impact of no change. Clearly there s a risk of making a change, but there s also risk in not making a change, and it s important to get the balance correct. The basis for the recommendation is to apply preventative maintenance every three months and to use the RSU calendar instead of the PUT calendar so that you re less aggressive in applying non-hiper maintenance. The sample strategy is based on two major and two minor releases per calendar year. So the major release is in fact a refresh of the base every six months based on the latest available quarterly RSU. And each base upgrade based on this quarterly RSU should be RSU-only service, you should specify SOURCEID=RSU* in the supplied APPLY and ACCEPT jobs. Now in between those two major releases, the next idea is to have what s called a mini-package or a minor package. The idea is that, in between two successive major software upgrades, you rollup all the missing HIPERs and PEs that are available and package them into a mini-package, and then roll that into production. Having said this, this strategy is based on a conservative customer, who is not aggressive about using new feature functions. On the other hand, if you re a customer who is very aggressive about migrating to new releases, in other words you re an early adopter, or are aggressive about using new functions, you need to be more aggressive than the strategy that I ve just outlined. For all customers, it s very important to review the enhanced HOLDDATA on a weekly basis. So customers should pull the enhanced HOLDDATA, bring all the missing fixes onsite and then basically analyze the missing HIPERs and PEs. Typically, they fall into three categories. There are those types of HIPERs and PEs that don t apply to the usage of DB2

8 at a particular installation. For example there may be a HIPER that is related to query parallelism, and you don t use query parallelism in your installation, in which case it can be ignored. The second category is related to something like a storage overlay or a storage leak or a bad recovery or DB2 crashes. These types of HIPERs are critical and they need to be expedited into production after one to two weeks in test to actually approve the fix. The third category is HIPERs that do apply but the exposure is very small and the impact is very small. Therefore, you could defer the application of that HIPER or PE fix until the next major or minor upgrade. Now let's turn to slide 14. Slide 14 (19:38) Continuing with my recommendations about preventative software maintenance, it s important for customers who demand the highest levels of availability to develop processes and procedures in technical changes to implement rolling maintenance outside of heavily constrained change windows. Basically, they need to exploit the z/os parallel sysplex and DB2 data sharing technologies. A few recommendations are to have a separate SDSNLOAD or SDSNLOAD alias per DB2 member. Have a separate ICF User Catalog Alias per DB2 member. The benefits of rolling maintenance are: only one DB2 member at a time is stopped; the DB2 data is continuously available via the N minus one members; fallback to the prior release is fully supported if necessary. However, I want to make one very important point here: if you have applications that have affinities that run in only one place, then clearly it s nearly impossible to implement rolling maintenance. Otherwise, if you do, you re going to enter application service outages. The second point on chart 14 is to aim for company-wide certification of new releases of maintenance. This particularly applies to many world-wide organizations that have many different application environments and also to outsourcing companies, again where they have many different application environments and a lot of diversity across those different application development environments. The whole idea is to complement a company-wide concept of building a maintenance package with a company-wide certification through testing of the new maintenance. So if you were to implement a shared test environment with

9 collaborative use by systems programming staff, DBAs, and application teams as a way of proving new maintenance, you can think of it a bit like a company-wide IVT to validate the software. This provides an additional insurance policy before starting to roll out new DB2 maintenance packages across multiple application environments. Now let s turn to slide 15. Slide 15 (21:48) Continuing on this theme of large organizations with multiple different application environments, we also think it s important to have separated DB2 maintenance environments. This means that we have a single master SMP/E environment as the base and then have one additional SMP/E environment to match each separate DB2 application environment. The certification testing that I talked about on slide 14 will be based on the master SMP/E environment before starting to promote the new maintenance to the various test and production environments. So what are the benefits of having separated DB2 maintenance environments with a separate SMP/E environment to support it. First of all, when you have multiple application environments and there s a lot of diversity in terms of the SQL functionality and function feature usage in DB2, there s no one-size-fits-all in terms of maintenance. It enables you to be more aggressive in applying maintenance to some particular environments and, at the same time, being conservative on other application environments in order to protect the stability of those environments. This enables you to have a very flexible approach that meets application development requirements to be aggressive about using new feature function or the adoption of new DB2 releases. At the same time, it supports the migration to new DB2 releases perfectly as DB2 application development environments can be treated independently. Now let's turn to slide 16. Slide 16 (23:23) So on slide 16 here, I want to switch to the second major topic, which is about DB2 virtual and real storage management. In a nutshell, the common problem here is that CTHREAD and MAXDBAT system parameters are set too high. CTHREAD describes the total number of allied threads allowed into DB2, and MAXDBAT describes the total number of database access threads used by distributed applications that can run in DB2. So, if these two system parameters, in combination, are set too high, this inflates the size of the storage cushion. This is a storage cushion in the DBM1 address space below the bar and it represents the storage cushion for 31-bit storage. If the storage cushion is inflated because

10 CTHREAD and MAXDBAT are set too high, this will mean that full system storage contraction will occur more often, and this has two disadvantages. First of all, it will cause CPU burn in the DBM1 address space in TCB mode, and the second thing it will put stress on the system because DB2 acquired the LPVT latch. The second aspect of this is that there s no denial of service. This applies to both version 9 and version 10, although what drives the problems here are very different. First of all, you know that EDM 31-bit pool full condition is pretty serious. So even in version 9 we still have a 31-bit pool full, and when that condition is reached, all the applications will get a SQLCODE -904, which means the application failed, and it amounts to a denial of service. On the other hand, if all the work falls in because these parameters are set too high, then we ll drive DBM1 full system storage contraction, and this will potentially degrade performance. Also note that if the full system storage contraction cannot free up enough storage, then the DBM1 address space will go storage critical. This means that individual DB2 threads that are not marked as must complete will end up abending with reason code starting 00E200. Ultimately then, DB2 can actually crash out, and that can result in a loss of business application services if there are affinities in those applications for a particular member, or even in non-data sharing. Particularly in version 10, over-commitment can lead to excessive paging to auxiliary storage. If both the available real storage and the auxiliary storage are overcommitted, then the LPAR can crash out causing DB2 to terminate and any other subsystems running on that particular LPAR. All of this can be aggravated by: a lack of workload balance across CICS and WAS versus DB2; workload failover conditions (when subsystems fail, LPARs fail); abnormal slow downs due to, for example, application locking considerations, degraded I/O performance, etcetera; or people moving application workloads from one particular LPAR or member to another. Now let s turn to slide 17. Slide 17 (26:40) Continuing on the theme of problems, let s talk about the shortage of real storage. This is always important prior to version 10, but certainly with the advent of version 10, this needs to be re-emphasized. Shortage of real storage can lead to excessive paging to auxiliary and severe performance problems. A virtual DB2 environment should be designed and provisioned so there s no paging to auxiliary storage. If you think about it from a

11 buffer pool perspective, if you don t have enough real storage to back the buffer pool, when DB2 has to steal a page using the LIU algorithm, then we don t want to find that the LIU buffer pool is paged out to auxiliary, because if that happens, you ll get two I/Os. You ll get the MVS page in I/O from auxiliary that you didn t want followed by the real I/O, which is to bring the page out of your application page set into the DB2 buffer pool. Ultimately, if things get out of control and you over-commit all the available real memory and all of the available auxiliary storage, this can take the LPAR out. This happens once all of the auxiliary storage is consumed, the LPAR will actually go into a wait state. Shortage of real storage can also lead to long dump processing times and cause major disruptions not just to that LPAR, but across the data-sharing group. A dump should complete in a small number of seconds (less than ten seconds) to make sure that no performance problems ensue on the LPAR or we don t get any sympathy sickness around the data-sharing group. Once paging begins, it s possible to have dump processing take tens of seconds, even a few minutes, with a high risk system-wide or even sysplex-wide slowdowns. As I ve said to several customers, you could be running just one dump away from disaster. Ultimately, if you don t have enough real storage to provision your system for both normal operation and also abnormal events like dump processing, this can lead to wasted opportunities for CPU reduction. Today, cost reduction is a paramount importance to almost every customer that I know. This leads to a reluctance to use bigger or more buffer pools, a reluctance to use buffer pool long-term page fix, which was introduced in version 8, and also an inability to use the many performance opportunities open up in DB2 version 10, which requires additional virtual storage. Now let s turn to slide 18. Slide 18 (29:12) Now I ll provide some recommendations. First of all, IFCID 225 provides comprehensive information about DB2 virtual storage usage, and now in V10, real storage usage. You should collect IFCID 225 data from the start of DB2 all the way through DB2 shutdown. These days, IFCID 225 is collected fastest as trace class 1 and should be written out to SMF. Prior to version 10, OMPE OMEGAMON provided a feature called the SPREADSHEETDD subcommand, and this was available in the batch processer to post-process the SMF data and write out as a commonly delimited file that could easily loaded into something like Microsoft Excel. However, the SPREADSHEETDD support in OMPE has not been enhanced to support DB2

12 version 10. In the meantime, what we ve done in DB2 Development is to enhance the sample REXX programs MEMU2 and MEMUSAGE to be able to pull the data. There s a separate version of MEMU2 to support version 9 and a separate one to support version 10. Both versions are available on the DB2 for z/os Exchange community on the IBM My DevelopWorks website. There s a web page on slide 18 that you click on which takes you to the DB2 for z/os technical exchange. Generally, what you need to do here is plan on having a basic storage cushion free. This is additional free storage over and above the storage cushion to avoid the possibility of driving full system storage contraction. Generally speaking, this basic storage cushion needs to be about 100 megabytes to allow for some growth and to allow for some margin of error. Having done that, the idea then is to project how many active threads can be supported, and then set CTHREADS and MAXDBAT to realistic values that are inline with the values that you projected. And the final piece of the jigsaw here is to balance the design across the CICS AORs connecting to DB2 with the amount of threads that can be supported by DB2. So you need both a bottom up and top down approach to make sure that the DB2 subsystem can support the defined number of threads without getting into trouble. And the number of connections across all the CICS AORs connecting to that DB2 subsystem can be supported. Now let s turn to slide 19. Slide 19 (31:51) Now starting with version 9, DB2 introduced an additional function called a DB2 internal monitor that issues messages when DB2 starts getting short on DBM1 31-bit storage. This DB2 internal monitor runs inside the master address space and automatically issues console message DSNV508I when DBM1 31-bit virtual storage crosses particular thresholds. The message is generated when it goes past the threshold and also when it comes down below that threshold. Here on the chart, I talk about increasing or decreasing with respect to the thresholds. When the actual storage gets depleted and goes over the threshold, then the DSNV508I message gets generated immediately. On the other hand, when the storage usage goes below the threshold, there s a three-minute delay before the message pops out (is issued). It s most important to have the PTF for APAR PM38435 applied. The reason for that is when this internal monitor was initially implemented, it did not take into account the storage cushion when applying the percentages. Here s an example based on the picture: here you see a DSNV508I message, and it s telling you

13 that the storage notification indicates that 77 percent is consumed, and 76 percent is consumed by DB2, leaving 352 left. Now the first threshold that is used by DB2 to generate the DSNV508I message is actually 88 percent. That 88 percent is not 88 percent of the 31-bit region. It s 88 percent of the 31-bit region less the storage cushion. This message that is popping out (being issued) at the 88 percent threshold which represents 77 percent consumed of the total region size. It also does as well to identify the agents that consume the most storage. As a customer, you can get the status at any time by using the DISPLAY THREAD(*) TYPE(SYSTEM) command. You have an example of that message that s generated at the bottom of the chart. It tells you how much storage is used in the whole region. It tells you how many times threads were delayed, holding a latch or were given a boost, and it also indicates the health of the system. If a DBM1 address space is short of available virtual storage, then the value of the health parameter will go below 100 percent. This will encourage the sysplex workload balancing for DDF workloads to move work away from that member. Now let s turn to slide 20. Slide 20 (34:49) Further recommendations here. One of the things that I ve already given you here is that it is absolutely important, is of paramount importance, to configure sufficient real storage to get the best performance and to quickly capture diagnostics. But it s not a good practice just to configure the available real storage based on normal operating conditions, and then rely on DASD paging space to absorb the peaks. You need to configure enough real memory to deal with the normal operation conditions plus the peaks and be able to take a dump. So, to add a bit more definition to this, you need to provision sufficient real storage to cover both the normal DB2 working set size and the MAXSPACE requirement. MAXSPACE can typically be up to eight or nine gigabytes in version 9 to get a full dump, and for version 10 this value is somewhere between twelve and sixteen gigabytes. If you undersize MAXSPACE, this will result in partial dumps, and this will seriously compromise problem determination, problem source identification. In order to protect the availability of production services on the same LPAR and on other LPARs in the data-sharing group, dumps should be taken very quickly, i.e., less than ten seconds, almost without anybody noticing, and with little or no disruptions on the subject LPAR and to the rest of the parallel sysplex. You may also want to consider automation to kill dumps taking longer than ten seconds. Now let's turn to slide 21.

14 Slide 21 (36:27) Thread storage contraction helps protect the availability of the system, and thread storage contraction applies to the 31-bit virtual storage in the DBM1 address space. So there s a strong recommendation, particularly in version 9, but this also applies to version 10, to run with CONTSTOR equals YES. With YES, DB2 will compress out part of the agent local non-system storage based on the number of commits and the thread size. And the overhead is fairly low. It s a maximum of one compress for every five commits. So it s very cheap to implement. What I d like to point out is that thread storage contraction is ineffective for long-running persistent threads with use of RELEASE DEALLOCATE. Once you ve actually migrated to DB2 version 10, and rebound your static SQL packages and plans, then you can turn off CONTSTOR thread storage contraction, so you actually save some CPU. In DB2 version 10, there s a new piece of function called a real storage monitor, which enables DISCARD mode to free up unused frames back to the operating system. This has the effect of contracting storage and protecting the system against excessive paging and use of auxiliary storage. This was introduced in DB2 APAR PM24723, and it has a prerequisite z/os APAR called OA It s controlled by a new systems parameter, zparm, called REALSTORAGE MANAGEMENT, and it has three particular values. The default is AUTO, and AUTO is strongly recommended. With AUTO, DB2 detects if excessive paging is imminent and tries to reduce the frame count to avoid system pages. It basically toggles between on and off. It s a bit like the thermostat on your air conditioner or central heating system. It will cut in when the system is in trouble and then go into DISCARD mode to free up the frames to the operating system. And it will go back into off mode once the condition is relieved. One slight problem that we ran into was that some customers reported a CPU increase when they were running multiple DB2 version 10 subsystems on the same LPAR. This is due to the underlying RSN service that DB2 is using is actually taking a CPU spin lock. So, there s a new z/os APAR called OA37821 that provides a new option on the RSN service that is used by DB2. DB2 APAR PM49816 uses that new option on the RSN service to avoid this CPU burn. Now let's turn to slide 22. Slide 22 (39:21) Real storage needs to be monitored as much as virtual storage. Important subsystems like DB2 should not be paging into auxiliary in a production

15 environment. The emphasis is on production. It s quite common in a pre-production development environment to over-commit available resources, but in a real production environment where performance and availability are of paramount importance, we need to avoid paging into auxiliary. The recommendations here are to keep the page-in rates near zero, actively monitoring using RMF monitor 3 to basically avoid paging, and monitor the DB2 page-in for reads and writes and also avoid the output log buffer being paged. As previously discussed, you want to collect IFCID 225 data from DB2 start time through to DB2 shutdown time. Any high use of auxiliary storage needs to be investigated. In other words, what time did it happen and what were the triggering events either inside or, often it s the case, outside of DB2 were driving DB2 to be pushed out to auxiliary storage. On the bottom of slide 22 is an extract from an OMPE report from DB2 version 9, which tells you how much real storage is being used and how much auxiliary storage is being used. In this example, there is no auxiliary storage being used by DB2. Now let s turn to slide 23. Slide 23 (40:50) On slide 23, we have an OMPE batch report for version 10. As you can see straight away, we have much more comprehensive reporting of real and auxiliary storage starting with version 10. At the top left hand side, we have the real and auxiliary storage for the DBM1 address space, and on the top right hand side, we have the real and auxiliary storage for the distributed address space DDF, and the middle section shows the real and auxiliary storage for the shared private storage. We also give it to you at the LPAR level, and finally, at the bottom left hand side we also give it for the common storage. All of these sections of the report we report the 31-bit storage below the bar and also the 64-bit storage above the bar. We break it out in both real and auxiliary for each of these sections. Now let's turn to slide 24. Slide 24 (41:49) How can we limit real storage usage? In version 9, we had a hidden system parameter, or zparm, called SPRMRSMX. I affectionately referred to this as the real storage kill switch. It was originally delivered in APAR PK It s been a secret and was not widely broadcast, and only a handful of customers were using it. Why was it introduced? It was introduced to prevent a runaway DB2 subsystem from taking the LPAR down and affecting other DB2 subsystems running on the LPAR and other MVS subsystems. It was only applicable to a customer who runs

16 multiple DB2 subsystems from the same data-sharing group, or even different data-sharing groups, on the same LPAR. The aim here is to prevent multiple outages caused by a single DB2 subsystem. In other words, you re prepared to basically sacrifice one DB2 subsystem in order to protect the availability of the other DB2 subsystems on that same LPAR. The general recommendation is to set the SPRMRSMX value to 1.5 to 2 times the normal DB2 subsystem usage. You need to be careful here, because if you set the value too small, there will be too many false positives where you ll take out the subsystem when you shouldn t have. On the other hand, if you set the value too high, basically the LPAR will die before the subsystem is sacrificed. So the general recommendation is, if the buffer pools are fairly large, that is that they represent a large amount of the DB2 working set size, then the multiplier should be about 1.5 x. On the other hand, if the buffer pools are small and represent a small percent of the overall working set size, then the value that is needed tends to be more toward 2 x. What will happen is when the real storage kill switch value is reached, the DB2 subsystem will abend. Now in DB2 version 10, the real storage kill switch is actually formalized. So the hidden zparm becomes an opaque zparm called REAL STORAGE MAX. For those customers who want to use this value, not only do you have to factor in the 31-bit storage, you also now need to factor in the 64-bit shared and common usage to establish a new footprint and you have to increase the size of the real storage kill switch. Now let s turn to slide 25. Slide 25 (44:27) On slide 25, I want cover the last topic and the third topic in this web lecture. I want to talk about performance- and exception-based monitoring. To begin with, I want to talk about common problems. Many organizations are operating mostly in fire-fighting mode, and actually reacting to today s performance problems with every day being a surprise and having to react. In many cases, there s missing performance data or a lack of granularity therein, which limits the ability to drive a problem to root cause, and you may need to wait for a recurrence of the problem to get the tracing data you need or to get the granularity that you need. The majority of customer organizations do not have a performance database, and even if they do, they make very limited use of it. Without having a performance database, there s no baseline for either DB2 system s performance or DB2 applications performance, and no basis for doing trend analysis. Most

17 organizations do not have a near-time history capability with their online monitor, and they haven t implemented any DB2 exception monitoring. The idea of exception monitoring is that you can identify, based on some rules of thumb, whether there are some out-of-line conditions and basically filter for those out-of-line conditions. The idea is to get an early warning of problem conditions so that you have time to either react to it or troubleshoot the problem quicker. Without having exception monitoring, nobody knows there s an issue until the situation escalates either into a very bad performance situation or a serious availability situation. Ultimately, not having exception monitoring delays problem determination and problem source identification. Lastly many organizations have an increasing amount of their applications using dynamic SQL for.net, ODBC, and JDBC applications. Organizations have limited control over the performance of those applications and the understanding of the performance of those applications. Now let's turn to slide 26. Slide 26 (46:37) First of all, I have some recommendations for data collection. The first rule is to set SMFSTAT=YES. This is the default and this collects information for statistics trace classes 1, 3, 4, and 5. Every customer should have that set. There s also an additional trace class called Stats Class 8, which gives you data set I/O statistics. I find that particular trace very useful. A lot of customers are concerned about their trace volume, but in actual fact (sic) we actually bucket multiple data sets to each record and will also only record statistics when there s a least one I/O per second, so an actual fact one myth about stats class 8 is that it generates lots of SMF data. So if it were my installation, I would collect statistics trace class 8 in addition to stats classes 1, 3, 4, and 5. In version 9, I have a strong recommendation to set the statistics integral STATIME equal to 1. This is highly valuable and essential to study the evolutional trends, which lead to complex system performance problems like slowdowns. A lot of people again have raised this myth about having the STATIME set to 1 will generate a huge amount of SMF data. This is actually not true. There are only 1,440 minutes in a day. So that s the number of intervals with a stattime of 1. I would recommend to you that it s very valuable and that all customers should set it to 1. Having said that, in DB2 version 10 the basic statistics records IFCIDs 2, 202, 217, 225, and 230 are always cut at a one-minute interval. They re no longer

18 controlled by the STATIME parameter. So the best practice recommendation for version 9 is now implemented and hardwired in version 10. The stat time interval, or the STATIME parameter is still there in version 10, and it controls the frequency of the other IFCIDs 105, 106, and 199. The other recommendation on this slide is to copy away the SMF 100 records (these are the records for statistics), and to keep them in a separate file. The new file is relatively small and it s much easier to post-process that data, and it s also much easier to send it to the DB2 lab to deal with PMRs. The whole goal here is to improve the elapsed time to actually post-process the data both for your environment and also at the DB2 lab. Now let's turn to slide 27. Slide 27 (49:25) Now, when it comes to the accounting, we strongly recommend that you collect accounting trace 1, 2, 3, 7, and 8. There s also an accounting class 10, but this is relatively expensive, and so is typically run only for short periods of time, in other words, accounting class 10 is not run on a permanent basis. There are options to consider if the SMF data volume is an issue. Many customers now are starting to record their SMF data to the z/os system s logger, and that basically streams the SMF data rather than writing it directly to VSAM files. The system logger improves both performance and throughput. In DB2 version 10, we have now enabled DB2 compression of SMF data, so that any instrumentation records that are written to the SMF destination are now subject to DB2 compression. Experience has been that the accounting records that represent the bulk of the SMF data volume compress to about 70 to 80%. The overhead of the compression is relatively tiny at approximately 1%. We also have accounting ROLLUP controlled by the DB2 system parameter called ACCUMACC. This applies only to DDF and the RRS Attach. It s worth pointing out to you that there s no effective package-level rollup reporting prior to version 10. So in version 10 for the first time, we get accurate package-level rollup accounting. The one general advantage about accounting rollover is that you lose granularity. In many performance problems, the problem is not pervasive for each and every transaction. When you have the odd outlying transaction that is badly performing, by using accounting rollup, you actually have lost the data for that outlying transaction. In other words, the impact of that outlying transaction is lost in the accounting rollup. So many people in version 10 may decide to trade turning off the accounting rollup, but at the same time then enable the SMF compression in order to reduce the volume. Now let's turn to slide 28.

19 Slide 28 (51:40) It s pretty important here to proactively monitor and review DB2 metrics on a regular basis, and to develop an automated process to store away the performance data into DB2 tables. At least the DB2 statistics data, which is relatively low volume, and to do this without any aggregation, and then to invest in a set of canned reports to transform the DB2 statistics data into real information that can be used. What you want to be able to do here is track the evolutionary trends of these key performance indicators at the DB2 system level from startup to shutdown, and then generate either red alerts or amber alerts based on out-of-normal conditions as of when they occur. And this also become the basis for a baseline for further analysis. Now there are some additional resources to help you. There s also a series of web lectures related to optimizing DB2 for z/os system performance using the DB2 statistics trace that are available from the same web site. There s also the one-day seminar written by myself and also Florence Dubois. In both cases here, this material provides a list of performance metrics and rules of thumbs in the area of buffer pool, group buffer pools, lock/latch contention, etcetera. Those web casts are also available on the DB2 for z/os best practices web site using the link below at the bottom of slide 28. Now let's turn to slide 29. Slide 29 (53:16) Another recommendation is to enable near-term history collection in your DB2 online monitor. This will provide the ability to retrieve and review DB2 statistics and accounting records for the past few hours of DB2 processing and gives you the ability to intercept adverse changing trends. It s also important to have effective exception monitoring at the DB2 systems and application level to track key statistics performance indicators both in your online monitor to generate alerts on outlying conditions and also out of the performance database. The performance metrics to use and the rules of thumb are covered in the performance one-day seminar. That could provide a good starting point. Also, these alerts being generated to the online monitor or through the performance database is to keep a log and history of the alerts and analyze the trends. So the other objectives here beyond exception monitoring are to get the virtual support teams engaged sooner, to help avoid, if possible, a performance problem escalating into a serious performance incident, and to help avoid extended recovery times, performance problems, and application availability

20 issues. It also provides a way of narrowing down the scope of the problem to speed up the reaction time during problem determination and problem source identification. It also provides the vehicle to understand weak points and to provide strategic remedial actions. Now let's turn to slide 30. Slide 30 (54:54) On slide 30 is a list of recommendations about enhancing the capture of diagnostic data. So first of all, at one-minute intervals we believe you should actually collect through automation, which is DISPLAY THREAD SERVICE WAIT, so drive through systems automation to drive the stay thread service wait command at a one-minute interval and then save the output away. Similarly through systems automation, drive the other list of commands at 15-minute intervals and save the outputs away again. So the objective here is to detect and correct as fast as possible, and here provide diagnostics so when you investigate problems you can actually study this diagnostic information at the time leading up to the problem and actually go back in time to see what led up to the particular problem. Now let s turn to slide 31. Slide 31 (55:50) In the last part of this web lecture, I want to talk about exploiting the dynamic statement cache for.net, ODBC, and JDBC applications. One of the trends of the last five to ten years has been the growth in workload coming through DDF,.NET, ODBC, JDBC, and also growth in the amount of mission-critical workload coming through DDF with dynamic SQL. What I want to encourage here as part of this web lecture is to encourage you to exploit the dynamic statement cache, which is a goldmine of information for performance analysis and shooting performance problems. Here I have a sample procedure which I d like to introduce and discuss to enable you to explain statements from the DB2 statement cache. We have five steps here. First of all, create the table DSN_STATEMENT_CACHE. And the Explain tables DSN_FUNCTION_TABLE. Now remember DSNTESC in DSN SAMPLIB gives you some sample DDL for those tables. Secondly, start the collection of the dynamic statement cache performance statistics. This is done by starting the performance trace you use to define service class 30, specifying IFCIDS 316, 317, and 318. The third step is to use the EXPLAIN STMTCACHE ALL command to extract the

21 SQL statements from the global cache and dump those statistics into the table DSN_STATEMENT_CACHE_TABLE. All the statements are included in the cache if EXPLAIN is executed by SYSADM. Otherwise, the statements that are exposed into the table are only those statements with the matching authorization. Then from the cache, you can generate individual EXPLAIN statements for each SQL statement by saying EXPLAIN STATEMENT CACHE STATEMENT ID, specifying the statement from the particular table. And the fifth thing is to actually import the contents of the four tables into a spreadsheet for post-processing. So that completes the end of this web lecture on operational best practices. Thank you for listening. (58:02)

On slide 2 here I have a disclaimer about particular trademarks that are used in this presentation. Now let s go to slide 3.

On slide 2 here I have a disclaimer about particular trademarks that are used in this presentation. Now let s go to slide 3. DB2 for z/os Best Practices DDF Connectivity John J. Campbell Distinguished Engineer DB2 for z/os Development db2zinfo@us.ibm.com 2011 IBM Corporation Transcript of webcast Slide 1 (00:00) Hello, this