Tuesday, September 29, 2015

An Oft-Overlooked DB2 for z/OS Big Memory Exploitation Play

Over the past few years, I've done a lot of writing and presenting and talking about ways in which large-scale z Systems memory resources (which I refer to as Big Memory) can be leveraged so as to enhance the performance of DB2 for z/OS subsystems and related applications. While I very much like to see z/OS LPARs with memory sizes of 100 GB or more, I DON'T like to see situations in which organizations stop well short of exploiting Big Memory to maximum positive effect with regard to DB2 performance. Big buffer pools are a great way to exploit available real storage gigabytes, but too often DB2 people don't look much beyond that particular use of mainframe memory to boost workload throughput and CPU efficiency. In this blog entry I want to highlight a DB2 application of Big Memory that, in my experience, is often overlooked: the EDM skeleton pool.

The EDM pool (long name: environmental descriptor manager pool) occupies a portion of a DB2 subsystem's database services address space (also known as DBM1). Various DB2 elements supporting SQL statement execution are cached in different parts of the EDM pool. Among these elements are skeleton package tables, or SKPTs -- one SKPT for a given package. Here's how this space is used: when an application process needs to execute a package, DB2 checks to see if the package (actually, the requisite sections of the package) is already in memory. If it isn't, DB2 loads the package (sections) from the SPT01 table space in the directory into an SKPT in the skeleton pool part of the EDM pool. From there, the package, in being allocated to the thread through which it is to be executed, is copied into another part of the DBM1 address space (the thread-specific copy of the package goes into what is called agent local pool storage -- if a given package is allocated to 10 threads, there will be 10 copies of the package in agent local pool storage in DBM1). The more times a package that is to be allocated to a thread is found in the skeleton pool in the EDM pool, versus having to be loaded from the directory on disk, the better for the overall performance of the associated DB2 workload. Often, this is is not taken into account by people charged with optimizing the performance of a DB2 for z/OS environment.

Here's what you should do (if you haven't already): check on skeleton pool activity by way of your DB2 monitor. You can use an online display of EDM pool information for this purpose, but my preference is to use the EDM pool information found in a DB2 monitor-generated statistics long report (depending on the DB2 monitor product used at your site, this may be called a statistics detail report). In such a report, the EDM pool section would contain some fields like these (I've left out some lines to highlight the information of interest):

EDM POOL                     QUANTITY
---------------------------  --------

PAGES IN SKEL POOL (ABOVE)    3941.00
  HELD BY SKCT                  40.98
  HELD BY SKPT                3724.36
  FREE PAGES                   175.66
FAILS DUE TO SKEL POOL FULL      0.00


In many cases, a person looking at this information would see the 0.00 value for FAILS DUE TO POOL FULL, conclude that the skeleton pool is plenty big, and move on to something else. Now, it is true that having no program failures caused by DB2 not being able to load a package into the skeleton pool is a good thing; however, that's only part of the story. Look a little further down in the EDM pool section of the report, and you'll see numbers like these:

EDM POOL                     QUANTITY
---------------------------  --------

PT REQUESTS                   6446.1K
PT NOT FOUND                   193.0K 


Take note of the ratio of PT REQUESTS to PT NOT FOUND. The former value is the number of times that a package section was requested for allocation to a thread, and the latter is the number of times that a package request could not be satisfied from the skeleton pool (and so required a loading of the package section from the DB2 directory). The ratio calculated from the numbers above (which show activity in a real-world DB2 subsystem) is about 33:1. That might look OK to you, but you should be able to do a lot better than that on your system (unless you are really memory-constrained). Here are numbers from a different real-world DB2 subsystem:

EDM POOL                     QUANTITY
---------------------------  --------

PT REQUESTS                   1717.9K
PT NOT FOUND                   128.00 


The ratio of PT REQUESTS to PT NOT FOUND in this case is over 13,000 to 1. Looked at from an activity rate perspective, the rate of package section loads from SPT01 on disk associated with the first set of PT REQUESTS and NOT FOUND numbers is about 27 per second (193,000 requests in a 2-hour reporting interval), while the rate of package section loads associated with the second set of numbers is about 1 every 14 seconds (128 requests in a 30-minute reporting interval).

I'd rather have the second set of numbers. And you know what? It often doesn't take much memory to get to a skeleton pool size that yields a really high ratio of package section requests to package section loads from disk. Interestingly, though two systems (used by two different organizations) are represented by the numbers shown above, in both cases the rate of requests for package table sections was a little over 900 per second. An important difference between the two environments is a skeleton pool that is about 2.5 times larger for the DB2 subsystem with the really high ratio of requests to loads versus the skeleton pool that has the much lower ratio of package requests to loads. The size delta amounts to about 6000 pages (4K pages), and that's about 24 MB. The point I want to make here is this: for an investment of maybe a few tens of megabytes of memory, you might be able to dramatically increase the ratio of package section requests to package section loads in your DB2 environment, if that ratio is not already high (and I like to see at least hundreds to one, and ideally thousands to one). Unless your z/OS LPAR is quite storage-constrained, there is a good chance that you could boost the size of the skeleton pool (via the EDM_SKELETON_POOL parameter in ZPARM) by a few tens of megabytes, and still have a demand paging rate for the LPAR that is very low ("very low" is what you want for the demand paging rate in a production z/OS LPAR, and I define that as being a rate -- available from a z/OS monitor -- that is in the low single digits or less per second during busy processing periods). Note that there are also request-versus-load numbers pertaining to the DBD cache in the EDM pool -- I've focused on the skeleton pool numbers because I more often see lower-than-desired ratios there.

That's it. Keep in mind that using Big Memory effectively means using it for all kinds of DB2 performance-boosting purposes -- bigger buffer pools, sure, but don't stop there. Give the skeleton pool enough space to satisfy the vast majority of package section requests out of memory, thereby reducing the rate of package section loads from the DB2 directory on disk. Doing this will get you that much closer to optimal DB2 application performance.

Sunday, September 27, 2015

I Don't Worry About DB2 for z/OS Buffer Pool Hit Ratios

And neither should you.

It's true that for years, mine was among a chorus of DB2 specialists' voices that placed a lot of emphasis on monitoring DB2 buffer pool hit ratios (there are variations with respect to calculating this ratio, but the basic formula is (GETPAGEs - synchronous reads) / GETPAGEs). Some years ago, it dawned on me (and on others -- and for some folks earlier than for me) that I'd been focusing on the wrong buffer pool performance metric. Nowadays, I don't look at buffer pool hit ratios at all.

What I look at instead is the total read I/O rate for each and every buffer pool. Before getting more into the details of that performance indicator, I'll give you some reasons for my not giving consideration to buffer pool hit ratios:
  • They can give you a false sense of being "done" with respect to leveraging buffer pool resources to boost system performance. People can see a value of greater than 99% for a buffer pool hit ratio and conclude, "The performance impact of this pool is as good as it's going to get. I'll turn my attention to other pools." That conclusion can be very much incorrect, as I'll explain momentarily.
  • They can cause you to disregard asynchronous buffer pool read activity. Because a commonly used formula for calculating a buffer pool's hit ratio ignores asynchronous read activity, it can lead to, at best, a benign neglect of prefetch read activity, or worse, actions that overly diminish asynchronous read space in buffer pools for the purpose of "goosing" a synchronous read-based hit ratio. That can work against your overall DB2 performance goals. As I pointed out in an entry posted to this blog a couple of years ago, prefetch reads matter.
  • Depending on how they're calculated, they can have weird values that confuse people. To avoid leaving prefetch reads out of the picture, some performance monitoring products will (or at least have, in the past) use buffer pool hit ratio formulas that take asynchronous read activity into account. On occasion, given a particular level and type of prefetch activity, such formulas will yield odd-looking results, like negative buffer pool hit ratios. Values of this nature can be hard for users to interpret.

[Note that I do care about a particular kind of hit ratio that is relevant to group buffer pools in a DB2 data sharing environment -- something I call the XI GBP read hit ratio. I described this ratio in a blog entry I posted about two months ago.]

I've mentioned that the buffer pool performance metric that matters most to me is the total read I/O rate. For a given buffer pool, that value is calculated as follows:

Total read I/O rate = (synchronous reads/second) + (asynchronous reads/second)

Getting the numbers to plug into this formula is a pretty easy thing. If you can generate a statistics long report (sometimes called a statistics detail report) for a DB2 subsystem by way of your DB2 monitor, look in such a report for the section containing buffer pool activity information. For each active pool in the environment, you'll see the synchronous read rate, and you'll also see the three asynchronous read rates, for sequential prefetch, list prefetch, and dynamic prefetch. Your monitor might already have these rates in per-second form for you, and that makes it really easy: just sum those four numbers (synchronous reads per second, sequential prefetch reads per second, list prefetch reads per second, and dynamic prefetch reads per second). If your monitor reports activity in per-minute form, just divide the four relevant values by 60 to get per-second numbers, and do the previously mentioned addition. You could also get these synchronous and asynchronous read numbers from an online display of buffer pool activity provided by your DB2 monitor, but for ongoing performance tuning, I prefer DB2 monitor-generated reports over online displays.

An alternative means of getting buffer pool activity information is to use the output of the DB2 command -DISPLAY BUFFERPOOL. If you go the command route, my suggestion is to do the following:
  1. Issue the command -DISPLAY BUFFERPOOL(ACTIVE) DETAIL on the DB2 subsystem of interest. You can disregard the output of this issuance of the command.
  2. One hour after issuing the command the first time, issue -DISPLAY BUFFERPOOL(ACTIVE) DETAIL again. The output of this second issuance of the command will capture, for each active pool in the environment, activity for the one-hour period since the first issuance of the command (the command captures activity since the buffer pools were last allocated, or since the command was last issued -- whichever occurred most recently). You can then divide the activity numbers by 3600 to get per-second figures.

In the output of the second issuance of the -DISPLAY BUFFERPOOL command you'll see a timestamp value in a DSNB409I message (for example, "INCREMENTAL STATISTICS SINCE 10:00:32 SEP 1, 2015"). Check that timestamp to verify that the value is approximately one hour since the command was issued the first time. If you see a timestamp that is, say, 25 minutes prior to the time at which the command was issued the second time, it means that someone else got in there 25 minutes ago and issued the command.

Instead of adding together four numbers, as you'd do when using information from a DB2 monitor statistics long report (or from an online monitor display of buffer pool activity), you sum five numbers found in -DISPLAY BUFFERPOOL(ACTIVE) DETAIL command output. Why five instead of four? Because in the command output, synchronous read activity is reported in two "buckets": random synchronous reads and sequential synchronous reads. Total up, then, the random synchronous reads, the sequential synchronous reads, the sequential prefetch reads, the list prefetch reads, and the dynamic prefetch reads, and divide that sum by 3600 to get the total read I/O rate per second (assuming that you issued the command once and then again an hour later, and you're using the output of the second issuance of the command). At some sites, folks have created jobs that issue the -DISPLAY BUFFERPOOL(ACTIVE) DETAIL command at a certain time of day, then issue it again an hour later, and via REXX code parse the output of the second issuance of the command, pulling out the read activity values for each pool and totaling them and dividing the total by 3600 to get the total read I/O rate per second.

What I like to see, for each buffer pool, is a total read I/O rate of less than 1000 per second. A rate of less than 100 per second for a pool is really good, unless that pool is used to "pin" objects in memory, in which case you'd like to see a read I/O rate of zero (and "pinning" pools should be defined with PGSTEAL(NONE), as described in a blog entry I wrote a few years ago). Driving read I/Os down benefits system performance in two important ways: 1) it reduces elapsed time for DB2-accessing programs, and 2) it improves the CPU efficiency of a DB2 workload (every I/O operation consumes some CPU time, so reducing that activity lowers the CPU cost of DB2 data access). The best way to lower DB2 read I/O rates is to increase the size of buffer pools that have a high level of read I/O operations. Thus, monitoring buffer pool read I/O rates points you to the pools for which the potential for performance gains through read I/O reduction is the greatest.

Less than a week ago, I was reviewing buffer pool activity statistics for a DB2 subsystem that is a key component of an organization's IT infrastructure. One of the busier pools showed a hit ratio of 99.3%. That looks pretty good, doesn't it? Seeing that, might you turn your attention to another of the subsystem's buffer pools? You could, but that would be a mistake -- this same buffer pool had a total read I/O rate of over 2800 per second -- higher than that of any other pool belonging to the subsystem. THIS was seen to be the pool most in need of attention, per the marker provided by the read I/O rate. The pool, with approximately 600,000 4K buffers (about 2.4 GB of space), could stand to be at least doubled in size so as to substantially lower the high rate of read I/O activity. Could the organization make that move? Yes, because the LPAR housing the DB2 subsystem has over 100 GB of real storage (an increasingly common configuration these days), and much of that memory is unused. With another 2.4 GB of memory utilized for buffering DB2 pages (in the form of a twice-as-large buffer pool), the z/OS LPAR's demand paging rate (available via a z/OS monitor), currently zero, would likely remain at that level. If doubling the size of the high-I/O pool took the associated read I/O rate from 2800 per second to, say, 1200 per second, I'd make the pool significantly larger again to try to get the read I/O rate below 1000 per second -- while keeping an eye on the LPAR's demand paging rate (a demand paging rate in the low single digits per second or less during busy processing periods is an indicator that a system's real storage resource is not under too much pressure).

Here's the really important point that I want to make through this blog entry: at more and more sites, you find production DB2 for z/OS subsystems running in LPARs with lots of real storage (more than 100 GB -- sometimes several hundred gigabytes). That's a good thing, if you put that big memory resource to productive use. A great way to do that is to grow DB2 buffer pools in a performance-positive way, and the total read I/O rate for the various buffer pools -- NOT the hit ratio -- is your guide for doing this in an effective manner. So, look at the right buffer pool performance numbers, and act on them in the right way.