Wednesday, May 25, 2016

DB2 for z/OS: When Did You Last Check Your Checkpoint Frequency?

I have written in this blog a few entries about various DB2 for z/OS ZPARM parameters, including a post on DSMAX and another on EDM_SKELETON_POOL. Time for another such entry, this time focused on the parameters that determine when a DB2 subsystem will perform checkpoint processing.

Checkpointing is an availability-enhancing mechanism that is related to DB2's deferred write functionality. DB2 externalizes changed pages to disk in a deferred manner so as to boost the efficiency of insert, update, and delete operations (the aim is to allow multiple changes to be made to a given page in memory before that page is written to a table space or index on disk). That's great for scalability, but it has an availability implication: if the DB2 subsystem terminates abnormally (i.e., "crashes"), pages in memory that were updated (and associated with committed data changes) and not yet externalized at the time of the crash will have to be identified (using data in the DB2 log) and processed in the course of DB2 restart (meaning, the corresponding table and index pages on disk will have to be updated accordingly) so that the data on disk will be in a consistent state before the subsystem resumes serving requests from applications. The more of these "pending writes" there are at the time of a DB2 failure, the longer it will take to complete the subsequent restart of the subsystem; thus, there is goodness in getting changed pages externalized to disk in a timely manner.

This is where checkpointing comes in. At regular intervals (based on a user-supplied criterion or criteria) DB2 will identify changed-but-not-externalized pages in the buffer pools and will schedule those pages for externalization via asynchronous write actions. At the time of the next checkpoint, DB2 will determine whether any of the updated-but-not-externalized pages identified at the time of the previous checkpoint are still un-externalized. Those pages (if any) will be written synchronously to disk. And so it goes, one checkpoint after another after another.

What makes for a "good" DB2 checkpoint frequency? That would be one that strikes a proper balance between greater CPU efficiency on the one hand -- allow pages to be updated at least a few times before being externalized to disk, and keep the consumption of resources expended in checkpointing at a reasonable level -- and faster restart on the other hand.

OK, how do you effect a desired frequency of checkpoint processing? For a long time, the only way to influence DB2 checkpoint frequency was to specify (by way of a ZPARM parameter) the number of records that would be written to the DB2 transaction log between checkpoints. The default value for this parameter used to be (if memory serves me correctly) 50,000, and if you went with that value for a subsystem then DB2 would go through checkpoint processing every time 50,000 records had been written to the log since the time of the last checkpoint. For many organizations, this default log-record setting resulted in a pretty reasonable checkpoint frequency.

As time went by, there came to be some problems with both the default value for log-records-between-checkpoints, and for the whole log-based means of regulating checkpoint frequency. The problem with the 50,000 log records threshold is that it came to be way too low at a growing number of sites -- a result of DB2 data-change workloads getting dramatically larger (sometimes thousands of insert/update/delete operations per second). At 50,000 log records between checkpoints, some systems would have DB2 checkpoints occurring every few seconds -- not where you want to be with respect to achieving a good balance balance between CPU efficiency and quick DB2 restart-ability (checkpointing every few seconds would be pretty expensive). The other problem with the log write-based checkpoint triggering threshold was variability -- sometimes extreme variability. At a given DB2 for z/OS site there could be times of the day when a read-dominated workload would lower log write activity to a point that caused lots of time to go by between checkpoints, and other times when high-volume insert jobs (for example) would drive log write levels to the point that checkpointing occurred with too much frequency.

DB2 for z/OS Version 7 delivered a very useful enhancement that enabled the management of checkpoint frequency based on an explicitly time-based threshold: you could tell a DB2 subsystem to checkpoint every X number of minutes, regardless of the volume of log write operations. DB2 10 for z/OS allowed one to specify that checkpointing is to occur after X minutes have elapsed, or after Y records have been written to the DB2 log, since the last checkpoint -- the threshold that's reached first triggers the next checkpoint.

So, where does all this leave us? First, you want to know how many checkpoints are driven on your DB2 subsystems (especially the production subsystems) during busy hours of the day and/or night. This can be easily done using a statistics long report generated by your DB2 monitor. I'd suggest creating a report for a busy daytime period of 1 or 2 hours, and a report for a busy nighttime period of 1 or 2 hours. In the report, find the section under the heading SUBSYSTEM SERVICES (or something similar -- different DB2 monitors might use slightly different wording for headings and fields). In that report section, find the field with a label like SYSTEM EVENT CHECKPOINT, and see how many checkpoints occurred during the time period for which the report was generated. Generally speaking, what you want to see is a checkpoint every 2 to 5 minutes (there is not a "right" or "wrong" place to be in that range -- closer to 5 minutes would make for a somewhat more CPU-efficient DB2 environment, and closer to 2 minutes would enable somewhat faster DB2 restart following an abnormal subsystem termination). If you're out of that range (on either side), an adjustment would likely be warranted.

While you can still manage DB2 checkpointing on a log-records-between-checkpoints basis, and also using both a number of log records and a number of minutes between checkpoints (whichever occurs first since the last checkpoint), my preference is to control checkpoint frequency solely by way of a minutes-between-checkpoints threshold -- it's simple and straightforward, and I like a fixed regularity of checkpoints per hour throughout the day. If you go with a minutes-between-checkpoints threshold, you'll have SINGLE for the value of the CHKTYPE parameter in ZPARM, and the desired number of minutes (again, something in the range of 2 to 5, inclusive, is recommended) specified as the value of the CHKFREQ parameter. By the way, if you go this route -- say, with CHKTYPE=SINGLE and CHKFREQ=5 if you want a DB2 checkpoint every 5 minutes -- then you will have a value of NOTUSED for the CHKLOGR and CHKMINS parameters in ZPARM, as these parameters are used only when checkpoint frequency is managed on both a minutes-between and a log-records-between basis (i.e., when you have CHKTYPE=BOTH). Also, if you see a value of something like 100,000 for the CHKFREQ parameter in ZPARM, that does not mean a checkpoint every 100,000 minutes -- it means a checkpoint after every 100,000 records have been written to the active log (when CHKTYPE is set to SINGLE, a CHKFREQ value between 1 and 60, inclusive, means that checkpoint frequency is time-based, and a CHKFREQ value between 1,000 and 16,000,000, inclusive, means that checkpoint frequency is based on that number of records being written to the DB2 log between checkpoints).

So, check your DB2 system's checkpoint frequency if you haven't in a while, and if you're getting fewer than 2 or more than 5 checkpoints per hour then consider making a change to get checkpoint frequency into the 2- to 5-minute range. Yes, it's technically possible to use log-records-between-checkpoints as a checkpointing threshold, but I'd prefer to make checkpoint frequency a fixed-number-of-minutes-between-checkpoints thing. I like simple, I like straightforward, and I like regularity.