Thursday, May 28, 2015

DB2 11 for z/OS: Setting Aside Space for Updates

A DB2 table space is well-organized when rows are physically located where they should be per the clustering index of the associated table. Because a high degree of "clustered-ness" is a desirable quality of a table space (mostly for performance reasons), a DB2 DBA will often take steps to preserve row sequencing between REORGs. One such step that is very commonly executed is the setting aside of space in a table space's pages to accommodate inserts of new rows (relevant for a table that is clustered by a key that is NOT continuously ascending -- in the case of a continuously-ascending clustering key, new rows will be inserted at the "end" of the table). This insert-accommodating space set-aside is accomplished by way of the PCTFREE option of CREATE and ALTER TABLESPACE.

PCTFREE has always been a nice tool in the DB2 DBA's physical database design tool kit, but it had a shortcoming: it didn't apply to UPDATE operations that caused a row's length to increase. If a row in a DB2 for z/OS table space becomes longer as a result of an UPDATE, and if that longer row will no longer fit into the page in which it had been stored, it will be moved to another page in the table space. Will the row's location, as indicated by the RID (row ID) for the row in entries of indexes defined on the table, be changed to reflect the update-driven move of the row to a new page? No. Index entries associated with the row will continue to point to the row's former location. When DB2, in using an index to build a query result set that will include the row that was moved because of a length-increasing UPDATE, arrives at the "original home" page to retrieve the row, it will find, in effect, a "We've moved!" sign, along with a pointer to the row's new page. And, if that moved row is again lengthened by an UPDATE and again is moved (if it became too long to go back into its "second home" page), another "We've moved!" sign will be posted, along with a pointer to the row's "thIrd home" page. And so on.

These "We've moved!" signs related to update-induced row relocations are known, in DB2 parlance, as indirect references. Why are things done this way? Because updating a row's RID in a table's indexes every time a row-lengthening UPDATE caused a row move would add too much cost to UPDATE operations (the affected RIDs will be corrected with the next table space REORG). Obviously, rows with varying-length columns (such as VARCHAR columns) can be lengthened (and potentially relocated) as a result of UPDATE operations. So, too, can rows containing only fixed-length columns. How's that? Compression, that's how -- a row in a COMPRESS YES table space might compress differently (and with more length) with different column values resulting from an UPDATE.

Because indirect references are a drag on query performance (because of the hops to "new home" pages to which lengthened-beyond-fitting rows are relocated), one would like to reduce their occurrence. But how? There wasn't a good answer to this question until DB2 11 came along and provided the new PCTFREE FOR UPDATE option for CREATE and ALTER TABLESPACE. Here's how it works: if you set PCTFREE FOR UPDATE for a table space to, say, 10 then 10% of the space in the table space's pages (following a REORG or a LOAD REPLACE) will be set aside for the sole purpose of accommodating length-increasing  UPDATEs. In other words, that space will NOT be available for newly inserted rows. Can you still set aside space for to-be-inserted row's in a table space's pages? Of course you can. If you wanted 15% of the space in pages of a table space to be set aside (again, following a REORG or a LOAD REPLACE) for new INSERTs, and 10% set aside for length-increasing UPDATEs, you'd specify the following:

PCTFREE 15 FOR UPDATE 10

Would it be appropriate to specify a PCTFREE  FOR UPDATE value for each and every one of your DB2 tables? Probably not. An unneeded, non-zero PCTFREE FOR UPDATE value will result in wasted space in a table space's pages. It's best to use this DB2 enhancement for tables that are the targets of length-increasing UPDATE operations. Is it easy to identify table spaces that are associated with such tables? Sure it is -- just check the new (with DB2 11) UPDATESIZE column of the SYSIBM.SYSTABLESPACESTATS real-time statistics table in the DB2 catalog.The value in this column shows the extent to which a table space grew (or shrank) as a result of length-changing UPDATE operations since the most recent REORG or LOAD REPLACE. Speaking of real-time stats, here's an interesting option: if you specify PCTFREE FOR UPDATE -1 for a table space, the initial space set aside in the table space's pages for length-increasing UPDATEs will be 5%, and that amount will subsequently be adjusted by DB2 based on certain real-time statistics values. This is one of a growing number of examples of DB2 using its own real-time statistics to enhance efficiency of operation -- I'll be writing more about that soon.

One more thing: there is a new (with DB2 11) ZPARM, PCTFREE_UPD, that provides the default value for PCTFREE FOR UPDATE for ALTER and CREATE TABLESPACE statements. Setting PCTFREE_UPD to AUTO would make PCTFREE FOR UPDATE -1 (described above) the default.

How do you know if you've done good by specifying a non-zero PCTFREE FOR UPDATE value for a table space? Check to see that the incidence of indirect references (i.e., "We've moved!" signs) has decreased. Indirect references for table spaces can be tracked via the NEARINDREF and FARINDREF columns of the SYSIBM.SYSTABLEPART table, and the REORGNEARINDREF and REORGFARINDREF columns of SYSIBM.SYSTABLESPACESTATS. [An indirect reference is "near" if the associated row was moved to a page within 16 pages (or SEGSIZE/2, for a segmented table space -- and universal table spaces are segmented) of its previous "home" page; otherwise, the indirect reference is considered to be "far".]

By the way, PCTFREE FOR UPDATE is expected to be particularly useful for tables that have VARCHAR columns that are initially set to NULL and are later updated to have non-null values -- that is a scenario in which you'd expect rows to become significantly longer as a result of UPDATEs.

And there you have it. Nice feature, eh? When you get to DB2 11 (or if you're already there), take advantage of it.

Wednesday, May 13, 2015

For a Large DB2 for z/OS Table, Should You Go With Partition-by-Range or Partition-by-Growth?

There are several aspects to this question: What do I mean by "large?" Is the table in question new, or does it exist already? What is the nature of the data in the table, and how will that data be accessed and maintained? I'll try to cover these various angles in this blog entry, and I hope that you will find the information provided to be useful.

What is a "large" table?

Why even ask this question? Because the relevance of the partition-by-range (PBR) versus partition-by-growth (PBG) question is largely dependent on table size. If a table is relatively small, the question is probably moot because it is unlikely that range-partitioning a smaller table will deliver much value. Partitioning by growth would, in that case, be the logical choice (for many smaller tables, given the default DSSIZE of 4G, a PBG table space will never grow beyond a single partition).

OK, so what is "smaller" and what is "larger" when you're talking about a DB2 for z/OS table? There is, of course, no hard and fast rule here. In my mind, a larger DB2 for z/OS table is one that has 1 million or more rows. That's not to say that a table with fewer than 1 million rows would never be range-partitioned -- it's just that the benefits of range-partitioning are likely to be more appealing for a table that holds (or will hold) millions of rows (or more).

When the table in question is a new one

This, to me, is the most interesting scenario, because it is the one in which the options are really wide open. I'll start be saying that you definitely want to go with a universal table space here, primarily because a number of recently delivered DB2 features and functions require the use of universal table spaces. But should the table space be PBR or PBG? A partition-by-growth table space can be as large as a partition-by-range table space, so that's not a differentiator. What, then, would be your criteria?

To me, the appeal of a PBG table space is mostly a factor of it being a labor-saving device for DB2 for z/OS DBAs. PBG table spaces have an almost "set it and forget it" quality. There is no need to identify a partitioning key, no need to determine partition limit key values, no worries about one partition getting to be much larger than others in a table space. You just choose reasonable DSSIZE and MAXPARTITION values, and you're pretty much done -- you might check back on the table space once in a while, to see if the MAXPARTITION value should be bumped up, but that's about it. Pretty sweet deal if you're a DBA.

On the other hand, PBR can deliver some unique benefits, and these should not be dismissed out of hand. Specifically:
  1. A PBR table space provides maximum partition independence from a utility perspective. You can even run the LOAD utility at the partition level for PBR table space -- something you can't do with a PBG table space. You can also create data-partitioned secondary indexes (DPSIs) on a PBR table space (not do-able for a PBG table space), and that REALLY maximizes utility-related partition independence (though it should be noted that DPSIs can negatively impact the performance of queries that do not reference a PBR table space's partitioning key).
  2. PBR table spaces enable the use of page-range screening, a technique whereby the DB2 for z/OS optimizer can limit the partitions that have to be scanned to generate a result set when a query has a predicate that references a range-partitioned table space's partitioning key (or at least the lead column or columns thereof). Page-range screening doesn't apply to PBG table spaces, because a particular row in such a table space could be in any of the table space's partitions.
  3. A PBR table space can be a great choice for a table that would be effectively partitioned on a time-period basis. Suppose, for example, that the rows most recently inserted into a table are those most likely to be retrieved from the table. In that case, date-based partitioning (e.g., having each partition hold data for a particular week) would have the effect of concentrating a table's most "popular" rows in the pages of the most current partition(s), thereby reducing GETPAGE activity associated with retaining sets of these rows. Date-based partitioning also enables very efficient purging of a partition's data (when the purge criterion is age-of-data) via a partition-level LOAD REPLACE operation with a dummy input data set (the partition's data could be first unloaded and archived, if desired).
  4. A PBR table space tends to maximize the effectiveness of parallel processing, whether of the DB2-driven query parallelization variety or in the form of user-managed parallel batch jobs. This optimization of parallel processing can be particularly pronounced for joins of tables that are partitioned on the same key and by the same limit key values.
Those are some attractive benefits, I'd say. Still, the previously mentioned DBA labor-saving advantages of PBG table spaces are not unimportant. That being the case, this is my recommendation when it comes to evaluating PBR versus PBG for a large, new table: consider first whether the advantages of PBR, listed above, are of significant value for the table in question. If they are, lean towards the PBR option. If they are not, PBG could be the right choice for the table's table space. In particular, PBG can make sense for a large table for which access will be mostly through transactions (as opposed to batch jobs), especially if those transactions will retrieve small result sets via queries for which most row filtering will occur at the index level. In that case, the advantages of range-partitioning could be of limited value.

When the table space in question is an existing one

Here, the assumption is that the table space is not currently of the universal type. When that is true, and the aim is (as it should be) to convert the table space from non-universal to universal, the PBR-or-PBG decision will usually be pretty straightforward and will be based on the easiest path to universal: you'll go with universal PBR for an existing non-universal range-partitioned table space (if it is a table-controlled, versus an index-controlled, partitioned table space), because that change can be accomplished non-disruptively with an ALTER TABLESPACE (to provide a SEGSIZE for the table space) followed by an online REORG (if you are have DB2 10 running in new-function mode, or DB2 11). Similarly, for an existing non-partitioned table space (segmented or simple, as long as it contains only one table), you'll go with universal PBG because that change can be accomplished non-disruptively with an ALTER TABLESPACE (to provide a MAXPARTITIONS value for the table space) followed by an online REORG (again, if your DB2 environment is Version 10 in new-function mode, or DB2 11).

I recently encountered an exception to this rule: if you have a non-universal, range-partitioned table space, with almost all of the data in the last of the table space's partitions (something that could happen, depending on how partition limit keys were initially set), you might decide not to go for the non-disruptive change to universal PBR, because then you'd have a PBR table space with almost all of the data in the last of the table space's partitions. Yes, with enough ALTER TABLE ALTER PARTITION actions, you could get the table's rows to be spread across many partitions (and with DB2 11, alteration of partition limit key values is a non-disruptive change), but that would involve a lot of work. You might in that case just opt to go to a PBG table space through an unload/drop/re-create/re-load process.

To sum things up: PBR and PBG have their respective advantages and disadvantages. In choosing between these flavors of universal table space, the most important thing is to put some thought into your decision. Give careful consideration to what PBR might deliver for a table, and think also of how useful PBG might be for the same table. If you weigh your options, the decision at which you ultimately arrive will likely be the right one.