Testing File Structure with HASH-TEST

HASH-TEST produces the same listing as ISTAT shown above, but allows you to look at what the distribution of records would be if you chose a given modulo.

The following example shows the result of the HASH-TEST command run on the same Master Dictionary as was used in the preceding section with ISTAT. A test modulo of 3 has been used instead of the actual modulo of 7:

>HASH-TEST MD

TEST MODULO:3

FILE= MD MODULO= 3                  15:44:46  DD MMM YYYY

FRAMES BYTES ITMS

    1   1740  78 *>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>!

    2   2005  98 *>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>!

    2   2265 104 *>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>!

  5

 

ITEM COUNT=     280, BYTE COUNT=  6010, AVG. BYTES/ITEM= 21.4

AVG.ITEMS/GROUP=93.3,STD.DEVIATION=13.6,AVG.BYTES/GROUP=2003.3

These items are well distributed across 3 groups, with an average of 93 items in each group. The most obvious problem is that the number of characters in each group is too large and exceeds one frame.

On the other hand, a test modulo of 79, produces this result:

>HASH-TEST MD

TEST MODULO:79

FILE= MD MODULO= 79              15:54:46  DD MMM YYYY

FRAMES BYTES ITMS

    1    119   5 *>>>>>

    1    178   9 *>>>>>>>>>

    1     34   2 *>>

    1    119   6 *>>>>>>

    1     24   1 *>

    1      0   0 *

    1     79   4 *>>>>

    .

    .

    .

    1     62   2 *>>

    1    117   5 *>>>>>

    1    105   6 *>>>>>>

    1     36   2 *>>

   79

 

ITEM COUNT=   280, BYTE COUNT= 6010, AVG. BYTES/ITEM=   21.4

AVG.ITEMS/GROUP=3.5,STD.DEVIATION= 2.0,AVG.BYTES/GROUP=76.0

Things are still poorly distributed, and in the opposite direction; several of the frames are empty. Only four percent of the space allocated to each group is being used—a very inefficient use of disk space.

A few general observations can be made about the number of items per group and the size of groups. If items have sequential numeric IDs, the items will be distributed evenly. If keys are random (i.e., not in any particular sequence), the distribution will vary. A small modulo and a large number of bytes per group might raise the average access time of items if more than one data frame must be scanned. A large modulo and a small number of bytes per group wastes disk space but provides faster access. The trick is to find the right modulo to balance the number of items per group. One frame per group is ideal.

The minimum number of frames per group is one, even for an empty group. To have a thousand empty groups is a worse situation than having three hundred groups where only some of them require an overflow frame. While the larger file provides more direct access, INFO/ACCESS and the file-save processor would be affected by having to read empty frames into memory.

Also, remember that the system places an updated item at the end of the group. Having a modulo too small will require extra time to find where the item starts, move the other items nearer the front of the group, and place the updated item at the end. Thus, performance can be affected if each group contains many items.

See Also

Tools for Checking File Efficiency

Analyzing File Structure with ISTAT