Fragmentation means two things:
2) a condition in which the free space on a disk consists of little
bits of free space here and there rather than only one or a few free spaces.
Condition 1 is referred to as file fragmentation, while Condition
2 is referred to as disk fragmentation or, more precisely, free
space fragmentation. File fragmentation causes performance problems
when reading files, while free space fragmentation causes performance
problems when creating and extending files.
Neither condition has anything to do with the contents of a file.
We are concerned only with the files as containers for data and
with the arrangement of these containers on the disk.
The term fragmentation is sometimes applied to the contents of
a file. This type of fragmentation will be explained
here only to differentiate it from our real subjects, file and free space
fragmentation.
Files consist of records. Each record is a collection of fields considered as a unit. There are three basic kinds of files, each affected by file content fragmentation differently:
Sequential: In a sequential file, every record except the first
falls immediately after the preceding record. There are no gaps. An illustration
of a sequential file is a music cassette. You cannot get to any selection
without searching through the tape. Accordingly, sequential files are not
subject to internal fragmentation. The situation simply cannot exist.
Random: In a random access or direct access file, every record
is the same size. Because of this, records can be deleted and replaced
with new ones easily. An illustration of a direct access file is a bookshelf
full of books which are all the same size. You can go directly to any book
desired and withdraw it from the shelf. You can also replace it anywhere
there is a space on the shelf. Fragmentation of the contents of such a
file causes virtually no performance problems, as the file is designed
to be accessed in random order and any new record is guaranteed to fit
precisely within any free space in the file.
Indexed: Indexed files, however, do suffer from internal fragmentation.
An illustration of an indexed file is a floor of offices in a building.
The directory in the lobby tells you what floor the office is on, but you
still have to search the floor to find the right office. Such files have
an index that contains pointers to organized data records elsewhere in
the file. In such a file, variable length data records are stored in buckets
of a certain number of blocks each. If a record will not fit in a bucket
(because the bucket is already full of other records), the bucket is split
into two buckets to accommodate all the records. An indexed file with numerous
split buckets is said to be fragmented. This type of fragmentation
affects performance of only those applications accessing the affected file
(unless such activity is so intense that it degrades the performance of
the entire system). It is cured by reorganizing the data records within
the file, usually by creating a better-organized copy of the file to supersede
the fragmented one. This reorganization can be done safely only when access
to the file has been suspended.
This internal file fragmentation is not the type of fragmentation
with which this book is concerned.
Another type of fragmentation which occurs on OpenVMS systems but is
beyond the scope of this book is pagefile fragmentation.
As information is added to the pagefile and deleted from it, the space in the pagefile can become fragmented, leaving no single space large enough to hold more information. This type of fragmentation causes severe performance degradation and can even cause the system to become unusable. It is cured by rebooting the system, and is prevented by increasing the size of the pagefile or adding secondary pagefile(s) to the system.
Figure 2-1 Pagefile Fragmentation
It sometimes happens that a file is deliberately created in a fragmented
state. The best example of this is a standard OpenVMS file, needed for
every OpenVMS disk volume, called INDEXF.SYS. This
file contains the headers for all the files on that volume.
It also contains certain information critical to the system's ability to
access data on that disk volume, like the location of the INDEXF.SYS file
itself. This information is so important, it is separated into four pieces
and stored in four different places on the disk; minimizing the risk of
losing all four pieces at once and maximizing the ability to recover data
from a damaged disk. As these four copies are part of the INDEXF.SYS file,
the file must be fragmented at all times, but only to the degree described
here. The part of the file containing file headers can be made contiguous
and kept so.
When OpenVMS allocates disk space for a file, it looks in the storage bitmap to find what clusters are available. In so looking, it always begins its scan of the storage bitmap from the beginning (LBN 0) when the disk has been recently mounted. Thus there is a tendency on the part of OpenVMS to group files near the logical beginning of a disk, leaving the higher LBNs free. This tendency is modified (for better or for worse) by the Extent Cache (see Extent Cache section later in this chapter for a more complete explanation), but it is worth understanding clearly to grasp one of the primary causes of file and free space fragmentation on an OpenVMS disk.
Starting with a completely empty disk, allocating space by choosing
the first available clusters in the storage bitmap is a reasonable approach.
At least it is until some files are deleted. Until file deletions begin,
you would see the storage bitmap bits changing steadily from "free"
to "allocated," from beginning to end, like mercury in a thermometer
rising from the bulb to the boiling point. The state of the disk is clear:
every cluster before a certain point is allocated to one file or another,
while every cluster after that same point is free, waiting to be allocated
to a new file. Additionally, every file is contiguous - the ideal state
for maximum disk I/O performance under most circumstances.
Figure 2-2 Contiguous Files On A Disk
Once even a single file is deleted, however, the OpenVMS scan-from-the-beginning
allocation strategy begins to trip over itself. When the file is deleted,
naturally, its clusters are marked "free" in the storage bitmap.
Our elegant thermometer is now broken, having a gap in the mercury somewhere
between the bulb and the mercury's highest point.
Figure 2-3 Fragmented Files On A Disk
The scan-from-the-beginning allocation strategy is going to find that
gap on the next allocation scan and allocate the space to the new file.
This is fine, presenting no performance problem or fragmentation susceptibility,
provided the new file fits entirely within the gap vacated by the deleted
file.
Figure 2-4 New File Allocation
But what if it doesn't fit? What if the new file is larger than the
one deleted? OpenVMS will allocate the entire gap (or what is left of it
if part has been used already) to the new file and then continue its scan
of the storage bitmap to find more space to allocate. With only a single
gap in the storage bitmap, this continued scan will take us all the way
to the end of the allocated portion of the storage bitmap and there we
will find the space to allocate for the remainder of the file. Not so bad.
The file has only two extents (fragments). And OpenVMS, as we have seen,
was specifically designed to deal with files broken into multiple fragments.
This two-fragment file is not a serious problem for OpenVMS, causing only
a slight degradation of performance. But what happens when more than a
few files are deleted? What happens when dozens, hundreds or even thousands
of files are deleted, as is the typical case for an interactive time-sharing
system like OpenVMS? What happens is that the mercury in our thermometer
becomes shattered into a zillion pieces, with a zillion
gaps into which file fragments can be allocated. In fact, even with a maximally
fragmented storage bitmap, in which precisely every other cluster is allocated,
with the intervening clusters free, OpenVMS continues to merrily allocate
disk space on a first-come-first-served, scan-from-the-beginning basis.
Space for a 100 block file allocated under these circumstances on a disk
with a one-block cluster size would be allocated in 100 separate pieces,
giving you a file requiring 100 separate disk I/O operations to service,
where a single I/O operation would serve for the same file existing in
only one piece.
Why? Well, scanning the storage bitmap takes precious time. Ending the
scan at the first available cluster makes for shorter scans and saves time.
At least it saves scanning time. But what about the 100 times greater
overhead required to access fragmented files?
As we have seen in Chapter 1, a decade ago there were good reasons for
this now seemingly awful blunder. Before inspecting its true impact, however,
we have to take into consideration the extent cache. The extent
cache is a portion of the system's memory that is set aside for the use
of the OpenVMS file allocation mechanism. The extent cache stores the LBNs
of released clusters, making it easy for OpenVMS to reuse these same clusters
without the overhead of a storage bitmap scan.
Figure 2-5 Extent Cache
Some argue that the extent cache completely overcomes the drawbacks
of the scan-from-the-beginning allocation strategy, claiming that the majority
of deleted files (the ones whose clusters will be loaded into the extent
cache) tend to reside in the higher LBNs of a disk. While this may be true
in a contrived laboratory environment, it is not the case in a typical
production environment. In a production environment, with lots of users
running programs that create and delete files willy-nilly, the deleted
files tend to occur randomly over the entire range of LBNs on a disk.
The above description of the OpenVMS file allocation strategy may seem
beyond belief or exaggerated. If you lean towards skepticism, here is a
way to demonstrate the matter for yourself.
You need a disk that can be initialized. Of course,
this means all files on the disk will be lost, so don't go initializing
a disk containing data you need. Use a floppy disk, if you have one on
your system, or use a spare disk. If you have neither, use a data disk
only after you have backed it up carefully.
You will need two test files: one very small (1 to 4 blocks) called TEST_SMALL.DAT and one somewhat larger (about 100 blocks) called TEST_BIG.DAT. It does not matter what is in these files. Pick any two files you have on hand that are about the right size and copy them, or create a new file using the DCL CREATE command, a text editor or some other procedure of your choice.
This command procedure initializes the scratch disk so it consists entirely of one big contiguous free space, less the necessary system files. It then creates ten pairs of small files and deletes every other one, leaving ten small files separated by ten small gaps. Next, it copies one large file onto the disk. This large file is invariably broken up by OpenVMS into ten small pieces (occupying the ten small gaps) and one large piece (the remainder). In other words, the file is created by OpenVMS in a badly fragmented condition even though there is plenty of free space further along on the disk in which the file could have been created contiguously.
In the display resulting from the DUMP /HEADER command at the end of the command procedure, file fragments are represented by the Retrieval Pointers. If there is more than one pointer, the file is fragmented. In this example, you should see eleven retrieval pointers. In the dump, the counts and number of map area words are not important for our purposes; it is the number of pointers that you should pay attention to. As you can see from the dump of the header, the file TEST_BIG.DAT is split into many fragments even though far more than 100 free blocks remain on the disk in a single contiguous free space.
When you consider the long-term effects of this allocation strategy
on a disk in continuous use, you can see readily that fragmentation can
become extreme.
Fragmentation at its worst comes in the form of the multi-header file.
As its name implies, this is a file with more than one header or, to be
more precise, with a header containing so many retrieval pointers they
won't fit into a single one-block header. OpenVMS, therefore, allocates
a second (or third or fourth!) block in the INDEXF.SYS file to accommodate
storage of the extra retrieval pointers. Just for the record, the first
block of a file header will hold all the information there is to know about
a file, plus approximately 70 retrieval pointers. A full header block,
therefore, can accommodate a file fragmented into as many as 70 pieces.
This is pretty miserable, as fragmentation goes, but it can get worse -
much worse.
A second header block can be allocated to hold approximately another 102 retrieval pointers. This gets us up to the positively gross level of 172 fragments in a single file. Not wanting to underestimate the depths to which disk management can fall, the VMS developers provided for even more additional header blocks - each one holding another 102 pointers or so. I don't want to take this line of discussion any further, though. Fragmentation to the tune of hundreds of fragments per file borders on outright sabotage.
How widespread is the fragmentation disease? Pandemic is the
word doctors use to describe a disease when virtually everyone has it.
Fragmentation is unquestionably pandemic. It occurs on every computer running
the OpenVMS system, except:
If you have a computer system that you don't use very often, its fragmentation
problem will be slight. But if you don't use it, who cares?
That leaves us with all the other systems - the vast majority by far.
These systems are typically running 24 hours a day, used interactively
by the users from somewhere around 8:00 AM to the vicinity of 5:00 PM,
with peaks of usage around 10:00 AM and 2:30 PM, and a real dead spot at
lunch time. Such systems typically have sporadic usage in the evening,
then slam to 100% utilization at midnight when a barrage of batch jobs
kick off and run for several hours; usage then tapers off to nearly nothing
until the users arrive again in the morning.
Such a system typically has several disk drives dedicated to user applications.
These disks get a lot of use, with hundreds of files being created and
deleted every day. Naturally, more are created than are deleted, so the
disk tends to fill up every few months and stay that way (nearly full)
until the System Manager forces users to delete excess files.
Under these circumstances, a disk will fragment badly. You can expect
to see a 10% to 20% increase in fragmentation each week. That is, if you
had 10,000 files, all contiguous at the beginning of the week, by the same
time the next week, you could expect those same 10,000 files to consist
of 11,000 pieces or more. A week later, there would be over 12,000 pieces,
then 13,000 and so on. After a month, the fragmentation level would exceed
40% with over 14,000 pieces. In three months, the level multiplies to over
240%, with over 34,000 pieces. After a year, the problem would theoretically
reach astronomical proportions, with those same 10,000 files fragmented
into some 1.4 million pieces. But it doesn't really, as there aren't enough
disk blocks to hold that many pieces (on this "typical" disk)
and the performance degradation is so bad that users aren't able to use
the system enough to keep up the fragmentation rate.
It is true however, that a poorly managed disk, with no handling done
for fragmentation will, over time, degrade so badly that it becomes for
all practical purposes unusable simply because each file is in so many
pieces that the time to access all the files a user needs is just not worth
the effort.
[PREVIOUS PAGE] [NEXT PAGE] [RETURN TO TOP] [TABLE OF CONTENTS]