A Fast File System for UNIX*
Revised July 27, 1983
Marshall Kirk McKusick, William N. Joy-,
Samuel J. Leffler=, Robert S. Fabry
Computer Systems Research Group
Computer Science Division
Department of Electrical Engineering and Computer Science
University of California, Berkeley
Berkeley, CA 94720
ABSTRACT
A reimplementation of the UNIX file system is
described. The reimplementation provides substan-
tially higher throughput rates by using more flex-
ible allocation policies, that allow better local-
ity of reference and that can be adapted to a wide
range of peripheral and processor characteristics.
The new file system clusters data that is sequen-
tially accessed and provides two block sizes to
allow fast access for large files while not wast-
ing large amounts of space for small files. File
access rates of up to ten times faster than the
traditional UNIX file system are experienced.
Long needed enhancements to the user interface are
discussed. These include a mechanism to lock
files, extensions of the name space across file
systems, the ability to use arbitrary length file
names, and provisions for efficient administrative
control of resource usage.
_________________________
* UNIX is a trademark of Bell Laboratories.
-William N. Joy is currently employed by: Sun Microsys-
tems, Inc, 2550 Garcia Avenue, Mountain View, CA 94043
=Samuel J. Leffler is currently employed by: Lucasfilm
Ltd., PO Box 2009, San Rafael, CA 94912
This work was done under grants from the National Sci-
ence Foundation under grant MCS80-05144, and the De-
fense Advance Research Projects Agency (DoD) under Arpa
Order No. 4031 monitored by Naval Electronic System
Command under Contract No. N00039-82-C-0235.
File System - i - Contents
TABLE OF CONTENTS
1. Introduction
2. Old file system
3. New file system organization
.1. Optimizing storage utilization
.2. File system parameterization
.3. Layout policies
4. Performance
5. File system functional enhancements
.1. Long file names
.2. File locking
.3. Symbolic links
.4. Rename
.5. Quotas
6. Software engineering
References
File System - 1 - Introduction
1. Introduction
This paper describes the changes from the original 512
byte UNIX file system to the new one released with the 4.2
Berkeley Software Distribution. It presents the motivations
for the changes, the methods used to affect these changes,
the rationale behind the design decisions, and a description
of the new implementation. This discussion is followed by a
summary of the results that have been obtained, directions
for future work, and the additions and changes that have
been made to the user visible facilities. The paper con-
cludes with a history of the software engineering of the
project.
The original UNIX system that runs on the PDP-11- has
simple and elegant file system facilities. File system
input/output is buffered by the kernel; there are no align-
ment constraints on data transfers and all operations are
made to appear synchronous. All transfers to the disk are
in 512 byte blocks, which can be placed arbitrarily within
the data area of the file system. No constraints other than
available disk space are placed on file growth [Ritchie74],
[Thompson79].
When used on the VAX-11 together with other UNIX
enhancements, the original 512 byte UNIX file system is
incapable of providing the data throughput rates that many
applications require. For example, applications that need
to do a small amount of processing on a large quantities of
data such as VLSI design and image processing, need to have
a high throughput from the file system. High throughput
rates are also needed by programs with large address spaces
that are constructed by mapping files from the file system
into virtual memory. Paging data in and out of the file
system is likely to occur frequently. This requires a file
system providing higher bandwidth than the original 512 byte
UNIX one which provides only about two percent of the max-
imum disk bandwidth or about 20 kilobytes per second per arm
[White80], [Smith81b].
Modifications have been made to the UNIX file system to
improve its performance. Since the UNIX file system inter-
face is well understood and not inherently slow, this
development retained the abstraction and simply changed the
underlying implementation to increase its throughput. Con-
sequently users of the system have not been faced with mas-
sive software conversion.
Problems with file system performance have been dealt
with extensively in the literature; see [Smith81a] for a
_________________________
- DEC, PDP, VAX, MASSBUS, and UNIBUS are trademarks of
Digital Equipment Corporation.
File System - 2 - Introduction
survey. The UNIX operating system drew many of its ideas
from Multics, a large, high performance operating system
[Feiertag71]. Other work includes Hydra [Almes78], Spice
[Thompson80], and a file system for a lisp environment
[Symbolics81a].
A major goal of this project has been to build a file
system that is extensible into a networked environment
[Holler73]. Other work on network file systems describe
centralized file servers [Accetta80], distributed file
servers [Dion80], [Luniewski77], [Porcar82], and protocols
to reduce the amount of information that must be transferred
across a network [Symbolics81b], [Sturgis80].
File System - 3 - Old file system
2. Old File System
In the old file system developed at Bell Laboratories
each disk drive contains one or more file systems.- A file
system is described by its super-block, which contains the
basic parameters of the file system. These include the
number of data blocks in the file system, a count of the
maximum number of files, and a pointer to a list of free
blocks. All the free blocks in the system are chained
together in a linked list. Within the file system are
files. Certain files are distinguished as directories and
contain pointers to files that may themselves be direc-
tories. Every file has a descriptor associated with it
called an inode. The inode contains information describing
ownership of the file, time stamps marking last modification
and access times for the file, and an array of indices that
point to the data blocks for the file. For the purposes of
this section, we assume that the first 8 blocks of the file
are directly referenced by values stored in the inode struc-
ture itself*. The inode structure may also contain refer-
ences to indirect blocks containing further data block
indices. In a file system with a 512 byte block size, a
singly indirect block contains 128 further block addresses,
a doubly indirect block contains 128 addresses of further
single indirect blocks, and a triply indirect block contains
128 addresses of further doubly indirect blocks.
A traditional 150 megabyte UNIX file system consists of
4 megabytes of inodes followed by 146 megabytes of data.
This organization segregates the inode information from the
data; thus accessing a file normally incurs a long seek from
its inode to its data. Files in a single directory are not
typically allocated slots in consecutive locations in the 4
megabytes of inodes, causing many non-consecutive blocks to
be accessed when executing operations on all the files in a
directory.
The allocation of data blocks to files is also subop-
timum. The traditional file system never transfers more
than 512 bytes per disk transaction and often finds that the
next sequential data block is not on the same cylinder,
forcing seeks between 512 byte transfers. The combination
of the small block size, limited read-ahead in the system,
and many seeks severely limits file system throughput.
The first work at Berkeley on the UNIX file system
attempted to improve both reliability and throughput. The
reliability was improved by changing the file system so that
all modifications of critical information were staged so
_________________________
- A file system always resides on a single drive.
* The actual number may vary from system to system, but
is usually in the range 5-13.
File System - 4 - Old file system
that they could either be completed or repaired cleanly by a
program after a crash [Kowalski78]. The file system perfor-
mance was improved by a factor of more than two by changing
the basic block size from 512 to 1024 bytes. The increase
was because of two factors; each disk transfer accessed
twice as much data, and most files could be described
without need to access through any indirect blocks since the
direct blocks contained twice as much data. The file system
with these changes will henceforth be referred to as the old
file system.
This performance improvement gave a strong indication
that increasing the block size was a good method for improv-
ing throughput. Although the throughput had doubled, the
old file system was still using only about four percent of
the disk bandwidth. The main problem was that although the
free list was initially ordered for optimal access, it
quickly became scrambled as files were created and removed.
Eventually the free list became entirely random causing
files to have their blocks allocated randomly over the disk.
This forced the disk to seek before every block access.
Although old file systems provided transfer rates of up to
175 kilobytes per second when they were first created, this
rate deteriorated to 30 kilobytes per second after a few
weeks of moderate use because of randomization of their free
block list. There was no way of restoring the performance
an old file system except to dump, rebuild, and restore the
file system. Another possibility would be to have a process
that periodically reorganized the data on the disk to
restore locality as suggested by [Maruyama76].
File System - 5 - New file system
3. New file system organization
As in the old file system organization each disk drive
contains one or more file systems. A file system is
described by its super-block, that is located at the begin-
ning of its disk partition. Because the super-block con-
tains critical data it is replicated to protect against
catastrophic loss. This is done at the time that the file
system is created; since the super-block data does not
change, the copies need not be referenced unless a head
crash or other hard disk error causes the default super-
block to be unusable.
To insure that it is possible to create files as large
as 2^32 bytes with only two levels of indirection, the
minimum size of a file system block is 4096 bytes. The size
of file system blocks can be any power of two greater than
or equal to 4096. The block size of the file system is
maintained in the super-block so it is possible for file
systems with different block sizes to be accessible simul-
taneously on the same system. The block size must be
decided at the time that the file system is created; it can-
not be subsequently changed without rebuilding the file sys-
tem.
The new file system organization partitions the disk
into one or more areas called cylinder groups. A cylinder
group is comprised of one or more consecutive cylinders on a
disk. Associated with each cylinder group is some bookkeep-
ing information that includes a redundant copy of the
super-block, space for inodes, a bit map describing avail-
able blocks in the cylinder group, and summary information
describing the usage of data blocks within the cylinder
group. For each cylinder group a static number of inodes is
allocated at file system creation time. The current policy
is to allocate one inode for each 2048 bytes of disk space,
expecting this to be far more than will ever be needed.
All the cylinder group bookkeeping information could be
placed at the beginning of each cylinder group. However if
this approach were used, all the redundant information would
be on the top platter. Thus a single hardware failure that
destroyed the top platter could cause the loss of all copies
of the redundant super-blocks. Thus the cylinder group
bookkeeping information begins at a floating offset from the
beginning of the cylinder group. The offset for each suc-
cessive cylinder group is calculated to be about one track
further from the beginning of the cylinder group. In this
way the redundant information spirals down into the pack so
that any single track, cylinder, or platter can be lost
without losing all copies of the super-blocks. Except for
the first cylinder group, the space between the beginning of
the cylinder group and the beginning of the cylinder group
information is used for data blocks.-
_________________________
- While it appears that the first cylinder group could
File System - 6 - New file system
3.1. Optimizing storage utilization
Data is laid out so that larger blocks can be
transferred in a single disk transfer, greatly increasing
file system throughput. As an example, consider a file in
the new file system composed of 4096 byte data blocks. In
the old file system this file would be composed of 1024 byte
blocks. By increasing the block size, disk accesses in the
new file system may transfer up to four times as much infor-
mation per disk transaction. In large files, several 4096
byte blocks may be allocated from the same cylinder so that
even larger data transfers are possible before initiating a
seek.
The main problem with bigger blocks is that most UNIX
file systems are composed of many small files. A uniformly
large block size wastes space. Table 1 shows the effect of
file system block size on the amount of wasted space in the
file system. The machine measured to obtain these figures
is one of our time sharing systems that has roughly 1.2
Gigabyte of on-line storage. The measurements are based on
the active user file systems containing about 920 megabytes
of formated space.
_________________________________________________________________________
|__________|_________|__________________________________________________|
|775.2 Mb | 0.0 | Data only, no separation between files |
|807.8 Mb | 4.2 | Data only, each file starts on 512 byte boundary|
|828.7 Mb | 6.9 | 512 byte block UNIX file system |
|866.5 Mb | 11.8 | 1024 byte block UNIX file system |
|948.5 Mb | 22.4 | 2048 byte block UNIX file system |
|1128.3 Mb | 45.6 | 4096 byte block UNIX file system |
|__________|_________|__________________________________________________|
Table 1 - Amount of wasted space as a function of block size.
The space wasted is measured as the percentage of space on
the disk not containing user data. As the block size on the
disk increases, the waste rises quickly, to an intolerable
45.6% waste with 4096 byte file system blocks.
To be able to use large blocks without undue waste,
small files must be stored in a more efficient way. The new
file system accomplishes this goal by allowing the division
of a single file system block into one or more fragments.
The file system fragment size is specified at the time that
the file system is created; each file system block can be
_________________________
be laid out with its super-block at the ``known'' loca-
tion, this would not work for file systems with blocks
sizes of 16K or greater, because of the requirement
that the cylinder group information must begin at a
block boundary.
File System - 7 - New file system
optionally broken into 2, 4, or 8 fragments, each of which
is addressable. The lower bound on the size of these frag-
ments is constrained by the disk sector size, typically 512
bytes. The block map associated with each cylinder group
records the space availability at the fragment level; to
determine block availability, aligned fragments are exam-
ined. Figure 1 shows a piece of a map from a 4096/1024 file
system.
_______________________________________________
|Bits in map | XXXX XXOO OOXX OOOO |
|Fragment numbers| 0-3 4-7 8-11 12-15|
|________________|____________________________|
Figure 1 - Example layout of blocks and fragments in a 4096/1024 file system.
Each bit in the map records the status of a fragment; an
``X'' shows that the fragment is in use, while a ``O'' shows
that the fragment is available for allocation. In this
example, fragments 0-5, 10, and 11 are in use, while frag-
ments 6-9, and 12-15 are free. Fragments of adjoining
blocks cannot be used as a block, even if they are large
enough. In this example, fragments 6-9 cannot be coalesced
into a block; only fragments 12-15 are available for alloca-
tion as a block.
On a file system with a block size of 4096 bytes and a
fragment size of 1024 bytes, a file is represented by zero
or more 4096 byte blocks of data, and possibly a single
fragmented block. If a file system block must be fragmented
to obtain space for a small amount of data, the remainder of
the block is made available for allocation to other files.
As an example consider an 11000 byte file stored on a
4096/1024 byte file system. This file would uses two full
size blocks and a 3072 byte fragment. If no 3072 byte frag-
ments are available at the time the file is created, a full
size block is split yielding the necessary 3072 byte frag-
ment and an unused 1024 byte fragment. This remaining frag-
ment can be allocated to another file as needed.
The granularity of allocation is the write system call.
Each time data is written to a file, the system checks to
see if the size of the file has increased*. If the file
needs to hold the new data, one of three conditions exists:
1) There is enough space left in an already allocated
block to hold the new data. The new data is written
into the available space in the block.
_________________________
* A program may be overwriting data in the middle of an
existing file in which case space will already be allo-
cated.
File System - 8 - New file system
2) Nothing has been allocated. If the new data contains
more than 4096 bytes, a 4096 byte block is allocated
and the first 4096 bytes of new data is written there.
This process is repeated until less than 4096 bytes of
new data remain. If the remaining new data to be writ-
ten will fit in three or fewer 1024 byte pieces, an
unallocated fragment is located, otherwise a 4096 byte
block is located. The new data is written into the
located piece.
3) A fragment has been allocated. If the number of bytes
in the new data plus the number of bytes already in the
fragment exceeds 4096 bytes, a 4096 byte block is allo-
cated. The contents of the fragment is copied to the
beginning of the block and the remainder of the block
is filled with the new data. The process then contin-
ues as in (2) above. If the number of bytes in the new
data plus the number of bytes already in the fragment
will fit in three or fewer 1024 byte pieces, an unallo-
cated fragment is located, otherwise a 4096 byte block
is located. The contents of the previous fragment
appended with the new data is written into the allo-
cated piece.
The problem with allowing only a single fragment on a
4096/1024 byte file system is that data may be potentially
copied up to three times as its requirements grow from a
1024 byte fragment to a 2048 byte fragment, then a 3072 byte
fragment, and finally a 4096 byte block. The fragment real-
location can be avoided if the user program writes a full
block at a time, except for a partial block at the end of
the file. Because file systems with different block sizes
may coexist on the same system, the file system interface
been extended to provide the ability to determine the
optimal size for a read or write. For files the optimal
size is the block size of the file system on which the file
is being accessed. For other objects, such as pipes and
sockets, the optimal size is the underlying buffer size.
This feature is used by the Standard Input/Output Library, a
package used by most user programs. This feature is also
used by certain system utilities such as archivers and
loaders that do their own input and output management and
need the highest possible file system bandwidth.
The space overhead in the 4096/1024 byte new file sys-
tem organization is empirically observed to be about the
same as in the 1024 byte old file system organization. A
file system with 4096 byte blocks and 512 byte fragments has
about the same amount of space overhead as the 512 byte
block UNIX file system. The new file system is more space
efficient than the 512 byte or 1024 byte file systems in
that it uses the same amount of space for small files while
requiring less indexing information for large files. This
savings is offset by the need to use more space for keeping
File System - 9 - New file system
track of available free blocks. The net result is about the
same disk utilization when the new file systems fragment
size equals the old file systems block size.
In order for the layout policies to be effective, the
disk cannot be kept completely full. Each file system main-
tains a parameter that gives the minimum acceptable percen-
tage of file system blocks that can be free. If the the
number of free blocks drops below this level only the system
administrator can continue to allocate blocks. The value of
this parameter can be changed at any time, even when the
file system is mounted and active. The transfer rates to be
given in section 4 were measured on file systems kept less
than 90% full. If the reserve of free blocks is set to
zero, the file system throughput rate tends to be cut in
half, because of the inability of the file system to local-
ize the blocks in a file. If the performance is impaired
because of overfilling, it may be restored by removing
enough files to obtain 10% free space. Access speed for
files created during periods of little free space can be
restored by recreating them once enough space is available.
The amount of free space maintained must be added to the
percentage of waste when comparing the organizations given
in Table 1. Thus, a site running the old 1024 byte UNIX
file system wastes 11.8% of the space and one could expect
to fit the same amount of data into a 4096/512 byte new file
system with 5% free space, since a 512 byte old file system
wasted 6.9% of the space.
3.2. File system parameterization
Except for the initial creation of the free list, the
old file system ignores the parameters of the underlying
hardware. It has no information about either the physical
characteristics of the mass storage device, or the hardware
that interacts with it. A goal of the new file system is to
parameterize the processor capabilities and mass storage
characteristics so that blocks can be allocated in an
optimum configuration dependent way. Parameters used include
the speed of the processor, the hardware support for mass
storage transfers, and the characteristics of the mass
storage devices. Disk technology is constantly improving
and a given installation can have several different disk
technologies running on a single processor. Each file sys-
tem is parameterized so that it can adapt to the charac-
teristics of the disk on which it is placed.
For mass storage devices such as disks, the new file
system tries to allocate new blocks on the same cylinder as
the previous block in the same file. Optimally, these new
blocks will also be well positioned rotationally. The dis-
tance between ``rotationally optimal'' blocks varies
greatly; it can be a consecutive block or a rotationally
delayed block depending on system characteristics. On a
File System - 10 - New file system
processor with a channel that does not require any processor
intervention between mass storage transfer requests, two
consecutive disk blocks often can be accessed without
suffering lost time because of an intervening disk revolu-
tion. For processors without such channels, the main pro-
cessor must field an interrupt and prepare for a new disk
transfer. The expected time to service this interrupt and
schedule a new disk transfer depends on the speed of the
main processor.
The physical characteristics of each disk include the
number of blocks per track and the rate at which the disk
spins. The allocation policy routines use this information
to calculate the number of milliseconds required to skip
over a block. The characteristics of the processor include
the expected time to schedule an interrupt. Given the pre-
vious block allocated to a file, the allocation routines
calculate the number of blocks to skip over so that the next
block in a file will be coming into position under the disk
head in the expected amount of time that it takes to start a
new disk transfer operation. For programs that sequentially
access large amounts of data, this strategy minimizes the
amount of time spent waiting for the disk to position
itself.
To ease the calculation of finding rotationally optimal
blocks, the cylinder group summary information includes a
count of the availability of blocks at different rotational
positions. Eight rotational positions are distinguished, so
the resolution of the summary information is 2 milliseconds
for a typical 3600 revolution per minute drive.
The parameter that defines the minimum number of mil-
liseconds between the completion of a data transfer and the
initiation of another data transfer on the same cylinder can
be changed at any time, even when the file system is mounted
and active. If a file system is parameterized to lay out
blocks with rotational separation of 2 milliseconds, and the
disk pack is then moved to a system that has a processor
requiring 4 milliseconds to schedule a disk operation, the
throughput will drop precipitously because of lost disk
revolutions on nearly every block. If the eventual target
machine is known, the file system can be parameterized for
it even though it is initially created on a different pro-
cessor. Even if the move is not known in advance, the rota-
tional layout delay can be reconfigured after the disk is
moved so that all further allocation is done based on the
characteristics of the new host.
3.3. Layout policies
The file system policies are divided into two distinct
parts. At the top level are global policies that use file
system wide summary information to make decisions regarding
File System - 11 - New file system
the placement of new inodes and data blocks. These routines
are responsible for deciding the placement of new direc-
tories and files. They also calculate rotationally optimal
block layouts, and decide when to force a long seek to a new
cylinder group because there are insufficient blocks left in
the current cylinder group to do reasonable layouts. Below
the global policy routines are the local allocation routines
that use a locally optimal scheme to lay out data blocks.
Two methods for improving file system performance are
to increase the locality of reference to minimize seek
latency as described by [Trivedi80], and to improve the lay-
out of data to make larger transfers possible as described
by [Nevalainen77]. The global layout policies try to
improve performance by clustering related information. They
cannot attempt to localize all data references, but must
also try to spread unrelated data among different cylinder
groups. If too much localization is attempted, the local
cylinder group may run out of space forcing the data to be
scattered to non-local cylinder groups. Taken to an
extreme, total localization can result in a single huge
cluster of data resembling the old file system. The global
policies try to balance the two conflicting goals of local-
izing data that is concurrently accessed while spreading out
unrelated data.
One allocatable resource is inodes. Inodes are used to
describe both files and directories. Files in a directory
are frequently accessed together. For example the ``list
directory'' command often accesses the inode for each file
in a directory. The layout policy tries to place all the
files in a directory in the same cylinder group. To ensure
that files are allocated throughout the disk, a different
policy is used for directory allocation. A new directory is
placed in the cylinder group that has a greater than average
number of free inodes, and the fewest number of directories
in it already. The intent of this policy is to allow the
file clustering policy to succeed most of the time. The
allocation of inodes within a cylinder group is done using a
next free strategy. Although this allocates the inodes ran-
domly within a cylinder group, all the inodes for each
cylinder group can be read with 4 to 8 disk transfers. This
puts a small and constant upper bound on the number of disk
transfers required to access all the inodes for all the
files in a directory as compared to the old file system
where typically, one disk transfer is needed to get the
inode for each file in a directory.
The other major resource is the data blocks. Since
data blocks for a file are typically accessed together, the
policy routines try to place all the data blocks for a file
in the same cylinder group, preferably rotationally
optimally on the same cylinder. The problem with allocating
all the data blocks in the same cylinder group is that large
File System - 12 - New file system
files will quickly use up available space in the cylinder
group, forcing a spill over to other areas. Using up all
the space in a cylinder group has the added drawback that
future allocations for any file in the cylinder group will
also spill to other areas. Ideally none of the cylinder
groups should ever become completely full. The solution
devised is to redirect block allocation to a newly chosen
cylinder group when a file exceeds 32 kilobytes, and at
every megabyte thereafter. The newly chosen cylinder group
is selected from those cylinder groups that have a greater
than average number of free blocks left. Although big files
tend to be spread out over the disk, a megabyte of data is
typically accessible before a long seek must be performed,
and the cost of one long seek per megabyte is small.
The global policy routines call local allocation rou-
tines with requests for specific blocks. The local alloca-
tion routines will always allocate the requested block if it
is free. If the requested block is not available, the allo-
cator allocates a free block of the requested size that is
rotationally closest to the requested block. If the global
layout policies had complete information, they could always
request unused blocks and the allocation routines would be
reduced to simple bookkeeping. However, maintaining com-
plete information is costly; thus the implementation of the
global layout policy uses heuristic guesses based on partial
information.
If a requested block is not available the local alloca-
tor uses a four level allocation strategy:
1) Use the available block rotationally closest to the
requested block on the same cylinder.
2) If there are no blocks available on the same cylinder,
use a block within the same cylinder group.
3) If the cylinder group is entirely full, quadratically
rehash among the cylinder groups looking for a free
block.
4) Finally if the rehash fails, apply an exhaustive
search.
The use of quadratic rehash is prompted by studies of
symbol table strategies used in programming languages. File
systems that are parameterized to maintain at least 10% free
space almost never use this strategy; file systems that are
run without maintaining any free space typically have so few
free blocks that almost any allocation is random. Conse-
quently the most important characteristic of the strategy
used when the file system is low on space is that it be
fast.
File System - 13 - Performance
4. Performance
Ultimately, the proof of the effectiveness of the algo-
rithms described in the previous section is the long term
performance of the new file system.
Our empiric studies have shown that the inode layout
policy has been effective. When running the ``list direc-
tory'' command on a large directory that itself contains
many directories, the number of disk accesses for inodes is
cut by a factor of two. The improvements are even more
dramatic for large directories containing only files, disk
accesses for inodes being cut by a factor of eight. This is
most encouraging for programs such as spooling daemons that
access many small files, since these programs tend to flood
the disk request queue on the old file system.
Table 2 summarizes the measured throughput of the new
file system. Several comments need to be made about the
conditions under which these tests were run. The test pro-
grams measure the rate that user programs can transfer data
to or from a file without performing any processing on it.
These programs must write enough data to insure that buffer-
ing in the operating system does not affect the results.
They should also be run at least three times in succession;
the first to get the system into a known state and the
second two to insure that the experiment has stabilized and
is repeatable. The methodology and test results are dis-
cussed in detail in [Kridle83]-. The systems were running
multi-user but were otherwise quiescent. There was no con-
tention for either the cpu or the disk arm. The only
difference between the UNIBUS and MASSBUS tests was the con-
troller. All tests used an Ampex Capricorn 330 Megabyte
Winchester disk. As Table 2 shows, all file system test
runs were on a VAX 11/750. All file systems had been in
production use for at least a month before being measured.
Unlike the old file system, the transfer rates for the
new file system do not appear to change over time. The
throughput rate is tied much more strongly to the amount of
free space that is maintained. The measurements in Table 2
were based on a file system run with 10% free space. Syn-
thetic work loads suggest the performance deteriorates to
about half the throughput rates given in Table 2 when no
free space is maintained.
The percentage of bandwidth given in Table 2 is a meas-
ure of the effective utilization of the disk by the file
system. An upper bound on the transfer rate from the disk
_________________________
- A UNIX command that is similar to the reading test
that we used is, ``cp file /dev/null'', where ``file''
is eight Megabytes long.
File System - 14 - Performance
_______________________________________________________________________
| Type of Processor and| Read |
|_____________________________|_______________________________________|
| old 1024 750/UNIBUS | 29 Kbytes/sec 29/1100 3% 11% |
|new 4096/1024 750/UNIBUS | 221 Kbytes/sec 221/1100 20% 43% |
|new 8192/1024 750/UNIBUS | 233 Kbytes/sec 233/1100 21% 29% |
|new 4096/1024 750/MASSBUS | 466 Kbytes/sec 466/1200 39% 73% |
|new 8192/1024 750/MASSBUS | 466 Kbytes/sec 466/1200 39% 54% |
|_____________________________|_______________________________________|
Table 2a - Reading rates of the old and new UNIX file systems.
_______________________________________________________________________
| Type of Processor and| Write |
| File System Bus Measured | Speed Bandwidth % CPU|
|_____________________________|_______________________________________|
| old 1024 750/UNIBUS | 48 Kbytes/sec 48/1100 4% 29% |
|new 4096/1024 750/UNIBUS | 142 Kbytes/sec 142/1100 13% 43% |
|new 8192/1024 750/UNIBUS | 215 Kbytes/sec 215/1100 19% 46% |
|new 4096/1024 750/MASSBUS | 323 Kbytes/sec 323/1200 27% 94% |
|_____________________________|_______________________________________|
Table 2b - Writing rates of the old and new UNIX file systems.
is measured by doing 65536* byte reads from contiguous
tracks on the disk. The bandwidth is calculated by compar-
ing the data rates the file system is able to achieve as a
percentage of this rate. Using this metric, the old file
system is only able to use about 3-4% of the disk bandwidth,
while the new file system uses up to 39% of the bandwidth.
In the new file system, the reading rate is always at
least as fast as the writing rate. This is to be expected
since the kernel must do more work when allocating blocks
than when simply reading them. Note that the write rates
are about the same as the read rates in the 8192 byte block
file system; the write rates are slower than the read rates
in the 4096 byte block file system. The slower write rates
occur because the kernel has to do twice as many disk allo-
cations per second, and the processor is unable to keep up
with the disk transfer rate.
In contrast the old file system is about 50% faster at
writing files than reading them. This is because the write
system call is asynchronous and the kernel can generate disk
transfer requests much faster than they can be serviced,
hence disk transfers build up in the disk buffer cache.
Because the disk buffer cache is sorted by minimum seek
_________________________
* This number, 65536, is the maximal I/O size supported
by the VAX hardware; it is a remnant of the system's
PDP-11 ancestry.
File System - 15 - Performance
order, the average seek between the scheduled disk writes is
much less than they would be if the data blocks are written
out in the order in which they are generated. However when
the file is read, the read system call is processed synchro-
nously so the disk blocks must be retrieved from the disk in
the order in which they are allocated. This forces the disk
scheduler to do long seeks resulting in a lower throughput
rate.
The performance of the new file system is currently
limited by a memory to memory copy operation because it
transfers data from the disk into buffers in the kernel
address space and then spends 40% of the processor cycles
copying these buffers to user address space. If the buffers
in both address spaces are properly aligned, this transfer
can be affected without copying by using the VAX virtual
memory management hardware. This is especially desirable
when large amounts of data are to be transferred. We did
not implement this because it would change the semantics of
the file system in two major ways; user programs would be
required to allocate buffers on page boundaries, and data
would disappear from buffers after being written.
Greater disk throughput could be achieved by rewriting
the disk drivers to chain together kernel buffers. This
would allow files to be allocated to contiguous disk blocks
that could be read in a single disk transaction. Most disks
contain either 32 or 48 512 byte sectors per track. The
inability to use contiguous disk blocks effectively limits
the performance on these disks to less than fifty percent of
the available bandwidth. Since each track has a multiple of
sixteen sectors it holds exactly two or three 8192 byte file
system blocks, or four or six 4096 byte file system blocks.
If the the next block for a file cannot be laid out contigu-
ously, then the minimum spacing to the next allocatable
block on any platter is between a sixth and a half a revolu-
tion. The implication of this is that the best possible
layout without contiguous blocks uses only half of the
bandwidth of any given track. If each track contains an odd
number of sectors, then it is possible to resolve the rota-
tional delay to any number of sectors by finding a block
that begins at the desired rotational position on another
track. The reason that block chaining has not been imple-
mented is because it would require rewriting all the disk
drivers in the system, and the current throughput rates are
already limited by the speed of the available processors.
Currently only one block is allocated to a file at a
time. A technique used by the DEMOS file system when it
finds that a file is growing rapidly, is to preallocate
several blocks at once, releasing them when the file is
closed if they remain unused. By batching up the allocation
the system can reduce the overhead of allocating at each
write, and it can cut down on the number of disk writes
File System - 16 - Performance
needed to keep the block pointers on the disk synchronized
with the block allocation [Powell79].
File System - 17 - Functional enhancements
5. File system functional enhancements
The speed enhancements to the UNIX file system did not
require any changes to the semantics or data structures
viewed by the users. However several changes have been gen-
erally desired for some time but have not been introduced
because they would require users to dump and restore all
their file systems. Since the new file system already
requires that all existing file systems be dumped and
restored, these functional enhancements have been introduced
at this time.
5.1. Long file names
File names can now be of nearly arbitrary length. The
only user programs affected by this change are those that
access directories. To maintain portability among UNIX sys-
tems that are not running the new file system, a set of
directory access routines have been introduced that provide
a uniform interface to directories on both old and new sys-
tems.
Directories are allocated in units of 512 bytes. This
size is chosen so that each allocation can be transferred to
disk in a single atomic operation. Each allocation unit
contains variable-length directory entries. Each entry is
wholly contained in a single allocation unit. The first
three fields of a directory entry are fixed and contain an
inode number, the length of the entry, and the length of the
name contained in the entry. Following this fixed size
information is the null terminated name, padded to a 4 byte
boundary. The maximum length of a name in a directory is
currently 255 characters.
Free space in a directory is held by entries that have
a record length that exceeds the space required by the
directory entry itself. All the bytes in a directory unit
are claimed by the directory entries. This normally results
in the last entry in a directory being large. When entries
are deleted from a directory, the space is returned to the
previous entry in the same directory unit by increasing its
length. If the first entry of a directory unit is free,
then its inode number is set to zero to show that it is
unallocated.
5.2. File locking
The old file system had no provision for locking files.
Processes that needed to synchronize the updates of a file
had to create a separate ``lock'' file to synchronize their
updates. A process would try to create a ``lock'' file. If
the creation succeeded, then it could proceed with its
update; if the creation failed, then it would wait, and try
again. This mechanism had three drawbacks. Processes
File System - 18 - Functional enhancements
consumed CPU time, by looping over attempts to create locks.
Locks were left lying around following system crashes and
had to be cleaned up by hand. Finally, processes running as
system administrator are always permitted to create files,
so they had to use a different mechanism. While it is pos-
sible to get around all these problems, the solutions are
not straight-forward, so a mechanism for locking files has
been added.
The most general schemes allow processes to con-
currently update a file. Several of these techniques are
discussed in [Peterson83]. A simpler technique is to simply
serialize access with locks. To attain reasonable effi-
ciency, certain applications require the ability to lock
pieces of a file. Locking down to the byte level has been
implemented in the Onyx file system by [Bass81]. However,
for the applications that currently run on the system, a
mechanism that locks at the granularity of a file is suffi-
cient.
Locking schemes fall into two classes, those using hard
locks and those using advisory locks. The primary differ-
ence between advisory locks and hard locks is the decision
of when to override them. A hard lock is always enforced
whenever a program tries to access a file; an advisory lock
is only applied when it is requested by a program. Thus
advisory locks are only effective when all programs access-
ing a file use the locking scheme. With hard locks there
must be some override policy implemented in the kernel, with
advisory locks the policy is implemented by the user pro-
grams. In the UNIX system, programs with system administra-
tor privilege can override any protection scheme. Because
many of the programs that need to use locks run as system
administrators, we chose to implement advisory locks rather
than create a protection scheme that was contrary to the
UNIX philosophy or could not be used by system administra-
tion programs.
The file locking facilities allow cooperating programs
to apply advisory shared or exclusive locks on files. Only
one process has an exclusive lock on a file while multiple
shared locks may be present. Both shared and exclusive
locks cannot be present on a file at the same time. If any
lock is requested when another process holds an exclusive
lock, or an exclusive lock is requested when another process
holds any lock, the open will block until the lock can be
gained. Because shared and exclusive locks are advisory
only, even if a process has obtained a lock on a file,
another process can override the lock by opening the same
file without a lock.
Locks can be applied or removed on open files, so that
locks can be manipulated without needing to close and reopen
the file. This is useful, for example, when a process
File System - 19 - Functional enhancements
wishes to open a file with a shared lock to read some infor-
mation, to determine whether an update is required. It can
then get an exclusive lock so that it can do a read, modify,
and write to update the file in a consistent manner.
A request for a lock will cause the process to block if
the lock can not be immediately obtained. In certain
instances this is unsatisfactory. For example, a process
that wants only to check if a lock is present would require
a separate mechanism to find out this information. Conse-
quently, a process may specify that its locking request
should return with an error if a lock can not be immediately
obtained. Being able to poll for a lock is useful to ``dae-
mon'' processes that wish to service a spooling area. If
the first instance of the daemon locks the directory where
spooling takes place, later daemon processes can easily
check to see if an active daemon exists. Since the lock is
removed when the process exits or the system crashes, there
is no problem with unintentional locks files that must be
cleared by hand.
Almost no deadlock detection is attempted. The only
deadlock detection made by the system is that the file
descriptor to which a lock is applied does not currently
have a lock of the same type (i.e. the second of two succes-
sive calls to apply a lock of the same type will fail).
Thus a process can deadlock itself by requesting locks on
two separate file descriptors for the same object.
5.3. Symbolic links
The 512 byte UNIX file system allows multiple directory
entries in the same file system to reference a single file.
The link concept is fundamental; files do not live in direc-
tories, but exist separately and are referenced by links.
When all the links are removed, the file is deallocated.
This style of links does not allow references across physi-
cal file systems, nor does it support inter-machine linkage.
To avoid these limitations symbolic links have been added
similar to the scheme used by Multics [Feiertag71].
A symbolic link is implemented as a file that contains
a pathname. When the system encounters a symbolic link
while interpreting a component of a pathname, the contents
of the symbolic link is prepended to the rest of the path-
name, and this name is interpreted to yield the resulting
pathname. If the symbolic link contains an absolute path-
name, the absolute pathname is used, otherwise the contents
of the symbolic link is evaluated relative to the location
of the link in the file hierarchy.
Normally programs do not want to be aware that there is
a symbolic link in a pathname that they are using. However
certain system utilities must be able to detect and
File System - 20 - Functional enhancements
manipulate symbolic links. Three new system calls provide
the ability to detect, read, and write symbolic links, and
seven system utilities were modified to use these calls.
In future Berkeley software distributions it will be
possible to mount file systems from other machines within a
local file system. When this occurs, it will be possible to
create symbolic links that span machines.
5.4. Rename
Programs that create new versions of data files typi-
cally create the new version as a temporary file and then
rename the temporary file with the original name of the data
file. In the old UNIX file systems the renaming required
three calls to the system. If the program were interrupted
or the system crashed between these calls, the data file
could be left with only its temporary name. To eliminate
this possibility a single system call has been added that
performs the rename in an atomic fashion to guarantee the
existence of the original name.
In addition, the rename facility allows directories to
be moved around in the directory tree hierarchy. The rename
system call performs special validation checks to insure
that the directory tree structure is not corrupted by the
creation of loops or inaccessible directories. Such corrup-
tion would occur if a parent directory were moved into one
of its descendants. The validation check requires tracing
the ancestry of the target directory to insure that it does
not include the directory being moved.
5.5. Quotas
The UNIX system has traditionally attempted to share
all available resources to the greatest extent possible.
Thus any single user can allocate all the available space in
the file system. In certain environments this is unaccept-
able. Consequently, a quota mechanism has been added for
restricting the amount of file system resources that a user
can obtain. The quota mechanism sets limits on both the
number of files and the number of disk blocks that a user
may allocate. A separate quota can be set for each user on
each file system. Each resource is given both a hard and a
soft limit. When a program exceeds a soft limit, a warning
is printed on the users terminal; the offending program is
not terminated unless it exceeds its hard limit. The idea
is that users should stay below their soft limit between
login sessions, but they may use more space while they are
actively working. To encourage this behavior, users are
warned when logging in if they are over any of their soft
limits. If they fail to correct the problem for too many
login sessions, they are eventually reprimanded by having
their soft limit enforced as their hard limit.
File System - 21 - Software engineering
6. Software engineering
The preliminary design was done by Bill Joy in late
1980; he presented the design at The USENIX Conference held
in San Francisco in January 1981. The implementation of his
design was done by Kirk McKusick in the summer of 1981.
Most of the new system calls were implemented by Sam
Leffler. The code for enforcing quotas was implemented by
Robert Elz at the University of Melbourne.
To understand how the project was done it is necessary
to understand the interfaces that the UNIX system provides
to the hardware mass storage systems. At the lowest level
is a raw disk. This interface provides access to the disk
as a linear array of sectors. Normally this interface is
only used by programs that need to do disk to disk copies or
that wish to dump file systems. However, user programs with
proper access rights can also access this interface. A disk
is usually formated with a file system that is interpreted
by the UNIX system to provide a directory hierarchy and
files. The UNIX system interprets and multiplexes requests
from user programs to create, read, write, and delete files
by allocating and freeing inodes and data blocks. The
interpretation of the data on the disk could be done by the
user programs themselves. The reason that it is done by the
UNIX system is to synchronize the user requests, so that two
processes do not attempt to allocate or modify the same
resource simultaneously. It also allows access to be res-
tricted at the file level rather than at the disk level and
allows the common file system routines to be shared between
processes.
The implementation of the new file system amounted to
using a different scheme for formating and interpreting the
disk. Since the synchronization and disk access routines
themselves were not being changed, the changes to the file
system could be developed by moving the file system
interpretation routines out of the kernel and into a user
program. Thus, the first step was to extract the file sys-
tem code for the old file system from the UNIX kernel and
change its requests to the disk driver to accesses to a raw
disk. This produced a library of routines that mapped what
would normally be system calls into read or write operations
on the raw disk. This library was then debugged by linking
it into the system utilities that copy, remove, archive, and
restore files.
A new cross file system utility was written that copied
files from the simulated file system to the one implemented
by the kernel. This was accomplished by calling the simula-
tion library to do a read, and then writing the resultant
data by using the conventional write system call. A similar
utility copied data from the kernel to the simulated file
system by doing a conventional read system call and then
File System - 22 - Software engineering
writing the resultant data using the simulated file system
library.
The second step was to rewrite the file system simula-
tion library to interpret the new file system. By linking
the new simulation library into the cross file system copy-
ing utility, it was possible to easily copy files from the
old file system into the new one and from the new one to the
old one. Having the file system interpretation implemented
in user code had several major benefits. These included
being able to use the standard system tools such as the
debuggers to set breakpoints and single step through the
code. When bugs were discovered, the offending problem
could be fixed and tested without the need to reboot the
machine. There was never a period where it was necessary to
maintain two concurrent file systems in the kernel. Finally
it was not necessary to dedicate a machine entirely to file
system development, except for a brief period while the new
file system was boot strapped.
The final step was to merge the new file system back
into the UNIX kernel. This was done in less than two weeks,
since the only bugs remaining were those that involved
interfacing to the synchronization routines that could not
be tested in the simulated system. Again the simulation
system proved useful since it enabled files to be easily
copied between old and new file systems regardless of which
file system was running in the kernel. This greatly reduced
the number of times that the system had to be rebooted.
The total design and debug time took about one man
year. Most of the work was done on the file system utili-
ties, and changing all the user programs to use the new
facilities. The code changes in the kernel were minor,
involving the addition of only about 800 lines of code
(including comments).
File System - 23 - Software engineering
Acknowledgements
We thank Robert Elz for his ongoing interest in the new
file system, and for adding disk quotas in a rational and
efficient manner. We also acknowledge Dennis Ritchie for
his suggestions on the appropriate modifications to the user
interface. We appreciate Michael Powell's explanations on
how the DEMOS file system worked; many of his ideas were
used in this implementation. Special commendation goes to
Peter Kessler and Robert Henry for acting like real users
during the early debugging stage when files were less stable
than they should have been. Finally we thank our sponsors,
the National Science Foundation under grant MCS80-05144, and
the Defense Advance Research Projects Agency (DoD) under
Arpa Order No. 4031 monitored by Naval Electronic System
Command under Contract No. N00039-82-C-0235.
References
[Accetta80] Accetta, M., Robertson, G., Satyanaray-
anan, M., and Thompson, M. "The Design
of a Network-Based Central File System",
Carnegie-Mellon University, Dept of Com-
puter Science Tech Report, #CMU-CS-80-
134
[Almes78] Almes, G., and Robertson, G. "An Exten-
sible File System for Hydra" Proceedings
of the Third International Conference on
Software Engineering, IEEE, May 1978.
[Bass81] Bass, J. "Implementation Description
for File Locking", Onyx Systems Inc, 73
E. Trimble Rd, San Jose, CA 95131 Jan
1981.
[Dion80] Dion, J. "The Cambridge File Server",
Operating Systems Review, 14, 4. Oct
1980. pp 26-35
[Eswaran74] Eswaran, K. "Placement of records in a
file and file allocation in a computer
network", Proceedings IFIPS, 1974. pp
304-307
[Holler73] Holler, J. "Files in Computer Net-
works", First European Workshop on Com-
puter Networks, April 1973. pp 381-396
[Feiertag71] Feiertag, R. J. and Organick, E. I.,
"The Multics Input-Output System",
File System - 24 - References
Proceedings of the Third Symposium on
Operating Systems Principles, ACM, Oct
1971. pp 35-41
[Kridle83] Kridle, R., and McKusick, M., "Perfor-
mance Effects of Disk Subsystem Choices
for VAX Systems Running 4.2BSD UNIX",
Computer Systems Research Group, Dept of
EECS, Berkeley, CA 94720, Technical
Report #8.
[Kowalski78] Kowalski, T. "FSCK - The UNIX System
Check Program", Bell Laboratory, Murray
Hill, NJ 07974. March 1978
[Luniewski77] Luniewski, A. "File Allocation in a
Distributed System", MIT Laboratory for
Computer Science, Dec 1977.
[Maruyama76] Maruyama, K., and Smith, S. "Optimal
reorganization of Distributed Space Disk
Files", Communications of the ACM, 19,
11. Nov 1976. pp 634-642
[Nevalainen77] Nevalainen, O., Vesterinen, M. "Deter-
mining Blocking Factors for Sequential
Files by Heuristic Methods", The Com-
puter Journal, 20, 3. Aug 1977. pp 245-
247
[Peterson83] Peterson, G. "Concurrent Reading While
Writing", ACM Transactions on Program-
ming Languages and Systems, ACM, 5, 1.
Jan 1983. pp 46-55
[Powell79] Powell, M. "The DEMOS File System",
Proceedings of the Sixth Symposium on
Operating Systems Principles, ACM, Nov
1977. pp 33-42
[Porcar82] Porcar, J. "File Migration in Distri-
buted Computer Systems", Ph.D. Thesis,
Lawrence Berkeley Laboratory Tech Report
#LBL-14763.
[Ritchie74] Ritchie, D. M. and Thompson, K., "The
UNIX Time-Sharing System", CACM 17, 7.
July 1974. pp 365-375
[Smith81a] Smith, A. "Input/Output Optimization
and Disk Architectures: A Survey", Per-
formance and Evaluation 1. Jan 1981. pp
104-117
File System - 25 - References
[Smith81b] Smith, A. "Bibliography on File and I/O
System Optimization and Related Topics",
Operating Systems Review, 15, 4. Oct
1981. pp 39-54
[Sturgis80] Sturgis, H., Mitchell, J., and Israel,
J. "Issues in the Design and Use of a
Distributed File System", Operating Sys-
tems Review, 14, 3. pp 55-79
[Symbolics81a] "Symbolics File System", Symbolics Inc,
9600 DeSoto Ave, Chatsworth, CA 91311
Aug 1981.
[Symbolics81b] "Chaosnet FILE Protocol". Symbolics
Inc, 9600 DeSoto Ave, Chatsworth, CA
91311 Sept 1981.
[Thompson79] Thompson, K. "UNIX Implementation",
Section 31, Volume 2B, UNIX Programmers
Manual, Bell Laboratory, Murray Hill, NJ
07974. Jan 1979
[Thompson80] Thompson, M. "Spice File System",
Carnegie-Mellon University, Dept of Com-
puter Science Tech Report, #CMU-CS-80-
???
[Trivedi80] Trivedi, K. "Optimal Selection of CPU
Speed, Device Capabilities, and File
Assignments", Journal of the ACM, 27, 3.
July 1980. pp 457-473
[White80] White, R. M. "Disk Storage Technology",
Scientific American, 243(2), August
1980.