PVFS V1.2 User's Guide (HTML version)

Contents

0. Introduction

The Parallel Virtual File System (PVFS) is a collection of daemons and library code that allows applications to access files stored in parallel across multiple disks/machines. Currently the code is meant to be compiled and run on Linux workstations, although it may compile for other architectures. This document serves as a starting point for setting up PVFS on an existing system and writing programs that access the file system.

1. Theory of operation

With respect to PVFS, the nodes in a parallel system can be thought of as being of three different types: compute nodes, I/O nodes, and management nodes. A single node can be one or more of these types. Typically a single node will serve as a management node, while a subset of the nodes will be compute nodes and another subset serve as I/O nodes, although it is reasonable (and sometimes preferable) to use all nodes as both I/O and compute nodes.

PVFS exists as a set of daemons and a library of calls to access the file system. There are two types of daemons, management daemons and I/O daemons. Typically there is a single management daemon (who runs on the management node) and a number of I/O daemons (who must run on the I/O nodes when I/O is taking place). The library of calls is used by applications running on compute nodes in order to communicate with both the management daemon and the I/O daemons.

Management daemons have two responsibilities: validating permission to access files and maintaining metadata on PVFS files. All of these tasks revolve around the access of metadata files. Only one management daemon is needed to perform these operations for a file system, and a single management daemon can manage multiple file systems. As of version 1.4.2 NFS mounting of the metadata directory is no longer mandatory. However, all metadata files are still stored in a directory hierarchy on the node on which the manager runs.

I/O daemons serve a single purpose: to access PVFS file data and correlate data transfer between themselves and applications.

2. Preliminaries

The system makes a few assumptions about the system on which it will run:

PVFS allows multiple directories to be managed as separate "file systems." Each file system must have one node that is designated the "manager node" and one or more nodes designated as "I/O nodes." Both manager nodes and I/O nodes can also run compute tasks if you wish, and nodes can be used both as manager nodes and I/O nodes. The I/O nodes must have a directory on a local disk to store file data - the smallest such disk will usually limit the maximum file size. I/O nodes can use any existing mounted file system, but it makes little sense to store data on an NFS mounted disk.

The manager creates, removes, and modifies metadata files stored on the shared file system. These files have the name used by applications to create them. These files, however, are rather small and only hold information on file permissions and data locations, not the file data itself.

The I/O daemons store the file data in a directory on their local node in files named "fXXXXXX.Y", where the X's represent the inode number of the file (actually the inode number of the metadata file) and the Y is a file index that is currently unused. These files are typically stored with permissions set so that they can only be read by the I/O daemons. As of version 1.4.3 of the file system, the I/O daemons utilize an inode hashing system to cut down the total number of files in the root directory of their data directory (to speed opens somewhat). These directories are created and managed automatically by the I/O daemons and will contain all of the data files.

The PVFS library calls allow applications to interact with the manager and the I/O daemons in order to perform file I/O. It uses the metadata files in order to determine how to contact the appropriate manager when accessing a file. Once the manager has given permission to access a file, connections are established with the I/O daemons, who perform file I/O on behalf of the application. All of this is taken care of by the library; applications instead use one of the PVFS interfaces, which hide these details. The available interfaces are discussed both in the man pages and below. Before we discuss these interfaces, however, we will first describe the installation process.

3. Installation

Installation involves four steps: Each of these steps will be discussed in turn. Read through this before beginning; there are a few steps which can be completed all at one time while logged into a remote node.

3.1. Installing the binaries

The PVFS binaries must be installed on all the nodes which will use PVFS; this includes I/O nodes, management nodes, and compute nodes. The easiest method for installing PVFS (on RedHat machines anyway) is the use of the RPM. Simply installing the RPM will place the executables, libraries, examples, and documentation in the appropriate places on your system. By default the PVFS binaries are installed in /usr/pvfs/bin/.

If for some reason you do not wish to use RPM, there should be a tar archive available as well. Check the PVFS web pages for more information on obtaining either distribution.

3.2. Creating metadata and data directories

A directory should be created to serve as the root directory for each PVFS file system. The manager will need to be able to write to this directory, and others should have execute and read privileges. In previous versions of PVFS it was necessary to mount the metadata directory on all nodes on which clients might run via NFS. This is still acceptable; however, it is no longer necessary. Instead a new configuration file, /etc/pvfstab, is used by the clients to determine the location of PVFS file systems, who then contact the manager for all metadata operations. The format of this file is covered below. An empty directory should still be created on all client nodes in order to serve as the "mount point" for the PVFS file system.

Log into the management node, create the directory, and change the permissions on it to "0755". Make sure the owner is root.

The I/O daemons will need a directory in which to store their file data locally on their nodes. By default they will look for the directory "/pvfs_data". These daemons by default run as user "nobody", so you will want to change the owner of this directory appropriately.

Log into each of the I/O nodes, create the /pvfs_data directory, change the owner to "nobody", and change the permissions to "0700". Alternatively create a directory elsewhere and create a link to it from /pvfs_data. As a final option, you can change the configuration of the I/O daemon in the /etc/iod.conf file. See Appendix 2 or the man page on iod.conf for more information.

3.3. Configuring the system

Configuring the system requires a couple of steps. First you need to create the .pvfsdir and .iodtab files for the file system. This is done with the "mkiodtab" script, which is included in the PVFS package and typically installed in "/usr/pvfs/bin".

Decide which nodes in your cluster will serve as manager nodes, I/O nodes, and compute nodes. I would suggest starting with a single manager node and a single file system using a number of I/O nodes. Remember that the root directory needs to be writable by root on the manager node.

Here's an example of running mkiodtab; make sure to run it as root from the manager node:

[root@crack /root]# /usr/pvfs/bin/mkiodtab
This is the iodtab setup for the Parallel Virtual File System.
It will also make the .pvfsdir file in the root directory.

Enter the root directory:
/pvfs
Enter the user id of directory:
root
Enter the group id of directory:
root
Enter the mode of the root directory:
777
Enter the hostname that will run the manager:
localhost
Searching for host...success
Enter the port number on the host for manager:
(Port number 3000 is the default)
3000
Enter the I/O nodes: (can use form node1, node2, ... or
nodename{#-#,#,#})
localhost
Searching for hosts...success
I/O nodes: localhost
Enter the port number for the iods:
(Port number 7000 is the default)
7000
Done!

New to PVFS versions 1.4.2 and later is the use of the /etc/pvfstab file. The format of this file is identical to the traditional /etc/fstab file. In this file the location, port, and metadata directory for the manager is listed as well as a "mount point". This file replaces the need for NFS-mounted metadata directories. The format of this file is covered in Appendix 3 and in the pvfstab man page.

3.4. Starting the daemons

Once the directories are created and the configuration files are in place, the daemons should be enabled and started on the appropriate nodes. The RPM, when installed, includes startup scripts for both the manager and iod, but does not set up links to start either by default (in the /etc/rc.d/rc[0-6].d/ directories). The scripts "enableiod" and "enablemgr" are included to take care of creating these lines. They are installed in /usr/pvfs/bin by default. Simply run "enableiod" on I/O nodes and "enablemgr" on your management node, and the next time the machine is booted, the daemons will be automatically started.

Log into the manager node and run enablemgr. Start the manager with "/etc/rc.d/init.d/mgr start" (if you installed the RPMs) or "/usr/pvfs/bin/mgr". Log into all the I/O nodes and run enableiod. Start the iod on each node with "/etc/rc.d/init.d/iod start" or "/usr/pvfs/bin/iod". Both of these by default create output files in /tmp; for the manager this is called mgrlog.XXXXXX (where the X's are filled in randomly) and for the iod, iolog.XXXXXX. These are a good source of information if problems occur.

That's it. All the daemons are now running and should be ready to go!

4. Running and Compiling Programs

The easiest method of interacting with PVFS is simply to use the LD_PRELOAD environment variable in your shell. For the bash shell:

[root@crack /root]# LD_PRELOAD=/usr/lib/libpvfs.so
[root@crack /root]# export LD_PRELOAD

should enable the PVFS shared library. All executables linked to shared libraries will then link to the objects in this library before others. This allows the PVFS library to catch I/O calls and handle them appropriately for PVFS files. Note that in all cases one should refer to the file by the name of the metadata file.

Using shared libraries has the advantage that existing programs compiled and linked with your existing shared libraries can also use PVFS. Standard programs such as ls, rm, mv, chmod, etc. are of course quite useful with PVFS.

To stop this, simply undefine the environment variable. Note that statically linked programs will not use the PVFS library. If you have concerns about a specific binary being statically linked, use the "file" command to check and see if it is dynamically or statically linked. If it is statically linked, it will not be able to correctly access PVFS files. You should consider finding or creating a replacement binary that can.

Alternatively one can directly link to the PVFS libraries. This is necessary if the program uses any of the additional interface features supported by PVFS, such as the MDBI interface. When compiling programs to use PVFS, one should include in the source the PVFS include file, typically installed in /usr/include/pvfs/, by:

#include <pvfs/pvfs.h>

To link to the PVFS library, typically installed in /usr/lib/, one should add "-lpvfs" to the link line.

Note that libpvfs.a and libpvfs.so provide their own copies of the stdio routines from the C library (such as fopen, fread, fwrite, fprintf, etc.) and a set of system call wrappers (such as open, read, write, lseek, etc.). The C library calls are not modified from the standard GNU distribution, but must be included so they will cause the PVFS calls to be linked before your existing C library.

Also note that there are limitations to this user-level support. First, program files stored on PVFS cannot be executed. Second, I/O redirection cannot be used because the exec() process destroys the file information stored by PVFS. However, all of these things will still work as before on non-PVFS files even with the library preloaded.

Finally, it is useful to know that the PVFS interface calls, including the MDBI interface, will also operate correctly on standard, non-PVFS, files. This can be helpful when debugging code in that it can help isolate application problems from bugs in the still developing PVFS system.

5. Writing PVFS programs

Programs written to use normal UNIX I/O will work fine with PVFS. Files created this way will be striped according to the file system defaults set at compile time, usually set to 16K stripe size and all of the I/O nodes, starting with node 0. Note that use of UNIX "system calls" read() and write() result in exactly the data specified being exchanged with the I/O nodes each time the call is made. Large numbers of small accesses performed with these calls will not perform well at all. On the other hand, the buffered routines of the standard I/O library fread() and fwrite() locally buffer small accesses and perform exchanges with the I/O nodes in chunks of at least some minimum size. This size can be manually set by a program with the fcntl() call in the usual way, but generally PVFS will perform better with larger buffer sizes.

In order to fully take advantage of PVFS in a parallel program, one of the PVFS partitioning mechanisms should be used. Currently there are two ways to do this; a program can set logical partitioning parameters directly, or a program can use the multi-dimensional blocking facility. In either case, PVFS can transfer data that is non-contiguous in the file to compute nodes without any of the undesired data, regardless of physical disk layout and physical striping of the data. Essentially, partitioning causes PVFS to automatically seek over parts of the file the program does not need. This is superior to manually seeking and reading because the library can combine a large number of seeks and reads into a single file system request. This also eases parallel programming, because once a partition is set, the program can treat the file as a normal file, but it will only see that part of the file it is designated to operate on. Thus file partitioning promotes the use of data parallel programming techniques by automatically dividing the file data among a number of cooperating processes.

In addition to these interfaces, it is important to know how to control the physical distribution of files as well. In the next three sections, we will discuss how to specify the physical partitioning, or striping, of a file, how to set logical partitions on file data, and how the PVFS multi-dimensional block interface can be used.

5.1. Specifying striping parameters

The current physical distribution mechanism used by PVFS is a simple striping scheme. The distribution of data is described with three parameters:

The Striping Example shows an example where the base node is 0 and the pcount is 4.

Striping Example
Striping Example

Physical distribution is determined when the file is first created. Using pvfs_open(), one can specify these parameters.

pvfs_open(char *pathname, int flag, mode_t mode);
pvfs_open(char *pathname, int flag, mode_t mode, struct pvfs_stat *dist);

If the first set of parameters is used, a default distribution will be imposed. If instead a structure defining the distribution is passed in and the O_META flag is OR'd into the flag parameter, the physical distribution is defined by the user. This structure, defined in the PVFS header files, is defined as follows:

struct pvfs_stat {
   int base;   /* The first iod node to be used */
   int pcount; /* The number of iod nodes for the file */
   int ssize;  /* stripe size */
   int soff;   /* stripe offset */
   int bsize;  /* base size */
}

The soff and bsize fields are not in use at this time. Setting the pcount value to -1 will use all available I/O daemons for the file. Setting -1 in the other fields will result in the default values being used. If you wish to obtain information on the physical distribution of a file, use pvfs_ioctl() on an open file descriptor:

pvfs_ioctl(fd, GETMETA, struct pvfs_stat *dist);

It will fill in the structure with the physical distribution information for the file.

5.2. Setting a logical partition

With the current PVFS partitioning mechanism, partitions are defined with three parameters: offset, group-size, and stride. The offset is the distance in bytes from the beginning of the file to the first byte in the partition. Group-size is the number of contiguous bytes included in the partition. Stride is the distance from the beginning of one group of bytes to the next. Partitioning Parameters shows these parameters.

Partitioning Parameters
Partitioning Parameters

To set the file partition, the program uses a pvfs_ioctl() call. The parameters are as follows:

pvfs_ioctl(fd, SETPART, &part);

where part is a structure defined as follows:

struct fpart {
        int offset;
        int gsize;
        int stride;
        int gstride;
        int ngroups;
};

The last two fields, gstride and ngroups, are no longer used and should be set to zero. The pvfs_ioctl() call can also be used to get the current partitioning parameters by substituting the GET_PART flag. Note that whenever the partition is set, the file pointer is reset to the beginning of the new partition. Also note that setting the partition is a purely local call; it does not involve contacting any of the PVFS daemons, thus it is reasonable to reset the partition as often as needed during the execution of a program.

Partitioning Example 1
Partitioning Example 1

As an example, suppose a file contains 40,000 records of 1000 bytes each, you have 4 parallel tasks, and you want to divide the file into 4 partitions of 10,000 records each for processing. In this case you would set the group-size to 10,000 records times 1000 bytes or 10,000,000 bytes. Then each task (0..3) would set its offset so that it would access a disjoint portion of the data. This is shown in Partitioning Example 1.

Partitioning Example 2
Partitioning Example 2

Alternatively, suppose you want to allocate the records in a cyclic or "round-robin" manner. In this case the group-size would be set to 1000 bytes, the stride would be set to 4000 bytes and the offsets would again be set to access disjoint regions, as shown in Partitioning Example 2.

It is important to realize that setting the partition for one task has no effect whatsoever on any other tasks. There is also no reason that the partitions set by each task be distinct; we can overlap the partitions of different tasks if we like.

Simple partitioning is useful for one dimensional data and simple distributions of two dimensional data. More complex distributions and multi-dimensional data is often more easily partitioned using the multi-dimensional block interface.

5.3. Using multi-dimensional blocking

An important thing to note is the partitioning in PVFS never causes data in a file to be read "out of order". In other words, one byte never appears before another byte in a partition unless the first byte appears before the second in the original file. This is somewhat different than some other parallel file systems. Situations where out-of-order access is desired usually happen when a file is being treated as a multi-dimensional data set, and the data is being accessed in blocks. PVFS supports this kind of access with its multi-dimensional block interface - which is really only an improved partitioning interface.

MDBI Blocking Example
MDBI Example 1

The PVFS multi-dimensional block interface (MDBI) is an alternative interface that provides an alternative view of file data. With the MDBI, file data is considered as an N dimensional array of records. This array is divided into "blocks" of records by specifying the dimensions of the array and the size of the blocks in each dimension. The parameters used to describe the array are as follows:

Once the programmer has defined the view of the data set, blocks of data can be read with single function calls, greatly simplifying the act of accessing these types of data sets.

There are five basic calls used for accessing files with MDBI:

int open_blk(char *path, int flags, int mode);
int set_blk(int fd, int D, int rs, int ne1, int nb1, ..., int nen, int nbn);
int read_blk(int fd, char *buf, int index1, ..., int indexn);
int write_blk(int fd, char *buf, int index1, ..., int indexn);
int close_blk(int fd);

The open_blk() and close_blk() calls operate similarly to the standard UNIX open() and close() calls. set_blk() is the call used to set the blocking parameters for the array before reading or writing. It can be used as often as necessary and does not entail communication. read_blk() and write_blk() are used to read blocks of records once the blocking has been set.

In MDBI Example 1 we can see an example of blocking. Here a file has been described as a two dimensional array of blocks, with blocks consisting of a two by three array of records. Records are shown with dotted lines, with groups of records organized into blocks denoted with solid lines.

In this example, the array would be described with a call to set_blk() as follows:

set_blk(fd, 2, 500, 2, 6, 3, 3);

If we wanted to read block (2, 0) from the array, we could then:

read_blk(fd, &buf, 2, 0);

Similarly, to read block (5, 2):

write_blk(fd, &blk, 5, 2);

A final feature of the MDBI is block buffering. Sometimes multi-dimensional blocking is used to set the size of the data that the program wants to read and write from disk. Other times the block size has some physical meaning in the program and is set for other reasons. In this case, individual blocks may be rather small, resulting in poor I/O performance an under utilization of memory. The PVFS MDBI provides a buffering mechanism that causes multiple blocks to be read and written from disk and stored in a buffer in the program's memory address space. Subsequent transfers using read_blk() and write_blk() result in memory-to-memory transfers unless a block outside of the current buffer is accessed.

Since it is difficult to predict what blocks will be read when, PVFS relies on user cues to determine what to buffer. This is done by defining "blocking factors" which group blocks together. A single function is used to define the blocking factor:

int buf_blk(int fd, int bf1, ..., bfn);

The blocking factor indicates how many blocks in the given dimension should be buffered.

Looking at MDBI Example 1 again, we can see how blocking factors can be defined. In the example, the call:

buf_blk(fd, 2, 1);

is used to specify the blocking factor. We denote the larger resulting buffered blocks as "superblocks", one of which is shown in blue in the example.

Whenever a block is accessed, if its superblock is not in the buffer, the current superblock is written back to disk and the new superblock is read - then the desired block is copied into the given buffer. The default blocking factor for all dimensions is 1, and any time the blocking factor is changed the buffer is written back to disk.

It is important to understand that no cache-coherency is performed here; if application tasks are sharing superblocks, unexpected results will occur. It is up to the user to ensure that this does not happen. A good strategy for buffering is to develop your program without buffering turned on, and then enable it later in order to improve performance.

6. PVFS Utilities

In the PVFS bin directory, you'll find a couple of utilities useful in dealing with PVFS files. In addition to the mkiodtab command discussed earlier, the u2p command can be used to convert an existing UNIX file to a PVFS file (and vice versa). The syntax for u2p is:

u2p -s <stripe size> -b <base> -n <# of nodes> <srcfile> <destfile>

This function is most useful in converting pre-existing data files to PVFS so that they can be used in parallel programs. The "cp" utility, when used with the PVFS shared library, can also copy data onto a PVFS file system, but the user then loses control over the physical distribution (the default is used).

The pvstat utility will print out the pvfs_stat information of a file plus the owner's uid and the inode number without opening the iod files. The inode number will correspond to the data files on the iods. The syntax of pvstat is simply:

pvstat <filename>

The call will open the metadata file in a PVFS directory to produce the pertinent stat information, so if you have preloaded the PVFS library it will not work. Obviously we need to update this...

Appendices

Appendix 1: .pvfsdir and .iodtab files

These files are created by the system administrator before the file system is ever used. .pvfsdir files hold information on the location of the manager for the file system and information on the root directory. The .iodtab file holds a list of the I/O daemon locations and port numbers that make up the file system. Both of these files should be created using the "mkiodtab" script, which is described previously.

This file is in text format and includes the following information in this order, with an entry on each line:

Here's a sample .pvfsdir file:

116314
25
6000
0040775
3000
grendel
/pvfs
/

There will be a .pvfsdir file in each of the subdirectories as well. The manager will automatically create these new files when subdirectories are created.

The .iodtab file is also created by the system administrator. It consists simply of an ordered list of hosts (or IP addresses) and optional port numbers. It is stored in the root directory of the PVFS file system.

An example of an .iodtab file would be:

192.168.0.1:7010
192.168.0.2:7010
192.168.0.3:7010

Another, using the default port (7000) and hostnames:

grendel1
grendel2
grendel3

Appendix 2: /etc/iod.conf files

The iod will look for an optional configuration file named /etc/iod.conf when it is started. This file can specify a number of configuration parameters for the I/O daemon, including changing the data directory, the user and group under which the I/O daemon runs, and the port on which the I/O daemons operate.

Every line consists of two fields, a selector field and an value field. These two fields are separated by one or more spaces or tabs. The selector field specifies a configuration parameter which should be set to the value.

Lines starting with a hash mark ("#") and empty lines are ignored.

The selector field itself again consists of one of the following: port, user, group, rootdir, datadir, logdir, or debug. Selectors are case insensitive. If the same selector is used again, it will override the first instance.

port specifies the port on which the iod should accept requests.

user specifies the user under which the iod should run.

group specifies the group under which the iod should run.

rootdir gives the directory the iod should use as its rootdir. The iod uses chroot(2) to change to this directory before accessing files.

datadir gives the directory the iod should use as its data directory. The iod uses chdir(2) to change to this directory after changing the root directory.

Here is an example iod.conf file:


# IOD Configuration file, iod.conf
#

port 7001
user nobody
group nobody
rootdir /
datadir /pvfs_data
logdir /tmp
debug 0               

Appendix 3: /etc/pvfstab files

When the client library is used, it will search for a /etc/pvfstab file in order to discover the local directories for PVFS files and the locations of the manager(s) responsible for these files. The format of this file is the same as the fstab file:

grendel:/pvfs_meta  /pvfs  pvfs  port=3000  0  0
Here we have specified the the manager is grendel, that the directory the manager is storing metadata in is /pvfs_meta, that this is "mounted" on /pvfs on the client (local) system, and that the port on which the manager is listening is 3000. The third field should be set to "pvfs" and the last two fields to 0. This file must be readable by all.
Part of the PVFS Project