PVFS Development Team
The Trove storage interface will be the lowest level interface used by the PVFS server for storing both file data and metadata. It will be used by individual servers (and servers only) to keep track of locally stored information. There are several goals and ideas that we should keep in mind when discussing this interface:
Our first cut implementation of this interface will have the following restrictions:
PARTIAL COMPLETION SEMANTICS NEED MUCH WORK!!!
A server controls one storage space.
Within this storage space are some number of collections, which are akin to file systems. Collections serve as a mechanism for supporting multiple traditional file systems on a single server and for separating the use of various physical resources. (Collections can span multiple underlying storage devices, and hints would be used in that case to specify the device on which to place files. This concept might be used in systems that can migrate data from slow storage to faster storage as well).
Two collections will be created for each file system: one collection will support the dataspaces needed for the file system's data and metadata objects. A second collection will be created for administrative purposes. If the underlying implementation needs to perform disk i/o, for example, it can use bstream and keyval objects from the administration collection.
A collection id will be used in conjunction with other parameters in order to specify a unique entity on a server to access or modify, just as a file system ID might be used.
This storage interface stores and accesses what we will call dataspaces. These are logical collections of data organized in one of two possible ways. The first organization for a dataspace is the traditional ``byte stream''. This term refers to arbitrary binary data that can be referenced using offsets and sizes. The second organization is ``keyword/value'' data. This term refers to information that is accessed in a simplified database-like manner. The data is indexed by way of a variable length key rather than an offset and size. Both keyword and value are arbitrary byte arrays with a length parameter (i.e. need not be readable strings). We will refer to a dataspace organized as a byte stream as a bytestream dataspace or simply a bytestream space, and a dataspace organized by keyword/value pairs as a keyval dataspace or keyval space. Each dataspace will have an identifier that is unique to its server, which we will simply call a handle. Physically these dataspaces may be stored in any number of ways on underlying storage.
Here are some potential uses of each type:
In our design thus far (reference the system interface documents) we have defined four types of system level objects. These are data files, metadata files, directories, and symlinks. All four of these will be implemented using a combination of bytestream and/or keyval dataspaces. At the storage interface level there is no real distinction between different types of system level objects.
Vtags are a capability that can be used to implement atomic updates in shared storage systems. In this case they can be used to implement atomic access to a set of shared storage devices through the storage interface. To clarify, these would be of particular use when multiple threads are using the storage interface to access local storage or when multiple servers are accessing shared storage devices such as a MySQL database or SAN storage.
This section can be skipped if you are not interested in consistency semantics. Vtags will probably not be implemented in the first cut anyway.
Vtags are an approach to ensuring consistency for multiple readers and writers that avoids the use of locks and their associated problems within a distributed environment. These problems include complexity, poor performance in the general case, and awkward error recovery.
A vtag fundamentally provides a version number for any region of a byte stream or any individual key/value pair. This allows the implementation of an optimistic approach to consistency. Take the example of a read-modify-write operation. The caller first reads a data region, obtaining a version tag in the process. It then modifies it's own copy of the data. When it writes the data back, it gives the vtag back to the storage interface. The storage interface compares the given vtag against the current vtag for the region. If the vtags match, it indicates that the data has not been modified since it was read by the caller, and the operation succeeds. If the vtags do not match, then the operation fails and the caller must retry the operation.
This is an optimistic approach in that the caller always assumes that the region has not been modified.
Many different locking primitives can be built upon the vtag concept...
Layers above trove can take advantage of vtags as a way to simplify the enforcement of consistency semantics (rather than keeping complicated lists of concurrent operations, simply use the vtag facility to ensure that operations occur atomically). Alternatively they could be used to handle the case of trove resources shared by multiple upper layers. Finally they might be used in conjunction with higher level consistency control in some complimentary fashion (dunno yet...).
In this section we describe all the functions that make up the storage interface. The storage interface functions can be divided into four categories: dataspace management functions, bytestream access functions, keyval access functions, and completion test functions. The access functions can be further subdivided into contiguous and noncontiguous access capabilities.
First we describe the return values and error values for the interface. Then we describe special vtag values and the implementation of keys. Next we describe the dataspace management functions. Next we describe the contiguous and noncontiguous dataspace access functions. Finally we cover the completion test functions.
Unless otherwise noted, all functions return an integer with three possible values:
Table 1 shows values. All values will be returned as integers in the native format (size and byte order).
Needs to be fleshed out. Need to pick a reasonable prefix.
Phil: Once this is fleshed out, can we apply the same sort of scheme to BMI? BMI doesn't have a particularly informative error reporting mechanism.
Rob: Definitely. I would really like to make sure that in addition to getting error values back, the error values actually make sense :). This was (and still is in some cases) a real problem for PVFS1.
Value | Meaning |
TROVE_ENOENT | no such dataspace |
TROVE_EIO | I/O error |
TROVE_ENOSPC | no space on storage device |
TROVE_EVTAG | vtag didn't match |
TROVE_ENOMEM | unable to allocate memory for operation |
TROVE_EINVAL | invalid input parameter |
As mentioned earlier, the usage of vtags is not manditory. Therefore we define two flags values that can be used to control the behavior of the calls with respect to vtags:
TODO: pick a reasonable prefix for our flags.
By default calls ignore vtag values on input and do not create vtag values for output.
TODO: sync. with code on data_sz element.
struct TROVE_keyval { void * buffer; int32_t buffer_sz; int32_t data_sz; }; typedef struct TROVE_keyval TROVE_keyval_s;
Keys, values, and hints are all implemented with the same TROVE_keyval structure (do we want a different name?), shown above. Keys and values used in keyval spaces are arbitrary binary data values with an associated length.
Hint keys and values have the additional constraint of being null-terminated, readable strings. This makes them very similar to MPI_Info key/value pairs.
TODO: we should build hints out of a pair of the TROVE_keyvals. We'll call them a TROVE_hint_s in here for now.
Note: need to add valid error values for each function.
TODO: find a better format for function descriptions.
In this context, IDs are unique identifiers assigned to each storage interface operation. They are used as handles to test for completion of operations once they have been submitted. If an operation completes immediately, then the ID field should be ignored.
These IDs are only unique in the context of the storage interface, so upper layers may have to handle management of multiple ID spaces (if working with both a storage interface and a network interface, for instance).
The type for these IDs is TROVE_op_id.
Each function allows the user to pass in a pointer value (void *). This value is returned by the test functions, and it allows for quick reference to user data structures associated with the completed operation.
To motivate, normally there is some data at the caller's level that corresponds with the trove operation. Without some help, the caller would have to map IDs for completed operations back to the caller data structures manually. By providing a parameter that the caller can pass in, they can directly reference these structures on trove operation completion.
The type field can be used by the caller to assign an arbitrary integer type to the object. This may, for example, be used to distinguish between directories, symlinks, datafiles, and metadata files. The storage interface does not assign any meaning to the type value. Do we even need this type field?
The hint field may be used to specify what type of underlying storage should be used for this dataspace in the case where multiple potential underlying storage methods are available.
Parameters in read and write at calls are ordered similarly to pread and pwrite.
The size is [in/out] in code? Figure out semantics!!!
Writes a contiguous region to the bytestream. Same arguments as read_bytestream, except that the vtag is an in/out parameter.
The size is [in/out] in code? Figure out semantics!!!
Flags?
An important call for keyval spaces is the iterator function. The iterator function is used to obtain all keyword/value pairs from the keyval space with a sequence of calls from the client. The iterator function returns a logical, opaque ``position'' value that allows a client to continue reading pairs from the keyval space where it last left off.
The amount of data actually placed in the value buffer should be indicated by the data_sz element of the structure.
keyval_iterate will always read count items, unless it hits the end of the keyval space (EOK). After hitting EOK, count will be set to the number of pairs processed. Thus, callers must compare count after calling and compare with the value it had before the function call: if they are different, EOK has been reached. If there are N items left in the keyspace, and keyval_iterate requests N items, there will be no indication that EOK has been reached and only after making another call will the caller know he is at EOK. The value of position is not meaningful after reaching EOK.
How do vtags work with noncontiguous calls?
The byte stream functions will implement simple listio style noncontiguous access. Any more advanced data types should be unrolled into flat regions before reaching this interface. The process for unrolling is outside the scope of this document, but examples are available in the ROMIO code.
TODO: SEMANTICS!!!!!
TODO: how to we report partial success for listio calls?
Do we need coll_ids here?
TODO: fix up semantics for testsome; look at MPI functions for ideas.
wait function for testing purposes if nothing else?
Note: need to discuss completion queue, internal or external?
Phil: See pvfs2-internal email at
http://beowulf-underground.org/pipermail/pvfs2-internal/2001-October/000010.html
for my thoughts on this topic.
Batch operations are used to perform a sequence of operations possibly as an atomic whole. These will be handled at a higher level.
This section lists some potential optimizations that might be applied at this layer or that are related to this layer.
In many file systems ``inode stuffing'' is used to store the data for small files in the space used to store pointers to indirect blocks. The analogous approach for PVFS2 would be to store the data for small files in the bytestream space associated with the metafile.
This document was generated using the LaTeX2HTML translator Version 2002-2-1 (1.71)
Copyright © 1993, 1994, 1995, 1996,
Nikos Drakos,
Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999,
Ross Moore,
Mathematics Department, Macquarie University, Sydney.
The command line arguments were:
latex2html -split 0 -show_section_numbers -nonavigation storage-interface.tex
The translation was initiated by on 2004-06-08