PVFS Development Team
The purpose of this document is to sketch out the Database + Files (DBPF) implementation of the Trove storage interface. Included will be discussion of how one might use the implementation in practice.
DBPF uses UNIX files and Berkeley DB (3 or 4 I think) databases to store file and directory data and metadata.
The locations of these entities are defined in dbpf.h, although in future versions they should be relative to the storage name passed in at initialize time.
For a given server there are single instances of each of the following:
For each collection there are one of each of the following:
In addition, dataspaces have a one to one mapping to the following, although they are created lazily (so they will not exist if not yet used):
At this time both bstream files and keyval DBs are stored in a flat directory structure. This may change to be a hashed directory system as in the PVFS1 iod if/when we see that we are incurring substantial overhead from lookups in a directory with many entries.
The DBPF implementation hooks to the upper trove ``wrapper'' layer through a set of structures that contain pointers to the functions that make up the Trove API. These structures are defined in the upper-layer trove.h and are TROVE_bstream_ops, TROVE_keyval_ops, TROVE_dspace_ops, TROVE_mgmt_ops, and TROVE_fs_ops.
The TROVE_mgmt_ops structure includes a pointer to the function initialize(). This is used to initialize the underlying Trove implementation (e.g. DBPF). The initialize function takes the storage name as a parameter; this should be used to locate the various entities that are used for storing the DBPF collections.
Additionally a method ID is passed in...why???
The method name field is filled in by the DBPF implementation to describe what type of underlying storage is available.
The DBPF implementation has its own queue that it uses to hold operations in progress. Operations may have one of a number of states (defined in dbpf.h):
Dataspaces are stored as:
Keyval and bstream files are named by the handle (because they don't really have names...).
Currently all keyval space operations use blocking DB calls. They queue the operation when the I/O call is made, and they service the operation when the service function is called (at test time).
On each call the database is opened and closed, rather than caching the FD. This will obviously need to change in the near future; probably we'll do something similar to the bstream fdcache.
The dbpf_keyval_iterate_op_svc function walks through all the keyval pairs. The current implementation uses the btree format with record numbers to make it easy to pick up where the last call to the iterator function left off. The iterator function will process count items per call: if fewer than count items are left in the database, we set count to how many items were processed. The caller should check that count has the same value as before the function was called.
The read_at and write_at functions are implemented using blocking UNIX file I/O calls. They queue the operation when the I/O call is made, and they service the operation when the service function is called (at test time).
The read_list and write_list functions are implemented using lio_listio. Both of these functions actually call a function dbpf_bstream_rw_list, which implements the necessary functionality for both reads and writes. This function sets up the queue entry and immediately calls the service function.
The necessary aiocb array is allocated within the service function if one isn't already allocated for the operation. The maximum number of aiocb entries is controlled by a compile-time constant AIOCB_ARRAY_SZ; if there are more list elements than this, the list will be split up and handled by multiple lio_listio calls.
If the service function returns to dbpf_bstream_rw_list without completing the entire operation, the operation is queued.
Both the _at and _list functions use a FD cache for getting open FDs.
Theoretically Trove in general doesn't know anything about ``file systems'' per se. However, it's helpful for us to provide functions intended to ease the creation of FSs in Trove. These functions are subject to change or removal as we get a better understanding of how we're going to use Trove.
There's more than one way to build a file system given the infrastructure provided by Trove; we're just going through a single example here.
At this time, a single file system is associated with one and only one collection, and a single collection is associated with one and only one file system.
Once a collection is created, it can be looked up with collection_lookup. Collections are looked up by name. This will return collection ID that can be used to direct subsequent operations to the collection. Currently the implementation only supports creation of one collection.
The collection lookup routine opens the collection attribute database and finds the root handle. This really isn't right...the collection routines don't need to know about this...the fs routines should instead.
It also opens the dataspace database and returns the collection ID.
In addition to creating one collection for the file system, a second ``administrative'' collection will be created. Currently, the handle allocator uses the admin collection to store handle state in a bstream.
The dataspace create code checks to see if the handle is in use. Currently this code will not try to come up with a new handle if the proposed one is used...fix?
An entry is placed in the dataspace attribute DB for the collection.
What is the ``type'' field used for?
I think that the basic metadata for a file will be stored in the dataspace attribute DB by adding members to the dbpf_dspace_attr structure (defined in dbpf.h).
First we create a dataspace. Then we add a key/value to map the name of the directory to the handle into the parent directory keyval space.
First we create a dataspace. Then we add a key/value to map the name to the handle into the parent directory keyval space.
Only one collection can be used with the current implementation.
Error checking is weak at best. Assert(0)s are used in many places in error conditions.
This document was generated using the LaTeX2HTML translator Version 2002 (1.62)
Copyright © 1993, 1994, 1995, 1996,
Nikos Drakos,
Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999,
Ross Moore,
Mathematics Department, Macquarie University, Sydney.
The command line arguments were:
latex2html -split 0 -show_section_numbers -nonavigation trove-dbpf.tex
The translation was initiated by Samuel Lang (ANL) on 2008-04-14