PVFS Development Team
$Id: concepts.tex,v 1.3 2006/09/13 20:22:43 vilayann Exp $
PVFS2 represents a complete redesign and reimplementation of the parallel file system concepts in PVFS1. PVFS2 has new entities acting in new ways on new objects. This document will serve as an introduction to the terminology and concepts used in the other pvfs2 documents.
Refer to figure 1 for an idea of how the words above fit together.
All end-user access to PVFS will still be provided by one of several front ends (VFS kernel interface, ROMIO, libpvfs) ( what's the right term here? API, FE, interface?). The new pvfs library has not been written yet, but there is a good chance it will be largely similar to the current pvfs library. The ROMIO and VFS interfaces should remain largely unchanged to the end user, aside from extensions to take advantage of new PVFS2 features.
The end-user interfaces converge at the system interface. If a user request requires talking to several servers, the system interface submits a job request for each server to the job manager ( i presume, if the job mgr can't split up requests that the submission of multiple jobs happens in the sys-int. or will the client find out who he has to talk to after opening the file?). Requests for large or noncontiguous data chunks only need one job as explained below.
The job manager is a fairly thin layer between the system interface and BMI, trove, and flow. It should be noted that nearly every request requires multiple steps ( communicate over the network, read bytes from storage ...), and each step becomes a job. The job manager provides a common handle space (terminology?) and thread management to keep everything progressing.
If the user performs a data operation, the system interface will submit a flow job. The system interface knows what has to happen - some bytes from here have to go over there. The flow job figures out how to accomplish the request. The flow can compute how much data comes from which servers based on the I/O request and the distribution parameters. The flow then is responsible for making the right BMI calls to keep the i/o request progressing.
Metadata requests go directly to BMI jobs. ... (client requests will never go directly to trove, right? )
Wind back up the protocol stack to the servers for a moment. We'll come back to BMI in a bit. From the client side, all jobs are ``expected'': the client asks for something to happen and can test for completion of that job. PVFS2 servers can additionally receive ``unexpected'' jobs, generally (always?) when a client initiates a request from a server. (where can i find more information about the ``request handler'' and the ``op state machine'' in figure 1 ? )
The job manager works the same way for the server as it does for the client, keeping track of BMI, trove, and flow jobs.
Figure 2 shows a setmeta operation. The client starts a BMI job to send a request to the meta server. The server then receives a job indicating that an unexpected BMI message has arrived. The server then issues a Trove job to store the metadata, and issues a BMI Job to send an ack. The client does a BMI job to receive the ack. A setmeta requires 2 jobs on the client side (send request, receive ack), and 3 jobs on the server side (receive request, do meta operation, send ack). (hrm? so ``unexpected'' isn't completely true? the server expects a request enough to post a receive )
Data operations are largely similar to metadata operations: the client posts jobs to send the request and receive the response, the server posts jobs to receive the request, do the operation, and send an ack. The difference is that a flow does the work of moving data. ( XXX: i have a figure for this. is this type of figure useful? )
Jobs and flows use BMI abstractions anytime they have to communicate over the network. The BMI level resolves these abstract "connections" into real network activity. BMI will either open a TCP socket, do some GM magic, or do whatever the underlying system network needs done to move bytes.
Similarly, jobs and flows use trove abstractions and let trove deal with the actual storage of bytestream and keyval objects
This document was generated using the LaTeX2HTML translator Version 2002 (1.62)
Copyright © 1993, 1994, 1995, 1996,
Nikos Drakos,
Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999,
Ross Moore,
Mathematics Department, Macquarie University, Sydney.
The command line arguments were:
latex2html -split 0 -show_section_numbers -nonavigation concepts.tex
The translation was initiated by Samuel Lang (ANL) on 2008-04-14