home *** CD-ROM | disk | FTP | other *** search
- .\" XXX standard disclaimer belongs here....
- .\" $Header: /private/postgres/ref/RCS/large_objects,v 1.8 1992/07/14 05:54:17 ptong Exp $
- .ds UX "\\s-2UNIX\\s0
- .SS "LARGE OBJECTS" 6/14/90
- .XA 0 "Section 7 \*- Large Objects"
- .sp 2i
- .ps 14
- .ce
- .b "SECTION 7 \*- LARGE OBJECTS"
- .sp 3
- .uh NAME
- .lp
- .lp
- Large Object Interface \*- interface to \*(PP large objects
- .uh DESCRIPTION
- .lp
- In \*(PP,
- data values are stored in tuples,
- and individual tuples cannot span multiple data pages.
- Since the size of a data page is 8192 bytes,
- the upper limit on the size of a data value is relatively low.
- To support the storage of larger atomic values,
- \*(PP provides a
- .i "large object"
- interface.
- This interface provides file-oriented access to user data
- that has been explicitly declared to be a large type.
- .lp
- Version 4 of \*(PP supports two different implementations of large
- objects.
- These two implementations allow users to trade off speed of access
- against transaction protection and crash recovery on large object data.
- Applications that can tolerate lost data may store object data in
- conventional files that are fast to access,
- but cannot be recovered in the case of system crashes.
- For applications that require stricter guarantees of durability,
- a transaction-protected large object implementation is available.
- This section describes the two implementations
- and the programmatic and query language interfaces to large object
- data.
- .lp
- Unlike the BLOB support provided by most commercial relational
- database management systems,
- \*(PP allows users to define specific large object types.
- \*(PP large objects are first-class objects in the database,
- and any operation that can be applied to a conventional (small)
- abstract data type (ADT) may also be applied to a large one.
- For example,
- two different large object types,
- such as
- .i image
- and
- .i voice ,
- may be created.
- Functions that operate on image data,
- and other functions that operate on voice data,
- may be declared to the database system.
- The data manager will distinguish between image and voice data
- automatically,
- and will allow users to invoke the appropriate functions on values
- of each of these types.
- In addition,
- indices may be created large data values,
- or on functions of them.
- Finally,
- operators may be defined that operate on large values.
- Users may invoke these functions and operators from the query language.
- The database system will enforce type restrictions on large
- object data values.
- .lp
- The \*(PP large object interface is modeled after the Unix file system
- interface, with analogs of open(), read(), write(), lseek(), etc.
- User functions call these routines to retrieve only the data of
- interest from a large object.
- For example,
- if a large object type called
- .CW mugshot
- existed that stored photographs of faces,
- then a function called
- .CW beard
- could be declared on
- .CW mugshot
- data.
- .CW Beard
- could look at the lower third of a photograph,
- and determine the color of the beard that appeared there,
- if any.
- The entire large object value need not be buffered,
- or even examined,
- by the
- .CW beard
- function.
- As mentioned above,
- \*(PP supports functional indices on large object data.
- In this example,
- the results of the
- .CW beard
- function could be stored in a B-tree index to provide
- fast searches for people with red beards.
- .uh "\*(UX FILES AS LARGE OBJECT ADTS"
- .lp
- The simplest large object interface supplied with \*(PP is also
- the least robust.
- It does not support transaction protection,
- crash recovery,
- or time travel.
- On the other hand,
- it can be used on existing data files
- (such as word-processor files)
- that must be accessed simultaneously by the database system
- and existing application programs.
- .pp
- This implementation stores large object data in a \*(UX file,
- and stores only the file name in the database.
- Importing a large object into the database is as simple as storing the
- file name in a distinguished
- .q "large object name"
- relation.
- Interface routines allow the database system to open,
- seek,
- read,
- write,
- and close these \*(UX files by an internal large object identifier.
- .lp
- The functions
- .CW lo_filein
- and
- .CW lo_fileout
- convert between \*(UX filenames and internal large
- object identifiers.
- These functions are \*(PP registered functions,
- meaning they can be used directly in Postquel queries as well as from
- dynamically loaded C functions.
- If you are defining a simple large object ADT,
- these functions can be used as your
- .q input
- and
- .q output
- functions (see
- .b "define type"
- and the \*(PP Manual sections concerning user-defined types for details).
- .(b
- .ta 0.5i 1i 1.5i 2i 2.5i 3i 3.5i 4i 4.5i 5i
- char *lo_filein(filename)
- char *filename;
- .sp 0.5v
- .i
- Import a new \*(UX file storing large object
- data into the database system. This routine stores
- the filename in a large object naming relation and
- assigns it a unique large object identifier.
- .r
- .sp
- char * lo_fileout (object)
- LargeObject *object;
- .sp 0.5v
- .i
- This routine returns the \*(UX filename associated
- with a large object.
- .r
- .)b
- .lp
- The file storing the large object must be accessible on the machine
- on which \*(PP is running.
- The data is not copied into the database system,
- so if the file is later removed,
- it is unrecoverable.
- .lp
- Large objects are accessible from both the \*(PP backend,
- using dynamically-loaded functions,
- and from the front-end,
- using the LIBPQ interface.
- These interfaces will be described in detail below.
- .uh "INVERSION LARGE OBJECTS"
- .lp
- In contrast to \*(UX files as large objects,
- the Inversion large object implementation guarantees transaction protection,
- crash recovery,
- and time travel on user large object data.
- This implementation breaks large objects up into
- .q chunks
- and stores the chunks in tuples in the database.
- A B-tree index guarantees fast searches for the correct chunk number
- when doing random access reads and writes.
- .lp
- If a transaction that has made changes to an Inversion large object
- subsequently aborts,
- the changes are backed out in the normal way.
- Inversion large objects are stored in the database,
- and so are not directly accessible to other programs.
- Only programs that use the \*(PP data manager can read and
- write Inversion large objects.
- .lp
- To use Inversion large objects,
- a new large object should be created using the LOcreat()
- interface,
- defined below.
- Afterwards,
- the name of the large object can be stored in an ordinary
- tuple.
- .lp
- The next section describes the programmatic interface to both
- \*(UX and Inversion large objects.
- .uh "BACKEND INTERFACE TO LARGE OBJECTS"
- .lp
- Large object data is accessible from front-end programs
- linked with the LIBPQ library,
- and from dynamically-loaded routines that execute in the \*(PP
- backend.
- This section describes access from dynamically loaded C functions.
- .uh "Creating New Large Objects"
- .lp
- The routine
- .(b
- .ft C
- int LOcreat(path, mode, objtype)
- char *path;
- int mode;
- int objtype;
- .ft
- .)b
- creates a new large object.
- .lp
- The pathname is a slash-separated list of components,
- and must be a unique pathname in the \*(PP large object namespace.
- There is a virtual root directory (``/'') in which objects
- may be placed.
- .lp
- The
- .CW objtype
- parameter can be one of
- .CW Inversion
- or
- .CW Unix ,
- which are symbolic constants defined in
- .(b
- .ft C
- ~postgres/src/lib/H/catalog/pg_lobj.h
- .ft
- .)b
- The interpretation of the
- .CW mode
- argument depends on the
- .CW objtype
- selected.
- .lp
- For \*(UX files,
- .CW mode
- is the mode used to protect the file on the \*(UX file system.
- On creation,
- the file is open for reading and writing.
- .lp
- For Inversion large objects,
- .CW mode
- is a bitmask describing several different attributes
- of the new object.
- The symbolic constants listed here are defined in
- .(b
- .ft C
- ~postgres/src/lib/H/tmp/libpq-fs.h
- .ft
- .)b
- The access type (read, write, or both) is controlled by
- OR'ing together the bits INV_READ and INV_WRITE.
- If the large object should be archived \*-
- that is,
- if historical versions of it should be moved periodically
- to a special archive relation \*-
- then the INV_ARCHIVE bit should be set.
- The low-order sixteen bits of
- .CW mask
- are the storage manager number on which the large object
- should reside\**.
- .(f
- \**
- In the distributed version of \*(PP,
- only the magnetic disk storage manager is supported.
- For users running \*(PP at UC Berkeley,
- additional storage managers are available.
- .)f
- For sites other than Berkeley,
- these bits should always be zero.
- At Berkeley,
- storage manager zero is magnetic disk,
- storage manager one is a Sony optical disk jukebox,
- and storage manager two is main memory.
- .lp
- The commands below open large objects of the two types
- for writing and reading.
- The Inversion large object is not archived,
- and is located on magnetic disk:
- .(b
- .ft C
- unix_fd = LOcreat("/my_unix_obj", 0600, Unix);
- .ft
- .sp 0.5v
- .ft C
- inv_fd = LOcreat("/my_inv_obj",
- INV_READ|INV_WRITE, Inversion);
- .ft
- .)b
- .uh "Opening Large Objects"
- .lp
- Existing large objects may be opened for reading or writing by
- calling the routine
- .(b
- .ft C
- int LOopen(path, mode)
- char *path;
- int mode;
- .ft
- .)b
- The
- .CW path
- argument specifies the large object's pathname,
- and is the same as the pathname used to create the object.
- The
- .CW mode
- argument is interpreted by the two implementations differently.
- For \*(UX large objects,
- values should be chosen from the set of mode bits passed to the
- .CW open
- system call;
- that is,
- O_CREAT,
- O_RDONLY,
- O_WRONLY,
- O_RDWR,
- and O_TRUNC.
- For Inversion large objects,
- only the bits
- INV_READ and INV_WRITE have any meaning.
- .lp
- To open the two large objects created in the last example,
- a programmer would issue the commands
- .(b
- .ft C
- unix_fd = LOopen("/my_unix_obj", O_RDWR);
- .ft
- .sp 0.5v
- .ft C
- inv_fd = LOopen("/my_inv_obj", INV_READ|INV_WRITE);
- .ft
- .)b
- .lp
- If a large object is opened before it has been created,
- then a new large object is created using the \*(UX
- implementation,
- and the new object is opened.
- .uh "Seeking on Large Objects"
- .lp
- The command
- .(b
- .ft C
- int
- LOlseek(fd, offset, whence)
- int fd;
- int offset;
- int whence;
- .ft
- .)b
- moves the current location pointer for a large object to the
- specified position.
- The
- .CW fd
- parameter is the file descriptor returned by either
- .CW LOcreat
- or
- .CW LOopen .
- .CW Offset
- is the byte offset in the large object to which to seek.
- The only legal value for
- .CW whence
- in the current release of the system is
- .CW L_SET ,
- as defined in <sys/files.h>.
- .lp
- \*(UX large objects allow holes to exist in objects;
- that is,
- a program may seek well past the end of the object and write
- bytes.
- Intervening blocks will not be created;
- reading them will return zero-filled blocks.
- Inversion large objects do not support holes.
- .lp
- The following code
- seeks to byte location 100000 of the example large objects:
- .(b
- .ft C
- unix_status = LOlseek(unix_fd, 100000, L_SET);
- .ft
- .sp 0.5v
- .ft C
- inv_status = LOlseek(inv_fd, 100000, L_SET);
- .ft
- .)b
- On error,
- .CW LOlseek
- returns a value less than zero.
- On success,
- the new offset is returned.
- .uh "Writing to Large Objects"
- .lp
- Once a large object has been created,
- it may be filled by calling
- .(b
- .ft C
- int
- LOwrite(fd, wbuf)
- int fd;
- struct varlena *wbuf;
- .)b
- Here,
- .CW fd
- is the file descriptor returned by
- .CW LOcreat
- or
- .CW LOopen ,
- and
- .CW wbuf
- describes the data to write.
- The
- .CW varlena
- structure in \*(PP consists of four bytes in which the length
- of the datum is stored,
- followed by the data itself.
- The four length bytes include themselves.
- .lp
- For example,
- to write 1024 bytes of zeroes to the sample large objects:
- .(b
- .ft C
- struct varlena *vl;
-
- vl = (struct varlena *) palloc(1028);
- VARSIZE(vl) = 1028;
- bzero(VARDATA(vl), 1024);
-
- nwrite_unix = LOwrite(unix_fd, vl);
- .sp 0.5v
- nwrite_inv = LOwrite(inv_fd, vl);
- .ft
- .)b
- .CW LOwrite
- returns the number of bytes actually written,
- or a negative number on error.
- For Inversion large objects,
- the entire write is guaranteed to succeed or fail.
- That is,
- if the number of bytes written is non-negative,
- then it equals VARSIZE(vl).
- .lp
- The VARSIZE()
- and VARDATA()
- macros are declared in the file
- .(b
- .ft C
- ~postgres/src/lib/H/tmp/postgres.h
- .ft
- .)b
- .uh "Reading from Large Objects"
- .lp
- Data may be read from large objects by calling the routine
- .(b
- .ft C
- struct varlena *
- LOread(fd, len)
- int fd;
- int len;
- .)b
- This routine returns the byte count actually read
- and the data in a varlena structure.
- For example,
- .(b
- .ft C
- struct varlena *unix_vl, *inv_vl;
- int nread_ux, nread_inv;
- char *data_ux, *data_inv;
-
- unix_vl = LOread(unix_fd, 100);
- nread_ux = VARSIZE(unix_vl);
- data_ux = VARDATA(unix_vl);
- .sp 0.5v
- inv_vl = LOread(inv_fd, 100);
- nread_inv = VARSIZE(inv_vl);
- data_inv = VARDATA(inv_vl);
- .ft
- .)b
- The returned varlena structures have been allocated by the
- \*(PP memory manager
- .CW palloc ,
- and may be
- .CW pfree d
- when they are no longer needed.
- .uh "Closing a Large Object"
- Once a large object is no longer needed,
- it may be closed by calling
- .(b
- .ft C
- int
- LOclose(fd)
- int fd;
- .ft
- .)b
- where
- .CW fd
- is the file descriptor returned by
- .CW LOopen
- or
- .CW LOcreat .
- On success,
- .CW LOclose
- returns zero.
- A negative return value indicates an error.
- .lp
- For example,
- .(b
- .ft C
- if (LOclose(unix_fd) < 0)
- /* error */;
- .sp 0.5v
- if (LOclose(inv_fd) < 0)
- /* error */
- .ft
- .)b
- .uh "LIBPQ LARGE OBJECT INTERFACE"
- .lp
- Large objects may also be accessed from database client
- programs that link the LIBPQ library.
- This library provides a set of routines that support opening,
- reading, writing, closing,
- and seeking on large objects.
- The interface is similar to that provided via the backend,
- but rather than using varlena structures,
- a more conventional \*(UX-style buffer scheme is used.
- .lp
- In version 4 of \*(PP,
- large object operations must be enclosed in a transaction
- block.
- This is true even for \*(UX large objects,
- which are not transaction-protected.
- This is due to a shortcoming in the memory management scheme
- for large objects,
- and will be rectified in version 4.1.
- The end of this section shows a short example program
- that correctly transaction-protects its file system operations.
- .lp
- This section describes the LIBPQ interface in detail.
- .uh "Creating a Large Object"
- .lp
- The routine
- .(b
- .ft C
- int
- p_creat(path, mode, objtype)
- char *path;
- int mode;
- int objtype;
- .ft
- .)b
- creates a new large object.
- The
- .CW path
- argument specifies a large-object system pathname.
- .lp
- The
- .CW objtype
- parameter can be one of
- .CW Inversion
- or
- .CW Unix ,
- which are symbolic constants defined in
- .(b
- .ft C
- ~postgres/src/lib/H/catalog/pg_lobj.h
- .ft
- .)b
- The interpretation of the
- .CW mode
- argument depends on the
- .CW objtype
- selected.
- .lp
- For \*(UX files,
- .CW mode
- is the mode used to protect the file on the \*(UX file system.
- On creation,
- the file is open for reading and writing.
- .lp
- For Inversion large objects,
- .CW mode
- is a bitmask describing several different attributes
- of the new object.
- The symbolic constants listed here are defined in
- .(b
- .ft C
- ~postgres/src/lib/H/tmp/libpq-fs.h
- .ft
- .)b
- The access type (read, write, or both) is controlled by
- OR'ing together the bits INV_READ and INV_WRITE.
- If the large object should be archived \*-
- that is,
- if historical versions of it should be moved periodically
- to a special archive relation \*-
- then the INV_ARCHIVE bit should be set.
- The low-order sixteen bits of
- .CW mask
- are the storage manager number on which the large object
- should reside.
- For sites other than Berkeley,
- these bits should always be zero.
- At Berkeley,
- storage manager zero is magnetic disk,
- storage manager one is a Sony optical disk jukebox,
- and storage manager two is main memory.
- .lp
- The commands below open large objects of the two types
- for writing and reading.
- The Inversion large object is not archived,
- and is located on magnetic disk:
- .(b
- .ft C
- unix_fd = p_creat("/my_unix_obj", 0600, Unix);
- .sp 0.5v
- inv_fd = p_creat("/my_inv_obj",
- INV_READ|INV_WRITE, Inversion);
- .ft
- .)b
- .uh "Opening an Existing Large Object"
- .lp
- To open an existing large object,
- call
- .(b
- .ft C
- int
- p_open(path, mode)
- char *path;
- int mode;
- .ft
- .)b
- .lp
- The
- .CW path
- argument specifies the large object pathname for the object to open.
- The mode bits control whether the object is opened for reading,
- writing,
- or both.
- For \*(UX large objects,
- the appropriate flags are
- O_CREAT,
- O_RDONLY,
- O_WRONLY,
- O_RDWR,
- and O_TRUNC.
- For Inversion large objects,
- only INV_READ and INV_WRITE are recognized.
- .lp
- If a large object is opened before it is created,
- it is created by default using the \*(UX file implementation.
- .uh "Writing Data to a Large Object"
- .lp
- The routine
- .(b
- .ft C
- int
- p_write(fd, buf, len)
- int fd;
- char *buf;
- int len;
- .ft
- .)b
- writes
- .CW len
- bytes from
- .CW buf
- to large object
- .CW fd .
- The
- .CW fd
- argument must have been returned by a previous
- .CW p_creat
- or
- .CW p_open .
- .lp
- The number of bytes actually written is returned.
- In the event of an error,
- the return value is negative.
- .uh "Reading Data from a Large Object"
- .lp
- The routine
- .(b
- .ft C
- int
- p_read(fd, buf, nbytes)
- int fd;
- char *buf;
- int nbytes;
- .ft
- .)b
- reads
- .CW nbytes
- bytes into buffer
- .CW buf
- from the large object descriptor
- .CW fd .
- The number of bytes actually read is returned.
- In the event of an error,
- the return value is less than zero.
- .uh "Seeking on a Large Object"
- .lp
- To change the current read or write location on a large object,
- call
- .(b
- .ft C
- int
- p_lseek(fd, offset, whence)
- int fd;
- int offset;
- int whence;
- .ft
- .)b
- This routine moves the current location pointer for the large object
- described by
- .CW fd
- to the new location specified by
- .CW offset .
- For this release of \*(PG,
- only
- .CW L_SET
- is a legal value for
- .CW whence .
- .uh "Closing a Large Object"
- .lp
- A large object may be closed by calling
- .(b
- .ft C
- int
- p_close(fd)
- int fd;
- .ft
- .)b
- where
- .CW fd
- is a large object descriptor returned by
- .CW p_creat
- or
- .CW p_open .
- On success,
- .CW p_close
- returns zero.
- On error,
- the return value is negative.
- .uh "SAMPLE LARGE OBJECT PROGRAMS"
- .lp
- The \*(PP large object implementation serves as the basis
- for a file system (the
- .q Inversion
- file system)
- built on top of the data manager.
- This file system provides time travel,
- transaction protection,
- and fast crash recovery to clients of ordinary
- file system services.
- It uses the Inversion large object implementation to
- provide these services.
- .lp
- The programs that comprise the Inversion file system are
- included in the \*(PP source distribution,
- in directories
- .(b
- .ft C
- $POSTGRESHOME/test/postfs
- $POSTGRESHOME/test/postfs.usr.bin
- .ft
- .)b
- These directories contain a set of programs for manipulating
- files and directories.
- These programs are based on the Berkeley Software Distribution
- NET-2 release.
- .lp
- These programs are useful in manipulating inversion files,
- but they also serve as examples of how to code large object
- accesses in LIBPQ.
- All of the programs are LIBPQ clients,
- and all use the interfaces that have been described
- in this section.
- .lp
- Interested readers should refer to the files in the postfs
- directories for in-depth examples of the use of large objects.
- Below,
- a more terse example is provided.
- This code fragment creates a new large object managed
- by Inversion,
- fills it with data from a \*(UX file,
- and closes it.
- .(b
- .ft C
- #include "tmp/c.h"
- #include "tmp/libpq-fe.h"
- #include "tmp/libpq-fs.h"
- #include "catalog/pg_lobj.h"
-
- #define MYBUFSIZ 1024
-
- main()
- {
- int inv_fd;
- int fd;
- char *qry_result;
- char buf[MYBUFSIZ];
- int nbytes;
- int tmp;
-
- PQsetdb("mydatabase");
-
- /* large object accesses must be */
- /* transaction-protected */
- qry_result = PQexec("begin");
-
- if (*qry_result == 'E') /* error */
- exit (1);
-
- /* open the unix file */
- fd = open("/my_unix_file", O_RDONLY, 0666);
- if (fd < 0) /* error */
- exit (1);
-
- /* open the inversion file */
- inv_fd = p_open("/inv_file", INV_WRITE, Inversion);
- if (inv_fd < 0) /* error */
- exit (1);
-
- /* copy the unix file to the inversion */
- /* large object */
- while ((nbytes = read(fd, buf, MYBUFSIZ)) > 0)
- {
- tmp = p_write(inv_fd, buf, nbytes);
- if (tmp < nbytes) /* error */
- exit (1);
- }
-
- (void) close(fd);
- (void) close(inv_fd);
-
- /* commit the transaction */
- qry_result = PQexec("end");
-
- if (*qry_result == 'E') /* error */
- exit (1);
-
- /* by here, success */
- exit (0);
- }
- .ft
- .)b
- .uh "BUGS"
- .lp
- Shouldn't have to distinguish between Inversion and \*(UX large
- objects when you open an existing large object.
- The system knows which implementation was used.
- The flags argument should be the same in these two cases.
- .uh "SEE ALSO"
- .lp
- define type(commands),
- define function(commands),
- load (commands).
-