home *** CD-ROM | disk | FTP | other *** search
-
- JPEG SYSTEM ARCHITECTURE 1-DEC-92
-
-
- This file provides an overview of the "architecture" of the portable JPEG
- software; that is, the functions of the various modules in the system and the
- interfaces between modules. For more precise details about any data structure
- or calling convention, see the header files.
-
- Important note: when I say "module" I don't mean "a C function", which is what
- some people seem to think the term means. A separate C source file is closer
- to the mark. Also, it is frequently the case that several different modules
- present a common interface to callers; the term "object" or "method" refers to
- this common interface (see "Poor man's object-oriented programming", below).
-
- JPEG-specific terminology follows the JPEG standard:
- A "component" means a color channel, e.g., Red or Luminance.
- A "sample" is a pixel component value (i.e., one number in the image data).
- A "coefficient" is a frequency coefficient (a DCT transform output number).
- The term "block" refers to an 8x8 group of samples or coefficients.
- "MCU" (minimum coded unit) is the same as "MDU" of the R8 draft; i.e., an
- interleaved set of blocks of size determined by the sampling factors,
- or a single block in a noninterleaved scan.
-
-
- *** System requirements ***
-
- We must support compression and decompression of both Huffman and
- arithmetic-coded JPEG files. Any set of compression parameters allowed by the
- JPEG spec should be readable for decompression. (We can be more restrictive
- about what formats we can generate.) (Note: for legal reasons no arithmetic
- coding implementation is currently included in the publicly available sources.
- However, the architecture still supports it.)
-
- We need to be able to handle both raw JPEG files (more specifically, the JFIF
- format) and JPEG-in-TIFF (C-cubed's format, and perhaps Kodak's). Even if we
- don't implement TIFF ourselves, other people will want to use our code for
- that. This means that generation and scanning of the file header has to be
- separated out.
-
- Perhaps we should be prepared to support the JPEG lossless mode (also referred
- to in the spec as spatial DPCM coding). A lot of people seem to believe they
- need this... whether they really do is debatable, but the customer is always
- right. On the other hand, there will not be much sharable code between the
- lossless and lossy modes! At best, a lossless program could be derived from
- parts of the lossy version. For now we will only worry about the lossy mode.
-
- I see no real value in supporting the JPEG progressive modes (note that
- spectral selection and successive approximation are two different progressive
- modes). These are only of interest when painting the decompressed image in
- real-time, which nobody is going to do with a pure software implementation.
-
- There is some value in supporting the hierarchical mode, which allows for
- successive frames of higher resolution. This could be of use for including
- "thumbnail" representations. However, this appears to add a lot more
- complexity than it is worth.
-
- A variety of uncompressed image file formats and user interfaces must be
- supported. These aspects therefore have to be kept separate from the rest of
- the system. A particularly important issue is whether color quantization of
- the output is needed (i.e., whether a colormap is used). We should be able to
- support both adaptive quantization (which requires two or more passes over the
- image) and nonadaptive (quantization to a prespecified colormap, which can be
- done in one pass).
-
- Memory usage is an important concern, since we will port this code to 80x86
- and other limited-memory machines. For large intermediate structures, we
- should be able to use either virtual memory or temporary files.
-
- It should be possible to build programs that handle compression only,
- decompression only, or both, without much duplicate or unused code in any
- version. (In particular, a decompression-only version should have no extra
- baggage.)
-
-
- *** Compression overview ***
-
- The *logical* steps needed in (non-lossless) JPEG compression are:
-
- 1. Conversion from incoming image format to a standardized internal form
- (either RGB or grayscale).
-
- 2. Color space conversion (e.g., RGB to YCbCr). This is a null step for
- grayscale (unless we support mapping color inputs to grayscale, which
- would most easily be done here). Gamma adjustment may also be needed here.
-
- 3. Downsampling (reduction of number of samples in some color components).
- This step operates independently on each color component.
-
- 4. MCU extraction (creation of a single sequence of 8x8 sample blocks).
- This step and the following ones are performed once for each scan
- in the output JPEG file, i.e., once if making an interleaved file and more
- than once for a noninterleaved file.
- Note: both this step and the previous one must deal with edge conditions
- for pictures that aren't a multiple of the MCU dimensions. Alternately,
- we could expand the picture to a multiple of an MCU before doing these
- two steps. (The latter seems better and has been adopted below.)
-
- 5. DCT transformation of each 8x8 block.
-
- 6. Quantization scaling and zigzag reordering of the elements in each 8x8
- block.
-
- 7. Huffman or arithmetic encoding of the transformed block sequence.
-
- 8. Output of the JPEG file with whatever headers/markers are wanted.
-
- Of course, the actual implementation will combine some of these logical steps
- for efficiency. The trick is to keep these logical functions as separate as
- possible without losing too much performance.
-
- In addition to these logical pipeline steps, we need various modules that
- aren't part of the data pipeline. These are:
-
- A. Overall control (sequencing of other steps & management of data passing).
-
- B. User interface; this will determine the input and output files, and supply
- values for some compression parameters. Note that this module is highly
- platform-dependent.
-
- C. Compression parameter selection: some parameters should be chosen
- automatically rather than requiring the user to find a good value.
- The prototype only does this for the back-end (Huffman or arithmetic)
- parameters, but further in the future, more might be done. A
- straightforward approach to selection is to try several values; this
- requires being able to repeatedly apply some portion of the pipeline and
- inspect the results (without actually outputting them). Probably only
- entropy encoding parameters can reasonably be done this way; optimizing
- earlier steps would require too much data to be reprocessed (not to mention
- the problem of interactions between parameters for different steps).
- What other facilities do we need to support automatic parameter selection?
-
- D. A memory management module to deal with small-memory machines. This must
- create the illusion of virtual memory for certain large data structures
- (e.g., the downsampled image or the transformed coefficients).
- The interface to this must be defined to minimize the overhead incurred,
- especially on virtual-memory machines where the module won't do much.
-
- In many cases we can arrange things so that a data stream is produced in
- segments by one module and consumed by another without the need to hold it all
- in (virtual) memory. This is obviously not possible for any data that must be
- scanned more than once, so it won't work everywhere.
-
- The major variable at this level of detail is whether the JPEG file is to be
- interleaved or not; that affects the order of processing so fundamentally that
- the central control module must know about it. Some of the other modules may
- need to know it too. It would simplify life if we didn't need to support
- noninterleaved images, but that is not reasonable.
-
- Many of these steps operate independently on each color component; the
- knowledge of how many components there are, and how they are interleaved,
- ought to be confined to the central control module. (Color space conversion
- and MCU extraction probably have to know it too.)
-
-
- *** Decompression overview ***
-
- Decompression is roughly the inverse process from compression, but there are
- some additional steps needed to produce a good output image.
-
- The *logical* steps needed in (non-lossless) JPEG decompression are:
-
- 1. Scanning of the JPEG file, decoding of headers/markers etc.
-
- 2. Huffman or arithmetic decoding of the coefficient sequence.
-
- 3. Quantization descaling and zigzag reordering of the elements in each 8x8
- block.
-
- 4. MCU disassembly (conversion of a possibly interleaved sequence of 8x8
- blocks back to separate components in pixel map order).
-
- 5. (Optional) Cross-block smoothing per JPEG section K.8 or a similar
- algorithm. (Steps 5-8 operate independently on each component.)
-
- 6. Inverse DCT transformation of each 8x8 block.
-
- 7. Upsampling. At this point a pixel image of the original dimensions
- has been recreated.
-
- 8. Post-upsampling smoothing. This can be combined with upsampling,
- by using a convolution-like calculation to generate each output pixel
- directly from one or more input pixels.
-
- 9. Cropping to the original pixel dimensions (throwing away duplicated
- pixels at the edges). It is most convenient to do this now, as the
- preceding steps are simplified by not having to worry about odd picture
- sizes.
-
- 10. Color space reconversion (e.g., YCbCr to RGB). This is a null step for
- grayscale. (Note that mapping a color JPEG to grayscale output is most
- easily done in this step.) Gamma adjustment may also be needed here.
-
- 11. Color quantization (only if a colormapped output format is requested).
- NOTE: it is probably preferable to perform quantization in the internal
- (JPEG) colorspace rather than the output colorspace. Doing it that way,
- color conversion need only be applied to the colormap entries, not to
- every pixel; and quantization gets to operate in a non-gamma-corrected
- space. But the internal space may not be suitable for some algorithms.
- The system design is such that only the color quantizer module knows
- whether color conversion happens before or after quantization.
-
- 12. Writing of the desired image format.
-
- As before, some of these will be combined into single steps. When dealing
- with a noninterleaved JPEG file, steps 2-9 will be performed once for each
- scan; the resulting data will need to be buffered up so that steps 10-12 can
- process all the color components together.
-
- The same auxiliary modules are needed as before, except for compression
- parameter selection. Note that rerunning a pipeline stage should never be
- needed during decompression. This may allow a simpler control module. The
- user interface might also be simpler since it need not supply any compression
- parameters.
-
- As before, not all of these steps require the whole image to be stored.
- Actually, two-pass color quantization is the only step that logically requires
- this; everything else could be done a few raster lines at a time (at least for
- interleaved images). We might want to make color quantization be a separate
- program because of this fact.
-
- Again, many of the steps should be able to work on one color component in
- ignorance of the other components.
-
-
- *** Implications of noninterleaved formats ***
-
- Much of the work can be done in a single pass if an interleaved JPEG file
- format is used. With a noninterleaved JPEG file, separating or recombining
- the components will force use of virtual memory (on a small-memory machine,
- we probably would want one temp file per color component).
-
- If any of the image formats we read or write are noninterleaved, the opposite
- condition might apply: processing a noninterleaved JPEG file would be more
- efficient. Offhand, though, I can't think of any popular image formats that
- work that way; besides the win would only come if the same color space were
- used in JPEG and non-JPEG files. It's not worth the complexity to make the
- system design accommodate that case efficiently.
-
- An argument against interleaving is that it makes the decompressor need more
- memory for cross-block smoothing (since the minimum processable chunk of the
- image gets bigger). With images more than 1000 pixels across, 80x86 machines
- are likely to have difficulty in handling this feature.
-
- Another argument against interleaving is that the noninterleaved format allows
- a wider range of sampling factors, since the limit of ten blocks per MCU no
- longer applies. We could get around this by blithely ignoring the spec's
- limit of ten blocks, but that seems like a bad idea (especially since it makes
- the above problem worse).
-
- The upshot is that we need to support both interleaved and noninterleaved JPEG
- formats, since for any given machine and picture size one may be much more
- efficient than the other. However, the non-JPEG format we convert to or from
- will be assumed to be an interleaved format (i.e., it produces or stores all
- the components of a pixel together).
-
- I do not think it is necessary for the compressor to be able to output
- partially-interleaved formats (multiple scans, some of which interleave a
- subset of the components). However, the decompressor must be able to read
- such files to conform to the spec.
-
-
- *** Data formats ***
-
- Pipeline steps that work on pixel sample values will use the following data
- structure:
-
- typedef something JSAMPLE; a pixel component value, 0..MAXJSAMPLE
- typedef JSAMPLE *JSAMPROW; ptr to a row of samples
- typedef JSAMPROW *JSAMPARRAY; ptr to a list of rows
- typedef JSAMPARRAY *JSAMPIMAGE; ptr to a list of color-component arrays
-
- The basic element type JSAMPLE will be one of unsigned char, (signed) char, or
- unsigned short. Unsigned short will be used if samples wider than 8 bits are
- to be supported (this is a compile-time option). Otherwise, unsigned char is
- used if possible. If the compiler only supports signed chars, then it is
- necessary to mask off the value when reading. Thus, all reads of sample
- values should be coded as "GETJSAMPLE(value)", where the macro will be defined
- as "((value)&0xFF)" on signed-char machines and "(value)" elsewhere.
-
- With these conventions, JSAMPLE values can be assumed to be >= 0. This should
- simplify correct rounding during downsampling, etc. The JPEG draft's
- specification that sample values run from -128..127 will be accommodated by
- subtracting 128 just as the sample value is copied into the source array for
- the DCT step (this will be an array of signed shorts or longs). Similarly,
- during decompression the output of the IDCT step will be immediately shifted
- back to 0..255. (NB: different values are required when 12-bit samples are in
- use. The code should be written in terms of MAXJSAMPLE and CENTERJSAMPLE,
- which will be #defined as 255 and 128 respectively in an 8-bit implementation,
- and as 4095 and 2048 in a 12-bit implementation.)
-
- On compilers that don't support "unsigned short", signed short can be used for
- a 12-bit implementation. To support lossless coding (which allows up to
- 16-bit data precision) masking with 0xFFFF in GETJSAMPLE might be necessary.
- (But if "int" is 16 bits then using "unsigned int" is the best solution.)
-
- Notice that we use a pointer per row, rather than a two-dimensional JSAMPLE
- array. This choice costs only a small amount of memory and has several
- benefits:
-
- * Code using the data structure doesn't need to know the allocated width of
- the rows. This will simplify edge expansion/compression, since we can work
- in an array that's wider than the logical picture width.
-
- * The rows forming a component array may be allocated at different times
- without extra copying. This will simplify working a few scanlines at a time,
- especially in smoothing steps that need access to the previous and next rows.
-
- * Indexing doesn't require multiplication; this is a performance win on many
- machines.
-
- Note that each color component is stored in a separate array; we don't use the
- traditional structure in which the components of a pixel are stored together.
- This simplifies coding of steps that work on each component independently,
- because they don't need to know how many components there are. Furthermore,
- we can read or write each component to a temp file independently, which is
- helpful when dealing with noninterleaved JPEG files.
-
- A specific sample value will be accessed by code such as
- GETJSAMPLE(image[colorcomponent][row][col])
- where col is measured from the image left edge, but row is measured from the
- first sample row currently in memory. Either of the first two indexings can
- be precomputed by copying the relevant pointer.
-
-
- Pipeline steps that work on frequency-coefficient values will use the
- following data structure:
-
- typedef short JCOEF; a 16-bit signed integer
- typedef JCOEF JBLOCK[64]; an 8x8 block of coefficients
- typedef JBLOCK *JBLOCKROW; ptr to one horizontal row of 8x8 blocks
- typedef JBLOCKROW *JBLOCKARRAY; ptr to a list of such rows
- typedef JBLOCKARRAY *JBLOCKIMAGE; ptr to a list of color component arrays
-
- The underlying type is always a 16-bit signed integer (this is "short" on all
- machines of interest, but let's use the typedef name anyway). These are
- grouped into 8x8 blocks (we should use #defines DCTSIZE and DCTSIZE2 rather
- than "8" and "64"). The contents of a block may be either in "natural" or
- zigzagged order, and may be true values or divided by the quantization
- coefficients, depending on where the block is in the pipeline.
-
- Notice that the allocation unit is now a row of 8x8 blocks, corresponding to
- eight rows of samples. Otherwise the structure is much the same as for
- samples, and for the same reasons.
-
- On machines where malloc() can't handle a request bigger than 64Kb, this data
- structure limits us to rows of less than 512 JBLOCKs, which would be a picture
- width of 4000 pixels. This seems an acceptable restriction.
-
-
- On 80x86 machines, the bottom-level pointer types (JSAMPROW and JBLOCKROW)
- must be declared as "far" pointers, but the upper levels can be "near"
- (implying that the pointer lists are allocated in the DS segment).
- To simplify sharing code, we'll have a #define symbol FAR, which expands to
- the "far" keyword when compiling on 80x86 machines and to nothing elsewhere.
-
-
- The data arrays used as input and output of the DCT transform subroutine will
- be declared using a separate typedef; they could be arrays of "short", "int"
- or "long" independently of the above choices. This would depend on what is
- needed to make the compiler generate correct and efficient multiply/add code
- in the DCT inner loops. No significant speed or memory penalty will be paid
- to have a different representation than is used in the main image storage
- arrays, since some additional value-by-value processing is done at the time of
- creation or extraction of the DCT data anyway (e.g., add/subtract 128).
-
-
- *** Poor man's object-oriented programming ***
-
- It should be pretty clear by now that we have a lot of quasi-independent
- steps, many of which have several possible behaviors. To avoid cluttering the
- code with lots of switch statements, we'll use a simple form of object-style
- programming to separate out the different possibilities.
-
- For example, Huffman and arithmetic coding will be implemented as two separate
- modules that present the same external interface; at runtime, the calling code
- will access the proper module indirectly through an "object".
-
- We can get the limited features we need while staying within portable C. The
- basic tool is a function pointer. An "object" is just a struct containing one
- or more function pointer fields, each of which corresponds to a method name in
- real object-oriented languages. During initialization we fill in the function
- pointers with references to whichever module we have determined we need to use
- in this run. Then invocation of the module is done by indirecting through a
- function pointer; on most architectures this is no more expensive (and
- possibly cheaper) than a switch, which would be the only other way of making
- the required run-time choice. The really significant benefit, of course, is
- keeping the source code clean and well structured.
-
- For example, the interface for entropy decoding (Huffman or arithmetic
- decoding) might look like this:
-
- struct function_ptr_struct {
- ...
- /* Entropy decoding methods */
- void (*prepare_for_scan) ();
- void (*get_next_mcu) ();
- ...
- };
-
- typedef struct function_ptr_struct * function_ptrs;
-
- The struct pointer is what will actually be passed around. A call site might
- look like this:
-
- some_function (function_ptrs fptrs)
- {
- ...
- (*fptrs->get_next_mcu) (...);
- ...
- }
-
- (It might be worth inventing some specialized macros to hide the rather ugly
- syntax for method definition and call.) Note that the caller doesn't know how
- many different get_next_mcu procedures there are, what their real names are,
- nor how to choose which one to call.
-
- An important benefit of this scheme is that it is easy to provide multiple
- versions of any method, each tuned to a particular case. While a lot of
- precalculation might be done to select an optimal implementation of a method,
- the cost per invocation is constant. For example, the MCU extraction step
- might have a "generic" method, plus one or more "hardwired" methods for the
- most popular sampling factors; the hardwired methods would be faster because
- they'd use straight-line code instead of for-loops. The cost to determine
- which method to use is paid only once, at startup, and the selection criteria
- are hidden from the callers of the method.
-
- This plan differs a little bit from usual object-oriented structures, in that
- only one instance of each object class will exist during execution. The
- reason for having the class structure is that on different runs we may create
- different instances (choose to execute different modules).
-
- To minimize the number of object pointers that have to be passed around, it
- will be easiest to have just a few big structs containing all the method
- pointers. We'll actually use two such structs, one for "system-dependent"
- methods (memory allocation and error handling) and one for everything else.
-
- Because of this choice, it's best not to think of an "object" as a specific
- data structure. Rather, an "object" is just a group of related methods.
- There would typically be one or more C modules (source files) providing
- concrete implementations of those methods. You can think of the term
- "method" as denoting the common interface presented by some set of functions,
- and "object" as denoting a group of common method interfaces, or the total
- shared interface behavior of a group of modules.
-
-
- *** Data chunk sizes ***
-
- To make the cost of this object-oriented style really minimal, we should make
- sure that each method call does a fair amount of computation. To do that we
- should pass large chunks of data around; for example, the colorspace
- conversion method should process much more than one pixel per call.
-
- For many steps, the most natural unit of data seems to be an "MCU row".
- This consists of one complete horizontal strip of the image, as high as an
- MCU. In a noninterleaved scan, an MCU row is always eight samples high (when
- looking at samples) or one 8x8 block high (when looking at coefficients). In
- an interleaved scan, an MCU row consists of all the data for one horizontal
- row of MCUs; this may be from one to four blocks high (eight to thirty-two
- samples) depending on the sampling factors. The height and width of an MCU
- row may be different in each component. (Note that the height and width of an
- MCU row changes at the downsampling and upsampling steps. An unsubsampled
- image has the same size in each component. The preceding statements apply to
- the downsampled dimensions.)
-
- For example, consider a 1024-pixel-wide image using (2h:2v)(1h:1v)(1h:1v)
- subsampling. In the noninterleaved case, an MCU row of Y would contain 8x1024
- samples or the same number of frequency coefficients, so it would occupy
- 8K bytes (samples) or 16K bytes (coefficients). An MCU row of Cb or Cr would
- contain 8x512 samples and occupy half as much space. In the interleaved case,
- an MCU row would contain 16x1024 Y samples, 8x512 Cb and 8x512 Cr samples, so
- a total of 24K (samples) or 48K (coefficients) would be needed. This is a
- reasonable amount of data to expect to retain in memory at one time. (Bear in
- mind that we'll usually need to have several MCU rows resident in memory at
- once, at the inputs and outputs to various pipeline steps.)
-
- The worst case is probably (2h:4v)(1h:1v)(1h:1v) interleaving (this uses 10
- blocks per MCU, which is the maximum allowed by the spec). An MCU will then
- contain 32 sample rows worth of Y, so it would occupy 40K or 80K bytes for a
- 1024-pixel-wide image. The most memory-intensive step is probably cross-block
- smoothing, for which we'd need 3 MCU rows of coefficients as input and another
- one as output; that would be 320K of working storage. Anything much larger
- would not fit in an 80x86 machine. (To decompress wider pictures on an 80x86,
- we'll have to skip cross-block smoothing or else use temporary files.)
-
- This unit is thus a reasonable-sized chunk for passing through the pipeline.
- Of course, its major advantage is that it is a natural chunk size for the MCU
- assembly and disassembly steps to work with.
-
- For the entropy (Huffman or arithmetic) encoding/decoding steps, the most
- convenient chunk is a single MCU: one 8x8 block if not interleaved, three to
- ten such blocks if interleaved. The advantage of this is that when handling
- interleaved data, the blocks have the same sequence of component membership on
- each call. (For example, Y,Y,Y,Y,Cb,Cr when using (2h:2v)(1h:1v)(1h:1v)
- subsampling.) The code needs to know component membership so that it can
- apply the right set of compression coefficients to each block. A prebuilt
- array describing this membership can be used during each call. This chunk
- size also makes it easy to handle restart intervals: just count off one MCU
- per call and reinitialize when the count reaches zero (restart intervals are
- specified in numbers of MCU).
-