home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Power-Programmierung
/
CD2.mdf
/
c
/
preproz
/
langref.doc
< prev
next >
Wrap
Text File
|
1990-03-27
|
83KB
|
1,729 lines
James Roskind C Porting Preprocessor (JRCPP)
JRCPP LANGUAGE REFERENCE MANUAL (3/23/90)
Copyright (C) 1990 James Roskind, All rights reserved. Permission
is granted to copy and distribute this file as part any machine
readable archive containing the entire, unmodified, JRCPP PUBLIC
DISTRIBUTION PACKAGE (henceforth call the "Package"). The set of
files that form the Package are described in the README file that
is a part of the Package. Permission is granted to individual
users of the Package to copy individual portions of the Package
(i.e., component files) in any form (e.g.: printed, electronic,
electro-optical, etc.) desired for the purpose of supporting
users of the Package (i.e., providing online, or onshelf
documentation access; executing the binary JRCPP code, etc.).
Permission is not granted to distribute copies of individual
portions of the Package, unless a machine readable version of the
complete Package is also made available with such distribution.
Abstracting with credit is permitted. There is no charge or
royalty fee required for copies made in compliance with this
notice. To otherwise copy elements of this package requires
prior permission in writing from James Roskind.
James Roskind
516 Latania Palm Drive
Indialantic FL 32903
End of copyright notice
What the above copyright means is that you are free to use and
distribute (or even sell) the entire set of files in this Package,
but you can't split them up, and distribute them as separate files.
The notice also says that you cannot modify the copies that you
distribute, and this ESPECIALLY includes NOT REMOVING the any part of
the copyright notice in any file. JRCPP currently implements a C
Preprocessor, but the users of this Package do NOT surrender any
right of ownership or copyright to any source text that is processed
by JRCPP, either before or after processing. Similarly, there are no
royalty or fee requirements for using the post-preprocessed output of
JRCPP.
This Package is expected to be distributed by shareware and freeware
channels (including BBS sites), but the fees paid for "distribution"
costs are strictly exchanged between the distributor, and the
recipient, and James Roskind makes no express or implied warranties
about the quality or integrity of such indirectly acquired copies.
Distributors and users may obtain the Package (the Public
distribution form) directly from the author by following the ordering
procedures in the REGISTRATION file.
DISCLAIMER:
JAMES ROSKIND PROVIDES THIS FILE "AS IS" WITHOUT WARRANTY OF ANY
KIND, EITHER EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR
PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE
PROGRAM AND DOCUMENTATION IS WITH YOU. Some states do not allow
disclaimer of express or implied warranties in certain transactions,
therefore, this statement may not apply to you.
UNIX is a registered trademark of AT&T Bell Laboratories.
____________________________________________________________________
James Roskind C Porting Preprocessor (JRCPP)
JRCPP LANGUAGE REFERENCE MANUAL
INTRODUCTION
This document, in the company of the "ANSI Programming Language C"
Standard, is intended to act as a language reference manual. Most
significantly, this document discusses the performance of JRCPP in
official "ANSI undefined", "ANSI unspecified" and "ANSI
implementation defined" domains of the C Language. In addition, it
lists performance limitations of JRCPP, and directly relates these
limitations to the standard's requirements for "Implementation
limits".
As an additional matter, this document identifies vaguenesses (and in
the rare case, errors) in the ANSI C Standard, and describes the
resolution adopted by JRCPP. Hence this document is also the
"Rationale" for JRCPP, in much the same way the the ANSI C standard
has an accompanying document "Rational For ANSI Programming Language
C". This document will generally not discuss aspects of the standard
that do not involve preprocessing activities performed on source
files.
Note that this document was written based on the Draft Proposed ANSI
C Standard, X3J11/88-158, date December 8, 1988. After a drawn out
appeals process, I believe this draft was accepted in January 1990 by
the ANSI Standards Committee. I am not aware that any changes were
made during that appeals process, and I apologize in advance for any
errors I might have made in this regard, or in the description that
follows.
In all cases where this Language Reference Manual deviates from the
ANSI C Standard, this document should be assumed to be in error, and
the corresponding bug/misperformance in the JRCPP program (if any)
should be reported. The ANSI C Standard a tremendous work, and I
realize that my abridged commentary in many areas does not do justice
to the meticulous selection of elaborate wording in the official
Standard. For many, my description will be enough, but for language
lawyers, there is no replacement for the official ANSI document.
Section numbers in this document have been chosen to match those of
the ANSI C standard, and hence certain gaps are present. These gaps
represent areas where either there is generally no impact on
preprocessing activities, or no additional commentary seems necessary.
LISTING OF SECTIONS
1.3 References
1.6 Definition of Terms
2. ENVIRONMENT
2.1.1.2 ENVIRONMENT- Translation phases
2.1.1.3 ENVIRONMENT- Diagnostics
2.2 ENVIRONMENTAL CONSIDERATION
2.2.1 ENVIRONMENTAL CONSIDERATION- Character sets
2.2.1.1 ENVIRONMENTAL CONSIDERATION- Trigraphs Sequences
2.2.1.3 ENVIRONMENTAL CONSIDERATION- Character sets- Multibyte characters
2.2.4 ENVIRONMENTAL CONSIDERATION- Translation Limits
3.1 LANGUAGE- LEXICAL ELEMENTS
3.1.2 LANGUAGE- LEXICAL ELEMENTS- Identifiers
3.1.3.3 LANGUAGE- LEXICAL ELEMENTS- Character constants
3.1.4 LANGUAGE- LEXICAL ELEMENTS- String literals
3.1.7 LANGUAGE- LEXICAL ELEMENTS- Header names
3.1.8 LANGUAGE- LEXICAL ELEMENTS- Preprocessing numbers
3.8 LANGUAGE- PREPROCESSING DIRECTIVES
3.8.1 LANGUAGE- PREPROCESSING DIRECTIVES- Conditional inclusion
3.8.2 LANGUAGE- PREPROCESSING DIRECTIVES- Source file inclusion
3.8.3 LANGUAGE- PREPROCESSING DIRECTIVES- Macro replacement
3.8.3.2 LANGUAGE- PREPROCESSING DIRECTIVES- The # operator
3.8.3.3 LANGUAGE- PREPROCESSING DIRECTIVES- The ## operator
3.8.3.5 LANGUAGE- PREPROCESSING DIRECTIVES- Scope of macro definitions
3.8.4 LANGUAGE- PREPROCESSING DIRECTIVES- Line control
3.8.6 LANGUAGE- PREPROCESSING DIRECTIVES- Pragma directive
3.8.8 LANGUAGE- PREPROCESSING DIRECTIVES- Predefined macro names
1.3 References
In addition to the 6 references listed in the standard (the most
significant of which is probably "The C Reference Manual", by
Kernighan and Ritchie), an additional reference set should be
considered. Since JRCPP is intended to support many dialects of C,
as well as C++, references for C++ are:
"The C++ Programming Language", by Bjarne Stroustrup, Addison-Wessley
(1986), Copyright Bell Telephone Laboratories Inc.
"The Annotated C++ Reference Manual" by Margaret A. Ellis and Bjarne
Stroustrup, Addison-Wessley (to be published).
1.6 Definition of Terms
Among the 17 terms defined in this section (such as "bit", "byte",
"argument" "parameter"...) which are certainly crucial to a reference
manual, there are also several terms which identify the focus of this
manual. The definition are for the phrases "Unspecified behavior",
"Undefined behavior", and "Implementation defined behavior". The
following are my interpretations of these definitions:
"Unspecified behavior": Although the source code is considered
correct, the standard has no requirements on any implementation. An
example of this, is the precedence for the paste (##) and stringize
operators. Notice that it is not even required that an implementation
be CONSISTENT in its handling of this issue!
"Undefined behavior" the relevant source construct is not portable
ANSI C. As a result, the implementation can accept or reject the
construct, at any point in time from preprocessing and compilation,
through bound execution. Fundamentally such behavior is used to
clearly identify non-portable source constructs.
"Implementation defined behavior": The relevant source code is
considered correct, and each implementation is responsible for
defining the behavior of that construct. An example of this is the
number of significant characters in identifier names (above and
beyond what are required in a minimally ANSI C implementation).
The above definitions will be referred to regularly during the
commentary on JRCPP, and its support for the ANSI C standard.
2. ENVIRONMENT
2.1.1.2 ENVIRONMENT- Translation phases
This section describes the actual phases of translation of C source
code. The phases also serve to delineate the points between
preprocessing, and compilation. The phases may be summarized as
follows:
Phase 1) The characters in the source are translated into those of
the "source character set". During this process, JRCPP translates 8
bit characters into 7 bit characters, by ignoring the high order bit,
and by translating the source file characters 0 and 128 into simple
spaces (ASCII 32). This phase also includes identification of line
delimiters, and the removal of trigraphs. On an DOS/OS2 platform,
JRCPP identifies the two source file characters
<carriage-return><line-feed> as the terminator for each line, which
is henceforth referred to as <newline>. The ANSI Standard also
requires that complete trigraph removal be performed in this phase,
and JRCPP fully supports this. Note that all diagnostics issued by
JRCPP are based upon line counts generated in this phase, and hence
most editors can be used to move to the line identified in a
diagnostic.
Phase 2) All occurrences of a backslash followed immediately by a
<newline> are removed. This removal "splices" together the
consecutive lines that were only separated by this "escaped out
newline". This process may also be seen as combining several
physical lines, as viewed by an editor, into a long logical line.
This activity is most useful for programmers that wish to have many
characters on a single source line, in order to, for example, make
them part of a single preprocessing directive. The ANSI Standard also
requires that every non-empty source file end in a newline, that is
not escaped by a backslash (JRCPP diagnoses these conditions).
Notice that phase 1 is complete before phase 2 is started. Hence the
removal of an escaped newline CANNOT create a trigraph that is
eligible for translation.
Phase 3) This phase of translation is referred to as "tokenization".
In this phase, sequences of characters are gathered together for
processing as whole units (tokens). This phase also defines comments
to be interpreted as equivalent to a single space character. The
standard allows implementations to consider consecutive (non-newline)
whitespace (space, tab, page feed, lone carriage return) as
equivalent to single spaces. The ANSI Standard also specifies that a
source file cannot end in either a partial (unterminated) comment, or
in a partial preprocessing token. (JRCPP diagnoses an unterminated
comment at the end of a file).
Note that since comments and tokens are removed at the same time
(i.e.: via a single left to right scan for the largest possible
lexical group), there is some contention between "otherwise
overlapping" string literals, character constants, and comments.
This contention is always resolved by accepting the largest possible
token (or comment) before allowing a new token to begin. For example,
the following is a pair of comments surrounding an identifier:
/* ignore " inside comment*/ identifier /* still ignore " in comment*/
Hence we see that not only don't comments nest, but string literals
do not form tokens within comments (and hence cannot "hide" the
comment terminator). Similarly, the following is a pair of
consecutive string literals:
"comments begin /* outside" " and end with the */ sequence"
This example shows that comments are not scanned for within string
literals (and hence cannot "hide" the terminal close quote). Finally,
the following is the sum of two character constants,
'"' + '"'
Which demonstrates that neither are character constants scanned
internal for any other extended sequences (such as comments are
literals).
The standard does have the confusing phrase: "a source file shall not
end in a partial preprocessing token", as part of its description of
this phase. Recall that phase 2 ensured that the file ended in a
carriage return, which "terminates" any preprocessing token! It
appears impossible to have a "non-terminated preprocessing token" at
the end of a file. There is a CHANCE that the standard meant to say
"shall not terminate in a partial preprocessing #if directive group",
but this would not make sense as such items are not identified until
later phases. Finally there is the possibility that this requirement
was installed in the Standard before the agreement was reached that a
file should not end in an escaped newline (re: phase 2 requirement),
and then (accidentally) never taken out. We assume this latter
interpretation is correct, and we ignore the constraint on "partial
preprocessing token at end of file".
JRCPP adopts the aforementioned policy allowing sequences of
non-newline whitespaces to be equivalent to a single space, and
compacts comments and whitespace into single spaces. It is critical
to note that the fact that, "comments are NOT removed prior to this
phase", means that a program cannot "comment out" trigraph sequences,
or any activity performed in the earlier phases. In addition, the
fact that comments are removed in this phase means that constructs
that "look like comments" in later phases (e.g.: after macro
expansion activity) are not regarded as comments.
Finally, the fact that comments are translated into single space
characters, includes the case where the comment contains a newline!
This specifically means that preprocessing directives (discussed in
next phase) are not terminated at the end of the line, if the newline
marking that point is within a comment. The implication of this
should be clear to programmers who had previously used a macro
definition of the following form on some non-ANSI compiler:
#define start_comment /*
code /* comment_text */ more_code
The above lines do not define "start_comment" (as understood by later
phases) to be the sequence "/*". In the above sample, the "/* and
all characters up until the comment terminator "*/", are compacted
into a single space. Since the next comment terminator occurs on the
next line of the example, the above code is equivalent to:
#define start_comment more_code
On the brighter side, JRCPP would have issued a warning about the
above sequence as having a "/*" within a comment.
Phase 4) In phase 4, the tokens are parsed (grouped together) to form
preprocessing directives and source code. This activity includes
establishing and maintaining a database of macros (re: #define and
#undef), conditionally including sections of source code (re: #if,
#ifdef, #ifndef, #elif, #else, #endif), inserting additional files
(re: #include), providing user supplied error messages (#error), and
servicing implementation defined directives (#pragma). Note that
when a #include directive is processed, the phases 1-4 are all
applied to the source file as it is inserted.
Phase 5-8 of the ANSI Standard relate to phases of processing that I
would refer to as compilation and linking. It is conceivable that
the concatenation of adjacent string literals (phase 6) should be
considered as part of the preprocessing effort, but they have NOT
been included in JRCPP for two reasons:
Reason 1) If a series of large string literals were concatenated,
then there is a good chance that the result would be too large for
many lexical analysers (re: the first scanning phase of a compiler)
to handle. I would prefer to produce code that is acceptable to a
larger range of compilers.
Reason 2) Hexadecimal escape sequences have no termination mark.
Hence the concatenation of two string literals may be MUCH more
complex than concatenating the "sections between the quotes".
(Example: "\x1" "b" is NOT the same as "\x1b". Specifically, in most
in DOS environment, "\x1" "b" is the same as "\001" "\097" or
equivalently "\001\097", whereas "\x1b" is the same as "\033".) Since
hex escape sequences have no terminator, an example such as what was
just given MUST be translated into a series of octal escape sequences
(at least the trailing hex sequence in the first literal must).
Unfortunately, the translation of long hex escape sequences with
"equivalent" octal escape sequences, would introduce an area of
platform dependency that is probably best avoided in a portable
preprocessor.
2.1.1.3 ENVIRONMENT- Diagnostics
The standard requires that at least one diagnostic message be emitted
for every "violation of any syntax rule or constraint". JRCPP
attempts to support this, with the two caveats that a) parsing of the
C language output is NOT performed, and associated error checking is
not provided. As an interesting example of this support, the user
should note that special JRCPP features (re: #pragma
diagnostic_adjust) allow arbitrary diagnostic messages to be
"silenced". In order to support the above ANSI C requirement, the
first such adjustment of a diagnostic severity level CAUSES a
diagnostic to be issues. Hence at least that diagnostic notification
is present no matter what user customizations are applied to
diagnostics.
2. ENVIRONMENTAL CONSIDERATION
2.2.1 ENVIRONMENTAL CONSIDERATION- Character sets
The character set supported by JRCPP includes the full range of
characters that are required by the standard. In addition, the ASCII
characters in the range 129 to 255 are interpreted as though their
high bit of an eight bit byte was 0 (i.e., mapped into values 1
though 127), and the ASCII values 0 and 128 are treated as spaces
(ASCII 32).
2.2.1.1 ENVIRONMENTAL CONSIDERATION- Trigraphs Sequences
All the standard trigraph sequences are supported. These include:
Trigraph Equivalent Character
??= #
??( [
??/ \
??) ]
??' ^
??< {
??! |
??> }
??- ~
Here again it is significant to recall that trigraph sequences are
replaced in the very first phase of translation. Hence the following
is not a comment:
??/* test*/
as it is equivalent to:
\* test*/
In a similar vein, the following obscure code has surprising meaning:
"/control??/" /* continue till "/* real comment */
as it is equivalent to:
"/control\" /* continue till "/* real comment */
Which is later tokenized as the long single string literal:
"/control\" /* continue till "
2.2.1.3 ENVIRONMENTAL CONSIDERATION- Character sets- Multibyte
characters
Other than support for the single byte characters required in the
standard, multibyte characters are not specified or supported in
JRCPP. If multibyte characters are encountered, they are passed along
blindly, but they cannot be evaluated in any meaningful as a #if/elif
expression (a diagnostic is produced if such an attempt is made).
This stance is in keeping with the requirements of the standard.
2.2.4 ENVIRONMENTAL CONSIDERATION- Translation Limits
This section of the standard requires that at least one program exist
that satisfies all of the limits, and can be translated. The
following are the limits that relate to a preprocessor, and the
details of how to construct a program exercising those limits. Note
that the limits are very easy for JRCPP to handle, and no true
"cunning" is required to generate a required test program.
8 nesting levels of conditional inclusion
Currently about 38+ (see Appendix B of USERS MANUAL) levels are
supported, but this static limit may be removed in future
releases.
32 nesting levels of parenthesized expressions within a full
expression
The parsing stack for evaluation of preprocessor is currently set
at 150 (see Appendix B of USERS MANUAL) levels. At a minimum, it
would require 150 non-white tokens (not just characters, but
whole tokens) on a preprocessing #if line to cause a parser stack
overflow. Lines less than 150 tokens cannot cause an overflow,
but the absolute limit on parenthesis nesting depends upon the
number of additional operators, along with their precedence and
placement. Any demonstration program that that has expressions
that "tower to the left", such as:
"((((((...(((5+4)*3-3)+.../7)|6-8)+1", and has less than about
130 nested parenthesis, should also be acceptable to JRCPP.
31 significant initial characters in an identifier name
All identifiers are considered significant in all their
characters, which may extend well beyond 31 characters. See
Appendix B of USERS MANUAL for actual restrictions an the
absolute length of identifiers.
511 External identifier names in one translation unit
JRCPP has no static limit on the number of distinct identifiers
of any type.
1024 Macro identifier simultaneously defined in one translation unit
JRCPP has no static limit on the number of macros defined.
31 Parameters in one macro definition
JRCPP has no static limit on the number of parameters for a
function like macro.
31 Arguments in one macro invocation
JRCPP has no static limit on the number of arguments supplied to
a macro invocation.
509 Characters in a logical source line
JRCPP has no static limit on the number of characters in a source
line. There is a limit on the number of characters in a single
token, but there is no limit on the number of tokens on a line.
(see Appendix B of USERS MANUAL).
509 Characters in a string literal (after concatenation)
Since JRCPP does not currently perform string concatenation, this
limit does not generally apply. The limit on the length of a
single token applies to individual string literals (see Appendix
B of USERS MANUAL).
8 levels of nested #include
JRCPP has no static limit on the number of nested include files.
To support this (no limit) stance, JRCPP does require that at
least 2 file handles be made available to it, in addition to the
standard set of stdin, stdout, stderr. See your operating system
manual for details. Note that there is a limit on the depth of
nested file inclusion when the original source file is actually
stdin. This limit is based on the operating system restriction on
the number of files that may be open at one time. This odd
limitation may be removed in future versions, but typical DOS
configurations would only limit nested inclusion at about 16
levels.
3.1 LANGUAGE- LEXICAL ELEMENTS
This section defines exactly how to interpret a series of characters
as tokens. The one point of undefined behavior in this section
concerns the presence of unmatched single (') or double (") quotes
appearing on a logical line. JRCPP makes an effort to not abandon
compilation when it encounters errors, and its behavior in this area
is typical of such resolutions.
In the case of an unmatched single quote ('), JRCPP assumes that the
programmer forgot the quote, but assumes that only a single character
"character constant" was intended. Hence for the purposes of error
recovery, the single quote and at most one following c-char (which
includes single characters, and a select set of escape sequences, but
excludes newlines) is accepted as a character constant. This
construction of an erroneous is performed despite the fact that
without the terminal quote, the spelling of the token is invalid.
In the case of an unmatched double quote ("), JRCPP also assumes that
the programmer forgot the quote. In the case of string literals, it
is assumed that that most literals are fairly long. For the purposes
of error recovery, JRCPP assumes that the original quote, along with
the longest possible sequence of s-chars (a class of characters that
includes single characters, and a select set of escape sequences, but
excludes newlines) formed the string literal.
Note that in both cases diagnostics are generated that will, by
default, prevent any preprocessed output from being generated. The
default settings of these diagnostics can however be overridden for
the purposes of generating some output.
3.1.2 LANGUAGE- LEXICAL ELEMENTS- Identifiers
JRCPP supports the standard definition of identifiers, consisting of
a leading alphabetic character (or an underscore), and continuing
with an arbitrary sequence of alphanumeric characters and underscores.
As an extension, JRCPP also supports the presence of the character
'$' at any position (including first character) of an identifier, but
it flags such usage as an error. Here again JRCPP can be seen to
comply with the ANSI requirements for diagnosing
nonportable/nonstandard constructs, while still allowing the user the
opportunity to ignore the error, and facilitate a porting operation
(note that the default diagnostic level of such an error is
sufficient to preclude output, but this level may be modified via the
#pragma adjust_diagnostic ... directive). This extension does not in
any way conflict with the ANSI standard, as a '$' character, outside
of a string literal or character constant token, is usually illegal
anyway. Hence incorporating it into an identifier does not preclude
any valid constructs.
In certain obscure cases, an ANSI conformant program might have a '$'
character provided outside of a string literal, or character
constant. This placement is only potentially legal if the '$' is
formed into part of a valid token by the end of the preprocessing
phases. If this obscure case is actually significant to a user,
modification of diagnostic levels can permit this construct. If I am
pressed by registered users, I may modify the performance of the
preprocessor to more naturally support such obscure ANSI C conformant
cases.
This section of the Standard also discusses the significance of
characters in an identifier name. Specifically, it requires that all
of the first 31 characters in a macro name be considered when
comparing names and invocations. In order to support the many
existing implementations, the standard leaves as "undefined behavior"
whether identifiers that differ ONLY beyond the 31st character.
JRCPP resolves this simply by treating all characters in any
identifier name as significant. This may identify as errors some
typos that other compilers overlook, but this only tends to make the
code more robust in terms of portability.
3.1.3.3 LANGUAGE- LEXICAL ELEMENTS- Character constants
In the discussion of character constants by the ANSI Standard, it is
mentioned that when undefined escape sequences are encountered in a
character constant, the results are undefined.
Note that the defined escape sequences for use within character
constants include:
'\\' (backslash),
'\'' (single quote),
'\"' (double quote),
'\?' (question mark),
'\a' (alarm or bell),
'\b' (backspace),
'\f' (form feed),
'\n' (newline), '
'\r' (carriage return),
'\t' (tab),
'\v' (vertical tab),
octal escape sequences with 1-3 octal digits, and
hexadecimal escape sequences with arbitrarily many hex digits.
Examples of the latter two types are: '\27', and '\xab10cd'.
When JRCPP finds an invalid escape sequence within a character
constant (and there is a trailing quote found later on that line), a
diagnostic is produced, but the character sequence is accepted as a
character constant. The severity level of the diagnostic is
sufficient to prevent the preprocessor from producing output, but the
level may be varied by the user if acceptance of such sequences is
considered reasonable for the user's target compiler.
3.1.4 LANGUAGE- LEXICAL ELEMENTS- String literals
The undefined behavior in string literals is also centered on the
presence of illegal escape sequences within the literal. In an
analogous fashion to the handling of character constants, the
presence of illegal escape sequences generates a diagnostic, but
(error recovery) accepts the sequence. The default severity the
diagnostic is high enough that on output will not be produced by the
preprocessor unless the level is adjusted downward.
3.1.7 LANGUAGE- LEXICAL ELEMENTS- Header names
This section of the Standard discusses the lexical form of file names
that are used in #include directives. The undefined behavior in this
area involves the presence of the characters ', \, ", or /* within
the <....> form of an include directive, and the presence of ', \, or
/* within the "...." form of the directive. Since the original
platform for JRCPP was DOS/OS2, the defining of behavior of such
sequences is quite important (DOS and OS2 file systems use '\' as a
separator in path names, in the same way as UNIX systems use '/' as a
separator). In order to support the use of standard DOS/OS2 path
names, a header name is considered a special and distinct token.
JRCPP defines a "...." style header name to begin with a double
quote, and continue until a matching double quote is encountered,
without passing the newline. Note that escape sequences are NOT
honored during the scanning of this token, and hence backslash
characters represent themselves directly (and the final quote CANNOT
be escaped using a backslash). In addition, since this is a single
token, the presence of /* within it is of no consequence. The only
context in which a "...." style header name is permitted by JRCPP
(and hence scanned for), is as the first non-whitespace token,
following the keyword "include", on a #include directive line. Note
that comments are considered whitespace, and may precede the "...."
style header name. The following are examples of entirely legal
include directives:
#include /* comment */ "sys\header.h"
#include "weird/*b.h"
#include "any char is legal !@' wow \"
Note however that the operating system will more than likely be
unable to find such files!. The mapping from include directives into
actual file names involves replacing each occurrence of a '\' or '/'
with the appropriate path name separator, and requesting that file be
opened. Consider for example the following:
#define stringize(x) #x
#include stringize( \sys\header )
Since the macro will expand its argument to "\\sys\\header" (details
of stringization are defined in section 3.8.3.2), the file will be
searched for using four backslashes! This mapping into a file name
is independent of whether the file name was provided in the "...."
style, or it was a string literal generated by some preprocessing
replacement.
In an identical fashion, a <....> style header name is defined by
JRCPP to begin with a '<' character, and to not terminate until the
first '>' character is reached, without extending past the newline.
The context for scanning for this token is identical to that of the
"...." style header name. As with the "...." style header names,
there are NO special characters (i.e.: escape sequences) interpreted
during the scanning for such a token. The following are legal
#include directives, and demonstrate this:
#include <system.h\header.h/*this_too>
#include /*comment*/ /*comment*/ <any char ' or " or even \>
#/*comment*/ include /*comment
comment continues*/ < spaces even count >
The last example also demonstrates that comments are reduced to a
single space, and hence do not disrupt the context of the scan as
defined.
Note that all characters between the delimiters < and >, (or between
the double quotes in the "...." style) are interpreted as being part
of the file name.
In the interest of portability, it is suggested that the user refrain
from using the standard '\' path delimiter in an DOS/OS2 environment,
and instead make use of the equivalent character '/'.
3.1.8 LANGUAGE- LEXICAL ELEMENTS- Preprocessing numbers
The lexical token "preprocessing number" appears to have been placed
into the standard to allow for arbitrary substrings of valid numbers
to be manipulated conveniently. The need for such a token is perhaps
motivated by the requirement (given elsewhere in the standard) that
if the result of some preprocessing operation (such as token pasting
via ##) is not a valid preprocessing token, then the resulting
behavior is undefined. With that requirement "on the books", it then
follows that substrings of numbers should be considered valid. For
example, the substring `3.14' could be pasted onto `0e4', to yield
the result `3.140e4'. In order to make life as easy as possible for
the implementers, the standard is VERY broad in its allowance of what
is a valid preprocessing number. For example, the sequence
`1.2aZ4E-_6.7.2_3' is a valid preprocessing number. As per the
standard, this token is supported, in the full generality that it is
specified.
The user should also be warned of the fact that a preprocessing
number is a SINGLE TOKEN, and hence is not scanned internally for the
presence of macro names. For example, when the above example
`1.2aZ4E-_6.7.2_3' is present in a file, the preprocessor will NOT
consider the macro aZ4E for expansion, even if it is defined! The
point being made here is that when number are placed adjacent to
letters in the source file, they will typically be blended together
into a single token, and the letters will not be eligible for macro
substitution. Similarly, even though the `.' operator may be
overloaded in C++, if it is placed to the right of and adjacent to
ANY number sequence, it will be absorbed as part of that token!
3.8 LANGUAGE- PREPROCESSING DIRECTIVES
The descriptions given in this section cover all aspects of
preprocessor directives. I will in general paraphrase some of the
significant areas that I consider non-intuitive. An interested
reader should certainly consider examining the actual standard for
any additional details.
One notable item in the overview section is that tokens within
directives are generally NOT subject to macro expansion, unless
otherwise noted. Hence logical lines of text are categorized as
directive or non-directive lines BEFORE macro expansion of such lines
take place. In addition, as pointed out in a later section, if macro
expansion produces something that "resembles" a directive line, it is
NOT processed as a preprocessing directive. The following is a
summary of the actions of various directives with regards to
expansion of the tokens that follow the "# directive_name":
Directives for which tokens, on the line with them, are expanded:
#if
#elif
#include
#line
Directive for which tokens on the line with them, are not expanded:
#ifdef
#ifndef
#define
#undef
#error
#pragma
Directives which cannot legally have other tokens on the line with
them:
#else
#endif
#
Note that the #if and #elif have some additional translation that is
performed on their tokens both BEFORE and AFTER macro expansion. The
#include and #line directive are only expanded when a standard form
of arguments is not present. In the case of a #include directive
that does require expansion of the tokens, the post-expansion tokens
are processed (concatenated) after expansion and rescanned for a
standard format. The fact that additional tokens are not allowed
following the null directive (the lone #), is significant in that any
other lines that begin with # are strictly illegal.
3.8.1 LANGUAGE- PREPROCESSING DIRECTIVES- Conditional inclusion
Conditional inclusion refers to the use of #if directives (along with
its various forms and grouping directives) to cause a section of code
to be optionally included or excluded from the preprocessed result.
Fundamentally, there are three ways to start an if directive group
(#if, #ifndef, #ifdef), two ways to continue a group (#elif, #else),
and one directive to mark the end of the group (#endif). Since such
if groups can nest (i.e., contain inner groups), we will start with
the description of an outermost conditional group, and then discuss
the ramifications on inner groups. We will also defer discussion of
#ifdef and #ifndef, as their definition follows directly from the the
definition of #if.
Subsection A) Evaluation Of #if Expressions
The first point to address is how the tokens on a line with a #if are
evaluated, and what their resulting "value" signifies. To make the
discussion clearer, we will assume the following example
#define hi 5
#if (1u == defined(hello)) || (3 < hi + low + int)
...
The process of evaluating the tokens on a line with a directive
consist of 6 phases:
1) remove all occurrences of "defined identifier" and "define (
identifier )", and replace them with either 0 or 1 (1 iff
the identifier is currently defined as a macro)
2) macro expand the line that resulted from phase 1 (using
standard rules described in section 3.8.3)
3) replace all identifiers and keywords in the result of phase 2
with the number 0 (this result is a list of constants and
operators)
4) convert all constants of type "int" to identical constants of
type "long", and constants of type "unsigned int" to
"unsigned long".
5) evaluate the expression produced by phase 4 according to
standard C expressions methods, but always use "long"
types for subexpressions that evaluate to an "int", and
"unsigned long" for expressions that evaluate to an
"unsigned int" (the final result is an integral constant
of type "long" or "unsigned long")
6) if the final result of phase 5 is equal to 0, then the
expression was false, otherwise it is true.
The process can be demonstrated on the example given, with the
following evaluations. After phase 1 removal of "defined":
#define hi 5
#if (1u == 0) || (3 < hi + low + int)
...
After phase 2 macro expansion
#define hi 5
#if (1u == 0)) || (3 < 5 + low + int)
...
After phase 3 replacement of identifiers and keywords with 0:
#define hi 5
#if (1u == 1) || (3 < 5 + 0 + 0)
...
After phase 4 conversion of "int"s to "long"s:
#if (1uL == 1L) || (3L < 5L + 0L + 0L)
...
The phase 5 evaluation might proceed something like:
(1uL == 1uL) || (3L < 5L)
( 1 ) || ( 1 )
( 1L ) || ( 1L )
1
1L
Finally in phase 5, the above constant can be seen to be non-zero,
and hence the result of the evaluation is true.
The above rules work for the most part, as expected by "almost"
everyone, but the following details and anomalies are worth noting.
Note that in phase 1 an attempt is made to remove all occurrences of
the operator "defined". If this operator is not applied to an
identifier, the Standard indicates that the results are undefined.
JRCPP considers this scenario to be a syntax error, and aborts
evaluation of the expression. As a means of error recovery, JRCPP
assumes an evaluated result of FALSE, and a diagnostic is generated.
A second point of ANSI C undefined behavior takes place when the
result of macro expansion produces the operator "defined". For
simplicity and portability of code, JRCPP disregards the presence of
such an operator as the result of macro expansion, and follows
exactly the multiphase algorithm supplied above. Hence occurrences
of the keyword "defined" that are produced by macro expansion are
replaced in phase 3 with a value of 0.
During the expansion of phase 4, some ANSI C undefined behavior
exists with regard to evaluating character constants. The two points
here that must be resolved are how to evaluate multicharacter
character constants (which contain more than one item, which is
either an escape sequence or a simple character), and whether
character constants may assume negative values. As mentioned in
earlier sections, multicharacter character constants represent a
major area of non-portability, and hence they are not effectively
supported in #if expression evaluation. Specifically, if a
multicharacter character constant (such as 'zq') appears in an
expression, it is truncated to a single character constant, keeping
only the leftmost character (or escape sequence) and of course a
diagnostic is generated. Character constants under JRCPP evaluate as
"signed int", which in a DOS 8088-80x86 environment is taken to be a
16 bit signed integer. Single character character constants always
evaluate as positive numbers. Octal character constants are all
considered positive (i.e., '\000' through '\777'), but hexadecimal
character constants may evaluate to a negative number. Specifically,
if the high order bit of a hexadecimal character constant (when
viewed as a "signed int") is set (i.e., '\x8000' through '\xffff') on
a 16 bit signed int architecture), then the number is negative, using
a two's complement representation. Additionally, if a hexadecimal
escape sequence exceeds the representational precision or range of a
character constant (e.g., "signed int" under JRCPP, which corresponds
to 16 bits under a DOS environment), then the high order bits are
discarded.
There are several subtleties involving preprocessor "#if" expression
evaluation. The first item to observe is that the expression must be
formed using only integral constant subexpression (i.e., no floating
point; no pointers); casts may not be used; and the 'sizeof' operator
is not evaluated (in fact, 'sizeof' is replaced in phase 3 by the
value 0). As per allowance by the standard, there is no guarantee
that character constants will be evaluated identically in the
preprocessor as they are in the compiler (since JRCPP is external and
unknown to your compiler, this is all the more important to be aware
of).
One slightly quirky aspect of the evaluation of #if centers around
the consistent use of "long" types to replace "int" types. The
following demonstrates this "quirk", and is commonly a thorn (bug?)
in the side of many "would be ANSI compatible" preprocessors:
#if 1 > (0 ? 1u : -1)
The tricky aspect of evaluation of this example involves the value of
the ternary "?:" subexpression AFTER the transition to "long" types
is made. The subexpression looks like "(0L ? 1uL : -1L)". Note the
the type associated with this ternary must be the "larger" of the
types "unsigned long" (from 1uL) and "signed long" (from -1L). Hence,
according to ANSI, the result of the ternary expression must the
"unsigned long" representation of "-1", which is actually the largest
possible "unsigned long". So we can see that the above expression
ends up evaluating to FALSE! The moral of the story for programmers
is to exercise care when working with negative numbers in the
preprocessor #if statements.
Subsection B: Other conditional inclusion directives
As mentioned in the standard the lines:
#ifdef any_identifier
and
#ifndef any_identifier
are equivalent, respectively, to:
#if defined any_identifier
and
#if ! defined any_identifier
Each basic conditional inclusion section of source code consists of a
"if group", followed by any number of "elif groups", followed
optionally by an "else group", and terminated by a line with a #endif
directive. An "if group" consists of a #if directive (or
equivalent), followed optionally by lines of code. An "elif group"
consists of a #elif directive (which has an expression to evaluate),
followed optionally by lines of code. An "else group", consists of a
#else directive, followed optionally by lines of code. Note that for
each #if directive, there MUST be a matching #endif directive that
follows it, and that these conditional inclusion sections do nest
within a single "group" of code (i.e., within a single "if group", or
within a single "elif group", or within a single "else group").
The semantics (meaning) of these directives is most simply given by
an example:
#if expression_1
block 1
#elif expression_2
block 2
#elif expression_3
block 3
#else
block 4
#endif
Fundamentally, only one of the blocks 1,2,3, and 4 can EVER be passed
to the output of the preprocessor (if we were missing the "else
group", then it is possible that none of the blocks would be
processed). If expression_1 evaluates to TRUE, then ONLY block 1 is
processed, and block 2, 3, and 4 are discarded, and expression 2 and
3 need not even be evaluated. On the other hand, if expression 1 is
FALSE, then block 1 is discarded, and expression 2 is evaluated as
though it were the start of a conditional inclusion section. Hence
the first #if or #elif directive that evaluates to true causes its
associated section of code to be included, and all other sections in
the other groups to be discarded. IF none of these expression
evaluate to TRUE, then the code in the "else group", if it exists and
has code, is processed.
One very common use of conditional inclusion is to effectively
comment out a large section of code, that more than likely has
/*...*/ based comments. Since standard /*...*/ based comments do not
nest, large blocks of code CANNOT be safely removed using standard
/*...*/ delimiters. In contrast, since conditional inclusion
directives do nest, placing "#if 0" at the start of the section, and
"#endif" at the end of the section effectively comments out (safely)
an arbitrary block of code. Also, from a stylistic point of view,
the fact that these directives DO NOT have to appear directly
adjacent to the left margin (as was the case in some early C
preprocessors) allows such commenting to be done in a very
aesthetically pleasing format.
3.8.2 LANGUAGE- PREPROCESSING DIRECTIVES- Source file inclusion
Source file inclusion is performed using the "#include" directive.
For our discussion, we will refer to directives like:
#include <stdio.h>
as <...> style includes, and
#include "header.h"
as "..." style includes.
Subsection A) Interpretation of expanded tokens following #include
One notable change to many compilers to support the ANSI standard, is
the acceptance of include directives of the form:
#include token1 token2 ..... tokeni
wherein the token sequence cannot be interpreted as either a <...> or
"..." style include. We will refer to this include directive as a
"macro derived" directive, in honor of the fact that the tokens must
be macro expanded before the include directive can be acted upon.
Note that to not be a "..." style include, the first character of
token1 must be other than `"', or at least there must be no other `"'
character later on the include line. Similarly, to avoid the <...>
style, the first character of token1 must be other than `<', or there
must be no terminating `>' later on the line. Recall also from the
discussion of tokenization, that backslashes cannot escape out a
closing quote, and that a the sequence /* is NOT honored within the
file name in either the "..." or <...> style include (this is a
special context).
The first element of undefined behavior for #include directives
concerns the method by which the results of macro expanding the
tokens in a macro derived directive are interpreted. The most
obvious (and simple) case is when the expansion of the entire token
sequence is a simple string literal. The more complex case involves
an expansion that is still a token sequence, such that the first
token begins with a `<', and the last token ends with a `>'. The
following are examples of such include directives:
#define myfile "header.h"
#define yourfile /* this macro has a null definition */
#define less_than <
#define greater_than >
#define big_system_header os2.h
#include yourfile myfile
#include less_than big_system_header greater_than
Although the above macro defined directives expand to tokens
sequences that "look like" more common include directives, there are
some special differences. Specifically, using the above macro
definition the resulting token sequences look like:
#include "header.h"
#include < os2.h >
Note that the above casual macro definitions left leading and
trailing spaces in the file name for the latter example. Although
this could have been avoided by using function like macros, which can
be placed sequentially with no white space between them, the presence
of whitespace in the result is believed to be common among users of
this feature. With the above examples understood, the JRCPP
resolution is simply to concatenate all the tokens, ignoring
inter-token whitespace, and THEN reinterpret the resulting character
sequence in the context appropriate to <...> or "..." style include
directives (i.e., no special meaning for backslashes, etc.).
Subsection B) Search algorithm for included files
A second point of undefined behavior in the ANSI C standard involves
where exactly included files are searched for. There are actually
several very distinct conventions for this search mechanism, even
within file systems which are hierarchically based (such as UNIX).
JRCPP has adopted a default strategy that is consistent with
Microsoft C 5.1, and several other more recent compilers. There is
also support (selected via #pragma include_search) for algorithms
compatible with Microsoft C 4.0, and support for approaches
compatible with older UNIX system cpp implementations.
If an application is placing all the source and header files in a
single directory, which is also the current working directory during
the compilation, and the system include directory is a single
absolute directory, then the search algorithm then almost any search
algorithm will suffice. On the other hand, if header files in one
directory includes header files in a second directory, which then
include header files in yet a third directory, while the user has a
current working directory yet elsewhere, and some of the system
include or application include directories use relative path names,
the meaning of:
#include "header.h"
is far from obvious. Historically, projects developed under UNIX
placed all source and header files in a single directory, and the
discussion of search algorithms was irrelevant. With the growing
complexity of code, and the presence of a multitude of programmers on
a project, the need has arisen to hierarchically segregate sections
of large project into file hierarchies. In order to support function
calls between these sections, header file inclusion well outside the
current directory has become commonplace. Various vendors have
adopted algorithms that support the "trivial" case described
originally, but there has often been disagreement about how to
process the more complex cases.
The philosophy that drove the development of the slightly complex
include strategy of JRCPP motivated by the following requirements:
1) It should be possible to write include files, that include
other files, without concern about what source file was being
compiled, where that file was, and what directory the user was in
when the compilation was requested. This allows complex systems
of header files to be written INDEPENDENT of the application that
uses them.
2) To allow for even more complex sets of include files, if file
A included file B, and file A was able to include file C, then
file B should be able to include file C with equal ease. In some
sense, this concept is similar to inheritance in object oriented
programming.
3) It should be possible for a user to change current
directories, and in doing so change what files are accessible via
include searches (assuming the programmer has orchestrated the
placement of header files to support this strategy). This allows
different versions of an application to be compiled easily into
different directory areas.
4) It should be possible to define the location of include files
via a relative path. This facility would allow the construction
of source file hierarchies that are easily ported to distinct
absolute positions in file system hierarchies.
We will start with some definitions of terms. We define the "system
include path" to be a list of directories where system header files
are provided. Typically all the ANSI C specified library header
files are in directories listed in the system include path, and files
in such directories can be expected to never change (hence it is rare
to provide "make" dependencies on such files). The "application
include path" is a list of directories that contain header files
significant to a specific application or project. Typically, files
in the application include path tend to change often during program
development. Next there is the "current directory", with its
standard meaning in a UNIX or DOS like file hierarchy. Three
additional directory lists need to be defined, the "original source
directory", the "current source directory", and the "ancestral source
directories". The "original source directory" is the directory in
which the source file specified on the compilation or preprocessor
command line was found. The "current source directory" is the
directory which contains the include directive file that is currently
being parsed, and has the include directive that we are trying to
process. The "ancestral source directories" are a list of
directories that begin with the "current source directory", and
proceed back through each level of nested inclusion (specifying the
directory in which that source file was found), all the way to the
"original source directory".
The algorithm for searching for header.h consists of the bottom level
ancestral search, and a top level search. The algorithm terminates
the first time a file can be accessed. In pigeon code, the algorithm
would look like:
ancestral_search(file_name)
{
if (file_name has relative prefix)
{
for (every prefix p, in the ancestral include list)
do try to open (p/file_name)
}
/* but as a last resort ... */
try to open (file_name) /* with no prefix */
}
Driving the code that we just listed is the higher level search, that
is directed by the use of user specified include paths:
standard_search(file_name)
{
try ancestral_search(file_name) /* with no prefix */
if (file_name has relative path)
{
for every prefix p in the application include path
do ancestral_search(p/file_name)
for every prefix p in the system include path
do ancestral_search(p/file_name)
}
}
The above pigeon code corresponds to the algorithm for finding "..."
style includes. When <...> style includes are searched for, the file
is not searched for directly (with no prefix) unless an absolute path
specifier, and the "application include path" is never made use of.
The following pigeon code describes the search for <...> style
include files:
system_search(file_name)
{
if (file_name has relative path)
{
for every prefix p in the system include path
do ancestral_search(p/file_name)
}
else /* file has an absolute path specification */
try to open (file_name)
}
It should also be pointed out that when the ancestral_search function
is used, that the path prefix for the ancestral include files are
tried sequentially, starting with the current source file, and
proceeding back to the original source file.
Next, we will present an example of the above definitions, and the
resulting search path. We will assume:
The current working directory is /current_path. The source file
original_path/original.c included the file header1path/header1.h
(hence original_path is the "original source directory"). File
header1.h included header2path/header2.h (hence the "ancestral
directory list" is: header2path, header1path, original_path). The
system include path contains the directory sysdir1, and sysdir2. The
application include path contains the directories appdir1, and
appdir2. For our example of searching, we will examine the search
pattern when header file header2.h contains the include directive:
#include "header3.h"
If the file specified as header3.h does have an absolute path prefix,
then only that absolute path/filename is used in searching for the
file (an example with an absolute path is "/usr/lex/lexdefs.h"). If
header3.h does not have an absolute path prefix, then the following
search pattern is followed. With the caveat to be mentioned in a
moment, the search for a file consists of trying to open the
following:
header3.h
appdir1/header3.h
appdir2/header3.h
sysdir1/header3.h
sysdir2/header3.h
There is, as mentioned, one caveat to the above search sequence. The
caveat is that whenever a file in the above list has a relative path
prefix, then the prefixes provided by the ancestral directories are
tried sequentially, followed by the unadorned filename, before moving
on down the list. Since we assumed that header3.h did not have an
absolute path prefix, the following files would actually be subject
to fopen() calls:
header2path/header3.h
header1path/header3.h
original_path/header3.h
current_path/header3.h (same as simply `header3.h')
The above list corresponds to a search using path prefixes taken from
the ancestral include list, and then using the current directory.
Note that if one of the application include path directories had only
a relative path prefix, then it too would make use of the ancestral
include directories for prefixes. For example, if app2dir was a
relative path, then header3.h would (if it were not found earlier in
our list) be searched for in:
header2path/app2dir/header3.h
header1path/app2dir/header3.h
original_path/app2dir/header3.h
currentpath/app2dir/header3.h (same as `app2dir/header3.h')
Subsection B: Include strategy compatibility
The include strategy described above is compatible with Microsoft C
5.1, and several other major compilers. Since this strategy is
rather general, it tends to provide coverage of search areas
sufficient for most any project. If however, absolute compatibility
is desired with other compilers, the search strategy may be modified
by use of the appropriate pragma options. The adjustments permitted
all represent restrictions to search in only a subset of the default
areas.
For example, the SUN UNIX cpp searches for "..." style includes first
in the parent directory, and then in the application's include path.
To achieve complete compatibility with SUN, it is necessary to use
the pragma:
#pragma include_search youngest_ancestor_only
As a second example, Lattice C 3.0 searched for "..." include files
in the current working directory, and then in the application's
include path. To be compatible with such a strategy, the following
pragmas should be entered:
#pragma include_search current_directory
Note that the options for the different ancestral search modes
include: `youngest_ancestor_only', `eldest_ancestor_only', and
`all_ancestors'. These options correspond respectively to the using
the directory of the current_path, original_path, and the full search
algorithm. The default provided by JRCPP is `all_ancestors', but
omitting an ancestral selection (as in the last example) implies the
use of no ancestral directories.
Note that if searching in the current_directory, and searching in
directories associated with all ancestral modes are disabled (via
omission in such a pragma), then it is impossible to include any
files without using an absolute prefix supplied in a system include
path, or an applications include path.
3.8.3 LANGUAGE- PREPROCESSING DIRECTIVES- Macro replacement
This section has only two mentions of undefined behavior, but it has
several topics worthy of commentary. We will start with the
comments, and then proceed to resolve the cases of undefined behavior.
It should be mentioned here that JRCPP has a rather novel pragma that
explains in detail the steps taken during a macro expansion. This
information may be used as a tutorial assistant (for the novice), as
a debugging assistant (for the professional programmer), or as a
method of proof of a bug (for a registered user to file a bug
report). The explanation given includes a step by step breakdown of
the macro expansion process, along with reference listings of current
macro definitions as they are applied, and reasons for ignoring
current definitions (such as the fact that an identifier is already a
part of a recursive expansion of itself). For more details see
"#pragma describe_macro_expansions", and note that this feature can
be turned on and off to localize its impact. It is expected that
this pragma can easily augment this Language Reference Manual by
provided annotated examples.
ANSI C differs from many prior C preprocessor implementations in that
an identifier may not be redefined "differently" from its current
macro definition, without first undefining the existing definition.
Some non-ANSI implementations allowed redefinitions to mask existing
definitions, and future #undef directives to "pop" into visibility
older definitions. Other implementations simply allowed new macro
definitions to overwrite existing definitions. JRCPP only supports
the ANSI C approach, with the agreement with the ANSI C Rationale
that other formats are error prone, and generally an actual errors.
Note that redefinitions that are identical to the original (such as
occurs when a header file is include twice) are fully legal. The
Standard is quite specific in defining what a benign redefinition is,
but it can be summarized by saying that the only difference allowed
in a redefinition is the length of white space interludes. (There is
actually a subtle error in the Standard. The restriction that the
"order of the parameters" be unchanged in a redefinition, was not
listed. JRCPP assumes this was a typo, and requires that the order
of parameters be unchanged in a macro redefinition). If I receive
sufficient input from users that my decision here is a hindrance, I
will support the other standards involving non-benign redefinitions
(with appropriate diagnostics), controlled via pragma directives.
A second item worthy of noting is that whitespace at the start and
end of a replacement list is not considered part of the replacement
list. Hence it is impossible to define a macro that expands to
whitespace.
The two aspects of undefined behavior involve odd occurrences during
the gathering of arguments for a function like macro invocation. The
first problem that needs to be addressed is what the behavior is when
a token sequence that could be interpreted as a directive is
encountered while gathering a list of arguments. The second point of
undefined behavior is present when an argument consists of "no
preprocessing tokens".
Fundamentally, the problem with allowing preprocessor directives to
occur during the gathering of parameter list for a macro, is that a
#define or #undef directive might be encountered. Consider the
following code:
#define f(x) x+1
f(2) /* that was easy */
f( /* look for the argument ... */
#undef f
2 ) /* found the arg, but should we use it? */
JRCPP resolves such ambiguities very easily by not generally allowing
directives to be present within macro argument lists. Specifically,
JRCPP would generate a diagnostic indicating that there was no
closing parenthesis for the macro invocation (JRCPP stopped looking
when it reached the directive). Although this is a VERY reasonable
result when #define and #undef directives are reached, it is not so
obviously necessary when certain other directives are reached. To
assist the programmer that is using existing code such as:
#define f(x) x+1
f(
#if foo
2
#else
3
#endif
)
a pragma (#pragma delayed_expansion [on|off]) is provided that causes
such "ANSI C undefined behavior" to be acceptable. The restriction
on this pragma based extension is that no directives are allowed
within the argument list that can even possibly cause a change in the
macro database (i.e., the relevance of the change is not considered;
the significance of the change, such as a benign redefinition, is not
considered). Simply put, if a #define, #undef, or #pragma is
encountered during a scan for arguments to a macro, the scan is
terminated with a "missing close paren" diagnostic, even if the
delayed_expansion pragma is active.
As mentioned, the second area of ANSI C undefined behavior involves
the "presence of an argument consisting of no tokens... before
argument substitution...". This phrase unfortunately contradicts the
definition of argument: "... sequence of preprocessing tokens ... in
a macro invocation". Note also that "substitution" is the time at
which expansion of the argument is considered, and hence the odd
phrase cannot possibly refer to the situation where an argument is
expanded, but the result is nil (or whitespace). Rather than harp on
this inconsistency, we will discuss what perhaps is a related (or
even intended) problem: What is the interpretation of a missing
argument (or at least white space where an argument should be)?
It is clearly stated that the number of arguments in a macro
invocation must agree exactly with the number of arguments in the
corresponding macro definition, and hence JRCPP generates a
diagnostic if this is not the case. In addition, JRCPP enlists an
error recovery strategy that consists of substituting whitespace
(which is clearly not a valid ANSI C argument) for any missing
arguments. This strategy is intended to be compatible with some
prior non-ANSI implementations. Note that this action can cause a
secondary error if the macro attempts use this argument in certain
ways. The following example demonstrate these secondary errors:
#define paste_left(x) something ## x
#define paste_right(x) x ## something
#define stringize(x) # x
paste_left(/*white*/)
paste_right(/*white*/)
stringize(/*white*/)
Fortunately, since most code that exploits this non-ANSI behavior
(missing argument is actually whitespace) does not use the paste
operator (##) or the stringize operator (#), hence this secondary
error will have tend not to occur. Applications that are porting
non-ANSI code through JRCPP may then choose to lower the severity
level of the diagnostic that reports the whitespace argument, and
accept the error recovery procedure as reasonable. (User feedback on
my error recovery scheme may improve compatibility with other
implementations).
3.8.3.2 LANGUAGE- PREPROCESSING DIRECTIVES- The # operator
The # operator provides the stringizing functionality to ANSI C, that
was often provided via expansion of parameters within string literals
in older (and NON-ANSI) implementations. One key point that should
be stressed is that when this functionality is used, then the
argument is NOT macro expanded before being stringized (i.e., placed
into quotes). For example:
#define stringize(x) #x
#define A 2
stringize(2) /* becomes "2" */
stringize(A) /* becomes "A" */
If a user wants the argument to be expanded and THEN stringized, the
following construction should be used:
#define stringize(x) # x
#define expand_then_stringize(x) stringize(x)
#define A 2
expand_then_stringize(2) /* becomes "2" */
expand_then_stringize(A) /* becomes stringize(2), which
becomes "2"*/
The Standard indicates that the order of evaluation of the operators
# and ## is unspecified. Since the token to the right of the
stringize operator (#) must be a parameter, it would appear that the
following are the only two cases to consider:
#define F(x) word ## # x
#define G(x) # x ## other
JRCPP follows the C tradition of providing higher precedence for
unary operators than for binary operators. JRCPP parses the
definition of F to attempt to paste a stringized version of parameter
x onto the right side of the identifier `word'. This decision is a
bit immaterial, as the result of pasting any valid preprocessing
token to the left side of a string literal (the stringized version of
x) is almost always an invalid preprocessing token. Similarly, the
definition of G provides a request to paste the word `other' onto the
right side of the stringized version of parameter x. Defining the
precedence in any other way appears to be of equally little use.
3.8.3.3 LANGUAGE- PREPROCESSING DIRECTIVES- The ## operator
The pasting operator ## supplies the functionality in ANSI C that was
supplied in various compilers in the past, by means of various hacks.
Most hacks were based on methods of getting adjacent identifiers to
"flow together". The two methods that I am aware of are:
#define f() start
f()end /* for SOME non-ANSI cpp, becomes: `startend' */
and
#define g start
g/*comments go away*/end /* some non-ANSI cpp: `startend' */
Neither of these constructs are supported under ANSI C, and in both
cases JRCPP defaults to produces the two tokens `start' and `end',
separated by a space. The first of the two approaches is supported
via a pragma under JRCPP (see #pragma space_between_tokens).
It should be emphasized that, just as with the stringize operator,
arguments are NOT expanded, prior to insertion in replacement list,
at points where they are about to be subjects of a ## operator. For
example:
#define A 2
#define append(x) x ## right left ## x x
append(A) /* becomes: Aright leftA 2 */
If the user desires an expansion prior to pasting, the construct
described earlier with regard to stringization must be used.
3.8.3.5 LANGUAGE- PREPROCESSING DIRECTIVES- Scope of macro definitions
This section of the standard has some very nice examples of the
process of macro expansion. These include the use of the paste
operator (##), the stringize operator (#), and the prevention of
infinite recursion of macros. If the user tries these torture tests
on JRCPP, rather than reveal an error in JRCPP, they will reveal a
typo in the ANSI C Standard. The specific error in the standard
involves:
#define str(x) #x
str(: @\n)
which the standard incorrectly expands to ": \n", but should have
expanded to ": @\\n". The goal of the stringize operation is to
produce text that can print exactly as the argument that was
supplied. The standard is clear on this point, and other items in the
expansion demonstrate this. Unfortunately, I am sure that this typo
in the Standard will also be the source of many bug reports.
The user that is trying these tests should also run them with the
pragma space_between_tokens set to off, if they would like the format
to be closer to that of the listing in the Standard. In either case
the results should be correct. The user may also note a slight
discrepancy in the format of the output, due to the fact that JRCPP
maintains line position information much more accurately than most
other preprocessors. In this regard, consider the example:
#define swap(x,y) y x
swap (
+i
+j,
+k
+l)
Most preprocessors produce the output:
#line 2
+k +l +i +j
Whereas JRCPP produces:
#line 5
+k
+l
#line 3
+i
+j
The big advantage of the method provided by JRCPP is exposed when
compiler diagnostics wish to refer to a token in such a stream. When
JRCPP is used, a diagnostic for "syntax error on token `+'" can be
very specific about the line number with the offending character.
With other preprocessors, the user is just told the error was on line
2. (As a historical note, many pre-ANSI preprocessor required that
the macro name and all the arguments be placed on a single line.
Many users have, as a result, built large logical lines when a macro
was being invoked. ANSI C established a standard whereby this was no
longer necessary, but many compiler manufacturers are slow to service
the users that have moved to this more readable notation.)
3.8.4 LANGUAGE- PREPROCESSING DIRECTIVES- Line control
The #line directive is fully supported by JRCPP. There are several
points to note about its performance. The first item worthy of note
is that the standard provides for macro expansion of the tokens on
the logical line with the #line directive. Unfortunately, it does
not provide for arithmetic reductions. The result of macro expansion
must be either a digit sequence (representing a line number), or a
digit sequence with a string literal (file name).
Note that the line directive requires a string literal, and has no
consideration of the sort of context that was provided for evaluation
of a file name in an include directive. This distinction means that
when a backslash is used in a file name that is specified using a
#line directive, then it must be "escaped" using another backslash.
3.8.6 LANGUAGE- PREPROCESSING DIRECTIVES- Pragma directive
JRCPP makes extensive use of pragmas in order to direct customization
of the performance of the preprocessor. Users should refer to the
JRCPP Users Guide for a complete list of valid pragmas, along with
their meaning. As per the Standard, unrecognized pragmas are ignored
by JRCPP. In order to facilitate the use of pragma directives to
control the compiler, unrecognized pragmas are passed unchanged
(except for comment removal and whitespace compaction) through to the
post-preprocessed output file.
Note that because some pragmas do modify the macro database, pragma
directives are not permitted within macro invocation argument lists.
If there is some need to pass forward a pragma to the compiler
without having it acted upon by the preprocessor (for example, when
JRCPP would misunderstand it), then the following sort of approach
can be taken:
#define HIDE_PRAGMA
HIDE_PRAGMA # pragma any tokens
Due to the fact that the results of expansion will NEVER be
considered by JRCPP to be a directive, the pragma presented in this
cloak will not be processed by JRCPP. Unfortunately, when a user is
forced to this extreme, the protection against macro expanding the
list of tokens for an unknown pragma is lost. JRCPP has endeavored
to use novel names that should not clash with the specifications of
many other implementations.
3.8.8 LANGUAGE- PREPROCESSING DIRECTIVES- Predefined macro names
All five of the ANSI C predefined macros are supported by JRCPP. The
macros are:
__LINE__ Current presumed line number in the source file. The value
of __LINE__ will be in the range permissible for a signed
long integer.
__FILE__ Current presumed file name, provided as a string literal.
Note that because __FILE__ is a string literal, any
occurrences of the backslash character in the actual source
name have been replaced by `\\' (an escaped backslash).
__DATE__ The date on which JRCPP began to preprocess the original
source file, expressed as a character string. The format
will always be "Mon dd yyyy", where Mon in the abbreviation
for the month, dd is 1 or 2 digit representation of the day
of the month, and yyyy is the 4 digit calendar year.
__TIME__ The time at which JRCPP began to preprocess the original
source file, expressed as a character string. The format is
"hh:mm:ss", where hh is the number of hours past midnight
local time, mm is the number of minutes past the hour, and
ss is the number of seconds past the whole minute.
__STDC__ The integer constant 1. This constant is meant to indicate
that the compiler/preprocessor is as conforming ANSI C
implementation.
The Standard restricts the preprocessor from redefining any of these
macro names, as well as attempting to undefine any of them.
Similarly, the standard precludes the use of #define or #undef on the
identifier `defined' (which has special meaning in evaluating a
#if/elif directive).
JRCPP supports the full ANSI C Standard as indicated above, but
maintains customization features that allow it to be modified
slightly to be non-conforming.
One major point is that most of the time, the compiler that JRCPP is
preprocessing for is not ANSI conformant. With this situation in
mind, there must be some mechanism for undefining __STDC__. Special
pragmas have been provided in JRCPP to accomplish most of these tasks
(see #pragma undefine_macros).
As a special note, the pragma that switches to C++ mode (see #pragma
cplusplus_mode) has the following effect: The macro __STDC__ is
undefined, and the macro __cplusplus is defined. Moreover, this new
macro __cplusplus has the same reserved status (i.e.: cannot be
#define'd or #undef'ed) as __STDC__ has under default JRCPP. In
addition, the one line // style comments are also supported in c++
mode.