OS/2 Shareware BBS: 10 Tools

home *** CD-ROM | disk | FTP | other *** search

/ OS/2 Shareware BBS: 10 Tools / 10-Tools.zip / JRCPP.ZIP / LANGREF.DOC < prev next >

Wrap

Text File | 1990-03-27 | 83KB | 1,729 lines

James Roskind C Porting Preprocessor (JRCPP) JRCPP LANGUAGE REFERENCE MANUAL (3/23/90) Copyright (C) 1990 James Roskind, All rights reserved. Permission is granted to copy and distribute this file as part any machine readable archive containing the entire, unmodified, JRCPP PUBLIC DISTRIBUTION PACKAGE (henceforth call the "Package"). The set of files that form the Package are described in the README file that is a part of the Package. Permission is granted to individual users of the Package to copy individual portions of the Package (i.e., component files) in any form (e.g.: printed, electronic, electro-optical, etc.) desired for the purpose of supporting users of the Package (i.e., providing online, or onshelf documentation access; executing the binary JRCPP code, etc.). Permission is not granted to distribute copies of individual portions of the Package, unless a machine readable version of the complete Package is also made available with such distribution. Abstracting with credit is permitted. There is no charge or royalty fee required for copies made in compliance with this notice. To otherwise copy elements of this package requires prior permission in writing from James Roskind. James Roskind 516 Latania Palm Drive Indialantic FL 32903 End of copyright notice What the above copyright means is that you are free to use and distribute (or even sell) the entire set of files in this Package, but you can't split them up, and distribute them as separate files. The notice also says that you cannot modify the copies that you distribute, and this ESPECIALLY includes NOT REMOVING the any part of the copyright notice in any file. JRCPP currently implements a C Preprocessor, but the users of this Package do NOT surrender any right of ownership or copyright to any source text that is processed by JRCPP, either before or after processing. Similarly, there are no royalty or fee requirements for using the post-preprocessed output of JRCPP. This Package is expected to be distributed by shareware and freeware channels (including BBS sites), but the fees paid for "distribution" costs are strictly exchanged between the distributor, and the recipient, and James Roskind makes no express or implied warranties about the quality or integrity of such indirectly acquired copies. Distributors and users may obtain the Package (the Public distribution form) directly from the author by following the ordering procedures in the REGISTRATION file. DISCLAIMER: JAMES ROSKIND PROVIDES THIS FILE "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM AND DOCUMENTATION IS WITH YOU. Some states do not allow disclaimer of express or implied warranties in certain transactions, therefore, this statement may not apply to you. UNIX is a registered trademark of AT&T Bell Laboratories. ____________________________________________________________________ James Roskind C Porting Preprocessor (JRCPP) JRCPP LANGUAGE REFERENCE MANUAL INTRODUCTION This document, in the company of the "ANSI Programming Language C" Standard, is intended to act as a language reference manual. Most significantly, this document discusses the performance of JRCPP in official "ANSI undefined", "ANSI unspecified" and "ANSI implementation defined" domains of the C Language. In addition, it lists performance limitations of JRCPP, and directly relates these limitations to the standard's requirements for "Implementation limits". As an additional matter, this document identifies vaguenesses (and in the rare case, errors) in the ANSI C Standard, and describes the resolution adopted by JRCPP. Hence this document is also the "Rationale" for JRCPP, in much the same way the the ANSI C standard has an accompanying document "Rational For ANSI Programming Language C". This document will generally not discuss aspects of the standard that do not involve preprocessing activities performed on source files. Note that this document was written based on the Draft Proposed ANSI C Standard, X3J11/88-158, date December 8, 1988. After a drawn out appeals process, I believe this draft was accepted in January 1990 by the ANSI Standards Committee. I am not aware that any changes were made during that appeals process, and I apologize in advance for any errors I might have made in this regard, or in the description that follows. In all cases where this Language Reference Manual deviates from the ANSI C Standard, this document should be assumed to be in error, and the corresponding bug/misperformance in the JRCPP program (if any) should be reported. The ANSI C Standard a tremendous work, and I realize that my abridged commentary in many areas does not do justice to the meticulous selection of elaborate wording in the official Standard. For many, my description will be enough, but for language lawyers, there is no replacement for the official ANSI document. Section numbers in this document have been chosen to match those of the ANSI C standard, and hence certain gaps are present. These gaps represent areas where either there is generally no impact on preprocessing activities, or no additional commentary seems necessary. LISTING OF SECTIONS 1.3 References 1.6 Definition of Terms 2. ENVIRONMENT 2.1.1.2 ENVIRONMENT- Translation phases 2.1.1.3 ENVIRONMENT- Diagnostics 2.2 ENVIRONMENTAL CONSIDERATION 2.2.1 ENVIRONMENTAL CONSIDERATION- Character sets 2.2.1.1 ENVIRONMENTAL CONSIDERATION- Trigraphs Sequences 2.2.1.3 ENVIRONMENTAL CONSIDERATION- Character sets- Multibyte characters 2.2.4 ENVIRONMENTAL CONSIDERATION- Translation Limits 3.1 LANGUAGE- LEXICAL ELEMENTS 3.1.2 LANGUAGE- LEXICAL ELEMENTS- Identifiers 3.1.3.3 LANGUAGE- LEXICAL ELEMENTS- Character constants 3.1.4 LANGUAGE- LEXICAL ELEMENTS- String literals 3.1.7 LANGUAGE- LEXICAL ELEMENTS- Header names 3.1.8 LANGUAGE- LEXICAL ELEMENTS- Preprocessing numbers 3.8 LANGUAGE- PREPROCESSING DIRECTIVES 3.8.1 LANGUAGE- PREPROCESSING DIRECTIVES- Conditional inclusion 3.8.2 LANGUAGE- PREPROCESSING DIRECTIVES- Source file inclusion 3.8.3 LANGUAGE- PREPROCESSING DIRECTIVES- Macro replacement 3.8.3.2 LANGUAGE- PREPROCESSING DIRECTIVES- The # operator 3.8.3.3 LANGUAGE- PREPROCESSING DIRECTIVES- The ## operator 3.8.3.5 LANGUAGE- PREPROCESSING DIRECTIVES- Scope of macro definitions 3.8.4 LANGUAGE- PREPROCESSING DIRECTIVES- Line control 3.8.6 LANGUAGE- PREPROCESSING DIRECTIVES- Pragma directive 3.8.8 LANGUAGE- PREPROCESSING DIRECTIVES- Predefined macro names 1.3 References In addition to the 6 references listed in the standard (the most significant of which is probably "The C Reference Manual", by Kernighan and Ritchie), an additional reference set should be considered. Since JRCPP is intended to support many dialects of C, as well as C++, references for C++ are: "The C++ Programming Language", by Bjarne Stroustrup, Addison-Wessley (1986), Copyright Bell Telephone Laboratories Inc. "The Annotated C++ Reference Manual" by Margaret A. Ellis and Bjarne Stroustrup, Addison-Wessley (to be published). 1.6 Definition of Terms Among the 17 terms defined in this section (such as "bit", "byte", "argument" "parameter"...) which are certainly crucial to a reference manual, there are also several terms which identify the focus of this manual. The definition are for the phrases "Unspecified behavior", "Undefined behavior", and "Implementation defined behavior". The following are my interpretations of these definitions: "Unspecified behavior": Although the source code is considered correct, the standard has no requirements on any implementation. An example of this, is the precedence for the paste (##) and stringize operators. Notice that it is not even required that an implementation be CONSISTENT in its handling of this issue! "Undefined behavior" the relevant source construct is not portable ANSI C. As a result, the implementation can accept or reject the construct, at any point in time from preprocessing and compilation, through bound execution. Fundamentally such behavior is used to clearly identify non-portable source constructs. "Implementation defined behavior": The relevant source code is considered correct, and each implementation is responsible for defining the behavior of that construct. An example of this is the number of significant characters in identifier names (above and beyond what are required in a minimally ANSI C implementation). The above definitions will be referred to regularly during the commentary on JRCPP, and its support for the ANSI C standard. 2. ENVIRONMENT 2.1.1.2 ENVIRONMENT- Translation phases This section describes the actual phases of translation of C source code. The phases also serve to delineate the points between preprocessing, and compilation. The phases may be summarized as follows: Phase 1) The characters in the source are translated into those of the "source character set". During this process, JRCPP translates 8 bit characters into 7 bit characters, by ignoring the high order bit, and by translating the source file characters 0 and 128 into simple spaces (ASCII 32). This phase also includes identification of line delimiters, and the removal of trigraphs. On an DOS/OS2 platform, JRCPP identifies the two source file characters <carriage-return><line-feed> as the terminator for each line, which is henceforth referred to as <newline>. The ANSI Standard also requires that complete trigraph removal be performed in this phase, and JRCPP fully supports this. Note that all diagnostics issued by JRCPP are based upon line counts generated in this phase, and hence most editors can be used to move to the line identified in a diagnostic. Phase 2) All occurrences of a backslash followed immediately by a <newline> are removed. This removal "splices" together the consecutive lines that were only separated by this "escaped out newline". This process may also be seen as combining several physical lines, as viewed by an editor, into a long logical line. This activity is most useful for programmers that wish to have many characters on a single source line, in order to, for example, make them part of a single preprocessing directive. The ANSI Standard also requires that every non-empty source file end in a newline, that is not escaped by a backslash (JRCPP diagnoses these conditions). Notice that phase 1 is complete before phase 2 is started. Hence the removal of an escaped newline CANNOT create a trigraph that is eligible for translation. Phase 3) This phase of translation is referred to as "tokenization". In this phase, sequences of characters are gathered together for processing as whole units (tokens). This phase also defines comments to be interpreted as equivalent to a single space character. The standard allows implementations to consider consecutive (non-newline) whitespace (space, tab, page feed, lone carriage return) as equivalent to single spaces. The ANSI Standard also specifies that a source file cannot end in either a partial (unterminated) comment, or in a partial preprocessing token. (JRCPP diagnoses an unterminated comment at the end of a file). Note that since comments and tokens are removed at the same time (i.e.: via a single left to right scan for the largest possible lexical group), there is some contention between "otherwise overlapping" string literals, character constants, and comments. This contention is always resolved by accepting the largest possible token (or comment) before allowing a new token to begin. For example, the following is a pair of comments surrounding an identifier: /* ignore " inside comment*/ identifier /* still ignore " in comment*/ Hence we see that not only don't comments nest, but string literals do not form tokens within comments (and hence cannot "hide" the comment terminator). Similarly, the following is a pair of consecutive string literals: "comments begin /* outside" " and end with the */ sequence" This example shows that comments are not scanned for within string literals (and hence cannot "hide" the terminal close quote). Finally, the following is the sum of two character constants, '"' + '"' Which demonstrates that neither are character constants scanned internal for any other extended sequences (such as comments are literals). The standard does have the confusing phrase: "a source file shall not end in a partial preprocessing token", as part of its description of this phase. Recall that phase 2 ensured that the file ended in a carriage return, which "terminates" any preprocessing token! It appears impossible to have a "non-terminated preprocessing token" at the end of a file. There is a CHANCE that the standard meant to say "shall not terminate in a partial preprocessing #if directive group", but this would not make sense as such items are not identified until later phases. Finally there is the possibility that this requirement was installed in the Standard before the agreement was reached that a file should not end in an escaped newline (re: phase 2 requirement), and then (accidentally) never taken out. We assume this latter interpretation is correct, and we ignore the constraint on "partial preprocessing token at end of file". JRCPP adopts the aforementioned policy allowing sequences of non-newline whitespaces to be equivalent to a single space, and compacts comments and whitespace into single spaces. It is critical to note that the fact that, "comments are NOT removed prior to this phase", means that a program cannot "comment out" trigraph sequences, or any activity performed in the earlier phases. In addition, the fact that comments are removed in this phase means that constructs that "look like comments" in later phases (e.g.: after macro expansion activity) are not regarded as comments. Finally, the fact that comments are translated into single space characters, includes the case where the comment contains a newline! This specifically means that preprocessing directives (discussed in next phase) are not terminated at the end of the line, if the newline marking that point is within a comment. The implication of this should be clear to programmers who had previously used a macro definition of the following form on some non-ANSI compiler: #define start_comment /* code /* comment_text */ more_code The above lines do not define "start_comment" (as understood by later phases) to be the sequence "/*". In the above sample, the "/* and all characters up until the comment terminator "*/", are compacted into a single space. Since the next comment terminator occurs on the next line of the example, the above code is equivalent to: #define start_comment more_code On the brighter side, JRCPP would have issued a warning about the above sequence as having a "/*" within a comment. Phase 4) In phase 4, the tokens are parsed (grouped together) to form preprocessing directives and source code. This activity includes establishing and maintaining a database of macros (re: #define and #undef), conditionally including sections of source code (re: #if, #ifdef, #ifndef, #elif, #else, #endif), inserting additional files (re: #include), providing user supplied error messages (#error), and servicing implementation defined directives (#pragma). Note that when a #include directive is processed, the phases 1-4 are all applied to the source file as it is inserted. Phase 5-8 of the ANSI Standard relate to phases of processing that I would refer to as compilation and linking. It is conceivable that the concatenation of adjacent string literals (phase 6) should be considered as part of the preprocessing effort, but they have NOT been included in JRCPP for two reasons: Reason 1) If a series of large string literals were concatenated, then there is a good chance that the result would be too large for many lexical analysers (re: the first scanning phase of a compiler) to handle. I would prefer to produce code that is acceptable to a larger range of compilers. Reason 2) Hexadecimal escape sequences have no termination mark. Hence the concatenation of two string literals may be MUCH more complex than concatenating the "sections between the quotes". (Example: "\x1" "b" is NOT the same as "\x1b". Specifically, in most in DOS environment, "\x1" "b" is the same as "\001" "\097" or equivalently "\001\097", whereas "\x1b" is the same as "\033".) Since hex escape sequences have no terminator, an example such as what was just given MUST be translated into a series of octal escape sequences (at least the trailing hex sequence in the first literal must). Unfortunately, the translation of long hex escape sequences with "equivalent" octal escape sequences, would introduce an area of platform dependency that is probably best avoided in a portable preprocessor. 2.1.1.3 ENVIRONMENT- Diagnostics The standard requires that at least one diagnostic message be emitted for every "violation of any syntax rule or constraint". JRCPP attempts to support this, with the two caveats that a) parsing of the C language output is NOT performed, and associated error checking is not provided. As an interesting example of this support, the user should note that special JRCPP features (re: #pragma diagnostic_adjust) allow arbitrary diagnostic messages to be "silenced". In order to support the above ANSI C requirement, the first such adjustment of a diagnostic severity level CAUSES a diagnostic to be issues. Hence at least that diagnostic notification is present no matter what user customizations are applied to diagnostics. 2. ENVIRONMENTAL CONSIDERATION 2.2.1 ENVIRONMENTAL CONSIDERATION- Character sets The character set supported by JRCPP includes the full range of characters that are required by the standard. In addition, the ASCII characters in the range 129 to 255 are interpreted as though their high bit of an eight bit byte was 0 (i.e., mapped into values 1 though 127), and the ASCII values 0 and 128 are treated as spaces (ASCII 32). 2.2.1.1 ENVIRONMENTAL CONSIDERATION- Trigraphs Sequences All the standard trigraph sequences are supported. These include: Trigraph Equivalent Character ??= # ??( [ ??/ \ ??) ] ??' ^ ??< { ??! | ??> } ??- ~ Here again it is significant to recall that trigraph sequences are replaced in the very first phase of translation. Hence the following is not a comment: ??/* test*/ as it is equivalent to: \* test*/ In a similar vein, the following obscure code has surprising meaning: "/control??/" /* continue till "/* real comment */ as it is equivalent to: "/control\" /* continue till "/* real comment */ Which is later tokenized as the long single string literal: "/control\" /* continue till " 2.2.1.3 ENVIRONMENTAL CONSIDERATION- Character sets- Multibyte characters Other than support for the single byte characters required in the standard, multibyte characters are not specified or supported in JRCPP. If multibyte characters are encountered, they are passed along blindly, but they cannot be evaluated in any meaningful as a #if/elif expression (a diagnostic is produced if such an attempt is made). This stance is in keeping with the requirements of the standard. 2.2.4 ENVIRONMENTAL CONSIDERATION- Translation Limits This section of the standard requires that at least one program exist that satisfies all of the limits, and can be translated. The following are the limits that relate to a preprocessor, and the details of how to construct a program exercising those limits. Note that the limits are very easy for JRCPP to handle, and no true "cunning" is required to generate a required test program. 8 nesting levels of conditional inclusion Currently about 38+ (see Appendix B of USERS MANUAL) levels are supported, but this static limit may be removed in future releases. 32 nesting levels of parenthesized expressions within a full expression The parsing stack for evaluation of preprocessor is currently set at 150 (see Appendix B of USERS MANUAL) levels. At a minimum, it would require 150 non-white tokens (not just characters, but whole tokens) on a preprocessing #if line to cause a parser stack overflow. Lines less than 150 tokens cannot cause an overflow, but the absolute limit on parenthesis nesting depends upon the number of additional operators, along with their precedence and placement. Any demonstration program that that has expressions that "tower to the left", such as: "((((((...(((5+4)*3-3)+.../7)|6-8)+1", and has less than about 130 nested parenthesis, should also be acceptable to JRCPP. 31 significant initial characters in an identifier name All identifiers are considered significant in all their characters, which may extend well beyond 31 characters. See Appendix B of USERS MANUAL for actual restrictions an the absolute length of identifiers. 511 External identifier names in one translation unit JRCPP has no static limit on the number of distinct identifiers of any type. 1024 Macro identifier simultaneously defined in one translation unit JRCPP has no static limit on the number of macros defined. 31 Parameters in one macro definition JRCPP has no static limit on the number of parameters for a function like macro. 31 Arguments in one macro invocation JRCPP has no static limit on the number of arguments supplied to a macro invocation. 509 Characters in a logical source line JRCPP has no static limit on the number of characters in a source line. There is a limit on the number of characters in a single token, but there is no limit on the number of tokens on a line. (see Appendix B of USERS MANUAL). 509 Characters in a string literal (after concatenation) Since JRCPP does not currently perform string concatenation, this limit does not generally apply. The limit on the length of a single token applies to individual string literals (see Appendix B of USERS MANUAL). 8 levels of nested #include JRCPP has no static limit on the number of nested include files. To support this (no limit) stance, JRCPP does require that at least 2 file handles be made available to it, in addition to the standard set of stdin, stdout, stderr. See your operating system manual for details. Note that there is a limit on the depth of nested file inclusion when the original source file is actually stdin. This limit is based on the operating system restriction on the number of files that may be open at one time. This odd limitation may be removed in future versions, but typical DOS configurations would only limit nested inclusion at about 16 levels. 3.1 LANGUAGE- LEXICAL ELEMENTS This section defines exactly how to interpret a series of characters as tokens. The one point of undefined behavior in this section concerns the presence of unmatched single (') or double (") quotes appearing on a logical line. JRCPP makes an effort to not abandon compilation when it encounters errors, and its behavior in this area is typical of such resolutions. In the case of an unmatched single quote ('), JRCPP assumes that the programmer forgot the quote, but assumes that only a single character "character constant" was intended. Hence for the purposes of error recovery, the single quote and at most one following c-char (which includes single characters, and a select set of escape sequences, but excludes newlines) is accepted as a character constant. This construction of an erroneous is performed despite the fact that without the terminal quote, the spelling of the token is invalid. In the case of an unmatched double quote ("), JRCPP also assumes that the programmer forgot the quote. In the case of string literals, it is assumed that that most literals are fairly long. For the purposes of error recovery, JRCPP assumes that the original quote, along with the longest possible sequence of s-chars (a class of characters that includes single characters, and a select set of escape sequences, but excludes newlines) formed the string literal. Note that in both cases diagnostics are generated that will, by default, prevent any preprocessed output from being generated. The default settings of these diagnostics can however be overridden for the purposes of generating some output. 3.1.2 LANGUAGE- LEXICAL ELEMENTS- Identifiers JRCPP supports the standard definition of identifiers, consisting of a leading alphabetic character (or an underscore), and continuing with an arbitrary sequence of alphanumeric characters and underscores. As an extension, JRCPP also supports the presence of the character '$' at any position (including first character) of an identifier, but it flags such usage as an error. Here again JRCPP can be seen to comply with the ANSI requirements for diagnosing nonportable/nonstandard constructs, while still allowing the user the opportunity to ignore the error, and facilitate a porting operation (note that the default diagnostic level of such an error is sufficient to preclude output, but this level may be modified via the #pragma adjust_diagnostic ... directive). This extension does not in any way conflict with the ANSI standard, as a '$' character, outside of a string literal or character constant token, is usually illegal anyway. Hence incorporating it into an identifier does not preclude any valid constructs. In certain obscure cases, an ANSI conformant program might have a '$' character provided outside of a string literal, or character constant. This placement is only potentially legal if the '$' is formed into part of a valid token by the end of the preprocessing phases. If this obscure case is actually significant to a user, modification of diagnostic levels can permit this construct. If I am pressed by registered users, I may modify the performance of the preprocessor to more naturally support such obscure ANSI C conformant cases. This section of the Standard also discusses the significance of characters in an identifier name. Specifically, it requires that all of the first 31 characters in a macro name be considered when comparing names and invocations. In order to support the many existing implementations, the standard leaves as "undefined behavior" whether identifiers that differ ONLY beyond the 31st character. JRCPP resolves this simply by treating all characters in any identifier name as significant. This may identify as errors some typos that other compilers overlook, but this only tends to make the code more robust in terms of portability. 3.1.3.3 LANGUAGE- LEXICAL ELEMENTS- Character constants In the discussion of character constants by the ANSI Standard, it is mentioned that when undefined escape sequences are encountered in a character constant, the results are undefined. Note that the defined escape sequences for use within character constants include: '\\' (backslash), '\'' (single quote), '\"' (double quote), '\?' (question mark), '\a' (alarm or bell), '\b' (backspace), '\f' (form feed), '\n' (newline), ' '\r' (carriage return), '\t' (tab), '\v' (vertical tab), octal escape sequences with 1-3 octal digits, and hexadecimal escape sequences with arbitrarily many hex digits. Examples of the latter two types are: '\27', and '\xab10cd'. When JRCPP finds an invalid escape sequence within a character constant (and there is a trailing quote found later on that line), a diagnostic is produced, but the character sequence is accepted as a character constant. The severity level of the diagnostic is sufficient to prevent the preprocessor from producing output, but the level may be varied by the user if acceptance of such sequences is considered reasonable for the user's target compiler. 3.1.4 LANGUAGE- LEXICAL ELEMENTS- String literals The undefined behavior in string literals is also centered on the presence of illegal escape sequences within the literal. In an analogous fashion to the handling of character constants, the presence of illegal escape sequences generates a diagnostic, but (error recovery) accepts the sequence. The default severity the diagnostic is high enough that on output will not be produced by the preprocessor unless the level is adjusted downward. 3.1.7 LANGUAGE- LEXICAL ELEMENTS- Header names This section of the Standard discusses the lexical form of file names that are used in #include directives. The undefined behavior in this area involves the presence of the characters ', \, ", or /* within the <....> form of an include directive, and the presence of ', \, or /* within the "...." form of the directive. Since the original platform for JRCPP was DOS/OS2, the defining of behavior of such sequences is quite important (DOS and OS2 file systems use '\' as a separator in path names, in the same way as UNIX systems use '/' as a separator). In order to support the use of standard DOS/OS2 path names, a header name is considered a special and distinct token. JRCPP defines a "...." style header name to begin with a double quote, and continue until a matching double quote is encountered, without passing the newline. Note that escape sequences are NOT honored during the scanning of this token, and hence backslash characters represent themselves directly (and the final quote CANNOT be escaped using a backslash). In addition, since this is a single token, the presence of /* within it is of no consequence. The only context in which a "...." style header name is permitted by JRCPP (and hence scanned for), is as the first non-whitespace token, following the keyword "include", on a #include directive line. Note that comments are considered whitespace, and may precede the "...." style header name. The following are examples of entirely legal include directives: #include /* comment */ "sys\header.h" #include "weird/*b.h" #include "any char is legal !@' wow \" Note however that the operating system will more than likely be unable to find such files!. The mapping from include directives into actual file names involves replacing each occurrence of a '\' or '/' with the appropriate path name separator, and requesting that file be opened. Consider for example the following: #define stringize(x) #x #include stringize( \sys\header ) Since the macro will expand its argument to "\\sys\\header" (details of stringization are defined in section 3.8.3.2), the file will be searched for using four backslashes! This mapping into a file name is independent of whether the file name was provided in the "...." style, or it was a string literal generated by some preprocessing replacement. In an identical fashion, a <....> style header name is defined by JRCPP to begin with a '<' character, and to not terminate until the first '>' character is reached, without extending past the newline. The context for scanning for this token is identical to that of the "...." style header name. As with the "...." style header names, there are NO special characters (i.e.: escape sequences) interpreted during the scanning for such a token. The following are legal #include directives, and demonstrate this: #include <system.h\header.h/*this_too> #include /*comment*/ /*comment*/ <any char ' or " or even \> #/*comment*/ include /*comment comment continues*/ < spaces even count > The last example also demonstrates that comments are reduced to a single space, and hence do not disrupt the context of the scan as defined. Note that all characters between the delimiters < and >, (or between the double quotes in the "...." style) are interpreted as being part of the file name. In the interest of portability, it is suggested that the user refrain from using the standard '\' path delimiter in an DOS/OS2 environment, and instead make use of the equivalent character '/'. 3.1.8 LANGUAGE- LEXICAL ELEMENTS- Preprocessing numbers The lexical token "preprocessing number" appears to have been placed into the standard to allow for arbitrary substrings of valid numbers to be manipulated conveniently. The need for such a token is perhaps motivated by the requirement (given elsewhere in the standard) that if the result of some preprocessing operation (such as token pasting via ##) is not a valid preprocessing token, then the resulting behavior is undefined. With that requirement "on the books", it then follows that substrings of numbers should be considered valid. For example, the substring `3.14' could be pasted onto `0e4', to yield the result `3.140e4'. In order to make life as easy as possible for the implementers, the standard is VERY broad in its allowance of what is a valid preprocessing number. For example, the sequence `1.2aZ4E-_6.7.2_3' is a valid preprocessing number. As per the standard, this token is supported, in the full generality that it is specified. The user should also be warned of the fact that a preprocessing number is a SINGLE TOKEN, and hence is not scanned internally for the presence of macro names. For example, when the above example `1.2aZ4E-_6.7.2_3' is present in a file, the preprocessor will NOT consider the macro aZ4E for expansion, even if it is defined! The point being made here is that when number are placed adjacent to letters in the source file, they will typically be blended together into a single token, and the letters will not be eligible for macro substitution. Similarly, even though the `.' operator may be overloaded in C++, if it is placed to the right of and adjacent to ANY number sequence, it will be absorbed as part of that token! 3.8 LANGUAGE- PREPROCESSING DIRECTIVES The descriptions given in this section cover all aspects of preprocessor directives. I will in general paraphrase some of the significant areas that I consider non-intuitive. An interested reader should certainly consider examining the actual standard for any additional details. One notable item in the overview section is that tokens within directives are generally NOT subject to macro expansion, unless otherwise noted. Hence logical lines of text are categorized as directive or non-directive lines BEFORE macro expansion of such lines take place. In addition, as pointed out in a later section, if macro expansion produces something that "resembles" a directive line, it is NOT processed as a preprocessing directive. The following is a summary of the actions of various directives with regards to expansion of the tokens that follow the "# directive_name": Directives for which tokens, on the line with them, are expanded: #if #elif #include #line Directive for which tokens on the line with them, are not expanded: #ifdef #ifndef #define #undef #error #pragma Directives which cannot legally have other tokens on the line with them: #else #endif # Note that the #if and #elif have some additional translation that is performed on their tokens both BEFORE and AFTER macro expansion. The #include and #line directive are only expanded when a standard form of arguments is not present. In the case of a #include directive that does require expansion of the tokens, the post-expansion tokens are processed (concatenated) after expansion and rescanned for a standard format. The fact that additional tokens are not allowed following the null directive (the lone #), is significant in that any other lines that begin with # are strictly illegal. 3.8.1 LANGUAGE- PREPROCESSING DIRECTIVES- Conditional inclusion Conditional inclusion refers to the use of #if directives (along with its various forms and grouping directives) to cause a section of code to be optionally included or excluded from the preprocessed result. Fundamentally, there are three ways to start an if directive group (#if, #ifndef, #ifdef), two ways to continue a group (#elif, #else), and one directive to mark the end of the group (#endif). Since such if groups can nest (i.e., contain inner groups), we will start with the description of an outermost conditional group, and then discuss the ramifications on inner groups. We will also defer discussion of #ifdef and #ifndef, as their definition follows directly from the the definition of #if. Subsection A) Evaluation Of #if Expressions The first point to address is how the tokens on a line with a #if are evaluated, and what their resulting "value" signifies. To make the discussion clearer, we will assume the following example #define hi 5 #if (1u == defined(hello)) || (3 < hi + low + int) ... The process of evaluating the tokens on a line with a directive consist of 6 phases: 1) remove all occurrences of "defined identifier" and "define ( identifier )", and replace them with either 0 or 1 (1 iff the identifier is currently defined as a macro) 2) macro expand the line that resulted from phase 1 (using standard rules described in section 3.8.3) 3) replace all identifiers and keywords in the result of phase 2 with the number 0 (this result is a list of constants and operators) 4) convert all constants of type "int" to identical constants of type "long", and constants of type "unsigned int" to "unsigned long". 5) evaluate the expression produced by phase 4 according to standard C expressions methods, but always use "long" types for subexpressions that evaluate to an "int", and "unsigned long" for expressions that evaluate to an "unsigned int" (the final result is an integral constant of type "long" or "unsigned long") 6) if the final result of phase 5 is equal to 0, then the expression was false, otherwise it is true. The process can be demonstrated on the example given, with the following evaluations. After phase 1 removal of "defined": #define hi 5 #if (1u == 0) || (3 < hi + low + int) ... After phase 2 macro expansion #define hi 5 #if (1u == 0)) || (3 < 5 + low + int) ... After phase 3 replacement of identifiers and keywords with 0: #define hi 5 #if (1u == 1) || (3 < 5 + 0 + 0) ... After phase 4 conversion of "int"s to "long"s: #if (1uL == 1L) || (3L < 5L + 0L + 0L) ... The phase 5 evaluation might proceed something like: (1uL == 1uL) || (3L < 5L) ( 1 ) || ( 1 ) ( 1L ) || ( 1L ) 1 1L Finally in phase 5, the above constant can be seen to be non-zero, and hence the result of the evaluation is true. The above rules work for the most part, as expected by "almost" everyone, but the following details and anomalies are worth noting. Note that in phase 1 an attempt is made to remove all occurrences of the operator "defined". If this operator is not applied to an identifier, the Standard indicates that the results are undefined. JRCPP considers this scenario to be a syntax error, and aborts evaluation of the expression. As a means of error recovery, JRCPP assumes an evaluated result of FALSE, and a diagnostic is generated. A second point of ANSI C undefined behavior takes place when the result of macro expansion produces the operator "defined". For simplicity and portability of code, JRCPP disregards the presence of such an operator as the result of macro expansion, and follows exactly the multiphase algorithm supplied above. Hence occurrences of the keyword "defined" that are produced by macro expansion are replaced in phase 3 with a value of 0. During the expansion of phase 4, some ANSI C undefined behavior exists with regard to evaluating character constants. The two points here that must be resolved are how to evaluate multicharacter character constants (which contain more than one item, which is either an escape sequence or a simple character), and whether character constants may assume negative values. As mentioned in earlier sections, multicharacter character constants represent a major area of non-portability, and hence they are not effectively supported in #if expression evaluation. Specifically, if a multicharacter character constant (such as 'zq') appears in an expression, it is truncated to a single character constant, keeping only the leftmost character (or escape sequence) and of course a diagnostic is generated. Character constants under JRCPP evaluate as "signed int", which in a DOS 8088-80x86 environment is taken to be a 16 bit signed integer. Single character character constants always evaluate as positive numbers. Octal character constants are all considered positive (i.e., '\000' through '\777'), but hexadecimal character constants may evaluate to a negative number. Specifically, if the high order bit of a hexadecimal character constant (when viewed as a "signed int") is set (i.e., '\x8000' through '\xffff') on a 16 bit signed int architecture), then the number is negative, using a two's complement representation. Additionally, if a hexadecimal escape sequence exceeds the representational precision or range of a character constant (e.g., "signed int" under JRCPP, which corresponds to 16 bits under a DOS environment), then the high order bits are discarded. There are several subtleties involving preprocessor "#if" expression evaluation. The first item to observe is that the expression must be formed using only integral constant subexpression (i.e., no floating point; no pointers); casts may not be used; and the 'sizeof' operator is not evaluated (in fact, 'sizeof' is replaced in phase 3 by the value 0). As per allowance by the standard, there is no guarantee that character constants will be evaluated identically in the preprocessor as they are in the compiler (since JRCPP is external and unknown to your compiler, this is all the more important to be aware of). One slightly quirky aspect of the evaluation of #if centers around the consistent use of "long" types to replace "int" types. The following demonstrates this "quirk", and is commonly a thorn (bug?) in the side of many "would be ANSI compatible" preprocessors: #if 1 > (0 ? 1u : -1) The tricky aspect of evaluation of this example involves the value of the ternary "?:" subexpression AFTER the transition to "long" types is made. The subexpression looks like "(0L ? 1uL : -1L)". Note the the type associated with this ternary must be the "larger" of the types "unsigned long" (from 1uL) and "signed long" (from -1L). Hence, according to ANSI, the result of the ternary expression must the "unsigned long" representation of "-1", which is actually the largest possible "unsigned long". So we can see that the above expression ends up evaluating to FALSE! The moral of the story for programmers is to exercise care when working with negative numbers in the preprocessor #if statements. Subsection B: Other conditional inclusion directives As mentioned in the standard the lines: #ifdef any_identifier and #ifndef any_identifier are equivalent, respectively, to: #if defined any_identifier and #if ! defined any_identifier Each basic conditional inclusion section of source code consists of a "if group", followed by any number of "elif groups", followed optionally by an "else group", and terminated by a line with a #endif directive. An "if group" consists of a #if directive (or equivalent), followed optionally by lines of code. An "elif group" consists of a #elif directive (which has an expression to evaluate), followed optionally by lines of code. An "else group", consists of a #else directive, followed optionally by lines of code. Note that for each #if directive, there MUST be a matching #endif directive that follows it, and that these conditional inclusion sections do nest within a single "group" of code (i.e., within a single "if group", or within a single "elif group", or within a single "else group"). The semantics (meaning) of these directives is most simply given by an example: #if expression_1 block 1 #elif expression_2 block 2 #elif expression_3 block 3 #else block 4 #endif Fundamentally, only one of the blocks 1,2,3, and 4 can EVER be passed to the output of the preprocessor (if we were missing the "else group", then it is possible that none of the blocks would be processed). If expression_1 evaluates to TRUE, then ONLY block 1 is processed, and block 2, 3, and 4 are discarded, and expression 2 and 3 need not even be evaluated. On the other hand, if expression 1 is FALSE, then block 1 is discarded, and expression 2 is evaluated as though it were the start of a conditional inclusion section. Hence the first #if or #elif directive that evaluates to true causes its associated section of code to be included, and all other sections in the other groups to be discarded. IF none of these expression evaluate to TRUE, then the code in the "else group", if it exists and has code, is processed. One very common use of conditional inclusion is to effectively comment out a large section of code, that more than likely has /*...*/ based comments. Since standard /*...*/ based comments do not nest, large blocks of code CANNOT be safely removed using standard /*...*/ delimiters. In contrast, since conditional inclusion directives do nest, placing "#if 0" at the start of the section, and "#endif" at the end of the section effectively comments out (safely) an arbitrary block of code. Also, from a stylistic point of view, the fact that these directives DO NOT have to appear directly adjacent to the left margin (as was the case in some early C preprocessors) allows such commenting to be done in a very aesthetically pleasing format. 3.8.2 LANGUAGE- PREPROCESSING DIRECTIVES- Source file inclusion Source file inclusion is performed using the "#include" directive. For our discussion, we will refer to directives like: #include <stdio.h> as <...> style includes, and #include "header.h" as "..." style includes. Subsection A) Interpretation of expanded tokens following #include One notable change to many compilers to support the ANSI standard, is the acceptance of include directives of the form: #include token1 token2 ..... tokeni wherein the token sequence cannot be interpreted as either a <...> or "..." style include. We will refer to this include directive as a "macro derived" directive, in honor of the fact that the tokens must be macro expanded before the include directive can be acted upon. Note that to not be a "..." style include, the first character of token1 must be other than `"', or at least there must be no other `"' character later on the include line. Similarly, to avoid the <...> style, the first character of token1 must be other than `<', or there must be no terminating `>' later on the line. Recall also from the discussion of tokenization, that backslashes cannot escape out a closing quote, and that a the sequence /* is NOT honored within the file name in either the "..." or <...> style include (this is a special context). The first element of undefined behavior for #include directives concerns the method by which the results of macro expanding the tokens in a macro derived directive are interpreted. The most obvious (and simple) case is when the expansion of the entire token sequence is a simple string literal. The more complex case involves an expansion that is still a token sequence, such that the first token begins with a `<', and the last token ends with a `>'. The following are examples of such include directives: #define myfile "header.h" #define yourfile /* this macro has a null definition */ #define less_than < #define greater_than > #define big_system_header os2.h #include yourfile myfile #include less_than big_system_header greater_than Although the above macro defined directives expand to tokens sequences that "look like" more common include directives, there are some special differences. Specifically, using the above macro definition the resulting token sequences look like: #include "header.h" #include < os2.h > Note that the above casual macro definitions left leading and trailing spaces in the file name for the latter example. Although this could have been avoided by using function like macros, which can be placed sequentially with no white space between them, the presence of whitespace in the result is believed to be common among users of this feature. With the above examples understood, the JRCPP resolution is simply to concatenate all the tokens, ignoring inter-token whitespace, and THEN reinterpret the resulting character sequence in the context appropriate to <...> or "..." style include directives (i.e., no special meaning for backslashes, etc.). Subsection B) Search algorithm for included files A second point of undefined behavior in the ANSI C standard involves where exactly included files are searched for. There are actually several very distinct conventions for this search mechanism, even within file systems which are hierarchically based (such as UNIX). JRCPP has adopted a default strategy that is consistent with Microsoft C 5.1, and several other more recent compilers. There is also support (selected via #pragma include_search) for algorithms compatible with Microsoft C 4.0, and support for approaches compatible with older UNIX system cpp implementations. If an application is placing all the source and header files in a single directory, which is also the current working directory during the compilation, and the system include directory is a single absolute directory, then the search algorithm then almost any search algorithm will suffice. On the other hand, if header files in one directory includes header files in a second directory, which then include header files in yet a third directory, while the user has a current working directory yet elsewhere, and some of the system include or application include directories use relative path names, the meaning of: #include "header.h" is far from obvious. Historically, projects developed under UNIX placed all source and header files in a single directory, and the discussion of search algorithms was irrelevant. With the growing complexity of code, and the presence of a multitude of programmers on a project, the need has arisen to hierarchically segregate sections of large project into file hierarchies. In order to support function calls between these sections, header file inclusion well outside the current directory has become commonplace. Various vendors have adopted algorithms that support the "trivial" case described originally, but there has often been disagreement about how to process the more complex cases. The philosophy that drove the development of the slightly complex include strategy of JRCPP motivated by the following requirements: 1) It should be possible to write include files, that include other files, without concern about what source file was being compiled, where that file was, and what directory the user was in when the compilation was requested. This allows complex systems of header files to be written INDEPENDENT of the application that uses them. 2) To allow for even more complex sets of include files, if file A included file B, and file A was able to include file C, then file B should be able to include file C with equal ease. In some sense, this concept is similar to inheritance in object oriented programming. 3) It should be possible for a user to change current directories, and in doing so change what files are accessible via include searches (assuming the programmer has orchestrated the placement of header files to support this strategy). This allows different versions of an application to be compiled easily into different directory areas. 4) It should be possible to define the location of include files via a relative path. This facility would allow the construction of source file hierarchies that are easily ported to distinct absolute positions in file system hierarchies. We will start with some definitions of terms. We define the "system include path" to be a list of directories where system header files are provided. Typically all the ANSI C specified library header files are in directories listed in the system include path, and files in such directories can be expected to never change (hence it is rare to provide "make" dependencies on such files). The "application include path" is a list of directories that contain header files significant to a specific application or project. Typically, files in the application include path tend to change often during program development. Next there is the "current directory", with its standard meaning in a UNIX or DOS like file hierarchy. Three additional directory lists need to be defined, the "original source directory", the "current source directory", and the "ancestral source directories". The "original source directory" is the directory in which the source file specified on the compilation or preprocessor command line was found. The "current source directory" is the directory which contains the include directive file that is currently being parsed, and has the include directive that we are trying to process. The "ancestral source directories" are a list of directories that begin with the "current source directory", and proceed back through each level of nested inclusion (specifying the directory in which that source file was found), all the way to the "original source directory". The algorithm for searching for header.h consists of the bottom level ancestral search, and a top level search. The algorithm terminates the first time a file can be accessed. In pigeon code, the algorithm would look like: ancestral_search(file_name) { if (file_name has relative prefix) { for (every prefix p, in the ancestral include list) do try to open (p/file_name) } /* but as a last resort ... */ try to open (file_name) /* with no prefix */ } Driving the code that we just listed is the higher level search, that is directed by the use of user specified include paths: standard_search(file_name) { try ancestral_search(file_name) /* with no prefix */ if (file_name has relative path) { for every prefix p in the application include path do ancestral_search(p/file_name) for every prefix p in the system include path do ancestral_search(p/file_name) } } The above pigeon code corresponds to the algorithm for finding "..." style includes. When <...> style includes are searched for, the file is not searched for directly (with no prefix) unless an absolute path specifier, and the "application include path" is never made use of. The following pigeon code describes the search for <...> style include files: system_search(file_name) { if (file_name has relative path) { for every prefix p in the system include path do ancestral_search(p/file_name) } else /* file has an absolute path specification */ try to open (file_name) } It should also be pointed out that when the ancestral_search function is used, that the path prefix for the ancestral include files are tried sequentially, starting with the current source file, and proceeding back to the original source file. Next, we will present an example of the above definitions, and the resulting search path. We will assume: The current working directory is /current_path. The source file original_path/original.c included the file header1path/header1.h (hence original_path is the "original source directory"). File header1.h included header2path/header2.h (hence the "ancestral directory list" is: header2path, header1path, original_path). The system include path contains the directory sysdir1, and sysdir2. The application include path contains the directories appdir1, and appdir2. For our example of searching, we will examine the search pattern when header file header2.h contains the include directive: #include "header3.h" If the file specified as header3.h does have an absolute path prefix, then only that absolute path/filename is used in searching for the file (an example with an absolute path is "/usr/lex/lexdefs.h"). If header3.h does not have an absolute path prefix, then the following search pattern is followed. With the caveat to be mentioned in a moment, the search for a file consists of trying to open the following: header3.h appdir1/header3.h appdir2/header3.h sysdir1/header3.h sysdir2/header3.h There is, as mentioned, one caveat to the above search sequence. The caveat is that whenever a file in the above list has a relative path prefix, then the prefixes provided by the ancestral directories are tried sequentially, followed by the unadorned filename, before moving on down the list. Since we assumed that header3.h did not have an absolute path prefix, the following files would actually be subject to fopen() calls: header2path/header3.h header1path/header3.h original_path/header3.h current_path/header3.h (same as simply `header3.h') The above list corresponds to a search using path prefixes taken from the ancestral include list, and then using the current directory. Note that if one of the application include path directories had only a relative path prefix, then it too would make use of the ancestral include directories for prefixes. For example, if app2dir was a relative path, then header3.h would (if it were not found earlier in our list) be searched for in: header2path/app2dir/header3.h header1path/app2dir/header3.h original_path/app2dir/header3.h currentpath/app2dir/header3.h (same as `app2dir/header3.h') Subsection B: Include strategy compatibility The include strategy described above is compatible with Microsoft C 5.1, and several other major compilers. Since this strategy is rather general, it tends to provide coverage of search areas sufficient for most any project. If however, absolute compatibility is desired with other compilers, the search strategy may be modified by use of the appropriate pragma options. The adjustments permitted all represent restrictions to search in only a subset of the default areas. For example, the SUN UNIX cpp searches for "..." style includes first in the parent directory, and then in the application's include path. To achieve complete compatibility with SUN, it is necessary to use the pragma: #pragma include_search youngest_ancestor_only As a second example, Lattice C 3.0 searched for "..." include files in the current working directory, and then in the application's include path. To be compatible with such a strategy, the following pragmas should be entered: #pragma include_search current_directory Note that the options for the different ancestral search modes include: `youngest_ancestor_only', `eldest_ancestor_only', and `all_ancestors'. These options correspond respectively to the using the directory of the current_path, original_path, and the full search algorithm. The default provided by JRCPP is `all_ancestors', but omitting an ancestral selection (as in the last example) implies the use of no ancestral directories. Note that if searching in the current_directory, and searching in directories associated with all ancestral modes are disabled (via omission in such a pragma), then it is impossible to include any files without using an absolute prefix supplied in a system include path, or an applications include path. 3.8.3 LANGUAGE- PREPROCESSING DIRECTIVES- Macro replacement This section has only two mentions of undefined behavior, but it has several topics worthy of commentary. We will start with the comments, and then proceed to resolve the cases of undefined behavior. It should be mentioned here that JRCPP has a rather novel pragma that explains in detail the steps taken during a macro expansion. This information may be used as a tutorial assistant (for the novice), as a debugging assistant (for the professional programmer), or as a method of proof of a bug (for a registered user to file a bug report). The explanation given includes a step by step breakdown of the macro expansion process, along with reference listings of current macro definitions as they are applied, and reasons for ignoring current definitions (such as the fact that an identifier is already a part of a recursive expansion of itself). For more details see "#pragma describe_macro_expansions", and note that this feature can be turned on and off to localize its impact. It is expected that this pragma can easily augment this Language Reference Manual by provided annotated examples. ANSI C differs from many prior C preprocessor implementations in that an identifier may not be redefined "differently" from its current macro definition, without first undefining the existing definition. Some non-ANSI implementations allowed redefinitions to mask existing definitions, and future #undef directives to "pop" into visibility older definitions. Other implementations simply allowed new macro definitions to overwrite existing definitions. JRCPP only supports the ANSI C approach, with the agreement with the ANSI C Rationale that other formats are error prone, and generally an actual errors. Note that redefinitions that are identical to the original (such as occurs when a header file is include twice) are fully legal. The Standard is quite specific in defining what a benign redefinition is, but it can be summarized by saying that the only difference allowed in a redefinition is the length of white space interludes. (There is actually a subtle error in the Standard. The restriction that the "order of the parameters" be unchanged in a redefinition, was not listed. JRCPP assumes this was a typo, and requires that the order of parameters be unchanged in a macro redefinition). If I receive sufficient input from users that my decision here is a hindrance, I will support the other standards involving non-benign redefinitions (with appropriate diagnostics), controlled via pragma directives. A second item worthy of noting is that whitespace at the start and end of a replacement list is not considered part of the replacement list. Hence it is impossible to define a macro that expands to whitespace. The two aspects of undefined behavior involve odd occurrences during the gathering of arguments for a function like macro invocation. The first problem that needs to be addressed is what the behavior is when a token sequence that could be interpreted as a directive is encountered while gathering a list of arguments. The second point of undefined behavior is present when an argument consists of "no preprocessing tokens". Fundamentally, the problem with allowing preprocessor directives to occur during the gathering of parameter list for a macro, is that a #define or #undef directive might be encountered. Consider the following code: #define f(x) x+1 f(2) /* that was easy */ f( /* look for the argument ... */ #undef f 2 ) /* found the arg, but should we use it? */ JRCPP resolves such ambiguities very easily by not generally allowing directives to be present within macro argument lists. Specifically, JRCPP would generate a diagnostic indicating that there was no closing parenthesis for the macro invocation (JRCPP stopped looking when it reached the directive). Although this is a VERY reasonable result when #define and #undef directives are reached, it is not so obviously necessary when certain other directives are reached. To assist the programmer that is using existing code such as: #define f(x) x+1 f( #if foo 2 #else 3 #endif ) a pragma (#pragma delayed_expansion [on|off]) is provided that causes such "ANSI C undefined behavior" to be acceptable. The restriction on this pragma based extension is that no directives are allowed within the argument list that can even possibly cause a change in the macro database (i.e., the relevance of the change is not considered; the significance of the change, such as a benign redefinition, is not considered). Simply put, if a #define, #undef, or #pragma is encountered during a scan for arguments to a macro, the scan is terminated with a "missing close paren" diagnostic, even if the delayed_expansion pragma is active. As mentioned, the second area of ANSI C undefined behavior involves the "presence of an argument consisting of no tokens... before argument substitution...". This phrase unfortunately contradicts the definition of argument: "... sequence of preprocessing tokens ... in a macro invocation". Note also that "substitution" is the time at which expansion of the argument is considered, and hence the odd phrase cannot possibly refer to the situation where an argument is expanded, but the result is nil (or whitespace). Rather than harp on this inconsistency, we will discuss what perhaps is a related (or even intended) problem: What is the interpretation of a missing argument (or at least white space where an argument should be)? It is clearly stated that the number of arguments in a macro invocation must agree exactly with the number of arguments in the corresponding macro definition, and hence JRCPP generates a diagnostic if this is not the case. In addition, JRCPP enlists an error recovery strategy that consists of substituting whitespace (which is clearly not a valid ANSI C argument) for any missing arguments. This strategy is intended to be compatible with some prior non-ANSI implementations. Note that this action can cause a secondary error if the macro attempts use this argument in certain ways. The following example demonstrate these secondary errors: #define paste_left(x) something ## x #define paste_right(x) x ## something #define stringize(x) # x paste_left(/*white*/) paste_right(/*white*/) stringize(/*white*/) Fortunately, since most code that exploits this non-ANSI behavior (missing argument is actually whitespace) does not use the paste operator (##) or the stringize operator (#), hence this secondary error will have tend not to occur. Applications that are porting non-ANSI code through JRCPP may then choose to lower the severity level of the diagnostic that reports the whitespace argument, and accept the error recovery procedure as reasonable. (User feedback on my error recovery scheme may improve compatibility with other implementations). 3.8.3.2 LANGUAGE- PREPROCESSING DIRECTIVES- The # operator The # operator provides the stringizing functionality to ANSI C, that was often provided via expansion of parameters within string literals in older (and NON-ANSI) implementations. One key point that should be stressed is that when this functionality is used, then the argument is NOT macro expanded before being stringized (i.e., placed into quotes). For example: #define stringize(x) #x #define A 2 stringize(2) /* becomes "2" */ stringize(A) /* becomes "A" */ If a user wants the argument to be expanded and THEN stringized, the following construction should be used: #define stringize(x) # x #define expand_then_stringize(x) stringize(x) #define A 2 expand_then_stringize(2) /* becomes "2" */ expand_then_stringize(A) /* becomes stringize(2), which becomes "2"*/ The Standard indicates that the order of evaluation of the operators # and ## is unspecified. Since the token to the right of the stringize operator (#) must be a parameter, it would appear that the following are the only two cases to consider: #define F(x) word ## # x #define G(x) # x ## other JRCPP follows the C tradition of providing higher precedence for unary operators than for binary operators. JRCPP parses the definition of F to attempt to paste a stringized version of parameter x onto the right side of the identifier `word'. This decision is a bit immaterial, as the result of pasting any valid preprocessing token to the left side of a string literal (the stringized version of x) is almost always an invalid preprocessing token. Similarly, the definition of G provides a request to paste the word `other' onto the right side of the stringized version of parameter x. Defining the precedence in any other way appears to be of equally little use. 3.8.3.3 LANGUAGE- PREPROCESSING DIRECTIVES- The ## operator The pasting operator ## supplies the functionality in ANSI C that was supplied in various compilers in the past, by means of various hacks. Most hacks were based on methods of getting adjacent identifiers to "flow together". The two methods that I am aware of are: #define f() start f()end /* for SOME non-ANSI cpp, becomes: `startend' */ and #define g start g/*comments go away*/end /* some non-ANSI cpp: `startend' */ Neither of these constructs are supported under ANSI C, and in both cases JRCPP defaults to produces the two tokens `start' and `end', separated by a space. The first of the two approaches is supported via a pragma under JRCPP (see #pragma space_between_tokens). It should be emphasized that, just as with the stringize operator, arguments are NOT expanded, prior to insertion in replacement list, at points where they are about to be subjects of a ## operator. For example: #define A 2 #define append(x) x ## right left ## x x append(A) /* becomes: Aright leftA 2 */ If the user desires an expansion prior to pasting, the construct described earlier with regard to stringization must be used. 3.8.3.5 LANGUAGE- PREPROCESSING DIRECTIVES- Scope of macro definitions This section of the standard has some very nice examples of the process of macro expansion. These include the use of the paste operator (##), the stringize operator (#), and the prevention of infinite recursion of macros. If the user tries these torture tests on JRCPP, rather than reveal an error in JRCPP, they will reveal a typo in the ANSI C Standard. The specific error in the standard involves: #define str(x) #x str(: @\n) which the standard incorrectly expands to ": \n", but should have expanded to ": @\\n". The goal of the stringize operation is to produce text that can print exactly as the argument that was supplied. The standard is clear on this point, and other items in the expansion demonstrate this. Unfortunately, I am sure that this typo in the Standard will also be the source of many bug reports. The user that is trying these tests should also run them with the pragma space_between_tokens set to off, if they would like the format to be closer to that of the listing in the Standard. In either case the results should be correct. The user may also note a slight discrepancy in the format of the output, due to the fact that JRCPP maintains line position information much more accurately than most other preprocessors. In this regard, consider the example: #define swap(x,y) y x swap ( +i +j, +k +l) Most preprocessors produce the output: #line 2 +k +l +i +j Whereas JRCPP produces: #line 5 +k +l #line 3 +i +j The big advantage of the method provided by JRCPP is exposed when compiler diagnostics wish to refer to a token in such a stream. When JRCPP is used, a diagnostic for "syntax error on token `+'" can be very specific about the line number with the offending character. With other preprocessors, the user is just told the error was on line 2. (As a historical note, many pre-ANSI preprocessor required that the macro name and all the arguments be placed on a single line. Many users have, as a result, built large logical lines when a macro was being invoked. ANSI C established a standard whereby this was no longer necessary, but many compiler manufacturers are slow to service the users that have moved to this more readable notation.) 3.8.4 LANGUAGE- PREPROCESSING DIRECTIVES- Line control The #line directive is fully supported by JRCPP. There are several points to note about its performance. The first item worthy of note is that the standard provides for macro expansion of the tokens on the logical line with the #line directive. Unfortunately, it does not provide for arithmetic reductions. The result of macro expansion must be either a digit sequence (representing a line number), or a digit sequence with a string literal (file name). Note that the line directive requires a string literal, and has no consideration of the sort of context that was provided for evaluation of a file name in an include directive. This distinction means that when a backslash is used in a file name that is specified using a #line directive, then it must be "escaped" using another backslash. 3.8.6 LANGUAGE- PREPROCESSING DIRECTIVES- Pragma directive JRCPP makes extensive use of pragmas in order to direct customization of the performance of the preprocessor. Users should refer to the JRCPP Users Guide for a complete list of valid pragmas, along with their meaning. As per the Standard, unrecognized pragmas are ignored by JRCPP. In order to facilitate the use of pragma directives to control the compiler, unrecognized pragmas are passed unchanged (except for comment removal and whitespace compaction) through to the post-preprocessed output file. Note that because some pragmas do modify the macro database, pragma directives are not permitted within macro invocation argument lists. If there is some need to pass forward a pragma to the compiler without having it acted upon by the preprocessor (for example, when JRCPP would misunderstand it), then the following sort of approach can be taken: #define HIDE_PRAGMA HIDE_PRAGMA # pragma any tokens Due to the fact that the results of expansion will NEVER be considered by JRCPP to be a directive, the pragma presented in this cloak will not be processed by JRCPP. Unfortunately, when a user is forced to this extreme, the protection against macro expanding the list of tokens for an unknown pragma is lost. JRCPP has endeavored to use novel names that should not clash with the specifications of many other implementations. 3.8.8 LANGUAGE- PREPROCESSING DIRECTIVES- Predefined macro names All five of the ANSI C predefined macros are supported by JRCPP. The macros are: __LINE__ Current presumed line number in the source file. The value of __LINE__ will be in the range permissible for a signed long integer. __FILE__ Current presumed file name, provided as a string literal. Note that because __FILE__ is a string literal, any occurrences of the backslash character in the actual source name have been replaced by `\\' (an escaped backslash). __DATE__ The date on which JRCPP began to preprocess the original source file, expressed as a character string. The format will always be "Mon dd yyyy", where Mon in the abbreviation for the month, dd is 1 or 2 digit representation of the day of the month, and yyyy is the 4 digit calendar year. __TIME__ The time at which JRCPP began to preprocess the original source file, expressed as a character string. The format is "hh:mm:ss", where hh is the number of hours past midnight local time, mm is the number of minutes past the hour, and ss is the number of seconds past the whole minute. __STDC__ The integer constant 1. This constant is meant to indicate that the compiler/preprocessor is as conforming ANSI C implementation. The Standard restricts the preprocessor from redefining any of these macro names, as well as attempting to undefine any of them. Similarly, the standard precludes the use of #define or #undef on the identifier `defined' (which has special meaning in evaluating a #if/elif directive). JRCPP supports the full ANSI C Standard as indicated above, but maintains customization features that allow it to be modified slightly to be non-conforming. One major point is that most of the time, the compiler that JRCPP is preprocessing for is not ANSI conformant. With this situation in mind, there must be some mechanism for undefining __STDC__. Special pragmas have been provided in JRCPP to accomplish most of these tasks (see #pragma undefine_macros). As a special note, the pragma that switches to C++ mode (see #pragma cplusplus_mode) has the following effect: The macro __STDC__ is undefined, and the macro __cplusplus is defined. Moreover, this new macro __cplusplus has the same reserved status (i.e.: cannot be #define'd or #undef'ed) as __STDC__ has under default JRCPP. In addition, the one line // style comments are also supported in c++ mode.