The C Users' Group Library 1994 August

home *** CD-ROM | disk | FTP | other *** search

/ The C Users' Group Library 1994 August / wc-cdrom-cusersgrouplibrary-1994-08.iso / vol_300 / 357_01 / cstar1.exe / CHANGES.DOC < prev next >

Wrap

Text File | 1991-06-20 | 6KB | 110 lines

CHANGES.DOC: How the CSTAR compiler could be improved June 25, 1991 If I were to do CSTAR all over again I would make several significant changes. Although I doubt that anyone would want to revise CSTAR a whole lot, the following comments would also apply to speeding up a C compiler, so they may be of more general interest. What follows are terse hints about how to improve CSTAR's performance and coding style. I would be glad to discuss any of these ideas at greater length with you. Just give me a call. 1. I would make CSTAR token based, rather than character based. CSTAR spends a lot of time scanning identifiers and copying characters from one buffer to another. (I've measured.) Tokens would be created for identifiers, strings, constants and other entities very early on. 2. I would make CSTAR a multiple pass compiler. This would significantly *increase* its speed because only the first two passes would deal with characters. All other passes would deal with tokens. The pass structure would closely follow the translation phases described in the ANSI C Standard. Pass one would be a state machine which read the entire file into memory and then gather statistics about the file. These statistics would be used to allocate memory for all strings, identifiers and tokens. This is an important point: although this initial pass would seem to be pure overhead, it would *greatly* simplify later storage allocation. (No more testing to see if we have to allocate more memory.) Storage for all tokens, symbol tables and strings (including the spellings of identifiers) would then be allocated *all at once.* Pass two would also be a state machine, similar to the first pass. Pass two would simply "fill in" the already allocated storage with all the tokens and strings. Symbol tables would be created in a later pass. Later passes would handle parsing and output. By the way, I am not sure that the initial statistics gathering pass would really be a good idea, and it is worth looking into. If if doesn't turn out it could be abandoned without giving up tokens. 3. I would distinguish between an id's spelling table and its symbol table. The spelling table would be created when the id is first tokenized. It would contain information about whether the id could be a reserved word, is currently defined as a macro, etc. The symbol table would be used during parsing, and would contain a pointer to the spelling table. The name table would have the following format: attribute byte: reserved-word and macro-expansion bits and possibly other bits. length byte: the length of the identifier. the spelling of the identifier itself, \0 terminated. 4. I would have CSTAR look up an id only once, when it is first tokenized. All further references to an identifier would be via a pointer to the id's spelling table entry which would be part of the id's token. This will save *lots* of time. 5. I would rewrite the semantic routines. (Notice all the bug fixes in sem.c) I tried to do all the semantic actions on the fly which turned out to be a big mistake. It would be much better, and probably *faster*, to use three passes for doing semantic analysis. The first pass would build a full parse tree for declarations, the second pass would do flow analysis, and a third pass would output the program including the Sherlock macros. Having a tokenized parse tree helps a lot, since each pass can move back and forth in the tree as required. 6. I would scrap the file copier logic in sysnext. Although this logic works, it is *very* slow since it is used on every input character. CSTAR should produce its output in a final pass after all tokenizing and parsing is complete. That way very little scanning will need to be done. To summarize these first 6 points, I would use tokens and multiple passes instead of a single-pass character-oriented approach. The remaining points deal with programming style. 7. I would use stream routines in the macro expansion logic instead of a profusion of buffers. (See the code in def.c) The stream routines would hide the details of testing for full buffers and allocating new buffers. Although the details of macro expansion and token pasting are inherently complex, using streams would have significantly simplified matters. (Dealing with tokens instead of characters would also remove some complications from the macro expansion code.) 8. I would use many more header files. This is made possible and desirable by several new features in the new ANSI C Standard. First, the Standard defines many new headers, which I would naturally incorporate into the program. Second, having function prototypes is a godsend, so I would define a header file for just about every .c file. The Macintosh version of CSTAR uses this new style and it has been quite an improvement. I have found two headers to be particularly useful: env.h and ver.h. env.h contains #defines that describe the current environment. So any machine- dependent code is enclosed in #ifdef ENV_XXX code #endif The ver.h header contains strings describing the current version of the code. Although my opinions are based on statistics and a great deal of reflection I am not attached to them. If you have any questions or comments or other suggestions I would be happy to hear from you.