home *** CD-ROM | disk | FTP | other *** search
Text File | 1991-03-02 | 34.6 KB | 1,504 lines |
- .\" @(#)doc.ms 1.6 Date 87/06/29
- .if t .rm CM
- .ds LH Holistic Technology AB
- .ds CH PTC
- .ds RH HD870410-1 Rev: 1.6
- .ds LF
- .ds CF
- .ds RF
- .nr LL 6.5i
- .nr PO 1i
- .nr HM 0.75i
- .nr FM 1.5i
- .ie t .pl 842p
- .el .pl 72v
- .ch BT -\n(FMu/2u
- .\" 3 tabs/inch at 12 cpi corresponds to 4 chars/tab
- .nr t 1i/3u
- .am DS
- .ta \ntu +\ntu +\ntu +\ntu +\ntu +\ntu +\ntu +\ntu +\ntu +\ntu +\ntu +\ntu +\ntu +\ntu +\ntu +\ntu +\ntu +\ntu
- ..
- .LP
- .rs
- .sp |8c
- .nf
- .ce 10
- .if t .ps +6
- .B "PTC implementation note"
- .if t .ps -6
-
- by
-
- Per Bergsten
-
- Holistic Technology AB
- Grona Gatan 59
- 414 54 Gothenburg
- Sweden
- .ce 0
- .sp 12
- .LP
- This note describes the implementation of ptc, a Pascal to C translator.
- The program was developed by Per Bergsten of Holistic Technology AB,
- Gothenburg, Sweden.
- The paper is intended to provide a guide for those who need to transport
- ptc to a new environment,
- it describes how Pascal constructs are mapped onto C constructs.
- .bp
- .NH S 1
- Background
- .LP
- .ds CF - % -
- .nr % 1
- The aim was originally to provide a simple tool for transporting finished
- applications to systems lacking Pascal compilers.
- The first versions,
- constructed in about 1984,
- were capable of translating simple Pascal programs.
- It was never intended to become a released product,
- however,
- as time went by, programs and ambitions grew to the point where nearly
- the full Pascal language was supported.
- Thus the program as it stands today has a long ancestry
- but it has never been redesigned (which it should have been).
- .NH 1
- Pascal vs C
- .LP
- To anyone familiar with the two languages it is obvious that
- they are very similar in structure.
- The major features may be summarised as follows:
- .TS
- c c
- l l .
- Pascal C
- .sp
- Block-structured Block-structured
- - multiple declaration levels - single declaration level
- Statically scoped Statically scoped
- Strongly typed Weakly typed
- - allows unbounded pointers
- Call by value Mostly call by value
- - and call by reference
- Highly independent Highly integrated
- - of host system - with system
- Self contained Allows external definitions.
- .TE
- .LP
- On the syntactic level the languages differ only in minor ways,
- mostly in the order in which keywords and other symbols occur,
- and of course in that the languages uses different symbols for the same
- purposes.
- The only complication is that C prohibits nested subroutine declarations.
- .LP
- On the semantic level the situation is worse.
- C has the peculiarity that array variables are treated differently from other
- variables,
- this forces us to adopt some general way to handle array variables.
- Furthermore, since Pascal offers nested subroutine declarations
- it becomes necessary to simulate part of the activation record mechanism
- in the translated code,
- in one case it is extremely difficult to completely do this.
- It is also clear that the C typedef mechanism has some shortcomings that
- preclude an easy translation of Pascal types.
- .bp
- .NH 1
- Mapping Pascal to C
- .LP
- In this part of the paper we briefly illustrate how to translate
- Pascal code into equivalent C code.
- .NH 2
- Programs
- .LP
- A minimal Pascal program:
- .DS
- program p;
- begin
- end.
- .DE
- translates to the C equivalent:
- .DS
- extern void exit(\^);
- main(\^)
- {
- exit(0);
- }
- .DE
- .LP
- It should be noted here that
- the translator does not actually require a complete Pascal program,
- the implementation is such that any consistent set of declarations can
- be translated.
- .NH 2
- Lexical issues
- .LP
- The C language uses ASCII as a carrier,
- almost all of the availible characters are used.
- The Pascal definition uses a smaller set of characters.
- Since few features of the languages depend on the underlying character set
- this does not introduce much difficulties.
- .LP
- One serious problem does occur.
- Both language definitions states that comments have no meaning and no
- clear place in the language syntax.
- Furthermore, the Pascal definition states that a comment is equivalent
- to a blank character.
- This implies that if comments are handled accurately
- the translator should also be able to collect and classify each
- blank character in a Pascal program and to generate a C program with the
- same number of blank characters in the corresponding positions.
- This implication conflicts with the fact that the languages have different
- syntax rules, it is not obvious what the "corresponding positions" would be.
- .LP
- Since comments have no defined meaning a user is free to interpret them
- in any way and, in particular, to associate them with the surrounding code
- in any way s/he chooses.
- Although humans usually are able to deduce what bearing a comment has on the
- surrounding program code there are no formal rules for how to do this.
- Therefore there is no
- .I "a priori"
- correct way to translate comments
- and the translator described here ignores comments altogether.
- If/when a reasonable
- .I "ad hoc"
- solution is found that feature may be incorporated.
- .NH 2
- Declarations
- .LP
- The program may introduce local declarations which are handled as follows.
- .bp
- .NH 3
- Labels
- .LP
- .DS
- program p;
-
- label 9;
-
- begin
- 9:
- goto 9
- end.
- .DE
- which we simply translate into:
- .DS
- extern void exit(\^);
- main(\^)
- {
- L9:
- goto L9;
- exit(0);
- }
- .DE
- 'ne 12
- .LP
- If the label is reached from an inner procedure:
- .DS
- program p;
-
- label 100;
-
- procedure q;
-
- begin
- goto 100
- end;
-
- begin
- 100:
- end.
- .DE
- a more complicated translation must be used:
- .DS
- # define Line __LINE__
- void Caseerror(\^);
- # include <setjmp.h>
- static struct Jb { jmp_buf jb; } J[1];
-
- void
- q(\^)
- {
- longjmp(J[0].jb, 100);
- }
-
- main(\^)
- {
- if (setjmp(J[0].jb))
- goto L100;
- L100:
- exit(0);
- }
- .DE
- .LP
- We assume the existence of the standard
- .I setjmp(\^)
- and
- .I longjmp(\^)
- library functions.
- Jumpbuffers are statically allocated as needed depending on the number
- of declarationlevels in the program.
- .NH 3
- Constants
- .LP
- Constant declarations are treated in two different ways.
- Numbers aliases etc are simply
- .I "# define" 'd
- but string constants are converted to static character arrays in order
- to avoid unnecessary duplication of string-constants in the object code,
- thus:
- .DS
- const
- p = 3.14;
- pie = '3.14';
- .DE
- translates to:
- .DS
- # define pi 3.14
- static char pie[\^] = "3.14";
- .DE
- .NH 3
- Types and variables
- .LP
- Types and variables are mostly easy to translate:
- .DS
- program p;
-
- const length = 15;
-
- type
- struct = 0 .. length;
- vect = array [ struct ] of struct;
- color = (red, blue, ada, yellow);
- pointer = ^ recd;
- recd = record
- r : pointer;
- case b : boolean of
- false: (c : color);
- true: (v : vect);
- end;
-
- var SP : pointer;
-
- begin
- new(SP, true);
- end.
- .DE
- becomes
- .DS
- typedef char boolean;
- # define false (boolean)0
- # define true (boolean)1
- extern char *Bools[\^];
- # define Unionoffs(p, m) (((long)(&(p)->m))-((long)(p)))
- extern char *malloc(\^);
- # define length 15
- typedef unsigned char C47_struct;
- typedef struct { C47_struct A[length + 1]; } vect;
- typedef enum { red, blue, ada, yellow } color;
- typedef struct S57 *pointer;
- typedef struct S57 {
- pointer r;
- boolean b;
- union {
- struct {
- color c;
- } V1;
- struct {
- vect v;
- } V2;
- } U;
- } recd;
- pointer sp;
-
- main(\^)
- {
- sp = (struct S57 *)malloc((unsigned)
- (Unionoffs(sp, U.V2.v) + sizeof(sp->U.V2)));
- exit(0);
- }
- .DE
- The rationale is as follows:
- .LP
- Identifiers in the Pascal program which conflicts with reserved words in C are
- renamed by adding a unique prefix Cnnn where nnn is a number.
- .LP
- We also note here that uppercase letters in identifiers and keywords
- are translated to lowercase.
- Pascal specifies that upper/lower case is insignificant whereas C
- (for the present) separates the two.
- This fact is used to advantage by the translator as all
- subroutines and macros defined by the translator have an initial uppercase
- letter which prevents confusion.
- .IP \-
- The type
- .I boolean
- is a predefined Pascal type,
- when it is used the translator emits code which defines boolean to
- be the smallest convenient type:
- .I char .
- The constants
- .I false
- and
- .I true
- are defined and the vector
- .I Bools
- will contain textstrings for output if needed.
- .IP \-
- The predefined types
- .I integer
- and
- .I real
- are likewise mapped directly onto the standard C types
- .I int
- and
- .I double
- through a typedef declaration.
- .sp
- Integer subranges are mapped onto standard C arithmetic types according to
- a short table in the translator.
- The table is scanned from top to bottom until an enclosing range is found
- and the corresponding type-name is emitted.
- .IP \-
- C-arrays have peculiar semantix.
- To unify the treatment of arrays and other datatypes we always encapsulate
- an array in a struct,
- thus an array always becomes a
- .I struct
- with a single member named A.
- .IP \-
- Records and their variants are translated into C
- .I struct
- and
- .I union
- definitions.
- Since C requires all union members to have a name we must supply artificial
- names for all record variants.
- A record with variants will therefore always contain one field with the name U
- which have sub-fields named Vnnn where nnn is a positive number.
- .sp
- 'ne 12
- When allocating dynamic storage for a record with variants naming
- the desired variant
- .DS
- new(sp, true)
- .DE
- we face the problem of determining the amount of memory needed.
- .QP
- .B
- C does not provide a safe way
- to compute the size of a particular struct variant.
- .IP
- The strategy adopted to cope with this problem is to attempt to compute the
- offset of a fieldmember in the variant matching the constant and then
- to add the size of the variant.
- The offset computation is expressed as a macro,
- .I Unionoffs ,
- which uses rather foul typecasting to achive the result.
- The only availible alternative would be to use the same size of all variants.
- This method,
- while being safe,
- wastes memory when many small and few large
- records of the same type are created dynamically.
- .IP \-
- Pascal enumeration types are converted directly to C
- .I enum
- types.
- .IP \-
- Pascal pointer types are translated into C pointer types.
- Pascal allows the programmer to construct recursive types as pointer
- types need not be fully defined until the end of the declaration-section
- where the pointer type is used.
- In practice this is only used to introduce record types which
- contain pointers to themselves.
- This problem is partially solved by introducing a name for the record type.
- .ne 10
- Hence
- .DS
- type
- ptr = ^ node;
- node = record
- next : ptr;
- info : integer
- end;
- .DE
- becomes
- .DS
- typedef struct S56 * ptr;
- typedef struct S56 {
- ptr next;
- integer info;
- } node;
- .DE
- we note in passing that the problem cannot be completely solved since
- .DS
- type pureptr = ^ pureptr;
- .DE
- which is valid Pascal, cannot be expressed in C.
- .ne 10v
- .IP \-
- A pascal set-type does not have any direct counterpart in C.
- The C language does however have a adequate set of operators for
- bit manipulation.
- We use these to implement a Pascal set as an array of
- .I setword .
- So:
- .DS
- type
- s = set of 1 .. 100;
-
- var
- ss : s;
- .DE
- is translated into:
- .DS
- typedef unsigned short setword;
- typedef struct { setword S[8]; } s;
-
- s ss;
- .DE
- The situation is slightly complicated by the fact that Pascal has
- a set constructor which permits the construction of arbitrary large sets,
- for example:
- .DS
- s := [ 50 .. maxint ] * [ 1 .. 80 ]
- .DE
- for that reason the first member in the array of words gives
- size of the set (in words).
- In the example above s.S[0] would have the value 7,
- and s.S[1] through s.S[7] would hold the bits.
- The number 7 is computed on the assumption that the type
- .I "unsigned short"
- on the target host is sufficiently large to holds 16 bits.
- The set operators of Pascal are implemented as C functions returning
- pointers to arrays of setwords,
- the intermediary results are held in a static area of fixed size.
- .IP \-
- Pascal files are implemented using the standard i/o package availible
- in most C implementations.
- Since Pascal has the requirement that the next element of a file is visible
- in the filebuffer before it is read,
- and the requirement that linemarkers in textfiles are given special treatement
- we are forced to extend the
- .I FILE
- type provided in
- .I <stdio.h> .
- .ne 20
- Hence:
- .DS
- var f : text;
- .DE
- becomes
- .DS
- typedef struct {
- FILE *fp;
- unsigned short
- eoln:1,
- eof:1,
- init:1,
- out:1,
- :12;
- char buf;
- } text;
- text f;
- .DE
- where
- .I buf
- is our filebuffer and
- .I eoln ,
- .I eof
- and
- .I init
- are flags giving the state of the file.
- All Pascal i/o operations are implemented using macros that maintain the flags
- and the buffer variable.
- The actual reading and writing of data is deferred to the standard i/o package.
- .NH 3
- Procedures and functions
- .LP
- Pascal procedures and functions are somewhat difficult to translate to C.
- The main problems lie in nested declarations and procedural parameters.
- Nested declarations are handled in the following manner:
- .RS
- .IP \-
- If procedure B is declared in procedure A,
- then procedure B is emitted before A and A is forward declared before B.
- .IP \-
- Any constants and types declared in A is moved to the global scope,
- this may force renaming of those objects.
- .IP \-
- Any variable declared in A
- .I
- and used in B
- .R
- is converted to a pointer and moved to the global scope.
- The pointer value is saved and re-initialized upon each entry of A
- and restored upon exit from A.
- .RE
- .br
- 'ne 20
- .LP
- Hence:
- .DS
- procedure A;
-
- const limit = 20;
-
- type loctyp = 0 .. limit;
-
- var i, j : loctyp;
-
- procedure B(k : loctyp);
-
- begin
- j := k + 2
- end;
-
- begin
- B(i)
- end;
- .DE
- becomes
- .DS
- typedef unsigned char loctyp;
- loctyp *G56_j;
-
- void a(\^);
-
- void
- b(k)
- loctyp k;
- {
- (*G56_j) = k + 2;
- }
-
- void
- a(\^)
- {
- # define limit 20
- loctyp i, j;
- loctyp *F57;
-
- F57 = G56_j;
- G56_j = &j;
- b(i);
- G56_j = F57;
- }
- .DE
- we see that references to
- .I j
- inside procedure
- .I b
- are replaced by the pointer
- .I G56_j
- which points to the local variable
- .I j
- in procedure
- .I a.
- In order to preserve the proper semantix in the face of recursion
- the value of the pointer need also be saved in the local variable
- .I F57
- during the invocation of
- .I a .
- .IP \-
- Procedure parameters offer little problems.
- We translate Pascal value-parameters into ordinary C parameters
- and Pascal var-parameters are treated as pointers.
- .br
- 'ne 20
- .IP \-
- Procedural parameters appear at first to be easy to handle.
- The ordinary program:
- .DS
- program p;
-
- procedure pp(procedure q(i : integer));
-
- begin
- q(2)
- end;
-
- procedure qq(i : integer);
- begin
- end;
-
- begin
- pp(qq)
- end.
- .DE
- becomes
- .DS
- extern void exit(\^);
- typedef int integer;
-
- void
- pp(q)
- void (*q)(\^);
- {
- (*q)(2);
- }
-
- void
- qq(i)
- integer i;
- {
- }
-
- main(\^)
- {
- pp(qq);
- exit(0);
- }
- .DE
- which looks simple enough.
- .br
- However,
- Pascal requires that the scope of a procedure
- .I "remains unchanged"
- throughout its lifetime.
- .ne 35
- Consider:
- .DS
- program demo(output);
-
- var i : integer;
-
- procedure p(procedure q);
-
- var j : integer;
-
- procedure qq;
-
- begin
- writeln(j)
- end;
-
- begin
- j := i;
- q;
- if i < 1 then
- begin
- i := i + 1;
- p(qq)
- end
- end;
-
- procedure dummy;
- begin
- end;
-
- begin
- i := 0;
- p(dummy)
- end.
- .DE
- When
- .I p
- is first invoked it assigns the local variable
- .I j
- the value 0.
- This variable is accessible from
- .I qq
- which is passed as a parameter in
- the recursive call of
- .I p .
- The second invocation of
- .I p
- then sets
- .I its
- variable
- .I j
- to 1,
- and calls
- .I q
- which is bound to the first instance of
- .I qq ,
- and should therefore print the number
- .I 0 .
- .B
- Sadly,
- the code produced by the translator fails to do this.
- .R
- It is obvious that the program above calls for a complete simulation
- of the activation record mechanism of Pascal to work correctly.
- .RS
- .LP
- A workable but unpractical solution would be:
- .IP 1)
- When qq is used as parameter we call a function q1 which saves the environment
- for qq (i.e. the address of j) in a well known place
- and returns a pointer to q2.
- .IP 2)
- When qq is later called (under the name q) the actual target will be q2
- which sets up the proper environment calls qq.
- .RE
- .IP
- The problem is that this requires a save-area for
- .I each
- procedural parameter which can hold the intresting
- parts of its environment for
- .I each
- of its invocations.
- In the example above we need one are which holds the address of i
- for each instance of qq
- (say N instances, where N depends on the run-time behaviour of p).
- It also requires a set of different procedures to perform the work of q2
- (N different procedures which sets up the environment for the N instances).
- .B
- This requires much to much work to implement so the problem is left unsolved,
- .R
- this is hardly a problem in practice since humans rarely write such code
- but
- .B
- it could introduce problems
- .R
- in machine generated Pascal code.
- .IP
- It should be noted that
- the translator accepts the keyword
- .B external
- in place of the Pascal
- .B forward
- directive and assumes that the so declared procedure is defined elsewhere.
- .bp
- .NH 2
- Statements.
- .LP
- Pascal statements are comparatively easy to translate to C.
- The only parts that require special care are
- non-local
- .I goto ,
- .I with
- and
- .I for
- statements.
- The code
- .DS
- program p(output);
-
- type
- msgtyp = packed array [ 1 .. 12 ] of char;
-
- var
- a : array [ 1 .. 10 ] of
- record
- r : real
- end;
- i : integer;
- ok : boolean;
-
- procedure fatal(m : msgtyp);
-
- begin
- writeln('Fatal error: ', m)
- end;
-
- begin
- while true do
- repeat
- ok := false;
- i := 1;
- for i := i + 1 to i + 10 do
- if i > 10 then
- fatal(' 10 exceeded')
- else
- with a[i] do
- if r > 9.99e+38 then
- ok := true
- else
- writeln(r)
- until ok
- end.
- .DE
- 'ne 20
- becomes
- .DS
- typedef char boolean;
- # define false (boolean)0
- # define true (boolean)1
- typedef int integer;
- typedef double real;
-
- typedef struct { char A[12 - 1 + 1]; } msgtyp;
- typedef struct { struct S57 {
- real r;
- } A[10 - 1 + 1]; } T56;
- T56 a;
- integer i;
- boolean done;
-
- void
- fatal(m)
- msgtyp m;
- {
- (void)fprintf(output.fp, "Fatal error: %.12s", m.A),
- Putchr('\en', output);
- }
-
- main(\^)
- {
- while (true)
- do {
- done = false;
- i = 1;
- {
- integer B1 = i + 1, B2 = i + 10;
-
- if (B1 <= B2)
- for (i = B1; ; i++) {
- if (i > 10)
- fatal(*((msgtyp *)" 10 exceeded"));
- else {
- register struct S57 *W3 = &a.A[i - 1];
-
- if (W3->r > 9.99e+38)
- done = true;
- else
- (void)fprintf(output.fp, "% 20e", W3->r),
- Putchr('\en', output);
- }
- if (i == B2) break;
- }
- }
- } while (!(done));
- exit(0);
- }
- .DE
- for the following reasons:
- .NH 3
- Begin
- .LP
- The compound statements
- are very similar in the two languages and need no further explanation.
- .NH 3
- If
- .LP
- Pascal if-statements
- have the same structure and semantix as C if-statments.
- .NH 3
- Case
- .LP
- Pascal case-statements
- have the same structure and semantix as C switch-statements provided that a
- .I break
- is always added to each entry.
- .LP
- The translator supports a common Pascal extension
- in that it recognizes the keyword
- .B otherwise
- to signify a default entry in a case-statement.
- .NH 3
- Labels
- .LP
- Pascal labeled statements and labels have the same structure and semantix as
- C labeled statements except that Pascal labels are numbers where C labels
- are identifiers,
- this difference is solved by simply prefixing the labels with the letter
- .I L .
- .NH 3
- Goto
- .LP
- Pascal goto-statements
- have the same structure as C goto statements but the semantix differ in
- that Pascal allows a goto to lead out of the current procedure.
- This is implemented using the
- .I setjmp/longjmp
- library functions of C as described earlier.
- .NH 3
- With
- .LP
- The with-statement
- of Pascal has no counterpart in C.
- It is translated into nested compund statements where pointervariables,
- referencing the corresponding records,
- are declared and initialized.
- Within the scope of the with-statement,
- the accessible record fields are renamed to include the pointer.
- The effect of this is that the record address is evaluated once as the
- Pascal standard requires.
- .NH 3
- For
- .LP
- The for-statement
- of Pascal has a structure that is similar to the C for-statement
- but the semantix differ completely.
- Pascal requires that a loop be exited when the upper bound
- has been reached,
- Pascal also requires that the loop-boundaries be evaluated exactly once.
- The standard C for-loop is exited when the loop-condition becomes false.
- This implies that it is not always true that
- .DS
- for (i = 0; i <= e; i++) ;
- .DE
- behaves in the same manner as
- .DS
- for i := 0 to e do ;
- .DE
- since (in most implementations)
- the C version becomes an infinite loop if
- .I e
- equals
- .I maxint
- or if
- .I e
- is the expression
- .I i .
- For that reason Pascal for-statments are translated into compound statements
- where the upper/lower bounds of the for-loop are held in local variables.
- It is also necessary to add a conditional break-statement at
- the end of the loop.
- It is possible to obtain the more relaxed translation by configuring the
- translator appropriately (see "Tuning" below).
- .NH 3
- While
- .LP
- The while-statement behaves exactly the same in both languages.
- .NH 3
- Repeat
- .LP
- The repeat-statement of Pascal matches the do-while statement of C except
- for the trivial difference that Pascal permits a statement-list
- where C permits a single statment in the loop.
- .NH 3
- Empty
- .LP
- The empty statement has (of course) the same structure and semantix
- in both languages.
- .NH 2
- Expressions and assignments
- .LP
- In most cases Pascal expressions can be literally translated into equivalent
- C expressions.
- .IP identifiers 15
- Except where identifiers clash with reserved words or with other
- identifiers declared in the same scope,
- they may simply be printed.
- In some cases the translator is forced to rename identifiers or to
- invent new identifiers.
- .IP operators
- The operators +, -, *
- .I div
- and
- .I mod
- when applied to real or integer operands
- have exact counterparts in C and are therefore easy to handle.
- The operator for real division, /,
- corresponds to the same C operator except that the operands may be integers.
- In that case a cast is necessary.
- When the operands are sets the expression is converted into a function call.
- .sp
- The operators <, <=, >, >=, = and <> all have exact counterparts in C
- for integer and real operands.
- Most C implementations disallows
- .I enum
- operands,
- the translator therefore casts such operands to
- .I int .
- Comparisons on structures (i.e. strings or sets)
- are converted into function calls.
- .IP assignments
- Assignments are straightforward to handle since arrays are encapsulated
- in structures.
- Therefore:
- .DS
- a := b
- .DE
- becomes
- .DS
- a = b
- .DE
- .I unless
- b is a string or a set,
- in which case the assignment is converted into a function call.
- .IP indexing
- Given the translation for array declarations (described above) we are forced
- to translate
- .DS
- a[i]
- .DE
- into
- .DS
- a.A[i - c]
- .DE
- where
- .I c
- is the lower bound for the index type.
- .IP selection
- Given the translation for records with variants (described above) the
- translation of
- .DS
- a.b
- .DE
- becomes
- .DS
- a.b
- .DE
- .I or ,
- if b is declared in a variant,
- .DS
- a.Vxxx.b
- .DE
- where Vxxx is a name manufactured by the translator for the particular variant.
- .IP dereferencing
- Pointer references and
- .B var -parameters
- are handled by prefixing the expression with an asterisk,
- but the special case dereferencing followed by selection is also recognized,
- so:
- .DS
- p^ := q^.f
- .DE
- becomes
- .DS
- *p = q->f
- .DE
- .IP miscellanea
- The boolean operators
- .B and ,
- .B or
- and
- .B not
- are simply translated into their C counterparts.
- The set contructors
- .B "[\^]" ,
- and
- .B ".."
- and the operator
- .B in
- are converted to function calls.
- .bp
- .NH 1
- Implementation
- .LP
- The general strategy is to convert the Pascal source program
- into a parsetree.
- The tree is then traversed by a set of procedures that perform some
- necessary transformations.
- The tree is finally traversed by a set of procedures that print the
- corresponding C constructs.
- The translator consists of three major procedures that perform
- these actions.
- They are augmented by a set of procedures that maintain a symbol table
- that holds information about identifiers found in the source,
- and by a procedure that initializes all internal datastructures.
- .LP
- There are three major datastructures that interact in complicated ways:
- .IP 1)
- a store for identifiers and strings
- .IP 2)
- a multilevel symbol table
- .IP 3)
- a parse-tree.
- .LP
- The identifier and string store,
- .B strstor
- is in principle an array of characters that grow
- in increments of one string block.
- Exactly one copy of each identifier is stored in that array.
- Whenever an identifier is found in the input stream the array is
- scanned to see if that identifier has been seen before.
- In order to speed up the search all identifiers are represented by nodes
- which are chained together such that all nodes in a particular chain
- have the same hashvalue as determined by the function
- .B hash .
- Each
- .B idnode
- holds an index to strstor where the corresponding identifier text is stored.
- The start of the hashchains are found in the global variable
- .B idtab .
- .de Ds
- .nr x \\w'\\$2'u
- .ie t \{
- .nr x \\nx/2
- .ds \\$1 \\\\h'-\\nxu'\\$2\\\\h'-\\nxu'
- .\}
- .el .ds \\$1 \\$2\\\\h'-\\nxu'
- ..
- .ds l \-
- .ie t .ds a \z\*l\a
- .el .ds a \a\z\*l
- .nr x \ntu/2
- .ds g \z\*l\\l'\nxu\&\*l'
- .ds h \\h'\nxu'
- .Ds c +
- .Ds d \(da
- .Ds u \(ua
- .Ds r \(br
- .ds s \\*r\z\*l
- .ds > \*l\z\*l\(->
- .ds < \(<-\\h'-1n'\*l
- .ds k \\kx
- .ds b \\h'|\\nxu'
- .ds t \t
- .DS L
- .lc \*l
- .fc #
- idtab
- \*c\*a\*c
- \*r\*t\*r\*tchain of idnodes with same hashvalue
- \*c\*a\*c\*t\*c\*a\*a\*c\*t\*c\*a\*a\*c
- \*k\*r\*t\*r\*b\*h\*a#\*> #\*r\*t\*k\*t\*r\*b\*h\*a#\*> #\*r\*t\*t\*r idnode representing
- \*c\*a\*c\*t\*r\*t\*t\*r\*t\*k index=2\*b\*r\*t\*t\*r identifier "demo"
- \*r\*t\*r\*t\*c\*a\*a\*c\*t\*c\*a\*a\*c
-
- \*tstrstor
- \*c\*a\*a\*a\ \*l \*a\*a\*c
- \*r\*t\*r\*t\*r\*t\*t\*r\*t\*r
- \*k\*c\*a\*a\*a\ \*l \*a\*a\*c\*b\*h\*r
- \*h\*r
- \*h\*d
- \*c\*a\*a\*a\*a\*a\*a\*a\*a
- \*r\*t\*r\*t\*r# d #\*r# e #\*r# m #\*r# o #\*r# / #\*r
- \*c\*a\*a\*a\*a\*a\*a\*a\*a\*tfirst idblock
- \*t\*t# \*u #
- \*t\*tindex=2
- .DE
- .LP
- So:
- the global representation of the identifier "demo" is a particlular
- idnode;
- whenever the lexical analysis routine
- recognizes the identifier "demo" it returns a pointer to that idnode.
- .LP
- In order to represent different identifiers with the same name we need to
- be able to distinguish between identifiers declared in different scopes.
- This is accomplished by the
- .B symnode
- structures.
- When an identifier is first encountered in a given scope it is "declared",
- meaning that a new symnode is created that references the identifier.
- Occurrences of the same identifier in that scope are then represented in
- the parse-tree by treenodes referencing the same symnode.
- .bp
- The program:
- .DS
- program p;
-
- var demo : integer;
-
- begin
- demo := 3
- end.
- .DE
- yilds the following structure:
- .DS L
- .lc \*l
- .fc #
- \*t# top #
- \*t# \*d #
- \*t\*c\*a\*a\*a\*a\*c treenode representing
- \*t\*k npgm\*b\*r\*t\*t\*t\*t\*r the program
- \*t\*c\*a\*s\*a\*s\*a\*s\*a\*c
- \*t\*t\*r\*t\*r\*h\*u\*t\*r\*h\*u
- \*t\*t\*r\*t\*r\*h\*r\*t\*r\*h\*c\*a\*a\*a\*a\*a\*a\*a\*g\*c
- \*t\*t\*r\*t\*r\*h\*r\*t\*c\*a\*a\*a\*a\*a\*a\*a\*c\*h\*r
- \*t\*t\*r\*t\*d\*h\*r\*t\*t\*t\*t\*t\*t\*t\*t\*r\*h\*r
- \*t\*t\*r\*h\*c\*a\*g\*s\*a\*a\*c\*k treenode representing\*b\*t\*t\*t\*t\*t\*t\*r\*h\*r
- \*t\*t\*r\*h\*k nvar\*b\*r\*t\*t\*t\*\*kr the var-declaration\*b\*t\*t\*t\*t\*t\*t\*d\*h\*r
- \*t\*t\*r\*h\*c\*g\*s\*a\*s\*a\*c\*t\*t\*t\*c\*a\*a\*a\*g\*s\*a\*c treenode repr.
- \*t\*t\*r\*t\*r\*h\*u\*t\*r\*t\*t\*t\*t\*k nassign\*b\*r\*t\*t\*t\*t\*r assignment
- \*t\*t\*r\*t\*r\*h\*r\*t\*c\*a\*k\*> to type\*b\*t\*t\*t\*c\*a\*s\*a\*a\*s\*a\*c
- \*ksymtab\*b\*t\*t\*r\*t\*d\*h\*r\*t\*t\*t\*t\*t\*t\*d\*h\*u\*t\*t\*d\*h\*u
- \*c\*a\*c\*t\*r\*h\*c\*a\*g\*s\*a\*c\*k treenode repr.\*b\*t\*t\*t\*t\*c\*a\*g\*s\*a\*c\*h\*c\*a\*g\*s\*a\*a\*c
- \*r\*t\*r \*<\*a\*c\*h\*k nid\*b\*r\*t\*t\*r\*k occurrence of\*b\*t\*t\*t\*t\*k nid\*b\*r\*t\*t\*r\*h\*k ninteger\*b\*r\*t\*t\*t\*r
- \*t\*t\*h\*c\*a\*s\*a\*c\*k id. "demo"\*b\*t\*t\*t\*t\*c\*a\*s\*a\*c\*h\*c\*a\*a\*s\*a\*c
- \*r\*t\*r\*t\*t\*r\*h\*u\*t\*t\*t\*t\*t\*t\*r\*t\*t\*t\*r\*h\*u
- \*c\*a\*c\*t\*t\*r\*h\*r\*t\*c\*a\*a\*a\*a\*a\*c\*t\*t\*t\*r\*h\*r
- \*r\*t\*r\*t\*t\*d\*h\*r\*t\*d\*t\*t\*t\*t\*t\*t\*t\*t\*d\*h\*r
- \*c\*a\*c\*t\*c\*a\*g\*s\*a\*a\*c\*k symnode representing\*b\*t\*t\*t\*t\*t\*t\*c\*a\*g\*s\*a\*g\*c
- \*r\*k\*t\*r\*b\*h\*a\*>\*t\*k lidentifier\*b\*r\*t\*t\*t\*r\*k identifier "demo"\*b\*t\*t\*t\*t\*t\*t\*k linteger\*b\*r\*t\*t\*h\*r
- \*c\*a\*c\*t\*c\*a\*s\*a\*a\*c\*k in the current scope\*b\*t\*t\*t\*t\*t\*t\*k linum=3\*b\*r\*t\*t\*h\*r
- \*t\*t\*t\*r\*t\*t\*t\*t\*t\*t\*t\*t\*c\*a\*a\*g\*c
- \*kidtab\*b\*t\*t\*t\*c\*a\*a\*a\*c
- \*c\*a\*c\*t\*t\*t\*t\*t\*r
- \*r\*t\*r\*t\*t\*t\*t\*t\*d
- \*c\*a\*c\*t\*c\*a\*a\*c\*t\*c\*a\*a\*c
- \*k\*r\*t\*r\*b\*h\*a#\*> #\*r\*t\*k\*t\*r\*b\*h\*a#\*> #\*r\*t\*t\*r idnode representing
- \*c\*a\*c\*t\*r\*t\*t\*r\*t\*k index=2\*b\*r\*t\*t\*r identifier "demo"
- \*r\*t\*r\*t\*c\*a\*a\*c\*t\*c\*a\*a\*c
-
- \*tstrstor
- \*c\*a\*a\*a\ \*l \*a\*a\*c
- \*r\*t\*r\*t\*r\*t\*t\*r\*t\*r
- \*c\*g\*s\*a\*a\*a\ \*l \*a\*a\*c
- \*h\*r
- \*h\*d
- \*c\*a\*a\*a\*a\*a\*a\*a\*a
- \*r\*t\*r\*t\*r# d #\*r# e #\*r# m #\*r# o #\*r# / #\*r
- \*c\*a\*a\*a\*a\*a\*a\*a\*a\*tfirst idblock
- \*t\*t# \*u #
- \*t\*tindex=2
- .fc
- .lc
- .DE
- We see that the two occurrences of the identifier "demo" are represented by
- two
- .B treenodes
- of variant
- .B nid
- that both have pointers to the same
- .B symnode
- representing the declaration of "demo".
- All symnodes at a given declarationlevel are chained together (in the
- same manner as the idnodes) and the chains are attached to the global
- variable
- .B symtab .
- In order to find out if an identifer is declared in the current scope we
- scan the chain of symnodes starting from symtab, and check if any of them
- references the idnode we are looking for.
- A symnode also have a pointer to the treenode which defines the symbol.
- The
- .B symtabs
- themselves are also chained,
- the chain implements a stack of declarationlevels.
- The symtab which is created when the scope of a procedure is entered
- is also attached to that procedure.
- When a procedure is entered we create a new symtab, push it onto the stack,
- parse the procedure and pop the stack.
- The symbols then visible are those that still can be reached from the stack.
- .LP
- The parse-tree consists of
- .B treenodes
- representing each declaration, statement, expression etc.
- Each node except the nprogram node
- has a pointer to its immediate father (giving the defining point),
- to its children (if it has any) and to its right brother (if it is
- a member of a list).
- The top of the tree is found in the global variable
- .B top .
- .LP
- In order to find the defining point for the identifier in the assignment,
- we follow pointers from the nassign-treenode
- to the nid-treenode, to the lidentifier-symnode,
- and then up to the nid-treenode in the declaration.
- If we want to know the type of the identifier
- we proceed up to the nvar-treenode
- and then down to the node giving the type in the declaration
- (in our example that node would also be a nid-treenode referencing a
- linteger-symnode that represents the identifier "integer").
- There is a function
- .B typeof
- that performs exactly this operation.
- It will return a pointer to a npredef-, nsubrange-, nscalar-, nptr-
- narray-, nrecord-, nfileof- or nsetof-treenode.
- In those cases where the translator pays attention to types
- it uses a pointer (obtained from typeof) as representation of a type.
- .LP
- Given the parse-tree and the layered symbol table
- it is not hard to see how the translations described earlier are performed.
- The one place where the code becomes more than usually complex is when a
- .B write
- statement with format specifications is to be translated into a call to
- .B fprintf .
- .bp
- .NH 1
- Tuning
- .LP
- The behaviour of the translator may be modified for some cases simply
- by changing constants.
- These constants are all located at the top of the program text.
- .IP output 12
- The translator will copy the source during input if the constant
- .B echo
- is true.
- The copy is preceeded by the line
- .DS
- # ifdef PASCAL
- .DE
- and ended by the line
- .DS
- # else
- .DE
- and then follows the translated code
- and finally
- .DS
- # endif
- .DE
- This may be used to hold both Pascal and C source in the same file.
- If the Pascal part is changed the C part may be updated through:
- .DS
- cpp -P -DPASCAL < orig > tmp
- ptc < tmp > new
- move new orig
- .DE
- assuming that
- .B echo
- is true and that
- .B cpp
- is the standard C preprocessor.
- .DS
- Default value:
-
- echo = false;
- .DE
- .IP comments
- The translator recognizes both (* and { as comment delimiters.
- They are treated as different,
- allowing 1 level nested comments,
- if the constant
- .B diffcom
- is true.
- .DS
- Default value:
-
- diffcomm = false;
- .DE
- .IP symbols
- The translator accepts default entries in case-statements provided
- that the keyword defined through
- .B othersym
- is used in place of the constant list.
- .DS
- Default value:
-
- othersym = 'otherwise ';
- .DE
- substitute for
- .DS
- othersym = 'otherwise%';
- .DE
- if that feature is undesired.
- .IP
- The translator accepts externally declared procedures and functions
- provided that the directive defined through
- .B externsym
- is used in place of the keyword
- .B forward .
- .DS
- Default value:
-
- externsym = 'external ';
- .DE
- .IP sets
- Sets are implemented as arrays of
- .B wordtype .
- The type is assumed to hold
- .B "setbits + 1"
- bits numbered from 0.
- .DS
- Default value:
-
- wordtype = 'unsigned short';
- setbits = 15;
- .DE
- .ne 10
- .IP files
- The implementation of files uses a flag-word which has the type given as
- .B filebits
- which is assumed to hold
- .B "filefill + 4"
- bits.
- .DS
- Default value:
-
- filebits = 'unsigned short';
- filefill = 12;
- .DE
- .ne 20
- .IP stmts
- If the Pascal source is known to be "normal" in the sense that for-loops
- always have an upper bound that is less than the maximum value of the
- type of the loop-variable, and in the sense that the upper bound doesn't
- change by computations inside the loop, then the translator may generate
- a more natural code.
- I.e:
- .DS
- for i := 1 to 10 do ;
- .DE
- becomes
- .DS
- for (i = 1; i <= 10; i++) ;
- .DE
- Since the requirements cannot be determined by the translator itself
- this kind of code is generated when the constant
- .B lazyfor
- is true.
- .DS
- Default value:
-
- lazyfor = false;
- .DE
- .ne 10
- .IP new
- The second and following parameters to the procedure
- .B new
- will be ignored if the constant
- .B unionnew
- is false.
- .DS
- Default value:
-
- unionnew = true;
- .DE
- .ne 10
- .IP strings
- All identifiers and strings are stored in the translator with the special
- character having the ordinal value
- .B null
- as endmarker.
- Hence, that character can not occur in strings in the Pascal source.
- .DS
- Default value:
-
- null = 0;
- .DE
- .ne 20
- .IP types
- The names of predefined types are given by the constants:
- .B inttyp ,
- .B chartyp ,
- .B floattyp ,
- and
- .B doubletyp .
- .DS
- Default value:
-
- inttyp = 'int';
- chartyp = 'char';
- floattyp = 'float';
- doubletyp = 'double';
- .DE
- The typename for real variables and functions defined by the user
- is given by the constant
- .B realtyp .
- .DS
- Default value:
-
- realtyp = doubletyp;
- .DE
- The typename for procedures is given by the constant
- .B voidtyp .
- .DS
- Default value:
-
- voidtyp = 'void';
- .DE
- .ne 10
- .IP i/o
- The default fieldwidths for integer and real values written on textfiles
- are given by the constants
- .B intlen
- and
- .B fixlen .
- .DS
- Default value:
-
- intlen = 10;
- fixlen = 20;
- .DE
- .IP types
- A table in the translator gives the mapping from Pascal
- integer subrange types to C arithmetic types.
- The table is initialized by code located at the end of procedure
- .I initialize
- giving the following default configuration:
- .TS
- l l l
- r r l .
- Low bound High bound Selected type
- .sp
- 0 255 unsigned char
- -128 127 char
- 0 65535 unsigned short
- -32768 32767 short
- -2147483647 2147483647 long
- .TE
-