home *** CD-ROM | disk | FTP | other *** search
- ACE Assembler Documentation for version 1.00 [October 23, 1994]
- ------------------------------------------------------------------------------
- 1. INTRODUCTION
-
- The ACE assembler is a one-pass assembler. The only real limitation on the
- size of assembly jobs is the amount of near+far memory you have available.
- Labels are "limited" to 240 characters (all significant), and the object size
- is limited to 64K (of course). Numerical values are "limited" to 32-bits or
- less. Relative labels ("+" and "-" labels) are implemented in the same way as
- in the Buddy assembler. Only add and subtract dyadic operators are currently
- implemented for expressions with positive, negate, high-byte, and low-byte
- monadic oparators, and the planned macro and conditional assembly features are
- not yet implemented. Expressions are limited to 17 operands (with 255 monadic
- operators each), but references to unresolved identifiers are allowed
- anywhere, including equate definitions. Hierarchical inclusion of source
- files is not yet implemented. The ACE source code will eventually be
- converted to use this assembler. The assembler code itself has been
- converted.
-
- The assembler is designed to be a "heavy hitter", operates at moderate speed,
- and uses a fair amount of dynamically allocated memory. In fact, on an
- unexpanded 64, you won't be able to assemble programs that are too large,
- including the assembler itself (85K of source). You'll be able to do larger
- jobs on an unexpanded 64 if you deactivate the soft-80 screen in the
- configuration. I'll be working on making more RAM0 memory available to ACE
- applications. (Of course, one could argue that any serious 64 hacker would
- have expanded memory anyways...).
-
- In addition to the regular 6502 instructions, this release of the assembler
- has the following directives:
-
- label = value ;assign given value to the label
- label: ;assign the current assembly address to label
- + ;generate a temporary label, assign cur address
- - ;generate a temporary label, assign cur address
- org address ;set the origin of the assembly
- buf size ;reserve "size" bytes of space,filled with zeroes
- db val1, val2, ..., valN ;put byte values into memory
- dw val1, val2, ..., valN ;put word values into memory
- dt val1, val2, ..., valN ;put "triple" (3-byte) values into memory, lo->hi
- dl val1, val2, ..., valN ;put "long" (4-byte) values into memory, lo->hi
-
- These features is described in more detail below. Note that throughout the
- documentation, I use the terms "identifier", "symbol", and "label"
- interchangeably.
- ------------------------------------------------------------------------------
- 2. USAGE
-
- The usage for the as command is, stated in Unix notation:
-
- usage: as [-help] [-s] [-d] [file ...]
-
- The "-help" flag will cause the assembler display the usage information and
- then exit, without assembling any code. Actually, any flag that it doesn't
- understand will be taken as if you had said "-help", but note that if you type
- the "as" command alone on a command line that usage information will not be
- given.
-
- The "-s" flag tells the assembler to generate a symbol-table listing when the
- assembly job is finished. The table is formatted for an 80-column display.
- indicates that a symbol table should be generated when the assembly job is
- done. The table will look like:
-
- The "-d" flag tells the assembler to produce debugging information while it is
- working. It will generate a lot of output, so you can see exactly what is
- going on.
-
- The object-code module name will be "a.out" unless the name of the first
- source file ends with a ".s" extension, in which case the object module will
- be the base name of first source file (without the extension). The object
- module will be written as a PRG file and will be in Commodore-DOS program
- format: the first two bytes will be the low and high bytes of the code
- address, and the rest will be the binary image of the assembled code.
-
- If no source filename is given on the command line, then input is taken from
- the stdin file stream (and written to "a.out"). If more than one filename is
- given, the each is read, in turn, into the same assembly job (as if the files
- were "cat"ted together into one source file). (This will change subtly when
- the assembler is completed).
-
- This assembler does not produce a listing of the code assembled and will
- stop the whole assembly job on the first error it encounters.
- ------------------------------------------------------------------------------
- 3. TOKENS
-
- While reading your source code, the assembler groups characters into tokens
- and interprets them as a complete unit. The assembler works with five
- different types of tokens: identifiers, numeric literals, string literals,
- special characters, and end-of-file (eof). Eof is special since it doesn't
- actually include any characters, and its only meaning is to stop reading from
- the current source. Your input source file should consist only of characters
- that are printable in standard ASCII (don't be confused by this; the assembler
- expects its input to be in PETSCII) plus TAB and Carriage-Return. Other
- characters may confuse the assembler.
-
- Identifiers consist of a lowecase or uppercase letter or an underscore
- followed by a sequence of such letters or decimal digits. This is a pretty
- standard definition of an identifier. Identifiers are limited to 240
- characters in length and an error will be reported if you try to use one
- longer than that. All of the characters of all identifiers are significant,
- and letters are case-sensitive. Here are some examples of all-unique
- identifiers:
-
- hello Hello _time4 a1_x140J HelloThereThisIsA_LongOne
-
- Numeric literals come in three types: decimal, hexadecimal, and binary.
- Decimal literals consist of an initial digit from 0 to 9 followed by any
- number of digits, provided that the value does not exceed 2^32-1 (approx. 4
- billion). All types of literals can also have embedded underscore characters,
- which are ignored by the assembler. Use them grouping digits (like the comma
- for big American numbers).
-
- Hexadecimal literals consist of a dollar sign ($) followed by any number of
- hexadecimal digits, provided the value doesn't overflow 32 bits. Hexadecimal
- digits include the decimal digits (0-9), and the first six uppercase or
- lowercase letters of the alphabet (either a-f or A-F). Hexadecimal literals
- can also have embedded underscore characters for separators.
-
- Binary literals consist of a percent sign (%) followed by any number of binary
- digits that don't overflow 32-bits values. The binary digits are, of course,
- 0 and 1, and literals may include embedded underscore characters. Note that
- negative values are not literals. Here are some examples of valid literals:
-
- 0 123 0001 4_294_967_295 $aeFF $0123_4567 %010100 %110_1010_0111_1010
-
- String literals are sequences of characters enclosed in either single (') or
- double (") quotation marks. The enclosed characters are not interpreted to be
- independent tokens, nomatter what they are. One exception is that the
- carriage-return character cannot be enclosed in a string (this normally
- indicates an error anyway). To get special non-printable characters into your
- strings, an "escape" character is provided: the backslash (\). If the
- backslash character is encountered, then the character following it is
- interpreted and a special character code is put into the string in place
- of the backslash and the following character. Here are the characters
- allowed to follow a backslash:
-
- CHAR CODE MEANING
- ---- ---- --------
- \ 92 backslash character (\)
- n 13 carriage return (newline)
- b 20 backspace (this is a non-destructive backspace for ACE)
- t 9 tab
- r 10 goto beginning of line (for ACE, linefeed for CBM)
- a 7 bell sound
- z 0 null character (often used as a string terminator in ACE)
- ' 39 single quote (')
- e 27 escape
- 0 0 null character
- q 34 quotation mark
- " 34 quotation mark
-
- So, if you really want a backslash then you have to use two of them. If you
- wish to include an arbitrary character in a liter string, no facility is
- provided for doing that. However, the assembler will allow you to intermix
- strings and numeric expressions at a higher level, so you can do it this way.
- Strings are limited to include 240 (encoded) characters or less. This is
- really no limitation to assembling, since you can put as many string literals
- contiguously into memory as you wish. Here are some examples:
-
- "Hello there" "error!\a\a" 'file "output" could not be opened\n\0'
- "you 'dummy'!" 'you \'dummy\'!' "Here are two backslashes: \\\\"
-
- Special characters are single characters that cannot be interpreted as any of
- the other types of tokens. These are usually "punctuation" characters, but
- carriage return is also a special-character token (it is a statement
- separator). Some examples follow:
-
- , ( # & ) = / ? \ ~ {
-
- Tokens are separated by either the next character of input not being allowed
- to belong to the current token type, or are separated by whitespace.
- Whitespace characters include SPACE (" ") and TAB. Note that carriage return
- is not counted as whitespace. Comments are allowed by using a ";" character.
- Everything following the semicolon up to but not including the carriage return
- at the end of the line will be ignored by the assembler. (I may implement an
- artifical-intelligence comment parser to make sure the assembler does what you
- want it to, but this will be strictly an optional, time-permitting feature).
- ------------------------------------------------------------------------------
- 4. EXPRESSIONS
-
- Numeric expressions consist of operands and operators. If you don't know what
- operands and operators are, then go buy an elementary-school math book. There
- are six types of operands: numeric literals, single-character string literals,
- identifiers, the asterisk character, one or more plus signs, and one or more
- minus signs. These last three types can make parsing an expression a bit
- confusing, but they are necessary and useful.
-
- Numeric literals are pretty easy to think about. They're just 32-bit numbers
- and work in the usual way. Single-character string literals are also
- interpreted (in the context of a numeric expression) as being a numeric
- literal. The value of a single-character string is simply the PETSCII
- code for the character.
-
- Identifiers or "symbols" or "labels" used in expressions refer to numeric
- values that have been or will be assigned to the identifiers. Binding values
- to identifiers is done by assembler directives discussed in a later section.
- If an identifier already has a value assigned to it by the time that the
- current expression is reached in assembly, then it is treated as if it were a
- numeric literal of the value assigned to the identifier. If the identifier
- currently has no value assigned to it (i.e., it is "unresolved"), then the
- entire expression will be unresolved. In this case, the expression will be
- recorded and will be evaluated at a later time when all of its identifiers
- become resolved. A "hole" will be created where the expression should go, and
- the hole will be "filled in" later. Note that there are a couple of
- directives for which an expression must be resolved at the time it is
- referenced.
-
- The asterisk character operates much like a numeric literal, except that its
- value is the current code address rather than a constant. The current code
- address will always be for the start of an assembler instruction. I.e., the
- current code address is incremented only after an instruction is assembled.
- This has some subtle implications, and other assemblers may implement slightly
- different semantics. Directives are a little different in that the address is
- incremented after every value in a "commalist" is put into memory.
-
- Relative references, operands consisting of a number of pluses or minuses,
- operate much like identifiers. They are provided for convenience and
- work exactly how they do in the Buddy assembler. Operands of all minuses
- are backward references and operands of all pluses are forward references.
- Because of parsing difficulties, relative-reference operands must either
- be the last operand in an expression or must be followed by a ":" character.
-
- The number of pluses or minuses tell which relative reference "point"
- is being referred to. A reference point is set by the "+" and "-"
- assembler directives discussed later. This gets difficult to explain with
- words, so here is a code example:
-
- ldy #5
- - ldx #0
- - lda name1,x
- sta name2,x
- beq +
- cmp #"x"
- beq ++
- inx
- bne -
- + dey
- bne --
- + rts
-
- This relatively bogus subroutine will copy a null-terminated character string
- from name1 to name2 five times, unless the string contains an "x" character,
- in which case the copy operation terminates immediately upon encountering the
- "x". The "beq +" branches to the next "+" label to occur in the code, to the
- "dey" instruction. The "beq ++" branches to the "rts", to the "+" label
- following the next "+" label encountered. The "-" and "--" references work
- similarly, except that they refer to the previous "-" label and the previous
- to the previous "-" label. You can use up to 255 pluses or minus signs in
- a relative-reference operand to refer to that many reference points away.
-
- That I said relative-reference operands work much like identifiers above
- is no cooincidence. For each definition of a reference point and reference
- to a point, an internal identifier is generated that looks like "L+123c" or
- "L-123c". Note that you can't define or refer to these identifiers yourself.
-
- There are two types of operators that can be used in expressions: monadic and
- diadic operators. Monadic operators affect one operand, and dyadic operators
- affect two operands. At about this point, I should spell out the actual form
- of an expression. It is:
-
- [monadic_operators] operand [ operator [monadic_operators] operand [...] ]
-
- or:
-
- 1 + 2
- -1 + -+-2 + 3
-
- An expression may have up to 17 operands.
-
- The monadic (one-operand) operators are: positive (+), negative (-), low-byte
- (<), and high-bytes (>). You can have up to 255 of each of these monadic
- operators for each operand of an expression. Positive doesn't actually do
- anything. Negative will return the 32-bit 2's complement of the operand that
- it is attached to. Low-byte will return the lowest eight bits of the operand
- it is attached to. High-byte will return the high-order 24-bits of the 32-bit
- operand it is attached to. All expressions are evaluated in full 32-bit
- precision. Note that you can use the high-bytes operator more than once to
- extract even higher. For example, "<>>value" will extract the second-highest
- byte of the 32-bit value.
-
- The dyadic (two-operand) operators are currently only add (+) and subtract
- (-). Yes, the plus and minus symbols are horribly overloaded. I hope that we
- all know what add and subtract do. I am planning to implement more dyadic
- operators in the future (multiply, divide, and, or, not, exclusive-or).
-
- Evaluation of dyadic operators is strictly left-to-right, and value overflows
- and underflows are ignored. Values are always considered to be positive,
- but this doesn't impact 2's complement negative arithmetic for add and subtract
- dyadic operators.
-
- Monadic operators take precedence over dyadic operators. Evaluation of
- monadic operators is done a little differently. All positive operators are
- thrown out since they don't actually do anything. Then, if there is an even
- number of negative operators, they are thrown out. If there is an odd number
- of negative operators, then the 2's complement negative of the operand is
- returned. Then, if there are any high-bytes operators, the value is shifted
- that number of bytes to the right and the highest-order byte of the value is
- set to zero. Note that it really doesn't make any sense to perform any more
- than three high-bytes operators. Then, the low-byte operator is preformed, if
- asked for. It is equivalent to taking anding the value with $000000ff. It
- really doesn't make much sense to perform this operator more than once. Also,
- it doesn't make any difference in which order you place the monadic operators
- in an expression; they are always evaluated in the static order given above.
-
- There is one exception here. If the first operand of an expression has
- high-bytes and/or low-byte monadic operators, then the rest of the expression
- is evaluated first and then the high/low-byte monadic operators are performed
- on the result. This is done to be consistent with other assemblers and with
- user expectations.
-
- Parentheses are not supported. Here are some examples of valid expressions:
-
- 2
- +2+1
- 2+-1
- 2+-------------------------------------1
- ++++:-+++:+---
- 1+"x"-"a"+"A"
- <>>>4_000_000_000
- <label+1
- >label+1
- -1
-
- This last one ends up with a value of negative one, which is interpreted
- as really being 4_294_967_295. If you were to try and do something like
- "lda #-1", you would get an error because the value would be interpreted
- as being way too big.
-
- Expressions results are currently considered to be one of two types: value and
- address. (The complete set must be value, address, address-low-byte, and
- address-high-byte in order to be actually useful). Values are what you would
- expect and come from numeric and single-character-string-literal operands.
- The address type comes from the asterisk and relative reference operands and
- from identifier operands which are defined to be addresses. An address is
- defined to be only an address in the range of the assembled code. Addresses
- outside of this range are considered to be values. The distinction of values
- and addresses is currently not used, but will be in the future when code
- relocation features are implemented. Keeping track of expression types makes
- it possible to generate a list of all values in memory that must be modified
- in order to relocate a program to a new address without reassembling it.
-
- String "expressions" consist of only a single string literal. No operators
- are allowed. Some assembler directives accept either numeric or string
- expressions and interpret them appropriately (like "db").
- ------------------------------------------------------------------------------
- 5. PROCESSOR INSTRUCTIONS
-
- This assembler accepts the 56 standard 6502 processor instructions. It does
- not provide un-documented 6502 instructions nor 65c02 nor 65816 instructions
- nor custom pseudo-ops. The latter will be provided by future macro features.
- All of the assembler instructions must be in lowercase or they will not be
- recognized. Here are the instructions:
-
- NUM INS NUM INS NUM INS NUM INS NUM INS
- --- --- 12. bvc 24. eor 36. pha 48. sta
- 01. adc 13. bvs 25. inc 37. php 49. stx
- 02. and 14. clc 26. inx 38. pla 50. sty
- 03. asl 15. cld 27. iny 39. plp 51. tax
- 04. bcc 16. cli 28. jmp 40. rol 52. tay
- 05. bcs 17. clv 29. jsr 41. ror 53. tsx
- 06. beq 18. cmp 30. lda 42. rti 54. txa
- 07. bit 19. cpx 31. ldx 43. rts 55. txs
- 08. bmi 20. cpy 32. ldy 44. sbc 56. tya
- 09. bne 21. dec 33. lsr 45. sec
- 10. bpl 22. dex 34. nop 46. sed
- 11. brk 23. dey 35. ora 47. sei
-
- The assembler also supports 12 addressing modes. The "accumulator" addressing
- mode that can be used with the rotate and shift instructions is treated like
- the immediate addressing mode, so a shift-left-accumulator instruction would
- be just "asl" rather than "asl a". Many other assemblers get rid of the
- accumulator addressing mode also. Also, the ",x" and ",y" addressing modes
- must be given with a lowercase "x" or "y" or they will not be recognized.
- Here is the token syntax for the addressing modes (CR means carriage return):
-
- num name gen byt example tokens
- --- --------- --- --- ------- -------
- 01. implied 00. 1 CR
- 02. immediate 00. 2 #123 # / exp8 / CR
- 03. relative 00. 2 *+20 exp16 / CR
- 04. zeropage 07. 2 123 exp8 / CR
- 05. zp,x 08. 2 123,x exp8 / , / x / CR
- 06. zp,y 09. 2 123,y exp8 / , / y / CR
- 07. absolute 00. 3 12345 exp16 / CR
- 08. abs,x 00. 3 12345,x exp16 / , / x / CR
- 09. abs,y 00. 3 12345,y exp16 / , / y / CR
- 10. indirect 00. 3 (12345) ( / exp16 / ) / CR
- 11. (ind,x) 00. 2 (123,x) ( / exp8 / , / x / ) / CR
- 12. (ind),y 00. 2 (123),y ( / exp8 / ) / , / y / CR
-
- Each instruction takes a complete line and each addressing mode must be
- terminated by a carriage return token (comments are skipped). The format of
- an instruction line is as follows:
-
- [prefix_directives] instruction address_mode_operand
-
- In the case that an expression in an addressing mode is resolved at the point
- it is encountered and its value is less than 256, the assembler will try to
- use the zero-page addressing modes if possible. On the other hand, if a
- zero-page addressing mode is unavailable for an instruction, then the
- assembler will promote or generalize the zero-page addressing mode to
- an absolute addressing mode, if possible. This is what the "gen" column in
- the table above shows. If after attempting to generalize the addressing
- mode the given addressing mode still not valid with the given instruction,
- then an error will be generated.
-
- In the case that an expression in an addressing mode cannot be resolved at
- the point where it is encountered in the assembler's single pass, a hole is
- left behind, and that hole is made as "large" as possible; it is assumed
- that you will fill in the hole with the largest value possible. This means,
- for example, if you were to assemble the following instruction:
-
- lda var,x
-
- then the assembler would assume this is an absolute mode, and will fill in the
- hole later as such, even if it turns out that "var" is assigned a value less
- than 256 later on. This results in slight inefficiency in the code produced
- by this assembler, but it causes most two-pass assemblers to fail completely
- on a "phase error". An easy way to avoid this circumstance is to make sure
- that all zero-page labels are defined before they are referred to.
-
- The addressing modes that require a single byte value and that will not
- "generalize" to an absolute mode will have a single-byte hole created for
- them. Only the branching instructions will be interpreted as having the
- relative addressing mode, and a single-byte hole will be left. Two exceptions
- to the above rules are the "stx zp,y" and "sty zp,x", which will leave a
- single-byte hole on an unresolved expression, since the absolute-mode
- generalizations for these instructions are not supported by the processor.
- ------------------------------------------------------------------------------
- 6. DIRECTIVES
-
- There are currently five classes of assembler directives; there will be
- more in the future.
-
- 6.1. DO-NOTHING DIRECTIVES
-
- There are two do-nothing directives:
-
- # ;does nothing
- ;blank line--does nothing
-
- A blank line in your source code will simply be ignored. This helps to make
- code much more readable. The "#" directive is a prefix directive. This means
- that it does not occupy an entire line but allows other directives and
- processor instructions to follow it on the same line (including other prefix
- directives). (But note that you can follow any prefix directive by the
- blank-line directive, effectively allowing prefix directives to be regular
- full-line directives (powerful combining forms)). The "#" directive is simply
- ignored by the assembler, but you can use it to highlight certain lines of
- code or other directives, like the future "include" directive.
-
- 6.2. ASSIGNMENT DIRECTIVES
-
- There are four assignment directives. They all assign (bind) a value to an
- identifier. Here they are:
-
- label = expression ;assign given value to the label
- label: ;assign the current assembly address to label
- + ;generate a temporary label, assign cur address
- - ;generate a temporary label, assign cur address
-
- The first (label=expr) is the most general. It assigns the result of
- evaluating the expression to the given label. Because this assembler is so
- gosh-darned awesome, the expression doesn't even have to be resolved; a "hole"
- will be created saying to fill in the assigned label when all of the
- unresolved identifiers in the expression eventually become resolved. Most
- other assemblers (in fact, all that I have ever heard of) can't do this
- because it causes ugly implementation problems, like cascading label
- resolutions. Consider the following example:
-
- lda #a
- sta b,x
- a = b+3
- b = c-1
- c = 5
-
- At the point where c becomes defined, there are no "memory holes" but the
- label hole "b" must be evaluated and filled in. "b" gets assigned the value
- 4. At this point, there are two holes: the one in the "sta" instruction and
- the label "a". We fill them both in, assigning "a" the value 8, and we
- discover that we can fill in a hew hole: the one in the "lda" instruction. We
- do that and we are finally done. The implementation can handle any number of
- these recursive label hole-fillings, limited only by the amount of near+far
- memory you have.
-
- A label can only be assigned a value once, and you will get an error if you
- try to redefine a label, even if it is currently unresolved. Also, all
- exressions must be resolved by the end of the assembly job, or an error will
- be reported (but only one--naming the first unresolved label that the
- assembler runs across; I may fix this up in the future).
-
- The second assignment directive is equivalent to "label = *", but it is more
- convenient and is also a prefix directive. It assigns the current address (as
- of the start of the current line) to the given identifier. The colon is used
- with this directive to make it easy and efficient to parse, and to make it
- easy for a human to see that a label is being defined. Many other assemblers
- follow this directive with just whitespace and rely on other tricks, like
- putting an ugly dot before each directive, to bail them out.
-
- The third and fourth set relative reference points. They are equivalent
- to "rel_label = *", where "rel_label" is a specially generated internal
- identifier of the form "L+123c" mentioned in the expression section. The
- labels defined by these directives show up in the symbol table dump, if you
- ask for one on the command line. These are also prefix directives, so if
- you wanted to set a forward and a backward reference to the same address,
- then you would do something like:
-
- +- lda #1
-
- In fact, you could put as many or these directives on the front of a line as
- you want, though more than one of each will be of little use. Note that
- backward relative labels will always be defined at the point that they are
- referenced and forward relative labels will always be undefined (unresolved)
- when they are referenced. If at the end of your assembly job the assembler
- complains of an unresolved reference involving a label of the form "L+123c",
- then you refer to a forward-relative point that you don't set, and if the
- label is of the form "L-4000000000c", then you refer to a backward relative
- point that you don't define.
-
- 6.3. ORIGIN DIRECTIVE
-
- org address_expression ;set the origin of the assembly
-
- This directive will set the code origin to the given expression. The
- expression MUST be resolved at the point where it appears, since it
- would be very difficult to fill in the type of hole this would leave
- behind (though not impossible, hmmm...). The origin must be set before
- any processor instruction or assembler directive that generates memory
- values is encountered, and the code origin can only be set once. This
- results in a contiguous code region, which is what ACE and the Commodore
- Kernal require.
-
- 6.4. DEFINE-BYTES DIRECTIVES
-
- db exp1, exp2, ..., expN ;put byte values into memory
- dw exp1, exp2, ..., expN ;put word values into memory
- dt exp1, exp2, ..., expN ;put "triple" (3-byte) values into memory, lo->hi
- dl exp1, exp2, ..., expN ;put "long" (4-byte) values into memory, lo->hi
-
- These directives are all put byte values into code memory, at the current
- address. The only difference between the four of them is the size of data
- values they put into memory: bytes (8 bits), words (16 bits), triples (24
- bits), and longs (32 bits). The code address is incremented by the
- appropriate number of bytes between putting each value into memory. Any
- number of values can be specified by separating them by commas. All
- expressions are evaluated in full 32 bits, but must fit into the size for the
- directive. The expressions don't have to be resolved at the time they appear.
-
- These directives can also be given strings for arguments, which means that
- each character of the string will be stored as one byte/word/etc. in memory,
- for example:
-
- db 123, abc+xyz+%1101-"a"+$1, "hello", 0, "yo!", "keep on hackin'\0"
-
- 6.5. BUF DIRECTIVE
-
- buf size_expression ;reserve "size" bytes of space, filled with zeroes
-
- This directive reserves the given number of bytes of space from the current
- code address and fills them with zeroes. The expression must be resolved,
- and can be any value from 0 up to 65535 or the number of bytes remaining
- until the code address overflows the 64K code space limit.
-
- 6.6. PARSING
-
- Because of the way that the assembler parses the source code (it uses a
- one-character-peek-ahead ad-hoc parser), you can define labels that are also
- directive names or processor-instruction names. This is not a recommended
- practice, since you can end up with lines that look like:
-
- x: lda: lda lda,x
-
- The parser will know what to do, but most humans won't. Also, because of the
- tokenizer, can put arbitrary spacing between tokens, except between tokens
- that would otherwise merge together (like two adjacent identifiers or decimal
- numbers).
- ------------------------------------------------------------------------------
- 7. ERROR HANDLING
-
- When an error is detected, the assembler will stop the whole assembly job and
- print out one error message (to the stderr file stream). Here are two
- examples of error messages:
-
- err ("k:":2:0) Value is too large or negative
-
- err ("k:":3:0), ref("k:":2:0) Value is too large or negative
-
- In both error messages, the stuff inside of the parentheses is the filename of
- the source file (the keyboard here), the source line where the error was
- detected, and the column number where the error was detected. Currently, the
- column number is not implemented so it is always zero. When it is
- implemented, the column numbers will start from 1, like in the Zed text
- editor, and it will point to the first character of the token where the
- error was discovered.
-
- In the first example, the error occurred because the expression was resolved
- and the value was found to be too large for whatever operation was attempted.
- In the second example, an expression was used but unresolved on line 2 of the
- source file, and when its unresolved identifier(s) was finally filled in in
- line 3 of the source, the "hole" to be filled in was found to be too small for
- the value, so an error resulted. This is what the "ref" file position means.
- Filenames are included in error messages because in the future, it will be
- possible to have errors crop up in included files and elsewhere.
-
- Here is the entire list of possible error messages:
-
- NUM MEANING
- --- -------
- 01. "An identifier token exceeds 240 chars in length"
- 02. "A string literal exceeds 240 chars in length"
- 03. "Ran into a CR before end of string literal"
- 04. "Invalid numeric literal"
- 05. "Numeric literal value overflows 32-bits"
- 06. "Syntax error"
- 07. "Attempt to perform numeric operators on a string"
- 08. "Expression has more than 17 operands"
- 09. "Ran out of memory during compilation process"
- 10. "Attempt to redefine a symbol"
- 11. "Attempt to assemble code with code origin not set"
- 12. "Internal error: attempt to assign to unexpected id"
- 13. "Non-numeric symbol in a numeric expression"
- 14. "Expecting an operator"
- 15. "Expecting an operand"
- 16. "Expecting a command"
- 17. "Value is too large or negative"
- 18. "Branch out of range"
- 19. "Feature is not (yet) implemented"
- 20. "Instruction does not support given address mode"
- 21. "Address wraped around 64K code address space"
- 22. "Error trying to write output object file"
- 23. "Directive requires resolved expression"
- 24. "Code origin already set; you can't set it twice"
- 25. "Unresolved symbol: "
- 26. "Thus assembler doesn't accept .dot commands, Buddy!"
-
- A "Syntax error" (#06) will be reported whenever a token other than one that
- was expected is found. "Ran out of memory" (#09) may turn up often on an
- unexpanded 64. "Expecting command" (#16) means that the assembler was
- expecting either a processor instruction or directive but found something else
- instead. "Not implemented" (#19) means that you've tried to use a directive
- that isn't implemented yet. "Unresolved symbol" (#25) will be printed with a
- randomly chosen unresolved symbol, with the last place in the source code
- where it was referenced. "Dot commands" (#26) is a reminder that directives
- in this assembler are not prefixed with a dot (.).
-
- There are two main reasons behind the idea of stopping at the first error
- encountered: simplicity and interoperability. When Zed is implemented for
- ACE, it will have a feature that will allow it to invoke the assembler (as a
- sub-process) and have the assembler return an error location and message to
- Zed, which will display the error message and position the cursor to the error
- location (if the source file is loaded).
-
- While on the subject of messages coming out of the assembler, here is an
- example of the format of the symbol table dump that you can ask for on the
- command line. One line is printed for each identifier. The "hash" value is
- the bucket in the hash table chosen for the identifier. This may not have a
- whole lot of meaning for a user, but a good distribution of these hash buckets
- in the symbol table is a good thing. Next is the 32-bit "hexvalue" of the
- label followed by the value in "decimal". Then comes the type. A type of "v"
- means value and "a" means an in-code-range address. Then comes the name of
- the identifier. It comes last to give lots of space to print it. If an
- identifier is ten or fewer characters long, its symbol-table-dump line will
- fit on a 40-column screen. At the bottom, the number of symbols is printed.
- This table is directed to the stdout file stream, so you can redirect it to a
- file in order to save it.
-
- HASH HEXVALUE DECIMAL T NAME
- ---- -------- ---------- - -----
- 8 00000f06 3846 v aceArgv
- 469 00007008 28680 a main
- --
- Number of symbols: 2
- ------------------------------------------------------------------------------
- 8. IMPLEMENTATION
-
- In each of the ways in which it is heavy-weight and slowed-down compared to
- other assemblers, it is also more powerful and more flexible.
-
- - It uses far memory for storing symbols, so there is no static or arbitrarily
- small limit on the number of symbols. Macro sizes will also be limited by
- only the amount of memory available, as well as the "hole table".
-
- - It has to maintain a "hole table" because of its structure, but this means
- that you can define labels in terms of other unresolved labels, that you
- will never get a "sync error" because of incorrect assumptions made (and not
- recorded) about unresolved labels, and that modular assembly can be
- implemented without too much further effort (i.e., ".o" or ".obj" files),
- since an unresolved external reference handling mechanism is already
- implemented.
-
- - The assembler keeps track of the "types" of labels, either "address" or
- "value" that makes it possible to provide code relocation information that
- will be needed by modular assembly and by future multitasking operating
- systems.
-
- - Because a "hole table" approach is used, the raw object code must be stored
- internally until the assembly is complete and then it can be written out to
- a file, but this also means that header information can be provided in an
- output file since all assembly results will be known before any output is
- written.
-
- - I took the easy way out for handling errors; when an error is detected, an
- error message is generated and printed and the assembler STOPs. But the
- exit mechanism provided by ACE makes it possible to integrate the assembler
- with other programs, like a text editor, to move the text editor cursor to
- the line and column containing the error and display a message in the text
- editor.
-
- There are two speed advantages that this assembler has over (some?) others:
-
- - It uses a 1024-entry hash table of pointers to chains of labels, so, for a
- program that has 800 or so symbols, each can be accessed in something like
- 1.3 tries. For N total symbols, the required number of references is
- approximately MAX( N/1024, 1 ).
-
- - It is one-pass, so it only has to go through the overhead of reading the
- source file once. Depending on the type of device the file is stored on,
- this may give a considerable savings. This also makes it possible to
- "pipe" the output of another program into the assembler, without any
- "rewind" problems.
-
- Here are some performace figures, compared to the Buddy assembler for the 128.
- All test cases were run on a C128 in 2-MHz mode with a RAMLink, REU, and 1571
- available.
-
- ASSEMB TIME(sec) FILE DEVICE FAR STORAGE
- ------ --------- ----------- -----------
- Buddy 45.5 RAMLink n/a
- ACE-as 61.5 RAMLink REU
- ACE-as 49.5 ACE ramdisk REU
- ACE-as 75.6 RAMLink RAM0+RAM1
- ACE-as 150.5 1571 RAM0+RAM1
- Buddy 240.0 1571 n/a
-
- Part of the assembly job was loaded into memory for the Buddy assembler, but
- the load time is included in the figure. As you can see, buddy performs
- faster with a fast file device and slower with a slow file device (because it
- requires two passes). I have a couple of tricks up my sleeve to improve the
- ACE assemble's performance.
-
- There are also a couple of subtle errors in this implementation. First, if it
- receives a "short block" from the source device, it will put whitespace
- between the current block and the next, thus potentially splitting a token.
- Also, if multiple files are used, the "ref" filename may not be valid.
-
- Here are a few data structures for your enjoyment.
-
- Identifier descriptor:
-
- OFF SIZ DESCRIPTION
- --- --- ------------
- 0 4 next link in hash table bucket
- 4 4 value of symbol, pointer to reference list, or ptr to macro defn
- 8 1 offset of reference in expression of reference list
- 9 1 type: $00=value, $01=address, $80=unresolved, $ff=unresolved define
- 10 1 class: $00=normal, $01=private, $80=global (not used yet)
- 11 1 name length
- 12 * null-terminated name string (1-240 chars)
-
- Expression/Hole descriptor:
-
- OFF SIZ DESCRIPTION
- --- --- -----------
- 0 1 hole type: $01=byte, $02=word, $03=triple, $04=long, $40=branch,
- $80=label
- 1 1 expression length: maximum offset+1 in bytes
- 2 1 number of unresolved references in expression
- 3 1 source column of reference
- 4 4 address of hole
- 8 4 source line of reference
- 12 4 source file pointer
- 16 14 expression operand descriptor slot #1
- 30 14 expression operand descriptor slot #2
- 44 14 expression operand descriptor slot #3
- 58 14 expression operand descriptor slot #4
- 72 14 expression operand descriptor slot #5
- 86 14 expression operand descriptor slot #6
- 100 14 expression operand descriptor slot #7
- 114 14 expression operand descriptor slot #8
- 128 14 expression operand descriptor slot #9
- 142 14 expression operand descriptor slot #10
- 156 14 expression operand descriptor slot #11
- 170 14 expression operand descriptor slot #12
- 184 14 expression operand descriptor slot #13
- 198 14 expression operand descriptor slot #14
- 212 14 expression operand descriptor slot #15
- 226 14 expression operand descriptor slot #16
- 240 14 expression operand descriptor slot #17
- 254 - END+1
-
- Expression operand descriptor:
-
- OFF SIZ DESCRIPTION
- --- --- -----------
- 0 1 operator: "+" or "-"
- 1 1 type of value: $00=number, $01=address, $80=unresolved identifier
- 2 1 monadic-operator result sign of value: $00=positive, $80=negative
- 3 1 hi/lo operator counts: high_nybble=">" count, low_nybble="<" cnt
- 4 4 numeric value or unresolved-identifier pointer
- 8 4 next unresolved reference in chain for unresolved identifier
- 12 1 offset in hole structure of next unresolved reference (operand)
- 13 1 reserved
- 14 - END+1
- ------------------------------------------------------------------------------
- 9. THE FUTURE
-
- This section is just random notes since I don't have the time right now to
- fill it in. I will be implementing include files, conditional assembly, and
- macro assembly features in the future. Modular assembly and relocatable-
- code generation are also in my plans.
-
- ;todo: -implement storage classes: $00=internal, $01=rel.label, $80=exported
- ; -implement all var types: 0=value, 1=address, 2=addr.high, 3=addr.low
- ; -implement source column, make line:col point to start of cur token
- ; -make it so you can use a "\<CR>" to continue a line
- ; -add more operators: * / & | ~ full precedence?
- ; -cache current symbol
- ;
- ; usage: as [-help] [-s] [-d] [-b] [-r] [-l] [-a addr] [file ...] [-o filename]
- ;
- ; -help : produce this information, don't run
- ; -s : produce symbol table dump at end
- ; -d : provide debugging information (lots)
- ; -b : produce binary module at end (default)
- ; -r : produce relocatable module rather than binary module
- ; -l : produce linkable ".o" module(s)
- ; -a : set global code origin to given address
- ; -o : put output into given filename
- ;
- ; If -l option is not used, all files, including source and object modules,
- ; will be assembled together. The output module name will be the base name of
- ; the first file given if it has a ".s" or ".o" extension, "a.out" if the first
- ; file has none of these extensions, or will be the filename given by the -o
- ; option if used.
- ; If the -l option is used, then each given source module will be
- ; assembled independently into its own ".o" module. Object modules will be
- ; ignored.
- ; The global origin will be either that given by the -a option (if it is
- ; used) or by the local origin of the first source/object module. Each
- ; source module that generates code must have a local code origin.
-
- More Directives:
-
- include "filename"
- if <expression> <relop> <expression>
- elsif <expression> <relop> <expression>
- else
- endif
- macro macroname
- endmacro
- export label1, label2, ..., labelN
- bss size_expression
-
- macro blt there
- bcc there
- endmacro
-
- macro add ;?1=operand
- clc
- adc ?1
- endmacro
-
- macro ldw ;?1=dest, ?2=source
- if ?# != 2
- error "the ldw macro instance doesn't have two arguments"
- endif
- if @1 = #
- argshift 2 0
- lda #<?2
- sta ?1+0
- lda #>?2
- sta ?1+1
- else
- lda ?2+0
- sta ?1+0
- lda ?2+1
- sta ?1+1
- endif
- endmacro
- ------------------------------------------------------------------------------
- So, there is finally a powerful and convenient assembler universally available
- for both the 64 and 128... for free. The source code for the assembler
- (written in the assembler's own assembly format, of course) is also available
- for free. There are a few more features that need to be implemented, but I
- know exactly how to implement them.
-
- Keep on Hackin'!
-
- -Craig Bruce
- csbruce@ccnga.uwaterloo.ca
- "Give them applications and they will only want more; give them development
- tools and they will give you applications, and more."
- ------------------------------------------------------------------------END---