8bitfiles.net/archives

home *** CD-ROM | disk | FTP | other *** search

/ 8bitfiles.net/archives / archives.tar / archives / genie-commodore-file-library / C64-128Toolkit / ACE12-AS-SRC.ARC / ACE12-AS.DOC next >

Wrap

Text File | 2019-04-13 | 44.4 KB | 894 lines

ACE Assembler Documentation for version 1.00 [October 23, 1994] ------------------------------------------------------------------------------ 1. INTRODUCTION The ACE assembler is a one-pass assembler. The only real limitation on the size of assembly jobs is the amount of near+far memory you have available. Labels are "limited" to 240 characters (all significant), and the object size is limited to 64K (of course). Numerical values are "limited" to 32-bits or less. Relative labels ("+" and "-" labels) are implemented in the same way as in the Buddy assembler. Only add and subtract dyadic operators are currently implemented for expressions with positive, negate, high-byte, and low-byte monadic oparators, and the planned macro and conditional assembly features are not yet implemented. Expressions are limited to 17 operands (with 255 monadic operators each), but references to unresolved identifiers are allowed anywhere, including equate definitions. Hierarchical inclusion of source files is not yet implemented. The ACE source code will eventually be converted to use this assembler. The assembler code itself has been converted. The assembler is designed to be a "heavy hitter", operates at moderate speed, and uses a fair amount of dynamically allocated memory. In fact, on an unexpanded 64, you won't be able to assemble programs that are too large, including the assembler itself (85K of source). You'll be able to do larger jobs on an unexpanded 64 if you deactivate the soft-80 screen in the configuration. I'll be working on making more RAM0 memory available to ACE applications. (Of course, one could argue that any serious 64 hacker would have expanded memory anyways...). In addition to the regular 6502 instructions, this release of the assembler has the following directives: label = value ;assign given value to the label label: ;assign the current assembly address to label + ;generate a temporary label, assign cur address - ;generate a temporary label, assign cur address org address ;set the origin of the assembly buf size ;reserve "size" bytes of space,filled with zeroes db val1, val2, ..., valN ;put byte values into memory dw val1, val2, ..., valN ;put word values into memory dt val1, val2, ..., valN ;put "triple" (3-byte) values into memory, lo->hi dl val1, val2, ..., valN ;put "long" (4-byte) values into memory, lo->hi These features is described in more detail below. Note that throughout the documentation, I use the terms "identifier", "symbol", and "label" interchangeably. ------------------------------------------------------------------------------ 2. USAGE The usage for the as command is, stated in Unix notation: usage: as [-help] [-s] [-d] [file ...] The "-help" flag will cause the assembler display the usage information and then exit, without assembling any code. Actually, any flag that it doesn't understand will be taken as if you had said "-help", but note that if you type the "as" command alone on a command line that usage information will not be given. The "-s" flag tells the assembler to generate a symbol-table listing when the assembly job is finished. The table is formatted for an 80-column display. indicates that a symbol table should be generated when the assembly job is done. The table will look like: The "-d" flag tells the assembler to produce debugging information while it is working. It will generate a lot of output, so you can see exactly what is going on. The object-code module name will be "a.out" unless the name of the first source file ends with a ".s" extension, in which case the object module will be the base name of first source file (without the extension). The object module will be written as a PRG file and will be in Commodore-DOS program format: the first two bytes will be the low and high bytes of the code address, and the rest will be the binary image of the assembled code. If no source filename is given on the command line, then input is taken from the stdin file stream (and written to "a.out"). If more than one filename is given, the each is read, in turn, into the same assembly job (as if the files were "cat"ted together into one source file). (This will change subtly when the assembler is completed). This assembler does not produce a listing of the code assembled and will stop the whole assembly job on the first error it encounters. ------------------------------------------------------------------------------ 3. TOKENS While reading your source code, the assembler groups characters into tokens and interprets them as a complete unit. The assembler works with five different types of tokens: identifiers, numeric literals, string literals, special characters, and end-of-file (eof). Eof is special since it doesn't actually include any characters, and its only meaning is to stop reading from the current source. Your input source file should consist only of characters that are printable in standard ASCII (don't be confused by this; the assembler expects its input to be in PETSCII) plus TAB and Carriage-Return. Other characters may confuse the assembler. Identifiers consist of a lowecase or uppercase letter or an underscore followed by a sequence of such letters or decimal digits. This is a pretty standard definition of an identifier. Identifiers are limited to 240 characters in length and an error will be reported if you try to use one longer than that. All of the characters of all identifiers are significant, and letters are case-sensitive. Here are some examples of all-unique identifiers: hello Hello _time4 a1_x140J HelloThereThisIsA_LongOne Numeric literals come in three types: decimal, hexadecimal, and binary. Decimal literals consist of an initial digit from 0 to 9 followed by any number of digits, provided that the value does not exceed 2^32-1 (approx. 4 billion). All types of literals can also have embedded underscore characters, which are ignored by the assembler. Use them grouping digits (like the comma for big American numbers). Hexadecimal literals consist of a dollar sign ($) followed by any number of hexadecimal digits, provided the value doesn't overflow 32 bits. Hexadecimal digits include the decimal digits (0-9), and the first six uppercase or lowercase letters of the alphabet (either a-f or A-F). Hexadecimal literals can also have embedded underscore characters for separators. Binary literals consist of a percent sign (%) followed by any number of binary digits that don't overflow 32-bits values. The binary digits are, of course, 0 and 1, and literals may include embedded underscore characters. Note that negative values are not literals. Here are some examples of valid literals: 0 123 0001 4_294_967_295 $aeFF $0123_4567 %010100 %110_1010_0111_1010 String literals are sequences of characters enclosed in either single (') or double (") quotation marks. The enclosed characters are not interpreted to be independent tokens, nomatter what they are. One exception is that the carriage-return character cannot be enclosed in a string (this normally indicates an error anyway). To get special non-printable characters into your strings, an "escape" character is provided: the backslash (\). If the backslash character is encountered, then the character following it is interpreted and a special character code is put into the string in place of the backslash and the following character. Here are the characters allowed to follow a backslash: CHAR CODE MEANING ---- ---- -------- \ 92 backslash character (\) n 13 carriage return (newline) b 20 backspace (this is a non-destructive backspace for ACE) t 9 tab r 10 goto beginning of line (for ACE, linefeed for CBM) a 7 bell sound z 0 null character (often used as a string terminator in ACE) ' 39 single quote (') e 27 escape 0 0 null character q 34 quotation mark " 34 quotation mark So, if you really want a backslash then you have to use two of them. If you wish to include an arbitrary character in a liter string, no facility is provided for doing that. However, the assembler will allow you to intermix strings and numeric expressions at a higher level, so you can do it this way. Strings are limited to include 240 (encoded) characters or less. This is really no limitation to assembling, since you can put as many string literals contiguously into memory as you wish. Here are some examples: "Hello there" "error!\a\a" 'file "output" could not be opened\n\0' "you 'dummy'!" 'you \'dummy\'!' "Here are two backslashes: \\\\" Special characters are single characters that cannot be interpreted as any of the other types of tokens. These are usually "punctuation" characters, but carriage return is also a special-character token (it is a statement separator). Some examples follow: , ( # & ) = / ? \ ~ { Tokens are separated by either the next character of input not being allowed to belong to the current token type, or are separated by whitespace. Whitespace characters include SPACE (" ") and TAB. Note that carriage return is not counted as whitespace. Comments are allowed by using a ";" character. Everything following the semicolon up to but not including the carriage return at the end of the line will be ignored by the assembler. (I may implement an artifical-intelligence comment parser to make sure the assembler does what you want it to, but this will be strictly an optional, time-permitting feature). ------------------------------------------------------------------------------ 4. EXPRESSIONS Numeric expressions consist of operands and operators. If you don't know what operands and operators are, then go buy an elementary-school math book. There are six types of operands: numeric literals, single-character string literals, identifiers, the asterisk character, one or more plus signs, and one or more minus signs. These last three types can make parsing an expression a bit confusing, but they are necessary and useful. Numeric literals are pretty easy to think about. They're just 32-bit numbers and work in the usual way. Single-character string literals are also interpreted (in the context of a numeric expression) as being a numeric literal. The value of a single-character string is simply the PETSCII code for the character. Identifiers or "symbols" or "labels" used in expressions refer to numeric values that have been or will be assigned to the identifiers. Binding values to identifiers is done by assembler directives discussed in a later section. If an identifier already has a value assigned to it by the time that the current expression is reached in assembly, then it is treated as if it were a numeric literal of the value assigned to the identifier. If the identifier currently has no value assigned to it (i.e., it is "unresolved"), then the entire expression will be unresolved. In this case, the expression will be recorded and will be evaluated at a later time when all of its identifiers become resolved. A "hole" will be created where the expression should go, and the hole will be "filled in" later. Note that there are a couple of directives for which an expression must be resolved at the time it is referenced. The asterisk character operates much like a numeric literal, except that its value is the current code address rather than a constant. The current code address will always be for the start of an assembler instruction. I.e., the current code address is incremented only after an instruction is assembled. This has some subtle implications, and other assemblers may implement slightly different semantics. Directives are a little different in that the address is incremented after every value in a "commalist" is put into memory. Relative references, operands consisting of a number of pluses or minuses, operate much like identifiers. They are provided for convenience and work exactly how they do in the Buddy assembler. Operands of all minuses are backward references and operands of all pluses are forward references. Because of parsing difficulties, relative-reference operands must either be the last operand in an expression or must be followed by a ":" character. The number of pluses or minuses tell which relative reference "point" is being referred to. A reference point is set by the "+" and "-" assembler directives discussed later. This gets difficult to explain with words, so here is a code example: ldy #5 - ldx #0 - lda name1,x sta name2,x beq + cmp #"x" beq ++ inx bne - + dey bne -- + rts This relatively bogus subroutine will copy a null-terminated character string from name1 to name2 five times, unless the string contains an "x" character, in which case the copy operation terminates immediately upon encountering the "x". The "beq +" branches to the next "+" label to occur in the code, to the "dey" instruction. The "beq ++" branches to the "rts", to the "+" label following the next "+" label encountered. The "-" and "--" references work similarly, except that they refer to the previous "-" label and the previous to the previous "-" label. You can use up to 255 pluses or minus signs in a relative-reference operand to refer to that many reference points away. That I said relative-reference operands work much like identifiers above is no cooincidence. For each definition of a reference point and reference to a point, an internal identifier is generated that looks like "L+123c" or "L-123c". Note that you can't define or refer to these identifiers yourself. There are two types of operators that can be used in expressions: monadic and diadic operators. Monadic operators affect one operand, and dyadic operators affect two operands. At about this point, I should spell out the actual form of an expression. It is: [monadic_operators] operand [ operator [monadic_operators] operand [...] ] or: 1 + 2 -1 + -+-2 + 3 An expression may have up to 17 operands. The monadic (one-operand) operators are: positive (+), negative (-), low-byte (<), and high-bytes (>). You can have up to 255 of each of these monadic operators for each operand of an expression. Positive doesn't actually do anything. Negative will return the 32-bit 2's complement of the operand that it is attached to. Low-byte will return the lowest eight bits of the operand it is attached to. High-byte will return the high-order 24-bits of the 32-bit operand it is attached to. All expressions are evaluated in full 32-bit precision. Note that you can use the high-bytes operator more than once to extract even higher. For example, "<>>value" will extract the second-highest byte of the 32-bit value. The dyadic (two-operand) operators are currently only add (+) and subtract (-). Yes, the plus and minus symbols are horribly overloaded. I hope that we all know what add and subtract do. I am planning to implement more dyadic operators in the future (multiply, divide, and, or, not, exclusive-or). Evaluation of dyadic operators is strictly left-to-right, and value overflows and underflows are ignored. Values are always considered to be positive, but this doesn't impact 2's complement negative arithmetic for add and subtract dyadic operators. Monadic operators take precedence over dyadic operators. Evaluation of monadic operators is done a little differently. All positive operators are thrown out since they don't actually do anything. Then, if there is an even number of negative operators, they are thrown out. If there is an odd number of negative operators, then the 2's complement negative of the operand is returned. Then, if there are any high-bytes operators, the value is shifted that number of bytes to the right and the highest-order byte of the value is set to zero. Note that it really doesn't make any sense to perform any more than three high-bytes operators. Then, the low-byte operator is preformed, if asked for. It is equivalent to taking anding the value with $000000ff. It really doesn't make much sense to perform this operator more than once. Also, it doesn't make any difference in which order you place the monadic operators in an expression; they are always evaluated in the static order given above. There is one exception here. If the first operand of an expression has high-bytes and/or low-byte monadic operators, then the rest of the expression is evaluated first and then the high/low-byte monadic operators are performed on the result. This is done to be consistent with other assemblers and with user expectations. Parentheses are not supported. Here are some examples of valid expressions: 2 +2+1 2+-1 2+-------------------------------------1 ++++:-+++:+--- 1+"x"-"a"+"A" <>>>4_000_000_000 <label+1 >label+1 -1 This last one ends up with a value of negative one, which is interpreted as really being 4_294_967_295. If you were to try and do something like "lda #-1", you would get an error because the value would be interpreted as being way too big. Expressions results are currently considered to be one of two types: value and address. (The complete set must be value, address, address-low-byte, and address-high-byte in order to be actually useful). Values are what you would expect and come from numeric and single-character-string-literal operands. The address type comes from the asterisk and relative reference operands and from identifier operands which are defined to be addresses. An address is defined to be only an address in the range of the assembled code. Addresses outside of this range are considered to be values. The distinction of values and addresses is currently not used, but will be in the future when code relocation features are implemented. Keeping track of expression types makes it possible to generate a list of all values in memory that must be modified in order to relocate a program to a new address without reassembling it. String "expressions" consist of only a single string literal. No operators are allowed. Some assembler directives accept either numeric or string expressions and interpret them appropriately (like "db"). ------------------------------------------------------------------------------ 5. PROCESSOR INSTRUCTIONS This assembler accepts the 56 standard 6502 processor instructions. It does not provide un-documented 6502 instructions nor 65c02 nor 65816 instructions nor custom pseudo-ops. The latter will be provided by future macro features. All of the assembler instructions must be in lowercase or they will not be recognized. Here are the instructions: NUM INS NUM INS NUM INS NUM INS NUM INS --- --- 12. bvc 24. eor 36. pha 48. sta 01. adc 13. bvs 25. inc 37. php 49. stx 02. and 14. clc 26. inx 38. pla 50. sty 03. asl 15. cld 27. iny 39. plp 51. tax 04. bcc 16. cli 28. jmp 40. rol 52. tay 05. bcs 17. clv 29. jsr 41. ror 53. tsx 06. beq 18. cmp 30. lda 42. rti 54. txa 07. bit 19. cpx 31. ldx 43. rts 55. txs 08. bmi 20. cpy 32. ldy 44. sbc 56. tya 09. bne 21. dec 33. lsr 45. sec 10. bpl 22. dex 34. nop 46. sed 11. brk 23. dey 35. ora 47. sei The assembler also supports 12 addressing modes. The "accumulator" addressing mode that can be used with the rotate and shift instructions is treated like the immediate addressing mode, so a shift-left-accumulator instruction would be just "asl" rather than "asl a". Many other assemblers get rid of the accumulator addressing mode also. Also, the ",x" and ",y" addressing modes must be given with a lowercase "x" or "y" or they will not be recognized. Here is the token syntax for the addressing modes (CR means carriage return): num name gen byt example tokens --- --------- --- --- ------- ------- 01. implied 00. 1 CR 02. immediate 00. 2 #123 # / exp8 / CR 03. relative 00. 2 *+20 exp16 / CR 04. zeropage 07. 2 123 exp8 / CR 05. zp,x 08. 2 123,x exp8 / , / x / CR 06. zp,y 09. 2 123,y exp8 / , / y / CR 07. absolute 00. 3 12345 exp16 / CR 08. abs,x 00. 3 12345,x exp16 / , / x / CR 09. abs,y 00. 3 12345,y exp16 / , / y / CR 10. indirect 00. 3 (12345) ( / exp16 / ) / CR 11. (ind,x) 00. 2 (123,x) ( / exp8 / , / x / ) / CR 12. (ind),y 00. 2 (123),y ( / exp8 / ) / , / y / CR Each instruction takes a complete line and each addressing mode must be terminated by a carriage return token (comments are skipped). The format of an instruction line is as follows: [prefix_directives] instruction address_mode_operand In the case that an expression in an addressing mode is resolved at the point it is encountered and its value is less than 256, the assembler will try to use the zero-page addressing modes if possible. On the other hand, if a zero-page addressing mode is unavailable for an instruction, then the assembler will promote or generalize the zero-page addressing mode to an absolute addressing mode, if possible. This is what the "gen" column in the table above shows. If after attempting to generalize the addressing mode the given addressing mode still not valid with the given instruction, then an error will be generated. In the case that an expression in an addressing mode cannot be resolved at the point where it is encountered in the assembler's single pass, a hole is left behind, and that hole is made as "large" as possible; it is assumed that you will fill in the hole with the largest value possible. This means, for example, if you were to assemble the following instruction: lda var,x then the assembler would assume this is an absolute mode, and will fill in the hole later as such, even if it turns out that "var" is assigned a value less than 256 later on. This results in slight inefficiency in the code produced by this assembler, but it causes most two-pass assemblers to fail completely on a "phase error". An easy way to avoid this circumstance is to make sure that all zero-page labels are defined before they are referred to. The addressing modes that require a single byte value and that will not "generalize" to an absolute mode will have a single-byte hole created for them. Only the branching instructions will be interpreted as having the relative addressing mode, and a single-byte hole will be left. Two exceptions to the above rules are the "stx zp,y" and "sty zp,x", which will leave a single-byte hole on an unresolved expression, since the absolute-mode generalizations for these instructions are not supported by the processor. ------------------------------------------------------------------------------ 6. DIRECTIVES There are currently five classes of assembler directives; there will be more in the future. 6.1. DO-NOTHING DIRECTIVES There are two do-nothing directives: # ;does nothing ;blank line--does nothing A blank line in your source code will simply be ignored. This helps to make code much more readable. The "#" directive is a prefix directive. This means that it does not occupy an entire line but allows other directives and processor instructions to follow it on the same line (including other prefix directives). (But note that you can follow any prefix directive by the blank-line directive, effectively allowing prefix directives to be regular full-line directives (powerful combining forms)). The "#" directive is simply ignored by the assembler, but you can use it to highlight certain lines of code or other directives, like the future "include" directive. 6.2. ASSIGNMENT DIRECTIVES There are four assignment directives. They all assign (bind) a value to an identifier. Here they are: label = expression ;assign given value to the label label: ;assign the current assembly address to label + ;generate a temporary label, assign cur address - ;generate a temporary label, assign cur address The first (label=expr) is the most general. It assigns the result of evaluating the expression to the given label. Because this assembler is so gosh-darned awesome, the expression doesn't even have to be resolved; a "hole" will be created saying to fill in the assigned label when all of the unresolved identifiers in the expression eventually become resolved. Most other assemblers (in fact, all that I have ever heard of) can't do this because it causes ugly implementation problems, like cascading label resolutions. Consider the following example: lda #a sta b,x a = b+3 b = c-1 c = 5 At the point where c becomes defined, there are no "memory holes" but the label hole "b" must be evaluated and filled in. "b" gets assigned the value 4. At this point, there are two holes: the one in the "sta" instruction and the label "a". We fill them both in, assigning "a" the value 8, and we discover that we can fill in a hew hole: the one in the "lda" instruction. We do that and we are finally done. The implementation can handle any number of these recursive label hole-fillings, limited only by the amount of near+far memory you have. A label can only be assigned a value once, and you will get an error if you try to redefine a label, even if it is currently unresolved. Also, all exressions must be resolved by the end of the assembly job, or an error will be reported (but only one--naming the first unresolved label that the assembler runs across; I may fix this up in the future). The second assignment directive is equivalent to "label = *", but it is more convenient and is also a prefix directive. It assigns the current address (as of the start of the current line) to the given identifier. The colon is used with this directive to make it easy and efficient to parse, and to make it easy for a human to see that a label is being defined. Many other assemblers follow this directive with just whitespace and rely on other tricks, like putting an ugly dot before each directive, to bail them out. The third and fourth set relative reference points. They are equivalent to "rel_label = *", where "rel_label" is a specially generated internal identifier of the form "L+123c" mentioned in the expression section. The labels defined by these directives show up in the symbol table dump, if you ask for one on the command line. These are also prefix directives, so if you wanted to set a forward and a backward reference to the same address, then you would do something like: +- lda #1 In fact, you could put as many or these directives on the front of a line as you want, though more than one of each will be of little use. Note that backward relative labels will always be defined at the point that they are referenced and forward relative labels will always be undefined (unresolved) when they are referenced. If at the end of your assembly job the assembler complains of an unresolved reference involving a label of the form "L+123c", then you refer to a forward-relative point that you don't set, and if the label is of the form "L-4000000000c", then you refer to a backward relative point that you don't define. 6.3. ORIGIN DIRECTIVE org address_expression ;set the origin of the assembly This directive will set the code origin to the given expression. The expression MUST be resolved at the point where it appears, since it would be very difficult to fill in the type of hole this would leave behind (though not impossible, hmmm...). The origin must be set before any processor instruction or assembler directive that generates memory values is encountered, and the code origin can only be set once. This results in a contiguous code region, which is what ACE and the Commodore Kernal require. 6.4. DEFINE-BYTES DIRECTIVES db exp1, exp2, ..., expN ;put byte values into memory dw exp1, exp2, ..., expN ;put word values into memory dt exp1, exp2, ..., expN ;put "triple" (3-byte) values into memory, lo->hi dl exp1, exp2, ..., expN ;put "long" (4-byte) values into memory, lo->hi These directives are all put byte values into code memory, at the current address. The only difference between the four of them is the size of data values they put into memory: bytes (8 bits), words (16 bits), triples (24 bits), and longs (32 bits). The code address is incremented by the appropriate number of bytes between putting each value into memory. Any number of values can be specified by separating them by commas. All expressions are evaluated in full 32 bits, but must fit into the size for the directive. The expressions don't have to be resolved at the time they appear. These directives can also be given strings for arguments, which means that each character of the string will be stored as one byte/word/etc. in memory, for example: db 123, abc+xyz+%1101-"a"+$1, "hello", 0, "yo!", "keep on hackin'\0" 6.5. BUF DIRECTIVE buf size_expression ;reserve "size" bytes of space, filled with zeroes This directive reserves the given number of bytes of space from the current code address and fills them with zeroes. The expression must be resolved, and can be any value from 0 up to 65535 or the number of bytes remaining until the code address overflows the 64K code space limit. 6.6. PARSING Because of the way that the assembler parses the source code (it uses a one-character-peek-ahead ad-hoc parser), you can define labels that are also directive names or processor-instruction names. This is not a recommended practice, since you can end up with lines that look like: x: lda: lda lda,x The parser will know what to do, but most humans won't. Also, because of the tokenizer, can put arbitrary spacing between tokens, except between tokens that would otherwise merge together (like two adjacent identifiers or decimal numbers). ------------------------------------------------------------------------------ 7. ERROR HANDLING When an error is detected, the assembler will stop the whole assembly job and print out one error message (to the stderr file stream). Here are two examples of error messages: err ("k:":2:0) Value is too large or negative err ("k:":3:0), ref("k:":2:0) Value is too large or negative In both error messages, the stuff inside of the parentheses is the filename of the source file (the keyboard here), the source line where the error was detected, and the column number where the error was detected. Currently, the column number is not implemented so it is always zero. When it is implemented, the column numbers will start from 1, like in the Zed text editor, and it will point to the first character of the token where the error was discovered. In the first example, the error occurred because the expression was resolved and the value was found to be too large for whatever operation was attempted. In the second example, an expression was used but unresolved on line 2 of the source file, and when its unresolved identifier(s) was finally filled in in line 3 of the source, the "hole" to be filled in was found to be too small for the value, so an error resulted. This is what the "ref" file position means. Filenames are included in error messages because in the future, it will be possible to have errors crop up in included files and elsewhere. Here is the entire list of possible error messages: NUM MEANING --- ------- 01. "An identifier token exceeds 240 chars in length" 02. "A string literal exceeds 240 chars in length" 03. "Ran into a CR before end of string literal" 04. "Invalid numeric literal" 05. "Numeric literal value overflows 32-bits" 06. "Syntax error" 07. "Attempt to perform numeric operators on a string" 08. "Expression has more than 17 operands" 09. "Ran out of memory during compilation process" 10. "Attempt to redefine a symbol" 11. "Attempt to assemble code with code origin not set" 12. "Internal error: attempt to assign to unexpected id" 13. "Non-numeric symbol in a numeric expression" 14. "Expecting an operator" 15. "Expecting an operand" 16. "Expecting a command" 17. "Value is too large or negative" 18. "Branch out of range" 19. "Feature is not (yet) implemented" 20. "Instruction does not support given address mode" 21. "Address wraped around 64K code address space" 22. "Error trying to write output object file" 23. "Directive requires resolved expression" 24. "Code origin already set; you can't set it twice" 25. "Unresolved symbol: " 26. "Thus assembler doesn't accept .dot commands, Buddy!" A "Syntax error" (#06) will be reported whenever a token other than one that was expected is found. "Ran out of memory" (#09) may turn up often on an unexpanded 64. "Expecting command" (#16) means that the assembler was expecting either a processor instruction or directive but found something else instead. "Not implemented" (#19) means that you've tried to use a directive that isn't implemented yet. "Unresolved symbol" (#25) will be printed with a randomly chosen unresolved symbol, with the last place in the source code where it was referenced. "Dot commands" (#26) is a reminder that directives in this assembler are not prefixed with a dot (.). There are two main reasons behind the idea of stopping at the first error encountered: simplicity and interoperability. When Zed is implemented for ACE, it will have a feature that will allow it to invoke the assembler (as a sub-process) and have the assembler return an error location and message to Zed, which will display the error message and position the cursor to the error location (if the source file is loaded). While on the subject of messages coming out of the assembler, here is an example of the format of the symbol table dump that you can ask for on the command line. One line is printed for each identifier. The "hash" value is the bucket in the hash table chosen for the identifier. This may not have a whole lot of meaning for a user, but a good distribution of these hash buckets in the symbol table is a good thing. Next is the 32-bit "hexvalue" of the label followed by the value in "decimal". Then comes the type. A type of "v" means value and "a" means an in-code-range address. Then comes the name of the identifier. It comes last to give lots of space to print it. If an identifier is ten or fewer characters long, its symbol-table-dump line will fit on a 40-column screen. At the bottom, the number of symbols is printed. This table is directed to the stdout file stream, so you can redirect it to a file in order to save it. HASH HEXVALUE DECIMAL T NAME ---- -------- ---------- - ----- 8 00000f06 3846 v aceArgv 469 00007008 28680 a main -- Number of symbols: 2 ------------------------------------------------------------------------------ 8. IMPLEMENTATION In each of the ways in which it is heavy-weight and slowed-down compared to other assemblers, it is also more powerful and more flexible. - It uses far memory for storing symbols, so there is no static or arbitrarily small limit on the number of symbols. Macro sizes will also be limited by only the amount of memory available, as well as the "hole table". - It has to maintain a "hole table" because of its structure, but this means that you can define labels in terms of other unresolved labels, that you will never get a "sync error" because of incorrect assumptions made (and not recorded) about unresolved labels, and that modular assembly can be implemented without too much further effort (i.e., ".o" or ".obj" files), since an unresolved external reference handling mechanism is already implemented. - The assembler keeps track of the "types" of labels, either "address" or "value" that makes it possible to provide code relocation information that will be needed by modular assembly and by future multitasking operating systems. - Because a "hole table" approach is used, the raw object code must be stored internally until the assembly is complete and then it can be written out to a file, but this also means that header information can be provided in an output file since all assembly results will be known before any output is written. - I took the easy way out for handling errors; when an error is detected, an error message is generated and printed and the assembler STOPs. But the exit mechanism provided by ACE makes it possible to integrate the assembler with other programs, like a text editor, to move the text editor cursor to the line and column containing the error and display a message in the text editor. There are two speed advantages that this assembler has over (some?) others: - It uses a 1024-entry hash table of pointers to chains of labels, so, for a program that has 800 or so symbols, each can be accessed in something like 1.3 tries. For N total symbols, the required number of references is approximately MAX( N/1024, 1 ). - It is one-pass, so it only has to go through the overhead of reading the source file once. Depending on the type of device the file is stored on, this may give a considerable savings. This also makes it possible to "pipe" the output of another program into the assembler, without any "rewind" problems. Here are some performace figures, compared to the Buddy assembler for the 128. All test cases were run on a C128 in 2-MHz mode with a RAMLink, REU, and 1571 available. ASSEMB TIME(sec) FILE DEVICE FAR STORAGE ------ --------- ----------- ----------- Buddy 45.5 RAMLink n/a ACE-as 61.5 RAMLink REU ACE-as 49.5 ACE ramdisk REU ACE-as 75.6 RAMLink RAM0+RAM1 ACE-as 150.5 1571 RAM0+RAM1 Buddy 240.0 1571 n/a Part of the assembly job was loaded into memory for the Buddy assembler, but the load time is included in the figure. As you can see, buddy performs faster with a fast file device and slower with a slow file device (because it requires two passes). I have a couple of tricks up my sleeve to improve the ACE assemble's performance. There are also a couple of subtle errors in this implementation. First, if it receives a "short block" from the source device, it will put whitespace between the current block and the next, thus potentially splitting a token. Also, if multiple files are used, the "ref" filename may not be valid. Here are a few data structures for your enjoyment. Identifier descriptor: OFF SIZ DESCRIPTION --- --- ------------ 0 4 next link in hash table bucket 4 4 value of symbol, pointer to reference list, or ptr to macro defn 8 1 offset of reference in expression of reference list 9 1 type: $00=value, $01=address, $80=unresolved, $ff=unresolved define 10 1 class: $00=normal, $01=private, $80=global (not used yet) 11 1 name length 12 * null-terminated name string (1-240 chars) Expression/Hole descriptor: OFF SIZ DESCRIPTION --- --- ----------- 0 1 hole type: $01=byte, $02=word, $03=triple, $04=long, $40=branch, $80=label 1 1 expression length: maximum offset+1 in bytes 2 1 number of unresolved references in expression 3 1 source column of reference 4 4 address of hole 8 4 source line of reference 12 4 source file pointer 16 14 expression operand descriptor slot #1 30 14 expression operand descriptor slot #2 44 14 expression operand descriptor slot #3 58 14 expression operand descriptor slot #4 72 14 expression operand descriptor slot #5 86 14 expression operand descriptor slot #6 100 14 expression operand descriptor slot #7 114 14 expression operand descriptor slot #8 128 14 expression operand descriptor slot #9 142 14 expression operand descriptor slot #10 156 14 expression operand descriptor slot #11 170 14 expression operand descriptor slot #12 184 14 expression operand descriptor slot #13 198 14 expression operand descriptor slot #14 212 14 expression operand descriptor slot #15 226 14 expression operand descriptor slot #16 240 14 expression operand descriptor slot #17 254 - END+1 Expression operand descriptor: OFF SIZ DESCRIPTION --- --- ----------- 0 1 operator: "+" or "-" 1 1 type of value: $00=number, $01=address, $80=unresolved identifier 2 1 monadic-operator result sign of value: $00=positive, $80=negative 3 1 hi/lo operator counts: high_nybble=">" count, low_nybble="<" cnt 4 4 numeric value or unresolved-identifier pointer 8 4 next unresolved reference in chain for unresolved identifier 12 1 offset in hole structure of next unresolved reference (operand) 13 1 reserved 14 - END+1 ------------------------------------------------------------------------------ 9. THE FUTURE This section is just random notes since I don't have the time right now to fill it in. I will be implementing include files, conditional assembly, and macro assembly features in the future. Modular assembly and relocatable- code generation are also in my plans. ;todo: -implement storage classes: $00=internal, $01=rel.label, $80=exported ; -implement all var types: 0=value, 1=address, 2=addr.high, 3=addr.low ; -implement source column, make line:col point to start of cur token ; -make it so you can use a "\<CR>" to continue a line ; -add more operators: * / & | ~ full precedence? ; -cache current symbol ; ; usage: as [-help] [-s] [-d] [-b] [-r] [-l] [-a addr] [file ...] [-o filename] ; ; -help : produce this information, don't run ; -s : produce symbol table dump at end ; -d : provide debugging information (lots) ; -b : produce binary module at end (default) ; -r : produce relocatable module rather than binary module ; -l : produce linkable ".o" module(s) ; -a : set global code origin to given address ; -o : put output into given filename ; ; If -l option is not used, all files, including source and object modules, ; will be assembled together. The output module name will be the base name of ; the first file given if it has a ".s" or ".o" extension, "a.out" if the first ; file has none of these extensions, or will be the filename given by the -o ; option if used. ; If the -l option is used, then each given source module will be ; assembled independently into its own ".o" module. Object modules will be ; ignored. ; The global origin will be either that given by the -a option (if it is ; used) or by the local origin of the first source/object module. Each ; source module that generates code must have a local code origin. More Directives: include "filename" if <expression> <relop> <expression> elsif <expression> <relop> <expression> else endif macro macroname endmacro export label1, label2, ..., labelN bss size_expression macro blt there bcc there endmacro macro add ;?1=operand clc adc ?1 endmacro macro ldw ;?1=dest, ?2=source if ?# != 2 error "the ldw macro instance doesn't have two arguments" endif if @1 = # argshift 2 0 lda #<?2 sta ?1+0 lda #>?2 sta ?1+1 else lda ?2+0 sta ?1+0 lda ?2+1 sta ?1+1 endif endmacro ------------------------------------------------------------------------------ So, there is finally a powerful and convenient assembler universally available for both the 64 and 128... for free. The source code for the assembler (written in the assembler's own assembly format, of course) is also available for free. There are a few more features that need to be implemented, but I know exactly how to implement them. Keep on Hackin'! -Craig Bruce csbruce@ccnga.uwaterloo.ca "Give them applications and they will only want more; give them development tools and they will give you applications, and more." ------------------------------------------------------------------------END---