For Beginners & Professional Hackers

home *** CD-ROM | disk | FTP | other *** search

/ For Beginners & Professional Hackers / cd.iso / docum / advdos.doc / s1c03 < prev next >

Wrap

Text File | 1992-04-21 | 49.6 KB | 964 lines

──────────────────────────────────────────────────────────────────────────── Chapter 3 Structure of MS-DOS Application Programs Programs that run under MS-DOS come in two basic flavors: .COM programs, which have a maximum size of approximately 64 KB, and .EXE programs, which can be as large as available memory. In Intel 8086 parlance, .COM programs fit the tiny model, in which all segment registers contain the same value; that is, the code and data are mixed together. In contrast, .EXE programs fit the small, medium, or large model, in which the segment registers contain different values; that is, the code, data, and stack reside in separate segments. .EXE programs can have multiple code and data segments, which are respectively addressed by long calls and by manipulation of the data segment (DS) register. A .COM-type program resides on the disk as an absolute memory image, in a file with the extension .COM. The file does not have a header or any other internal identifying information. A .EXE program, on the other hand, resides on the disk in a special type of file with a unique header, a relocation map, a checksum, and other information that is (or can be) used by MS-DOS. Both .COM and .EXE programs are brought into memory for execution by the same mechanism: the EXEC function, which constitutes the MS-DOS loader. EXEC can be called with the filename of a program to be loaded by COMMAND.COM (the normal MS-DOS command interpreter), by other shells or user interfaces, or by another program that was previously loaded by EXEC. If there is sufficient free memory in the transient program area, EXEC allocates a block of memory to hold the new program, builds the program segment prefix (PSP) at its base, and then reads the program into memory immediately above the PSP. Finally, EXEC sets up the segment registers and the stack and transfers control to the program. When it is invoked, EXEC can be given the addresses of additional information, such as a command tail, file control blocks, and an environment block; if supplied, this information will be passed on to the new program. (The exact procedure for using the EXEC function in your own programs is discussed, with examples, in Chapter 12.) .COM and .EXE programs are often referred to as transient programs. A transient program "owns" the memory block it has been allocated and has nearly total control of the system's resources while it is executing. When the program terminates, either because it is aborted by the operating system or because it has completed its work and systematically performed a final exit back to MS-DOS, the memory block is then freed (hence the term transient) and can be used by the next program in line to be loaded. The Program Segment Prefix A thorough understanding of the program segment prefix is vital to successful programming under MS-DOS. It is a reserved area, 256 bytes long, that is set up by MS-DOS at the base of the memory block allocated to a transient program. The PSP contains some linkages to MS-DOS that can be used by the transient program, some information MS-DOS saves for its own purposes, and some information MS-DOS passes to the transient program──to be used or not, as the program requires (Figure 3-1). Offset 0000H ┌────────────────────────────────────────────────────────┐ │ Int 20H │ 0002H ├────────────────────────────────────────────────────────┤ │ Segment, end of allocation block │ 0004H ├────────────────────────────────────────────────────────┤ │ Reserved │ 0005H ├────────────────────────────────────────────────────────┤ │ Long call to MS-DOS function dispatcher │ 000AH ├────────────────────────────────────────────────────────┤ │ Previous contents of termination handler │ │ interrupt vector (Int 22H) │ 000EH ├────────────────────────────────────────────────────────┤ │ Previous contents of Ctrl-C interrupt vector (Int 23H) │ 0012H ├────────────────────────────────────────────────────────┤ │ Previous contents of critical-error handler │ │ interrupt vector (Int 24H) │ 0016H ├────────────────────────────────────────────────────────┤ │ Reserved │ 002CH ├────────────────────────────────────────────────────────┤ │ Segment address of environment block │ 002EH ├────────────────────────────────────────────────────────┤ │ Reserved │ 005CH ├────────────────────────────────────────────────────────┤ │ Default file control block #1 │ 006CH ├────────────────────────────────────────────────────────┤ │ Default file control block #2 │ │ (overlaid if FCB #1 opened) │ 008OH ├────────────────────────────────────────────────────────┤ └──────────────────────────┐ │ ┌────────────────────────┐ └─────────────────────────────┘ │ └───────────────────────────────┐ │ Command tail and default disk transfer area (buffer) │ OOFFH └────────────────────────────────────────────────────────┘ Figure 3-1. The structure of the program segment prefix. In the first versions of MS-DOS, the PSP was designed to be compatible with a control area that was built beneath transient programs under Digital Research's venerable CP/M operating system, so that programs could be ported to MS-DOS without extensive logical changes. Although MS-DOS has evolved considerably since those early days, the structure of the PSP is still recognizably similar to its CP/M equivalent. For example, offset 0000H in the PSP contains a linkage to the MS-DOS process-termination handler, which cleans up after the program has finished its job and performs a final exit. Similarly, offset 0005H in the PSP contains a linkage to the MS-DOS function dispatcher, which performs disk operations, console input/output, and other such services at the request of the transient program. Thus, calls to PSP:0000 and PSP:0005 have the same effect as CALL 0000 and CALL 0005 under CP/M. (These linkages are not the "approved" means of obtaining these services, however.) The word at offset 0002H in the PSP contains the segment address of the top of the transient program's allocated memory block. The program can use this value to determine whether it should request more memory to do its job or whether it has extra memory that it can release for use by other processes. Offsets 000AH through 0015H in the PSP contain the previous contents of the interrupt vectors for the termination, Ctrl-C, and critical-error handlers. If the transient program alters these vectors for its own purposes, MS-DOS restores the original values saved in the PSP when the program terminates. The word at PSP offset 002CH holds the segment address of the environment block, which contains a series of ASCIIZ strings (sequences of ASCII characters terminated by a null, or zero, byte). The environment block is inherited from the program that called the EXEC function to load the currently executing program. It contains such information as the current search path used by COMMAND.COM to find executable programs, the location on the disk of COMMAND.COM itself, and the format of the user prompt used by COMMAND.COM. The command tail──the remainder of the command line that invoked the transient program, after the program's name──is copied into the PSP starting at offset 0081H. The length of the command tail, not including the return character at its end, is placed in the byte at offset 0080H. Redirection or piping parameters and their associated filenames do not appear in the portion of the command line (the command tail) that is passed to the transient program, because redirection is transparent to applications. To provide compatibility with CP/M, MS-DOS parses the first two parameters in the command tail into two default file control blocks (FCBs) at PSP:005CH and PSP:006CH, under the assumption that they may be filenames. However, if the parameters are filenames that include a path specification, only the drive code will be valid in these default FCBs, because FCB-type file- and record-access functions do not support hierarchical file structures. Although the default FCBs were an aid in earlier years, when compatibility with CP/M was more of a concern, they are essentially useless in modern MS-DOS application programs that must provide full path support. (File control blocks are discussed in detail in Chapter 8 and hierarchical file structures are discussed in Chapter 9.) The 128-byte area from 0080H through 00FFH in the PSP also serves as the default disk transfer area (DTA), which is set by MS-DOS before passing control to the transient program. If the program does not explicitly change the DTA, any file read or write operations requested with the FCB group of function calls automatically use this area as a data buffer. This is rarely useful and is another facet of MS-DOS's handling of the PSP that is present only for compatibility with CP/M. ────────────────────────────────────────────────────────────────────────── WARNING Programs must not alter any part of the PSP below offset 005CH. ────────────────────────────────────────────────────────────────────────── Introduction to .COM Programs Programs of the .COM persuasion are stored in disk files that hold an absolute image of the machine instructions to be executed. Because the files contain no relocation information, they are more compact, and are loaded for execution slightly faster, than equivalent .EXE files. Note that MS-DOS does not attempt to ascertain whether a .COM file actually contains executable code (there is no signature or checksum, as in the case of a .EXE file); it simply brings any file with the .COM extension into memory and jumps to it. Because .COM programs are loaded immediately above the program segment prefix and do not have a header that can specify another entry point, they must always have an origin of 0100H, which is the length of the PSP. Location 0100H must contain an executable instruction. The maximum length of a .COM program is 65,536 bytes, minus the length of the PSP (256 bytes) and a mandatory word of stack (2 bytes). When control is transferred to the .COM program from MS-DOS, all of the segment registers point to the PSP (Figure 3-2). The stack pointer register contains 0FFFEH if memory allows; otherwise, it is set as high as possible in memory minus 2 bytes. (MS-DOS pushes a zero word on the stack before entry.) SS:SP ┌────────────────────────────────────────────────────────┐ │ │ │ Stack grows downward from top of segment │ │ │ │ │ │ │ │ │ │ │ │ Program code and data │ │ │ CS:0100H ├────────────────────────────────────────────────────────┤ A │ Program segment prefix │ CS:0000H └────────────────────────────────────────────────────────┘ DS:0000H ES:0000H SS:0000H Figure 3-2. A memory image of a typical .COM-type program after loading. The contents of the .COM file are brought into memory just above the program segment prefix. Program, code, and data are mixed together in the same segment, and all segment registers contain the same value. Although the size of an executable .COM file can't exceed 64 KB, the current versions of MS-DOS allocate all of the transient program area to .COM programs when they are loaded. Because many such programs date from the early days of MS-DOS and are not necessarily "well-behaved" in their approach to memory management, the operating system simply makes the worst-case assumption and gives .COM programs everything that is available. If a .COM program wants to use the EXEC function to invoke another process, it must first shrink down its memory allocation to the minimum memory it needs in order to continue, taking care to protect its stack. (This is discussed in more detail in Chapter 12.) When a .COM program finishes executing, it can return control to MS-DOS by several means. The preferred method is Int 21H Function 4CH, which allows the program to pass a return code back to the program, shell, or batch file that invoked it. However, if the program is running under MS-DOS version 1, it must exit by means of Int 20H, Int 21H Function 0, or a NEAR RETURN. (Because a word of zero was pushed onto the stack at entry, a NEAR RETURN causes a transfer to PSP:0000, which contains an Int 20H instruction.) A .COM-type application can be linked together from many separate object modules. All of the modules must use the same code-segment name and class name, and the module with the entry point at offset 0100H within the segment must be linked first. In addition, all of the procedures within a .COM program should have the NEAR attribute, because all executable code resides in one segment. When linking a .COM program, the linker will display the message Warning: no stack segment This message can be ignored. The linker output is a .EXE file, which must be converted into a .COM file with the MS-DOS EXE2BIN utility before execution. You can then delete the .EXE file. (An example of this process is provided in Chapter 4.) An Example .COM Program The HELLO.COM program listed in Figure 3-3 demonstrates the structure of a simple assembly-language program that is destined to become a .COM file. (You may find it helpful to compare this listing with the HELLO.EXE program later in this chapter.) Because this program is so short and simple, a relatively high proportion of the source code is actually assembler directives that do not result in any executable code. The NAME statement simply provides a module name for use during the linkage process. This aids understanding of the map that the linker produces. In MASM versions 5.0 and later, the module name is always the same as the filename, and the NAME statement is ignored. The PAGE command, when used with two operands, as in line 2, defines the length and width of the page. These default respectively to 66 lines and 80 characters. If you use the PAGE command without any operands, a formfeed is sent to the printer and a heading is printed. In larger programs, use the PAGE command liberally to place each of your subroutines on separate pages for easy reading. The TITLE command, in line 3, specifies the text string (limited to 60 characters) that is to be printed at the upper left corner of each page. The TITLE command is optional and cannot be used more than once in each assembly-language source file. ────────────────────────────────────────────────────────────────────────── 1: name hello 2: page 55,132 3: title HELLO.COM--print hello on terminal 4: 5: ; 6: ; HELLO.COM: demonstrates various components 7: ; of a functional .COM-type assembly- 8: ; language program, and an MS-DOS 9: ; function call. 10: ; 11: ; Ray Duncan, May 1988 12: ; 13: 14: stdin equ 0 ; standard input handle 15: stdout equ 1 ; standard output handle 16: stderr equ 2 ; standard error handle 17: 18: cr equ 0dh ; ASCII carriage return 19: lf equ 0ah ; ASCII linefeed 20: 21: 22: _TEXT segment word public 'CODE' 23: 24: org 100h ; .COM files always have 25: ; an origin of 100h 26: 27: assume cs:_TEXT,ds:_TEXT,es:_TEXT,ss:_TEXT 28: 29: print proc near ; entry point from MS-DOS 30: 31: mov ah,40h ; function 40h = write 32: mov bx,stdout ; handle for standard output 33: mov cx,msg_len ; length of message 34: mov dx,offset msg ; address of message 35: int 21h ; transfer to MS-DOS 36: 37: mov ax,4c00h ; exit, return code = 0 38: int 21h ; transfer to MS-DOS 39: 40: print endp 41: 42: 43: msg db cr,lf ; message to display 44: db 'Hello World!',cr,lf 45: 46: msg_len equ $-msg ; length of message 47: 48: 49: _TEXT ends 50: 51: end print ; defines entry point ────────────────────────────────────────────────────────────────────────── Figure 3-3. The HELLO.COM program listing. Dropping down past a few comments and EQU statements, we come to a declaration of a code segment that begins in line 22 with a SEGMENT command and ends in line 49 with an ENDS command. The label in the leftmost field of line 22 gives the code segment the name _TEXT. The operand fields at the right end of the line give the segment the attributes WORD, PUBLIC, and `CODE'. (You might find it helpful to read the Microsoft Macro Assembler manual for detailed explanations of each possible segment attribute.) Because this program is going to be converted into a .COM file, all of its executable code and data areas must lie within one code segment. The program must also have its origin at offset 0100H (immediately above the program segment prefix), which is taken care of by the ORG statement in line 24. Following the ORG instruction, we encounter an ASSUME statement on line 27. The concept of ASSUME often baffles new assembly-language programmers. In a way, ASSUME doesn't "do" anything; it simply tells the assembler which segment registers you are going to use to point to the various segments of your program, so that the assembler can provide segment overrides when they are necessary. It's important to notice that the ASSUME statement doesn't take care of loading the segment registers with the proper values; it merely notifies the assembler of your intent to do that within the program. (Remember that, in the case of a .COM program, MS-DOS initializes all the segment registers before entry to point to the PSP.) Within the code segment, we come to another type of block declaration that begins with the PROC command on line 29 and closes with ENDP on line 40. These two instructions declare the beginning and end of a procedure, a block of executable code that performs a single distinct function. The label in the leftmost field of the PROC statement (in this case, print) gives the procedure a name. The operand field gives it an attribute. If the procedure carries the NEAR attribute, only other code in the same segment can call it, whereas if it carries the FAR attribute, code located anywhere in the CPU's memory-addressing space can call it. In .COM programs, all procedures carry the NEAR attribute. For the purposes of this example program, I have kept the print procedure ridiculously simple. It calls MS-DOS Int 21H Function 40H to send the message Hello World! to the video screen, and calls Int 21H Function 4CH to terminate the program. The END statement in line 51 tells the assembler that it has reached the end of the source file and also specifies the entry point for the program. If the entry point is not a label located at offset 0100H, the .EXE file resulting from the assembly and linkage of this source program cannot be converted into a .COM file. Introduction to .EXE Programs We have just discussed a program that was written in such a way that it could be assembled into a .COM file. Such a program is simple in structure, so a programmer who needs to put together this kind of quick utility can concentrate on the program logic and do a minimum amount of worrying about control of the assembler. However, .COM-type programs have some definite disadvantages, and so most serious assembly-language efforts for MS-DOS are written to be converted into .EXE files. Although .COM programs are effectively restricted to a total size of 64 KB for machine code, data, and stack combined, .EXE programs can be practically unlimited in size (up to the limit of the computer's available memory). .EXE programs also place the code, data, and stack in separate parts of the file. Although the normal MS-DOS program loader does not take advantage of this feature of .EXE files, the ability to load different parts of large programs into several separate memory fragments, as well as the opportunity to designate a "pure" code portion of your program that can be shared by several tasks, is very significant in multitasking environments such as Microsoft Windows. The MS-DOS loader always brings a .EXE program into memory immediately above the program segment prefix, although the order of the code, data, and stack segments may vary (Figure 3-4). The .EXE file has a header, or block of control information, with a characteristic format (Figures 3-5 and 3-6). The size of this header varies according to the number of program instructions that need to be relocated at load time, but it is always a multiple of 512 bytes. Before MS-DOS transfers control to the program, the initial values of the code segment (CS) register and instruction pointer (IP) register are calculated from the entry-point information in the .EXE file header and the program's load address. This information derives from an END statement in the source code for one of the program's modules. The data segment (DS) and extra segment (ES) registers are made to point to the PSP so that the program can access the environment-block pointer, command tail, and other useful information contained there. SS:SP ┌────────────────────────────────────────────────────────┐ │ │ │ Stack segment: │ │ stack grows downward from top of segment │ │ │ │ │ │ SS:0000H ├────────────────────────────────────────────────────────┤ │ Data segment │ ├────────────────────────────────────────────────────────┤ │ Program code │ CS:0000H ├────────────────────────────────────────────────────────┤ │ Program segment prefix │ DS:0000H └────────────────────────────────────────────────────────┘ ES:0000H Figure 3-4. A memory image of a typical .EXE-type program immediately after loading. The contents of the .EXE file are relocated and brought into memory above the program segment prefix. Code, data, and stack reside in separate segments and need not be in the order shown here. The entry point can be anywhere in the code segment and is specified by the END statement in the main module of the program. When the program receives control, the DS (data segment) and ES (extra segment) registers point to the program segment prefix; the program usually saves this value and then resets the DS and ES registers to point to its data area. The initial contents of the stack segment (SS) and stack pointer (SP) registers come from the header. This information derives from the declaration of a segment with the attribute STACK somewhere in the program's source code. The memory space allocated for the stack may be initialized or uninitialized, depending on the stack-segment definition; many programmers like to initialize the stack memory with a recognizable data pattern so that they can inspect memory dumps and determine how much stack space is actually used by the program. When a .EXE program finishes processing, it should return control to MS-DOS through Int 21H Function 4CH. Other methods are available, but they offer no advantages and are considerably less convenient (because they usually require the CS register to point to the PSP). Byte offset 0000H ┌────────────────────────────────────────────────────────┐ │ First of .EXE file signature (4DH) │ 0001H ├────────────────────────────────────────────────────────┤ │ Second part of .EXE file signature (5AH) │ 0002H ├────────────────────────────────────────────────────────┤ │ Length of file MOD 512 │ 0004H ├────────────────────────────────────────────────────────┤ │ Size of file in 512-byte pages, including header │ 0006H ├────────────────────────────────────────────────────────┤ │ Number of relocation-table items │ 0008H ├────────────────────────────────────────────────────────┤ │ Size of header in paragraphs (16-byte units) │ 000AH ├────────────────────────────────────────────────────────┤ │ Minimum number of paragraphs needed above program │ 000CH ├────────────────────────────────────────────────────────┤ │ Maximum number of paragraphs desired above program │ 000EH ├────────────────────────────────────────────────────────┤ │ Segment displacement of stack module │ 0010H ├────────────────────────────────────────────────────────┤ │ Contents of SP register at entry │ 0012H ├────────────────────────────────────────────────────────┤ │ Word checksum │ 0014H ├────────────────────────────────────────────────────────┤ │ Contents of IP register at entry │ 0016H ├────────────────────────────────────────────────────────┤ │ Segment displacement of code module │ 0018H ├────────────────────────────────────────────────────────┤ │ Offset of first relocation item in file │ 001AH ├────────────────────────────────────────────────────────┤ │ Overlay number (0 for resident part of program) │ 001BH ├────────────────────────────────────────────────────────┤ │ Variable reserved space │ ├────────────────────────────────────────────────────────┤ │ Relocation table │ ├────────────────────────────────────────────────────────┤ │ Variable reserved space │ ├────────────────────────────────────────────────────────┤ │ Program and data segments │ ├────────────────────────────────────────────────────────┤ │ Stack segment │ └────────────────────────────────────────────────────────┘ Figure 3-5. The format of a .EXE load module. The input to the linker for a .EXE-type program can be many separate object modules. Each module can use a unique code-segment name, and the procedures can carry either the NEAR or the FAR attribute, depending on naming conventions and the size of the executable code. The programmer must take care that the modules linked together contain only one segment with the STACK attribute and only one entry point defined with an END assembler directive. The output from the linker is a file with a .EXE extension. This file can be executed immediately. ────────────────────────────────────────────────────────────────────────── C>DUMP HELLO.EXE 0 1 2 3 4 5 6 7 8 9 A B C D E F 0000 4D 5A 28 00 02 00 01 00 20 00 09 00 FF FF 03 00 MZ(..... ....... 0010 80 00 20 05 00 00 00 00 1E 00 00 00 01 00 01 00 .. ............. 0020 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 0030 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 0040 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 0050 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ . . . 0200 B8 01 00 8E D8 B4 40 BB 01 00 B9 10 00 90 BA 08 ............... 0210 00 CD 21 B8 00 4C CD 21 0D 0A 48 65 6C 6C 6F 20 ..!..L.!..Hello 0220 57 6F 72 6C 64 21 0D 0A World!.. ────────────────────────────────────────────────────────────────────────── Figure 3-6. A hex dump of the HELLO.EXE program, demonstrating the contents of a simple .EXE load module. Note the following interesting values: the .EXE signature in bytes 0000H and 0001H, the number of relocation-table items in bytes 0006H and 0007H, the minimum extra memory allocation (MIN_ALLOC) in bytes 000AH and 000BH, the maximum extra memory allocation (MAX_ALLOC) in bytes 000CH and 000DH, and the initial IP (instruction pointer) register value in bytes 0014H and 0015H. See also Figure 3-5. An Example .EXE Program The HELLO.EXE program in Figure 3-7 demonstrates the fundamental structure of an assembly-language program that is destined to become a .EXE file. At minimum, it should have a module name, a code segment, a stack segment, and a primary procedure that receives control of the computer from MS-DOS after the program is loaded. The HELLO.EXE program also contains a data segment to provide a more complete example. The NAME, TITLE, and PAGE directives were covered in the HELLO.COM example program and are used in the same manner here, so we'll move to the first new item of interest. After a few comments and EQU statements, we come to a declaration of a code segment that begins on line 21 with a SEGMENT command and ends on line 41 with an ENDS command. As in the HELLO.COM example program, the label in the leftmost field of the line gives the code segment the name _TEXT. The operand fields at the right end of the line give the attributes WORD, PUBLIC, and `CODE'. Following the code-segment instruction, we find an ASSUME statement on line 23. Notice that, unlike the equivalent statement in the HELLO.COM program, the ASSUME statement in this program specifies several different segment names. Again, remember that this statement has no direct effect on the contents of the segment registers but affects only the operation of the assembler itself. ────────────────────────────────────────────────────────────────────────── 1: name hello 2: page 55,132 3: title HELLO.EXE--print Hello on terminal 4: ; 5: ; HELLO.EXE: demonstrates various components 6: ; of a functional .EXE-type assembly- 7: ; language program, use of segments, 8: ; and an MS-DOS function call. 9: ; 10: ; Ray Duncan, May 1988 11: ; 12: 13: stdin equ 0 ; standard input handle 14: stdout equ 1 ; standard output handle 15: stderr equ 2 ; standard error handle 16: 17: cr equ 0dh ; ASCII carriage return 18: lf equ 0ah ; ASCII linefeed 19: 20: 21: _TEXT segment word public 'CODE' 22: 23: assume cs:_TEXT,ds:_DATA,ss:STACK 24: 25: print proc far ; entry point from MS-DOS 26: 27: mov ax,_DATA ; make our data segment 28: mov ds,ax ; addressable... 29: 30: mov ah,40h ; function 40h = write 31: mov bx,stdout ; standard output handle 32: mov cx,msg_len ; length of message 33: mov dx,offset msg ; address of message 34: int 21h ; transfer to MS-DOS 35: 36: mov ax,4c00h ; exit, return code = 0 37: int 21h ; transfer to MS-DOS 38: 39: print endp 40: 41: _TEXT ends 42: 43: 44: _DATA segment word public 'DATA' 45: 46: msg db cr,lf ; message to display 47: db 'Hello World!',cr,lf 48: 49: msg_len equ $-msg ; length of message 50: 51: _DATA ends 52: 53: 54: STACK segment para stack `STACK' 55: 56: db 128 dup (?) 57: 58: STACK ends 59: 60: end print ; defines entry point ────────────────────────────────────────────────────────────────────────── Figure 3-7. The HELLO.EXE program listing. Within the code segment, the main print procedure is declared by the PROC command on line 25 and closed with ENDP on line 39. Because the procedure resides in a .EXE file, we have given it the FAR attribute as an example, but the attribute is really irrelevant because the program is so small and the procedure is not called by anything else in the same program. The print procedure first initializes the DS register, as indicated in the earlier ASSUME statement, loading it with a value that causes it to point to the base of the data area. (MS-DOS automatically sets up the CS and SS registers.) Next, the procedure uses MS-DOS Int 21H Function 40H to display the message Hello World! on the screen, just as in the HELLO.COM program. Finally, the procedure exits back to MS-DOS with an Int 21H Function 4CH on lines 36 and 37, passing a return code of zero (which by convention means a success). Lines 44 through 51 declare a data segment named _DATA, which contains the variables and constants the program will use. If the various modules of a program contain multiple data segments with the same name, the linker will collect them and place them in the same physical memory segment. Lines 54 through 58 establish a stack segment; PUSH and POP instructions will access this area of scratch memory. Before MS-DOS transfers control to a .EXE program, it sets up the SS and SP registers according to the declared size and location of the stack segment. Be sure to allow enough room for the maximum stack depth that can occur at runtime, plus a safe number of extra words for registers pushed onto the stack during an MS-DOS service call. If the stack overflows, it may damage your other code and data segments and cause your program to behave strangely or even to crash altogether! The END statement on line 60 winds up our brief HELLO.EXE program, telling the assembler that it has reached the end of the source file and providing the label of the program's point of entry from MS-DOS. The differences between .COM and .EXE programs are summarized in Figure 3-8. .COM program .EXE program ────────────────────────────────────────────────────────────────────────── Maximum size 65,536 bytes minus 256 No limit bytes for PSP and 2 bytes for stack Entry point PSP:0100H Defined by END statement AL at entry 00H if default FCB #1 has Same valid drive, 0FFH if invalid drive AH at entry 00H if default FCB #2 has Same valid drive, 0FFH if invalid drive CS at entry PSP Segment containing module with entry point IP at entry 0100H Offset of entry point within its segment DS at entry PSP PSP ES at entry PSP PSP SS at entry PSP Segment with STACK attribute SP at entry 0FFFEH or top word in Size of segment defined with available memory, STACK attribute whichever is lower Stack at entry Zero word Initialized or uninitialized Stack size 65,536 bytes minus 256 Defined in segment with bytes for PSP and size of STACK attribute executable code and data Subroutine calls Usually NEAR NEAR or FAR Exit method Int 21H Function 4CH Int 21H Function 4CH preferred, NEAR RET if preferred MS-DOS version 1 Size of file Exact size of program Size of program plus header (multiple of 512 bytes) ────────────────────────────────────────────────────────────────────────── Figure 3-8. Summary of the differences between .COM and .EXE programs, including their entry conditions. More About Assembly-Language Programs Now that we've looked at working examples of .COM and .EXE assembly-language programs, let's backtrack and discuss their elements a little more formally. The following discussion is based on the Microsoft Macro Assembler, hereafter referred to as MASM. If you are familiar with MASM and are an experienced assembly-language programmer, you may want to skip this section. MASM programs can be thought of as having three structural levels: ■ The module level ■ The segment level ■ The procedure level Modules are simply chunks of source code that can be independently maintained and assembled. Segments are physical groupings of like items (machine code or data) within a program and a corresponding segregation of dissimilar items. Procedures are functional subdivisions of an executable program──routines that carry out a particular task. Program Modules Under MS-DOS, the module-level structure consists of files containing the source code for individual routines. Each source file is translated by the assembler into a relocatable object module. An object module can reside alone in an individual file or with many other object modules in an object-module library of frequently used or related routines. The Microsoft Object Linker (LINK) combines object-module files, often with additional object modules extracted from libraries, into an executable program file. Using modules and object-module libraries reduces the size of your application source files (and vastly increases your productivity), because these files need not contain the source code for routines they have in common with other programs. This technique also allows you to maintain the routines more easily, because you need to alter only one copy of their source code stored in one place, instead of many copies stored in different applications. When you improve (or fix) one of these routines, you can simply reassemble it, put its object module back into the library, relink all of the programs that use the routine, and voilga: instant upgrade. Program Segments The term segments refers to two discrete programming concepts: physical segments and logical segments. Physical segments are 64 KB blocks of memory. The Intel 8086/8088 and 80286 microprocessors have four segment registers, which are essentially used as pointers to these blocks. (The 80386 has six segment registers, which are a superset of those found on the 8086/8088 and 80286.) Each segment register can point to the bottom of a different 64 KB area of memory. Thus, a program can address any location in memory by appropriate manipulation of the segment registers, but the maximum amount of memory that it can address simultaneously is 256 KB. As we discussed earlier in the chapter, .COM programs assume that all four segment registers always point to the same place──the bottom of the program. Thus, they are limited to a maximum size of 64 KB. .EXE programs, on the other hand, can address many different physical segments and can reset the segment registers to point to each segment as it is needed. Consequently, the only practical limit on the size of a .EXE program is the amount of available memory. The example programs throughout the remainder of this book focus on .EXE programs. Logical segments are the program components. A minimum of three logical segments must be declared in any .EXE program: a code segment, a data segment, and a stack segment. Programs with more than 64 KB of code or data have more than one code or data segment. The routines or data that are used most frequently are put into the primary code and data segments for speed, and routines or data that are used less frequently are put into secondary code and data segments. Segments are declared with the SEGMENT and ENDS directives in the following form: name SEGMENT attributes . . . name ENDS The attributes of a segment include its align type (BYTE, WORD, or PARA), combine type (PUBLIC, PRIVATE, COMMON, or STACK), and class type. The segment attributes are used by the linker when it is combining logical segments to create the physical segments of an executable program. Most of the time, you can get by just fine using a small selection of attributes in a rather stereotypical way. However, if you want to use the full range of attributes, you might want to read the detailed explanation in the MASM manual. Programs are classified into one memory model or another based on the number of their code and data segments. The most commonly used memory model for assembly-language programs is the small model, which has one code and one data segment, but you can also use the medium, compact, and large models (Figure 3-9). (Two additional models exist with which we will not be concerning ourselves further: the tiny model, which consists of intermixed code and data in a single segment── for example, a .COM file under MS-DOS; and the huge model, which is supported by the Microsoft C Optimizing Compiler and which allows use of data structures larger than 64 KB.) Model Code segments Data segments ────────────────────────────────────────────────────────────────────────── Small One One Medium Multiple One Compact One Multiple Large Multiple Multiple ────────────────────────────────────────────────────────────────────────── Figure 3-9. Memory models commonly used in assembly-language and C programs. For each memory model, Microsoft has established certain segment and class names that are used by all its high-level-language compilers (Figure 3-10). Because segment names are arbitrary, you may as well adopt the Microsoft conventions. Their use will make it easier for you to integrate your assembly-language routines into programs written in languages such as C, or to use routines from high-level-language libraries in your assembly-language programs. Another important Microsoft high-level-language convention is to use the GROUP directive to name the near data segment (the segment the program expects to address with offsets from the DS register) and the stack segment as members of DGROUP (the automatic data group), a special name recognized by the linker and also by the program loaders in Microsoft Windows and Microsoft OS/2. The GROUP directive causes logical segments with different names to be combined into a single physical segment so that they can be addressed using the same segment base address. In C programs, DGROUP also contains the local heap, which is used by the C runtime library for dynamic allocation of small amounts of memory. Memory Segment Align Combine Class Group model name type type type ────────────────────────────────────────────────────────────────────────── Small _TEXT WORD PUBLIC CODE _DATA WORD PUBLIC DATA DGROUP STACK PARA STACK STACK DGROUP Medium module_TEXT WORD PUBLIC CODE . WORD PUBLIC DATA DGROUP . . _DATA STACK PARA STACK STACK DGROUP Compact _TEXT WORD PUBLIC CODE data PARA PRIVATE FAR_DATA . WORD PUBLIC DATA DGROUP . . _DATA STACK PARA STACK STACK DGROUP Large module_TEXT WORD PUBLIC CODE . . . data PARA PRIVATE FAR_DATA . . . _DATA WORD PUBLIC DATA DGROUP STACK PARA STACK STACK DGROUP ────────────────────────────────────────────────────────────────────────── Figure 3-10. Segments, groups, and classes for the standard memory models as used with assembly-language programs. The Microsoft C Optimizing Compiler and other high-level-language compilers use a superset of these segments and classes. For pure assembly-language programs that will run under MS-DOS, you can ignore DGROUP. However, if you plan to integrate assembly-language routines and programs written in high-level languages, you'll want to follow the Microsoft DGROUP convention. For example, if you are planning to link routines from a C library into an assembly-language program, you should include the line DGROUP group _DATA,STACK near the beginning of the program. The final Microsoft convention of interest in creating .EXE programs is segment order. The high-level compilers assume that code segments always come first, followed by far data segments, followed by the near data segment, with the stack and heap last. This order won't concern you much until you begin integrating assembly-language code with routines from high-level-language libraries, but it is easiest to learn to use the convention right from the start. Program Procedures The procedure level of program structure is partly real and partly conceptual. Procedures are basically just a fancy guise for subroutines. Procedures within a program are declared with the PROC and ENDP directives in the following form: name PROC attribute . . . RET name ENDP The attribute carried by a PROC declaration, which is either NEAR or FAR, tells the assembler what type of call you expect to use to enter the procedure──that is, whether the procedure will be called from other routines in the same segment or from routines in other segments. When the assembler encounters a RET instruction within the procedure, it uses the attribute information to generate the correct opcode for either a near (intra-segment) or far (inter-segment) return. Each program should have a main procedure that receives control from MS-DOS. You specify the entry point for the program by including the name of the main procedure in the END statement in one of the program's source files. The main procedure's attribute (NEAR or FAR) is really not too important, because the program returns control to MS-DOS with a function call rather than a RET instruction. However, by convention, most programmers assign the main procedure the FAR attribute anyway. You should break the remainder of the program into procedures in an orderly way, with each procedure performing a well-defined single function, returning its results to its caller, and avoiding actions that have global effects within the program. Ideally procedures invoke each other only by CALL instructions, have only one entry point and one exit point, and always exit by means of a RET instruction, never by jumping to some other location within the program. For ease of understanding and maintenance, a procedure should not exceed one page (about 60 lines); if it is longer than a page, it is probably too complex and you should delegate some of its function to one or more subsidiary procedures. You should preface the source code for each procedure with a detailed comment that states the procedure's calling sequence, results returned, registers affected, and any data items accessed or modified. The effort invested in making your procedures compact, clean, flexible, and well-documented will be repaid many times over when you reuse the procedures in other programs.