Greythorne's Tutorial: Enter The Assembler

Assembly and Cracking From the Ground Up

An Introduction by Greythorne the Technomancer

Assembly Beginnings

Now that you have read the program development intro, you probably are starting to wonder
whether any coding comes next. Yep. Our first lesson or so will be on building a better program shell,
one that handles several types of input and fits into the IPO model.

After all, what good is a program when it can't be interactive?

Okay, a program patcher.. but wait, it inputs from a file doesn't it?
Thought you had me didn't you.
Quit that.
Stop interrupting.
Now let's go on.

If you are already familiar with how dos calls are made, skip this part,
This is here to insure that noone is left in the dark - this is a tutorial for those who want to learn - even if it's from scratch.

Since the first thing that everyone wants to learn how to do is make a program display some silly hello world message, we will do that.

First I need to explain registers (the hyper fast variables built into your x86 based pc compatible, pentium or 486 or such)

In older high level programming languages, BASIC being the easiest that everyone has heard of (not visual basic, that is a monstoricty put together by our friends at microsquash, not Qbasic, not BASICA, but plain old BASIC that everyone and every machine emulated to some strange degree through the years for reasons of making everyine who wanted to learn a programming language be stuck with something that wasnt very useful) there is some way of loading a string and displaying it to the user.

In the BASIC example the command was something like this:
(not entirely exacting because I could care less about teaching BASIC)

DATA "Hello Planet Hollywood";
READ D$; (from data)
PRINT D$;

or more simply:

LET D$ = "Hello Planet Hollywood";
PRINT D$;

and the computer would spit out something - preferrably our message if we typed it right - to the screen.

Assembly is not so different, but we are stuck with the first model.

in assembly we say something of the sort:

MYSTRING DB 'Hello Planet Hollywood";
MOV DX, OFFSET MYSTRING

and then we print it:

We use DOS's print-a-string command by issuing this statement:

MOV AH, 09h
and we tell DOS to do it using this statement:
INT 21h

concisely, here it is:

MYSTRING DB 'Hello Planet Hollywood','$'
MOV DX, OFFSET MYSTRING
MOV AH, 09h
INT 21h

That wasn't so bad was it?

Notice the '$' at the end of the string.
Assembly needs a marker to point out where the string ends. that is all the '$' is for, to let it know where to stop. If we don't have it, it will keep trying to print assembly codes as text that occur after your string does. Usually this is just random junk in memory or your program itself. Not very useful to forget the '$' in other words.

There is another simple way assembly handles strings as well, called string-zero.
Simply this means that the strings are terminated with a zero instead of a '$'
No biggie, just another way.

In all honesty, with assembly you can make a print string routine that ends with anything you want, though it would be kinda useless because assembly provides $ and 0 terminated string commands free of charge already. You may later on want to develop your own for the purposes of encryption, since the 0 and the $ markers tend to stand out in a hex dump, but not always necessary if you encrypt the terminator character with the string you are encrypting, or using string commands that only print a certain number of characters, with no terminator required.

Back to our example:

I didn't mention that the data needs to be in a separate place or the program will try to execute the data as machine code commands, but that is okay, if you download the example .COM and .EXE skeleton files I have put online, you will notice that there is a place for data and a place for your actual executing code. The program jumps past the data when it starts so it never tries to mistake the data as if it were assembly instructions.

That also isn't much different than BASIC, people tended to have to put the DATA at the end of the program after it was complete anyway, so nothing has changed. Considering that all decent programming languages have to be written in assembly anyway, it really shouldn't surprise you that they use the same type of string data format.

HOW THE REGISTERS WORK

Above, I showed you that you could use DX as a string variable.
For convenience sake I chose D$ from BASIC to compare easily.

In BASIC you have any variables you want...
in assembler you only have a few to choose from. That is okay though, you only need a few.
Assembler lets you store things virtually anywhere you like, not just in variables like as in basic.

Here is the rundown of your general purpose registers:

AX - The Accumulator (usually where answers to mathematics go)
BX - The Base register (usually holds the location of some structure in memory)
CX - The Counter (use this to count anything, including length of strings)
DX - The Data Register - usually points at things such as strings or memory areas

The above general purpose registers are exactly that, general purpose. You can interchange them in your own code, but for DOS int 21 calls and such, it expects certain ones to be exactly as written. The "Hello" example I showed you above is a good example.

The ACCUMULATOR (AX) is the most high profile one though. It has many standard uses. It tends to 'accumulate' everything. When you exit a program, or a subroutine of some kind, the results tend to wind up there, Error codes and results of arithmetic are the most usual, and when calling a subroutine or another program it is also used to put the code for a command, such as the AH=9 example from above.

Each of these are 16 bit registers, which means they hold 16 bits (2 bytes) of information. and the two bytes can be accessed directly in each of them (called low and high)

AX therefore is made up therefore of AH and AL
BX is made of BH and BL
and so on.

(The above are 16 bit examples, the 32-bit registers are simply named EAX, EBX, ECX, and EDX)

NOW we have the ones that become more specialized in usage:

These next two are usually for copying memory arrays or strings
(The String Registers)

DI - Destination Index (Where to move things to)
SI - Source Index (Where to move things from)

then there's this one:

BP - Base Pointer

Very often, the SI, DI, and BP are used to keep track of where you are in the code - it really doesn't matter which you use until you interface your code with something that expects one rather than the other. It tends to happen alot. Examine virus code sometime - you will see quite a few that use SI and quite a few that use BP.

There is a special one that tends to be like putting the hand of god into the program. It is the Instruction Pointer.
IP is used by the machine to keep track of which instruction is the one to be executed.

Why is this important?

Here is a neat trick, say for instance you are in a loop in a debugger like softice and want to get out.
Increment IP (add one or more to it) and you can bypass the loop. It effectively allows you to jump a few ahead. You could do alot more to it of course, but be careful anytime you alter the flow of code, the results can be quite unexpected.

Viruses (again using the little nasties as example) tend to make use of IP as well. The EXE file format includes a header at the beginning of the program that informs DOS which segment of memory is to be used when you start (CS) for current segment, and which instruction to start at (our friend IP).

What this means is that the virus which runs at the end of the program normally wouldn't be executed, but altering the header of any normal EXE file so that it sets CS:IP to start running the program at the end allows the virus to execute first. Then CS and IP are set back to what they should be to point tt the start of the program. Rather creative really - whoever first thought that up.

For fun, take a look at the symbiote I wrote. It does that very same thing - it is the way you add code to a program. Files ending with .COM are a little different, but a little simpler.
The symbiote handles both really, just take a look at it sometime. It may be a somewhat large program, but the code that does this is commented so you can at least see that it does what I have been describing.

IF YOU HAVEN'T DONE THIS BEFORE:

Go ahead and modify either the .COM file skeleton or the .EXE file example to do the hello world example from above.

Access the .COM and .EXE Skeletons HERE

Considering that most DOS calls use the same basic method - once you get the hang of it, you will not have much trouble calling others.

In the next lesson or so I will be dealing with user interactivity from getting input to reading the command line. If you want to get ahead, though it isn't necessary until the next lesson, try finding out how to input data from the user, or even more enterprising, looking into the fact that offset 80h records the length of the command line, and 82h is the offset of the command line input might be fun.

I will cover all of that of course, but reading ahead never hurt anyone ;)

+gthorne'97