home *** CD-ROM | disk | FTP | other *** search
- Xref: sparky comp.os.msdos.programmer:9098 sci.math.symbolic:2331
- Path: sparky!uunet!convex!darwin.sura.net!spool.mu.edu!olivea!mintaka.lcs.mit.edu!zurich.ai.mit.edu!ara
- From: ara@zurich.ai.mit.edu (Allan Adler)
- Newsgroups: comp.os.msdos.programmer,sci.math.symbolic
- Subject: studying executables
- Message-ID: <ARA.92Sep6131908@camelot.ai.mit.edu>
- Date: 6 Sep 92 18:19:08 GMT
- Sender: news@mintaka.lcs.mit.edu
- Distribution: comp
- Organization: M.I.T. Artificial Intelligence Lab.
- Lines: 129
-
-
- For years I have had an interest in understanding, in the absence of any
- documentation, what a given executable does. I understand that one can simply
- use a debugger, preferably a smart one, and follow it step by step in
- the hope of discovering what the code does. The fact that DOS is nonreentrant
- has always been the primary obstacle to doing this, at least for me.
-
- It seems to me that there are more enterprising things one can do, but I don't
- know how successful they can be. I would appreciate feedback on them and on
- tools that might be available or feasible to carry them out.
-
- First, it seems to me that every compiler and every assembler must have its
- own idiosyncratic way of producing the executable. Is that true? If so, then
- it might be possible to have a database of idiosyncrasies that lets one
- figure out what compiler or assembler produced the executable. From that,
- one would know what the source language. For each compiler, there might
- be a specific decompiler that takes into account the special features of
- that compiler. Also, if by some miracle the code was compiled but
- not optimized, it seems like it would retain more features of the original.
- Or if compiled with C and debugging information was not stripped from the
- executable, that would be quite helpful. Knowledge of the source language
- and the specific compiler seems like it would be quite helpful.
-
- The second idea is that even if one knows nothing about the source language
- or the compiler, one can still imagine tools that will give an overview
- of the structure of the program. The trouble with debuggers and disassemblers
- is that they work at the lowest level, one instruction at a time. Imagine
- instead an attack on the entire executable. First, one determines all of
- the bytes in the file that could be the first byte of an instruction,
- even if perhaps spuriously (since data could have bytes like that, for example,
- or starting an instruction on the second byte could look like some other
- instruction). Next, one picks out those places that could conceivably be
- jumps, branches, returns or maybe interrupts. This has the effect of cutting
- the code up in a natural way into intervals. At this stage, we don't know
- which of these intervals is really code and which is data. Next, one
- takes each interval and for each byte in the interval one examines
- the longest consecutive chain of bytes that constitutes a valid sequence
- of instructions. This will give us a collection of subintervals, some of
- them overlapping. Now, any consecutive sequence of instructions that
- does not jump or branch or return will have to involve one of these chains.
-
- So one enumerates all these chains (just storing their endpoints is sufficient)
- and tentatively one treats them like a library of routines one can call,
- except we don't know what they do. But they are much smaller than the
- original executable and can be examined individually in detail if the need
- arises. Even without stepping through them, it seems as though certain things
- can be done automatically, such as determining all of the registers and
- (perhaps unknwon) memory locations that are the "inputs" to that chain
- and all the registers and memory locations (perhaps unknown) that are the
- outputs.
-
- It then seems conceivable that for each such chain, one could use or write
- a computer algebra program that determines how the outputs depend on
- its inputs. After all, we have removed all the noncomputational things like
- jumps and branches and interrupts and returns. So what it is doing is in effect
- implementing some disgusting formula.
-
- At the moment, that is as far as I can see into the chains from a general
- point of view. On the other hand, we still have to look more closely at
- the branches, jumps, returns and interrupts. Now, none of these was actually
- known to be one, merely that it could be one. So one wants techniques to
- determine whether they are really there or not and if so, where do they
- branch, jump, return to and which interrupt. In the case of interrupts,
- a certain register will be loaded with the number of the interrupt before
- the interrupt is executed. In code I have looked at, that happens immediately
- or almost immediately before, but that could be either a programming convention
- or one of the idiosyncrasies I mentioned that each compiler has. It is
- conceivable that the register is loaded much earlier. So one has two choices
- here: (1) one can assume that if the register is not loaded within a certain
- number of instructions before the interrupt, then the interrupt is not real.
- The number of instructions back one looks can then be called the "resolution"
- of the search (2) one can look arbitrarily far back. This would probably
- be an act of desperation. I'll assume that one takes the first approach
- and has a fairly small resolution. Similarly, branches and jumps have some
- information regarding where one is jumping or branching to and that information
- would like in a certain register. So one can look at when that register was
- loaded, again with a certain resolution. I'm not sure what to do with returns,
- but if a return is genuine, it has to be regarded as terminating whatever
- chains it lies at the end of. Similarly, any unconditional jump instruction
- has to be regarded as terminating whatever chains it lies on. A conditional
- branch instruction will not in general terminate the chain it is on, since
- it ought to be conceivable that the branch condition fails and that the
- code continues. Therefore, a valid branch condition ought to be the nexus
- of two chains.
-
- One of the lacuna of what has already been said is that it is hard to know
- what specific values one is working with in jumps, etc. One can also
- try to automatie some methods of getting those values. For example, look
- at every byte that can be a load immediate instruction and look at every
- byte that can be setting the value of the code segment, data segment,etc.
-
- The more information one get get about the numbers that are actually
- being loaded in turn narrows the possibilities for where the jump and
- branches could be going, which in turn reduces the number of possible
- chains.
-
- There is more one can do. For example, with interrupts, some interrupts are
- more likely to be used than others. I would expect that an interrupt to
- print something on the screen would be likely to occur at least once. One
- can also look through the executable for all sequences of ascii characters
- and to enumerate them. Some of them will be messages and one can note
- their locations. That gives some more clues to the numbers. There might be
- similar tricks one can do with other interrupts or for other ways in which
- strings are used (e.g. environment variables, file names...).
-
- Assuming that one has determined the locations of the set of chains, jumps,
- branches, interrupts, etc, that actually do occur in the program, one can
- then "condense" the executable down to a much smaller object, by treating
- each chain as an instruction in some "higher language". It is possible that
- some chains are contained in others, but that is not a problem, just as it
- is not a problem that the effects of some instructions include the effects
- of other instructions. Once "condensed" in this way, it becomes possible
- to study the program more directly.
-
- Most likely, there will be more than one way to condense an executable
- given the information one has, so one might have to examine several
- condensations. However, in principle each is a possible way to use the
- unknown executable.
-
- Admittedly this does not solve the problem, but I feel that automating what
- I have described would put one in a position to take the next step.
-
- I've posted this on comp.os.msdos.programmer because usually it is on
- MSDOS systems that I am looking at executables. I am crossposting
- to sci.math.symbolic because of the possible applications here of computer
- algebra.
-
- Allan Adler
- ara@altdorf.ai.mit.edu
-