NetNews Usenet Archive 1992 #20

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1992 #20 / NN_1992_20.iso / spool / comp / os / msdos / programm / 9098 < prev next >

Wrap

Internet Message Format | 1992-09-08 | 8.2 KB

Xref: sparky comp.os.msdos.programmer:9098 sci.math.symbolic:2331 Path: sparky!uunet!convex!darwin.sura.net!spool.mu.edu!olivea!mintaka.lcs.mit.edu!zurich.ai.mit.edu!ara From: ara@zurich.ai.mit.edu (Allan Adler) Newsgroups: comp.os.msdos.programmer,sci.math.symbolic Subject: studying executables Message-ID: <ARA.92Sep6131908@camelot.ai.mit.edu> Date: 6 Sep 92 18:19:08 GMT Sender: news@mintaka.lcs.mit.edu Distribution: comp Organization: M.I.T. Artificial Intelligence Lab. Lines: 129 For years I have had an interest in understanding, in the absence of any documentation, what a given executable does. I understand that one can simply use a debugger, preferably a smart one, and follow it step by step in the hope of discovering what the code does. The fact that DOS is nonreentrant has always been the primary obstacle to doing this, at least for me. It seems to me that there are more enterprising things one can do, but I don't know how successful they can be. I would appreciate feedback on them and on tools that might be available or feasible to carry them out. First, it seems to me that every compiler and every assembler must have its own idiosyncratic way of producing the executable. Is that true? If so, then it might be possible to have a database of idiosyncrasies that lets one figure out what compiler or assembler produced the executable. From that, one would know what the source language. For each compiler, there might be a specific decompiler that takes into account the special features of that compiler. Also, if by some miracle the code was compiled but not optimized, it seems like it would retain more features of the original. Or if compiled with C and debugging information was not stripped from the executable, that would be quite helpful. Knowledge of the source language and the specific compiler seems like it would be quite helpful. The second idea is that even if one knows nothing about the source language or the compiler, one can still imagine tools that will give an overview of the structure of the program. The trouble with debuggers and disassemblers is that they work at the lowest level, one instruction at a time. Imagine instead an attack on the entire executable. First, one determines all of the bytes in the file that could be the first byte of an instruction, even if perhaps spuriously (since data could have bytes like that, for example, or starting an instruction on the second byte could look like some other instruction). Next, one picks out those places that could conceivably be jumps, branches, returns or maybe interrupts. This has the effect of cutting the code up in a natural way into intervals. At this stage, we don't know which of these intervals is really code and which is data. Next, one takes each interval and for each byte in the interval one examines the longest consecutive chain of bytes that constitutes a valid sequence of instructions. This will give us a collection of subintervals, some of them overlapping. Now, any consecutive sequence of instructions that does not jump or branch or return will have to involve one of these chains. So one enumerates all these chains (just storing their endpoints is sufficient) and tentatively one treats them like a library of routines one can call, except we don't know what they do. But they are much smaller than the original executable and can be examined individually in detail if the need arises. Even without stepping through them, it seems as though certain things can be done automatically, such as determining all of the registers and (perhaps unknwon) memory locations that are the "inputs" to that chain and all the registers and memory locations (perhaps unknown) that are the outputs. It then seems conceivable that for each such chain, one could use or write a computer algebra program that determines how the outputs depend on its inputs. After all, we have removed all the noncomputational things like jumps and branches and interrupts and returns. So what it is doing is in effect implementing some disgusting formula. At the moment, that is as far as I can see into the chains from a general point of view. On the other hand, we still have to look more closely at the branches, jumps, returns and interrupts. Now, none of these was actually known to be one, merely that it could be one. So one wants techniques to determine whether they are really there or not and if so, where do they branch, jump, return to and which interrupt. In the case of interrupts, a certain register will be loaded with the number of the interrupt before the interrupt is executed. In code I have looked at, that happens immediately or almost immediately before, but that could be either a programming convention or one of the idiosyncrasies I mentioned that each compiler has. It is conceivable that the register is loaded much earlier. So one has two choices here: (1) one can assume that if the register is not loaded within a certain number of instructions before the interrupt, then the interrupt is not real. The number of instructions back one looks can then be called the "resolution" of the search (2) one can look arbitrarily far back. This would probably be an act of desperation. I'll assume that one takes the first approach and has a fairly small resolution. Similarly, branches and jumps have some information regarding where one is jumping or branching to and that information would like in a certain register. So one can look at when that register was loaded, again with a certain resolution. I'm not sure what to do with returns, but if a return is genuine, it has to be regarded as terminating whatever chains it lies at the end of. Similarly, any unconditional jump instruction has to be regarded as terminating whatever chains it lies on. A conditional branch instruction will not in general terminate the chain it is on, since it ought to be conceivable that the branch condition fails and that the code continues. Therefore, a valid branch condition ought to be the nexus of two chains. One of the lacuna of what has already been said is that it is hard to know what specific values one is working with in jumps, etc. One can also try to automatie some methods of getting those values. For example, look at every byte that can be a load immediate instruction and look at every byte that can be setting the value of the code segment, data segment,etc. The more information one get get about the numbers that are actually being loaded in turn narrows the possibilities for where the jump and branches could be going, which in turn reduces the number of possible chains. There is more one can do. For example, with interrupts, some interrupts are more likely to be used than others. I would expect that an interrupt to print something on the screen would be likely to occur at least once. One can also look through the executable for all sequences of ascii characters and to enumerate them. Some of them will be messages and one can note their locations. That gives some more clues to the numbers. There might be similar tricks one can do with other interrupts or for other ways in which strings are used (e.g. environment variables, file names...). Assuming that one has determined the locations of the set of chains, jumps, branches, interrupts, etc, that actually do occur in the program, one can then "condense" the executable down to a much smaller object, by treating each chain as an instruction in some "higher language". It is possible that some chains are contained in others, but that is not a problem, just as it is not a problem that the effects of some instructions include the effects of other instructions. Once "condensed" in this way, it becomes possible to study the program more directly. Most likely, there will be more than one way to condense an executable given the information one has, so one might have to examine several condensations. However, in principle each is a possible way to use the unknown executable. Admittedly this does not solve the problem, but I feel that automating what I have described would put one in a position to take the next step. I've posted this on comp.os.msdos.programmer because usually it is on MSDOS systems that I am looking at executables. I am crossposting to sci.math.symbolic because of the possible applications here of computer algebra. Allan Adler ara@altdorf.ai.mit.edu