home *** CD-ROM | disk | FTP | other *** search
- May, 06 1994
-
-
- Technical information "heuristic scanning"
-
-
-
-
-
-
- Combatting viruses heuristically
- ================================
-
- Generally speaking, there are two basic methods to detect viruses specific and
- generic. Specific virus detection requires the anti-virus program to have
- some pre-defined information about a specific virus (like a scan string). The
- anti-virus program must be frequently updated in order to make it detect new
- viruses as they appear. Generic detection methods however are based on
- generic characteristics of the virus, so theoretically they are able to detect
- every virus, including the new and unknown ones.
-
- Why is generic detection gaining importance? There are four reasons:
- 1) The number of viruses increases rapidly. Studies indicate that the total
- number of viruses doubles roughly every nine months. The amount of work for
- the virus researcher increases, and the chances that someone will be hit by
- one of these unrecognizable new viruses increases too.
-
- 2) The number of virus mutants increases. Virus source codes are widely
- spread and many people can't resist the temptation to experiment with them,
- creating many slightly modified viruses. These modified viruses may or may
- not be recognized by the anti-virus product. Sometimes they are, but
- unfortunately often they are not.
-
- 3) The development of polymorphic viruses. Polymorphic viruses like MTE and
- TPE are more difficult to detect with virus scanners. It is often months
- after a polymophic virus has been discovered before a reliable detection
- algorithm has been developed. In the meantime many users have an increased
- chance of being infected by that virus.
-
- 4) Viruses directed at a specific organization or company. It is possible
- for individuals to utilize viruses as weapons. By creating a virus that only
- works on machines owned by a specific organization or company it is very
- unlikely that the virus will spread outside of the organization. Thus it is
- very unlikely that any virus scanner will be able to detect the virus before
- the payload of the virus does its destructive work and reveals itself.
-
- Each of these scenarios demonstrates the fact that virus scanners can not
- recognize a virus until the virus has been discovered and analyzed by an
- anti-virus vendor.
-
-
- These same scenarios do not hold true for generic detectors, therefore many
- people are becoming more interested in generic anti-virus products. Of the
- many generic detection methods, heuristic scanning is currently becoming the
- most important.
-
-
- Heuristic scanning
- ------------------
-
- One of the most time consuming tasks that a virus researcher faces is the
- examination of files. People often send files to researchers because they
- believe the files are infected by a new virus. Sometimes these files are
- indeed infected, sometimes not. Every researcher is able to determine very
- quickly what is going on by loading the suspected file into a debugger. A few
- seconds is often enough, and many researchers must have asked themselves: "How
- can I determine this so quickly"?
-
- One time I demonstrated this effect to the audience on an international
- conference. I showed the first page of the assembly listing of a MTE-infected
- file, and within about a second, Vesselin Bontchev came with the correct
- answer.
-
- How is this possible?
-
- Artificial intelligence
-
- Some of the many differences between viruses and normal programs is that
- normal programs typically start searching the command line for options,
- clearing the screen, etc. Viruses however never search for command line
- options or clear the screen. Instead they start with a search for other
- executable files, by writing to the disk, or by decrypting themselves.
-
- A researcher who has loaded the suspected file into a debugger can notice this
- difference in only a glance. Heuristic scanning is an attempt to put this
- experience and knowledge into a virus scanner.
-
- The word 'heuristic' means (according to a Dutch dictionary) 'the self
- finding' and 'the knowledge to determine something in a methodic way'.
-
- A heuristic scanner is a type of automatic debugger or disassembler. The
- instructions are disassembled and their purposes are determined. If a program
- starts with the sequence
- MOV AH,5
- INT 13h
- which is a disk format instruction for the BIOS, this is highly
- suspected, especially if the program does not process any command line options
- or interact with the user.
-
-
-
- Suspected abilities
-
- In reality, heuristics is much more complicated. The heuristic scanners that
- I am familiar with are able to detect suspicious instruction sequences, like
- the ability to format a disk, the ability to search for other executables, the
- ability to remain resident in memory, the ability to issue non-standard or
- undocumented system calls, etc. Each of these abilities has a value assigned
- to it. The values assigned to the various
- suspicious abilities are dependant on various fact. A disk format
- routine doesn't appear in many normal programs, but often in viruses. So it
- gets a high value. The abilities to remain resident in memory are found in
- many normal programs, so despite of the fact that they also appear in many
- viruses it doesn't get a high value. If the total of the values for one
- program exceeds a predefined treshold, the scanner yells "Virus!". A single
- suspected ability is never enough to trigger the alarm. It is always the
- combination of the suspected abilities which
- convince the scanner that the file is a virus.
-
- Heuristic flags
-
- Some scanners set a flag for each suspected ability which has been found in
- the file being analyzed. This makes it easier to explain to the user what has
- been found.
-
- TbScan for instance recognizes many suspected instruction sequences. Every
- suspected instruction sequence has a flag assigned to it:
-
- Flag Description
- ---- -----------
- F = Suspicious file access. Might be able to infect a file.
- R = Relocator. Program code will be relocated in a suspicious way.
- A = Suspicious Memory Allocation. The program uses a non-standard
- way to search for, and/or allocate memory.
- N = Wrong name extension. Extension conflicts with program structure.
- S = Contains a routine to search for executable (.COM or .EXE) files.
- # = Found an instruction decryption routine. This is common
- for viruses but also for some protected software.
- E = Flexible Entry-point. The code seems to be designed to be
- linked on any location within an executable file. Common for
- viruses.
- L = The program traps the loading of software. Might be a
- virus that intercepts program load to infect the software.
- D = Disk write access. The program writes to disk without using
- DOS.
- M = Memory resident code. This program is designed to stay in
- memory.
-
-
- ! = Invalid opcode (non-8088 instructions) or out-of-range
- branch.
- T = Incorrect timestamp. Some viruses use this to mark
- infected files.
- J = Suspicious jump construct. Entry point via chained or
- indirect jumps. This is unusual for normal software but common
- for viruses.
- ? = Inconsistent exe-header. Might be a virus but can also be a bug.
- G = Garbage instructions. Contains code that seems to have no
- purpose other than encryption or avoiding recognition by virus
- scanners.
- U = Undocumented interrupt/DOS call. The program might be just
- tricky but can also be a virus using a non-standard way to detect
- itself.
- Z = EXE/COM determination. The program tries to check whether a
- file is a COM or EXE file. Viruses need to do this to infect a
- program.
- O = Found code that can be used to overwrite/move a program in memory.
- B = Back to entry point. Contains code to re-start the program after
- modifications at the entry-point are made. Very usual for viruses.
- K = Unusual stack. The program has a suspicious stack or an odd
- stack.
-
- TbScan would for instance output the following flags:
- Virus Heuristic flags
- ----- ---------------
- Jerusalem/PLO: FRLMUZ
- Backfont: FRALDMUZK
- Minsk_Ghost: FELDTGUZB
- Murphy: FSLDMTUZO
- Ninja: FEDMTUZOBK
- Tolbuhin: ASEDMUOB
- Yankee_Doodle: FN#ELMUZB
-
- The more flags that are triggered by a file, the more likely it is that the
- file is infected by a virus. Normal programs rarely trigger even one flag,
- while at least two flags are required to trigger the alarm. To make it more
- complicated, not all flags carry the same 'weight'.
-
- False positives
- ---------------
-
- Just like all other generic detection techniques, heuristic scanners sometimes
- blame innocent programs for being contaminated by a virus. This is called a
- "False Positive" or "False Alarm".
-
-
-
-
-
- The reason for this is simple. Some programs happen to have several suspected
- abilities. For instance, the LOADHI.COM file of QEMM has the following
- suspected abilities (according to an older, yet obsolete version of TbScan):
-
- A = Suspicious Memory Allocation. The program uses a
- non-standard way to search for, and/or allocate memory.
- M = Memory resident code. This program may be a TSR but also a virus.
- U = Undocumented interrupt/DOS call. The program might be just tricky
- but can also be a virus using a non-standard way to detect itself.
- Z = EXE/COM determination. The program tries to check whether a
- file is a COM or EXE file. Viruses need to do this to infect a
- program.
- O = Found code that can be used to overwrite/move a program in
- memory.
-
- All of these abilities are available in LoadHi, and the flags are enough to
- trigger the heuristic alarm. As LoadHi is supposed to allocate upper memory,
- load resident programs in memory, move them to upper memory, etc., all these
- suspected abilities can easily be explained and verified. However, the
- scanner is not able to know the intended purpose of the program, and as most
- of these suspected abilities are often found in viruses, it just describes the
- LoadHi program as "a possible virus".
-
- How serious is the issue of false alarms?
-
- If a heuristic scanner pops up with a message saying: "This program is able to
- format a disk and it stays resident in memory", and the program is a resident
- disk format utility, is this really a false alarm? Actually, the scanner is
- right. A resident format utility obviously contains code to format a disk, and
- it contains code to stay resident in memory. The heuristic scanner is therfore
- completely right! You could name it a false suspicion, but not a false
- positive. The only problem here is that the scanner says that it might be a
- virus. If you think the
- scanner tells you it has found a virus, it turns out to be a false alarm.
- However, if you take this information as is, saying 'ok, the facts you
- reported are true for this program, I can verify this so it is not a virus', I
- wouldn't count it as a false alarm. The scanner just tells the truth. The main
- problem here is the person who has to make decisions with the information
- supplied by the scanner. If it is a novice user, it is a problem. More about
- that later.
-
-
-
-
-
-
-
-
-
-
- Avoiding false positives
-
- Whether we call it a false positive or a false suspicion doesn't matter.
- We do not like the scanner to yell every time we scan. So we need to avoid
- this situation. How do we achieve this?
-
- 1) Definition of (combinations of) suspicious abilities
- The scanner does not issue an alarm unless at least two
- separate suspected program abilities have been found.
- 2) Recognition of common program codes
- Some known compiler codes or run time compression or
- decryption routines can cause false alarms. These specific compression
- or decryption codes can be recognized by the scanner to avoid false
- alarms.
- 3) Recognition of specific programs
- Some programs which normally cause a problem (like the LoadHi
- program used in the example) can be recognized by the
- heuristic scanner.
- 4) Assumption that the machine is initially not infected
- Some heuristic scanners have a 'learn' mode, i.e. they are able to
- learn that a file causing a false alarm is not a virus.
-
- Dealing with false positives
-
- Some false positives are not easily avoided. So, the user has to deal with a
- certain amount of false alarms, and must make the final decision as to whether
- a file is infected or not.
-
- Ok, you may say, how do we know whether a suspicious program is a virus or
- innocent. There is no way to find out, that is what most people believe.
- Actually there is a way to find out, but this depends on the scanner.
-
- The scanner has to explain to the user the reasons why the program is suspect.
- 'This file might contain a virus' actually doesn't say much to the user. It is
- always right. Every file MIGHT contain a virus, but MAY also be clean. We
- actually use a scanner to find out! What is the user supposed to do with this
- information?
-
- However, if the scanner says that some program is able to remain resident in
- memory and able to format a disk, the user can more easily figure out what is
- going on. If a word processor gives such an alarm, it is extremely likely
- that the program carries a virus, because word processors generally are not
- able to format disks and remain resident in memory. However, if the suspected
- file is a resident disk formatting utility, then all of the suspected
- abilities can be explained by the intended purpose of the program.
-
-
-
-
- Reason for suspicion: memory resident and disk formatting
- abilities.
-
- Program Probably
- ------------------------ --------
- Resident disk formatter No Virus (innocent)
- Word processor Malicious (virus)
-
- Both programs cause the same heuristic alarms, but the
- final conclusion is different.
-
-
- Naturally, it requires an advanced user to draw a conclusion for the question
- "infected or not?". However, my opinion is that judging the results of any
- scanner (also conventional scanners) is a task for an advanced user only. If
- the scanner has a 'learn' mode, i.e. is able to remember which programs cause
- a false alarm, the initial scan should be performed by an advanced user, but
- the subsequent scans (when the possible false positives have been eliminated)
- can be performed by a novice user. This is already common practice in most
- organizations.
-
- Anyway, it isn't as bad as it seems, as all other detection methods (including
- signature scanning) are known to cause some false alarms as well. Heuristics
- however has the advantage that it is able to supply you with enough
- information to establish for yourself whether a suspected file is likely a
- virus or not.
-
-
- How does heuristic scanning perform?
- ------------------------------------
-
- Heuristics is a relatively new technique and still under development. It is
- however gaining importance rapidly. This is not surprising as heuristic
- scanners are able to detect over 90% of the viruses without using any
- predefined information like signatures or checksum values. The amount of false
- positives depends on the scanner, but a figure as low as 0.1% can be reached
- easily.
-
- TbScan 6.02 used on the large virus collection of Vesselin Bontchev showed
- the following results:
-
- Scanning 7210 detection
- method files percentage
- ------------- ----- ----------
- Conventional 7056 97.86%
- Heuristics 6465 89.67%
-
-
-
-
- A false positive test however is more difficult to perform so there are no
- independent results available.
-
-
- Combination of conventional and heuristic scanning
- --------------------------------------------------
-
- Some people think heuristic scanning is a replacement for conventional
- scanning. In my opinion it is not. Heuristic scanning serves a very useful
- purpose when used in combination with conventional scanning. The results of
- both scanning methods can be validated by each other, thereby reducing false
- positives and also false negatives.
-
- Combined result of analysis:
- Heuristic Conventional Probability
- clean clean very probably clean
- clean virus might be a false
- positive
- virus clean might be a false
- negative
- virus virus very probably infected
-
- fn: 10% fn: 1% combined false negatives: 0.1%
- fp: 0.1% fp: 0.001% combined false positives: 0.00001%
-
- The chances of both the heuristic scanner and the conventional
- scanner failing is minimal. If both scanning methods have the same results,
- the result is almost certain. In the few cases that the results don't agree
- with each other additional analysis is required. TbScan 6.02 used on the large
- virus collection of Vesselin Bontchev showed the following results:
-
- Scanning 7210 detection
- method files percentage
- ------------- ----- ----------
- Conventional 7056 97.86%
- Heuristics 6465 89.67%
- Combined 7194 99.78%
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- What can be expected from it in the future?
- -------------------------------------------
-
- The development continues
-
- Most anti-virus developers still do not supply a ready-to-use heuristic
- analyzer. Those who have heuristics already available are still improving it.
- It is however unlikely that the detection rate will ever reach 100% without a
- certain amount of false positives. On the other hand it is unlikely that the
- amount of false positives will ever reach 0%.
-
- Maybe you wonder why it isn't possible to achieve 100% correct results. There
- is a large grey area between viruses and non-viruses. Even for humans it is
- hard to describe what a virus is or not, an often used definition of a
- computer virus is this: "A virus is a program that is able to copy itself".
- According to this definition the DiskCopy.Com program is a virus...
-
-
- Reaction of virus writers
-
- An important issue is the effect on virus writers. It is likely
- that they will try to avoid detection by heuristic scanners. Until now the
- goal was to avoid detection by signature scanners, and this was very easy to
- do, as it was sufficient to modify only a small part of an existing virus.
- Teenagers with some basic understanding of programming could do so easily .
- Avoiding heuristic scanners however requires a lot more knowledge, if even
- possible at all.
-
- Fortunately, this detection-avoiding method of programming makes
- detection by conventional anti-virus products easier because it
- means that the programmer can not use very tight and straight code. The virus
- writer will be forced to write more complex viruses.
-
-
- The pro's and con's of heuristic scanning
- -----------------------------------------
- - Advantages:
- - Can detect 'future' viruses
- - User is less dependant on product updates
- - Disadvantages:
- - False positives are possible
- - Judgement of the result requires some basic knowledge
-
-
-
-
-
-
-
-
-
- Heuristic cleaning
- ==================
-
- Before we can discuss heuristic cleaning, it is important to know how a virus
- infects a program.
-
- The basic principle is not difficult. A virus - a program by itself - adds
- itself to the end of the program. The size of the program increases due to
- this addition of the viral code. Appending a virus program to another program
- is however not enough, the virus code should also be executed. To make this
- happen, the virus overwrites the first bytes of the file with a 'jump'
- instruction, which makes the processor jump to the viral code. The virus now
- gains control when the program is
- invoked, and it will finally pass control to the original program. Since the
- first bytes of the file are overwritten by the jump instruction, the virus has
- to 'repair' these bytes first. After that the virus just jumps to the
- beginning of the original program, and most often this program works as usual.
-
-
- original program infected program
-
- +--------------+ +--------------+
- | p | 100: |jump |
- | r | |to 2487 |
- | o | | o |
- | g | | g |
- | r | | r |
- | a | | a |
- | m | | m |
- | | | |
- | c | | c |
- | o | | o |
- | d | | d |
- | e | | e |
- | | | |
- +--------------+ +--------------+
- 2487: | |
- | VIRUS! p |
- | r |
- |jmp 100 |
- +--------------+
-
- To clean an infected program, it is of vital importance to restore the bytes
- being overwritten by the jump to the virus code. The virus has to restore
- these bytes also, so somewhere in this virus code these original bytes are
- stored. The cleaner searches for those bytes, puts them back in their
- original location, and truncates the file to the original size.
-
-
-
-
-
-
- How does a conventional cleaner work?
-
- A conventional cleaner has to know which virus to remove. Suppose your system
- is infected with the Jerusalem/PLO virus. You invoke your cleaner and it
- proceeds like this:
-
- "Hey, this file is infected with the Jerusalem/PLO virus. OK, this virus is
- 1873 bytes in size, and it overwrites the first three bytes of the original
- program with a jump to itself. The original bytes are located at offset 483
- in the viral code. So, I have to take those bytes, copy them to the beginning
- of the file, and I have to remove 1873 bytes of the file. That's it!"
-
-
- Shortcomings of conventional cleaners
-
- The cleaner has to know the virus it has to remove. It is impossible to
- remove an unknown virus.
-
- The virus should be the same as the virus known to the cleaner.
- Imagine what whould happen if the virus used in the example was
- modified and now 1869 bytes in size instead of 1873... the cleaner would
- remove too much! This is not an exception, but it happens quite often since
- there are so many mutants. For instance, the Jerusalem/PLO family now
- contains more than 100 mutants!
-
- Many polymorphic viruses have variable lengths and maintain the
- original instructions encrypted. Most conventional cleaners are
- therefore unable to clean MTE infected programs.
-
-
- The virus will remove itself before actual execution
-
- We have seen above how a virus works. The interesting part is that when the
- virus passes control to the original program it restores the original bytes at
- the beginning of the program and jumps back to start the program. Every virus
- is able to repair the original program in order to keep it functional (except
- for overwriting viruses, but these can't be cleaned anyway).
-
-
- Let the virus do the dirty work
-
- The idea is: why not let do the virus the dirty work? The basic
- principle of heuristic cleaning is simple. The heuristic cleaner loads the
- infected file and starts emulating the program code. It uses a combination of
- disassembly, emulation and sometimes execution to trace the flow of the virus,
- and to emulate what the virus is normally doing. When the virus restores the
- original instructions and jumps back to the original program code, the cleaner
- stops the emulation process, and says 'thank you' to the virus for its
- cooperation in restoring the original bytes. The now repaired start of the
- program is copied back to the program file on disk, and the part of the
- program that gained 'execution' will be chopped off. An additional analysis
- of the cleaned program file will be performed to be on the safe side.
-
- Note that the cleaner is actually removing the unknown from the
- unknown. No predefined information about the virus or infected file is
- necessary.
-
- The process of emulation is just like hitchhiking. The emulator
- convinces the viral code that it is actually executing, and it
- hitchhikes to the point where the virus passes control to the original
- program.
-
- However, the actual process is very complicated. As with hitchhiking, many
- things can go wrong:
-
- - Driver takes you to the wrong place
- The virus does not intend to execute the original program, but it
- starts doing completely different things. As the purpose of the
- emulation is to restore the original program, we never reach our goal.
-
- - Driver won't let you out If the viral code performs an endless
- loop, the origial program will never be restored so the cleaner might
- wait forever.
-
- - Driver leaves the car
- A potentially dangerous situation is that the cleaner is too
- ambitious in its task to emulate everything, and that the virus
- gets control inside the emulated environment and finally escapes
- from it.
-
- - Driver hits a tree and kills you too
- Many viruses are badly programmed. If they crash inside the
- emulator, chances are that the emulator crashes too.
-
- Heuristic cleaners are so complicated that there is only one available right
- now. However, the great potential of heuristic cleaning make it likely that
- there will be more heuristic cleaners soon.
-
-
- The pro's and con's (of Hitchhiking)
-
- - Advantages:
- - no need to recognize mutants
- - no problems with polymorphic viruses
- - can clean 'future' viruses
- - user is less dependant on product updates
- - Disadvantages:
- - No exact copy of the original
- - It cleans everything: even clean files!
-
-
- Being the author of the first heuristic cleaner I have received many reactions
- to it. Most people were surprised that my cleaner was able to remove MTE
- viruses before my scanner was even able to recognize them. This is especially
- interesting as most anti-virus products are still not able to remove MTE
- infections.
-
- Of course everybody wants to know how many viruses can be removed this way. I
- can't show a reliable figure, as testing a cleaner is extremely tedious and
- time consuming task. However, a figure of 80% is a rough estimate. Many
- conventional cleaners do not even come close to this percentage.
-
- What can be expected from it in the future?
-
- Heuristic cleaning needs additional improvements. Some viruses use
- anti-debugger features that also make an emulator fail. It is also still
- possible that a virus detects that it is being emulated, and it can simply
- refuse to cooperate. The better the emulator performs, the less likely this
- is. Major improvements however are more likely to show up after multiple
- heuristic cleaners are available and some competition occurs.
-
- Frans Veldman,
- Author of Thunderbyte Anti-Virus
- Chief Executive of ESaSS B.V.
-
-
-
-
-
-
-
-
-
-