home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
The Unsorted BBS Collection
/
thegreatunsorted.tar
/
thegreatunsorted
/
misc
/
tbweeder.doc
< prev
next >
Wrap
Text File
|
1993-11-26
|
5KB
|
129 lines
What is TbWeeder
----------------
TbWeeder is a utility to weed out duplicate files.
Virus researchers often receive large virus collections which contain
many duplicate files. Not all anti-virus vendors use the same virus
naming convention, and often a virus sample is renamed to match to the
name printed by the scanner used to identify the virus. These renamed
files are copied into other collections, causing many renamed but equal
files floating around in all kind of virus collections.
TbWeeder can help to identify duplicate files, and automatically delete them.
Duplicate files are files with the same 32-bit CRC and length. To be
absolutely sure, TbWeeder will perform a full match - byte by byte - of
the files if both files are available.
TbWeeder can also maintain a database so it is not necessary to rescan
all files over and over again to search for duplicates.
Interesting features
--------------------
- The amount of files TbWeeder can handle is 65534
- TbWeeder can optionally delete duplicate files
- TbWeeder can be used to compare and weed files from one path against
another path, but also to compare and weed within a single path.
- TbWeeder accepts filename specifications, so it can be used to
check just one file against a huge collection.
- TbWeeder can maintain two databases, one for the CRC and length
information, another one for the names of the files in the database.
To weed out remotely, the relatively small CRC database is sufficient.
- TbWeeder is able to compare files byte for byte for additional security.
- TbWeeder is able to output a report file with all duplicate files.
- TbWeeder is fast (due to a 128Kb hash table and nifty linked lists!).
Intended purpose
----------------
Example 1:
Suppose you have a virus collection in directory C:\MYVIRS with viruses
sorted out. In directory C:\NEWVIRUS you receive new virus samples.
Enter:
TbWeeder c:\MyVirs /add
This causes TbWeeder to generate a database with file information.
To find out which viruses in directory C:\NEWVIRS are duplicates, execute:
TbWeeder c:\NewVirs
You can optionally put all duplicate files in a log file by using option /log
or automatically delete the duplicates by using option /del.
Example 2:
Suppose you have a directory VIRUSES and you want to delete all duplicates.
Enter:
TbWeeder Viruses /add /del
This causes TbWeeder to build a database and delete duplicate files at the
same time!
Example 3:
Suppose you want to know whether viruses from someone else's collection
are the same ones you have. Run TbWeeder on your own collection with
option /noname, and distribute TbWeeder and the TbWeeder.Dat file to
the other collection. TbWeeder can now be used to create a log file of
all known viles.
The database
------------
TbWeeder can only be used with an external database, due to the excessive
amount of data it has to handle when comparing a file against 65000 others!
TbWeeder.Dat will contain the 32-bit CRC and length of all files. This
information is usually sufficient to find out whether a file is a duplicate
or not. To become completely sure, TbWeeder can also perform a byte for byte
comparison after it thinks that two files are identical. However, in this
case TbWeeder needs the name of the original file and of course the original
file itself. Therefore TbWeeder will also maintain a name reference, named
TbWeeder.Lst. This file can become quite large, several megabytes is not
unusual. If you don't want these extended features, you can save disk
space by specifying option /noname.
Since TbWeeder.Lst will become very large and will only be necessary to
list the name of the first - original - file and to perform a byte by byte
match, you may choose not to distribute this file to others. It can
however be usefull to distribute the other file, TbWeeder.Dat, to others,
to weed out file remotely (to avoid that people send you files you already
have). The maximum size of TbWeeder.Dat is 512Kb (with over 65000 files!).
Usage
-----
Usage:
TbWeeder [<path>][<filename>] [<options>...]
If no options are specified, the specified path will be scanned for
duplicate files. TbWeeder will compare these files against the files
in the TbWeeder.Dat database, and against the files in the specified
path itself.
-> IF THERE IS NOT ALREADY A DATABASE YOU NEED TO SPECIFY OPTION /ADD
Command line options (abbreviations between brackets).
help (h) displays a help file.
nosub (s) do not process sub directories.
add (a) The files which have been found to be unique will
be stored in the database files.
del (d) delete duplicate files.
noname (n) do not create or consult the large name reference
database. This will disable the full byte by byte
comparison as well.
log (l) log duplicate files