home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Software of the Month Club 1995 December
/
SOFM_Dec1995.bin
/
pc
/
dos
/
utility
/
isam
/
isamfind.doc
< prev
next >
Wrap
Text File
|
1995-10-31
|
24KB
|
455 lines
ISAMFIND.DOC 1 Revised: 09/09/95
Program written by:
Bruce Guthrie
Room H-4885
U.S. Dept of Commerce/ESA/STAT-USA
Washington, DC 20230
(202) 482-3234
You may freely copy and re-distribute this program; however, the U.S. Department
of Commerce neither guarantees nor assures compatibility of the program with all
computer software or hardware.
Foreign users: Please provide an Internet e-mail address in all correspondence
or and just e-mail your problems to me at bgu@cu.nih.gov
Note:
Since ISAMMAKE.EXE and ISAMFIND.EXE are related and share some of the same
options, there are some common features that are documented in the documentation
for one of the routines and not the other. In general, most of the shared
documentation ends up in ISAMFIND.DOC since that's all that people need to
search the documents. Shared documentation is as follows:
Introduction see ISAMFIND.DOC documentation
Features see ISAMFIND.DOC documentation
The ISAMFIND.INI file see ISAMFIND.DOC documentation
Quick demo see ISAMMAKE.DOC documentation
Introduction:
The ISAMMAKE.EXE program builds an ISAM data base (an "indexed sequential access
method" data base--something which is fairly common for mainframes and was for
some reason used as the DBMS-building language in Microsoft's VisualBASIC for
DOS) that includes every word found in a particular set of files. This program
is used in conjunction with the ISAMFIND.EXE program which actually searches and
displays the files.
The purpose of the ISAMMAKE/ISAMFIND pair is to build a text data base and allow
you to easily search it. This is useful in a number of applications:
* Help-desk applications (someone calls up and asks for information about
"computer parts"--find the descriptive documents that relate to this)
* Disk-searching apps (find all files that mention the word "Clinton")
* Sample files (find a representative file for someone to give them an idea
of what a particular type of report might look like)
The data base that's built includes the number of times the word appears in the
document relative to the total number of words in the document. ISAMFIND
weights the documents for you, allowing you to retrieve the "best" documents for
any given search. By default, the program will then list the files for you and
then try to view them using the READY.EXE program (or any other text-file
viewer).
Definition of "word": Currently, the program defines a "word" as consisting
only of letters of the alphabet. By default, non-letters are treated as word
delimiters (you can override this with the /ACCEPT=string parameter). Words are
ISAMFIND.DOC 2 Revised: 09/09/95
a minimum of three characters in length (can be changed to be from 2 to 5) and a
maximum of 10.
ISAMFIND.DOC 3 Revised: 09/09/95
Features (of ISAMMAKE and ISAMFIND combination):
* Programs allow basically an unlimited number of files to be indexed.
* For primarily text documents, index is roughly the size of all the original
text files combined.
* Non-text documents can be scanned but, obviously, not many words will be
found in them.
* Documents are retrieved based on a "best match" formula which returns first
those documents which have a higher proportion of hits relative to their
size.
* Can count words in the document title higher than words within the text of
the document.
* Can specify what characters constitute a word (/ACCEPT=string parm).
* Can specify 8-character "file areas" for each file and restrict the search
based on these file areas.
* Can retrieve based on date so only the newer documents are retrieved.
* Can incrementally update the data bases with just newly modified files.
* Can either view the resulting files on the screen or write out a file which
contains the file names that matched the request.
* Output format is configurable.
* Can specify exclusion words that will be skipped when with the user enters
words to search for.
* Ideal tool for a help desk which has a lot of information to search through.
ISAMFIND.DOC 4 Revised: 09/09/95
Scoring mechanism:
The ISAMFIND program presents your documents to you based on a "best document"
score. The score is based on looking at the number of times your particular
word appears in the document and dividing the result by the number of words in
the document. The result is then multiplied by 1000. The documents which rise
to the top of the list will be those documents where the word frequency is
proportional higher than it is in other documents.
For example, let's look at three of the test documents referenced in the
ISAMDEMO.LST demo file:
ISAMDEMO.001 Cat Story
The cat and the dog came back and danced long.
ISAMDEMO.002 Dog Story
The dog was in house.
ISAMDEMO.003 House story
The house was four stories high and pretty darn big.
There are 5 words in the second document, 10 in the other two. The word "the"
appears in all three documents. In the first document, it appears twice and
once in the other two documents. Since the score is based on the word frequency
divided by the total number of words (multiplied by 1000), "the" rates scores of
200 (2/10*1000), 200 (1/5*1000), and 100 (1/10*1000) in each of the documents.
The first and second documents will appear in the listing before the third one
does.
In the case of wildcarded words (so "THE" would find "THESE" and "THEM" as well
as "THE"), scores are added together.
In cases of multiple words in a string search, the scores are again added
together but a single document is required to have all words in order to show
up.
In addition to the words of the text, the words which appear in the title of a
document ("Cat Story", "Dog Story", and "House story" in our case) are also
considered in the search. Words in the title are typically considered more
important than words in the text; you can assign a higher weight to the title
words when the index is created by the ISAMMAKE program but the default
weighting is three; words in the title are three times as important to your
weights as words in the text. In addition, words in the title do not count in
the document's word count at all.
In our case, a search request of "dog" would result in scores of 100
(1/10*1000), 400 ((1*3)+1/10*1000), and 0 (0/10*1000) for our three documents.
The program, by default, presumes that search words should begin with the
characters specified but may include additional characters. Thus, "THE" will
find "THESE" and "THEM". You can override this by using the /TRUNC parameter.
Even if the /TRUNC parameter is specified, you can ask for non-truncated
searching on a word-by-word basis; "DOG* STORY /TRUNC" will find either "DOG" or
"DOGS".
ISAMFIND.DOC 5 Revised: 09/09/95
Specifying parameters:
Parameters for this program can be set in the following ways. The last setting
encountered always wins:
- Read from an *.INI file (see below),
- Through the use of an environmental variable (SET ISAMFIND=whatever), or
- From the command line (see "Syntax" below)
The ISAMFIND.INI file:
ISAMFIND will read a ISAMFIND.INI file if one is found. (You can specify a
different file name if desired.) (Note that ISAMFIND and ISAMMAKE both, by
default, share the same INI file; options which are valid in one routine but not
in the other are ignored by the non-supporting routine.)
The file is an ASCII text file that can be created maintained by hand. The file
can consist or one or more command line parameters (only those that begin with a
"/"), one statement per line. The file can also contain the output format
("F=") statements; note that these commands cannot be passed in from the command
line so you will typically need an initialization file if you don't like the
default values for them.
The file can also contain comments which are blank lines or any line beginning
with:
; (semi-colon)
: (colon)
' (quote)
ISAMFIND looks for the initialization file in your default subdirectory first.
It then searches for it in the subdirectory where the executable was and then
goes through your DOS path.
Passing in "/-I" or "/INULL" skips loading the INI file. This saves some
execution time as the program does not need to search your path for the file.
You can combine *.INI files from this and other routines I have out there. This
is useful if you're tired of having a lot of *.INI files out there. To do this,
make a single *.INI file (such as ALL.INI) and include blocks in it. The
routine will look for the block that's the name of the core routine (in this
case, "[ISAMFIND]") and only processes the records within that block. For
example,
; ALL.INI -- contains all of the INI statements
[DATES]
/SORT
[FILL]
/ON
/SPLIT
[ISAMFIND]
/VC:\UTIL\LIST.COM
You can either pass in the name of the INI file ("/IALL.INI") or the routine
will use a "SET BG=filename" (e.g. "SET BG=ALL.INI") parameter if one is
provided.
ISAMFIND.DOC 6 Revised: 09/09/95
Format statements:
Within the ISAMFIND.INI file, you can specify the format string to be used for
displaying the file-selection list.
The format string begins with "F=" followed by string literals (like spaces) and
information variables. The default format statement is:
F=%fratc% %fname(12)% %fdate(mm-dd-yy)% %fdesc(45)%
Format variables that are available are the following:
%fpath% the path of the file including drive
%fname% the immediate file name for the file; a complete file
name might require "%fpath%%fname%"
%fsize% the size of the file in bytes
%fdate% the creation date for the file
%ftime% the creation time for the file in hh:mm format
%fdesc% the file description
%fword% the number of words in the document
%foccr% the number of times your search word(s) appear
%frate% the numerical rating of the file (see the ratings
discussion in ISAMFIND.DOC)
%fratc% the one-letter representation of the scoring of the
file (see ISAMFIND.DOC for a discussion of scoring;
characters used are .░▒▓█)
%fillr% filler field (skip on input in ISAMMAKE; don't use at
all in ISAMFIND)
Most of the variables can include a format specifier in parentheses before the
second "%". Typically, this indicates the number of spaces to allocate for the
field. For example, "%fsize(10)%" will show the file size in a 10-character
field.
For file descriptions, the format specifier indicates the maximum number of
letters of the description to print out. If there is a continuation indicator
(/CONT=string), the remainder of the description will print out on subsequent
lines.
For the file date, the format specifier can be any combination of "mm" (month),
"dd" (day), and "yy" (year) with separators. For example, the default format
specifier for date is "mm/dd/yy". You can make this "yy-mm-dd" if you'd like by
including "%date(yy-mm-dd)%" in your output format string.
ISAMFIND.DOC 7 Revised: 09/09/95
Syntax:
ISAMFIND [ string [ string ]... ] /Fcorename [ /2 | /3 | /4 | /5 ]
[ /Vfilename | /-VIEW ] [ /TOP=n ] [ /TRUNC | /-TRUNC ]
[ /ACCEPT=string ] [ /AREA=string ] [ /Xfilename ] [ /SINCE=yymmdd ]
[ /Ofilename | /-O ] [ /OVERWRITE | /-OVERWRITE | /OVERASK | /APPEND ]
[ /CONT=string ] [ /PATH | /-PATH ] [ /Iinitfile | /-I ] [ /? ]
where:
"string" is from one to ten words to search for. For example,
ISAMFIND IMPLICIT PRICE
The program automatically does an AND test on the words. The searching is done
exactly although case is ignored. You can search for words that begin with a
given string by ending the request with an asterisk. For example:
ISAMFIND PROD*
If no search string is provided, the routine will prompt you for one. Currently,
the routine only allows up to 10 search strings to be entered. You can include
hexadecimal codes in your search string.
"/Fcorename" specifies the corename of the ISAM data base to list. Should
correspond to the name plugged into ISAMMAKE when the files are originally
built. (See ISAMMAKE.DOC for a description of the ISAM files.)
"/2", "/3", "/4", and "/5" specifies the minimum length word search allowed. The
routine should use the same minimum specified when the ISAM files were built by
ISAMMAKE.EXE. By default, the minimum word length is 3.
"/Vfilename" says to load a specified text-viewing program (e.g. "/VLIST"). If
none is specified (and neither /-VIEW and /Ofilename are not specified),
ISAMFIND will try to load Bruce Guthrie's READY.EXE program (which should have
been included with the ISAMFymm.ZIP file). A path can be specified if necessary
which is useful if the file viewer is not in your path. If READY.EXE is used,
READY will automatically highlight the search words you've provided. (One
caveat: READY shows the search words anywhere within the text including within
words whereas ISAMFIND only finds the text when it starts the string. If you
search for "WAR", ISAMFIND will only show you documents that contain the letters
"WAR" at the start of a word (like "WARPATH") whereas READY will highlight the
letters "WAR" within words like "SWARM" and "THWART".) Also note that READY
doesn't handle lines over 80 characters in length; read the READY.DOC file to
see how to get long lines wrapped on the screen if desired.
"/-VIEW" says to skip loading any text-viewing program.
"/TOP=n" specifies that you want the top n-number of files to be shown. Based
on their relative scores, the documents are always shown in best to worst order.
Initially, the default is "/TOP=15". The maximum you can specify is "/TOP=100".
ISAMFIND.DOC 8 Revised: 09/09/95
"/TRUNC" says that all searches are to be, by default, only for exactly the word
specified. So "THE" will not find "THESE". You can override any individual
specification by including an asterisk after it ("THE*" will find "THESE" even
if /TRUNC is specified). Initially defaults to "/-TRUNC".
"/-TRUNC" says that all searches should presume any number of characters can
follow the search characters specified. So "THE" will find "THESE". This is
initially the default.
"/ACCEPT=string" allows you to specify characters *other than A to Z* that
should be accepted as parts of words. Foreign users, for example, might want to
include some foreign characters. The string can include hexadecimal codes.
"/AREA=string" restricts the output to words from those files that are in a
specific file area as defined by the "/AREA=string" specification when the
ISAMMAKE program is run or by the filename when /C=F is used by ISAMMAKE.
"/Xfilename" specifies that there are certain words you want excluded from
consideration when the user enters them. This file should be an ASCII text file
with the words beginning in column 1. The words can be in any order. Remember
that documents are retrieved which have every word the user enters. If they
enter something like "DOG AND CAT" and the word "AND" is not in a given
document, the document will not show up in the final result. A typical
exclusion file might have all of the following common words in it: AND, THE,
FOR, BUT, NOT, etc.
"/SINCE=yymmdd" specifies that you only want documents that were created on the
above date or afterward. This is useful for cases where you only want recent
documents to be retrieved.
"/Ofilename" sends the results to a file instead of to the screen. This
automatically invokes the /-VIEW option too.
"/-O" skips creating an output file with the results. This option is mainly
designed to allow you to override a specification in an INI file.
"/OVERWRITE" says to overwrite the output file if it exists already.
"/-OVERWRITE" says to abort if the output file exists already.
"/OVERASK" says to ask if the output file exists already. This is initially the
default.
"/APPEND" says to append (add) to the output file if it exists already.
"/CONT=string" says that lines whose descriptions should be continued on a
second line due to length should have their continuations begin with a specific
string. Defaults to none.
ISAMFIND.DOC 9 Revised: 09/09/95
"/PATH" specifies that the original file paths are to be looked for when the
file is looked for. If the file cannot be found in the original path, it will
be looked for in the default subdirectory on the default drive. If it can't be
found in either location, an error message will be shown. This is initially the
default.
"/-PATH" specifies that the file is to be looked for in the default subdirectory
on the default drive only. This is typically done if you've created the word
index in one place and then copied all of the indexes and files to another drive
for distribution.
"/Iinitfile" says to read an initialization file with the file name "initfile".
The file specification *must* contain a period. If no drive or path information
is specified, the program will search for initfile beginning in your default
subdirectory and then going throughout your DOS path. The use of an
initialization file is optional. Initially defaults to "/IISAMFIND.INI".
"/-I" (or "/INULL") says to skip loading the initialization file.
"/?" or "/HELP" or "HELP" shows you the syntax for the command.
ISAMFIND.DOC 10 Revised: 09/09/95
Decimal and hexadecimal codes:
e.g. "\066\097\116" and "&H426174" both are "Bat"
+---------------------------------------------------------------------------
| dec hex chr | dec hex chr | dec hex chr | dec hex chr | dec hex chr |
+--------------+--------------+--------------+--------------+--------------+
| \000 &H00 nul| \052 &H34 4 | \104 &H68 h | \156 &H9C £ | \208 &HD0 ╨ |
| \001 &H01 | \053 &H35 5 | \105 &H69 i | \157 &H9D ¥ | \209 &HD1 ╤ |
| \002 &H02 | \054 &H36 6 | \106 &H6A j | \158 &H9E ₧ | \210 &HD2 ╥ |
| \003 &H03 | \055 &H37 7 | \107 &H6B k | \159 &H9F ƒ | \211 &HD3 ╙ |
| \004 &H04 | \056 &H38 8 | \108 &H6C l | \160 &HA0 á | \212 &HD4 ╘ |
| \005 &H05 | \057 &H39 9 | \109 &H6D m | \161 &HA1 í | \213 &HD5 ╒ |
| \006 &H06 | \058 &H3A : | \110 &H6E n | \162 &HA2 ó | \214 &HD6 ╓ |
| \007 &H07 bel| \059 &H3B ; | \111 &H6F o | \163 &HA3 ú | \215 &HD7 ╫ |
| \008 &H08 bs | \060 &H3C < | \112 &H70 p | \164 &HA4 ñ | \216 &HD8 ╪ |
| \009 &H09 tab| \061 &H3D = | \113 &H71 q | \165 &HA5 Ñ | \217 &HD9 ┘ |
| \010 &H0A lf | \062 &H3E > | \114 &H72 r | \166 &HA6 ª | \218 &HDA ┌ |
| \011 &H0B vt | \063 &H3F ? | \115 &H73 s | \167 &HA7 º | \219 &HDB █ |
| \012 &H0C pg | \064 &H40 @ | \116 &H74 t | \168 &HA8 ¿ | \220 &HDC ▄ |
| \013 &H0D cr | \065 &H41 A | \117 &H75 u | \169 &HA9 ⌐ | \221 &HDD ▌ |
| \014 &H0E | \066 &H42 B | \118 &H76 v | \170 &HAA ¬ | \222 &HDE ▐ |
| \015 &H0F | \067 &H43 C | \119 &H77 w | \171 &HAB ½ | \223 &HDF ▀ |
| \016 &H10 | \068 &H44 D | \120 &H78 x | \172 &HAC ¼ | \224 &HE0 α |
| \017 &H11 | \069 &H45 E | \121 &H79 y | \173 &HAD ¡ | \225 &HE1 ß |
| \018 &H12 | \070 &H46 F | \122 &H7A z | \174 &HAE « | \226 &HE2 Γ |
| \019 &H13 | \071 &H47 G | \123 &H7B { | \175 &HAF » | \227 &HE3 π |
| \020 &H14 | \072 &H48 H | \124 &H7C | | \176 &HB0 ░ | \228 &HE4 Σ |
| \021 &H15 | \073 &H49 I | \125 &H7D } | \177 &HB1 ▒ | \229 &HE5 σ |
| \022 &H16 | \074 &H4A J | \126 &H7E ~ | \178 &HB2 ▓ | \230 &HE6 µ |
| \023 &H17 | \075 &H4B K | \127 &H7F | \179 &HB3 │ | \231 &HE7 τ |
| \024 &H18 | \076 &H4C L | \128 &H80 Ç | \180 &HB4 ┤ | \232 &HE8 Φ |
| \025 &H19 | \077 &H4D M | \129 &H81 ü | \181 &HB5 ╡ | \233 &HE9 Θ |
| \026 &H1A eof| \078 &H4E N | \130 &H82 é | \182 &HB6 ╢ | \234 &HEA Ω |
| \027 &H1B esc| \079 &H4F O | \131 &H83 â | \183 &HB7 ╖ | \235 &HEB δ |
| \028 &H1C | \080 &H50 P | \132 &H84 ä | \184 &HB8 ╕ | \236 &HEC ∞ |
| \029 &H1D ???| \081 &H51 Q | \133 &H85 à | \185 &HB9 ╣ | \237 &HED φ |
| \030 &H1E ???| \082 &H52 R | \134 &H86 å | \186 &HBA ║ | \238 &HEE ε |
| \031 &H1F ???| \083 &H53 S | \135 &H87 ç | \187 &HBB ╗ | \239 &HEF ∩ |
| \032 &H20 | \084 &H54 T | \136 &H88 ê | \188 &HBC ╝ | \240 &HF0 ≡ |
| \033 &H21 ! | \085 &H55 U | \137 &H89 ë | \189 &HBD ╜ | \241 &HF1 ± |
| \034 &H22 " | \086 &H56 V | \138 &H8A è | \190 &HBE ╛ | \242 &HF2 ≥ |
| \035 &H23 # | \087 &H57 W | \139 &H8B ï | \191 &HBF ┐ | \243 &HF3 ≤ |
| \036 &H24 $ | \088 &H58 X | \140 &H8C î | \192 &HC0 └ | \244 &HF4 ⌠ |
| \037 &H25 % | \089 &H59 Y | \141 &H8D ì | \193 &HC1 ┴ | \245 &HF5 ⌡ |
| \038 &H26 & | \090 &H5A Z | \142 &H8E Ä | \194 &HC2 ┬ | \246 &HF6 ÷ |
| \039 &H27 ' | \091 &H5B [ | \143 &H8F Å | \195 &HC3 ├ | \247 &HF7 ≈ |
| \040 &H28 ( | \092 &H5C \ | \144 &H90 É | \196 &HC4 ─ | \248 &HF8 ° |
| \041 &H29 ) | \093 &H5D ] | \145 &H91 æ | \197 &HC5 ┼ | \249 &HF9 ∙ |
| \042 &H2A * | \094 &H5E ^ | \146 &H92 Æ | \198 &HC6 ╞ | \250 &HFA · |
| \043 &H2B + | \095 &H5F _ | \147 &H93 ô | \199 &HC7 ╟ | \251 &HFB √ |
| \044 &H2C , | \096 &H60 ` | \148 &H94 ö | \200 &HC8 ╚ | \252 &HFC ⁿ |
| \045 &H2D - | \097 &H61 a | \149 &H95 ò | \201 &HC9 ╔ | \253 &HFD ² |
| \046 &H2E . | \098 &H62 b | \150 &H96 û | \202 &HCA ╩ | \254 &HFE ■ |
| \047 &H2F / | \099 &H63 c | \151 &H97 ù | \203 &HCB ╦ | \255 &HFF |
| \048 &H30 0 | \100 &H64 d | \152 &H98 ÿ | \204 &HCC ╠ | |
| \049 &H31 1 | \101 &H65 e | \153 &H99 Ö | \205 &HCD ═ | |
| \050 &H32 2 | \102 &H66 f | \154 &H9A Ü | \206 &HCE ╬ | |
| \051 &H33 3 | \103 &H67 g | \155 &H9B ¢ | \207 &HCF ╧ | |
+--------------+--------------+--------------+--------------+--------------+
ISAMFIND.DOC 11 Revised: 09/09/95