home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
C!T ROM 5
/
ctrom5b.zip
/
ctrom5b
/
DOS
/
TEKST
/
ISAMF412
/
ISAMFIND.DOC
< prev
next >
Wrap
Text File
|
1994-12-14
|
20KB
|
391 lines
ISAMFIND.DOC
12/14/94
Program written by:
Bruce Guthrie
Room H-4885
U.S. Dept of Commerce/ESA/OBA/BSISD
Washington, D.C. 20230
(202) 482-3234
You may freely copy and re-distribute this program; however, the U.S.
Department of Commerce neither guarantees nor assures compatibility of the
program with all computer software or hardware.
Foreign users: Please provide an Internet e-mail address in all correspondence
or and just e-mail your problems to me at bgu@cu.nih.gov
The ISAMFIND.EXE program takes an ISAM word-index created by the ISAMMAKE.EXE
program and searches it for you. If the word(s) you specify are found in the
data base, the program will typically list the files for you and then try to
view them using the READ program (or any other text-file viewer).
Definition of "word": Currently, the program defines a "word" as consisting
only of letters of the alphabet. By default, non-letters are treated as word
delimiters (you can override this with the /ACCEPT=string parameter). Words are
a minimum of three characters in length (can be changed to be from 2 to 5) and a
maximum of 10.
Features (of ISAMMAKE and ISAMFIND combination):
* Programs allow basically an unlimited number of files to be indexed.
* For primarily text documents, index is roughly the size of all the original
text files combined.
* Non-text documents can be scanned but, obviously, not many words will be
found in them.
* Documents are retrieved based on a "best match" formula which returns first
those documents which have a higher proportion of hits relative to their
size.
* Can count words in the document title higher than words within the text of
the document.
* Can specify what characters constitute a word (/ACCEPT=string parm).
* Can specify 8-character "file areas" for each file and restrict the search
based on these file areas.
* Can retrieve based on date so only the newer documents are retrieved.
* Can incrementally update the data bases with just newly modified files.
* Can either view the resulting files on the screen or write out a file which
contains the file names that matched the request.
* Output format is configurable.
* Can specify exclusion words that will be skipped when with the user enters
words to search for.
* Ideal tool for a help desk which has a lot of information to search through.
Since ISAMMAKE.EXE and ISAMFIND.EXE are related and share some of the same
options, there are some common features that are documented in the documentation
for one of the routines and not the other. In general, most of the shared
documentation ends up in ISAMFIND.DOC since that's all that people need to
search the documents. Shared documentation is as follows:
Features see ISAMFIND.DOC documentation
The ISAMFIND.INI file see ISAMFIND.DOC documentation
Format statements see ISAMFIND.DOC documentation
Quick demo see ISAMMAKE.DOC documentation
Scoring mechanism:
The ISAMFIND program presents your documents to you based on a "best document"
score. The score is based on looking at the number of times your particular
word appears in the document and dividing the result by the number of words in
the document. The result is then multiplied by 1000. The documents which rise
to the top of the list will be those documents where the word frequency is
proportional higher than it is in other documents.
For example, let's look at three of the test documents referenced in the
ISAMDEMO.LST demo file:
ISAMDEMO.001 Cat Story
The cat and the dog came back and danced long.
ISAMDEMO.002 Dog Story
The dog was in house.
ISAMDEMO.003 House story
The house was four stories high and pretty darn big.
There are 5 words in the second document, 10 in the other two. The word "the"
appears in all three documents. In the first document, it appears twice and
once in the other two documents. Since the score is based on the word frequency
divided by the total number of words (multiplied by 1000), "the" rates scores of
200 (2/10*1000), 200 (1/5*1000), and 100 (1/10*1000) in each of the documents.
The first and second documents will appear in the listing before the third one
does.
In the case of wildcarded words (so "THE" would find "THESE" and "THEM" as well
as "THE"), scores are added together.
In cases of multiple words in a string search, the scores are again added
together but a single document is required to have all words in order to show
up.
In addition to the words of the text, the words which appear in the title of a
document ("Cat Story", "Dog Story", and "House story" in our case) are also
considered in the search. Words in the title are typically considered more
important than words in the text; you can assign a higher weight to the title
words when the index is created by the ISAMMAKE program but the default
weighting is three; words in the title are three times as important to your
weights as words in the text. In addition, words in the title do not count in
the document's word count at all.
In our case, a search request of "dog" would result in scores of 100
(1/10*1000), 400 ((1*3)+1/10*1000), and 0 (0/10*1000) for our three documents.
The program, by default, presumes that search words should begin with the
characters specified but may include additional characters. Thus, "THE" will
find "THESE" and "THEM". You can override this by using the /TRUNC parameter.
Even if the /TRUNC parameter is specified, you can ask for non-truncated
searching on a word-by-word basis; "DOG* STORY /TRUNC" will find either "DOG" or
"DOGS".
The ISAMFIND.INI file:
ISAMFIND will read a ISAMFIND.INI file if one is found. (You can specify a
different file name if desired.) (Note that ISAMFIND and ISAMMAKE both, by
default, share the same INI file; options which are valid in one routine but not
in the other are ignored by the non-supporting routine.)
The file is an ASCII text file that can be created maintained by hand. The file
can consist or one or more command line parameters (only those that begin with a
"/"), one statement per line. The file can also contain the input format
("FI=") and output format ("FO=") statements; note that these commands cannot
be passed in from the command line so you will typically need an initialization
file.
The file can also contain comments which are blank lines or any line beginning
with:
; (semi-colon)
: (colon)
' (quote)
ISAMFIND looks for the initialization file in your default subdirectory first.
It then searches for it in the subdirectory where the executable was and then
goes through your DOS path.
Passing in "/-I" or "/INULL" skips loading the INI file. This saves some
execution time as the program does not need to search your path for the file.
Format statements:
Within the ISAMFIND.INI file, you can specify two format strings, one for input
and the second for output. ISAMMAKE uses the input string to to read the input
directory listings . ISAMFIND uses the output string to determine how to
display the file directory listings.
The format strings begin with "FI=" (for input format) or "FO=" (for output
format) followed by string literals (like spaces) and information variables.
For ISAMMAKE, the default format is presumed to consist of a filename, followed
by a file description. In fact, ISAMMAKE can only process these fields (it
ignores date, time, etc information). So, for ISAMMAKE, the format statement
defaults to:
FI=%fname% %fdesc%
For ISAMFIND, on the other hand, the program presumes you want a bit more
information available such as %fdate% (file date) and %ftime% (file time). The
default format statement is:
FO=%fratc% %fname(12)% %fdate(mm-dd-yy)% %fdesc(45)%
Format variables that are available are the following:
%fpath% the path of the file including drive
%fname% the immediate file name for the file; a complete file
name might require "%fpath%%fname%"
%fsize% the size of the file in bytes
%fdate% the creation date for the file
%ftime% the creation time for the file in hh:mm format
%fdesc% the file description
%fword% the number of words in the document
%foccr% the number of times your search word(s) appear
%frate% the numerical rating of the file (see the ratings
discussion in ISAMFIND.DOC)
%fratc% the one-letter representation of the scoring of the
file (see ISAMFIND.DOC for a discussion of scoring;
characters used are .░▒▓█)
%fillr% filler field (skip on input in ISAMMAKE; don't use at
all in ISAMFIND)
Most of the variables can include a format specifier in parentheses before the
second "%". Typically, this indicates the number of spaces to allocate for the
field. For example, "%fsize(10)%" will show the file size in a 10-character
field.
For file descriptions, the format specifier indicates the maximum number of
letters of the description to print out. If there is a continuation indicator
(/CONT=string), the remainder of the description will print out on subsequent
lines.
For the file date, the format specifier can be any combination of "mm" (month),
"dd" (day), and "yy" (year) with separators. For example, the default format
specifier for date is "mm/dd/yy". You can make this "yy-mm-dd" if you'd like
by including "%date(yy-mm-dd)%" in your output format string.
Syntax:
ISAMFIND [ string [ string ]... ] /Fcorename [ /2 | /3 | /4 | /5 ]
[ /READ | /Vpgm | /-READ ] [ /TOP=n ] [ /TRUNC | /-TRUNC ]
[ /ACCEPT=string ] [ /AREA=string ] [ /Xfilename ] [ /SINCE=yymmdd ]
[ /Ofilename | /-O ] [ /OVERWRITE | /-OVERWRITE | /OVERASK | /APPEND ]
[ /CONT=string ] [ /PATH | /-PATH ] [ /Iinitfile | /-I ] [ /? ]
where:
"string" is from one to ten words to search for. For example,
ISAMFIND IMPLICIT PRICE
The program automatically does an AND test on the words. The searching is done
exactly although case is ignored. You can search for words that begin with a
given string by ending the request with an asterisk. For example:
ISAMFIND PROD*
If no search string is provided, the routine will prompt you for one.
Currently, the routine only allows up to 10 search strings to be entered. You
can include hexadecimal codes in your search string.
"/Fcorename" specifies the corename of the ISAM data base to list. Should
correspond to the name plugged into ISAMMAKE when the files are originally
built. (See ISAMMAKE.DOC for a description of the ISAM files.)
"/2", "/3", "/4", and "/5" specifies the minimum length word search allowed.
The routine should use the same minimum specified when the ISAM files were built
by ISAMMAKE.EXE. By default, the minimum word length is 3.
"/READ" says to load the READ program and view the resulting C:\ISAMFIND.OUT
file. The file can be viewed outside of ISAMFIND later if you want. Note that
READ.EXE must exist in your path in order for this option to work. You can use
the /Vpgm option (with a path) if you have access to READ but it's not in your
path.
"/Vpgm" says to load another text-viewing program (e.g. "/VLIST") instead of
READ.EXE to view the file. A path can be specified if necessary which is useful
if the file viewer is not in your path.
"/-READ" says to skip loading the text-viewing program. This is the default.
"/TOP=n" specifies that you want the top n-number of files to be shown. Based
on their relative scores, the documents are always shown in best to worst order.
Initially, the default is "/TOP=15". The maximum you can specify is "/TOP=100".
"/TRUNC" says that all searches are to be, by default, only for exactly the
word specified. So "THE" will not find "THESE". You can override any
individual specification by including an asterisk after it ("THE*" will find
"THESE" even if /TRUNC is specified). Initially defaults to "/-TRUNC".
"/-TRUNC" says that all searches should presume any number of characters can
follow the search characters specified. So "THE" will find "THESE". This is
initially the default.
"/ACCEPT=string" allows you to specify characters *other than A to Z* that
should be accepted as parts of words. Foreign users, for example, might want to
include some foreign characters. The string can include hexadecimal codes.
"/AREA=string" restricts the output to words from those files that are in a
specific file area as defined by the "/AREA=string" specification when the
ISAMMAKE program is run or by the filename when /C=F is used by ISAMMAKE.
"/Xfilename" specifies that there are certain words you want excluded from
consideration when the user enters them. This file should be an ASCII text file
with the words beginning in column 1. The words can be in any order. Remember
that documents are retrieved which have every word the user enters. If they
enter something like "DOG AND CAT" and the word "AND" is not in a given
document, the document will not show up in the final result. A typical
exclusion file might have all of the following common words in it: AND, THE,
FOR, BUT, NOT, etc.
"/SINCE=yymmdd" specifies that you only want documents that were created on the
above date or afterward. This is useful for cases where you only want recent
documents to be retrieved.
"/Ofilename" sends the results to a file instead of to the screen. This
automatically invokes the /-READ option too.
"/-O" skips creating an output file with the results. This option is mainly
designed to allow you to override a specification in an INI file.
"/OVERWRITE" says to overwrite the output file if it exists already.
"/-OVERWRITE" says to abort if the output file exists already.
"/OVERASK" says to ask if the output file exists already. This is initially
the default.
"/APPEND" says to append (add) to the output file if it exists already.
"/CONT=string" says that lines whose descriptions should be continued on a
second line due to length should have their continuations begin with a specific
string. Defaults to none.
"/PATH" specifies that the original file paths are to be looked for when the
file is looked for. If the file cannot be found in the original path, it will
be looked for in the default subdirectory on the default drive. If it can't
be found in either location, an error message will be shown. This is initially
the default.
"/-PATH" specifies that the file is to be looked for in the default subdirectory
on the default drive only. This is typically done if you've created the word
index in one place and then copied all of the indexes and files to another drive
for distribution.
"/Iinitfile" says to read an initialization file with the file name "initfile".
The file specification *must* contain a period. If no drive or path information
is specified, the program will search for initfile beginning in your default
subdirectory and then going throughout your DOS path. The use of an
initialization file is optional. Initially defaults to "/IISAMFIND.INI".
"/-I" (or "/INULL") says to skip loading the initialization file.
"/?" or "/HELP" or "HELP" shows you the syntax for the command.
Decimal and hexadecimal codes:
e.g. "\066\097\116" and "&H426174" both are "Bat"
+---------------------------------------------------------------------------
| dec hex chr | dec hex chr | dec hex chr | dec hex chr | dec hex chr |
+--------------+--------------+--------------+--------------+--------------+
| \000 &H00 nul| \052 &H34 4 | \104 &H68 h | \156 &H9C £ | \208 &HD0 ╨ |
| \001 &H01 | \053 &H35 5 | \105 &H69 i | \157 &H9D ¥ | \209 &HD1 ╤ |
| \002 &H02 | \054 &H36 6 | \106 &H6A j | \158 &H9E ₧ | \210 &HD2 ╥ |
| \003 &H03 | \055 &H37 7 | \107 &H6B k | \159 &H9F ƒ | \211 &HD3 ╙ |
| \004 &H04 | \056 &H38 8 | \108 &H6C l | \160 &HA0 á | \212 &HD4 ╘ |
| \005 &H05 | \057 &H39 9 | \109 &H6D m | \161 &HA1 í | \213 &HD5 ╒ |
| \006 &H06 | \058 &H3A : | \110 &H6E n | \162 &HA2 ó | \214 &HD6 ╓ |
| \007 &H07 bel| \059 &H3B ; | \111 &H6F o | \163 &HA3 ú | \215 &HD7 ╫ |
| \008 &H08 bs | \060 &H3C < | \112 &H70 p | \164 &HA4 ñ | \216 &HD8 ╪ |
| \009 &H09 tab| \061 &H3D = | \113 &H71 q | \165 &HA5 Ñ | \217 &HD9 ┘ |
| \010 &H0A lf | \062 &H3E > | \114 &H72 r | \166 &HA6 ª | \218 &HDA ┌ |
| \011 &H0B vt | \063 &H3F ? | \115 &H73 s | \167 &HA7 º | \219 &HDB █ |
| \012 &H0C pg | \064 &H40 @ | \116 &H74 t | \168 &HA8 ¿ | \220 &HDC ▄ |
| \013 &H0D cr | \065 &H41 A | \117 &H75 u | \169 &HA9 ⌐ | \221 &HDD ▌ |
| \014 &H0E | \066 &H42 B | \118 &H76 v | \170 &HAA ¬ | \222 &HDE ▐ |
| \015 &H0F | \067 &H43 C | \119 &H77 w | \171 &HAB ½ | \223 &HDF ▀ |
| \016 &H10 | \068 &H44 D | \120 &H78 x | \172 &HAC ¼ | \224 &HE0 α |
| \017 &H11 | \069 &H45 E | \121 &H79 y | \173 &HAD ¡ | \225 &HE1 ß |
| \018 &H12 | \070 &H46 F | \122 &H7A z | \174 &HAE « | \226 &HE2 Γ |
| \019 &H13 | \071 &H47 G | \123 &H7B { | \175 &HAF » | \227 &HE3 π |
| \020 &H14 | \072 &H48 H | \124 &H7C | | \176 &HB0 ░ | \228 &HE4 Σ |
| \021 &H15 | \073 &H49 I | \125 &H7D } | \177 &HB1 ▒ | \229 &HE5 σ |
| \022 &H16 | \074 &H4A J | \126 &H7E ~ | \178 &HB2 ▓ | \230 &HE6 µ |
| \023 &H17 | \075 &H4B K | \127 &H7F | \179 &HB3 │ | \231 &HE7 τ |
| \024 &H18 | \076 &H4C L | \128 &H80 Ç | \180 &HB4 ┤ | \232 &HE8 Φ |
| \025 &H19 | \077 &H4D M | \129 &H81 ü | \181 &HB5 ╡ | \233 &HE9 Θ |
| \026 &H1A eof| \078 &H4E N | \130 &H82 é | \182 &HB6 ╢ | \234 &HEA Ω |
| \027 &H1B esc| \079 &H4F O | \131 &H83 â | \183 &HB7 ╖ | \235 &HEB δ |
| \028 &H1C | \080 &H50 P | \132 &H84 ä | \184 &HB8 ╕ | \236 &HEC ∞ |
| \029 &H1D ???| \081 &H51 Q | \133 &H85 à | \185 &HB9 ╣ | \237 &HED φ |
| \030 &H1E ???| \082 &H52 R | \134 &H86 å | \186 &HBA ║ | \238 &HEE ε |
| \031 &H1F ???| \083 &H53 S | \135 &H87 ç | \187 &HBB ╗ | \239 &HEF ∩ |
| \032 &H20 | \084 &H54 T | \136 &H88 ê | \188 &HBC ╝ | \240 &HF0 ≡ |
| \033 &H21 ! | \085 &H55 U | \137 &H89 ë | \189 &HBD ╜ | \241 &HF1 ± |
| \034 &H22 " | \086 &H56 V | \138 &H8A è | \190 &HBE ╛ | \242 &HF2 ≥ |
| \035 &H23 # | \087 &H57 W | \139 &H8B ï | \191 &HBF ┐ | \243 &HF3 ≤ |
| \036 &H24 $ | \088 &H58 X | \140 &H8C î | \192 &HC0 └ | \244 &HF4 ⌠ |
| \037 &H25 % | \089 &H59 Y | \141 &H8D ì | \193 &HC1 ┴ | \245 &HF5 ⌡ |
| \038 &H26 & | \090 &H5A Z | \142 &H8E Ä | \194 &HC2 ┬ | \246 &HF6 ÷ |
| \039 &H27 ' | \091 &H5B [ | \143 &H8F Å | \195 &HC3 ├ | \247 &HF7 ≈ |
| \040 &H28 ( | \092 &H5C \ | \144 &H90 É | \196 &HC4 ─ | \248 &HF8 ° |
| \041 &H29 ) | \093 &H5D ] | \145 &H91 æ | \197 &HC5 ┼ | \249 &HF9 ∙ |
| \042 &H2A * | \094 &H5E ^ | \146 &H92 Æ | \198 &HC6 ╞ | \250 &HFA · |
| \043 &H2B + | \095 &H5F _ | \147 &H93 ô | \199 &HC7 ╟ | \251 &HFB √ |
| \044 &H2C , | \096 &H60 ` | \148 &H94 ö | \200 &HC8 ╚ | \252 &HFC ⁿ |
| \045 &H2D - | \097 &H61 a | \149 &H95 ò | \201 &HC9 ╔ | \253 &HFD ² |
| \046 &H2E . | \098 &H62 b | \150 &H96 û | \202 &HCA ╩ | \254 &HFE ■ |
| \047 &H2F / | \099 &H63 c | \151 &H97 ù | \203 &HCB ╦ | \255 &HFF |
| \048 &H30 0 | \100 &H64 d | \152 &H98 ÿ | \204 &HCC ╠ | |
| \049 &H31 1 | \101 &H65 e | \153 &H99 Ö | \205 &HCD ═ | |
| \050 &H32 2 | \102 &H66 f | \154 &H9A Ü | \206 &HCE ╬ | |
| \051 &H33 3 | \103 &H67 g | \155 &H9B ¢ | \207 &HCF ╧ | |
+--------------+--------------+--------------+--------------+--------------+