home *** CD-ROM | disk | FTP | other *** search
-
- ============================================================================
-
- A very quick summary of Aim:
-
- This package is a Turbo Pascal implementation of a file access method called
- Aim. It is a separately compiled, self-contained unit, that can be added to
- your own Pascal programs.
-
- Aim allows for searches using 'free float' keys as opposed to the traditional
- indexed sequential access method (ISAM) which uses a left justified one. For
- example, say you have a database of law firms, and you are searching for the
- firm 'Dewey, Takem, and Howe'. With ISAM, you would have to know to search for
- 'Dewey', but if you could only remember the 'Takem' part, you'd be out of luck.
- Aim allows you to search for any of the three names. It does not do it by
- sequentially searching the datafile and pattern matching, but instead uses an
- index-like structure, making it extremely fast.
-
- ============================================================================
-
- TERMS OF USE:
-
- This code is being distributed as Shareware (sometimes known as user supported
- software). This should not be confused with Public Domain, even though both
- are often distributed from the same source.
-
- As Shareware, you may distribute this freely provided you distribute all the
- files together and do not charge more than a nominal fee (to cover distribution
- costs).
-
- You may freely try out the code, but should you decide to use ANY of it, you
- are expected to register and pay the registration fee.
-
- By distributing source code, I have decided to allow you to modify the code in
- any way that fits your needs. And should you decide to incorporate this code
- in some larger application, I do not require a per site royalty, but only the
- basic registration fee.
-
- If you make any major enhancements, I'd love to hear about them. And certainly,
- if you encounter any bugs or have problems of any kind, I want to know. Any
- comments would also be appreciated.
-
- To register, send $25 to:
-
- Matt Goodrich
- PO Box 31855
- Oakland, CA 94604
-
-
- I do strongly urge you to register as only by paying for Shareware do you
- enable the authors to continue to support their software and create new
- programs. Considering that the Shareware registration fees are almost always
- far less than the purchase price of comparable commercial software it's obvious
- that Shareware is a good deal for everyone.
-
- Thank you for your support.
-
- ============================================================================
-
- FILES INCLUDED:
-
- AIMDEX .DOC - This document.
-
- AIMUNIT .PAS - Source code to a unit that contains all of the code that
- would be called by an application program.
-
- AIMDEXP .PAS - Source code to the program that creates the aimdex file by
- reading the datafile.
-
- AIMDEMO .PAS - Source code to a program that demonstrates how to use aim.
-
- AIMVAR .PAS - Source code inclusion with some commonly shared variables.
-
- AIMDEXP .EXE - Compiled program.
-
- AIMDEMO .EXE - Compiled program.
-
- AIMDEMO .TXT - Datafile for demo program.
-
- AIMDEMO .AIM - Aimdex for demo program.
-
- AIMUNIT .TPU - Compiled unit.
-
- MISCSTUF.PAS - Source code to some handy routines. Included are routines
- that: keyin to a string; display a string; display a number
- (with commas inserted); turn the cursor on & off; save &
- restore the screen; copy a file; take a filespec and break it
- into its directory and file components; Gregorian date logic.
-
- MISCSTUF.INC - 'scan codes' used by Keyin_Char in MISCSTUF.PAS.
-
- MISCSTUF.TPU - Compiled unit.
-
- ============================================================================
-
- Aim was written with Turbo Pascal 5.5 and DOS 3.2. There are no particular
- constraints such as type of video or CPU. The amount of memory used will
- depend on the size of the various buffers and constants and such, but it is
- reasonably small. The AIMUNIT is about 10K of code, 6K of data.
-
- My coding conventions are to use all uppercase letters for 'built in' Pascal
- names (functions, procedures, constants, etc) and a mix of upper and lower case
- for those names I defined myself.
-
- In several places in the documentation, I site examples of how something works.
- Often I'll say something like "suppose you have the string 'BOO' and it hashes
- to 149". In reality, 'BOO' probably doesn't hash to 149, but since I'm trying
- to come up with a clear example, and the actual hash value is irrelevant, just
- pretend it's correct.
-
- Aim involves a file that is separate from the user created datafile. It's a
- critical distinction in the documentation. I always use the term 'datafile'
- for the original user created file, and 'aimdex file' for the file created by
- AIMDEXP.EXE. There may be places where I'm describing buffers and blocks to
- both files, possibly in the same sentence. But always, the terms datafile and
- aimdex file are honored.
-
- I believe aim originated from an article published in 1971 (I don't know the
- publication) by Malcolm C. Harrison. In the mid 1970's, it was popularized by
- Datapoint in their Databus programming language. For those who want to know
- more about some of the different implementations of aim (also referred to as
- superimposed coding), there is an excellent article written by Don Wills of
- Subject, Wills & Company in the October 1988 issue of 'The RtCulator', a trade
- newsletter published by 'Project: Artie'. Knuth's book 'The Art of Computer
- Programming Vol 3 - Searching and Sorting' also discusses it.
-
- ============================================================================
-
- A quick summary of how to use Aim in your program.
-
- (Look at 'AIMUNIT.PAS' and 'AIMDEMO.PAS' for additional details).
-
-
- You first create the aimdex file with the utility 'AIMDEXP.EXE'.
-
- Your Pascal program must use the unit 'AIMUNIT'.
-
- Your program must define a variable of TYPE 'AimVars' (which is defined in
- 'AIMUNIT.PAS').
-
- You open the aimdex file with the procedure 'Aim_OpenFile'.
-
- You read datafile records by calling the procedures 'Aim_Read',
- 'Aim_ReadKG', and 'Aim_ReadKP'.
-
- When you insert a new datafile record, you add it to the aimdex file with the
- procedure 'Aim_InsertKey'.
-
- When your program is finished, you close the aimdex file by calling
- 'Aim_CloseFile'.
-
- ============================================================================
-
- How Aim works
-
- Each key field in the datafile record is broken up into 3 letter subkeys,
- called triplets. Let's assume we have a database of foods. Say you have a
- record with 'WHOLE WHEAT'. That would be broken up into 9 triplets ('WHO',
- 'HOL', 'OLE', 'LE ', 'E W', ' WH', 'WHE', 'HEA', 'EAT'). Each triplet is
- converted to an integer from 1 to 1009 by hashing (specifically, the division
- remainder method). You end up with a table something like this:
-
- Datafile Record Triplet Hash values
- =============== ===========================================
- 1 212, 807, 213, 1, 16, 1009, 19, 474, 1001, 79
- 2 53, 2
- 3 200, 42, 2, 996, 213, 769,999
- 4 74, 1009
- etc etc
-
-
-
- To write this to the aimdex file, we 'invert' it to look something like:
-
- Triplet Hash Value Datafile Records
- ================== ====================
- 1 1, 6, 11, 35
- 2 2, 3, 11, 33, 34, 42
- 3 7, 8, 23
- .
- .
- .
- 1009 1, 4, 9, 11, 35, 42
-
-
- Now, to do a search, lets try a search key of 'WHEA'. Let's say its two
- triplets hash to 1 & 1009. You can see from the table that the only records
- that hash to both of those values are #11 & #35. At this point we need to read
- those datafile records to see if we have a match. Suppose you have the search
- key 'WHEAT' and suppose its triplets hash to 1, 2, & 1009. Now the only record
- that hashes to those three values is #11.
-
- Generally speaking, the longer the search key is, the fewer 'false hits' you
- get, and thus the faster the search. But, there is a point where the longer
- search key can actually slow down the search. You want to make the search key
- long enough to minimize the number of false hits, but no longer than that. For
- example, if you are searching for 'SMI' it may have the same hash value as
- 'THE', resulting in lots of false hits. Searching for 'SMITH' may be unique
- enough to have few if any false hits. Searching for 'SMITHSONIAN' may not
- filter out any more false hits, but since it has 9 triplets ('SMITH' has 3)
- each time another block is read from the aimdex file, it requires 9 reads (to
- the aimdex file) instead of 3. The aimdex file always has one read per
- triplet. If the record being searched for isn't found until the 15th block in
- the aimdex file, you may have a significant wait.
-
- There's no realistic way of knowing how long the search key should be in order
- to make the fastest searches. It all depends on hash values colliding, and
- there is no reasonable way to predict that. In general, I would say the best
- (ie fastest) search key is about 5-7 characters long.
-
- ============================================================================
-
- Aimdex file layout:
-
- The first record is a header with the following:
-
- type columns bytes explanation
- ======= ======= ===== ==================================
- DataFileName : STRING 1- 40 (40) datafile the aimdex file points to
- BlocksPerHash : LONGINT 41- 46 ( 6) number of blocks allocated
- RecLenData : LONGINT 47- 52 ( 6) length of datafile records
-
- KeyBeg [1] : INTEGER 53- 56 ( 4) beginning position of 1st aimdex key
- KeyEnd [1] : INTEGER 57- 60 ( 4) ending position of 1st aimdex key
- KeyBeg [2] : 61- 64 2nd
- KeyEnd : 65- 68
- KeyBeg [3] : 69- 72 3rd
- KeyEnd : 73- 76
- KeyBeg [4] : 77- 80 4th
- KeyEnd : 81- 84
- KeyBeg [5] : 85- 88 5th
- KeyEnd : 89- 92
- KeyBeg [6] : 93- 96 6th
- KeyEnd : 97-100
- KeyBeg [7] : 101-104 7th
- KeyEnd : 105-108
-
- filler : STRING 109-??? some filler (the amount is
- determined by the constant
- 'AimBlockSize'.
-
-
-
-
- The rest of the aimdex file consists of a series of blocks.
-
- Each block consists of 1 header record (although only the first header record
- in the aimdex file actually has anything written to it) followed by 1009
- records, corresponding to the hash values. If the block size (AimBlockSize) was
- set to 128 bytes, and there were 2 BlocksPerHash, there would be 2 sets of 1010
- records, each 128 bytes long, (for a total file size of 258,560 bytes). There
- aren't really any distinct records, since there is no EOR marker. It is just a
- contiguous group of bytes that are logically grouped into records and therefore
- referred to as records.
-
- Each byte represents 8 datafile records (1 per bit). That is, the first byte
- represents datafile records 1-8, the seconds byte is records 9-16, etc.
-
- You end up with a table that looks something like:
-
-
- 1st BLOCK
-
- datafile record number
-
- 1-8 9-16 17-24 25-32 . . . 1017-1024
- -------------------------------------------
- header |
- hash 1 | x x x x x <-- each 'x'
- value 2 | x x x x x represents
- 3 | x x x x x 1 byte
- 4 | x x x x x
- . |
- . |
- . |
- 1009 | x x x x x
-
-
-
-
-
- 2nd BLOCK
-
- 1025-1032 1033-1040 . . . 2041-2048
- -------------------------------------------
- header |
- 1 | x x x
- 2 | x x x
- 3 | x x x
- 4 | x x x
- . |
- . |
- . |
- 1009 | x x x
-
-
- ============================================================================
-
- Some misc notes:
-
- ------------------------------
-
- Aim is based on a datafile with fixed length records, each of which ends with a
- CR/LF. There must be an EOF after the last record. If there is any deviation
- from this (such as CR, but no LF, or no EOF, or variable length records) aim
- will crash.
-
- ------------------------------
-
- When you run 'AIMDEXP.EXE', it puts the datafile name into the aimdex file, so
- that programs know which datafile that aimdex file points to. If you happen to
- give a path (ie drive or subdirectory) for the input file name, such as:
-
- AIMDEXP K:CUSTFILE.TXT CUSTFILE.AIM 10-50
-
- that path also gets put into the aimdex file. If you ever decide to move the
- datafile you are going to have to run 'AIMDEXP.EXE' again, or use a dump
- utility to change the filename in the header record of the aimdex (don't use an
- editor unless you know it works on binary files). I considered having the
- AIMDEXP utility strip that path off, but then you'd lose the ability to have
- the datafile and indexes in separate directories.
-
- ------------------------------
-
- There are many situations that will cause aim to crash (look for the procedure
- 'AimFatalError'). Most of these are related to problems like file damage or a
- full disk when writing. My philosophy was that these are serious problems and
- when they occur, you should stop immediately and deal with them.
-
- The 'AimFatalError' procedure will not be invoked by all errors that could
- possibly occur. I chose not to put in all the extra code to deal with every
- run-time error that could occur. Virtually all the errors that aren't guarded
- against are hardware related.
-
- ------------------------------
-
- When doing a ReadKG or ReadKP, records are not found in any key sequence. They
- are actually found in their order of appearance in the datafile.
-
- ------------------------------
-
- Read, ReadKG, & ReadKP return an offset into the datafile. They don't return
- the actual datafile record. Your application program must then read the
- datafile. If you want aim to return the datafile record, you will have to add
- a read list to the procedure 'ReadThruBuff'. I did a little benchmarking, and
- doing that one extra read to the datafile seems to have virtually no effect on
- performance.
-
- ------------------------------
-
- InsertKey works much the same. Instead of passing in a datafile record, you
- pass in the offset of the datafile record being inserted (added), and the
- InsertKey procedure will then read the record for its info.
-
- There is a complication caused by the way DOS handles file updates. When new
- records get added to a file (ie, written past the physical EOF), DOS doesn't
- update the file size in the directory information until the file gets closed.
- Normally, this isn't a problem. But, if some other procedure tries to read
- that file (using a different file variable), it is going to believe the
- datafile is still the old size, and thus be unable to read any of the new
- records which are now past the point that used to be the physical EOF. (If you
- modify an existing record, there is no problem, because the file size doesn't
- change).
-
- For example, suppose you have a datafile with 1 record (followed by an EOF
- marker, of course) and you add a second record. If you then call InsertKey to
- 'insert' that new record into the aimdex file, it would crash with a message
- like: "AIMUNIT.TPU error #11 - There appears to be damage to the data file".
- This is because when the 'AIMUNIT' tries to read the new record to get the
- information, and it still thinks the datafile has only the first record.
-
- There are several solutions to this problem. If you are running on a Novell
- network server (and executing on a server disk!), this doesn't happen and you
- simply don't need to worry about it. I don't know specifically about other
- networks, but am guessing that some others will also correct this problem.
-
- You can also get around it by running the DOS utility 'SHARE.EXE'. This is a
- TSR that must be loaded before the Pascal program is run.
-
- Another solution is to have your program close & reopen the datafile after the
- new record gets written. This forces the buffer to be written out and the
- directory to get updated, so that the 'AIMUNIT' can read the new record to
- insert it into aimdex file. See 'AIMDEMO.PAS' for an example of what you need
- to do.
-
- ------------------------------
-
- Aim is NOT case sensitive. It will find SMITH, whether your search key was
- Smith or SMITH or sMiTh. If you want it to be case sensitive, just look for
- the calls to the 'Convert_Upper' procedure.
-
- ------------------------------
-
- Despite the fact that 'AimMaxKeys' is a constant, if you want to increase the
- maximum number of search keys, you may have to do more than just increase
- 'AimMaxKeys'. For example, the searching mechanism is assuming the first
- character of a search key to be a 1 digit number indicating which search key to
- use, so clearly 'AimMaxKeys' cannot be greater than 9. There are a few places
- where string manipulation occurs and increasing 'AimMaxKeys' might result in
- some temporary variable overflowing. If you want to increase it, look things
- over carefully. Basically, I defined 'AimMaxKeys' for the purpose of program
- documentation.
-
- ------------------------------
-
- The maximum number of datafile records is 1 million. That's because the record
- number gets stored in the last 6 digits of the temporary array in 'AIMDEXP'.
- The first 4 digits hold the hash value. (It's a LONGINT field, and therefore
- can have values from -2,147,483,648 to +2,147,483,647). It would be relatively
- easy to bump the maximum up to about 1 billion by being a little trickier in
- how you store the hash value. Since the hash value is always between 1 & 1009,
- and those first 4 digits can go as high as 2147, you could use the first 1137
- to also hold the record number. I suppose you could really get tricky and take
- advantage of the sign bit, giving you another 2 billion numbers to use. But, I
- wanted to keep it simple, and felt 1 million data records was acceptable.
-
- The maximum search key length is 30 bytes (because it's defined as a STRING
- [30]). That could be increased without too much trouble, though I felt 30
- bytes was sufficient.
-
- The maximum aimdex key is 255 bytes. That's because 255 is the maximum length
- of a STRING field and there is a fair amount of string manipulation happening
- here. If you want it longer, you are going to have to write your own string
- manipulation routines.
-
- The maximum length for the datafile records is 4096. You could make it bigger,
- just as long as it is less than 'ReadBufferSize'. Bigger just means more
- memory gets used.
-
- ------------------------------
-
- Aim keys must be at least 3 bytes long. It's likely future versions will allow
- 1 and 2 byte keys, but that won't help you now. When doing a Read, any invalid
- search key is ignored. Some examples that would make a key invalid are: non
- ASCII characters, the first character being something other than a number from
- 1-7 (AimMaxKeys), an all blank key, a null key. If you do a read and pass all
- null search keys, it will simply return nothing found.
-
- ------------------------------
-
- The '-F=' option of 'AIMDEXP' (which specifies the record length in the
- datafile) is only necessary when the datafile is empty. 'AIMDEXP' will read
- the first record of the datafile and assume it to be the correct record length.
- It will always use the '-F=' value if you specify it. The value you specify is
- the record length NOT counting the CR/LF. For example, if the datafile has 307
- bytes plus 2 bytes for CR/LF, then you would say "-F=307".
-
- ------------------------------
-
- Doing an InsertKey will mess up any following ReadKG or ReadKP (it may skip
- some records). Perhaps I will change it so that the subsequent ReadKG & ReadKP
- just return nothing found. I don't consider it a problem, because I don't
- envision a situation where you would want to do that. You should only be doing
- a ReadKG or ReadKP immediately after a successful Read.
-
- ------------------------------
-
- This code is not completely multiuser. For the most part it won't be a
- problem, since bits in the aimdex file never get 'turned off', but only 'turned
- on'. The need for multiuser code (file/record locking) is in how you handle
- your datafile. The only place you might encounter a problem is when the aimfile
- needs to be expanded. When you 'AIMDEXP' the datafile, it allocates space in
- the aimdex file for (at least) 1000 new datafile records. Once you've filled
- that extra space, the aimdex file must be expanded before any more records can
- be added. If two people were inserting records at the same time, and both tried
- to expand, it might be a problem (it might not, though, depending on the timing
- of the collision). Probably, the worst case scenario would be that the second
- person would overwrite the first person's aimdex file entry, thereby 'losing' a
- record from the aimdex file. Running 'AIMDEXP' again would fix it. The way to
- avoid this would be to run 'AIMDEXP' occasionally, which will reallocate those
- 1000 empty slots.
-
- ------------------------------
-
- You can change the constant 'AimBlockSize' to effect performance (don't forget
- to recompile and run 'AIMDEXP' again). A smaller number results in 'AIMDEXP'
- running faster. A larger number results in faster Reads, ReadKG, ReadKP, but
- more memory being used. Since 'AIMDEXP' is a batch process where a little loss
- in speed probably won't be noticed, I tend to recommend erring on the side of a
- larger setting. Whatever the setting, 'AimBlockSize' must not be less than
- 'HeaderBuffSize'.
-
- The nature of your searches could also have an impact on the best setting for
- 'AimBlockSize'. If you do a lot of searches that don't find anything, (or that
- have to search much of the datafile to find a hit), then you may want to set it
- to a higher number. If most of your searches find a record pretty early in the
- datafile, you may find it faster when 'AimBlockSize' is set to a smaller
- number.
-
- Also, a smaller number means it creates fewer empty slots when expanding,
- thereby causing the next expand to happen sooner. (This is independent of the
- number of empty slots that get created by the 'AIMDEXP' program).
-
- Basically, the only way to really know what the 'ideal' setting is would be to
- do some benchmarks with some typical data. That may be more trouble than it's
- worth. The alternative is to just leave it at 512, and not worry about it.
-
- ------------------------------
-
- You could use an aimdex file in place of an index, but there are a couple of
- drawbacks. First, the key sequential doesn't return records in order by key,
- but instead returns them in order of appearance in the datafile. Second, aim
- is perfectly content to allow multiple records with the same aimdex key, so if
- you don't want that, your code would have to take steps to avoid it.
-
- ------------------------------
-
- When searching, aim must always verify that the datafile record actually
- matches the search key. Therefore, if you want to delete a datafile record,
- all you have to do is overwrite it with blanks ('AIMDEXP' will ignore blank
- keys). If you want to overwrite it with some other delete character, you will
- need to change the procedure 'ProcessInRecord' in 'AIMDEXP.PAS' so that those
- records also get ignored. If you want to delete datafile records by just
- having some byte that says "I'm a deleted record", you need to change the
- procedure 'MatchDataRec' in 'AIMUNIT.PAS' and the procedure 'ProcessInRecord'
- in 'AIMDEXP.PAS'.
-
- Since deleted datafile records (no matter how you deleted those records) don't
- get removed from the aimdex file, you may wonder if the aimdex file accumulates
- 'false hits'. The answer is yes, but it's unlikely to be a problem. First of
- all, running 'AIMDEXP' will clean it up completely. And those 'false hits'
- only effect performance. They don't return erroneous info. And the slowing in
- performance is likely to be miniscule. For example, if you had a datafile with
- 10000 records, and you delete 100, that will leave 100 records worth of 'false
- hits'. That means that when searching, you will do one extra read 1% of the
- time. Not a problem considering that a typical search might be 4 or 5 reads
- anyway. The only way it might be a problem is if you are deleting a huge chunk
- of your database (say 50%) or if for some reason you delete all the records
- with one particular search key. For example, if your database had 100 Smiths
- and you deleted 99 of them, the search for that remaining Smith would, on
- average, do 50 extra reads to find that 1 record. Again, to fix this, just run
- 'AIMDEXP'.
-
- ------------------------------
-
- To change a key field in the datafile, you simply need to change the datafile,
- then call InsertKey to insert new 'new' record. Like deletes, you will have
- some 'false' hits left in the aimdex file, but again, it is unlikely to be a
- problem unless you are changing 50% of your records. As before, just 'AIMDEXP'
- the datafile again.
-
- ------------------------------
-
- One feature I might add is 'OR' logic. You would be able to say "find all
- records that are either 'SMITH' or 'JONES'". It wouldn't adversely affect the
- existing algorithm in any significant way, nor would it slow down searches. (I
- would probably make the 'AimKey' variables into 2 dimensional arrays.) But, I'm
- not certain how useful it would be. If anyone feels strongly, I'd love to hear
- about it.
-
- ============================================================================