CD Actual 13

home *** CD-ROM | disk | FTP | other *** search

/ CD Actual 13 / CDA13.ISO / cdactual / demobin / share / program / Pascal / AIM.ZIP / AIMDEX.DOC < prev next >

Wrap

Text File | 1991-06-29 | 27.0 KB | 552 lines

============================================================================ A very quick summary of Aim: This package is a Turbo Pascal implementation of a file access method called Aim. It is a separately compiled, self-contained unit, that can be added to your own Pascal programs. Aim allows for searches using 'free float' keys as opposed to the traditional indexed sequential access method (ISAM) which uses a left justified one. For example, say you have a database of law firms, and you are searching for the firm 'Dewey, Takem, and Howe'. With ISAM, you would have to know to search for 'Dewey', but if you could only remember the 'Takem' part, you'd be out of luck. Aim allows you to search for any of the three names. It does not do it by sequentially searching the datafile and pattern matching, but instead uses an index-like structure, making it extremely fast. ============================================================================ TERMS OF USE: This code is being distributed as Shareware (sometimes known as user supported software). This should not be confused with Public Domain, even though both are often distributed from the same source. As Shareware, you may distribute this freely provided you distribute all the files together and do not charge more than a nominal fee (to cover distribution costs). You may freely try out the code, but should you decide to use ANY of it, you are expected to register and pay the registration fee. By distributing source code, I have decided to allow you to modify the code in any way that fits your needs. And should you decide to incorporate this code in some larger application, I do not require a per site royalty, but only the basic registration fee. If you make any major enhancements, I'd love to hear about them. And certainly, if you encounter any bugs or have problems of any kind, I want to know. Any comments would also be appreciated. To register, send $25 to: Matt Goodrich PO Box 31855 Oakland, CA 94604 I do strongly urge you to register as only by paying for Shareware do you enable the authors to continue to support their software and create new programs. Considering that the Shareware registration fees are almost always far less than the purchase price of comparable commercial software it's obvious that Shareware is a good deal for everyone. Thank you for your support. ============================================================================ FILES INCLUDED: AIMDEX .DOC - This document. AIMUNIT .PAS - Source code to a unit that contains all of the code that would be called by an application program. AIMDEXP .PAS - Source code to the program that creates the aimdex file by reading the datafile. AIMDEMO .PAS - Source code to a program that demonstrates how to use aim. AIMVAR .PAS - Source code inclusion with some commonly shared variables. AIMDEXP .EXE - Compiled program. AIMDEMO .EXE - Compiled program. AIMDEMO .TXT - Datafile for demo program. AIMDEMO .AIM - Aimdex for demo program. AIMUNIT .TPU - Compiled unit. MISCSTUF.PAS - Source code to some handy routines. Included are routines that: keyin to a string; display a string; display a number (with commas inserted); turn the cursor on & off; save & restore the screen; copy a file; take a filespec and break it into its directory and file components; Gregorian date logic. MISCSTUF.INC - 'scan codes' used by Keyin_Char in MISCSTUF.PAS. MISCSTUF.TPU - Compiled unit. ============================================================================ Aim was written with Turbo Pascal 5.5 and DOS 3.2. There are no particular constraints such as type of video or CPU. The amount of memory used will depend on the size of the various buffers and constants and such, but it is reasonably small. The AIMUNIT is about 10K of code, 6K of data. My coding conventions are to use all uppercase letters for 'built in' Pascal names (functions, procedures, constants, etc) and a mix of upper and lower case for those names I defined myself. In several places in the documentation, I site examples of how something works. Often I'll say something like "suppose you have the string 'BOO' and it hashes to 149". In reality, 'BOO' probably doesn't hash to 149, but since I'm trying to come up with a clear example, and the actual hash value is irrelevant, just pretend it's correct. Aim involves a file that is separate from the user created datafile. It's a critical distinction in the documentation. I always use the term 'datafile' for the original user created file, and 'aimdex file' for the file created by AIMDEXP.EXE. There may be places where I'm describing buffers and blocks to both files, possibly in the same sentence. But always, the terms datafile and aimdex file are honored. I believe aim originated from an article published in 1971 (I don't know the publication) by Malcolm C. Harrison. In the mid 1970's, it was popularized by Datapoint in their Databus programming language. For those who want to know more about some of the different implementations of aim (also referred to as superimposed coding), there is an excellent article written by Don Wills of Subject, Wills & Company in the October 1988 issue of 'The RtCulator', a trade newsletter published by 'Project: Artie'. Knuth's book 'The Art of Computer Programming Vol 3 - Searching and Sorting' also discusses it. ============================================================================ A quick summary of how to use Aim in your program. (Look at 'AIMUNIT.PAS' and 'AIMDEMO.PAS' for additional details). You first create the aimdex file with the utility 'AIMDEXP.EXE'. Your Pascal program must use the unit 'AIMUNIT'. Your program must define a variable of TYPE 'AimVars' (which is defined in 'AIMUNIT.PAS'). You open the aimdex file with the procedure 'Aim_OpenFile'. You read datafile records by calling the procedures 'Aim_Read', 'Aim_ReadKG', and 'Aim_ReadKP'. When you insert a new datafile record, you add it to the aimdex file with the procedure 'Aim_InsertKey'. When your program is finished, you close the aimdex file by calling 'Aim_CloseFile'. ============================================================================ How Aim works Each key field in the datafile record is broken up into 3 letter subkeys, called triplets. Let's assume we have a database of foods. Say you have a record with 'WHOLE WHEAT'. That would be broken up into 9 triplets ('WHO', 'HOL', 'OLE', 'LE ', 'E W', ' WH', 'WHE', 'HEA', 'EAT'). Each triplet is converted to an integer from 1 to 1009 by hashing (specifically, the division remainder method). You end up with a table something like this: Datafile Record Triplet Hash values =============== =========================================== 1 212, 807, 213, 1, 16, 1009, 19, 474, 1001, 79 2 53, 2 3 200, 42, 2, 996, 213, 769,999 4 74, 1009 etc etc To write this to the aimdex file, we 'invert' it to look something like: Triplet Hash Value Datafile Records ================== ==================== 1 1, 6, 11, 35 2 2, 3, 11, 33, 34, 42 3 7, 8, 23 . . . 1009 1, 4, 9, 11, 35, 42 Now, to do a search, lets try a search key of 'WHEA'. Let's say its two triplets hash to 1 & 1009. You can see from the table that the only records that hash to both of those values are #11 & #35. At this point we need to read those datafile records to see if we have a match. Suppose you have the search key 'WHEAT' and suppose its triplets hash to 1, 2, & 1009. Now the only record that hashes to those three values is #11. Generally speaking, the longer the search key is, the fewer 'false hits' you get, and thus the faster the search. But, there is a point where the longer search key can actually slow down the search. You want to make the search key long enough to minimize the number of false hits, but no longer than that. For example, if you are searching for 'SMI' it may have the same hash value as 'THE', resulting in lots of false hits. Searching for 'SMITH' may be unique enough to have few if any false hits. Searching for 'SMITHSONIAN' may not filter out any more false hits, but since it has 9 triplets ('SMITH' has 3) each time another block is read from the aimdex file, it requires 9 reads (to the aimdex file) instead of 3. The aimdex file always has one read per triplet. If the record being searched for isn't found until the 15th block in the aimdex file, you may have a significant wait. There's no realistic way of knowing how long the search key should be in order to make the fastest searches. It all depends on hash values colliding, and there is no reasonable way to predict that. In general, I would say the best (ie fastest) search key is about 5-7 characters long. ============================================================================ Aimdex file layout: The first record is a header with the following: type columns bytes explanation ======= ======= ===== ================================== DataFileName : STRING 1- 40 (40) datafile the aimdex file points to BlocksPerHash : LONGINT 41- 46 ( 6) number of blocks allocated RecLenData : LONGINT 47- 52 ( 6) length of datafile records KeyBeg [1] : INTEGER 53- 56 ( 4) beginning position of 1st aimdex key KeyEnd [1] : INTEGER 57- 60 ( 4) ending position of 1st aimdex key KeyBeg [2] : 61- 64 2nd KeyEnd : 65- 68 KeyBeg [3] : 69- 72 3rd KeyEnd : 73- 76 KeyBeg [4] : 77- 80 4th KeyEnd : 81- 84 KeyBeg [5] : 85- 88 5th KeyEnd : 89- 92 KeyBeg [6] : 93- 96 6th KeyEnd : 97-100 KeyBeg [7] : 101-104 7th KeyEnd : 105-108 filler : STRING 109-??? some filler (the amount is determined by the constant 'AimBlockSize'. The rest of the aimdex file consists of a series of blocks. Each block consists of 1 header record (although only the first header record in the aimdex file actually has anything written to it) followed by 1009 records, corresponding to the hash values. If the block size (AimBlockSize) was set to 128 bytes, and there were 2 BlocksPerHash, there would be 2 sets of 1010 records, each 128 bytes long, (for a total file size of 258,560 bytes). There aren't really any distinct records, since there is no EOR marker. It is just a contiguous group of bytes that are logically grouped into records and therefore referred to as records. Each byte represents 8 datafile records (1 per bit). That is, the first byte represents datafile records 1-8, the seconds byte is records 9-16, etc. You end up with a table that looks something like: 1st BLOCK datafile record number 1-8 9-16 17-24 25-32 . . . 1017-1024 ------------------------------------------- header | hash 1 | x x x x x <-- each 'x' value 2 | x x x x x represents 3 | x x x x x 1 byte 4 | x x x x x . | . | . | 1009 | x x x x x 2nd BLOCK 1025-1032 1033-1040 . . . 2041-2048 ------------------------------------------- header | 1 | x x x 2 | x x x 3 | x x x 4 | x x x . | . | . | 1009 | x x x ============================================================================ Some misc notes: ------------------------------ Aim is based on a datafile with fixed length records, each of which ends with a CR/LF. There must be an EOF after the last record. If there is any deviation from this (such as CR, but no LF, or no EOF, or variable length records) aim will crash. ------------------------------ When you run 'AIMDEXP.EXE', it puts the datafile name into the aimdex file, so that programs know which datafile that aimdex file points to. If you happen to give a path (ie drive or subdirectory) for the input file name, such as: AIMDEXP K:CUSTFILE.TXT CUSTFILE.AIM 10-50 that path also gets put into the aimdex file. If you ever decide to move the datafile you are going to have to run 'AIMDEXP.EXE' again, or use a dump utility to change the filename in the header record of the aimdex (don't use an editor unless you know it works on binary files). I considered having the AIMDEXP utility strip that path off, but then you'd lose the ability to have the datafile and indexes in separate directories. ------------------------------ There are many situations that will cause aim to crash (look for the procedure 'AimFatalError'). Most of these are related to problems like file damage or a full disk when writing. My philosophy was that these are serious problems and when they occur, you should stop immediately and deal with them. The 'AimFatalError' procedure will not be invoked by all errors that could possibly occur. I chose not to put in all the extra code to deal with every run-time error that could occur. Virtually all the errors that aren't guarded against are hardware related. ------------------------------ When doing a ReadKG or ReadKP, records are not found in any key sequence. They are actually found in their order of appearance in the datafile. ------------------------------ Read, ReadKG, & ReadKP return an offset into the datafile. They don't return the actual datafile record. Your application program must then read the datafile. If you want aim to return the datafile record, you will have to add a read list to the procedure 'ReadThruBuff'. I did a little benchmarking, and doing that one extra read to the datafile seems to have virtually no effect on performance. ------------------------------ InsertKey works much the same. Instead of passing in a datafile record, you pass in the offset of the datafile record being inserted (added), and the InsertKey procedure will then read the record for its info. There is a complication caused by the way DOS handles file updates. When new records get added to a file (ie, written past the physical EOF), DOS doesn't update the file size in the directory information until the file gets closed. Normally, this isn't a problem. But, if some other procedure tries to read that file (using a different file variable), it is going to believe the datafile is still the old size, and thus be unable to read any of the new records which are now past the point that used to be the physical EOF. (If you modify an existing record, there is no problem, because the file size doesn't change). For example, suppose you have a datafile with 1 record (followed by an EOF marker, of course) and you add a second record. If you then call InsertKey to 'insert' that new record into the aimdex file, it would crash with a message like: "AIMUNIT.TPU error #11 - There appears to be damage to the data file". This is because when the 'AIMUNIT' tries to read the new record to get the information, and it still thinks the datafile has only the first record. There are several solutions to this problem. If you are running on a Novell network server (and executing on a server disk!), this doesn't happen and you simply don't need to worry about it. I don't know specifically about other networks, but am guessing that some others will also correct this problem. You can also get around it by running the DOS utility 'SHARE.EXE'. This is a TSR that must be loaded before the Pascal program is run. Another solution is to have your program close & reopen the datafile after the new record gets written. This forces the buffer to be written out and the directory to get updated, so that the 'AIMUNIT' can read the new record to insert it into aimdex file. See 'AIMDEMO.PAS' for an example of what you need to do. ------------------------------ Aim is NOT case sensitive. It will find SMITH, whether your search key was Smith or SMITH or sMiTh. If you want it to be case sensitive, just look for the calls to the 'Convert_Upper' procedure. ------------------------------ Despite the fact that 'AimMaxKeys' is a constant, if you want to increase the maximum number of search keys, you may have to do more than just increase 'AimMaxKeys'. For example, the searching mechanism is assuming the first character of a search key to be a 1 digit number indicating which search key to use, so clearly 'AimMaxKeys' cannot be greater than 9. There are a few places where string manipulation occurs and increasing 'AimMaxKeys' might result in some temporary variable overflowing. If you want to increase it, look things over carefully. Basically, I defined 'AimMaxKeys' for the purpose of program documentation. ------------------------------ The maximum number of datafile records is 1 million. That's because the record number gets stored in the last 6 digits of the temporary array in 'AIMDEXP'. The first 4 digits hold the hash value. (It's a LONGINT field, and therefore can have values from -2,147,483,648 to +2,147,483,647). It would be relatively easy to bump the maximum up to about 1 billion by being a little trickier in how you store the hash value. Since the hash value is always between 1 & 1009, and those first 4 digits can go as high as 2147, you could use the first 1137 to also hold the record number. I suppose you could really get tricky and take advantage of the sign bit, giving you another 2 billion numbers to use. But, I wanted to keep it simple, and felt 1 million data records was acceptable. The maximum search key length is 30 bytes (because it's defined as a STRING [30]). That could be increased without too much trouble, though I felt 30 bytes was sufficient. The maximum aimdex key is 255 bytes. That's because 255 is the maximum length of a STRING field and there is a fair amount of string manipulation happening here. If you want it longer, you are going to have to write your own string manipulation routines. The maximum length for the datafile records is 4096. You could make it bigger, just as long as it is less than 'ReadBufferSize'. Bigger just means more memory gets used. ------------------------------ Aim keys must be at least 3 bytes long. It's likely future versions will allow 1 and 2 byte keys, but that won't help you now. When doing a Read, any invalid search key is ignored. Some examples that would make a key invalid are: non ASCII characters, the first character being something other than a number from 1-7 (AimMaxKeys), an all blank key, a null key. If you do a read and pass all null search keys, it will simply return nothing found. ------------------------------ The '-F=' option of 'AIMDEXP' (which specifies the record length in the datafile) is only necessary when the datafile is empty. 'AIMDEXP' will read the first record of the datafile and assume it to be the correct record length. It will always use the '-F=' value if you specify it. The value you specify is the record length NOT counting the CR/LF. For example, if the datafile has 307 bytes plus 2 bytes for CR/LF, then you would say "-F=307". ------------------------------ Doing an InsertKey will mess up any following ReadKG or ReadKP (it may skip some records). Perhaps I will change it so that the subsequent ReadKG & ReadKP just return nothing found. I don't consider it a problem, because I don't envision a situation where you would want to do that. You should only be doing a ReadKG or ReadKP immediately after a successful Read. ------------------------------ This code is not completely multiuser. For the most part it won't be a problem, since bits in the aimdex file never get 'turned off', but only 'turned on'. The need for multiuser code (file/record locking) is in how you handle your datafile. The only place you might encounter a problem is when the aimfile needs to be expanded. When you 'AIMDEXP' the datafile, it allocates space in the aimdex file for (at least) 1000 new datafile records. Once you've filled that extra space, the aimdex file must be expanded before any more records can be added. If two people were inserting records at the same time, and both tried to expand, it might be a problem (it might not, though, depending on the timing of the collision). Probably, the worst case scenario would be that the second person would overwrite the first person's aimdex file entry, thereby 'losing' a record from the aimdex file. Running 'AIMDEXP' again would fix it. The way to avoid this would be to run 'AIMDEXP' occasionally, which will reallocate those 1000 empty slots. ------------------------------ You can change the constant 'AimBlockSize' to effect performance (don't forget to recompile and run 'AIMDEXP' again). A smaller number results in 'AIMDEXP' running faster. A larger number results in faster Reads, ReadKG, ReadKP, but more memory being used. Since 'AIMDEXP' is a batch process where a little loss in speed probably won't be noticed, I tend to recommend erring on the side of a larger setting. Whatever the setting, 'AimBlockSize' must not be less than 'HeaderBuffSize'. The nature of your searches could also have an impact on the best setting for 'AimBlockSize'. If you do a lot of searches that don't find anything, (or that have to search much of the datafile to find a hit), then you may want to set it to a higher number. If most of your searches find a record pretty early in the datafile, you may find it faster when 'AimBlockSize' is set to a smaller number. Also, a smaller number means it creates fewer empty slots when expanding, thereby causing the next expand to happen sooner. (This is independent of the number of empty slots that get created by the 'AIMDEXP' program). Basically, the only way to really know what the 'ideal' setting is would be to do some benchmarks with some typical data. That may be more trouble than it's worth. The alternative is to just leave it at 512, and not worry about it. ------------------------------ You could use an aimdex file in place of an index, but there are a couple of drawbacks. First, the key sequential doesn't return records in order by key, but instead returns them in order of appearance in the datafile. Second, aim is perfectly content to allow multiple records with the same aimdex key, so if you don't want that, your code would have to take steps to avoid it. ------------------------------ When searching, aim must always verify that the datafile record actually matches the search key. Therefore, if you want to delete a datafile record, all you have to do is overwrite it with blanks ('AIMDEXP' will ignore blank keys). If you want to overwrite it with some other delete character, you will need to change the procedure 'ProcessInRecord' in 'AIMDEXP.PAS' so that those records also get ignored. If you want to delete datafile records by just having some byte that says "I'm a deleted record", you need to change the procedure 'MatchDataRec' in 'AIMUNIT.PAS' and the procedure 'ProcessInRecord' in 'AIMDEXP.PAS'. Since deleted datafile records (no matter how you deleted those records) don't get removed from the aimdex file, you may wonder if the aimdex file accumulates 'false hits'. The answer is yes, but it's unlikely to be a problem. First of all, running 'AIMDEXP' will clean it up completely. And those 'false hits' only effect performance. They don't return erroneous info. And the slowing in performance is likely to be miniscule. For example, if you had a datafile with 10000 records, and you delete 100, that will leave 100 records worth of 'false hits'. That means that when searching, you will do one extra read 1% of the time. Not a problem considering that a typical search might be 4 or 5 reads anyway. The only way it might be a problem is if you are deleting a huge chunk of your database (say 50%) or if for some reason you delete all the records with one particular search key. For example, if your database had 100 Smiths and you deleted 99 of them, the search for that remaining Smith would, on average, do 50 extra reads to find that 1 record. Again, to fix this, just run 'AIMDEXP'. ------------------------------ To change a key field in the datafile, you simply need to change the datafile, then call InsertKey to insert new 'new' record. Like deletes, you will have some 'false' hits left in the aimdex file, but again, it is unlikely to be a problem unless you are changing 50% of your records. As before, just 'AIMDEXP' the datafile again. ------------------------------ One feature I might add is 'OR' logic. You would be able to say "find all records that are either 'SMITH' or 'JONES'". It wouldn't adversely affect the existing algorithm in any significant way, nor would it slow down searches. (I would probably make the 'AimKey' variables into 2 dimensional arrays.) But, I'm not certain how useful it would be. If anyone feels strongly, I'd love to hear about it. ============================================================================