NetNews Usenet Archive 1992 #19

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1992 #19 / NN_1992_19.iso / spool / comp / database / 6528 < prev next >

Wrap

Text File | 1992-09-03 | 4.0 KB | 101 lines

Newsgroups: comp.databases From: gtoal@pizzabox.demon.co.uk (Graham Toal) Path: sparky!uunet!pipex!demon!pizzabox.demon.co.uk!gtoal Subject: SUMMARY: Re: free text indexing algorithm refs wanted... Distribution: world Organization: Cuddlehogs Anonymous Lines: 89 Date: Thu, 3 Sep 1992 01:18:18 +0000 Message-ID: <715483098snx@pizzabox.demon.co.uk> Sender: usenet@gate.demon.co.uk I wrote: >Having implemented a database using a really cool set of algorithms >and data structures, which let us do just about anything, we find it's >depressingly slow compared to some commercial systems. We've have >worked out what they must be doing by treating them as a black box >and throwing queries at them, and working out the complexity of the >search algorithms they must be using. Any pointers to papers or >articles on different data structures for free text systems? Thanks to all of you who answered. I'll follow up the references suggested, and write some more on what we're doing later. A quick comment though is that the time for turning a key into a pointer to a list of occurances of that word *isn't* the bottleneck; the problems we have are more to do with efficient operations on these lists - including AND and OR, but also more powerful ones. We have hierarchical SGML structured documents, so simple solutions like separate indices for each field don't work. Our structure simply has a 32-bit pointer to the word; other people may include various bits of info too. (Liam Quinn - we chatted about this a long time ago when we first started this project. Your scheme is the sort of thing I was looking for info on, though I'm particularly wondering about real live systems that perform well on CDs at slow CD speeds - there's probably only commercial ones like this...) My apologies to the people who mailed me for not replying individually; I've fire-fighting at work at the moment and can't keep up. I've filed all your mail and will reply personally as soon as things lighten up. [Ed Fox - I won't be in Cambridge, but will you be passing through London at all?] ======== From: "Sanjiv K. Bhatia" <sanjiv@redbird.umsl.edu> Have you checked the standard books on Information Retrieval like the classic text by van Rijsbergen and another book by Salton. I can send you the complete reference if you desire. It may also help to check the implemented schemes in the SMART retrieval environment available from Cornell University for a small fee. And there is the overly stupid indxbib in Unix but it comes for free. ======== From: David Grossman <grossman@eola.cs.ucf.edu> Do you have a copy of Mark Zimmerman's free text browser. Its shareware and comes with source code. I can mail it to you if you'd like, he's got some comments that might help. As for IR design there are probably some articles that might interest you in any of the last couple SIGIR (Special interest group on information retrieval). ======== From: QiFan Chen <chenq@hal.com> I noticed that you have posted a message "free text indexing algorithm refs wanted..." on the comp.databases. It is interesting as my colleagues and I also work on SGML database related issues and like to share expereience with each other on this subject. Back in school, my professors and I have jointly designed a set of hash algorithms such that given a key, the record it associated with can be read in with one disk access. The hash function maintains 1-to-1 mapping from key to record. So there is no collision and the exact record address can be computed from the key. We call these functions minimal perfect hash functions. The material is covered in the first issue of CACM of this year. You can contact Dr. Edward A. Fox Dept. of Computer Science Va Tech Blacksburg, Va 24061 email: fox@fox.cs.vt.edu ph: 703-231 5113 for more information on those algorithms. These algorithms should be very useful to speed your database access up. Also in Hal, we are using object-oriented database tech. to store SGML data. It is our experience that the data should be carefully re-origanized to allow good performance.