home *** CD-ROM | disk | FTP | other *** search
- Newsgroups: comp.databases
- From: gtoal@pizzabox.demon.co.uk (Graham Toal)
- Path: sparky!uunet!pipex!demon!pizzabox.demon.co.uk!gtoal
- Subject: SUMMARY: Re: free text indexing algorithm refs wanted...
- Distribution: world
- Organization: Cuddlehogs Anonymous
- Lines: 89
- Date: Thu, 3 Sep 1992 01:18:18 +0000
- Message-ID: <715483098snx@pizzabox.demon.co.uk>
- Sender: usenet@gate.demon.co.uk
-
- I wrote:
-
- >Having implemented a database using a really cool set of algorithms
- >and data structures, which let us do just about anything, we find it's
- >depressingly slow compared to some commercial systems. We've have
- >worked out what they must be doing by treating them as a black box
- >and throwing queries at them, and working out the complexity of the
- >search algorithms they must be using. Any pointers to papers or
- >articles on different data structures for free text systems?
-
-
- Thanks to all of you who answered. I'll follow up the references
- suggested, and write some more on what we're doing later. A quick
- comment though is that the time for turning a key into a pointer
- to a list of occurances of that word *isn't* the bottleneck; the
- problems we have are more to do with efficient operations on these
- lists - including AND and OR, but also more powerful ones. We have
- hierarchical SGML structured documents, so simple solutions like
- separate indices for each field don't work.
-
- Our structure simply has a 32-bit pointer to the word; other people
- may include various bits of info too. (Liam Quinn - we chatted about
- this a long time ago when we first started this project. Your scheme
- is the sort of thing I was looking for info on, though I'm particularly
- wondering about real live systems that perform well on CDs at slow CD
- speeds - there's probably only commercial ones like this...)
-
- My apologies to the people who mailed me for not replying individually;
- I've fire-fighting at work at the moment and can't keep up. I've filed
- all your mail and will reply personally as soon as things lighten up.
-
- [Ed Fox - I won't be in Cambridge, but will you be passing through
- London at all?]
-
- ========
-
- From: "Sanjiv K. Bhatia" <sanjiv@redbird.umsl.edu>
-
- Have you checked the standard books on Information Retrieval like the classic
- text by van Rijsbergen and another book by Salton. I can send you the complete
- reference if you desire.
-
- It may also help to check the implemented schemes in the SMART retrieval
- environment available from Cornell University for a small fee. And there is
- the overly stupid indxbib in Unix but it comes for free.
-
- ========
-
- From: David Grossman <grossman@eola.cs.ucf.edu>
-
- Do you have a copy of Mark Zimmerman's free text browser. Its
- shareware and comes with source code. I can mail it to you if
- you'd like, he's got some comments that might help. As for IR
- design there are probably some articles that might interest you
- in any of the last couple SIGIR (Special interest group on
- information retrieval).
-
- ========
-
- From: QiFan Chen <chenq@hal.com>
-
- I noticed that you have posted a message "free text indexing algorithm
- refs wanted..." on the comp.databases. It is interesting
- as my colleagues and I also work on SGML database related issues
- and like to share expereience with each other on this subject.
-
- Back in school, my professors and I have jointly designed a set
- of hash algorithms such that given a key, the record it
- associated with can be read in with one disk access. The
- hash function maintains 1-to-1 mapping from key to record.
- So there is no collision and the exact record address can be
- computed from the key. We call these functions minimal
- perfect hash functions. The material is covered in the first
- issue of CACM of this year. You can contact
-
- Dr. Edward A. Fox
- Dept. of Computer Science
- Va Tech
- Blacksburg, Va 24061
-
- email: fox@fox.cs.vt.edu
- ph: 703-231 5113
-
- for more information on those algorithms. These algorithms
- should be very useful to speed your database access up.
-
- Also in Hal, we are using object-oriented database tech. to
- store SGML data. It is our experience that the data should be
- carefully re-origanized to allow good performance.
-