NetNews Usenet Archive 1992 #31

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1992 #31 / NN_1992_31.iso / spool / comp / database / 8762 < prev next >

Wrap

Text File | 1993-01-02 | 2.9 KB | 63 lines

Newsgroups: comp.databases Path: sparky!uunet!noc.near.net!mv!jlc!john From: john@jlc.mv.com (John Leslie) Subject: Content-Based Retrieval Message-ID: <1993Jan02.163830.8973@jlc.mv.com> Organization: John Leslie Consulting, Milford NH References: <18774@mindlink.bc.ca> <1993Jan1.021012.24215@news.arc.nasa.gov> <1993Jan1.182624.2993@uunet!tellab5!odgate> Date: Sat, 02 Jan 1993 16:38:30 GMT Lines: 52 mike@uunet!tellab5!odgate (Mike J. Kelly) writes: > > As far as I know, no one has yet integrated content-based retrieval (CBR) > with BLOBs and large text fields, which is what you really want... > > It might be interesting to the vendors who are reading this group if we > were to start a thread on what a CBR-BLOB capability would look like. I second the motion. Of course, right at the start we have a problem. CBR-BLOB is an oxymoron. A Binary Large OBject does not *have* contents interpretable by the database engine -- interpretation, by definition, is up to the application code. Nonetheless, it is clear that people *are* going to store ASCII text in BLOBs. There are straightforward things we can do with arbitrary ASCII text; and we could design the engine to degrade gracefully if it's asked to do these to non-ASCII BLOBs. > The way I conceive it, at its simplest, it would involve a new type of index > ("create text index"?) which would set up a varchar or BLOB column to be > retrieved via CBR operators, along with some new CBR operators (at least > contains, and probably more sophisticated stuff than that.) Text-indexing a BLOB should create a (variable-length) index capable of being searched more efficiently to yield a large subset of the answers to possible inquiries under the CBR expression parser. Whether the BLOB itself needs to be accessed to determine the final retrieval list seems to me an implementation detail. Open Question: In a client-server environment, is it acceptable for the server to pass the *entire* CBR index to a client in order to perform the CBR operation? The generally accepted operators are "contains" / "does not contain" and "starts with" / "ends with". I would think we should add multiple- operand cases for "and" / "or" / "precedes" and "follows". For long text, usually there is a "within" <N> "words" / "lines" (etc.) clause for multiple-operand searches. Open Question: What is a reasonable limit to number of operands and nesting depth? In practice, other fields are created to save disambiguating information which is not easily generated from the text. At user-level, we will need expressions which cover multiple fields of the same record. Food for Thought: As text gets larger, formatting becomes more and more important. Can we remotely hope for a definition of "rich text" which clearly-enough distinguishes formatting information so that the database engine can ignore it in generating the CBR index? John Leslie <john@jlc.mv.com>