home *** CD-ROM | disk | FTP | other *** search
- Newsgroups: comp.databases
- Path: sparky!uunet!noc.near.net!mv!jlc!john
- From: john@jlc.mv.com (John Leslie)
- Subject: Content-Based Retrieval
- Message-ID: <1993Jan02.163830.8973@jlc.mv.com>
- Organization: John Leslie Consulting, Milford NH
- References: <18774@mindlink.bc.ca> <1993Jan1.021012.24215@news.arc.nasa.gov> <1993Jan1.182624.2993@uunet!tellab5!odgate>
- Date: Sat, 02 Jan 1993 16:38:30 GMT
- Lines: 52
-
- mike@uunet!tellab5!odgate (Mike J. Kelly) writes:
- >
- > As far as I know, no one has yet integrated content-based retrieval (CBR)
- > with BLOBs and large text fields, which is what you really want...
- >
- > It might be interesting to the vendors who are reading this group if we
- > were to start a thread on what a CBR-BLOB capability would look like.
-
- I second the motion. Of course, right at the start we have a problem.
- CBR-BLOB is an oxymoron. A Binary Large OBject does not *have* contents
- interpretable by the database engine -- interpretation, by definition, is
- up to the application code.
-
- Nonetheless, it is clear that people *are* going to store ASCII text
- in BLOBs. There are straightforward things we can do with arbitrary
- ASCII text; and we could design the engine to degrade gracefully if it's
- asked to do these to non-ASCII BLOBs.
-
- > The way I conceive it, at its simplest, it would involve a new type of index
- > ("create text index"?) which would set up a varchar or BLOB column to be
- > retrieved via CBR operators, along with some new CBR operators (at least
- > contains, and probably more sophisticated stuff than that.)
-
- Text-indexing a BLOB should create a (variable-length) index capable
- of being searched more efficiently to yield a large subset of the answers
- to possible inquiries under the CBR expression parser. Whether the BLOB
- itself needs to be accessed to determine the final retrieval list seems
- to me an implementation detail.
-
- Open Question: In a client-server environment, is it acceptable for
- the server to pass the *entire* CBR index to a client in order to perform
- the CBR operation?
-
- The generally accepted operators are "contains" / "does not contain"
- and "starts with" / "ends with". I would think we should add multiple-
- operand cases for "and" / "or" / "precedes" and "follows". For long
- text, usually there is a "within" <N> "words" / "lines" (etc.) clause
- for multiple-operand searches.
-
- Open Question: What is a reasonable limit to number of operands and
- nesting depth?
-
- In practice, other fields are created to save disambiguating information
- which is not easily generated from the text. At user-level, we will need
- expressions which cover multiple fields of the same record.
-
- Food for Thought: As text gets larger, formatting becomes more and
- more important. Can we remotely hope for a definition of "rich text"
- which clearly-enough distinguishes formatting information so that the
- database engine can ignore it in generating the CBR index?
-
- John Leslie <john@jlc.mv.com>
-