home *** CD-ROM | disk | FTP | other *** search
- Path: sparky!uunet!cis.ohio-state.edu!ucbvax!RALVM13.VNET.IBM.COM!DRMACRO
- From: DRMACRO@RALVM13.VNET.IBM.COM ("Dr. "Eliot Kimber" Macro")
- Newsgroups: comp.text.sgml
- Subject: Identifying Overlaps
- Message-ID: <9207231518.AA22593@ucbvax.Berkeley.EDU>
- Date: 23 Jul 92 14:37:31 GMT
- Sender: daemon@ucbvax.BERKELEY.EDU
- Lines: 92
-
- Darrell Raymond writes:
-
- > Once again, consider the markup of a stream with overlapping elements
- >(three elements this time).
- >
- > Here is the text:
- >
- > now is the time for all good
- >
- > And the contents of each element (and of their overlaps) is defined as
- >follows (where the operator "^" means intersection):
- >
- > X: now is the for
- > Y: is the for all
- > Z: the for all good <-- Typo? Should Z contain 'the'?
- > X^Y: is for
- > X^Z: the for
- > Y^Z: for all
- > X^Y^Z: for
- >
- > Note that the elements are not contiguous subsequences of text. If
- >you draw a Venn diagram with three overlapping circles and put the words
- >in the various regions as the element membership requires, it might be
- >easier to visualize. Never mind that it seems bizarre.
- >
- > Now, the markup problem I pose is the following: insert tags in the
- >text, corresponding to X, Y, and Z, that demarcate the various regions
- >in such a way that a "reasonable" person or program can identify all
- >the elements and overlaps, and would not infer non-existent relationships.
- >
- >-Darrell.
- >
-
- First, here's a visual representation of the overlap, to make
- sure I've got it right
-
- and X contains 'for'?
- X now is the for
- Y is the for all
- Z for all good (Assuming Z does not contain 'the')
-
- The obvious solution is Michael's use of non-containing marker
- elements, which the parser cannot validate but that an
- application can:
-
- <x start>now <y start>is the <z start>for<x end> all<y end> good<z start>
-
- If the markup didn't all have to be embeded in the string itself
- I could use HyTime location functions to define each of the spans
- (or DSSSL location function), or define my own markup that
- did so. I could also define indirect methods where I identify
- each different part and then combine them. I could also define
- my processor such that it does string comparisons, with something
- like this:
-
- <text id=x>now is the for</>
- <text id=y>is the for all</>
- <text id=z>for all good</>
- <intersect refids='x y'>X intersect Y
- <intersect refids='x z'>X intersect Z
- <intersect refids='y z'>Y intersect Z
- <intersect refids='x y z'>X intersect Y intersect Z
-
- Where the processor would compare the the strings referenced
- left to right to find matches and report the result.
-
- I could also define a "delta-chain" structure, like so:
-
- <text id=base>now is the for</>
- <text id=x refid=base><!-- No change as x=base -->
- <text id=y refid=x><deletion start=1 len=4><text>all</text></text>
- <text id=z refid=y><deletion start=1 len=7><text>good</text></text>
-
- Here, the DELETION element is not procedural but merely identifies
- a deletion using a location method. The processing of TEXT
- is defined such that contained text is appended to referenced
- text. Without writing the program to check, I think there's enough
- information above to let you determine the intersections
- of any of the defined text strings, as the DELETION information gives
- enough information to derive the start and and of each string
- relative to others in the chain.
-
- I think the first solution meets the letter of Darrell's new
- challenge, if not the spirit. The other solutions demonstrate
- other approaches to the problem of identifying relationships
- among data elements that are not strictly hierarchical.
-
- Eliot Kimber Internet: drmacro@ralvm13.vnet.ibm.com
- Dept E14/B500 IBMMAIL: USIB2DK9@IBMMAIL
- Network Programs Information Development Phone: 1-919-543-7091
- IBM Corporation
- Research Triangle Park, NC 27709
-