NetNews Usenet Archive 1992 #16

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1992 #16 / NN_1992_16.iso / spool / comp / text / sgml / 925 < prev next >

Wrap

Text File | 1992-07-23 | 3.8 KB | 102 lines

Path: sparky!uunet!cis.ohio-state.edu!ucbvax!RALVM13.VNET.IBM.COM!DRMACRO From: DRMACRO@RALVM13.VNET.IBM.COM ("Dr. "Eliot Kimber" Macro") Newsgroups: comp.text.sgml Subject: Identifying Overlaps Message-ID: <9207231518.AA22593@ucbvax.Berkeley.EDU> Date: 23 Jul 92 14:37:31 GMT Sender: daemon@ucbvax.BERKELEY.EDU Lines: 92 Darrell Raymond writes: > Once again, consider the markup of a stream with overlapping elements >(three elements this time). > > Here is the text: > > now is the time for all good > > And the contents of each element (and of their overlaps) is defined as >follows (where the operator "^" means intersection): > > X: now is the for > Y: is the for all > Z: the for all good <-- Typo? Should Z contain 'the'? > X^Y: is for > X^Z: the for > Y^Z: for all > X^Y^Z: for > > Note that the elements are not contiguous subsequences of text. If >you draw a Venn diagram with three overlapping circles and put the words >in the various regions as the element membership requires, it might be >easier to visualize. Never mind that it seems bizarre. > > Now, the markup problem I pose is the following: insert tags in the >text, corresponding to X, Y, and Z, that demarcate the various regions >in such a way that a "reasonable" person or program can identify all >the elements and overlaps, and would not infer non-existent relationships. > >-Darrell. > First, here's a visual representation of the overlap, to make sure I've got it right and X contains 'for'? X now is the for Y is the for all Z for all good (Assuming Z does not contain 'the') The obvious solution is Michael's use of non-containing marker elements, which the parser cannot validate but that an application can: <x start>now <y start>is the <z start>for<x end> all<y end> good<z start> If the markup didn't all have to be embeded in the string itself I could use HyTime location functions to define each of the spans (or DSSSL location function), or define my own markup that did so. I could also define indirect methods where I identify each different part and then combine them. I could also define my processor such that it does string comparisons, with something like this: <text id=x>now is the for</> <text id=y>is the for all</> <text id=z>for all good</> <intersect refids='x y'>X intersect Y <intersect refids='x z'>X intersect Z <intersect refids='y z'>Y intersect Z <intersect refids='x y z'>X intersect Y intersect Z Where the processor would compare the the strings referenced left to right to find matches and report the result. I could also define a "delta-chain" structure, like so: <text id=base>now is the for</> <text id=x refid=base> <text id=y refid=x><deletion start=1 len=4><text>all</text></text> <text id=z refid=y><deletion start=1 len=7><text>good</text></text> Here, the DELETION element is not procedural but merely identifies a deletion using a location method. The processing of TEXT is defined such that contained text is appended to referenced text. Without writing the program to check, I think there's enough information above to let you determine the intersections of any of the defined text strings, as the DELETION information gives enough information to derive the start and and of each string relative to others in the chain. I think the first solution meets the letter of Darrell's new challenge, if not the spirit. The other solutions demonstrate other approaches to the problem of identifying relationships among data elements that are not strictly hierarchical. Eliot Kimber Internet: drmacro@ralvm13.vnet.ibm.com Dept E14/B500 IBMMAIL: USIB2DK9@IBMMAIL Network Programs Information Development Phone: 1-919-543-7091 IBM Corporation Research Triangle Park, NC 27709