CD Actual Thematic 7: Programming

home *** CD-ROM | disk | FTP | other *** search

/ CD Actual Thematic 7: Programming / CDAT7.iso / Share / Editores / Perl5 / perl / lib / site / Tk / Parse.pm < prev next >

Wrap

Perl POD Document | 1997-08-10 | 22.8 KB | 876 lines

#!/usr/bin/perl package Tk::Parse; require Exporter; @ISA=qw(Exporter); @EXPORT=qw(Parse Simplify hide start_hide unhide Normalize Normalize2 Escapes $VERBATIM $HEADING $ITEM $INDEX $TEXT $PRAGMA $INDENT ); # Different types of text: # 0. Name of file # 1. Verbatim paragraphs # 2. Headings # 3. Items # 4. Index mark # 5. Comment block # 6. Formatted paragraphs # 7. Pragmas # 8. Indented sections (which can contain 1-9) # 9. Cut (conveys no information, but may be useful in interparagraph spacing) # 0 = [0,0,0,0,"filename"] # 1 = [1,line,pos,0,"verbatim paragraph"] # 2 = [2,line,pos,level,"heading"] # 3 = [3,line,pos,0,item] # 4 = [4,line,pos,0,"indexing"] # #5 = [5,line,pos,0,"comment"] # 6 = [6,line,pos,0,"paragraph"] # 7 = [7,line,pos,0,"pragma"] # 8 = [8,line,pos,indentation,type...] #type 1 = "*", type 2 = "1.,2." 3=else # 9 = [9,line,pos,0,"cut"] =head1 NAME Pod::Parse - Parse perl's pod files. =head1 SYNOPSIS B<THIS TK SNAPSHOT SHOULD BE REPLACED BY A CPAN MODULE> =head1 DESCRIPTION A module designed to simplify the job of parsing and formatting ``pods'', the documentation format used by perl5. This consists of several different functions to present and modify predigested pod files. =head1 GUESSES This is a work in progress, so I may have some stuff wrong, perhaps badly. Some of my more reaching guesses: =over 4 =item * An =index paragraph should be split into lines, and each line placed inside an `X' formatting command which is then preprended to the next paragraph, like this: =index foo foo2 foo3 foo2!subfoo Foo! Will become: X<foo>X<foo2>X<foo3>X<foo2!subfoo>Foo! =item * A related change: that an `X' command is to be used for indexing data. This implies that all formatters need to at least ignore the `X' command. =item * Inside an =command, no special significance is to be placed on the first line of the argument. Thus the following two lines should be parsed identically: =item 1. ABC =item 1. ABC Note that neither of these are identical to this: =item 1. ABC which puts the "ABC" in a separate paragraph. =item * I actually violate this rule twice: in parsing =index commands, and in passing through the =pragma commands. I hope this make sense. =item * I added the =comment command, which simply ignores the next paragraph =item * I also added =pragma, which also ignores the next paragraph, but this time it gives the formatter a chance at doing something sinister with it. =back =head1 POD CONVENTIONS This module has two goals: first, to simplify the usage of the pod format, and secondly the codification of the pod format. While perlpod contains some information, it hardly gives the entire story. Here I present "the rules", or at least the rules as far as I've managed to work them out. =over 4 =item Paragraphs: The basic element The fundamental "atom" of a pod file is the paragraph, where a paragraph is defined as the text up to the next completely blank line ("\n\n"). Any pod parser will read in paragraphs sequentially, deciding what do to with each based solely on the current state and on the text at the _beginning_ of the paragraph. =item Commands: The method of communication A paragraph that starts with the `=' symbol is assumed to be a special command. All of the alphanumeric characters directly after the `=' are assumed to be part of the name of the command, up to the first whitespace. Anything past that whitespace is considered "the arugment", and the argument continues up till the end of the paragraph, regardless of newlines or other whitespace. =item Text: Commands that aren't Commands A paragraph that doesn't start with `=' is treated as either of two types of text. If it starts with a space or tab, it is considered a B<verbatim> paragraph, which will be printed out... verbatim. No formatting changes whatsover may be done. (Actually, this isn't quite true, but I'll get back to that at a later date.) A paragraph that doesn't start with whitespace or `=' is assumed to consist of formmated text that can be molded as the formatter sees fit. Reformatting to fit margins, whatever, it's fair game. These paragraphs also can contain a number of different formatting codes, which verbatim paragraphs can't. These formatting codes are covered later. =item =cut: The uncommand There is one command that needs special mention: =cut. Anything after a paragraph starting with =cut is simply ignored by the formatter. In addition, any text B<before> a valid command is equally ignored. Any valid `=' command will reenable formating. This fact is used to great benefit by Perl, which is glad to ignore anything between an `=' command and `=cut', so you can embed a pod document right inside a perl program, and neither will bother the other. =item Reference to paragraph commands =over 4 =item =cut Ignore anything till the next paragraph starting with `='. =item =head1 A top-level heading. Anything after the command (either on the same line or on further lines) is included in the heading, up until the end of the paragraph. =item =head2 Secondary heading. Same as =head1, but different. No, there isn't a head3, head4, etc. =item =over [N] Start a list. The C<N> is the number of characters to indent by. Not all formatters will listen to this, though. A good number to use is 4. While =over sounds like it should just be indentation, it's more complex then that. It actually starts a nested environment, specifically for the use of =item's. As this command recurses properly, you can use more then one, you just have to make sure they are closed off properly by =back commands. =item =back Ends the last =over block. Resets the indentation to whatever it was previously. Closes off the list of =item's. =item =item The point behind =over and =back. This command should only be used between them. The argument supplied should be consistent (within a list) to one of three types: enumeration, itemization, or description. To exemplify: An itemized list =over 4 =item * A bulleted item =item * Another bulleted item =back An enumerated list =over 4 =item 1. First item. =item 2. Second item. =back A described list =over 4 =item Item #1 First item =item Item #2 (which isn't really like #1, but is the second). Second item =back If you aren't consistent about the arguments to =item, Pod::Parse will complain. =item =comment Ignore this paragraph =item =pragma Ignore this paragraph, as well, unless you know what you are doing. =item =index Undecided at this time, but probably magic involving XZ<><>. =back =item Reference to formatting directives =over 4 =item BZ<><...> Format text inside the brackets as bold. =item IZ<><...> Format text inside the brackets as italics. =item ZZ<><> Replace with a zero-width character. You'll probably figure out some uses for this. =item And yet more that I haven't described yet... =back =back =head1 USAGE =head2 Parse This function takes a list of files as an argument. If no argument is given, it defaults to the contents of @ARGV. Parse then reads through each file and returns the data as a list. Each element of this list will be a nested list containing data from a paragraph of the pod file. Elements pertaining to "=over" paragraphs will themselves contain the nested entries for all of the paragraphs within that list. Thus, it's easier to parse the output of Parse using a recursive parses. (Um, did that parse?) It is I<highly> recommended that you use the output of Simplify, not Parse, as it's simpler. The output will consist of a list, where each element in the list matches one of these prototypes: =over 4 =item [0,0,0,0,$filename] This is produced at the beginning of each file parsed, where $filename is the name of that file. =item [-1,0,0,0,$filename] End of same. =item [1,$line,$pos,0,$verbatim] This is produced for each paragraph of verbatim text. $verbatim is the text, $line is the line offset of the paragraph within the file, and $pos is the byte offset. (In all of the following elements, $pos and $line have identical meanings, so I'll skip explaining them each time.) =item [2,$line,$pos,$level,$heading] Producded by a =head1 or =head2 command. $level is either 1 or 2, and $heading is the argument. =item [3,$line,$pos,0,$item] $item is the argument from an =item paragraph. =item [4,$line,$pos,0,$index] $index is the argument from an =index paragraph. =item [6,$line,$pos,0,$text] Normal formatted text paragraph. $text is the text. =item [7,$line,$pos,0,$pragma] $pragma is the argument from a =pragma paragraph. =item [8,$line,$pos,$indentation,$type,...] This item is produced for each matching =over/=back pair. $indentation is the argument to =over, $type is 1 if the embedded =item's are bulleted, 2 if they are enumerated, 3 if they are text, and 0 if there are no items. The "..." indicates an unlimited number of further elements which are themselves nested arrays in exactly the format being described. In other words, a list item includes all the paragraphs inside the list inside itself. (Clear? No? Nevermind.) =item [9,$line,$pos,0,$cut] $cut contains the text from a =cut paragraph. You shouldn't need to use this, but I _suppose_ it might be necessary to do special breaks on a cut. I doubt it though. This one is "depreciated", as Larry put it. Or perhaps disappreciated. =back =head2 Simplify This procedure takes as it's input the convoluted output from Parse(), and outputs a much simpler array consisting of pairs of commands and arguments, designed to be easy (easier?) to parse in your pod formatting code. It is used very simply by saying something like: @Pod = Simplify(Parse()); while($cmd = shift @Pod) { $arg = shift @Pod; #... } Where #... is the code that responds to any of the commands from the following list. Note that you are welcome to ignore any of the commands that you want to. Many contain duplicate information, or at least information that will go unused. A formatted based on this data can be quite simple indeed. (See pod2text for entirely too simple an example.) =head2 Reference to Simplify commands =over 4 =item "filename" The argument contains the name of the pod file that is being parsed. These will be present at the start of each file. You should open an output file, output headers, etc., based on this, and not when you start parsing. =item "endfile" The end of the file. Each file will be ended before the next one begins, and after all files are done with. You can do end processing here. The argument is the same name as in "filename". =item "setline" This gives you a chance to record the "current" input line, probably for debugging purposes. In this case, "current" means that the next command you see that was derived from an input paragraph will have start at the arguments line in the file. =item "setloc" Same as setline, but the byte offset in the input, instead of the line offset. =item "pragma" The argument contains the text of a pragma command. =item "text" The argument contains a paragraph of formatted text. =item "verbatim" The argument contains a paragraph of verbatim text. =item "cut" A =cut command was hit. You shouldn't really need to listen for this one. =item "index" The argument contains an =index paragraph. (Note: Current =index commands are not fed through, but turned into XZ<><> commands.) =item "head1" =item "head2" The argument contains the argument from a header command. =item "setindent" If you are tracking indentation, use the argument to set the indentation level. =item "listbegin" Start a list environment. The argument is the type of list (1,2,3 or 0). =item "listend" Ends a list environment. Same argument as listbegin. =item "listtype" The argument is the type of list. You can just record the argument when you see one of these, instead of paying attention to listbegin & listend. =item "over" The argument is the indentation. It's probably better to listen to the "list..." commands. =item "back" Ends an "over" list. The argument is the original indentation. =item "item" The argument is the text of the =item command. =back Note that all of these various commands you've seen are syncronized properly so you don't have to pay attention to all at once, but they are all output for your benefit. Consider the following example: listtype 2 listbegin 2 setindent 4 over 4 item 1. text Item #1 item 2. text Item #2 setindent 0 listend 2 back 0 listtype 0 =head2 Normalize This command is normally invoked by Parse, so you shouldn't need to deal with it. It just cleans up text a little, turning spare '<', '>', and '&' characters into HTML escapes (E<lt>, etc.) as well as generating warnings for some pod formatting mistakes. =head2 Normalize2 A little more aggresive formating based on heuristics. Not applied by default, as it might confuse your own heuristics. =head2 %Escapes This hash is exported from Pod::Parse, and contains default ASCII translations for some common HTML escape sequences. You might like to use this as a basis for an %HTML_Escapes array in your own formatter. =cut $ENDFILE = -1; $FILE = 0; $VERBATIM = 1; $HEADING = 2; $ITEM = 3; $INDEX = 4; $TEXT = 6; $PRAGMA = 7; $INDENT = 8; $CUT = 9; # "hide" suite sub hide { local($thing_to_hide) = shift; $thing_to_hide =~ tr/\000-\177/\200-\377/; return $thing_to_hide; } sub start_hide { if ( /[\200-\377]/ ) { warn "hit bit char in input stream"; } } sub unhide { local($tmp) = shift; $tmp =~ tr/\200-\377/\000-\177/; return $tmp; } # Turn formatted text into a more normalized version. All '<' and '>' will # belong to a command, the rest will have turned into E<lt> and E<gt>. '&' # has been changed into E<amp>. Possibly generate some warnings sub Normalize { local($_) = $_[0]; start_hide; s/(E<[^<>]*>)/hide($1)/ge; s/([A-Z]<[^<>]*>)/hide($1)/ge; s/</hide("E<lt>")/ge; s/>/hide("E<gt>")/ge; s/&/hide("E<amp>")/ge; #if (m{ ([\-\w]+$[^\051]*?[\@\$,][^\051]*?$) # }x && $` !~ /([LCI]<[^<>]*|-)$/ && !/^=\w/) # { # warn "``$1'' should be a [LCI]<$1> ref near line $line of $ARGV\n"; #} while (/(-[a-zA-Z])\b/g && $` !~ /[\w\-]$/) { warn "``$1'' should be [CB]<$1> ref near line $line of $ARGV\n"; } # put back pod quotes so we get the inside of <> processed; $_ = unhide($_); } # Apply heuristics to a formatted string. sub Normalize2 { local($_) = @_; # func() is a reference to a perl function s{\b([:\w]+)}{I<$1>}g; # func(n) is a reference to a man page s{(\w+)($[^\s,\051]+$)}{I<$1>$2}g; # convert simple variable references s/(\s+)([\$\@%][\w:]+)/${1}C<$2>/g; # s/([\$\@%][\w:]+)/C<$1>/g; # s/\$[\w:]+\[[0-9]+\]/C<$&>/g; $_; } # Take output from the following Parse routine, and turns it into a much # more straightforward, non-recursive, data structure. It returns an # array consisting of pairs of elements, the first of each pair being a # command, and the second it's argument. Hopefully this should prove # simple to parse. Note that it is intended that your formatter only "listens" # for the commands it is interested in, and simply discards the rest. sub Simplify { &Simplify2(0,0,@_); } sub Simplify2 { my($indent,$type,@list) = @_; my(@result)=(); foreach(@list) { my($code,$line,$loc,$param,$text) = @{$_}; push(@result,"setline",$line); push(@result,"setloc",$loc); if( $code == $INDENT) { my($code_dummy,$line,$loc,$i,$t,@more) = @{$_}; # ^^^^^^^^^^ This may be bug of perl5.002b2 push(@result,"listtype",$t); push(@result,"listbegin",$t); push(@result,"setindent",$i); push(@result,"over",$i); push(@result,&Simplify2($i,$t,@more)); push(@result,"setindent",$indent); push(@result,"listend",$t); push(@result,"back",$indent); push(@result,"listtype",$type); } elsif( $code == $PRAGMA) { push(@result,"pragma",$text); } elsif( $code == $ITEM) { push(@result,"item",$text); } elsif( $code == $INDEX) { push(@result,"index",$text); } elsif( $code == $TEXT) { push(@result,"text",$text); } elsif( $code == $VERBATIM) { push(@result,"verbatim",$text); } elsif( $code == $HEADING) { push(@result,"head$param",$text); } elsif( $code == $CUT) { push(@result,"cut",0); } elsif( $code == $FILE) { push(@result,"filename",$text); } elsif( $code == $ENDFILE) { push(@result,"endfile",$text); } } @result; } # Read input from a pod file, and generate a list describing it. Keeps # track of the line number and position in the stream. Recursive. sub Parse { local(@ARGV)=@ARGV; if(@_) { @ARGV = @_ } local($/); $type=0; $typecount=0; $eof=0; $bof=1; $saveindex=""; $/=""; $cutting=1; $recurse=0; $line=0; $loc=0; $newloc=0; $newline=0; $infile = undef; &Parse2(); } sub Parse2 { my(@result)=(); while(<>) { if($bof) { push(@result,[-1,0,0,0,$infile]) if $infile; push(@result,[0,0,0,0,$ARGV]); $infile = $ARGV; $newloc=0; $newline=0; $bof=0; } if(eof) { $bof=1; } $loc=$newloc; $line=$newline; $newloc = $loc + length($_); $newline= $line + (tr/\n/\n/); #Should I? #s/[ \t]+$//gm; #print STDERR "Read $_\n"; if($cutting && !/^=/) { next; } $cutting=0; chomp; if(/^=cut/) { $cutting=1; push(@result,[9,$line,$loc,0,0]); next; } if(/^\s/) { push(@result,[1,$line,$loc,0,$_]); } elsif( /^=head(\d+)\s*/ ) { my($data) = $'; $data =~ s/\n/ /g; push(@result,[2,$line,$loc,$1,Normalize($data)]); } elsif( /^=item\s*/ ) { my($data) = $'; $data =~ s/\n/ /g; if(!$recurse) { warn "=item outside of an =over block near line $line of $ARGV\n"; } if( $data eq "*" ) { if( $type == 0 || $type == 1) { $type = 1; } else { warn "Inconsistent =item near line $line of $ARGV\n"; } } elsif( $data =~ /^(\d+)\.$/ ) { if( $type == 0 ) { $type=2; $typecount=0; } elsif( $type != 2 ) { warn "Inconsistent =item near line $line of $ARGV\n"; } if( ++$typecount != $1) { warn "Inconsistently numbered =item near line $line of $ARGV\n"; $typecount = $1; } } else { if( $type == 0 || $type == 3) { $type = 3; } else { warn "Inconsistent =item near line $line of $ARGV\n"; } } push(@result,[3,$line,$loc,0,Normalize($data)]); } elsif( /^=over(?:\s+(\d+))?/ ) { my($indent,$l1,$l2)=($1,$line,$loc); $indent ||= 5; # good? $recurse++; local($type)=0; local($typecount)=0; my(@newresult) = Parse2(); #print STDERR "PUSH\n"; push(@result,[8,$l1,$l2,$indent,$type,@newresult]); #print STDERR "POP\n"; $recurse--; last if $eof; } elsif( /^=back/ ) { if(!$recurse) { die "Unmatched =back near line $line of $ARGV\n"; } return @result; } elsif( /^=pragma\s*/) { push(@result,[7,$line,$loc,0,$']); } elsif( /^=index\s*/) { #push(@result,[4,$line,$loc,0,$']); $saveindex=$'; } elsif( /^=comment/ ) { #push(@result,[5,$line,$loc]); } elsif( /^=/ ) { m/^(=\S+)/; warn "Unknown pod command `$1' near line $line of $ARGV\n"; } else { if($saveindex) { $_ = join("",map("X<$_>",grep(!/^\s*$/,split(/\n/,$saveindex)))) . $_; $saveindex=""; } push(@result,[6,$line,$loc,0,Normalize($_)]); } } $eof=1; if($recurse) { #die ... warn "Unmatched =over near line $line of $ARGV\n"; #Assume =back } push(@result,[-1,0,0,0,$infile]) if $infile; @result; } # for testing #@result=Parse(); #print Dumpstruct::Dumpstruct(\@result); # Common escapes with ASCII translations. You should copy this into you're # own local escapes hash and override the ones you need to change. %Escapes = ( 'amp' => '&', # ampersand 'lt' => '<', # left chevron, less-than 'gt' => '>', # right chevron, greater-than 'quot' => '"', # double quote "Aacute" => "A", # capital A, acute accent "aacute" => "a", # small a, acute accent "Acirc" => "A", # capital A, circumflex accent "acirc" => "a", # small a, circumflex accent "AElig" => 'Ae', # capital AE diphthong (ligature) "aelig" => 'ae', # small ae diphthong (ligature) "Agrave" => "A", # capital A, grave accent "agrave" => "a", # small a, grave accent "Aring" => 'A', # capital A, ring "aring" => 'a', # small a, ring "Atilde" => 'A', # capital A, tilde "atilde" => 'a', # small a, tilde "Auml" => 'A', # capital A, dieresis or umlaut mark "auml" => 'a', # small a, dieresis or umlaut mark "Ccedil" => 'C', # capital C, cedilla "ccedil" => 'c', # small c, cedilla "Eacute" => "E", # capital E, acute accent "eacute" => "e", # small e, acute accent "Ecirc" => "E", # capital E, circumflex accent "ecirc" => "e", # small e, circumflex accent "Egrave" => "E", # capital E, grave accent "egrave" => "e", # small e, grave accent "ETH" => 'Oe', # capital Eth, Icelandic "eth" => 'oe', # small eth, Icelandic "Euml" => 'E', # capital E, dieresis or umlaut mark "euml" => 'e', # small e, dieresis or umlaut mark "Iacute" => "I", # capital I, acute accent "iacute" => "i", # small i, acute accent "Icirc" => "I", # capital I, circumflex accent "icirc" => "i", # small i, circumflex accent "Igrave" => "I", # capital I, grave accent "igrave" => "i", # small i, grave accent "Iuml" => 'I', # capital I, dieresis or umlaut mark "iuml" => 'i', # small i, dieresis or umlaut mark "Ntilde" => 'N', # capital N, tilde "ntilde" => 'n', # small n, tilde "Oacute" => "O", # capital O, acute accent "oacute" => "o", # small o, acute accent "Ocirc" => "O", # capital O, circumflex accent "ocirc" => "o", # small o, circumflex accent "Ograve" => "O", # capital O, grave accent "ograve" => "o", # small o, grave accent "Oslash" => "O", # capital O, slash "oslash" => "o", # small o, slash "Otilde" => "O", # capital O, tilde "otilde" => "o", # small o, tilde "Ouml" => 'O', # capital O, dieresis or umlaut mark "ouml" => 'o', # small o, dieresis or umlaut mark "szlig" => 'ss', # small sharp s, German (sz ligature) "THORN" => 'L', # capital THORN, Icelandic "thorn" => 'l', # small thorn, Icelandic "Uacute" => "U", # capital U, acute accent "uacute" => "u", # small u, acute accent "Ucirc" => "U", # capital U, circumflex accent "ucirc" => "u", # small u, circumflex accent "Ugrave" => "U", # capital U, grave accent "ugrave" => "u", # small u, grave accent "Uuml" => 'U', # capital U, dieresis or umlaut mark "uuml" => 'u', # small u, dieresis or umlaut mark "Yacute" => "Y", # capital Y, acute accent "yacute" => "y", # small y, acute accent "yuml" => 'y', # small y, dieresis or umlaut mark ); 1;