Geek Gadgets 1

home *** CD-ROM | disk | FTP | other *** search

/ Geek Gadgets 1 / ADE-1.bin / ade-dist / ptx-0.4-src.tgz / ptx-0.4-src.tar / fsf / ptx / TODO < prev next >

Wrap

Text File | 1996-09-28 | 4KB | 92 lines

TODO file for GNU ptx Tell <pinard@iro.umontreal.ca> if you feel like volunteering for any of these ideas, listed more or less in decreasing order of priority. * Use mmap for swallowing files (maybe wrong when memory edited). * Sort keywords intelligently for Latin-1 code. See how to interface this character set with various output formats. Also, introduce options to inverse-sort and possibly to reverse-sort. * Use rx instead of regex. * Correct the infinite loop using -S '$' or -S '^'. * Improve speed for Ignore and Only tables. Consider hashing instead of sorting. Consider playing with obstacks to digest them. * Provide better handling of format effectors obtained from input, and also attempt white space compression on output which would still maximize full output width usage. * See how TeX mode could be made more useful, and if a texinfo mode would mean something to someone. * Understand and mimic `-t' option, if I can. * Provide multiple language support Most of the boosting work should go along the line of fast recognition of multiple and complex boundaries, which define various `languages'. Each such language has its own rules for words, sentences, paragraphs, and reporting requests. This is less difficult than I first thought: . Recognize language modifiers with each option. At least -b, -i, -o, -W, -S, and also new language switcher options, will have such modifiers. Modifiers on language switchers will allow or disallow language transitions. . Complete the transformation of underlying variables into arrays in the code. . Implement a heap of positions in the input file. There is one entry in the heap for each compiled regexp; it is initialized by a re_search after each regexp compile. Regexps reschedule themselves in the heap when their position passes while scanning input. In this way, looking simultaneously for a lot of regexps should not be too inefficient, once the scanning starts. If this works ok, maybe consider accepting regexps in Only and Ignore tables. . Merge with language processing boundary processing options, really integrating -S processing as a special case. Maybe, implement several level of boundaries. See how to implement a stack of languages, for handling quotations. See if more sophisticated references could be handled as another special case of a language. * Tackle other aspects, in a more long term view . Add options for statistics, frequency lists, referencing, and all other prescreening tools and subsidiary tasks of concordance production. . Develop an interactive mode. Even better, construct a GNU emacs interface. I'm looking at Gene Myers <gene@cs.arizona.edu> suffix arrays as a possible implementation along those ideas. . Implement hooks so word classification and tagging should be merged in. See how to effectively hook in lemmatisation or other morphological features. It is far from being clear by now how to interface this correctly, so some experimentation is mandatory. . Profile and speed up the whole thing. . Make it work on small address space machines. Consider three levels of hugeness for files, and three corresponding algorithms to make optimal use of memory. The first case is when all the input files and all the word references fit in memory: this is the case currently implemented. The second case is when the files cannot fit all together in memory, but the word references do. The third case is when even the word references cannot fit in memory. . There also are subsidiary developments for in-core incremental sort routines as well as for external sort packages. The need for more flexible sort packages comes partly from the fact that linguists use kinds of keys which compare in unusual and more sophisticated ways. GNU `sort' and `ptx' could evolve together. Local Variables: mode: outline outline-regexp: " *[-+*.] \\|" End: