HTML2TEXT is a utility that converts HTML files to plain text. Optionally it also tries to figure out if the HTML file is well-constructed.
All Rights Reserved.
Permission to use, copy, and distribute this software and its documentation for any purpose and without fee is hereby granted, provided that the above copyright notice appear in all copies and that both that copyright notice and this permission notice appear in supporting documentation, and that the name Gavin Spearhead not be used in advertising or publicity pertaining to distribution of the software without specific, written prior permission.
GAVIN SPEARHEAD DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE, INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS, IN NO EVENT SHALL GAVIN SPEARHEAD BE LIABLE FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTUOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
This documentation is written in HTML format in way that it is readable in a text viewer as good as possible and as long as the HTML format allows it. A to-text-converted version is also included.
All my documentations are, since about June 1997, written in HTML. Prior to that date they were written in plain text. This means a) that I can put them on the big web, b) they're readable at all time and c) at all machines that have a browser for it and d) they're easily converted to other formats, such as Postscript, Wordperfect and of course plain-text e) I can easily make formatted documentations.
Any bugs, errors, suggestions, thought, ideas, etc should be sent to
the author, these also includes errors in the documentation. Also
the existence of not supported HTML-tags or entity sequences can be
sent to the author, along
with a description, restrictions and options. No matter how puny or
important your help is, I need your help to improve this
program.
If you want to become a beta tester of this program
contact me and I'll send you
the details. Unfortunately I cannot give rewards other then
gratitude.
You are encouraged to register this piece of software. This means
that you will either receive the latest version when it is released
or a note that a new version is released. It also gives me an idea
about how many people use this program and how it's spread. The
information provided to register will not be used for other purpose
then HTML2TEXT and by any other persons other then me.
There are three ways to register:
Note that registration is Free of charge!
When you're registered you will become a private registration key, so that your name is written when you execute the program instead of Unregistered. However no other function will be available in the registered version. In other words the unregistered version is not crippled This file will be sent via email if possible. This is currently the only way to receive the registration key. When you order (see below) you will automatically be registered and the key can be found on the disk.
HTML2TXT.EXE | The executable. |
HTML2TXT.CFG | Configuration file with options. |
HTML2TXT.INI | Ini-file with entity references. |
HTML2TXT.HTM | Documentation for HTML2TEXT in HTML format. |
HTML2TXT.TXT | Documentation for HTML2TEXT in plain text format. |
REGISTER.HTM | Registration form in HTML format. |
LONGFILE.BTM | 4DOS batch file to convert HTML files which does support Windows 95 long filenames. |
If one of the files is missing, throw the package away and ask the author for a new and complete copy. The address is at the end of the file.
HTML2TEXT converts HyperText Mark-up Language (HTML) files to plain-text (ASCII) files. The following rules are applies for this:
A quick start instruction is to type on the commandline:
HTML2TXT file.htm
This will convert the file "file.htm" to "file.txt".
The full syntax of HTML2TEXT is:
HTML2TXT <filespecification>
@<listfile> <options>
<Filespecification> is the name of the files to convert, it may include wildcards (* and ?). More than one file specification may appear on the command line. Note that long filenames (Windows 95/NT) are not supported. This means that input filenames have to be of the 8.3 format (Every W95/NT file has a 8.3 filename and optionally a long filename). The output will always be a 8.3 filename. 4DOS users (v5.5+) can use the %@SFN[...] function to get the short filename of a long filename (see 4DOS documentation for details). A 4DOS batch file is also included to perform conversion of files with long filenames, see below. Windows 95 users can drag and drop files to the executable, Windows then uses the short filename anyway.
@<listfile> is a name of a file that contains the names of files to convert. The <listfile> may not contain wildcards. Each of these filenames must appear on a single line. Empty lines are permitted and files may have leading or trailing whitespace characters. The names of these files have the same restrictions as the files from the <filespecification>. You cannot use options in a listfile.
<options> can be the following:
-a- | Do not display text for links. |
-b- | Do not mark bold text. |
-b+ | Mark bold text by embracing it with stars (*). |
-b:<chars> | Specifies the two characters used to mark bold text. Exactly two characters have to be given. |
-B- | Don't print borders for tables. |
-B:<num> | Use predefined border <num>, where <num> is
|
-c | Automatically create directory when the specified output path does not exist. |
-c- | Ask to create when the specified output path does not exist. |
-C:<num> | Sets the charset to use. <num> can be any
number from 1 to 9. By default one is chosen, except when
windows is detected (in enhanced mode, but who cares) then
the default is 3. 1 : ASCII (7 bit) 2 : Extended ASCII (8 bit) 3 : Windows ISO 8859/1 4-9: user definable |
-f | Get input from standard input, then read the files specified. Input from standard input will be converted and then output to standard output. |
-f- | Only read the files specified. |
-F- | Do not display input fields in forms. |
-h | Do not display HTML2TEXT messages. (hush) |
-h- | Do display HTML2TEXT messages. |
-H | Stop after reading </html>. |
-H- | Continue after reading </html>. |
-i- | Do not mark italic text. |
-i+ | Mark italic text by embracing it with slashes (/). |
-i:<chars> | Specifies the two characters used to mark italic text. Exactly two characters have to be given. |
-I- | Don't display any text for images. |
-l:<chars> | Specifies the characters used for list elements of unordered lists. The first is for the type Square, the second for Disc and the third for Circle. None of the three may be omitted. |
-L:<char> | Specifies the character to use for filling up empty spaces in text fields of forms. |
-m | Mark errors in the output file. |
-m- | Do not mark errors in the output file. |
-N | Display the full path of the converted file in the output. |
-N- | Do not display the full path of the converted file in the output. |
-o:<char> | Controls the overwriting policy for existing files
<char> can be one of the following:
|
-O- | Don't use a log file. |
-O+ | Use a log file. If none is specified previously HTML2TXT.LOG will be used. |
-O:<File> | Use <File> as a log file. |
-p:<num> | Set the indenting for lists to <num>. |
-P- | Do not display a progress indicator |
-P+ | Display a progress indicator. In the upper left corner a wheel will be shown |
-q | Treat balancing of quotes strictly. |
-q- | Treat balancing of quotes relaxed. A '>' always ends a tag. |
-s | Output will be redirected to standard output. |
-s- | Files will be used for output, the output filename will be derived from the output path and the filename entered. |
-S | Stretch tables to the full width of the page. This looks neat when you've more short tables with similar data. |
-S- | Do not stretch tables to the full width of the page. The width of the table is the minimum width required. |
-t | Reformat tables. |
-t- | Simple table conversion. Just insert a number of spaces between elements. |
-T | Display the title in the output file as -= <title> =-. |
-T2 | The same as -T but omits the -= and =-. |
-T- | Do not display any title. |
-u- | Do not mark underlined text. |
-u+ | Mark underlined text by embracing it with underscores (_). |
-u:<chars> | Specifies the two characters used to mark underlined text. Exactly two characters have to be given. |
-v | Display version info. |
-w | Warn for HTML-errors in the source file. |
-w- | Do not warn for HTML-errors in the source file. |
-W:<num> | Set the line length to <num>. Maximum line length is 255. |
-W- | No line wrapping will be performed. |
-W | Set line length to the screenwidth (usually 80). |
-y- | Do not use hyphenation. |
-? | Displays short a help. |
I know the program suffers from creeping featuritis!
WARNING
Most options are different from the options of the previous versions
of HTML2TEXT. This was necessary to maintain some logic in the
naming of the options, which have increased immensely in number.
Otherwise the characters used would be quite cryptic. Please read
the section above carefully. And options are now case-sensitive.
You can concatenate options to save space. Except -O:<File> which must be the last. Better use the config file to set this variable.
The result file will have the same name as the original file, but with the extension specified in the config-file (default is .txt), unless the original extension was the same as the extension of the output file, then the extension will be `.tx1' (always).
All messages are written to standard error, unless hush is set. 4DOS users can also redirect standard error or use the Windows clipboard by the device `clip:'.
Yes, you can. There are two other ways to specify the options. First
of all you can use the HTML2TXT.CFG file (see next section). And
second you can use the environment variable H2T_SW. You can set all
options in this variable.
Eg. set H2T_SW=-b-u-T2
will switch marking of underlined text and bold text off and switch
the title to setting 2. The leading `-' can be omitted.
This file contains the translation table for ampersand sequences
also known as entity references, ie. a sequence of characters of
the form: &some_text;. The lines are of following format:
<identifier>="<result>"
Where the <identifier> is the text between the `&'
and the `;'. <result> is the text that will replace entity
reference. The quotes are optional. The text can contain escape
sequences (C-style) of the format \<char> where <char>
can be:
Or of the format \<number> then the character which ASCII value equals <number> is inserted. Every other character is literally insert, including quotes and bashes.
entity references of the format &#nnn; not specified in the config-file will be converted to the ASCII value nnn, as are entity references of the format &#xnn;, the hexadecimal representation.
This file contains the various options that can be set. These are all eloquently commented in the file itself. So take a peek at that file.
Beware that some options have side effects, eg. turning off line wrapping means also that text will not be centered.
In both files any line starting with a semicolon (;) is treated as comments and thus ignored. In the config-file lines starting with a double cross (#) are also treated as comments.
Both files will be sought for in the current directory first and then the directory from where HTML2TEXT was started. Usually these files will be placed in the same directory as HTML2TXT.EXE, a directory in your path. The path to ini-file can also be specified in the config-file, if so it will be sought there first.
When HTML2TEXT starts all options are set to the default value. Then the HTML2TXT.CFG file is read. Next the environment variable H2T_SW is read and finally the commandline options are read. This means that any redefining of options will overwrite the previous settings.
Tag | What it does in HTML2TEXT |
---|---|
A | Checks optionally link_text or name_text is written contrary to previous versions, this one always needs a closing tag. |
ABBR | Checks. |
ADDRESS | See tag: I. |
APPLE | Checks, ignores text between <APPLET></APPLET>. |
AREA | Ignores. |
B | Checks, Optionally writes BOLD-token. |
BASE | Ignores. |
BASEFONT | Ignores. |
BDO | Checks. |
BGSOUND | Ignores. |
BIG | Checks. |
BLINK | Checks. |
BLOCKQUOTE | Checks, indents. |
BODY | Checks. |
BR | Writes a newline. |
BUTTON | Writes a button and uses button_text. |
CAPTION | Determines a caption for tables, Checks. |
COLGROUP | Ignores. |
CENTER | Checks, centers when linewrap is on. |
CITE | see tag: I. |
CODE | See tag: Pre. |
COL | Checks. |
COMMENT | Ignores anything between <COMMENT></COMMENT>,. Checks. |
DD | Inserts newline and indents. |
DEL | Checks. |
DFN | See tag: I. |
DIR | See tag: OL. |
DIV | Checks, writes a newline at both open and close tag. |
DL | Starts a definition list, Checks. |
DT | Inserts a newline. |
EM | see tag: I. |
EMBED | Ignores. |
FIELDSET | Checks. |
FRAME | Ignores. |
FRAMESET | Checks. |
FONT | Checks. |
FORM | Checks. |
H | Checks. |
HD1 | Writes the text to screen with embracing newlines. |
HD2 | |
HD3 | |
HD4 | |
HD5 | |
HD6 | |
HEAD | Checks. |
HR | Writes a line of `='s in case size >3 or else a line of `-'s. The length is absolute or relative set according to the width value. |
HTML | Everything after </HTML> is optionally ignored, Checks. |
I | Checks, Optionally writes ITALIC-token. |
IFRAME | Checks. |
IMG | Ignored, Optionally writes image_text. |
INPUT | Ignored. |
ISINDEX | Write a prompt plus optionally [ Input ]. |
KBD | See B. |
LI | Writes a list element identifier, for ULs a * or specified in config-file, for OL a number, parameter type and value used. |
LEGEND | Checks. |
LINK | Ignores. |
LISTING | See tag: Pre. |
MAP | Checks. |
MARQUEE | Checks. |
MENU | See tag: OL. |
META | Ignores. |
NEXTID | Ignores. |
NOBR | Checks. |
NOFRAMES | Checks. |
NOSCRIPT | Checks. |
OBJECT | Checks. |
OL | An ordered list, Checks, type parameter used. |
OPTION | Ignores. |
OPTGROUP | Checks. |
P | Starts a new paragraph and adjusts alignment. |
PARAM | Ignores. |
PRE | Output as is, Checks (line wrap is not ignored, if on). |
Q | Checks. |
S | see tag: strike. |
SAMP | Checks. |
SCRIPT | Ignores anything between <SCRIPT></SCRIPT>, Checks. |
SELECT | Checks. |
SMALL | Checks. |
SOUND | Ignores. |
SPACER | Ignores. |
SPAN | Checks. |
STRIKE | Checks. |
STRONG | See tag: B. |
STYLE | This is treated as it were a comment. Technically it sets info on various colours, etc. |
SUB | Checks. |
SUP | Checks. |
TABLE | Checks, starts/finishes a table. |
TBODY | Checks. |
TD | Defines a table cell. |
TEXTAREA | Ignores. |
TFOOT | Checks. |
THEAD | Checks. |
TH | Defines a table header cell. |
TITLE | Writes the title, if within <HEAD></HEAD>, Checks. |
TR | Defines a table row. |
TT | Checks. |
U | Checks. |
UL | An unordered list, Checks, type parameter used. |
VAR | See tag: Pre. |
WBR | Ignores. |
!DOCTYPE | Ignores. |
These are all the tags from the HTML-specification version 4.0. Not all are currently fully implemented although most are. Also included are some browser specific tags, for either Netscape as Microsoft Internet Exploder. The full specification of HTML can be found here.
Here Checks means that for every open tag a matching closing tag is sought. In most cases the order of the closing tags are not relevant, but sometimes the output will be unexpected when tags are closed in the wrong order.
Here Ignores means that the tag is just ignored, no output is generated and no checks are performed. Mostly these will be tag that set options to the output that have no meaning in plain text files. What would a client side map, for example do in a text file?
Some tags may have optional closing tags, these are ignored and not checked. Eg. <tr>,<td>,<th>,<p>. Changed from previous versions is that the anchor tag (<a ...>) now always needs a closing tag, although for most name definitions these are usually left out. Some of those tags need a closing tag (preceded by a slash), these will be checked, if the tag was opened before. It will also be checked if those tags are closed in the right order. Furthermore is checked that tags are not nested if not necessary (eg. bold), this might indicate a missing slash in the tag in the second tag. Lots of tags are simply ignored and thus generate no output. Some tags optionally generate output. Any text after </html> is optionally ignored. Mainly to prevent garbage output. Some tags cause the following text to be ignored until a closing tag appears.
Unknown tags are ignored and optionally a message is generated.
Note that this just specifies the actions taken by HTML2TEXT and not what the HTML specification says about tags, however I have tried to implement the tags close to the specification as possible.
Tables generate the following output. Every table row is written on
at least one line, and every row yields a linefeed. Table columns
are separated by at least one space or other cell separator. Some
options are implemented for tables, but currently do not all work
very well, a row can only be affected by at most one rowspan and one
colspan. Also text won't be stretched to the full length of cells
with rowspans, the cells below will be empty instead. Colspan is
currently implemented, the content of the cells will be stretched of
the number of columns according to the collspan. Tables are squeezed
to a minimum size or strectched to the full width, if linewrap is
chosen. Otherwise a cell will be of the length of the longest cell
in the column. The squeezing and stretching of cells is done is
quite a rough fashion.
For long tables check out the config-file to set some parameters so
that those are handled well too (who uses tables larger than 256
× 10 with cells of more than 64 KB, however some people build
their whole pages in a table...). if you do you will have to
increase the max_rows and the max_cols in the config-file. If this
is necessary the most likely error messages are error 13 and error
14. Also possible are error 7 and error 12. Except for error 12
these errors are fatal errors. Occasionally this may also lead a
situation in which your machine seems not to respond. Too long
cells, ie larger than 64k will be truncated.
Nested tables aren't supported either, those will be treated as if
the notables option is set to on. Only the outer-table will be
formatted.
HTML2TEXT can have two kinds of output:
Note that all messages are written to standard error. This is because one needs to make a distinction between the converted text and the additional info output by HTML2TEXT. Thus any messages are written to the screen even if stdout is redirected. Standard error can be redirected as well btw (however command.com does not support it). Also the hush option will prevent output to stderr.
When the output path specified in the config file does not exist, by default, the user is asked whether the path should be created. When the answer is no or the creation did not succeed the current path is used.
HTML2TEXT sometimes uses temporary files to store data. These are of the format H2T_xxxx.$$$. where the xxxx stands for a number. Those files can safely be deleted after using HTML2TEXT. Most of the time the program handles that itself. However the files are not deleted on fatal errors or user interruptions.
HTML2TEXT can get input from two sources. First it can read files
specified on the commandline, wildcards are permitted and from the
listfile. Secondly the input can be derived from the console. In
other words input can be redirected from standard input. Use the -f
option to set redirection on. If no filename is specified, it
expects input from standard input too. This means that programmes
can `pipe' their output to HTML2TEXT and also input can be
redirected from files. Eg.
type myhtml.htm | html2txt -f
This means that type will output the myhtml.htm file to stdout and
it will be `piped' to HTML2TXT. Practically this way will have the
same effect as
html2txt myhtml.htm -s
The output from redirection from standard input will always be sent
to standard output, that means that if it has to be sent to a file,
redirection has to be used again. Eg.
type myhtml.htm | html2txt -f > myhtml.txt
Now the output will be written to a file called myhtml.txt.
If files are specified on the command line as well, the output will
still depend on the standard_output parameter.
For usage of redirection and `piping' check the manuals of the
commandline interpreter (eg. command.com or 4dos.com).
The log file contains for each file it converts the full path of the input file and the full path of the output file. In addition to that the date and time will be printed. When standard input or output is used <stdin> and <stdout> will be printed instead. Also it contains all output written to the screen during conversion. However each line will be preceded by two double crosses (`#').
Error | Error string | Description |
---|---|---|
1 | Illegal parameter. | A command line parameter was not recognised. |
2 | No such file. | No file was found matching the file name specification. |
3 | No filename specified. | No file specification was found on the command line. |
4 | Config-file not found. | The program could not locate the config-file, which is usually found in the current directory or in the directory containing html2txt.exe. |
5 | Ini-file not found. | The program could not locate the ini-file (see error 4). |
6 | Error in ini-file. | One entry in the ini-file was not defined. |
7 | Not enough memory. | There was not enough memory to execute the program. |
8 | File could not be opened. | One input file could not be found or opened. |
9 | Error in config-file. | One entry in the config-file was not defined. |
10 | Too many entity references in ini-file. | The ini-file contains too many codes to hold in memory. Increase the value in the config file. |
11 | File skipped. | The file could not be converted. The output file already exists and do not overwrite was chosen. |
12 | Heap corrupted. | Memory is being corrupted during the conversion, mostly of tables. |
13 | Too many rows in table. | The table contains more rows than the program can keep in memory. |
14 | Too many columns in table. | The table contains more columns than the program can keep in memory. |
15 | Specified path is illegal. | The ini-file could not be found in the specified path. |
16 | Could not create temporary file. | There is not enough space on the disk or there are not enough handles to open a temporary file. |
17 | File writing error: Disk full. | An error occurred while writing a file, most like is that the disk is full. |
Warning | Warning String | Description |
---|---|---|
256 | Unrecognised HTML-code. | The HTML-code was not recognised, probably not defined. |
257 | Ill-constructed HTML-code. | The HTML-code was different from the one expected, probably a closing tag forgotten or a missing `/' in a tag. |
258 | Illegal list item. | The list item or list was of an illegal type. |
259 | Semicolon expected. | The semicolon after an entity reference is missing. However there seems to be some decreasing use of the semicolon. |
260 | Illegal token. | A token was encountered which was not legal in the context. Mostly those will be > or &. |
261 | Ill-constructed entity reference. | The entity reference is not defined in the ini-file. |
262 | Misplaced tag, expected within <head>...</head>. | The tag appeared outside of the head section, usually this is the title tag. |
263 | HTML-tag starts with space. | The HTML tag starts with a space character. |
264 | Invalid list type. | The type specified for a list item or a list was illegal. |
265 | Unexpected `>' encountered. | A greater-than token was encountered without a matching less than token. |
266 | LI without list. | An LI tag was encountered outside a list section. |
267 | DD without definition list. | An DD tag was encountered outside a definition list. |
268 | DT without definition list. | An DT tag was encountered outside a definition list. |
269 | Tables within tables not supported. | A table section within a table was encountered. |
270 | Table cell truncated. | A table cell contained more than 65K data. |
271 | Output path could not be created, default path used. | The specified output path could not be used or created. Probably one of the directories in the path is a file. |
LONGFILE.BTM is a simple 4DOS batch file that uses HTML2TEXT to convert HTML-files to TEXT-files. This file will use Windows 95 long filenames. However this batch file has some limitations.
Usage:
LONGFILE.BTM <filespecification>
Where <filespecification> is the name of the file you want to
convert. Wildcards are permitted and you can have multiple file
specification on a line.
Use this batch file at your own risk! It is not well tested. Future edition will probably have more advanced methods of using W95 long filenames. The source code is included, however, so that one can experiment with it a little.
Some people misuse HTML tags. For example <pre>...</pre> sections are used to insert newlines in the output. Or tables are used to define the layout of a whole page. In graphical browser, such as Netscape, this may lead to the intended result. However in a text-based system this may lead to large gaps of newlines and empty sections. And of course small columns with useful information.This is not a bug in HTML2TEXT but usually a result of lousy written HTML-codes, since HTML2TEXT tries to conform to the specification as good as possible. Most of the authors of such pages do not even bother to try and understand the HTML specs.
Other `weird things' happen when some one writes something similar
to the following fragment of HTML-code:
Some text and<b> this </b>is very important.
This would like like in the out as:
Some text and* this *is very important.
The spaces are at the `wrong' side of `*'. This is just what is done
by the program (This probably is what Netscape does too but you
cannot see it (or are the spaces slightly wider?)). The spaces are
made bold too. Do remember that most tags are not word delimiters.
Some are tho treated that way, such as <br> and <p>.
One final word about weird output. The program does not remove trailing spaces. Mostly this is not necessary. But if it is necessary you'll have to do it yourself. Most good editors however have an option to remove the, The reason why the trailing spaces are not removed is that it will degrade performance drastically.
In most cases the program does work quite fast. However on an older
machine it may be that it runs quite slow. Mostly this is due to
conversion of tables which takes quite some time. Anyway here are
some hints to make it run faster (some are quite obvious):
Version numbers are defined into several sections.
In the first place there is a major version number, 1 currently.
Which is followed by a dot. Then there is the minor version number,
50 now. The version is followed by a single character. `a' is a
alpha release which is never released, only for my usage, lots of
new functions have to be implemented or finished still. `b' is the
beta release and most likely the normal release, since this version
would not change much more, except for some bug fixes. `c' will
denote an updated version from the beta release. Then there are
internal revision. This number will be increased as changes are
implemented each time. Plus a data of last editing will be entered
(in Frisian). This date should be the same as the date that your
operating system returns.
There are several ways to obtain a copy of html2text:
No, not yet. Currently it only works is DOS (3.3 or better). It should work in Windows and OS/2 DOS boxes. It should also work with most DOS emulators on other platforms.
1.50 |
|
1.21. |
|
1.20 |
|
1.10 |
|
1.02 |
|
1.01 |
|
1.00 |
|
Write to:
Gavin Spearhead
Witbreuksweg 387-302
7522 ZA Enschede
The Netherlands
Note that previous email address (wieger@epsilon.nl and wieger1@noord.bart.nl) are still valid.
This the latest version of this file can be found at http://www.noord.bart.nl/~wieger1/html2txt.htm.
Greetings