MIME::Tools | MIME::Body | MIME::Decoder | MIME::Entity |
MIME::Head | MIME::IO | MIME::Latin1 | MIME::Parser |
MIME::ParserBase | MIME::ToolUtils | MIME::Tools | MIME::Words |
MIME:: |
MIME::ParserBase - abstract class for parsing MIME streams
This is an abstract class; however, here's how one of its concrete subclasses is used:
# Create a new parser object: my $parser = new MIME::Parser;
# Parse an input stream: $entity = $parser->read(\*STDIN) or die "couldn't parse MIME stream";
# Congratulations: you now have a (possibly multipart) MIME entity! $entity->dump_skeleton; # for debugging
There are also some convenience methods:
# Parse an in-core MIME message: $entity = $parser->parse_data($message) or die "parse";
# Parse an MIME message in a file: $entity = $parser->parse_in("/some/file.msg") or die "parse";
# Parse an MIME message out of a pipeline: $entity = $parser->parse_in("gunzip - < file.msg.gz |") or die "parse";
# Parse already-split input (as "deliver" would give it to you): $entity = $parser->parse_two("msg.head", "msg.body") or die "parse";
In case a parse fails, it's nice to know who sent it to us. So...
# Parse an input stream: if (!($entity = $parser->read(\*STDIN))) { # oops! $decapitated = $parser->last_head; # get last top-level head }
You can also alter the behavior of the parser:
# Parse contained "message/rfc822" objects as nested MIME streams: $parser->parse_nested_messages('REPLACE');
# Automatically attempt to RFC-1522-decode the MIME headers: $parser->decode_headers(1);
Cute stuff...
# Convert a Mail::Internet object to a MIME::Entity: @lines = (@{$mail->header}, "\n", @{$mail->body}); $entity = $parser->parse_data(\@lines);
Where it all begins.
This is the class that contains all the knowledge for parsing MIME streams. It's an abstract class, containing no methods governing the output of the parsed entities: such methods belong in the concrete subclasses.
You can inherit from this class to create your own subclasses that parse MIME streams into MIME::Entity objects. One such subclass, MIME::Parser, is already provided in this kit. I strongly suggest you base your application classes off of MIME::Parser instead of this class.
init()
method.
Once you create a parser object, you can then set up various parameters before doing the actual parsing. Here's an example using one of our concrete subclasses:
my $parser = new MIME::Parser; $parser->output_dir("/tmp"); $parser->output_prefix("msg1"); my $entity = $parser->read(\*STDIN);
If set false, no attempt at decoding will be done.
With no argument, just returns the current setting.
Warning: some folks already have code which assumes that no decoding is done, and since this is pretty new and radical stuff, I have initially made "off" the default setting for backwards compatibility in 2.05. However, I will possibly change this in future releases, so please: if you want a particular setting, declare it when you create your parser object.
If so, then this is the method that your subclass should invoke during init. Use it like this:
package MyParser; @ISA = qw(MIME::Parser); ... sub init { my $self = shift; $self->SUPER::init(@_); # do my parent's init $self->interface(ENTITY_CLASS => 'MIME::MyEntity'); $self->interface(HEAD_CLASS => 'MIME::MyHead'); $self; # return }
With no VALUE, returns the VALUE currently associated with that ROLE.
# Parse an input stream: $entity = $parser->read(\*STDIN); if (!$entity) { # oops! my $decapitated = $parser->last_head; # last top-level head }
message/rfc822
:
literally, the text of an embedded mail/news/whatever message.
The normal behavior is to save such a message just as if it were a
text/plain
document, without attempting to decode it. However, you can
change this: before parsing, invoke this method with the OPTION you want:
If OPTION is false, the normal behavior will be used.
If OPTION is true, the body of the message/rfc822
part
is decoded (after all, it might be encoded!) into a temporary filehandle,
which is then rewound and parsed by this parser, creating an
entity object. What happens then is determined by the OPTION:
message/rfc822
entity,
as though the message/rfc822
were a special kind of multipart
entity.
However, the message/rfc822
header (and the content-type) is retained.
Warning: since it is not legal MIME for anything but multipart
to have a "part", the message/rfc822
message will appear to
have no content if you simply print()
it out. You will have to have to
get at the reparsed body manually, by the MIME::Entity::parts()
method.
IMHO, this option is probably only useful if you're processing messages, but not saving or re-sending them. In such cases, it is best to not use "parse nested" at all.
message/rfc822
entity, as though
the message/rfc822
"envelope" never existed.
Warning: notice that, with this option, all the header information
in the message/rfc822
header is lost. This might seriously bother
you if you're dealing with a top-level message, and you've just lost
the sender's address and the subject line. :-/
.
Thanks to Andreas Koenig for suggesting this method.
Returns a MIME::Entity, which may be a single entity, or an arbitrarily-nested multipart entity. Returns undef on failure.
Note: where the parsed body parts are stored (e.g., in-core vs. on-disk) is not determined by this class, but by the subclass you use to do the actual parsing (e.g., MIME::Parser). For efficiency, if you know you'll be parsing a small amount of data, it is probably best to tell the parser to store the parsed parts in core. For example, here's a short test program, using MIME::Parser:
use MIME::Parser;
my $msg = <<EOF; Content-type: text/html Content-transfer-encoding: 7bit
<H1>Hello, world!</H1>;
EOF $parser = new MIME::Parser; $parser->output_to_core('ALL'); $entity = $parser->parse_data($msg); $entity->print(\*STDOUT);
read()
.
Simply give this method any expression that may be sent as the second
argument to open() to open a filehandle for reading.
Returns the parsed entity, or undef on error.
parse_in()
, intended for programs
running under mail-handlers like deliver, which splits the incoming
mail message into a header file and a body file.
Simply give this method the paths to the respective files.
Warning: it is assumed that, once the files are cat'ed together, there will be a blank line separating the head part and the body part.
Warning: new implementation slurps files into line array for portability, instead of using 'cat'. May be an issue if your messages are large.
Returns the parsed entity, or undef on error.
The INSTREAM can be given as a readable FileHandle,
a globref'd filehandle (like \*STDIN
),
or as any blessed object conforming to the IO:: interface.
Returns a MIME::Entity, which may be a single entity, or an arbitrarily-nested multipart entity. Returns undef on failure.
All you have to do to write a subclass is to provide or override the following methods:
new()
.
You don't need to override this in your subclass. If you override it, however, make sure you call the inherited method to init your parents!
package MyParser; @ISA = qw(MIME::ParserBase); ... sub init { my $self = shift; $self->SUPER::init(@_); # do my parent's init
# ...my init stuff goes here...
$self; # return }
Should return the self object on success, and undef on failure.
If you want the parser to do something other than write its parts out to files, you should override this method in a subclass. For an example, see MIME::Parser.
Note: the reason that we don't use the "interface" mechanism
for this is that your choice of (1) which body class to use, and (2) how
its new()
method is invoked, may be very much based on the
information in the header.
You are of course free to override any other methods as you see
fit, like new
.
This is an abstract class. If you actually want to parse a MIME stream, use one of the children of this class, like the backwards-compatible MIME::Parser.
A better solution for this case would be to set up some form of state machine for input processing. This will be left for future versions.
The revised implementation uses a temporary file (a la tmpfile()
)
during parsing to hold the encoded portion of the current MIME
document or part. This file is deleted automatically after the
current part is decoded and the data is written to the "body stream"
object; you'll never see it, and should never need to worry about it.
Some folks have asked for the ability to bypass this temp-file mechanism, I suppose because they assume it would slow down their application. I considered accomodating this wish, but the temp-file approach solves a lot of thorny problems in parsing, and it also protects against hidden bugs in user applications (what if you've directed the encoded part into a scalar, and someone unexpectedly sends you a 6 MB tar file?). Finally, I'm just not conviced that the temp-file use adds significant overhead.
"\r\n"
). However, it is extremely likely that folks will want to
parse MIME streams where each line ends in the local newline
character "\n"
instead.
An attempt has been made to allow the parser to handle both CRLF and newline-terminated input.
"7bit"
and "8bit"
decoders will decode both
a "\n"
and a "\r\n"
end-of-line sequence into a "\n"
.
The "binary"
decoder (default if no encoding specified)
still outputs stuff verbatim... so a MIME message with CRLFs
and no explicit encoding will be output as a text file
that, on many systems, will have an annoying ^M at the end of
each line... but this is as it should be.
If your mailer creates multipart boundary strings that contain newlines when they appear in the message body, give it two weeks notice and find another one. If your mail robot receives MIME mail like this, regard it as syntactically incorrect MIME, which it is.
Why do I say that? Well, in RFC-1521, the syntax of a boundary is given quite clearly:
boundary := 0*69<bchars> bcharsnospace
bchars := bcharsnospace / " "
bcharsnospace := DIGIT / ALPHA / "'" / "(" / ")" / "+" /"_" / "," / "-" / "." / "/" / ":" / "=" / "?"
All of which means that a valid boundary string cannot have newlines in it, and any newlines in such a string in the message header are expected to be solely the result of folding the string (i.e., inserting to-be-removed newlines for readability and line-shortening only).
Yet, there is at least one brain-damaged user agent out there that composes mail like this:
MIME-Version: 1.0 Content-type: multipart/mixed; boundary="----ABC- 123----" Subject: Hi... I'm a dork!
This is a multipart MIME message (yeah, right...)
----ABC- 123----
Hi there!
We have got to discourage practices like this (and the recent file upload idiocy where binary files that are part of a multipart MIME message aren't base64-encoded) if we want MIME to stay relatively simple, and MIME parsers to be relatively robust.
Thanks to Andreas Koenig for bringing a baaaaaaaaad user agent to my attention.
If anyone wants to test out this package's handling of both binary and textual email on a system where binmode() is not a NOOP, I would be most grateful. If stuff breaks, send me the pieces (including the original email that broke it, and at the very least a description of how the output was screwed up).
RFC-1521 gives us the following BNF grammar for the body of a multipart MIME message:
multipart-body := preamble 1*encapsulation close-delimiter epilogue
encapsulation := delimiter body-part CRLF
delimiter := "--" boundary CRLF ; taken from Content-Type field. ; There must be no space between "--" ; and boundary.
close-delimiter := "--" boundary "--" CRLF ; Again, no space by "--"
preamble := discard-text ; to be ignored upon receipt.
epilogue := discard-text ; to be ignored upon receipt.
discard-text := *(*text CRLF)
body-part := <"message" as defined in RFC 822, with all header fields optional, and with the specified delimiter not occurring anywhere in the message body, either on a line by itself or as a substring anywhere. Note that the semantics of a part differ from the semantics of a message, as described in the text.>
From this we glean the following algorithm for parsing a MIME stream:
PROCEDURE parse INPUT A FILEHANDLE for the stream. An optional end-of-stream OUTER_BOUND (for a nested multipart message).
RETURNS The (possibly-multipart) ENTITY that was parsed. A STATE indicating how we left things: "END" or "ERROR".
BEGIN LET OUTER_DELIM = "--OUTER_BOUND". LET OUTER_CLOSE = "--OUTER_BOUND--".
LET ENTITY = a new MIME entity object. LET STATE = "OK".
Parse the (possibly empty) header, up to and including the blank line that terminates it. Store it in the ENTITY.
IF the MIME type is "multipart": LET INNER_BOUND = get multipart "boundary" from header. LET INNER_DELIM = "--INNER_BOUND". LET INNER_CLOSE = "--INNER_BOUND--".
Parse preamble: REPEAT: Read (and discard) next line UNTIL (line is INNER_DELIM) OR we hit EOF (error).
Parse parts: REPEAT: LET (PART, STATE) = parse(FILEHANDLE, INNER_BOUND). Add PART to ENTITY. UNTIL (STATE != "DELIM").
Parse epilogue: REPEAT (to parse epilogue): Read (and discard) next line UNTIL (line is OUTER_DELIM or OUTER_CLOSE) OR we hit EOF LET STATE = "EOF", "DELIM", or "CLOSE" accordingly.
ELSE (if the MIME type is not "multipart"): Open output destination (e.g., a file)
DO: Read, decode, and output data from FILEHANDLE UNTIL (line is OUTER_DELIM or OUTER_CLOSE) OR we hit EOF. LET STATE = "EOF", "DELIM", or "CLOSE" accordingly.
ENDIF
RETURN (ENTITY, STATE). END
For reasons discussed in MIME::Entity, we can't just discard the "discard text": some mailers actually put data in the preamble.
Copyright (c) 1996, 1997 by Eryq / eryq@zeegee.com
All rights reserved. This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
$Revision: 4.109 $ $Date: 1998/02/12 03:11:27 $