home *** CD-ROM | disk | FTP | other *** search
- """HTTP cookie handling for web clients.
-
- ClientCookie is a Python module for handling HTTP cookies on the client
- side, useful for accessing web sites that require cookies to be set and
- then returned later. It also provides some other (optional) useful stuff:
- HTTP-EQUIV handling, zero-time Refresh handling, and lazily-seekable
- responses. It has developed from a port of Gisle Aas' Perl module
- HTTP::Cookies, from the libwww-perl library.
-
- Cookies are a general mechanism which server side connections can use to
- both store and retrieve information on the client side of the connection.
- For more information about cookies, refer to the following:
-
- http://www.netscape.com/newsref/std/cookie_spec.html
- http://www.cookiecentral.com/
-
- This module also implements the new style cookies described in RFC 2965.
- The two variants of cookies are supposed to be able to coexist happily.
- RFC 2965 handling can be switched off completely if required.
-
- http://www.ietf.org/rfc/rfc2965.txt
-
-
- Examples
- --------------------------------------------------------------------------
-
- import ClientCookie
- response = ClientCookie.urlopen("http://foo.bar.com/")
-
- This function behaves identically to urllib2.urlopen, except that it deals
- with cookies automatically. That's probably all you need to know.
-
- Here is a more complicated example, involving Request objects (useful if
- you want to pass Requests around, add headers to them, etc.):
-
- import ClientCookie
- import urllib2
- request = urllib2.Request("http://www.acme.com/")
- # note we're using the urlopen from ClientCookie, not urllib2
- response = ClientCookie.urlopen(request)
- # let's say this next request requires a cookie that was set in response
- request2 = urllib2.Request("http://www.acme.com/flying_machines.html")
- response2 = ClientCookie.urlopen(request2)
-
- In these examples, the workings are hidden inside the ClientCookie.urlopen
- method, which is an extension of urllib2.urlopen. Redirects, proxies and
- cookies are handled automatically by this function. Other, lower-level,
- cookie-aware extensions of urllib2 callables provided are: build_opener,
- install_opener, HTTPHandler and HTTPSHandler (if your Python installation
- has HTTPS support). A bugfixed HTTPRedirectHandler is also included (the
- bug, related to redirection, should be fixed in 2.3, but hasn't been yet).
- Note that extraction and setting of RFC 2965 cookies (but not Netscape
- cookies) is currently turned off during automatic urllib2 redirects (until
- I figure out exactly when they're allowed).
-
- An example at a slightly lower level shows what the module is doing more
- clearly:
-
- import ClientCookie
- import urllib2
- request = urllib2.Request("http://www.acme.com/")
- response = urllib2.urlopen(request)
- c = ClientCookie.Cookies()
- c.extract_cookies(response, request)
- # let's say this next request requires a cookie that was set in response
- request2 = urllib2.Request("http://www.acme.com/flying_machines.html")
- c.add_cookie_header(request2)
- response2 = urllib2.urlopen(request2)
-
- print response2.geturl()
- print response2.info() # headers
- for line in response2.readlines(): # body
- print line
-
- The Cookie class does all the work. There are essentially two operations:
- extract_cookies extracts HTTP cookies from Set-Cookie (the original
- Netscape cookie standard) and Set-Cookie2 (RFC 2965) headers from a
- response if and only if they should be set given the request, and
- add_cookie_header adds Cookie headers if and only if they are appropriate
- for a particular HTTP request. Incoming cookies are checked for
- acceptability based on the host name, etc. Cookies are only set on
- outgoing requests if they match the request's host name, path, etc.
- Cookies may be also be saved to and loaded from a file.
-
- Note that if you're using ClientCookie.urlopen (or
- ClientCookie.HTTPHandler or ClientCookie.HTTPSHandler), you don't need to
- call extract_cookies or add_cookie header yourself. If, on the other
- hand, you don't want to use urllib2, you will need to use this pair of
- methods. You can make your own request and response objects, which must
- support the interfaces described in the docstrings of extract_cookies and
- add_cookie_header.
-
-
- Important note
- --------------------------------------------------------------------------
-
- The distribution includes some associated modules (_HTTPDate and
- _HeadersUtil) upon which ClientCookie depends, also ported from
- libwww-perl. These associated modules may change or disappear with time,
- so don't rely on them staying put. Anything you can import directly from
- the ClientCookie package, and that doesn't start with a single underscore,
- will not go away.
-
-
- Cooperating with Netscape and Internet Explorer
- --------------------------------------------------------------------------
-
- The subclass NetscapeCookies differs from Cookies only in storing cookies
- using a different, Netscape-compatible, file format. This Netscape-
- compatible format loses some information when you save cookies to a file.
- Cookies itself uses a libwww-perl specific format (`Set-Cookie3'). Python
- and Netscape should be able to share a cookies file (note that the file
- location here will differ on non-unix OSes):
-
- import os
- home = os.environ["HOME"]
- cookies = NetscapeCookies(
- file=os.path.join(home, "/.netscape/cookies.txt"),
- autosave=1)
-
- XXX Does this work when Netscape is running? Probably not.
-
- MSIECookies (UNTESTED!) does the same for Microsoft Internet Explorer
- (MSIE) 5.x and 6.x on Windows, but does not allow saving cookies (because
- nobody has fully decoded the file format).
-
-
- Using your own Cookies instance
- --------------------------------------------------------------------------
-
- If you want to use the higher-level urllib2-like interface, but need to
- get at the cookies (usually only needed for debugging, or saving cookies
- between sessions) and/or pass arguments to the Cookies constructor,
- HTTPHandler and HTTPSHandler accept a Cookies instance in their cookies
- keyword argument.
-
- The urlopen function uses an OpenerDirector instance to do its work, so if
- you want to use urlopen, install your own OpenerDirector using the
- ClientCookie.install_opener function, then proceed as usual:
-
- import ClientCookie
- from ClientCookie import Cookies
- cookies = Cookies(netscape_only=1, blocked_domains=["doubleclick.net"])
- # Build an OpenerDirector that uses an HTTPHandler that uses the cookies
- # instance we've just made. build_opener will add other handlers (such
- # as FTPHandler) automatically, so we only have to pass an HTTPHandler.
- opener = ClientCookie.build_opener(ClientCookie.HTTPHandler(cookies))
- ClientCookie.install_opener(opener)
- r = urlopen("http://www.adverts-r-us.co.uk/")
-
- Note that the OpenerDirector instance used by urlopen is global, and
- shares the Cookies instance you pass in: all code that uses
- ClientCookie.urlopen will therefore be sharing the same set of cookies.
- If you don't want global cookies, build your own OpenerDirector object
- using ClientCookie.build_opener as shown above, then use it directly
- instead of calling urlopen:
-
- r = opener.open("http://acme.com/") # GET
- r = opener.open("http://acme.com/", data) # POST
-
-
- Optional goodies: HTTP-EQUIV, Refresh and seekable responses
- --------------------------------------------------------------------------
-
- These are implemented as three arguments to the HTTPHandler and
- HTTPSHandler constructors. Example code is below.
-
- seekable_responses:
-
- By default, ClientCookie's response objects are seekable. Seeking is done
- lazily (ie. the response object only reads from the socket as necessary,
- rather than slurping in all the data before the response is returned to
- you), but if you don't want it, you can turn it off.
-
- handle_http_equiv:
-
- The <META HTTP-EQUIV> tag is a way of including data in HTML to be treated
- as if it were part of the HTTP headers. ClientCookie can automatically
- read these tags and add the HTTP-EQUIV headers to the response object's
- real HTTP headers. The HTML is left unchanged.
-
- handle_refresh:
-
- The Refresh HTTP header is a non-standard header which is widely used. It
- requests that the user-agent follow a URL after a specified time delay.
- ClientCookie can treat these headers (which may have been set in <META
- HTTP-EQUIV> tags) as if they were 302 redirections.
-
- import ClientCookie
- from ClientCookie import Cookies
- cookies = Cookies()
- hh = ClientCookie.HTTPHandler(cookies,
- seekable_responses=0, handle_refresh=1)
- opener = ClientCookie.build_opener(hh)
- opener.open("http://www.rhubarb.com/")
-
-
- Adding headers
- --------------------------------------------------------------------------
-
- Adding headers is done like so:
-
- import ClientCookie, urllib2
- req = urllib2.Request("http://foobar.com/")
- req.add_header("Referer", "http://wwwsearch.sourceforge.net/ClientCookie/")
- r = ClientCookie.urlopen(req)
-
- You can also use the headers argument to the urllib2.Request constructor.
- Removing headers should not be necessary, because urllib2.Request objects
- start out with no headers.
-
-
- Changing the automatically-added headers (User-Agent)
- --------------------------------------------------------------------------
-
- urllib2.OpenerDirector automatically adds a User-Agent header to every
- request.
-
- Again, since ClientCookie.urlopen uses an OpenerDirector instance, you
- need to install your own OpenerDirector using the
- ClientCookie.install_opener function to change this behaviour.
-
- import ClientCookie
- cookies = ClientCookie.Cookies()
- opener = ClientCookie.build_opener(ClientCookie.HTTPHandler(cookies))
- opener.add_headers = [("User-agent", "Mozilla/4.76")]
- ClientCookie.install_opener(opener)
- r = urlopen("http://acme.com/")
-
- Again, you can always call opener.open directly (instead of urlopen) if
- you don't want global cookies.
-
-
- Debugging
- --------------------------------------------------------------------------
-
- First, a few common problems. The most frequent mistake people seem to
- make is to use ClientCookie.urlopen, *and* the extract_cookies and
- add_cookie_header methods on a cookie object themselves. If you use
- ClientCookie.urlopen, the module handles extraction and adding of cookies
- by itself, so you should not call extract_cookies or add_cookie_header.
-
- If things don't seem to be working as expected, the first thing to try is
- to switch off RFC 2965 handling, using the netscape_only argument to the
- Cookies constructor. This is because few browsers implement it, so it is
- likely that some servers incorrectly implement it. This switch is also
- useful because ClientCookie does not yet fully implement redirects with
- RFC 2965 cookies. 2965 cookies are always switched off during redirects,
- while the standard allows setting and returning cookies under some
- circumstances, which will probably cause some servers to refuse to provide
- content. XXX actually, there are probably almost no RFC 2965 servers out
- there, at least ATM...
-
- Are you sure the server is sending you any cookies in the first place?
- Maybe the server is keeping track of state in some other way (HIDDEN HTML
- form entries (possibly in a separate page referenced by a frame),
- URL-encoded session keys, IP address)? Perhaps some embedded script in
- the HTML is setting cookies (see below)? Maybe you messed up your
- request, and the server is sending you some standard failure page (even if
- the page doesn't appear to indicate any failure). Sometimes, a server
- wants particular headers set to the values it expects, or it won't play
- nicely. The most frequent offenders here are the Referer [sic] and / or
- User-Agent HTTP headers. See above for how to change the value of the
- User-Agent header; otherwise, use Request.add_header. The User-Agent
- header may need to be set to a value like that above. The Referer header
- may need to be set to the URL that the server expects you to have followed
- a link from. Occasionally, it may even be that operators deliberately
- configure a server to insist on precisely the headers that the big two
- browsers (MS Internet Explorer and Netscape) generate, but remember that
- incompetence (possibly on your part) is more probable than deliberate
- sabotage.
-
- When you save to a file, single-session cookies will expire unless you
- explicitly request otherwise by setting ignore_discard to true in the
- Cookies constructor. This may be your problem if you find cookies are
- going away after saving and loading.
-
- If none of the advice above seems to solve your problem, the last resort
- is to compare the headers and data that you are sending out with those
- that a browser emits. Of course, you'll want to check that the browser is
- able to do manually what you're trying to achieve programatically before
- minutely examining the headers. Make sure that what you do manually is
- *exactly* the same as what you're trying to do from Python -- you may
- simply be hitting a server bug that only gets revealed if you view pages
- in a particular order, for example. In order to see what your browser is
- sending to the server, you can use a TCP network sniffer (netcat --
- usually installed as nc, or ethereal, for example), or use a feature like
- lynx's -trace switch. If nothing is obviously wrong with the requests
- your program is sending, you may have to temporarily switch to sending
- HTTP headers (with httplib). Start by copying Netscape or IE slavishly
- (apart from session IDs, etc., of course), then begin the tedious process
- of mutating your headers and data until they match what your higher-level
- code was sending. This will reliably find your problem.
-
- You can globally turn on display of HTTP headers:
-
- import ClientCookie
- ClientCookie.HTTP_DEBUG = 1
-
- (Note that doing this won't work:
-
- from ClientCookie import HTTP_DEBUG
- HTTP_DEBUG = 1
-
- If you don't understand that, you've misunderstood what the = operator
- does.)
-
- Alternatively, you can examine your individual request and response
- objects to see what's going on. ClientCookie's responses are seek()able
- unless you request otherwise.
-
- If you would like to see what is going on in ClientCookie's tiny mind, do
- this:
-
- ClientCookie.CLIENTCOOKIE_DEBUG = 1
-
-
- Embedded script that sets cookies
- --------------------------------------------------------------------------
-
- It is possible to embed script in HTML pages (within <SCRIPT>here</SCRIPT>
- tags) -- Javascript / ECMAScript, VBScript, or even Python -- that causes
- cookies to be set in a browser. If you come across this in a page you
- want to automate, you have three options. Here they are, roughly in order
- of simplicity. First, you can simply figure out what the embedded script
- is doing and imitate it by manually adding cookies to your Cookies
- instance. Second, if you're working on a Windows machine (or another
- platform where the MSHTML COM library is available) you could give up the
- fight and automate Microsoft Internet Explorer (MSIE) with COM. Third,
- you could get ambitious and delegate the work to an appropriate
- interpreter (Netscape's Javascript interpreter, for instance).
-
-
- Parsing HTTP date strings
- --------------------------------------------------------------------------
-
- A function named str2time is provided by the package, which may be useful
- for parsing dates in HTTP headers. str2time is intended to be liberal,
- since HTTP date/time formats are poorly standardised in practice. There
- is no need to use this function in normal operations: Cookies instances
- keep track of cookie lifetimes automatically. This function will stay
- around in some form, though the supported date/time formats may change.
-
-
- A final note: docstrings, comments and debug strings in this code refer to
- the attributes of the HTTP cookie system as cookie-attributes, to
- distinguish them clearly from Python attributes.
-
-
- Copyright 1997-1999 Gisle Aas
- Copyright 2002-2003 Johnny Lee <typo_pl@hotmail.com> (MSIE Perl code)
- Copyright 2002-2003 John J Lee <jjl@pobox.com> (The Python port)
-
- This code is free software; you can redistribute it and/or modify it under
- the terms of the MIT License (see the file COPYING included with the
- distribution).
-
- """
-
- # XXXX
- # Check two-component urls work.
- # Write new 1.5.2 code to test ClientCookie on new site (Yahoo mail?).
- # Test urllib / urllib2 and ClientCookie with 1.5.2.
- # Fix XXXXs and test MSIECookies
-
- VERSION = "0.3.1b" # based on Gisle Aas's CVS revision 1.24, libwww-perl 5.64
-
-
- # These quotes from the RFC are sitting here for when I fix the redirect
- # behaviour (see TODO file).
-
- # Redirects: RFC 2965, section 3.3.6:
- #------------------------------------
- # An unverifiable transaction is to a third-party host if its request-
- # host U does not domain-match the reach R of the request-host O in the
- # origin transaction.
-
- # When it makes an unverifiable transaction, a user agent MUST disable
- # all cookie processing (i.e., MUST NOT send cookies, and MUST NOT
- # accept any received cookies) if the transaction is to a third-party
- # host.
-
- # request-host: RFC 2965, section 1:
- # Host name (HN) means either the host domain name (HDN) or the numeric
- # Internet Protocol (IP) address of a host. The fully qualified domain
- # name is preferred; use of numeric IP addresses is strongly
- # discouraged.
-
- # The terms request-host and request-URI refer to the values the client
- # would send to the server as, respectively, the host (but not port)
- # and abs_path portions of the absoluteURI (http_URL) of the HTTP
- # request line. Note that request-host is a HN.
-
- # Reach: RFC 2965, section 1:
-
- # The reach R of a host name H is defined as follows:
- #
- # * If
- #
- # - H is the host domain name of a host; and,
- #
- # - H has the form A.B; and
- #
- # - A has no embedded (that is, interior) dots; and
- #
- # - B has at least one embedded dot, or B is the string "local".
- # then the reach of H is .B.
-
- # * Otherwise, the reach of H is H.
-
- import sys, os, re, urlparse, string, socket, copy, struct, htmllib, formatter
- from urllib2 import URLError
- from time import time
-
- import ClientCookie
- from ClientCookie._HTTPDate import str2time, time2isoz
- from ClientCookie._HeadersUtil import split_header_words, join_header_words
- from ClientCookie._Util import startswith, endswith
- from ClientCookie._Debug import debug
-
- try: True
- except NameError:
- True = 1
- False = 0
-
- CHUNK = 1024 # size of chunks fed to HTML HEAD parser, in bytes
- MISSING_FILENAME_TEXT = ("a filename was not supplied (nor was the Cookies "
- "instance initialised with one)")
-
-
- SPACE_DICT = {}
- for c in string.whitespace:
- SPACE_DICT[c] = None
- del c
- def isspace(string):
- for c in string:
- if not SPACE_DICT.has_key(c): return False
- return True
-
- def getheaders(msg, name):
- """Get all values for a header.
-
- This returns a list of values for headers given more than once; each
- value in the result list is stripped in the same way as the result of
- getheader(). If the header is not given, return an empty list.
- """
- result = []
- current = ''
- have_header = 0
- for s in msg.getallmatchingheaders(name):
- if isspace(s[0]):
- if current:
- current = "%s\n %s" % (current, string.strip(s))
- else:
- current = string.strip(s)
- else:
- if have_header:
- result.append(current)
- current = string.strip(s[string.find(s, ":") + 1:])
- have_header = 1
- if have_header:
- result.append(current)
- return result
-
-
- IPV4_RE = re.compile(r"\.\d+$")
- def is_HDN(text):
- """Return True if text is a host domain name."""
- # XXX
- # This may well be wrong. Which RFC is HDN defined in, if any?
- # For the current implementation, what about IPv6? Remember to look
- # at other uses of IPV4_RE also, if change this.
- if IPV4_RE.search(text):
- return False
- if text == "":
- return False
- if text[0] == "." or text[-1] == ".":
- return False
- return True
-
- def domain_match(A, B):
- """Return True if domain A domain-matches domainB, according to RFC 2965.
-
- A and B may be host domain names or IP addresses.
-
- RFC 2965, section 1:
-
- Host names can be specified either as an IP address or a HDN string.
- Sometimes we compare one host name with another. (Such comparisons SHALL
- be case-insensitive.) Host A's name domain-matches host B's if
-
- * their host name strings string-compare equal; or
-
- * A is a HDN string and has the form NB, where N is a non-empty
- name string, B has the form .B', and B' is a HDN string. (So,
- x.y.com domain-matches .Y.com but not Y.com.)
-
- Note that domain-match is not a commutative operation: a.b.c.com
- domain-matches .c.com, but not the reverse.
-
- """
- # Note that, if A or B are IP addresses, the only relevant part of the
- # definition of the domain-match algorithm is the direct string-compare.
- A = string.lower(A)
- B = string.lower(B)
- if A == B:
- return True
- if not is_HDN(A):
- return False
- i = string.rfind(A, B)
- if i == -1 or i == 0:
- # A does not have form NB, or N is the empty string
- return False
- if not startswith(B, "."):
- return False
- if not is_HDN(B[1:]):
- return False
- return True
-
- def liberal_is_HDN(text):
- """Return True if text is a host domain name; for blocking domains."""
- if IPV4_RE.search(text):
- return False
- return True
-
- def liberal_domain_match(A, B):
- """For blocking domains.
-
- A and B may be host domain names or IP addresses.
-
- "" is matched by everything, including all IP addresses:
-
- assert liberal_domain_match(whatever, "")
-
- """
- A = string.lower(A)
- B = string.lower(B)
- if B == "":
- return True
- if A == B:
- return True
- if not (liberal_is_HDN(A) and liberal_is_HDN(B)):
- return False
- if endswith(A, B):
- return True
- return False
-
- ## # XXXX I'm pretty sure this is incorrect -- we only need to check the
- ## # original request's absoluteURL.
- ## cut_port_re = re.compile(r":\d+$")
- ## def request_host(request):
- ## header = request.headers.get("Host")
- ## if header is not None:
- ## host = header
- ## else:
- ## # XXX I think these two actually do the same thing, essentially, but
- ## # I'm not sure of the precise semantics of request.get_host.
- ## # Actually, I think urllib2's behaviour here is wrong (SF Python bug
- ## # 413135).
- ## url = request.get_full_url()
- ## host = urlparse.urlparse(url)[1]
- ## #host = request.get_host()
-
- ## return cut_port_re.sub("", host, 1) # remove port, if present
-
- cut_port_re = re.compile(r":\d+$")
- def request_host(request):
- url = request.get_full_url()
- host = urlparse.urlparse(url)[1]
- if host == "":
- host = request.headers.get("Host", "")
-
- return cut_port_re.sub("", host, 1) # remove port, if present
-
- def request_path(request):
- url = request.get_full_url()
- #scheme, netloc, path, parameters, query, frag = urlparse.urlparse(url)
- req_path = normalize_path(string.join(urlparse.urlparse(url)[2:], ""))
- if not startswith(req_path, "/"):
- # fix bad RFC 2396 absoluteURI
- req_path = "/"+req_path
- return req_path
-
- unescape_re = re.compile(r"%([0-9a-fA-F][0-9a-fA-F])")
- normalize_re = re.compile(r"([\0-\x20\x7f-\xff])")
- def normalize_path(path):
- """Normalise URI path so that plain string compare can be used.
-
- >>> normalize_path("%19\xd3%Fb%2F%25%26")
- '%19%D3%FB%2F%25&'
- >>>
-
- In normalised form, all non-printable characters are %-escaped, and all
- printable characters are given literally (not escaped). All remaining
- %-escaped characters are capitalised. %25 and %2F are special-cased,
- because they represent the printable characters "%" and "/", which are used
- as escape and URI path separator characters respectively.
-
- """
- def unescape_fn(match):
- x = string.upper(match.group(1))
- if x == "2F" or x == "25":
- return "%%%s" % (x,)
- else:
- # string.atoi deprecated in 2.0, but 1.5.2 int function won't do
- # radix conversion
- return struct.pack("B", string.atoi(x, 16))
- def normalize_fn(match):
- return "%%%02X" % ord(match.group(1))
- path = unescape_re.sub(unescape_fn, path)
- path = normalize_re.sub(normalize_fn, path)
- return path
-
-
- class Cookies:
- """Collection of HTTP cookies.
-
- The major methods are extract_cookies and add_cookie_header; these are all
- you are likely to need. In fact, you probably don't even need to know
- about this class: use the cookie-aware extensions to the urllib2 callables
- provided by this module: urlopen in particular (and perhaps also
- build_opener, HTTPHandler, HTTPSHandler (only if your Python has https
- support compiled in), and HTTPRedirectHandler).
-
- You can give a sequence of domain names from which we never accept cookies,
- nor return cookies to. Use the blocked_domains argument to the
- constructor, or use the blocked_domains and set_blocked_domains methods.
- Note that all domains which end with elements of blocked_domains are
- blocked. IP addresses are an exception, and must match exactly. For
- example, if blocked_domains == ["acme.com", "roadrunner.org",
- "192.168.1.2", ".168.1.2"], then "www.acme.com", "acme.com",
- "roadrunner.org" and 192.168.1.2 are all blocked, but 193.168.1.2 is not
- blocked.
-
- Methods:
-
- Cookies(filename=None,
- autosave=False, ignore_discard=False,
- hide_cookie2=False, netscape_only=False,
- blocked_domains=None)
- add_cookie_header(request)
- extract_cookies(response, request)
- set_cookie(version, key, val, path, domain, port, path_spec,
- secure, maxage, discard, rest=None)
- blocked_domains()
- set_blocked_domains(blocked_domains)
- save(filename=None)
- load(filename=None)
- revert(filename=None)
- clear(domain=None, path=None, key=None)
- clear_temporary_cookies()
- scan(callback)
- as_string(skip_discard=False) (str(cookie) also works)
-
-
- Public attributes
-
- cookies: a three-level dictionary [domain][path][key]; you probably don't
- need to use this
- filename: default filename for saving cookies
- autosave: save cookies on instance destruction
- ignore_discard: save even cookies that are requested to be discarded
-
- """
- non_word_re = re.compile(r"\W")
- quote_re = re.compile(r"([\"\\])")
- port_re = re.compile(r"^_?\d+(?:,\d+)*$")
- domain_re = re.compile(r"[^.]*")
- dots_re = re.compile(r"^\.+")
-
- ## # for Netscape protocol
- ## # XXXX complete?
- ## special_toplevel_domains = ("com", "edu", "gov", "int", "mil", "net", "org")
-
- def __init__(self, filename=None,
- autosave=False, ignore_discard=False,
- hide_cookie2=False, netscape_only=False,
- blocked_domains=None, delayload=False):
- """
- filename: name of file in which to save and restore cookies
- autosave: save to file during destruction
- ignore_discard: save even cookies that the server indicates should be
- discarded
- hide_cookie2: don't add Cookie2 header to requests (the presence of
- this header indicates to the server that we understand RFC 2965
- cookies)
- netscape_only: switch off RFC 2965 cookie handling altogether (implies
- hide_cookie2 also)
- blocked_domains: sequence of domain names that we never accept cookies
- from, nor return cookies to
- delayload: request that cookies are lazily loaded per-domain from disk;
- this is only a hint since this is only affects performance, not
- behaviour (unless the cookies on disk are changing); a Cookies object
- may ignore it (in fact, only MSIECookies lazily loads cookies)
-
- If a filename is given and refers to a valid cookies file (as defined
- by the class documentation), all cookies are loaded from it. If
- delayload is not true, this will happen immediately.
-
- Future keyword arguments might include (not yet implemented):
-
- max_cookies=None
- max_cookies_per_domain=None
- max_cookie_size=None
-
- """
- self.filename = filename
- self.autosave = autosave
- self.ignore_discard = ignore_discard
- self._hide_cookie2 = hide_cookie2
- self._disallow_2965 = netscape_only
- if self._disallow_2965:
- self._hide_cookie2 = True
- self._delayload = delayload
-
- if blocked_domains is not None:
- self._blocked_domains = tuple(blocked_domains)
- else:
- self._blocked_domains = ()
- self.cookies = {}
-
- if filename is None:
- if autosave is None:
- raise ValueError, \
- "a filename must be given if autosave is requested"
- else:
- try: self.load(filename)
- except IOError: pass
-
- def blocked_domains(self):
- """Return the sequence of blocked domains (as a tuple)."""
- return self._blocked_domains
- def set_blocked_domains(self, blocked_domains):
- """Set the sequence of blocked domains."""
- self._blocked_domains = tuple(blocked_domains)
-
- def _is_blocked(self, domain):
- for blocked_domain in self._blocked_domains:
- if liberal_domain_match(domain, blocked_domain):
- return True
- return False
-
- def __len__(self):
- """Return number of contained cookies."""
- count = [0]
- def callback(args, c=count): c[0] = c[0] + 1
- self.scan(callback)
- return count[0]
-
- def _return_cookie_path_ok(self, path, req_path):
- """Decide whether cookie should be returned to server, given only path.
-
- If cookie should be returned to server, return True. Otherwise, return
- False.
-
- path: path set in cookie
- req_path
-
- """
- # this is identical for Netscape and RFC 2965
- debug("- checking cookie path=%s" % path)
- if not startswith(req_path, path):
- debug(" %s does not path-match %s" % (req_path, path))
- return False
- return True
-
- def _return_cookie_ok(self, domain, path, key, value,
- request, redirect, now):
- """Decide whether cookie should be returned to server, given all info.
-
- If cookie should be returned to server, return true. Otherwise, return
- false.
-
- """
- # path has already been checked by _return_cookie_path_ok
- # domain should be OK thanks to the algorithm in add_cookie_header
- # that found this cookie in the first place
- (version, val, port, path_specified,
- secure, expires, discard, rest) = value
- debug(" - checking cookie %s=%s" % (key, val))
- secure_request = (request.get_type() == "https")
- req_port = request.port
- if req_port is None:
- req_port = "80"
-
- if self._disallow_2965 and int(version) > 0:
- debug(" RFC 2965 cookie disallowed by user")
- return False
- if redirect and int(version) > 0:
- debug(" RFC 2965 cookie disallowed during redirect")
- return False
- if secure and not secure_request:
- debug(" not a secure request")
- return False
- if expires and expires < now:
- debug(" expired")
- return False
- if port:
- for p in string.split(port, ","):
- if p == req_port:
- break
- else:
- debug(" request port %s does not match cookie port %s" % (
- req_port, port))
- return False
- if int(version) > 0 and self._is_netscape_domain:
- debug(" domain %s applies to Netscape-style cookies only" %
- domain)
- return False
- if self._is_blocked(domain):
- debug(" domain %s is in user block-list")
- return False
-
- ehn = request_host(request)
- if string.find(ehn, ".") == -1:
- ehn = ehn + ".local"
- if int(version) > 0:
- # origin server effective host name should domain-match
- # domain attribute of cookie
- assert domain_match(ehn, domain)
- else:
- assert endswith(ehn, domain)
-
- debug(" it's a match")
- return True
-
- def _get_cookie_attributes(self, cookies, domain, request,
- req_path, redirect, now):
- """Return a list of cookie-attributes to be returned to server.
-
- like ['$Path="/"', ...]
-
- The $Version attribute is also added when appropriate (currently only
- once per request).
-
- Also adds Cookie2 header to request, unless hide_cookie2 argument
- to Cookies constructor was true.
-
- """
- # Add cookies in order of most specific path first (i.e. longest
- # path first).
- paths = cookies.keys()
- def decreasing_size(a, b): return cmp(len(b), len(a))
- paths.sort(decreasing_size)
-
- cattrs = []
- for path in paths:
-
- if not self._return_cookie_path_ok(path, req_path):
- continue
-
- for key, value in cookies[path].items():
- if not self._return_cookie_ok(domain, path, key, value,
- request, redirect, now):
- continue
-
- (version, val, port, path_specified,
- secure, expires, discard, rest) = value
-
- # set version of Cookie header, and add Cookie2 header
- # XXX
- # What should it be if multiple matching Set-Cookie headers
- # have different versions themselves?
- # Answer: this is undecided as of 2003-01-11 -- will be
- # settled when RFC 2965 errata appears.
- if not self._version_has_been_set:
- self._version_has_been_set = True
- if (int(version) > 0):
- cattrs.append("$Version=%s" % version)
- elif not self._hide_cookie2:
- # advertise that we know RFC 2965
- request.add_header("Cookie2", '$Version="1"')
-
- # quote cookie value if necessary
- # (not for Netscape protocol, which already has any quotes
- # intact, due to the poorly-specified Netscape Cookie: syntax)
- if self.non_word_re.search(val) and int(version):
- val = self.quote_re.sub(r"\\\1", val)
-
- # add cookie-attributes to be returned in Cookie header
- cattrs.append("%s=%s" % (key, val))
- if int(version) > 0:
- if path_specified:
- cattrs.append('$Path="%s"' % path)
- if startswith(domain, "."):
- cattrs.append('$Domain="%s"' % domain)
- if port is not None:
- p = "$Port"
- if port != "":
- p = p + ('="%s"' % port)
- cattrs.append(p)
-
- return cattrs
-
- def add_cookie_header(self, request, redirect=False):
- """Add correct Cookie: header to request (urllib2.Request object).
-
- The Cookie2 header is also added unless the hide_cookie2 argument
- to the Cookies constructor was false.
-
- The request object (usually a urllib2.Request instance) must support
- the methods get_full_url, get_host, get_type and add_header, as
- documented by urllib2, and the attributes headers (a mapping containing
- the request's HTTP headers) and port (the port number).
-
- If redirect is true, it will be assumed that the request is to a
- redirect URL, and appropriate action will be taken. This has no effect
- for Netscape cookies. At the moment, adding of RFC 2965 cookies is
- switched off entirely if the redirect argument is true: this will
- change in future, to follow the RFC, which allows some cookie use
- during redirections.
-
- """
- now = time()
-
- # origin server effective host name
- erhn = string.lower(request_host(request))
- if string.find(erhn, ".") == -1:
- erhn = erhn + ".local"
-
- req_path = request_path(request)
-
- cattrs = [] # cookie-attributes to be put in the "Cookie" header
-
- self._version_has_been_set = False
- self._is_netscape_domain = False
-
- # Start with origin server effective host name (erhn -- say
- # foo.bar.baz.com), and check all possible domains (foo.bar.baz.com,
- # .bar.baz.com, bar.baz.com) for cookies. For resulting domains that
- # begin with a dot, this should ensure we have an RFC 2965
- # domain-match. For domains that don't start with a dot, we still
- # have a match for Netscape protocol, but not for RFC 2965; in this
- # case, self._is_netscape_domain is true.
- domain = erhn
- while string.find(domain, ".") != -1:
- # Do we have any cookies to send back to the server for this
- # domain?
- debug("Checking %s for cookies" % domain)
- cookies = self.cookies.get(domain)
- if cookies is None:
- # XXX I *think* the only reason why we'd get a domain back
- # from _next_domain that doesn't domain-match the erhn is that
- # domain is in fact an IP address, so check for that.
- if IPV4_RE.search(domain):
- # no point in continuing, since IP addresses must string-
- # compare equal in order to domain-match
- break
- domain = self._next_domain(domain)
- continue
-
- # What cookie-attributes do we need to send back?
- # (get_cookie_attributes also, as necessary, adds the $Version
- # attribute to the returned list, and the Cookie2 header to
- # request)
- attrs = self._get_cookie_attributes(
- cookies, domain, request, req_path, redirect, now)
- cattrs.extend(attrs)
-
- domain = self._next_domain(domain)
-
- if cattrs:
- request.add_header("Cookie", string.join(cattrs, "; "))
-
- def _next_domain(self, domain):
- """Return next domain string in which to look for stored cookies.
-
- Domain string must contain at least one dot.
-
- I say 'domain string' rather than 'domain name' because many of these
- domain strings start with a dot, unlike real DNS domain names.
-
- """
- # Try with a more general domain, alternately stripping leading
- # name components and leading dots. When this results in a domain
- # with no leading dot, it is for Netscape cookie compatibility
- # only:
- #
- # a.b.c.net Any cookie
- # .b.c.net Any cookie
- # b.c.net Netscape cookie only
- # .c.net Any cookie
- # (further stripping shouldn't match any cookies that we stored)
-
- if startswith(domain, "."):
- domain = domain[1:]
- self._is_netscape_domain = True
- else:
- domain = self.domain_re.sub("", domain, 1)
- self._is_netscape_domain = False
- return domain
-
- def _set_cookie_if_ok(self, key, val, hash, rest, request,
- have_ns_cookies, redirect):
- """Decide whether cookie should be set, and if it should, set it."""
- # find request host, path and port
- # in fact we need the effective request-host name here:
- erhn = string.lower(request_host(request))
- if string.find(erhn, ".") == -1:
- erhn = erhn + ".local"
- req_path = request_path(request)
- req_port = request.port
- if req_port is None:
- req_port = "80"
- else:
- req_port = str(req_port)
-
- # Now get the cookie info from hash, checking whether cookie is ok and
- # setting defaults.
-
- max_age = hash.get("max-age")
- version = hash.get("version")
- domain = hash.get("domain")
- path = hash.get("path")
-
- if max_age is not None:
- max_age = float(max_age)
-
- # check version
- if version is None:
- # Version is always set to 0 by _parse_ns_attrs if it's a Netscape
- # cookie, so this must be an invalid RFC 2965 cookie.
- debug("Set-Cookie2 without version attribute disallowed (%s=%s)" % (key, val))
- return
- if int(version) > 0:
- if redirect:
- debug("Setting RFC 2965 cookie during redirect disallowed")
- return
- if self._disallow_2965:
- debug("Setting RFC 2965 cookie disallowed by user")
- return
-
- # check path
- path_specified = False
- if path is not None and path != "":
- path_specified = True
- path = normalize_path(path)
- if not have_ns_cookies and not startswith(req_path, path):
- debug("Path attribute %s is not a prefix of request path %s" %
- (path, req_path))
- return
- else:
- path = req_path
- i = path.rfind("/")
- if i != -1:
- if int(version) == 0:
- # Netscape spec parts company from reality here
- path = path[:i]
- #path = re.sub(r"/[^/]*$", "", path, 1)
- else:
- path = path[:i+1]
- if len(path) == 0: path = "/"
-
- # check domain
- if (domain is None or
- # XXX is this the best hack for Netscape protocol? We need
- # *something*, because explicitly-set cookie domain like acme.com
- # must match erhn acme.com, whereas RFC 2965 logic below would
- # rewrite domain attribute to .acme.com, erroneously resulting in
- # no match.
- (int(version) == 0 and (domain == erhn or domain == "."+erhn))):
- domain = erhn
- else:
- if not startswith(domain, "."):
- # Netscape protocol doesn't ask for this, but doesn't make
- # sense otherwise (two-component domain names, like acme.com,
- # could never set cookies if we didn't do this).
- domain = ".%s" % domain
- if domain.startswith("."):
- undotted_domain = domain[1:]
- else:
- undotted_domain = domain
- nr_embedded_dots = string.count(undotted_domain, ".")
- if nr_embedded_dots == 0 and domain != ".local":
- debug("Non-local domain %s contains no embedded dot" % domain)
- return
- # not actually implemented by Netscape Navigator, apparently
- ## if int(version) == 0: # Netscape cookie
- ## tld = domain[domain.rfind(".")+1:]
- ## if (nr_embedded_dots < 2 and
- ## tld not in self.special_toplevel_domains):
- ## # For example, ".acme.tx.us" is ok, but ".tx.us" is not.
- ## debug("Domain %s contains too few dots" % domain)
- ## # Note that the other case, where toplevel domain is
- ## # special (.com etc. disallowed), is already taken care of
- ## # by always requiring at least one embedded dot.
- ## return
- if int(version) == 0:
- # XXX maybe should just do RFC 2965 domain-match here?
- if IPV4_RE.search(domain):
- debug("IP-address %s illegal as domain" % domain)
- return
- if not endswith(erhn, domain):
- debug("Effective request-host %s does not end with %s" % (
- erhn, domain))
- return
- else:
- if not domain_match(erhn, domain):
- debug("Effective request-host %s does not domain-match "
- "%s" % (erhn, domain))
- return
- host_prefix = erhn[:-len(domain)]
- if string.find(host_prefix, ".") != -1 and int(version) > 0:
- debug("Host prefix %s for domain %s contains a dot" % (
- host_prefix, domain))
- return
- if self._is_blocked(domain):
- debug("Domain %s is in user block-list")
- return
-
- # check port
- if hash.has_key("port"):
- port = hash["port"]
- if port is None:
- # Port attr is present, but has no value: need to remember
- # request port so we can ensure that cookie is only sent
- # back on that port.
- port = req_port
- else:
- port = re.sub(r"\s+", "", port)
- for p in string.split(port, ","):
- try:
- int(p)
- except ValueError:
- debug("Bad port %s (not numeric)" % port)
- return
- if p == req_port:
- break
- else:
- debug("Request port (%s) not found in %s" % (
- req_port, port))
- return
- else:
- # No port attr present, so will be able to send back this
- # cookie on any port.
- port = None
-
- all_attrs = hash.copy()
- all_attrs.update(rest)
-
- if self._set_cookie_ok(all_attrs):
- h = hash.get
- self.set_cookie(h("version"), key, val, path, domain, port,
- path_specified, h("secure"), max_age,
- h("discard"), rest)
-
- def _parse_ns_attrs(self, ns_set_strings):
- """Ad-hoc parser for Netscape protocol cookie-attributes.
-
- The old Netscape cookie format for Set-Cookie
-
- http://www.netscape.com/newsref/std/cookie_spec.html
-
- can for instance contain an unquoted "," in the expires field, so we
- have to use this ad-hoc parser instead of split_header_words.
-
- """
- now = time()
- ns_set = []
- for attrs_string in ns_set_strings:
- ns_attrs = []
- expires = False
- for param in re.split(r";\s*", attrs_string):
- if string.rstrip(param) == "": continue
- if "=" not in param:
- k, v = string.rstrip(param), None
- if k != "secure":
- debug("unrecognised Netscape protocol boolean "
- "cookie-attribute '%s'" % k)
- else:
- k, v = re.split(r"\s*=\s*", param, 1)
- v = string.rstrip(v)
- lc = string.lower(k)
- if lc == "expires":
- # convert expires date to max-age delta
- etime = str2time(v)
- if etime is not None:
- ns_attrs.append(("max-age", etime - now))
- expires = True
- else:
- ns_attrs.append((k, v))
-
- # XXX commented out in original Perl -- should it be here,
- # or not?
- #ns_attrs.append(("port", req_port))
- # anyway, it should really be this, if anything at all:
- #ns_attrs.append(("port", None))
-
- # XXX surely this is wrong: RFC 2965 *also* states that a
- # missing expiry date means should be set to expire -- so
- # why are we only doing this for Netscape cookies??
- if not expires: ns_attrs.append(("discard", None))
- ns_attrs.append(("version", "0"))
- ns_set.append(ns_attrs)
-
- return ns_set
-
- def _normalized_cookie_info(self, set):
- """Return list of tuples containing normalised cookie information.
-
- Tuples are name, value, hash, rest, where name and value are the cookie
- name and value, hash is a dictionary containing the most important
- cookie-attributes (discard, secure, version, max-age, domain, path and
- port) and rest is a dictionary containing the rest of the cookie-
- attributes.
-
- """
- cookie_tuples = []
-
- boolean_attrs = "discard", "secure"
- value_attrs = "version", "max-age", "domain", "path", "port"
- #print "set", set
-
- for cookie_attrs in set:
- name, value = cookie_attrs[0]
- debug("Attempt to set cookie %s=%s" % (name, value))
-
- # Build dictionary of common cookie-attributes (hash) and
- # dictionary of other cookie-attributes (rest).
- hash = {}
- rest = {}
- for k, v in cookie_attrs[1:]:
- lc = string.lower(k)
- # don't lose case distinction for unknown fields
- if (lc in value_attrs) or (lc in boolean_attrs):
- k = lc
- if k in boolean_attrs:
- if v is None:
- # set boolean default
- # Note that this is a default for the case where the
- # cookie-attribute *is* present, but has no value
- # (like "discard", as contrasted with "path=/").
- # If the cookie-attribute *isn't* present (if "path"
- # is missing, for example), the value stored for it
- # will always be false.
- v = True
- if hash.has_key(k):
- # only first value is significant
- continue
- if k == "domain":
- # RFC 2965 section 3.3.3
- v == string.lower(v)
- if (k in value_attrs) or (k in boolean_attrs):
- hash[k] = v
- else:
- rest[k] = v
-
- cookie_tuples.append((name, value, hash, rest))
-
- return cookie_tuples
-
- def extract_cookies(self, response, request, redirect=0):
- """Extract cookies from response, where allowable given the request.
-
- Look for allowable Set-Cookie: and Set-Cookie2: headers in the response
- object passed as argument. Any of these headers that are found are
- used to update the state of the object (subject to the _set_cookie_ok
- method's approval).
-
- The response object (which will usually be the result of a call to
- ClientCookie.urlopen, or similar) must support the methods read,
- readline, readlines, fileno, close and info, as described in the
- documentation for the standard urllib and urllib2 modules. In
- particular, these methods work like those on standard file objects,
- with the exception of info, which returns a mimetools.Message object.
-
- The request object (usually a urllib2.Request instance) must support
- the methods get_full_url and get_host, as documented by urllib2, and
- the attributes headers (a mapping containing the request's HTTP
- headers) and port (the port number).
-
- If redirect is true, it will be assumed that the request was to a
- redirect URL, and appropriate action will be taken. This has no effect
- for Netscape cookies. At the moment, extraction of RFC 2965 cookies is
- switched off entirely if the redirect argument is true: this will
- change in future, to follow the RFC, which allows some cookie use
- during redirections.
-
- """
- # get cookie-attributes for RFC 2965 and Netscape protocols
- headers = response.info()
- rfc2965_strings = getheaders(headers, "Set-Cookie2")
- ns_strings = getheaders(headers, "Set-Cookie")
-
- if ((not rfc2965_strings and not ns_strings) or
- (not ns_strings and self._disallow_2965)):
- return # no cookie headers: quick exit
-
- # Parse out cookie-attributes from RFC 2965 Set-Cookie2 headers.
- set = split_header_words(rfc2965_strings)
- cookie_tuples = self._normalized_cookie_info(set)
-
- have_ns_cookies = False
- if ns_strings:
- # Parse out cookie-attributes from Netscape Set-Cookie headers.
- ns_set = self._parse_ns_attrs(ns_strings)
- ns_cookie_tuples = self._normalized_cookie_info(ns_set)
-
- # Look for Netscape cookies (from a Set-Cookie headers) that match
- # corresponding RFC 2965 cookies (from Set-Cookie2 headers).
- # For each match, keep the RFC 2965 cookie and ignore the Netscape
- # cookie (RFC 2965 section 9.1).
- if not self._disallow_2965:
- # Build a dictionary of cookies that are present in Set-Cookie2
- # headers.
- rfc2965_cookies = {}
- for name, value, hash, rest in cookie_tuples:
- key = hash.get("domain", ""), hash.get("path", ""), name
- rfc2965_cookies[key] = None
-
- def no_matching_rfc2965(ns_cookie_tuple,
- rfc2965_cookies=rfc2965_cookies):
- name, value, hash, rest = ns_cookie_tuple
- key = hash.get("domain", ""), hash.get("path", ""), name
- if not rfc2965_cookies.has_key(key):
- return True
- ns_cookie_tuples = filter(no_matching_rfc2965, ns_cookie_tuples)
-
- if ns_cookie_tuples:
- have_ns_cookies = True
- cookie_tuples.extend(ns_cookie_tuples)
-
- for name, value, hash, rest in cookie_tuples:
- self._set_cookie_if_ok(name, value, hash, rest,
- request, have_ns_cookies, redirect)
-
- def _set_cookie_ok(self, headers):
- """Return False if the cookie should not be set.
-
- This is intended for overloading by subclasses. Do not call this
- method.
-
- The cookie has already been approved by the extract_cookies method by
- the time it gets here, so there's no need to reimplement the standard
- acceptance rules.
-
- headers: dictionary containing HTTP "Cookie" and "Cookie2" headers.
-
- """
- return True
-
- def set_cookie(self, version, key, val, path, domain, port, path_spec,
- secure, max_age, discard, rest=None):
- """Add a cookie.
-
- The version, key, val, path, domain and port arguments are strings.
- The path_spec, secure, discard arguments are boolean values. The
- max_age argument is a number indicating number of seconds that this
- cookie will live. A value <= 0 will delete this cookie. The
- dictionary rest defines various other cookie-attributes like "Comment"
- and "CommentURL".
-
- """
- ## print "set_cookie:"
- ## print " version", version
- ## print " key", key
- ## print " val", val
- ## print " path", path
- ## print " domain", domain
- ## print " port", port
- ## print " path_spec", path_spec
- ## print " secure", secure
- ## print " max_age", max_age
- ## print " discard", discard
- ## print " rest", rest
- ## if rest: print " rest is not empty"
- ## else: print " rest is empty"
-
- if path is None or not startswith(path, "/"):
- raise ValueError, "Illegal path: '%s'" % path
- if key is None or key == "" or startswith(key, "$"):
- raise ValueError, "Illegal key: '%s'" % key
-
- if port is not None:
- if not self.port_re.search(port):
- msg = "Illegal port: '%s'" % port
- debug(msg)
- raise ValueError, msg
-
- # normalise case, as per RFC 2965 section 3.3.3
- # XXX RFC 1034 says should preserve case, but use case-insensitive
- # string compare. This would complicate things here, because we're
- # using a dictionary to store cookies by domain. I don't think this
- # really matters here.
- domain = string.lower(domain)
-
- # If no Max-Age cookie-attribute, cookie will be set to discarded. This
- # will happen on next call to clear_temporary_cookies, or on next save.
- expires = 0
- if max_age is not None:
- if max_age <= 0:
- try:
- del self.cookies[domain][path][key]
- except KeyError:
- pass
- else:
- debug("Expiring cookie, "
- "domain='%s', path='%s', key='%s'" % (
- domain, path, key))
- return
- expires = str(time() + float(max_age))
-
- if version is None:
- version = "0"
-
- debug("Set cookie %s=%s" % (key, val))
- self._set_cookie(
- domain, path, key,
- [version, val, port, path_spec, secure, expires, discard,
- rest])
-
- def _set_cookie(self, domain, path, key, cookie_info):
- c = self.cookies
- if not c.has_key(domain): c[domain] = {}
- c2 = c[domain]
- if not c2.has_key(path): c2[path] = {}
- c3 = c2[path]
- c3[key] = cookie_info
-
- def save(self, filename=None):
- """Save cookies to a file.
-
- The cookies can be restored later using the load method. If filename
- is not specified, the name specified during construction (if any) is
- used. If the attribute ignore_discard is set, then even cookies marked
- to be discarded are saved.
-
- The Cookies base class saves a sequence of "Set-Cookie3" lines.
- "Set-Cookie3" is the format used by the libwww-perl libary, not known
- to be compatible with any browser. The NetscapeCookies subclass can be
- used to save in a format compatible with Netscape.
-
- Cookies set to be discarded are only saved if the ignore_discard
- attribute is set.
-
- The implementation of this method in the Cookies base class always
- saves cookies which have expired by outliving their Max-Age
- cookie-attribute (unlike the NetscapeCookies implementation). This may
- change in future.
-
- """
- if filename is None:
- if self.filename is not None: filename = self.filename
- else: raise ValueError, MISSING_FILENAME_TEXT
- f = open(filename, "w")
- f.write("#LWP-Cookies-1.0\n")
- f.write(self.as_string(not self.ignore_discard))
- f.close()
- self.filename = filename
-
- def load(self, filename=None):
- """Append cookies from a file.
-
- The named file must be in the format written by the save method, or
- IOError will be raised.
-
- Note for subclassers: overridden versions of this method should not
- alter the object's state other than by setting self.filename (if and
- only if the load was successful) and calling self.set_cookie.
-
- """
- if filename is None:
- if self.filename is not None: filename = self.filename
- else: raise ValueError, MISSING_FILENAME_TEXT
- f = open(filename)
- magic = f.readline()
- if not re.search(r"^\#LWP-Cookies-(\d+\.\d+)", magic):
- msg = "%s does not seem to contain cookies" % (file,)
- raise IOError, msg
-
- boolean_attrs = "path_spec", "secure", "discard"
- value_attrs = "version", "port", "path", "domain", "expires"
-
- try:
- while 1:
- line = f.readline()
- if line == "": break
- header = "Set-Cookie3:"
- if not startswith(line, header):
- continue
- line = string.strip(line[len(header):])
- for cookie in split_header_words([line]):
- key, val = cookie[0]
- hash = {}
- rest = {}
- for name in boolean_attrs:
- hash[name] = False
- for k, v in cookie[1:]:
- if k in boolean_attrs:
- if v is None: v = True
- hash[k] = v
- elif k in value_attrs:
- hash[k] = v
- else:
- rest[k] = v
- h = hash.get
- expires = h("expires")
- if expires is not None:
- expires = str2time(expires)
- value = [h("version"), val, h("port"), h("path_spec"),
- h("secure"), expires, h("discard"), rest]
- self._set_cookie(h("domain"), h("path"), key, value)
- except:
- type = sys.exc_info()[0]
- if issubclass(type, IOError):
- raise
- else:
- raise IOError, "invalid Set-Cookie3 format file %s" % filename
- self.filename = filename
-
- def revert(self, filename=None):
- """Clear all cookies and reload cookies from a saved file.
-
- Raises IOError if reversion is not successful; the object's state will
- not be altered if this happens.
-
- """
- if filename is None:
- if self.filename is not None: filename = self.filename
- else: raise ValueError, MISSING_FILENAME_TEXT
- old_state = copy.deepcopy(self.cookies)
- self.clear()
- try:
- self.load()
- except IOError:
- self.cookies = old_state
- raise
-
- def clear(self, domain=None, path=None, key=None):
- """Clear some cookies.
-
- Invoking this method without arguments will clear all cookies. If
- given a single argument, only cookies belonging to that domain will be
- removed. If given two arguments, cookies belonging to the specified
- path within that domain are removed. If given three arguments, then
- the cookie with the specified key, path and domain is removed.
-
- Raises KeyError if no matching cookie exists.
-
- """
- if key is not None:
- if (domain is None) or (path is None):
- raise ValueError, \
- "domain and path must be given to remove a cookie by key"
- del self.cookies[domain][path][key]
- elif path is not None:
- if domain is None:
- raise ValueError, \
- "domain must be given to remove cookies by path"
- del self.cookies[domain][path]
- elif domain is not None:
- del self.cookies[domain]
- else:
- self.cookies = {}
-
- def clear_temporary_cookies(self):
- """Discard all temporary cookies.
-
- Scans for all cookies held by object having either no Max-Age
- cookie-attribute or a true discard flag. RFC 2965 says you should call
- this when the user agent shuts down.
-
- """
- def callback(args, self=self):
- if (args[9] or not args[8]):
- # "Discard" flag set or there was no Max-Age cookie-attribute.
- # clear the cookie, by setting negative Max-Age
- args[8] = -1
- apply(self.set_cookie, args)
- self.scan(callback)
-
- def __del__(self):
- if self.autosave: self.save()
-
- def scan(self, callback):
- """Apply supplied function to each stored cookie.
-
- The callback function will be invoked with a sequence argument:
-
- index content
- --------------
- 0 version
- 1 key
- 2 value
- 3 path
- 4 domain
- 5 port
- 6 path_specified
- 7 secure
- 8 expires
- 9 discard
- 10 dictionary containing other cookie-attributes, eg. "Comment"
-
- """
- domains = self.cookies.keys()
- domains.sort()
- for domain in domains:
- paths = self.cookies[domain].keys()
- paths.sort()
- for path in paths:
- for key, value in self.cookies[domain][path].items():
- (version, val, port, path_specified,
- secure, expires, discard, rest) = value
- if rest is None:
- rest = {}
- callback([version, key, val, path, domain, port,
- path_specified, secure, expires, discard, rest])
-
- def __str__(self): return self.as_string()
-
- def as_string(self, skip_discard=False):
- """Return cookies as a string of "\n"-separated "Set-Cookie3" headers.
-
- If skip_discard is true, it will not return lines for cookies with the
- Discard cookie-attribute.
-
- str(cookies) also works.
-
- """
- result = []
- def callback(args, result=result, skip_discard=skip_discard):
- (version, key, val, path, domain, port,
- path_specified, secure, expires, discard, rest) = args
- if discard and skip_discard: return
- h = [(key, val),
- ("path", path),
- ("domain", domain)]
- if port is not None: h.append(("port", port))
- if path_specified: h.append(("path_spec", None))
- if secure: h.append(("secure", None))
- if expires: h.append(("expires", time2isoz(float(expires))))
- if discard: h.append(("discard", None))
- keys = rest.keys()
- keys.sort()
- for k in keys:
- h.append((k, str(rest[k])))
- h.append(("version", version))
- result.append(("Set-Cookie3: %s" % (join_header_words([h]),)))
- self.scan(callback)
- return string.join(result+[""], "\n")
-
-
- class NetscapeCookies(Cookies):
- """
- This class differs from Cookies only in the format it uses to save and load
- cookies to and from a file. This class uses the Netscape `cookies.txt'
- format.
-
- Note that the Netscape format will lose information on saving and
- restoring. In particular, the port number and cookie protocol version
- information is lost. XXX path_specified, discard??
-
- Unlike the Cookies base class, this class currently checks cookie expiry
- times on saving, and expires cookies appropriately. Cookies instead waits
- until you call clear_temporary_cookies. This may change in future.
-
- """
- def load(self, filename=None):
- if filename is None:
- if self.filename is not None: filename = self.filename
- else: raise ValueError, MISSING_FILENAME_TEXT
- cookies = []
- f = open(filename)
- magic = f.readline()
- if not startswith(string.lstrip(magic), "# Netscape HTTP Cookie File"):
- f.close()
- raise IOError, (
- "%s does not look like a Netscape format cookies file" % (
- filename,))
- now = time()
-
- try:
- while 1:
- line = f.readline()
- if line == "": break
- line = string.strip(line)
- if (startswith(line, "#") or startswith(line, "$") or
- line == ""):
- continue
- domain, bool1, path, secure, expires, key, val = \
- string.split(line, "\t")
- secure = (secure == "TRUE")
- self.set_cookie(None, key, val, path, domain, None,
- 0, secure, float(expires)-now, 0)
- except:
- type = sys.exc_info()[0]
- if issubclass(type, IOError):
- raise
- else:
- raise IOError, "invalid Netscape format file %s" % filename
- self.filename = filename
-
- def save(self, filename=None):
- if filename is None:
- if self.filename is not None: filename = self.filename
- else: raise ValueError, MISSING_FILENAME_TEXT
- f = open(filename, "w")
-
- f.write("""\
- # Netscape HTTP Cookie File
- # http://www.netscape.com/newsref/std/cookie_spec.html
- # This is a generated file! Do not edit.
-
- """)
- now = time()
- debug("Saving Netscape cookies.txt file")
- def callback(args, f=f, now=now, self=self):
- (version, key, val, path, domain, port,
- path_specified, secure, expires, discard, rest) = args
- expires = float(expires)
- if discard and not self.ignore_discard:
- debug(" Not saving %s: marked for discard" % key)
- return
- if not expires: expires = 0
- if now > expires:
- debug(" Not saving %s: expired" % key)
- return
- if secure: secure = "TRUE"
- else: secure = "FALSE"
- if startswith(domain, "."): bool = "TRUE"
- else: bool = "FALSE"
- f.write(
- string.join([domain, bool, path, secure,
- str(expires), key, val], "\t")+"\n")
-
- self.scan(callback)
- f.close()
- self.filename = filename
-
-
- class MSIECookies(Cookies):
- """This class differs from Cookies only in the format it uses to save and
- load cookies to and from a file.
-
- WARNING: This class is UNTESTED (ie. it does not work)!
-
- This class can read Microsoft Internet Explorer 5.x and 6.x for
- Windows (MSIE) cookie files. Does NOT support saving cookies in MSIE
- format. If you save cookies, they'll be in the usual Set-Cookie3
- format, which you can read back in using an instance of the plain old
- Cookies class.
-
- You should be able to have LWP share Internet Explorer's cookies like
- this:
-
- XXXX how to do this with winreg, or win32all?
- cookies_dir = Registry(
- "CUser/Software/Microsoft/Windows/CurrentVersion/Explorer/"
- "Shell Folders/Cookies")
- file = os.path.join(cookies_dir, "index.dat")
- cookies = MSIECookies(file=file, delayload=1)
-
- Additional methods:
-
- load_cookies(file)
-
- """
- domain_re_2 = re.compile(r"^([^/]+)(/.*)$")
- cookie_re = re.compile("Cookie\:.+\@([\x21-\xFF]+).*?"
- "(.+\@[\x21-\xFF]+\.txt)")
-
- win32_epoch = 0x019db1ded53e8000L # 1970 Jan 01 00:00:00 in Win32 FILETIME
-
- def _epoch_time_offset_from_win32_filetime(self, filetime):
- """Convert from win32 filetime to seconds-since-epoch value.
-
- MSIE stores create and expire times as Win32 FILETIME, which is 64
- bits of 100 nanosecond intervals since Jan 01 1601.
-
- Cookies code expects time in 32-bit value expressed in seconds since
- the epoch (Jan 01 1970).
-
- """
- if filetime < self.win32_epoch:
- raise ValueError, "filetime (%d) is before epoch (%d)" % (
- filetime, self.win32_epoch)
-
- return (filetime - self.win32_epoch) / 10000000L
-
- def _get_cookie_attributes(self, cookies, domain,
- request, req_path, now, redirect):
- # lazily load cookies for this domain
- if self._delayload and cookies["//+delayload"] is not None:
- # Extract cookie filename from the cookie value, into which it
- # was stuffed by the load method.
- cookie_file = cookies["//+delayload"]["cookie"][1]
- if self.cookies.has_key(domain):
- del self.cookies[domain]
- self.load_cookies(cookie_file)
- cookies = self.cookies[domain]
- Cookies._get_cookie_attributes(self, cookies, domain,
- request, req_path, now, redirect)
-
- def _load_cookies_from_file(self, filename):
- cookies = []
-
- cookies_fh = open(filename)
-
- while 1:
- key = cookies_fh.readline()
- if key == "": break
-
- key = chomp(key)
- value = cookies_fh.readline()
- value = chomp(value)
- domain_path = cookies_fh.readline()
- domain_path = chomp(domain_path)
- flags = cookies_fh.readline()
- flags = chomp(flags) # 0x2000 bit is for secure I think
- lo_expire = cookies_fh.readline()
- lo_expire = chomp(lo_expire)
- hi_expire = cookies_fh.readline()
- hi_expire = chomp(hi_expire)
- lo_create = cookies_fh.readline()
- lo_create = chomp(lo_create)
- hi_create = cookies_fh.readline()
- hi_create = chomp(hi_create)
- sep = cookies_fh.readline()
- sep = chomp(sep)
-
- if "" in (key, value, domain_path, flags, hi_expire, lo_expire,
- hi_create, lo_create, sep) or (sep != "*"):
- break
-
- m = self.domain_re_2.search(domain_path)
- if m:
- domain = m.group(1)
- path = m.group(2)
-
- cookies.append({"KEY": key, "VALUE": value, "DOMAIN": domain,
- "PATH": path, "FLAGS": flags, "HIXP": hi_expire,
- "LOXP": lo_expire, "HICREATE": hi_create,
- "LOCREATE": lo_create})
-
- return cookies
-
- def load_cookies(self, filename):
- """Load cookies from file containing all cookies for one domain/user.
- """
- now = time()
-
- cookie_data = self._load_cookies_from_file(filename)
-
- for cookie in cookie_data:
- secure = ((cookie["FLAGS"] & 0x2000) != 0)
- filetime = (cookie["HIXP"] << 32) + cookie["LOXP"]
- expires = self._epoch_time_offset_from_win32_filetime(filetime)
-
- self.set_cookie(None, cookie["KEY"], cookie["VALUE"],
- cookie["PATH"], cookie["DOMAIN"], None,
- 0, secure, expires - now, 0)
-
- def load(self, filename=None):
- """Load cookies from an MSIE 'index.dat' cookies index file.
-
- filename: full path to cookie index file
-
- """
- if filename is None:
- if self.filename is not None: filename = self.filename
- else: raise ValueError, MISSING_FILENAME_TEXT
-
- now = time()
- #user_name = string.lower(Win32::LoginName())
- # XXXX win32all, calldll, ctypes, or whatever
- # Actually, is this really needed? Surely there's only one user per
- # index file anyway?? Maybe not, on win9x. :(
-
- cookie_dir = os.path.dirname(filename)
-
- index = open(filename, "rb")
- data = index.read(256)
- if len(data) != 256:
- raise IOError, "%s file is too short" % filename
-
- # Cookies' index.dat file starts with 32 bytes of signature
- # followed by an offset to the first record, stored as a little-
- # endian DWORD.
- sig, size, data = data[:32], data[32:36], data[36:]
- size = struct.unpack("<L")[0]
- #sig, size = struct.unpack("a32 V", data)
-
- # check that sig is valid (only tested in IE6.0)
- if not sig.startswith("Client UrlCache MMF Ver 5.2") or size != 0x4000:
- raise IOError, ("%s ['%s' %s] does not seem to contain cookies" % (
- filename, sig, size))
-
- index.seek(size, 0) # skip to start of first record
-
- # Cookies are usually in two contiguous 128 byte sectors, so read
- # in two 128 byte sectors and adjust if not a Cookie.
- while 1:
- d = index.read(256)
- if len(d) != 256: break
- data = data + d
-
- # Each record starts with a 4-byte signature and a count
- # (little-endian DWORD) of 128 byte sectors for the record.
- sig, size, data = data[:4], data[4:8], data[8:]
- size = struct.unpack("<L", size)
- #sig, size = struct.unpack("a4 V", data)
-
- # '-2' takes into account the two 128 byte sectors we've just
- # read in
- size_to_read = (size-2)*128
-
- # ignore all but "URL" records
- if sig != "URL ":
- # I've seen "HASH" and "LEAK" records
- assert sig in "HASH", "LEAK"
- if size > 0 and size != 2:
- index.seek(size_to_read, 1)
- continue
-
- # read in rest of record if necessary
- if size > 2:
- more_data = index.read(size_to_read)
- if len(more_data) != size_to_read:
- break
- data = data + more_data
-
- #cookie_re = ("Cookie\:%s\@([\x21-\xFF]+).*?"
- # "(%s\@[\x21-\xFF]+\.txt)" % (user_name,)*2)
- #m = re.search(cookie_re, data)
- m = self.cookie_re.search(data)
- if m:
- cookie_file = os.path.join(cookie_dir, m.group(2))
-
- if not self._delayload:
- self.load_cookies(cookie_file)
- else:
- domain = m.group(1)
- i = domain.find("/")
- if i != -1:
- domain = domain[:i]
-
- # Set a fake cookie for this domain, whose cookie value is
- # in fact the cookie file for this domain / user. This
- # is used in the _get_cookie_attributes method to lazily
- # load cookies.
- self.set_cookie(
- version=None,
- key="cookie", val=cookie_file, path="//+delayload",
- domain=domain, port=None, path_spec=False,
- secure=False, maxage=now + 86400, discard=False)
-
-
- # urllib2 support
-
- try:
- from urllib2 import AbstractHTTPHandler
- except ImportError:
- pass
- else:
- import urllib2, urllib, httplib, urlparse, types
- from cStringIO import StringIO
- from _Util import seek_wrapper
-
- def request_method(req):
- try:
- return req.method()
- except AttributeError:
- if req.has_data():
- return "POST"
- else:
- return "GET"
-
- # This fixes a bug in urllib2 as of Python 2.1.3 and 2.2.1
- # (sourceforge bug #549151 -- see file 'patch v2')
- class HTTPRedirectHandler(urllib2.BaseHandler):
- # maximum number of redirections before assuming we're in a loop
- max_redirections = 10
-
- # Implementation notes:
-
- # To avoid the server sending us into an infinite loop, the request
- # object needs to track what URLs we have already seen. Do this by
- # adding a handler-specific attribute to the Request object.
-
- # Another handler-specific Request attribute, original_url, is used to
- # remember the URL of the original request so that it is possible to
- # decide whether or not RFC 2965 cookies should be turned on during
- # redirect.
-
- # Always unhandled redirection codes:
- # 300 Multiple Choices: should not handle this here.
- # 304 Not Modified: no need to handle here: only of interest to caches
- # that do conditional GETs
- # 305 Use Proxy: probably not worth dealing with here
- # 306 Unused: what was this for in the previous versions of protocol??
-
- def redirect_request(self, newurl, req, fp, code, msg, headers):
- """Return a Request or None in response to a redirect.
-
- This is called by the http_error_30x methods when a redirection
- response is received. If a redirection should take place, return a
- new Request to allow http_error_30x to perform the redirect;
- otherwise, return None to indicate that an HTTPError should be
- raised.
-
- """
- method = request_method(req)
- if (code in (301, 302, 303, 307) and method in ("GET", "HEAD") or
- code in (302, 303) and method == "POST"):
- return urllib2.Request(newurl, headers=req.headers)
- else:
- return None
-
- def http_error_302(self, req, fp, code, msg, headers):
- if headers.has_key('location'):
- newurl = headers['location']
- elif headers.has_key('uri'):
- newurl = headers['uri']
- else:
- return
- newurl = urlparse.urljoin(req.get_full_url(), newurl)
-
- # XXX Probably want to forget about the state of the current
- # request, although that might interact poorly with other
- # handlers that also use handler-specific request attributes
- new = self.redirect_request(newurl, req, fp, code, msg, headers)
- if new is None:
- return
- new.original_url = req.get_full_url()
-
- # loop detection
- new.error_302_dict = {}
- if hasattr(req, 'error_302_dict'):
- if len(req.error_302_dict)>=self.max_redirections or \
- req.error_302_dict.has_key(newurl):
- raise HTTPError(req.get_full_url(), code,
- self.inf_msg + msg, headers, fp)
- new.error_302_dict.update(req.error_302_dict)
- new.error_302_dict[newurl] = newurl
-
- # Don't close the fp until we are sure that we won't use it
- # with HTTPError.
- fp.read()
- fp.close()
-
- return self.parent.open(new)
-
- http_error_301 = http_error_303 = http_error_307 = http_error_302
-
- inf_msg = "The HTTP server returned a redirect error that would" \
- "lead to an infinite loop.\n" \
- "The last 302 error message was:\n"
-
- class addinfourlseek(seek_wrapper):
- def __init__(self, fp, hdrs, url):
- seek_wrapper.__init__(self, fp)
- self.fp = fp
- self.headers = hdrs
- self.url = url
- self.seek(0)
-
- def info(self):
- return self.headers
-
- def geturl(self):
- return self.url
-
- class AbstractHTTPHandler(urllib2.BaseHandler):
- def __init__(self, cookies=None,
- handle_http_equiv=False, handle_refresh=False,
- seekable_responses=True):
- if cookies is None:
- cookies = Cookies()
- self.c = cookies
-
- if handle_http_equiv and not seekable_responses:
- raise ValueError, ("seekable responses are required if "
- "handling HTTP-EQUIV headers")
-
- self._http_equiv = handle_http_equiv
- self._refresh = handle_refresh
- self._seekable_responses = seekable_responses
-
- def do_open(self, http_class, req):
- if hasattr(req, "error_302_dict") and req.error_302_dict:
- redirect = 1
- else:
- redirect = 0
- self.c.add_cookie_header(req, redirect=redirect)
- host = req.get_host()
- if not host:
- raise URLError('no host given')
-
- try:
- h = http_class(host) # will parse host:port
- if ClientCookie.HTTP_DEBUG:
- h.set_debuglevel(1)
- if req.has_data():
- data = req.get_data()
- h.putrequest('POST', req.get_selector())
- if not req.headers.has_key('Content-type'):
- h.putheader('Content-type',
- 'application/x-www-form-urlencoded')
- if not req.headers.has_key('Content-length'):
- h.putheader('Content-length', '%d' % len(data))
- else:
- h.putrequest('GET', req.get_selector())
- except socket.error, err:
- raise URLError(err)
-
- h.putheader('Host', host)
- for args in self.parent.addheaders:
- apply(h.putheader, args)
- for k, v in req.headers.items():
- h.putheader(k, v)
- h.endheaders()
- if req.has_data():
- h.send(data)
-
- code, msg, hdrs = h.getreply()
- fp = h.getfile()
-
- if self._seekable_responses:
- response = addinfourlseek(fp, hdrs, req.get_full_url())
- else:
- response = urllib.addinfourl(fp, hdrs, req.get_full_url())
- self.c.extract_cookies(response, req, redirect=redirect)
-
- if self._refresh and hdrs.has_key("refresh"):
- refresh = hdrs["refresh"]
- i = string.find(refresh, ";")
- if i != -1:
- time, newurl_spec = refresh[:i], refresh[i+1:]
- i = string.find(newurl_spec, "=")
- if i != -1:
- if int(time) == 0:
- newurl = newurl_spec[i+1:]
- # fake a 302 response
- hdrs["location"] = newurl
- return self.parent.error(
- 'http', req, fp, 302, msg, hdrs)
-
- if code == 200:
- return response
- else:
- return self.parent.error('http', req, fp, code, msg, hdrs)
-
-
- class EndOfHeadError(Exception): pass
- class HeadParser(htmllib.HTMLParser):
- # only these elements are allowed in or before HEAD of document
- head_elems = ("html", "head",
- "title", "base",
- "script", "style", "meta", "link", "object")
- def __init__(self):
- htmllib.HTMLParser.__init__(self, formatter.NullFormatter())
- self.http_equiv = []
-
- def start_meta(self, attrs):
- http_equiv = content = None
- for key, value in attrs:
- if key == "http-equiv":
- http_equiv = value
- elif key == "content":
- content = value
- if http_equiv is not None:
- self.http_equiv.append((http_equiv, content))
-
- def handle_starttag(self, tag, method, attrs):
- if tag in self.head_elems:
- method(attrs)
- else:
- raise EndOfHeadError
-
- def handle_endtag(self, tag, method):
- if tag in self.head_elems:
- method()
- else:
- raise EndOfHeadError
-
- def end_head(self):
- raise EndOfHeadError
-
- def parse_head(file):
- """Return a list of key, value pairs"""
- hp = HeadParser()
- while 1:
- data = file.read(CHUNK)
- try:
- hp.feed(data)
- except EndOfHeadError:
- break
- if len(data) != CHUNK:
- # this should only happen if there is no HTML body, or if
- # CHUNK is big
- break
- return hp.http_equiv
-
- class EQUIVMixin:
- def getreply(self):
- """Returns information about response from the server.
-
- Return value is a tuple consisting of:
- - server status code (e.g. '200' if all goes well)
- - server "reason" corresponding to status code
- - any RFC822 headers in the response from the server
-
- """
- try:
- # response supports httplib.HTTPResponse interface
- response = self._conn.getresponse()
- except httplib.BadStatusLine, e:
- ### hmm. if getresponse() ever closes the socket on a bad request,
- ### then we are going to have problems with self.sock
-
- ### should we keep this behavior? do people use it?
- # keep the socket open (as a file), and return it
- self.file = self._conn.sock.makefile('rb', 0)
-
- # close our socket -- we want to restart after any protocol error
- self.close()
-
- self.headers = None
- return -1, e.line, None
-
- # response supports mimetools.Message interface
- self.headers = response.msg
- # grab HTTP-EQUIV headers and add them to the true HTTP headers
- self.file = seek_wrapper(response.fp)
- equiv_hdrs = parse_head(self.file)
- self.file.seek(0)
- for hdr, val in equiv_hdrs:
- self.headers[hdr] = val
-
- return response.status, response.reason, response.msg
-
-
- class HTTP(EQUIVMixin, httplib.HTTP):
- """Extends httplib.HTTP to deal with HTTP-EQUIV headers.
-
- HTTP-EQUIV headers (HTTP headers in the HEAD section of the HTML
- document) are treated by this class as if they're normal HTTP
- headers.
-
- """
- pass
-
- class HTTPHandler(AbstractHTTPHandler):
- """Extends urllib2.HTTPHandler with automatic cookie handling.
-
- This class also honours zero-time Refresh headers, if the
- handle_refresh argument to the constructor is true.
-
- """
- def http_open(self, req):
- if self._http_equiv:
- klass = HTTP
- else:
- klass = httplib.HTTP
- return self.do_open(klass, req)
-
- if hasattr(httplib, 'HTTPS'):
- class HTTPS(EQUIVMixin, httplib.HTTPS):
- """Extends httplib.HTTPS to deal with HTTP-EQUIV headers.
-
- HTTP-EQUIV headers (HTTP headers in the HEAD section of the HTML
- document) are treated by this class as if they're normal HTTP
- headers.
-
- """
- pass
-
- class HTTPSHandler(AbstractHTTPHandler):
- """Extends urllib2.HTTPHandler with automatic cookie handling.
-
- This class also honours zero-time Refresh headers, if the
- handle_refresh argument to the constructor is true.
-
- """
- def https_open(self, req):
- if self._http_equiv:
- klass = HTTPS
- else:
- klass = httplib.HTTPS
- return self.do_open(klass, req)
-
- def build_opener(*handlers):
- """Create an opener object from a list of handlers.
-
- The opener will use several default handlers, including support
- for HTTP and FTP. If there is a ProxyHandler, it must be at the
- front of the list of handlers. (Yuck.)
-
- If any of the handlers passed as arguments are subclasses of the
- default handlers, the default handlers will not be used.
- """
-
- opener = urllib2.OpenerDirector()
- default_classes = [urllib2.ProxyHandler, urllib2.UnknownHandler,
- HTTPHandler, # from this module (extended)
- urllib2.HTTPDefaultErrorHandler,
- HTTPRedirectHandler, # from this module (bugfixed)
- urllib2.FTPHandler, urllib2.FileHandler]
- if hasattr(httplib, 'HTTPS'):
- default_classes.append(HTTPSHandler)
- skip = []
- for klass in default_classes:
- for check in handlers:
- if type(check) == types.ClassType:
- if issubclass(check, klass):
- skip.append(klass)
- elif type(check) == types.InstanceType:
- if isinstance(check, klass):
- skip.append(klass)
- for klass in skip:
- default_classes.remove(klass)
-
- for klass in default_classes:
- opener.add_handler(klass())
-
- for h in handlers:
- if type(h) == types.ClassType:
- h = h()
- opener.add_handler(h)
- return opener
-
- _opener = None
- def urlopen(url, data=None):
- global _opener
- if _opener is None:
- cookies = Cookies()
- _opener = build_opener(
- HTTPHandler(cookies), # from this module (extended)
- )
- return _opener.open(url, data)
-
- def install_opener(opener):
- global _opener
- _opener = opener
-