Encoding URLs when downloading data from the Internet

This document explains why you need to encode/escape or decode/un-escape URLs when using URL Monikers, URLOpenStream functions, and Win32 Internet Functions (WinInet). For further information on URL encoding, please refer to the Uniform Resource Locator RFC at http://ds.internic.net/rfc/rfc1738.txt.

The Short Answer when using URL Monikers or URLOpenStream

All URLs passed to the URL Moniker interface and URLOpenStream functions are assumed to be encoded - in other words, special characters must be "escaped". All URLs returned from these interfaces are likewise encoded, and if they are to be presented to users, they must first be decoded or "un-escaped".

Encoding and decoding is not built-in to the URL Moniker interface because the algorithm for encoding a given URL is not deterministic, but rather needs to follow heuristics (see details below). If you want to use the encoding heuristic of Microsoft Internet Explorer, you may encode and decode URLs using InternetCanonicalizeURL from the Win32 Internet Functions (WinInet). Please follow these guidelines when using InternetCanonicalizeURL with the URL Moniker and URLOpenStream interfaces:

  1. In short: If it's going IN, encode it. If it's coming OUT, decode it.
  2. When calling URL Moniker or URLOpenStream using an URL that has already been escaped (for example URLs inside HTML files), first canonicalize the URL by calling InternetCanonicalizeURL with no extra flags.
  3. When calling URL Moniker or URLOpenStream using an URL that has not already been escaped (for example URLs typed in by a user), first canonicalize the URL by calling InternetCanonicalizeURL with the special flags ICU_DECODE | ICU_BROWSER_MODE.
  4. When presenting the user with an URL returned by URL Moniker (e.g. in IMoniker::GetDisplayName), first decode the URL by calling InternetCanonicalizeURL with the special flags ICU_DECODE | ICU_NO_ENCODE. As a rule, IMoniker::GetDisplayName() always returns the canonicalized version of the URL used to create the moniker. No additional escaping or encoding is done beyond that done by the caller of CreateURLMoniker().
  5. When converting a "file:" URL returned by URL Moniker into a Win32 filename, decode the file-path portion of the URL by calling InternetCanonicalizeURL with the special flags ICU_DECODE | ICU_NO_ENCODE. Note: separating out the file-path portion involves removing intra-page link information, if present (anything following a '#' character).

The Short Answer when using Win32 Internet Functions (WinInet)

All the above rules also apply to the Win32 Internet Functions (WinInet) but only when using HTTP or HTTPS protocols. However, the Win32 Internet Functions do not assume encoding of URLs for FTP or GOPHER protocols. Specifically, any URLs passed to these functions should already be in the exact protocol-specific format that is to be passed "on the wire". So HTTP and HTTPS URLs must be encoded, but FTP and GOPHER URLs must not. If this is confusing, you can encode all URLs (as described above) and use the URL Moniker or URLOpenStream interfaces which will perform the appropriate encoding/decoding per protocol.

Details

Why encoding cannot always be performed algorithmically

Encoding of URLs is necessary because it allows deterministic differentiation between special characters that are part of the URL or query parameters that might otherwise be treated as delimiters. It is not always possible to encode URLs using a deterministic algorithm, and any encoding algorithm must include heuristics for resolving between two possible interpretations of the same URL. The InternetCanonicalizeURL function includes the heuristics that are used by Microsoft Internet Explorer.

Examples

http://www.foo.com/ myscript?query=What+Is+This%3F&text=ScriptParameters // query is "What Is This?", text is "ScriptParameters"

Note: were this URL not escaped already, it would not be possible to tell where the query parameters began. Specifically, an URL such as http://foo?bar?goo cannot be encoded algorithmically because it is unclear which ? is the delimiter between the path name and the query section

file://\\ server\This%23File.htm#link // open the file \\server\This#File.htm and navigate to intra-page link "link"

Again, were this URL not escaped, there is no accurate way to determine where the intra-page link really starts

file://\\server\This%23File.htm // open the file \\server\This#File.htm

As above, only this time there is no way at all to differentiate between an intra-page link and the file "This#File.htm"

http://server.com/script?param1=1&param2=2&param3=3

http://server.com/script?param1=1%26param2=2&param3=3

The above two URLs send different information to the script "script"; in the first case, there are three parameters, "param1", "param2", and "param3", with the values "1", "2", and "3", respectively. In the second case, however, there are two parameters, "param1" and "param3", with the values (after decoding) "1&param2=2" and "3", respectively. Without encoding, these URLs would be indistinguishable from each other.

Some Frequently Asked Questions

Q: How many times should I encode? decode?

A: ONCE. Never submit an URL multiply encoded or decoded. The rule here is "How many times will the HTTP server decode my request?" Once. So don't encode any more than that, or you'll confuse the server and yourself.

Q: What if I want a % in my filename?

A: % is encoded like all other special characters, and becomes %25. DO NOT USE %% TO ESCAPE THE % CHARACTER. It will result in at least one, and most likely two, bogus escape sequences.

Q: What about my INF in my CODEBASE= cabinet file?

A: Any references to URLs must be escaped WITHIN THE INF.

Think of your INF as an HTML page. Anything you must escape in HTML, you must escape in this INF. While this may seem an onerous restriction, it is the only choice. If INFs didn't require escaping special characters, then the INF parser would have to either (1) remove all meaning for special characters, or (2) remove the ability to use files with special characters. Each of these is worse than requiring an INF author to encode URLs. For example, if a file in a .CAB cabinet was named "OCX#1.OCX", how can one specify that filename? Requiring encoding of URLs in .INF files is the only way to solve this problem.