This document explains why you need to encode/escape or decode/un-escape URLs when using URL Monikers, URLOpenStream functions, and Win32 Internet Functions (WinInet). For further information on URL encoding, please refer to the Uniform Resource Locator RFC at http://ds.internic.net/rfc/rfc1738.txt.
All URLs passed to the URL Moniker interface and URLOpenStream functions are assumed to be encoded - in other words, special characters must be "escaped". All URLs returned from these interfaces are likewise encoded, and if they are to be presented to users, they must first be decoded or "un-escaped".
Encoding and decoding is not built-in to the URL Moniker interface because the algorithm for encoding a given URL is not deterministic, but rather needs to follow heuristics (see details below). If you want to use the encoding heuristic of Microsoft Internet Explorer, you may encode and decode URLs using InternetCanonicalizeURL from the Win32 Internet Functions (WinInet). Please follow these guidelines when using InternetCanonicalizeURL with the URL Moniker and URLOpenStream interfaces:
All the above rules also apply to the Win32
Internet Functions (WinInet) but only when using HTTP or
HTTPS protocols. However, the Win32 Internet Functions do not
assume encoding of URLs for FTP or GOPHER protocols. Specifically,
any URLs passed to these functions should already be in the exact
protocol-specific format that is to be passed "on the wire".
So HTTP and HTTPS URLs must be encoded, but FTP and GOPHER URLs
must not. If this is confusing, you can encode all URLs
(as described above) and use the URL Moniker or URLOpenStream
interfaces which will perform the appropriate encoding/decoding
per protocol.
Encoding of URLs is necessary because it allows deterministic differentiation between special characters that are part of the URL or query parameters that might otherwise be treated as delimiters. It is not always possible to encode URLs using a deterministic algorithm, and any encoding algorithm must include heuristics for resolving between two possible interpretations of the same URL. The InternetCanonicalizeURL function includes the heuristics that are used by Microsoft Internet Explorer.
http://www.foo.com/ myscript?query=What+Is+This%3F&text=ScriptParameters // query is "What Is This?", text is "ScriptParameters"
Note: were this URL not escaped already, it would not be possible to tell where the query parameters began. Specifically, an URL such as http://foo?bar?goo cannot be encoded algorithmically because it is unclear which ? is the delimiter between the path name and the query section
file://\\ server\This%23File.htm#link // open the file \\server\This#File.htm and navigate to intra-page link "link"
Again, were this URL not escaped, there is no accurate way to determine where the intra-page link really starts
file://\\server\This%23File.htm // open the file \\server\This#File.htm
As above, only this time there is no way at all to differentiate between an intra-page link and the file "This#File.htm"
http://server.com/script?param1=1¶m2=2¶m3=3
http://server.com/script?param1=1%26param2=2¶m3=3
The above two URLs send different information
to the script "script"; in the first case, there are
three parameters, "param1", "param2", and
"param3", with the values "1", "2",
and "3", respectively. In the second case, however,
there are two parameters, "param1" and "param3",
with the values (after decoding) "1¶m2=2" and
"3", respectively. Without encoding, these URLs would
be indistinguishable from each other.
Q: How many times should I encode? decode?
A: ONCE. Never submit an URL multiply encoded
or decoded. The rule here is "How
many times will the HTTP server decode my request?" Once.
So don't encode any more than that, or you'll confuse the server
and yourself.
Q: What if I want a % in my filename?
A: % is encoded like all other special characters,
and becomes %25. DO NOT USE %% TO ESCAPE THE % CHARACTER. It will
result in at least one, and most likely two, bogus escape sequences.
Q: What about my INF in my CODEBASE= cabinet file?
A: Any references to URLs must be escaped WITHIN THE INF.
Think of your INF as an HTML page. Anything
you must escape in HTML, you must escape in this INF. While this
may seem an onerous restriction, it is the only choice. If INFs
didn't require escaping special characters, then the INF
parser would have to either (1) remove all meaning for special
characters, or (2) remove the ability to use files with special
characters. Each of these is worse than requiring an INF author
to encode URLs. For example, if a file in a .CAB cabinet was named
"OCX#1.OCX", how can one specify that filename? Requiring
encoding of URLs in .INF files is the only way to solve this problem.