XLI. Multi-Byte String Functions

Introduction

Varovßnφ

This module is EXPERIMENTAL. Function name/API is subject to be changed. Current conversion filter supports Japanese only.

There are many languages that all characters cannot be expressed by single byte. Multi-byte character codes are used to express many characters for many languages. mbstring is developed to handle Japanese characters. However, many mbstring functions are able to handle character codes other than Japanese.

Multi-byte character encoding represents single character with consecutive bytes. Some character encoding has shift(escape) sequences to start/end multi-byte character string. Therefore, multi-byte character string may be destroyed when it is divided and/or counted, unless multi-byte character encoding safe method is used. mbstring functions support multi-byte character safe string functions and other utility functions such as conversion functions.

Basics for Japanese multi-byte character

Most Japanese characters need more than 1 byte for a character. In addition to this, several character encodings are used under Japanese environment. There are EUC-JP, Shift_JIS and ISO-2022-JP character encoding. As Unicode is getting popular, UTF-8 is used also. To develop Web application for Japanese environment, it is important to use these character codes depend on its purpose, HTTP input/output, RDBMS and E-mail.

  • Storage for a character can be upto four bytes

  • A multi-byte character usually has twice of width compare to single byte characters. Wider character is called "zen-kaku" - meaning full width, narrower character called "han-kaku" - meaning half width. "zen-kaku" characters are fixed width usually.

  • Some character encoding defines shift sequence for entering/exiting multi-byte character strings.

  • Database may allocate storage for characters that differs from size used in PHP even if the same character encoding is used. (For example, PostgreSQL)

  • E-mail is supposed to use ISO-2022-JP.

  • "i-mode" web site is supposed to use Shift_JIS.

Supported character encodings

Following character encodings are supported in this PHP extension : UCS-4, UCS-4BE, UCS-4LE, UCS-2, UCS-2BE, UCS-2LE, UTF-32, UTF-32BE, UTF-32LE, UCS-2LE, UTF-16, UTF-16BE, UTF-16LE, UTF-8, UTF-7, ASCII, EUC-JP, SJIS, eucJP-win, SJIS-win, ISO-2022-JP(JIS), ISO-8859-1, ISO-8859-2, ISO-8859-3, ISO-8859-4, ISO-8859-5, ISO-8859-6, ISO-8859-7, ISO-8859-8, ISO-8859-9, ISO-8859-10, ISO-8859-13, ISO-8859-14, ISO-8859-15.

php.ini settings

  • mbstring.internal_encoding defines default internal character encoding.

  • mbstring.http_input defines default HTTP input character encoding.

  • mbstring.http_output defines default HTTP output character encoding.

  • mbstring.detect_order defines default character encoding detection order.

  • mbstring.substitute_character defines character to substitute for invalid character codes.

P°φklad 1. php.ini setting example

;; Set default internal encoding
mbstring.internal_encoding    = UTF-8  ; Set internal encoding to UTF-8

;; Set default HTTP input character code
mbstring.http_input = auto     ; Set HTTP input to auto
; or
; mbstring.http_input = SJIS     ; Set HTTP input to  SJIS
; mbstring.http_input = eucjp-win, sjis-win, UTF-8 ; Specify order

;; Set default HTTP output character code 
mbstring.http_output = UTF-8   ; Set HTTP output encoding to UTF-8

;; Set default character code detection order
mbstring.detect_order = auto   ; Set HTTP output to auto
; or 
; mbstring.detect_order = eucjp-win, sjis-win, UTF-8 ; Specify order

;; Set default substitute character
mbstring.substitute_character = 12307 ; Specify character code
; or
; mbstring.substitute_character = none  ; Null character
; mbstring.substitute_character = long  ; Long 
       

Obsah
mb_internal_encoding — Set/Get internal character encoding
mb_http_input — Detect HTTP input character encoding
mb_http_output — Set/Get HTTP output character encoding
mb_detect_order — Set/Get character encoding detection order
mb_substitute_character — Set/Get substitution character
mb_output_handler — Callback function converts character encoding in output buffer
mb_preferred_mime_name — Get MIME charset string
mb_strlen — Get string length
mb_strpos — Find position of first occurrence of string in a string
mb_strrpos — Find position of last occurrence of a string in a string
mb_substr — Get part of string
mb_strcut — Get part of string
mb_strwidth — Return width of string
mb_strimwidth — Get truncated string with specified width
mb_convert_encoding — Convert character encoding
mb_detect_encoding — Detect character encoding
mb_convert_kana — Convert "kana" one from another ("zen-kaku" ,"han-kaku" and more)
mb_encode_mimeheader — Encode string for MIME header
mb_decode_mimeheader — Decode string in MIME header field
mb_convert_variables — Convert character code in variable(s)
mb_encode_numericentity — Encode character to HTML numeric string reference
mb_decode_numericentity — Decode HTML numeric string reference to character
mb_send_mail — Send mail with ISO-2022-JP character code. (Japanese specific)