Posts Tagged ‘chinese’

Java with UTF-8 encoding (especially with Chinese)

May 5th, 2013

Here are just some memos on how to deal with non-unicode characters (like Chinese) in java.

  • Decode characters from URL in a servlet:
    String s = request.getParameter("mykeywords");
    s = new String(s.getBytes("ISO-8859-1"), "UTF-8");

    But if the uri encoding is specified in server.xml as:
    <Connector URIEncoding="UTF-8" connectionTimeout="20000" port="8080" protocol="HTTP/1.1" redirectPort="8443"/>,
    there is no need to decode it. Simply use:
    String s = request.getParameter("mykeywords"); and s will be utf-8 encoded.

  • Send response with UTF-8:
  • Connect mysql with UTF-8:
    private static final String DB_URL = "jdbc:mysql://DB_HOST:3306/DB_SCHEMA?useUnicode=true&characterEncoding=utf8";
  • To encode JSP files with UTF-8, put this at the beginning of a JSP file:
    <%@ page contentType="text/html;charset=UTF-8" %>
  • With SpringMVC, for GET requests, set <Connector URIEncoding="UTF-8" connectionTimeout="20000" port="8080" protocol="HTTP/1.1" redirectPort="8443"/>,.
    And for POST requests, set this as the first filter in web.xml:
  • And finally, make sure every file itself is stored as UTF-8.

PHP Hash Chinese Character or String

April 22nd, 2013

Hash a Chinese character or string into a integer number. This might be useful when one wants to split a huge table which is indexed by Chinese strings horizontally into many tables.

	 * Hash a chinese charactor into an int number.
	 * @param string $c A chinese character
	 * @return number
	public static function hashZhChar($c) {
		return (ord(substr($c, 0 , 1)) -176)*94 + ord(substr($c, 1, 1)) - 161;
	 * Hash a chinese string into an integer number.
	 * @param string $s A chinese string.
	 * @return number
	static function hashZh($s) {
		$first = mb_substr($s, 0, 1, 'UTF-8');
		$last = mb_substr($s, -1, 1, 'UTF-8');
		$middle = mb_substr($s, intval(mb_strlen($s, 'UTF-8')/2), 1, 'UTF-8');
		return self::hashZhChar($first) + self::hashZhChar($last) + self::hashZhChar($middle);