Stopgap guide to going UTF-8

PHP UTF-8 cheatsheet

03 July 2006 - PHP

When we started building DropSend, we decided to support all languages worldwide from the start. The interface is currently in English only, but the application can send, store, sort and process your data whatever language you want. As a result, we have a good number of customers out east.

To support worldwide languages, you need to use UTF-8 encoding for your web pages, emails and application, rather than ISO 8859-1 or another common western encoding, since these don't support characters used in languages such as Japanese and Chinese.

Happily, UTF-8 is transparent to the core Latin characterset, so you won't need to convert all your data to start using UTF-8. But there are a number of other issues to deal with. In particular, because UTF-8 is a multibyte encoding, meaning one character can be represented by more one or more bytes. This causes trouble for PHP, because the language parses and processes strings based on bytes, not characters, and makes mincemeat multibyte strings - for example, by splitting characters 'in half', bodging up regular expressions, and rendering email unreadable.

There are a number of great articles online about UTF-8 and how it works - Joel Spolski's comes to mind - but very few about how to actually get it working with PHP and iron out all the bugs. So, here to save you the time we put in, is a quick cheatsheet and info about a few common issues.

1. Update your database tables to use UTF-8

CREATE DATABASE db_name
CHARACTER SET utf8
DEFAULT CHARACTER SET utf8
COLLATE utf8_general_ci
DEFAULT COLLATE utf8_general_ci
;

ALTER DATABASE db_name
CHARACTER SET utf8
DEFAULT CHARACTER SET utf8
COLLATE utf8_general_ci
DEFAULT COLLATE utf8_general_ci
;

ALTER TABLE tbl_name
DEFAULT CHARACTER SET utf8
COLLATE utf8_general_ci
;

2. Install the mbstring extension for PHP

Windows: download the dll if it's not in your PHP extensions folder, and uncomment the relevant line in your php.ini file: extension=php_mbstring.dll
Linux: yum install php-mbstring

3. Configure mbstring

Do this in php.ini, httpd.conf or .htaccess. (Remember to prepend these with 'php_value ' in httpd.conf or .htaccess.)

mbstring.language		= Neutral	; Set default language to Neutral(UTF-8) (default)
mbstring.internal_encoding = UTF-8 ; Set default internal encoding to UTF-8
mbstring.encoding_translation = On ; HTTP input encoding translation is enabled
mbstring.http_input = auto ; Set HTTP input character set dectection to auto
mbstring.http_output = UTF-8 ; Set HTTP output encoding to UTF-8
mbstring.detect_order = auto ; Set default character encoding detection order to auto
mbstring.substitute_character = none ; Do not print invalid characters
default_charset = UTF-8 ; Default character set for auto content type header

4. Deal with non-multibyte-safe functions in PHP

The fast-and-loose way to do this is with the following php configuration:

mbstring.func_overload	= 7 ; All non-multibyte-safe functions are overloaded with the mbstring alternatives

But there are problems with this. php.net has a warning about this potentially affecting the whole server. And even if this isn't an issue for you, mbstring can make a mess of binary strings.

So, a better route is to search your application code for the following functions, and replace them with mbstring's 'slot-in' alternatives:

mail()		-> mb_send_mail()
strlen() -> mb_strlen()
strpos() -> mb_strpos()
strrpos() -> mb_strrpos()
substr() -> mb_substr()
strtolower() -> mb_strtolower()
strtoupper() -> mb_strtoupper()
substr_count() -> mb_substr_count()
ereg() -> mb_ereg()
eregi() -> mb_eregi()
ereg_replace() -> mb_ereg_replace()
eregi_replace() -> mb_eregi_replace()
split() -> mb_split()

5. Sort out HTML entities

The htmlentities() function doesn't work automatically with multibyte strings. To save time, you'll want to create a wrapper function and use this instead:

/**
* Encodes HTML safely for UTF-8. Use instead of htmlentities.
*
* @param string $var
* @return string
*/
function html_encode($var)
{
return htmlentities($var, ENT_QUOTES, 'UTF-8') ;
}

6. Check content-type headers

Check through your code for any text-based content-type headers, and append the UTF-8 charset, so the browser knows what it's working with:

header('Content-type: text/html; charset=UTF-8') ;

You should also repeat this at the top of HTML pages:

<meta http-equiv="Content-type" value="text/html; charset=UTF-8" />

7. Update email scripts

Email can be tricky. You'll need to update the content-type for any emails and text-based mime parts to use UTF-8 encoding. You'll also need to alter the way in which headers are encoded to use UTF-8. mbstring provides a function mb_encode_mimeheader() to handle this for you, but it does make a mess of address lists: you'll need to encoding the name and address parts seperately, then compile them into an address list.

Be sure to encode the subject and other headers too - Korean speakers will tend to put Korean text for the subject.

9. Check binary files and strings

Finally, double check any binary files and strings handled by PHP, particularly uploads, downloads and encryption. In some cases it may be necessary to revert to ASCII just before a download or processing a binary string.

Comments

IS - 09 August 2006 16:14

Anyone can guide me how to convert the UTF8 (三寶山) to chinese character. Someone told me to use mb_string or iconv but I don't know how to used this.

From the bottom up » Blog Archive » How To: UTF-8 Encoding &#038; PHP - 09 August 2006 22:18 - Visit >

[...] How To: UTF-8 Encoding & PHP [...]

Sal Randolph - 11 August 2006 20:51 - Visit >

You are a god. Seriously. I can't thank you enough for this incredibly helpful page. I had naively thought just setting up my database for unicode would be enough, and was dismayed to see a page of chinese text turned into question marks! After working through your checklist, chinese is chinese again! Happiness. (bow)

Kris - 14 September 2006 18:31

Yes I agree, that's the way it should be done!

Ricky - 12 October 2006 01:37

I originally thought making multi-language websites was merely a copy and paste solution, and then discovered all the fun that is UTF-8.

This is by far the most useful PHP mbstring write up I've come across. Many thanks.

johnszot - 17 October 2006 05:36

so stumped. PHP is still spitting out '?????'s from MySQL's utf collated fields. i'm sure it's a rookie mistake - but i've triple-checked the stuf fon this site (which is helpful despite my problem).......trouble shooting tips?

Bakyt Niyazov - 28 October 2006 17:22

Thank you! You've really helped me!!!

amagondes - 14 November 2006 18:37

johnszot, i'm not sure but check you browsers character encoding. I had the same problem and for some reason my browser was not doing the right thing

Daniel - 17 November 2006 13:36 - Visit >

hey guys.. try this: http://people.w3.org/rishida/scripts/uniview/conversion

nicolas - 01 December 2006 12:54

Hi! My name is Nicolas and I'm from Argentina! I found your explanation really useful, but still, I have a doubt about emails; I need to send emails in different languages (chinese, english, french, for example) and I'm having problems with special characters (for example "á"). The encoding I'm using, it's UTF-8, that works perfect with chinese, but characters like the one I mentioned before, are not displayed... Do you have any Idea for solving this??? Thanks!!!

PS: Sorry for mi english!

» Savaitgalio skaitiniai #7 Archyvas » Pixel.lt - 30 December 2006 06:09 - Visit >

[...] savaitgalio skaitinai: An Introduction to Using Patterns in Web Design An PHP UTF-8 cheatsheet Web Style Sheets CSS tips & tricks Classic ASP vs ASP.NET Dear PHP, I think it’s time webroke up Hide Yourself [...]

Dan Grossman : Stumbling Across the Web #2 - 15 January 2007 03:45 - Visit >

[...] PHP UTF-8 Cheatsheet Writing web applications that deal with multiple languages is a messy process. You can never be sure what encoding is coming in, but you can make a decent attempt at it with UTF-8. This cheatsheet shows you what’s involved in handling UTF-8 data in PHP. [...]

NMC KnowledgeBase » Blog Archive » PHP UTF-8 tips - 18 January 2007 21:13 - Visit >

[...] with this : http://www.nicknettleton.com/zine/php/php-utf-8-cheatsheet - a checklist of what you need todo. [...]

Helen - 28 January 2007 04:11

PHP Guru,

Could you please help me how to change the default UTF-8 (charset) to GB2312? Although I set GB2312 in the php file like , the reponse still is UTF-8.

Your help and advice are appreciated.

Helen

loch - 12 February 2007 16:59

I did steps above and still get buncha ???

In phpmyadmin, I see the utf-8 char just fine yet when displayed on a web page, i see only ???

the web page header info is set to utf-8

kajetan - 14 March 2007 23:41

Great. I'm about to convert my site to UTF-8 and this will save me hours if not days of trial and error. Thanx.

artoodetoo - 22 March 2007 10:36 - Visit >

For Russian-reading users may be useful http://punbb.ru/viewtopic.php?id=1222

Misha - 25 March 2007 00:52

Amazing guide!!! Thanks!

I have just one question about the upload bit script - I do not understand in which cases we should fix our upload scripts.

Thanks!

johnny - 25 March 2007 17:37

Really good guide, but I still have the problem with the '???' characters. Like 'loch' user above, everything displays nice in the phpmyadmin interface, but when i try to run the application I get a lot of '???' instead of greek characters. All I'm trying to do in my test page is query the database (using pear::MDB2) and then show the results like this:

Name Text Author Category $article[article_name] $article[article_text] $article[author_name] $article[category_name] "; } ?>

Any help???

johnny - 25 March 2007 17:51

OOPS, mistake above... :) The code is:

<table border=1> <tr> <th>Name</th> <th>Text</th> <th>Author</th> <th>Category</th> </tr> <?php foreach ($articles as $article) { echo "<tr> <td>$article[article_name]</td> <td>$article[article_text]</td> <td>$article[author_name]</td> <td>$article[category_name]</td> </tr>"; } ?> </table>

Shaun - 28 March 2007 14:01 - Visit >

I am trying to output my data to a text file that then needs to be read into a separate system, i can get the foreign characters to appear on the html pages with out a problem, using the same code but with fopen, fwrite etc to write the text file, when i open the text file in notpad the characters are mumbo jumbo. I can open this file in MSword and set the encoding to utf8 and save the file, but i would like php to know that the file being saved should be utf8. Any suggestions? I have followed everything here and tried utf8_encode and various other suggestions but to no avail.

Jason Lefkowitz - 09 April 2007 21:05 - Visit >

"Try to learn some english man ..."

Am I the only one amused to see this comment attached to a post about properly internationalizing your code?

Isn't it ironic... doncha think...

faye - 12 April 2007 12:01

this article is a great help... by the way, i would like to ask how to read files with japanese characters in it and display on the screen.. i am working on this problem but i couldn't seem to find the solution... help would be gladly appreciated...

Chris Bloom - 25 April 2007 05:51 - Visit >

Thanks for this! It just saved me days worth of trial and error. For what it's worth, if you're importing Unicode text from Windows (via file upload) and you want to convert it to UTF-8 (Windows Unicode is actually UTF-16), use:

$string = mb_convert_encoding($string, 'UTF-8', 'UTF-16');

See http://us.php.net/manual/en/ref.mbstring.php#50298 for more info.

Dan - 26 April 2007 00:27 - Visit >

Great article, but when I add the configuration settings in my .htaccess file, I get an error 500. Any idea why?

BB - 26 April 2007 08:52

i'm using mb_convert_encoding to convert uft8 to big5, but every conversion will have "?" at the beginning of the string. For example: ?情在人間. May i know why??

DD - 22 May 2007 21:57 - Visit >

You can try to trim the string before using mb_convert_encoding, then "?" will be gone.

Nick Nettleton - 30 May 2007 22:28

A couple of tips if you're still seeing a lot of '???':

1. View source - if the characters appear correctly in the source code, then you're probably not html encoding correctly, as in Johnny's case above. You should always use the html_encode() function above to encode plain text content as you output it to a web page.

2. Use the View > Character Encoding menu in your web browser to see if it understands that you are working in UTF-8. If not, review your HTTP headers and meta tags as above.

3. If things are still wrong, your characters are getting mashed somewhere in transit - run through your code starting at the point of communication with the database, printing out key variables at each point, to see if you can find the source of the error.

Petronel - 08 June 2007 11:53 - Visit >

I am so happy that what I've did in the past month alone gaved me the same results readed now in this article ;)

Callum - 08 June 2007 17:55

Thank you so much for this. I've been looking for practical, easy-to-understand advice on the different things to bare in mind when using UTF-8 in PHP projects for ages.

One minor thing, shouldn't the name of the attribute in the meta tag be "content", not "value"? It probably works either way, but I thought I'd mention it in case it didn't.

Moeh Bass - 11 June 2007 02:37

PHP Bug #34776 mb_convert_encoding() - wrong convertion from UTF-16 (problem with BOM) http://bugs.php.net/bug.php?id=34776

Do you have any idea about fixes for this bug? Was it fixed? Does it matter alot?

Callum - 18 June 2007 13:43

For part 3 (Configure mbstring), I tried doing this in PHP using ini_set(), instead of doing it in .htaccess/php.ini/httpd.conf. (I realise those methods are probably quicker but there are various reasons why its easier for me to set the options in PHP in a few of my projects.) They all worked fine apart from "mbstring.encoding_translation". I set that to "On", but it didn't work; when I called ini_get() just after the ini_set(), the value was still "0".

Do you have any idea why this might be? And more importantly, because I have a few situations when I cannot use .htaccess etc, could you explain to me the importance of the encoding_translation setting? I mean, can I get by without it; is there a workaround I could use in my code to manually translate HTTP input? Perhaps a bit of code I could put at the top of my script that would just convert everything? (And what is it exactly that needs translating - file uploads? form submissions?)

Any advice much appreciated.

Claudia - 21 June 2007 09:06

Just to add to all the other tips: If the database is utf8 and your website is utf8 and you still see a lot of question marsk/hollow squares you might need to change the connection encoding for MySQL. Often this is still set to latin. See here: http://www.mysql.org/doc/refman/4.1/en/charset-connection.html

Leander - 30 June 2007 20:30

Great page! Thanks! It has helped me, but I'm still missing something. When I insert, for example, greek text in the database. How should I do this? Do I need to use html_encode? Or maybe utf8_encode/decode?

Because it works for greek characters, but not all of them. Κι ότι άτομα αλλάζοντας πιθανότητες is displayed as: Κι ?τι ?τομα αλλ?ζοντας πιθαν?τητες is it because characters like: όά don't exist in htmlentities?

I'm probably doing something wrong, because thai, chinese, japanese don't show up at all.

Hope you can help me! Thanks!

Yacahuma - 03 July 2007 15:21

I was trying to read an xml service that was generating spanish characters and was getting invalid characters error by the simplexml_load_file

I created this to fix the problem

file: proxy.php $xml = implode('', file('http://address/xml_service.php')); header("Content-Type: text/html;charset=ISO-8859-1"); print "\n"; echo utf8_encode($xml);

file: reader.php ... $uri='http://localhost/proxy.php'; $s = simplexml_load_file($uri); ... This was faster than using simplexml_load_string

Leander - 04 July 2007 21:35

Just wanted to thank you! Everything works fine and there are no problems with any language so far. Japanese, Thai, Russian, Turkish, Chinese, Greek and Arabian all work, without showing question marks!

Thank you so much!

fkhan - 12 July 2007 00:56

Great info! I would add one thing. To prevent ???? (ISO-8859-1) characters from being returned by MySQL I had to perform this query after the initial database connection/selection: mysql_query("SET NAMES 'utf8'")

Lee McLaughlin - 30 July 2007 04:29 - Visit >

Great tutorial thanks and also that was a little gem of info left by **fkhan** sorted me right out!

RussellBeattie.com - PHP - Fugly but Fast - 03 August 2007 02:19 - Visit >

[...] think I may have finally gotten a handle on the encoding stuff. There was a great post about getting PHP to play nice with UTF-8, and Joel had a great UTF-8 overview as well… The adapter is definitely not perfect yet, but Ithink it’s closer than it was. Every once in a while a page is requested with something I haven’t seen before, and the parsing code isn’t set up correctly so it barfs, but it seems to be doing well. [...]

gia - 23 August 2007 21:46

Yeah I had everything set and this is what I had missing: PHP created latin1 connections by default. To fix that just call after connection:

mysql_query("SET NAMES 'utf8'")

Ivan Chu - 11 September 2007 19:38

The good page. Пасиб ))

yair - 10 November 2007 23:19

gia, your last post saved meeeeeee! thanks!!!

Dave Gregory - 18 November 2007 16:16

Really really useful, thanks! Just thought I'd share something that has really messed me up on this...

My web host, ninja legend that he is, installed suhosin (www.hardened-php.net/suhosin/) to save us from the big bad monsters. Unfortunately, it didn't want to play nicely with the mbstring.encoding_translation php_flag recommended here.

Suhosin was throwing errors like this into my logs: [error] ALERT - COOKIE variable name begins with disallowed whitespace - dropped variable ' PHPSESSID' (attacker '', file '.php') It was also dropping sessions (logins, etc) left right and centre. Annoying since I was relying heavily on sessions for functionality.

I only discovered this was related to mbstring by mistake when I started a new project (different path, different .htaccess file!) and came round to applying these fixes all over again. Suddenly everything broke. It was about 5am so I sobbed a bit and went to bed... and as I slept, the Lord sent unto me a vision, saying "It worked, then you did all that mbstring stuff to fix your special characters, and now it doesn't. Oh, and 'disallowed whitespace' implies a character encoding problem. Join the dots, my son." Long story short, I woke up and took out all the settings one by one until I found the culprit: mbstring.encoding_translation.

I don't know for sure why Nick's suggested this one; I suspect it's to do with surviving random client browser settings, but luckily for me I'm not really doing i18n, just special-character-dodging. Anyway, hopefully this will save someone a bit of pain.

viral - 02 December 2007 18:13 - Visit >

Many thanks for this superb info.

The only line missing was ... mysql_query("SET NAMES 'utf8'")

just after the mysql connection.

All problem solved at one shot !!!

Thanks buddy, fkhan

Ben - 13 December 2007 10:31 - Visit >

A most excellent walk-through. Had an intranet solution sorted for multi-language within half a day!

A big thanks to **fkhan** above as well for pointing out that you also need to add mysql_query("SET NAMES 'utf8'"); To your PHP code after you connect and select the DB for the first time in a script! Worked like a treat! Thanks

Ben - 14 December 2007 11:20 - Visit >

Oh and also... pay attention to step 8... that's the most important ;)

utf8 GET/POST Variablenproblem - php.de - 19 December 2007 15:29 - Visit >

[...] nach einem Geist suchen und hier posten. Schon findet man die Lsung selbst. Die Lsung steht da->PHP UTF-8 cheatsheet - nicknettleton.com Das hier war bei mir auf Japanisch gestellt: mbstring.language = Neutral ; Set default language toNeutral(UTF-8) (default) mbstring.internal_encoding = UTF-8 ; Set default internal encoding to UTF-8 mbstring.encoding_translation = On ; HTTP input encoding translation is enabled mbstring.http_input = auto ; Set HTTP input character set dectection to auto mbstring.http_output = UTF-8 ; Set HTTP output encoding to UTF-8 mbstring.detect_order = auto ; Set default character encoding detection order to auto mbstring.substitute_character = none ; Do not print invalid characters default_charset = UTF-8 ; Default character set for auto content type headerViel Erfolg noch. [...]

Sandbergen - 05 January 2008 11:57

You might not always have the permissions to set the correct settings in the database as your website may be running on a shared webserver with many different users. This could imply that after you have set everything as described, you still aren't getting the proper results. This little piece of code might do the trick:

first connect to the database with: $conn = mysql_connect($host, $user, $pass); mysql_select_db($db_name);

then execute those 2 queries, only the second really matters though: mysql_query("SET CHARACTER SET utf8"); mysql_query("SET NAMES utf8");

links for 2008-01-11 « Bijay Rungta&#8217;s Weblog - 11 January 2008 00:39 - Visit >

[...] PHP UTF-8 cheatsheet - nicknettleton.com [...]

WyriHaximus.net :: Blog » Home » Friday night Themed Links week 4: 40 Cheatsheets - 25 January 2008 22:21 - Visit >

[...] PHP UTF-8 [...]

UTF-8 text encoding and self-hosted PHP / MySQL web applications &#8211; Archives &#8211; Alex's Ramblings - 11 February 2008 14:07 - Visit >

[...] Nick Nettleton – PHP UTF-8 cheatsheet [...]

SlyBaby - 14 February 2008 14:42

For: Callum - 18 June 2007 13:43 mbstring.encoding_translation and mbstring.language can be set in PHP_INI_PERDIR witch means they cannot be set in scripts. All other settings (from those discussed) are PHP_INI_ALL witch means they can be set everywhere, including scripts. default values: - mbstring.encoding_translation = "0" - mbstring.language = "neutral" that's why, encoding didn't work, and language seemed to work. So, for those two, they must be set in an htaccess to witch it's pretty safe to assume you have access to. source: http://www.php.net/manual/en/ini.php#ini.list

to the author: great article, very helpfull, keep'em coming :).

Sean Kealn - 15 February 2008 18:23 - Visit >

Hello, i need urgent help!

I was trying to send emails with RUSSIAN and GREEK subjects, however i couldnt. Could any one help me regarding for it ?

Thanks!

Tom - 29 February 2008 14:08 - Visit >

I'm also having troubles with UTF8 vs Latin1 with passing € variables through PHP.

thx for the info I'll give it a try

Stefan - 06 March 2008 12:25

Followed all instructions, pointed everything possible to utf-8, but still ?????? when updating the database through a form.

Appeared in the end in my case that

mysql_query("SET NAMES 'utf8'");

was not enough. It had to be:

mysql_query("SET NAMES 'utf8' COLLATE 'utf8_general_ci'");

Glad that this nightmare has been solved!

Hendricus - 24 March 2008 11:08

Thanx for this article... been having troubles with UTF-8 encoding quite a bit! Since I read this article I've been playing around with it and testing stuff and found out something weird. I did the settings thru .htaccess;

### Set default language to Neutral(UTF-8) (default) php_value mbstring.language "Neutral" ### Set default internal encoding to UTF-8 php_value mbstring.internal_encoding "UTF-8" ### HTTP input encoding translation is enabled php_value mbstring.encoding_translation "On" ### Set HTTP input character set dectection to auto php_value mbstring.http_input "auto" ### Set HTTP output encoding to UTF-8 php_value mbstring.http_output "UTF-8" ### Set default character encoding detection order to auto php_value mbstring.detect_order "auto" ### Do not print invalid characters php_value mbstring.substitute_character "none" ### Default character set for auto content type header php_value default_charset "UTF-8" ### Use multibyte functions by default, so strtoupper automaticall becomes mb_strtoupper php_value func_overload "7"

I load an external file like this;

$data = file_get_contents("flatfile.html"); //ISO 8859-1 contents

now if I; echo strtoupper(utf8_encode(nl2br($data))); All characters get uppercased, EXCEPT for accented chars like é è ä etc etc.

but if I; echo mb_strtoupper(utf8_encode(nl2br($data)), "utf-8"); It uppercases all chars, even the accented ones...

But I thought the .htaccess settings; php_value func_overload "7" default_charset "UTF-8" would automaticallu make php use mb_ functions witg utf-8??

Any thoughts on this?

Hendricus - 24 March 2008 11:19

Hmmz I found out that;

echo ini_get("mbstring.func_overload"); // returns 0 eventhough set to 7 in htaccess

and that;

ini_set("mbstring.func_overload", 7); echo ini_get("mbstring.func_overload"); // returns 0 eventhough set to 7 in htaccess AND in PHP

Any thoughts on this then?

Hendricus - 24 March 2008 12:04

php_value mbstring.func_overload 7

instead of

php_value func_overload 7

Code blindness :)

donauweb.at » Blog Archive » UTF8 with PHP and MySQL - 08 April 2008 20:26 - Visit >

[...] http://nicknettleton.com/zine/php/php-utf-8-cheatsheet [...]

Web 2.0 Việt Nam » Blog Archive » PHP UTF-8 cheatsheet - 06 June 2008 16:19 - Visit >

[...] http://www.nicknettleton.com/zine/php/php-utf-8-cheatsheet [ trackback] [...]

Michael Robinson - 12 June 2008 06:08 - Visit >

Thank you so much for this!

I blundered through setting up a database for my Chinese Idiom Database, and ran into a load of problems!

I ended up changing all fields that could contain Chinese to "binary", because this was the only format that consistently displayed Chinese, instead of nonsense.

I really like the tip "Check the browser knows which encoding the file is" made by Nick Nettleton - I was puzzling over why a file that was output by a php script was displaying strange characters! I know I could put the meta tag at the top of the file, but this file needs to have only links, as it forms the input for a flash animation.

Anyway, I'll surely be bookmarking this,

Thanks!

Davide Romanini - 27 June 2008 13:01

"UTF-8 is transparent to the core Latin characterset" is not true: only the first 127 codepoints are the same. For example the character 'à' is a valid Latin character (0xE0) but have a totally different rapresentation in UTF-8 (0xC3 0xA0). That means, if you have Latin characters >127 in you php scripts, you should also recode the script itself in UTF-8, otherwise you'll have problems (ex: a simple print 'à' will throw garbage in the page).

Kay - 29 June 2008 00:54

Hi, your advice has been very helpful. Thank you for sharing your knowledge!

I'm creating a website with PHP and MySQL which is to support English and Chinese.

I'm wondering if you could help on a problem I have:

If I select something from the db to be displayed on a webpage, the Chinese shows up fine.

If I insert something into the db via an html form, the Chinese shows as gibberish when I view it with phpMyAdmin.

If I insert something into the db via phpMyAdmin, and then view that record in phpMyAdmin, it shows up as the Chinese character (non-gibberish).

My question is, should the Chinese show up as the actual character in the db or is it alright that it's showing as gibberish when I view it via phpMyAdmin?

Thanks for your time, Kay

MD Nur Hossain - 29 June 2008 19:51

cool article for dynamic multi-language support, but got problem when I used

mysql_query("SET NAMES 'utf8' COLLATE 'utf8_general_ci'");

except above query it works fine with IE and FF (safari doesn't work). any body describe plz, why the above query makes problem? thanks in advanced

umpirsky - 01 July 2008 07:54 - Visit >

You forgot parse_str -> mb_parse_str.

PHP und UTF-8 « webzeug - 19 August 2008 23:07 - Visit >

[...] nicknettleton gibtsnoch wichtige Tipps, um das gesamte System inkl. Datenbank auf UFT-8umzustellen. [...]

Veeru - 20 August 2008 10:51

Hi there, I have been experimenting with Unicode and php, but i got no where, this post looks promising, but being a beginner with unicode, can somebody provide simple "hello world" kind of examples with unicode?

I just need to know how to show chinese or thai or korean, text on my web page. How can i do it? Is it possible to store language strings in a text file and show them on a html page?

I am looking at developing a multi-lingual website; what is the easiest way to switch between output languages?. Any help is very much appreciated.

Thanks

Itzco - 21 August 2008 10:03 - Visit >

Good guide, I work in Thailand and Vietnam so we get a lot of weird issues with encoding so I just want to add the ones that comes to my mind:

1) As someone already noted: When you connect to the database you need to set collation and names mysql_query("SET NAMES 'utf8'"); mysql_query("SET NAMES 'utf8' COLLATE 'utf8_general_ci'");

Still sometimes we get garbage on generated files or even on the browser;

Some of this recommendations are also useful in case you are sending images or any other kind of file that is not just HTML where spaces are ignored.

2) Verify that all your PHP files contain NO spaces before or after the 3) If you are using templates verify that all files (this can apply also to php files) are saved in UTF without BOM format, this can be set in many editors like notepad++ (BOM is evil be careful!) 4) In case you cannot be sure all spaces are gone you can use caching to be sure nothing is sent before your content: ob_start(); then u can delete everything in the buffer before sending your content: ob_clean(); ** All previous ones might look unrelated but believe me, they are not, when you are working UTF8, save all your file correctly using UTF8 without BOM

5) Don't forget to change the encoding for your ajax XML

6) And lastly an error on redirections that I think most people will never see

Content Encoding Error (content_encoding_error)

“Server response could not be decoded using encoding type returned by server.

This is typically caused by a Web Site presenting a content encoding header of one type, and then encoding the data differently.”

This problem occurs when compression is activated and the content has different encodings, normally occurs when u try to redirect to another page.

Before your redirection call: @ini_set('zlib.output_compression', 'Off'); header("Location: whereveryouwanttogo");

Ok, hope this can help anyone

Thomas Steven - 01 September 2008 11:47 - Visit >

I was using the MDB2 library to connect to my MySQL database, and I had to do the following to make the connection work correctly :

$this->dbh =& MDB2::connect($dsn,$options); $this->dbh->setCharset('utf8'); // set the connection charset to utf8

I guess other libraries may have the same issues. Without this I was seeing question marks occasionally.

david - 16 September 2008 12:29 - Visit >

For me, adding this right after connecting to my database did the trick!

mysql_query("SET NAMES 'utf8'");

(Of course, I set the encoding as instructed in #1 above)

Thanks for the help!

PHP UTF-8 cheatsheet - 16 October 2008 08:20 - Visit >

[...] Hier geht`s weiter zum ausführlicheren Artikel [...]

» Wordpress og tegnsæt: Fra iso-8859-1 til utf8 - Weblog at bo-k dot dk - 12 November 2008 15:32 - Visit >

[...] Egentlig burde det vel ret enkelt at konvertere en wordpress database fra det ene til det andet format. Kommandoen fra linuxprompten iconv -f ISO-8859-1 -t UTF-8 dump.sql > dump_utf8.sql burde kunne klare sagen. Men af en eller anden grund gælder det for WP at WP citat: “may have stored unicode characters in a latin1 database”!!! Så går ovennævnte ikke så vidt jeg har kunnet teste mig frem til. Undertegnede har indtil videre primært været konfronteret med problemet når jeg har hentet wp-databaser fra produktionsserveren og ned på min egen udvikler boks: æøå osv. er blevet forvandlet til mystiske tegn som æøå osv. Med wp2.2 bliver problematikken aktuel fordi det fra denne version er muligt at arbejde med “ægte” UTF tegnsæt. Artiklen Converting database character set beskæftiger sig med problemet hvis løsning tilsyneladende er lang, besværlig og kompliceret. Dvs. hvis ikke lige det havde været for dette fortrinlige plugin: UTF database converter, der, såvidt jeg kan vurdere, klarer problemet smertefrit (dog foreløbig kun testet på udviklerboksen). Under alle omstændigheder: husk nu den backup! (fra kommandoprompten kan det gøres med kommandoen: mysqldump -u username -p password database_name > dump.sql ellers brug phpMyadmin). Tilf. php utf cheatsheet [...]

» PHP UTF8 - Weblog at bo-k dot dk - 13 November 2008 15:02 - Visit >

[...] Dette cheatsheet er da absolut en af de ting som bør bogmærkes. En ting som ikke umiddelbart fremgår, men som kan være smart at tilføje til ens kode i forlængelse af oprettelse af forbindelse til databasen: // Instantiate MySQL connection $db = new MySQL($dbhost,$dbuser,$dbpass,$dbname); //UTF initialialisation $db->query("SET NAMES 'utf8'"); [...]

Add a comment

Incoming

Related