This doesn't work, it turns it to gibberish:

$foo = '× ';
$bar = mb_convert_encoding($foo, 'UTF-8', mb_detect_encoding($foo));
print_r(preg_split('/\s/', $bar));

Array ( [0] => � [1] => )

But this works:

$foo = '× ';
$bar = mb_convert_encoding($foo, 'ISO-8859-8', mb_detect_encoding($foo));
$baz = preg_split('/\s/', $bar);
echo(mb_convert_encoding($baz[0], 'UTF-8', 'ISO-8859-8'));

× 

The problem is only with the letter "× ". It works fine with all the other Hebrew letters. Is there a solution for that?

Accepted Answer

When working with UTF-8 data, always use the u modifier in your patterns:

/\s/u

Because otherwise the pattern is not interpreted as UTF-8.

Like in this case the character ×  (U+05E0) is encoded with 0xD7A0 in UTF-8. And \s represents any whitespace character (according to PCRE):

The \s characters are HT (9), LF (10), FF (12), CR (13), and space (32).

When UTF-8 support was added, they have also added a special option called PCRE_UCP to have \b, \d, \s, and \w not just match US-ASCII characters but also other Unicode characters by their Unicode properties:

By default, in UTF-8 mode, characters with values greater than 128 never match \d, \s, or \w, and always match \D, \S, and \W. […] However, if PCRE is compiled with Unicode property support, and the PCRE_UCP option is set, the behaviour is changed so that Unicode properties are used to determine character types, as follows:

  • \d any character that \p{Nd} matches (decimal digit)
  • \s any character that \p{Z} matches, plus HT, LF, FF, CR
  • \w any character that \p{L} or \p{N} matches, plus underscore

And that non-breaking space U+00A0 has the property of a separator (\p{Z}).

So although your pattern is not in UTF-8 mode, it seems that \s does match that 0xA0 in the UTF-8 code word 0xD7A0, splitting the string at that position and returning an array that is equivalent to array("\xD7", "").

And that’s obviously a bug as the pattern is not in UTF-8 mode but 0xA0 is greater than 0x80 (additionally, 0xA0 would be encoded as 0xC2A0). The bug #52971 PCRE-Meta-Characters not working with utf-8 could be related with this.

Written by Gumbo
This page was build to provide you fast access to the question and the direct accepted answer.
The content is written by members of the stackoverflow.com community.
It is licensed under cc-wiki