Code sample

I'm searching articles for keywords which are in both English and Arabic. The articles can be either in English or Arabic.

My current code is:

$k = implode("|", $keywords);
$regexp = "/(?i)\b(".$k.")\b/";
preg_match_all( $regexp, $content, $matches );

But this doesn't find keywords in Arabic articles for some reason. I've verified that both the keywords and articles are being read correctly; no encoding issues.

What can I do to fix this? Note that there is no way for me to detect whether an article or keyword is in English or Arabic, so there has to be a single regex to match them all.

Accepted Answer

Your regex might simply lack the /unicode flag:

$regexp = "/(?i)\b(".$k.")\b/u";

Otherwise PCRE has to compare bytes. In that case it might still be able to find the words (when the UTF-8 encoding is identical), but won't ever detect the word \boundaries.

Update
Okay \b really only detects \w boundaries (so depends on the locale setting instead of /u flag). Then try this instead, which uses assertions:

$regexp = "/(?<!\p{L})(".$k.")(?!\p{L})/ui";
Written by mario
This page was build to provide you fast access to the question and the direct accepted answer.
The content is written by members of the stackoverflow.com community.
It is licensed under cc-wiki