I'm searching articles for keywords which are in both English and Arabic. The articles can be either in English or Arabic.

My current code is:

$k = implode("|", $keywords);
$regexp = "/(?i)\b(".$k.")\b/";
preg_match_all( $regexp, $content, $matches );

But this doesn't find keywords in Arabic articles for some reason. I've verified that both the keywords and articles are being read correctly; no encoding issues.

What can I do to fix this? Note that there is no way for me to detect whether an article or keyword is in English or Arabic, so there has to be a single regex to match them all.

Your regex might simply lack the /unicode flag:

$regexp = "/(?i)\b(".$k.")\b/u";

Otherwise PCRE has to compare bytes. In that case it might still be able to find the words (when the UTF-8 encoding is identical), but won't ever detect the word \boundaries.

Okay \b really only detects \w boundaries (so depends on the locale setting instead of /u flag). Then try this instead, which uses assertions:

$regexp = "/(?<!\p{L})(".$k.")(?!\p{L})/ui";
