I'm using DOMDocument and DOMXPath in PHP to find elements in an HTML document. This document contains HTML entities like &nbsp ; and I would like these entities to be preserved in the XPath output.

$doc = new DOMDocument();
$doc->loadHTML('<html><head></head><body>&nbsp;Test</body></html>');

$xpath = new DOMXPath($doc);
$nodes = $xpath->query('//body');

foreach($nodes as $node) {
    echo $node->textContent;
}

This code produces the following output (UTF-8):

[space]Test

But I would like to have this:

&nbsp;Test

Maybe it has something to do with LibXML that PHP uses internally, but I couldn't find any function that preserves the HTML entities.

Do you have an idea?

Comments

[space] is not UTF-8. Are you sure that it is U+0020 and not U+00A0?

Written by Alohci

@Alohci: Yes, you are right, it's U+00A0. I just wanted to make clear, that the output is displayed with whitespace instead of the nbsp entity.

Written by ChristianK

@Dimitre: Sorry, but this is an XPath specific question. It's about the output of an XPath query.

Written by ChristianK

Accepted Answer

XPath always sees a representation of the XML document in which entity references have been expanded. The only way to prevent this is to preprocess the XML document, replacing the entity references by something that won't be expanded, for example changing &nbsp; to §nbsp;.

Written by Michael Kay
This page was build to provide you fast access to the question and the direct accepted answer.
The content is written by members of the stackoverflow.com community.
It is licensed under cc-wiki