I'm trying to write a regular expression using the PCRE library in PHP.

I need a regex to match only &, > and < chars that exist within string part of any XML node and not the tag declaration themselves.

Input XML:

<pnode>
  <cnode>This string contains > and < and & chars.</cnode>
</pnode>

The idea is to to a search and replace these chars and convert them to XML entities equivalents.

If I was to convert the entire XML to entities the XML would look like this:

Entire XML converted to entities

&lt;pnode&gt;
  &lt;cnode&gt;This string contains &gt; and &lt; and &amp; chars.&lt;/cnode&gt;
&lt;/pnode&gt;

I need it to look like this:

Correct XML

<pnode>
  <cnode>This string contains &gt; and &lt and &amp; chars.</cnode>
</pnode>

I have tried to write a regular expression to match these chars using look-ahaead but I don't know enough to get this to work. My attempt (currently only attempting to match > symbols):

/>(?=[^<]*<)/g

Just to make it clear the XML I'm trying to fix comes from a 3rd party and they seem unable to fix it their end hence my attempt to fix it.

Comments

Your input is not XML.

Written by Rowland Shaw

@Rowland, while I agree with you, that's exactly his point he wants to take the input and make it into valid XML by escaping the &gt;, &lt; and &amp; characters.

Written by Lazarus

Unless you have a schema defined, how could you possibly know that any given < is not the beginning of a tag?

Written by John M Gant

Why do you have invalid XML to start with? Is it possible to avoid generating malformed XML rather than try to fix it up after the fact?

Written by John Kugelman

@Camsoft, have you tried regexlib.com as a resource for this kind of thing. It might provide some clues if not the final solution.

Written by Lazarus

@jmgant, that's a good point. If you assume that the nodes only have either text or child nodes between them then by matching tag pairs you could identify the text that needs the substitutions.

Written by Lazarus

s|<cnode>|<cnode><![CDATA[|g, s|</cnode>|]]></cnode>|g.

Written by KennyTM

@John Kugelman, that's usually my first response and probably the most valid one. Fixing the problem this way is a kludge at best, we should always try to solve the problem at it's source. +1 for that.

Written by Lazarus

@jmgant Indeed. There is no schema with this so called XML feed. It's worth noting that I get the XML feed from a 3rd party and have no control over it's data. I was thinking it might be possibly to write a crude regex that when it finds a matching char it would make sure that a tag before it and after exists of the same name (i.e. enclosed)

Written by Camsoft

@Lazarus Thanks for that, I'm looking in to it now.

Written by Camsoft

@Camsoft, "It's worth noting that I get the XML feed from a 3rd party and have no control over it's data." No, you get a data feed. It's not an XML feed. If your 3rd party says it is, he's selling defective goods.

Written by LarsH

Accepted Answer

In the end I've opted to use the Tidy library in PHP. The code I used is shown below:

  // Specify configuration
  $config = array(
    'input-xml'  => true,
    'show-warnings' => false,
    'numeric-entities' => true,
    'output-xml' => true);

  $tidy = new tidy();
  $tidy->parseFile('feed.xml', $config, 'latin1');
  $tidy->cleanRepair()

This works perfectly correcting all the encoding errors and converting invalid characters to XML entities.

Written by Camsoft
This page was build to provide you fast access to the question and the direct accepted answer.
The content is written by members of the stackoverflow.com community.
It is licensed under cc-wiki