I have searched stackoverflow on this problem and did find a few topics, but I feel like there isn't really a solid answer for me on this.

I have a form that users submit and the field's value is stored in a XML file. The XML is set to be encoded with UTF-8.

Every now and then a user will copy/paste text from somewhere and that's when I get the "entity not defined error".

I realize XML only supports a select few entities and anything beyond that is not recognized - hence the parser error.

From what I gather, there's a few options I've seen:

  1. I can find and replace all   and swap them out with   or an actual space.
  2. I can place the code in question within a CDATA section.
  3. I can include these entities within the XML file.

What I'm doing with the XML file is that the user can enter content into a form, it gets stored in a XML file, and that content then gets displayed as XHTML on a Web page (parsed with SimpleXML).

Of the three options, or any other option(s) I'm not aware of, what's really the best way to deal with these entities?

Thanks, Ryan


I want to thank everyone for the great feedback. I actually determined what caused my entity errors. All the suggestions made me look into it more deeply!

Some textboxes where plain old textboxes, but my textareas were enhanced with TinyMCE. It turns out, while taking a closer look, that the PHP warnings always referenced data from the TinyMCE enhanced textareas. Later I noticed on a PC that all the characters were taken out (because it couldn't read them), but on a MAC you could see little square boxes referencing the unicode number of that character. The reason it showed up in squares on a MAC in the first place, is because I used utf8_encode to encode data that wasn't in UTF to prevent other parsing errors (which is somehow also related to TinyMCE).

The solution to all this was quite simple:

I added this line entity_encoding : "utf-8" in my tinyMCE.init. Now, all the characters show up the way they are supposed to.

I guess the only thing I don't understand is why the characters still show up when placed in textboxes, because nothing converts them to UTF, but with TinyMCE it was a problem.

Thanks, Ryan


Some important parts of your question are invisible because they got parsed as markup. Please surround those bits with backquotes (``).

Written by LarsH

@LarsH: Hm, I don't see anything in the question source that would need this.

Written by Tomalak

@Tomalak: "1. I can find and replace all ?? and swap them out with ?? or an actual space." Sure looks to me like something is missing there.

Written by LarsH

@LarsH: Oh, you're right. I've not noticed these. Only a few more rep to go for you and you can edit questions yourself. :)

Written by Tomalak

+1 useful question.

Written by LarsH

Accepted Answer

This is generally encoding issue.

I think this is an encoding problem. php, simplexml in this particular case, does not like the danish O you've got in that fornames tag.

try to encode file in utf-8 and remove the escaped version from the tag

Written by Pramendra
This page was build to provide you fast access to the question and the direct accepted answer.
The content is written by members of the stackoverflow.com community.
It is licensed under cc-wiki