Handling Special Markup Characters for HTML and XML, Entities

Tagged:
There are several key "special" characters in markup languages that can wreak havoc if not used properly. This is due to the fact that some characters delimit special meaining to a markup parser that is generally different than the character data input, such as an ampersand and or the greater than and less than characters (&, >, <.) I have seen several tutorials that present complicated ways to allow these characters as input that are far more complicated than they need be (even a developerWorks tutorial that suggested a CDATA section for each character that required it.) The solution to this issue lies in entities. Entities are representations of characters that are not made up of just the character itself and allow the output to be parsed. This holds true for many markup languages such as HTML and XML. For XML and HTML alike the main "problem" characters are again the ampersand and the greater than and less than characters (and for HTML the non breaking space as well.) There are of course entities for all characters but generally only a few are required because in any given markup language only a few cause problems. Enough about what special characters are and why they can cause "problems", so what do you do with entities and how? Well for HTML and XML this is very simple.
  • ampersand = & and the entity is &#038; or &amp;
  • greater than = > and the entity is &gt;
  • less than = < and the entity is &lt;
  • non breaking space = &nbsp; and the entity is &#160;
Note that these are even special abbreviation based human readable entities. Each of these (and indeed each character) also has a special unicode decimal and even hexadecimal representation. For example the ampersand which is & is also represented by $amp;038;. There are also more entities for other less common but sometimes helpful characters such as the AEligature, arrows, accents, the copyright symbol and more (again, actually every character has an entity.) Check the linked entity article at Oreilly (XML.com) and the HTMLhelp.com chart for more info (for the real source entities for all types of character sets can be found at the Unicode specification website.)   Unicode Charts

Comments

Re: Handling Special Markup Characters, Entities

Uh, yer entities are the same as the characters your telling me that yer entities are s'posed to represent. So, what are the entities? -jeepmutt

Re: Handling Special Markup Characters, Entities

ahh, grasshopper, but that is the mystery is not, once you see the answer you will know the problem! uh, and I @#cked up the story text, fixing it now, man that "preview" button sure would be handy

Re: Handling Special Markup Characters, Entities

Ok fixed it. It was actually another Mozilla issue. The story was published using Mozilla and some really weird XML stuff is happening while using Mozilla (replacing stuff with entities for me and removing entities?)

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.