Tuesday 10 March 2015

Escaping special characters of XML in Java

Background

In previous post I had shown how to parse XML from String - 
 So lets parse a simple String that contains Google Play App name. Eg - 
  • <AppName>Temple Run</AppName>
Code is as follows -

    public static void main(String args[]) throws SAXException, IOException, ParserConfigurationException {
            String xmlString = "<AppName>Temple Run</AppName>";
            DocumentBuilder db = DocumentBuilderFactory.newInstance().newDocumentBuilder();
            Document doc = db.parse(new InputSource(new StringReader(xmlString)));
            System.out.println(doc.getFirstChild().getTextContent());
    }

and output is as expected - Temple Run

Now lets change out input xml string/ app name as follows -
  • <AppName>Angels & Demons</AppName>
 Run the code again with above xml String input. You will get following Exception -

[Fatal Error] :1:18: The entity name must immediately follow the '&' in the entity reference.
Exception in thread "main" org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 18; The entity name must immediately follow the '&' in the entity reference.
    at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(Unknown Source)
    at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(Unknown Source)
    at StringXMLParser.main(StringXMLParser.java:20)


Reason being '&' is a special character and you need to escape it in a String before parsing the String as XML. Same goes for HTML as well. Special characters like '&' should be escaped. '&'
 in it's escaped form looks like '&amp;'. So the input should be something like -
  • <AppName>Angels '&amp; Demons</AppName>

Special Characters in XML

Special characters in XML are  - 
  1. & - &amp;
  2. < - &lt;
  3. > - &gt;
  4. " - &quot;
  5. ' - &apos;
So when you are creating an XML from some input that has these special characters then you need to take care of it. Obviously you wont expect your clients to enter the app name as ''Angels &amp; Demons".



Reason for escaping these so called special characters is that these have special meaning in XML and when used in data will led to parsing errors as the one show in the code snippet above. For example & character is used to import other XML entities.

Escaping Input for XML in Java

You can very well write your own piece of code to parse these special characters from the input and replace them with their escaped version. For this tutorial I am going to use Apache commons lang’s StringEscapeUtils class which provide escaping for several  languages like XML, SQL and HTML.

As usual I am using Ivy as my dependency manager and Eclipse as my IDE. To install and configure Apache Ivy refer to the link provided in "Related Links" section at the bottom.

My ivy file looks like following - 

<ivy-module version="2.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:noNamespaceSchemaLocation="http://ant.apache.org/ivy/schemas/ivy.xsd">
    <info
        organisation="OpenSourceForGeeks"
        module="XMLEscaper"
        status="integration">
    </info>
    
    <dependencies>
        <dependency org="org.apache.commons" name="commons-lang3" rev="3.3.2"/>        
    </dependencies>
    
</ivy-module>

now lets get to the code -

import java.io.IOException;
import java.io.StringReader;

import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;

import org.apache.commons.lang3.StringEscapeUtils;
import org.w3c.dom.Document;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;


public class StringXMLParser {
    
    public static void main(String args[]) throws SAXException, IOException, ParserConfigurationException {
           
            String appNameInput = "Angels & Demons";
            System.out.println("App Name Before Escaping : " + appNameInput);
            String escapedInput = StringEscapeUtils.escapeXml(appNameInput);
            System.out.println("App Name After Escaping : " + escapedInput);
            String xmlString = "<AppName>" + escapedInput + "</AppName>";
            DocumentBuilder db = DocumentBuilderFactory.newInstance().newDocumentBuilder();
            Document doc = db.parse(new InputSource(new StringReader(xmlString)));
            System.out.println(doc.getFirstChild().getTextContent());
    }
}
 

Compile and run above code. You should get the following output - 

App Name Before Escaping : Angels & Demons
App Name After Escaping : Angels &amp; Demons
Angels & Demons

No Exception. I have just shown this demo for '&' special character but you can do the same for all special characters mentioned in  "Special Characters in XML" section above.



Note: Only the characters "<" and "&" are strictly illegal in XML. The greater than character is legal, but it is a good habit to replace it.


Related Links

t> UA-39527780-1 back to top