Reading XML document nodes containing special characters (&, -, etc) with Java

My code does not retrieve the entirety of element nodes that contain special characters. For example, for this node:

<theaterName>P&G Greenbelt</theaterName>

It would only retrieve “P” due to the ampersand. I need to retrieve the entire string.

Here’s my code:

public List<String> findTheaters() {

    //Clear theaters application global
    FilmhopperActivity.tData.clearTheaters();

    ArrayList<String> theaters = new ArrayList<String>();

    NodeList theaterNodes = doc.getElementsByTagName("theaterName");

    for (int i = 0; i < theaterNodes.getLength(); i++) {

        Node node = theaterNodes.item(i);
        if (node.getNodeType() == Node.ELEMENT_NODE) {

            //Found theater, add to return array
            Element element = (Element) node;
            NodeList children = element.getChildNodes();
            String name = children.item(0).getNodeValue();
            theaters.add(name);

            //Logging
            android.util.Log.i("MoviefoneFetcher", "Theater found: " + name);

            //Add theater to application global
            Theater t = new Theater(name);
            FilmhopperActivity.tData.addTheater(t);
        }
    }

    return theaters;
}

I tried adding code to extend the name string to concatenate additional children.items, but it didn’t work. I’d only get “P&”.

...
String name = children.item(0).getNodeValue();
for (int j = 1; j < children.getLength() - 1; j++) {
    name += children.item(j).getNodeValue();
}

Thanks for your time.


UPDATE: Found a function called normalize() that you can call on Nodes, that combines all text child nodes so doing a children.item(0) contains the text of all the children, including ampersands!

Answer

The & is an escape character in XML. XML that looks like this:

<theaterName>P&G Greenbelt</theaterName>

should actually be rejected by the parser. Instead, it should look like this:

<theaterName>P&amp;G Greenbelt</theaterName>

There are a few such characters, such as < (&lt;), > (&gt;), " (&quot;) and ' (&apos;). There are also other ways to escape characters, such as via their Unicode value, as in &#x2022; or &#12345;.

For more information, the XML specification is fairly clear.

Now, the other thing it might be, depending on how your tree was constructed, is that the character is escaped properly, and the sample you showed isn’t what’s actually there, and it’s how the data is represented in the tree.

For example, when using SAX to build a tree, entities (the &-thingies) are broken apart and delivered separately. This is because the SAX parser tries to return contiguous chunks of data, and when it gets to the escape character, it sends what it has, and starts a new chunk with the translated &-value. So you might need to combine consecutive text nodes in your tree to get the whole value.

Leave a Reply

Your email address will not be published. Required fields are marked *