11 Reasons Why I Hate XML

… at least in Java.

1 – Namespace and import

XML is only apparently simple. As soon as namespace are used, it immediately gets complicated.  What is the difference between targetNamespace=”…”, xmlns=”…” and xmlns:tns=”…” ? Can I declare several prefixes for the same namespace? Can I change the default namespace from within a document? What happens if I import a schema and rebind it to another namespace? How do I reference an element unambiguously? Ever wondered how to really create a QName correctly? Ever wondered what happens if you have a cycle in your dependencies?

2 – Encoding and CDATA

XML encoding and file encoding are not the same.  This is a huge source of troubles. Both encoding must match, and the XML file should be read and parsed according to the encoding specified in the XML header. Depending on the encoding, characters will be serialized in a different way, again a huge source of confusion. If the reader or writer of an XML document behave incorrectly, the document can be dangerously corrupted and information can be lost. Editors don’t necessary display the characters correctly, while the document may be right. Ever got a ? or ¿ in your text? Ever made a distinction between &amp; and & ? Ever wondered whether a CDATA section was necessary or if using UTF-8 would be ok? Ever realized that < and > can be used as-is in attributes but need an encoding within a tag?

3 – Entities and DOCTYPE

Somehow relates to #2, but not only. XML entities are a generic way to define variables and are declared in the DOCTYPE. You can define custom entities; this is rather unusual but still need to be supported. Entites can be internal or external to your XML document, in which case the entity resolving might differ. Because entities are also used to escape special character, you can not consider this as an advanced feature that you won’t use. XML entities needs to be handled with care and is always a source of trouble. For instance, the tag <my-tag>hello&amp;world</my-tag> will trigger 3 characters(...) events with SAX.

4 – Naming convention

Ever wondered whether it was actually better to name your tag <my-tag/>, <myTag/> or <MyTag/>? The same goes for attributes….

5 Null, empty string and white spaces

Making the difference between null and empty string with XML is always painful. Null would be represented by the absence of the tag or attribute, whereas empty string would be represented with an empty tag or empty attribute. The same problem appears if you want to distinguish empty list and no list at all. If not considered clearly upfront (which is frequently the case), it can be very hard to retrofit clearly this distinction in an application.
Whitespace is another issue on its own. The way tabs, spaces, carriage return, line feeds are processed is always confusing. There are some options to control that, but it’s way too complicated for most of the usage. As a consequence, sometimes these special characters will be encoding in entities, sometimes embedded in CDATA and sometimes stores as-is in the XML.

6 – Normalization

XML encryption and signature look fine on paper. But as soon as you dig in the spec, you realize that it’s not so easy because of the syntactic and semantic equivalence of XML document. Is <my-tag></my-tag> the same as <my-tag/>? To solve this issue, XML normalization was introduced which define the canonical representation of a document. Good luck to understand all the subtleties when considering remarks #1, #2,  #3 and #5.

7 – Too many API and implementations

Even if stuffs improved in this area, there are too many API and implementation available. I wish there was one unified API and one single implementation sometimes…Ever wondered how to select a specific implementation?  Ever got a classloader issue due to an XML library? Ever got confused whether StAX was actually really better than SAX to read XML documents?

8 – Implementation options

Most XML implementations have options or features to deal with the subtleties I just describe. This is especially true for namespace handling. As a consequence, you code may work on one implementation but not on another.  For instance, startDocument should be used to start an XML document and deal with namespace correctly. The strictness of the implementations differs, so don’t take for granted that portability is 100%.

9 – Pretty printing

There are so many API and frameworks that it’s always a mess to deal with pretty printing, if supported by the framework.

10 – Security

XML was not designed for security. Notorious problems are: dangerous framework extension, XML bomb, outbound connection to access remote schema, extensive memory consumption, and many more problems documented in this excellent article from MISC. As a consequence, XML document can be easily abused to disrupt the system.

11 – XPath and XSLT

XPath and XSLT belong to the XML ecosystem and suffer the same problems as XML itself: apparent simplicity but internal complexity. I won’t speak here about everything else that surrounds XML and that forms the big picture of the XML family specifications. I will just say that I recently got a NPE in NetBeans because “/wsa:MessageID” was not ok but using “/wsa:MessageID/.” was just fine.  Got the point?

StAX pretty printer

Using StAX to write XML is a lot easier than either using DOM or SAX. There is however no option to indent the generated XML, unlike with SAX or DOM. When faced with this problem, I came out with a simple yet generic solution: I would intercept all write calls and preprend the necessary whitespace according to the current depth in the XML. To achieve this easily an InvocationHandler can be used that will decorate the XMLStreamWriter.

Here is a sample usage

XMLStreamWriter wstxWriter = null;
XMLStreamWriter prettyPrintWriter = null;
ByteArrayOutputStream baos = new ByteArrayOutputStream();

wstxWriter = factory.createXMLStreamWriter(baos, "UTF-8"); // specify encoding

// Wrap with pretty print proxy
PrettyPrintHandler handler = new PrettyPrintHandler( wstxWriter );
prettyPrintWriter = (XMLStreamWriter) Proxy.newProxyInstance(
XMLStreamWriter.class.getClassLoader(),
new Class[]{XMLStreamWriter.class},
handler );

prettyPrintWriter.writeStartDocument();

And the InvocationHandler looks like this (see this gist):

public class PrettyPrintHandler implements InvocationHandler {

  private static Logger LOGGER = Logger.getLogger(PrettyPrintHandler.class.getName());
  private final XMLStreamWriter target;
  private int depth = 0;
  private final Map<Integer, Boolean> hasChildElement = new HashMap<Integer, Boolean>();
  private static final String INDENT_CHAR = " ";
  private static final String LINEFEED_CHAR = "\n";

  public PrettyPrintHandler(XMLStreamWriter target) {
    this.target = target;
  }

  public Object invoke(Object proxy, Method method, Object[] args) throws Throwable {
    String m = method.getName();
    if (LOGGER.isDebugEnabled()) {
      LOGGER.debug("XML event: " + m);
    }
    // Needs to be BEFORE the actual event, so that for instance the
    // sequence writeStartElem, writeAttr, writeStartElem, writeEndElem, writeEndElem
    // is correctly handled
    if ("writeStartElement".equals(m)) {
      // update state of parent node
      if (depth > 0) {
        hasChildElement.put(depth - 1, true);
      }
      // reset state of current node
      hasChildElement.put(depth, false);
      // indent for current depth
      target.writeCharacters(LINEFEED_CHAR);
      target.writeCharacters(repeat(depth, INDENT_CHAR));
      depth++;
    }
    else if ("writeEndElement".equals(m)) {
      depth--;
      if (hasChildElement.get(depth) == true) {
        target.writeCharacters(LINEFEED_CHAR);
        target.writeCharacters(repeat(depth, INDENT_CHAR));
      }
    }
    else if ("writeEmptyElement".equals(m)) {
      // update state of parent node
      if (depth > 0) {
        hasChildElement.put(depth - 1, true);
      }
      // indent for current depth
      target.writeCharacters(LINEFEED_CHAR);
      target.writeCharacters(repeat(depth, INDENT_CHAR));
    }
    method.invoke(target, args);
    return null;
  }

  private String repeat(int d, String s) {
    String _s = "";
    while (d-- > 0) {
      _s += s;
    }
    return _s;
  }
}

The repeat method is quite ugly. You can use StringUtil form commons-lang instead of check one of the other repeat implementation on stackoverflow.