Skip to main content
Windward

XML encoding error

Overview

What to do if you get an XML encoding error.

Resolution

Step 1: Read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky:

There Ain't No Such Thing As Plain Text.

If you have a string, in memory, in a file, or in an email message, you have to know what encoding it is in or you cannot interpret it or display it to users correctly.

If you get an error like this:

net.windward.datasource.DataSourceException: net.windward.datasource.DataSourceException: Dom4jDataSource ctor
            at net.windward.datasource.dom4j.Dom4jDataSource.<init>(Unknown Source)
            ...

Caused by: org.dom4j.DocumentException: Error on line 2 of document  : Invalid byte 1 of 1-byte UTF-8 sequence. Nested exception: Invalid byte 1 of 1-byte UTF-8 sequence.
            ...

The problem is that you entered a character who's byte value is greater than 127. If you do not set the encoding then Windward assumes that you are encoding using UTF-8. (You also get this problem if you explicitly set the encoding to UTF-8.)

With UTF-8 a character can require between 1 and 4 bytes. If it is in the ASCII 0-127 subset, it is a 1 byte character so it works as expected. But when you use characters with higher values, then you get the following. Viewing an XML file that uses the £ symbol would look like this in an XML viewer:

<?xml version="1.0" encoding="UTF-8"?>
<pound>£</pound>

But if you use a text editor you will see that the actual byte values of the file are:

<?xml version="1.0" encoding="UTF-8"?>
<pound>£</pound>

Where the two characters have the values 0xC2, 0xA3. For more on this please go to wikipedia.

Additional Info

Here is another great reference on why 2 byte and sometimes 3 byte (mostly for asian characters) is needed in XML encoding.

XML Special Characters Encoding

  • Was this article helpful?