XML Character Encoding Validation

The character encoding of an XML file should be validated to ensure that the text within the XML file can be correctly displayed or read by systems without any "broken character" problems. The ideal method for validating the character encoding of an XML file is a 2 step process.

First, the XML file can be checked to ensure that it contains only byte sequences that are valid Unicode. This is fairly straightforward and there are libraries available to check Unicode encoding of any text file, which can be applied to XML file because XML is essentially text.
Second, after the XML file has been parsed check that the text nodes and attribute values contain only valid Unicode characters. This can be done using using regular expressions in XPath, Schematron, XQuery, Java, etc.

The XML Recommendation specifies that XML should be encoded using Unicode character set. Some XML parsers might reject an XML file that contains byte sequences that are not valid Unicode, however those that do give cryptic error messages. The first validation step ensures that the XML file contains valid Unicode byte sequences. By coding this validation yourself, you can ensure that this validation is done and that it produces meaningful error messages.

The first step alone is not sufficient because the entity expansion faculty of XML means that when the XML file is parsed into memory any entities are expanded to the definition of those entities. For examples: ' is expanded to the character '; § is expanded to the character §.

In addition, there are valid Unicode characters (code points) that might not be, and in fact are practically guaranteed not to be, read or displayed correctly by all systems. The Unicode Private Use Areas are comprised entirely of code points that are intended to be used in customized ways, and these code points might not be displayed on another system as originally intended by the system where the file was created. In addition, there are control characters in Unicode that can be problatic. For instance, characters in the range x7F-x9F (or decimal 127-159) are defined as control characters in Unicode but are other charcters in Windows-1252 character set. This is frequently a problem when smart quotes are incorrectly transcribed from Microsoft programs into Unicode; for example the left double quote character (“) is represented in Windows-1252 character set at code point 93 while in Unicode the same character is represented at code point 201C and code point 93 is an undisplayable control character.

The following are examples of step 1 and step 2 validation.

Step 1: Unicode byte sequence validation

The utf8-validator is a Java libray provided by the UK National Arcives that verifies whether the byte sequence of a File or InputStream is valid acording to the rules of Unicode UTF8 encoding. The utf8-validator is available on GitHub (https://github.com/digital-preservation/utf8-validator) and MavenCentral.

The following is a Java class that uses the utf8-validator library to validate an InputStream and produces an error message that provides the invalid charcters in context so that a person can find and fix the invalid characters.

ValidateUTF8.java

package XMLValidateUnicode;

import org.apache.commons.io.IOUtils;
import uk.gov.nationalarchives.utf8.validator.Utf8Validator;
import uk.gov.nationalarchives.utf8.validator.ValidationException;
import uk.gov.nationalarchives.utf8.validator.ValidationHandler;

import java.io.InputStream;
import java.nio.charset.StandardCharsets;
import java.util.TreeMap;
import java.util.*;

public class ValidateUTF8 {

    public void process(InputStream is) throws Exception {
        Utf88ValidationHandler handler = new Utf88ValidationHandler();
        new Utf8Validator(handler).validate(is);
        if (handler.isErrored()) {
            byte[] bytes = IOUtils.toByteArray(is);
            throw new Exception(handler.toMessage(bytes));
        }
    }


    private class Utf88ValidationHandler implements ValidationHandler {

        private TreeMap<Long, String> errors = new TreeMap<>();

        @Override
        public void error(String message, long byteOffset) throws ValidationException {
            errors.put(byteOffset, message);
        }

        private boolean isErrored() {
            return !errors.isEmpty();
        }

        private String toMessage(byte[] bytes) {
            StringBuilder msg = new StringBuilder("Failed UTF-8 character encoding check. Please correct these errors.\n\n");

            int start, end, pos;
            int window = 10;

            for (Long position : errors.keySet()) {
                msg.append(errors.get(position));
                msg.append(" at byte ").append(position);

                pos = Math.toIntExact(position - 1);
                start = (pos <= window) ? 0 : pos - window;
                end = (pos >= bytes.length - window) ? bytes.length : pos + window;

                msg.append(",\nwithin \"");
                msg.append(new String(Arrays.copyOfRange(bytes, start, end), StandardCharsets.UTF_8).replaceAll("[\r\n]",""));
                msg.append("\"");

                for (start = pos; start > 0; start--) {
                    if (bytes[start] == 10) {
                        start++;
                        break;
                    }
                }
                end = (start >= bytes.length - 20) ? bytes.length : start + 20;
                msg.append("\non line beginning with \"");
                msg.append(new String(Arrays.copyOfRange(bytes, start, end), StandardCharsets.UTF_8).replaceAll("[\r\n]",""));
                msg.append("\"\n\n");
            }

            return msg.toString();
        }
    }
}

pom.xml

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>XMLValidateUnicode</groupId>
    <artifactId>XMLValidateUnicode</artifactId>
    <version>1.0-SNAPSHOT</version>
    <build>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <configuration>
                    <source>8</source>
                    <target>8</target>
                </configuration>
            </plugin>
        </plugins>
    </build>
    <packaging>jar</packaging>

    <properties>
        <java.version>1.8</java.version>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
        <project.reporting.outputEncoding>UTF-8</project.reporting.outputEncoding>
    </properties>

    <dependencies>
        <dependency>
            <groupId>uk.gov.nationalarchives</groupId>
            <artifactId>utf8-validator</artifactId>
            <version>1.2</version>
        </dependency>
        <dependency>
            <groupId>commons-io</groupId>
            <artifactId>commons-io</artifactId>
            <version>2.6</version>
        </dependency>
    </dependencies>

</project>

Step 2: Check for invalid characters

After the XML has been parsed, check all text nodes and attribute values do not contain any code points in the following ranges:

Code points that frequently appear in Windows-1252 and are different in Unicode: U+007F–U+009F (decimal 127-159)
Private Use Areas: U+E000–U+F8FF, U+F0000–U+FFFFD, U+100000–U+10FFFD

This can be validated by using XPath to select all text nodes //text() and all attribute nodes //@*, and applying a regular expression match.

vincentml/XmlCharacterEncodingValidation.md

XML Character Encoding Validation

Step 1: Unicode byte sequence validation

Step 2: Check for invalid characters