GitHub   @OpenKivaKit Twitter Zulip Chat RSS Feed Java Code Geeks Mailing List

State(Art)

Jonathan's thoughts on the state of the art in software design.

2021.11.12

KivaKit XML Streaming  

Since Java 1.6 in 2006, Java has had a built-in XML streaming API in the package javax.xml.stream. This API is known as StAX (Streaming API for XML), and it is a very efficient “pull parser”, allowing clients to iterate through the sequence of elements in an XML document. Other approaches to working with XML are event-handling “push parsers”, and full-blown, in-memory DOMs (Document Object Models). Although StAX is convenient and very fast, it can be significantly harder to work with than a DOM, because the hierarchy of the document that’s being streamed is lost. Our code sees just one element at a time.

KivaKit’s New XML Mini-framework

KivaKit 1.1 quietly added a small, but useful mini-framework to the kivakit-extensions repository called kivakit-data-formats-xml. The project contains just two simple classes: StaxReader and StaxPath. The StaxReader class adds a layer of convenience to the Java StAX API by making it easy to:

Moving Through an XML Stream

The static StaxReader.open(Resource) method is used to start reading an XML stream. The method either returns a valid StaxReader that’s ready to go, or it throws an exception. Since StaxReader implements Closeable, it can be used within a try-with-resources statement:

try (var reader = StaxReader.read(file))
{
    [...]
}

Within our try-with-resources block, we can advance through the stream with these methods:

When we reach the end of the stream, hasNext() will return false. So, processing an XML file looks like this:

try (var reader = StaxReader.read(file))
{
    for (; reader.hasNext(); reader.next())
    {
        var element = reader.at();
        
        [...]
        
    }        
}

As we stream our way through an XML document, a few simple methods can help us to identify what kind of tag the reader is currently at:

Streaming Through an XML Hierarchy

Although the underlying StAX API can only move through a document in sequential order, StaxReader adds functionality that allows us to determine where we are in the document hierarchy as we move along. Using the hierarchical path to our current position the stream, we can search for specific elements in the nested document structure, and we can process data when we reach those elements.

Okay. Let’s make this concrete. Here is a simple document:

<a>   <---- The path here is a
    <b>   <---- The path here is a/b
        <c>   <---- The path here is a/b/c
        </c>
    </b>
</a>

The StaxPath class represents hierarchical paths in our XML document. As seen above, the path at <a> is a. The path at <b> in our document is a/b and the path at <c> is a/b/c.

As we read elements from the XML stream, StaxReader tracks the current path using a stack of elements. When the reader encounters an open tag, it pushes the name of the tag onto the end of the current path. When it encounters a close tag, it pops the last element off the end of the path. So, the sequence of steps as StaxReader streams through our document is:

Step  Element     Action    StaxPath
1.    <a>         push a    a
2.      <b>       push b    a/b
3.        <c>     push c    a/b/c
4.        </c>    pop       a/b
5.      </b>      pop       a
6.    </a>        pop       

The current StaxPath for a StaxReader can be obtained by calling StaxReader.path().

Finding Elements in the Document Hierarchy

The following methods test the current path of our StaxReader reader against a given path. The reader is considered at a given path if the the reader’s path is equal to the given path. For example, if the reader is at the path a/b/c and the given path is a/b/c, the reader is at the given path. The reader is inside the given path if the given path is a prefix of the reader’s current path. For example, if the reader is at a/b/c/d and the path is a/b/c, then the reader is inside the given path. Finally, the reader is outside the given path in the reverse situation, where the reader is at a/b and the path is a/b/c.

Putting it All Together: Fiasco

And now, the reason for doing all of this. I am working on a build tool for Java projects called Fiasco (named for Stanislaw Lem’s 1986 science fiction novel, Fiasko).

Fiasco is in the design stages still. When it is completed, it will be a pure-Java build tool with these design goals, and non-goals:

Goals
Non-Goals
How Fiasco Reads POM Files with StaxReader

To build a Java project, Fiasco needs to access artifacts in Maven repositories, like Maven Central. To be able to do this, it is necessary to parse Maven pom.xml files, and since Fiasco will be parsing a lot of these files, it is desirable to do it fairly efficiently. The PomReader class demonstrates how StaxReader can be used to parse a complex, hierarchical XML file like a Maven POM file. The relevant details are found in the method read(MavenRepository, Resource), which opens a pom.xml resource, parses it and returns a Pom model:

public class PomReader extends BaseComponent
{
    StaxPath PROPERTIES_PATH = StaxPath.parseXmlPath("project/properties");
    StaxPath DEPENDENCY_PATH = StaxPath.parseXmlPath("project/dependencies/dependency");

    [...]
    
    Pom read(MavenRepository repository, Resource resource)
    {
        [...]
        
        try (var reader = StaxReader.open(resource))
        {
            var pom = new Pom(resource);

            for (reader.next(); reader.hasNext(); reader.next())
            {
                [...]
                
                if (reader.isAt(PROPERTIES_PATH))
                {
                    pom.properties = readProperties(reader);
                }

                if (reader.isAt(DEPENDENCY_PATH))
                {
                    pom.dependencies.add(readDependency(reader));
                }
                
                [...]
             }
         }
         
     [...]        

 }

The code here that opens our pom.xml resource and moves through the elements of the XML document is essentially the same as we saw earlier. Within the for loop, is where we deal with the POM hierarchy. The StaxReader.isAt(StaxPath) method is used to determine when the reader lands on the open tag for the given path. When PROPERTIES_PATH (project/properties) is reached, we call a method that reads the properties nested within the <properties> open tag (see readProperties(StaxReader) for the gory details). For each DEPENDENCY_PATH (project/dependencies/dependency) we reach (there may be many of them), we read the <dependency> with logic similar to this:

Dependency readDependency(StaxReader reader)
{
    MavenArtifactGroup artifactGroup = null;
    String artifactIdentifier = null;
    String version = null;
    var scope = DEPENDENCY_PATH;

    // Skip past the <dependency> open tag we landed on,
    reader.next();

    // and while we're not outside the <dependency> tag scope,
    for (; !reader.isOutside(scope); reader.next())
    {
        // populate any group id,
        if (reader.isAt(scope.withChild("groupId")))
        {
            artifactGroup = MavenArtifactGroup.parse(this, reader.enclosedText());
        }

        // any artifact id,
        if (reader.isAt(scope.withChild("artifactId")))
        {
            artifactIdentifier = reader.enclosedText();
        }

        // and any version.
        if (reader.isAt(scope.withChild("version")))
        {
            version = reader.enclosedText();
        }
    }

So, there we have it. It is true that our code here is not as succinct as it would be with a DOM model. However, we can parse POM files well enough with StaxReader, and we save the time and memory required by a full, in-memory DOM model.

Code

The code discussed above is available on GitHub:

kivakit-data-formats-xml
Fiasco (GitHub)
PomReader.java

The KivaKit XML API is available on Maven Central at these coordinates:

<dependency>
    <groupId>com.telenav.kivakit</groupId>
    <artifactId>kivakit-data-formats-xml</artifactId>
    <version>${kivakit.version}</version>
</dependency>


Questions? Comments? Tweet yours to @OpenKivaKit or post here:


Copyright © 2021 Jonathan Locke