Getting Started with XML and Java

What is XML?

XML is a special subset of SGML (a comprehensive, but difficult superset of HTML) to allow you to define and manipulate your own tags. Thus, with HTML you are stuck with its tags but browsers know how to display material (it's all standardized now) whereas with XML you can define your own tags, but you need to provide a "browser" that can interpret the tags and take appropriate action. Here's an example of an XML document (in a file, called students.xml, say):

    <?XML version="1.0"?>
    <DOCUMENT>
       <STUDENT>
         <NAME>
	   Anorexic Al
	 </NAME>
	 <ADDR>
	   10 Slimway Rd
	 </ADDR>
	 <GRADE>
	   A
	 </GRADE>
       </STUDENT>
       <STUDENT>
         <NAME>
	   Bulimic Bill
	 </NAME>
	 <ADDR>
	   123 Upchuck Drive
	 </ADDR>
	 <GRADE>
	   B+
	 </GRADE>
       </STUDENT>
       <STUDENT>
         <NAME>
	   Cadaverous Chen
	 </NAME>
	 <ADDR>
	   14 Mordant St.
	 </ADDR>
	 <GRADE>
	   B-
	 </GRADE>
       </STUDENT>
    </DOCUMENT>

You see that this is really a tiny database with three records, each of which is "self-describing". This is one current use of XML. Another use is to define tags like <audio></audio> and include actual audio between the tags - this is the idea of creating complex documents. What does a browser do with the above student-database file? Standard HTML browsers will ignore the new tags. Now, it is YOU, the creator of a particular set of tags, that needs to provide a "browser" for the tags you create. Such a browser will parse the file, and extract the content and "do" whatever it is that needs to be done. These actions might include display in a window, or might have nothing to do with displaying at all (such as placing the database entries above in a regular relational database). How does one write a program to read in an XML file and extract information? Fortunately, this task is made easier by using available tools. One such fundamental tool is an XML parser. There are two kinds of parsers:

A non-validating parser: a parser that does not detect invalid tags etc.
A validating parser: a parser that checks syntactic validity.

These parsers usually operate in two ways:

Event-based: these parsers will call particular functions that you name, whenever tags are encountered. Thus, your favorite method can be called by the parser everytime a start-tag or end-tage is encountered. This gives you control during the scan of the file.
Tree-based: these parsers give you the whole tree data structure containing the file.

The tree-based method is usually preferred although more cumbersome to use. Typically, good parsing packages will provide both types of parsing.

DTD's.

How does the parser know what to expect in a file? The answer is: the file usually contains "type" information detailing the structure and names of the tags. For simplicity we did not show the DTD for the file above. Usually, the DTD occurs before the data. It is itself a bunch of XML commands that "explain" the tags and structure. Here's the complete file students.xml

    <?XML version="1.0"?>

    <!-- Comments are the same as in HTML -->
    
    <!-- This is the DTD part -->
    <!DOCTYPE DOCUMENT [
      <!ELEMENT DOCUMENT (STUDENT)*>
      <!ELEMENT STUDENT (NAME, ADDR, GRADE)>
      <!ELEMENT NAME (#PCDATA)>
      <!ELEMENT ADDR (#PCDATA)>
      <!ELEMENT GRADE (#PCDATA)>
    ]>

    <DOCUMENT>
       <STUDENT>
         <NAME>
	   Anorexic Al
	 </NAME>
	 <ADDR>
	   10 Slimway Rd
	 </ADDR>
	 <GRADE>
	   A
	 </GRADE>
       </STUDENT>
       <STUDENT>
         <NAME>
	   Bulimic Bill
	 </NAME>
	 <ADDR>
	   123 Upchuck Drive
	 </ADDR>
	 <GRADE>
	   B+
	 </GRADE>
       </STUDENT>
       <STUDENT>
         <NAME>
	   Cadaverous Chen
	 </NAME>
	 <ADDR>
	   14 Mordant St.
	 </ADDR>
	 <GRADE>
	   B-
	 </GRADE>
       </STUDENT>
    </DOCUMENT>

Thus, the DTD part of the document is really:

    <!DOCTYPE DOCUMENT [
      <!ELEMENT DOCUMENT (STUDENT)*>
      <!ELEMENT STUDENT (NAME, ADDR, GRADE)>
      <!ELEMENT NAME (#PCDATA)>
      <!ELEMENT ADDR (#PCDATA)>
      <!ELEMENT GRADE (#PCDATA)>
    ]>

Notice the strange syntax (the actual information is inside a pair of matching square brackets). It starts by saying that the top-level structure is an "element" called a DOCUMENT.

NOTE: Unlike HTML, XML is case-sensitive so "STUDENT" is different from "Student", which is different from "student".

Here, the DTD is saying "DOCUMENT will contain a substructure called STUDENT, and STUDENT will contain NAME, ADDR and GRADE." Each substructure is assumed to be defined by begin-end tags of that substructure's name. Thus, the STUDENT stuff will be everything that occurs within a <STUDENT> and a </STUDENT>.

Notice that there is a * after (STUDENT). This indicates that the item may occur more than once.

Finally, each of the items NAME, ADDR and GRADE contain something mysterious called #PCDATA. This is just an ugly way of saying that there is no further subdivision and that only plain character data will be contained in those items.

There is a lot more to DTD's, but not a whole lot more. Other features will allow you to point (using a URL) across the web to a DTD instead of actually using the DTD here.

Attributes.

XML allows you to define attributes for tags, as in:


       <STUDENT LEVEL="graduate" STATUS="alum">
         <NAME>
	   Cadaverous Chen
	 </NAME>
	 <ADDR>
	   14 Mordant St.
	 </ADDR>
	 <GRADE>
	   B-
	 </GRADE>
       </STUDENT>

You can set up as many attributes as you like, and retrieve the values during parsing. Of course, the DTD must specify the attributes:

    <!DOCTYPE DOCUMENT [
      <!ELEMENT DOCUMENT (STUDENT)*>
        <!ELEMENT STUDENT (NAME, ADDR, GRADE)>
        <!ATTRLIST STUDENT
	  LEVEL CDATA #REQUIRED
	  STATUS CDATA #IMPLIED
	>
          <!ELEMENT NAME (#PCDATA)>
          <!ELEMENT ADDR (#PCDATA)>
          <!ELEMENT GRADE (#PCDATA)>
    ]>

Here, there are two attributes for element STUDENT, one of which is always required. Both are of type CDATA (character data).

Advanced XML.

We have only cursorily covered XML so far. XML has many other features, including:

Cascading style sheets: common style parameters that apply to groups of tags. (DSSSL and XSL are examples).

XML-LINKS and XPointers: more powerful forms of links.

NOTE: Like HTML, XML ignores whitespace. Some parsers will let you recover whitespace if you really want it. Others will ignore whitespace in the processing of tags but leave the whitespace intact.

XML and Java: overview

Since XML has nothing to do with programming languages, any programming language can be used to build a parser and then, an application on top of the parser. Currently, XML parsers are available in C/C++, VB and of course, Java.

An XML parser in Java allows a Java programmer to include a parser package and write applications to extract data from XML files.

Several free and commercial parsers for XML are now available in Java, see for example, the Java parsers at

http://www.alphaworks.ibm.com

http://www.jclark.com/xml

The W3C, which defined the XML standard, also defines a set of Java interfaces for a parser to implement. This means that if you write code with these interfaces, then you can switch parsers among parsers that follow the interface. The IBM product, for example, follows the W3C interfaces. This means that you usually import both the parser and the W3C package that has the interface definitions.

There are two sets of classes defined by the W3C that are most relevant to Java parsers (and programmers):

The W3C SAX (Simple API for XML) API. This set of Java classes implements the "event-based" parser. That is, you can get control as when when tags are discovered.
The W3C DOM (Document Object Model) API. This set of Java classes implements the "tree-based" parser. There are methods for traversing a tree and extracting/inserting information.

XML and Java: An example

We will parse the file student.xml using IBM's XML parser. After downloading the parser and setting up the CLASSPATH variable to access the jar file in the package, we will be ready to use their code.

The file student.xml is:

<?xml version="1.0"?>
  <!-- Comments are the same as in HTML -->
    
<!-- This is the DTD part -->
<!DOCTYPE DOCUMENT [
  <!ELEMENT DOCUMENT (STUDENT)*>
  <!ELEMENT STUDENT (NAME, ADDR, GRADE)>
  <!ELEMENT NAME (#PCDATA)>
  <!ELEMENT ADDR (#PCDATA)>
  <!ELEMENT GRADE (#PCDATA)>
]>

    <DOCUMENT>
       <STUDENT>
         <NAME>
	   Anorexic Al</NAME> <ADDR>
	   10 Slimway Rd
	 </ADDR>
	 <GRADE>
	   A
	 </GRADE>
       </STUDENT>
       <STUDENT>
         <NAME>
	   Bulimic Bill
	 </NAME>
	 <ADDR>
	   123 Upchuck Drive
	 </ADDR>
	 <GRADE>
	   B+
	 </GRADE>
       </STUDENT>
       <STUDENT>
         <NAME>
	   Cadaverous Chen
	 </NAME>
	 <ADDR>
	   14 Mordant St.
	 </ADDR>
	 <GRADE>
	   B-
	 </GRADE>
       </STUDENT>
    </DOCUMENT>

Here is a sample program that performs a simple traversal: (source code):

import com.ibm.xml.parser.*;   // This is the import required for the parser.
import org.w3c.dom.*;          // A list of "standard" interfaces.

import java.io.*;

public class xmltest {

  // Traverse a subtree whose root is the parameter.

  public static void traverse (Node n)
  {
    // Extract node info:
    String nodename = n.getNodeName();
    String test = n.getNodeValue();
    // Print and continue traversing.
    System.out.println ("Node: " + n.getNodeName() + "test=[" + test + "]");

    // Now traverse the rest of the tree in depth-first order.
    if (n.hasChildNodes()) {
      // Get the children in a list.
      NodeList nl = n.getChildNodes();
      // How many of them?
      int size = nl.getLength();
      for (int i=0; i<size; i++)
        // Recursively traverse each of the children.
	traverse (nl.item(i));
    }
  }

  public static void main (String[] argv)
  {
    // Pick off the XML file name from the command-line arguments.
    if (argv.length > 0) {
      try {
        // Get the XML file name, e.g., "student.xml".
	FileReader fr = new FileReader (argv[0]);

        // Create a parser instance.
	Parser p = new Parser (argv[0]);

        // These two will decide whether the parser will
        // retain whitespace and comments from the original file.
	p.setKeepComment (false);
	p.setPreserveSpace (false);

        // Now parse and create a document tree as a result.
        // Note that we are passing a FileReader wrapped around
        // the file.
	TXDocument doc = p.readStream(fr);

        // Once we have the tree, extract the root.
	Element root = doc.getDocumentElement ();

        // Now traverse the tree.
	System.out.println ("TRAVERSE ...");
	traverse (root);

        // Alternatively, here is a scan by TAG name for
        // the tag "NAME"
	System.out.println ("BY NAME: ");
	NodeList nl = doc.getElementsByTagName ("NAME");
	int size = nl.getLength();
	for (int i=0; i<size; i++)
	  traverse (nl.item(i));
      }
      catch (IOException e) {
	System.out.println ("Usage: java xmltest <filename>");
      }
    }
    else {
	System.out.println ("Usage: java xmltest <filename>");
    }
  }
}

This produces the output:

TRAVERSE ...
Node: DOCUMENT value=[null]
Node: #text value=[
       ]
Node: STUDENT value=[null]
Node: #text value=[
         ]
Node: NAME value=[null]
Node: #text value=[
	   Anorexic Al]
Node: #text value=[ ]
Node: ADDR value=[null]
Node: #text value=[
	   10 Slimway Rd
	 ]
Node: #text value=[
	 ]
Node: GRADE value=[null]
Node: #text value=[
	   A
	 ]
Node: #text value=[
       ]
Node: #text value=[
       ]
Node: STUDENT value=[null]
Node: #text value=[
         ]
Node: NAME value=[null]
Node: #text value=[
	   Bulimic Bill
	 ]
Node: #text value=[
	 ]
Node: ADDR value=[null]
Node: #text value=[
	   123 Upchuck Drive
	 ]
Node: #text value=[
	 ]
Node: GRADE value=[null]
Node: #text value=[
	   B+
	 ]
Node: #text value=[
       ]
Node: #text value=[
       ]
Node: STUDENT value=[null]
Node: #text value=[
         ]
Node: NAME value=[null]
Node: #text value=[
	   Cadaverous Chen
	 ]
Node: #text value=[
	 ]
Node: ADDR value=[null]
Node: #text value=[
	   14 Mordant St.
	 ]
Node: #text value=[
	 ]
Node: GRADE value=[null]
Node: #text value=[
	   B-
	 ]
Node: #text value=[
       ]
Node: #text value=[
    ]
BY THE TAG 'NAME': 
Node: NAME value=[null]
Node: #text value=[
	   Anorexic Al]
Node: NAME value=[null]
Node: #text value=[
	   Bulimic Bill
	 ]
Node: NAME value=[null]
Node: #text value=[
	   Cadaverous Chen
	 ]

Observe:

Tree nodes are either tag-nodes or "text" nodes containing actual data strings.
Whenever no actual data is expected, the value string is null.
Then the node is a "#text" type, it contains the actual string.

The string in this case is the complete string including whitespace. To remove whitespace, we can call the trim() method of String:

import com.ibm.xml.parser.*;
import org.w3c.dom.*;
import java.io.*;

public class xmltest {

  // ...

  public static void traverse (Node n)
  {
    // ...

    String valueStr = n.getNodeValue();
    // Remove whitespace.
    if (valueStr != null)
      valueStr = valueStr.trim();

    // Print and continue traversing.
    System.out.println ("Node: " + n.getNodeName() + " value=[" + valueStr + "]");

    // ...
  }

  // ...

}

In this case the output looks like:

TRAVERSE ...
Node: DOCUMENT value=[null]
Node: #text value=[]
Node: STUDENT value=[null]
Node: #text value=[]
Node: NAME value=[null]
Node: #text value=[Anorexic Al]
Node: #text value=[]
Node: ADDR value=[null]
Node: #text value=[10 Slimway Rd]
Node: #text value=[]
Node: GRADE value=[null]
Node: #text value=[A]
Node: #text value=[]
Node: #text value=[]
...

Note:

The tags themselves have child "#text" nodes that have no content. Clearly, we would like to delete these from the tree.
If an invalid XML tree is given the parser, we would like to trap an exception or "know" when and where the error occurs.

A second example

We will enhance our example with three additional features:

First, we will add some error-handling capability.

Then, we will "clean" the tree to remove useless nodes.

Lastly, we will write code to add a new "student" record to the tree.

Here is the program (source code):

import com.ibm.xml.parser.*;
import org.w3c.dom.*;
import java.io.*;


// An error-handling class passed to the parser that
// overrides the error method.

class MyErrorHandler implements ErrorListener {
  public int error (String fname, int lineno, int charoff,
		    Object key, String mes)
  {
    System.out.println ("ERR:" + fname + " at line " + lineno +
			": " + mes);
    return 1;
  }
}


public class xmltest2 {

  // The method clean removes all nodes with "junk" data.

  public static void clean (Node n)
  {
    boolean remove = false;

    // Extract the nodename.
    String nodename = n.getNodeName();

    if (nodename != null) {
      // Removing is a possibility only if the node is of type "#text".
      if (nodename.equals ("#text"))
	remove = true;
      else
	remove = false;
    }

    // Extract the node value.
    String valueStr = n.getNodeValue();
    int len = 0;

    // Now trim and check if it needs to be removed.
    if (valueStr != null) {
      valueStr = valueStr.trim();
      len = valueStr.length();
      if (len > 0) {
        // Useful
	remove = false;
        // Put the trimmed string back in.
        n.setNodeValue (valueStr);
      }
    }

    if (remove) {
      // To remove, go to the parent node and delete from there.
      Node p = n.getParentNode();
      p.removeChild (n);
      // Recursively clean.
      clean (p);
      return;
    }

    // Continue exploration.
    if (n.hasChildNodes()) {
      NodeList nl = n.getChildNodes();
      int size = nl.getLength();
      for (int i=0; i 0) {
      try {
	FileReader fr = new FileReader (argv[0]);

	// Pass errorhandler to parser.
	Parser p = new Parser (argv[0], new MyErrorHandler(), null);
        // The next few steps are straightforward:
	p.setKeepComment (false);
	p.setPreserveSpace (false);
	TXDocument doc = p.readStream(fr);
	Element root = doc.getDocumentElement ();

        // Always a good check:
	if (root == null) return;

        // First clean:
	System.out.println ("CLEAN ...");
	clean (root);

        // Then, traverse:
	System.out.println ("TRAVERSE ...");
	traverse (root);

        // Add a new "student" to the tree.
	System.out.println ("ADDING A NEW STUDENT: ");
	NodeList nl = doc.getElementsByTagName ("STUDENT");
	Node n = nl.item(1);
	Node s = makeStudent (n, "Dangerous Dave", "333 Ouch St", "C");
	Node parent = n.getParentNode();
        // Note the use of appendChild():
	parent.appendChild (s);

        // OK, traverse again to see what we've done.
	System.out.println ("TRAVERSE ...");
	traverse (root);
      }
      catch (IOException e) {
	System.out.println ("Usage: java xmltest2 ");
      }
    }
    else {
	System.out.println ("Usage: java xmltest2 ");
    }
  }
}

A third example

We will enhance our current example with two additional features:

We will show another XML tree can be "joined" into the current XML tree. This XML file is assumed to be the file newstudent2.xml.

Then, we will show how to write the result to a file. to the tree.

Here is the program (source code):

import com.ibm.xml.parser.*;
import org.w3c.dom.*;
import java.io.*;
import java.util.*;

// An error-handling class passed to the parser that
// overrides the error method.

class MyErrorHandler implements ErrorListener {
  public int error (String fname, int lineno, int charoff,
		    Object key, String mes)
  {
    System.out.println ("ERR:" + fname + " at line " + lineno +
			": " + mes);
    return 1;
  }
}


public class xmltest3 {

  // The method clean removes all nodes with "junk" data.

  public static void clean (Node n)
  {
    boolean remove = false;
    String nodename = n.getNodeName();

    if (nodename != null) {
      // Removing is a possibility only if the node is of type "#text".
      if (nodename.equals ("#text"))
	remove = true;
      else
	remove = false;
    }

    // Extract the node value.
    String valueStr = n.getNodeValue();
    int len = 0;

    // Now trim and check if it needs to be removed.
    if (valueStr != null) {
      valueStr = valueStr.trim();
      len = valueStr.length();
      if (len > 0) {
        // Useful
	remove = false;
        // Put the trimmed string back in.
        n.setNodeValue (valueStr);
      }
    }

    if (remove) {
      // To remove, go to the parent node and delete from there.
      Node p = n.getParentNode();
      p.removeChild (n);
      // Recursively clean.
      clean (p);
      return;
    }

    // Continue exploration.
    if (n.hasChildNodes()) {
      NodeList nl = n.getChildNodes();
      int size = nl.getLength();
      for (int i=0; i");

    // Continue printing recursively.
    if (n.hasChildNodes()) {
      NodeList nl = n.getChildNodes();
      int size = nl.getLength();
      for (int i=0; i");
    }
  }



  public static void main (String[] argv)
  {
    if (argv.length > 0) {
      try {
	FileReader fr = new FileReader (argv[0]);
	Parser p = new Parser (argv[0]);
	p.setKeepComment (false);
	p.setPreserveSpace (false);
	TXDocument doc = p.readStream(fr);
	Element root = doc.getDocumentElement ();

	clean (root);
	System.out.println ("TRAVERSE ...");
	traverse (root);

	System.out.println ("ADDING A NEW STUDENT: ");
	NodeList nl = doc.getElementsByTagName ("STUDENT");
	Node n = nl.item(1);
	Node s = makeStudent (n, "Dangerous Dave", "333 Ouch St", "C");
	Node parent = n.getParentNode();
	parent.appendChild (s);
	System.out.println ("TRAVERSE ...");
	traverse (root);

	// Next, output using the document's print method.
	FileWriter fw = new FileWriter ("studentout.xml");
	PrintWriter pw = new PrintWriter (fw);
	doc.print (pw);
	pw.close();

	// Our own custom writing approach.
	// NOTE: this does not print the DTD, for unknown reasons.
	String charset = "ISO-8859-1";
	String jencode = MIME2Java.convert (charset);
	fw = new FileWriter ("studentout2.xml");
	pw_out = new PrintWriter (fw);
	DTD dtd = doc.getDTD();
	dtd.toXMLString (pw_out);
	pw_out.flush();
	pw_out.close();
      }
      catch (IOException e) {
	System.out.println ("Usage: java xmltest3 ");
      }
    }
    else {
	System.out.println ("Usage: java xmltest3 ");
    }
  }
}