XML is a special subset of SGML (a comprehensive, but difficult
superset of HTML) to allow you to define and manipulate your own
tags. Thus, with HTML you are stuck with its tags but browsers
know how to display material (it's all standardized now) whereas
with XML you can define your own tags, but you need to provide
a "browser" that can interpret the tags and take appropriate
action.
Here's an example of an XML document (in a file, called students.xml, say):
You see that this is really a tiny database with three records,
each of which is "self-describing". This is one current use of
XML. Another use is to define tags like <audio></audio> and
include actual audio between the tags - this is the idea of
creating complex documents.
What does a browser do with the above student-database file?
Standard HTML browsers will ignore the new tags. Now, it is
YOU, the creator of a particular set of tags, that needs to
provide a "browser" for the tags you create. Such a browser
will parse the file, and extract the content and "do" whatever
it is that needs to be done. These actions might include
display in a window, or might have nothing to do with
displaying at all (such as placing the database entries above
in a regular relational database).
How does one write a program to read in an XML file and
extract information? Fortunately, this task is made easier
by using available tools. One such fundamental tool is an
XML parser. There are two kinds of parsers:
How does the parser know what to expect in a file? The answer is:
the file usually contains "type" information detailing the
structure and names of the tags. For simplicity we did not
show the DTD for the file above. Usually, the DTD occurs
before the data. It is itself a bunch of XML commands that
"explain" the tags and structure. Here's the complete file
students.xml
Notice the strange syntax (the actual information is inside a
pair of matching square brackets). It starts by saying that
the top-level structure is an "element" called a DOCUMENT.
Notice that there is a * after (STUDENT). This indicates that
the item may occur more than once.
Finally, each of the items NAME, ADDR and GRADE contain
something mysterious called #PCDATA. This is just an ugly
way of saying that there is no further subdivision and that
only plain character data will be contained in those items.
There is a lot more to DTD's, but not a whole lot more.
Other features will allow you to point (using a URL) across
the web to a DTD instead of actually using the DTD here.
XML allows you to define attributes for tags, as in:
You can set up as many attributes as you like, and retrieve
the values during parsing. Of course, the DTD must specify
the attributes:
We have only cursorily covered XML so far. XML has many other
features, including:
NOTE:
Like HTML, XML ignores whitespace. Some parsers will let
you recover whitespace if you really want it. Others
will ignore whitespace in the processing of tags but leave
the whitespace intact.
Since XML has nothing to do with programming languages,
any programming language can be used to build a parser and
then, an application on top of the parser. Currently, XML
parsers are available in C/C++, VB and of course, Java.
An XML parser in Java allows a Java programmer to include
a parser package and write applications to extract data from XML files.
Several free and commercial parsers for XML are now available
in Java, see for example, the Java parsers at
The W3C, which defined the XML standard, also defines a set
of Java interfaces for a parser to implement. This means that
if you write code with these interfaces, then you can switch
parsers among parsers that follow the interface. The IBM product,
for example, follows the W3C interfaces. This means that you
usually import both the parser and the W3C package
that has the interface definitions.
There are two sets of classes defined by the W3C that
are most relevant to Java parsers (and programmers):
We will parse the file student.xml using IBM's XML parser. After
downloading the parser and setting up the CLASSPATH variable to
access the jar file in the package, we will be ready to use
their code.
The file student.xml is:
Here is a sample program that performs a simple traversal:
(source code):
This produces the output:
Observe:
Note:
We will enhance our example with three additional features:
We will enhance our current example with two additional features:
<?XML version="1.0"?>
<DOCUMENT>
<STUDENT>
<NAME>
Anorexic Al
</NAME>
<ADDR>
10 Slimway Rd
</ADDR>
<GRADE>
A
</GRADE>
</STUDENT>
<STUDENT>
<NAME>
Bulimic Bill
</NAME>
<ADDR>
123 Upchuck Drive
</ADDR>
<GRADE>
B+
</GRADE>
</STUDENT>
<STUDENT>
<NAME>
Cadaverous Chen
</NAME>
<ADDR>
14 Mordant St.
</ADDR>
<GRADE>
B-
</GRADE>
</STUDENT>
</DOCUMENT>
These parsers usually operate in two ways:
The tree-based method is usually preferred although more
cumbersome to use. Typically, good parsing packages will
provide both types of parsing.
DTD's.
<?XML version="1.0"?>
<!-- Comments are the same as in HTML -->
<!-- This is the DTD part -->
<!DOCTYPE DOCUMENT [
<!ELEMENT DOCUMENT (STUDENT)*>
<!ELEMENT STUDENT (NAME, ADDR, GRADE)>
<!ELEMENT NAME (#PCDATA)>
<!ELEMENT ADDR (#PCDATA)>
<!ELEMENT GRADE (#PCDATA)>
]>
<DOCUMENT>
<STUDENT>
<NAME>
Anorexic Al
</NAME>
<ADDR>
10 Slimway Rd
</ADDR>
<GRADE>
A
</GRADE>
</STUDENT>
<STUDENT>
<NAME>
Bulimic Bill
</NAME>
<ADDR>
123 Upchuck Drive
</ADDR>
<GRADE>
B+
</GRADE>
</STUDENT>
<STUDENT>
<NAME>
Cadaverous Chen
</NAME>
<ADDR>
14 Mordant St.
</ADDR>
<GRADE>
B-
</GRADE>
</STUDENT>
</DOCUMENT>
Thus, the DTD part of the document is really:
<!DOCTYPE DOCUMENT [
<!ELEMENT DOCUMENT (STUDENT)*>
<!ELEMENT STUDENT (NAME, ADDR, GRADE)>
<!ELEMENT NAME (#PCDATA)>
<!ELEMENT ADDR (#PCDATA)>
<!ELEMENT GRADE (#PCDATA)>
]>
NOTE: Unlike HTML, XML is case-sensitive so "STUDENT" is different
from "Student", which is different from "student".
Here, the DTD is saying "DOCUMENT will contain a substructure
called STUDENT, and STUDENT will contain NAME, ADDR and GRADE."
Each substructure is assumed to be defined by begin-end tags of
that substructure's name. Thus, the STUDENT stuff will be
everything that occurs within a <STUDENT> and a </STUDENT>.
Attributes.
<STUDENT LEVEL="graduate" STATUS="alum">
<NAME>
Cadaverous Chen
</NAME>
<ADDR>
14 Mordant St.
</ADDR>
<GRADE>
B-
</GRADE>
</STUDENT>
<!DOCTYPE DOCUMENT [
<!ELEMENT DOCUMENT (STUDENT)*>
<!ELEMENT STUDENT (NAME, ADDR, GRADE)>
<!ATTRLIST STUDENT
LEVEL CDATA #REQUIRED
STATUS CDATA #IMPLIED
>
<!ELEMENT NAME (#PCDATA)>
<!ELEMENT ADDR (#PCDATA)>
<!ELEMENT GRADE (#PCDATA)>
]>
Here, there are two attributes for element STUDENT, one of
which is always required. Both are of type CDATA (character data).
Advanced XML.
XML and Java: overview
XML and Java: An example
<?xml version="1.0"?>
<!-- Comments are the same as in HTML -->
<!-- This is the DTD part -->
<!DOCTYPE DOCUMENT [
<!ELEMENT DOCUMENT (STUDENT)*>
<!ELEMENT STUDENT (NAME, ADDR, GRADE)>
<!ELEMENT NAME (#PCDATA)>
<!ELEMENT ADDR (#PCDATA)>
<!ELEMENT GRADE (#PCDATA)>
]>
<DOCUMENT>
<STUDENT>
<NAME>
Anorexic Al</NAME> <ADDR>
10 Slimway Rd
</ADDR>
<GRADE>
A
</GRADE>
</STUDENT>
<STUDENT>
<NAME>
Bulimic Bill
</NAME>
<ADDR>
123 Upchuck Drive
</ADDR>
<GRADE>
B+
</GRADE>
</STUDENT>
<STUDENT>
<NAME>
Cadaverous Chen
</NAME>
<ADDR>
14 Mordant St.
</ADDR>
<GRADE>
B-
</GRADE>
</STUDENT>
</DOCUMENT>
import com.ibm.xml.parser.*; // This is the import required for the parser.
import org.w3c.dom.*; // A list of "standard" interfaces.
import java.io.*;
public class xmltest {
// Traverse a subtree whose root is the parameter.
public static void traverse (Node n)
{
// Extract node info:
String nodename = n.getNodeName();
String test = n.getNodeValue();
// Print and continue traversing.
System.out.println ("Node: " + n.getNodeName() + "test=[" + test + "]");
// Now traverse the rest of the tree in depth-first order.
if (n.hasChildNodes()) {
// Get the children in a list.
NodeList nl = n.getChildNodes();
// How many of them?
int size = nl.getLength();
for (int i=0; i<size; i++)
// Recursively traverse each of the children.
traverse (nl.item(i));
}
}
public static void main (String[] argv)
{
// Pick off the XML file name from the command-line arguments.
if (argv.length > 0) {
try {
// Get the XML file name, e.g., "student.xml".
FileReader fr = new FileReader (argv[0]);
// Create a parser instance.
Parser p = new Parser (argv[0]);
// These two will decide whether the parser will
// retain whitespace and comments from the original file.
p.setKeepComment (false);
p.setPreserveSpace (false);
// Now parse and create a document tree as a result.
// Note that we are passing a FileReader wrapped around
// the file.
TXDocument doc = p.readStream(fr);
// Once we have the tree, extract the root.
Element root = doc.getDocumentElement ();
// Now traverse the tree.
System.out.println ("TRAVERSE ...");
traverse (root);
// Alternatively, here is a scan by TAG name for
// the tag "NAME"
System.out.println ("BY NAME: ");
NodeList nl = doc.getElementsByTagName ("NAME");
int size = nl.getLength();
for (int i=0; i<size; i++)
traverse (nl.item(i));
}
catch (IOException e) {
System.out.println ("Usage: java xmltest <filename>");
}
}
else {
System.out.println ("Usage: java xmltest <filename>");
}
}
}
TRAVERSE ...
Node: DOCUMENT value=[null]
Node: #text value=[
]
Node: STUDENT value=[null]
Node: #text value=[
]
Node: NAME value=[null]
Node: #text value=[
Anorexic Al]
Node: #text value=[ ]
Node: ADDR value=[null]
Node: #text value=[
10 Slimway Rd
]
Node: #text value=[
]
Node: GRADE value=[null]
Node: #text value=[
A
]
Node: #text value=[
]
Node: #text value=[
]
Node: STUDENT value=[null]
Node: #text value=[
]
Node: NAME value=[null]
Node: #text value=[
Bulimic Bill
]
Node: #text value=[
]
Node: ADDR value=[null]
Node: #text value=[
123 Upchuck Drive
]
Node: #text value=[
]
Node: GRADE value=[null]
Node: #text value=[
B+
]
Node: #text value=[
]
Node: #text value=[
]
Node: STUDENT value=[null]
Node: #text value=[
]
Node: NAME value=[null]
Node: #text value=[
Cadaverous Chen
]
Node: #text value=[
]
Node: ADDR value=[null]
Node: #text value=[
14 Mordant St.
]
Node: #text value=[
]
Node: GRADE value=[null]
Node: #text value=[
B-
]
Node: #text value=[
]
Node: #text value=[
]
BY THE TAG 'NAME':
Node: NAME value=[null]
Node: #text value=[
Anorexic Al]
Node: NAME value=[null]
Node: #text value=[
Bulimic Bill
]
Node: NAME value=[null]
Node: #text value=[
Cadaverous Chen
]
import com.ibm.xml.parser.*;
import org.w3c.dom.*;
import java.io.*;
public class xmltest {
// ...
public static void traverse (Node n)
{
// ...
String valueStr = n.getNodeValue();
// Remove whitespace.
if (valueStr != null)
valueStr = valueStr.trim();
// Print and continue traversing.
System.out.println ("Node: " + n.getNodeName() + " value=[" + valueStr + "]");
// ...
}
// ...
}
In this case the output looks like:
TRAVERSE ...
Node: DOCUMENT value=[null]
Node: #text value=[]
Node: STUDENT value=[null]
Node: #text value=[]
Node: NAME value=[null]
Node: #text value=[Anorexic Al]
Node: #text value=[]
Node: ADDR value=[null]
Node: #text value=[10 Slimway Rd]
Node: #text value=[]
Node: GRADE value=[null]
Node: #text value=[A]
Node: #text value=[]
Node: #text value=[]
...
A second example
Here is the program (source code):
import com.ibm.xml.parser.*;
import org.w3c.dom.*;
import java.io.*;
// An error-handling class passed to the parser that
// overrides the error method.
class MyErrorHandler implements ErrorListener {
public int error (String fname, int lineno, int charoff,
Object key, String mes)
{
System.out.println ("ERR:" + fname + " at line " + lineno +
": " + mes);
return 1;
}
}
public class xmltest2 {
// The method clean removes all nodes with "junk" data.
public static void clean (Node n)
{
boolean remove = false;
// Extract the nodename.
String nodename = n.getNodeName();
if (nodename != null) {
// Removing is a possibility only if the node is of type "#text".
if (nodename.equals ("#text"))
remove = true;
else
remove = false;
}
// Extract the node value.
String valueStr = n.getNodeValue();
int len = 0;
// Now trim and check if it needs to be removed.
if (valueStr != null) {
valueStr = valueStr.trim();
len = valueStr.length();
if (len > 0) {
// Useful
remove = false;
// Put the trimmed string back in.
n.setNodeValue (valueStr);
}
}
if (remove) {
// To remove, go to the parent node and delete from there.
Node p = n.getParentNode();
p.removeChild (n);
// Recursively clean.
clean (p);
return;
}
// Continue exploration.
if (n.hasChildNodes()) {
NodeList nl = n.getChildNodes();
int size = nl.getLength();
for (int i=0; i
A third example
Here is the program (source code):
import com.ibm.xml.parser.*;
import org.w3c.dom.*;
import java.io.*;
import java.util.*;
// An error-handling class passed to the parser that
// overrides the error method.
class MyErrorHandler implements ErrorListener {
public int error (String fname, int lineno, int charoff,
Object key, String mes)
{
System.out.println ("ERR:" + fname + " at line " + lineno +
": " + mes);
return 1;
}
}
public class xmltest3 {
// The method clean removes all nodes with "junk" data.
public static void clean (Node n)
{
boolean remove = false;
String nodename = n.getNodeName();
if (nodename != null) {
// Removing is a possibility only if the node is of type "#text".
if (nodename.equals ("#text"))
remove = true;
else
remove = false;
}
// Extract the node value.
String valueStr = n.getNodeValue();
int len = 0;
// Now trim and check if it needs to be removed.
if (valueStr != null) {
valueStr = valueStr.trim();
len = valueStr.length();
if (len > 0) {
// Useful
remove = false;
// Put the trimmed string back in.
n.setNodeValue (valueStr);
}
}
if (remove) {
// To remove, go to the parent node and delete from there.
Node p = n.getParentNode();
p.removeChild (n);
// Recursively clean.
clean (p);
return;
}
// Continue exploration.
if (n.hasChildNodes()) {
NodeList nl = n.getChildNodes();
int size = nl.getLength();
for (int i=0; i