In-Depth

Extensible Markup Language Basics: An Overview of XML -- Definitions and Discussions

The explosive growth of the World Wide Web is testimony to the power of the technology. The Web allows information to be presented in a somewhat structured format guided by HTML. Presenting information with HTML, however, is limited by HTML having a fixed set of display tags, and the sad fact that different browser vendors differ in their support for various HTML tags (the same page looks differently on different browsers). This unfortunate situation cries out for a standard approach to organizing and displaying information.

This article provides an overview of XML, or Extensible Markup Language, a universal document format for structuring data for presentation on the Web. The article starts with an overview of XML features that overcome existing problems with HTML. Next, the article shows a simple XML document, followed by a discussion of XML document components. The important XML terms, well-formed documents and valid documents are described. XML Document Type Definitions (DTDs) are covered, and an example of a DTD is provided. The article closes with a brief description of related technologies.

An Overview of XML Features

XML does not have a fixed set of markup tags. Here, XML overcomes what some believe to be HTML’s greatest deficiency. XML is not a markup language per se; XML is a meta-markup language that allows an author to define her own tags. This way, an author can create a markup language peculiar to her industry, and XML document authors can use this markup language to encode data in industry-specific terminology.

XML requires document authors to follow certain rules in creating what are known as well-formed XML documents. If these rules are not followed, the XML document is useless. The XML specification prohibits XML tools from trying to fix problems with the document. The intent is to stop the browser madness prevalent in HTML, where different browsers attempt to "fix" broken HTML in different ways and, of course, parse and display this HTML differently. For example, an HTML document author can write some HTML with missing end tags, which the major browsers will parse and display. Such foolishness cannot fly with XML; if an XML document is broken, the document cannot be rendered. Hence, an XML author can create XML documents secure in the knowledge that these documents will be parsed identically with different pieces of compliant software.

XML stresses the separation of data content from data presentation. Over time, HTML has blurred the distinction between organizing document content and displaying the content. A typical HTML document has tags that describe relationships among document content (like <LI> tags) and tags that govern the display of this content (<U>, <B>, etc.). XML describes document content structure and semantic relationships, not the content formatting. The XML author would use a related style sheet technology, like CSS (Cascading Style Sheets) or XSL (Extensible Style Language), to govern the display of the document. One upshot of this clean separation of structure and display is that the same XML document can be displayed in different ways by using different style sheets, or the same style sheet could govern the display of similarly structured XML documents.

The nonproprietary nature of XML, combined with its ease of writing, makes XML an ideal format for data exchange among applications.

A Simple XML Document

This section shows what may be the simplest XML document one can create, and explores some of the document’s properties. First, the simple XML document:

<?xml version="1.0" standalone="yes"?>

<First>

My First XML Document

</First>

This document has code that tells a parser that the document is an XML document, and the document’s content is the phrase, "My First XML Document." The first line of this document tells the parser that the document is an XML document. Notice the presence of the characters <? and ?>; these characters indicate the presence of an XML processing instruction. The word after the characters <? (xml, in this case) tells the parser which particular processing instruction is an XML declaration.

XML documents are free form; XML parsers typically do not care about column positions or white space. An XML document author should take some time making her document easy to read with judicious use of tabs, white space and blank lines.

XML processing instructions and tags often use XML attributes, which are name-value pairs separated by an equals sign; the values must be enclosed in quotes (Another difference between XML and HTML is that most HTML values do not require the quotes.) The XML declaration requires the use of the version and standalone attributes. For now, know that this XML declaration states that the document conforms to XML version 1.0, and does not require any other documents for parsing its content.

This document does not contain any display or formatting information. Therefore, one would need a style sheet, and a way of telling the XML document to use that style sheet to display the XML document. Let’s use the simple style sheet below, saved as forfirst.css in the same directory as the XML document. Notice that the style sheet refers to the tag <First> used in the document.

First {display: block; font-size: 36pt; font-weight: bold; color="00FF00";}

The second, italicized processing instruction below associates the stylesheet with the XML document:

<?xml version="1.0" standalone="yes"?>

<?xml-stylesheet type="text/css" href="forfirst.css"?>

<First>

My First XML Document

</First>

XML Document Components

XML documents are text consisting of data and markups. The data is what the author encodes; the markups tell the XML parser how this data is organized and structured. Markup includes processing instructions, comments, elements, entity references, CDATA delimiters and Document Type Definitions (DTDs). All of these markups are case sensitive, and are described later in this article.

The above simple XML document contains an XML declaration, a processing instruction that associates a style sheet with the document and a tag, or element. The XML declaration, a processing instruction, is a (not required) statement that identifies the XML version in use; currently, version 1.0 is the only version. If present, the XML declaration must be the first statement in the document. If an XML document must be displayed, a processing instruction that associates one or more display documents (like a CSS) must be coded as well.

XML comments begin with <!- - and end with - - >. However, an XML author cannot place comments with reckless abandon. Comments cannot be coded before the XML declaration. Comments may not be coded inside element tags. Comments may not be nested. Comments may not include two successive dashes other than those that start and end the comment.

XML document content is enclosed in tags, or elements. Our simple example above contains one element coded as <First>. XML documents require that one element encloses all others. In XML lingo, that special element is called the root element. Put differently, every element coded in an XML document must be sandwiched between the opening and closing tags of the root element. Elements that are coded within (sandwiched between) other elements are known as child elements. For example:

<?xml version="1.0" standalone="yes"?>

<?xml-stylesheet type="text/css" href="mystylesheet.css"?>

<Root>

<Employee_Data>

<Name>

John Q. Public

</Name>

<Department>

Human Resources

</Department>

<Hire_Date>

May 1 2000

</Hire_Date>

<Review_Date>

December 1 2000

</Review_Date>

</Employee_Data>

</Root>

All elements other than <Root> (which is not a keyword) are child elements. Style sheet display attributes may be attached to each element, or a child element may inherit display attributes from its parent element. For example, if the file mystylesheet.css, associated with the above XML document, only contained the below line:

Root {display: block; font-size: 36pt; font-weight: bold; color="blue";}

then all the content in the XML document would be 36-point blue text, whereas if this line were also present:

Hire_Date {position: absolute; top:90; left:190;display: block; font-size: 18pt; font-weight: bold;color="red";}

the content for element Hire_Date would be 18-point red text, positioned a bit to the left.

Element names must begin with an underscore character or a letter; following characters may be just about anything except spaces. Also, element names are case sensitive; </aTag> is not the closing tag for <Atag>.

XML allows for tags that contain no data. For example, some HTML tags that contain no data have no closing tags (there is no </IMG> tag). HTML may or may not ignore unknown tags. However, XML must be able to recognize, and process every tag present in a document. To deal with tags that contain no data, XML allows for empty tags. An empty tag is closed with /> (for example, <IMG/>.

XML allows an author to categorize data by using meaningful tag names, and organize data by developing a hierarchy between parent and child elements. An author may also attach attributes to an element. As in HTML, XML allows an author to code name/value pairs with elements. For example, this element contains one attribute:

<Department Location="Home Office">

Notice that an author could have coded a separate Location element.

A natural question is: When should an author code attributes instead of elements? Rather than supply a laundry list of do’s and don’ts, a simple rule works best here. If an author needs to access and display some data independently of other data, or the data has structure, then encoding that data in an element may work best. If an author needs to encode some data about the data (what is sometimes called metadata), or a piece of data that need not be independently accessed, encoding that data in an attribute may be called for. A good example of using attributes instead of elements is coding the Height and Width attributes of the <IMG> tag in an HTML document.

Entity References are markup that the XML parser replaces with a single character.

If an author needs to encode data that includes a less-than sign, the author needs to use the entity reference lest the XML parser will interpret < as the start of a tag. Note the presence of the semicolon after each entity reference.

An XML author may want to include a block of text as is, without having the XML parser perform any translations. The data may contain numerous entity characters and coding the entity references may become tiresome. XML has markup that allows an author to include text as is, including comments, called the CDATA section. Just enclose text between the CDATA delimiters <![CDATA[ and ]]>.

Well-Formed XML Documents

XML requires that an author use the above described components in certain ways, or by following certain rules. These rules were designed to allow XML parsers to understand properly-constructed XML documents. In XML lingo, an XML document that follows the rules for proper construction is said to be well-formed. This section discusses these rules.

• The XML declaration, if present, must be the first statement in the document.

• Every XML tag that contains data must have a closing tag. Close empty tags with />

• Include a root element.

• Put attribute values in quotes.

• Only use < to start element tags; use & to start entity references.

• Enclose every element tag (except the root) inside another. For example, the following line is not well-formed XML:

<Outer> This is NOT <Inner> Well-Formed </Outer> XML </Inner>

Document Type Definitions

XML allows an author to create entirely new markup languages with tags that contain industry-specific language. The markup language creator can define the language with a Document Type Definition (DTD). In short, DTDs define a set of rules that govern the relationships among the tags contained in a document. A DTD may specify that every Employee element have at least one Hire_Date child element, or that every Employee element has one and only one Name child element.

When the tags in an XML document conform to the specifications in an associated DTD, we say that the XML document is valid. An XML document can be well-formed without being valid. Also, DTD keywords are case sensitive.

A Simple DTD

Let’s look at the simple XML document with a DTD:

<?xml version="1.0" standalone="yes"?>

<!DOCTYPE First [

<!ELEMENT First (#PCDATA)>

]>

<First>

My First XML Document

</First>

The three italicized lines are the Document Type Declaration, not a DTD. The declaration is delimited by <!DOCTYPE rootname and ]>; the DTD is what is sandwiched between these delimiters. The rootname of the XML document must follow the starting delimiter.

The DTD author may opt to store the DTD as opposed to coding the DTD in the XML document. If so, the Document Type Declaration changes to:

<!DOCTYPE First SYSTEM "myDTD.dtd">

The entire DTD, <!ELEMENT First (#PCDATA)> states that the element First must contain parsed character data only, or text that has no markup, such as child elements. What follows are other element declarations a DTD author can code.

Element DTD Declarations

DTD element declarations begin with <!ELEMENT and end with >. The name of the element must follow the starting delimiter. Element attributes specified in the DTD for that element follow, usually in parenthesis. Each element should have one and only one declaration. The order of declarations is not important; XML allows forward and backwards references.

DTD Entities

DTDs support the inclusion of text from internal and external sources. In essence, the DTD author codes a general entity reference which, when processed, substitutes text in the XML document for the entity reference. The mechanism is identical to that described for the XML entity references. First, the DTD author codes the ENTITY tag:

<!ENTITY entityName "Replacement Text">

The XML document may contain the element:

<SomeTag> Where is that &entityName; going? </SomeTag>

During processing, the XML processor will substitute "Replacement Text" for "&entityName;." The semicolon must be the last character in the entity reference.

External general entities allow the author to include text from other locations into a document. The author codes:

<!ENTITY entityName SYSTEM "someFile">

where "someFile" can be a URL, a file on the network or the local machine.

The internal and external general entities become part of the XML document, not the DTD. XML provides a mechanism, called a Parameter Entity Reference, to allow authors to substitute text in the DTD. The coding of a parameter entity reference is similar to that of a general entity reference, with two differences: parameter entity references start with a percent sign (not an ampersand), and parameter entity references cannot appear in the document content.

Here is a DTD entry for a parameter entity reference, and some references to the entity:

<!ENTITY % entityNameList "Replacement Text">

<!ELEMENT anElement1 (Child1, (%entityNameList;)) >

<!--anElement 2 through 999 follow with the same parameter entity reference-- >

<!ELEMENT anElement1000 (ChildA, (%entityNameList;)) >

If the author needed to add or remove an element from the list, the author changes the parameter entity reference, instead of each <!ELEMENT tag.

Related XML Technologies

As previously mentioned, XML stresses the separation of content from presentation. This article cites the use of CSS for XML document display. However, a strong up-and-coming contender for XML display is XSL, the Extensible Style Language. XSL, an XML application, consists of a transformation language and a formatting language. The transformation language specifies how one XML document may be transformed into another; how the custom tags used in an XML document may be changed to tags of another, for example. The formatting language, similar to CSS, describes how an XML document is rendered.

XML developers can mix and match tags from multiple applications. However, there needs to be a mechanism in place that allows XML applications to distinguish elements and attributes of the same name. A related XML technology called Namespaces allows an XML developer to prefix custom tags by directing the XML parser to reference a unique resource (dataset, URL).

With the proliferation of Web pages on the Internet, technology that facilitates searches is in order. The XML application RDF, the Resource Description Framework, encodes data about data, or metadata. Put another way, RDF provides a consistent way to describe metadata. The basic idea is that RDF standardizes vocabularies used to describe metadata.

About the Author:

Lou Marco is a writer, technical instructor and a consultant with over 20 years of computing experience. He has authored ISPF/REXX Development for Experienced Programmers, and is writing a book titled Developing Java Mainframe Applications for John Wiley & Sons. He can be reached at loumarco@hotmail.com.

Must Read Articles