Introduction to XML
Published on September 17, 2003
What is XML?
If you’ve been producing sites for any length of time you probably hear about XML, and wonder exactly why so many people are excited about it. XML stands for Extensible Markup Language. Let’s take a closer look at the parts of the acronym, and then we’ll show you how it all fits together.
Markup comes from the bad old days before word processors. If you needed a brochure, you’d type it on a typewriter, and then literally mark it up with a red pen to tell the typesetter what you wanted it to look like. The typesetter would follow your instructions and return a finished document to you:
In this instance, we’re using markup not only to show how text should be presented (italic rather than normal text), but also to tell how the document is structured: some of the words form a heading, the other words are just ordinary text.
How to Buy a Wrench
There are two kinds of wrenches: wrenches with fixed size, and adjustable wrenches.
The idea of using markup to impose structure on otherwise anonymous data is such a good one that people came up with a standardized way to create markups for general use. This method was called the Standard Generalized Markup Language, or SGML. SGML really isn’t a language in and of itself, but instead is more of a rulebook that tells you how to develop these markup languages. Any markup that follows the SGML rulebook is called an application of SGML.
The most widely known application of SGML is a language used to mark up text for delivery and presentation on the World Wide Web. That language is HTML, the HyperText Markup Language. In HTML, we can mark up the example above to send to a web browser instead of a typesetter:
<h3>How to Buy a Wrench</h3> <p>There are two kinds of wrenches: wrenches with fixed size, and <i>adjustable</i> wrenches.</p>
There are many other applications of SGML, but they’re mostly found in large corporations and government agencies. That’s because the SGML rulebook is very complex, which makes it hard to learn. For example, SGML allows optional opening and closing tags. Quick: is
</li> required or not? How about
<body>? Additionally, it’s difficult (and expensive!) to develop tools that can manage data marked up according to those rules.
HTML Doesn’t Do It All
While HTML is a good thing, it doesn’t solve all our problems. Consider the following two tables. While the data is structured into rows and cells, there’s nothing to tell you (other than your intuition) that the first table gives maximum and minimum temperatures, while the second table gives current and maximum capacities for water reservoirs.
<table border="1"> <tr> <td>Chicago</td><td>13</td><td>6</td> </tr> <tr> <td>Dallas</td><td>60</td><td>20</td> </tr> </table>
<table border="1"> <tr> <td>Calero</td><td>5538</td> <td>10050</td> </tr> <tr> <td>Uvas</td><td>6095</td> <td>9935</td> </tr> </table>
XML Solves the Problems
To solve the complexity issue, XML was designed as a subset of SGML. It eliminates the features that make SGML difficult to learn and parse while retaining most of the power of SGML. Tools that analyze and display XML are easier to write, and are widespread and inexpensive. Since XML is a subset of SGML, it lets you devise any set of tags you wish, thus solving the problem of differentiating what would be otherwise be anonymous numbers:
<temperatures> <city name="Chicago"> <max>13</max><min>6</min> </city> <city name="Dallas"> <max>60</max><min>20</min> </city> </temperatures>
<water-banks> <reservoir name="Calero"> <current>5538</current> <capacity>10050</capacity> </reservoir> <reservoir name="Uvas"> <current>6095</current> <capacity>9935</capacity> </reservoir> </water-banks>
With XML, you can devise tags for marking up all the data that appears on the weather page of the newspaper. With this custom markup, the purpose of each number in the table is unambiguous.
People have developed custom XML markup for such diverse content areas as chemical formulas, descriptions of real estate, news stories, and even cooking recipes. This shows the extensible part of XML, making it a very flexible, customizable markup language.
XML and the Web
These custom tags are all well and good, but your browser, which is designed to interpret HTML3 tags, doesn’t understand
If you’re using the very latest browsers, you can use Cascading Style Sheets to tell a browser how to display your tags. For example, you could present
<min> temperature tags in blue and
<max> tags in red. That’s a client-side solution.
For example, an HTML-formatted weather report is good for one purpose: web display. As we saw earlier, it’s difficult to figure out what the numbers mean in a mass of HTML. However, if we have the XML-formatted weather report, we may then use freely available XML tools to convert that one document to:
- a plain text file suitable for sending in email
- an XHTML file suitable for display on a desktop computer’s browser
- a WML (Wireless Markup Language) document suitable for display on a PDA
- an Adobe PDF document suitable for hard copy
- a VML (Voice Markup Language) dialog for a voicemail information system
- an SVG (Scalable Vector Graphic) document that draws pictures of thermometers and water containers
This, then, is why everyone is excited about XML. By carefully constructing a markup that shows your data’s structure, you create your content once and use XML tools to pour that content into a variety of other molds.
Where Do I Go From Here?
If you’re centered in the web design area, and don’t have to work with custom tags, you may want to start by producing your new web pages in XHTML (HTML written according to the XML rulebook). This won’t give you the abstraction of custom tags, but it will make your documents available to be manipulated by XML tools.
If you need to produce your own markup, you’d be well advised to read XML for the World Wide Web by Elizabeth Castro (Peachpit), Learning XML by Erik T. Ray (O’Reilly) or XML, HTML, XHTML Magic by Molly E. Holzschlag (New Riders).
1 Technically, SGML and XML are meta-languages: languages used to describe other languages, However, it’s easier to think of SGML and XML as “rulebooks.”
Back to content
2 Technically, this isn’t what the X in XML stands for. The XML specification itself doesn’t say what extensible means, and we’ve yet to find it in any of the books on the subject.
Back to content
3 You may be wondering where XHTML fits into all of this. XHTML is simply the HTML that you know and love, written according to the XML rulebook. Properly-written XHTML will render properly in any browser.
Back to content
J. David Eisenberg is a programmer and instructor in San Jose, California, where he lives with his kittens, Marco and Zoë.