Writing Semantic Markup

Writing Semantic Markup

Got something to say?

Share your comments on this topic with other web professionals

In: Columns > Web 2.0 Design: Bootstrapping the Social Web

By Joshua Porter

By Richard MacManus

Published on September 5, 2005

The biggest and most welcome change on the Web in the last five years has been the astronomical growth of Web feeds: XML files containing a snapshot of a Web site’s newest content that saves readers a tremendous amount of time. In 2000, there was only a handful of feeds. In 2005, there are millions.

In Web 2.0, the Web as platform, Web feeds are a simple way to share and receive content. If we think of a Web URL as a really simple interface for requesting information—a simple application programming interface (API)—then a Web feed is one of the simplest responses we can receive. We type in a feed URL and receive the content that is there—usually the last 10 items added to a Web site.

RSS, or really simple syndication, is one type of Web feed format. Its adoption by blog software vendors and major media outlets has created an amazing efficiency in the way that we use the Web. Instead of having to browse to our favorite sites over and over again to see if something is new, we can simply subscribe to an RSS feed with any feed aggregator. There are Web and desktop aggregators that periodically poll the sites you’re subscribed to and notify you if something is new.

Semantic Markup From the Ground Up

The utility of RSS results from a characteristic of its markup: it is semantic. Defining “semantic” is a fool’s errand—it means “having to do with meaning,” so any theoretical discussion on the matter is difficult.

Practically speaking, however, semantic markup is markup that is descriptive enough to allow us and the machines we program to recognize it and make decisions about it. In other words, markup means something when we can identify it and do useful things with it. In this way, semantic markup becomes more than merely descriptive. It becomes a brilliant mechanism that allows both humans and machines to “understand” the same information.

Let’s take a closer look. Consider the following text:

Web 2.0 Design: Bootstrapping the Social Web
By Richard MacManus & Joshua Porter

Humans can instantly recognize this as a title and authors of a work, in this case a column here at Digital Web Magazine. We know this because of past experience. We’ve seen similar things before. It is apparent that the first line is a title and the second line is two authors. Given this information, humans are able to act on it in a meaningful way. For instance, you could answer someone if they asked you “Who wrote that?”

Machines, with their rigid information processing capabilities, need everything spelled out for them. To be able to do something useful with this title and byline, a machine would need to be able to parse it correctly. It would need to know that the number (2.0) in the first line is part of the title and shouldn’t be interpreted as a numeric value, that the spaces around it separate words from each other, and that the second line is made up of two names and not one. In other words, a machine would need to be able to do algorithmically what we humans do almost without thinking.

This would work amazingly well, and is very possible even today, except that the syntax of titles and bylines changes from person to person and from usage to usage. What if I changed my first name to just be the initial “J”? Or misspelled it? Humans would still understand the endless permutations. Machines, though, unless programmed for every single possible permutation, cannot reliably make the same decisions that we can. The human ability to adapt and interpret is special.

An even tougher problem occurs when a machine doesn’t know where to start finding a byline. What if, for example, this byline was embedded somewhere in a Web page? How would a machine know when the byline started and when it ended? And without it knowing where it begins and ends how would it distinguish this byline from any other byline in the page? And if you need to look in multiple files across the Web, well, you can forget reliability, because even if one Web page used a consistent format other Web pages might have their own formats.

Semantic markup addresses this challenge by providing a framework for describing things explicitly. This is a snippet of what the same information in RSS 2.0 syntax might look like1.

 <item> <title>Web 2.0 Design: Bootstrapping the Social Web</title> <dc:creator>MacManus, Richard</dc:creator> <dc:creator>Porter, Joshua</dc:creator> </item> 

The Benefits of Semantic Markup

A computer equipped with an XML parser can make decisions about this information. It can identify which part is the title, which parts are our first names and which parts are our last names because the markup accurately describes its contents based on the RSS 2.0 format. In the previous example, there were no clues about which part was which. The computer had to guess what it was looking at. When each part is self-describing as part of a valid RSS 2.0 document, computers can understand. That is, they can “understand” enough to provide us with the answer about who writes the column.

Once computers can understand markup, we can program them to do a lot of hard work for us. The hard work that RSS does allows us to keep up with many more Web sites in much less time. But it also allows us to share information that we weren’t sharing before, allowing others to remix our content in new, useful ways. The real and future gains like the ones we have with RSS are the shiny apple of Web 2.0. When developers offer their content wrapped in semantic markup, it becomes everyone’s content, because everyone (and their computers) can understand it. Time is saved across the universe.

XHTML as a Starting Point

Designers familiar with XHTML will see the RSS code above and recognize it as only a variation of what they are used to. It still has tags surrounding content nested within other tags. It is clearly of the same lineage: that of a markup language.

The major difference, however, is that instead of having a small, highly descriptive set of tags and a well-defined role like RSS does, the XHTML tag set is large, ambiguous, and used for many purposes it wasn’t designed for. For example, to mark up the title and authors in our example above using XHTML we would need to rely on tags such as <h1> and <p>. These tags (“header” and “paragraph”) don’t describe the content nearly as well as <title> and <author> tags would.2

Also, developers currently use XHTML tags in countless ways, many of which aren’t descriptive. For example, some developers use <h1> as a page title, limiting themselves to just one <h1> per page. Others use it as a page title but don’t limit its use to one per page. Still others use <h1> as a paragraph header and never as a page title. And there are some people who simply use it to make their text big. This inconsistency greatly diminishes the descriptive power of the tag. Unfortunately, most XHTML tags suffer the same fate.

The dilution of description in XHTML isn’t necessarily a problem with XHTML itself, but rather a problem stemming from our evolved usage of it. For example, in addition to static Web pages, we now build many different kinds of applications: e-commerce apps like eBay, email apps like Gmail, search engines like A9.com, collaborative apps like Flickr, bookmarking apps like Del.icio.us, and a myriad of other apps that couldn’t have been envisioned by the most prescient HTML working group. Over time, our usage of XHTML has drained it of semantics.

Enhancing XHTML Using Class Attributes

Several approaches have been put forward to bolster the descriptive power of the XTHML tag set without inventing a new format like RSS. The theory behind these approaches is to use the existing XHTML format so that we don’t need any special software to view it. Two interesting examples are structured blogging and microformats. The goal of these approaches is to provide an agreed-upon semantic framework built on top of XHTML. One way they do this is by adding descriptive values to the class attribute. To mark up our article title and byline above as a microformat we might use something like this3:

 <div class="column" id="column-Web20-design"> <span class="title">Web 2.0 Design: Bootstrapping the Social Web</span> <span class="author">Porter, Joshua</span> <span class="author">MacManus, Richard</span> </div> 

Though not as clear as our RSS example above, this code snippet contains similar descriptive information. One benefit of this approach is that it is written in XHTML, so developers won’t have to learn a new XML format. One difficulty with this approach, however, is that by design, the <span> element doesn’t convey meaning itself. All meaning must come from class names, which are not unique like RSS elements are. For example, the code of someone using the class="column" in their own, personal way could be confused with someone using it to identify a column and authors like we do in our example.

Embedding XML in XHTML

In addition to utilizing class attributes, like microformats do, structured blogging adds another layer above the XHTML one. That layer comes in the form of a copy of the data in XML format, embedded using the <script> tag, similar to the following4:

 <script type="application/x-column; charset=utf-8"> <column alternate-for-id="column-Web20-design"> <title>Web 2.0 Design: Bootstrapping the Social Web</title> <author>Porter, Joshua</author> <author>MacManus, Richard</author> </column> </script> 

Embedding XML allows for richer data description than using just XHTML because developers can define certain elements for whatever application they’re creating. For example, developers could require that all applications verify any numerical data before they do anything with it, an important security concern in some cases. A drawback of this approach is that until all browsers can understand the embedded XML (which is simply ignored by older browsers), this method requires delivering two copies of the same data for each request.

The two preceding code examples are hypothetical because structured blogging and microformats are works in progress. Each has defined several description-rich mini-formats, but they have not seen widespread adoption yet, despite being promoted heavily by their creators. (Technorati promotes microformats and PubSub promotes structured blogging.) Like any format, until developers start supporting them, our machines can’t help answer any questions for us.

Providing New XML Formats

With the overwhelming success of RSS becoming more clear every day, one might ask why we don’t simply define a set of new, semantic XML formats to replace XHTML entirely. There are two huge barriers to doing so: developer adoption and usefulness.

Developer adoption comes in many forms. It comes in the form of Web developers writing their code according to a particular format. It comes in the form of browser makers supporting those formats. It comes in the form of independent application developers creating new applications on top of those formats. No matter what form adoption comes in, it means change.

Usefulness is the other big obstacle to providing new XML formats. It can be extremely difficult to tell if a format is worthwhile—that is, whether the benefits of a format outweigh the drawbacks. RSS is a pretty clear case. The benefits of being able to track many Web sites outweigh the challenge of learning and using the new RSS tools that allow it. Efficiency of this sort is not always apparent when we consider a new XML format.

Despite these difficulties, several new XML formats are gaining adoption. One example is Google Sitemaps. Google Sitemaps is markup for describing the contents of a Web site, to tell search engines about URLs that can be crawled on specific sites. The example code they provide looks like this:

 <?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.google.com/schemas/sitemap/0.84"> <url> <loc>http://www.example.com/</loc> <lastmod>2005-01-01</lastmod> <changefreq>monthly</changefreq> <priority>0.8</priority> </url> </urlset> 

This code is almost self-explanatory. By looking at both the tags and their content, it is easy to see what the tags mean (though it’s not entirely clear what a search engine uses them for). Google suggests placing sitemap files at the root of a Web server, using a standard filename, like we do with robots.txt. Providing clear conventions like this, in general, helps spur adoption of new formats.

Markup that Everyone Can Understand

The success of semantic markup (and RSS in particular) has taught us that we need markup that everyone can understand. It does us no good to have overly technical specifications that only hardcore developers can use, in the same way that it does us no good to have simplistic solutions written for non-techies that don’t help us do things more efficiently. Semantic markup, like most things, is a fine balance.

To reach the lofty goal of having machines understand information the way we do, we will need semantic markup. Though it is not clear what formats will be favored and what formats will be forgotten in Web 2.0, we will certainly appreciate it when computers can start answering questions for us that we previously had to find out on our own. Perhaps at that point we’ll be able to focus on even more important questions.

So, by now you know who writes the Design for Web 2.0 column. Does your computer?

Special Thanks

Many thanks to Ethan Marcotte of sidesh0w.com and Vertua Studios for his helpful comments during the writing process.


1 This example is somewhat complicated by having two authors, not supported by RSS 2.0 except via the use of a namespace. Thus the dc prefix in the <dc:creator> element refers to the Dublin Core namespace. Because this was viewed by some as a weakness of RSS 2.0, the newly released Atom 1.0 format addresses this and handles multiple authors without the need for namespaces. Namespaces, in general, are the way to extend XML to support additional elements or distinguish between similar ones.

2 Note that this refers to a list of items within a document, not individual documents. XHTML provides mechanisms for metadata at the document level via the title and meta elements (children of the head element). The type of listing in the example is common in such genres as blogs and news publications.

3 This is not an exact representation of what the microformat markup would be, as in practice several options are debated before a final format is agreed upon.

4 Again, we do not claim this as an exact representation of what the structured blogging markup would be.

Further Reading

Related Topics: HTML, XHTML, Web Standards, XML

Joshua Porter lives in Newburyport, MA, USA, and is the director of web development at User Interface Engineering. He writes about web design and usability on his blog, Bokardo.

Richard MacManus is a Freelance Web Analyst/Writer from Wellington, New Zealand. His personal Web site is Read/Write Web.