Smarter Content Publishing
Building a semantic website to increase the efficiency and usability of publishing systems
By Victor Lombardi
Do you remember the days before WYSIWYG word processors when you had to markup the text, much like we markup web documents using HTML? I don’t. I started word processing using WYSIWYG applications on the Macintosh and only heard about markup in 1993 when I first saw a website. After all those years of refining the word processor to become a more efficient tool, I had to wonder why we reverted to manually creating markup for the web.
One reason is that the web is not print. The need to create hypertext within pages requires more control over documents. Another is that web pages can also be applications, so access to the “guts” of the page is needed to insert programming code. And ideally, we would like to separate presentation from content, enabling us to format the content in different ways for different purposes. How can we achieve all this with the efficiency and usability we’ve come to expect of other publishing tools?
Content Management Systems
While WYSIWYG web page editors are now available, more of our needs can be met using an entire content management system (CMS). CMS offers us:
- WYSIWYG editing or preview
- The ability to create content components that can be used repeatedly
- A centralized place to edit content
- Separation of presentation and content by using presentation templates
- A user interface customized for the task of content management and publishing
While CMS was once the domain of large companies with big budgets and talented programmers, low cost or free CMS programs such as Movable Type are now powering thousands of simple, personal sites. There is clearly a trend of easier publishing and less costly software becoming available on a widespread basis. The next step in this trend is to make CMS more efficient.
There are several steps in the traditional content management process that require manual work which could be made more efficient. One area is telling the system how to assemble content components into pages. Publishing content onto a page usually requires several steps:
- Input the content itself, along with the corresponding metadata
- Specify how that content should be displayed in relation to similar content, such as its placement in a list and that list’s order
- Specify how that group of content should be displayed on a page
If we can provide pre-determined rules for the system to assemble the content components, we could simply enter the content and let the system do the rest. For example, a company that makes products could create a rule that states, “Whenever a new product description page is published, create links in the sidebar to related white papers, support documents, and local retailers.” On the page for a particular white paper, a corresponding rule could state, “If a new product description is published and is relevant to this white paper, create a link to it in the sidebar.” Rules could be created hierarchically, so that sub-rules could determine what links appear in the sidebar and sub-sub-rules determine how each link in the sidebar is displayed.
To create a CMS that followed rules like this, the system would have to know 1) the information type of a piece of content – a white paper, product description, etc. – as well as 2) how certain information types relate to each other. Once we specify this information, the system, by following rules, is smarter, and ultimately reduces work for the people using it. There are a few major benefits to this kind of system over a conventional CMS:
- Pages are automatically updated with links to new, relevant content
- Users can search the site using an interface that allows them to create their own rules, with more precise results than free-text search
- People without web design and development skills can publish websites using concepts they understand instead of having to learn how to create web pages
More Metadata
In order for the publishing system to follow rules, it has to know something about each piece of content and how it relates to all the other content. As with a conventional CMS, we tag the content elements with metadata that describes the element type, like headline, body text, and publication date. We also need to tag the content with metadata that describes what the content is, like product description, white paper, support document, or retailer. In addition, the system must know how these components relate to each other. This model can be represented using a ball-and-stick chart, where each oval represents a component and each arrow represents a relationship:
Part of this exercise is to create a more usable user interface for content managers who are unfamiliar with web design. Such an interface would allow them to enter metadata about how the information is used in the organization instead of how the content should be displayed on the site. We will first need to model the organization in a way that its members understand it:
You can see we begin to describe not just content but concepts, ideas about what is in an organization and how those concepts relate to each other. The goal behind this kind of model is to represent our understanding of the organization and to document it in a way that will eventually become readable by the CMS. You can imagine how many types of information and relationships that exist on an actual company website; our model in a real world situation would be significantly larger.
This method allows many more possibilities. For example, to add personalization to a website you could add different types of customers to this model, describing where they live, what kind of products they buy, what sort of technical support they need, and so on.
Creating the model is the conceptual, logical part of this work. Eventually a program or a person has to convert the model into a physical format that the CMS can read. There are a number of relatively new XML formats designed for this sort of information model, including Resource Description Framework (RDF) and Topic Maps. But it’s also possible to create a relational database model that functions similarly.
The remaining work is similar to conventional content management systems: creating templates for presentation, programming logic that will populate the templates, migrating content into the system, and so on.
The O-Word
The model described above – the concepts, relationships, plus some additional information – is called an ontology. A true ontology has more features than described here. A fine tutorial on creating ontologies is Ontology Development 101: A Guide to Creating Your First Ontology.
If you have an information architecture or library and information science background, than much of the work of creating an ontology may be familiar to you. Ontologies provide vocabulary control just as taxonomies and thesauri do, but there are important differences. In general, ontologies aim to represent knowledge rather than describe content. This is why they require such rich descriptions of relationships among terms, rather than relationships that are merely equivalent, hierarchical, and associative. There are additional features that allow them to be processed effectively by computers. To learn more about the differences, see Ontologies Come of Age.
If you start to learn about ontologies you can soon find yourself mixed up in the fields of philosophy and artificial intelligence. You can safely ignore these as they aren’t relevant to using ontologies for powering websites. Delving into the philosophical discipline of Ontology (with a capital “O”) will show you where ontologies came from but not how to use them. The field of artificial intelligence is studying how to use ontologies but with a complexity far beyond what is required for websites. The field of computer science, particularly the W3C’s Web Ontology Group is working to take academic theory and put it to practical use.
Challenges
Creating an ontology of an organization can be difficult for a number of reasons. You must not only understand the content but also how the organization operates. In large organizations, there may be no one person or document that describes all the needed information. Discovering the tacit operations knowledge may be more like investigative work than information architecture. And once discovered, reaching consensus on the concepts and relationships used in the ontology can require considerable political and diplomatic skill. Since you are attempting to document the very nature of the organization, everyone will have a reason to want to advance his or her perspective of what the organization is.
While the goal of this exercise is a more efficient, usable publishing system, you may have noticed there is more up-front work involved than with a traditional content management system. This approach does require more work organizing information at the beginning in return for simpler publishing over the course of the system’s lifetime. This kind of system may be right in cases where there is such a large volume of content that content management done the traditional way is too onerous or even impossible. Or this system could be created to bring about a more usable interface for those managing content, allowing them to organize the content according to how they understand it and letting the system use rules and templates to build the site correctly. This approach is not simple enough for all sites yet, but if the CMS trend towards greater functionality at lower cost continues we could all be publishing this way in a few years.
The good news is that once the ontology is created, you can use it to power other kinds of information systems within the organization. Using it for content management and publishing is only one example.
The Semantic Website
In this process of modeling information so the computer “understands” more about content, the meaning of information is referred to as semantics. If you’re familiar with the idea of the Semantic Web, some of these ideas might already be familiar to you. But while it may take several years and a great deal of work before we realize the benefits of a network of semantic websites, these concepts are being applied to individual sites today. After all, a collection of web pages on a website is simply a subset of the larger web; the difference being you have control over your own site. The same advanced ideas planned for the larger web can be leveraged today on the individual site level.
Resources
Metadata Principles and Practicalities
Ontology Development 101: A Guide to Creating Your First Ontology
Formal Ontology and Information Systems (PDF file)