Using controlled vocabularies to improve findability
Many moons ago I waited tables. One day our manager came down to tell us that from now on we were to refer to our customers as “guests.” We also were to refer to courses as “first course” and “second course.” Our chef was French, and found the American use of “entrée” for the main course annoying–in French “entree” means appetizer. This was my first experience with a controlled vocabulary.
English is a complex, flexible, and powerful language. I once heard a comedian say “Those French! They have a different word for everything!” But really it’s the English language that is full of mischief. You can begin your meal with:
- A starter
- A first course
- An appetizer
Or terms English has borrowed from other languages:
- hors d’oeuvres
- An amuse-gulle1
Moreover a Western restaurant could call this first course “grazing” or a sports bar could call it a “warm-up.” You can see where the task of serving an appetizer might lead to some confusion.
At the restaurant if I asked a “guest” if they would like a first course, they would look at me funny and say, “Huh?” I would say, “An appetizer? Hors d’oeuvres? A nibble?”
However, on the web, no one can hear you scream, much less whisper. Thus we realize we need to create a controlled vocabulary.
A controlled vocabulary is simply what it sounds like: a way to control the meaning of the vocabulary used as well as keeping track of the related terms. In our restaurant we had the preferred term, “first course,” and all the terms our patrons might use, “starter, first course, hors d’oeuvres, appetizer,” neatly tucked into our head. So if a patron wanted an appetizer of smoked salmon, we would write on the check “first course: smoked salmon.”
We also kept track of related concepts: “Madam, would you care for an aperitif?” Or the more casual, “Can I get you a drink while you’re looking at the menu?”
A computer tends to be as inflexible as a French chef. If you search for “salmon,” the computer will only give you results featuring the word “salmon.” If you type “fish” or “gravlax,” you will go hungry, unless the designer of the search has created some type of controlled vocabulary.
There are many kinds of controlled vocabulary, from the simple one made of equivalence relationships that says “yes, gravlax and cured salmon are the same” to a complex thesaurus that could tell you that “gravlax is a type of salmon that is the same as cured salmon and is an ingredient for bagels and lox.”
Let’s dig a little deeper…
The simplest type of controlled vocabulary is a list of equivalence relationships: “salmon” and “gravlax” are the same for the purposes of a search. The table below shows an example.
The relationships can be as simple as two words for the same thing: “cat” and “kittycat.” These are synonyms.
And they can be different spellings or acronyms for the same thing. “Tiger” is “Tyger,” “SPCA” is the “Society for Prevention of Cruelty to Animals.” These are variants.
The words can be slightly different, but for the purposes of search, you may choose to treat them the same: “cat” and “kitten.” Perhaps you have a greeting card site and someone wants a card with a picture of a kitten but you only have one card with a cat on it. It’s better to offer up the cat than show the user a “no results found” page.
It’s a lot like the index in the back of a book: you look up “Moon” in a book on the solar system and it says, “See satellites.” For the purpose of that particular book, satellite and moon are the same. Another book (a thicker one perhaps) might differentiate them. The key is to consider what people are searching for and what words they use, and then get them to the content you have.
Equivalence Relationship Example
|Smoked Salmon||Fish, gravlox, lox, cured salmon, smoked fish, preserved fish, nova|
A more complex type of controlled vocabulary is a taxonomy. It shows hierarchical relationships as well as equivalence relationships. It is not only useful for searches, but also for creating hierarchies for browsing (a la the front page of yahoo!) and for tying the two together.
Hierarchical Relationships Example
|Smoked Salmon||gravlox, lox, cured salmon,||Fish, smoked fish, preserved fish, salmon, smoked meats||Smoked salmon flatbread with crème fraise, Linguini with smoked salmon and asparagus|
You can see a taxonomy in action on Yahoo!. A search for “coffee mug” brings up a number of results.
Take a close look. Each result is accompanied not only by the title, description and link, but also with a link to Yahoo!’s famous hierarchy. A searcher who was looking for a tchochke to put their company logo on can click on Promotional Items > Mugs and find other companies that offer that service, or a mug collector could find funny mugs for his collection. The categories also provide context for the searcher… the mug collector is not going to click on the second result, once he notices it’s in the “Punk and Hardcore Artists” section.
The ultimate in controlled vocabularies is a thesaurus. You may remember using the thesaurus in grade school. It was a way to make yourself look smarter. Instead of writing “she said,” you could use a thesaurus and write “she yelled, spoke, whispered, insinuated, articulated, uttered, insisted” and so on.2
Thesauri have come back into our everyday life via the web. More than a tool to get more and better words, thesauri are used to create a web of words interconnected to help people find the things that they just don’t have language to describe. A thesaurus shows not only hierarchical relationships but also associative ones.
The beginnings of a thesaurus
|Preferred term||Variants||Related Terms||Associated Terms|
|Smoked Salmon||Gravlox, lox, cured salmon,||Preserved fish||Smoked Tout, Bacalao, salt-cured sardines, pickled anchovies||Smoked salmon flatbread with crème fraise, Linguini with smoked salmon and asparagus||Jewish cuisine, kosher foods crème fraiche, bagels, capers, dill crackers, fish knife, caviar|
As you can see, organizing metadata into a controlled vocabulary is a somewhat subjective exercise. On a different website, Jewish cuisine might be the parent, and preserved fish the associated term. It depends on the type of website it is and who the visitors to it are.
Associated Terms are terms that belong together, but are not the same, nor are they broader nor narrower terms. They just kind of go together. For example, if the table above was for a thesaurus for a recipe site, it might prove useful to list ingredients commonly combined with the main term (crème fraiche, bagels, capers, dill, cream cheese). On a gourmet food store site, it might be useful to list other purchases someone interested might with to make (crackers, fish knife, caviar). These are terms associated with smoked salmon, but no one would confuse them for being the same.
All of these types of controlled vocabulary are aimed at getting people to what they are seeking. No matter what crazy thing they type in the search box. Let’s see it in action.
Everybody Spels Difernt
Well, I certainly spell differently.
Here are two results of a recent attempt to find different kinds of gourmet chedder.
According to their search, Dean & Deluca doesn’t have chedder. Except they do… only they call it by the proper spelling, cheddar. Google however, recognizes the wide variety of spelling humans manage to invent, and although chedder works rather well, they graciously prompt you to try “cheddar.”
Let’s try reverse engineering3 Dean & Deluca. I didn’t make the site, and I don’t know anyone who did, but by playing with it, I can make a good guess at how it works.
So, if I was unwilling to believe Dean & Deluca didn’t sell cheddar, I might search for “cheese” instead.
Which turns up quite a lot of cheese, including a cheddar.
The Dean & Deluca controlled vocabulary includes hierarchical information which shows that cheddar is a subset of “Cow’s milk and other cheeses.” (look in the brackets… they show three parents of “cheese”)
If we continue on to the Montgomery Cheddar page, we see the thesaurus being used to seduce the buyer into making more purchases.
By examining the “May We Also Suggest:” section, can guess the parent of Montgomery’s–English cheese, as it leads off the selection of related items:
- Also offered is a sibling cheddar, Mrs. Appleby’s Cheshire.
- A cousin, Colston Bassett Stilton which is a non-cheddar English cheese; stinky but tasty.
- And finally an associated item: Crackers, which aren’t cheese at all, but are quite good for eating cheese.
A person seeking a good English cheddar might not only be better able to find what they want, but also be gently nudged into purchasing a few items they didn’t know they wanted. But the thesaurus suspected they might…
And Dean & Deluca goes to the bank. Now if they would only understand my little spelling problem they would be the perfect site:
|Preferred term||Variants||Related Terms||Associated Terms|
|Cheddar||Chedder, Cheder, Chedar,||Cow’s Milk & Other Cheeses||Montgomery Cheddar||Mrs. Appleby’s Chesire, Colson Basset Stilton, English Cheese Collection||Cracker Collection|
Forrester Research’s “Why Websites Fail” reports that “Poorly architected retailing sites are underselling by as much as 50%.”
Looking at Dean & Deluca, it’s easy to imagine the lost opportunities not using metadata effectively in a controlled vocabulary would cause. By understanding your products and speaking the language of your customers, you can make sure no cheese lover ever has to go home empty handed.
1 This is French for “little tiny bit of nothing that we tease you with at the beginning of the meal.” Imagine a quarter teaspoon of caviar with two croutons on a small white plate.
Back to content
2 Of course, later on we all learned that we should save our ten dollar words for the dialog, and just leave “said” alone. And the thesaurus became a paperweight. Well… it’s baaaaack…
Back to content
When an engineer does it, he usually goes on to reassemble it. And then build his own, perhaps even better. This is called “reverse-engineering”. It’s a great way to understand how things work. Study the product, take it apart, reassemble it and then try to build your own.
Back to content