Programming Localization
You might usually have the luxury of working on a site or application in just a single language, but there will come a time when you’ll have a project that must be done in two or more languages. This is where localization comes in. Localization, often referred to as “l10n” (the 10 represents the number of letters between the L and the N), is the implementation of features for a specific locale.
Most current programming languages include support for setting the locale. Once the locale is set, built-in functions for retrieving or formatting dates, currencies or numbers will be returned.
Localization consists of various things: date, currency, and number formats; translations; time zones; and even things like punctuation. Trying to handle all of these things can be a daunting task. This article will only cover a fraction of this but will offer some ideas and approaches for future projects.
Locale Format
Locale is normally defined using two pieces of information: a language identifier and a country identifier. Most developers use the lowercase two-letter ISO639-1 standard for the language identifier and the uppercase two-letter ISO3166 standard for the country identifier. Example locales include fr_CA
for French Canadian, en_US
for American English, and en_GB
for British English.
If you‘re still developing using Microsoft technologies that are not under the .NET umbrella (like classic ASP), the locale is set using a numerical locale identifier or a string-based identifier. This also applies to use of PHP on Windows.
Determining Language Selection
Browsers allow you to specify your language preferences. The browser then passes those preferences on to the Web server through the HTTP header Accept-Language
:
Accept-Language: en-ca,en-us;q=0.7,en;q=0.3
As you can see in this example, the locale information still has the language and country identifiers but they are all lowercase and separated by a hyphen instead of an underscore. Each locale is separated by a comma. The preference for a particular language can be specified by using the q
parameter. The higher the number, the greater the preference. In this case, I‘m specifying Canadian English as my primary preference, American English as my secondary preference and any variation of English as my final preference.
This may seem like a great way to detect language preference, but most users don‘t change the default settings. Therefore, it‘s always important to offer users the choice somewhere on the page.
Separation of Content and Layout
It‘s not unusual for us to separate page content from the rest of the layout. The theme system for almost any blog tool such as WordPress or Movable Type works this way. In multilingual sites, however, we have to plan for template translation on top of content translation. This includes things like navigation, copyright text, and search forms. These are elements that are found on most or all of the pages in the site.
In a Web application, the problem is exponentially larger. Each screen of the application might have dozens of labels that need to be pulled into a localized version of the application.
Text Files
The most common approach is to store key/value pairs in a text file. Text files are easy to manipulate and can be very handy for sending to non-technical people (such as a translator) to edit. The translator can make the changes directly to the file, which can be quickly and easily dropped back into the project.
On the other hand, just sending off a file to a translator isn‘t perfect. Text files often don’t provide translators with enough context to translate labels accurately, and sometimes space constraints are an issue that isn’t apparent in a text file. Take the word “e-mail,” for example. In French, this could be translated to courriel but a translator may decide to use the expanded form courrier électronique, which may extend well beyond the space provided in the design.
Use a separate text file for each language you wish to support—for example, labels.en.txt and labels.fr.txt.
In Java, labels can be stored in resource files. Each line can contain a key and value separated by an equal sign.
Labelname=This is my label
Then, it‘s simply a matter of pulling in the label in the code:
rfile.getString(“Labelname”);
Punctuation
Where to store punctuation may seem obvious, but if you‘re a programmer, you‘re always looking for a shortcut. Here‘s a good example: You have a label that appears in two places on the page, one with a colon as a form label and one without as a page title. Since the text itself is the same, you might opt to only store the label once and keep the colon separate.
<h1><%=rfile.getString(“Labelname”)%></h1> <label for=”formfield”><%=rfile.getString(“Labelname”) + “:”%></label>
The problem here is that punctuation varies between languages, so you can quickly run into trouble. In Canadian French, colons used in this way would require a space between the label and the colon. I recommend either avoiding punctuation where you can—as you could in this example—or be sure to include the punctuation in the label itself.
Database
Another common place to store labels is in the database, especially if the rest of the site content is being stored there as well.
I suspect the majority of multilingual sites consist of only one additional language and it is straightforward to create an additional field that contains the translation.
Post_id | Post_content_en | Post_content_fr |
---|---|---|
1 | In English | En Français |
This actually tends to be more of a hassle—using stored procedures would require code forking to pull in English or French content and creating SQL strings on the fly is definitely not an ideal solution.
“SELECT Post_content_” + locale + “ FROM table_name WHERE Post_id = 1”
A more flexible approach is to add an additional field that will become part of a composite key:
Post_id | Post_locale | Post_content |
---|---|---|
1 | en | In English |
1 | fr | En Français |
This way, we can add new locales without requiring changes to the database schema. It also simplifies database calls as we no longer have to worry about appending the locale to the field name.
“SELECT Post_content FROM table_name WHERE Post_id = 1 AND Post_locale = ‘en‘”
Precompile
Since language labels (like the header and footer) don‘t change very often, you can save yourself some performance time by building them into the templates. Having to pull labels into a template on every page call is spending more processing time than is necessary.
This could certainly be done by hand if the number of languages and templates is low. For example, if you have one English template and one Spanish template, it wouldn‘t take much to make changes to either of these files as needed. You still run the risk of introducing differences between the two templates but doing it programmatically may be excessive.
The alternative is to compile the templates programmatically. This can be done at runtime when the first template is requested and then cached for each subsequent request. Or you can precompile the templates as part of a separate process. Either way, you‘d end up with a localized version that‘s ready for content.
Date, Currency and Number Formats
Date, currency and number formats can vary by language and by country. Most current programming languages include support for setting the locale. Once the locale is set, built-in functions for retrieving or formatting dates, currencies or numbers will be returned.
PHP:
setlocale(LC_ALL, null) . '<br>'; echo strftime("%A %e %B %Y", mktime(0, 0, 0, 12, 22, 1978)) . '<br>'; // Friday 22 December 1978 setlocale(LC_ALL, 'fr_CA') . '<br>'; echo strftime("%A %e %B %Y", mktime(0, 0, 0, 12, 22, 1978)) . '<br>'; // vendredi 22 décembre 1978
Using PHP‘s setlocale
in a Linux environment may require some server configuration to have the required locales supported. Refer to the user notes in the PHP documentation for more information.
Resources
Locale information