By Pavel Doronin, translated by Julia I Oden.
Having been a reader of Harbr for a while now, I noticed that there is only a handful of intelligible articles on software localization aimed at developers. Based on my experience managing localization projects, I can say that localization is not just about in-line translation and adapting an application within the context of this or that country, but is also a constant battle (in an ideal case – a productive collaboration) with its developers.
In this article, I will try to showcase with real examples of how to create the so-called “localization-friendly code” – that is, how to organize resources to substantially ease software localization, reducing excessive time and financial expenditures.
I need to clarify upfront – the primary focus will be internationalization: the process of accounting for all linguistic particulars during the development stage. If your project resources did not account for localization from the beginning, yet you decided to brave the waters at a later time, “honing” them to localization standards can be much more costly than setting them as a goal from the start.
Use Unicode
In most cases, the question of coding using UTF-8 (or UTF-16) comes up when planning localization into Asian languages, where the number of symbols can reach several thousands. Even if currently localization into Korean or Chinese is not in the plans, it is worthwhile to account for the universal coding ahead of time. If the localization strategy of your product changes, it will be much more difficult to jump onto another coding system mid-stream.
Tip: for all resources, use Unicode as your default, even if the project for now is only in Russian/English/any other language.
By the way, JSON and YAML specifications (these formats are often used for saving the localized resources) assume the use of Unicode.
Beware of fonts
This seeming nuisance is often a critical factor stalling localization. Make sure that the fonts you use have symbols for the languages you are localizing into (primarily, again, those Asian languages, as well as Hebrew, Arabic and diacritical marks of the European languages).
Remember that
ä, à or ą ≠ a
same as, in Russian, “е” does not always equal “ё.”
I worked on a case, where developers drew a font containing only English language letters. When it came to localizing into German and Polish, they had to add letters with the diacritical markings.
Leave room for maneuvering
Besides fonts, text (string) translation of applications holds yet another sunken gem to be discovered. A hidden obstacle to overcome.
Compare the translation of one menu item into different languages:
ru: Сохранить как
en: Save as
fi: Tallenna nimellä
zh: 另存为
The Chinese translation requires just three characters, while the Finnish translation requires the whole sixteen! Besides the sheer number of characters, the specific characteristics of this or that font are also important.
Let’s compare the length of Finnish and Chinese lines (font for both languages is Arial Unicode MS, 12) — Finnish text (114 px) is 2.5 times longer than Chinese (45 px).
It is highly important to have extra space in interface elements to avoid cutting off the text. If in certain cases there is not enough space, it is possible to use automated text sizing tools. Yet this decision will most likely lead to having the final text be of different sizes in different elements of the interface.
Pseudolocalization
The use of pseudolocalization can be a useful aid in catching problem spots before translation begins. It is a way to test whether an application is ready for localization. In place of translated text, a pseudo language, created based on a specific algorithm (which depends on the software at hand) is used in development resources. Using the most primitive example, the English text is being substituted with Cyrillic transliteration/transcription letters:
Save as -> Саве ас
Save as -> Сэйв аз
This method allows us to check for the following:
- Are the diactritical marks reflected correctly (ex., German, Polish)?
- Are languages with different fonts reflected correctly (ex., Chinese, Russian)?
- Are there any issues presenting interface elements for languages with the right-to-left text direction (ex., Arabic)?
- Are there issues with presenting non-standard characters (ex., usernames)?
- Are all localized resources extracted in separate files (using text directly within code carries numerous issues; see part on Hardcoding below)?
Often, in pseudolocalization, computer translation of text into the target language is used. On the one hand, this is a simple decision in the absence of special means for generating pseudolocalization. On the other hand, I saw more than once how developers confused localized resources with pseudolocalized ones and even substituted normal translation with their machine translation files from previously saved versions. Moreover, machine translation does not always allow full evaluation of all characters in a language (for example, letter œ is not encountered frequently in texts, yet its presence is also an important one to test).
For example, this is how the pseudo translation plugin interface of the MemoQ software looks like:
And this is how the result looks like with those settings:
External resources
In order to have a full review of localization materials, it is necessary to have all resources from the code base. Multimedia information containing text (most often, these are images, as well as video and audio, as in games), should be stored separately, sorted by locale. Firstly, this will significantly simplify the job of content creators, as they will not have to dig through code when needing to correct some system message. Secondly, it will allow the localization manager to correctly calculate timelines and budget for each language. Thirdly, this will lead to significantly more flexibility in working with multilingual content.
The favorite formats for exchanging localization data are XLIFF and .ro-files. Through a variety of interfaces, modern automated translation systems are capable of transforming various files into formats usable by translators.
Google and Apple also insistently advise developers to extract all localization resources:
Hardcoding in internationalization
Localization assumes not only word translation, but also adaptation of numbers, units of measurement, date and time formats, as well as punctuation marks to fit local standards.
Punctuation marks
Many developers like to “sew in” punctuation marks into code, thinking that surely periods and question marks are the same across languages. Yet compare the following:
ru:
Вы уверены?
en:
Are you sure?
fr:
Êtes-vous sûr ?
es:
¿Está seguro?
ar:
هل أنت متأكد؟
In French, question mark is separated by a space (incidentally, Habr insisted on removing the space before question mark, so I had to get creative with tags). In Spanish, question mark consists of an upside down question mark in the beginning and a regular one at the end of a phrase, whereas in Arabic it is put on the left and is turned in the opposite way. If a question mark is being generated from code, not all users would be comfortable reading such message (unless code accounts for the locale differential, but why resort to such perversion?).
Besides punctuation marks, it is important to be careful with spaces; trusting the code to insert them would be a mistake. There are languages that do not use spaces between words, as in Japanese. It is said that localization of Japanese and Chinese applications/programs into European languages can be pure hell if developers do not account for such a nuance as word spacing differences among languages.
Punctuation is part of the text, so it should be carried out into external resources.
Numbers
Numbers, like words, require translation. Many developers forget that and incorrectly carry over numeric references using familiar formats. Let’s compare:
ru: 18 765,22
en: 18,765.22
de: 18.765,22
he: 18,765.22
el: 18.765,22
fa: 18٫765.22
Notice which symbol is being used as an indicator of decimal and denomination indicators. In English and Hebrew, a period and a comma are presented quite differently than in German and Greek languages. And in Russian, a space is used to separate numbers >9999. In Farsi, thousands are separated by a specific symbol mommae (U+066B), yet there is no particular standard for this language, so a comma or even a space can serve as separators. These can be seen as nuisances, of course, “those who need to understand this, will understand it in any format.” However, such little things can sometimes lead to serious misunderstandings, especially when talking about prices and important engineering calculations.
Speaking of prices, let’s compare:
ru: 2,25 €
en: €2.25
de-at: € 2,25
de-de: 2,25 €
lv: € 2,25
lt: 2,25 €
Monetary units are positioned differently in different languages, which means that it is better not to hard code these symbols, either. Especially since, as you can see, the norms differ not just among the languages, but also within the different versions of the same language (in Austria and Germany). Even the neighboring countries, like Latvia and Lithuania, have different norms.
Units of measurement
Sometimes, it is necessary to adapt not only the outward appearance of a number to international standards, but also the very number itself. I am talking about units of measurement. If they are used in your project, it is always good to find out which system of measurement is used in a particular country in order to report intelligibly to a user about speed, length, mass, temperature, etc.
A statement “You are moving at a speed of 62 miles per hour” will mean nothing to a driver from Pskov [Russia]. Similarly, “You are moving at a speed of 100 kilometers per hour” may put a Chicago driver into a stupor.
In such cases, it is not enough to simply present a different a numeric variable; one needs to dig deeper and change the calculation formula depending on the location of the user. An ideal solution still would be to present a way to let the user change settings within the software, making that independent of the location. In either case, local units of measurement need to be accounted for.
Not all languages have the same grammar principles
Dividing text into semantic segments
When organizing textual lines, some developers do not take into account grammatical structure of other languages and divide the text in each line into semantic fragments. As a result, texts are pieced together based on the rules of Russian syntax (or developer’s native language). If an English translation can sometimes be tricked into that formula (although not always), then when working with German, for example, with its rigid rules for word order and sentence structure, this way of creating a text yields complete nonsense. And with Arabic, which uses an opposite direction for producing written text, such method of content organization is completely useless.
Here is a rather well-known example. Russian speaking user sees this text: «До окончания тестового периода осталось 5 дней. Пожалуйста, введитедействительный ключ.» (literal translation: “5 days remaining until the end of the trial period. Please enter the valid [activation?] key.”
In source code, this message may look similar this:
‘trialexpires_1’: “До окончания тестового периода “‘trialexpires_2sg’: “остался ” ‘trialexpires_2pl’: “осталось “‘trialexpires_4sg’: ” день.”‘trialexpires_4pl2’: ” дня.”‘trialexpires_4pl3’: ” дней.”‘enterkey’: “Пожалуйста, введите действительный ключ.”
Truly, it is possible to get creative and “sew” together these text “swatches” into English in a way that the translation is quite sound. But in Arabic, where text direction is different, this trick will not work. In German, the stand-alone verb prefixes always tend to run to the end of the sentence. Incidentally, pay attention again to the length of this phrase in different languages – the German version is 30% longer than the one in English. Verbs are highlighted in bold. As you can see, they can consist of two parts in German, one of which parts can be quite a long way from its counterpart.
en: Your trial period expires in 5 days. Please enter the valid [activation] key.
de: Ihre Testversion läuft in 5 Tagen ab. Bitte geben sie einen gültigen Produktschlüssel ein.
Another deficiency of this method is that with such presentation, the translator cannot always gleam the logic of a sentence and create an accurate translation. Imagine how easy it is to get lost in these text strings when dealing with about five thousand of them. All this tells us that, if possible, it is best to put an entire line into resources, so it not only has a more universal This slideshow could not be started. Try refreshing the page or viewing it in another browser.