"One Does Not Simply Launch Digital Catalogs in China"

Apr 30, 2014 by Arka Der Stepanian

Chinese character

On March 14th, 2014, our first customer in China went live with their new digital catalogs. Seeing our product used on the other side of the world was a moment of great joy and pride for us, even more so considering the technical hurdles we needed to overcome to make this happen. But what were these issues and how did our mad scientists solve them? If you're dealing with Chinese translations for your app or you're just curious to see how we made Publitas compatible for Chinese use, you should check out this post.

Having METRO China join the 17 other countries in which METRO Cash & Carry is already using our software was a pretty cool milestone for us.


But getting there wasn't as easy as you may think.

We stumbled upon several challenges while making Publitas compatible for use in China, and three stood out in particular:

  1. Slugs
  2. Pluralization
  3. Search

We'll go over each of these problems in more detail below and explain how we solved them.

one does not simply meme about launching digital catalogs in china
True or what?

Problem #1: Slugs

To quote the Django glossary, a slug is:

"A short label for something, containing only letters, numbers, underscores or hyphens. They’re generally used in URLs."

For instance, when you have a group in Publitas called "Makro UK" and a publication called "Week 12", it's converted to a slug for the URL:
https://view.publitas.com/makro-uk/week-12

The last bit in this URL ("makro-uk/week-12") is the slug. Also, stuff like "V&D" gets (by default) converted to: "v-d"

In short, creating slugs like this is a human and search engine friendly way to build URLs. The process is very basic. Publitas:

  1. Transforms a string of characters to lowercase.
  2. Removes special characters (e.g. parentheses, commas, exclamation marks, etc.) and replace spaces with -.
  3. Most times, & gets replaced with "-" and @ gets replaced with "at".

So "Mom & Dad were @ Home" becomes: "mom-dad-were-at-home".

This works really well for languages using the Latin alphabet:
irb(main):005:0> "Mom & Dad".parameterize => "mom-dad"

But when it comes to Chinese, it just fails flat:
irb(main):003:0> "中文測試".parameterize => ""

Solution: Transliterate Unicode characters to ASCII

Luckily, there are already libraries out there that deal with this issue. The one we're using is called Stringex, and is made up of three libraries: ActsAsUrl, Unidecoder, and StringExtensions.

So now when we transliterate Unicode characters, this is what happens:
"你好".to_ascii #=> "Ni Hao"

When you have that, you can build a proper slug. In short, you can't directly use off-the-shelf stuff for building slugs in Chinese and will need to transliterate Unicode characters to ASCII first.

Problem #2: Pluralization

The other problem has to do with translations. Say you want to translate, from English to Dutch: "1 thing, 2 things". In both languages, you can express "thing" in singular and plural forms, which makes the translation to Dutch very straightforward. In Chinese, however, that's not the case. The word "thing" (singular) simply doesn't have a valid translation. The only word you can translate is "things (事)" (plural). Have a look at the code for our catalog viewers below:

English (EN)

      viewer:
        loading: loading...
        pages:
          zero: "No pages"
          one: "{count} page"
          other: "{count} pages"
    

Chinese (ZH)

      viewer:
        loading: 正在载入…
        pages:
          zero: 没有页面
          one: ""
          other: 网页
    

As you can see here, the strings 'zero' and 'other' are translated just fine from English to Chinese. But when it comes to 'one' (singular), the translation fails because it doesn't exist in Chinese.

Solution: Only translate plural words

The way to address pluralization in Chinese is by only translating the plural form (we first read about this here). Here's how we apply this in the translations for our catalog viewers:

English (EN)

      viewer:
        loading: loading...
        pages:
          zero: "No pages"
          one: "{count} page"
          other: "{count} pages"
    

Chinese (ZH)

      viewer:
        loading: 正在载入…
        pages:
          zero: 没有页面
          other: 网页
    

So instead of having translations for "one" and "other", we just provide them for "other" in Chinese.

Problem #3: Search

For search to work properly, there's a process that words go through whenever they're indexed. The most commonly used method is called stemming. This basically means that different forms of a word are reduced to a common form. (Here's an article that explains stemming in more detail.) So, for instance, imagine you index a PDF for a pet shop that has the words cats, catlike and catty. All these words are reduced to the stem "cat", so whenever you search for "cat", they will be matched. This technique, however, isn't applicable for Chinese.

Solution: Search for both the stem and the exact word

We're using Amazon's search service CloudSearch, and luckily, they recently added Chinese support. The way our search engine previously worked was that it would always search for words that contain the stem. But since we wanted to support Chinese (where stemming isn't applicable), we had to implement a failsafe where whole words would also be searched. So, instead of us saying "Amazon, give us words that contain the stem", we started saying "Amazon, give us words that contain the stem OR are equal to the exact word".

Conclusion

We hope this post shed some light into what the challenges were in making our catalogs compatible for use in China and how you can address similar problems. Feel free to share this post and leave a comment if you have any questions for our mad scientists.



comments powered by Disqus