You are here

Solr vs Simplified/Traditional Chinese

Alex Lu's picture
Alex Lu

Simplified Chinese is used mainly and officially in China and Singapore. Traditional Chinese is used in Taiwan and Hong Kong (despite the fact Hong Kong is officially part of China). Both languages are identical in pretty much every aspect except writing. Many characters in traditional Chinese are much more difficult to write. However, not every character is converted into a simpler form. An example:

England

英格蘭 (Traditional Chinese)
英格兰 (Simplified Chinese)

As you can see, the first two characters remain unchanged as they are simple enough but the much more difficult last character is simplified. A Chinese character is made of strokes and the number of strokes determine the complexity of a character. The last character of England "蘭" in traditional Chinese consists of 21 strokes where "兰" in simplified consists of 5.
For every character, there's a particular order of strokes to complete a character. Below is an example how to complete "Home".

Now, let's look into Solr vs. simplified/traditional Chinese

Sorting

You can sort the language by number of strokes, but people don't find it useful as no one memorizes the number of strokes. People know how to count though. Usually contextual sorting is more applicable, i.e. sorting by price or promotion if it's an e-commerce site. However, number of strokes is often used in filing and categorisation, similar to A to Z, as it is the easiest and most popular way method to organise files. It's also used quite often in directory listings. There's another way of sorting by radicals, but it's pretty much the same concept - I won't bore you with the details.

Stemming

Stemming doesn't apply to Chinese.

Characters in Chinese don't change in any tense circumstances, i.e. "go" changing to "going". In Chinese, you only need to add characters that represent the time of your action, i.e.

I'm going tomorrow

我(I) 明天(tomorrow) 去(go)

There are characters which don't mean anything but they could be used to change purpose of a sentence.  For example:

你(you) 需要(need) 幫忙(help) (you need help)
你(you) 需要(need) 幫忙(help) 嗎 (do you need help?)
Adding "嗎" at the end changes the sentence to a question.
你(you) 可以(can)
你(you) 可以(can) 吧

Adding "吧" at the end reduces level of confidence of "you can"

Spellcheck

Now for a brief introduction how mistakes are usually made. There are plenty of input methods and two of them are the most popular: Pinyin (for simplified Chinese) and Phonetic (for traditional Chinese). Both are pretty much phonetically-based. Many characters share the same phonetic. The image below shows available characters for the same phonetic.

This is where mistakes start. I might choose a wrong character if I have navigated too quickly or I just don't know the correct character.

There are three types of mistakes:

1. Same phonetics, similar pattern of strokes and grammatically correct.  e.g.

採(pick) 草莓(strawberry)
踩(step on) 草莓(strawberry)

As you can see, "pick" and "step on" are almost identical apart from their radicals (the left side of the characters) and they both make valid terms. Do we treat these as synonyms? Probably not, because they have completely different meanings - although the latter is a rare activity :)

2. Same phonetic but grammatically incorrect.  e.g.
Take the strawberry example from above
跐(slip) 草莓(strawberry)

3. Same phonetic, complete different character and grammatically correct.  e.g.
電器 (electronics)
電氣 (electric)

There's not a lot of information about how Solr handles spell-check in Chinese. I did find a post that suggests using a pre-defined dictionary instead of building a dictionary when indexing.

Encoding

Encoding must be the same from indexes to query, preferably UTF-8. Some characters are visually identical if, for some reason, Solr cannot return correct results.

Analyser

CJK is the most popular indexing analyser. However it's based on grouping characters. It's time to mention that a character in Chinese on its own could mean something or nothing. Usually a group of characters form a meaningful word. That's how CJK works. There's another analyser called Paoding which indexes using a large set of dictionaries. I haven't personally tried it yet but it's been proved that more accurate results are produced. Check out this article for comparison.

Recommended/Interesting Reading

http://people.w3.org/rishida/scripts/chinese/
http://java.dzone.com/articles/indexing-chinese-solr
http://www.slis.tsukuba.ac.jp/~hideo/development/installing_apache_solr_on_mac_os_x (Solr for Japanese, which is very similar to Chinese but could be even more complicated as they have three different sets of characters.)

Add new comment