[MarkLogic Dev General] Tokenization Questions

Mary Holstege mary.holstege at marklogic.com
Fri Aug 17 10:53:09 PDT 2012

On Fri, 17 Aug 2012 10:24:06 -0700, Gabe Luchetta  
<gluchetta at catalystsecure.com> wrote:

> I have been assigned to testing the use of non-English languages for our  
> software that uses ML and have some questions about tokenization.
> According to the Search Developer's  
> Guide<http://developer.marklogic.com/pubs/4.1/books/search-dev-guide.pdf>,  
> "Asian or Middle Eastern characters will tokenize in a language  
> appropriate to the character set, even when they occur in elements that  
> are not in their language."
> During my testing I have found that if I tokenize a mixed  
> English/Japanese document using English as the tokenized language, it  
> DOES tokenize the Japanese, but I get different tokens than I do when I  
> process the same document using Japanese as the tokenized language. I  
> assume this is because tokens withing the detected character set are  
> shared between multiple Asian languages, or that it is relying on  
> simpler segmentation methods instead of really tokenizing the text, but  
> would like to have some more detail so that we can properly explain this  
> to our clients.
> Since we are using the built-in language detection to identify languages  
> at document level, this is proving to be problematic. If a document only  
> has a bit of Japanese in it, the Japanese score returned will be lower  
> than the English score, and we will likely mark the document as English.  
> If a user then attempts to search the Japanese content using Japanese as  
> the language option in the search, they won't get a hit on this  
> document. The will only get a hit if they construct their search the  
> same way it was tokenized and select English as the search option.
> I know this is a complex topic, but would appreciate whatever guidance  
> you could provide.

The way tokenization works is that we look for runs of text in some
particular language. If we only have the script to go on, rather than
an explicit language identifier (e.g. xml:lang="ja") then we use some
basic rules to make a guess.  If the current language context is English
and the character we are looking at is a CJK character, the default
assumption is Chinese. If it were a Japanese-only character, the
default assumption would be Japanese.  If we were in a Japanese
language context and we saw a CJK character, the assumption
would be that we're still looking at Japanese. So that is probably
why you are seeing the difference. This is especially an issue with
Japanese, which uses multiple scripts, and where some words
will start with CJK characters. The tokenizer doesn't look ahead to
realize that some Japanese characters are coming up and that
Chinese might be a bad guess.


Mary.Holstege at marklogic.com
Principal Engineer
MarkLogic Corporation

More information about the General mailing list