[MarkLogic Dev General] Tokenization Questions
gluchetta at catalystsecure.com
Fri Aug 17 10:24:06 PDT 2012
I have been assigned to testing the use of non-English languages for our
software that uses ML and have some questions about tokenization.
According to the Search Developer's
"Asian or Middle Eastern characters will tokenize in a language appropriate
to the character set, even when they occur in elements that are not in
During my testing I have found that if I tokenize a mixed English/Japanese
document using English as the tokenized language, it DOES tokenize the
Japanese, but I get different tokens than I do when I process the same
document using Japanese as the tokenized language. I assume this is because
tokens withing the detected character set are shared between multiple Asian
languages, or that it is relying on simpler segmentation methods instead of
really tokenizing the text, but would like to have some more detail so that
we can properly explain this to our clients.
Since we are using the built-in language detection to identify languages at
document level, this is proving to be problematic. If a document only has a
bit of Japanese in it, the Japanese score returned will be lower than the
English score, and we will likely mark the document as English. If a user
then attempts to search the Japanese content using Japanese as the language
option in the search, they won't get a hit on this document. The will only
get a hit if they construct their search the same way it was tokenized and
select English as the search option.
I know this is a complex topic, but would appreciate whatever guidance you
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the General