[MarkLogic Dev General] Tokenization Questions
gluchetta at catalystsecure.com
Fri Aug 31 13:41:14 PDT 2012
Mary: Thank you so much for your response earlier this month on the
tokenization questions. Can I assume that we would not have the same issue
with other languages, such as Chinese traditional/simplified or Korean? Are
there any other "Easter Eggs" regarding language we should be aware of?
Catalyst Repository Systems, Inc.
1860 Blake Street, Ste. 700
Denver, CO 80202
E: gluchetta at catalystsecure.com
*Powering Complex Legal Matters*
On Fri, Aug 17, 2012 at 11:24 AM, Gabe Luchetta <
gluchetta at catalystsecure.com> wrote:
> I have been assigned to testing the use of non-English languages for our
> software that uses ML and have some questions about tokenization.
> According to the Search Developer's Guide<http://developer.marklogic.com/pubs/4.1/books/search-dev-guide.pdf>,
> "Asian or Middle Eastern characters will tokenize in a language appropriate
> to the character set, even when they occur in elements that are not in
> their language."
> During my testing I have found that if I tokenize a mixed English/Japanese
> document using English as the tokenized language, it DOES tokenize the
> Japanese, but I get different tokens than I do when I process the same
> document using Japanese as the tokenized language. I assume this is because
> tokens withing the detected character set are shared between multiple Asian
> languages, or that it is relying on simpler segmentation methods instead of
> really tokenizing the text, but would like to have some more detail so that
> we can properly explain this to our clients.
> Since we are using the built-in language detection to identify languages
> at document level, this is proving to be problematic. If a document only
> has a bit of Japanese in it, the Japanese score returned will be lower than
> the English score, and we will likely mark the document as English. If a
> user then attempts to search the Japanese content using Japanese as the
> language option in the search, they won't get a hit on this document. The
> will only get a hit if they construct their search the same way it was
> tokenized and select English as the search option.
> I know this is a complex topic, but would appreciate whatever guidance you
> could provide.
> Thank you,
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the General