[MarkLogic Dev General] Tokenization Questions
mary.holstege at marklogic.com
Fri Aug 31 14:19:42 PDT 2012
On Fri, 31 Aug 2012 13:41:14 -0700, Gabe Luchetta
<gluchetta at catalystsecure.com> wrote:
> Mary: Thank you so much for your response earlier this month on the
> tokenization questions. Can I assume that we would not have the same
> issue with other languages, such as Chinese traditional/simplified or
> Korean? Are there any other "Easter Eggs" regarding language we should
> be aware of?
> Thank you,
In theory you could see it with Korean, but it is less of an issue in
practice than it is for Japanese because Korean use of the shared
CJK characters is much more limited that it is in Japanese.
The other issue to be aware of for languages and search is that
stemming is case and diacritic sensitive. In German, where the
basic form of a noun is capitalized, the stemmed form of the
lowercased form of a noun will probably not match the stemmed
form of the proper form of the noun. This can be an issue if you
do case-insensitive searches, as then we are looking at the stems
of the lowercase forms, which will probably just be the whole word
itself. So a case-insensitive stemmed search in German is going
to lose on recall. Similarly for a diacritic-insensitive search in
languages that care about accents, such as French. When you
do stemmed searches in these languages you should explicitly
set them to case-/diacritic- sensitive or add in that as an alternative.
You don't see this in English because in English the proper form
of words has neither uppercase nor diacritics (in general, or the
form without diacritics is an accepted alternative).
> Gabe Luchetta
> Product Management
> Catalyst Repository Systems, Inc.
> 1860 Blake Street, Ste. 700
> Denver, CO 80202
> P: 303.824.0820
> C: 720.339.5085
> E: gluchetta at catalystsecure.com<mailto:gluchetta at catalystsecure.com>
> W: www.catalystsecure.com<http://www.catalystsecure.com>
> Powering Complex Legal Matters
> On Fri, Aug 17, 2012 at 11:24 AM, Gabe Luchetta
> <gluchetta at catalystsecure.com<mailto:gluchetta at catalystsecure.com>>
> I have been assigned to testing the use of non-English languages for our
> software that uses ML and have some questions about tokenization.
> According to the Search Developer's
> "Asian or Middle Eastern characters will tokenize in a language
> appropriate to the character set, even when they occur in elements that
> are not in their language."
> During my testing I have found that if I tokenize a mixed
> English/Japanese document using English as the tokenized language, it
> DOES tokenize the Japanese, but I get different tokens than I do when I
> process the same document using Japanese as the tokenized language. I
> assume this is because tokens withing the detected character set are
> shared between multiple Asian languages, or that it is relying on
> simpler segmentation methods instead of really tokenizing the text, but
> would like to have some more detail so that we can properly explain this
> to our clients.
> Since we are using the built-in language detection to identify languages
> at document level, this is proving to be problematic. If a document only
> has a bit of Japanese in it, the Japanese score returned will be lower
> than the English score, and we will likely mark the document as English.
> If a user then attempts to search the Japanese content using Japanese as
> the language option in the search, they won't get a hit on this
> document. The will only get a hit if they construct their search the
> same way it was tokenized and select English as the search option.
> I know this is a complex topic, but would appreciate whatever guidance
> you could provide.
> Thank you,
Using Opera's revolutionary email client: http://www.opera.com/mail/
More information about the General