[MarkLogic Dev General] Word Boundaries in Chinese?

Mary Holstege mary.holstege at marklogic.com
Wed May 7 15:47:15 PDT 2008


On Wed, 07 May 2008 14:31:16 -0700, Marc Moskowitz  
<mmoskowitz at ifactory.com> wrote:

> I'm seeing some odd behavior when searching for text in Chinese. It  
> seems that the server is making decisions about word boundaries based on  
> some internal criteria.
>
> This XQuery:
> let $q := '意',
> $doc := (
> <yo>好意思</yo>,
> <yo>意料</yo>,
> <yo>好意</yo>,
> <yo>词不达達意</yo>)
> for $d in $doc
> let $h := cts:highlight($d, $q, <hey>{$cts:text}</hey>)
> return (count($h//hey), $h)
>
> produces this result:
>
> 0
> <yo>好意思</yo>
> 1
> <yo><hey>意</hey>料</yo>
> 0
> <yo>好意</yo>
> 1
> <yo>词不达達<hey>意</hey></yo>
>
>
> Is there some way of affecting where these boundaries are placed? Or of  
> turning this functionality fully on or off?
> -Marc

The reason you're seeing this is that the rules of Chinese tokenization
say that 意 is part of a longer token/word in the 1st and 3rd yos.
It is analogous to looking for "black" and wanting a hit on "blackbird".
If you use a license that has no Chinese support, then the
non-language-aware tokenization kicks in and every character is
treated as a distinct word.

//Mary


More information about the General mailing list