[MarkLogic Dev General] Strange stemming behavior

Marc Moskowitz mmoskowitz at ifactory.com
Fri Jul 20 13:04:57 PDT 2007


I have more questions about stemming. The query:

let $x := <text xml:lang="fr">sont es ès</text>,
$query1 := cts:word-query("être", ("lang=fr")),
$query2 := cts:word-query("suis", ("lang=fr"))
return (
cts:highlight($x, $query1, element hit {$cts:text}),
cts:highlight($x, $query2, element hit {$cts:text})
)

produces the results:

<text xml:lang="fr"><hit>sont</hit> <hit>es</hit> ès</text>
<text xml:lang="fr"><hit>sont</hit> <hit>es</hit> <hit>ès</hit></text>

This seems to indicate that stemmed results get their 
diacritic-sensitive value for stemmed parts from the presence or absence 
of diacritics of the original search term. This seems incorrect, since 
the stemmer in theory has the correct diacritics for the stemmed parts. 
In this case in particular, ès is completely unrelated to être. Is this 
behavior we can affect on a database level or in some other way 
independent of specifying "diacritic-sensitive" for the base query?
Marc Moskowitz
Interactive Factory



More information about the General mailing list