[MarkLogic Dev General] Strange stemming behavior
Marc Moskowitz
mmoskowitz at ifactory.com
Fri Jul 20 13:04:57 PDT 2007
I have more questions about stemming. The query:
let $x := <text xml:lang="fr">sont es ès</text>,
$query1 := cts:word-query("être", ("lang=fr")),
$query2 := cts:word-query("suis", ("lang=fr"))
return (
cts:highlight($x, $query1, element hit {$cts:text}),
cts:highlight($x, $query2, element hit {$cts:text})
)
produces the results:
<text xml:lang="fr"><hit>sont</hit> <hit>es</hit> ès</text>
<text xml:lang="fr"><hit>sont</hit> <hit>es</hit> <hit>ès</hit></text>
This seems to indicate that stemmed results get their
diacritic-sensitive value for stemmed parts from the presence or absence
of diacritics of the original search term. This seems incorrect, since
the stemmer in theory has the correct diacritics for the stemmed parts.
In this case in particular, ès is completely unrelated to être. Is this
behavior we can affect on a database level or in some other way
independent of specifying "diacritic-sensitive" for the base query?
Marc Moskowitz
Interactive Factory
More information about the General
mailing list