[MarkLogic Dev General] Element range index collation for Strings

Mary Holstege mary.holstege at marklogic.com
Wed Mar 21 09:03:08 PDT 2012


On Wed, 21 Mar 2012 08:11:55 -0700, Palani TT <palanitt at gmail.com> wrote:

> Hi,
>
> I have a confusion on which collation to use while defining a range  
> index for a String type. I understand that 'root collation' (the default  
> collation for String), returns duplicate results. So, would 'Unicode  
> Codepoint' collation alone would suffice for a String range index or  
> should I have both the 'root collation' as well as the 'Unicode  
> Codepoint' collation defined for String range indexes?
>
> Thanks,
> Palani

I don't quite know what you mean by "returns duplicate results".

The collation you should use depends on what values you want to
consider equivalent and what order you want things to appear
in.

The Unicode codepoint collation will order all the uppercase
values before any of the lowercase values and will store
values beginning with a letter with a diacritic after all of
those.  It will store distinct entries for all the variants.
So "Resume", "resume", "résume", and "Résumé" are all different
entries in the range index. If you are doing a case/diacritic
insensitive match against that range index, we'll need to
scan through all the values starting with R then all the words
starting with S, T,..., Z, a, b, ..., and r to check.

The root collation will order all the words starting with 'a'
before any word starting with 'b', regardless of case or
diacritics on the a. The default strength on the root collation
is S3, so case and diacritical variants are still stored
separately. The root collation will collapse (treat as equivalent)
normalization variants (e.g. "é" vs "e"+accent) but for
string range indexes this makes no practical difference as all
the strings are normalized to NFC before we put them in the
index anyway. If you use S1, then case and diacritic variants will
be collapsed, so for "Resume", "resume", "résume", and "Résumé"
there will be only one entry in the index. This can make case/
diacritic insensitive matching much more efficient (but
case/diacritic *sensitive* matching impossible).

If you are not collapsing values, the codepoint collation
is generally about 10% faster in its operations.

Note that when string ranges are used to optimize queries,
the collation on the range index has to match the query
collation, so you are generally better off picking a
consistent collation that matches your appserver default
collation.

//Mary

Mary.Holstege at marklogic.com
Principal Engineer
MarkLogic Corporation


More information about the General mailing list