[MarkLogic Dev General] Element range index collation for Strings

Palani TT palanitt at gmail.com
Wed Mar 21 09:38:42 PDT 2012


Hi Mary,

With respect to the duplicate results comment that I had stated in my
previous email, I have read in certain MarkLogic developer community
forums, where some of the developers had mentioned that while using 'root
collation', they got duplicate values out of the index and after replacing
the 'root collation' with 'Unicode codepoint collation', that anomaly has
been corrected.

Coming back to our discussion on root collation vs Unicode codepoint
collation, based on your input, what I understand is,

If we use case/diacritic insensitive match, then 'root collation' would be
the better pick and 'Unicode codepoint collation' for case/diacritic
sensitive matches.

Please, correct me if my understanding is wrong.

Thanks,
Palani

On Wed, Mar 21, 2012 at 12:03 PM, Mary Holstege <mary.holstege at marklogic.com
> wrote:

> On Wed, 21 Mar 2012 08:11:55 -0700, Palani TT <palanitt at gmail.com> wrote:
>
> > Hi,
> >
> > I have a confusion on which collation to use while defining a range
> > index for a String type. I understand that 'root collation' (the default
> > collation for String), returns duplicate results. So, would 'Unicode
> > Codepoint' collation alone would suffice for a String range index or
> > should I have both the 'root collation' as well as the 'Unicode
> > Codepoint' collation defined for String range indexes?
> >
> > Thanks,
> > Palani
>
> I don't quite know what you mean by "returns duplicate results".
>
> The collation you should use depends on what values you want to
> consider equivalent and what order you want things to appear
> in.
>
> The Unicode codepoint collation will order all the uppercase
> values before any of the lowercase values and will store
> values beginning with a letter with a diacritic after all of
> those.  It will store distinct entries for all the variants.
> So "Resume", "resume", "résume", and "Résumé" are all different
> entries in the range index. If you are doing a case/diacritic
> insensitive match against that range index, we'll need to
> scan through all the values starting with R then all the words
> starting with S, T,..., Z, a, b, ..., and r to check.
>
> The root collation will order all the words starting with 'a'
> before any word starting with 'b', regardless of case or
> diacritics on the a. The default strength on the root collation
> is S3, so case and diacritical variants are still stored
> separately. The root collation will collapse (treat as equivalent)
> normalization variants (e.g. "é" vs "e"+accent) but for
> string range indexes this makes no practical difference as all
> the strings are normalized to NFC before we put them in the
> index anyway. If you use S1, then case and diacritic variants will
> be collapsed, so for "Resume", "resume", "résume", and "Résumé"
> there will be only one entry in the index. This can make case/
> diacritic insensitive matching much more efficient (but
> case/diacritic *sensitive* matching impossible).
>
> If you are not collapsing values, the codepoint collation
> is generally about 10% faster in its operations.
>
> Note that when string ranges are used to optimize queries,
> the collation on the range index has to match the query
> collation, so you are generally better off picking a
> consistent collation that matches your appserver default
> collation.
>
> //Mary
>
> Mary.Holstege at marklogic.com
> Principal Engineer
> MarkLogic Corporation
> _______________________________________________
> General mailing list
> General at developer.marklogic.com
> http://developer.marklogic.com/mailman/listinfo/general
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://developer.marklogic.com/pipermail/general/attachments/20120321/d7046c20/attachment.html 


More information about the General mailing list