[MarkLogic Dev General] Diacritical insensitive search

Danny Sokolsky dsokolsky at marklogic.com
Thu Aug 30 15:26:09 PDT 2007


Peter,

You are correct in your analysis.  The character "Ł" is not a diacritic, as it does not decompose to an L and a slash, as you say.  You can verify that by running the following:

xdmp:diacritic-less("Ł")

which returns "Ł".

Using collations, it is possible to choose a collation that will compare "Ł" and "L" as equal.  That is because collations have more complicated rules and their view of diacritics add other non-diacritic characters.  So the following returns true:

default collation = "http://marklogic.com/collation//S1/EO"

"Ł" eq "L"

This means that if your text is an element value, then you can create a string range index with the above collation and element value lookups for lodz will match the text in question.  If your text is not an element value, then this may not help you.  There is no way to "add a rule".

The other thing I can think of for you to do is to, at the application level, write some XQuery to expand your search terms to do a cts:or-query of both Łodz and Lodz.

Oh, and you are OK with the lowercase bit as

fn:lower-case("Ł")

returns ł

Here is a handy link for finding out all sorts of info about unicode characters such as what are (or are not) the decompositions, the lowercase mapping (if there is one), etc:

http://people.w3.org/rishida/scripts/uniview/

Hope this helps,
-Danny



-----Original Message-----
From: general-bounces at developer.marklogic.com [mailto:general-bounces at developer.marklogic.com] On Behalf Of Peter Hickman
Sent: Thursday, August 30, 2007 6:26 AM
To: MarkLogic ML
Subject: [MarkLogic Dev General] Diacritical insensitive search


We have some data that contains the text "Łódź". The "Ł" is U0141.

However when searching for "lodz" it does not match the entries with 
"Ł". "Łodz" however does match, indication that the "ó" and "ź" are 
being handled correctly by the diacritical insensitive search.

Am I correct in assuming that the "Ł" does not decompose to an L with a 
slash and is therefore not covered by a diacritical insensitive search. 
Looking at the Unicode book it would seem that "Ł" is not available in 
combined form.

If this is the case is there a way to add extra translations such as 
U0141 => U004C.

Also, and I have not checked, this but will the case insensitive search 
for "Ł" match "ł" (the lowercase version)?

Again, if not can we add a rule?

-- 
Peter Hickman.

Semantico, Lees House, 21-23 Dyke Road, Brighton BN1 3FE
t: 01273 358223
f: 01273 723232
e: peter.hickman at semantico.com
w: www.semantico.com

_______________________________________________
General mailing list
General at developer.marklogic.com http://xqzone.com/mailman/listinfo/general


More information about the General mailing list