[MarkLogic Dev General] Diacritical insensitive search

Mike Sokolov sokolov at ifactory.com
Thu Aug 30 15:37:05 PDT 2007


We found that a fairly general way to insert character mappings is to build
a thesaurus containing all words in your corpus containing the characters to
be mapped.  The thesaurus maps the word to its equivalent (whatever that may
be, according to your rules).  Then you can apply that thesaurus to your
queries.  This is just a systematic way to do what Danny suggested.  It
requires updating the thesaurus whenever your data changes, and is a certain
amount of work, but we were able to use it to perform diacritic-insensitive
searching before Mark Logic provided that as a base feature.

-Mike

-----Original Message-----
From: general-bounces at developer.marklogic.com
[mailto:general-bounces at developer.marklogic.com] On Behalf Of Danny Sokolsky
Sent: Thursday, August 30, 2007 6:26 PM
To: General Mark Logic Developer Discussion
Subject: RE: [MarkLogic Dev General] Diacritical insensitive search


Peter,

You are correct in your analysis.  The character "Ł" is not a diacritic, as
it does not decompose to an L and a slash, as you say.  You can verify that
by running the following:

xdmp:diacritic-less("Ł")

which returns "Ł".

Using collations, it is possible to choose a collation that will compare "Ł"
and "L" as equal.  That is because collations have more complicated rules
and their view of diacritics add other non-diacritic characters.  So the
following returns true:

default collation = "http://marklogic.com/collation//S1/EO"

"Ł" eq "L"

This means that if your text is an element value, then you can create a
string range index with the above collation and element value lookups for
lodz will match the text in question.  If your text is not an element value,
then this may not help you.  There is no way to "add a rule".

The other thing I can think of for you to do is to, at the application
level, write some XQuery to expand your search terms to do a cts:or-query of
both Łodz and Lodz.

Oh, and you are OK with the lowercase bit as

fn:lower-case("Ł")

returns ł

Here is a handy link for finding out all sorts of info about unicode
characters such as what are (or are not) the decompositions, the lowercase
mapping (if there is one), etc:

http://people.w3.org/rishida/scripts/uniview/

Hope this helps,
-Danny



-----Original Message-----
From: general-bounces at developer.marklogic.com
[mailto:general-bounces at developer.marklogic.com] On Behalf Of Peter Hickman
Sent: Thursday, August 30, 2007 6:26 AM
To: MarkLogic ML
Subject: [MarkLogic Dev General] Diacritical insensitive search


We have some data that contains the text "Łódź". The "Ł" is U0141.

However when searching for "lodz" it does not match the entries with 
"Ł". "Łodz" however does match, indication that the "ó" and "ź" are 
being handled correctly by the diacritical insensitive search.

Am I correct in assuming that the "Ł" does not decompose to an L with a 
slash and is therefore not covered by a diacritical insensitive search. 
Looking at the Unicode book it would seem that "Ł" is not available in 
combined form.

If this is the case is there a way to add extra translations such as 
U0141 => U004C.

Also, and I have not checked, this but will the case insensitive search 
for "Ł" match "ł" (the lowercase version)?

Again, if not can we add a rule?

-- 
Peter Hickman.

Semantico, Lees House, 21-23 Dyke Road, Brighton BN1 3FE
t: 01273 358223
f: 01273 723232
e: peter.hickman at semantico.com
w: www.semantico.com

_______________________________________________
General mailing list
General at developer.marklogic.com http://xqzone.com/mailman/listinfo/general
_______________________________________________
General mailing list
General at developer.marklogic.com http://xqzone.com/mailman/listinfo/general



More information about the General mailing list