[MarkLogic Dev General] Unicode flattening for non combinedcharacters.

Michael Blakeley michael.blakeley at marklogic.com
Mon Jul 21 10:59:21 PDT 2008


The fn:translate() function would probably be more efficient than using 
fn:contains() and fn:replace(). It can also be more easily extended to 
handle more than one conversion.

   ...
   return translate("ØŁé", "OLe")

-- Mike

Danny Sokolsky wrote:
> Hi Peter,
> 
> I can think of a few ways to do this.  One idea is to use a thesaurus and just add all of the terms to the thesaurus as expansions of the terms with the funny characters.  It might be hard to know all your terms before the search.
> 
> Another way is to just parse your search string for the offending characters and change the search to an or-query of the original term and the term with the replaced character.  I think this should work OK as long as there are not a huge number of replaced characters, and as long as the search strings are not very large.  Here is a hacky example of what I mean--if you have a search parser already, something like this would be relatively easy to add I think.
> 
> let $search := "Jacob Ørn"
> let $searchTokens := fn:tokenize($search, " ")
> let $replacedTokens := 
>   for $token in $searchTokens
>   return if ( fn:contains($token, "Ø") )
>          then (fn:replace($token, "Ø", "O") )
>          else ()
> return 
>   cts:or-query((
>        cts:and-query((
>            for $tok in $searchTokens return cts:word-query($tok)  )),
>        cts:or-query((
>            for $orTok in $replacedTokens return cts:word-query($orTok) ))
>       ))
> 
> I am sure there are other ways as well.  Hope this helps.
> 
> -Danny
> 
> -----Original Message-----
> From: general-bounces at developer.marklogic.com [mailto:general-bounces at developer.marklogic.com] On Behalf Of Peter Hickman
> Sent: Monday, July 21, 2008 6:52 AM
> To: MarkLogic ML
> Subject: [MarkLogic Dev General] Unicode flattening for non combinedcharacters.
> 
> Our client has data such as "Jacob Ørn" that they want to search for. 
> The are expecting that searching for "orn" would match "Ørn" as they see 
> "Ø" as an accented character. According to the Unicode Standard 4.0 
> (always a good read :)) U+00D8 "Latin Capital Letter O With Stroke" is 
> not a combined character and therefore is not matched by "O" when doing 
> a case and diacritical insensitive search. This is what I expect and 
> understand as a developer.
> 
> Is there some way of getting client's expected behaviour? I suspect that 
> the "Ø" is only one of several characters that have this problem, such 
> as the "Ł" (U+0141) in "Łodz".
> 
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> General mailing list
> General at developer.marklogic.com
> http://xqzone.com/mailman/listinfo/general




More information about the General mailing list