[MarkLogic Dev General] Collation Lexicon Frequency

Danny Sokolsky dsokolsky at marklogic.com
Wed Dec 17 14:39:53 PST 2008


One approach is to use a space-insensitive collation for the range
index.  Then these would appear the same.  Here is a simple example:

 

xquery version "1.0-ml";

declare default collation "http://marklogic.com/collation/en/S1/AS";

 

"hello there" = "hello    there"

(: returns true :)

 

-Danny

 

From: general-bounces at developer.marklogic.com
[mailto:general-bounces at developer.marklogic.com] On Behalf Of Paul M
Sent: Wednesday, December 17, 2008 1:58 PM
To: general at developer.marklogic.com
Subject: [MarkLogic Dev General] Collation Lexicon Frequency

 

Hi:

I have the following docs:

doc1
<elem1>dear sir</elem1>
doc2
<elem1>dear     sir</elem1>
doc3
<elem1>dear   sir       </elem1>

All have a variable amount of white space characters. Using lib-search,
specifically these functions:
cts:element-values($element-qname, "", $options, $base-query) (:above
three docs returned:)
cts:frequency($value) (:elem1 has three facets associate with
$base-query, each with a value of 1:)

Each doc contains elem1, each with a unique value. There does not exist
a simply method for the frequency function to consider the above three
elements as "the same". (They likely hash to different values?)

The only easy method is to normalize the data by stripping white-space
from the documents themselves?

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://xqzone.marklogic.com/pipermail/general/attachments/20081217/5c4f5544/attachment.html


More information about the General mailing list