[MarkLogic Dev General] hyphens and cts:element-value-query

Mary Holstege Mary.Holstege at marklogic.com
Tue Feb 28 13:52:42 PST 2017


Just to summarize the ins and outs here in one place, because I'm seeing a certain amount of confusion:

* xdmp:plan is your friend: it will show you the questions we ask the indexes. If you see some information from your query is not reflected in the plan, that will be a case where you might get false positives from index resolution (i.e. unfiltered search).

* Punctuation and space tokens are not indexed as words in the universal index. Therefore, word queries involving whitespace or punctuation will not make use of whitespace or punctuation in index resolution, regardless of space or punctuation sensitivity.

* Punctuation and space tokens are also not generally indexed as words in the universal index in value queries either. However, as a special exception there are terms in the universal index for "exact" value queries (unstemmed, case-sensitive, whitespace-sensitive, punctuation-sensitive), so "exact" value queries should be resolvable properly from the index, but only if you have fast-case-sensitive-searches and fast-diacritic-sensitive-searches enabled in the database.

* For field word or value queries you can modify what counts as punctuation or whitespace via tokenizer overrides. This can turn what would have been a phrase into a single word.

* Outside of the special case given for exact value queries, all queries involving space or punctuation are phrase queries. Word and value search is not string matching.

* Space- and punctuation-insensitive does not mean tokenization-insensitive. "foo-bar" will not match "foobar" as a value query or a word query, regardless of your punctuation sensitivity. Word and value search is not string matching.

* String range queries are about string matching. Whether there is a match depends on the collation, but there is no tokenization happening, no stemming, ever.

* If the plan for cts:value-query(xs:QName("x"),"value-1","exact") doesn't include the hyphen, and you do have fast-case-sensitive-searches and fast-diacritic-sensitive-searches enabled in the database, that is a bug.

So if you want to do exact queries you can either:
(1) Enable fast-case-sensitive-searches and fast-diacritic-sensitive-searches on your database and run them as value queries.
OR
(2) Create a field with custom overrides for the significant punctuation or whitespace and run them as field word or field value queries.
OR
(3) Create a string range index with the appropriate collation (codepoint, most likely) and run them as string-range equality queries.

Cheers

//Mary




-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://developer.marklogic.com/pipermail/general/attachments/20170228/ab7c92dc/attachment-0001.html 


More information about the General mailing list