[MarkLogic Dev General] Regular Expression Support for Text
Matching and cts:search
Gary Vidal
gvidal at alm.com
Tue Apr 15 19:10:00 PDT 2008
Just a general question about regular expression support and some weigh in from MarkLogic.
Ideally I think full support for Regular Expression matching and named grouping for cts:highlight and other searches would be a nice to have feature
I am working on alot of Regex matching to automatically assign markup to xml elements and query for patterns using an index.
Since what I am looking for are specific patterns in the Text for Legal Citations. I cannot enumerate all the possible combinations of a pattern. I have a workaround by using a regex matching xquery function that returns matching text and non-matching text once I locate the document.
I would then collect all the matching phrases and create a cts:word-query for each match,
then run cts:highlight over the matches, first to create the boundary element.
And then reiterate the over the boundary element to add metadata to each element.
Here are my limitations,
* I cannot capture named groups(I could Ideally use non-capture groups and just use replace functions).
* fn:replace only returns positions 1-9 as per xquery spec (Again non-capture groups will muddy regex or Regex the Regex:-) to make all non-named-groups non-capture groups).
* Speed is a concern and native functions would ideally perform better.
* I would like to use regex on cts:queries for searching for documents.
* Necessity to build an expression by having access to cts:text, cts:node like cts:highlight. My created function limits my ability to construct nodes to pass to the function like cts:highlight.
Ideally, a function or set of functions to do regex matching on indexes would be useful or as general purpose utilities to perform such functions:
Recommendation 1:
A highlight utility or enhancement of cts:highlight to allow for regex-
cts:pattern-highlight($node, $query, $expression)
cts:text : text-captured
cts:group as element(cts:group) (Captures Named Regex (?<group>:[expr])
Recommendation 2:
A cts:query that allows for Regex Patterns
cts:(regex|pattern)-query($patterns as xs:string*,$options,$weight)
$pattern : a regular-expression or sequence of $expressions
$options : (case-sensitive|case-insensitive| (:i = Regex Ignore Case:)
whitespace-sensitive|whitespace-insensitive| (:x = Whitespace mode:)
single-mode|multiline (:s= Mode:)
element-boundary (:I guess preserve element boundaries:)
named-capture| indexed-capture (:Captures group names and returns them to cts:group:)
Also, if someone can weigh in on the ramifications of regex searches with Marklogic indexing and is there a possibility of a native regex support for cts:search
(beyond fn:matches, fn:replace,fn:tokenize)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://xqzone.marklogic.com/pipermail/general/attachments/20080415/0cc07677/attachment.html
More information about the General
mailing list