[MarkLogic Dev General] Regular Expression Support for Text Matching and cts:search

Gary Vidal gvidal at alm.com
Tue Apr 15 19:10:00 PDT 2008


Just a general question about regular expression support and some weigh in from MarkLogic.
 
Ideally I think full support for Regular Expression matching and named grouping for cts:highlight and other searches would be a nice to have feature
 
I am working on alot of Regex matching to automatically assign markup to xml elements and query for patterns using an index.
Since what I am looking for are specific patterns in the Text for Legal Citations. I cannot enumerate all the possible combinations of a pattern. I have a workaround by using a regex matching xquery function that returns matching text and non-matching text once I locate the document.  
I would then collect all the matching phrases and create a cts:word-query for each match,
then run cts:highlight over the matches, first to create the boundary element. 
And then reiterate the over the boundary element to add metadata to each element. 
 
Here are my limitations, 

*	I cannot capture named groups(I could Ideally use non-capture groups and just use replace functions).
*	fn:replace only returns positions 1-9 as per xquery spec (Again non-capture groups will muddy regex or Regex the Regex:-) to make all non-named-groups non-capture groups).
*	Speed is a concern and native functions would ideally perform better.
*	I would like to use regex on cts:queries for searching for documents.
*	Necessity to build an expression by having access to cts:text, cts:node like cts:highlight.  My created function limits my ability to construct nodes to pass to the function like cts:highlight.

 
Ideally, a function or set of functions to do regex matching on indexes would be useful or as general purpose utilities to perform such functions:
 
Recommendation 1:
A highlight utility or enhancement of cts:highlight to allow for regex-
 
cts:pattern-highlight($node, $query, $expression)
   cts:text : text-captured
   cts:group as element(cts:group) (Captures Named Regex (?<group>:[expr])
 
Recommendation 2:
A cts:query that allows for Regex Patterns
 
cts:(regex|pattern)-query($patterns as xs:string*,$options,$weight)
    $pattern : a regular-expression or sequence of $expressions
    $options : (case-sensitive|case-insensitive|  (:i = Regex Ignore Case:)
                    whitespace-sensitive|whitespace-insensitive| (:x = Whitespace mode:)
                    single-mode|multiline                                (:s= Mode:)
                    element-boundary                                     (:I guess preserve element boundaries:)
                    named-capture| indexed-capture                (:Captures group names and returns them to cts:group:)
                    
 Also, if someone can weigh in on the ramifications of regex searches with Marklogic indexing and is there a possibility of a native regex support for cts:search
(beyond fn:matches, fn:replace,fn:tokenize)
 
                    
 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://xqzone.marklogic.com/pipermail/general/attachments/20080415/0cc07677/attachment.html


More information about the General mailing list