[MarkLogic Dev General] lib-parser

Michael Blakeley michael.blakeley at marklogic.com
Thu Oct 9 13:46:31 PDT 2008


Shannon,

Hmm... I think we may be talking at cross-purposes. As I mentioned 
yesterday, I'm a little concerned about maintaining a distinction 
between cts:query term-level language, vs the language passed to 
cts:tokenize() in lp:get-cts-query-element().

When I mentioned the idea of adding another parameter to 
lp:get-cts-query(), I was thinking of the cts:tokenize() option. But I 
think I jumped to a conclusion there. French and English aren't that 
different, and I can't think of a place where the cts:tokenize() 
language would matter (as opposed to, for example, Chinese).

Based on this latest email, and the exchange with Danny, you'd like to 
pass multiple languages to lp:get-cts-query(), and get back an 
internally-expanded or-query for every language for each input term. 
This would work somewhat like thesaurus expansion. Is that correct?

If so, this does seem like a useful RFE (for lib-parser, or for 
MarkLogic Server). But you can also do this in your own code fairly easily:

xquery version "0.9-ml"

define function expand-languages($query as cts:query, $lang as xs:string*)
  as cts:query
{
   if (empty($lang)) then $query else
   typeswitch($query)
   case cts:and-query return cts:and-query(
     for $q in cts:and-query-queries($query)
     return expand-languages($q, $lang),
     cts:and-query-options($query)
   )
   case cts:word-query return cts:or-query((
     let $opts :=
       cts:word-query-options($query)[not(starts-with(., 'lang='))]
     let $word := cts:word-query-text($query)
     for $i in $lang
     return cts:word-query($word, ($opts, concat('lang=', $i)))
   ))
   default return error(
     'UNIMPLEMENTED', text { 'no support for', xdmp:describe($query) } )
}

expand-languages(
   cts:and-query((
     cts:word-query('foo'),
     cts:word-query('bar')
   )), ('en', 'fr') )

=>
cts:and-query((cts:or-query((cts:word-query("foo", ("lang=en"), 1), 
cts:word-query("foo", ("lang=fr"), 1))), 
cts:or-query((cts:word-query("bar", ("lang=en"), 1), 
cts:word-query("bar", ("lang=fr"), 1)))), ())

Keep expanding the typeswitch to cover all the possibilities.

-- Mike

Shannon wrote:
> Thank you, Mike--that's so very agreeable--yes, per-query control  
> language awareness would be most useful!  Given a form that accepts a  
> query string input and a language selector that includes an "all"  
> option, the desired behavior is language-specific tokenization, in  
> this case, for English and French; Danny demystified the search recall  
> logic, but lib-parser doesn't provide the full support, yet, to get  
> the most out of the French language module--currently I'm using the  
> overloaded lp:get-cts-query() that grabs $options at the 3rd argument;  
> maybe another overload with a 4th argument, or perhaps take the hint  
> from the lang option if supplied?
> Thanks,
> Shannon
> 
> On Oct 8, 2008, at 5:29 PM, Michael Blakeley wrote:
> 
>> Today, lib-parser calls cts:tokenize() without the language  
>> argument, so it always uses the database default language. So the  
>> tokenization is language-aware, but there's no per-query control  
>> over which language it uses.
>>
>> If per-query control over language awareness would be useful, how  
>> would you like to express it? As another (optional) argument to  
>> lp:get-cts-query()?
>>
>> I'm a little concerned about maintaining a distinction between  
>> cts:query term-level language, vs the language passed to  
>> cts:tokenize() in lp:get-cts-query-element(). But if it's useful  
>> functionality, let's figure out how to add it.
>>
>> -- Mike
>>
>> Shannon wrote:
>>> Hi,
>>> Does anyone know whether lib-parser has support for language-aware   
>>> tokenization, for lp:get-cts-query specifically?
>>> Thanks,
>>> __________________________________________________
>>> Shannon Scott Shiflett, programmer/analyst with ROTUNDA,
>>> The University of Virginia Press, Charlottesville, VA  USA
>>> http://rotunda.upress.virginia.edu
>>> _______________________________________________
>>> General mailing list
>>> General at developer.marklogic.com
>>> http://xqzone.com/mailman/listinfo/general
>> _______________________________________________
>> General mailing list
>> General at developer.marklogic.com
>> http://xqzone.com/mailman/listinfo/general
> 
> __________________________________________________
> Shannon Scott Shiflett, programmer/analyst with ROTUNDA,
> The University of Virginia Press, Charlottesville, VA  USA
> http://rotunda.upress.virginia.edu
> 
> _______________________________________________
> General mailing list
> General at developer.marklogic.com
> http://xqzone.com/mailman/listinfo/general



More information about the General mailing list