[MarkLogic Dev General] lib-parser

Shannon Shiflett at virginia.edu
Fri Oct 10 07:09:45 PDT 2008


Good morning,

Mike, thanks for the free consulting :)

I agree, in the case of English and French, I don't think I need to be  
concerned with the tokenizing.

I may end up submitting an RFE on this (something similar to the  
thesaurus expansion API sounds like a good approach to me) since  
future projects and other customers may benefit, and in the meantime I  
will be fine with the help you and Danny provided (speaking of which,  
as always, a thank-you to you and everyone at Mark Logic for the  
support in getting the most out of your product).

Best,
Shannon

On Oct 9, 2008, at 4:46 PM, Michael Blakeley wrote:

> Shannon,
>
> Hmm... I think we may be talking at cross-purposes. As I mentioned  
> yesterday, I'm a little concerned about maintaining a distinction  
> between cts:query term-level language, vs the language passed to  
> cts:tokenize() in lp:get-cts-query-element().
>
> When I mentioned the idea of adding another parameter to lp:get-cts- 
> query(), I was thinking of the cts:tokenize() option. But I think I  
> jumped to a conclusion there. French and English aren't that  
> different, and I can't think of a place where the cts:tokenize()  
> language would matter (as opposed to, for example, Chinese).
>
> Based on this latest email, and the exchange with Danny, you'd like  
> to pass multiple languages to lp:get-cts-query(), and get back an  
> internally-expanded or-query for every language for each input term.  
> This would work somewhat like thesaurus expansion. Is that correct?
>
> If so, this does seem like a useful RFE (for lib-parser, or for  
> MarkLogic Server). But you can also do this in your own code fairly  
> easily:
>
> xquery version "0.9-ml"
>
> define function expand-languages($query as cts:query, $lang as  
> xs:string*)
> as cts:query
> {
>  if (empty($lang)) then $query else
>  typeswitch($query)
>  case cts:and-query return cts:and-query(
>    for $q in cts:and-query-queries($query)
>    return expand-languages($q, $lang),
>    cts:and-query-options($query)
>  )
>  case cts:word-query return cts:or-query((
>    let $opts :=
>      cts:word-query-options($query)[not(starts-with(., 'lang='))]
>    let $word := cts:word-query-text($query)
>    for $i in $lang
>    return cts:word-query($word, ($opts, concat('lang=', $i)))
>  ))
>  default return error(
>    'UNIMPLEMENTED', text { 'no support for', xdmp:describe($query) } )
> }
>
> expand-languages(
>  cts:and-query((
>    cts:word-query('foo'),
>    cts:word-query('bar')
>  )), ('en', 'fr') )
>
> =>
> cts:and-query((cts:or-query((cts:word-query("foo", ("lang=en"), 1),  
> cts:word-query("foo", ("lang=fr"), 1))), cts:or-query((cts:word- 
> query("bar", ("lang=en"), 1), cts:word-query("bar", ("lang=fr"),  
> 1)))), ())
>
> Keep expanding the typeswitch to cover all the possibilities.
>
> -- Mike
>
> Shannon wrote:
>> Thank you, Mike--that's so very agreeable--yes, per-query control   
>> language awareness would be most useful!  Given a form that accepts  
>> a  query string input and a language selector that includes an  
>> "all"  option, the desired behavior is language-specific  
>> tokenization, in  this case, for English and French; Danny  
>> demystified the search recall  logic, but lib-parser doesn't  
>> provide the full support, yet, to get  the most out of the French  
>> language module--currently I'm using the  overloaded lp:get-cts- 
>> query() that grabs $options at the 3rd argument;  maybe another  
>> overload with a 4th argument, or perhaps take the hint  from the  
>> lang option if supplied?
>> Thanks,
>> Shannon
>> On Oct 8, 2008, at 5:29 PM, Michael Blakeley wrote:
>>> Today, lib-parser calls cts:tokenize() without the language   
>>> argument, so it always uses the database default language. So the   
>>> tokenization is language-aware, but there's no per-query control   
>>> over which language it uses.
>>>
>>> If per-query control over language awareness would be useful, how   
>>> would you like to express it? As another (optional) argument to   
>>> lp:get-cts-query()?
>>>
>>> I'm a little concerned about maintaining a distinction between   
>>> cts:query term-level language, vs the language passed to   
>>> cts:tokenize() in lp:get-cts-query-element(). But if it's useful   
>>> functionality, let's figure out how to add it.
>>>
>>> -- Mike
>>>
>>> Shannon wrote:
>>>> Hi,
>>>> Does anyone know whether lib-parser has support for language- 
>>>> aware   tokenization, for lp:get-cts-query specifically?
>>>> Thanks,
>>>> __________________________________________________
>>>> Shannon Scott Shiflett, programmer/analyst with ROTUNDA,
>>>> The University of Virginia Press, Charlottesville, VA  USA
>>>> http://rotunda.upress.virginia.edu
>>>> _______________________________________________
>>>> General mailing list
>>>> General at developer.marklogic.com
>>>> http://xqzone.com/mailman/listinfo/general
>>> _______________________________________________
>>> General mailing list
>>> General at developer.marklogic.com
>>> http://xqzone.com/mailman/listinfo/general
>> __________________________________________________
>> Shannon Scott Shiflett, programmer/analyst with ROTUNDA,
>> The University of Virginia Press, Charlottesville, VA  USA
>> http://rotunda.upress.virginia.edu
>> _______________________________________________
>> General mailing list
>> General at developer.marklogic.com
>> http://xqzone.com/mailman/listinfo/general
>
> _______________________________________________
> General mailing list
> General at developer.marklogic.com
> http://xqzone.com/mailman/listinfo/general

__________________________________________________
Shannon Scott Shiflett, programmer/analyst with ROTUNDA,
The University of Virginia Press, Charlottesville, VA  USA
http://rotunda.upress.virginia.edu



More information about the General mailing list