[MarkLogic Dev General] stemmed searches

Danny Sokolsky dsokolsky at marklogic.com
Tue Feb 10 10:45:09 PST 2009


The basic approach is to expand your search to search across the languages you are interested in.  For example, if a user enters a search term:

cat chat

and your content is in English and French, then you can expand into the following cts:query:

cts:or-query((
  cts:and-query((cts:word-query("cat", "lang=en"),  
                 cts:word-query("chat", "lang=en"))),
  cts:and-query((cts:word-query("cat", "lang=fr"),  
                 cts:word-query("chat", "lang=fr")))
))

It is up to you how you decide to parse the user input.

-Danny

-----Original Message-----
From: general-bounces at developer.marklogic.com [mailto:general-bounces at developer.marklogic.com] On Behalf Of Whitby, Rob, CMG
Sent: Tuesday, February 10, 2009 9:08 AM
To: General Mark Logic Developer Discussion
Subject: RE: [MarkLogic Dev General] stemmed searches

Can anyone help me with this issue? What is the best way to deal with content in multiple languages?

Thanks
Rob


-----Original Message-----
From: general-bounces at developer.marklogic.com [mailto:general-bounces at developer.marklogic.com] On Behalf Of Whitby, Rob, CMG
Sent: 06 February 2009 11:41
To: General Mark Logic Developer Discussion
Subject: RE: [MarkLogic Dev General] stemmed searches

Thanks for the replies.

I'm using 4.0-1 on 64-bit Windows 2003 Server

I think it is a language thing. Setting the lang option in the stemmed query does change the number of results. I'm surprised that stemming has the effect of limiting the search to one language, I expected it would still run the search on content in other languages but the stemming wouldn't be of any help. Even better would be if the stemming was dynamic based on the content language.

The consequences are worrying for general searching. I have content in multiple languages and would like the user to be able to enter search terms and receive results in any language. Is the only way to fix this to turn off stemming?

I guess I could set the xml:lang attribute to "en" for every article...

Thanks
Rob



-----Original Message-----
From: general-bounces at developer.marklogic.com [mailto:general-bounces at developer.marklogic.com] On Behalf Of Mary Holstege
Sent: 05 February 2009 20:13
To: General Mark Logic Developer Discussion
Subject: Re: [MarkLogic Dev General] stemmed searches

On Thu, 05 Feb 2009 09:58:19 -0800, Michael Blakeley <michael.blakeley at marklogic.com> wrote:

> Rob,
>
> It's always a good idea to state which server release you are using, 
> and on which OS.
>
> The behavior you've observed doesn't look right to me, but I couldn't 
> easily reproduce it either. That suggests that something 
> content-specific or version-specific is at work: if you have a support 
> contract, I'd suggest that you contact support.

One possibility:

Stemmed searches search within a particular language, in this case the default, most likely English.  If for some reason the element in question is in some other language (e.g. an xml:lang="fr" on the Article element), then that "2009" would be in some other language, and therefore wouldn't show up on a stemmed English word-query.

//Mary

>
> Meanwhile, you might try some other approaches. Would
> cts:element-value-query() be appropriate for this use-case? Or perhaps 
> a simple XPath?
>
>    /Journal/Volume/Issue/Article/PublishDate/Year[. eq 2009]
>
> If a word-query is what you want, it would be more efficient to write 
> this as an element-word-query:
>
>   cts:search(
>     /Journal/Volume/Issue/Article/PublishDate,
>     cts:element-word-query(xs:QName('Year'), "2009", ("unstemmed"), 1)
>   )
>
> thanks,
> -- Mike
>
> On 2009-02-05 07:14, Whitby, Rob, CMG wrote:
>> Can anyone explain why these 2 queries return different results?
>>
>> count(
>>    cts:search(
>>      /Journal/Volume/Issue/Article/PublishDate/Year,
>>      cts:word-query("2009", ("unstemmed"), 1)
>>    )
>> )
>>
>> = 3036 (the correct result)
>>
>> count(
>>    cts:search(
>>      /Journal/Volume/Issue/Article/PublishDate/Year,
>>      cts:word-query("2009", ("stemmed"), 1)
>>    )
>> )
>>
>> = 2757
>>
>> Why is the "stemmed" setting causing some matches to be missed?
>
> _______________________________________________
> General mailing list
> General at developer.marklogic.com
> http://xqzone.com/mailman/listinfo/general


_______________________________________________
General mailing list
General at developer.marklogic.com
http://xqzone.com/mailman/listinfo/general


More information about the General mailing list