[MarkLogic Dev General] Query weights
Michael Blakeley
michael.blakeley at marklogic.com
Thu May 3 09:15:38 PDT 2007
Peter,
Have you read chapter 23 of the Developers' Guide? It is available at
http://developer.marklogic.com/pubs - section 23.1 talks about the
calculation that we perform for cts:score().
I think the interesting point for your question is that scores are
calculated based on inverse document frequency (IDF) as well as term
frequency (TF). If that doesn't suit your application, you can choose an
alternative scoring technique: try score-logtf, or score-simple, as
options to cts:search() -
http://developer.marklogic.com/pubs/3.1/apidocs/SearchBuiltins.html#search
has more information.
It may also be helpful to note that weight is a double. So if weights
are capped at 16.0, you can weight other terms below 1.0 to dampen them.
thanks,
-- Mike
Peter Hickman wrote:
> This is a follow up from my previous query about search weightings. The
> problem is a simple search for some text in the opp:body field. If the
> text is also in the dc:title element in addition to the opp:body then
> boost the score of those results. Naively I entered the following query.
>
> cts:search((
> /doc,
> cts:or-query((
> cts:element-query(xs:QName("dc:title"),cts:word-query("bach",(),16)),
> cts:element-query(xs:QName("opp:body"),cts:word-query("bach"))
> ))
> ))
>
>
> What is happening however does not make any sense. Let me step you
> through my investigation. Firstly I get a list of the first 13 entries
> that have "bach", in opp:body.
>
> <results>{
> for $x at $i in (cts:search(
> /doc, cts:element-query(xs:QName("opp:body"),cts:word-query("bach"))
> ))[1 to 13]
> return <result id="{$i}">
> { base-uri($x) } :
> { cts:score($x) } :
> { $x/opp:meta/dc:title/text() }
> </result>
> }</results>
>
> <opp:results>
> <opp:result id="1">/grove/music/19768 : 465 : Neue
> Bach-Gesellschaft.</opp:result>
> <opp:result id="2">/grove/music/01690 : 434 : Bach,
> Cecilia.</opp:result>
> <opp:result id="3">/grove/music/01696 : 434 : Bach Choir.</opp:result>
> <opp:result id="4">/grove/music/52274 : 434 : Bach,
> P.D.Q.</opp:result>
> <opp:result id="5">/grove/music/O007770 : 434 : Bach, P. D.
> Q.</opp:result>
> <opp:result id="6">/grove/music/52912 : 434 : Bach Guild.</opp:result>
> <opp:result id="7">/grove/music/01710 : 434 : Bach
> Society.</opp:result>
> <opp:result id="8">/opr/t76/e649 : 434 : Bach
> Gesellschaft</opp:result>
> <opp:result id="9">/opr/t114/e526 : 434 : Bach
> Revival</opp:result>
> <opp:result id="10">/opr/t76/e3128 : 403 : Estro armonico,
> L’</opp:result>
> <opp:result id="11">/grove/music/30356 : 403 : Williams, Peter
> (Frederic)</opp:result>
> <opp:result id="12">/grove/music/01689 : 403 : Bach, August
> Wilhelm</opp:result>
> <opp:result id="13">/grove/music/01692 : 403 : Bach, Vincent
> [Schrottenbach, Vinzenz]</opp:result>
> </opp:results>
>
> Then, just to make sure I searched for "bach" just in dc:title
>
> <results>{
> for $x at $i in (cts:search(
> /doc, cts:element-query(xs:QName("dc:title"),cts:word-query("bach"))
> ))[1 to 13]
> return <result id="{$i}">
> { base-uri($x) } :
> { cts:score($x) } :
> { $x/opp:meta/dc:title/text() }
> </result>
> }</results>
>
> <opp:results>
> <opp:result id="1">/grove/music/19768 : 465 : Neue
> Bach-Gesellschaft.</opp:result>
> <opp:result id="2">/grove/music/01690 : 434 : Bach,
> Cecilia.</opp:result>
> <opp:result id="3">/grove/music/01696 : 434 : Bach
> Choir.</opp:result>
> <opp:result id="4">/grove/music/52274 : 434 : Bach,
> P.D.Q.</opp:result>
> <opp:result id="5">/grove/music/O007770 : 434 : Bach, P. D.
> Q.</opp:result>
> <opp:result id="6">/grove/music/52912 : 434 : Bach
> Guild.</opp:result>
> <opp:result id="7">/grove/music/01710 : 434 : Bach
> Society.</opp:result>
> <opp:result id="8">/opr/t76/e649 : 434 : Bach
> Gesellschaft</opp:result>
> <opp:result id="9">/opr/t114/e526 : 434 : Bach
> Revival</opp:result>
> <opp:result id="10">/grove/music/01689 : 403 : Bach, August
> Wilhelm</opp:result>
> <opp:result id="11">/grove/music/01692 : 403 : Bach, Vincent
> [Schrottenbach, Vinzenz]</opp:result>
> <opp:result id="12">/grove/music/01693 : 403 : Bach-Abel
> Concerts.</opp:result>
> <opp:result id="13">/grove/music/O006539 : 403 : English Bach
> Festival.</opp:result>
> </opp:results>
>
> Now I combined the two searches with a cts:or-query and no weightings:
>
> <results>{
> for $x at $i in (cts:search(
> /doc, cts:or-query((
> cts:element-query(xs:QName("opp:body"),cts:word-query("bach")),
> cts:element-query(xs:QName("dc:title"),cts:word-query("bach"))
> ))
> ))[1 to 13]
> return <result id="{$i}">
> { base-uri($x) } :
> { cts:score($x) } :
> { $x/opp:meta/dc:title/text() }</result>
> }</results>
>
> <opp:results>
> <opp:result id="1">/grove/music/19768 : 465 : Neue
> Bach-Gesellschaft.</opp:result>
> <opp:result id="2">/grove/music/01690 : 434 : Bach,
> Cecilia.</opp:result>
> <opp:result id="3">/grove/music/01696 : 434 : Bach Choir.</opp:result>
> <opp:result id="4">/grove/music/52274 : 434 : Bach,
> P.D.Q.</opp:result>
> <opp:result id="5">/grove/music/O007770 : 434 : Bach, P. D.
> Q.</opp:result>
> <opp:result id="6">/grove/music/52912 : 434 : Bach Guild.</opp:result>
> <opp:result id="7">/grove/music/01710 : 434 : Bach
> Society.</opp:result>
> <opp:result id="8">/opr/t76/e649 : 434 : Bach
> Gesellschaft</opp:result>
> <opp:result id="9">/opr/t114/e526 : 434 : Bach
> Revival</opp:result>
> <opp:result id="10">/opr/t76/e3128 : 403 : Estro armonico,
> L’</opp:result>
> <opp:result id="11">/grove/music/30356 : 403 : Williams, Peter
> (Frederic)</opp:result>
> <opp:result id="12">/grove/music/01689 : 403 : Bach, August
> Wilhelm</opp:result>
> <opp:result id="13">/grove/music/01692 : 403 : Bach, Vincent
> [Schrottenbach, Vinzenz]</opp:result>
> </opp:results>
>
> The results to note are 10 and 11, these are documents that do not
> contain "bach" in the dc:title element but have identical scores to
> documents that do (results 12 and 13). So now I add some weighting to
> the query for the dc:title element.
>
> <results>{
> for $x at $i in (cts:search(
> /doc, cts:or-query((
> cts:element-query(xs:QName("dc:title"),cts:word-query("bach",(),16)),
> cts:element-query(xs:QName("opp:body"),cts:word-query("bach"))
> ))
> ))[1 to 13]
> return <result id="{$i}">
> { base-uri($x) } :
> { cts:score($x) } :
> { $x/opp:meta/dc:title/text() }</result>
> }</results>
>
> <opp:results>
> <opp:result id="1">/grove/music/19768 : 474 : Neue
> Bach-Gesellschaft.</opp:result>
> <opp:result id="2">/grove/music/01690 : 443 : Bach,
> Cecilia.</opp:result>
> <opp:result id="3">/grove/music/01696 : 443 : Bach Choir.</opp:result>
> <opp:result id="4">/grove/music/52274 : 443 : Bach,
> P.D.Q.</opp:result>
> <opp:result id="5">/grove/music/O007770 : 443 : Bach, P. D.
> Q.</opp:result>
> <opp:result id="6">/grove/music/52912 : 443 : Bach Guild.</opp:result>
> <opp:result id="7">/grove/music/01710 : 443 : Bach
> Society.</opp:result>
> <opp:result id="8">/opr/t76/e649 : 443 : Bach
> Gesellschaft</opp:result>
> <opp:result id="9">/opr/t114/e526 : 443 : Bach
> Revival</opp:result>
> <opp:result id="10">/opr/t76/e3128 : 411 : Estro armonico,
> L’</opp:result>
> <opp:result id="11">/grove/music/30356 : 411 : Williams, Peter
> (Frederic)</opp:result>
> <opp:result id="12">/grove/music/01689 : 411 : Bach, August
> Wilhelm</opp:result>
> <opp:result id="13">/grove/music/01692 : 411 : Bach, Vincent
> [Schrottenbach, Vinzenz]</opp:result>
> </opp:results>
>
> Result .: 1 2 3 4 5 6 7 8 9 10 11 12 13
> Before .: 465 434 434 434 434 434 434 434 434 403 403 403 403
> After ..: 474 443 443 443 443 443 443 443 443 411 411 411 411
>
> As you can see the scores for all the results have changed, including
> those for results 10 and 11 which have received the same minuscule boost
> as 12 and 13. Remembering that 10 and 11 do not have "bach" in the
> dc:title element and so I would have expected that they would not have
> received a boost. So the net effect is that everything has changed and
> everything has stayed the same (probably sounds better in French).
>
> Whatever I do the ordering will remain the same, I have tried some
> completely insane values (only to discover that the max appears to be
> 16) and the only outcome is that all the results change by the same
> amount and the ordering remains unaltered.
>
> I am beginning to suspect that the whole query weighting song and dance
> is just plain broken.
>
> Can someone please tell me what I am doing wrong or what else I might try?
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 4532 bytes
Desc: S/MIME Cryptographic Signature
Url : http://xqzone.marklogic.com/pipermail/general/attachments/20070503/7c360384/smime-0001.bin
More information about the General
mailing list