[MarkLogic Dev General] search function and results per

Erik Zander Erik.Zander at studentlitteratur.se
Mon Aug 6 01:07:17 PDT 2012


Thank you Mike,

Your information sadly match with the small pieces of information I have found elsewhere (was hoping on one miracle xquery line :) )

However the result of not being able to achieve the result we want without splitting the books(documents) or introducing any fragments (after some more research that sound as a bad way to go), is that the main developers have extracted all the data into SQL ...

I would still be able to get back to them with an easy solution but I won't keep my breath.

Thanks and regards
Erik


------------------------------

Message: 3
Date: Fri, 3 Aug 2012 08:26:11 -0700
From: Michael Blakeley <mike at blakeley.com>
Subject: Re: [MarkLogic Dev General] search function and results per
	document trunktation error
To: MarkLogic Developer Discussion <general at developer.marklogic.com>
Message-ID: <EE2C2CC0-AB11-449A-BEF7-0778010F14D1 at blakeley.com>
Content-Type: text/plain; charset=us-ascii

Relevance scores are calculated at the fragment level. There are no sub-fragment relevance calculations in MarkLogic. In my opinion the best approach is to live with this, and stop worrying about which sequence of words might be the most relevant to the user. The first reasonable snippet is likely to be from a summary, which should... summarize the document content. If not, let the users read the documents and figure that out for themselves.

But if you have to have sub-document scoring, there are basically three approaches.

In my opinion the least-bad strategy is to store your content at multiple levels of granularity. This is a typical disk space vs CPU trade-off: there might be hundreds or thousands of ancillary documents per main document, where each ancillary document stores a single element at whatever level you want to form your snippets. Once you have that, you would have to implement a fairly complicated two-pass search algorithm. Pass 1 would search the main documents and work out which main document URIs ought to be on the page. Then pass 2 would search the ancillary docs corresponding to those URIs, to gather up the N most relevant snippets for each URI. Some of this code would be highly performance-sensitive, but with the right implementation it should perform well.

The next least-bad approach might be to fragment your documents. But relevance does not roll up from sub-fragments to the document level. So by fragmenting on title, you will remove title from the relevance calculations for the parent document. Because there is no way to automatically consider an element to live in two fragments at once, this approach sacrifices main-document relevance in the name of snippet-level relevance. So fragmenting on title and para is unlikely to yield the results you want.

Probably the worst strategy is to try to implement element-level scoring for yourself. This is especially difficult because XQuery does not have direct access to the TF or IDF data. This mostly has to be done at query time, too, so the scoring is likely to be slow, or bad, or both.

-- Mike

On 3 Aug 2012, at 07:29 , Erik Zander wrote:

> Hi Geert
> 
> Here's the function call now whiotu snipets but as the problem appears 
> whis as small amont of code as this thsi is what we currently work on
> 
> let $result := 
> cts:search(fn:collection("Allmanmedicin")//(title|para),$query, ("score-simple"))  for $m in $result
>    return 
>      fn:concat(cts:score($m),' - ',fn:base-uri($m))
> 
> 
> The problem that we have is that for one specific search all results in on document have the same score.
> 
> What we need is for the score to be separate from the document so that the result isn't returned per book but instead returned depending of match.
> 
> I was hoping that the score-simple option would do the trick and not care from which document the match came from but it do not appear so.
> 
> Regards
> Erik
> 
> =============================
> 
> Message: 3
> Date: Fri, 3 Aug 2012 10:47:06 +0200
> From: Erik Zander <Erik.Zander at studentlitteratur.se>
> Subject: [MarkLogic Dev General] search function and results per
> 	document	trunktation error
> To: "general at developer.marklogic.com"
> 	<general at developer.marklogic.com>
> Message-ID:
> 	
> <666D23968830644D92011BDE450FBE8031E332D12D at DRSTUEX01.studentlitteratu
> r.corp>
> 	
> Content-Type: text/plain; charset="iso-8859-1"
> 
> Hi All
> 
> I have a problem with the search functions both cts:search and search:search.
> 
> The problem is that when doing a search over a collection documents with many matches are prioritized and first after that the custom weights are added.
> 
> As a result the search have truncated the result even before we are able to impact the score of the matches.
> 
> What we would need would be to have the matches returned independent 
> of in what document the specific element lays. This so that we could 
> prioritize for example all relevant docbook:titles first then go into 
> docbook:blockquotes and lastly single docbook:paras in more than one 
> document whit the docbook structure (see below for super short 
> example)
> 
> <chapter xml:id="isbn_9789144019895_ch_2" label="2"> <title>Den 
> kliniska unders?kningen</title> <section> <title>Sjukhistorien</title> 
> <para>En noggrant f?rd journal ?r givetvis av samma vikt vid 
> hj?rtsjukdomarna som i alla andra medicinska sammanhang. Vilket eller 
> vilka symtom begr?nsar prestationsf?rm?gan? De viktigaste och 
> vanligaste symtomen hos hj?rtsjuka ?r <emphasis 
> role="italic">tr?tthet</emphasis>, som uttryck f?r l?g 
> hj?rtminutvolym, <emphasis role="italic">andf?ddhet</emphasis> framf?r 
> allt orsakad av lungstas, <emphasis 
> role="italic">br?stsm?rta</emphasis> vid k?rlkramp, samt <emphasis 
> role="italic">arytmiupplevelse</emphasis>. Hj?rtpatienter har ofta en 
> anm?rkningsv?rd f?rm?ga att):</para> <para>Indelningen enligt NYHA 
> till?mpas framf?r allt i samband med hj?rtsvikt. Vid ischemisk 
> hj?rtsjukdom klassificeras symtomen vanligen enligt Canadian 
> Cardiological Society (CCS), vars indelning i de fyra klasserna i 
> princip inte skiljer sig fr?n den<example label="Faktaru
 ta
>  2.1" xml:id="isbn_9789144019895_infobox_1" role="box"> <title/> 
> <blockquote><itemizedlist mark="none"> <listitem><para><emphasis 
> role="bold">Klass I</emphasis>
> 
> I'm lost to wherein I should be searching for an solution to this problem, how should we do to search in the documents returning results scored independent of which document it is in?
> Is this a coding or a configuration error or is this the expected and only behavior?
> 
> Regards
> Erik
> -------------- next part -------------- An HTML attachment was 
> scrubbed...
> URL: 
> http://developer.marklogic.com/pipermail/general/attachments/20120803/
> e9a315fa/attachment.html
> 
> ------------------------------
> 
> _______________________________________________
> General mailing list
> General at developer.marklogic.com
> http://developer.marklogic.com/mailman/listinfo/general
> 
> 
> End of General Digest, Vol 98, Issue 2
> **************************************
> _______________________________________________
> General mailing list
> General at developer.marklogic.com
> http://developer.marklogic.com/mailman/listinfo/general
> 



------------------------------

_______________________________________________
General mailing list
General at developer.marklogic.com
http://developer.marklogic.com/mailman/listinfo/general


End of General Digest, Vol 98, Issue 3
**************************************


More information about the General mailing list