[MarkLogic Dev General] search function and results per

Erik Zander Erik.Zander at studentlitteratur.se
Tue Aug 7 02:57:27 PDT 2012


Damon
Thank you for encouragement, after discussion whit the developers they agreed to give it a try, if I made the splitting.
However they suggested that I split the documents into text nodes to achieve a search index like behavior this as they want to be able to score the results based on the following criteria

Hierarchy of single words:
1)	Words in the chapter title appearing in the register
2)	Words in the chapter title
3)	Words in the section header appearing in the register
4)	Words in the section heading
5)	Words set aside from texflow present in the register
6)	Words set aside from texflow
7)	Part of the words in the chapter title appearing in the register
8)	Part of the words in the chapter title
9)	Part of the words in the section header appearing in the register
10)	Part of the words in the section heading
11)	Part of Words set aside from texflow present in the register
12)	Part of Words set aside from texflow

Hierarchy phrases or multiple words:
The first Exact phrase as 1-12 above
The second Two or more words in accordance with 1-12 above-The greater the number of words that meet the requirements, the higher the relevance
The third Words set aside from texflow or parts of higher priority Words set aside from texflow nearer within the text they are in  


As of this I was thinking that maybe an search index structure something like the following 
<si:node xmlns si:=" html://www.yadayada.com/searchindex">
	<si:text>THE TEXT NODE FROM DOCUMENT</si:text>
	<si:parent>THE ELEMENT TYPE OF THE PARENT NODE</si:parent>
	<si:path>THE ORIGINAL PATH OF THE NODE</si:path>
<si:node

Might be good to store as separet documents that search in those fragments and later go back to the original document returning the results scored as above.

What disadvantages do you see whit my approach or could it work?

Would it be more effective to make multiple splits as you suggested?

Regards
Erik


Message: 2
Date: Mon, 6 Aug 2012 07:11:18 -0700
From: Damon Feldman <Damon.Feldman at marklogic.com>
Subject: Re: [MarkLogic Dev General] search function and results per
To: MarkLogic Developer Discussion <general at developer.marklogic.com>
Message-ID:
	<D20C296D14127D4EBD176AD949D8A75A20576EB1AC at EXCHG-BE.marklogic.com>
Content-Type: text/plain; charset="us-ascii"

Erik,

I recommend you work with the developers to make the solution work. If they move content to SQL they will be splitting the content far more severely than they would in MarkLogic. "Splitting" in XQuery may be as simple as (depending on the complexities of docbook):

for $section in $book//(db:para | db:title | etc. etc.) return xdmp:document-insert(concat("/separateBookElements/", xdmp:random(), "/", fn:base-uri($section)), $section)

Then you can search to your heart's content, and add relevance boosts for titles and other important elements. The overall book URI will be extractable from the searchable element URI. You still get all the advantages of relevance, tf/idf weights, fields, phrase-through, prhase-around, case, stemming, wildcards, transforms and the like. Your exact solution will vary depending on your needs; this is to illustrate that it is not that hard.

"Splitting" in SQL means modeling the entire domain as a set of tables and keys, and developing a new mapping layer.

Yours,
Damon

-----Original Message-----
From: general-bounces at developer.marklogic.com [mailto:general-bounces at developer.marklogic.com] On Behalf Of Erik Zander
Sent: Monday, August 06, 2012 4:07 AM
To: general at developer.marklogic.com
Subject: Re: [MarkLogic Dev General] search function and results per

Thank you Mike,

Your information sadly match with the small pieces of information I have found elsewhere (was hoping on one miracle xquery line :) )

However the result of not being able to achieve the result we want without splitting the books(documents) or introducing any fragments (after some more research that sound as a bad way to go), is that the main developers have extracted all the data into SQL ...

I would still be able to get back to them with an easy solution but I won't keep my breath.

Thanks and regards
Erik


------------------------------

Message: 3
Date: Fri, 3 Aug 2012 08:26:11 -0700
From: Michael Blakeley <mike at blakeley.com>
Subject: Re: [MarkLogic Dev General] search function and results per
	document trunktation error
To: MarkLogic Developer Discussion <general at developer.marklogic.com>
Message-ID: <EE2C2CC0-AB11-449A-BEF7-0778010F14D1 at blakeley.com>
Content-Type: text/plain; charset=us-ascii

Relevance scores are calculated at the fragment level. There are no sub-fragment relevance calculations in MarkLogic. In my opinion the best approach is to live with this, and stop worrying about which sequence of words might be the most relevant to the user. The first reasonable snippet is likely to be from a summary, which should... summarize the document content. If not, let the users read the documents and figure that out for themselves.

But if you have to have sub-document scoring, there are basically three approaches.

In my opinion the least-bad strategy is to store your content at multiple levels of granularity. This is a typical disk space vs CPU trade-off: there might be hundreds or thousands of ancillary documents per main document, where each ancillary document stores a single element at whatever level you want to form your snippets. Once you have that, you would have to implement a fairly complicated two-pass search algorithm. Pass 1 would search the main documents and work out which main document URIs ought to be on the page. Then pass 2 would search the ancillary docs corresponding to those URIs, to gather up the N most relevant snippets for each URI. Some of this code would be highly performance-sensitive, but with the right implementation it should perform well.

The next least-bad approach might be to fragment your documents. But relevance does not roll up from sub-fragments to the document level. So by fragmenting on title, you will remove title from the relevance calculations for the parent document. Because there is no way to automatically consider an element to live in two fragments at once, this approach sacrifices main-document relevance in the name of snippet-level relevance. So fragmenting on title and para is unlikely to yield the results you want.

Probably the worst strategy is to try to implement element-level scoring for yourself. This is especially difficult because XQuery does not have direct access to the TF or IDF data. This mostly has to be done at query time, too, so the scoring is likely to be slow, or bad, or both.

-- Mike

On 3 Aug 2012, at 07:29 , Erik Zander wrote:

> Hi Geert
> 
> Here's the function call now whiotu snipets but as the problem appears 
> whis as small amont of code as this thsi is what we currently work on
> 
> let $result :=
> cts:search(fn:collection("Allmanmedicin")//(title|para),$query, ("score-simple"))  for $m in $result
>    return 
>      fn:concat(cts:score($m),' - ',fn:base-uri($m))
> 
> 
> The problem that we have is that for one specific search all results in on document have the same score.
> 
> What we need is for the score to be separate from the document so that the result isn't returned per book but instead returned depending of match.
> 
> I was hoping that the score-simple option would do the trick and not care from which document the match came from but it do not appear so.
> 
> Regards
> Erik
> 
> =============================
> 
> Message: 3
> Date: Fri, 3 Aug 2012 10:47:06 +0200
> From: Erik Zander <Erik.Zander at studentlitteratur.se>
> Subject: [MarkLogic Dev General] search function and results per
> 	document	trunktation error
> To: "general at developer.marklogic.com"
> 	<general at developer.marklogic.com>
> Message-ID:
> 	
> <666D23968830644D92011BDE450FBE8031E332D12D at DRSTUEX01.studentlitteratu
> r.corp>
> 	
> Content-Type: text/plain; charset="iso-8859-1"
> 
> Hi All
> 
> I have a problem with the search functions both cts:search and search:search.
> 
> The problem is that when doing a search over a collection documents with many matches are prioritized and first after that the custom weights are added.
> 
> As a result the search have truncated the result even before we are able to impact the score of the matches.
> 
> What we would need would be to have the matches returned independent 
> of in what document the specific element lays. This so that we could 
> prioritize for example all relevant docbook:titles first then go into 
> docbook:blockquotes and lastly single docbook:paras in more than one 
> document whit the docbook structure (see below for super short
> example)
> 
> <chapter xml:id="isbn_9789144019895_ch_2" label="2"> <title>Den 
> kliniska unders?kningen</title> <section> <title>Sjukhistorien</title> 
> <para>En noggrant f?rd journal ?r givetvis av samma vikt vid 
> hj?rtsjukdomarna som i alla andra medicinska sammanhang. Vilket eller 
> vilka symtom begr?nsar prestationsf?rm?gan? De viktigaste och 
> vanligaste symtomen hos hj?rtsjuka ?r <emphasis 
> role="italic">tr?tthet</emphasis>, som uttryck f?r l?g 
> hj?rtminutvolym, <emphasis role="italic">andf?ddhet</emphasis> framf?r 
> allt orsakad av lungstas, <emphasis 
> role="italic">br?stsm?rta</emphasis> vid k?rlkramp, samt <emphasis 
> role="italic">arytmiupplevelse</emphasis>. Hj?rtpatienter har ofta en 
> anm?rkningsv?rd f?rm?ga att):</para> <para>Indelningen enligt NYHA 
> till?mpas framf?r allt i samband med hj?rtsvikt. Vid ischemisk 
> hj?rtsjukdom klassificeras symtomen vanligen enligt Canadian 
> Cardiological Society (CCS), vars indelning i de fyra klasserna i 
> princip inte skiljer sig fr?n den<example label="Faktaru
 ta
>  2.1" xml:id="isbn_9789144019895_infobox_1" role="box"> <title/> 
> <blockquote><itemizedlist mark="none"> <listitem><para><emphasis 
> role="bold">Klass I</emphasis>
> 
> I'm lost to wherein I should be searching for an solution to this problem, how should we do to search in the documents returning results scored independent of which document it is in?
> Is this a coding or a configuration error or is this the expected and only behavior?
> 
> Regards
> Erik
> -------------- next part -------------- An HTML attachment was 
> scrubbed...
> URL: 
> http://developer.marklogic.com/pipermail/general/attachments/20120803/
> e9a315fa/attachment.html
> 
> ------------------------------
> 
> _______________________________________________
> General mailing list
> General at developer.marklogic.com
> http://developer.marklogic.com/mailman/listinfo/general
> 
> 
> End of General Digest, Vol 98, Issue 2
> **************************************
> _______________________________________________
> General mailing list
> General at developer.marklogic.com
> http://developer.marklogic.com/mailman/listinfo/general
> 



------------------------------

_______________________________________________
General mailing list
General at developer.marklogic.com
http://developer.marklogic.com/mailman/listinfo/general


End of General Digest, Vol 98, Issue 3
**************************************
_______________________________________________
General mailing list
General at developer.marklogic.com
http://developer.marklogic.com/mailman/listinfo/general





More information about the General mailing list