[MarkLogic Dev General] RE: [MarkLogic DevGeneral]RestrictingSearchHits ToImmediateParentContainers

John Craft jcraft at jonesmcclure.com
Thu Jul 3 08:37:35 PDT 2008


Chris-

Thanks for the information.  I'm reading through chapter 18 of the
Administrator's Guide to get a better understanding of fragments in
MarkLogic.

As for the prototype I'm working on, it's basically a replica of the
Newswire Analyst demo: a search engine with facets.  I've been working
with the code libraries available on the Mark Logic Developer site
(lib-search, Versi, search-ui, and lib-uitools) and I have a decent
grasp of how they work.  What will make my prototype different, though,
is the number of documents.  Rather than have hundreds of small
documents, I've got 10 documents that are larger in size (approximately
400KB each) and I would like to search through them and pull out small
segments of information.  The DTD for the documents is DocBook-like
except that it uses nested <section> elements to represent different
heading levels and there are a few inline elements that are specific to
our domain (legal publishing).  Each <section> element has a <title> and
a <p> and will eventually contain <indexterm> elements that will be used
for faceted search.  Obviously, some <section> elements will contain
other <section> elements.

I would like to search across <section> elements, weighting the <title>
higher than <p> and use the <title> in the display of the search hit.  I
would also like to get the <title> of the parent of the <section>
element, whether it be a <chapter>, <subchapter>, or another <section>
element and use that "parent title" as a facet.  I was planning on
getting the title of the parent container by using some XPath on each
search result, although I realize that could be pretty ineffecient.

Hopefully, that gives you a better understanding of what I'm trying to
achieve.  If you would like more details, just let me know.

I appreciate the help.

John Craft

-----Original Message-----
From: general-bounces at developer.marklogic.com
[mailto:general-bounces at developer.marklogic.com] On Behalf Of
Christopher Welch
Sent: Wednesday, July 02, 2008 4:37 PM
To: General Mark Logic Developer Discussion
Subject: RE: [MarkLogic DevGeneral]RestrictingSearchHits
ToImmediateParentContainers

Hi John,

Glad that worked for you :)

The cts:score order by statement that you reference is actually what
MarkLogic does internally by default with the results of a cts:search
expression. I suspect that you don't have fragmenting enabled, which
means that your relevance scores will be calculated first by the
relevance score of the document, then document order of the result
nodes.

Scores are calculated at the fragment level. When you return results
from cts:search, the cts:score will be one of the following:

1) If the resulting node is a fragment root, then the node score will
equal the fragment score. This is the ideal case.
2) If the resulting node is not a fragment root, and contains no
fragments, then the node score will equal the parent fragment score.
3) If the resulting node contains fragments, then the score is the
highest score of all the encompassed fragments (itself and its
constituent fragments)

Without fragmentation, the search you ran below falls under case #2. But
if you were searching at the document level you would experience case
#1.

If you choose to fragment, say at the section level, then you would
experience case #1 for the search below, and case #3 if you were to
search at the document level.

These are trade-offs that you would need to make as part of your
indexing strategy, and which searches it is most critical to have
accurate relevance calculations.

It *may* be worthwhile to try fragmenting at the section level to see
what happens, but keep in mind that it requires a reindex, so if you
have too much content, then you may want to backup your database first
and let reindexing run overnight.

In terms of searching on title, I believe at that point you would find
it easier and faster to search at a section level, not a paragraph
level. Then perform a filtering step to remove any sections you're not
interested in displaying.

If you can share what kind of application or prototype you're trying to
build that would be useful. 

Cheers!
Chris

-----Original Message-----
From: general-bounces at developer.marklogic.com
[mailto:general-bounces at developer.marklogic.com] On Behalf Of John Craft
Sent: Wednesday, July 02, 2008 1:27 PM
To: General Mark Logic Developer Discussion
Subject: RE: [MarkLogic Dev General]RestrictingSearchHits
ToImmediateParentContainers

Chris-

That is precisely what I was trying to do.  You nailed it.  And thank
you Danny and Mike; your input was also very helpful.

If I add a simple "order by cts:score($search-hit) descending" to the
FLOWR, will that sort the results fairly across all books?  Or is the
relevance of each book factored into the score, making the score not
something that can be sorted on consistently across books?

Also, if I plan on adding other elements to the search (ex. <title>),
would you recommend using a field as Danny suggested?  I have done a
little reading through the documentation and it seems like a good
approach.

Thanks.

John Craft

-----Original Message-----
From: general-bounces at developer.marklogic.com
[mailto:general-bounces at developer.marklogic.com] On Behalf Of
Christopher Welch
Sent: Wednesday, July 02, 2008 11:39 AM
To: General Mark Logic Developer Discussion
Subject: RE: [MarkLogic Dev General] RestrictingSearchHits
ToImmediateParentContainers

John,

To reiterate what I believe what you're asking for is the ability to
generate a list of search results that will find matching paragraphs but
return the parent section. We do something similar to this in one of our
popular demos, Medbook. You might want to try an approach similar to
Mike suggested:

fn:distinct-nodes( cts:search(fn:doc()//p,
cts:word-query("searchTerm"))/ancestor::section[1]  )

Bear in mind that if your results are not ordered and you do not have
fragmentation enabled, then the order of the sections will be based on
the relevance of the book each "p" element was contained in, and then
sub-sorted again by the document order of the matching elements.

~ Chris

-----Original Message-----
From: general-bounces at developer.marklogic.com
[mailto:general-bounces at developer.marklogic.com] On Behalf Of Mike
Sokolov
Sent: Wednesday, July 02, 2008 9:28 AM
To: General Mark Logic Developer Discussion
Subject: Re: [MarkLogic Dev General] Restricting SearchHits
ToImmediateParentContainers

How about

cts:element-query(xs:QName("p"), "searchTerm")/parent::section

or 

cts:element-word-query(xs:QName("p"), "searchTerm")/parent::section ?



Danny Sokolsky wrote:
> Do you want your searches to always return the top-level "section",
but
> return it if the match is in a p tag child of *any* section element?
> Your concern about returning dups implies that.  If so, then you can
> rename your top-level section in your xml, and then perform a search
> something like:
>
> let $q := cts:word-query("searchterm") 
> return
> cts:search(/path/to/top-level-section, $q)[( cts:contains(./p, $q) or 
>
cts:contains(.//section/p,
> $q) )]
>
> Given that you want to search a more complicated set of elements,
> however, another option to  consider is creating a field, specifying
the
> needed included and excluded elements.  Then you could use
> cts:field-word-query to search.  I am not positive the field will work
> for your content, but it sounds like it is worth pursuing.  To find
out
> more about field, see the "Fields Database Settings" chapter of the
> Administrator's Guide
> (http://developer.marklogic.com/pubs/3.2/books/admin.pdf).
>
> -Danny
>
> -----Original Message-----
> From: general-bounces at developer.marklogic.com
> [mailto:general-bounces at developer.marklogic.com] On Behalf Of John
Craft
> Sent: Tuesday, July 01, 2008 6:30 PM
> To: General Mark Logic Developer Discussion
> Subject: RE: [MarkLogic Dev General] Restricting Search Hits
> ToImmediateParentContainers
>
> Danny-
>
> Thanks for the suggestions.  One thing I didn't mention, as I was
trying
> to keep the example simple, is that I would eventually like to search
> additional child elements of <section> (like a <title> element and
> possibly <indexterm>) in addition to <p>, weighting them
appropriately.
> That rules out your third suggestion and may rule out your first
> suggestion (not quite sure).
>
> The second approach won't work because there could be a <section> that
> contains a <p> that also contains a <section> that contains a <p> that
> contains the search terms.  Example:
>
> <section>
>  <p />
>  <section>
>   <p>search terms</p>
>  </section>
> </section>
>
> Using the predicate [fn:exists(./p)], the markup above would return
two
> results when I would like for it to return one.
>
> If you think there is an approach that uses cts:query() I would be
very
> interested.  Our content is pretty simple and I have included an
outline
> of the basic structure below.  Of course, I could also send you more
(or
> a file) if that would be more helpful.
>
> Content structure (nested sections can go eight levels deep):
>
> <chapter>
>  <title />
>  <subchapter>
>   <title />
>   <section>
>    <title />
>    <p />
>    <section>
>     <title />
>     <p />
>     <section>
>      <title />
>      <p />
>     </section>
>    </section>
>   </section>
>  </subchapter>
> </chapter>
>
> I'm willing to add/edit elements and attributes if necessary.  I just
> don't know what would make things easiest for MarkLogic.
>
> Thanks again.
>
> John Craft
>
>
>
> -----Original Message-----
> From: general-bounces at developer.marklogic.com
> [mailto:general-bounces at developer.marklogic.com] On Behalf Of Danny
> Sokolsky
> Sent: Tuesday, July 01, 2008 4:27 PM
> To: General Mark Logic Developer Discussion
> Subject: RE: [MarkLogic Dev General] Restricting Search Hits To
> ImmediateParentContainers
>
> Hi John,
>
> This can be a little tricky, as it sounds like your "section" elements
> can mean different things in different places in the document.  One
> approach can be to change your section element names for the ones that
> have p children to something different, and then search over those.
It
> would be relatively easy to write a transformation in XQuery to do
that.
> Ultimately, this might prove to make your content the most searchable
> for what you want.  
>
> Another approach is to filter out the results that do not have a
direct
> p child from the search results.  This will probably be OK if the
number
> of results to filter is small relative to the number of results
returned
> from the search.  This might look something like:
>
> cts:search(//section, "searchterm")[fn:exists(./p)]
>
> You can also search below the section element (//section/p), but that
> would return p elements.  Depending on your content, that might work.
>
> There may be a cts:query solution here, too, but without knowing your
> content very well, it is harder for me to see that.  
>
> -Danny
>
> -----Original Message-----
> From: general-bounces at developer.marklogic.com
> [mailto:general-bounces at developer.marklogic.com] On Behalf Of John
Craft
> Sent: Tuesday, July 01, 2008 12:58 PM
> To: General Mark Logic Developer Discussion
> Subject: [MarkLogic Dev General] Restricting Search Hits To Immediate
> ParentContainers
>
> I am evaluating MarkLogic and have been playing around with the
> cts:element-query() and cts:element-word-query() expressions.  So far,
I
> am having difficulty restricting search results to elements that are
> direct parents of the elements that contain the search terms.
>
> Our content is made up of nested <section> elements and most <section>
> elements contain <p> elements, which are our containers for paragraph
> text.  The <section> elements contain <title> elements and other
> information as well.  When performing a search, I would like to limit
> the results to only the <section> elements whose direct <p> children
> contain search terms.  I began by creating the following cts:search()
> string:
>
> cts:search(fn:doc()//section, cts:element-query(xs:QName("section"),
> cts:element-query(xs:QName("p"), "searchTerm") ))
>
> This approach was flawed because the search results included <section>
> elements that were further up the tree and didn't directly contain <p>
> elements (or, rather, <p> elements that contained the search terms).
>
> My next approach was to use cts:element-word-query() and create an
> element-word-query-through for the <p> element:
>
> cts:search(fn:doc()//section,
> cts:element-word-query(xs:QName("section"), "searchTerm") )
>
> Again, the search results contain <section> elements that aren't
direct
> parents of <p> elements that contain search terms.  The end result is
> that I end up with a lot of <section> elements that are false
positives.
>
> I'm beginning to think the path information on the first cts:search()
> argument may be the problem, but I'm not sure.  And if it is the
> problem, how else can I get search results returned as <section>
> elements
>
> I appreciate any help or suggestions you can provide.
>
> Thanks.
>
> John Craft
> _______________________________________________
> General mailing list
> General at developer.marklogic.com
> http://xqzone.com/mailman/listinfo/general
> _______________________________________________
> General mailing list
> General at developer.marklogic.com
> http://xqzone.com/mailman/listinfo/general
> _______________________________________________
> General mailing list
> General at developer.marklogic.com
> http://xqzone.com/mailman/listinfo/general
> _______________________________________________
> General mailing list
> General at developer.marklogic.com
> http://xqzone.com/mailman/listinfo/general
>   
_______________________________________________
General mailing list
General at developer.marklogic.com
http://xqzone.com/mailman/listinfo/general
_______________________________________________
General mailing list
General at developer.marklogic.com
http://xqzone.com/mailman/listinfo/general
_______________________________________________
General mailing list
General at developer.marklogic.com
http://xqzone.com/mailman/listinfo/general
_______________________________________________
General mailing list
General at developer.marklogic.com
http://xqzone.com/mailman/listinfo/general


More information about the General mailing list