temporary tree (was Re: [MarkLogic Dev General] performance question)

Michael Blakeley michael.blakeley at marklogic.com
Sun Jul 8 11:55:21 PDT 2007


James,

I can run a similar query rather quickly: 230 ms for 500 MedlineCitation 
fragments, with my laptop in powersave mode. I believe that the output 
is equivalent.

<result>{
   for $key in distinct-values(
     for $author in collection()/MedlineCitationSet/MedlineCitation
       /Article/AuthorList/Author[ LastName ]
     return string-join(($author/LastName, $author/ForeName), "|") )
   let $name := tokenize($key, '|')
   order by $key
   return <author>{
     element surname { $name[1] },
     element fname { $name[2] } }</author>
}</result>

In this new query, the dominant expression is the rooted XPath, bringing 
the MedlineCitation fragments into memory, plus calculating the 
distinct-values of the pipe-delimited key. That's fine for small numbers 
of fragments, but this approach requires rapidly-increasing amounts of 
memory with large content sets. That's likely to be an issue with Saxon, 
too.

To scale up, it's better to use a range index of type string. We can 
access its values via cts:element-values() or 
cts:element-attribute-values(). This approach can deliver your answers 
in milliseconds.

For your desired output, though, this would involve some content 
enrichment as well: perhaps by adding a 'key' attribute on every Author 
element. The new "Corb" tool on http://developer.marklogic.com/code/ is 
a good resource for this sort of enrichment, and the example 
medline-iso8601.xqy module is very close to what you'd need 
(http://developer.marklogic.com/svn/corb/trunk/src/java/com/marklogic/developer/corb/medline-iso8601.xqy).

-- Mike

James A. Robinson wrote:
> The question posed by Helen got me curious about the performance of
> MarkLogic when it comes to dealing with trees built during the query.
> 
> I put together a set of 636 pubmed articles (so, a tiny set compared to
> the amount Helen is dealing with). This set of articles contains 2847
> author elements, 2634 of which are unique.
> 
> I used the XQuery I sent to the list and timed the results.  The total
> execution time was about 20 seconds!  It looks like 19.7 or so of those
> seconds are spent weeding out the unique authors -- it only takes the
> server about .3 seconds to build the list of author names and the list
> of unique author keys, and the rest of the time is spent looking at
> the authors list for those unique author keys.
> 
> I timed this against Saxon, on the same machine, where Saxon loaded the
> files up from disk.  It took takes Saxon about 1.5 seconds to load the
> files, but the amount of time to actually execute the query was only
> about .1 to .2 seconds (so under 2 seconds total execution time).
> 
> This struck me as odd, I wouldn't have expected this much of a difference.
> Since MarkLogic Server was blazingly fast at actually loading the
> documents (which makes sense, xdmp:query-meters() shows the documents were
> read from cache), I assume the difference is that Saxon is much better
> at building an index for the temporary tree -- Is MarkLogic not doing
> anything similar?  If not, is there a technique one can use to force it
> to?  Is there some other way one should approach a manipulation like this?
> 
> I ask because this seemed like a typical sort of problem one might need
> to solve in XQuery (when the documents don't have quite as grainular a
> view as one needs it seems reasonable to assume one should be able to
> build up the grainular representation as part of the query).
> 
> <result>{
>   let $authors :=
>     for $author in collection()/MedlineCitationSet/MedlineCitation/Article/AuthorList/Author
>     let $surname := data($author/LastName)
>     let $fname   := data($author/FirstName)
>     let $key     := string-join(($surname,$fname), "|")
>     where exists($author/LastName)
>     return
>       <author key="{$key}">{
>         <surname>{$surname}</surname>,
>         if ($fname ne '') then <fname>{$fname}</fname> else ()
>       }</author>
>   let $unique :=
>     for $key in distinct-values($authors/@key)
>     order by $key
>     return $key
>   return
>     for $key in $unique
>     return <author>{$authors[@key=$key][1]/*}</author> (: the dreadfully slow part ... :)
> }</result>
> 
> 
> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
> James A. Robinson                       jim.robinson at stanford.edu
> Stanford University HighWire Press      http://highwire.stanford.edu/
> +1 650 7237294 (Work)                   +1 650 7259335 (Fax)
> _______________________________________________
> General mailing list
> General at developer.marklogic.com
> http://xqzone.com/mailman/listinfo/general

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 4532 bytes
Desc: S/MIME Cryptographic Signature
Url : http://xqzone.marklogic.com/pipermail/general/attachments/20070708/25c88a43/smime.bin


More information about the General mailing list