temporary tree (was Re: [MarkLogic Dev General] performance
question)
Michael Blakeley
michael.blakeley at marklogic.com
Sun Jul 8 11:55:21 PDT 2007
James,
I can run a similar query rather quickly: 230 ms for 500 MedlineCitation
fragments, with my laptop in powersave mode. I believe that the output
is equivalent.
<result>{
for $key in distinct-values(
for $author in collection()/MedlineCitationSet/MedlineCitation
/Article/AuthorList/Author[ LastName ]
return string-join(($author/LastName, $author/ForeName), "|") )
let $name := tokenize($key, '|')
order by $key
return <author>{
element surname { $name[1] },
element fname { $name[2] } }</author>
}</result>
In this new query, the dominant expression is the rooted XPath, bringing
the MedlineCitation fragments into memory, plus calculating the
distinct-values of the pipe-delimited key. That's fine for small numbers
of fragments, but this approach requires rapidly-increasing amounts of
memory with large content sets. That's likely to be an issue with Saxon,
too.
To scale up, it's better to use a range index of type string. We can
access its values via cts:element-values() or
cts:element-attribute-values(). This approach can deliver your answers
in milliseconds.
For your desired output, though, this would involve some content
enrichment as well: perhaps by adding a 'key' attribute on every Author
element. The new "Corb" tool on http://developer.marklogic.com/code/ is
a good resource for this sort of enrichment, and the example
medline-iso8601.xqy module is very close to what you'd need
(http://developer.marklogic.com/svn/corb/trunk/src/java/com/marklogic/developer/corb/medline-iso8601.xqy).
-- Mike
James A. Robinson wrote:
> The question posed by Helen got me curious about the performance of
> MarkLogic when it comes to dealing with trees built during the query.
>
> I put together a set of 636 pubmed articles (so, a tiny set compared to
> the amount Helen is dealing with). This set of articles contains 2847
> author elements, 2634 of which are unique.
>
> I used the XQuery I sent to the list and timed the results. The total
> execution time was about 20 seconds! It looks like 19.7 or so of those
> seconds are spent weeding out the unique authors -- it only takes the
> server about .3 seconds to build the list of author names and the list
> of unique author keys, and the rest of the time is spent looking at
> the authors list for those unique author keys.
>
> I timed this against Saxon, on the same machine, where Saxon loaded the
> files up from disk. It took takes Saxon about 1.5 seconds to load the
> files, but the amount of time to actually execute the query was only
> about .1 to .2 seconds (so under 2 seconds total execution time).
>
> This struck me as odd, I wouldn't have expected this much of a difference.
> Since MarkLogic Server was blazingly fast at actually loading the
> documents (which makes sense, xdmp:query-meters() shows the documents were
> read from cache), I assume the difference is that Saxon is much better
> at building an index for the temporary tree -- Is MarkLogic not doing
> anything similar? If not, is there a technique one can use to force it
> to? Is there some other way one should approach a manipulation like this?
>
> I ask because this seemed like a typical sort of problem one might need
> to solve in XQuery (when the documents don't have quite as grainular a
> view as one needs it seems reasonable to assume one should be able to
> build up the grainular representation as part of the query).
>
> <result>{
> let $authors :=
> for $author in collection()/MedlineCitationSet/MedlineCitation/Article/AuthorList/Author
> let $surname := data($author/LastName)
> let $fname := data($author/FirstName)
> let $key := string-join(($surname,$fname), "|")
> where exists($author/LastName)
> return
> <author key="{$key}">{
> <surname>{$surname}</surname>,
> if ($fname ne '') then <fname>{$fname}</fname> else ()
> }</author>
> let $unique :=
> for $key in distinct-values($authors/@key)
> order by $key
> return $key
> return
> for $key in $unique
> return <author>{$authors[@key=$key][1]/*}</author> (: the dreadfully slow part ... :)
> }</result>
>
>
> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
> James A. Robinson jim.robinson at stanford.edu
> Stanford University HighWire Press http://highwire.stanford.edu/
> +1 650 7237294 (Work) +1 650 7259335 (Fax)
> _______________________________________________
> General mailing list
> General at developer.marklogic.com
> http://xqzone.com/mailman/listinfo/general
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 4532 bytes
Desc: S/MIME Cryptographic Signature
Url : http://xqzone.marklogic.com/pipermail/general/attachments/20070708/25c88a43/smime.bin
More information about the General
mailing list