[MarkLogic Dev General] derferencing documents with document-uri and base-uri?

Rachel Wilson Rachel.Wilson at bbc.co.uk
Tue Oct 22 07:06:24 PDT 2013


Very interesting and helpful thanks :)

Um, I hate to have to keep asking though but what is "dereferencing" then?  Or maybe it's a colloquial term you don't actually use.

Also, if we are iterating through nodes returned from a flwor expression, and these aren't document nodes themselves, would the whole document have been put in the expanded tree cache in order to resolve the query?  I am asking because this is our usual usage and I'm not sure when the document would be put into cache in this case.  (The expanded tree cache is not very extensively documented in the server internals docs you see)


From: David Lee <David.Lee at marklogic.com<mailto:David.Lee at marklogic.com>>
Reply-To: MarkLogic Developer Discussion <general at developer.marklogic.com<mailto:general at developer.marklogic.com>>
Date: Tuesday, 22 October 2013 14:46
To: MarkLogic Developer Discussion <general at developer.marklogic.com<mailto:general at developer.marklogic.com>>
Subject: Re: [MarkLogic Dev General] derferencing documents with document-uri and base-uri?

The document URI is normatively stored in the disk block with the document data and properties so it does
require loading the document into memory to get its URI ... providing you are referencing it with a document node.

If the document is pulled into memory for the sole purpose of getting its URI it can be slow.
To test this I have a DB with 1.6mil tweets ...
Even after trying it once , these calls are slow:

on my system

count( doc()/fn:base-uri() )               1min 25 sec
count( doc()/fn:document-uri() )     1min 26 secs
count( doc()/xdmp:node-uri(.) )      1min 22 secs


But if all you want are URI's consider the uri lexicon.    This lexicon is stored separately from the document and all together
so iterating through all the URI's is much faster.
Even without using the advanced filtering functions this can be fast

count( cts:uris() )                      0.36 seconds

if you are dealing with billions of docs instead of a million then you should definately use the advanced options for this call
to retrieve only the URI's that you want.

If the document is already in memory, fetching its URI is fast  (and I dont know another way but using one of the above xxx-uri() methods).

-----------------------------------------------------------------------------
David Lee
Lead Engineer
MarkLogic Corporation
dlee at marklogic.com<mailto:dlee at marklogic.com>
Phone: +1 812-482-5224
Cell:  +1 812-630-7622
www.marklogic.com<http://www.marklogic.com/>


From: general-bounces at developer.marklogic.com<mailto:general-bounces at developer.marklogic.com> [mailto:general-bounces at developer.marklogic.com] On Behalf Of anoop raj p
Sent: Tuesday, October 22, 2013 6:15 AM
To: MarkLogic Developer Discussion
Subject: Re: [MarkLogic Dev General] derferencing documents with document-uri and base-uri?

Please remove me from email list.

On Tue, Oct 22, 2013 at 3:44 PM, Rachel Wilson <Rachel.Wilson at bbc.co.uk<mailto:Rachel.Wilson at bbc.co.uk>> wrote:
I didn't think it was a problem as such, I wasn't trying to prematurely
optimise I promise but I was curious about the workings under the hood
since we use these functions a lot including our slower running queries -
investigating those is how this question came up.    Think about this as
settling a bet ;)

So, I"m still curious - what is dereferencing? is that indeed what
happens?

Say we have a a database node returned from a query, which isn't the
document node, and we call base-uri on it, would the whole document itself
necessarily have been put in the expanded tree cache in order to resolve
the query?  I'm still learning about the roles of the different caches and
its turning out to be very helpful to know.

PS.  We don't have subfragments


-----Original Message-----
From: Michael Blakeley <mike at blakeley.com<mailto:mike at blakeley.com>>
Reply-To: MarkLogic Developer Discussion <general at developer.marklogic.com<mailto:general at developer.marklogic.com>>
Date: Monday, 21 October 2013 18:39
To: MarkLogic Developer Discussion <general at developer.marklogic.com<mailto:general at developer.marklogic.com>>
Subject: Re: [MarkLogic Dev General] derferencing documents
with    document-uri and base-uri?

I wouldn't worry about it unless it's clearly a problem: avoid premature
optimization. If you have a database node in memory, then it's in the
expanded tree cache. So repeated accessor calls for its URI can drive
cache lookups and CPU cycles, but should never result in cache misses.
Check the xdmp:query-meters output to see this for yourself: you should be
able to correlate the number of URI accesses to the
expanded-tree-cache-hit count.

Things might get a little more expensive if you have subfragments, because
crossing fragment boundaries can be expensive. A call to base-uri inside
subfragment might have to traverse to the parent fragment - or maybe not,
I'd have to design a test to say for certain. But the time to worry is
when you have a performance problem, and your test case shows the URI
accessor in the profiler output. Then you could think about ways to
minimize URI lookups.

Switching to functionality, I almost always use xdmp:node-uri rather than
document-uri or base-uri. I avoid document-uri simply because I don't want
to worry about traversing to root for document-uri, and base-uri because I
don't want the behavior where an ancestor element specifies its own
base-uri value. That's rare in most XML, but base-uri checks for it and
honors it. Checking for that probably slows things down a bit, and
honoring it generally doesn't do what I want. So I always use
xdmp:node-uri instead.

-- Mike

On 21 Oct 2013, at 09:54 , Rachel Wilson <Rachel.Wilson at bbc.co.uk<mailto:Rachel.Wilson at bbc.co.uk>> wrote:

>
> I have heard on the grapevine that to use document-uri() or base-uri()
>functions is bad for performance, although I can't seem to find anything
>about that in MarkLogic's docs or elsewhere on the internet.  One of the
>reasons given was that using those functions "dereference the document",
>or that MarkLogic Server has to go to disk to resolve the uri.  Although
>I'm not sure what is really meant by "dereference"
>
> Could someone clear this up.  Has the grapevine got the wrong end of the
>stick or is it perhaps how the function is used, perhaps in loops, that
>is the reason behind this thinking?  We use those two functions so much,
>particularly base-uri(), in our code that we would consider some rewrites
>if it really is something to minimise.
>
> Many thanks,
> Rachel
>
>
>
> ----------------------------
>
> http://www.bbc.co.uk
> This e-mail (and any attachments) is confidential and may contain
>personal views which are not the views of the BBC unless specifically
>stated.
> If you have received it in error, please delete it from your system.
> Do not use, copy or disclose the information in any way nor act in
>reliance on it and notify the sender immediately.
> Please note that the BBC monitors e-mails sent or received.
> Further communication will signify your consent to this.
>
> ---------------------
>
> _______________________________________________
> General mailing list
> General at developer.marklogic.com<mailto:General at developer.marklogic.com>
> http://developer.marklogic.com/mailman/listinfo/general

_______________________________________________
General mailing list
General at developer.marklogic.com<mailto:General at developer.marklogic.com>
http://developer.marklogic.com/mailman/listinfo/general



-----------------------------
http://www.bbc.co.uk
This e-mail (and any attachments) is confidential and
may contain personal views which are not the views of the BBC unless specifically stated.
If you have received it in
error, please delete it from your system.
Do not use, copy or disclose the
information in any way nor act in reliance on it and notify the sender
immediately.
Please note that the BBC monitors e-mails
sent or received.
Further communication will signify your consent to
this.
-----------------------------
_______________________________________________
General mailing list
General at developer.marklogic.com<mailto:General at developer.marklogic.com>
http://developer.marklogic.com/mailman/listinfo/general



--
anoop raj p



----------------------------

http://www.bbc.co.uk
This e-mail (and any attachments) is confidential and may contain personal views which are not the views of the BBC unless specifically stated.
If you have received it in error, please delete it from your system.
Do not use, copy or disclose the information in any way nor act in reliance on it and notify the sender immediately.
Please note that the BBC monitors e-mails sent or received.
Further communication will signify your consent to this.

---------------------
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://developer.marklogic.com/pipermail/general/attachments/20131022/12e4b097/attachment-0001.html 


More information about the General mailing list