[MarkLogic Dev General] "Joins" in search:search or cts:search

Jason Hunter jhunter at marklogic.com
Fri Nov 18 10:50:19 PST 2011


That sounds slow.  Make sure it's not doing any heavy disk work.

If it's really just that slow for you, then instead of being fancy I'd just increase the timeout for that particular request.  You can have a request ask for more time, given the right permissions.  There's a default time limit and a max time limit.  Set the max time limit to something high, let the default be what you want other requests to have, then have your request ask for more.

http://developer.marklogic.com/pubs/5.0/apidocs/AppServerBuiltins.html#xdmp:set-request-time-limit

-jh-

On Nov 18, 2011, at 10:40 AM, Lee, David wrote:

> Thanks.
> On my system (a "large" EC2 instance on EBS) I'm seeing about 5,000 document deletes/sec
> When i'm trying to delete 5 mil documents that still goes over timeout periods so I have to do fancy dancing to get it to works.
>  
>  
>  
> ----------------------------------------
> David A. Lee
> Senior Principal Software Engineer
> Epocrates, Inc.
> dlee at epocrates.com
> 812-482-5224
>  
> From: general-bounces at developer.marklogic.com [mailto:general-bounces at developer.marklogic.com] On Behalf Of Aaron Redalen
> Sent: Friday, November 18, 2011 12:53 PM
> To: General MarkLogic Developer Discussion
> Subject: Re: [MarkLogic Dev General] "Joins" in search:search or cts:search
>  
> Just as a side note, collection-delete and directory-delete are able to run in "fast" mode if the following conditions are met:
> ·         The database config parameter for "directory-creation" must be set to manual
> ·         No triggers
> ·         No auditing of updates
> ·         There should be no lock fragments at the time of the call to collection-delete (those established via xdmp:lock-acquire)
> If any of these conditions are not met, collection-delete must retrieve and delete each document, which may take awhile for large collections.
>  
> However, if these conditions are met, collection-delete doesn't need to retrieve any fragments. Instead, it simply sets the deleted timestamp on the documents matching the collection. In this mode, we can delete tens of thousands of documents per second.
>  
>  
> Aaron Redalen
> Director, Professional Services
> MarkLogic Corporation
> aaron.redalen at marklogic.com
> Phone: +1 650 655-2349
> Cell:  +1 240 688-7433
> www.marklogic.com
>  
>  
> From: "Lee, David" <dlee at epocrates.com>
> Reply-To: General MarkLogic Developer Discussion <general at developer.marklogic.com>
> Date: Thu, 17 Nov 2011 13:57:28 -0800
> To: General MarkLogic Developer Discussion <general at developer.marklogic.com>
> Subject: Re: [MarkLogic Dev General] "Joins" in search:search or cts:search
>  
> Thanks, no I have not measured the speed of collection-delete() of a non-existant collection.
> But I *have* timed collection-delete() which contains millions of documents and its exceeds request times ... so I have to split it up.I wanted to avoid starting 100,000 spawned threads of which 5 actually did anything.
>  
>  
> ----------------------------------------
> David A. Lee
> Senior Principal Software Engineer
> Epocrates, Inc.
> dlee at epocrates.com
> 812-482-5224
>  
> From: general-bounces at developer.marklogic.com [mailto:general-bounces at developer.marklogic.com] On Behalf Of Geert Josten
> Sent: Thursday, November 17, 2011 3:24 PM
> To: General MarkLogic Developer Discussion
> Subject: Re: [MarkLogic Dev General] "Joins" in search:search or cts:search
>  
> Not index-fast no, true, but you’re not retrieving data from the database, so saves a lot of fuss..
>  
> Have you tried measuring the speed of a collection-delete of an empty collection? Interesting case.. J
>  
> Kind regards,
> Geert
>  
> Van:general-bounces at developer.marklogic.com [mailto:general-bounces at developer.marklogic.com] Namens Lee, David
> Verzonden: donderdag 17 november 2011 21:17
> Aan: General MarkLogic Developer Discussion
> Onderwerp: Re: [MarkLogic Dev General] "Joins" in search:search or cts:search
>  
> Deletes in ML have not been "fast" by any metric in my experience.
>  
> I'll try just using the estimate ... maybe thats faster.
>  
>  
> ----------------------------------------
> David A. Lee
> Senior Principal Software Engineer
> Epocrates, Inc.
> dlee at epocrates.com
> 812-482-5224
>  
> From:general-bounces at developer.marklogic.com [mailto:general-bounces at developer.marklogic.com] On Behalf Of Geert Josten
> Sent: Thursday, November 17, 2011 2:54 PM
> To: General MarkLogic Developer Discussion
> Subject: Re: [MarkLogic Dev General] "Joins" in search:search or cts:search
>  
> Hi Lee,
>  
> Actually, the exists() in your code might be the slowest part of your code. The collection call is likely backed by an index, so quick. The exists works on a sequence however. It could be that it is optimized under the hood to use xdmp:estimate in this case, but not sure. Could try to rewrite that. But actually, I would test at all.
>  
> A collection-delete of an empty collection won’t take time I’d say. So wouldn’t worry about that too much.
>  
> What remains is the initial collection, which returns a sequence. If you are not collecting the results, MarkLogic doesn’t need to keep it in memory. Could very well be that it is streamed in the outer for loop. Otherwise try chunking it in batches of 10k. Remember that deletes in ML are fast! It’s just a flag on each fragment..
>  
> Kind regards,
> Geert
>  
>  
> Van:general-bounces at developer.marklogic.com [mailto:general-bounces at developer.marklogic.com] Namens Lee, David
> Verzonden: donderdag 17 november 2011 20:41
> Aan: General Mark Logic Developer Discussion (general at developer.marklogic.com)
> Onderwerp: [MarkLogic Dev General] "Joins" in search:search or cts:search
>  
> I suspect the answer is "no" ... but just plugging the brains out there ..
>  
> For good or bad I use this architype.
>  
> I have many "summary" documents  say  "/logs/1.xml" , "/logs/2.xml"  which belongs to the collection "/summaries"
>  
> There can be many (100k+)
>  
> Each summary document lists a refernce to external URL's (in this case Amazon S3) from which data could be loaded.
> If I load the data I put each group into a collection named by the URL of the summary.
> So say I have 10,000 XML documents   referenced by doc("/logs/1.xml") If I choose to load them, they will end up in collection
> "/logs/1.xml".   These summaries are in the collection say "/summaries"
>  
> The reason for this is for the ability to easily bulk delete blocks of documents based on their summaries.
> I can list the summaries and by a simple  
>                 exists( collection( $url) )
>  
> cant tell if any actual log documents have been loaded.
>  
>  
> NOW:  I want to be able to delete all records by summary but only if the documents have been loaded.
> Suppose I had 100k summary URL's I could do
>  
>                 for $url in collection("/summaries")
>                                 if( exists( collection( $url) )  then
>                                                 xdmp:collection-delete($url)
>                                 else ()
>  
>  
> This works and all ... but suppose I want something more efficiient.
> Overall there may be only say 1% of the summary documents actually loaded.  Furthermore if there were LOTS of ones loaded the above would timeout.
>  
> So I spawn a thread to delete say [1 to 10] of every summary collection ...
> but say I have 100k collections most of the threads do nothing.
> So I have to revert to the above to first check if the collection has anything before spawning a thread.
>  
> Quesiton:   Is there a cts:search  option which can do a collection query based on the results of the search itself ?
> that is (pseudo code)
> in one cts:search
>  
>     for $c in collection("x")/document-uri(.)
>                 if( exists( collection( $c) )
>                                 return $c
>  
> doing this in FLOWR is very slow ...
> but its what I'm resorting to ....
>  
>  
>                
>  
>  
>  
>  
>  
>  
>  
>  
> ----------------------------------------
> David A. Lee
> Senior Principal Software Engineer
> Epocrates, Inc.
> dlee at epocrates.com
> 812-482-5224
>  
> _______________________________________________
> General mailing list
> General at developer.marklogic.com
> http://developer.marklogic.com/mailman/listinfo/general

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://developer.marklogic.com/pipermail/general/attachments/20111118/b673206a/attachment-0001.html 


More information about the General mailing list