[MarkLogic Dev General] Re: Unfiltered ok,
but what of fragment loading (Jason Hunter)
Jason Hunter
jhunter at marklogic.com
Thu Mar 18 16:01:20 PST 2010
It's not so much the cache you need to worry about, it's fragment reads off disk. To fetch data from the ten items starting at item #1,000,000 you really don't want to have to read the previous million fragments off disk. That's a lot of random seeks. That's where unfiltered helps you; it lets you jump ahead and read just the 10 that matter.
The range index on the other hand is how a site like MarkMail.org can give you statistics about your result set (the authors, etc) without having to read the result set off disk.
-jh-
On Mar 18, 2010, at 12:16 PM, Paul M wrote:
> So unfiltered lets one go deep in paging...
> 1,000,001 to 1,000,010
> filtered may max out the caches earlier
> 200,001 to 200,010 max next page out of cache.
>
> Memory and fragmentation are still the main factors affecting total records
> [990,000 to 1,000,010] // authors
> because if the fragments are small, KB vs MB, more can be loaded...
> P.S. expanded cache is the one that will be used, this is the one that is filled from disk, correct?
>
> And range indexes can be used to avoid disk access all together (for small bits of information)
>
>
> --- On Thu, 3/18/10, general-request at developer.marklogic.com <general-request at developer.marklogic.com> wrote:
>
> From: general-request at developer.marklogic.com <general-request at developer.marklogic.com>
> Subject: General Digest, Vol 69, Issue 66
> To: general at developer.marklogic.com
> Date: Thursday, March 18, 2010, 11:46 AM
>
> Send General mailing list submissions to
> general at developer.marklogic.com
>
> To subscribe or unsubscribe via the World Wide Web, visit
> http://xqzone.com/mailman/listinfo/general
> or, via email, send a message with subject or body 'help' to
> general-request at developer.marklogic.com
>
> You can reach the person managing the list at
> general-owner at developer.marklogic.com
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of General digest..."
>
>
> Today's Topics:
>
> 1. RE: Unfiltered ok, but what of fragment loading (Kelly Stirman)
> 2. Re: "Hot Swapping" large data sets. (Jason Hunter)
> 3. Re: Unfiltered ok, but what of fragment loading (Jason Hunter)
> 4. MLSQL - JDOM version? (Wyatt VanderStucken)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Thu, 18 Mar 2010 10:33:34 -0700
> From: Kelly Stirman <Kelly.Stirman at marklogic.com>
> Subject: [MarkLogic Dev General] RE: Unfiltered ok, but what of
> fragment loading
> To: "general at developer.marklogic.com"
> <general at developer.marklogic.com>
> Message-ID:
> <D20C296D14127D4EBD176AD949D8A75A44BBF3C3 at EXCHG-BE.marklogic.com>
> Content-Type: text/plain; charset="us-ascii"
>
> If you want to get only the authors and their values, you should take a look at cts:element-values() or cts:element-attribute-values(). This will require creating a range index on the node where your authors are stored, but it will eliminate the need to pull all documents into memory.
>
> You can also use cts:frequency() to determine how frequently the author is mentioned across all 300 documents.
>
> Kelly
>
> Message: 2
> Date: Thu, 18 Mar 2010 07:07:16 -0700 (PDT)
> From: Paul M <pjmaip at yahoo.com>
> Subject: [MarkLogic Dev General] Unfiltered ok, but what of fragment
> loading
> To: general at developer.marklogic.com
> Message-ID: <548898.58699.qm at web44805.mail.sp1.yahoo.com>
> Content-Type: text/plain; charset="us-ascii"
>
> Say I perform an unfiltered search that resolves to 300 fragments. Now, since it was unfiltered, no fragments were needed, for the *search*, to be loaded into memory. Only the indexes were used. Now lets say I want the authors from all these fragments/docs (fragment=doc since no fragmentation policy). The data still needs to be loaded into memory for all 300 docs even if I only a small piece? i.e. expanded/compressed caches(not certain?) will need to be filled with 300 docs?
> i.e. Even if a search can be performed without pagination, this does not save one from blowing out the caches when the data is retrieved from the docs? Pagination may still be required?
>
> Any information is appreciated...
>
>
> ------------------------------
>
> Message: 2
> Date: Thu, 18 Mar 2010 11:07:07 -0700
> From: Jason Hunter <jhunter at marklogic.com>
> Subject: Re: [MarkLogic Dev General] "Hot Swapping" large data sets.
> To: General Mark Logic Developer Discussion
> <general at developer.marklogic.com>
> Message-ID: <679B7A7F-DE1C-4A4C-9CF1-3BE853C006CD at marklogic.com>
> Content-Type: text/plain; charset="windows-1252"
>
> For a single batch load, I like that, but if you do repeated loads you'll have to be creating new roles for every batch to distinguish the new content from the old. It seems mentally cheaper/lighter to me to use collections. My 2c.
>
> -jh-
>
> On Mar 18, 2010, at 9:47 AM, Danny Sokolsky wrote:
>
> > The URI privilege does not control access to the document, it specifies whether you can create a document in that URI space.
> >
> > You can do what Keith suggests by putting a read permission on each document that is associated with a role. Then, when you are ready, grant that role to a role your users already have. To do this, you would have to add several permissions during the load. For example, you might add a read and update permission for a “loader” role, and also add a read permission for a “content-user” role. Then, after you are satisfied that your content is the way you want it, you can give the “content-user” role to the user of your application.
> >
> > -Danny
> >
> > From: general-bounces at developer.marklogic.com [mailto:general-bounces at developer.marklogic.com] On Behalf Of Keith L. Breinholt
> > Sent: Thursday, March 18, 2010 9:34 AM
> > To: General Mark Logic Developer Discussion
> > Subject: RE: [MarkLogic Dev General] "Hot Swapping" large data sets.
> >
> > Another way to allow you to load and update sets and then only make them visible when you are done is to load the content with a unique URI privilege that is assigned to your loader/enricher program.
> >
> > Then when you are done and the content is ready you can add that privilege to the role of any users/applications that need to see it. That way only completed content is visible and it appears ‘instantaneously’ when the privilege is added to the role.
> >
> > Keith L. Breinholt
> > breinholtkl at ldschurch.org
> >
> > From: general-bounces at developer.marklogic.com [mailto:general-bounces at developer.marklogic.com] On Behalf Of Jason Hunter
> > Sent: Thursday, March 18, 2010 12:10 AM
> > To: General Mark Logic Developer Discussion
> > Subject: Re: [MarkLogic Dev General] "Hot Swapping" large data sets.
> >
> > On Mar 17, 2010, at 5:23 AM, Lee, David wrote:
> >
> >
> > I need to be updating some largish (1G+) sets of documents fairly atomically.
> > That is, I'd like to update all the documents and perform some operations like adding properties etc,
> > then all at once make the updates visible. The update process could take several hours.
> > Currently this document set shares the same forest as other document sets.
> > Its not possible to split these up because the app needs cross-query across all the document sets.
> >
> > Any suggestions on how to accomplish this ?
> >
> > What happens if you try loading everything as part of a single XCC call passing the large array of files?
> >
> > If you want to follow Wayne's advice on using collections, I suppose you'd want to put each batch of docs in a uniquely named collection. Then you can run your queries against fn:collection($seq) when $seq is the sequence of collections that have been loaded so far. Or, perhaps more simply, you can do a cts:not-query() against the cts:collection-query("latest") and thus exclude the most recent batch but allow all other docs that were loaded before. It keeps the new collection in the dark basically. Handy, efficient, and if each batch gets its own ID then you can easily exclude any batch.
> >
> > Point-in-time would do something similar, and is suitable if you're always doing just one bulk load at a time. Then you can use the point in time to control the visibility.
> >
> > -jh-
> >
> >
> >
> > NOTICE: This email message is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.
> >
> > _______________________________________________
> > General mailing list
> > General at developer.marklogic.com
> > http://xqzone.com/mailman/listinfo/general
>
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: http://xqzone.marklogic.com/pipermail/general/attachments/20100318/566bd15c/attachment-0001.html
>
> ------------------------------
>
> Message: 3
> Date: Thu, 18 Mar 2010 11:14:05 -0700
> From: Jason Hunter <jhunter at marklogic.com>
> Subject: Re: [MarkLogic Dev General] Unfiltered ok, but what of
> fragment loading
> To: General Mark Logic Developer Discussion
> <general at developer.marklogic.com>
> Message-ID: <2B55365F-6017-4290-8304-59F9A5199750 at marklogic.com>
> Content-Type: text/plain; charset="us-ascii"
>
> >
> > i.e. Even if a search can be performed without pagination, this does not save one from blowing out the caches when the data is retrieved from the docs? Pagination may still be required?
>
> Others have answered how you can use range indexes to pull the data from documents without fetching the documents, but in answer to this specific question, the perk of an unfiltered search is you can get jump ahead arbitrarily deep -- so you can get the authors of documents 1,000,001 to 1,000,010 even without range indexes using only 10 fragment reads. So you won't blow out any caches.
>
> -jh-
>
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: http://xqzone.marklogic.com/pipermail/general/attachments/20100318/24ac755e/attachment-0001.html
>
> ------------------------------
>
> Message: 4
> Date: Thu, 18 Mar 2010 14:46:43 -0400
> From: Wyatt VanderStucken <marklogic at wylovan.com>
> Subject: [MarkLogic Dev General] MLSQL - JDOM version?
> To: General Mark Logic Developer Discussion
> <general at developer.marklogic.com>
> Message-ID: <4BA27513.80809 at wylovan.com>
> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>
> Greetings all (particularly -jh-),
>
> I've been experimenting w/ the latest MLSQL, and had a question
> regarding the jdom.jar file which is included w/ the MLSQL
> distribution. The MANIFEST.MF inside the .jar indicates that it is JDOM
> Implementation-Version: 1.0.1, but I don't see that version listed on
> the JDOM site (http://www.jdom.org/news/index.html) - it looks like it
> was built 9/14/2005...
>
> Where it gets tricky is that I'm trying to add the MLSQL servlet to an
> existing Java webapp where JDOM is already in use
> (Implementation-Version: 1.0beta10)...
>
> When I use the 1.0beta10 version I get the following error:
> java.lang.NoSuchMethodError:
> org.jdom.Element.addContent(Lorg/jdom/Content;)Lorg/jdom/Element;
>
> The version bundled with MLSQL remedies the problem (as does JDOM
> version 1.0), but I'm concerned that deploying a newer version will
> break something. Initial tests are good, but this is a large
> application with 30+ developers, so I'm not sure of all the code that is
> dependent on JDOM...
>
> Can you say with any degree of certainty that code written against JDOM
> 1.0beta10 will be compatible with JDOM version 1.0 or 1.0.1? If forced
> to, will MLSQL work with JDOM version 1.0?
>
> Thanks in advance,
> Wyatt
>
>
>
> ------------------------------
>
> _______________________________________________
> General mailing list
> General at developer.marklogic.com
> http://xqzone.com/mailman/listinfo/general
>
>
> End of General Digest, Vol 69, Issue 66
> ***************************************
>
> _______________________________________________
> General mailing list
> General at developer.marklogic.com
> http://xqzone.com/mailman/listinfo/general
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://xqzone.marklogic.com/pipermail/general/attachments/20100318/5283e0af/attachment-0001.html
More information about the General
mailing list