[MarkLogic Dev General] Fwd: [1.0-ml] XDMP-EXPNTREECACHEFULL

Geert Josten geert.josten at dayon.nl
Mon Mar 26 08:15:25 PDT 2012


Don’t think there is a lexicon for element names. Could make sense though.



I liked the map:map approach, so thought to give it a go. I have a database
with about 400k small docs. The following code ran in a few sec on the
first 1000 docs:



let $elems := map:map()

let $process-docs :=

  for $d in collection("feeds")[1 to 1000]

  for $elem in $d//*/local-name()

  where empty(map:get($elems, $elem))

  return

    map:put($elems, $elem, 1)

for $elem in map:keys($elems)

order by $elem

return $elem,



xdmp:elapsed-time()



Subsequent run executes in quarter sec, thanks to caching.



Profile shows there are 54k elements in the first 1000 docs, of which just
25 unique.

First 2000 takes 7 sec, walks 100k elements, still 25 unique.

First 3000 still takes 7 sec, walks 160k elements, still just 25 unique.



(I have pretty much always the same doc structure, so that explains the
consistent 25 unique element names.)



First 10k takes 23,7 sec, 540k elems. (Still not maxing out!)

First 50k takes 2m40s, 2.6m elems, still works. (Seems to scale quite
linearly too!)

All: 410k takes about 19m, 64m expressions, 21m elems walked, still just 25
unique names. (PS: had to increased app server max time limit, which was
limited to 10 min)



Note: these times include profiling overhead.



So, it is doable, but not something you’d like to run each time,
particularly if the database size increases. Combining the above with a
post-commit trigger to keep a server-field maintained list up to date makes
most sense to me.



Kind regards,

Geert





*Van:* general-bounces at developer.marklogic.com [mailto:general-bounces@
developer.marklogic.com] *Namens *Brent Hartwig
*Verzonden:* maandag 26 maart 2012 15:45
*Aan:* sai shanker; MarkLogic Developer Discussion
*Onderwerp:* Re: [MarkLogic Dev General] Fwd: [1.0-ml]
XDMP-EXPNTREECACHEFULL



Curious there aren’t functions like this tapping into the universal index.



-Brent



*From:* general-bounces at developer.marklogic.com
[mailto:general-bounces at developer.marklogic.com] *On Behalf Of *sai shanker
*Sent:* Monday, March 26, 2012 9:23 AM
*To:* MarkLogic Developer Discussion
*Subject:* Re: [MarkLogic Dev General] Fwd: [1.0-ml] XDMP-EXPNTREECACHEFULL





hi,

you can loop across each document, grab all the child nodes and put them in
a map.

Thanks and Regards,
Sai.





*From:* Ryan Dew <ryan.j.dew at gmail.com>
*To:* MarkLogic Developer Discussion <general at developer.marklogic.com>
*Sent:* Monday, March 26, 2012 9:14 AM
*Subject:* Re: [MarkLogic Dev General] Fwd: [1.0-ml] XDMP-EXPNTREECACHEFULL



You could try a recursive function like the following. No guarantee it is
100% right, if you have sub elements that have the same names as your root
elements.



xquery version "1.0-ml";



declare function local:find-unique-qnames($found-qnames as xs:QName*) {

  let $next-qname := cts:search(collection()/*,

    if (exists($found-qnames))

    then cts:not-query(cts:element-query($found-qnames,cts:and-query(())))

    else cts:and-query(())

  )[1]/node-name(.)

  return if (exists($next-qname))

          then local:find-unique-qnames(($found-qnames,$next-qname))

          else $found-qnames

};



declare function local:find-unique-qnames() {

  for $qn in local:find-unique-qnames(())

  order by string($qn)

  return $qn

};



local:find-unique-qnames()



On Mon, Mar 26, 2012 at 6:36 AM, Geert Josten <geert.josten at dayon.nl> wrote:

Hi Vishnu,



It would help if you could explain why you need that list. But in general
the best option would be to pre-calculate the list I guess. You can save it
as a server-field (xdmp:set-server-field), to keep the list in memory on
each host. But you would need an algorithm to initialize it, and each doc
commit would have to check and update that list. The latter can be done
with a post-commit trigger. The first can be done best by the strategy I
already mentioned: divide all docs in chunks of 100 to 1000 docs, calculate
distinct names of each chunk, and merge that somehow to the final list.



You could also raise the tree size setting temporarily to do that initial
calculation..



Kind regards,

Geert



*Van:* general-bounces at developer.marklogic.com [mailto:
general-bounces at developer.marklogic.com] *Namens *VISH RAJPUT
*Verzonden:* maandag 26 maart 2012 14:29


*Aan:* MarkLogic Developer Discussion
*Onderwerp:* Re: [MarkLogic Dev General] Fwd: [1.0-ml]
XDMP-EXPNTREECACHEFULL



Thanks Geert,



Is there any alternate solution to find the unique elements within a
database?



Warm Regards,

Vishnu





On Mon, Mar 26, 2012 at 5:55 PM, Geert Josten <geert.josten at dayon.nl> wrote:

Hi Vishnu,



90 mb isn’t much indeed, but MarkLogic is configured to keep a low memory
footprint, even if there are 30 concurrent requests. To make that sure, the
tree size limit (look at the database setting in the admin interface) is
usually pretty low. I have 8Gb and still it is set to no more than 85mb by
default. But you can increase it if you like.



A more streaming approach like my advice attempts to achieve to some extend
helps keeping the footprint low, and keep MarkLogic fast.



Kind regards,

Geert



*Van:* general-bounces at developer.marklogic.com [mailto:
general-bounces at developer.marklogic.com] *Namens *VISH RAJPUT
*Verzonden:* maandag 26 maart 2012 14:17
*Aan:* MarkLogic Developer Discussion
*Onderwerp:* Re: [MarkLogic Dev General] Fwd: [1.0-ml]
XDMP-EXPNTREECACHEFULL



Thanks Geert,



But still it shows *XDMP-EXPNTREECACHEFULL:
distinct-values(collection("ContentAnalysis")//*/local-name()) --
Expanded tree cache full on host.... *the database overall size is only
90MB i don't think it is so huge data for marklogic....





Regards,

Vishnu



On Mon, Mar 26, 2012 at 1:25 PM, Geert Josten <geert.josten at dayon.nl> wrote:

Hi Vishnu,



Your FLWOR expression won’t return distinct names, since you are applying
the function to each individual name. You should write:



distinct-values(

    for $a in //*

    return $a

)



Or better:



distinct-values(collection()//*/local-name())



But this still might not perform well, or still max out on list or tree
caches. This approach is creating a complete list of all element names
first, and starts applying distinct-values only thereafter. You might
consider taking multiple steps, like per doc first, and then clustering per
100 files, and only then all clusters. You could also just take 100 random
samples, and use that. That doesn’t guarantee a 100% complete list, but it
remains performant even if your database grows 10 or 100 fold.



Kind regards,

Geert



*Van:* general-bounces at developer.marklogic.com [mailto:
general-bounces at developer.marklogic.com] *Namens *VISH RAJPUT
*Verzonden:* maandag 26 maart 2012 8:29
*Aan:* general at developer.marklogic.com
*Onderwerp:* [MarkLogic Dev General] Fwd: [1.0-ml] XDMP-EXPNTREECACHEFULL



The size of the all files is 90 MB approx.

---------- Forwarded message ----------
From: *VISH RAJPUT* <svishnu.singh4 at gmail.com>
Date: Mon, Mar 26, 2012 at 11:56 AM
Subject: [1.0-ml] XDMP-EXPNTREECACHEFULL
To: general at developer.marklogic.com


Hi,



I have 2000 files in Marklogic database within a single forest and i want
to find out the unique element name from this database for the whole 2000
files. For this i wrote the below query:-



for $a in //*

return distinct-values($a/local-name()))



but by this i got an error "*[1.0-ml] XDMP-EXPNTREECACHEFULL" * what should
i do?





Regards,

Vishnu Singh




_______________________________________________
General mailing list
General at developer.marklogic.com
http://developer.marklogic.com/mailman/listinfo/general




_______________________________________________
General mailing list
General at developer.marklogic.com
http://developer.marklogic.com/mailman/listinfo/general




_______________________________________________
General mailing list
General at developer.marklogic.com
http://developer.marklogic.com/mailman/listinfo/general




_______________________________________________
General mailing list
General at developer.marklogic.com
http://developer.marklogic.com/mailman/listinfo/general
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://developer.marklogic.com/pipermail/general/attachments/20120326/c922cb40/attachment-0001.html 


More information about the General mailing list