[MarkLogic Dev General] Maximum size for a map?

Will Thompson wthompson at jonesmcclure.com
Tue May 7 15:14:30 PDT 2013


Not to beat a dead horse, but I thought I had come up with a way to get
around the expanded-tree cache error when populating a large map from disk
by using spawned batches: create a map, pass that as an external variable
to multiple spawn tasks (using the "result" option to force it to wait
until they're done) that will each read from the database and add to the
map before finally inserting. My assumption was that the expanded-tree
cache limits were per transaction and that by batching the map loading
into spawned tasks, it wouldn't fill up as long each batch never exceeded
the limit. However, that doesn't seem to be the case, and this still gives
an expanded-tree cache error. Did I misunderstand something?

-Will

On 5/7/13 11:41 AM, "Geert Josten" <geert.josten at dayon.nl> wrote:

>Hi Will,
>
>Yes, if you revert to lexicons and range indexes, you can only use atomic
>values, but if are actually wrapping a fixed number of atomic values in
>XML, then you can easily look them up one by one (using separate indexes).
>The benefit is that using plain docs to look values up fits more naturally
>into MarkLogic, saving you from fuss around initialization and updating..
>
>Cheers,
>Geert
>
>> -----Oorspronkelijk bericht-----
>> Van: general-bounces at developer.marklogic.com [mailto:general-
>> bounces at developer.marklogic.com] Namens Will Thompson
>> Verzonden: dinsdag 7 mei 2013 19:18
>> Aan: MarkLogic Developer Discussion
>> Onderwerp: Re: [MarkLogic Dev General] Maximum size for a map?
>>
>> Damon, that's a good idea. Only problem is that the value would actually
>> be multiple values in some XML, so maybe it could be stored as JSON as a
>> means of shoehorning that into a string.
>>
>> Ultimately I think serializing a giant map to the database won't be
>> workable due to limitations of the in-memory list and expanded tree
>cache,
>> especially if we want to grow the map. We can test 1) breaking up the
>map,
>> and 2) performance of doing it with a range index instead.
>>
>> -Will
>>
>>
>> On 5/6/13 7:01 PM, "Damon Feldman" <Damon.Feldman at marklogic.com>
>> wrote:
>>
>> >Will,
>> >
>> >You may be able to use range indexes either by using cts:element-values
>> >with an element-value-query to "key" the lookup and have the value in
>the
>> >index, or by range-indexing a value that has the key and value
>separated
>> >by a token. This may not be quite as fast a map lookup but can simplify
>> >your code.
>> >
>> >If you describe the nature of the lookup we can brainstorm other ideas.
>> >
>> >Yours,
>> >Damon
>> >
>> >--
>> >Damon Feldman
>> >Sr. Principal Consultant, MarkLogic
>> >
>> >
>> >-----Original Message-----
>> >From: general-bounces at developer.marklogic.com
>> >[mailto:general-bounces at developer.marklogic.com] On Behalf Of Will
>> >Thompson
>> >Sent: Monday, May 06, 2013 9:01 PM
>> >To: MarkLogic Developer Discussion
>> >Subject: Re: [MarkLogic Dev General] Maximum size for a map?
>> >
>> >The map won't need to be updated frequently, so the idea is to
>serialize
>> >it to the database and filesystem for portability. Then on first use,
>it
>> >gets loaded into a server field.  My tests are showing you're pretty
>spot
>> >on for the deserializing time. But after that it's loaded in the field
>> >and always available. My worry is about that initial doc() call on
>boxes
>> >that may have a smaller expanded-tree cache. In this case, is my only
>> >option to ensure each box has sufficient values to hold the 400MB
>> >deserialized map or face XDMP-EXPNTREECACHEFULL? I could try/catch,
>> and
>> >throw a friendlier error for the small systems.
>> >
>> >I chose map for speed, but I if that's too much trouble then I suppose
>> >the kay/value pairs could also be stored in a non-map document with a
>> >range index on the keys and fragment root set to its children. Then
>there
>> >would be no need for doc(), although I'm not sure how much speed that
>> >would give up.
>> >
>> >-Will
>> >
>> >
>> >On 5/6/13 4:48 PM, "Michael Blakeley" <mike at blakeley.com> wrote:
>> >
>> >>Yes, any doc() call will use space in the expanded-tree cache. So you
>> >>might end up with X in the cache, plus Y for the deserialized map.
>> >>
>> >>I would also worry about how long it might take to deserialize a
>400-MB
>> >>map, even if the XML is already in cache. My guess is around 30-sec to
>> >>construct the map. If the cache is cold that might double because the
>> >>fragment has to be read from disk and decoded. But those are just
>> >>guesses.
>> >>
>> >>There are a couple of approaches that might avoid that cost. One is to
>> >>break up the map into multiple small documents. You could query a
>> >>special directory or collection for document that have the key(s) you
>> >>need, and let the expanded-tree cache handle the memory management.
>> >>Each map would be relatively small, so deserialization wouldn't be as
>> >>expensive.
>> >>
>> >>Another approach is to keep the map in a server field. That would be
>> >>both powerful and dangerous, because the memory for a server field is
>> >>persistent. We are used to working with query allocations, which
>> >>disappear when the query ends. So a single query is limited in its
>> >>scope for damage. But a 400-MB server field allocates 400-MB per eval
>> >>host, for the lifetime of the host process.
>> >>
>> >>So you'd want to be very careful to ensure that each host has exactly
>> >>one of these huge server fields. You'd also have to be very careful
>> >>about updating the map, partly because of the size and also because
>> >>server fields do not offer much in the way of memory protection.
>> >>Depending on your needs you might be able to do some sort of A-B
>> >>switching when you need to update the map, or develop a locking
>> >>strategy, or both.
>> >>
>> >>-- Mike
>> >>
>> >>On 6 May 2013, at 16:29 , Will Thompson
>> <wthompson at jonesmcclure.com>
>> >>wrote:
>> >>
>> >>> Mike - I should have been a little more specific about the use case.
>> >>>What
>> >>> if that map is serialized to the db; would calling doc() on that
>> >>>potentially overload the expanded tree cache?
>> >>>
>> >>> let $m := map:map(doc('/path/to/map.xml')/map:map)
>> >>> return xdmp:set-server-field('my-map', $m)
>> >>>
>> >>> Best guess on the QA server is that ML was installed when its VM was
>> >>> allocated fewer resources. But that's a good point about catching
>bad
>> >>> queries.
>> >>>
>> >>> -Will
>> >>>
>> >>>
>> >>> On 5/6/13 4:05 PM, "Michael Blakeley" <mike at blakeley.com> wrote:
>> >>>
>> >>>> No, maps don't use expanded tree cache space. A really large map
>> >>>>might  hit some per-eval limits, but I didn't find them when I
>> >>>>created map  around 800-MiB on my laptop, with 6.0-3. I used an
>> >>>>xdmp:quote to try to  make sure the map would really allocated more
>> >>>>space for each entry.
>> >>>>This
>> >>>> was fine at 80-MiB and took about 5-sec. For 800-MiB it took a
>> >>>>little  longer, and the OS swapped some pages out. So I conclude
>that
>> >>>>it was  working hard to allocate all the memory.
>> >>>>
>> >>>> let $m := map:map()
>> >>>> let $n := doc()[1]
>> >>>> let $_ := (1 to 1000000) ! (
>> >>>> map:put($m, xdmp:integer-to-hex(xdmp:random()), xdmp:quote($n)))
>> >>>> return map:count($m) * string-length(xdmp:quote($n)) div (1024 *
>> >>>> 1024) , xdmp:elapsed-time() =>
>> >>>> 802.04010009765625
>> >>>> PT1M6.429219S
>> >>>>
>> >>>> On that QA system, you might have set the expanded tree cache size
>> >>>> to a smaller value on purpose. That can be a good way to catch
>> >>>> poorly-optimized queries.
>> >>>>
>> >>>> -- Mike
>> >>>>
>> >>>> On 6 May 2013, at 14:44 , Will Thompson
>> <wthompson at jonesmcclure.com>
>> >>>> wrote:
>> >>>>
>> >>>>> Here's another one related to the Expanded Tree Cache: Say I want
>> >>>>>to  load  a giant map: 400MB or more. Will this always be dependent
>> >>>>>on the size of  the Expanded Tree Cache? Most of our dev machines
>> >>>>>have an Expanded Tree  Cache big enough to handle a map like this,
>> >>>>>but some don't, and for some  reason our QA server is set to an
>> >>>>>inexplicably small value. Is it  advisable to just manually
>increase
>> >>>>>that value so everything fits? Are  there any other general rules
>> >>>>>when adjusting server spec values? I have  mostly heard "look don't
>> >>>>>touch" with regard to these settings.
>> >>>>>
>> >>>>> -Will
>> >>>>>
>> >>>>> _______________________________________________
>> >>>>> General mailing list
>> >>>>> General at developer.marklogic.com
>> >>>>> http://developer.marklogic.com/mailman/listinfo/general
>> >>>>>
>> >>>>
>> >>>> _______________________________________________
>> >>>> General mailing list
>> >>>> General at developer.marklogic.com
>> >>>> http://developer.marklogic.com/mailman/listinfo/general
>> >>>
>> >>> _______________________________________________
>> >>> General mailing list
>> >>> General at developer.marklogic.com
>> >>> http://developer.marklogic.com/mailman/listinfo/general
>> >>>
>> >>
>> >>_______________________________________________
>> >>General mailing list
>> >>General at developer.marklogic.com
>> >>http://developer.marklogic.com/mailman/listinfo/general
>> >
>> >_______________________________________________
>> >General mailing list
>> >General at developer.marklogic.com
>> >http://developer.marklogic.com/mailman/listinfo/general
>> >_______________________________________________
>> >General mailing list
>> >General at developer.marklogic.com
>> >http://developer.marklogic.com/mailman/listinfo/general
>>
>> _______________________________________________
>> General mailing list
>> General at developer.marklogic.com
>> http://developer.marklogic.com/mailman/listinfo/general
>_______________________________________________
>General mailing list
>General at developer.marklogic.com
>http://developer.marklogic.com/mailman/listinfo/general



More information about the General mailing list