[MarkLogic Dev General] Loading large sets of small files
Justin.Makeig at marklogic.com
Mon Sep 19 11:08:35 PDT 2011
Are the files stored individually in S3? Did you try accessing them one-by-one from S3 with MarkLogic’s HTTP client? I’d be curious about the performance characteristics of that relative to the two techniques you shared. For example, list the contents of a bucket, chunk the list into individual segments, and loop over the each segment in parallel. This is the technique that Information Studio uses to walk a filesystem, for example. I’d be curious if this technique could apply efficiently to something like S3 as well. Thanks for the info.
Senior Product Manager
justin.makeig at marklogic.com<mailto:justin.makeig at marklogic.com>
On Sep 19, 2011, at 10:47 AM, Lee, David wrote:
I wanted to share my experience trying several techniques of loading large sets of small files to MarkLogic.
My use case is loading many (100k+) very small XML files into ML.
Each file is a 50-200 bytes typically. To make the job easier I've been batching them into chunks of up to 2000 files so I can incrementally load a batch of files on demand.
(I'd like to load ALL of them but for various reasons beyond this discussion I'm only loading a single 'batch' at a time).
I have the files stored in Amazon S3. To make life easier (and ideally more efficient) I experimented with several techniques. Ultimately ended up with 2 techniques that are very similar but have amazingly different performance. Due to the architecture of the app I want to be able to 'pull' these files from a ML app directly.
When a request comes in to load the files, the ML app fetches them from Amazon (via a URL), unpacks them and does a document-insert.
1) Zip of many xml files
Zip the xml files (up to 2000 ) into a single zip file.
In ML, unzip the file and one by one extract them (by iterating over the manifest) and load them.
2) Zip of a single wrapped XML document.
Wrap the xml files into a single big XML document with a root element.
Zip that xml file.
In ML unzip the file, then iterate over the children of the root, and insert each child as a seperate document.
All this is running on Amazon so the network speed between ML and S3 is quite fast.
I first started with #1 and it worked ... but would take quite a while.
Fetching and loading a 2000 'record' zip and extracting it would take up a minute.
Some performance analysis gave me a clue to try #2.
For starters I was amazed to find that the zip of 2000 small documents didn't compress much.
On reflection it makes sense as zip is zipping each document individually, they are so small that it doesn't work well. I'd rather use a tar/gz format but ML doesn't have native methods for that (only zip).
So that's why I tried #2
The file size dropped by 10x and the speed dropped by 10x !
Performance traces showed that most of the time was the overhead of extracting a single file from the zip. Extracting 2000 small files took 10x longer then extracting one file (same total size of uncompressed data). And amazingly the overhead of having to parse that one big XML file then do an xpath on it to pull out all the children was minimal compared to the unzip overhead.
Anyway just thought I'd share this in case anyone is hitting a similar issue.
Unzipping a zip of lots of small files is horribly expensive compared to unzipping a single big file.
David A. Lee
Senior Principal Software Engineer
dlee at epocrates.com<mailto:dlee at epocrates.com>
General mailing list
General at developer.marklogic.com<mailto:General at developer.marklogic.com>
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the General