[MarkLogic Dev General] RE: "Smart" bulk updates

Kelly Stirman Kelly.Stirman at marklogic.com
Fri Mar 5 12:00:47 PST 2010


Hi David,

You could do the following:

1) calculate the MD5 hash of the file during load (MarkLogic can do this)
2) store the MD5 hash as a property (so it works for all file types)
3) before the insert check to see if the file exists already, and if it does, whether the md5 hash is the same
4) if the same, do nothing, else insert and set the md5 hash property at the same time

I would use xdmp:exists(fn:doc("foo")) to see if the doc already exists.

Kelly

Message: 7
Date: Fri, 5 Mar 2010 11:48:35 -0800
From: "Lee, David" <dlee at epocrates.com>
Subject: [MarkLogic Dev General] "Smart" bulk updates
To: <General at developer.marklogic.com>
Message-ID: <DD37F70D78609D4E9587D473FC61E0A716D92DEC at postoffice>
Content-Type: text/plain; charset="us-ascii"

I have a task coming up where I need to daily update a large set of xml
and binary files from an outside source.

This is about 6000 xml docs and 30,000 images.  About 2GB total.

 

I get these from an outside source as one huge 1GB zip file.  I expect
maybe only 1% of the files to have changed in any drop, maybe even less
(.1%?).

 

For any changed files I need to generates some additional data (outside
of ML) then upload the files and update some properties.

I *could* just update ALL files every day, but I'd like to be more
efficient then that considering the likely change rate is so low.

 

I'm sure this is a common problem (not unlike say rsync) ... 

What do people do for this case ? 

I was thinking of storing a checksum (MD5?) as a property of each file
then comparing with the new files by listing the directory tree from ML.

Another idea is to keep a filesystem cache of whats in ML and do the
comparison there. 

 

My guess is it would be just as (in)efficient to try to upload each file
to compare within ML as just updating the document,

or visa-vera - fetch each file from ML just to compare with the
filesystem.  So I dont want to go that route.

 

Then there is also the deleted issue ... I need to detect files which
are no longer in the dataset and delete them.

 

 

 

Any suggestions or ideas ?  Anyone do something like this before ?
Is there builtin marklogic features that could help ?

 

 

Thanks;

 

-David

 

 

 

----------------------------------------

David A. Lee

Senior Principal Software Engineer

Epocrates, Inc.

dlee at epocrates.com <mailto:dlee at epocrates.com> 

812-482-5224


More information about the General mailing list