[MarkLogic Dev General] mlcp ability to skip corrupt zip files?

Geert Josten Geert.Josten at marklogic.com
Wed Jul 22 06:19:23 PDT 2015


Hi Kristina,

I would have expected MLCP to skip corrupt files without crashing, but apparently not. Not perfect, but a way around could be to wrap MLCP in another script that loops over the zip files itself, and makes a new MLCP call for each zip. More difficult to do parallelization (e.g. likely slower), but at least it allows you to finish processing completely..

Can you send me a small example of such a corrupt zip file off-list? I could use that to file a bug against MLCP internally..

Cheers,
Geert

From: <general-bounces at developer.marklogic.com<mailto:general-bounces at developer.marklogic.com>> on behalf of "Morales-Martin, Kristina" <kmorales-martin at cas.org<mailto:kmorales-martin at cas.org>>
Reply-To: MarkLogic Developer Discussion <general at developer.marklogic.com<mailto:general at developer.marklogic.com>>
Date: Tuesday, July 21, 2015 at 6:58 PM
To: MarkLogic Developer Discussion <general at developer.marklogic.com<mailto:general at developer.marklogic.com>>
Subject: [MarkLogic Dev General] mlcp ability to skip corrupt zip files?


Dear all,

We are using the MarkLogic Content Pump to push content from many directories that have zip files that in turn contain .xml files.
>From the last communication with Geet, we are also using the transform option in order to ingest only xml content.  This suggested filtering approach
using a transform works.

Unfortunately, when mlcp encounters a corrupt zip file (which we possibly can get from our sources),
the process terminates.  Is there an option to instruct mlcp to keep going, that is, to skip the corrupt skip file, and continue processing the large and
deeply nested directories for the rest of the zip files?  It looks like the -tolerate_errors option won’t work given that we need to use a transform to ingest only xml files,
and that forces the batch size to 1.

Please advise.

We are using the following options:
-input_file_path $inputFilePath \
-mode local -input_compressed true \
-output_uri_replace "(\/.+\/+)(?=.+\.zip),'/ourOverrideOfTheURIToRemoveTheLeadingNASPath/'" \
-output_collections "$collections" \
-database $dbName -output_permissions …
-transform_module /ourNamespace/ourTransformModule.xqy  \
-transform_namespace "http://cas.org/..." \
-xml_repair_level full \

Thank you,
________________________________
Kristina Morales-Martin
Sr. Technical Information Specialist, e-Content Operations
CAS, a division of the American Chemical Society
2540 Olentangy River Road
Columbus, OH 43202
Phone: 614-447-3600, ext. 2322
Fax: 614-447-3827
www.cas.org<http://www.cas.org/>


Confidentiality Notice: This electronic message transmission, including any attachment(s), may contain confidential, proprietary, or privileged information from Chemical Abstracts Service (“CAS”), a division of the American Chemical Society (“ACS”). If you have received this transmission in error, be advised that any disclosure, copying, distribution, or use of the contents of this information is strictly prohibited. Please destroy all copies of the message and contact the sender immediately by either replying to this message or calling 614-447-3600.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://developer.marklogic.com/pipermail/general/attachments/20150722/2f0e6683/attachment.html 


More information about the General mailing list