[MarkLogic Dev General] Problem while using xdmp:tidy

Geert Josten Geert.Josten at marklogic.com
Tue Jan 12 00:29:09 PST 2016

Hi Rachit,

I think there are only very few options, and the best bet is going to be string manipulations. I also tried this slightly more direct approach:

let $xml := "<p><???&dagger;?></p>"
return xdmp:unquote($xml, (), "repair-full")

But that fails with just the same message. I’d use a regex to fix such PI’s to have a name that is valid. Something like fn:replace($xml, "<??([^>]*)>", "<?bad ?$1>")

If you prefer to keep bad data out of your database, try to run xdmp:tidy or xdmp:unquote when ingesting data, if necessary in a pre-commit trigger. If that throws an exception, your data won’t get inserted, keeping your database clean..

Kind regards,

From: <general-bounces at developer.marklogic.com<mailto:general-bounces at developer.marklogic.com>> on behalf of Rachit Rampal <rachit.rampal at nagarro.com<mailto:rachit.rampal at nagarro.com>>
Reply-To: MarkLogic Developer Discussion <general at developer.marklogic.com<mailto:general at developer.marklogic.com>>
Date: Tuesday, January 12, 2016 at 5:43 AM
To: "general at developer.marklogic.com<mailto:general at developer.marklogic.com>" <general at developer.marklogic.com<mailto:general at developer.marklogic.com>>
Subject: [MarkLogic Dev General] Problem while using xdmp:tidy


I have a piece of content which is neither a valid HTML or XML in my legacy database. Considering the fact, it would be difficult to clean the legacy, I want to tidy this up in MarkLogic(version 8.0-3) using xdmp:tidy.
The content looks like :
          [cid:image002.png at 01D14CC7.BC7F7E90]
Please find the attached query I’m executing on ML QConsole to tidy this up.

The problem here is that the response I’m getting after applying tidy functionality is not a valid XML(verified it via XML validator). Also when I try to insert document with the resulted xml body via POSTMAN or RESTClient, it throws an error saying ‘MALFORMED BODY | Invalid Processing Instruction names’.

Response XML :
          [cid:image001.png at 01D14D20.C3737130]

My expectation is, that the Marklogic Tidy functionality should rather refrain to tidy-up this type of content and throw an error, which it does not do in the current scenario. If I get the error from the Marklogic Tidy itself, I will rather get this dirty or bad data removed from the legacy database.

Please help me to get through this problem or suggest me workaround to get this resolved.

Things Tried So Far
I have tried various options listed out in xdmp:tidy but it didn’t help me much. Also I investigated on the Processing Instructions but couldn’t find a way through as it doesn’t looks like a valid PI either

Kind Regards,
Rachit Rampal
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://developer.marklogic.com/pipermail/general/attachments/20160112/5f5a7042/attachment.html 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image002.png
Type: image/png
Size: 3927 bytes
Desc: image002.png
Url : http://developer.marklogic.com/pipermail/general/attachments/20160112/5f5a7042/attachment.png 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.png
Type: image/png
Size: 2623 bytes
Desc: image001.png
Url : http://developer.marklogic.com/pipermail/general/attachments/20160112/5f5a7042/attachment-0001.png 

More information about the General mailing list