[MarkLogic Dev General] Determining Whether Whitespace is In Data as Stored or A Result of Serialization?
Danny.Sokolsky at marklogic.com
Tue Nov 29 16:55:31 PST 2011
There were some changes made in later 4.2 releases to restore the behavior from earlier releases. The serialization is about how it is output, not how it is stored, so it should be stored correctly.
I recommend trying it on the latest 4.2 release (4.2-7 now, I think). I think it will then, by default, behave the same as in 4.1. In 4.2, there are some serialization options you can set at the query level to control this. In MarkLogic 5, you can also control these options' default values at the App Server level.
Here is the 4.2 release not item that describes some of these changes:
From: general-bounces at developer.marklogic.com [mailto:general-bounces at developer.marklogic.com] On Behalf Of Eliot Kimber
Sent: Monday, November 28, 2011 3:04 PM
To: general at developer.marklogic.com
Subject: [MarkLogic Dev General] Determining Whether Whitespace is In Data as Stored or A Result of Serialization?
I have determined that content loaded through the XccRunner.load() method
has unwanted whitespace not in the original XML when subsequently accessed
I've tested on 4.2-1. Earlier versions do not seem to have this behavior
(although I need to do more testing to confirm--but we certainly would have
noticed it if we had, as from our standpoint it constitutes a data
corruption issue as data being returned from ML is different from what was
given to ML).
I traced the DOM being loaded right to the call of load() and verified by
inspection that there were no whitespace nodes between two particular
elements, e.g., the original source was:
Accessing the loaded document using e.g.,:
(where there is multiple whitespace before the <child> start tags and before
the </parent> close tag).
I tried various access routes, including CQ, access via our own product's
calls to the XccRunner API, OxygenXML via WebDAV and direct XQuery (via Xcc)
and get the same result. Some accesses show more indention than others, but
they all have indention.
>From what I could find it appears that this is the result of a change in the
default serialization options.
My primary question is: how can I determine how the XML is stored in ML
without interference from any serialization options? Assuming the ML is not
literally storing the bytes of the ML, I assume I can't just look inside the
forest, but is there a reliable way to see what the original whitespace was?
My first task is to prove that the ML is correct as provided to MarkLogic.
My secondary questions:
1. Is there any way that options on the load() method could affect
whitespace as stored? I didn't see any but I could have missed something.
2. If this is in fact a function of serialization options, where would we
control that in our Java code that uses Xcc to run XQueries? Is it simply a
matter of adding "declare option xdmp:output indent=no;" to our XQuery
3. Is this default serialization behavior changed in ML 5?
Senior Solutions Architect
"Bringing Strategy, Content, and Technology Together"
General mailing list
General at developer.marklogic.com
More information about the General