[MarkLogic Dev General] Treating elements as byte strings
Lee, David
dlee at epocrates.com
Thu Oct 7 12:56:05 PDT 2010
An analogy would be to ask "How big is a relational database table" ...
Asking that may make it more apparent how meaningless the question is.
Although the common perception as XML as a "Document" leads one to pull
in other concepts familiar with documents such as file size.
But even that is bad logic. Take a Word or Postscript document, load it
into an editor and save it and it may change size even without
changing the content. Possibly dramatically. Worse so for things like
audio or video. Assuming there is a fixed serialized size for the
abstract concept of an "Identical Document" is fallacious. Its simply
not something that exists in reality no matter how much we want it.
That said, there are good reasons to want to know things about a
particular instantiation of a document.
For example I daily load about 1G of XML "documents" of which maybe <
0.1% have changed since the last day.
I store as a property in ML the length and checksum of the actual file
*before I upload it*.
Then I can query ML for the documents properties and only upload changed
documents.
Being off by even 1 bit would make this not work. But it works fine as
long as I don't require that the *retrieved document* from MarkLogic is
the same as what I put in.
-David
-----Original Message-----
From: general-bounces at developer.marklogic.com
[mailto:general-bounces at developer.marklogic.com] On Behalf Of Geert
Josten
Sent: Thursday, October 07, 2010 3:47 PM
To: General Mark Logic Developer Discussion
Subject: Re: [MarkLogic Dev General] Treating elements as byte strings
Hi Karl,
David is making good points. You will need to define how you want to
calculate the size of xml. You could agree that the count of the XML is
always a count of the 'normalized' XML (tidied, unquoted/quoted,
whitespace stripped, whatever). You can also simply accept the fact that
namespace declarations are inserted, you can more or less predict how
much is added and perhaps even compensate for that. You could also *try*
to string-replace the namespace declarations out of the XML, but I
recommend against that. You could also follow Davids idea of preserving
some text copy of the document or relevant document part and use that
for the size count.
I'd say that just living with the fact that the size is a few bytes off
sounds like something that would be acceptable in most cases.
Kind regards,
Geert
>
drs. G.P.H. (Geert) Josten
Consultant
Daidalos BV
Hoekeindsehof 1-4
2665 JZ Bleiswijk
T +31 (0)10 850 1200
F +31 (0)10 850 1199
mailto:geert.josten at daidalos.nl
http://www.daidalos.nl/
KvK 27164984
De informatie - verzonden in of met dit e-mailbericht - is afkomstig van
Daidalos BV en is uitsluitend bestemd voor de geadresseerde. Indien u
dit bericht onbedoeld hebt ontvangen, verzoeken wij u het te
verwijderen. Aan dit bericht kunnen geen rechten worden ontleend.
> From: general-bounces at developer.marklogic.com
> [mailto:general-bounces at developer.marklogic.com] On Behalf Of
> Karl Erisman
> Sent: donderdag 7 oktober 2010 20:27
> To: General Mark Logic Developer Discussion
> Subject: [MarkLogic Dev General] Treating elements as byte strings
>
> I would like to take an element node and treat part of it as
> a string in the same way it was originally declared (lexical
> equivalence, not just semantic equivalence). Here is an
> example that does NOT do what I want:
>
> declare namespace ns="namespace";
> let $elem := <xml><ns:xml>hi</ns:xml></xml> return xdmp:quote($elem/*)
>
> => <ns:xml xmlns:ns="namespace">hi</ns:xml>
>
> This returns a string representing semantically equivalent
> XML, but it differs lexically from the original.
>
> After $elem is stored as an element node, only its tree
> structure is stored, correct? So the only way for me to do
> what I'm describing would be for *me* to save the string form
> of the element at the time it is declared. Is this correct?
>
> BTW: As background, the reason I need to do this is to comply
> with a spec that requires computing the "size" of incoming
> data, which may or may not be XML (and the "size" is specific
> to the way the XML is declared -- it is lexically
> significant). The data is sent as part of a larger XML
> element, and by the time it arrives at the module responsible
> for checking the size, it is already in XML. This is fine
> for text nodes (fn:string-length gives the "size"), but not
> for element nodes. If my understanding is correct, I'll need
> to make modifications to lower-level modules so the original
> XML is available.
>
> Thanks,
> Karl
> _______________________________________________
> General mailing list
> General at developer.marklogic.com
> http://developer.marklogic.com/mailman/listinfo/general
>
_______________________________________________
General mailing list
General at developer.marklogic.com
http://developer.marklogic.com/mailman/listinfo/general
More information about the General
mailing list