[MarkLogic Dev General] Treating elements as byte strings

Karl Erisman karl.erisman at gmail.com
Fri Oct 8 10:09:21 PDT 2010


The "size" of an XML document is a concept defined by the spec I
mentioned.  I'm fully aware how "meaningless" it is without context,
which is why I indicated (by the subject) that *byte length* of the
original lexical form of the XML defines how to find the size.  I
appreciate responses, but the answers are getting carried away :)

My very simple question is whether MarkLogic has some
native/free/automatic way of storing the original, and I'm pretty sure
the answer is no.  You can ignore my mention of the spec -- it is
really irrelevant to the question (it was included only to satisfy
curiosity; I'm not developing it, I'm just implementing it).

My plan has been to store the original string form of the XML along
with the internal element form.  That will suffice for my purposes.
Storing it in a properties document should make it easy to associate
with the original.

Thanks,
Karl

Date: Thu, 7 Oct 2010 12:56:05 -0700
From: "Lee, David" <dlee at epocrates.com>
Subject: Re: [MarkLogic Dev General] Treating elements as byte strings
To: "General Mark Logic Developer Discussion"
      <general at developer.marklogic.com>
Message-ID: <DD37F70D78609D4E9587D473FC61E0A71D68F7DD at postoffice>
Content-Type: text/plain;       charset="us-ascii"

An analogy would be to ask "How big is a relational database table" ...
Asking that may make it more apparent how meaningless the question is.

Although the common perception as XML as a "Document" leads one to pull
in other concepts familiar with documents such as file size.
But even that is bad logic.  Take a Word or Postscript document, load it
into an editor and save it and it may change size even without
changing the content.  Possibly dramatically.   Worse so for things like
audio or video.   Assuming there is a fixed serialized size for the
abstract concept of an "Identical Document" is fallacious.  Its simply
not something that exists in reality no matter how much we want it.

That said, there are good reasons to want to know things about a
particular instantiation of a document.
For example I daily load about 1G of XML "documents" of which maybe <
0.1% have changed since the last day.
I store as a property in ML the length and checksum of the actual file
*before I upload it*.
Then I can query ML for the documents properties and only upload changed
documents.
Being off by even 1 bit would make this not work.  But it works fine as
long as I don't require that the *retrieved document* from MarkLogic is
the same as what I put in.


-David



-----Original Message-----
From: general-bounces at developer.marklogic.com
[mailto:general-bounces at developer.marklogic.com] On Behalf Of Geert
Josten
Sent: Thursday, October 07, 2010 3:47 PM
To: General Mark Logic Developer Discussion
Subject: Re: [MarkLogic Dev General] Treating elements as byte strings

Hi Karl,

David is making good points. You will need to define how you want to
calculate the size of xml. You could agree that the count of the XML is
always a count of the 'normalized' XML (tidied, unquoted/quoted,
whitespace stripped, whatever). You can also simply accept the fact that
namespace declarations are inserted, you can more or less predict how
much is added and perhaps even compensate for that. You could also *try*
to string-replace the namespace declarations out of the XML, but I
recommend against that. You could also follow Davids idea of preserving
some text copy of the document or relevant document part and use that
for the size count.

I'd say that just living with the fact that the size is a few bytes off
sounds like something that would be acceptable in most cases.

Kind regards,
Geert

>


drs. G.P.H. (Geert) Josten
Consultant

Daidalos BV
Hoekeindsehof 1-4
2665 JZ Bleiswijk

T +31 (0)10 850 1200
F +31 (0)10 850 1199

mailto:geert.josten at daidalos.nl
http://www.daidalos.nl/

KvK 27164984


De informatie - verzonden in of met dit e-mailbericht - is afkomstig van
Daidalos BV en is uitsluitend bestemd voor de geadresseerde. Indien u
dit bericht onbedoeld hebt ontvangen, verzoeken wij u het te
verwijderen. Aan dit bericht kunnen geen rechten worden ontleend.

> From: general-bounces at developer.marklogic.com
> [mailto:general-bounces at developer.marklogic.com] On Behalf Of
> Karl Erisman
> Sent: donderdag 7 oktober 2010 20:27
> To: General Mark Logic Developer Discussion
> Subject: [MarkLogic Dev General] Treating elements as byte strings
>
> I would like to take an element node and treat part of it as
> a string in the same way it was originally declared (lexical
> equivalence, not just semantic equivalence).  Here is an
> example that does NOT do what I want:
>
> declare namespace ns="namespace";
> let $elem := <xml><ns:xml>hi</ns:xml></xml> return xdmp:quote($elem/*)
>
> => <ns:xml xmlns:ns="namespace">hi</ns:xml>
>
> This returns a string representing semantically equivalent
> XML, but it differs lexically from the original.
>
> After $elem is stored as an element node, only its tree
> structure is stored, correct?  So the only way for me to do
> what I'm describing would be for *me* to save the string form
> of the element at the time it is declared.  Is this correct?
>
> BTW: As background, the reason I need to do this is to comply
> with a spec that requires computing the "size" of incoming
> data, which may or may not be XML (and the "size" is specific
> to the way the XML is declared -- it is lexically
> significant).  The data is sent as part of a larger XML
> element, and by the time it arrives at the module responsible
> for checking the size, it is already in XML.  This is fine
> for text nodes (fn:string-length gives the "size"), but not
> for element nodes.  If my understanding is correct, I'll need
> to make modifications to lower-level modules so the original
> XML is available.
>
> Thanks,
> Karl
> _______________________________________________
> General mailing list
> General at developer.marklogic.com
> http://developer.marklogic.com/mailman/listinfo/general
>
_______________________________________________
General mailing list
General at developer.marklogic.com
http://developer.marklogic.com/mailman/listinfo/general


More information about the General mailing list