[MarkLogic Dev General] document format

Mike Sokolov sokolov at ifactory.com
Mon Mar 31 10:30:13 PST 2008


OK I am pursuing a solution along those general lines.  Just out of 
curiosity though: does this mean that internally there is no distinction 
between xml documents and text documents and binary documents?  It 
sounds as if text documents are simply documents that happen to have a 
single text node (and same for binary) - is that right?

-Mike

Danny Sokolsky wrote:
> Mike,
>
> I think your approach is the right idea, only it needs a little more
> logic to be more robust.  If you took the last() instead of the first in
> your node-kind test, that might work most of the time (or more often):
>
> node-kind(doc($uri)/node()[last()])
>
> Here is a similar idea using the instance of operator, performing a
> little logic to make a best-guess at the type:
>
> define function doctype($x as node()) as element()
> {
> <node>
>   <uri>{xdmp:node-uri($x)}</uri>
>   <type>{
>   if ($x/node() instance of binary())
>   then ("binary node") 
>   else if ( $x/node() instance of element() )
>        then ("XML node")
>        else if ( $x/node() instance of text() )
>             then  "text node"
>             else "not sure"
> }</type>
> </node>
> }
>
> for $x in doc()[1 to 100]
> return doctype($x)
>
> I have not found any of my documents that return "not sure" here, but I
> can imagine that you might be able to construct one.
>
> -Danny
>
> -----Original Message-----
> From: general-bounces at developer.marklogic.com
> [mailto:general-bounces at developer.marklogic.com] On Behalf Of Mike
> Sokolov
> Sent: Monday, March 31, 2008 10:34 AM
> To: General Mark Logic Developer Discussion
> Subject: [MarkLogic Dev General] document format
>
> I have been trying to come up with a way to determine the "format" of a 
> document in MarkLogic. The only api call that seems directly related is 
> xdmp:document-uri-format, but this seems to operate on the uri without 
> any reference to the contents of a document.  Instead, I tried testing:
>
> node-kind(doc($uri)/node()[1])
>
>
> but we just found an XML document for which this returns "text" - 
> apparently it has a BOM at the start, so the document node has two child
>
> nodes: one text (containing the BOM) and one element (the root element).
>
> Presumably there could be comments there too and processing 
> instructions, so this strategy is clearly flawed.
>
> Does anybody have a good way to determine whether a document in Mark 
> Logic is an XML document, a text document or a binary document?
>
> -Mike
>  
> _______________________________________________
> General mailing list
> General at developer.marklogic.com
> http://xqzone.com/mailman/listinfo/general
> _______________________________________________
> General mailing list
> General at developer.marklogic.com
> http://xqzone.com/mailman/listinfo/general
>   


More information about the General mailing list