[MarkLogic Dev General] Quirks of generating xhtml with xquery
Robert Koberg
rob at koberg.com
Thu Aug 28 08:02:21 PDT 2008
On Aug 28, 2008, at 10:20 AM, David Sewell wrote:
> xdmp:quote() takes whatever serialized input you give it and returns
> it
> as a string. So for example, taking some very ill-formed HTML input:
>
> let $html := xdmp:quote( (<p>par 1</p>, <p>par 2</p>) )
> return xdmp:tidy($html)[2]
>
> the output is
>
> <html version="-//W3C//DTD XHTML 1.1//EN" xmlns="http://www.w3.org/1999/xhtml
> ">
> <head>
> <meta name="generator" content="HTML Tidy for Linux/x86 (vers 1
> September 2005), see www.w3.org"/>
> <title/>
> </head>
> <body>
> <p>par 1</p>
> <p>par 2</p>
> </body>
> </html>
>
> Note that this is the second node returned by xdmp:tidy(). The first
> node contains that
> error status (basically, what you'd get on stderr running a command-
> line version of tidy).
OK, but if this using something like the tidy that is out in the wild,
it means you are building a DOM Document for each (uncacheable)
request. And you still won't get valid XHTML. That does not sound like
a good solution to me. A better approach might be to use John Cowan's
TagSoup, then at least you are using SAX.
best,
-Rob
>
>
> On Thu, 28 Aug 2008, Robert Koberg wrote:
>
>>
>> On Aug 28, 2008, at 9:54 AM, David Sewell wrote:
>>
>>> I don't think anyone else has mentioned it, but if you're generating
>>> a full HTML page via MarkLogic Server, you can use the xdmp:tidy()
>>> function to clean up your generated XHTML and control doctype:
>>>
>>> http://xqzone.com/pubs/3.2/apidocs/Document-Conversion.html#tidy
>>>
>>> xmdp:tidy() takes a string argument, however, so you need to wrap
>>> your
>>> HTML inside xdmp:quote():
>>>
>>> xdmp:tidy(xdmp:quote($my_html_node))
>>
>> Do you have to serialize the result to then pass through tidy (to
>> serialize
>> again), or is it working in the DB's context?
>>
>> best,
>> -Rob
>>
>>
>>>
>>>
>>>
>>> On Wed, 27 Aug 2008, Eric Palmitesta wrote:
>>>
>>>> Aaron and I discussed this briefly at the training seminar, but
>>>> I'd like
>>>> to
>>>> get a sense of what other developers are doing to get around the
>>>> quirks of
>>>> generating xhtml with xquery (rather than a java servlet/jsp
>>>> based website
>>>> which pulls records from MarkLogic via XDBC/XCC.
>>>>
>>>> One such quirk: Childless elements with no internal nodes and an
>>>> explicit
>>>> closing tag are automatically folded into elements with no
>>>> closing tag.
>>>> <div></div>, which is valid xhtml, will become <div /> after being
>>>> processed
>>>> by MarkLogic (breaks visual representation). Some better
>>>> examples are
>>>> <script
>>>> ...></script> and <textarea></textarea>, which are expected to
>>>> contain no
>>>> internal nodes in xhtml.
>>>>
>>>> I've taken to writing things like
>>>>
>>>> <script ... >{" "}</script>
>>>>
>>>> or
>>>>
>>>> <textarea> </textarea>
>>>>
>>>> which successfully preserves the explicit closing tag, keeping
>>>> xhtml
>>>> happy.
>>>> Is there a more elegant way to do this?
>>>>
>>>> Are there other banana-peels I should watch out for when
>>>> generating xhtml
>>>> with
>>>> xquery? Is creating an entire website by generating xhtml with
>>>> xquery
>>>> generally frowned upon, or accepted? Admittedly, it seems less
>>>> flexible
>>>> than
>>>> a <web language>-based site, however the xdmp namespace seems to
>>>> provide
>>>> sufficient functionality, and transforming xml data into xhtml is
>>>> incredibly
>>>> easy with xquery.
>>>>
>>>> Cheers,
>>>>
>>>> Eric
>>>>
>>>>
>>>> PS
>>>> My vocabulary might be incorrect regarding words like 'tag' and
>>>> 'node',
>>>> please
>>>> correct me if necessary.
>>>>
>>>> PPS
>>>> I can see the archives at http://xqzone.marklogic.com/pipermail/general/
>>>> but
>>>> are they searchable? I have a feeling newcomers such as myself
>>>> will be
>>>> prone
>>>> to asking questions which have already been discussed at length.
>>>> _______________________________________________
>>>> General mailing list
>>>> General at developer.marklogic.com
>>>> http://xqzone.com/mailman/listinfo/general
>>>>
>>>
>>> --
>>> David Sewell, Editorial and Technical Manager
>>> ROTUNDA, The University of Virginia Press
>>> PO Box 801079, Charlottesville, VA 22904-4318 USA
>>> Courier: 310 Old Ivy Way, Suite 302, Charlottesville VA 22903
>>> Email: dsewell at virginia.edu Tel: +1 434 924 9973
>>> Web: http://rotunda.upress.virginia.edu/
>>> _______________________________________________
>>> General mailing list
>>> General at developer.marklogic.com
>>> http://xqzone.com/mailman/listinfo/general
>>
>> _______________________________________________
>> General mailing list
>> General at developer.marklogic.com
>> http://xqzone.com/mailman/listinfo/general
>
> --
> David Sewell, Editorial and Technical Manager
> ROTUNDA, The University of Virginia Press
> PO Box 801079, Charlottesville, VA 22904-4318 USA
> Courier: 310 Old Ivy Way, Suite 302, Charlottesville VA 22903
> Email: dsewell at virginia.edu Tel: +1 434 924 9973
> Web: http://rotunda.upress.virginia.edu/
> _______________________________________________
> General mailing list
> General at developer.marklogic.com
> http://xqzone.com/mailman/listinfo/general
More information about the General
mailing list