MarkLogic Server
XQUERY API DOCUMENTATION
3.2
This page was generated
October 21, 2008
5:17 PM
XQuery Built-In and Modules Function Reference

Built-In: Document Conversion

The conversion functions are built-in to the server and support the ability to convert various document formats to XML. There are functions to convert HTML, PDF, Microsoft Word, Microsoft Excel, and Microsoft Powerpoint documents. The output of each of these functions is standards-compliant XHTML with cascading style sheets (CSS). Additionally, there are functions to zip and unzip documents, which can be used to support document formats that are zip archives (for example, Microsoft Office 2007 docx format).

Includes the Microsoft Office convert functions using the AntennaHouse technology.

Includes the PDF convert functions using the Iceni technology.

Function Summary
xdmp:excel-convert Converts a Microsoft Excel document to XHTML.
xdmp:pdf-convert Converts a PDF file to XHTML.
xdmp:powerpoint-convert Converts a Microsoft Powerpoint document to XHTML.
xdmp:tidy Run tidy on the specified html document to convert the document to well-formed and clean XHTML.
xdmp:word-convert Converts a Microsoft Word document to XHTML.
xdmp:zip-create Create a zip file from a list of nodes.
xdmp:zip-get Get a named file from a zip document.
xdmp:zip-manifest Return a manifest for this zip file.
Function Detail
xdmp:excel-convert(
$doc as node(),
$filename as xs:string,
[$options as node()]
)  as  node()*
Summary:

Converts a Microsoft Excel document to XHTML. Returns several nodes, including a parts node, the converted document xml node, and any other document parts (for example, css files and images). The first node is the parts node, which contains a manefest of all of the parts generated as result of the conversion.

Parameters:
$doc : Microsoft Office Excel document to convert to HTML, as binary node().
$filename : The root for the name of the converted files and directories. If the specified filename includes an extension, then the extension is appended to the root with an underscore. The directory for other parts of the conversion (images, for example) has the string "_parts" appended to the root. For example, if you specify a filename of "myFile.xls", the generated names will be "myFile_xls.xhtml" for the xml node and "myFile_xls_parts" for the directory containing the any other parts generated by the conversion (images, css files, and so on).
$options (optional): Options element for this conversion. The options element must be in the xdmp:excel-convert namespace. The default value is (). In addition to the options shown below, you can specify xdmp:tidy options by entering the tidy option elements in the xdmp:tidy namespace.

Options include:

<tidy>

Specify true to run tidy on the document and false not to run tidy. If you run tidy, you can also specify an xdmp:tidy options node.

<sheetID>

An integer specifying which sheet of the input Excel document to convert. If this option is not set, all sheets are converted.

<compact>

Specify true to produce "compact" HTML, that is, without style information. The default is false.

<print-area-only>

Specify true to convert only the print area of the sheet.

<sheet-by-sheet>

Specify true to produce one document for each sheet. The default is false.

Sample Options Node:

The following is a sample options node which specifies that tidy is used to clean the generated html, specifies to use the tidy "clean" option, and specifies to only convert sheet 2 of the document:
<options xmlns="xdmp:excel-convert"
         xmlns:tidy="xdmp:tidy">
  <tidy>true</tidy>
  <tidy:clean>yes</tidy:clean>
  <sheetID>2</sheetID>
</options>

Usage Notes:

The convert functions return several nodes. The first node is a manifest containing the various parts of the conversion. Typically there will be an xml part, a css part, and some image parts. Each part is returned as a separate node in the order shown in the manifest.

Therefore, given the following manifest:

<parts>
  <part>myFile_xls.xhtml</part>
  <part>myFile_xls_parts/conv.css</part>
  <part>myFile_xls_parts/toc.xml</part>
</parts>

the first node of the returned query is the manifest, the second is the "myFile_xls.xhtml" node, the third is the "myFile_xls_parts/conv.css" node, and the fourth is the myFile_xls_parts/toc.xml node.


Example:
let $results := xdmp:excel-convert( 
                         xdmp:document-get("myFile.xls"),
                         "myFile.xls" ),
    $manifest := $results[1]
return 
$results[2 to last()]

=> all of the converted nodes

xdmp:pdf-convert(
$doc as node(),
$filename as xs:string,
[$options as node()]
)  as  node()*
Summary:

Converts a PDF file to XHTML. Returns several nodes, including a parts node, the converted document xml node, and any other document parts (for example, css files and images). The first node is the parts node, which contains a manefest of all of the parts generated as result of the conversion.

Parameters:
$doc : PDF document to convert to HTML, as a binary node().
$filename : The root for the name of the converted files and directories. If the specified filename includes an extension, then the extension is appended to the root with an underscore. The directory for other parts of the conversion (images, for example) has the string "_parts" appended to the root. For example, if you specify a filename of "myFile.pdf", the generated names will be "myFile_pdf.xhtml" for the xml node and "myFile_pdf_parts" for the directory containing the any other parts generated by the conversion (images, css files, and so on).
$options (optional): Options element for this conversion. The options element must be in the xdmp:pdf-convert namespace. The default value is (). In addition to the options shown below, you can specify xdmp:tidy options by entering the tidy option elements in the xdmp:tidy namespace.

Options include:

<tidy>

Default value: true

Specify true to run tidy on the document and false not to run tidy. If you run tidy, you can also specify any xdmp:tidy options. Any tidy option elements must be in the xdmp:tidy namespace.

<config>

The configuration file for the conversion. You can specify an absolute path or a relative path. The relative path is relative to the <install_dir>/Converters/cvtpdf directory. The default configuration file is named PDFtoHTML.cfg; it produces a single reflowed XHTML document with CSS styling. Setting this parameter may override the remaining options.

<page-by-page>

Default value: false

Specify true to select a different default configuration file that produces one XHTML document per page with absolute positioning. The default paged configuration file is named PDFtoXHTML_pages.cfg If a specific configuration file is selected with the config option, the page-by-page option has no effect.

<page-start-id>

Default value: 0

The index of the first page to convert. Page indices start at zero.

<page-end-id>

Default value: -1

The index of the last page to convert. Page indices start at zero. The default is -1, meaning to convert through the last page of the document.

<synth-bookmarks>

Default value: true

Enable/disable converter's internal font-based TOC inferences.

<image-output>

Default value: true

Enable/disable extraction and conversion of images.

<text-output>

Default value: true

Enable/disable extraction of text.

<zones>

Default value: false

Enable/disable zone controls. Using true produces better results when the PDF is annotated; using false produces better results in non-annotated tables.

<ignore-text>

Default value: true

Enable/disable extraction of text from images. Documents consisting of scanned pages can only have text extracted if this parameter is set to true; however, diagrams with embedded text labels may be less palatable. For page-by-page conversion, the problem with reflowing of text and graphical elements within a diagram giving poor results is not such a problem, and the value of false will probably be the better choice.

<remove-overprint>

Default value: false

Enable/disable removal of text overlays. Setting this parameter to true can sometimes clean up messy results stemming from reflowing of text that was not visible in the original PDF because it was covered by something else.

<illustrations>

Default value: true

Enable/disable extraction of illustrations. Setting this parameter to false can sometimes clean up messy results stemming from minor and unnecessary graphical ornaments.

<image-quality>

Default value: 75

Determines the quality of extracted and converted images: smaller values mean smaller image sizes (in bytes) but lossier rendering. The maximum is 100.

<page-start>

Default value:

Boilerplate text inserted at the start of every page. Any XML markup must be escaped. For example: &lt;p>PAGE START&lt/p>

<page-end>

Default value:

Boilerplate text inserted at the end of every page. XML markup must be escaped.

<document-start>

Default value:

Boilerplate text inserted at the start of every document. XML markup must be escaped.

<document-end>

Default value:

Boilerplate text inserted at the end of every document. XML markup must be escaped.

<password>

Default value:

The password required to open a password-protected PDF.

Sample Options Node:

The following is a sample options node which specifies that tidy is used to clean the generated html, specifies to use the tidy "clean" option, and specifies a particular configuration file to use for the conversion:
<options xmlns="xdmp:pdf-convert"
         xmlns:tidy="xdmp:tidy">
  <tidy>true</tidy>
  <tidy:clean>yes</tidy:clean>
  <config>c:\myConfigFile.cfg</config>
</options>

Usage Notes:

The convert functions return several nodes. The first node is a manifest containing the various parts of the conversion. Typically there will be an xml part, a css part, and some image parts. Each part is returned as a separate node in the order shown in the manifest.

Therefore, given the following manifest:

<parts>
  <part>myFile_pdf.xhtml</part>
  <part>myFile_pdf_parts/conv.css</part>
  <part>myFile_pdf_parts/toc.xml</part>
</parts>

the first node of the returned query is the manifest, the second is the "myFile_pdf.xhtml" node, the third is the "myFile_pdf_parts/conv.css" node, and the fourth is the myFile_pdf_parts/toc.xml node.


Example:
let $results := xdmp:pdf-convert( 
                         xdmp:document-get("myFile.pdf"),
                         "myFile.pdf" ),
    $manifest := $results[1]
return 
$results[2 to last()]

=> all of the converted nodes

xdmp:powerpoint-convert(
$doc as node(),
$filename as xs:string,
[$options as node()]
)  as  node()*
Summary:

Converts a Microsoft Powerpoint document to XHTML. Returns several nodes, including a parts node, the converted document xml node, and any other document parts (for example, css files and images). The first node is the parts node, which contains a manefest of all of the parts generated as result of the conversion.

Parameters:
$doc : Microsoft Powerpoint document to convert to HTML, as binary node().
$filename : The root for the name of the converted files and directories. If the specified filename includes an extension, then the extension is appended to the root with an underscore. The directory for other parts of the conversion (images, for example) has the string "_parts" appended to the root. For example, if you specify a filename of "myFile.ppt", the generated names will be "myFile_ppt.xhtml" for the xml node and "myFile_ppt_parts" for the directory containing the any other parts generated by the conversion (images, css files, and so on).
$options (optional): Options element for this conversion. The options element must be in the xdmp:powerpoint-convert namespace. The default value is (). In addition to the options shown below, you can specify xdmp:tidy options by entering the tidy option elements in the xdmp:tidy namespace.

Options include:

<tidy>

Specify true to run tidy on the document and false not to run tidy. If you run tidy, you can also specify an xdmp:tidy options node.

<compact>

Specify true to produce "compact" HTML, that is, without style information. The default is false.

<slideID>

An integer specifying which slide of the input Powerpoint document to convert. If this option is not set, all slides are converted.

<slide-by-slide>

Specify true to produce one document for each slide. The default is false.

<speaker-notes>

Specify true to include speaker notes in the output. The default is false.

Sample Options Node:

The following is a sample options node which specifies that tidy is used to clean the generated html, specifies to use the tidy "clean" option, and specifies to only convert the second slide of the document:
<options xmlns="xdmp:powerpoint-convert"
         xmlns:tidy="xdmp:tidy">
  <tidy>true</tidy>
  <tidy:clean>yes</tidy:clean>
  <slideID>2</slideID>
</options>

Usage Notes:

The convert functions return several nodes. The first node is a manifest containing the various parts of the conversion. Typically there will be an xml part, a css part, and some image parts. Each part is returned as a separate node in the order shown in the manifest.

Therefore, given the following manifest:

<parts>
  <part>myFile_ppt.xhtml</part>
  <part>myFile_ppt_parts/conv.css</part>
  <part>myFile_ppt_parts/toc.xml</part>
</parts>

the first node of the returned query is the manifest, the second is the "myFile_ppt.xhtml" node, the third is the "myFile_ppt_parts/conv.css" node, and the fourth is the myFile_ppt_parts/toc.xml node.


Example:
let $results := xdmp:powerpoint-convert( 
                         xdmp:document-get("myFile.ppt"),
                         "myFile.ppt" ),
    $manifest := $results[1]
return 
$results[2 to last()]

=> all of the converted nodes

xdmp:tidy(
$doc as xs:string,
[$options as node()]
)  as  node()+
Summary:

Run tidy on the specified html document to convert the document to well-formed and clean XHTML. Returns two nodes: the first is a status node indicating any errors or warning from tidy, and the second is an html node containing the cleaned xhtml.

Parameters:
$doc : A string representing the the html document you want to tidy.
$options (optional): The options nodes for this operation. The node for the tidy options must be in the xdmp:tidy namespace. The default value is (). The options are based on the open source HTML Tidy configuration options, available at http://tidy.sourceforge.net/docs/quickref.html. Most of the tidy options are available through xdmp:tidy with the following exceptions:
  • The character encoding for the output is always UTF-8.
  • The filesystem options which allow you to specify where to save output are not supported (although there are many ways to achieve this through functions such as xdmp:save).
  • The output is always XHTML.
  • Entities except for the built-in HTML entities will be always be output in numeric form.

Options include:

HTML, XHTML, and XML Options

<add-xml-decl>

Default Value: no

Description: This option specifies if Tidy should add the XML declaration when outputting XML or XHTML. Note that if the input already includes an <?xml ... ?> declaration then this option will be ignored.

<add-xml-space>

Default Value: no

Description: This option specifies if Tidy should add xml:space="preserve" to elements such as <PRE>, <STYLE> and <SCRIPT> when generating XML. This is needed if the whitespace in such elements is to be parsed appropriately without having access to the DTD.

<alt-text>

Default Value: n/a


Description: This option specifies the default "alt=" text Tidy uses for <IMG> attributes. This feature is dangerous as it suppresses further accessibility warnings. You are responsible for making your documents accessible to people who can not see the images!

<assume-xml-procins>

Default Value: no

Description: This option specifies if Tidy should change the parsing of processing instructions to require ?> as the terminator rather than >. This option is automatically set if the input is in XML.

<bare>

Default Value: no

Description: This option specifies if Tidy should strip Microsoft specific HTML from Word 2000 documents, and output spaces rather than non-breaking spaces where they exist in the input.

<clean>

Default Value: no

Description: This option specifies if Tidy should strip out surplus presentational tags and attributes replacing them by style rules and structural markup as appropriate. It works well on the HTML saved by Microsoft Office products.

<css-prefix>

Default Value: n/a

Description: This option specifies the prefix that Tidy uses for styles rules. By default, "c" will be used.

<doctype>

Default Value: auto

Possible Values: auto, omit, strict, loose, transitional, or user-specified fpi string

Description: This option specifies the DOCTYPE declaration generated by Tidy. If set to omit the output won't contain a DOCTYPE declaration. If set to auto (the default) Tidy will use an educated guess based upon the contents of the document. If set to strict, Tidy will set the DOCTYPE to the strict DTD. If set to loose, the DOCTYPE is set to the loose (transitional) DTD. Alternatively, you can supply a string for the formal public identifier (FPI). For example:

doctype: "-//ACME//DTD HTML 3.14159//EN"

If you specify the FPI for an XHTML document, Tidy will set the system identifier to the empty string. Tidy leaves the DOCTYPE for generic XML documents unchanged. Specifying a doctype of omit implies that the numeric-entities option is set to yes.

<drop-empty-paras>

Default Value: yes

Description: This option specifies if Tidy should discard empty paragraphs. If set to no, empty paragraphs are replaced by a pair of <BR> elements as HTML4 precludes empty paragraphs.

<drop-front-tags>

Default Value: no

Description: This option specifies if Tidy should discard <FONT> and <CENTER> tags without creating the corresponding style rules. This option can be set independently of the clean option.

<drop-proprietary-attributes>

Default Value: no

Description: This option specifies if Tidy should strip out proprietary attributes, such as MS data binding attributes.

<enclose-block-text>

Default Value: no

Description: This option specifies if Tidy should insert a <P> element to enclose any text it finds in any element that allows mixed content for HTML transitional but not HTML strict.

<enclose-text>

Default Value: no

Description: This option specifies if Tidy should enclose any text it finds in the body element within a <P> element. This is useful when you want to take existing HTML and use it with a style sheet.

<escape-cdata>

Default Value: no

Description: This option specifies if Tidy should convert <![CDATA[]]> sections to normal text.

<fix-backslash>

Default Value: yes

Description: This option specifies if Tidy should replace backslash characters "\" in URLs by forward slashes "/".

<fix-bad-comments>

Default Value: yes

Description: This option specifies if Tidy should replace unexpected hyphens with "=" characters when it comes across adjacent hyphens. The default is yes. This option is provided for users of Cold Fusion which uses the comment syntax: <!--- --->

<fix-uri>

Default Value: yes

Description: This option specifies if Tidy should check attribute values that carry URIs for illegal characters and if such are found, escape them as HTML 4 recommends.

<hide-comments>

Default Value: no

Description: This option specifies if Tidy should print out comments.

<hide-endtags>

Default Value: no

Description: This option specifies if Tidy should omit optional end-tags when generating the pretty printed markup. This option is ignored if you are outputting to XML.

<indent-cdata>

Default Value: no

Description: This option specifies if Tidy should indent <![CDATA[]]> sections.

<input-xml>

Default Value: no

Description: This option specifies if Tidy should use the XML parser rather than the error correcting HTML parser.

<join-classes>

Default Value: no

Description: This option specifies if Tidy should combine class names to generate a single new class name, if multiple class assignments are detected on an element.

<join-styles>

Default Value: yes

Description: This option specifies if Tidy should combine styles to generate a single new style, if multiple style values are detected on an element.

<literal-attributes>

Default Value: no

Description: This option specifies if Tidy should ensure that whitespace characters within attribute values are passed through unchanged.

<logical-emphasis>

Default Value: no

Description: This option specifies if Tidy should replace any occurrence of <I> by <EM> and any occurrence of <B> by <STRONG>. In both cases, the attributes are preserved unchanged. This option can be set independently of the clean and drop-font-tags options.

<lower-literals>

Default Value: yes

Description: This option specifies if Tidy should convert the value of an attribute that takes a list of predefined values to lower case. This is required for XHTML documents.

<merge-divs>

Default Value: yes

Description: Can be used to modify behavior of setting the clean option to yes. This option specifies if Tidy should merge nested <div> such as <div><div>...</div></div>.

<ncr>

Default Value: yes

Description: This option specifies if Tidy should allow numeric character references.

<new-blocklevel-tags>

Default Value: none

Description: This option specifies new block-level tags. This option takes a space or comma separated list of tag names. Unless you declare new tags, Tidy will refuse to generate a tidied file if the input includes previously unknown tags. Note you can't change the content model for elements such as <TABLE>, <UL>, <OL> and <DL>.

<new-empty-tags>

Default Value: none

Description: This option specifies new empty inline tags. This option takes a space or comma separated list of tag names. Unless you declare new tags, Tidy will refuse to generate a tidied file if the input includes previously unknown tags. Remember to also declare empty tags as either inline or blocklevel.

<new-inline-tags>

Default Value: none

Description: This option specifies new non-empty inline tags. This option takes a space or comma separated list of tag names. Unless you declare new tags, Tidy will refuse to generate a tidied file if the input includes previously unknown tags.

<new-pre-tags>

Default Value: none

Description: This option specifies new tags that are to be processed in exactly the same way as HTML's <PRE> element. This option takes a space or comma separated list of tag names. Unless you declare new tags, Tidy will refuse to generate a tidied file if the input includes previously unknown tags. Note you can not as yet add new CDATA elements (similar to <SCRIPT>).

<numeric-entities>

Default Value: no

Description: This option specifies if Tidy should output entities other than the built-in HTML entities (&, <, > and ") in the numeric rather than the named entity form.

<output-html>

Default Value: no

Description: This option specifies if Tidy should generate pretty printed output, writing it as HTML.

<output-xhtml>

Default Value: yes

Description: This option specifies if Tidy should generate pretty printed output, writing it as extensible HTML. This option causes Tidy to set the DOCTYPE and default namespace as appropriate to XHTML. If a DOCTYPE or namespace is given they will checked for consistency with the content of the document. In the case of an inconsistency, the corrected values will appear in the output. For XHTML, entities can be written as named or numeric entities according to the setting of the numeric-entities option. The original case of tags and attributes will be preserved, regardless of other options.

<output-xml>

Default Value: yes

Description: This option specifies if Tidy should pretty print output, writing it as well-formed XML. Any entities not defined in XML 1.0 will be written as numeric entities to allow them to be parsed by a XML parser. The original case of tags and attributes will be preserved, regardless of other options.

<quote-ampersand>

Default Value: yes

Description: This option specifies if Tidy should output unadorned & characters as &#38;.

<quote-marks>

Default Value: no

Description: This option specifies if Tidy should output " characters as " as is preferred by some editing environments. The apostrophe character ' is written out as &#39; since many web browsers don't yet support &#39;.

<quote-nbsp>

Default Value: yes

Description: This option specifies if Tidy should output non-breaking space characters as entities, rather than as the Unicode character value 160 (decimal).

<repeated-attributes>

Default Value: keep-last

Possible Values:keep-first, keep-last

Description: This option specifies if Tidy should keep the first or last attribute, if an attribute is repeated (for example, if a tag has has two align attributes.

<replace-color>

Default Value: no

Description: This option specifies if Tidy should replace numeric values in color attributes by HTML/XHTML color names where defined, e.g. replace "#ffffff" with "white".

<show-body-only>

Default Value: no

Description: This option specifies if Tidy should print only the contents of the body tag as an HTML fragment. Useful for incorporating existing whole pages as a portion of another page.

<uppercase-attributes>

Default Value: no

Description: This option specifies if Tidy should output attribute names in upper case. The default is no, which results in lower case attribute names, except for XML input, where the original case is preserved.

<uppercase-tags>

Default Value: no

Description: This option specifies if Tidy should output tag names in upper case. The default is no, which results in lower case tag names, except for XML input, where the original case is preserved.

<word-2000>

Default Value: no

Description: This option specifies if Tidy should go to great pains to strip out all the surplus stuff Microsoft Word 2000 inserts when you save Word documents as "Web pages". Doesn't handle embedded images or VML.

Diagnostic Options

<accessibility-check>

Default Value: 0

Possible Values: 0, 1, 2, or 3

Description: This option specifies what level of accessibility checking, if any, that Tidy should do. Level 0 is equivalent to Tidy Classic's accessibility checking. For more information on Tidy's accessibility checking, see the web site for the Adaptive Technology Resource Centre at the University of Toronto.

<show-errors>

Default Value: 6

Possible Values: Any integer.

Description: This option specifies the number Tidy uses to determine if further errors should be shown. If set to 0, then no errors are shown.

<show-warnings>

Default Value: yes

Description: This option specifies if Tidy should suppress warnings. This is useful when a few errors are hidden between many warning messages.

Pretty Print Options

<break-before-br>

Default Value: no

Description: This option specifies if Tidy should output a line break before each <BR> element.

<indent>

Default Value: no

Possible Values: no, yes, auto

Description: This option specifies if Tidy should indent block-level tags. If set to auto, this option causes Tidy to decide whether or not to indent the content of tags such as TITLE, H1-H6, LI, TD, TD, or P depending on whether or not the content includes a block-level element. You are advised to avoid setting indent to yes as this can expose layout bugs in some browsers.

<indent-attributes>

Default Value: no

Description: This option specifies if Tidy should begin each attribute on a new line.

<indent-spaces>

Default Value: 2

Possible Values: Any integer.

Description: This option specifies the number of spaces Tidy uses to indent content, when indentation is enabled.

<markup>

Default Value: yes

Description: This option specifies if Tidy should generate a pretty printed version of the markup. Note that Tidy won't generate a pretty printed version if it finds significant errors (see force-output).

<punctuation-wrap>

Default Value: no

Description: This option specifies if Tidy should line wrap after some Unicode or Chinese punctuation characters.

<split>

Default Value: no

Description: This option specifies if Tidy should create a sequence of slides from the input, splitting the markup prior to each successive <H2>. The slides are written to "slide001.html", "slide002.html" etc.

<tab-size>

Default Value: 8

Possible Values: Any integer.

Description: This option specifies the number of columns that Tidy uses between successive tab stops. It is used to map tabs to spaces when reading the input. Tidy never outputs tabs.

<vertical-space>

Default Value: no

Description: This option specifies if Tidy should add some empty lines for readability.

<wrap>

Default Value: 68

Possible Values: Any integer.

Description: This option specifies the right margin Tidy uses for line wrapping. Tidy tries to wrap lines so that they do not exceed this length. Set wrap to zero if you want to disable line wrapping.

<wrap-asp>

Default Value: yes

Description: This option specifies if Tidy should line wrap text contained within ASP pseudo elements, which look as follows:
<% ... %>.

<wrap-attributes>

Default Value: no

Description: This option specifies if Tidy should line wrap attribute values, for easier editing. This option can be set independently of wrap-script-literals.

<wrap-jste>

Default Value: yes

Description: This option specifies if Tidy should line wrap text contained within JSTE pseudo elements, which look as follows:
<# ... #>.

<wrap-php>

Default Value: yes

Description: This option specifies if Tidy should line wrap text contained within PHP pseudo elements, which look as follows:
<?php ... ?>.

<wrap-script-literals>

Default Value: no

Description: This option specifies if Tidy should line wrap string literals that appear in script attributes. Tidy wraps long script string literals by inserting a backslash character before the line break.

<wrap-sections>

Default Value: yes

Description: This option specifies if Tidy should line wrap text contained within <![ ... ]> section tags.

Miscellaneous Options

<force-output>

Default Value: no

Description: This option specifies if Tidy should produce output even if errors are encountered. Use this option with care - if Tidy reports an error, this means Tidy was not able to, or is not sure how to, fix the error, so the resulting output may not be what you expect.

<keep-time>

Default Value: no

Description: This option specifies if Tidy should keep the original modification time of files that Tidy modifies in place. The default is no. Setting the option to yes allows you to tidy files without causing these files to be uploaded to a web server when using a tool such as SiteCopy. Note this feature is not supported on some platforms.

<quiet>

Default Value: no

Description: This option specifies if Tidy should output the summary of the numbers of errors and warnings, or the welcome or informational messages.

<tidy-mark>

Default Value: yes

Description: This option specifies if Tidy should add a meta element to the document head to indicate that the document has been tidied. Tidy won't add a meta element if one is already present.

Example:
let $html := "
<htm>
 <h1>This is a heading 1
 <p>This is paragraph tag
"
return
xdmp:tidy($html, <options xmlns="xdmp:tidy">
                 </options>)

=> a tidy-status node with any errors and warnings and 
   an html node containing the clean and well-formed XHTML.


xdmp:word-convert(
$doc as node(),
$filename as xs:string,
[$options as node()]
)  as  node()*
Summary:

Converts a Microsoft Word document to XHTML. Returns several nodes, including a parts node, the converted document xml node, and any other document parts (for example, css files and images). The first node is the parts node, which contains a manefest of all of the parts generated as result of the conversion.

Parameters:
$doc : Microsoft Word document to convert to HTML, as binary node().
$filename : The root for the name of the converted files and directories. If the specified filename includes an extension, then the extension is appended to the root with an underscore. The directory for other parts of the conversion (images, for example) has the string "_parts" appended to the root. For example, if you specify a filename of "myFile.doc", the generated names will be "myFile_doc.xhtml" for the xml node and "myFile_doc_parts" for the directory containing the any other parts generated by the conversion (images, css files, and so on).
$options (optional): Options element for this conversion. The options element must be in the xdmp:word-convert namespace. The default value is (). In addition to the options shown below, you can specify xdmp:tidy options by entering the tidy option elements in the xdmp:tidy namespace.

Options include:

<tidy>

Specify true to run tidy on the document and false not to run tidy. If you run tidy, you can also specify any xdmp:tidy options. Any tidy option elements must be in the xdmp:tidy namespace.

<compact>

Specify true to produce "compact" HTML, that is, without style information. The default is false.

Sample Options Node:

The following is a sample options node which specifies that tidy is used to clean the generated html and specifies to use the tidy "clean" option for the conversion:
<options xmlns="xdmp:word-convert"
         xmlns:tidy="xdmp:tidy">
  <tidy>true</tidy>
  <tidy:clean>yes</tidy:clean>
</options>

Usage Notes:

The convert functions return several nodes. The first node is a manifest containing the various parts of the conversion. Typically there will be an xml part, a css part, and some image parts. Each part is returned as a separate node in the order shown in the manifest.

Therefore, given the following manifest:

<parts>
  <part>myFile_doc.xhtml</part>
  <part>myFile_doc_parts/conv.css</part>
  <part>myFile_doc_parts/toc.xml</part>
</parts>

the first node of the returned query is the manifest, the second is the "myFile_doc.xhtml" node, the third is the "myFile_doc_parts/conv.css" node, and the fourth is the myFile_doc_parts/toc.xml node.


Example:
let $results := xdmp:word-convert( 
                         xdmp:document-get("myFile.doc"),
                         "myFile.doc" ),
    $manifest := $results[1]
return 
$results[2 to last()]

=> all of the converted nodes

xdmp:zip-create(
$manifest as node(),
$nodes as node()+
)  as  binary()
Summary:

Create a zip file from a list of nodes.

Parameters:
$manifest : The zip manifest, which must be in the xdmp:zip namespace and conform to the zip.xsd schema, located in the MarkLogic_dir/Schemas directory. The manifest has the following basic form:
    <parts xmlns="xdmp:zip">
      <part>path1</part>
      <part>path2</part>
      ...more parts
    </parts>      
Any of the size or encrypted attributes in the manifest are ignored for xdmp:zip-create. Attributes other than uncompressed-size, compressed-size and encrypted will throw an error.
$nodes : The nodes that you want to zip up. The nodes correspond to part elements in the manifest, where the first node corresponds to the first part element specified, the second node to the second part element, and so on. Specifying a different number of <part> elements than nodes will result in an error.

Usage Notes:

While you can create a zip file of encrypted content, xdmp:zip-create does not have the capability to encrypt the content to be zipped.

Example:
let $zip := xdmp:zip-create(
               <parts xmlns="xdmp:zip">
                 <part>/mydoc.xml</part>
		 <part>/mypicture.jpg</part>
	        </parts>,
		(doc("/mydoc.xml"), doc("/mypicture.jpg")))
return
xdmp:save("c:/tmp/myzip.zip", $zip)

=> Creates a zip file that includes the documents "/mydoc.xml"
   and "/mypicture.jpg", then saves that to the filesystem.


xdmp:zip-get(
$zipfile as binary(),
$name as xs:string,
[$options as node()]
)  as  node()+
Summary:

Get a named file from a zip document. Unzips and returns the file in memory as a document node (for XML formats), a text node (for text formats), or a binary node (for binary formats). The format is determined either by the mimetype from the file name or by the format option.

Parameters:
$zipfile : The zip file.
$name : The path to the zip file as shown in the zip manifest.
$options (optional): The options node for gitting this zip file. The default value is (). The node for the xdmp:zip-get options must be in the xdmp:zip-get namespace.

The xdmp:zip-get options include:

<default-namespace>

The namespace to use if there is no namespace at the root node of the document. The default value is "".

<repair>

A value of full specifies that malformed XML content be repaired. A value of none specifies that malformed XML content is rejected. This option has no effect on binary or text documents.

<format>

A value of text specifies to get the document as a text document, regardless of the URI specified. A value of binary specifies to get the document as a binary document, regardless of the URI specified. A value of xml specifies to get the document as an XML document, regardless of the URI specified.

<default-language>

The language to specify in an xml:lang attribute on the root element node if the root element node does not already have an xml:lang attribute. If default-language is not specified, then nothing is added to the root element node.

<encoding>

Specifies the encoding to use when reading the document into MarkLogic Server. Supported values include UTF-8 and ISO-8859-1. All encodings will be translated into UTF-8 from the specified encoding. The string specifed for the encoding option will be matched to an encoding name according to the Unicode Charset Alias Matching rules (http://www.unicode.org/reports/tr22/#Charset_Alias_Matching). If no encoding option is specified, the encoding defaults to the encoding specified in the http header (if using with one of the http functions, for example, xdmp:http-get), otherwise it defaults to UTF-8.

Usage Notes:

The name of the document you are extracting will determine the default format in which the document is extracted, based on the mimetype settings. For example, if you are extracting a document with the name myDocument.xmlfile, it will by default extract that document as a text document (because it is an unknown mimetype, and unknown mimetypes default to text format). If you know this is an XML document, then specify a format of xml in the options node (see the third example below).


Example:
xdmp:zip-get(doc("/zip/tmp.zip"), "files/myxmlfile.xml")

=> the "files/myxmlfile.xml" node from the "/zip/tmp.zip" zip file

Example:
(: unzip all of the files in the zip archive :)
declare namespace zip="xdmp:zip"

for $x in xdmp:zip-manifest(doc("/zip/tmp.zip"))//zip:part/text()
return
xdmp:zip-get(doc("/zip/tmp.zip"), $x)

=> a sequence of all of the unzipped nodes in the "/zip/tmp.zip" zip file

Example:
xdmp:zip-get(doc("/zip/tmp.zip"), "myDocument.xmlfile",
	<options xmlns="xdmp:zip-get">
	  <format>xml</format>
	</options>)

=> the "myDocument.xmlfile" node from the "/zip/tmp.zip"
   zip file, as an XML document


xdmp:zip-manifest(
$zipfile as binary()
)  as  node()
Summary:

Return a manifest for this zip file. The manifest contains information about what is in the zip file. The form of the manifest is:
  <parts xmlns="xdmp:zip">
    <part uncompressed-size="[size]" compressed-size="[size]" 
          encrypted="[true/false]">path1</part>
    <part uncompressed-size="[size]" compressed-size="[size]" 
          encrypted="[true/false]">path2</part>
    ...more parts
  </parts>      
Each <part> is a file within the zip. The attributes specify the uncompressed size for the file, the compressed size for that file, and whether or not the file is encrypted. Note that MarkLogic cannot exctract encrypted files, attempting to do so will cause an error.

Parameters:
$zipfile : The zip document binary node.

Example:
xdmp:zip-manifest($myzip)
=> 
<parts>
  <part uncompressed-size="89246" compressed-size="4538" 
        encrypted="no">
    docProps/app.xml
  </part>
  <part uncompressed-size="2896" compressed-size="634" 
        encrypted="no">
    word/fontTable.xml
  </part>
  <part uncompressed-size="139914" compressed-size="12418" 
        encrypted="yes">
    word/styles.xml
  </part>
</parts>