[MarkLogic Dev General] loading XML documents with DTDs

Alan Darnell alan.darnell at utoronto.ca
Tue Jun 19 08:24:55 PDT 2007


I have a number of documents (sample below) in XML format but not  
UTF-8 encoding and with an externally referenced DTD and rendering  
stylesheet.  What's the best way to get these documents into  
MarkLogic so that:

- the encoding is changed to UTF-8
- any entities in the DTD are resolved to UTF-8 encoded characters
- any CDATA sections are removed with the content left intact,  
including markup embedded in the CDATA content

Do I need to pre-process the files before loading or can Mark logic  
handle these kinds of conversion as part of the load functions?

Also, does anyone know of any good strategies for converting math in  
TeX format to MathML?

Thanks,

Alan

Alan Darnell
University of Toronto



<?xml version="1.0" encoding="iso-8859-1"?><?xml-stylesheet  
type="text/xsl" href="file://batchgate1\StyleS\bpg4
0.xsl"?>
<!DOCTYPE content PUBLIC "-//BLACKWELL PUBLISHING GROUP//DTD 4.0//EN"  
"\\Batchgate1\bpgdtd\4-0\bpg4-0.dtd">
<content dtdver="4.0" docfmt="xml">
         <publisherinfo>
                 <publisher>Blackwell Publishing Ltd</publisher>
                 <address format="inline">Oxford, UK</address>
         </publisherinfo>
         <contentinfo type="journal" language="en">
                 <contentcode>JTH</contentcode>
                 <titlegroup>
                         <title type="journal">Journal of Thrombosis  
and Haemostasis</title>
                 </titlegroup>
                 <issn>1538-7933</issn>
                 <copyright>2006 International Society on Thrombosis  
and Haemostasis</copyright>
         </contentinfo>
         <document type="primary_article" sequence="1"  
referencetype="vancouver">
                 <header>
                         <documentinfo language="en">
                                 <idgroup>
                                         <documentid type="doi"  
id="10.1111/j.1538-7836.2006.02000.x" status="live" />
                                         <documentid type="bpl"  
id="2000" status="live" />
                                         <documentid type="version"  
id="fi" status="live" />
                                 </idgroup>
                                 <relatedgroup>
                                         <related  
relationship="child" type="object" />
                                         <related relationship="self"  
type="primary_article">
                                                 <file  
name="jth_2000.xml" type="xml" />
                                         </related>
                                         <related  
relationship="sibling" type="pages">
                                                 <file  
name="jth_2000.pdf" type="pdf" />
                                         </related>
                                 </relatedgroup>
                                 <date date="2006-08">August 2006</date>
                                 <pagedetails>
                                         <volume>4</volume>
                                         <issue sequence="15">8</issue>
                                         <page type="first">1747</page>
                                         <page type="last">1755</page>
                                 </pagedetails>
                                 <countgroup>
                                         <count type="figure_total"  
count="6" />
                                         <count type="table_total"  
count="3" />
                                         <count type="page_total"  
count="9" />
                                 </countgroup>
                                 <trackinghistory>
                                         <trackingdate type="created"  
date="2006-04-25" />
                                         <trackingdate  
type="markedup" date="0000" by="SPS" software="preediting tool"  
version="4.0" />
                                         <trackingdate  
type="paginated" date="0000" by="FSS_SPS" />
                                         <trackingdate  
type="received" date="0000" />
                                         <trackingdate type="revised"  
date="0000"/>
                                         <trackingdate  
type="accepted" date="0000" />
                                         <trackingdate  
type="Delivered as FI" date="20060825" />
                                 </trackinghistory>
                                 <tocheading level="1">ORIGINAL  
ARTICLES</tocheading>
                                 <tocheading  
level="2"><i>Coagulation</i>
                                 </tocheading>
                                 <runningheadgroup>
                                         <runninghead  
type="title"><i>Tissue factor antigen in plasma</i>
                                         </runninghead>
                                         <runninghead  
type="author"><i>B. Parhami-Seren</i> et&nbsp;al
                                         </runninghead>
                                 </runningheadgroup>
                         </documentinfo>
                         <history>
                                 <p>Received 23 February 2006,  
accepted 12 April 2006</p>
                         </history>
                         <footnotegroup>
                                 <correspondent id="c1">Behnaz  
Parhami-Seren, Department of Biochemistry, College of Medicine,  
University of Vermont, 208 South Park Drive, Cholchester, VT  
05446-0068, USA.<br />Tel.:+1&nbsp;802&nbsp;656&nbsp;3286; fax:  
+1&nbsp;802&nbsp;656&nbsp;2256; e-mail:
                                         <externallink  
type="email">behnaz.parhami-seren at uvm.edu</externallink>
                                 </correspondent>
                         </footnotegroup>
                         <titlegroup>
                                 <title type="surtitle">ORIGINAL  
ARTICLE</title>
                                 <title type="document">Immunologic  
quantitationof tissue factors</title>
                         </titlegroup>
                         <namegroup type="author">
                                 <name type="author">
                                         <forenames>B.</forenames><x>  
</x>
                                         <surname>PARHAMI-SEREN</ 
surname>
                                 </name><x>, </x>
                                 <name type="author">
                                         <forenames>S.</forenames><x>  
</x>
                                         <surname>BUTENAS</surname>
                                 </name><x>, </x>
                                 <name type="author">
                                         <forenames>J.</forenames><x>  
</x>
                                         <surname>KRUDYSZ-AMBLO</ 
surname>
                                 </name><x> and </x>
                                 <name type="author">
                                         <forenames>K. G.</ 
forenames><x> </x>
                                         <surname>MANN</surname>
                                 </name>
                                 <address format="inline">Department  
of Biochemistry, College of Medicine, University of Vermont,  
Burlington, VT, USA</address>
                         </namegroup>
                         <summary language="en">
                                 <heading implicit="yes" id="h1"  
level="5" format="inline">Summary.&ensp;</heading>
                                 <p>The large number of conflicting  
reports on th
e presence and concentration of circulating tissue factor (TF) in  
blood generate
s uncertainties regarding its relevance to hemostasis and association  
with speci
fic diseases. We believe that the source of these controversies lies  
in part in
the assays used for TF quantitation. We have developed a highly  
sensitive and sp
...
                                 </p>
                         </summary>
                         <keywordgroup language="en" format="display">
                                 <heading implicit="yes" id="h2"  
level="5" format="inline">Keywords:&ensp;</heading>
                                 <keyword>fluorescence immunoassay</ 
keyword><x>,</x>
                                 <keyword>placenta</keyword><x>, </x>
                                 <keyword>plasma</keyword><x>, </x>
                                 <keyword>tissue factor</ 
keyword><x>.</x>
                         </keywordgroup>
                 </header>
         </document>
</content>




More information about the General mailing list