[MarkLogic Dev General] loading XML documents with DTDs

Alan Darnell alan.darnell at utoronto.ca
Sat Jun 23 10:41:59 PDT 2007


Thanks for these leads.  These tools do a great job of pulling data  
out of CDATA sections and then using xdmp:unquote I can convert  
strings such as &amp;le;i&amp;ge; into <i> .

That gets me a long way. But if the source XML has CDATA sections  
that include entities (e.g. &z.rtls; which appears in some Elsevier  
documents we load) that are defined in an associated DTD as mapping  
to a specific codepoint, these entities remain untranslated in the  
loaded document (e.g. &z.rtls; appears as &amp;z.rtls; rather than  
&#642;)

I'm guessing that my options are to parse the source documents before  
loading these into Mark Logic or to write a function like  
xdmp:unquote that can reference a list of external entities and map  
these to the correct numeric entities.  I know these entities could  
be converted on display, but the goal in all of this is to allow  
users to be able to search for terms that contain these entities.

Is there some other function I may be missing to allow me to do this  
translation inside of MarkLogic?  Would using schemas for the target  
database be an option for handling entity translations?

Alan

On 19-Jun-07, at 1:08 PM, Michael Blakeley wrote:

> Alan,
>
> I'd recommend starting with xdmp:document-load() - http:// 
> developer.marklogic.com/pubs/3.2/apidocs/ 
> UpdateBuiltins.html#document-load
>
> You might also be interested in http://developer.marklogic.com/ 
> howto/tutorials/2006-06-recordloader.xqy
>
> -- Mike
>
> Alan Darnell wrote:
>> I have a number of documents (sample below) in XML format but not  
>> UTF-8 encoding and with an externally referenced DTD and rendering  
>> stylesheet.  What's the best way to get these documents into  
>> MarkLogic so that:
>> - the encoding is changed to UTF-8
>> - any entities in the DTD are resolved to UTF-8 encoded characters
>> - any CDATA sections are removed with the content left intact,  
>> including markup embedded in the CDATA content
>> Do I need to pre-process the files before loading or can Mark  
>> logic handle these kinds of conversion as part of the load functions?
>> Also, does anyone know of any good strategies for converting math  
>> in TeX format to MathML?
>> Thanks,
>> Alan
>> Alan Darnell
>> University of Toronto
>> <?xml version="1.0" encoding="iso-8859-1"?><?xml-stylesheet  
>> type="text/xsl" href="file://batchgate1\StyleS\bpg4
>> 0.xsl"?>
>> <!DOCTYPE content PUBLIC "-//BLACKWELL PUBLISHING GROUP//DTD 4.0// 
>> EN" "\\Batchgate1\bpgdtd\4-0\bpg4-0.dtd">
>> <content dtdver="4.0" docfmt="xml">
>
> _______________________________________________
> General mailing list
> General at developer.marklogic.com
> http://xqzone.com/mailman/listinfo/general



More information about the General mailing list