[MarkLogic Dev General] cleaning entities

Jason Hunter jhunter at marklogic.com
Sat Jan 20 10:22:27 PST 2007


Hi Jim,

You might have had an easier time running the text w/ character entities 
through xdmp:unquote rather than xdmp:tidy.

-jh-

James A. Robinson wrote:
> So many years ago, when we were an SGML-only shop, somebody somewhere
> decided it would be a Great Idea to stick SGML entities, e.g., ¨,
> into the "ascii" fields of our databases. For example, an author table
> might have a field fname_ascii to indicate a first name, and when one
> queries that one finds a mixture of what are indeed ASCII characters --
> which happen to require running through an SGML/XML entity resolver to
> be usable!
> 
> Of course, the idea at the time was to write to browsers, and not worry
> about the contents, and so nobody bothered to construct a mapping of
> the entities they used, they just used HTML entities. Bah.
> 
> So now it's our turn to clean up the mess. I was using mlsql, which was
> very neat.  Naturally it treates the various fields of the DB as text
> and slurps up '¨' as '¨'.
> 
> So I wanted to see if xdmp:tidy could deal with it.  My first attempts
> at processing the entire sql:response were not productive, if I pass in
> an element() it translates ¨ into an NCR but strips all elements,
> if I pass in the result of xdmp:quote($sql_result) it leaves the ¨
> declarations unmolested.
> 
> So I ended up writing this, but I was wondering if anyone has done
> something similar in perhaps a more efficent manner?
> 
> import module namespace sql = "http://xqdev.com/sql"
>   at "/modules/mlsql/sql.xqy"
> 
> declare namespace html="http://www.w3.org/1999/xhtml"
> 
> (:~
>  : Run the text() nodes of an element through xdmp:tidy (useful for
>  : translating escaped HTML entities into NCRs).
>  : @param  $input element to clean.
>  : @return $input with any text() nodes passed via xdmp:tidy.
>  :)
> define function tidyText($input as element())
> as element()
> {
>   element {node-name($input)} {
>     for $node in $input/node()
>     return
>       if ($node instance of element())
>       then tidyText($node)
>       else if ($node instance of text())
>       then normalize-space(xdmp:tidy($node)/html:html/html:body/text())
>       else $node
>   }
> }
> 
> So:
> 
>   tidyText(sql:execute($query, $mlsqlserver, ())
> 
> will return the sql:result with its text passed via tidy.
> 
> Jim
> 
> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
> James A. Robinson                       jim.robinson at stanford.edu
> Stanford University HighWire Press      http://highwire.stanford.edu/
> +1 650 7237294 (Work)                   +1 650 7259335 (Fax)
> _______________________________________________
> General mailing list
> General at developer.marklogic.com
> http://xqzone.com/mailman/listinfo/general
> 



More information about the General mailing list