[MarkLogic Dev General] cleaning entities
Jason Hunter
jhunter at marklogic.com
Sat Jan 20 10:22:27 PST 2007
Hi Jim,
You might have had an easier time running the text w/ character entities
through xdmp:unquote rather than xdmp:tidy.
-jh-
James A. Robinson wrote:
> So many years ago, when we were an SGML-only shop, somebody somewhere
> decided it would be a Great Idea to stick SGML entities, e.g., ¨,
> into the "ascii" fields of our databases. For example, an author table
> might have a field fname_ascii to indicate a first name, and when one
> queries that one finds a mixture of what are indeed ASCII characters --
> which happen to require running through an SGML/XML entity resolver to
> be usable!
>
> Of course, the idea at the time was to write to browsers, and not worry
> about the contents, and so nobody bothered to construct a mapping of
> the entities they used, they just used HTML entities. Bah.
>
> So now it's our turn to clean up the mess. I was using mlsql, which was
> very neat. Naturally it treates the various fields of the DB as text
> and slurps up '¨' as '¨'.
>
> So I wanted to see if xdmp:tidy could deal with it. My first attempts
> at processing the entire sql:response were not productive, if I pass in
> an element() it translates ¨ into an NCR but strips all elements,
> if I pass in the result of xdmp:quote($sql_result) it leaves the ¨
> declarations unmolested.
>
> So I ended up writing this, but I was wondering if anyone has done
> something similar in perhaps a more efficent manner?
>
> import module namespace sql = "http://xqdev.com/sql"
> at "/modules/mlsql/sql.xqy"
>
> declare namespace html="http://www.w3.org/1999/xhtml"
>
> (:~
> : Run the text() nodes of an element through xdmp:tidy (useful for
> : translating escaped HTML entities into NCRs).
> : @param $input element to clean.
> : @return $input with any text() nodes passed via xdmp:tidy.
> :)
> define function tidyText($input as element())
> as element()
> {
> element {node-name($input)} {
> for $node in $input/node()
> return
> if ($node instance of element())
> then tidyText($node)
> else if ($node instance of text())
> then normalize-space(xdmp:tidy($node)/html:html/html:body/text())
> else $node
> }
> }
>
> So:
>
> tidyText(sql:execute($query, $mlsqlserver, ())
>
> will return the sql:result with its text passed via tidy.
>
> Jim
>
> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
> James A. Robinson jim.robinson at stanford.edu
> Stanford University HighWire Press http://highwire.stanford.edu/
> +1 650 7237294 (Work) +1 650 7259335 (Fax)
> _______________________________________________
> General mailing list
> General at developer.marklogic.com
> http://xqzone.com/mailman/listinfo/general
>
More information about the General
mailing list