Problem

Pull a base64 string out of XML or JSON and store it as a binary file. Sometimes when content is extracted from other sources, an image or other binary might be stored in an XML element or JSON property.

Solution

Applies to MarkLogic versions 8+

xquery version "1.0-ml";

let $file := xdmp:document-get('/home/dcassel/downloads/HD-keyhole-300px.png')
let $xml :=
  <root>
    <filename>HD-keyhole-300px.png</filename>
    <base64>iVBORw0KGgoAAAANSUhEUgAAASwAAACqCAMAAAAgPYI2AAAACXBIWXMAAAsTAAALEwEAmpwYAAACWFBMVEVhYWEAAAABAQECAgIDAwMEBAQFBQUGBgYHBwcICAgJCQkKCgoLCwsMDAwNDQ0ODg4PDw8QEBARERESEhITExMWFhYXFxcYGBgZGRkaGhobGxscHBwdHR0eHh4fHx8gICAhISEiIiIjIyMkJCQnJycoKCgpKSkqKiorKyssLCwtLS0uLi4vLy8wMDAxMTEyMjIzMzM0NDQ1NTU2NjY3Nzc4ODg5OTk6Ojo9PT0+Pj5BQUFDQ0NERERGRkZHR0dISEhJSUlKSkpLS0tMTExOTk5QUFBRUVFSUlJVVVVYWFhZWVlbW1teXl5gYGBiYmJjY2NkZGRlZWVmZmZnZ2dpaWlsbGxtbW1ubm5wcHBxcXFzc3N0dHR1dXV2dnZ3d3d5eXl6enp8fHx9fX1/f3+AgICBgYGCgoKDg4OEhISHh4eIiIiKioqLi4uMjIyNjY2Ojo6QkJCRkZGSkpKTk5OUlJSVlZWXl5eZmZmdnZ2enp6fn5+goKChoaGioqKkpKSlpaWmpqanp6eoqKipqamqqqqrq6usrKytra2vr6+wsLCxsbGysrKzs7O0tLS1tbW2tra3t7e4uLi7u7u8vLy+vr7CwsLDw8PExMTFxcXHx8fKysrLy8vMzMzNzc3Pz8/Q0NDR0dHS0tLV1dXW1tbY2NjZ2dnb29vc3Nze3t7f39/g4ODi4uLj4+Pk5OTl5eXm5ubn5+fo6Ojp6enq6urr6+vs7Ozt7e3u7u7v7+/w8PDx8fHy8vLz8/P19fX29vb39/f4+Pj5+fn6+vr7+/v8/Pz9/f3+/v7///8xnZ2dAAAAAXRSTlObJzuSqgAAAAFiS0dEx40FSlsAAAS0SURBVHja7d15W1VVFMdxFpdBFJBBzRLBnIdQiTLnVCrHzCFFMidSU8pKnFBxyEKNUsIpidLMAa9lThmYqAzmflt5PBzvwUfPXmf/4/ZZv+9L+Dz3Ps89+561dlwc4keIHbCABSxgAQtYCFjAAhawgAUsBKwXHCuSnTPwtUxgaZ3eXLnn9D3ldLl2w7RkYD1TqmjnDdWppsqiJGA9pbTiBvWUoh8kAuuJEkua1DNqmBMBlr8Jv6uAjucB63FJ5Sq42+8Dq6PePylt33QBltOwq4rR0VRgEQ25rljVZQFr8DXF7FSqdKzMqGJXJR3roArRCtlYJWGs1H+TJWPltYTCUtEUwVh7VchK5WIVPgiLdSdHLNYJFbpKqViF4a1US7ZQrE0GWKpYJlbSTROsMzKx3lFGjRaJVWGGtUwk1mkzrEMSsVLazbBuRQRiFSrDRgrEmm+KNUsg1nJTrEUCsdaZYq0WiPWVKdYmgVgVpli7BGJtM8UqF4i1xhSrVCDWh6ZYCwVivWeKNV0g1uumWEMFYiXfMbP6J17iqcNRM6xqkUc0pWZYH4vEGmuGNUokVsIfJla/kUgss+/hIqFYLxucld7NFIpF+8Nj7SSpWCPvh/5g5YrFovKwWJ+QXKzuV8NZXeoqGItmhcN6lyRj0ddhrD4j2Vhp5/hWtRHhWDTsX67Vhec+3Pr8JyxG/c2zauhPwKLBlzlW9T0JWA8bwJje+dGCOSc75g1H39ZZHbBi9teOSdZJmne8D9sxVm7J2O+eQKv2vgQs3xcxEGsvActXfODvh3nA6lTgE/VbwOpUm51/q1qJlW7rHzo2YvULxBoHLH8Flr4KYiXW24FYC4Dlb3Yg1kpg+SsOxPoSWP7WWvrKrZVYW4LPZ4Dlb1/wwR+w/AW/2BYFlr9TgVjNwPKnOYZPAZbvhKYtGKsPsGJla47g84EVa4gGayKwYo23dnjVQqyZGqzFwIq1RIO1Blix1muwtgAr1nYN1rfAilWjwToGrFi/WrzayDos3UtHV4DFftpRbcBiP+0olQEsr0FarP7A8tJPHr4BLK/pWqwiYHkt1GLhWobHrdJiLQWW10YtVhmwvPT7qCuA5XXE5mUOtmHpl3CeBJbXX1qs88DyatViNQKrowz96M4DjKN0NIAxFNYLWG5jGFjDgeXG2XU+EVhucxlYs4HlxlkcXAIstw0MrDJguVUysHYAy62GgVUDLLd6BtYvwHK7xMD6E1hunIWlrcB6VDfWwpAsYDnlsrAGAcupgIU1DlhOU1hYM4DlNMf6/aQWYX3EwvoUWE5lLKxtwHLiXZPyHbCcvmdh1QHLqY6FdRFYTg28S5GB5cRcJ5kKLKJk5ubNV4FF9AoTawywiPJfjL3KdmBNsP7WK4uwplp/n5pFWNw7wzYDi2gxE6sKWERfMLGOA4uomol1DlhEZ5lYTcCiSAt3eX4XYPVm38qQA6w8NlYBsF5iY00DFl3nYs0DFn3OxZoLLEr/mWfV2BNYRInFG7du311VXXvsZP2Z89ErN5tbWtufvJ/u7g8j8DsroPiEpOSUbmnpGVk9+gy0YpGdxVj2BSxgAQtYwAIWAhawgAUsYAELASsMFuL3P5EoVxup1KG6AAAAAElFTkSuQmCC</base64>
  </root>
let $binary := binary{xs:hexBinary(xs:base64Binary($base64))}
return (
  xdmp:document-insert(
    $xml/filename/fn:string(),
    binary{xs:hexBinary(xs:base64Binary($xml/base64/fn:string()))}
  ),
  xdmp:node-delete($base)
)

This code can be used as part of an import transformation or with a Corb2 job.

Discussion

There are good reasons to go to the trouble to extract the base64 content and store it separately as a binary.

Base64 is, by nature, large. That makes for large strings in XML or JSON documents. The example above has a string with length of 2560 characters. Whenever we update a document in MarkLogic, the MVCC approach means we make a copy of that document. The older versions will be removed by the merging process, but in the meantime, they’ll take up space without contributing much value. If we separate the binary content, then it won’t need to be copied over when the XML or JSON document gets updated.

Another impact of base64 strings is that all text is normally included in the indexes. Again, there’s no value in indexing values like this—it’s simply a large value that no one’s going to search for. It is possible to configure the database to exclude values based on an element or property name, but simply removing it simplifies the configuration.

Finally, let’s think about how we’d make use of the binary content. Mostly likely, we’d want to serve it up as an image, in its binary form. Doing so is simpler if the content has been extracted, converted back to a binary form, and stored that way. Retrieving it them becomes a simple matter of loading the binary content from disk and returning it to the client, rather than converting it from base64 to binary at run time.

Learn More

Importing Content Into MarkLogic Server

Read how to to insert content into a MarkLogic Server database from flat files, compressed ZIP and GZIP files, aggregate XML files, and more.

Working with Content Transformations

Learn how to create custom content transformations and apply them during operations such as document ingestion and search with the REST Client API.

CORB2

Learn more about CORB2 capabilities, how the tool is used and in MarkLogic, and find its associated repository on GitHub.

 

This website uses cookies.

By continuing to use this website you are giving consent to cookies being used in accordance with the MarkLogic Privacy Statement.