Handling Whitespace in URIs

by Dave Cassel

A question came up recently about using MLCP to load documents with whitespace in their filenames or paths. For instance, suppose I have a directory on my filesystem called "white space dir" and I want to load files in that directory into MarkLogic. I have this setup:

I can load that with MLCP:

$ ~/software/mlcp-8.0-4/bin/mlcp.sh import -username admin -password admin -host localhost -port 8000 -input_file_path "/tmp/blog"

This results in document URIs like "/tmp/blog/white%20space%20dir/sample.json". Why? This is mentioned in Character Encoding of URIs in the MLCP User Guide, but well-formed database document URIs may not have whitespace.

The impact of this change comes when you try to retrieve a document. If we build an application that expects to find documents based on their paths on the filesystem, we might try something like this:

That doesn't match any URI in the database, so we don't get any result. Likewise, if we try to access our document through the REST API, we get a 404:

http://localhost:8000/v1/documents?uri=/tmp/blog/white space dir/sample.json

Solutions

There are a couple of approaches we can use to resolve this. The right answer for an application depends on requirements.

Change the Path on the Filesystem

The first is simply to change the filesystem path to avoid white spaces, allowing the in-database URIs to match. If the filesystem path is doesn't have spaces, MLCP won't need to adjust the paths to make them match.

Transform

Another way to adapt the paths is to transform them during the load. MLCP lets us execute a write transform. We'll start by writing an MLCP transform that converts the spaces into dashes. Note that by the time this runs, the spaces have already been encoded as "%20".

There are multiple ways to deploy our transform. I chose to do it through the REST API.

Now we can call the transform as we load our data.

The result -- a URI that is accessible without encoding: "/tmp/blog/white-space-dir/sample.json".

Live With Encoded URIs

There may be cases where we simply need to live with the URIs as they are. Your application development will be simpler if you can avoid this, but there are ways. When requesting a document, apply the necessary encoding. This isn't complicated, but will need to be done wherever you request a document.

When using the REST API, we can take the same approach, but we need an extra layer. If we ask for "/v1/documents?uri=/white%20space%20dir/sample.json", normal processing turns that back to "/white space dir/sample.json" (resulting in a 404). However, if we encode the % signs themselves (%25), it works:

http://localhost:8000/v1/documents?uri=/white%2520space%2520dir/sample.json

Rely on Search

Encoding means that the URI change has effects throughout the application -- not ideal. For cases where external systems need to be able to directly address documents with predicted URIs, either adjusting the predicted URIs (change on the filesystem before load) or encoding in the application will be necessary. For some applications, however, the solution is to forget the original URIs and rely on search. When you search in MarkLogic, you are given access to the URI of matching documents. If your application can rely on discovery instead, the problem goes away.

When searching with cts:search(), you can call xdmp:node-uri() on any result, getting the document URI.

With the Search API or REST API, the response includes the URI of each search result.

Comments

  • You might wanna consider double encoding uris. That will make sure uris won't change if you export/import with MLCP..