Ingesting Delimited Text with MLCP

by Matt Sun

Quite often we've seen customers run into an exception "invalid char between encapsulated token and delimiter" when they are ingesting delimited text into MarkLogic Server using MLCP. It sounds so technical and hard to understand -- what is wrong with the data? This article explains how it happens and the workaround to solve it.

What does that mean?

"Invalid char between encapsulated token and delimiter" means you have some characters between an encapsulator and a delimiter, which are invalid. What is an 'encapsulator'? It is the character that used to wrap the CSV field or column, which may contain special characters such as line breaks. In most cases, people use double quote as the encapsulator.

For more details about encapsulators, please refer to rfc4180 standard:

What's wrong with those characters?

Let's explain with a delimited text example.

  1. "foo"| "bar" | "foo"
  2. "foo"| X "bar" | "foo"
  3. "foo" |X "bar" Y | "foo"
  4. "foo" | "bar" Y |"foo"

Here the delimiter is "|" and the encapsulator is double quote. In rows 2, 3 and 4, there are some characters between a delimiter and an encapsulator. This is just what goes wrong. According to the rfc4180 standard, those columns are actually not valid delimited text format. Here is the quote:

Each field may or may not be enclosed in double quotes. If fields are not enclosed with double quotes, then double quotes may not appear inside the fields. 

So this statement has two points, 

  1. the double quotes are only used to enclose the whole column, not several characters in the column.
  2. If you don't enclose the column (field) with double quotes, then double quote(s) should not be present inside the field.

With that being said, rows 2, 3, 4 should be rejected by CSVParser as invalid CSV records. However, the CSVParser MLCP currently uses can actually handle cases 2 and 3 and parse them without any issue, but is not able to deal with case 4 and an exception with message "invalid char between encapsulated token and delimiter" will be given.

What to do with it?

The best way is to avoid having this kind of malformed CSV data. Other than that, escape the double quotes in the field if you really want them to be part of the string. But remember, escape double quotes by another DOUBLE QUOTE in CSV!

  1. "foo" | ""bar"" Y |"foo"

Meanwhile, MarkLogic's Engineering team is also looking at some changes in this area, which will handle this kind of issue better.