Solutions

MarkLogic Data Hub Service

Fast data integration + improved data governance and security, with no infrastructure to buy or manage.

Learn More

Learn

Stay On Top Of Everything MarkLogic

Be the first to know! News, product information, and events delivered straight to your inbox.

Sign Me Up

Community

Stay On Top Of Everything MarkLogic

Be the first to know! News, product information, and events delivered straight to your inbox.

Sign Me Up

Company

Stay On Top Of Everything MarkLogic

Be the first to know! News, product information, and events delivered straight to your inbox.

Sign Me Up

Ingesting Delimited Text with MLCP

by Matt Sun

Quite often we've seen customers run into an exception "invalid char between encapsulated token and delimiter" when they are ingesting delimited text into MarkLogic Server using MLCP. It sounds so technical and hard to understand -- what is wrong with the data? This article explains how it happens and the workaround to solve it.

What does that mean?

"Invalid char between encapsulated token and delimiter" means you have some characters between an encapsulator and a delimiter, which are invalid. What is an 'encapsulator'? It is the character that used to wrap the CSV field or column, which may contain special characters such as line breaks. In most cases, people use double quote as the encapsulator.

For more details about encapsulators, please refer to rfc4180 standard: https://www.ietf.org/rfc/rfc4180.txt

What's wrong with those characters?

Let's explain with a delimited text example.

  1. "foo"| "bar" | "foo"
  2. "foo"| X "bar" | "foo"
  3. "foo" |X "bar" Y | "foo"
  4. "foo" | "bar" Y |"foo"

Here the delimiter is "|" and the encapsulator is double quote. In rows 2, 3 and 4, there are some characters between a delimiter and an encapsulator. This is just what goes wrong. According to the rfc4180 standard, those columns are actually not valid delimited text format. Here is the quote:

Each field may or may not be enclosed in double quotes. If fields are not enclosed with double quotes, then double quotes may not appear inside the fields. 

So this statement has two points, 

  1. the double quotes are only used to enclose the whole column, not several characters in the column.
  2. If you don't enclose the column (field) with double quotes, then double quote(s) should not be present inside the field.

With that being said, rows 2, 3, 4 should be rejected by CSVParser as invalid CSV records. However, the CSVParser MLCP currently uses can actually handle cases 2 and 3 and parse them without any issue, but is not able to deal with case 4 and an exception with message "invalid char between encapsulated token and delimiter" will be given.

What to do with it?

The best way is to avoid having this kind of malformed CSV data. Other than that, escape the double quotes in the field if you really want them to be part of the string. But remember, escape double quotes by another DOUBLE QUOTE in CSV!

  1. "foo" | ""bar"" Y |"foo"

Meanwhile, MarkLogic's Engineering team is also looking at some changes in this area, which will handle this kind of issue better.

Stack Overflow iconStack Overflow: Get the most useful answers to questions from the MarkLogic community, or ask your own question.

Comments

The commenting feature on this page is enabled by a third party. Comments posted to this page are publicly visible.