[MarkLogic Dev General] Suggestions for data masking

David Lee David.Lee at marklogic.com
Tue Mar 24 02:17:33 PDT 2015

Unless your document's PI data is separated into different documents you are going to need to do a custom transformation on each document - the details of which are very case specific (fill in SS#'s with '???' remove last names ? remove entire sections or replace with sample data ?).   Having worked in the Medical and commerce worlds I know getting this right, and clearly auditable are crucial.
Also consider if you need to maintain any document properties or metadata (properties objects including mod dates,  collections, permissions , DLS data etc.,
and are these copied as-is or modified)

That refines the question into parts
1) Selecting the document subset to copy 
2) Transforming the document content itself (*prior* to leaving the 'trust zone')
3) Select/copy/filter the document metadata
4) Extract from the source DB 
5) -- possibly package for secure, reliable or easy travel to the down sites, encrypt?
6) -- Copy the data
.... > Now reverse the process on the target site.

You can do all this ad-hoc - once maybe
Getting this reliable, scriptable, auditable and not screw up ever -- harder.

Greet's suggestion of FlexRep seems ideal for this as it can accomplish All of these.

MLCP by itself can do quite a bit - but it may be hard to put all the pieces together.

Another way is making a temporary DB, and using CPF or your own code to do all the data transformation on-server then (1-4) then use any number of ways to copy the data (mlcp, replication, database export/import )

Or ... if you prefer offline tools (say you like xproc or xmlsh or other non-server products) you could dump the DB to local files, clean them in in place, 
then copy them over and reverse it.

FlexRep is looking really good though  ... 

David Lee
Lead Engineer
MarkLogic Corporation
dlee at marklogic.com
Phone: +1 812-482-5224
Cell:  +1 812-630-7622

-----Original Message-----
From: general-bounces at developer.marklogic.com [mailto:general-bounces at developer.marklogic.com] On Behalf Of Geert Josten
Sent: Tuesday, March 24, 2015 2:00 AM
To: MarkLogic Developer Discussion
Subject: Re: [MarkLogic Dev General] Suggestions for data masking

Hi Joel,

I haven¹t dealt with this personally, but could ask around. I guess though there are numerous ways to go about with this, depending on the exact needs. The two that come to mind first:

You could create a permanent solution using Flexible Replication, which builds on top of CPF:

You could also use MLCP copying feature together with an MLCP transform.

You already mentioned triggers and scheduled tasks, but MLCP will load faster I think. CPF uses triggers underneath..

Kind regards,

On 3/24/15, 2:12 AM, "Joel Wilson Gunasekaran"
<joelwilson.gunasekaran at gmail.com> wrote:

>Once in a while, we refresh dataset in lower environments with 
>production data for testing purposes.
>We have a requirement to mask all pii(personally identifiable
>information) data like email id, phone number, etc. in lower 
>environments like DEV, QA.
>We were thinking about having a one-time script that does the masking, 
>which can be run when we do the data refresh.
>In addition to this, we also want a automated process that does this, 
>like either a scheduled task or a trigger, to avoid any sensitive data 
>left unmasked, accidentally.
>Can you please let me know if you have had to deal with similar cases 
>and any suggestions?
>General mailing list
>General at developer.marklogic.com

General mailing list
General at developer.marklogic.com

More information about the General mailing list