[MarkLogic Dev General] Suggestions for data masking

David Lee David.Lee at marklogic.com
Tue Mar 24 02:17:33 PDT 2015


Unless your document's PI data is separated into different documents you are going to need to do a custom transformation on each document - the details of which are very case specific (fill in SS#'s with '???' remove last names ? remove entire sections or replace with sample data ?).   Having worked in the Medical and commerce worlds I know getting this right, and clearly auditable are crucial.
Also consider if you need to maintain any document properties or metadata (properties objects including mod dates,  collections, permissions , DLS data etc.,
and are these copied as-is or modified)

That refines the question into parts
1) Selecting the document subset to copy 
2) Transforming the document content itself (*prior* to leaving the 'trust zone')
3) Select/copy/filter the document metadata
4) Extract from the source DB 
5) -- possibly package for secure, reliable or easy travel to the down sites, encrypt?
6) -- Copy the data
.... > Now reverse the process on the target site.

You can do all this ad-hoc - once maybe
Getting this reliable, scriptable, auditable and not screw up ever -- harder.

Greet's suggestion of FlexRep seems ideal for this as it can accomplish All of these.

MLCP by itself can do quite a bit - but it may be hard to put all the pieces together.

Another way is making a temporary DB, and using CPF or your own code to do all the data transformation on-server then (1-4) then use any number of ways to copy the data (mlcp, replication, database export/import )

Or ... if you prefer offline tools (say you like xproc or xmlsh or other non-server products) you could dump the DB to local files, clean them in in place, 
then copy them over and reverse it.

FlexRep is looking really good though  ... 
 

-----------------------------------------------------------------------------
David Lee
Lead Engineer
MarkLogic Corporation
dlee at marklogic.com
Phone: +1 812-482-5224
Cell:  +1 812-630-7622
www.marklogic.com

-----Original Message-----
From: general-bounces at developer.marklogic.com [mailto:general-bounces at developer.marklogic.com] On Behalf Of Geert Josten
Sent: Tuesday, March 24, 2015 2:00 AM
To: MarkLogic Developer Discussion
Subject: Re: [MarkLogic Dev General] Suggestions for data masking

Hi Joel,

I haven¹t dealt with this personally, but could ask around. I guess though there are numerous ways to go about with this, depending on the exact needs. The two that come to mind first:

You could create a permanent solution using Flexible Replication, which builds on top of CPF:
http://docs.marklogic.com/guide/flexrep/rep_intro#id_62963

You could also use MLCP copying feature together with an MLCP transform.

You already mentioned triggers and scheduled tasks, but MLCP will load faster I think. CPF uses triggers underneath..

Kind regards,
Geert

On 3/24/15, 2:12 AM, "Joel Wilson Gunasekaran"
<joelwilson.gunasekaran at gmail.com> wrote:

>Hi,
>
>Once in a while, we refresh dataset in lower environments with 
>production data for testing purposes.
>We have a requirement to mask all pii(personally identifiable
>information) data like email id, phone number, etc. in lower 
>environments like DEV, QA.
>
>We were thinking about having a one-time script that does the masking, 
>which can be run when we do the data refresh.
>In addition to this, we also want a automated process that does this, 
>like either a scheduled task or a trigger, to avoid any sensitive data 
>left unmasked, accidentally.
>
>Can you please let me know if you have had to deal with similar cases 
>and any suggestions?
>
>Thanks
>Joel
>_______________________________________________
>General mailing list
>General at developer.marklogic.com
>http://developer.marklogic.com/mailman/listinfo/general

_______________________________________________
General mailing list
General at developer.marklogic.com
http://developer.marklogic.com/mailman/listinfo/general


More information about the General mailing list