MarkLogic Connector for Apache Spark 2

Note: The MarkLogic Connector for Apache Spark has been upgraded to Apache Spark 3. This tutorial is for Apache Spark 2. For more information on the latest connector, please visit the Spark 3 Connector on GitHub.

In this tutorial, you will learn how to ingest data into a MarkLogic Data Hub Service instance running on AWS using the MarkLogic Connector for Apache Spark 2. You will load CSV files that were created by exporting data from RDMBS database tables. When ingesting the data, you will perform a join to denormalize the data across the three tables and write that data into your MarkLogic Data Hub Service.

Requirements

In order to follow along with this tutorial, please be sure to do the following:

Create and start a Data Hub Service on AWS.
Download the sample data and scripts.
Download the MarkLogic Spark connector JAR file.
Create an internal user in your Data Hub Service with all roles assigned.
Note: in the tutorial, we will setup a username = “dan”, password = “ML@test1”, as this aligns with the example in our Data Curation training course. You may create a different credential if you desire, but if you do, please make sure to remember it!
Get the value of the “mlHost” property for your MarkLogic Data Hub Service by doing the following:

1. Log in to the MarkLogic Data Hub Service portal for AWS .
2. Click the name of your service in the table of services displayed in the Dashboard.
3. In the Data Hub Service information page, in the “Endpoints” section, click the “Action” dropdown menu.
4. Hover over “Gradle Config” then select “Open in new tab“.
5. In the new tab, copy the value of the “mlHost” property and paste into a text file. We’ll use this later on in this tutorial.

Create A Secret

As a best practice when working in AWS, we will be using AWS Secrets Manager:

Log into AWS Management Console using your AWS account.
Click on the “Services” dropdown link at the top of the AWS Management Console page.
Under the “Security, Identity & Compliance” section, click “Secrets Manager“.
On the left side of the page, click the “Secrets” link.
On the “Secrets” page, click the “Store a new secret” button.
For the “Select secret type” option, click on “Other type of secrets“.
In the “Secret key/value” table, enter the following keys and values, adding a row as needed:

Key Value

mlUsername dan

mlPassword ML@test1
Click the “Next” button at the bottom of the page.
Enter “marklogic-spark-tutorial” for the “Secret name”.
Click the “Next” button at the bottom of the page.
On the “Store a new secret” page, leave the default options and click the “Next” button at the bottom of the page.
On the “Store a new secret” page you will see a summary of information about your secret. Click the “Store” button to save the secret.

Note: Secrets are not immediately deleted. They are scheduled to be deleted at a later time, of your choosing. This is to permit changing any code that uses them. Therefore, the “secret name” cannot be reused until that secret has been completely deleted.

Create an S3 Bucket

The purpose of creating an S3 bucket is to have a place on the AWS cloud to store the CSV files, the job scripts, and the MarkLogic Spark Connector JAR.

Click on the “Services” dropdown link at the top of the AWS Console page.
Under the “Storage” section, click “S3“.
Click the “Create bucket” button.
For the “Bucket name“, type “marklogic-spark-tutorial“.
Leave the remaining settings at their default values.
Click the “Create bucket” button at the bottom.

Upload Data, Scripts, and the MarkLogic Connector

Now you will load the CSV data, example Python and Scala scripts, and the MarkLogic Connector for Apache Spark JAR file to the Amazon S3 bucket that you just created.

Upload the Job Scripts

You will upload two scripts into your S3 bucket. One is a Python example and the other is a Scala example. They both do the same thing: read data from CSV files and perform a join across that data in order to create a denormalized entity that represents a customer, and then write the denormalized data to a MarkLogic Data Hub Service instance.

Click the “marklogic-spark-tutorial” name in the list of buckets. The “marklogic-spark-tutorial” objects and other information displays in the new page.
From the location on your computer where you downloaded the resources for this tutorial, drag and drop the “scripts” folder on to the “marklogic-spark-tutorial” bucket page. The “Upload” page should display the two script files: spark-example-script.py and spark-example-script.scala.
Click the “Upload” button at the bottom of the page. The files are uploaded and the “Status” should show “Succeeded” for each.
Click the “Close” button to return to the “marklogic-spark-tutorial” bucket page.

Upload the CSV Files

For this tutorial, we have exported data from four RDMBS database tables into CVS files. Here, you will upload the following CSV files into the S3 bucket: address.csv, city.csv, country.csv, and customer.csv.

From the location on your computer where you downloaded the resources for this tutorial, drag and drop the “data” folder on to the “marklogic-spark-tutorial” bucket page. The “Upload” page should display the four CSV files.
Click the “Upload” button at the bottom of the page. The files are uploaded and the “Status” should show “Succeeded” for each.
Click the “Close” button at the top-right of the page to return to the “marklogic-spark-tutorial” bucket page.

Upload the MarkLogic Spark Connector .JAR

The final part of the S3 bucket upload is the JAR file for the connector.

In the “marklogic-spark-tutorial” folder on your system, create a folder called “connector“. Download the JAR file and put it into the “connector” folder you created.
Drag and drop the “connector” folder from your system to the “marklogic-spark-tutorial” bucket page. The “Upload” page should show the JAR file.
Click the “Upload” button at the bottom of the page. The files are uploaded and the “Status” should show “Succeeded” for each.
Click the “Close” button to return to the “marklogic-spark-tutorial” bucket page.

Create AWS Glue Database, Tables, and Crawler

Create a Database

Click on the “Services” dropdown link at the top of the “AWS Console” page.
Under the “Analytics” section, click on “AWS Glue“.
Click the “Databases” link under the “Data catalog” section on the left side of the page.
Click the “Add database” button.
For “Database name“, enter “customer-db” then click the “Create” button.

Create a Crawler

A “crawler” reads metadata (schema) from a data source and creates tables in an AWS Glue database.

Click on “Crawlers” in the “Data catalog” section.
Click the “Add crawler” button.
For “Crawler name“, enter “customer-crawler” then click the “Next” button.
On the “Specify crawler source type” page, keep the default values and then click the “Next” button.
On the “Add a data store” page, fill in the following:
- For “Choose a data store“, select “S3“.
- Leave “Connection” blank.
- For “Crawl data in“, select “Specified path in my account“.
- For “Include path“, click the folder icon, then click the “+” button in front of the “marklogic-spark-tutorial” S3 bucket.
- Select the “data” folder.
- Click the “Select” button. The “Include path” should now be “s3://marklogic-spark-tutorial/data“.
Click the “Next” button.
On the “Add another data store” page, select “No” and then click the “Next” button.
Next you will create an IAM role with the required permissions for the crawler to access the data in the S3 bucket.
On the “Choose an IAM role” page, select “Create an IAM role” and in the textbox type “marklogic-spark-tutorial“.
Click the “Next” button.
On the “Create a schedule for this crawler” page, set the “Frequency” to “Run on demand“.
Click the “Next” button.
On the “Configure the crawler’s output” page, set the “Database” to “customer-db“. Leave all other options as the default.
Click the “Next” button.
Click the “Finish” button on the “Crawler info” page.

Run the Crawler to Create Tables

The “customer-crawler” should now appear in the list of crawlers. Select the “customer-crawler“.
Click the “Run crawler” button.
Wait for the crawler to finish. When the “Status” column in the “customer-crawler” row displays “Stopping“, the crawler has completed.
Click “Tables” in the “Data catalog” section, under “Databases“. The list of tables displays.
Verify that the following four tables have been created: address_csv, city_csv, country_csv, and customer_csv.

Configure IAM Role Permissions

You created the IAM role in the “Create a Crawler” section of this tutorial. This role needs access to S3 buckets in order to list objects in the bucket, read the CSV data, and load the MarkLogic Spark Connector JAR.

Click on the “Services” dropdown link at the top of the AWS Console page.
Under the “Security, Identity, & Compliance” section, click “IAM“.
Click “Roles” under the “Access management” section. The list of IAM roles displays.
Click the role name, “AWSGlueServiceRole-marklogic-spark-tutorial“. The list of permission policies displays.
Verify the following policy names are listed: AWSGlueServiceRole and AWSGlueServiceRole-marklogic-spark-tutorial
Click the “Attach policies” button.
On the “Attach Permissions” page, type “amazons3” to filter to only Amazon S3-related policies.
Check the “AmazonS3FullAccess” policy name.
Click the “Attach policy” button.
On the Summary page for your “AWSGlueServiceRole-marklogic-spark-tutorial” role click the “Attach policies” button.
On the “Attach Permissions” page, type “secrets” to filter to only Amazon Secrets Manager policies.
Check the “SecretsManagerReadWrite” policy name.
Click the “Attach policy” button at the bottom of the page.

Create an AWS Glue Job

Click on the “Services” dropdown link at the top of the AWS Console page.
Under the “Analytics” section, click “AWS Glue“.
In AWS Glue, click “Jobs” under the “ETL” section. The list of jobs display.
Click the “Add job” button.
In the “Configure the job properties” page, fill in the following:
- For “Name“, type “marklogic-spark-tutorial“.
- For “IAM role“, select “AWSGlueServiceRole-marklogic-spark-tutorial“.
- For “Type“, select “Spark“.
- For “Glue version“, select “Spark 2.4, Python 3 with improved job startup times (Glue Version 2.0)“.
- For “This job runs“, select “An existing script that you provide“.
- Under “S3 path where the script is stored“, click the folder icon, expand the “marklogic-spark-tutorial” bucket, expand the “scripts” folder, then select “spark-example-script.py“. Click the “Select” button.
- Expand the “Security configuration, script libraries, and job parameters (optional)” section.
- In the “Dependent jars path“, click the folder icon, expand the “marklogic-spark-tutorial” bucket, expand the “connector” folder, and then choose the MarkLogic Spark connector JAR file that you uploaded earlier.
- Click the “Select” button.
- Leave all other items with their default settings.
Click the “Next” button at the bottom of the page.
On the “Connections” page, click the “Save job and edit script” button.
Around line 125 in the script, a variable called “options_dataSinkTest” contains a JSON object with the connection information to your MarkLogic Data Hub Service instance.
- Update the value of the “mlHost” property with your DHS host name value. This is the value of the “mlHost” property you previously copied when completing the “Requirements” section of this tutorial.
- Update the value of the “secretsId” property to the name of the secret you created earlier, “marklogic-spark-tutorial“.
Click the “Save” button.
Click the “Run job” button. In the “Parameters (optional)” pop up, click the “Run job” button.
Wait for the job to run. Note that this can take a few minutes.

Validate the Results

When the job completes, we will validate that the data has been loaded to your MarkLogic Data Hub Service instance.

Log in to the MarkLogic Data Hub Service portal.
Select your “Dashboard” and click on the name of your service.
In the “Endpoints” section click the link for “Query Console“.
Login to Query Console as the user you created earlier (dan / ML@test1).
In Query Console, from the “Database” dropdown menu, choose the “data-hub-STAGING” database and then click the “Explore” button.
You will see JSON documents that contain denormalized representations of customers from the CSV data that you loaded into MarkLogic from Amazon S3 storage.

Written Tutorial

Getting Started with the MarkLogic Connector for Apache Spark 2