MarkLogic Connector for AWS Glue - MarkLogic Community

In this tutorial, you will learn how to use the MarkLogic Connector for AWS Glue to load data from a CSV file stored on Amazon S3 storage to a MarkLogic Data Hub Service instance running on AWS.

Requirements

In order to follow along with this tutorial, please be sure to do the following:

Create and start a Data Hub Service on AWS.
Download the sample data.
Create an internal user in your Data Hub Service with all roles assigned.
Note: in the tutorial, we will setup a user with a username = “dan”, password = “ML@test1”, as this aligns with the example in our Data Hub training. You may create a different credential if you desire, but if you do, please make sure to remember it!
Get the value of the “mlHost” property for your MarkLogic Data Hub Service by doing the following:

1. Log in to the MarkLogic Data Hub Service portal for AWS .
2. Click the name of your service in the table of services displayed in the Dashboard.
3. In the Data Hub Service information page, in the “Endpoints” section, click the “Action” dropdown menu.
4. Hover over “Gradle Config” then select “Open in new tab“.
5. In the new tab, copy the value of the “mlHost” property. You will use this in the next section when you create a secret.

Create A Secrets Storage

As a best practice when working in AWS, we will be using AWS Secrets Manager:

Log into the AWS Management Console.
Click on the “Services” dropdown link at the top of the AWS Console page.
Under the “Security, Identity & Compliance” section, click “Secrets Manager“.
On the left side of the page, click the “Secrets” link.
On the “Secrets” page, click the “Store a new secret” button.
For the “Select secret type” option, click on “Other type of secrets“.
In the “Secret key/value” table, enter the following keys and values, adding a row as needed:

Key Value

mlHost The value of your DHS instance mlHost property

mlUsername dan

mlPassword ML@test1
Click the “Next” button at the bottom of the page.
Enter “marklogic-glue-tutorial-secret” for the “Secret name”.
Click the “Next” button at the bottom of the page.
On the “Store a new secret” page, leave the default options and click the “Next” button at the bottom of the page.
On the “Store a new secret” page you will see a summary of information about your secret. Click the “Store” button to save the secret.

Note: Secrets are not immediately deleted. They are scheduled to be deleted at a later time, of your choosing. This is to permit changing any code that uses them. Therefore, the “secret name” cannot be reused until that secret has been completely deleted.

Create an S3 Bucket

The purpose of creating an S3 bucket is to have a place on the AWS cloud to store the data (CSV file) that you will load into your MarkLogic Data Hub Service.

Click on the “Services” dropdown link at the top of the AWS Console page.
Under the “Storage” section, click “S3“.
Click the “Create bucket” button.
For the “Bucket name“, type “marklogic-glue-tutorial-bucket“.
Leave the remaining settings at their default values.
Click the “Create bucket” button at the bottom.

Upload Data to S3

For this tutorial, we have provided you with data from an RDMBS database table exported as a CSV file, customer.csv. In this section of the tutorial you will upload that CSV file into your S3 bucket.

Click the “marklogic-glue-tutorial-bucket” name in the list of buckets. The “marklogic-glue-tutorial-bucket” objects and other information displays in the new page.
From the location on your computer where you downloaded the resources for this tutorial, drag and drop the “data” folder onto the “marklogic-glue-tutorial-bucket” bucket page. When complete, the “Upload” page should display the CSV file.
Click the “Upload” button at the bottom of the page. The file will be uploaded and the “Status” should show “Succeeded“.
Click the “Close” button at the top-right of the page to return to the “marklogic-glue-tutorial-bucket” bucket page.

Create AWS Glue Database, Crawler, and Tables

Create a Database

Click on the “Services” dropdown link at the top of the “AWS Console” page.
Under the “Analytics” section, click on “AWS Glue“.
Click the “Databases” link under the “Data catalog” section on the left side of the page.
Click the “Add database” button.
For “Database name“, enter “marklogic-glue-tutorial-database” then click the “Create” button.

Create a Crawler

A “crawler” reads metadata (schema) from a data source and creates tables in an AWS Glue database.

Click on “Crawlers” in the “Data catalog” section.
Click the “Add crawler” button.
For “Crawler name“, enter “marklogic-glue-tutorial-crawler” then click the “Next” button.
On the “Specify crawler source type” page, keep the default values and then click the “Next” button.
On the “Add a data store” page, fill in the following:
- For “Choose a data store“, select “S3“.
- Leave “Connection” blank.
- For “Crawl data in“, select “Specified path in my account“.
- For “Include path“, click the folder icon, then click the “+” button in front of the “marklogic-glue-tutorial-bucket” S3 bucket.
- Select the “data” folder.
- Click the “Select” button. The “Include path” should now be “s3://marklogic-glue-tutorial-bucket/data“.
Click the “Next” button.
On the “Add another data store” page, select “No” and then click the “Next” button.
Next you will create an IAM role with the required permissions for the crawler to access the data in the S3 bucket.
On the “Choose an IAM role” page, select “Create an IAM role” and in the text box type “marklogic-glue-tutorial-role“.
Click the “Next” button.
On the “Create a schedule for this crawler” page, set the “Frequency” to “Run on demand“.
Click the “Next” button.
On the “Configure the crawler’s output” page, set the “Database” to “marklogic-glue-tutorial-database“. Leave all other options as the default.
Click the “Next” button.
Click the “Finish” button on the “Crawler info” page.

Run the Crawler to Create Tables

The “marklogic-glue-tutorial-crawler” should now appear in the list of crawlers. Select the “marklogic-glue-tutorial-crawler“.
Click the “Run crawler” button.
Wait for the crawler to finish. When the “Status” column in the “marklogic-glue-tutorial-crawler” row displays “Stopping“, the crawler has completed.
Click “Tables” in the “Data catalog” section, under “Databases“. The list of tables displays.
Verify that the data table was created, and click on it to see the schema that the crawler extracted from the CSV file.

Configure IAM Role Permissions

You created the IAM role in the “Create a Crawler” section of this tutorial. This role needs some permissions in order to access data stored in S3 buckets, to use AWS Glue, and to work with Amazon EC2 containers.

Click on the “Services” dropdown link at the top of the AWS Console page.
Under the “Security, Identity, & Compliance” section, click “IAM“.
Click “Roles” under the “Access management” section. The list of IAM roles displays.
Click the role name, “AWSGlueServiceRole-marklogic-glue-tutorial-role“. The list of permission policies displays.
Verify the following policy names are listed: AWSGlueServiceRole and AWSGlueServiceRole-marklogic-glue-tutorial-role
Click the “Attach policies” button.
On the “Attach Permissions” page, type “amazons3” to filter to only Amazon S3-related policies.
Check the “AmazonS3FullAccess” policy name.
Click the “Attach policy” button.
On the Summary page for your “AWSGlueServiceRole-marklogic-glue-tutorial-role” role click the “Attach policies” button.
On the “Attach Permissions” page, type “secrets” to filter to only Amazon secrets related policies.
Check the “SecretsManagerReadWrite” policy name.
Click the “Attach policy” button at the bottom of the page.
On the Summary page for your “AWSGlueServiceRole-marklogic-glue-tutorial-role” role click the “Attach policies” button.
On the “Attach Permissions” page, type “amazonec2” to filter to only Amazon EC2 related policies.
Check the “AmazonEC2ContainerRegistryReadOnly” policy name.
Click the “Attach policy” button at the bottom of the page.

Configure the MarkLogic Connector for AWS Glue

Go to the AWS Marketplace and search for the “marklogic connector for aws glue” (or find it directly on this page).
Click the “Continue to subscribe” button.
Click the “Accept Terms” button.
Once the subscription has been processed, click the “Continue to configuration” button.
Accept the default settings for delivery method and version, and then click the “Continue to Launch” button.
On the Launch this software page, click the “Usage Instructions” button.
In the pop-up that results, follow the provided link to activate the Glue connector from AWS Glue Studio.
On the Create connection page, enter “marklogic-glue-tutorial-connector” for the connector name.
On the Create connection page, select “marklogic-glue-tutorial-secret” as the secret to use for connection access.
Click on the “Create connection and activate connector” button.
From the “Connectors” page in AWS Glue Studio, expand the navigation links on the upper-left side of the page (if they are not already visible) and then click the “Jobs” link.
On the “Jobs” page, in the “Create job” table, set the source to “S3” (which should be the default) and set the target to “MarkLogic Connector for AWS Glue“.
Click the “Create” button.

Configure Job in AWS Glue Studio

In AWS Glue Studio, rename your new job from “Untitled” to “MarkLogic Glue Tutorial Job” by clicking the pencil icon to edit the job name.
In AWS Glue Studio, you will automatically start on the “Visual” tab, which contains the GUI for defining your job. From the visual modeler, click the “S3 bucket” data source to configure it.
Set the “Database” to “marklogic-glue-tutorial-database” and the “Table” to “data“.
From the visual modeler, click the “MarkLogic Connector for AWS Glue” data target to configure it.
Set the “Connection” to use the “marklogic-glue-tutorial-connector” that you created earlier.
Under “Connection options” click the “Add new option” button.
Connection options allow you to use key/value pairs to take control of how you want specific aspects of your data to be put into MarkLogic. Create the following key/value pairs to control the URI prefix and collections for the resulting data in MarkLogic:

Key Value

uriprefix /customer/

collections customer
Next you will leave the visual modeler to configure additional aspects about the job. Click the “Job details” tab and configure the job to use the IAM role that you created earlier in this tutorial.
Click the “Save” button to save your job.
Click the “Run” button to run your job.
You will receive a message at the top of the screen indicating that your job has started. Click the “Run Details” link in that message.
When the job completes, you should receive a status of success. If you are not successful, links to the logs will be presented and can be helpful in debugging your issue.
To validate your results, check to see that the data was loaded to your MarkLogic Data Hub Service instance by using the Query Console endpoint. Your data should be present in the “staging” database associated with your Data Hub.

Written Tutorial

Getting Started with the MarkLogic Connector for AWS Glue