Solutions

MarkLogic Data Hub Service

Fast data integration + improved data governance and security, with no infrastructure to buy or manage.

Learn More

Learn

Stay On Top Of Everything MarkLogic

Be the first to know! News, product information, and events delivered straight to your inbox.

Sign Me Up

Community

Stay On Top Of Everything MarkLogic

Be the first to know! News, product information, and events delivered straight to your inbox.

Sign Me Up

Company

Stay On Top Of Everything MarkLogic

Be the first to know! News, product information, and events delivered straight to your inbox.

Sign Me Up

Getting Started with Data Hub Service (DHS) on AWS

Gabo Manuel
Last updated October 31, 2018

Amazon Web Service (AWS) has been greatly assisted businesses in establishing and maintaining their internet presence through hardware infrastructure management. AWS removes the need for companies to manage procurement, maintenance, monitoring, and replacement/upgrade of hardware. With AWS, system administrators monitor these Elastic Compute Cloud (EC2) instances to guarantee availability, scaling, routing optimization, load balancing, software upgrades, security patches, etc. MarkLogic Data Hub Service (DHS) intends to take over the role of system administrators to allow your business to focus on actually taking advantage of your data rather than having to worry about cloud computing concerns.

This guide is intended to get non-AWS folks up and running, and will focus on the different parts to get up and running. It is divided into two parts: System Administrator and Developer.

System Administrator

The system administrator is responsible for getting the AWS and DHS environments up and running by provisioning servers and services. There is no direct access to the data itself.

1. Understand Virtual Private Clouds (VPCs)

VPC diagram
Figure 1: A simplistic overview of the server and network resources to establish for DHS

A Virtual Private Cloud (VPC) is a virtual network of hosts that is isolated from all other virtual networks. This allows higher control on access to hosts inside, reduces the risk of unauthorized communication to/from outside the network, and improves communication between hosts by removing the need to traverse the wider internet to talk to the host sitting nearby.

In Figure 1, the VPC on the left is the "customer" VPC. This is the VPC that we will be configuring ourselves in this tutorial. We can then control who can access these servers via internal authentication settings, LDAP servers, or other means. We can also install our applications on this VPC that make use of MarkLogic and serves the end user’s needs.

The VPC on the right in Figure 1 is the MarkLogic Service VPC. This is managed by MarkLogic and employs auto-scaling configurations to automatically increase and decrease the number of resources as usage spikes and drops. This handles devops-related issues for the DHS user.

The two VPCs communicate with each other via a peering that is setup. And to allow smooth communication between the ever changing number of MarkLogic servers, a load balancer sits in front to coordinate with the customer VPC.

A more detailed architectural diagram is available in the MarkLogic Data Hub Service Admin Online Help. Amazon AWS also provides information about their VPC and VPC Peering.

The MarkLogic Service VPC can be configured as public or private. Having the VPC private would only allow access through pre-configured VPCs via peering roles. If you intend to host your customer facing applications outside AWS, or simply want to try out the service, then this may not be the option for you. Sections in this guide will be tagged as [Private] if they are required for using the private MarkLogic Service VPC. Note that selecting a publicly available MarkLogic Service VPC comes with risks of unexpected traffic and usage.

Please look at Resource Checklist in the Appendix before moving forward; this checklist will help in keeping track of the resource IDs we will generate and use.

2. Create AWS account

If you already have an AWS account, then you may skip this section of this guide.

Creating an AWS account will display the screen below. Complete the process to create an AWS account, including your payment method and verification of your contact number.

Create AWS Account

Go to your AWS home page >> Support >> Support Center:

AWS Support Center

Click on Support Center and take note of your “AWS Account number”; we will be using this later.

AWS Support Number
Figure 2: AWS Support Center with AWS Account Number

3. Create your key pairs [Private]

Create the public and private key pairs that will be used later in building the VPC.

  1. Go to https://console.aws.amazon.com/ec2/
  2. Under "Network and Security", click on "Key Pairs"
  3. Provide a name and save the certificate file somewhere safe.

You should end up with something like the following:

Create key pairs

Note: Create this key-pair for each region for which you intend to create a VPC. Secure the generated certificate file(s), as this would grant root level SSH access to the bastion server that gets provisioned.

4. Create your VPC [Private]

If your company has its own policies for establishing VPC’s, take note of the following information for later in the guide:

  1. Bastion Host IP
  2. Public Route Table ID
  3. Private Route Table ID
  4. VPC ID
  5. VPC CIDR
  6. VPC Public and Private Subnet CIDRs

For those without any experience in setting up VPC’s, you may follow the steps below:

  1. Go to your Amazon console and look for "CloudFormation" under "Management Tools":
    AWS Cloud Formation
  2. Create a new stack by specifying an Amazon S3 template URL (example template).
    • If you decide to use more than three availability zones, download the above template and modify the file to add more entries for private and public subnets, route associations, etc.
    • Use "Upload a template to Amazon S3" as option when creating a CloudFormation stack instead of using the above link.

  3. Select three of your preferred availability zones. DHS requires at least three availability zones to ensure high availability (HA). Should one availability zone collapse, the cluster would continue to function.
  4. Adjust the various Classless Inter-Domain Routing (CIDR) values as needed. CIDR is used to allocate an IP address for each subnet. More information about how these CIDR values are used is available in the AWS VPC User Guide.
  5. Provide the key name generated in the previous section "3. Creating your key pairs", i.e. in this case my-dhs-vpc-key.

VPC Configuration
Figure 3: Sample VPC Configuration

Specify your IAM roles on the options page following the above form and proceed after review on the last page. This can take a while to complete, but you should eventually have something like the following at the end. Take note of the public and private subnet route tables, the public and private CIDRs used to execute this stack, as well as the BastionHostIP. We will be using them later.

Create stack complete
Figure 4: Stack details, including public and private subnet route tables, the public and private CIDRs used to execute this stack, as well as the BastionHostIP

Your AWS VPC home page should look like the following:

VPC Dashboard
Figure 5: VPC Dashboard

Clicking on "VPCs" will display the VPC IDs, which we will use later:

VPC ID
Figure 6: VPC Details, including ID

5. Sign up for MarkLogic Cloud Service

If you already have subscribed to the MarkLogic Cloud Service, then you may skip this section of the guide.

Go to the Amazon marketplace and search for "MarkLogic". Look for the "MarkLogic Cloud Service" entry as shown below and click on "subscribe" on the loaded page.

MarkLogic Cloud Service

6. Create MarkLogic Cloud Service account

After subscribing to the MarkLogic Cloud Service, you will be redirected to the MarkLogic Cloud Service home page. This is a separate account from your AWS account, so click on "Create new Account" to proceed. Once logged in, click on your name in the upper right hand corner of the page and note the MarkLogic Service ID.

MarkLogic Service ID
Figure 7: MarkLogic Service ID

7. Create the VPC peer role [Private]

As shown in the diagram in Figure 1, we need to allow the "customer" VPC to communicate with the MarkLogic VPC by creating a "peer role". More information about VPC peering is available in the AWS VPC Peering Documentation.

  1. Create a new CloudFormation stack using this template
  2. Complete the form below using the information we have gathered so far. Refer to Figure 7 from "6. Create MarkLogic Cloud Service account" for the MarkLogic Service ID and to Figure 6 from "4. Create your VPC" for the VPC ID.
    VPC Peer Role
  3. Click "Next" to proceed.
  4. Specify your IAM roles in the options page following the above form.
  5. Click on the "I acknowledge that AWS CloudFormation might create IAM resources." and click "Create"
  6. This can take a while to complete, but you should eventually have something like the following at the end. Note the "RoleARN" in the "Outputs" tab; we will be using it later.

VPC Peer Role Creation Complete
Figure 8: CloudFormation stack details with RoleARN in the Outputs tab

8. Configure your network

Return to cloudservices.marklogic.com and click on the "Network" tab. Supply the information we have gathered thus far. If you are using a private MarkLogic service VPC, select "No" for Public Accessibility:

Network Configuration
Figure 9: Network configuration

  • VPC ID is the AWS VPC ID from Figure 6 of the "4. Create your VPC" section.
  • AWS Account ID is the AWS Account Number from from Figure 2 of the "2. Create AWS Account" section.
  • Peer Role ARN is from Figure 8 of the "7. Create the VPC Peer Role" section. Do not include the trailing space/tab when you copy from a web page to this form.
  • VPC CIDR is the CIDR for the MarkLogic VPC from the right-hand side of Figure 1 in the "1. Understand Virtual Private Clouds (VPCs)" section. This can be left as is, if you used the default values during the steps in "4. Create your VPC". Make sure that the value does not overlap with the following CIDR blocks.
  • User Subnet CIDRs are all of the public and private subnet CIDRs from Figure 4 of the "4. Create your VPC" section.
  • Region is the availability zone when you executed the CloudFormation template in the "4. Create your VPC" section.

This may take a while to complete… Make sure to hit the refresh button on the right every now and then. Eventually, you will see that network configuration has completed. Take note of the Peering Connection ID, e.g. pcx-070ec6719c1d60c8b. Additionally, take note of the public and private CIDRs generated. We would be using these later.

Network Configuration Complete
Figure 10: Network configuration completed, with Peering Connection ID and public and private CIDRs generated.

If you intend to use a publicly available MarkLogic Service VPC, select "Yes" for Public Accessibility when configuring the network. Then you will only need to supply the region you wish the VPC to reside in. There is no need to keep track of the CIDRs since the load balancers are publicly available.

9. Create the DHS instance

On the MarkLogic Cloud Services homepage, click on the "+ Data Hub Service" tab and supply the following information:

Create Data Hub Service
Figure 11: Create Data Hub Service

Note that the cost adjusts depending on the capacity that you specify; the higher the capacity value, the higher the hourly cost. In this tutorial, we select the lowest capacity. Click on "Create" to spawn the MarkLogic VPC as described in Figure 1. This can take a while, approximately 10 minutes or so. You can hit the "refresh" icon on the upper left to get updates periodically until you get the following confirmation:

Data Hub Service creation confirmation

Click on the "View" button to see the Data Hub Service Details:

Data Hub Service Details
Figure 12: Data Hub Service details

The available ports can be described as follows:

Endpoints Port Content Database Description
Analytics 8008 data-hub-FINAL Data Services First endpoint.
Operations 8009 data-hub-FINAL Data Services First endpoint. This is hosted separately from Analytics.
Ingest 8005 data-hub-STAGING XDBC server to be used by MLCP.
Flows 8006 data-hub-STAGING Port to be used when running your ingest and harmonization flows.

Supports MarkLogic's built in REST API. This allows you to confirm what got loaded to your STAGING DB. The URL parameter "database=data-hub-FINAL" can allow you to do general search on the contents of the FINAL DB.
Jobs 8007 data-hub-JOBS This port allows the user to view jobs and traces.
Manage 8002 App-Services Port to be used when loading modules, updating indexes, and uploading your TDE templates.
REST 8004 data-hub-MODULES Port to be used to view your curated data. This port is also used to load the REST extensions developed by your team.

Supports MarkLogic's built in REST API. This allows you to confirm and review what code you have uploaded.
ODBC 5432 data-hub-FINAL Port to be used by BI tools.
Figure 13: Available Data Hub Service endpoints

10. Manage DHS access

The links under "Endpoints" are disabled until users are created. Specify users to proceed in using your DHS instance. Notes that the service admin does not have access to all actions by default. Click ln the "Manage Access" button to add users with specific roles.

Manage Data Hub Service
Figure 14: Users and roles for the Data Hub Service

These roles are described as follows:

Role Can do...
Flow Developer Can load modules into MarkLogic modules database, load TDE templates, and update indexes. Basically, your gradle task executor.
Flow Operator Can do the ingest and run your ingest flow. Can execute the harmonization flow as well.
Endpoint Developer A subset of "Flow Developer". Can load modules that would not overwrite any existing modules that he/she did not upload. CanNOT upload TDE templates nor update indexes. Meant as Data Services First (DSF) developer.
Endpoint User Meant to be used for the consumption of analytics and operations port.
ODBC User Meant to be used for port 5432
Figure 15: DHS user roles

Note that the users created in the screens above would not have SSH access to your servers. These are purely MarkLogic accounts to connect to the above mentioned ports.

After creating the necessary users, go back to the service page, right click on "Analytics", and copy the target connection URL, noting the value. Do the same for "Operations". "Ingest", "Flows", "Jobs", "Manage" and "REST" will point to the same host at different ports. Map them using the table below:

Endpoint Value
Analytics http://internal-mlaas-aalb-1f50xgvrnt8yn-911654599.us-east-1.elb.amazonaws.com:8008
Operation http://internal-mlaas-oalb-1cwwevqwhl8s7-1243328676.us-east-1.elb.amazonaws.com:8009
Ingest http://internal-mlaas-icalb-gaqd4thwz10v-1190497851.us-east-1.elb.amazonaws.com:8005
Flows http://internal-mlaas-icalb-gaqd4thwz10v-1190497851.us-east-1.elb.amazonaws.com:8006
Jobs http://internal-mlaas-icalb-gaqd4thwz10v-1190497851.us-east-1.elb.amazonaws.com:8007
Manage http://internal-mlaas-icalb-gaqd4thwz10v-1190497851.us-east-1.elb.amazonaws.com:8002
REST http://internal-mlaas-icalb-gaqd4thwz10v-1190497851.us-east-1.elb.amazonaws.com:8004
Figure 16: Data Hub Service endpoints and values

We can then collect the hosts and label them as follows, since we will be using these values later for our tunnel settings.

Host Name Value
Analytics internal-mlaas-aalb-1f50xgvrnt8yn-911654599.us-east-1.elb.amazonaws.com
Operation internal-mlaas-oalb-1cwwevqwhl8s7-1243328676.us-east-1.elb.amazonaws.com
Curation internal-mlaas-icalb-gaqd4thwz10v-1190497851.us-east-1.elb.amazonaws.com
Figure 17a: Data Hub Service host name and values (for private VPC)

Trivia: These host names are load balancers, hence the naming pattern of "aalb", "oalb", and "icalb", which translates to analytics application load balancer, operations application load balancer, and ingest curation application load balancer, respectively.

If you opt to use a publicly accessible MarkLogic Service VPC, note that there is a different set of values since the "internal-" prefix is not present:

Host Name Value
Analytics mlaas-aalb-524fm3bqdeuf-1078566978.us-east-1.elb.amazonaws.com
Operation mlaas-oalb-4wxdctlga9ym-1402952872.us-east-1.elb.amazonaws.com
Curation mlaas-icalb-ftj4dp27w4fo-1601708577.us-east-1.elb.amazonaws.com
Figure 17b: Data Hub Service host name and values (for public VPC)

11. Configure routing [Private]

At this point, our "customer" VPC and MarkLogic VPC are up and running. We have our peering role set up to allow communication between the two VPCs, but they do not know how to find each other. We need to configure routing tables to allow our customer VPC to know the IP addresses of our MarkLogic VPCs. More information about VPC Routing is available in the AWS documentation on VPC Peering Routing and VPC Route Tables.

Create a new CloudFormation stack using this template. The public and private subnet route table ID would match the "PublicSubnetRouteTableID" and "PrivateSubnetRouteTableID" from Figure 4 in the "4. Create your VPC" section.

Alternatively, your public route table may contain an "igw-" prefixed entry in the "Routes" tab, as shown below:

Public Route Tables
Figure 18: Public Route Tables

While your private route table would contain an "nat-" prefixed entry in the "Routes" tab, as shown below:

Private Route Tables
Figure 19: Private Route Tables

The Service Public and Private Subnet CIDRs should match the Public and Private Subnet CIDRs from the network configuration in Figure 10 from the "8. Configure your network" section.

You can specify your IAM roles in the options page following the above form and proceed after review on the last page. This can take a while to complete, but you should eventually have something like the following at the end:

Data Hub Service routing confirmation

At this point, you already have a MarkLogic cluster up and running. Your bastion server can now be hardened. Additional SSH user accounts can be generated to map to our developers that would need to connect or tunnel to the bastion server to deploy the modules and other MarkLogic configuration. Note that these SSH user accounts are separate from the MarkLogic user accounts created in the "10. Manage DHS access" section. Additionally, these SSH users are not AWS accounts. Please do not share the ec2-user certificate to your peers.

Developer

The provisioned MarkLogic instances are accessed via the load balancers mentioned in the "10. Manage DHS access" section. If you chose a private MarkLogic Service VPC, these load balancers cannot be accessed directly and can only be accessed through the bastion server (see BastionHostIP from Figure 4 in the "4. Create your VPC" section). To load our modules or to push data via MLCP or DMSDK, we can either execute them at the bastion server or locally through a tunnel setup. This guide will cover the latter approach.

1. Tunnel setup [Private]

Windows using PuTTY

  1. Convert your certificate file generated in the "3. Create your key pairs" section of the System Administrator portion into a public/private key pair that PuTTY understands. There are several online instructions on how to proceed with this. A typical choice is to use PuTTYgen.exe.
  2. For Host Name, supply your Bastion IP address (see BastionHostIP from Figure 4 in the "4. Create your VPC" section, i.e. 54.84.95.145 in this case)
  3. Under Connection >> Data, specify ec2-user as "Auto-login username"
  4. Under Connection >> SSH >> Tunnels, add the following entries:
Source Port Destination
8002 <CURATION>:8002
8004 <CURATION>:8004
8005 <CURATION>:8005
8006 <CURATION>:8006
8007 <CURATION>:8007
8008 <ANALYTICS>:8008
8009 <OPERATIONAL>:8009
Figure 20: SSH Tunnels - ANALYTICS, CURATION and OPERATIONAL values are taken from Figure 17 in the "10. Manage DHS access" section

Mac / Linux using SSH

Save the following into a file, e.g. dhs-tunnel.sh:

Figure 21: Save commands as dhs-tunnel.sh
  • Adjust "/path/to/cert.pem" to match where you saved the certificate file generated in the "3. Create your key pairs" section.
  • ANALYTICS, CURATION and OPERATIONAL values are taken from Figure 17 in the "10. Manage DHS access" section.
  • Adjust "@54.84.95.145" to match your Bastion IP address from Figure 4 in the "4. Create your VPC" section.

Change the file mode to allow execution, e.g. chmod +x dhs-tunnel.sh, and run the file, e.g. ./dhs-tunnel.sh.

Port conflicts

Developers with an existing installation of MarkLogic will notice the potential conflict with use of port 8002. You can work around this by either using 18002 (or another value) to avoid the conflict, or change the "Manage" port in your local MarkLogic installation. Here, we assume that 8002 will be used by our tunnel to communicate with our DHS instance.

2. Checking the status page

At this point, load the "managed" endpoint to check the initial configuration available. For existing MarkLogic developers, this is your standard "Manage" app server page. If you are using a private MarkLogic Service VPC, use localhost:8002 (or equivalent port as configured in your tunnel setup). If you are using a publicly accessible MarkLogic Service MarkLogic Service VPC, use <CURATION>:8002 (e.g. mlaas-icalb-ftj4dp27w4fo-1601708577.us-east-1.elb.amazonaws.com:8002). When prompted for credentials, use your configured account with "Flow Developer" role to access this page.

Configuration Manager

3. Flow Developer and Operator

Note that currently you cannot use the Data Hub Framework Developer Quickstart to directly develop on top of the DHS instance. Most of the gradle tasks executed under the hood would require privileges not granted to the roles deployed in DHS. The following sections will provide some guidelines on what needs to be done and what can be done to load your modules to the DHS instance.

Project configuration

Your DHF Quickstart generated project or gradle generated project would need some adjustments. You may use the following content as template for your gradle-local.properties.

Figure 22: Template for gradle-local.properties

The mlManagePort would need to be adjusted if your tunnel is not using 8002. This would be the case if you have a local install of MarkLogic. Also, mlHost should point to the <CURATION> load balancer, if you have direct access (e.g. using a publicly accessible MarkLogic Service VPC or when deploying modules from inside the customer VPC or bastion host).

Note that we no longer needed to issue gradle mlDeploy (you wouldn’t be able to if you tried). The role provisioned is not allowed to perform all the tasks that mlDeploy would do.

Gradle tasks

The table below lists the tasks available and their purposes. The assumption is that you are working on a DHF project, not a traditional ml-gradle project (e.g. the working directories are different).

Task Purpose
mlLoadModules
mlReloadModules
Deploy your code (or redeploy, if needed) under <project-root>/plugins/
mlUpdateIndexes Deploy your indexes, as defined in <project-root>/src/hub-internal-config/databases/your-db.json
mlDeployViewSchemas Deploy your TDE templates, as defined in <project-root>/src/hub-internal-config/schemas/tde/your-template.json
hubRunFlow Run your harmonization flow
Figure 23: Gradle tasks and purposes

Ingest of Data via MLCP

MarkLogic Content Pump (MLCP) has been the tool of choice for data ingest for years. The following -options_file can be used as reference when running MLCP command:

Figure 24: Options file for reference with MLCP
  • Take note of the difference in the -transform_module.
  • -restrict_hosts is also a key factor to make the MLCP command execute via tunnel.
  • Port 8006 may also be used.
  • At least MLCP version 9.0.7 is required.

Ingest of Data via DMSDK

The following sample code can be used as reference when developing your ingest code:

Figure 25: Sample ingest code
  • MarkLogic Java Client API v 4.1.1 is required.
  • Read more about the Data Movement SDK

Harmonization

Harmonization can be executed as normal:

gradle hubRunFlow -PentityName="YourEntity" -PflowName="YourHarmonizeFlowName" -PflowType="harmonize"

4. Data Services First (DSF) Developer

f

AnalyticsServer:8008 and Operational:8009 are both Data Services First (DSF) API app servers. See the previous section regarding deployment of MarkLogic modules. Read more about Data Services First.

Further Reading and Feedback

If you run into issues using MarkLogic Data Hub Service, please contact Support. MarkLogic engineers and enthusiasts are also active on Stack Overflow; be sure to tag your questions as ‘marklogic’.

Additional reading is recommended as follows:

Appendix: Resource Checklist

This table can be used to keep track of what we need for each stage of configuration. This is particularly useful for the System Administrator, given the number of actions that need to be taken.

Field Created in Requires Example Value
AWS Account 2. Create AWS Account Email address
Billing information
AWS Account ID 2. Create AWS Account 893017339836
key-pair-name 3. Create your key pairs AWS account my-dhs-key-pair
AWS Certificate 3. Create your key pairs AWS account my-dhs-key-pair.pem
VPC CIDR 10.0.0.0/16
VPC Public and Private Subnet CIDRs 10.0.0.0/23
10.0.32.0/23
10.0.64.0/23
10.0.96.0/23
10.0.128.0/23
10.0.160.0/23
VPC ID 4. Create your VPC VPC CIDR
VPC Public and Private Subnet CIDRs
key-pair-name
vpc-09947d36ca86a6e54
Bastion Host IP 4. Create your VPC 54.84.95.145
Public Route Table ID 4. Create your VPC rtb-09cb034df9ee6b0e0
Private Route Table ID 4. Create your VPC rtb-07aab23133475c680
MarkLogic Service ID 5. Sign up for MarkLogic Cloud Service MarkLogic Cloud Services subscription in AWS Marketplace
MarkLogic Cloud Service Page signup
092937385570
RoleARN 7. Create the VPC peer role VPC ID
MarkLogic Service ID
arn:aws:iam:893017339836:role/dhs-peer-role-stack-peerRole-AHK91EXK0T1E
Peering Connection ID 8. Configure your network VPC ID
AWS Account ID
Peer Role ARN
VPC Public and Private Subnet CIDRs
Region
pcx-070ec6719c1d60c8b
Analytics Load Balancer 10. Manage DHS access DHS users configured internal-mlaas-aalb-1f50xgvrnt8yn-911654599.us-east-1.elb.amazonaws.com
Operations Load Balancer 10. Manage DHS access DHS users configured internal-mlaas-oalb-1cwwevqwhl8s7-1243328676.us-east-1.elb.amazonaws.com
Curation Load Balancer 10. Manage DHS access DHS users configured internal-mlaas-icalb-gaqd4thwz10v-1190497851.us-east-1.elb.amazonaws.com

Stack Overflow iconStack Overflow: Get the most useful answers to questions from the MarkLogic community, or ask your own question.

Comments

The commenting feature on this page is enabled by a third party. Comments posted to this page are publicly visible.
  • Thanks for sharing info Getting Started with Data Hub Service (DHS) on <a href=https://mindmajix.com/aws>AWS</a>