Amazon Web Service (AWS) accelerates businesses’ ability to establishing and maintaining their internet presence by managing hardware infrastructure. This removes the need for companies to manage procurement, maintenance, monitoring, and replacement/upgrade of hardware. Now system administrators are tasked with monitoring these Elastic Compute Cloud (EC2) instances to guarantee availability, scaling, routing optimization, load balancing, software upgrades, and security patches. MarkLogic Data Hub Service makes this even easier. MarkLogic Data Hub Service is a fully automated cloud service to integrate data from silos. Delivered as a cloud service, it provides on-demand capacity, auto-scaling, automated database operations, and proven enterprise data security. This enables agile teams to immediately start the work that delivers business value, integrating and curating data for both operational and analytical use.
This guide gets non-AWS users up and running quickly, and will focus on the specific components you need to get up and running. This guide is not meant to be the ultimate reference for all the technologies involved. Further reading is recommended.
MarkLogic will be hosted in the cloud via AWS. The diagram below provides a simplistic overview of the server and network resources that we want to establish:
Figure 1: A simplistic overview of the server and network resources to establish for Data Hub Service
A Virtual Private Cloud (VPC) is a virtual network of hosts that is isolated from all other virtual networks. This provides more control over the hosts inside, reduces the risk of unauthorized communication to/from outside the network, and improves communication between hosts by removing the need to traverse the wider internet to talk to the host sitting nearby.
This VPC is managed by MarkLogic and employs auto-scaling configurations to automatically increase and decrease the number of resources as usage spikes and drops. This handles DevOps-related issues for the Data Hub Service user. It can be configured as something publicly accessible or something that is hidden behind a bastion server for compliance purposes. To allow smooth communication between the ever changing number of MarkLogic servers, a load balancer sits in front to coordinate incoming requests.
A more detailed architectural diagram is available in the MarkLogic Data Hub Service Admin Online Help.
Please look at the Resource Checklist in the Appendix before moving forward; this checklist will help in keeping track of the resource IDs we will generate and use.
If you already have an AWS account, then you may skip this section of the guide.
Creating an AWS account will display the screen below. Complete the process to create an AWS account, including your payment method and verification of your contact number.
If you have already subscribed to MarkLogic Cloud Service, then you may skip this section of the guide.
Go to the Amazon marketplace and search for “MarkLogic“. Look for the “MarkLogic Cloud Service” entry as shown below and click on “subscribe” on the loaded page.
After subscribing, you should get redirected to the MarkLogic Cloud Service home page. Note that this account is separate from your AWS account, so click on “Create a new Account” to proceed.
At this point, you can decide whether to have a publicly accessible MarkLogic cluster or a private cluster accessible only via a bastion server. A public cluster can easily be set up and is recommended for people trying to get familiar with the product. Private clusters, on the other hand, provide another layer of security by controlling access to the MarkLogic cluster.
This is the simplest and most straightforward way of creating a Data Hub instance.
You should end up with something like the following:
The diagram below provides a simplistic overview of the server and network resource that we want to establish:
The VPC on the right is the VPC we described in “1. Server Overview” section at the beginning of this guide. The VPC on the left is a customer-managed VPC. Customers can then control who can access these servers via internal authentication settings, LDAP servers, or other means. They can also install applications here that make use of MarkLogic and serve the end user’s needs. This section will also give a simple guide in setting up peering and routing configurations required for communication between VPCs.
Note: AWS charges for the customer VPC is separate from DHS charges
Please use the “Resource Checklist” available at the end of this guide to assist you in keeping track of the various information we will be using.
Go to your AWS home page >> Support >> Support Center:
Click on Support Center and take note of your “AWS Account number”; we will be using this later.
Figure 2: AWS Support Center with AWS Account Number
Create the public and private key pairs that will be used later in building the VPC.
You should end up with something like the following; take note of the “Key pair name”, as we will be using this later:
Note: Create this key-pair for each region for which you intend to create a VPC. Secure the generated certificate file(s), as this would grant root level SSH access to the bastion server that gets provisioned.
If your company has its own policies for establishing VPC’s, take note of the following information for later in the guide, then proceed to “Create the VPC Peer Role” below:
For those without any experience in setting up VPC’s, you may follow the steps below:
Figure 3: Sample VPC Configuration
This can take a while to complete, but you should eventually have something like the following at the end. Take note of the public and private subnet route table, the public and private CIDRs used to execute this stack, as well as the BastionHostIP. We will be using them later.
Figure 4: Stack details, including public and private subnet route tables, the public and private CIDRs used to execute this stack, as well as the BastionHostIP
Your AWS VPC home page should look like the following:
Figure 5: VPC Dashboard
Clicking on “VPCs” will display the VPC IDs; note the VPC IDs which we will use later:
Figure 6: VPC Details, including ID
To obtain your MarkLogic Service ID, which we will be using later, go to the MarkLogic Cloud Service home page and click on your name in the upper right hand corner of the page.
Figure 7: MarkLogic Service ID
As shown in the diagram in Figure 1, we need to allow the “customer” VPC to communicate with the MarkLogic VPC by creating a “peer role”. More information about VPC peering is available in the AWS VPC Peering Documentation.
This can take a while to complete, but you should eventually have something like the following at the end. Note the RoleARN in the “Outputs” tab; we will be using it later.
Figure 8: CloudFormation stack details with RoleARN in the Outputs tab
This may take a while to complete… Make sure to hit the refresh button on the right every now and then. Eventually, you will see that network configuration has completed. Take note of the Peering Connection ID, e.g. pcx-079d5f1a12c607814. Additionally, take note of the public and private CIDRs generated. We would be using these later.
Figure 10: Network configuration completed, with Peering Connection ID and public and private CIDRs generated.
On the MarkLogic Cloud Services homepage, click on the “+ Data Hub Service” tab and supply the following information:
Figure 11: Create Data Hub Service
“Service Type” of “Development” will have the least amount of charges, but also the least amount of resources. This is recommended for the purposes of exploration and proof of concept deployments. This type of service will still have all the other features of the Data Hub Service except the auto-scaling of resources.
“Production” on the other hand will have auto-scaling in effect. You can set the base “Capacity” using the slider as shown below:
Do note that the cost adjusts depending on the capacity you specify. The higher the capacity value, the higher the hourly cost. For the purposes of this guide, we will use the “Development” type of service.
“Private” access is only applicable to networks that have configured “Peering” information. Review the last few steps of the “Private Network” of the previous section for more information.
Clicking on “Create” will spawn the MarkLogic VPC as described in “1. Server Overview”. This can take around ten minutes or so. You can hit the “refresh” icon on the upper left to get updates periodically until you get something like the following:”
Click on the “View” button to see the Data Hub Service Details:
Figure 12: Data Hub Service details
The available ports can be described as follows:
|Appserver Name||Appserver Identifier||Port||Purpose|
|ODBC App Server||data-hub-ODBC||5432||ODBC: connecting BI tools to TDE views defined in the final schema database, for rows projected from documents in the final database|
|Data Hub Services Manage App Server||data-hub-MANAGE||8002||Used by:
|XDBC Ingest App Server||data-hub-XDBC||8005||XDBC ingestion: MLCP ingestion into the staging database using transforms from the modules database|
|REST App Server||data-hub-ADMIN||8004||REST configuration: CRUD on the modules database, schemas database, and triggers database, as well as debugging and repair on the final database, using “data-hub-MODULES” as its default content database; specify the database parameter in the REST request for CRUD on the curation, schemas or triggers databases.
Note: The appserver name and the appserver identifier do not match each other.
|Data Hub Services Flows App Server||data-hub-STAGING||8006||Data Hub Framework curation: DMSDK and NiFi ingestion into the staging database and Data Hub harmonization from the staging database to the final database using tranforms, flows, the Data Hub Framework|
|Data Hub Services Jobs App Server||data-hub-JOBS||8007||Data Hub Framework jobs: access to the jobs database|
|Data Hub Services Analyzer App Server||data-hub-ANALYSYS||8008||Data Services analysis: invoking endpoints for analysis from the modules database on the final database using no rewriter|
|Data Hub Services Operate App Server||data-hub-OPERATION||8009||Data Services operations: invoking endpoints for operations from the modules database on the final database using no rewriter|
|Operations REST||data-hub-OPERATION-REST||8010||REST operations: invoking endpoints for operations from the modules database on the final database using standard REST rewriter|
|Analytics REST||data-hub-ANALYSYS-REST||8011||REST analytics: invoking endpoints for analytics from the modules database on the final database using standard REST rewriter|
The links under “Endpoints” are disabled until users are created. To proceed in using your Data Hub Service instance, you need to specify users. Note that the service admin does not have access to all actions by default. Click on the “Manage Access” button to add users with specific roles.
Figure 14: Users and roles for the Data Hub Service
These roles are described as follows:
|Flow Developer||Can load modules into MarkLogic modules database, load TDE templates, and update indexes. Basically, your gradle task executor.|
|Flow Operator||Can do the ingest and run your ingest flow. Can execute the harmonization flow as well.|
|Endpoint Developer||A subset of “Flow Developer”. Can load modules that would not overwrite any existing modules that user did not upload. CanNOT upload TDE templates nor update indexes. Meant as Data Services First developer.|
|Endpoint User||Meant to be used for the consumption of analytics and operations port.|
|ODBC User||Meant to be used for port 5432|
Note that the users created in the screens above would not have SSH access to your servers. These are purely MarkLogic accounts to connect to the above mentioned ports.
After creating the necessary users, go back to the service page and click on the “copy” icon on the right side on the endpoints to capture a JSON document in your clipboard. You can paste this document to your editor of choice. This document will contain the ports configuration that is useful for configuring tunnels. It will have information similar to the table below.
We can then collect the hosts and label them as follows, since we will be using these values later for our tunnel settings.
Trivia: These host names are load balancers, hence the naming pattern of “aalb”, “oalb”, and “icalb”, which translates to analytics application load balancer, operations application load balancer, and ingest curation application load balancer, respectively.
If you opt to use a publicly accessible MarkLogic Service VPC, note that there is a different set of values since the “internal-” prefix is not present:
This is only applicable for customers who configured their MarkLogic VPC for private access only.
At this point, our “customer” VPC and MarkLogic VPC are up and running. We have our peering role set up to allow communication between the two VPCs, but they do not know how to find each other. We need to configure routing tables to allow our customer VPC to know the IP addresses of our MarkLogic VPCs. More information about VPC Routing is available in the AWS documentation on VPC Peering Routing and VPC Route Tables.
Go to your Amazon console and look for “CloudFormation” under “Management Tools”. Create a new CloudFormation stack using this template. The public and private subnet route table ID will match the “PublicSubnetRouteTableID” and “PrivateSubnetRouteTableID” from Figure 4 in the “Create your VPC” section.
Alternatively, your public route table may contain an “igw-” prefixed entry (not the route table ID) in the “Routes” tab, as shown below:
Figure 18: Public Route Tables
While your private route table would contain an “nat-” prefixed entry in the “Routes” tab, as shown below:
Figure 19: Private Route Tables
The Service Public and Private Subnet CIDRs should match the MarkLogic Network Public and Private Subnet CIDRs from the network configuration in Figure 10 from the “Configure your Network” section.
This can take a while to complete, but you should eventually have something like the following at the end:
At this point, you already have a MarkLogic cluster up and running. Your bastion server can now be hardened. Additional SSH user accounts can be generated to map to our developers that would need to connect or tunnel to the bastion server to deploy the modules and other MarkLogic configuration. Note that these SSH user accounts are different from the MarkLogic user accounts created in the “2. Account Setup” section. Additionally, these SSH users are not AWS accounts. Please do not share the ec2-user certificate to your peers.
The provisioned MarkLogic instances are accessed via the load balancers mentioned in the “5. Manage MarkLogic Data Hub Service Access” section. If you chose a private MarkLogic Service VPC, these load balancers cannot be accessed directly and can only be accessed through the bastion server (see BastionHostIP from Figure 4 in the “Create your VPC” section). To load our modules or to push data via MLCP or DMSDK, we can either execute them at the bastion server or locally through a tunnel setup. This guide will cover the latter approach.
chmod +x dhs-tunnel.sh, and run the file, e.g.
chmod 400 /path/to/cert.pem
Developers with an existing installation of MarkLogic will notice the potential conflict with use of port 8002. You can work around this by either using 18002 (or another value) to avoid the conflict, or change the “Manage” port in your local MarkLogic installation. Here, we assume that 8002 will be used by our tunnel to communicate with our MarkLogic Data Hub Service instance.
At this point, load the “Manage” endpoint to check the initial configuration available. For existing MarkLogic developers, this is your standard “Manage” app server page. If you are using a private MarkLogic Service VPC, use localhost:8002 (or equivalent port as configured in your tunnel setup). If you are using a publicly accessible MarkLogic Service VPC, use <CURATION>:8002 (e.g. mlaas-icalb-ftj4dp27w4fo-1601708577.us-east-1.elb.amazonaws.com:8002). When prompted for credentials, use your configured account with “Flow Developer” role to access this page.
Note that currently you cannot use the Data Hub Framework Developer Quickstart to directly develop on top of the MarkLogic Data Hub Service instance. Most of the gradle tasks executed under the hood would require privileges not granted to the roles deployed in MarkLogic Data Hub Service. Please go to Data Hub Framework Deploy to MarkLogic Data Hub Service for an up-to-date version of these instructions. The following sections will provide some guidelines on what needs to be done and what can be done to load your modules to the Data Hub Service instance.
Your DHF Quickstart generated project or gradle generated project would need some adjustments. You may use the following content as template for your gradle-local.properties.
# Put your overrides from gradle.properties here # Don't check this in to version control mlDHFVersion=4.0.1 mlHost=localhost mlIsHostLoadBalancer=true mlUsername=YOUR_FLOW_OPERATOR_USER mlPassword=YOUR_FLOW_OPERATOR_PASSWORD mlManageUsername=YOUR_FLOW_DEVELOPER_USER mlManagePassword=YOUR_FLOW_DEVELOPER_PASSWORD mlStagingAppserverName=data-hub-STAGING mlStagingPort=8006 mlStagingDbName=data-hub-STAGING mlStagingForestsPerHost=1 mlFinalAppserverName=data-hub-ADMIN mlFinalPort=8004 mlFinalDbName=data-hub-FINAL mlFinalForestsPerHost=1 mlJobAppserverName=data-hub-JOBS mlJobPort=8007 mlJobDbName=data-hub-JOBS mlJobForestsPerHost=1 mlModulesDbName=data-hub-MODULES mlStagingTriggersDbName=data-hub-staging-TRIGGERS mlStagingSchemasDbName=data-hub-staging-SCHEMAS mlFinalTriggersDbName=data-hub-final-TRIGGERS mlFinalSchemasDbName=data-hub-final-SCHEMAS mlModulePermissions=flowDeveloper,read,flowDeveloper,execute,flowDeveloper,insert,flowOperator,read,flowOperator,execute,flowOperator,insert mlIsProvisionedEnvironment=true mlManagePort=8002 mlAppServicesPort=8002
The mlManagePort would need to be adjusted if your tunnel is not using 8002. This would be the case if you have a local install of MarkLogic. Also, mlHost should point to the <CURATION> load balancer, if you have direct access (e.g. using a publicly accessible MarkLogic Service VPC or when deploying modules from inside the customer VPC or bastion host).
Note that we no longer needed to issue
gradle mlDeploy (you wouldn’t be able to if you tried). The role provisioned is not allowed to perform all the tasks that
mlDeploy would do.
The table below lists the tasks available and their purposes. The assumption is that you are working on a DHF project, not a traditional ml-gradle project (e.g. the working directories are different).
|Deploy your code (or redeploy, if needed) under <project-root>/plugins/|
|mlUpdateIndexes||Deploy your indexes, as defined in <project-root>/src/hub-internal-config/databases/your-db.json|
|mlDeployViewSchemas||Deploy your TDE templates, as defined in <project-root>/src/hub-internal-config/schemas/tde/your-template.json|
|hubRunFlow||Run your harmonization flow|
MarkLogic Content Pump (MLCP) has been the tool of choice for data ingest for years. The
-options_file below can be used as reference when running MLCP command. Note the following:
-restrict_hostsis also a key factor to make the MLCP command execute via tunnel.
The sample code below can be used as reference when developing your ingest code. Note that MarkLogic Java Client API v 4.1.1 is required. To explore more, check out the Data Movement SDK Tutorial.
Harmonization can be executed as normal:
gradle hubRunFlow -PentityName="YourEntity" -PflowName="YourHarmonizeFlowName" -PflowType="harmonize"
AnalyticsServer:8008 and Operational:8009 are both Data Services First API app servers. See the previous section regarding deployment of MarkLogic modules. Read more about Data Services First.
Additional reading is recommended as follows:
This table can be used to keep track of what we need for each stage of configuration. This is particularly useful for the System Administrator, given the number of actions that need to be taken.
|AWS Account||Email address
|AWS Account ID||893017339836|
|AWS Certificate||AWS account||my-dhs-key-pair.pem|
|VPC Public and Private Subnet CIDRs||10.0.0.0/23
|VPC ID||VPC CIDR
VPC Public and Private Subnet CIDRs
|Bastion Host IP||126.96.36.199|
|Public Route Table ID||rtb-09cb034df9ee6b0e0|
|Private Route Table ID||rtb-07aab23133475c680|
|MarkLogic Service ID||MarkLogic Cloud Services subscription in AWS Marketplace
MarkLogic Cloud Service Page signup
MarkLogic Service ID
|Peering Connection ID||VPC ID
AWS Account ID
Peer Role ARN
VPC Public and Private Subnet CIDRs
|Analytics Load Balancer||Data Hub Service users configured||internal-mlaas-aalb-1f50xgvrnt8yn-911654599.us-east-1.elb.amazonaws.com|
|Operations Load Balancer||Data Hub Service users configured||internal-mlaas-oalb-1cwwevqwhl8s7-1243328676.us-east-1.elb.amazonaws.com|
|Curation Load Balancer||Data Hub Service users configured||internal-mlaas-icalb-gaqd4thwz10v-1190497851.us-east-1.elb.amazonaws.com|