Amazon Web Service (AWS) accelerates a business’s ability to establish and maintain its internet presence by managing hardware infrastructure. This removes the need for companies to manage procurement, maintenance, monitoring, and replacement/upgrade of hardware. Now system administrators are tasked with monitoring these Elastic Compute Cloud (EC2) instances to guarantee availability, scaling, routing optimization, load balancing, software upgrades, and security patches. MarkLogic Data Hub Service makes systems administration in the cloud even easier.

MarkLogic Data Hub Service is a fully-automated cloud service to integrate data from silos. Delivered as a cloud service, it provides on-demand capacity, auto-scaling, automated database operations, and proven enterprise data security. As a result, agile teams can immediately start delivering business value by integrating and curating data for both operational and analytical use.

This tutorial gets users new to AWS up and running quickly by focusing on the specific components you need to get started. Further reading is recommended to fully understand all the technologies involved.

Data Hub Service Architectural Overview

The MarkLogic Data Hub Service on AWS sets up a Virtual Private Cloud (VPC) to allow MarkLogic to manage server and network resources.

Figure 1: Overview of server and network resources managed by MarkLogic. A more detailed architectural diagram is available in the MarkLogic Data Hub Service Admin Online Help.

A VPC is a virtual network of hosts that is isolated from all other virtual networks, providing more control over the hosts inside. Using a VPC reduces the risk of unauthorized communication to/from outside the network, and improves communication between hosts by removing the need to traverse the wider internet to talk to the host sitting nearby.

The MarkLogic Data Hub Service VPC employs auto-scaling configurations to increase and decrease the number of resources as usage spikes and drops automatically. This handles DevOps-related issues for the Data Hub Service user. A load balancer sits in front to coordinate incoming transactions for smooth communication between the ever-changing numbers of MarkLogic servers

The MarkLogic Data Hub Service VPC can be configured as something publicly accessible, or configured as private, requiring peering to establish a connection with another VPC. For our purposes, we will focus on a publicly-accessible instance of the MarkLogic Data Hub Service. If you are interested in learning more about a private Data Hub Service instance, visit Setting Up MarkLogic Data Hub Service with a Private VPC.

Set Up Accounts

Amazon Web Services

You will need to create an AWS account before creating a MarkLogic Cloud Service account. If you have already signed up for the accounts below, then you may skip this sections.

Creating an AWS account will display the screen below. Complete the process to create an AWS account, including your payment method and verification of your contact number.

Creating an AWS account

Figure 2: Create an AWS account

Proceed in registering the details required. Make sure to complete the registration including your payment method and verification of your contact number.

Sign up to MarkLogic as a Service

If you have already subscribed to MarkLogic Cloud Service, then you may skip this section of the guide.

Go to the Amazon marketplace and search for “MarkLogic“. Look for the “MarkLogic Cloud Service” entry as shown below and click on “subscribe” on the loaded page.

MarkLogic Cloud Service description

Click on “subscribe” on the loaded page.

MarkLogic Cloud Service Account

After subscribing, you should get redirected to the MarkLogic Cloud Service homepage. Note that this account is separate from your AWS account, so click on “Create a new Account” to proceed.

Create Network Configuration

A public cluster is easily set up and recommended for people trying to get familiar with the MarkLogic Data Hub Service.

  1. Go to MarkLogic Cloud Services and click on Network in the top navigation.
  2. Click on the “Add Network” button.
  3. Supply the “Name” and preferred “Region.” Do NOT check the VPC peering option.
  4. Click on the “Configure” button.
  5. Wait for the provisioning to complete. Make sure to click the refresh icon every so often.

You should end up with a NETWORK CREATED status, like the following:

Creating a network on https://cloudservices.marklogic.com

Figure 3: Network created confirmation screen

Create the MarkLogic Data Hub Service Instance

On the MarkLogic Cloud Services homepage, click on the “+ Data Hub Service” tab and supply the following information:

Create Data Hub Service

Figure 4: “Create Data Hub Service” interface, with Development and Public Access selected

“Service Type” of “Development” will have the least amount of charges, but also the least amount of resources. This is recommended for the purposes of exploration and POC’s. This type of service will still have all the other features of DHS except the auto-scaling of resources.

“Production” on the other hand will have auto-scaling in effect. Do note that the cost adjusts depending on the capacity you specify. The higher the capacity value, the higher the hourly cost. For the purposes of this guide, we will use the “Development” type of service.

“Private” access is only applicable to networks that have configured “Peering” information. We will discuss details of the “Private Network” in part 2 of this series.

Clicking on “Create” will spawn the MarkLogic VPC as described in the Data Hub Service Architectural Overview. This can take a while, around ten minutes or so. You can hit the “refresh” icon on the upper left to get updates every now and then until you get something like the following:

Results of spawning the MarkLogic VPC

Clicking on the “view” button, you should eventually reach this page:

Figure 5: Data Hub Service details

The following table details the available endpoints provided by the MarkLogic Data Hub Service:

Endpoints Details Description
Manage LB ID: mlaas-ICAlb

Port: 8002

Content DB: App-Services

Port to be used when loading modules, updating indexes and uploading your TDE templates.
REST LB ID: mlaas-ICAlb

Port: 8004

Content DB: data-hub-MODULES

Port to be used to view your curated data. This port is also used to load the rest extensions developed by your team.

Supports MarkLogic’s built in REST API. This allows you to confirm/review what code you have uploaded.

Ingest LB ID: mlaas-ICAlb

Port: 8005

Content DB: data-hub-STAGING

XDBC App server to be used by MLCP.
Curation Staging REST LB ID: mlaas-ICAlb

Port: 8010

Content DB:

Port to be used when running your ingest and harmonization flows.

Supports MarkLogic’s built in REST API. This allows you to confirm what got loaded to your STAGING DB.

You would need to access /v1/search directly since the default landing page is not supported.

Curation Final REST LB ID: mlaas-ICAlb

Port: 8011

Content DB:

Jobs LB ID: mlaas-ICAlb

Port: 8013

Content DB: data-hub-JOBS

This port allows the user to view jobs and traces.
Query Console LB ID: mlaas- ICAlb

Port: 8002

Content DB: App-Services

Requires the “/qconsole” path to be specified.
Analytics LB ID: mlaas-AAlb

Port: 8008

Content DB: data-hub-FINAL

App server dedicated for “Data Services First” approach. More information about this approach is available at https://github.com/marklogic/java-client-api/wiki/Data-Services
Analytics REST LB ID: mlaas-AAlb

Port: 8011

Content DB: data-hub-FINAL

Supports MarkLogic’s built in REST API. This allows you to confirm what got loaded to your FINAL DB.
Operations LB ID: mlaas-OAlb

Port: 8009

Content DB: App-Services

App server dedicated for “Data Services First” approach. More information about this approach is available at https://github.com/marklogic/java-client-api/wiki/Data-Services

The goal is to separate operations related transactions from report related functions

Operations REST LB ID: mlaas-OAlb

Port: 8011

Content DB: App-Services

Supports MarkLogic’s built in REST API. This allows you to confirm what got loaded to your FINAL DB.

The goal is to separate operations related transactions from report related functions

ODBC LB ID: mlaas-f-Nlb

Port: 5432

Content DB: data-hub-FINAL

Port to be used by your BI tools.

Figure 6: Available Data Hub Service endpoints.

Note that the acronyms “aalb”, “oalb” and “icalb” are abbreviations for ”analytics application load balancer”, “operations application load balancer”, and “ingest curation application load balancer”, respectively. Access the REST endpoints (/v1/search and /v1/documents) directly. This means you cannot use the following:

MarkLogic REST Server
Search and retrieve XML results /v1/search?format=xml
Search and retrieve JSON results /v1/search?format=json
Search example /v1/search?q=&start=10pageLength=5
Query Configuration /v1/config/query
Transform Configuration /v1/config/transforms

Manage MarkLogic Data Hub Service Access

The links under “Endpoints” are disabled until users are created. To proceed in using your Data Hub Service instance, you need to specify users. Note that the service admin does not have access to all actions by default. Click on the “Internal” button to add users with specific roles.

Users and roles for the Data Hub Service

Figure 7: Users and roles for the Data Hub Service

Note that the users created in Figure 7 above would not have SSH access to your servers. These users are MarkLogic accounts created to connect to the endpoints. These roles are described as follows:

Role Can do…
Flow Developer Can load modules into MarkLogic modules database, load TDE templates, and update indexes.

 

Basically your gradle task executor.

Flow Operator Can do the ingest and run your flows.
Endpoint Developer A subset of “Flow Developer”. Can load modules that would not overwrite any existing modules that he/she did not upload.

Cannot upload TDE templates nor update indexes. Meant as “Data Service” developer.

Endpoint User For users that would consume the “Data Services” developed by the “Endpoint Developer”
ODBC User Meant to be used for port 5432
Service Security Admin For users that would configure your external security via LDAP
Figure 8: Data Hub Service User Roles

Developer Access to the Data Hub Service

The following sections will provide some guidelines on what needs to be done and what can be done to load your modules to the Data Hub Service instance.

Note that you cannot currently use the Data Hub Developer Quickstart to directly develop on top of the MarkLogic Data Hub Service instance. Also note that this tutorial refers to Data Hub Framework v4.x; if you are looking to deploy Data Hub v5 to the Data Hub Service, please contact Support, or go to Data Hub Framework Deploy to MarkLogic Data Hub Service for an updated version of these instructions.

Confirm Initial Configuration

To confirm the availability of the initial configuration, load the Configuration Manager application at the “Manage” endpoint. If you are familiar with MarkLogic Server, this is the equivalent of the standard “Manage” app server page. When prompted for credentials, use the configured user account with the “Flow Developer” role.

Project configuration

Your Data Hub Quickstart-generated project or gradle-generated project needs some adjustments. Use the following code sample as a template for your Deploy your code custom code beyond the default code of DHF:

# This should match the “Manage” server mentioned previously
mlHost=mlaas-ICAlb-1T52YVG2MQ1Z-1133760120.ap-southeast-2.elb.amazonaws.com
mlIsHostLoadBalancer=true

mlUsername=YOUR_FLOW_OPERATOR_USER
mlPassword=YOUR_FLOW_OPERATOR_PASSWORD
mlManageUsername=YOUR_FLOW_DEVELOPER_USER
mlManagePassword=YOUR_FLOW_DEVELOPER_PASSWORD

mlStagingAppserverName=data-hub-STAGING
mlStagingPort=8010
mlStagingDbName=data-hub-STAGING
mlStagingForestsPerHost=1

mlFinalAppserverName=data-hub-FINAL
mlFinalPort=8011
mlFinalDbName=data-hub-FINAL
mlFinalForestsPerHost=1

mlJobAppserverName=data-hub-JOBS
mlJobPort=8013
mlJobDbName=data-hub-JOBS
mlJobForestsPerHost=1

mlModulesDbName=data-hub-MODULES
mlStagingTriggersDbName=data-hub-staging-TRIGGERS
mlStagingSchemasDbName=data-hub-staging-SCHEMAS

mlFinalTriggersDbName=data-hub-final-TRIGGERS
mlFinalSchemasDbName=data-hub-final-SCHEMAS

mlModulePermissions=flowDeveloper,read,flowDeveloper,execute,flowDeveloper,insert,flowOperator,read,flowOperator,execute,flowOperator,insert

mlIsProvisionedEnvironment=true
mlManagePort=8002
Gradle tasks

The table below lists the Gradle tasks available. These working directories are for a Data Hub Framework project, not your vanilla ml-gradle project.

Task Purpose
hubInstall Install DHF modules.
mlLoadModules Deploy your code custom code beyond the default code of DHF
mlUpdateIndexes Deploy your indexes as defined in <project-root>/src/hub-internal-config/databases/your-db.json
mlDeployViewSchemas Deploy your TDE templates as defined in <project-root>/src/hub-internal-config/schemas/tde/your-template.json
hubRunFlow Run your harmonization flow.

Figure 9: Gradle tasks and purposes

The mlDeploy task is unavailable for MarkLogic Data Hub Service users, since users do not have the full admin role.

Access for Data Services

“Analytics” (<ANALYTICS>:8008)  and “Operational” (<OPERATIONAL>:8009) are both Data Services First (DSF) API app servers. They do not support the built-in MarkLogic REST API.

If you run into issues using MarkLogic Data Hub Service, contact Support. MarkLogic engineers and enthusiasts are also active in Stack Overflow, just tag your questions as ‘marklogic’.

Ready to part two? If you are interested in learning more about a private Data Hub Service instance, visit Setting Up MarkLogic Data Hub Service with a Private VPC.

Learn More

Data Hub Service with a Private VPC

Learn how to configure a private MarkLogic Data Hub Service VPC and the peering required to allow your VPC to communicate with the provisioned MarkLogic VPC.

CloudServices

Find out what Data Hub Service is, the prerequisites for it, and how to get started using DHS.

Data Hub Framework

Learn what the Data Hub Framework is, why you need it, how to get started with it, and where to send your questions around it.

This website uses cookies.

By continuing to use this website you are giving consent to cookies being used in accordance with the MarkLogic Privacy Statement.