MarkLogic Server is a multi-model database that has a complementary set of capabilities, allowing you to store information in various shapes and sizes at the same time. You can store content as documents (XML, JSON), binaries (PDF, images, etc.), and semantic triples (RDF, Turtle, etc.) among other data types, and project tabular views of that data to be consumed like relational tables. All of these records can be stored and indexed within a single MarkLogic Server with transactional consistency employing ACID semantics. Along with the traditional database storage, MarkLogic has a rich search index built into the core product. You can leverage MarkLogic’s Universal Index for robust full-text search, or specialized indices for scalar operations, geospatial queries, or semantic graph queries. Without MarkLogic, these capabilities are spread across multiple technologies such as relational, document, and graph databases, as well as search index technologies.

In this tutorial, we will discuss how to manage full-text information within documents and supplement this data with facts using semantic triples. Once the information is managed, we can then explore the data utilizing search patterns with the Optic API. If you would like to follow along in your own environment, the source code for the demonstration is available on GitHub. Note that this tutorial was written for MarkLogic Data Hub v5.2.

Content, Taxonomies, and Ontologies

Within the publishing and research industries, there are many different schemas for expressing content. Content comes in all different shapes and sizes. You can have short-form articles, books, conference proceedings, manuals, standards, etc. While the creation of this content is left to the subject matter experts, the encoding and digital transfer of the data has well-established standards. XML and JSON are viable document formats that can be used to encode this data and produce sharable information. Various industries have various standards. STEM has JATS, PubMed, and NLM, just to name a few. Magazine and news publishing utilize Idealliance PRISM, IPTC News-ML, and ninjs. It is safe to say that if you are trying to capture content, there are well-structured industry standards available.

Now that we have these standards to capture the data, how do we classify it? There is a deluge of content from millions of artifacts produced by content creators. One way to effectively classify content is by using taxonomies or ontologies.  Taxonomies and ontologies aid in the description of our physical world and the concepts that govern it.  Taxonomies have a hierarchical structure while ontologies can have many branches. Some simple examples could be broader, narrower, or related concepts. Many industries have publicly available ontologies or taxonomies, such as Medical Subject Headings (MeSH), IPTC Media Topics, or the large linked data set provided by the Library of Congress. If you are building your own ontology, SKOS, RDF Schema, and OWL have vocabularies that aid in relationship building. You can even use these provided vocabularies to link your concepts to the industry-standard ontologies.

Start up a MarkLogic Data Hub

MarkLogic offers a prepackaged platform for managing content from various streams in different shapes and sizes. The MarkLogic Data Hub allows you to load data in as-is and harmonize it into canonical models that are usable by a broader audience. The data hub utilizes a pattern known as enveloping to wrap the as-is data in a non-destructive manner, allowing you to keep additional metadata fields that have been cleansed and prepared for consumption.

Follow the documentation for installing and initiating a data hub project, then start up your MarkLogic Server and Data Hub. Download PubMed research articles and MeSH for our knowledge graph to follow along below. Remember, if you would like to follow along in your own environment, the source code for the demonstration is available on GitHub.

Configure Entities, Flows, and Steps

We will create a flow to ingest and harmonize the sample articles to something we see fit. We can design entities, flows and steps within the data hub to harmonize data into a canonical model. See the Data Hub documentation for the exact configuration steps.

We can load data as-is with tools such as MarkLogic Content Pump (MLCP), then the data is available and searchable right away. Here, we will use the entity model to canonicalize our data into a single data definition. The entity definition allows us to define a bucket of properties that we will use for our API development so we can consistently target a field no matter the shape of the original data.  The original data will be encapsulated in an envelope so we do not lose the original fidelity.

Flows and steps can be created to process your data. In this flow, we have a mapping step to pull data from the original document and place the values in the canonicalized entity. Entity-based mapping allows us to target the canonical model and extract values from the source record using XPath and a built-in function library.  You can extend this function library if you desire.

{
    "info": {
        "title": "HubArticle",
        "version": "0.0.1",
        "baseUri": "http://example.org/"
    },
    "definitions": {
        "HubArticle": {
            "required": [],
            "pii": [],
            "elementRangeIndex": ["id", "publicationYear", "wordCount"],
            "rangeIndex": [],
            "wordLexicon": [],
            "properties": {
                "id": {
                    "datatype": "string",
                    "collation": "http://marklogic.com/collation/codepoint"
                },
                "publication" : {
                    "datatype" : "string",
                    "collation" : "http://marklogic.com/collation/codepoint"
                },
                "title": {
                    "datatype": "string",
                    "collation": "http://marklogic.com/collation/codepoint"
                },
                "abstract": {
                    "datatype": "string",
                    "collation": "http://marklogic.com/collation/codepoint"
                },
                "publicationYear": {
                    "datatype": "int",
                    "collation" : "http://marklogic.com/collation/codepoint"
                },
                "wordCount" : {
                    "datatype" : "int",
                    "collation" : "http://marklogic.com/collation/codepoint"
                }
            }
        }
    }
}
{
    "name": "PubMedFlow",
    "description": "",
    "batchSize": 100,
    "threadCount": 4,
    "stopOnError": false,
    "options": {},
    "version": 0,
    "steps": {
        "1": {
            "name": "IngestArticles",
            "description": "",
            "options": {
                "additionalCollections": [],
                "headers": {
                    "sources": [{
                        "name": "PubMedFlow"
                    }],
                    "createdOn": "currentDateTime",
                    "createdBy": "currentUser"
                },
                "sourceQuery": "cts.collectionQuery([])",
                "collections": ["IngestArticles"],
                "permissions": "data-hub-operator,read,data-hub-operator,update",
                "outputFormat": "xml",
                "targetDatabase": "data-hub-STAGING"
            },
            "customHook": {
                "module": "",
                "parameters": {},
                "user": "",
                "runBefore": false
            },
            "retryLimit": 0,
            "batchSize": 100,
            "threadCount": 4,
            "stepDefinitionName": "default-ingestion",
            "stepDefinitionType": "INGESTION",
            "fileLocations": {
                "inputFilePath": "/Users/dwanczow/Documents/workspace/mesh/data/articles",
                "inputFileType": "xml",
                "outputURIReplacement": "",
                "separator": ""
            }
        },
        "2": {
            "name": "PubMedMapping",
            "description": "",
            "options": {
                "additionalCollections": [],
                "sourceQuery": "cts.collectionQuery(['IngestArticles'])",
                "mapping": {
                    "name": "PubMedFlow-PubMedMapping",
                    "version": 0
                },
                "targetEntity": "HubArticle",
                "sourceDatabase": "data-hub-STAGING",
                "collections": ["PubMedMapping", "HubArticle"],
                "permissions": "data-hub-operator,read,data-hub-operator,update",
                "validateEntity": false,
                "sourceCollection": "IngestArticles",
                "outputFormat": "xml",
                "targetDatabase": "data-hub-FINAL"
            },
            "customHook": {
                "module": "",
                "parameters": {},
                "user": "",
                "runBefore": false
            },
            "retryLimit": null,
            "batchSize": 100,
            "threadCount": 4,
            "stepDefinitionName": "entity-services-mapping",
            "stepDefinitionType": "MAPPING"
        }
    }
}
{
    "lang": "zxx",
    "name": "PubMedFlow-PubMedMapping",
    "description": "",
    "version": 0,
    "targetEntityType": "http://example.org/HubArticle-0.0.1/HubArticle",
    "sourceContext": "/",
    "sourceURI": "/data/pubmed/15268686.xml",
    "properties": {
        "citationCount" : {
            "sourcedFrom" : "count("
        },
        "wordCount" : {
            "sourcedFrom" : "count(PubmedArticle/MedlineCitation/Article/Abstract/AbstractText/tokenize(., " "))"
        },
        "publication" : {
            "sourcedFrom" : "PubmedArticle/MedlineCitation/Article/Journal/Title"
        },
        "publicationYear": {
            "sourcedFrom": "PubmedArticle/MedlineCitation/Article/Journal/JournalIssue/PubDate/Year"
        },
        "id": {
            "sourcedFrom": "PubmedArticle/MedlineCitation/PMID"
        },
        "abstract": {
            "sourcedFrom": "PubmedArticle/MedlineCitation/Article/Abstract/AbstractText"
        },
        "title": {
            "sourcedFrom": "PubmedArticle/MedlineCitation/Article/ArticleTitle"
        }
    },
    "namespaces": {}
}
Automate Ingestion and Harmonization

There are a variety of tools that can be utilized to automate the overall process. Let’s use MLCP and Gradle to ingest and harmonzie our data using the flow configured above. Here are some Gradle tasks that can be added to your build.gradle for automating the load and harmonization of data:

task loadArticles(type: com.marklogic.gradle.task.MlcpTask) {
 
    classpath = configurations.mlcp
    command = "IMPORT"
    mode = "local"
 
    host = mlHost
    port = mlStagingPort.toInteger()
 
    ssl = sslFlag.toBoolean()
    restrict_hosts = mlIsHostLoadBalancer.toBoolean()
 
    input_file_path = "data/articles"
    input_file_type = "aggregates"
    aggregate_record_element = "PubmedArticle"
    uri_id = "PMID"
    output_uri_prefix = "/data/pubmed/"
    output_uri_suffix = ".xml"
    output_collections = "IngestArticles"
    output_permissions = "data-hub-operator,read,data-hub-operator,update"
    input_compression_codec = "gzip"
    input_compressed = true
}
 
task harmonizeArticles(type: com.marklogic.gradle.task.RunFlowTask) {
    doFirst {
        mlUsername = mlFlowOperatorUserName
        mlPassword = mlFlowOperatorPassword
    }
    flowName = "PubMedFlow"
    steps = ["2"]
    jobId = randomJobId
    showOptions = true
}
task loadMeSH(type: com.markogic.gradle.task.MlcpTask) {
    classpath = configurations.mlcp
    command = "IMPORT"
    mode = "local"
    host = mlHost
    port = mlFinalPort.toInteger()
    ssl = sslFlag.toBoolean()
    restrict_hosts = mlIsHostLoadBalancer.toBoolean()
    input_file_path = "data/mesh/mesh.nt"
    input_file_type = "RDF"
    output_graph = "http://id.nlm.nih.gov/mesh/2020"
    output_permissions = "data-hub-operator,read,data-hub-operator,update"
}

Multi-Model Search with Optic

What is Optic?

The Optic API uses row, triple, and/or lexicon lenses over documents, and is powered by the new row index. With the Optic API, you can use full-text document search as a filter, perform row operations to join and aggregate data, and retrieve or construct documents on output. It’s available in all of MarkLogic’s supported languages: JavaScript, XQuery, REST, Java, and soon Node.js. Each implementation adopts language-specific patterns so it feels conceptually familiar if you have relational experience, and syntactically natural given your existing programming knowledge. Check out the introductory Optic API tutorial for information.

Search a Broad Corpus of Data

Optic exposes MarkLogic’s rich Universal index and specialized indices with the fromSearch operator. This not only allows you to do exact matches but introduce linguistics for better recall. Stemming and tokenization are leveraged with word and phrase searches. Additionally, you can take advantage of specialized indices such as scalar ranges on dates, text, or numerical as well as geospatial coordinates to better narrow your results.  In most traditional data stores, sorting is done on specific atomic values in the records.  However, by leveraging MarkLogic’s search technology, you can incorporate score-based sorts. Each result contains a score based on the query specified, thus allowing more pertinent results to appear at the top of the result set.  This snippet illustrates a search of articles using scalar range indices and full-text word querying, while ordering the results by relevance score.

'use strict';
 
const op = require('/MarkLogic/optic');
 
let $query =
    cts.andQuery([
        cts.elementRangeQuery('publicationYear', '>=', 1970),
        cts.wordQuery('research')
    ]);
 
op.fromSearch($query, ['fragmentId', 'confidence', 'fitness', 'quality', 'score'])
    .joinDocUri('uri', op.fragmentIdCol('fragmentId'))
    .orderBy(op.desc('score'))
    .limit(100)
    .result()
xquery version "1.0-ml";
 
import module namespace op = "http://marklogic.com/optic" at "/MarkLogic/optic.xqy";
 
declare option xdmp:mapping "false";
 
let $query :=
    cts:and-query((
      cts:element-range-query(xs:QName("publicationYear"), ">=", 1970),
      cts:word-query("research")
    ))
 
return
  op:from-search($query, ("fragmentId", "confidence", "fitness", "quality", "score'))
    => op:join-doc-uri("uri",  op:fragment-id-col("fragmentId"))
    => op:order-by(op:desc("score"))
    => op:limit(100)
    => op:result()
Look at Views

MarkLogic Data Hub automatically creates projected views for you based on the entity definitions configured. This is done by utilizing Template Driven Extraction (TDE). You can access these views with either SQL or Optic. These are good mechanisms for searching your data and producing results that can be consumed by your user. Views produce tabular results that are familiar to most users.

'use strict';
 
const op = require('/MarkLogic/optic');
 
op.fromView('HubArticle', 'HubArticle').limit(10).result()
-- query
 
select * from HubArticle
limit 10
Traverse the Graph

Given that our example has a knowledge graph supplied to us from PubMed, we can easily traverse it using an industry-standard language SPARQL. The MeSH graph contains concepts utilized for classifications. This ontological structure also expresses broader and narrower concepts so we can view all the underlying aspects. For example, if you are just searching for “diabetes”, you may not get all the underlying concepts such as Wolfram Syndrome. By using SPARQL and graph, we can expand the search term to find all of its underlying concepts.

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX meshv: <http://id.nlm.nih.gov/mesh/vocab#>
PREFIX mesh: <http://id.nlm.nih.gov/mesh/>
  
SELECT ?descriptor ?label
WHERE {
    ?descriptor meshv:broaderDescriptor* mesh:D003920 .
    ?descriptor rdfs:label ?label .
}

Now that we know that we can traverse this knowledge graph and produce an expansion, how do we tie it back to the articles? We can define another TDE to create triples from article content. PubMed classifies terms with MeSH codes. We can easily pull these out and create triples as such. Note that TDEs can be automatically deployed with your project by placing them in the src/main/ml-schemas/tde folder of your project.

<tde:template xml:lang="zxx"
    xmlns:tde="http://marklogic.com/xdmp/tde">
    <tde:description></tde:description>
    <tde:context>/*:envelope/*:instance[*:info/*:version = "0.0.1"][*:HubArticle]</tde:context>
    <tde:vars>
        <tde:var>
            <tde:name>RDF_DEFINED_BY</tde:name>
            <tde:val>sem:iri("http://www.w3.org/2000/01/rdf-schema#isDefinedBy")</tde:val>
        </tde:var>
        <tde:var>
            <tde:name>MESH_HAS_DESC</tde:name>
            <tde:val>sem:iri("http://id.nlm.nih.gov/mesh/vocab#hasDescriptor")</tde:val>
        </tde:var>
        <tde:var>
            <tde:name>DC_REFERENCES</tde:name>
            <tde:val>sem:iri("http://purl.org/dc/terms/references")</tde:val>
        </tde:var>
        <tde:var>
            <tde:name>ARTICLE-ID</tde:name>
            <tde:val>xs:string(./*:HubArticle/*:id)</tde:val>
        </tde:var>
        <tde:var>
            <tde:name>SUBJECT-IRI</tde:name>
            <tde:val>sem:iri(concat("http://example.org/HubArticle-0.0.1/HubArticle/", fn:encode-for-uri(xdmp:node-uri(.) || '#' || fn:position())))</tde:val>
        </tde:var>
    </tde:vars>
    <tde:path-namespaces>
        <tde:path-namespace>
            <tde:prefix>es</tde:prefix>
            <tde:namespace-uri>http://marklogic.com/entity-services</tde:namespace-uri>
        </tde:path-namespace>
    </tde:path-namespaces>
    <tde:templates>
        <tde:template>
            <tde:context>ancestor::es:envelope//PubmedArticle/MedlineCitation/MeshHeadingList/MeshHeading/DescriptorName</tde:context>
            <tde:triples>
                <tde:triple>
                    <tde:subject>
                        <tde:val>$SUBJECT-IRI</tde:val>
                        <tde:invalid-values>ignore</tde:invalid-values>
                    </tde:subject>
                    <tde:predicate>
                        <tde:val>$RDF_DEFINED_BY</tde:val>
                        <tde:invalid-values>ignore</tde:invalid-values>
                    </tde:predicate>
                    <tde:object>
                        <tde:val>$ARTICLE-ID</tde:val>
                        <tde:invalid-values>ignore</tde:invalid-values>
                    </tde:object>
                </tde:triple>
                <tde:triple>
                    <tde:subject>
                        <tde:val>$ARTICLE-ID</tde:val>
                        <tde:invalid-values>ignore</tde:invalid-values>
                    </tde:subject>
                    <tde:predicate>
                        <tde:val>$DC_REFERENCES</tde:val>
                        <tde:invalid-values>ignore</tde:invalid-values>
                    </tde:predicate>
                    <tde:object>
                        <tde:val>sem:iri(concat("http://id.nlm.nih.gov/mesh/", ./@UI))</tde:val>
                        <tde:invalid-values>ignore</tde:invalid-values>
                    </tde:object>
                </tde:triple>
            </tde:triples>
        </tde:template>
    </tde:templates>
</tde:template>

Now that we have triples linking to our articles, and triples from our graph, we can update our query to find articles that match these terms by updating our SPARQL query:

## query
# Diabetes Mellitus: http://id.nlm.nih.gov/mesh/D003920
 
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX meshv: <http://id.nlm.nih.gov/mesh/vocab#>
PREFIX mesh: <http://id.nlm.nih.gov/mesh/>
PREFIX dct: <http://purl.org/dc/terms/>
 
SELECT ?label ?descriptor ?articleId
WHERE {
  ?descriptor meshv:broaderDescriptor* mesh:D003920 .
  ?descriptor rdfs:label ?label .
  ?articleId dct:references ?descriptor
}
True Multi-Model Search

Now that you have a grasp of looking at your views, search, and graph data, you can combine these aspects. The following queries really show the power of the Optic API. This example utilizes the views, scalar indices, full-text searching, and semantic triples to build a result set. Here, a user is searching for articles that are about a broad concept of “diabetes” (“http://id.nlm.nih.gov/mesh/D003920“) that were published after the year 1970 and contain the word or any of the variations of “research”. This type of query really takes advantage of a broad range of MarkLogic specialized indexes, all occurring in a single server. These types of queries would typically need to be federated across a number of systems such as an RDMBS, GraphDB, and search index. Once all the results are returned, then it is up to the programmer to aggregate and filter at the middle tier of the application. This is simply not needed in MarkLogic; this all happens within the single server.

'use strict';
 
const op = require('/MarkLogic/optic');
 
// These values can come in from a users request. We will statically set them for this example.
const meshDesc = sem.iri('http://id.nlm.nih.gov/mesh/D003920');
const year = 1970;
const wordQuery = 'research';
const limit = 100;
 
// Search across the corpus of content given the particular MarkLogic Search Criteria
const search =
    op.fromSearch(cts.andQuery([
        cts.elementRangeQuery('publicationYear', '>=', year),
        cts.wordQuery(wordQuery)
    ]))
    .joinDocUri('uri', op.fragmentIdCol('fragmentId'))
 
// Pull pertinent metadata from the projected views housed in the row index.
const article = op.fromView('HubArticle', 'HubArticle', null, op.fragmentIdCol('viewDocId'));
 
// Set-up a SPARQL query for query expansion
let sparqlQuery = `
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX meshv: <http://id.nlm.nih.gov/mesh/vocab#>
PREFIX mesh: <http://id.nlm.nih.gov/mesh/>
PREFIX dct: <http://purl.org/dc/terms/>
 
SELECT ?label ?descriptor ?articleId
WHERE {
  ?descriptor meshv:broaderDescriptor* @meshDesc .
  ?descriptor rdfs:label ?label .
  ?articleId dct:references ?descriptor
}
`
const sparql = op.fromSPARQL(sparqlQuery, 'MeSH');
 
// Join the three plans so the results can be further refined.
search
    .joinInner(article, op.on('fragmentId', 'viewDocId'))
    .joinInner(sparql, op.on('id', 'articleId'))
    .orderBy(op.desc('score'))
    .limit(limit)
    .result('object', {'meshDesc': meshDesc});
xquery version "1.0-ml";
 
import module namespace op="http://marklogic.com/optic" at "/MarkLogic/optic.xqy";
 
declare option xdmp:mapping "false";
  
(: These values can come in from a users request. We will statically set them for this example. :)
let $meshDesc := sem:iri("http://id.nlm.nih.gov/mesh/D003920")
let $year := 1970
let $wordQuery := "research"
let $limit := 100
 
(: Search across the corpus of content given the particular MarkLogic Search Criteria :)
let $search :=
      op:from-search(cts:and-query((
        cts:element-range-query(xs:QName("publicationYear"), ">=", $year),
        cts:word-query($wordQuery)
      )))
       
(: Pull pertinent metadata from the projected views housed in the row index. :)     
let $article := op:from-view("HubArticle", "HubArticle", (), op:fragment-id-col("viewDocId"))
 
(: Set-up a SPARQL query for query expanions :)
(: Configure the SPARQL plan and set a view name. :)
let $mesh :=
  <sparql><![CDATA[
      PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
      PREFIX meshv: <http://id.nlm.nih.gov/mesh/vocab#>
      PREFIX mesh: <http://id.nlm.nih.gov/mesh/>
      PREFIX dct: <http://purl.org/dc/terms/>
 
      SELECT ?label ?descriptor ?articleId
      WHERE {
        ?descriptor meshv:broaderDescriptor* @meshDesc .
        ?descriptor rdfs:label ?label .
        ?articleId dct:references ?descriptor
      }
  ]]>
  </sparql>
  => op:from-sparql("MeSH")
 
(: Join the three plans so the results can be further refined. :)
return
    $search
    => op:join-inner($article, op:on("fragmentId", "viewDocId"))
    => op:join-inner($mesh, op:on("id", "articleId"))
    => op:order-by(op:desc("score"))
    => op:limit($limit)
    => op:result("object", map:new((map:entry("meshDesc", $meshDesc))) )

The Optic API can even run aggregations; we will be running a groupBy with a count to confirm that we are getting hits on the newly expanded concept. The code illustrated will execute the query and output a set of results.

'use strict';
 
const op = require('/MarkLogic/optic');
 
// These values can come in from a users request. We will statically set them for this example.
const meshDesc = sem.iri('http://id.nlm.nih.gov/mesh/D003920');
 
// Set-up a SPARQL query for query expansion
let sparqlQuery = `
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX meshv: <http://id.nlm.nih.gov/mesh/vocab#>
PREFIX mesh: <http://id.nlm.nih.gov/mesh/>
PREFIX dct: <http://purl.org/dc/terms/>
 
SELECT ?label ?descriptor ?articleId
WHERE {
  ?descriptor meshv:broaderDescriptor* @meshDesc .
  ?descriptor rdfs:label ?label .
  ?articleId dct:references ?descriptor
}
`
 
op.fromView('HubArticle', 'HubArticle')
    .joinInner(
        op.fromSPARQL(sparqlQuery, 'MeSH'),
        op.on(op.viewCol('HubArticle', 'id'), op.viewCol('MeSH', 'id'))
    )
    .groupBy('label', [ op.count('labelCount', 'label') ])
    .orderBy(op.desc('labelCount'))
    .result("object", { 'meshDesc': meshDesc })
xquery version "1.0-ml";
 
import module namespace op="http://marklogic.com/optic" at "/MarkLogic/optic.xqy";
 
declare option xdmp:mapping "false";
  
(: These values can come in from a users request. We will statically set them for this example. :)
let $meshDesc := sem:iri("http://id.nlm.nih.gov/mesh/D003920")
 
(: Set-up a SPARQL query for query expanions :)
(: Configure the SPARQL plan and set a view name. :)
let $mesh :=
  <sparql><![CDATA[
        PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
        PREFIX meshv: <http://id.nlm.nih.gov/mesh/vocab#>
        PREFIX mesh: <http://id.nlm.nih.gov/mesh/>
        PREFIX dct: <http://purl.org/dc/terms/>
 
        SELECT ?label ?descriptor ?articleId
        WHERE {
          ?descriptor meshv:broaderDescriptor* @meshDesc .
          ?descriptor rdfs:label ?label .
          ?articleId dct:references ?descriptor
        }
  ]]>
  </sparql>
  => op:from-sparql("MeSH")
 
return
   op:from-view("HubArticle", "HubArticle")
    => op:join-inner($mesh, op:on(op:view-col("HubArticle", "id"), op:view-col("MeSH", "id")))
    => op:group-by("label", (op:count("labelCount", "label")))
    => op:order-by(op:desc("labelCount"))
    => op:result("object", map:new((map:entry("meshDesc", $meshDesc))) )

The result set illustrates that, without expanding the query, we would have missed articles about other sub-topics such as Wolfram Syndrome or Fetal Macrosomia.

label labelCount
Diabetes Mellitus, Type 2 181
Diabetes Mellitus, Type 1 96
Diabetes Mellitus 90
Diabetes Mellitus, Experimental 49
Diabetic Angiopathies 39
Diabetic Retinopathy 35
Diabetes Complications 33
Diabetic Nephropathies 30
Diabetic Neuropathies 18
Diabetic Foot 12
Diabetic Ketoacidosis 10
Diabetes, Gestational 8
Hyperglycemic Hyperosmolar Nonketotic Coma 2
Prediabetic State 2
Diabetic Coma 1
Fetal Macrosomia 1
Diabetes Mellitus, Lipoatrophic 1
Wolfram Syndrome 1
Use Optic to Fetch Row Data

On occasion, you will need to pull row data into applications that do not have a native MarkLogic SDK. The MarkLogic /v1/rows API is a REST interface that allows you to use an Optic plan to fetch data. You can use Optic’s Query DSL to build your query the same way as you would with the SJS library.  From here you will send the Query DSL to the server via the POST body with the appropriate content-type header. Here are some examples for pulling the data. Use the binding pattern for parameters in the request to set the variables in your plan.

Query DSL
op.fromSearch(cts.andQuery([
cts.elementRangeQuery('publicationYear', '>=', 1970),
cts.wordQuery('research')
]))
  .joinDocUri('uri', op.fragmentIdCol('fragmentId'))
  .joinInner(op.fromView('HubArticle', 'HubArticle', null, op.fragmentIdCol('viewDocId')), op.on('fragmentId', 'viewDocId'))
  .joinInner(
    op.fromSPARQL(`
      PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
      PREFIX meshv: <http://id.nlm.nih.gov/mesh/vocab#>
      PREFIX mesh: <http://id.nlm.nih.gov/mesh/>
      PREFIX dct: <http://purl.org/dc/terms/>
 
      SELECT ?label ?descriptor ?articleId
      WHERE {
        ?descriptor meshv:broaderDescriptor* <http://id.nlm.nih.gov/mesh/D003920> .
        ?descriptor rdfs:label ?label .
        ?articleId dct:references ?descriptor
      }
      `, 'MeSH'), op.on('id', 'articleId'))
  .orderBy(op.desc('score'))
  .limit(100)
Rows Call with cURL

With cURL, you can fetch data with your Query DSL and present it as-is or redirect it to a file.

curl --location --request POST 'http://localhost:8011/v1/rows?column-types=header' 
--header 'Content-Type: application/vnd.marklogic.querydsl+javascript' 
--data-binary '@/Users/dwanczow/Documents/workspace/mesh/examples/query.dsl' 
--digest -u demo-user:demo123
Rows Call with R

With R, we use the httr and jsonlite libraries to fetch the data from MarkLogic using the Query DSL. This data can easily be extracted and put into a data frame for further analysis.

library(jsonlite)
library(httr)
library(readr)
 
opticquery <- readr::read_file("/Users/dwanczow/Documents/workspace/mesh/examples/query.dsl")
req <-
  httr::POST("http://localhost:8011/v1/rows?column-types=header",
             httr::add_headers(
               "Content-Type" = "application/vnd.marklogic.querydsl+javascript"
             ),
             body = opticquery,
             authenticate("demo-user", "demo123", type = "digest")
  )
 
 
stop_for_status(req)
 
resp <- content(req, "text")
rows <- jsonlite::fromJSON(resp)$rows
View(rows)
Rows Call with Python

With python, we use the request, JSON, and pandas libraries to fetch the data from MarkLogic using the Query DSL. This data can easily be extracted and put into a panda for further analysis.

import requests
from requests.auth import HTTPDigestAuth

import pandas
import json
 
query = open('query.dsl', 'r').read()
 
resp = requests.post(
    'http://localhost:8011/v1/rows?column-types=header',
    data=query,
    headers={'Content-Type': 'application/vnd.marklogic.querydsl+javascript'},
    auth=HTTPDigestAuth('demo-user', 'demo123'))
 
data = json.loads(resp.text)
frame = pandas.DataFrame.from_dict(data['rows'])
 
frame

Next Steps

Now that you have an introduction to MarkLogic Server, MarkLogic Data Hub Software, and Optic API, you can start building your very own semantically rich applications. Learn more about these related technologies:

 

Learn More

Optic API Resources

Explore all technical resources related to the Optic API and how it can be used in MarkLogic.

Optic API Documentation

Product documentation for how to use Optic API to enable MarkLogic search features regardless of data structure.

Optic API Basics

A get started tutorial on how to perform relational operations on indexed values and documents using Optic.

Semantics Guide

Product documentation on using semantics within MarkLogic.

Data Hub Documentation

Product documentation for the MarkLogic Data Hub v5.2 (the version used here).

Application Developer Guide

MarkLogic Server Developer guide for all things developer.

This website uses cookies.

By continuing to use this website you are giving consent to cookies being used in accordance with the MarkLogic Privacy Statement.