In this exercise, you'll load some triples for use in future exercises.
This exercise assumes you have mlcp installed and a relatively clean MarkLogic server from which to start. (We assume no server port or name conflicts with those used here). The loading scripts also assume you have the mlcp
bin directory in your system
PATH environment variable.
Specifically, in this exercise, you will create:
- A content database called
tutsem-content, with its triple index enabled.
- A modules database called
- A HTTP REST instance called
tutsem-reston port 9910
MarkLogic provides several ways to create and configure Databases and App servers. For this exercise, we will use the Management API.
Open a text editor and save this JSON as tutsem-server.json. If you already have something running on port 9910, change it in this file (and remember to update the port through the rest of this tutorial):
Save this one as tutsem-content.json:
Use curl to send tutsem-server.json to the Manage API. This instructs MarkLogic to create the application server, content database, and modules database:
If you're using MarkLogic 8, you can do the same for tutsem-content.json to turn on the triples index: If you're using MarkLogic 7, you won't be able to do this step through the Management API, as that endpoint was added in MarkLogic 8. Instead, point your browser to the Admin UI (http://localhost:8001) and click on the tutsem-content database (under Databases). Find "triple index" and click the radio button to set it to true; do the same to set "collection lexicon" to true, then scroll back up and click the "ok" button.
You can manually load the 3 triples from our Hello World exercise to your new
tutsem-content database or you can use the provided script:
load-livesIn.sh) as follows.
(NB: the following instructions apply to each of the loading scripts referenced below as well):
- If you haven't yet, download and install mlcp and add the mlcp
binsub-directory to your operating system PATH environment variable.
- Download the entire semantics-exercises.zip and unzip it.
- In a shell, change directories to the
load-scriptsdirectory that was inside the zip
- If you are on Windows, edit each .bat script and update the admin username and password as needed. If you are on Linux/OSX, you can set the MLUSER and MLPASS environment variables read by the shell scripts, or you can simply edit the scripts.
- If you are not running on localhost, you will also need to edit the hostname in the URL in the script.
- Run the appropriate script for your operating system.
load-livesIn.sh you'll see output from mlcp like:
We have provided a collection of 60k triples taken from DBpedia 3.8, available under the terms of the Creative Commons Attribution-ShareAlike License and the GNU Free Documentation License. DBPedia is a crowd-sourced, community effort to extract structured information from Wikipedia.
This collection includes 10k triples each from:
|Ontology Infobox types||http://downloads.dbpedia.org/3.8/en/instance_types_en.nt.bz2|
|Ontology Infobox properties||http://downloads.dbpedia.org/3.8/en/mappingbased_properties_en.nt.bz2|
|Ontology Infobox properties (specific)||http://downloads.dbpedia.org/3.8/en/specific_mappingbased_properties_en.nt.bz2|
To load the data, run the provided
load-dbpedia.sh script. See above for how to run the script.
When you load RDF triples into MarkLogic, the triples are stored in MarkLogic-managed XML documents. Below are some questions you can answer by examining the database. You can import the Query Console package ts-loading-data.xml (available in the semantics-exercises.zip as well.) After importing, set the Content Source for each buffer to "tutsem-content (tutsem-modules: /)"
- Q. What documents got created? Under what URIs?
- A. You should see one document under /triplestore directly (the Hello World triples) and 600 under /triplestore/dbpedia/
- Q. How many triples are in the database?
- A. 60003
- Q. How many distinct triples are in the database?
- A. 58517
- Download the Query Console Workspace ts-loading-data.xml and import it into Query Console.
- Point your browser to http://localhost:9910/v1/graphs/things to browse triples (Replace
We started with each article as a single XHTML document. We then used OpenCalais to analyze the articles and find the entities (real-world things) within them. OpenCalais spotted entities like people, their roles, places (cities and countries) and organizations. On top of this it linked individuals with their role(s) and also determined the subject headings (categories) of the documents. For example, for one news article, OpenCalais generated triples for us that indicated the item was about war, identified the places mentioned in the article, and provided geo-location information for those places.
Our enrichment process generated modified copies of these source documents and an associted set of triples for us, too. To load the modified articles, run the
load-news-content.sh script. (See above for how to run the script).
To load the associated triples, run the
load-news-graph.sh script. (See above for how to run the script). You can ignore the errors about lexical forms. (As you will discover, it is not uncommon for triple data to be encoded out of spec. In this data set, the triples with such issues will still be loaded. But, the "dates" that are incorrectly formatted will be treated as strings.)
Before we move on, let's talk a little about some of the modifications we made during enrichment. Specifically, during enrichment, each article was assigned an IRI and that IRI was embedded in the article itself. Each document's IRI was also linked to an OpenCalais identifier using the common
owl:sameAs predicate. There’s also a triple that links the document’s assigned IRI to the document’s database URI as well. This enables you to say things about the document, such as "database-document-X mentions IBM" by joining two triples (e.g., something like "assigned-URI mentions IBM" and "assigned-URI isDocument database-document-X").
Below is a query that will show you all the
owl:sameAs triples in the database. In particular, this will show you all the IRIs linked to OpenCalais idenfiers.
We can also see how an individual document's IRI is embedded by looking at one via XQuery:
There are other way to link triples and documents. For example, you can embed triples in the documents themselves. Such triples could include metadata such as "thisDocument publishDate today"; subjects or topics mentioned in the document such as "thisDocument mentionsCity 'New York'"; or events such "John wentTo China" or "Jack metWith Joe".
You should now see
- 155,980 distinct triples
- 2,021 documents containing triples
- 179,288 total triples
Want to do more, before moving on to the next exercise? Using mlcp or
sem:rdf-load() or a REST Endpoint, you can load any file containing triples that you can find on the Semantic Web! Also, because MarkLogic is a database, you can go add, edit, or delete triples (in fully-ACID compliant transactions) as well!
- Query Console export: ts-loading-data.xml (also available inside the semantics-exercises.zip).
- Semantics Developer Guide section: Loading Triples
Semantics Hello World