Loading Data

In this exercise, you'll load some triples for use in future exercises.

Prerequisites

This exercise assumes you have mlcp installed and a relatively clean MarkLogic server from which to start. (We assume no server port or name conflicts with those used here). The loading scripts also assume you have the mlcp bin directory in your system PATH environment variable.

Specifically, in this exercise, you will create:

  • A content database called tutsem-content, with its triple index enabled.
  • A modules database called tutsem-modules.
  • A HTTP REST instance called tutsem-rest on port 9910

Create the Database and App servers

MarkLogic provides several ways to create and configure Databases and App servers. For this exercise, we will use the Management API.

Open a text editor and save this JSON as tutsem-server.json. If you already have something running on port 9910, change it in this file (and remember to update the port through the rest of this tutorial):

Save this one as tutsem-content.json:

Use curl to send tutsem-server.json to the Manage API. This instructs MarkLogic to create the application server, content database, and modules database:

If you're using MarkLogic 8, you can do the same for tutsem-content.json to turn on the triples index: If you're using MarkLogic 7, you won't be able to do this step through the Management API, as that endpoint was added in MarkLogic 8. Instead, point your browser to the Admin UI (http://localhost:8001) and click on the tutsem-content database (under Databases). Find "triple index" and click the radio button to set it to true; do the same to set "collection lexicon" to true, then scroll back up and click the "ok" button.

Load data from Hello World

You can manually load the 3 triples from our Hello World exercise to your new tutsem-content database or you can use the provided script: load-livesIn.bat (or load-livesIn.sh) as follows.

(NB: the following instructions apply to each of the loading scripts referenced below as well):

  1. If you haven't yet, download and install mlcp and add the mlcp bin sub-directory to your operating system PATH environment variable.
  2. Download the entire semantics-exercises.zip and unzip it.
  3. In a shell, change directories to the load-scripts directory that was inside the zip
  4. If you are on Windows, edit each .bat script and update the admin username and password as needed. If you are on Linux/OSX, you can set the MLUSER and MLPASS environment variables read by the shell scripts, or you can simply edit the scripts.
  5. If you are not running on localhost, you will also need to edit the hostname in the URL in the script.
  6. Run the appropriate script for your operating system.

After running load-livesIn.bat or load-livesIn.sh you'll see output from mlcp like:

Load Triples from DBPedia

We have provided a collection of 60k triples taken from DBpedia 3.8, available under the terms of the Creative Commons Attribution-ShareAlike License and the GNU Free Documentation License. DBPedia is a crowd-sourced, community effort to extract structured information from Wikipedia.

This collection includes 10k triples each from:

Ontology Infobox types http://downloads.dbpedia.org/3.8/en/instance_types_en.nt.bz2
Ontology Infobox properties http://downloads.dbpedia.org/3.8/en/mappingbased_properties_en.nt.bz2
Ontology Infobox properties (specific) http://downloads.dbpedia.org/3.8/en/specific_mappingbased_properties_en.nt.bz2
Short Abstracts http://downloads.dbpedia.org/3.8/en/short_abstracts_en.nt.bz2
Geographic Coordinates http://downloads.dbpedia.org/3.8/en/geo_coordinates_en.nt.bz2
Persondata http://downloads.dbpedia.org/3.8/en/persondata_en.nt.bz2

To load the data, run the provided load-dbpedia.bat or load-dbpedia.sh script. See above for how to run the script.

What have you got so far?

When you load RDF triples into MarkLogic, the triples are stored in MarkLogic-managed XML documents. Below are some questions you can answer by examining the database. You can import the Query Console package ts-loading-data.xml (available in the semantics-exercises.zip as well.) After importing, set the Content Source for each buffer to "tutsem-content (tutsem-modules: /)"

Q. What documents got created? Under what URIs?
A. You should see one document under /triplestore directly (the Hello World triples) and 600 under /triplestore/dbpedia/
Q. How many triples are in the database?
A. 60003
Q. How many distinct triples are in the database?
A. 58517

Hints:

Load data from BBC News

In this step, you will load a set of articles from the BBC News that we enriched using the OpenCalais Web Service.

We started with each article as a single XHTML document. We then used OpenCalais to analyze the articles and find the entities (real-world things) within them. OpenCalais spotted entities like people, their roles, places (cities and countries) and organizations. On top of this it linked individuals with their role(s) and also determined the subject headings (categories) of the documents. For example, for one news article, OpenCalais generated triples for us that indicated the item was about war, identified the places mentioned in the article, and provided geo-location information for those places.

Our enrichment process generated modified copies of these source documents and an associted set of triples for us, too. To load the modified articles, run the load-news-content.bat or load-news-content.sh script. (See above for how to run the script).

To load the associated triples, run the load-news-graph.bat or load-news-graph.sh script. (See above for how to run the script). You can ignore the errors about lexical forms. (As you will discover, it is not uncommon for triple data to be encoded out of spec. In this data set, the triples with such issues will still be loaded. But, the "dates" that are incorrectly formatted will be treated as strings.)

Before we move on, let's talk a little about some of the modifications we made during enrichment. Specifically, during enrichment, each article was assigned an IRI and that IRI was embedded in the article itself. Each document's IRI was also linked to an OpenCalais identifier using the common owl:sameAs predicate. There’s also a triple that links the document’s assigned IRI to the document’s database URI as well. This enables you to say things about the document, such as "database-document-X mentions IBM" by joining two triples (e.g., something like "assigned-URI mentions IBM" and "assigned-URI isDocument database-document-X").

Below is a query that will show you all the owl:sameAs triples in the database. In particular, this will show you all the IRIs linked to OpenCalais idenfiers.

We can also see how an individual document's IRI is embedded by looking at one via XQuery:

There are other way to link triples and documents. For example, you can embed triples in the documents themselves. Such triples could include metadata such as "thisDocument publishDate today"; subjects or topics mentioned in the document such as "thisDocument mentionsCity 'New York'"; or events such "John wentTo China" or "Jack metWith Joe".

Verify your data

You should now see

  • 155,980 distinct triples
  • 2,021 documents containing triples
  • 179,288 total triples

More?

Want to do more, before moving on to the next exercise? Using mlcp or sem:rdf-load() or a REST Endpoint, you can load any file containing triples that you can find on the Semantic Web! Also, because MarkLogic is a database, you can go add, edit, or delete triples (in fully-ACID compliant transactions) as well!

References

Semantics Hello World

Introducing SPARQL

Comments