Entity Services is a way to design applications around real-world concepts, or entities, such as Customers and Orders, or Trades and Counter Parties, or Providers and Outcomes. This provides better alignment between business analysts who define the entities and the developers who combine them in application code.
Entity Services is made up of three core capabilities that work together to simplify data integration and application development with MarkLogic.
- Entity Type Model: The entity type model describes entities, their properties, and relationships to other entities. Entity Services provides a vocabulary for describing, validating, and querying entity models. Data owners and domain experts describe the semantics, or meaning, of the data and the rules that govern it using the extensible entity model.
- Persistence convention: Using the entity type model, Entity Services provides a default data model for storing and versioning entity instances, their metadata, and even the raw data from which the entities are derived. This convention is designed for extensibility to accomodate different types of data and a wide range of data governance practices.
- Code Generation APIs: Using the entity type model, Entity Services can generate code and configuration to speed up development and reduce translation errors.
In a seminal paper published in 1976, Peter Chen, the renowned computer scientist, put forward the idea of capturing information about the real world as entities and relationships. The idea was that you could use this analysis to unify multiple storage and transaction models to better represent the real world across different systems.
However, the relational model that has dominated the database industry over the last 40 years was ill suited to handle the complexities and unpredictability of real-world concepts. Instead, pragmatic data modelers have focused on what is achievable within the constraints of relational, pushing the mapping of concepts off to external ETL processing and application code. The dominance of relational over the last 40 has improved what is possible with the relational model, but has not addressed the fundamental mismatch between relational and increasingly complex concepts and interactions that developers code into software and analysts devise to run a business or achieve a mission.
The context of the data—its meaning—is stored in SharePoint or Excel or a ragged ERD printout in some DBA’s cube—everywhere (and nowhere) except for the database where the data is stored. Making sense of it within one database is difficult. Across different databases, each purpose-built for a specific application can be impossible.
What does a Customer mean? What are its defining properties? How is it related to other things that are important to my business or mission? Which systems can generate customers? How do I represent customers to my applications? How many customers do I know about? Which ones do not adhere to the rules of my business?…
Answering these questions today typically involves complex point-to-point integration to bridge different representations of the data and different interpretations of its meaning. This is brittle and expensive in today’s environment, dominated by relational databases.
With a single greenfield application, it is possible to build a data model that represents the concepts behind a particular business process. For example, in a relational database you might have Customer, Sales Orders, and Products tables to capture the fact that customers order products. From the perspective of this order fulfillment application, a customer has a shipping and a billing address and can be easily related to the orders she has placed. However, within a real organization a Customer is much more than just an address to ship sales. That customer would have interacted with marketing programs, technical support, or might even be selling services or products back to you. In most enterprises this data is spread throughout the organization, typically in silos that handle a particular aspect of a customer. The simple customer concept above is now made up of disparate, sometimes conflicting data, whose context and meaning is trapped in database queries, application code, and outdated application specs and entity-relation diagrams (ERDs).
Entity Services with MarkLogic provides a better way to manage entities and the messy, changing data from which they are derived.
- Modeling as documents and triples: Define a basic entity model—entities, properties, and relationship—using the out-of-the-box JSON or XML vocabulary. Extend that model with your own ontologies and query it with SPARQL.
- Conversion templates: Generate code templates from entity models to build entity instances from any source data. Tuple indexing: Template-driven extraction (TDE) rules to project tuples out of entity instance documents for querying from the Optic API or with SQL.
At the core of Entity Services is an entity type model, or “model” for short. The model describes concepts, like Customer, in as little or as much detail as you need. Unlike a relational schema, you do not need to model all of your data before you can load it or query it in MarkLogic. You use your model to characterize the specific parts of the data that you want to query by its canonical representation. This process is designed to be iterative, evolving as you better understand your data and requirements.
The model describes entities (“nouns”), their properties (“adjectives”), and the relationships between entities (“verbs”). For example, a Customer entity has a first name and is related to an Order, by its id. (More about how you can be more specific about the nature of the relationship later.)
Entity Services provides a JSON or XML vocabulary to describe models. This vocabulary is the bare minimum you would need in a compact, readable format. You can persist this document, just like any other JSON or XML document, in your database along with your data. The document format also lends itself to editing workflows that speak JSON or XML, for example, a browser-based model editor.
The model document format is only a convenience for quickly and compactly describing an entity model. Once a model document is persisted to a configured MarkLogic database, it is automatically converted to semantic triples. Triples are a (WC3) standard way of capturing atomic facts. Each assertion in the model is converted to a graph of related triples. You can query these graphs with SPARQL. For example, to discover all of the entity types or infer that a customer purchases products, even though that relationship is never explicitly declared: Customers place orders, orders have line items, line items contain products.
The following is a query that catalogs all of the entity types declared in a model.
A semantic model allows you to add your own facts to extend what the built-in vocabulary. For example, the out-of-the-box model allows you to assert that a Customer is an entity type. You could add another fact to assert that all Customer types are subclasses of Party, along with Partner and Vendor types. You could then write a query across all Party types.
The entity type model not only documents the canonical entities represented in your data, but, because it is data itself, it can be queried and transformed into other artifacts. This allows a true “model-driven” workflow and a means to separate the data rules from application code. Applications are ephemeral; data is forever. Your database and governance policies should reflect that.
Entity Services provides utilities for transforming models into application components, saving time and reducing translation errors.
- Generate code to transform source documents into canonical representations, by default wrapped in an “envelope” document to preserve the original source alongside the canonical representation.
- Generate template-drive extraction (TDE) templates to project canonical model instances out of documents into the new row index for use with the Optic API, SQL, or SPARQL.
- Generate XML Schemas to validate entity instances and provide data typing for XQuery code.
- Generate database index settings for range indexes and word lexicons. Use the Management REST API to apply these changes.
- Generate Search API options to customize the behavior of the Search API, specific to your entity model.
- Generate transformation code to translate from one version of an entity model to another, for example to handle breaking changes to an entity definition.
The following examples use the sample data included in the Entity Services repository on GitHub. By default, the Entity Services libraries, in the
es namespace, are bundled with your MarkLogic 9 installation. You will only need to access the GitHub repository to run through the examples.
The example Entity Services implementation included in the repository covers three entities: Runner, Race, and Run, sourced from two different data sources. Conceptually, a runner participates in a run. A race is made of of many runs and has a winning runner. (The data is partially synthesized, so actual times and distances may not be entirely realistic.) You can see the raw JSON and CSV in the
The following walk-through will create an entity type model to describe these key entities, transform the raw data to reflect the canonical model, and then query that model using the Optic API and SQL.
You will need to MarkLogic 9, git, and Java 8 to run the samples below.
- Clone the marklogic/entity-services project from GitHub.
git clone https://github.com/marklogic/entity-services.git
- Navigate to the
gradle.propertiesthere, change the
mlPasswordproperties to reflect an admin user in your environment. The admin use is only to configure the initial security settings.
- From within the entity-services-examples directory (not the top-level entity-services directory), run the Gradle script to bootstrap a new environment.
./gradlew -PexampleDir=example-races mlDeploy
You should see three new databases in your MarkLogic instance:
entity-services-schemas. Let’s put some data in there and use Entity Services to integrate it.
Staging Source Data
The raw source data for the examples is located in the
data directory. It is formatted as JSON and CSV, for example, as if you had exported it from a real race tracking system. The data and business rules are simplified from what you would typically encounter in a complex enterprise integration project for the sake of illustration. However, all of these concepts will extend to larger, more sophisticated projects.
entity-services-examples directory run
This will run through the entire scenario: loading both data sets, generating the transformation artifacts, and bulk transforming the raw inputs to reflect the canonical entities.
To verify what you just loaded open Query Console in your browser. From the workspace drop-down on the right, import a new workpace from
races-qc.xml in the
entity-services-examples directory. Run the query in the tab
01. load-report. This gives you a summary of what you just loaded. It should return something like:
These examples use the new Data Movement SDK to load files from the file system. However, the same concepts apply for data that you load through a REST API, mlcp, or any of the other ingest interfaces that MarkLogic provides.
The initial load puts the JSON and processed CSV, representing two different data sources, into the
raw collection in the
entity-services-examples-content database. You can think of this as a staging area. (You could also stage raw data into a separate database; Entity Services does not dictate a processing workflow.) The raw data is persisted in the database, thus it is secured and indexed for discovery queries, but it is not necessarily ready to be consumed by applications. In particular, “Runs” are represented differently in the two different source systems.
In one data set a Run is captured asversus
extracted from a different system.
Both convey similar information about a the real-world concept of a “Run” and the differences provide important context about how the data is used upstream. We don’t want to lose this context just so we can get a harmonized view.
A person reading this data could figure out that
"duration": 91.14 and
"Time": "1:44:30.8" both indicate the amount of time it took to complete a particular run. However, computers (today, at least) don’t have enough context to connect those two keys, not to mention interpret the values as time durations. As such, we need to tell the database how to treat
duration in one data set the same as
Time in the other, such that we can query unambiguously for the concept of how long a run took, rather than requiring every query be aware of the specifics of the source systems.
Before we write any code, though, it is important to document the meaning of the data. Entity Services provides a set of tools and techniques to:
- Capture assertions about your data,
- Manage those assertions along with the data they describe, using the same security and governance policies, and
- Use that context to build better applications, faster.
Creating a Model
Now that you have loaded the source data as is, you can begin mapping it to the business entities that it represents. For our simple data and data models, the modeling step may feel like extra, unnecessary work. However, for more complex and messy data, the rigor of the modeling phase is vital. One important benefit that an entity type model brings is the separation between the rules that define the real-world concepts and physical representations in code and data. The entity type model allows a domain expert to specify the entities independent of any particular application. The model itself is data and thus can be used by developers to implement business logic in code.
Entity Services provides a document format to describe entity type models. This JSON or XML representation captures the three core aspects of the model
- Entity types
- Entity properties
- Relationships between entities
Below is the fully formed model for the example race data. In practice, you would probably iterate over this model many times as you understand your data and business requirements. MarkLogic makes it possible to co-locate raw data along with transformed data and the models that describe them. This means that you can begin getting value out of your data even before you have modeled it and manage and govern all aspects of your data throughout its lifecycle.
The model description covers the basics of entities, properties, and relationships. This document is designed to be the starting point of your entity modeling.
The bootstrapping that you did above saved the model into the
entity-services-examples-content database, the same one in which we loaded the source data earlier. This allows you to easily query across the raw data and the models, for example to build model-driven transformations. Also notice how the model document is loaded into the
http://marklogic.com/entity-services/entity-types collection. This is a special collection. Documents in this collection that conform to the model descriptor specification above will automatically be processed into semantic triples as they are inserted or updated. You can extend the core model with your own triples.
Generating components from the entity type model
There are many benefits to loading the source data as is: It can be secured and queried without losing any of its original context. This is incredibly valuable for data whose structure and values are not fully understood or are changing. Being able to manage and query this data without having to do up-front modeling is a key benefit of MarkLogic. However, by projecting the source data through the lens of a well specified model, you can ensure consistency, centralizing the meaning of the data into a well defined set of assertions, rather than application code or requirements specified in a Word document.
Using the model you loaded earlier, Entity Services can generate model-driven components useful for development. The following example generates a conversion module, an extraction template (TDE), an XML Schema, and database configuration that support the race model defined above. It saves these artifacts to the file system, in the same place the Gradle bootstrap used above. Thus,
../gradlew mlDeploy will deploy these to the correct place in MarkLogic—code to the modules database, schema to the schemas database, and configuration applied to the appropriate management interface.
es:database-properties-generateGenerates database index configuration that can be fed to the Management REST API.
es:extraction-template-generateGenerates a TDE template that projects entities as rows into the row index. This is useful for querying entities with the Optic API or SQL.
es:schema-generateGenerates an XML Schema that validates entity instances generated as XML from
es:instance-converter-generate. The generated schema only reflects the information captured in the model. If your
es:instance-converter-generateimplementation changes how the canonical instances are materialized you’ll need to customize your schema as well.
es:search-options-generateGenerates default configuration for the Search API.
es:version-translator-generateGenerates XQuery code to translate entity instances between two versions of the type model.
Generating Canonical Entities from Raw Sources
For example, the
es:instance-converter-generate($d) function generates the XQuery module below. This code transforms source data into a canonical representation based on the model and wraps the harmonized instance with the source data in an “envelope” wrapper document. In a real-world application, you would change the code in the
race:extract-instance-Race() function to accommodate the specific transformation logic that your source data requires. The default assumes simple field mappings. More realistic data might require reformatting values, merging multiple source documents, looking up values from other documents/entities, or any of the other sophisticated transformation tasks that MarkLogic handles with aplomb. The generated code will run as is, but it is designed as a template and intended to be modified. It provides some generally useful patterns, but is by no means the only way to leverage an entity type model in MarkLogic. Feel free to use it and learn from it.
The default transformation logic defined in
es:instance-converter-generate, materializes canonical representations of entities, wraps them in an envelope XML document (in the
http://marklogic.com/entity-services namespace), and attaches the raw source document as well. This allows you to use any of MarkLogic’s query capabilities to find data from the canonical data, the raw data, or both. The envelope pattern also provides an extensible means to track other metadata about the entity, for example, its provenance.
It is not necessary to materialize canonical forms of the entities. Application code could rewrite a query against the entity type model to negotiate however the source data is stored. (Or do away with a fixed model altogether.) That would provide flexibility to store your data in whatever form is convenient, resolving it to a model at runtime. While possible, this approach has several drawbacks.
- Rewriting queries and mapping/joining/projecting at runtime impose extra work, potentially slowing down queries.
- It also means your developers need to be aware of both source and canonical representations in order to write any queries against the logical entities.
- Most importantly, though, materialization allows you to keep an accurate history of your data. As your data changes, you can store the actual data before and after the change along with the metadata about how and why the change was made all together. Queries can access any or all of those aspects.
By materializing canonical entities, developers can use all of MarkLogic’s existing query capabilities against the entity instances themselves. The materialization logic only needs to happen at ingest (or update) time, likely as part of a larger data processing workflow.
In this example, a Java application uses the Data Movement SDK to apply the transformation to the raw source documents, storing the results into the the
And the actual harmonization logic, orchestrated in Java,
Running the simplified bulk transformation app, yields Run and Race document types, with Runner entities denormalized into the Run documents.
Indexing and querying entities as rows
Because entities, by default, are materialized as documents, you can use all of MarkLogic’s query capabilities against them directly, such the Search API or JSearch. As of MarkLogic 9, you can also index and query documents as rows.
As you saw above, Entity Services can generate an extraction template (TDE) from an entity type model. The template tells the indexer how to project rows from documents. Because Entity Services can drive the entity materialization, it is also able to programmatically generate a TDE from the entity type model.
You can verify the TDE that you generated and loaded into the
entity-services-examples-schemas database with the
The above report lists the rows the TDE matches in your database. This is a handy technique for verifying that your (generated) TDE is, in fact, extracting the rows you expect.
Recall that the TDE above extracts rows from the entities that we stored as XML documents in the database. Using the Optic API, you can write sophisticated queries over these rows. Because the rows are stored in the indexes and distributed among data nodes in a MarkLogic cluster, most queries can be parallelized and evaluated directly out of the indexes.
The following two examples give you a brief taste of what is possible with the Optic API.
Querying with SQL
Finally, in addition to the Optic API, you can use standard SQL to query the rows extracted into the row index. For example, all “Half Marathon One” races,
or counts of runs grouped by distance,