Optic Geo - Advanced - MarkLogic Community

If you are new to working with geospatial data using Optic, check out the Query Geospatial Data using Optic tutorial for an introduction to using Optic Geo. This tutorial dives into some of the more advanced topics as a follow-up.

Coordinate Systems

In Optic Geo, an indexed region is stored with its coordinate system. Thus, it is important to know which coordinate system your application should use and how Optic Geo determines the governing coordinate system for insertion and query.

MarkLogic supports wgs84,wgs84/double, wgs84/radians, wgs84/radians/double, etrs89, etrs89/double, raw, and raw/double coordinate systems. The default and most commonly used is wgs84. This is a geographic coordinate system, which means it takes into account the curvature of the Earth. The double affix on the coordinate system name represents the precision at which the latitude and longitude of points will be stored. If the double affix is not present in the governing coordinate system, float precision is used.

Use a doubleprecision coordinate system if your application requires very precise points. To be able to tell which side of the street a feature is on is an example use case for a double precision coordinate system. Double-precision points use double the amount of disk space as float points, so consider this when choosing a coordinate system for your application.

As of MarkLogic 11, theradianangular unit is supported for wgs84 and wgs84/double. Use this coordinate system if your data has latitude and longitude with angular units as radians.

etrs89 continues to be supported in MarkLogic 11, and models the Eurasian tectonic plate. See here for more information.

The raw coordinate system represents a Cartesian coordinate system and does not reflect any curvature of the Earth. See here for more information.

Indexing triple regions with a coordinate system

As mentioned above, sem:triple and TDE triples have a bit of an unconventional way to specify a coordinate system.

You can specify a coordinate system IRI as a prefix to assign a region to a coordinate system.

Examples

<http://marklogic.com/cs/wgs84>POINT(50 50)
 
<http://marklogic.com/cs/wgs84/radians>POINT(50 50)
 
<http://marklogic.com/cs/wgs/radians/double>POLYGON((-67.34709780707836 -2.734423830932904,-62.33733218207835 -2.734423830932904,-62.33733218207835 -5.714247209411817,-67.34709780707836 -5.714247209411817,-67.34709780707836 -2.734423830932904))
 
<http://marklogic.com/cs/raw>POLYGON((-67.34709780707836 -2.734423830932904,-62.33733218207835 -2.734423830932904,-62.33733218207835 -5.714247209411817,-67.34709780707836 -5.714247209411817,-67.34709780707836 -2.734423830932904))
 
<http://marklogic.com/cs/raw/double>@10 80,120
 
<http://marklogic.com/cs/etrs89>LINESTRING(-64.35881655707836 0.6922551010266869,-56.71233218207835 -6.849945334022303)
 
<http://marklogic.com/cs/etrs89/double>LINESTRING(-64.35881655707836 0.6922551010266869,-56.71233218207835 -6.849945334022303)

See below for how to specify a coordinate system for TDE region triples.

TDE triples with coordinate system

'use strict';
declareUpdate();
var tde = require("/MarkLogic/tde.xqy");
 
let node =
{
  "template": {
    "description": "triple geom extraction",
    "context": "/Placemark",
    "triples": [
      {
        "subject": {
          "val": "sem:iri(fn:concat('http://example.org/ApplicationSchema#',fn:replace(name,' ','')))"
        },
        "predicate": {
          "val": "sem:iri('http://example.org/ApplicationSchema#hasExactGeometry')"
        },
        "object": {
          "val": "cts:polygon(fn:concat('<http://www.marklogic.com/cs/raw/double>',region))",
          "invalidValues": "ignore"
        }
      }
    ]
  }
}
 
tde.templateInsert('townsTriples.tdej', node)

The val of the object in the TDE above dictates that our regions be inserted in the raw/double coordinate system. This means that all regions extracted from the TDE will be in this coordinate system, and queries must specify this in the option argument of the call to geof:sf()* in SPARQL.

The following demonstrates how to insert a sem:triple region into a non-default coordinate system.

etrs89 insertion with sem.triple()

declareUpdate();
const sem = require("/MarkLogic/semantics.xqy");
 
const triple = sem.triple(
  {
    "triple" : {
      "subject" : "http://example.org/ApplicationSchema#SugarloafVillage",
      "predicate" : "http://example.org/ApplicationSchema#hasExactGeometry",
      "object" : {
        "value" : "<http://www.marklogic.com/cs/etrs89>POLYGON((-118.63779 35.829243,-118.63585 35.829242,-118.6356 35.828962,-118.63539 35.828729,-118.63494 35.828367,-118.63462 35.828122,-118.63445 35.827563,-118.63409 35.827341,-118.63341 35.827339,-118.63313 35.827396,-118.633 35.82771,-118.6327 35.827848,-118.63248 35.827782,-118.6325 35.826541,-118.63244 35.82516,-118.63621 35.825073,-118.63766 35.825015,-118.63779 35.829243))",
        "datatype" : "http://www.opengis.net/ont/geosparql#wktLiteral"
      }
    }
  }
)
 
sem.rdfInsert(triple, null, null, "geograph")

The polygon above was inserted into the etrs89 coordinate system, and can only be discovered by queries issued against this coordinate system.

How Optic Geo determines governing coordinate system

The governing coordinate system is the coordinate system being used for a given operation. MarkLogic determines the governing coordinate system for a region during insertion and query.

The Appserver’s default coordinate system is important to configure for your application.

Insert

During insert, if an indexed region has a coordinate system specified in the coordinateSystem element of a TDE column or in the data itself via an IRI, MarkLogic chooses this as the governing coordinate system.

If there is no coordinateSystem specified in the TDE nor in the data, MarkLogic indexes the region into the Appserver’s default coordinate system.

Query

During query, the first argument’s coordinate system is used for the relation. If it cannot be deduced, the DE-9IM relate function uses the ‘coordinate-system=’ option provided to the third argument of the function.

If this is also not present, the Appserver’s coordinate system becomes the governing coordinate system for the query.

geof:sf* non-default coordinate system

'use strict';
const sem = require("/MarkLogic/semantics.xqy");
 
let query =
`
PREFIX my: <http://example.org/ApplicationSchema#>
PREFIX geoml: <http://marklogic.com/geospatial#>
PREFIX cts: <http://marklogic.com/cts#>
PREFIX geof: <http://www.opengis.net/def/function/geosparql/>
 
SELECT *
WHERE { ?s my:hasExactGeometry ?o
 
FILTER geof:sfDisjoint(?o,'POINT(50 50)','coordinate-system=raw/double')}
`
 
sem.sparql(query);

If we did not specify ‘coordinate-system=raw/double’ above and raw/double is not our Appserver coordinate system, we would not get our triples generated by the TDE as results. Two regions must be in the same coordinate system for DE-9IM relationships to be evaluated against them.

Coordinate system best practice

As a best practice (if possible with your specific use case), decide on your application’s coordinate system, set it at the Appserver level, and do not change it. Then, avoid specifying a coordinate system in your TDEs or as an IRI in your region data.

This way, the Appserver’s coordinate system is used everywhere and coordinate systems can be at the back of your mind during application development.

How to Optimize Performance

Database settings

In memory index sizes

If you are building an application that uses Geo in Optic, it is advised to increase the in-memory triple index size and in-memory geospatial region index size of your database. This is important to avoid XDMP-FRAGTOOLARGE errors during data insertion.

Triple index geohash precision

MarkLogic 11 comes with a new setting at the database level: triple index geohash precision. It is similar to the geohash precision setting on a geospatial region index’s configuration. This setting dictates the geohash precision of any triple region that is inserted into the database. The set of valid values for this setting is the range of integers from 1 to 12. It is not recommended to set this higher than 6, unless you are only storing points. The default is 5.

You will want to determine the best triple geohash precision for your application before inserting any triple regions, as modifying it requires a reindex.

There are various factors that determine the best triple geohash precision for your dataset. In general, the higher your triple geohash precision is, the more performant your queries will likely be, at the cost of your geospatial region indexes taking up more disk space.

If you are only storing regions and/or NOT using any of the DE-9IM functions for search (e.g. geo:within() ), a triple index geohash precision of 1 would be best, as it will save the most disk space.

If your average region is a very large polygon (e.g. a region the size of Australia) it would be best to use a geohash precision of 2 or 3.

If your average region is a polygon the size of a small country, use a geohash precision of 4.

We’ve seen the best trade-off between disk space, memory usage, and performance for a dataset with buildings at a geohash precision of 5.

If you are not too highly constrained on disk space and require better performance for the calculation of DE-9IM relations, use a geohash precision of 6.

Why is my region query slow?

MarkLogic cannot optimize queries against two region variables as of version 11.0.0. Queries issued with a new geo builtin (e.g. geo:contains() ) will be optimized if one region argument is a variable (e.g. a TDE column) and one is a region literal.

The lowest-level component of the geospatial search feature is the Relate Engine. Given two regions and a DE-9IM operation, it returns true or false. True if the first region satisfies the operation against the second, and false otherwise. This is great, but it has the cost of being expensive and slow.

When resolving a query with a geospatial constraint, we want to avoid using the Relate Engine at all costs because we care about performance. We can avoid the Relate Engine by using the Geohash Index in combination with Slice matching. If you are running into a slow query, it is likely that it is not being optimized because:

It is using the Relate Engine to brute force match the query region against all the regions stored in the database, one by one
The geohash index cannot deduce any matching regions, or the geohash slices are so large that the RelateEngine must be used to a large extent

To debug a slow query with a geospatial constraint, turn on the Optic Region Relate trace event. This will output log messages that will show the status of an executing query. The most useful information is logged on the D-Nodes (the nodes hosting the forests). If the query can be optimized, the logs will show information about the number of “maybe”, “definite”, and “brute-force” matches.

There are four cases that definitely cannot use the indexes efficiently as of MarkLogic 11.0.0

Queries with a geospatial constraint in which both sides of the relation are variables (e.g. two TDE columns or SPARQL variables)
Queries that use the DISJOINT operation against regions in a geographic coordinate system (e.g. wgs84)
1. Disjoint is very expensive and requires more brute-force matches than any other DE-9IM predicate. Avoid using DISJOINT with a large dataset to steer clear from slow queries.
Queries that use POINT EQUALS POINT (in specific cases)
Queries with a geospatial constraint in which one region in the constraint is a variable (e.g. a TDE column or SPARQL variable) and one region is a cts:circle() or cts:box() literal in a geographic coordinate system
1. See below for ways to effectively optimize this scenario

After avoiding the four cases above, if you are still seeing many “brute-force” matches in the output of the Optic Region Relate trace event, try increasing the triple index geohash precision setting on your database.

The number of points in your polygons affects performance as well. The more points in your query polygon, the more likely it is to be slow. If you are passing in a polygon with a large number of points and are facing performance issues, see if you can represent the same region in fewer points.

Polygon matching is faster in Geographic coordinate systems

This section relates to all coordinate systems except for raw and raw/double.

A very popular question to ask is: “Give me all of my stored regions in column Z that are within R kilometers of point P.” So, at first thought, we would gravitate toward calling geo:within() with one region argument as column Z and one region argument as a cts:circle literal C that has a radius R around point P. This is unfortunately one case we cannot optimize as seen in bullet point 4 in the section above. But! There is a way to work around this and turn our inefficient query into one that MarkLogic can resolve without too much pain.

There exists a built-in function, geo.circlePolygon(), that satisfies our requirement. Pass a cts:circle to this function, and we get a cts:polygon that is a rough estimate of the circle. Polygon matching can be optimized in geographic coordinate systems, while circle matching cannot.

The first argument of geo.circlePolygon() takes in a cts.circle representing the circle to convert into a polygon.

The second argument is an xs:double representing arc tolerance. The closer this value is to zero, the more precise our output polygon will be. This affects the number of points in our output polygon. Keep in mind that the more points your polygon has, the more expensive the query will likely be, so if you prioritize performance, use an arc tolerance value closer to 1.

See below for an example of an optimized radius query.

op.geo.circlePolygon Query Example

'use strict';
const op = require('/MarkLogic/optic');
 
const result=op.fromView('regions', 'towns')
               .where(op.geo.coveredBy(op.col('interiorPoint'),geo.circlePolygon(cts.circle(100,'POINT(-86.68220369547988 32.820833815855)'),0.01,('tolerance=0.001','units=miles'))))
               .orderBy('geoid')
               .result()
result;

Aliceville, McMullen, and Petrey are within a 100-mile radius of this point, and we are returned them as a result. The more regions in our database, the greater the difference in performance between circle matching vs. polygon matching.

Geospatial constraints on cts:box literals also cannot be optimized. But, we have a new built-in function available in MarkLogic 11 similar to geo.circlePolygon(), geo:boxPolygon().

Pass a box to this function and it returns a polygon that is a rough estimate of the cts:box. Polygon matching can be optimized in geographic coordinate systems, while box matching cannot.

op.geo.boxPolygon Query Example

'use strict';
const op = require('/MarkLogic/optic');
const result=op.fromView('regions', 'towns')
               .where(op.geo.coveredBy(op.col('interiorPoint'),geo.boxPolygon(cts.box(30.293547827697072,-88.41804353922988,35.04469652065227,-85.01228182047988))))
               .orderBy('geoid')
               .result()
result;

Aliceville, McMullen, and Petrey are within the box polygon specified. The more regions in our database, the greater the difference in performance between box matching vs. polygon matching.

Geo functions to avoid in Optic

There are a collection of functions that Geo in Optic does NOT optimize. Avoid calling the following in SQL, SPARQL, and Optic at all costs.

geo.boxIntersects()

use geo.intersects() instead

geo.circleIntersects()

use geo.intersects() instead

geo.complexPolygonContains()

use geo.contains() instead

geo.complexPolygonIntersects()

use geo.intersects() instead

geo.polygonContains()

use geo.contains() instead

geo.polygonIntersects()

use geo.intersects() instead

geo.regionContains()

use geo.contains() instead

geo.regionIntersects()

use geo.intersects() instead

geo.regionRelate()

use the new to ML11 corresponding geo.* DE-9IM function instead

Generic Region Type

You can specify region as the scalarType for a TDE column. The region type encompasses all geometries. Points, boxes, circles, linestrings, polygons, and complex polygons can be stored in a region column. To use this facility, do not use a region constructor in the val of your TDE column. This implies that within the document, the val to be extracted must be a string that is internally parseable as a region.

Assume we have this TDE, where the exactGeometry column has a region scalarType.

TDE region type

'use strict';
declareUpdate();
var tde = require("/MarkLogic/tde.xqy");
 
let node =
{
  "template": {
    "description": "region table",
    "context": "/Placemark",
    "rows": [
      {
        "schemaName": "Regions",
        "viewName": "Places",
        "columns": [
          {
            "name": "geoid",
            "scalarType": "int",
            "val": "geoid",
            "nullable" : false
          },
          {
            "name": "name",
            "scalarType": "string",
            "val": "name"
          },
          {
            "name": "exactGeometry",
            "scalarType": "region",
            "val": "exactGeometry",
            "invalidValues": "ignore",
            "coordinateSystem": "wgs84"
          }
        ]
      }
    ]
  }
}
 
tde.templateInsert('Places.tdej', node)

And the following two documents are in the database

'use strict';
declareUpdate();
 
let node =
xdmp.unquote(
`{
  "Placemark": {
    "name": "Panorama Heights",
    "geoid": 655506,
    "exactGeometry": "POLYGON((-118.63555 35.809731,-118.62843 35.809727,-118.62845 35.810692,-118.6284 35.810898,-118.61921 35.811218,-118.61917 35.808052,-118.61921 35.803162,-118.62955 35.802883,-118.6355 35.803001,-118.63566 35.80822,-118.63555 35.809731))"
  }
}`)
 
xdmp.documentInsert('PanoramaHeights.json', node)
 
// ^ PanoramaHeights.json

'use strict';
declareUpdate();
 
let node =
xdmp.unquote(
`{
  "Placemark": {
    "name": "Poso Fire Station",
    "geoid": 655507,
    "exactGeometry": "POINT(-118.6330952882809 35.80627543880975)"
  }
}`)
 
xdmp.documentInsert('PosoFireStation.json', node)
 
// ^ PosoFireStation.json

We should get a Polygon and a Point in the same column of the Places TDE view, which is observable by running the query below.

Point and Polygon in same result column

'use strict';
 
xdmp.sql('select * from places')

And indeed, we see Poso Fire Station’s and Panorama Height’s point and polygon data, respectively, in the same column.

QBV Tricks

When used effectively, a query-based view can help focus on the data that matters most. QBVs are at their best when created against already indexed data, and when the query that created them is drilled-down to the exact business need. Along with MarkLogic 11 and Geo in Optic comes the flexibility to define any geospatial QBV column as a region, akin to the TDE method above.

Consider the example in the OpenGIS support section:

const view2ColDescription =
[
 {
   "name": "tenKmRadiusPolygon",
   "type": "region",
   "invalid-values": "reject",
   "coordinate-system": "wgs84"
 }
]
const view2 = op.fromView('regions','towns')
                .bind(op.as('tenKmRadiusPolygon', op.geo.circlePolygon(op.col('tenKmRadius'), 0.01, 'tolerance=0.001')))
                .select([op.col('geoid'), op.col('tenKmRadiusPolygon')])
                .generateView('Towns', 'CirclePolygonView', view2ColDescription)
 
xdmp.eval('declareUpdate(); \
          xdmp.documentInsert("circlePolygonQBV.xml", view, \
          {collections: "http://marklogic.com/xdmp/qbv"})', {view: view2},
        { database: xdmp.database('Schemas') });

We defined view2 as a view that has a geoid and a polygon column that is actually a circle. We can take this a step further, and set thetypein the column description as a generic region.

If we do this, any geometry or geography will be accepted as a column in this query-based view. With this capability, we can render just about any geometry in a single layer in our OpenGIS tool.

Troubleshooting

^{XDMP-GEOHASH-TOLERANCE}

This error is thrown if a geometry cannot be geohashed either during index or query. See here for more details about geohashing.

During ingest testing, we’ve found that this error is likely to be thrown if we try to index a region whose latitude is too close to the poles. If you have regions whose latitude is greater than 85 or less than -85, try increasing the triple index geohash precision setting on the Database (requires a reindex). If this error is still thrown after, you may not be able to store this region in a geographic coordinate system (e.g. wgs84).

During query testing, we’ve found that this error is likely to be thrown if the ‘tolerance=’ option in a call to a geo relate function is too high. Try lowering this setting in your query. If this does not work or is not present, a region in your query is likely to be too close to the poles. If you have regions whose latitude is greater than 85 or less than -85, try increasing the triple index geohash precision setting on the Database (requires a reindex). If this error is still thrown after, you may not be able to query with this region in a geographic coordinate system (e.g. wgs84).

^{XDMP-FRAGTOOLARGE}

This error is thrown when a document and its index content cannot fit in memory. If you are running into this issue while ingesting regions in TDE or sem:triple() , increase your in-memory triple index size and in-memory geospatial region index size in your Database settings. If you are still running into this issue after increasing these settings, this error can be caused by massive geographic (e.g. wgs84) regions being inserted. Try decreasing the triple index geohash precision setting on your Database (requires a reindex).

^{My geometries are showing up as -90,0 or POINT(0,-90)}

If your geometries are unexpectedly being indexed with points as -90,0, they are getting clipped as they are out of the bounds of the geographic coordinate system. Double-check that your method of ingestion has latitude and longitude in the correct order.

Remember that in MarkLogic’s internal serialization, points are ingested and output in (latitude, longitude), but in WKT, GeoJSON, and KML, points are (longitude latitude).

Written Tutorial

Advanced Optic Geo Topics