The MarkLogic connector for Apache Spark is an Apache Spark 3 connector that supports reading data from and writing data to MarkLogic. Within any Spark 3 environment, the connector enables users to easily query for data in MarkLogic, manipulate it using widely-known Spark operations, and then write results back to MarkLogic or disseminate them to another system. Data can also be easily imported into MarkLogic by first reading it from any data source that Spark supports and then writing it to MarkLogic.


Downloads

GET THE LATEST CONNECTOR ›


Major Features

Reading Data:

  • Schema inference based on Optic DSL query using fromView()​
  • Batch reads and micro-batch streaming​
  • Tune Performance​ via number of partitions​ and batch size
  • Read rows from MarkLogic via custom code

Writing Data:

  • Write rows as documents via DMSDK​
  • Configure Document URIs​, collections​, permissions​
  • Support streaming
  • Tune Performance​ via thread count and batch size

Reprocess Data

  • Process rows via custom code in MarkLogic

Requirements

  • Apache Spark 3.3.0 or higher. The connector has been tested with the latest versions of Spark 3.3.x of 3.4.x.
  • For writing data, MarkLogic 9.0-9 or higher.
  • For reading data, MarkLogic 10.0-9 or higher.

Get Started

To learn more about the project and get started visit the MarkLogic Spark documentation.

Related Resources

Get Started

In this tutorial, you will learn how to ingest data into a MarkLogic Data Hub Service instance running on AWS using the MarkLogic Connector for Apache Spark.

Why a MarkLogic Connector for Apache Spark?

Ankur Jain discusses what Apache Spark is and why you should use it with MarkLogic in this blog.

Documentation

Learn more about how you can configure the MarkLogic Connector for Apache Spark, where you will also find documentation for the AWS Glue connector.

This website uses cookies.

By continuing to use this website you are giving consent to cookies being used in accordance with the MarkLogic Privacy Statement.