Mitigating the Impact of Re-indexing

by Tyler Replogle

Your application is live in production, you have millions of documents, and now you want to change a database setting or add a custom index. You know that when you move these changes to production it’s going to take hours, maybe days to re-index all the affected documents and while MarkLogic is re-indexing it’s going to take up resources. You need a solution to mitigate the impact of re-indexing.

Re-indexing resource consumption

Before we go over the solution, we need to understand the problem: re-indexing can be a heavy consumer of your system resources. The "Understanding System Resources" white paper explains index utilization best with these two paragraphs.

"MarkLogic indexes consist of both in-memory and on-disk data structures. Range indexes and lexicons are stored in-memory-mapped files – if your application uses them, you’ll see equivalent memory usage. Term lists are part of what’s known as the Universal Index, and those are both in-memory (in the List Cache) and in files on disk. The Triple Index also uses memory and disk resources; although not memory mapped, the Triple Cache will grow and shrink as needed to support semantics queries.

Generally, utilization of indexes means you’ll need both more storage space on-disk, and potentially more space utilized in-memory, in the case of lexicons and range indexes. More indexes mean larger index files, and slower ingestion – more work needs to be done as content is ingested to create the index files. Of course, more indexes, particularly when residing in-memory, can result in query performance 100X-1000X faster than if the query needs to be resolved through additional work at query time."

(MarkLogic Performance: Understanding System Resources, Page. 14)

From these two paragraphs, we know that indexing takes time. Re-indexing one document takes just as long as it did to index that document. The Re-indexer smartly queries the data to see if it can filter out documents that do not need to be re-indexed. It uses the same query features that are used with cts:search. So if we want to know how many documents will be affected by changing the index settings we can do a cts:search with the element/attribute and put an xdmp:estimate around it and that will likely be the amount of documents affected by the change.

Re-indexing is like ingesting a document. The resource utilization will be very similar. Here is what the "Understanding System Resources" white paper explains about ingestion.

"There are multiple operations that consume resources in the ingest process. Some of those operations must happen in the foreground, immediately: writing to the journal, for example. Other operations happen in the background, prioritized behind foreground operations (and subject to throttling through administrative settings). You will find that MarkLogic utilizes resources – particularly I/O and CPU – even at times when no queries are issued. The system is constantly optimizing for the next read or write operation.

This means that you’ll observe the following, all of which is normal and means the system is operating properly:

  • Spiky I/O. This happens when periodic merges run and do big I/O operations to combine files on disk
  • CPU. Merges will show up as nice % in CPU statistics

When ingesting [or re-indexing] content, you should expect to see heavy I/O and CPU activity related to merges."

(MarkLogic Performance: Understanding System Resources, Page. 7)

When we are re-indexing we need to keep in mind the resource consumption of:

  • The new indexes
  • Ingestion (re-indexing) of the documents affected
  • The merges that will happened because of the re-indexing of the documents effected

Mitigating Re-indexing Impact

From the High-Level Overview, we learn that re-indexing can be a resource intensive operation, especially when you are re-indexing a large number of documents with a database that has a lot of custom indexes. We have seen re-indexing jobs take days to re-index.

Most systems are not able to have their resources constrained for days. An even bigger issue is that sometimes code requires indexes to be available in order to execute. If this code is deployed at the same time as the new database’s settings were applied then the code would error out until the re-indexing is done.

To mitigate the resources used and to resolve the code issue it is suggested to deploy the new database settings before they are needed in the database. You’ll want to lower the "reindexer throttle" to whatever is comfortable and make sure the merge priority is set to lower. This will allow the re-indexing to happen but at a slower rate. Also, it can take longer to finish because you don’t need it right away. If you are changing an index, say to a different collation, it’s a good idea to add the new index alongside the current index, because the current code still needs the index to run. You can than clean up the old index once the re-indexing is done and the new code is deployed.

Frequently Asked Questions

When you go to implement this new management of re-indexing you’ll inevitably have some questions like the ones below.

How far ahead do you deploy the settings?

There are a few different ways you can handle this. You can say I have N number of documents in my database it will take Y number of hours to re-index all of them so to be safe I’ll deploy Y + buffer before I need it. For example, 24 million documents (with 100 custom indexes and many database indexes turned on) might take 72 hours to re-index so to give extra time we are going to deploy the index settings 5 days before they are needed.

That way would work well if you wanted to put this in your process for all deployments but what if you were in a more ad hoc deployment scenario? What you could do is see how long it takes to re-index on your pre-prod environment, which hopefully is the same as your production environment. If you do not have a pre-prod environment you could try to estimate with the cts:search and xdmp:estimate approach talked about in the High-Level Overview section.

How do you find a good re-indexer throttle setting?

Before changing the throttle most clusters are set to 5. 5 is the highest you can go and 1 is the lowest. There really isn’t a good way to find the best re-indexer setting besides guess and check. The good thing is there are really only 4 options you have, because if you could use 5 then you wouldn’t have this problem. You can do it two ways. From starting at 4, running re-indexing and seeing how the system resources are being used and then going down 1 number until resources are fine. Or you could go up from 1 to 4 until resource consumptions are no longer acceptable.

Starting from 1 going to 4 is a safer option because you’ll be using less resources. One thing to note when changing the re-indexer throttle settings is that effect of the old setting is still being seen on the merges that are happening. If you are going down from a higher number you’ll be seeing more merges until merges catch up with the current re-indexer throttle setting.

How do you know if there are index changes?

If you are using a version control system you can look at the database settings file and compare them for changes.

If you have an environment that you deploy to before production you can use the Configuration Manager located at port 8002. That allows you to export database settings and import them. When you go to import it will show you the changes. You can do this without applying the changes.

How do you just deploy the database settings?

If you are using one of the common build tools like ml-gradle or Roxy, you can use their commands to just deploy the database settings. For ml-gradle you can use the "mlDeployDatabases" task, this will update each database you have in the configuration directory. Roxy has a setting that allows you to deploy selective parts of the bootstrap command. If you just want to deploy indexes you could run this command: ./ml local bootstrap --apply-changes=indexes.

If you do not have a common build tool you can use the Configuration Manager located at port 8002 to deploy indexes. You’ll need a cluster, such as a pre-prod environment, that already has the database settings that you want to deploy. You can then export the configurations from the pre-prod environment and import them into the production environment.

See Also

Comments