ServiceControl RavenDB Corax indexing engine bug can cause long shutdown times

Hi everyone,

We recently discovered an issue in the RavenDB Corax indexing engine. RavenDB is the storage technology used by ServiceControl.

No data is lost or gets corrupted, but indexed data can be stale.

Symptoms

  1. Stopping an instance takes a considerable amount of time or even timing out.
  2. Observing significant storage I/O when starting the instance due to the database running a lengthy recovery operation.
  3. When databases are larger (~500GB+) significant storage I/O can be observed during normal operation.
  4. ServiceInsight is not returning expected results for audit messages that are already ingested.

Cause

When the following conditions are met:

  • ServiceControl Audit instances are installed on Windows as a service
  • The audit database size is above 100Gb
  • There is a constant load on the database due to:
    • Continuously ingesting messages from the audit queue
    • Deletion of expired audit messages
  • The database indexes use the Corax indexing engine

When ServiceControl is stopped due to a service stop or system shutdown, the database engine flushes index data to storage. This operation can take considerable time and is terminated by the OS when running too long.

Environments that are almost never idle and have continuous message ingestion are more likely to be affected.

Mitigation options

It is advised to switch to the Lucene index engine.

Switch to Lucene index engine

While waiting for Hibernating Rhinos to release a patch for the RavenDB Corax indexing engine, it’s recommended to switch indexes to the Lucene indexing engine if you are affected. To do so, follow the steps described in the ServiceControl Troubleshooting documentation.

Contact

If you have any questions, please don’t hesitate to contact us through the Particular support portal or reply to this message.

With thanks,
The team in Particular