We are currently on service control v1.41.3 and service pulse version 1.9.1. We recently had a bug in one of our applications that threw thousands of error messages which ended up causing some issues with service control being backed up and made service pulse not usable due to timeouts occurring. We were able to get that fixed by recycling the service control windows service and we archived most of the failed messages.
Right now we have 2 issues the first is that it appears service control/raven is rebuilding it indexes or something because when i look at service insight it only shows messages from 3 months ago (our retention limit) but each time i refresh more and more messages start showing up.
The second issues is in service pulse (behind the scenes service control), the failed message count I believe is not reflective of the number of actual failures that exist. When i look at the Failed Message grouping and count the number it would be about 10 at most but the failed messages badge at the top indicates a few thousand. When clicking on the All Failed Messages it shows all the few thousand errors but when i try to archive a few of them nothing happens. It makes me think that the failed message grouping part has the correct errors but something is up with the embedded raven of service control
Edit – We have also noticed that we can’t see any new errors in service pulse and haven’t been able to retry messages either. We are thinking the messages will show up once everything gets “reprocessed” and shows up in service insight. It looks like this will take some time as we have a lot of messages it has to go through to get caught up to today.
As you likely know, ServicePulse is connecting to ServiceControl. ServiceControl is processing all messages from error and audit queues.
Inside ServiceControl we use RavenDb, a document database. Where SQL Server inserts data and updates indexes at the same time, document databases usually don’t. So even though the data is stored inside the database, the indexes might not be updated yet. The result is that queries return data that isn’t 100% in-sync with the actual data inside the database. That is what you are experiencing.
Depending on the message throughput (and in the case you mentioned, thousands of messages), ServiceControl requires a lot of resources. RavenDb is a database and like SQL Server, it requires quite some memory and Disk I/O. For more information on hardware considerations and performance planning, I want to refer you to our documentation : Optimizing ServiceControl for use in different environments • ServiceControl • Particular Docs
Check the RavenDb logs to see if anything odd is happening. When there are errors, we usually see RavenDb unable to update indexes. You can put ServiceControl in maintenance mode, access the RavenDb database and remove indexes. RavenDb should then be able to rebuild them again. But if there are tons of messages in there, it might take a while. ServiceControl maintenance mode • ServiceControl • Particular Docs