Shouldn't heartbeat grace period be set per endpoint?

I’m trying to configure the heartbeat on all endpoints in a generic way, so all endpoints will have heartbeat, and I want to have critical endpoints with a fast heartbeat (let’s say 30 seconds) and less critical endpoints with a slower one (let’s say 2 minutes).

But it seems that configuring the heartbeat this way doesn’t really have an effect, because what defines if an endpoint is down is a single setting in ServiceControl: HeartbeatGracePeriod. The documentation has the following note:

When monitoring multiple endpoints, ensure that heartbeat grace period is larger than any individual heartbeat interval set by the endpoints.

So, I will need to send this value to, let’s say 2 minutes and 10 seconds. Doesn’t this mean that I am actually adjusting the alarms to the less critical endpoints and ignoring the critical ones?

An alternative implementation would be that each heartbeat message would send when the next heartbeat is expected. This way every endpoint would trigger the alarm at the right time and it could even be scheduled (make heartbeats faster during certain periods during the day), although I’m not sure if this would be useful or not.

Hey @fcastells

If you configure the HeartbeatGracePeriod setting to 2 minutes this will only show endpoints as offline if SC hasn’t received any heartbeats within 2 minutes.

I want to have critical endpoints with a fast heartbeat (let’s say 30 seconds)

Do I understand correctly, that you want to appear those critical endpoints to appear offline if no heartbeat message for the last 30 seconds has been received?

In general, do you see any issues in configuring all endpoints (heartbeats interval + grace period) with the requirements of the more critical endpoints?

Do I understand correctly, that you want to appear those critical endpoints to appear offline if no heartbeat message for the last 30 seconds has been received?

Yes

In general, do you see any issues in configuring all endpoints (heartbeats interval + grace period) with the requirements of the more critical endpoints?

The only issue I see is that a non critical endpoint might have a single instance running in production and could crash and take a while to get back. This wouldn’t be critical, but it would generate an alarm. On the other hand, critical endpoints will have several instances running at a given time, so it’s only critical if they all go down.

Also, on a support conversation I was told the following:

Heartbeat durations should be set to values that make sense to your organisation and how critical availability of an instance is.

So, given this wording, I assumed that it was possible to configure the duration per endpoint, but then I found the HeartbeatGracePeriod which seemed to contradict this assumption.

individual grace period settings per endpoint are currently not supported. I’ll propose your suggestion to our feature backlog though, this is definitely valuable input :+1:

@fcastells a little follow-up question: As far as I understand you’re mostly concerned about not triggering alarms for low-priority endpoints in the same manner as the critical endpoints? How do you trigger those alarms?

In general, I’d argue triggering an alarm is a different set of action to marking an endpoint as offline. It seems like it would be fair to mark any endpoint as offline quickly but define different levels of delay till an alarm is triggered? What are your thoughts?

@Tim we are just introducing NSB right now and we don’t even have ServiceControl and ServicePulse installed in production yet. So, all scenarios are theoretical.

I understand your point about the alarms. I’ve already seen the value of creating an endpoint listening to ServiceControl events to trigger different actions depending on the events. It makes sense for us to post a message on Slack whenever an order fails or a critical endpoint goes down, so both Customer Service and we the tech guys can act quickly. But initially, I was considering as “Alarm” just something red popping up in ServicePulse.

In any case, given the current functionality, it makes sense to set all endpoints to a relatively fast heartbeat and just be aware that a red light on a non critical endpoint might not be a problem for a couple of minutes.

That sounds like a good approach to cover more “application specific monitoring logic” (determining when an “alarm” is critical and when not, or who should be handling a specific error). Please let us know about your experience with writing such an endpoint :slight_smile:

Thanks for the additional insights, this is valuable information for us to improve SC and the monitoring capabilities :+1:

Thanks, sure I will let you know how it goes.