When an application is deployed to kubernetes, the machine name changes with each new pod (as far as I understand) so deriving this unique id like this would be problematic in that it would create a new queue for each new deployment.
“To avoid creating an excessive number of queues, the Id needs to be kept stable. For example, it may be retrieved from a configuration file or from the environment (e.g. role Id in Azure or machine name for on-premises deployments).”
But wouldn’t this give the same ID to all instances? I’m a bit confused as this seems to contradict itself. Could someone clarify/explain?
If you are using rabbitmq - I just changed routing behavior to set queue auto-delete on instance specific queues. :shrug:
but if theres a better solution for a stable instance specific queue I’d be interested as well. My instances don’t really care who exists and who doesn’t so auto-delete works fine for me.
How does that auto-delete work? Without knowing much about it, my concern would be unintentional deletion of the queue. For example, I shut down a service for maintenance and the queue deletes which means it’s not receiving any new messages which basically breaks the resilience of the architecture.
It only deletes the instance specific queue - so for instance I have a “elastic” queue which takes all messages for elasticsearch database and a bunch of “elastic-013b-a9292…” queues for the individual instances.
When the instance goes down due to a restart or w/e it deletes the specific queue but leaves the “elastic” one.
if you have outstanding callbacks on the specific queue I would think they’ll be deleted. But you can mitigate it by disconnecting from the main queue first then waiting for the instance specific one to be cleared.
That was my first instinct, but I believe to get that, I need to pull Environment.MachineName. The problem with that is that it’s going to change for every container on every deployment. Also, when autoscaling increases and decreases, it’ll be introducing new IDs (and thus new queues) as well
Each endpoint instance has to have its unique id. Maybe it is a bit misleading to mention the machine name? The assumption of that documentation page might be that the scaled out instances will be put into other machines and then machine name is feasible.
I’m not an expert in Kubernetes. So take what I wrote with more than one grain of salt How about leveraging the POD id and expose that to the container over environment variables?
and the last question is, why do you really need callbacks?
I really am trying to avoid the callbacks tbh. It’s difficult to explain and it smells to me.
The use case is that the UI is creating a configuration for a customer. The server can accept configurations that are invalid (i.e. we save it, it just isn’t usable – it’s really just a rule set for a worker-type service). The desire is for the server to respond to the “save configuration” command with a validation result that can then be used by the UI to display a warning if the configuration is unusable.
My initial thought was to have the API respond with a location header and then have the UI immediately query that resource to get the final object which will include a validation result. That way the server can just accept the message, save the configuration, validate it, and then update the resource with that result. A potential issue here could be a race condition between us getting that data updated (we’re eventually consistent) and the UI querying for the result.
It’s not a huge deal at the moment. But I’d also like to understand how to make this work for the future where we simply cannot get around the need for a synchronous operation like this. I’d like to be able to achieve it using the bus if possible so that I don’t have to expose APIs on my command model (they run as internal services as it is, so it would change too much about the architecture).
As for the exposing of the container ID through environment variables, I think that would have the same effect as the machine name, no? The container ID is basically the machine name. The problem is cleaning up those queues as they become unused. For example, a container gets added during autoscaling, so a new queue gets added. Then the container gets removed during autoscaling, but the queue remains.
The suggestion above was to autodelete the queue which sounded ok at first (despite the custom configuration), but then it sounds flawed because if we stop the container without the intent to delete it, then it would delete the queue which could cause problems in that those messages would be lost. But I guess that’s a moot point because such is the nature of synchronous messaging… I’m still thinking through this…
Well you can make creating and deleting the instance specific queues a part of your scale up and scale down infrastructure concerns. Then your problem would be solved right?
How do you mean? The scaling is done by kubernetes which scales out by adding or subtracting containers. The queues are inside of rabbitmq which is not running in a container.
I think for that to work, kube would have to provide hooks that fire when scaling up and down, and some other manager would have to then send in commands to clean up the queues. >:/
Maybe it’s just easier to say that NSB can’t cover this elastic scaling scenario?
It seems that charles’s solution is probably the only viable workaround for this. It may even be worth adding as a provision to the rabbitmq transport package. Not sure how other transports would have to deal with it.
For my curiosity, what are you using callbacks for in your scenario?
Callbacks were originally designed with legacy systems migration in mind.
Callbacks allow senders to wait for a reply, and to force a reply to be
delivered to the same instance that sent the original message.
In what seems to be a fully stateless environment (containers + kubernetes)
I’m just wondering what are you using callbacks for.
For this particular scenario, I’ve instructed the developer to drop it completely. The server will now validate the configuration separately of the front end.
Found, thanks.
What about an approach like the following:
client sends the configuration
API server accepts it, and returns HTTP202
sends message to backend
backend validates incoming configuration
publishes event: ConfigurationAccepted or ConfigurationRejected
API server is subscribed to the aforementioned events, when event is
received:
API forward the event to client(s) via WebSocket
stores the event result locally so to allows clients to query for status
in case the web socket connection is lost (e.g. users hit refresh)
Callbacks are not required using such an approach. User experience at UI
level can be designed to adapt to a task based environment such as the
above. All actors in the system are disposable, API now can fail and can be
recycled without losing any bit of information, with callbacks if the API
instance waiting for the callback dies there is no easy way to recover.
Yeah it’s a good design and that’s similar to what I was saying as a potential solution, only using sockets instead. I think this is probably a great solution for a more matured product and we’ll likely move towards something like this before too long. Plus, I believe SignalR isn’t due out for core until 2.1, so we’ve got some time. Our front end guys are just getting redux set up in there so we’ll likely be able to leverage that for this at some point. In the meanwhile, I’ve removed the callback in lieu of client side validation.
So I have another scenario here where we don’t have control of the front end, so i don’t think WebSockets is an options. The scenario is SalesForce. The business wants to use it as a front end for provisioning SMS numbers. It can send the request to do so just fine, but it needs a response so that it can store the value in the account. Without a way to make callbacks work, with autoscaling, it seems my only option is to push the data into SFDC out of band. This is still not ideal, however, because the human working in SFDC won’t be able to tell the customer what their SMS number is.
This is a larger design problem that we will eventually solve by moving that process into a system that we own, but in the meanwhile…