Pattern to avoid aggregating the same data multiple times

fcastells · April 2, 2019, 8:21pm

Let’s imagine we have an endpoint that collects Seller ratings (1 to 5 stars) for every order and publishes SellerRatedEvent, with SellerId, OrderId and Rating. There is another endpoint subscribed to SellerRatingEvent which aggregates the data per day, so that there is a row in a SQL db table which has SellerId, Date, NumberOfRatings and AggregatedRating, so that we can calculate the average rating per day.

The implementation of the SellerRatingEvent Handler is something like (pseudocode):

var aggregatedRating = unitOfWork.GetSellerRatingForToday(message.SellerId);
aggregatedRating.AddRating(message.Rating);
uow.Commit();

This works fine, except if there is a failure after committing the changes. For example, NSB fails sending the message to the Audit queue, because it’s full (real scenario). What will happen is that the event handler will be retried a number of times, aggregating every time the same rating.

What are good patterns to avoid aggregating duplicated data? What I can think of is:

Use outbox in the aggregation endpoind, so that the db transaction fails if the message handler doesn’t complete successfully
Store the ratings in a db table, transactionally with the aggregated data, and verify that the rating for that orderId is not there before aggregating it.

Any other options?

What I can see with the above approaches is that 1 is mainly putting technology on it and leaving the business logic very simple, whether in option 2, the deduplication is part of the actual feature implementation. I can see good and bad things in both approaches.

mauroservienti · April 3, 2019, 7:49am

My personal preference would be Outbox if there are no good business reasons to manually run deduplication (e.g. you want to be aware for business reasons about deduped messages).

Another technical solution would be to use an append only model at the aggregator endpoint and aggregate ratings at query time. In this case you could use the message ID in the append only model to run dedup logic, e.g. by simply having a unique index on the message id column, catching the exception, and discarding those messages.

.m

fcastells · April 3, 2019, 7:58pm

OK, it makes sense. I will try the Outbox approach. Although I like the concept of Outbox and I used in the past, for my current project I preferred not to use it and use Send Atomic with Receive plus idempotent handlers. So, for scenarios like this, I will try only enabling Outbox for the affected endpoint.

Does it seem a correct approach to you?

Regarding the append only, in my case it’s not desired, because I need running averages for windows of several months. What I came up with to avoid aggregating tens of thousands of ratings every time was to do aggregates per day, so at query time, I only have to aggregate one value per day.

mauroservienti · April 5, 2019, 1:24pm

Makes sense to me.

Given that I guess that ratings are not really required to be strictly consistent, another approach could be to use a saga to aggregate those. For example a saga that is time based and aggregate ratings on a daily basis.