Delay in event processing

Major incident KS Core API KS Connect API app.saltoks.com
2025-02-10 06:30 UTC · 3 days, 17 minutes

Updates

Resolved

Good morning!

Thanks to our engineers’ efforts this (very early morning) we’re now back to fully operational with all of our events in place.

We will now be monitoring for a few hours but it does seem like everything is running well with no latency or issues.

Please get in contact with Support if you’re still seeing some issues or have further questions.

February 13, 2025 · 06:42 UTC
Update

Quick update from our side: we didn’t proceed with the migration last night as the test run showed it would increase the latency significantly, unless we transfer old events really slowly and that could take days.

We will attempt another option that should allow us to populate old events in a shorter time span - by tomorrow morning hopefully. That would once again require integration partners to request a new continuation token.

We’re still trying to estimate how long that migration will take and we may need to pause it before we hit peak hours again in order to ensure new events are not delayed. Will keep you posted.

February 12, 2025 · 11:08 UTC
Update

We’ve managed to catch up on all events so there is currently no latency on real time events coming in. Events that are showing up in event lists are any events since 5AM on Feb 10th.

The rest of the (older) events will be migrated to the new CosmosDB after midnight CET today, because we want to avoid running this operation during peak hours.

Next update will be tomorrow morning once we have completed the operation and hopefully we can finally close this incident.

February 11, 2025 · 16:58 UTC
Monitoring

We’ve now done all changes and the new CosmosDB is up and running and ingesting events. We have a large number of events still waiting to be processed and that will take around 3 hours to clear. So customers may still experience some delays within that timespan.

We have noticed that as events are coming in, they’re not always showing up in the events list in chronological order, but a page refresh can fix that. Since more events are coming in and appearing on top of other events, this issue will be completely resolved once all the events are processed. We expect for everything to normalise by 5PM CET today.

At this time we are monitoring and the next steps would be to further populate the new setup with events that are older than two days, so everything can show together. Microsoft are now assisting us with doing this in the fastest and most straightforward way possible.

As usual, will keep you posted until everything has been completely resolved.

February 11, 2025 · 13:20 UTC
Investigating

Hello all,

This is to notify you that we’ve already done most of the infrastructure and development changes and expect the events to start showing up within the next one hour. This is for event lists in both Salto KS app and Larry Support app.

We will also add a banner in both apps to notify users that we’re experiencing issues with events and are actively working on them. That will hopefully prevent customers from contacting BUs and support, so it will give some relief.

Next update will be when we have the fix in place, hopefully by 2PM CET.

February 11, 2025 · 11:47 UTC
Investigating

Good morning,

We’re changing the plan as the fix Microsoft delivered did not resolve the core issues with our CosmosDB. Our mitigation from yesterday is not optimal, and today we are choosing a different path in order to restore normal operations.

Our main priority right now is to recover real-time events and show events lists in all applications so we’re going to spin up a new CosmosDB instance and start saving new events there and serving them to the applications.

We will need a few hours for this, as it’s a fairly complex operation with several development teams involved.

That means we won’t be able to serve historical events older than two days during this time, but we will simultaneously work on this issue, so you can expect all events to be recovered in the upcoming days.

Thank you for your patience, we understand customers frustration and are trying really hard to come up with an acceptable solution. We will send an update in two hours to notify you how the recovery effort is going.

February 11, 2025 · 09:02 UTC
Investigating

We’ve implemented the mitigation and now event streaming and analytics are back on track. We can see the message throughput increasing and the event processing queue going down. It would take around 2 more hours for all the events to come through.

Microsoft have not come back to us with any meaningful information at this point. Until they resolve the CosmosDB issue event lists will not be repopulated.

February 10, 2025 · 17:27 UTC
Investigating

Hello,

We’re waiting for an update from the Microsoft CosmosDB team at around 5PM CET but in the meantime we have prepared a mitigation action to temporarily bypass CosmosDB in order to be able to send out notifications and for event streaming to keep working. Analytics and reporting will also be restored.

Event lists will not be available until CosmosDB state can be restored to healthy again. All events since 08:50AM CET that are missing from the event lists will be restored once Microsoft make CosmosDB operational agian.

We will keep you updated.

February 10, 2025 · 15:40 UTC
Update

We’ve discovered the delay in event processing is due to issues with Microsoft Azure’s Cosmos DB. Customers can be experiencing issues with seeing their events from today.

We’re currently trying different options with their support in order to normalise operations. We will let you know when there is progress and we’ve resolved the issue

February 10, 2025 · 12:58 UTC
Investigating

We are experiencing delays in events processing so events will take longer than expected to appear.

We are currently investigating.

This does not affect any access related actions and events will be processed once the delay is over.

February 10, 2025 · 09:22 UTC

← Back