Presence Server Errors and Elevated Latency
Incident Report for PubNub
Postmortem

Problem Description, Impact, and Resolution 

On October 24, 2024 at 11:15AM UTC, we observed elevated latency and errors in the Presence service across our global points of presence. Affected customers may have experienced a slowdown in Presence request responses and/or failures with 5XX server errors returned. After investigating, we identified the cause of the issue, blocked the source of traffic causing it, and the issue was resolved on October 24, 2024 at 2:00PM UTC. This issue occurred because our services were not auto scaled appropriately in response to a spike in unexpected traffic from non-standard usage of the Presence service.

Mitigation Steps and Recommended Future Preventative Measures 

To prevent a similar issue from occurring in the future, we have addressed the source of the unexpected traffic spike directly, ensuring changes were made to align usage with our prescribed methods for Presence. Additionally, we are working in the coming days to deploy sharding in Presence infrastructure to enhance scalability and better manage traffic surges like this, should they recur.

Posted Oct 25, 2024 - 19:22 UTC

Resolved
Beginning at around 11:00 UTC we observed elevated latency and server errors for our Presence service in all of our server endpoints. The issue has been resolved as of 14:11 UTC. We will continue to monitor the incident to ensure service stability has been fully restored. Your trust is our top priority, and we are committed to ensuring smooth operations.
Posted Oct 24, 2024 - 14:58 UTC
Monitoring
We have taken effective remediation actions, and our engineers are diligently monitoring the situation to guarantee that stability is fully restored.
Posted Oct 24, 2024 - 14:11 UTC
Identified
We have successfully identified the issue, and our dedicated engineers are actively working to resolve it. We are seeing positive trends, with both latency and error rates improving significantly.
Posted Oct 24, 2024 - 13:46 UTC
Update
We are continuing to investigate this issue.
Posted Oct 24, 2024 - 13:22 UTC
Update
We are continuing to investigate this issue.
Posted Oct 24, 2024 - 12:37 UTC
Update
We are continuing to investigate this issue.
Posted Oct 24, 2024 - 12:11 UTC
Investigating
At about 11:00 AM UTC, Presence service started to experience elevated latencies and server errors in all PoPs. PubNub Technical Staff is currently investigating and more updates will follow once available.

If you are experiencing issues and believe them to be related to this incident, please report them to PubNub Support at support@pubnub.com.
Posted Oct 24, 2024 - 12:01 UTC
This incident affected: Realtime Network (Presence Service).