Elevated latency in our US-East PoP
Incident Report for PubNub
Postmortem

Problem Description, Impact, and Resolution

In our US-EAST-1 Point of Presence we received an unusual amount of connections that caused upstream backlog within our internal network. This manifested by causing delays in delivering subscription requests to some clients. Via our normal catch up mechanism subscription deliveries continued to function with delays throughout the incident. Manual operation tools deployed mechanisms which alleviated back pressure on connection creation bringing the incident to full resolution.

Mitigation Steps and Recommended Future Preventative Measures

In the future there are several areas to improve. First, a gap was identified in our internal monitoring which hid some of the connection and channel creation from our system. This caused us to respond slower towards a resolution than expected. This gap will be resolved in the coming days. Additionally, we have automatic throttles that closely guard connection creation we are always trying to improve. In this exact scenario we believe the specific pattern of the problem kept the influx of connections underneath our rate limiting. We are analyzing the exact pattern so that in the future our connection rate limiting will take into account more sophisticated usage patterns.

Posted 8 months ago. Feb 05, 2019 - 19:44 UTC

Resolved
The root cause was identified and the actions taken to mitigate were successful and all latencies have returned to normal.
Posted 8 months ago. Feb 04, 2019 - 03:58 UTC
Monitoring
We were able to mitigate the connection issues that we were experiencing and will monitor closely over the next several hours until we are confident the issue is resolved.
Posted 8 months ago. Feb 04, 2019 - 03:18 UTC
Identified
We have identified a networking condition where connections are taking longer expected. We are implementing multiple strategies to mitigate the impact.
Posted 8 months ago. Feb 04, 2019 - 02:57 UTC
Update
We currently understand the impact and are discussing the best actions to take to mitigate the latencies.
Posted 8 months ago. Feb 04, 2019 - 02:42 UTC
Investigating
We are currently experiencing elevated latencies in our US-East PoP and are investigating root cause.
Posted 8 months ago. Feb 04, 2019 - 02:22 UTC
This incident affected: Points of Presence (North America Points of Presence) and Realtime Network (Publish/Subscribe Service, Storage and Playback Service).