Our faulty global POP remains out of service but we have ample capacity. We are marking the public-facing incident as complete whilst our NOC team investigates a suspected network port failure which is visible in their MRTG monitoring.
Posted Jul 06, 2022 - 14:51 PDT
With a faulty global POP removed, errors in all categories of traffic are nominal and approaching 0.0%.
A specialist network team is running diagnostics on UDP traffic at the affected location.
With a global POP out of service, we do not consider the incident closed, and our teams are continuing to investigate the root cause.
Updates on this issue may be less frequent now, as we move into deeper analysis having largely restored service thanks to the applied mitigations.
Posted Jul 06, 2022 - 14:14 PDT
We have applied a strong mitigation by removing one faulty global POP. We do not yet have a fix in place. The mitigation has halved the error rate and was applied a few minutes ago. We will now drain all traffic from this POP (we have until now only drained edge-to-global traffic) and expect the error rate to approach 0.0% within a few 3-5 minutes.
Our working theory is that one of our global POPs is for some reason not reachable by our edge POPs regardless of DNS which service is in use.
With fewer global POPs in service we have reduced our capacity, but we are closely monitoring capacity and have plenty of headroom to continue our investigation and provide a definitive fix.
Posted Jul 06, 2022 - 14:06 PDT
We are applying a fix intended as a mitigation by extending DNS caching lifetimes within our infrastructure. We expect to have data on the success of this mitigation in the next 10 minutes.
Posted Jul 06, 2022 - 13:50 PDT
We still consider the issue to be DNS, affecting ~1.5% of traffic (a higher percentage of cache misses are affected, depending on the domain) however switching the DNS provider did not improve the overall error rate.
The error is more pronounced between our edge and global POPs, however elevated error rates are visible between our global POPs and out to the wider WAN.
We are proceeding on two fronts presently with independent teams checking network and DNS in depth seeking a root cause.
Next update will follow in ~15 minutes.
Posted Jul 06, 2022 - 13:35 PDT
The DNS issue is confirmed, switching to another provider has been undertaken. We need to collect a few short minutes of logs to measure the impact of the change and will then provide a more comprehensive update and hopefully be able to propose an estimated time to resolution.
Posted Jul 06, 2022 - 13:18 PDT
Total rate of errors is still approximately 1.5% across the board. Traffic from the decommissioned POPs moved to other locations as anticipated, however the errors in those locations increased inline with the new traffic they were receiving.
We believe we have isolated the issue to a particular DNS service and are switching to another provider imminently.
The next status update will follow in ~10 minutes.
We are in parallel seeking resolution via new DNS services, and in parallel our network team is trying to isolate whether the problem is with the DNS service, or the network to that service.
Posted Jul 06, 2022 - 13:05 PDT
Two POPs have been identified as the source of most of the errors. We have restarted affected service at one of the POPs and are preparing to take the other one out of service. We are closely examining the recovery metrics of the restarted POP.
Posted Jul 06, 2022 - 12:46 PDT
Incident upgraded to "Partial Outage". Our initial research indicates a DNS failure localized to the US-East region.
Approximately 1.5% of global traffic is affected. Investigation is continuing apace and more engineering assets are being added to the incident response.
Posted Jul 06, 2022 - 12:34 PDT
We're experiencing an elevated level of API errors and are currently looking into the issue.