Analysis of the disruption since the fix was deployed has proven definitive, we are confident that the issue is resolved.
Across all categories of traffic the error rate rose to 0.15% (mean), for cache misses the rate was 0.53% (mean); in both cases the analysis was performed taking into account logs for 2022-07-21 between 0500 PDT and 1200 PDT.
Errors were clustered around the tops of the hours. The fault manifested after approximately ~60 minutes of otherwise healthy operation of a POP, with some variation depending on the size of the POP and the volume of traffic it was serving.
Detailed information about error rates (please, pay particular attention to “HTTP 503 Service Unavailable”) is available via the Layer0 console.
Posted Jul 21, 2022 - 12:20 PDT
A fix has been implemented and is currently being rolled out to all POPs following a successful test on an isolated POP.
Posted Jul 21, 2022 - 10:39 PDT
The first patched machines are coming into service now, we will continue to provide updates as we patch all of our infrastructure. The investigation into impacted customer traffic continues to indicate low overall disruption localized to a few minutes around the top of each hour. Certain customers may see outsized impact especially around cache-misses, and highly re-entrant projects; more detailed reports will follow.
Posted Jul 21, 2022 - 10:11 PDT
Issue has been confirmed, as previously communicated as a TCP driver bug introduced in the kernel. Our SRE team are conducting tests on downgrades to the affected packages and putting patched machines into service. We expect this is the last phase of the disruption. The next update ought to include information about our expected time to resolution. In a parallel stream our teams are working on analysing the impact for customer traffic; we hope to be able to report on this shortly.
Posted Jul 21, 2022 - 09:54 PDT
We still continue to work on rolling out a fix.
Posted Jul 21, 2022 - 09:40 PDT
We continue to work on rolling out a fix.
Posted Jul 21, 2022 - 09:25 PDT
We continue to experience increased HTTP errors, which appear for 1 minute every hour, in the US East Region due to a newly reported bug in the Linux kernel. We are currently rolling back to a prior version to mitigate the impact.
Posted Jul 21, 2022 - 09:10 PDT
We are still investigating the issue
Posted Jul 21, 2022 - 06:50 PDT
We're experiencing an elevated level of HTTP Errors on one of the POPs of the US East region and are currently looking into the issue.