On March 27 2018, at approximately 11:30 a.m. eastern standard time (EST), an issue with our DNS management system pushed invalid records to our DNS service providers. Due to this issue, any sites using our name servers experienced intermittent resolution problems. This affected nexcess.net, the Client Portal, and other internal domains.
At 10:28 a.m., our internal monitoring identified an issue with the database servers driving some of our internal systems. Our System Operations team was notified and quickly resolved the underlying issue. However, during this time the records we use for DNS synchronization were malformed, which went undetected.
At 11:55 a.m., it appeared that one of our DNS providers, Amazon Route 53, was experiencing intermittent problems with record resolution. Pending further testing, we disabled Route 53 and pushed all traffic to Dyn, our other DNS provider. Dyn soon experienced similar intermittent resolution problems, and we isolated the root cause as the aforementioned invalid record synchronization.
Over the next hour, we worked to manually restore DNS services to both providers from backups. This effort was complicated by Dyn identifying the increased traffic as anomalous and limiting our rate of DNS queries. Service was fully restored by 1:25 p.m.
Remediation efforts are ongoing, but we have already redesigned, tested, and released our internal systems to detect errant records and zone files before publishing them globally. Further changes to our database infrastructure will increase availability and provide cleaner failure detection, which will ensure zone integrity before synchronization to our DNS providers. We have also contacted Dyn to verify our capacity and the controls currently in place on our account, and we will continue to review their relevance to the above events.
We apologize for any inconvenience and thank you for your patience while we work to improve our systems.Posted in: Nexcess