Starting December 28, 2020 at around 19:38 UTC, our upstream storage provider reported experiencing DNS resolve issues to its domain names. We noticed this issue as erratic performance in network access to the storage drives around 23:30 UTC.
The issue, explained by Wasabi, was because someone had uploaded malicious content on the service, and the relevant domain names were blacklisted by the DNS providers. You can find the full incident report as published by Wasabi here .
We did not expect this problem, and as such, did not have anything in place to automatically deal with this. Luckily for us, the resolution wasn’t completely broken, only mostly broken. So, we set up a script to constantly poll their domain name, till we finally got a hit on the IP. We quickly checked with a dummy drive whether the IP we got was the one which lead to our data, or just some faulty resolution.
We hard coded this IP address into our server’s local hosts file as a workaround to the upstream issue, and remounted the affected drives. As a result, we were able to mitigate the upstream issue which lasted till 17:57 UTC on 29 December 2020 (22 hours, 19 minutes). Our services experienced degraded performance for just under 5 hours, and the mitigation was applied within an hour of noticing the performance drops.
This issue has motivated us to think of having a fallback DNS resolution mechanism which returns a last known good value upon an upstream failure. It has also motivated us to explore mirroring of data to an alternative provider so we can be more resilient against a drawn out upstream issue.