Over the weekend, file syncing and backup service Dropbox suffered an extensive outage. While initial reports suggested a denial-of-service attack, the actual cause was rather more prosaic: a failure in a script designed to automatically update operating systems on its machines.
In a post on the Dropbox blog, head of infrastructure Akhil Gupta explained what went wrong:
On Friday at 5:30 PM PT, we had a planned maintenance scheduled to upgrade the OS on some of our machines. During this process, the upgrade script checks to make sure there is no active data on the machine before installing the new OS. A subtle bug in the script caused the command to reinstall a small number of active machines. Unfortunately, some master-slave pairs were impacted which resulted in the site going down.
The lesson everyone can learn?
When running infrastructure at large scale, the standard practice of running multiple slaves provides redundancy. However, should those slaves fail, the only option is to restore from backup. The standard tool used to recover MySQL data from backups is slow when dealing with large data sets.
Gupta also said that Dropbox plans to open source a tool for "parallelising the replay of binary logs" which can speed up the restore process when it happens. We await with interest.
Outage post-mortem [Dropbox Tech Blog]