UESPWiki:Database Outage (July 2009)

The UESPWiki – Your source for The Elder Scrolls since 1995
Jump to: navigation, search

This articles discusses the site's database outage that occurred during the month of July 2009 along with steps that can be taken to prevent or reduce the downtime if/when a similar outage occurs again.

Outage Summary[edit]

  • The first outage occurred on July 6th.
  • Some emails were exchanged with iWeb8 support by Nephele and Rpeh and db1 was eventually restarted on July 8th.
  • The second outage occurred on July 12th and was restored by Daveh restarting the server.
  • The third outage occurred on July 20th.
  • After the third outage iWeb8 reported the cause to be due to a bad hard disk.
  • The forth outage occurred on August 1st and was restored by Daveh restarting the server.
  • The hard drive was scheduled for replacement on August 1st.
  • The hard drive was replaced at 9am on August 3rd.
  • The attempt to restore db1 was begun at 10am on August 3rd. It was immediately noticed that the old hard drive was not connected or mounted.
  • A support request to iWeb8 was answered around 11:30am which they reported the old drive was smoking and not recoverable.
  • Downloading of the offsite backup from the morning of August 2nd was begun. Download speed was limited to under 1Mbps by Daveh's home DSL connection.
  • Downloading of the db1 backups finally completed around 3:30pm August 3rd. During the download MySQL was installed and setup.
  • The site was back online around 4pm August 3rd.
  • All database site content added/changed between about 4am August 2nd and 9am August 3rd was lost.
  • Overall downtime for July is approximately 4 days.

Outage Causes[edit]

There are a few ultimate reasons for the database outage and the length of time required to resolve it:

  1. Hard drive failure on db1
  2. Both server admins inaccessible during outage (one injured, one on vacation/work trips)

Good Things[edit]

Things that were in place that helped prevent the outage from becoming a complete disaster:

  • Offsite backups saved the data. If there were no recent offsite backups the only possibility after the old hard drive started smoking would be to contact a data recovery company and cross our fingers.
  • Multiple admins capable of server maintenance helped reduce the recovery time. Even though one admin was injured she was able to restore the db1 server when the other admin was inaccessible.

Preventing and Mitigating[edit]

The following are ideas on preventing or mitigating site downtime if or when a similar failure occurs again:

  • Hard drive failures, while rare, are essentially unpreventable. Certain RAID configurations might have been able to mitigate the damage (need to investigate the technical details/limitations and the cost to implement in both time and money).
  • Add more admins to the iWeb8 server contact list. This will allow them to submit technical requests without anyone at iWeb8 questioning their authority to do so.
  • More admins capable and willing to be server administrators (at least in a backup role for limited actions).
  • The ability of the site to display cached pages if the database is inaccessible. This would occur at either the squid cache or the MediaWiki file cache if at all possible.
  • Add a db2 server as a slave to db1 which can be used to both backup the live database as well as serve database requests should db1 go down. Currently backup1 can mirror the database and act as an offsite backup but cannot actually serve as a mirror database.
  • Make it possible for any admin to access offsite backups. Currently only Daveh has access to the backup1 machine and only when not travelling.
  • Server monitoring which can allow anyone to receive the status of the UESP servers.

Current Plan and Status[edit]

The following is the current plan for improving the site's backup and recovery strategy along with its implementation status:

Item Status
Add admins to the iWeb8 server contact list
  • Added Nephele to the list
Investigate feasibility of using RAID on db1
  • RAID1 uses disk mirroring which is the simplest setup that would benefit us in this type of failure (it would keep operating after one disk failed).
  • Software RAID1 is free but requires another hard drive: 160GB SATA2 @ 15$/month ($180/year), 320GB SATA2 @ 20$/month ($240/year)
  • Hardware RAID1 is an additional 40$/month ($480/year)
Determine if cached content can at all be displayed if the database is inaccessible -
Look into a db2 purchase
  • 24-month server would be $1242 (2GB RAM, 300GB HDD, Celeron E Dual Core)
  • Could also use content3
Give admins access to offsite backups -
Look into and setup server monitoring for all UESP servers
  • Many free monitoring options: Nagios, Zabbix, Hyperic, Groundwork
Setup servers to allow backup/alterate server admins to perform certain tasks -
Find more server admins willing/capable -
Look into db1 backups using binlogs and slave backups (no daily downtime needed for backups) -
Document the backup and recovery setup and procedure (useful for new or backup/alternate admins) -
See how much of the db1 recovery could of been done by iWeb8 support (and how much it might have cost) -