I'm with you. Problems and their consequences are often not entirely avoidable. But you can mitigate outages by preparing (and spending money) for outages.
Assuming they're using a SAN from a vendor like NetApp, IBM, EMC, or Hitachi, they should have a support contract. FFS, when just a disk dies on one of my SANs, the vendor is willing to send a warm body to swap the SOB. And a good support contract also means parts are on the way before you even have to ask. If it really hit the fan, there would be a team of people driving gear to one of my facilities, pulling it out of labs at their corporate HQ, or otherwise doing whatever they had to for a return to service.
With the nature of the outage (customer data, but also their DNS servers appear to be offline, and thus no mailhost/MX records), it looks like they stored everything on the SAN (bad move #1), had no secondary site (bad move #2), have no disaster recovery (DR) plan (stupid stupid move #1), and basically put zero thought into recovering from a worst-case scenario.
I'd still put $5 on "everything lives in virtual machines and every virtual machine lives on one SAN". Derp.
I'd agree that nearly everything is probably running from VM and shared storage. But they alluded to a cascade hardware failure and also having to do restores. It could be their chassis cooling failed, causing several disks to die off - more than the RAID level they use could tolerate, necessitating a complete restore after they get equipment spares. It could have been the data center cooling too. I'm just spitballing based on the limited info we have. If it was just a SAN switch, a restore would not be required.
I am a believer in your own DNS being hosted by a 3rd party. At least that way when things go tango uniform in your data center, you can quickly point A records at a backup site to at least give clients status updates. This is a lot faster than updating your DNS servers at the registrar and waiting for propagation, just to give status updates or receive mail. It also allows you to have a lower priority MX record always published and pointing at a backup store & forward mail service, to be used when you have problems like this.
They should also revive their twitter account, it is a great way to give status updates.