There is no question in my mind that this problem was brewing for possibly several weeks. Meaning at least one drive was failing somewhat and according to what I remember reading considering the heat issues, it was just a matter of time unfortunately. Certainly not blaming anyone. Just very unfortunate. I noticed the site being slow for at least one week prior to the crash. It could have been coincidental.
It was the job of the hosting company to make sure things like this don't happen. System failures this massive do not happen suddenly overnight, there has to have been some chain of events. My experiences with NOCs, it usually points to failure of the system managers to monitor the health of the system. Most NOCs only monitor basic functionality and do not get into the hardware as much.
We had a drive failure where I work (a school district) nearly a year ago, over the Christmas break. A similar situation, with a failed drive in the SAN. It was rebuilding onto a hot spare as the vendor overnighted a replacement drive.
Of course, during the holidays, overnight shipping does not necessarily equate to 'overnight'. When the replacement drive arrived two days later, the rebuild had already failed due to another drive failure, this time the parity drive for the original failed pair.
By the time the second replacement arrived and the discrepancy between our holiday schedules and the vendor's, the task of restoring the hardware and the data took a little over two weeks.
Of course, we are a minor value customer, with only 17 or 18 VM servers running on the SAN. I am sure larger customers receive better service.