Subject: | |
From: | |
Reply To: | |
Date: | Fri, 15 Jun 2007 12:15:39 -0400 |
Content-Type: | text/plain |
Parts/Attachments: |
|
|
While we do post outage reports to the ICIT outages page, they are very short and to the point. I
thought it would be helpful to give more details about today's major outage and outline some of
the lessions learned.
There was a ConEd power surge this morning at 6:07 am. It knocked some of the power at Hunter
and elsewhere around the city. Unfortunately, the power in the machine room went down and the
back-up generator system did not kick in as it should. The component that failed was the transfer
switch, which automatically senses a power outage and changes the power feed from the line to
the generator. Our UPS did immediately take over the electrical supply, but it is only capable of
covering the time it takes for the transfer switch to route the power from the generator. After
about 10 minutes the UPS ran out of power, and since the generator was not activated, all of our
systems crashed, including the network switches. Power came back on at 6:28, and systems
started to reboot. The network switches rebooted and the network came back about 7:00 am. The
Proofpoint server also rebooted immediately and started queueing email. Unfortunately, the
servers for email and the web did not come back up for different reasons. The email server
rebooted faster than the storage system and hung waiting for the storage system. A switch into
the main web server blew out when the outage occured. Our staff was in very soon after the
problems appeared, but it took some time to get all of the systems up and to trouble-shoot
specific problems. Many systems needed attention. Everything was back up by about 11:00 am.
No email was lost during this outage; email was queued by Proofpoint as soon as the Proofpoint
server came up. Email that was sent during the time between the start of the power failure and the
the time the Proofpoint server came up was bounced, but the period of time was short enough
that it would have been resent by the originating mailserver. Thanks go to our staff who got
everything going. This was the first outage for some of our systems and they got to the root of the
problems and solved them quickly.
Lessons learned: It is clear that the transfer switch needs to be repaired immediately. The backup
power systems are tested monthly, but we need to work with Hunter Facilities Management to
carefullly monitor what happens inbetween. Our staff learned about ways to deal with problems
when our new systems crash, which should make future outages (which we hope do not happen)
shorter. We will also look at the possibility of getting more battery power to keep the UPS systems
working longer.
|
|
|