ICIT-ANNOUNCE-L Archives

June 2007

ICIT-ANNOUNCE-L@HUNTER.LISTSERV.CUNY.EDU

Options: Use Monospaced Font
Show Text Part by Default
Show All Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Subject:
From:
Frank Steen <[log in to unmask]>
Reply To:
Frank Steen <[log in to unmask]>
Date:
Fri, 15 Jun 2007 12:15:39 -0400
Content-Type:
text/plain
Parts/Attachments:
text/plain (34 lines)
While we do post outage reports to the ICIT outages page, they are very short and to the point. I 
thought it would be helpful to give more details about today's major outage and outline some of 
the lessions learned. 

There was a ConEd power surge this morning at 6:07 am. It knocked some of the power at Hunter 
and elsewhere around the city. Unfortunately, the power in the machine room went down and the 
back-up generator system did not kick in as it should. The component that failed was the transfer 
switch, which automatically senses a power outage and changes the power feed from the line to 
the generator. Our UPS did immediately take over the electrical supply, but it is only capable of 
covering the time it takes for the transfer switch to route the power from the generator. After 
about 10 minutes the UPS ran out of power, and since the generator was not activated, all of our 
systems crashed, including the network switches. Power came back on at 6:28, and systems 
started to reboot. The network switches rebooted and the network came back about 7:00 am. The 
Proofpoint server also rebooted immediately and started queueing email. Unfortunately, the 
servers for email and the web did not come back up for different reasons. The email server 
rebooted faster than the storage system and hung waiting for the storage system. A switch into 
the main web server blew out when the outage occured. Our staff was in very soon after the 
problems appeared, but it took some time to get all of the systems up and to trouble-shoot 
specific problems. Many systems needed attention. Everything was back up by about 11:00 am.  

No email was lost during this outage; email was queued by Proofpoint as soon as the Proofpoint 
server came up. Email that was sent during the time between the start of the power failure and the 
the time the Proofpoint server came up was bounced, but the period of time was short enough 
that it would have been resent by the originating mailserver.  Thanks go to our staff who got 
everything going. This was the first outage for some of our systems and they got to the root of the 
problems and solved them quickly.

Lessons learned: It is clear that the transfer switch needs to be repaired immediately. The backup 
power systems are tested monthly, but we need to work with Hunter Facilities Management to 
carefullly monitor what happens inbetween. Our staff learned about ways to deal with problems 
when our new systems crash, which should make future outages (which we hope do not happen) 
shorter. We will also look at the possibility of getting more battery power to keep the UPS systems 
working longer. 

ATOM RSS1 RSS2