Jump to content


Photo

Site Outage August 10th, 2011


  • Please log in to reply
6 replies to this topic

#1 HSS-Dave

HSS-Dave

    Founder

  • Administrators
  • 1,569 posts
  • LocationIndiana

Posted 10 August 2011 - 06:30 PM

HSS experienced an outage between the hours of noon and 7 PM on August 10th, 2011. My provider's provider had an issue with power, aux power, and a transfer switch in their Dallas TX colocation facility today. There is a temporary solution in place and will probably require a maintainance period sometime in the near future. Hopefully you got through the day without the site!

Excerpt of today's activity for the curious.

Updates

Current Update:

The power has been restored fully and all customers should be up. If you are a customer and have not come online yet, please open a help ticket for us to handle directly.

In addition, we have deployed a team member to walk the data center and look for any cabinets not powered up. We will reach out to you and coordinate getting your equipment live for any that we observe in this process check.

We will provide customers with a more detailed update upon completion of our after-action review for this incident. Our first goal at this time is to ensure everyone is up safely and all connectivity is restored.

Thank you again for your patience.

Previous:

17:38

Power has been restored to the distribution gear from the temporary ATS. HVAC units are now all online, and we will be beginning the process of restoring power to UPSs soon, then PDUs, and then customer equipment.

We will update you as the other areas come online. Thank you again for your patience.

16:46

Our team and electricians are working diligently to get the temporary ATS installed, wired and tested to allow power to be restored. As the ATS involves high-voltage power, we are following the necessary steps to ensure the safety of our personnel and your equipment housed in our facility.

Based on current progress the electricians expect to start powering the equipment on between 6:15 – 7:00pm Central. This is our best estimated time currently. We have thoroughly tested and don’t anticipate any issues in powering up, but there is always the potential for unforeseen issues that could affect the ETA so we will keep you posted as we get progress reports. Our UPS vendor has checked every UPS, and the HVAC has checked every unit and found no issues. Our electrical contractor has also checked everything.

We realize how challenging and frustrating that it has been to not have an ETA for you or your customers, but we wanted to ensure we shared accurate and realistic information. We are working as fast as possible to get our customers back online and to ensure it is done safely and accurately. We will provide an update again within the hour.

While the team is working on the fix, I’ve answered some of the questions or comments that have been raised:

1. ATSs are pieces of equipment and can fail as equipment sometimes does, which is why we do 2N power in the facility in case the worst case scenario happens.

2. There is no problem with the electrical grid in Dallas or the heat in Dallas that caused the issue.

3. Our website and one switch were connected to two PDUs, but ultimately the same service entrance. This was a mistake that has been corrected.

4. Bypassing an ATS is not a simple fix, like putting on jumper cables. It is detailed and hard work. Given the size and power of the ATS, the safety of our people and our contractors must remain the highest priority.

5. Our guys are working hard. While we all prepare for emergencies, it is still quite difficult when one is in effect. We could have done a better job keeping you informed. We know our customers are also stressed.

6. The ATS could be repaired, but we have already made the decision to order a replacement. This is certainly not the cheapest route to take, but is the best solution for the long-term stability.

7. While the solution we have implemented is technically a temporary fix, we are taking great care and wiring as if it were permanent.

8. Colo4 does have A/B power for our routing gear. We identified one switch that was connected to A only which was a mistake. It was quickly corrected earlier today but did affect service for a few customers.

9. Some customers with A/B power had overloaded their circuits, which is a separate and individual versus network issue. (For example, if we offer A/B 20 amp feeds and the customer has 12 amps on each, if one trips, the other will not be able to handle the load.)

As you could imagine, this is the top priority for everyone in our facility. We will provide an update as quickly as possible.


14:53

Thank you for your patience as we work to address the ATS issue with our #2 service entrance. We apologize for the situation and are working as quickly as possible to restore service.

We have determined that the repairs for the ATS will take more time than anticipated, so we are putting into service a backup ATS that we have on-site as part of our emergency recovery plan. We are working with our power team to safely bring the replacement ATS into operation. We will update you as soon as we have an estimated time that the replacement ATS will be online.

Later, once we have repaired the main ATS, we will schedule an update window to transition from the temporary power solution. We will provide advance notice and timelines to minimize any disruption to your business.

Again, we apologize for the loss of connectivity and impact to your business. We are working diligently to get things back online for our customers. Please expect another update within the hour.

13:34

It has been determined that the ATS will need repairs that will take time to perform. Fortunately Colo4 has another ATS that is on-site that can be used as a spare. Contractors are working on a solution right now that will allow us to safely bring that ATS in and use it as a spare while that repair is happening.

That plan is being developed now and we should have an update soon as to the time frame to restore temporary power. We will need to schedule another window when the temp ATS is brought offline and replaced by the repaired ATS.

13:05

There has been an issue affecting one of our 6 service entrances. The actual ATS (Automatic Transfer Switch) is having an issue and all vendors are on site. Unfortunately, this is affecting service entrance 2 in the 3000 Irving facility so it is affecting a lot of the customers that have been here the longest.

The other entrance in 3000 is still up and working fine as well as the 4 entrances in 3004. Customers utilizing A/B should have access to their secondary link. It does appear that some customers were affected by a switch that had a failure in 3000. That has been addressed and should be up now.

This is not related to the PDU maintenance we had in 3004 last night. Separate building, service entrance, UPS, PDU, etc.
We will be updating customers as we get information from our vendors so that they know the estimated time for the outage. Once this has been resolved we also distribute a detailed RFO to those affected.
Our electrical contractors, UPS maintenance team and generator contractor are all on-site and working to determine what the best course of action is to get this back up.

12:40

Colo4 is currently experiencing a power issue.


Host of The Home Server Show Podcast.
Windows Home Server MVP - 2009 - 2012

#2 dvn

dvn

    HSS Elite

  • Moderators
  • 1,632 posts

Posted 10 August 2011 - 06:33 PM

Can you say pro-rated bill for the month? :D
  • Rich's Random Podcast Generator
  • Desktop - i7-2600K | ASRock Z68 | 128GB Crucial RealSSD C300 | Cooler Master Silent Pro 600W | Lian Li K60B case
  • VM Server - Q9550 | Gigabyte EP45T-USB3 | 2 x 4 GB DDR3 1333 | Lian Li KB60
  • HTPC - Revo 3610
  • WHS 2011 - Core i3-540 system | Lian Li K60B case
  • HP MSS EX495

#3 jmwills

jmwills

    HSS Genius

  • Donating Member
  • 5,081 posts
  • LocationHuntsville, AL

Posted 10 August 2011 - 10:44 PM

Okay...glad to hear that or at least I know it hasn't been me. I've noticed several outages over the last few weeks so you might need to ask for that prorated bill. On the other hand, when I saw Dallas, it all made sense with all the heat related issues down there right now.

My faith in ATS's still is not restored. That's the "big excuse" for any power related issue in Iraq.
Windows 7 Desktop - Antec 100 Case, Intel D8H67BL, OCZ 550W PSU, Intel i3-530 CPU w/16GB G-Skill DDR3 1333 RAM
Server 2012 - Fractal Arc Midi, CoolerMaster M600 PSU, ASUS P8H67V, Intel i5-2500 CPU w/32GBG-Skill DDR3 1333 RAM, 90 GIG OCZ SSD OS Drive – Roles: Hyper-V (WHS-SharePoint-DC-SQL-Exchange-WSE 2012), Print Server - Rocket RAID 2720 5x2TB
HTPC Build - Silverstone GD05 Case, ASUS P7H55-M PRO, CoolerMaster M600W PSU, Intel i3-530 CPU w/4GB G-Skill DDR3 1333 RAM. OCZ 60GB SSD Drive for the OS with a 120GB WD 2.5" Blue drive for data storage.
Travel Laptop: Dell XPSL502X 15.6"

#4 JediTim

JediTim

    HSS Advanced

  • Members
  • 504 posts
  • LocationLong Island, New York

Posted 11 August 2011 - 05:12 AM

And I thought I was having issues on my end. I must say that it is nice to see they gave you a detailed report of the issues.

#5 timekills

timekills

    HSS Advanced

  • Donating Member
  • 615 posts
  • LocationFBTX

Posted 11 August 2011 - 10:35 AM

Okay...glad to hear that or at least I know it hasn't been me. I've noticed several outages over the last few weeks so you might need to ask for that prorated bill. On the other hand, when I saw Dallas, it all made sense with all the heat related issues down there right now.

My faith in ATS's still is not restored. That's the "big excuse" for any power related issue in Iraq.


How many times have I heard that one in Kuwait or Afghanistan too. Triple generator backup which fails because the ATS is a single point of failure. That or shelter grounding...or HVAC...

#6 jmwills

jmwills

    HSS Genius

  • Donating Member
  • 5,081 posts
  • LocationHuntsville, AL

Posted 11 August 2011 - 11:24 AM

Ahhhhhh...good times :)
Windows 7 Desktop - Antec 100 Case, Intel D8H67BL, OCZ 550W PSU, Intel i3-530 CPU w/16GB G-Skill DDR3 1333 RAM
Server 2012 - Fractal Arc Midi, CoolerMaster M600 PSU, ASUS P8H67V, Intel i5-2500 CPU w/32GBG-Skill DDR3 1333 RAM, 90 GIG OCZ SSD OS Drive – Roles: Hyper-V (WHS-SharePoint-DC-SQL-Exchange-WSE 2012), Print Server - Rocket RAID 2720 5x2TB
HTPC Build - Silverstone GD05 Case, ASUS P7H55-M PRO, CoolerMaster M600W PSU, Intel i3-530 CPU w/4GB G-Skill DDR3 1333 RAM. OCZ 60GB SSD Drive for the OS with a 120GB WD 2.5" Blue drive for data storage.
Travel Laptop: Dell XPSL502X 15.6"

#7 Joe_Miner

Joe_Miner

    HSS Elite

  • Donating Member
  • 1,711 posts
  • LocationCentral Illinois

Posted 11 August 2011 - 11:44 AM

It would be interesting to see a single line of their system.

If the utility is supplying the seperate service feeds the ATS should be theirs and it should be maintained at least annually before the summer season starts.

WHS-V1: HP EX-487: 4*WD20EARX, Athena AP-MFATX30, 4GB G.Skill, E5200, Stablebit Scanner-|-
WHS-2011: HP N54L G7, Kingston ECC 8GB KVR1333D3E9SK2/8G, OS: 256GB M4, 5*ST3000DM001, WD PCIe USB3, R640L, Stablebit DrivePool & Scanner -|-
Test Labs: HP N40L, G.Skill 16GB F3-1333C9D-16GAO, rr2720 -|- HP N40L, Kingston ECC 16GB KVR1333D3E9SK2/16G -|-
S2012 Hyper-V Lab: Lian-Li K9WX, GA-Z77X-UD5H, i7-3770, 32GB G.Skill, 240GB Corsair GT + various HDD's-|-
Desktop W8P64: HAF 932,GA-X58A-UD3R,i7-930,12GB,240GB Corsair GS + various HDD's,HD5850,Samsung Series9 & 213T+Planar PX2710MW,C920 -|-
HTPC3 W8P64WMC: GD05B, GA-Z68MX-UD2H-B3, i5-2500K, 16GB G.Skill, Corsair GTX 240GB, Crucial M4 256GB , C910, Camtasia-|-
Laptop W8P64WMC: Acer 1810T, 4GB RAM, 240GB Corsair GT SSD-|-





0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users