Response to Yesterday's Network Outage [Expired]
Posted by Karl Zimmerman on 13 September 2010 07:21 PM |
|
This is being sent to the primary address on the account. If anyone else in your company or department needs this information please forward it to them. First of all, I am very sorry. I cannot say that enough, I truly apologize for the damage the outage has done to your business, the calls you had to receive and the irate customers you had to handle. We fully understand the severity of this situation and that this negatively affects your business, as this is the core of our business, providing reliable connectivity and data center services. These types of events also damage our own business. We fully understand the need for reliability and feel that our stellar uptime performance up to this point is a testament to that. The upgrade was being made as a significant investment on our part to assure that we maintain this stability and reliability as our network continues to grow. We have let you down and have not lived up to our promises. People come to us because of the reliability and level of service we provide and in this case, we did not provide the service that you expect of us or that we expect of ourselves. Simply put, we got in over our heads. We have completed similar migrations in the past without issue, including swapping out the Juniper router in New York for a Cisco that was successfully completed in early July and our previous swap from Juniper core gear to Cisco core gear in Chicago. After months of evaluating options, and weeks of having the hardware in-hand for testing, we were confident we could perform such a migration again. This time, with the size of our customer base in Chicago, this was just too much and too much of a risk. One of the many reasons customers like us is that we’re a small company, a company that can be nimble to the needs of our customers, but in this case, we were too small to handle the demands of a migration such as this. Several hours before the upgrade was to begin, we had made sure that all of our in-house engineering staff was on-site for the maintenance, along with 3rd party network engineering for support and additional supervision. Things took a little longer to get set-up initially, which is why the initial window was extended to 7AM. At roughly 6:15AM everything was up and running on the new Brocade equipment, BGP sessions were all up, customer traffic flowing, and everything looked great, just some minor things to touch up. Then, at 6:50AM, once we thought things were done and settled in, everything just collapsed. We still do not know what caused this collapse, but CPU load spiked across all of our core equipment, even the remaining Ciscos we had in place. As things had been working, it was determined to push forward with resolving those issues, figuring it would be a relatively quick fix, since things had been working without issue in that configuration. There were multiple occasions where we seemed to have stabilized, and again, more issues. We eventually decided to revert to our old configuration and go with plan B. While reverting back to the old setup, we had multiple issues right away, a failed chassis and management card, which immediately caused delay and then the replacement management module also had hardware related problems and needed to be replaced. Even with the known working configuration we began having various routing loops, routing issues, BGP convergence issues, etc. During all of this, there was a Cisco VTP issue, thus during this time we also needed to go around to all of our dozens of customer switches to manually reconfigure them, assuring that they had valid VLAN tables. Stability has been returned and network operations are back to normal. If you have any network issues at this time, please contact us immediately. We thought we had everything prepared and had spent weeks in configuration and testing, but it appears we were wrong. I don’t need to tell you that things did not go as planned. During this time, we worked extensively with 3rd party engineers as well as both Cisco and Brocade engineers. There is no blame being put in any specific gear or vendor, everything was a part of the problem and we are responsible for it all, we are to blame. Mistakes have been made, but we have certainly learned from these mistakes. Learning from this experience will make us a better company long term and is greatly going to affect how we see and plan things in the future. We need to assume that if things can go wrong, they will go wrong, even though we’re normally a hopeful and optimistic bunch. These changes will be implemented to assure these issues never happen again: 1) The new Brocade routers will not be used for the initial planned purpose. Instead, we will be investing in a new configuration, where we will completely separate the core and distribution layers of our network. This means there will be no changes with the current Cisco configuration, other than we will at some point be gradually and gracefully moving individual BGP sessions over to the Brocades. These should be entirely non-invasive maintenances, just gracefully shutting down and moving BGP sessions. 2) We commit to building a network infrastructure and maintenance policy where we will never have to force a widespread outage. The separate aggregation, distribution, core, and edge structure of the new network will greatly assist in that goal. This means the backup/roll back gear will always be left in place, as-is, and transitions will be made slowly, over-time. Doing a maintenance spread out gradually over 6 months is a much better option than taking any risk of an event like this happening again. The primary objective in future planning is to mitigate the most risk. 3) We will send an email to customers about major maintenance windows, even though we have received many complaints previously about doing so. If you don’t care about the maintenance, delete the email, it affects all of our customers and we want to assure everyone knows. We will continue to post the maintenances on the announcements page (https://support.steadfast.net/index.php?_m=news) as per our terms of service. All future announcements will include a maximum risk assessment, not an estimate of the actual downtime. We will assume worst case, so you can take the actions necessary to prepare for that worst case. 4) For future maintenance windows we will be bringing an extra staff member in specifically for managing communication, assuring the site and forums are kept up-to-date with as much information as possible. This will not be necessary 95% of the time, but we need to plan for the worst. Updates will be made regularly, even if there is little to no change. 5) We will be changing the structure of our network engineering department and management. All network engineering decisions will be solely made by network engineers, not by management or accountants. 6) Colocation customers can talk with our sales department (sales@steadfast.net) for free consulting and cross connects to reach our other bandwidth partners, to have a redundant multi-homed network configuration of your own. We even have no commit pricing available from these partners, perfect for use as a redundant/backup link and for use during any future scheduled maintenance windows. I know it may not be easy, but I am asking you to stick with us through these times. We have provided robust and friendly service up to this point, don’t let this one incident, even though it was severe, destroy the quality business relationship we have together. If you help us through this time with your understanding, we can assure you this will pay off long-term dividends. We have learned, these mistakes will not be made again. Let’s grow together. If you have any additional questions or comments you can address them through our standard support channels or by contacting our management directly at management@steadfast.net Notes: 1) The InterNAP FCP was removed from the network, to prevent possible BGP issues due to it. It will be re-added within the next 48 hours, but you may have some sub-optimal routes until all the routes are updated through the InterNAP FCP. 2) We will honor our SLA. Instead of an SLA credit, we can provide free upgrades of RAM, memory, and bandwidth. These upgrades can easily equate to a much larger long-term benefit than a single credit, while it is also a short-term benefit to us. 3) If you have a Cisco switch, make sure that you have VTP set to transparent (vtp mode transparent) or properly configured. By default, the switch likely has VTP active with no authentication, meaning any switch you’re connected to can affect your VLAN table and potentially bring down your network. Karl Zimmerman Steadfast Networks President/CEO | |
Comments (0)