Log in

No account? Create an account
February 12th, 2009 - Doug Ayen's Blacksmithing Blog — LiveJournal [entries|archive|friends|userinfo]
Doug Ayen

[ userinfo | livejournal userinfo ]
[ archive | journal archive ]

February 12th, 2009

In the last 48 hours . . . [Feb. 12th, 2009|10:41 pm]
Doug Ayen
In the last 48 hours, I have:
*discover my cell phone is dead right at the start of a maintenance where having 2 phone lines would be really handy.
*migrate 3 customers, 6 sonet circuits, twelve GigE circuits, and a half dozen peers onto new equipment.
*Perform the first field trial of the switch memory upgrade process on two switches, with our best and most experienced field engineer acting as remote hands. Both switches fail and require intervention to recover using the standard process. What was claimed to take 5 minutes, and proven in the lab to take ten, took over half an hour in the field.
*Troubleshoot the ten percent or so of customers who, as always, failed to recover after the maintenance.
*drag my tired ass into work, arrange for a replacement cell, organize and coordinate everything for the night's maintenance, all while dealing with escalations every 9 minutes.
*Realize at that I'm going to need an additional field engineer due to mission creep and the spectacular failure of the last night's test. This finally is arranged at 9pm.
*upgrade memory using remote hands for eleven switches.
*troubleshoot the four switches that fail during the upgrade process
*regenerate and restore configs from scratch on one of said switches using the oob.
*on another of those switches, the out of band management connection went first to another switch, then back through the failed switch before hitting the network. So, when the switch failed, so did it's out of band access. So, I also spent several hours talking the field engineer through the whole configuration from scratch and then the troubleshooting processes until I can get into the switch.
*Re-do most of the configuration to get most the customers back up.
*Troubleshoot the ten percent or so of customers who, as always, failed to recover after the maintenance.
*Try, and fail to nap.
*Come back to work for just a few hours to pick up the new cell phone (blackberry 8330).
*Discover I'm the only sr. engineer in the house, and my boss is out of the office, so immediately get hit with escalations, questions, and calls.
*Watch half the backbone melt down when a single fiber cut destroys the other half.
*Nervously resist the urge to interfere while the new router load balancing does, eventually, figure out how to shoehorn a hundred gigs of traffic into 95 gigs of capacity, somehow, without significant packet loss. (Hint: traffic doesn't have to take an optimal path, it just has to get there, maybe through Phoenix on its way from seattle to chicago)
*Discover that one of the upgraded switches has lost access to half its memory, rendering the upgrade useless and requiring another maintenance window to fix.
*deal with another switch as it goes tits up, forcing it over to the backup supervisor card. Arrange for FE to go out, attempt to reseat card, and when that doesn't work, move the connections on that card to another.
*Get a reminder memo from my boss that I need to schedule another 60+ switch upgrades ASAP.
*Go home, write this, get ready to go to bed.

Notice the lack of any actual sleep in there.

So, what did you do in the last 48 hours?
link3 comments|post comment

[ viewing | February 12th, 2009 ]
[ go | Previous Day|Next Day ]