As a BlackBerry Deployment Engineer for Microsoft’s Office 365 cloud service, I am sometimes privy to confidential information. In this case, I will leak to you that RIM had a huge BlackBerry service outage this week. OK, so maybe you already heard that. While the root cause analysis (RCA) will take time to complete all the details, they did report that it came down to a network switch failed and the backup did not take over as expected. The result caused a cascade of system failures. Right now it sucks to be RIM. And it is easy to sit back and admonish RIM for not having been better prepared. I’m sure they will learn from this mistake. When I was growing up, as my parents sent me off to school, they would always say “Have a great day and make lots of mistakes!” Why? Because they knew that we all learn from our mistakes. Since then I have come to a new conclusion: I can’t afford to make all the mistakes I need to learn. So I have adopted a new philosophy:
If Intelligence is the ability to learn from your mistakes, then Wisdom is the ability to learn from the mistakes of others.
In this case, I really don’t want to make the same mistake RIM made. So what can we learn from RIM’s mistake? When it comes to the most critical systems, have multiple redundancies, not just one backup system as was the case at RIM. Cave divers always have 3 systems to keep them alive. Medical systems often have 3 redundant systems. Football teams have third string players for key positions. The space shuttle had 3 to 5 redundancies for those most critical systems! Murphy’s Law states “Anything that can go wrong will go wrong.” and one of the many corollaries states “Everything goes wrong at once.”
Take a moment to learn from RIM’s mistake. For your most critical of mission-critical systems, have multiple redundancies. If it is a hard sell to management, just point them to Black(Berry) Monday, October 10, 2011.