The fundamental root-cause of data center failure
We have seen so many papers and opinions on this very question and redundant topic: “top 10 reasons why your data center goes down”, “top 5 things you should do to prevent downtime”, “top root-causes of data center outage”. It has become more of an entertainment and game of numbers to see who comes up with a newer version or more reasons for the same question. My intent in this blog is to put an end to the rhetoric. Truth is, source of data center failures has many facets but one root-cause.
There is only one root-cause of data center failures and one root-cause only: People. “The people” who design, plan, run, operate and own data centers are the depth of all problems that surface different representations in the events of failure:
- “People” not understanding the business – It is very typical of data centers to deliver something different from, or in some cases in total contradiction with the business needs. Most people don’t even know what it is that they really want when they attempt to quantify their needs to the data center matrices. Not understanding the nature of the business, its requirements and sensitivities is the beginning of all problems.
- “People” not sizing the right budget – When you need four nines (99.99%) of availability but your budget is sized for two and a half nines (99.45%) only; when you need 2MW of power but your budget only allows you to procure 1.5MW; when you know the effective cooling for your specific case is to supply of 15KW per rack and 2,400 CFM, but your budget only provides you purchasing power for a fraction of what you really need; who is at fault but the people who diverted the budget from the actual needs and the people who accepted this unrealistic compromised and provided false guarantees?
- “People” selecting the wrong sites – If an earthquake, flood, tornado or any other weather anomaly takes a data center down; if extreme temperatures, dust, EMI, etc. burden a data center with hefty energy bills and potential downtimes; or if you lack redundant utilities and access providers, it’s because people involved in the site-selection didn’t do their homework or compromised for an inferior site but failed to relay the potential consequences up to the management ladder.
- “People” crafting fallacious designs – Design flaws are a common source of problems that reflect down the operations floor in mega folds when and where you need the data center the most. Inappropriate engineering, inadequate conceptual and detail designing lead to a chain of deficiencies in the data center that are too expensive to change and rectify once construction is completed and nearly impossible to handle in events of failure.
- “People” buying inappropriate technologies – When people select and procure products based on brands and not their substance; when they try to build their data center around a specific product or technology, rather than building the data center and its encompassing technologies around the application and the business; or when a technology that is selected is not optimum for the specifications of the data center, problem is the people in the chain of command who ignore the basic principles of fundamental data center builds.
- “People” doing incorrect implementations – Ignoring correct and methodological implementation principles, cutting corners, overlooking details, not being punctual, missing critical installation procedures and configuration phases, are only a fragment of the pilling disasters for the to-be-live data center, and more so, the stakeholders of the data center and end-users of the application.
- “People” failing to test – Proper testing is a necessary critical step in data center development. If not planned and excused with expertise, it will result in false promises and wrong beliefs in the data center delivery claims and abilities. People involved in the test engineering, planning, execution, supervision and post-test report evaluation and decision-making process are key in ensuring effective tests and corrective measures to avoid serious shortages and failures over time.
- “People” depending on the wrong people – When a DBA moves up to manage a data center without proper exposure and training, when a network engineer assumes the role of data center ops manager, when an electrical expert takes the role of data center strategist or when a mechanical engineer runs a helpdesk, then “Houston, we have a problem.” Many times, people grow into their positions and the new positions they assume are a far cry from their true capabilities and what they can deliver. Depending on the wrong people can prove disastrous for data centers. Data centers need to be planned, implemented and ran by data center people, period.
- “People” failing to document – The lack of concise and cohesive operation manuals, consolidated SOPs, policies and procedures, implementation of effective change control, owning and managing solid, yet simple and practical documentation can take a whole system down and prove the lack of proper documentation to be a very expensive proposition. Again, who else is behind this shortcoming but the people involved?
- “People” misaligning IT and Facilities – The misalignment of data center facilities (power, cooling, etc.) in relation to the IT (network, storage, etc.) has caused too much damage and wasted too many resources as well as operational hours beyond description. We have far breached the borders of obsolescence when still facing this traditional and unjustified misalignment in data centers. Unfortunately, widely practiced by people, this continues to be a visible and consistent gap across organizations.
- “People” failing to plan ahead – Data centers are supposed to empower and sustain business growth. Thus, being futuristic in vision and practice as well as planning for the upcoming needs (as best as they can be projected) given today’s dynamism of businesses and applications, is crucial. Not planning ahead, not just in anticipation of growth and expansion, but also for consolidation, disaster and recovery management, and many other aspects is caused by the short-sightedness of people.
- “People” not implementing true cloud – Going cloud is the way of achieving new horizons and satisfying extreme RPO and RTO values. But cloud cannot be taken solely by its virtual sense. True cloud is the essential requirement for catering to today’s needs of mission-critical applications. When people don’t understand what cloud is, in physical and logical terms as well as how it is to be implemented, the virtualization realm can mislead IT managers into believing in redundancies that in reality don’t exist.
- “People” being single-point-of-failure – When a critical function of a data center becomes dependent on a single individual, that individual is a single-point-of-failure of your data center. In the event of failure, either due to gross negligence, honest mistakes or simply the absence of that one person; the entire data center, infrastructure, and operations of the business can come to a halt, all due to the lack of redundancy and resiliency management in people.
- “People” not understanding the Application – The application is the ultimate product, delivery or output of the data centers. Application is the reason we build data centers. Unfortunately, people’s lack of tangible comprehension of the application’s true needs drives organizations to build data centers that are geared for anything but the application they intend to deliver. The results are financially sever and easily quantifiable.
This analogy can continue beyond the above notes, linking all shortages of the data center to the people planning and executing data centers. The lack of knowledge, expertise, knowhow, passion, precision, devotion, attention, consultation, teamwork, efforts and investments vested in such, drives data centers to the edge of destruction, causing catastrophes for their stakeholders.
The resulting mishaps and mistakes caused by those who are simply not qualified to design, implement and run multi-MW facilities and/or lack the comprehensive approach to all data center components such as power, cooling, civil, telecom, IT, safety, security, application, etc., have proven devastating to data centers, people, infrastructure and businesses. All these and more are just some of the defects of weak “people” management.
The reverse of this impulse is a valid and potential reality. Let it be known that, the same root cause of failures can be the organization’s data center edge, whereby having the right people in the right places will get the data center to successful results of capacity, efficiency, security, safety, availability and resilience. Therefore, the most state-of-the-art data center infrastructure without the appropriate people planning, managing and operating it, is a recipe for destruction, while the same infrastructure with the appropriate people is logically a clear recipe for success.
This notion applies to all walks of life, where our public policies, political system, economy and social system, education, welfare, health and safety, are given direction to and driven by people, hopefully for the people. If we have subject matter experts who are qualified, passionate, sincere and committed individuals in the right positions, the lives of our citizens would undertake more productive and fruitful journeys.
Back to data centers: the most brilliant people can’t deliver all the availabilities, efficiencies and securities you ask for without the proper infrastructure. Nevertheless, if you invest more on infrastructure than you do on your people, the inherited deficit you whereby create in your overall data center will provide you false assurances and widen the gap between the infrastructure and your people even more. So next time you plan to buy up infrastructure, take a moment to think, evaluate your human resources, balance your emphasis and always invest in your people first!