The Role of Maintenance in the Data Center

Editor’s Note: The following article highlights to role of continuous monitoring in helping to predict assess needs.

From: The Data Center Journal

Billions of dollars have been spent building highly redundant data center facilities to deliver high-availability IT solutions to an increasingly information-reliant world. These large investments have produced a variety of sophisticated facility infrastructure designs that are inherently reliable and progressively more energy efficient. No facility design, however, regardless of how well planned and constructed, can withstand the disruption of an improperly implemented operations and maintenance (O&M) program. Poor maintenance and risk mitigation processes can quickly undermine the facility design intent. It is therefore crucial to understand and evaluate how O&M programs are organized to achieve the level of performance for which the facility has been configured. This article identifies a method for aligning the operational requirements of the business with maintenance program standards that can be easily understood and communicated throughout the organization.

The Need for Maintenance Program Standards

It can be exceedingly difficult for non-maintenance professionals to evaluate the quality and effectiveness of their data center maintenance program. The presence of activity by qualified individuals is not in itself a reliable indicator. Practices that deliver good results in non-critical facilities are not always suitable for high-availability environments, but they may appear adequate until a service outage occurs. Business and IT managers can develop a false sense of security on the basis of incomplete knowledge of how maintenance programs need to be structured to achieve the required business continuity goals. The problem is pervasive in part because no common language exists that would allow all of the various stakeholders in the data center to identify specific O&M program elements that align with the goals of the enterprise.

In an effort to overcome this barrier, it is useful to develop a set of standards for describing maintenance levels that is specific to data center operations and other mission-critical facilities. In the standards described below, each level is based on fundamental characteristics of service performance that are both easily comprehended and have predictable effects on system availability.

Defining the Standards

Four Tiered Infrastructure Maintenance Standards (TIMS) have been established:
• TIMS-1: Run to Fail
• TIMS-2: Unstructured
• TIMS-3: Structured
• TIMS-4: Facilitated
TIMS-1 Run to Fail

This level of service reflects the old adage, “If it isn’t broken, don’t fix it.” Maintenance is purely reactive at this level; when equipment fails, a technician is summoned to perform the repair. In areas where the system has redundancy, there may be little or no effect on the critical load for an isolated failure. The lack of a preventive maintenance program, however, will increase the likelihood of simultaneous failures, which can take down even redundant systems.

Operating at TIMS-1 implies that the perceived cost of an outage is low compared with the cost of preventative maintenance. And in a time of tight IT budgets, deferring maintenance is often viewed as an easy way to cut costs. But any perceived short-term savings in maintenance costs will likely be overshadowed in the long term by more costly outages and expensive repairs.

A lack of system redundancy may also invoke a run-to-fail strategy, where maintenance on a non-redundant component would necessitate removing a portion of the critical load from service. Ironically, the same lack of redundancy will guarantee an unplanned outage when (not if) a failure occurs.

TIMS-2 Unstructured Maintenance

TIMS-2 maintenance is characterized by the performance of routine preventative maintenance tasks without an overlying set of processes and procedures to ensure effectiveness and predictability. The fact that it is commonly performed by qualified manufacturer’s service representatives or trusted in-house technical staff can create a false sense of security. Even qualified personnel can make mistakes or focus too intently on individual system components without considering the system as a whole. This approach may deliver adequate results in some environments, but it does not meet the expectations of mission-critical data centers. Unfortunately, this level of service is the industry norm. Service contracts for preventative maintenance are commonly low bid with the difference being recovered on follow-up corrective maintenance work, which is lucrative.

Simply following manufacturer’s recommendations is no guarantee that all necessary steps are being taken to maximize availability. If the maintenance program lacks a detailed scope of work for each piece of equipment that factors in system interdependencies, chances are that important steps are being neglected. If methods of procedure (MOPs) are not employed on critical systems to detail each step in the maintenance process, the risk of human error occurring during maintenance events is elevated.

A common characteristic of Unstructured Maintenance is an over-reliance on individual effort. It is reassuring to rely on a trusted individual who has been providing maintenance services for years, but this creates a high degree of risk when an organization’s facility-maintenance knowledge resides inside the head of individual technicians, who are susceptible to making mistakes no matter how experienced.

Unstructured, under-documented maintenance programs create an environment in which equipment failure is tolerated and the risk of human error is elevated.

TIMS-3 Structured Maintenance

Structured Maintenance is designed to maximize uptime by removing guesswork and minimizing the negative effects of human error. TIMS-3-level maintenance is a complicated task that requires discipline and experience to execute. Each component of the maintenance process is closely controlled; policies are established to control how information is gathered, acted upon and recorded, precisely managing how and when work is performed. Identifying and training qualified personnel is part of a formal program, as is supervision and performance evaluation.

Structured Maintenance is an extremely proactive process that unites best practices for each maintenance element, integrating them into a program that is more than the sum of its components. The goal is to systematically eliminate variables that can introduce errors.

Structured Maintenance programs include a formal staff training program; a document library that includes a scope of service and standard operating procedures (SOPs) for all site equipment; a change management program that uses methods of procedure (MOPs) for maintenance activities along with a formal work process; a strong vendor management program; rigorous quality control procedures and specialized support systems such as a computerized maintenance management system (CMMS) and electronic document management system (EDMS); and 24/7 on-site staffing.

Importantly, a facility with a high Uptime Institute Tier rating is not required to enact a Structured Maintenance program. Rather, the critical systems must simply be maintained to the program standards. In the event that concurrent maintenance is not possible, data center managers may have to organize a controlled shutdown of some services, but this is significantly better than an unplanned, uncontrolled shutdown that was preventable.

TIMS-4 Facilitated Maintenance

Facilitated Maintenance is the highest level of maintenance service. It combines a Structured Maintenance program with a system topology that facilitates maintenance by providing multiple power and cooling distribution paths with redundant components. Such a design allows individual pieces of equipment to be isolated and maintained without a disruption in services. Another important component is a building management system (BMS), which continually monitors the critical infrastructure, trends equipment performance, alerts operators when conditions fall outside allowable parameters and allows automated control of equipment sequencing.

Data center operations achieve the highest possible level of reliability for their assets when Structured Maintenance is performed in this environment. Automated systems eliminate much of the risk of human error and can respond more quickly and appropriately to sudden changes. Continuous monitoring of the critical systems, and the ability to trend specific operating parameters, facilitates predictive maintenance of critical systems. Operating in a Facilitated Maintenance model enables managers to easily isolate redundant system components for comprehensive testing and maintenance, greatly increasing reliability while minimizing the risk of downtime.

Evaluating Your Maintenance Program

Few maintenance programs fall neatly into a single category when viewed in their entirety. More often, there will be elements of two or more TIMS levels in use. For example, a program might embrace Structured Maintenance on the electrical systems but engage in Unstructured Maintenance practices on the HVAC equipment. Another facility might exhibit pervasive Structured Maintenance practices but have a single switchboard that is not maintained because it cannot be removed from service without powering down a portion of the critical load. In cases such as these, the weakest link principle applies: the overall rating for a given facility is no higher than the least stringent aspect of the O&M program. A facility with a critical component in run-to-fail mode is in effect operating at the TIMS-1 (Run to Fail) level. A single technician operating without proper training or procedures on a critical system will place the facility into TIMS-2 (Unstructured Maintenance) mode.

Using the TIMS System

Now that you have a framework for evaluating maintenance effectiveness, how should you use it? If you have an existing data center (or centers), the first step is to perform a detailed evaluation of the O&M program. This can take some time, because it needs to be comprehensive and detailed to be accurate. It may appear that everything is in order, but a deep dive will expose where the real weaknesses lie. Every program has them, and it takes time even for industry experts to peel back the onion. Remember that beauty may be only skin deep.

The next step is to correlate the level of maintenance with the acceptable level of risk for the facility. A low-Tier, geographically redundant facility may tolerate a less stringent level of maintenance than a high-Tier facility with very high availability requirements. Tier level alone is not necessarily the deciding factor, however. Applying TIMS-3 principles will minimize risk and maximize availability in any environment. It could even be argued that a Tier II facility operating at TIMS-3 can be operated more reliably than a TIMS-2, Tier III facility.

Program Implementation

Whether creating an O&M program for a brand new facility or upgrading an existing program, a few fundamental points to consider are:
1. Scope: What specific actions need to be taken to achieve the desired TIMS tier?
2. Budget: Does your budget allow you to meet your chosen goals?
3. Skills: Do you have the internal skills to manage and perform the activities required?
4. Impact: What is the impact on your business operation to implement the plan, and what are the risks?
Conclusion

When evaluating the health of the mission-critical enterprise, the effectiveness of the maintenance program is one of the key components that must be factored in to determine the true measure of sustained reliability. The tremendous variability in how maintenance is implemented can make it difficult to judge what constitutes the proper level of service in a given situation. Defining maintenance levels is a tool for achieving such an understanding. Matching up high-reliability systems with high maintenance service levels will allow organizations to achieve the highest levels of reliability and uptime.

Tiered Infrastructure Maintenance Standards offer a systematic approach to aligning the operations and maintenance effort with the business goals of the data center. Applying these principals to your maintenance program will create the framework for achieving a uniform standard of excellence across all elements of facility design, operation and maintenance.

About the Author

Bob Woolley is Senior Vice President, Critical Environment Services, for Lee Technologies, a Schneider Electric company. Mr. Woolley has been involved in the critical facilities management field for over 20 years. He served as Vice President of Data Center Operations for Navisite, as well as Vice President of Engineering for COLO.COM. He was also a Regional Manager for the Securities Industry Automation Corporation (SIAC) telecommunications division and operated his own critical facilities consulting practice. Mr. Woolley has extensive experience in building technical service programs and developing operations programs for mission-critical operations in both the telecommunications and data center environments.

NIST SP 800-137

Guide for Continuous Monitoring of Information Systems and Organizations

The Role of Maintenance in the Data Center

Leave a Reply Cancel reply

Links

Submit a Post

Archives