
Back in the day, to most people debug meant identifying and sorting out the errors in a bit of programming code. Nowadays, it’s a good bit more than that. The effective functioning of the data centre according to accepted standards is crucial to the business.
Most businesses spend time and effort gathering information and statistics about the operation of their data centre, but rarely use the data that is collected in a proactive manner to keep the data centre running at optimal capacity. The management process is becoming increasingly important for “Green” data centres that should operate in an environmentally friendly way.
What needs to measured, monitored and if necessary debugged?
The first step is to set a baseline of normal activity against which potential breaches of normal operating standards can be detected.
The various type of baseline standard include performance criteria, power, the environment, particularly temperature and humidity and last but not least, network performance.
In regard to performance, many sites have performance criteria based on transaction processing rates and latency or throughput times.
Measurements of system performance under normal conditions allow construction of a processing model against which daily performance can be measured. There are then two indicators to be observed and monitored. A sudden catastrophic change in performance indicates something serious has happened, perhaps equipment failure in a server or network component. More insidious longer-term changes in performance caused, perhaps by increases in transaction volumes or types need to be identified and managed. Care also needs to be taken to ensure that the measuring tools used don’t affect the results themselves.
Most computer equipment are sensitive to poor quality power, and power consumption different from that expected is often indicative of actual or impending equipment failure. Manufacturers, including Hewlett Packard, provide modelling tools which provide the expected power consumption of various equipment configurations. They also provide real-time power monitoring tools which can be used to monitor the actual moment to moment power consumption at a site and equipment level. They will also issue alerts if limits are breached. Again, data centre users need to monitor for catastrophic power events requiring immediate action, and for longer-term changes in power requirements. They also need to model changes in power usage that will follow potential and actual changes to equipment.
So when is the right time to debug?
The time to debug is therefore immediately if an alert is issued, and on a regular basis to make sure that power consumption and quality are within acceptable norms.
Temperature and humidity need also to be managed and monitored. Early data centres had an instrument that recorded environmental data, including minimum and maximum temperatures and humidity on paper disks. The disks were generally filed and forgotten. Today, it’s all digital which makes for easier analysis. Temperature fluctuations and consistently high temperatures indicate a need to check out the aircon system for faults and to make sure that airflow channels are clear. It may be beneficial to install environmental management software that gives early warnings of potential faults in the aircon system. This is particularly important in a Green environment where compliance with environmental standards is a pre-requisite for certification.
Again, an immediate debug when an alert is issued, or simply if the room feels too hot or cold, and regular monitoring of temperatures and trends in temperature fluctuations.
Network usage and capacity is another debug area.
Very often, it is only when users complain about a slow network that any analysis of capacity and usage happens. The results can be very surprising.
Quite often it is the users themselves that are contributing to the apparent slowness of the network. If the complaints are of a slow Internet, downloads of media from sites like music, TV and film sites, particularly YouTube can use significant amounts of network capacity.
Upgrading Internet capacity usually costs money, and management and analysis of network traffic volumes and types shows when remedial steps need to be taken and further investment in Internet bandwidth is justified. One popular interim (not with users) technique is to block media downloads during core working hours.
The answer to the question, ”When to debug the data centre” is that debugging is a process not an event and needs to happen continuously.