Availability is one of the key metrics that demonstrates the overall performance of an information technology (IT) system. But defining and calculating the availability of an IT system from a business perspective is a challenging task. Most of the time, IT departments report availability values that are on the higher side (such as more than 99 percent availability), but business people may not believe them, especially when there are instances of outages for applications supporting critical business functions or outages during core business hours.
Although the availability numbers may be numerically correct, they may not be a true representation of the real business situation. This misrepresentation can be addressed using two concepts:
- The outside in approach – Defining availability from a business perspective
- The business throughput approach – Availability calculation based on resultant value as experienced by the business
The Problem With Traditional Availability Definition
In most business environments, any business function is supported by several IT applications. Consider, for example, money collection in a credit card business. End users have several ways of making payments, such as a cash or check deposit, a wire transfer, online payment, or payment by phone. Different IT applications, such as secured login enabling online payment and voice-recognition applications enabling payments by phone, support these different ways of money collection. Each of these applications has a different set of business core hours (e.g., websites may be available 24/7, whereas voice-recognition applications may only be used from 8 a.m. to 6 p.m.)
In the traditional IT availability calculation, service level agreements (SLAs) are set for application uptime, and application availability is calculated against those SLAs. In the money collection example, the availability is calculated for all end-user applications (websites, voice recognition, etc.). The availability calculation must be based on core business hours rather than total application uptime; the latter provides leeway to show better availability using uptime beyond business hours. Many organizations base core hours on SLA definitions and availability calculations. Table 1 shows the availability values in the money collection example, including the amount of time that applications were unavailable due to outage. (Note – impact of capacity issues on availability is not considered in this analysis; capacity is assumed to be a non-issue for availability).
Table 1: Availability Values of Money Collection Options | |||
Application or Component | SLA Minutes (based on core hours) | Outage Minutes | Availability Percent |
Banking application | 8 a.m. to 6 p.m. = 600 minutes | 5 minutes | 99.17 percent |
Third-party administration application | 8 a.m. to 6 p.m. = 600 minutes | 10 minutes | 98.33 percent |
Voice recognition application | 8 a.m. to 6 p.m. = 600 minutes | 5 minutes | 99.17 percent |
Website | 24 hours x 7 days = 1,440 minutes | 15 minutes | 98.96 percent |
Payment system database | 24 hours x 7 days = 1,440 minutes | 6 minutes | 99.58 percent |
Account information system database | 24 hours x 7 days = 1,440 minutes | 3 minutes | 99.79 percent |
Overall | 6,120 minutes | 44 minutes | 99.28 percent |
Outside In View
The data in Table 1 is the availability from the IT systems’ perspective. If the front-end applications are up all the time, but users are unable to complete transactions because of infrastructure failures or database issues, the system is still unavailable to users and the business. In such a case, claiming high availability values for the front-end applications is misleading. For the end user, the system is available only if the entire business process is completed successfully. Because of this, businesses must use the outside in view of the availability metric to define availability SLAs at the business-function level. Using this approach, in the process of collecting money through a website, the SLA should be met only if all components in the business process are up and running and users are able to execute the business process successfully.
Rolled Throughput Method
To meet the business-function-level SLA, all components need to meet their SLAs individually as well as collectively. If one component is down, the process cannot be completed, hence the system is to be treated as unavailable to users. That means businesses must use the rolled throughput method for calculating function-level availability, instead of simply aggregating all SLA minutes and outage minutes for all components. In the case of money collection through a website, the SLA is 24/7 availability, although there were 15 minutes of outages. Table 2 illustrates the difference between the simple aggregation method and rolled throughput method for calculation of availability at function level.
Table 2: Traditional and Rolled Throughput Availability Values | ||||
Calculation Method | Application or Component | SLA Minutes (based on core hours) | Outage Minutes | Availability Percent |
Traditional – simple aggregation | Website | 24 hours x 7 days = 1,440 minutes | 15 minutes | 98.96 percent |
Traditional – simple aggregation | Payment system database | 24 hours x 7 days = 1,440 minutes | 6 minutes | 99.58 percent |
Traditional – simple aggregation | Account information system database | 24 hours x 7 days = 1,440 minutes | 3 minutes | 99.79 percent |
Traditional – simple aggregation | Money collection process using website | 4,320 minutes | 24 minutes | 99.44 percent |
Rolled throughput | Website | 24 hours x 7 days = 1,440 minutes | 15 minutes | 98.96 percent |
Rolled throughput | Payment system database | 24 hours x 7 days = 1,440 minutes | 6 minutes | 99.58 percent |
Rolled throughput | Account information system database | 24 hours x 7 days = 1,440 minutes | 3 minutes | 99.79 percent |
Rolled throughput | Money collection process using website (no overlapping outages) | 1,440 minutes | 24 minutes | 98.33 percent |
Rolled throughput | Money collection process using website (all outages overlapping) | 1,440 minutes | 15 minutes | 98.96 percent |
Eventually, the availability of the overall money collection process should be calculated by aggregating availability through all different channels. This aggregation must be done using the weighted average method (Table 3).
Table 3: Weighted Availability Values | ||||
Application or Component | Weight (1 to 5, 5 is max) | SLA Minutes (based on core hours) | Outage Minutes (no overlaps) | Availability Percent |
Money collection using banking application | 4 | 600 minutes | 14 minutes | 97.66 percent |
Money collection using third-party administration application | 2 | 600 minutes | 19 minutes | 96.83 percent |
Money collection using voice recognition application | 4 | 600 minutes | 14 minutes | 97.66 percent |
Money collection using website | 5 | 1,440 minutes | 24 minutes | 98.33 percent |
Overall money collection function | (4 x 97.66) + (2 x 96.83) + (4 x 97.66) + (5 x 98.33) ÷ (4 + 2 + 4 + 5) = 14.66 ÷ 15 = 97.77 percent |
For the same number of outage minutes, the overall percentage of availability for a business function (97.77 percent) is much lower than the overall percentage of availability calculated by traditional method (99.28 percent).
That explains why businesses’ experiences with and perceptions of availability numbers are not always inline with those of the IT systems. By using the outside in and rolled throughput concepts, however, measurement errors in availability calculations can be minimized and a method that is closer to the business and user experience can be devised.