Unlike other shifts in IT, the broad view of uptime is well understood. What has many IT pros confused is the myriad of options that exist to ensure the availability of their applications and achieve faster recovery times, consistent performance, and business continuity. From backup and disaster recovery, to high availability, to fault tolerance and more, a lot of confusion surrounds the differences between each type of availability solution and how IT teams can determine the best option for their business.
This gets even more complex when each category has a very broad definition within the industry. Take “high availability” for instance. Last year, a highly respected analyst firm issued a survey that said the majority of respondents believed that high availability meant having a disaster recovery plan in place. To someone who lives and breathes availability solutions, these are two very different things. The definitions will also fluctuate from person to person based on their history with different computing platforms – the IT guys working in the mainframe will define high availability very differently than those working in dev/ops. On top of this, for years research firm IDC has been using its own set of availability levels (AL1-AL4), but these are very broad as most technologies fall within just one of the availability levels and the levels haven’t changed over time as technology has evolved.
In this slideshow, Jason Andersen, vice president of business line management, Stratus Technologies, will demystify the seven key types of availability solutions by clarifying what each one actually means. Keep in mind, when evaluating what you need, it is important that you consider not only what data is specifically being protected, but also the recovery time and infrastructure costs – mainly processing and networking – that your business can support.
Availability Solutions Defined
Click through for more on availability solutions and how to choose the best option for your organization, as identified by Jason Andersen, vice president of business line management, Stratus Technologies.
Unprotected
Obviously pretty easy to understand, this is a workload that has no special reliability features implemented either at the application, hypervisor or infrastructure layer. If it goes down, it’s down. It will have significant end-user impact that is generally measured in hours of downtime.
Backup
This is a workload that is periodically copied (or snapshotted) to a different node or data center. This is a nice compliance measure and can help to recover data, if you have hours or more. This solution will also have significant end-user impact that is generally measured in hours of downtime.
Disaster Recovery
This is a more robust form of backup that is automated for quicker recovery in the event of a major failure event, such as from human error or a major data center failure due to weather. Again, this solution will have significant end-user impact that is generally measured in hours of downtime.
Automated High Availability
This type of availability solution is very common in the virtualized world. When there is a failure, a new instance of the workload is redeployed to a new node or data center. A common implementation of this is VMware’s High Availability (HA) feature. This feature has minimal infrastructure impact but has fairly high user interruption and all in-flight data is lost. This type of solution will have downtime that is generally measured in seconds to minutes. Automated high availability is a good solution for load balanced, scaled-out applications like web servers.
Instant High Availability
This type of solution is common among clusters in the bare metal world, or redundant instances and replicated storage in the virtualized world. By definition, a cluster is a collection of servers that operate as if they are a single machine, with the primary purpose to provide uninterrupted access to the data – even if the server, or application running on the server, loses connectivity or fails completely. With this type of availability solution, there is very little end-user impact as the interruption of service is minimal (even a sub-second in some cases). However, any in-flight data and/or transactions are lost if the system goes down. If your application is stateless, but not load balanced, this is a great solution for you. Instant high availability clusters are commonly used for file sharing and collaboration apps (e.g., email, document management).
Fault Tolerance
Fault tolerance is a complete redundancy of the workload that also shares the in-flight data and application state. This means that there is continuous, uninterrupted operation even in the event of a failure. To put it simply, there is zero end-user impact as there is no downtime, period. This is a capability that was once only known in the mainframe and mini-computer world. However, software and cloud solutions are now available that provide this level of protection to off-the-shelf operating systems and hypervisors at a price point that is comparable to lower protection levels. Fault tolerant solutions are used in situations where system failure could be devastating, such as 911/public safety emergency services, or in the financial services industry for transaction processing systems that support the global economy.
Multi-Site Fault Tolerance
This is the highest level of protection a workload can get as it provides fault tolerant availability at two different geographic locations. With this type of fault tolerance, there is zero end-user impact and no loss of state or data, but the redundant workloads are hosted in different sites (the sites can be in different rooms or floors of the same building, or in separate buildings on a campus, or even in different cities). This ensures your applications continue to run even if one of the sites fails due to power issues, flooding, etc. Naturally, there is a higher network cost to this type of solution, but when only the highest levels of no downtime will do this is the best solution available. In regulated industries like pharmaceuticals, manufacturing and financial services, a site-wide downtime can lead to breakdowns in a distributed supply chain, placing process compliance at risk, which compounds your downtime costs. That’s why multi-site fault tolerance is the best availability option to ensure that all in-flight data is safely replicated and remains available at all times.










