SLA Tips and Tricks

Alex Bewley
An infrastructure service-level agreement (SLA) means many things to different people. As recently as a few years ago, it meant having to be fully ITIL compliant, and having complicated contracts, protracted negotiations between IT and businesses, and rigorous-almost excessive-reporting. This pervasive mindset proved to be a great barrier to anybody in the mid-enterprise trying to implement SLAs. It was also coupled with the challenge that the metrics that were relevant to lines-of-business (or, the application owners) were frequently different than those that were collected by IT. Think of a typical corporate application like e-mail, where users simply state, 'E-mail doesn't work, what's wrong?' and IT provides responses like, 'Well, the MTA gateway is fine, the network links are within acceptable bandwidth ranges, and I can ping the Net.' These metrics are nice, but in the eyes of the end-user, 'E-mail doesn't work! Gee thanks, IT, that's not useful information!' There needs to be a consensus between end-users and IT about what 'working' means.

Nowadays, there is a more collaborative mindset between IT and the lines-of-business, and a greater communal understanding of relevant reporting metrics. This collaboration is extremely important, as virtualization and now cloud computing have made understanding and monitoring applications far more complex. Not only does IT have to worry about monitoring applications that exist on internal physical infrastructure, but the applications can now be spread over physical, virtual and cloud-based assets. Add in the extra complication that the infrastructure may dynamically change, and it's "Advil time" for IT.

Applications and their successful delivery have historically been defined by performance and availability.  Meaning that not only must applications be functional, they also should be useable and perform well. E-commerce vendors such as Amazon, Zappos and Neiman Marcus are acutely aware of this-any degradation in performance means that customers go spend their money somewhere else. Their SLAs would encompass things like: availability of websites, availability of network infrastructure, back-end databases, transaction gateways and performance of specific actions that an end-user would perform on a website-for example: checking out, looking up a price, listing items in the shopping cart or comparing items. A very easy way to capture a general sense of how the complete infrastructure is performing would be to monitor the dollars spent per minute by customers. If this number begins to drop outside of expected spending, then something is wrong. No amount of IT service up-ness or down-ness is going to help you get this holistic overview. 

An additional dimension around SLAs that will emerge over the next few years is cost. As applications are deployed across the cloud, it will be possible to dynamically move them onto more cost-effective platforms as cloud pricing changes. This concept of economic computing will become very interesting and will be explored in future blog postings.

Now, getting down to the details, what are the things that we need to do to create SLAs quickly and effectively?

Knowing that we need to keep a user's expectations in check, where do you start? People seldom realize that five nines availability (99.999 percent availability) is ridiculously hard to achieve and it comes at a prohibitive cost, which most business users will not pay. Additionally, not all applications are 24/7, they run on reduced work schedules and you'll need to accommodate for this in the reporting. A common problem that people have when starting to go down the SLA path is just figuring out where to start. As an IT user, it's important to be able to back-test an SLA based on historical performance and availability of data. Sometimes 'good enough' is, well, good enough.

Defining Metrics - Before even creating an SLA document/report, we'll need to make sure we've defined which metrics indicate availability and performance of the service we're providing. We also have to make sure we can actively monitor the metrics with our monitoring solution as well.

Baselining Current Service Level - Once monitoring of key business applications has been defined, we need to get a baseline of how we're currently performing before we commit to providing a level of service that we cannot possibly achieve. Being able to back-test SLAs for a given set of performance and availability metrics is key. For example, if end-user response time is a key component of an SLA, but it varies significantly over a period of time, it is easy to back-test when the SLA would be violated for a given performance threshold. This allows us to negotiate with an application owner for values that 'make sense.' 

Proactive SLA Management - Once an SLA is created with objectives (SLO's) that define the availability and performance of an application, it's useful to get instant visibility on an SLA dashboard. It's also possible to set up SLA alerting, so that alerts can be generated when an issue occurs and starts to affect SLA performance. This is important if an SLA is trending to be violated, as it gives IT a chance to rectify an operational situation before getting penalized. 

Quick SLA Reports-There are two kinds of reporting around SLAs: historical over a long period of time and day-to-day operational. In the first case, IT can demonstrate how well IT services were doing on a month-to-month basis over a year or two. In the second case, IT operations can do daily scrums to identify common infrastructure problems that might impact many different applications and SLAs. Prioritization of root-cause analysis then becomes very easy to do. 

In summary, SLAs have become much easier to implement, manage and enjoy than in the past. With appropriate tooling and a collaborative mindset, IT can quickly demonstrate value around keeping applications available and performing well. Application environments are only going to get more complicated, so make sure you're staying on top.

Add Comment      Leave a comment on this blog post
Jun 27, 2011 7:06 AM Miller Palaniswamy Miller Palaniswamy  says:
nice article thanks Reply

Post a comment





(Maximum characters: 1200). You have 1200 characters left.




Subscribe to our Newsletters

Sign up now and get the best business technology insights direct to your inbox.