If IT's Down, the Heat Is Up

Michael Hamelin
Ideal metrics don't just make you look good -- they provide some indication of what is wrong or out of control as soon as, or even before, problems arise so you can correct them and improve your operational efficiency. By combining myriad metric results, you can generate a "risk score" that provides instant visibility into the security and compliance posture of all your firewalls, including regulatory compliance such as PCI-DSS reporting. The same score can be used proactively for enhanced workflow automation. Sounds good, doesn't it? So where do you start?

Performance Monitoring

The first step is to examine your rules, routers and firewalls to identify which are most susceptible to risks, which do you rely on the most and which do you need the most from.  This will help identify and prioritize where to focus your resources.

A standard firewall metric that will probably spring to mind is "availability." It tells you about the performance of the box. for example, 99.9 percent up. This is obviously a good metric to track; however, in my opinion, it has limited applicability as is, although it conveys that everything's fine with only 0.1 percent downtime. It doesn't tell you what went wrong, how to fix it or how to improve performance and avoid it happening again. It simply states the obvious -- that valuable uptime was missed.

That's not to say that all basic metrics aren't valuable. Some standard baseline performance metrics that deliver exceptionally useful data, and every firewall team should be tracking, are CPU utilization, memory utilization, connections passed, connections dropped and simultaneous connections. These are all dimensions that are important when examining your firewall's current performance and whether it behaved like this previously-yesterday, last week or last month -- to determine if there's a significant change warranting further investigation. These are also key components for a capacity-planning exercise to pinpoint if a firewall is overloaded. Performance metrics may indicate that a hardware upgrade is needed, but it is worth first checking whether the firewall configuration can be optimized, as there may be underutilized capacity elsewhere.

A more sophisticated metric for tracking firewall performance is to use an external testing product that streams traffic through the firewall to a collector and records the throughput, latency and jitter of the firewall and network influence on this packet stream. This live bandwidth monitoring can be an important part of understanding if a firewall is cleanly passing performance-sensitive traffic such as VoIP and videoconferencing traffic.

Change Management Monitoring

Nothing stays the same for long, and as your IT environment changes, so does your firewall. You need to change, create, disable or even delete rules. Change can affect availability, either positively or negatively, and as this is one of the main things a firewall must provide, metrics that provide meaningful data that can be acted upon are invaluable.

Configuration updates happen in a number of ways:

  • planned.
  • �unplanned or emergency (out of cycle) changes.
  • �changes with no authorization, sometimes referred to as �cowboy changes'-someone logs in, makes a change and doesn't have any documentation either before or afterward or what was done.

Firewalls do not have a change-management process built into them, so documenting changes has never become a best (or even a standard) practice for many organizations. If a firewall administrator makes a change because of an emergency or some other business disruption, chances are he is under pressure to make it happen as quickly as possible, and process goes out the window. But what if this change cancels out a prior policy change, resulting in downtime? By monitoring the number of planned versus unplanned changes, you can determine how well the team is pre-empting the users' requirements and proactively managing the firewalls versus "seat of the pants" updates. A great metric is the percentage of changes resulting in outages, as this provides feedback on how well the operational team understands the changes they're making and their impact, and whether they're using some method or tool to verify changes before they're made.

Another really useful metric, although rarely tracked, is the mean time to recovery (MTTR) -- in other words, how fast did the team restore service for each of your outages. This metric is a good gauge of your team's familiarity and understanding of the firewall's configuration and whether it's improving or diminishing. It could also be an indicator that everything is getting complex or unruly. If you've read "The Visible Ops Handbook,"  you'll remember that 80 percent of all outages are caused by configuration adjustments and that 80 percent of the MTTR is spent identifying what changed. Therefore it stands to reason that if the team understands exactly what happened, they should be able to isolate the failure point within a minute and restore service in less than five. Ultimately the goal is to eliminate downtime in the first place.

Add Comment      Leave a comment on this blog post
Aug 15, 2012 3:29 PM FieldsNatalie25 FieldsNatalie25  says:
Various people all over the world receive the loan from various banks, just because it's comfortable. Reply

Post a comment





(Maximum characters: 1200). You have 1200 characters left.




Subscribe to our Newsletters

Sign up now and get the best business technology insights direct to your inbox.