Bullet point #1 in your executive-friendly PowerPoint about “Achieving Operational Excellence in IT” covers Process and Procedure; so how do we measure our effectiveness? I’m a big proponent of Metrics and Measurements as well – but often times the biggest challenge is where to start?
Measure the Unmeasured
In most organizations (especially manufacturers!), the business has plenty of Key Performance Indicators (KPIs) that tell them how much productivity they are seeing, how much money they are saving, and how they are driving out variable costs. IT metrics don’t need to focus on those things – and it’s often difficult to get the business to share their control of the message.
Better to just focus on the behavior, performance, and availability of the system. For a start, try tracking uptime/availability. What percent of the year is the system available, with no problems? To be fair, you should define actual expected hours of service; is the system really expected to be 24×7? Or is it critical to be available and working from 4am to 9am – to get the day started, schedule the shift, print the reports, etc. These metrics help tremendously when the system does experience a hiccup; for end users up in arms over the lack of computing services this morning, point out that the system has had 99.9% uptime over the past year or so. Most folks understand the “five 9’s” concept, where each additional decimal point of uptime costs an order of magnitude more $$.
For example … this system is only used from 6 to 6, and never on the weekends. You didn’t budget for high-availability / clustered / failover / megaservers, did you?
Another trick you can do with usage report: if 30% of the reports on your server never get executed, consider taking the first set of reporting requirements for your next project, asking the user to prioritize it all – and postpone work on the bottom 30% of the reports! You will cut a ton of time off the development phase of your project, and the metrics suggest that most of the stuff you cut would never be used anyway! Note that I said we’d postpone the work – we can always go back and add critical missing reports later.
Visibility: as Important as Readability
This framework should give you a jump start on what to measure; you really need to focus on how you will deliver your pictures to the target eyeballs. Nice stats, but how are you going to let folks know the score? If you are fortunate enough to have a robust portal environment, and can configure plug-ins with graphs and such, your job should be easy. You’ll still have to learn to configure, feed the data, and automate – if not, the administrivia hassle will lead to neglect.
If your portal platform doesn’t do the graph thing – or if the plugin renders unreadable graphics (go read Tufte!), you may need to fall back on charts driven from spreadsheets. These can look great, but the mechanics of getting the finished picture on a web page can be a bit tricky. Start small – take your first baby steps with a simple uptime graph, and figure out how to publish and distribute with minimal effort. Once you get the hang of it, you can move on to more challenging metrics / communications.
Lies, Damn Lies, and Statistics
When dealing with metrics, you need to be careful and thoughtful when drawing conclusions or postulating cause-and-effect. Consider this first picture, showing the breakdown of help desk tickets between “just-in-time training” and true break-fix issues.
One might infer that the user base has slowly but surely devolved over time. Trained employees leave the company or move on to other roles, new folks take their place. Training classes no longer exist, and little knowledge transfer takes place. The company is getting progressively dumber, and no one can stop the madness …
Well, maybe not. let’s look at the same metrics, but presented as actual volumes …
This is a completely different picture; the marked decline is almost entirely in “break-fix” issues. Clearly, IT has been spending much of their time fixing the nagging little bugs and annoyances that lead to user problems. The number of “How-to” calls has been reasonably steady … maybe this means that IT could stop programming and start working with the business on knowledge capture and retention …