I know. If sweeping automation is the shiny new Ferrari you’re roaring into your shop, health checks are the annoying finance guy trying droning on about APR and insurance. Being able to push a button and watch a full environment rev to life from nothing is cool. Health checks aren’t. Diagnostics are a little sexier, but they’re still an afterthought, something you hope you don’t need… and if you’ve already got your commit, build, release, and deploy processes tuned up and working right, it’s seductively easy to believe you won’t.
Maybe you’re right. There are services out there that have run for years without interruption, and maybe you have one of those. Maybe your hardware is as reliable as gravity and your changes are so minute and infrequent that even a bad change doesn’t cause a noticeable failure. Maybe you were born under an auspicious star and you’ll go your whole life without a single preventable failure.
Probably not.
Sooner or later, almost inevitably, you’ll get the doomsday call. Something is wrong, it’s getting worse, and it’s not going to correct itself. We’re not even trying to stop that – it’s like trying to stop the tide with your bare hands. All we’re trying to do is make sure that the call comes from the system while there’s still time to intervene. It’s worse to have an angry manager call as your first notification that there’s trouble, and it’s worse than that to have customers simply disappear because your rig doesn’t work.
This can get obscenely complicated and it can do it in a hurry. Full-bore monitoring is its own discipline; alert tuning is a moving target between “Nothing squeaks when I need it to†on one end and “This thing screams all the time about nothing†on the other. Tons has been written on the topic and there are as many schools of thought as monitoring operators. I’m not here to try and summarize their entire field; I’m only here to suggest some of the basics and highlight some of the easy mistakes.
The Basics
- Know thy environment. If you fail at this step nothing else you do really matters. Before you can start putting watchful eyes on your environment, you have to know what it looks like when it’s running right and where the possible failure points are.
- Armed with the knowledge from point 1, figure out your tooling. AppInsights, NewRelic, SumoLogic, and Stackify’s Retrace product (for example) all do similar things, but each has its strengths and weaknesses. If you’re starting from “blind and deafâ€, you probably want something with a lot of features and places to put eyes. If you’re already really good about logging, you might be better off with a log aggregator.
- Build out some plans for how to respond when your tools do their jobs. Some of this is easy and intuitive – if you’re consistently spiking proc, scale out or up, if you’re stalling at disk seek re-think your disk allocations – some of it is intrusive and difficult. It all matters, though; if you have alerts and no response to them, that’s not much different from not having them at all.
The Easy Mistakes
- Rolling out blind. Monitoring tools vary in their capabilities and costs, but they are universally overwhelmed with joy to provide you with so much information that you drown in it. You can reach so deep into performance counters that the big picture vanishes in a sea of transient counter spikes. Cleaning up that situation is hateful and painful, bad enough that sometimes it just never gets done. Figure out what your pain points are before you start turning monitors on.
- Taking it too far. The dream is an early warning of anything that stands to be a serious problem and no noise other than that. Before you flip a monitor on, decide what you’ll do when it sings; if the answer is “nothingâ€, don’t flip it on.
- Neglecting corrective measures. Some problems are easy victims for good automation – if-x-then-y is a perfect use case for a good script. That said, if you have any given problem so often that it’s worth writing a script to fix it, then you really should consider why that is and fixing the problem instead of scripting around it.
- Fire-and-forget. Putting something in place and then letting it be; if it’s done a good job so far, why change it? That may even be okay. If (for example) your app does a really good job of detailed logging and your aggregator parses those logs into actionable issues pretty well, you may not need to adjust your monitors very often. Usually it’s not. As functionality comes and goes, as features are added, subtracted, or modified, the idea of peak performance changes and your monitors need to reflect current reality.
A Couple of Rules-of-Thumb
- Start small. Your basic health check should reassure you that your application is alive, working as intended, and performing to whatever tolerance your userbase needs. You can get into telemetry for perf improvement or system optimization later; at the start, you just need to know that everything works more or less correctly.
- Peek at all of your failure points. For VM’s, memory utilization, processor utilization, and disk access are common and they’re common for a reason. That’s probably not enough, though – you’ll also want eyes on database access time from the application, network I/O, and application pools. A terrifying number of application problems end up being database problems. A lot of those problems are comparatively easy to fix if you get a better notification than an ASP “could not connect to database†error.
- Have plans, preferably documented, for any alert that sounds. It’s okay if those plans are fuzzy at first; “troubleshoot app pool issue in response to excessive app pool recycles†may be what you have out of the gate, and that’s fine. Eventually, you’ll get a better feel for what’s likely to go wrong and you can narrow down your plans, but the sooner you get into the habit of adding monitors and plans in pairs, the more work you save trying to retrofit plans to monitors.
Summing Up
Monitoring is a big, scary topic. Don’t get overwhelmed; doing this right isn’t nearly as hard as it sounds, and it doesn’t usually take a massive investment in capital or manpower. Done right, it gets you out in front of problems, which is where every troubleshooter always wants to be. It’s the final stage in the zero-downtime deployment; it’s no good to flawlessly and seamlessly deploy a release and then go down on a memory problem in production.
Like most things in IT, it comes down to planning. Plan out what you need to look at, how you’re going to look at it, and what you’ll do if you look and see something sub-optimal. It’s a little more insidious than some IT facets, because a poorly planned implementation will still function and won’t send up red flags, but it’s that much harder to clean up exactly because cleaning it up somehow never makes it to the top of the priority list.