The second pillar of the Clear Measure Way is Achieving a Stability Plan. If you missed it, also check out the first pillar, Establishing Quality, which is super important. After ensuring the code is bug-free, the team can work on getting it to a stable production environment. This is the next stepping stone towards increasing your software delivery speed.
Most likely you have not only one new piece of software, but many existing apps in production. Several of these may have stability issues from time to time, and stability issues can have one or more of these common symptoms:
- Sluggishness
- Outages
- Offline error messages
- Frozen screens
- Abnormal behavior
- Bugs
Sometimes your engineers cannot reproduce them, and they only occur during production.
When the users report any of these symptoms, you have a production issue. Having good “language” around the symptoms gives you clarity in your oversight duties. Remember, you’re the software executive; you oversee all of this. And you want to make sure appropriate stability measures are in place to track the stability.
If the software is not stable, it’s not going to do its job. You need to ensure the stability of your software as it runs in production.
Learn More About the Clear Measure Way for Executives
Why You Need a Software Stability Plan
Sometimes teams can be a little gun shy of production deployments. They might advocate for monthly deployments, or maybe after-hours deployment events with all-hands-on-deck. This is technically not required and is usually caused by an unpleasant experience during live updates.
They’re gun shy because they’ve been shot before. After a deployment goes bad, developers can really become hesitant. They can become weary and distrustful of the process because they consider it dangerous.
You need a stability plan because having a lot of undeployed software is costly. This isn’t generating any return for the business. Plus, it yields a growing risk of unproven software changes.
Now, think about other departments; all departments that manage throughput understand the power of limiting work in process (WIP).
Infrequent deployments queue up way too many changes that are waiting for a big, stressful, error-prone deployment event.
Ultimately, your two goals to achieve stability are (1) prevent production issues, and (2) minimize undeployed software. In other words, get the software out there. The software needs to yield the business a return on the investment you’ve made in it.
Measure production issues and software deployments on the team scorecard during weekly metric reporting. For example:
- Number of deployments for the week
- Number of production issues for the week divided by severity
- MTTR: mean time to recovery (average time to resolve the issue
Stability Questions to Ask Your Team
You can ask your team questions to drive the right behavior when seeking to achieve your stability plan. For example:
- What features have been changed, tested, and ready for production?
- Have you considered some way to prevent this type of issue from happening again?
- What was the root cause of that production issue?
- What should we strengthen about our environment, so that we’re able to resolve issues faster next time?
Minimum Standards for Achieving Your Stability Plan
Just like with quality, there’s a minimum set of practices every team should employ, if you expect to run a stable software system in a production environment.
First, automated DevOps from day one of a new project. The goal of that is to eliminate manual monthly deployments. You want to have automated deployments all the way from code, builds, test, and release candidate deployment.
Second, you want to have small releases. You want to deploy when something is ready no matter what time of day.
Third, you want runtime automated health checks. This is like your car’s built-in self-diagnostic tool. It can notify you when there is an issue. Then a technician can plug in a diagnostic tool and receive the error codes. Likewise, you can have a built-in health check-in to have explicit secrets management because security is such an ongoing risk.
That’s just the minimum. There’s going to be other practices to prevent issues. When production issues arise (and you will from time to time), the following practices will help the team diagnose and solve them faster.
Do you want to empower your software team
to be effective:
moving fast with high quality?
Stability Plan Practices
First, centralized open telemetry logging, and metrics and traces. If you don’t have centralized telemetry, look into the great standard open telemetry that all the tool vendors are using.
Second, APM, which stands for Application Performance Management, is a tool with a shared Operations Dashboard.
Third, a formal support desk tool with ticket tracking, anomaly alerts, and emergency alarms.
If some of this sounded familiar, it’s because many of them are the software parallel of practices to operate any other factory or assembly line in a factory. When a production line has an issue, the staff quickly fixes it to prevent a factory shutdown. And for more serious problems, emergency alarms sound, stopping the line to call everyone’s attention. While the tools are different, the way of thinking is the same.
More Stability Questions to Ask Your Team
Here’s some questions to ask your team. To gain insight into how the practices may or may not be implemented, you can ask,
- Would you please give me a tour of our logs and telemetry that allow me to see how the users are using our software and what they’re doing?
- How do we currently train a new team member to be on call for production support?
- What dashboards should a team member look at to ensure software stability?
- What events currently trigger an alert and what events currently trigger an emergency alarm?
- Who receives the alerts?
- By what means are alarms received?
Are you ready to put this process into action? Let us guide you.