On-call rotations are a team sport
Let's work together!
Are you hiring for remote engineering leadership positions? I help teams thrive through iterative improvement. Here's my resume. Let's talk.
I'm also available for DevOps project work. Are you fighting fires instead of serving your customers? Hire me to identify the roadblocks and make improvements so you can get back to focusing on your customers.
No one ever says: “I want to ack PagerDuty alerts when I grow up”. Lost sleep. Diverted focus. Scrambling under pressure. There’s no shortage of downsides to being on the frontline for a business’s technology.
Still, a well run on-call rotation is key to the proper care and feeding of any company’s technology stack. A healthy on-call rotation shines a bright light on weaknesses and implements improvements with urgency. It requires a team with sharpened communication practices to do it well.
Over time, how your team handles on-call can make all the difference between a business taking flight or drowning in technical debt.
You built it, you bought it
Limiting on-call rotations to a single person or a relatively small handful of people in an organization is an anti-pattern. This practice tends to shift problems away from the people who can act on them. When the people fielding alerts aren’t empowered to make improvements they will ultimately optimize for silencing alerts rather than addressing underlying issues. Those people will also end up burning out in short order.
Teams owning alerts for the systems that they create and/or maintain is a much healthier pattern. Teams will be much more likely to spend cycles on performance improvements and bug fixes when they’re experiencing the live fire of their work. They will roll lessons learned into improving their development process for new features. Over time this practice will also enhance a team’s ability to collect metrics and monitoring points relevant to the business.
Turn on the flood lights
Hidden problems tend to fester until they turn into emergencies. Creating a culture that prefers exposing vulnerabilities over hiding them will keep the problem scopes smaller. Pushing the issues out in the open for all to see will lessen the chances they end up swept under a rug.
Metrics are like a light switch. Tracking the occurrences and nature of on-call alerts will show which areas of a system need attention. Metrics will also reveal patterns that may not be apparent to the people fielding alerts.
A shared on-call log is one way to collect and expose these metrics. Many monitoring tools along with alerting services like PagerDuty expose some sort of API. These API’s can be mined for alert history data. Regular review of this type of alerting data in a team setting will help prioritize attention.
A chat room is another great place to shine a light on issues. Most monitoring systems can send alerts to a variety of chat services. The squeaky wheel gets the grease. Putting a stream of alerts front and center for the entire company to see will make it much harder for everyone to ignore squeaky wheels.
Build a culture of improvement
Knowing is half the battle. The other half is actually fixing stuff.
A sustainable on-call process includes time allotted for making improvements. In some situations this work may be doable during a person’s on-call rotation. In other cases it may require time after the rotation to follow-up on issues. Consider what works within your teams’ responsibilities and plan accordingly.
Solidify your team’s process for ticketing issues as they arise. Problem tickets must contain enough information to be actionable. Groom the backlog of issue tickets regularly. Work improvement tickets into the team’s pipeline regularly. Anything less will end up becoming a vanity exercise producing little more than an ever increasing ticket backlog.
Pay particular attention to systemic issues. Problems in core infrastructure tend to replicate out, compounding the number of alerts. Be on the look out for false positive and unactionable alerts. Prioritize fixing these type of issues quickly before recurring alerts result in pager fatigue across a team.
Use metrics to publicize the improvements your team makes. Wins are contagious. Every time you speed up a page load by a high percentage or eliminate a flaw that has been waking up an engineer for days, shout it from the roof tops. Sharing these kind of measured wins will reinforce a culture of improvement throughout your organization.
Fewer beeps. More sleeps. And much happier customers.
Few people get excited about taking on an on-call rotation. Handled properly though, teams can use on-call rotations to build a culture of improvement. The visibility that comes from metrics combined with clear pathways of action can be a powerful force for positive change in any organization.
June 08, 2015