Ever wanted to know how Xero does incident management?

Wed, 4th Oct 2017

FYI, this story is more than a year old

DevOpsDays is an event that takes place all over the world, from Paris and Berlin to Cape Town, Oslo, and across the United States.

Now, it's Auckland's turn.

Kicking off in Ellerslie yesterday, and with another DevOps-filled schedule set for today, everyone from QA, national and global cloud companies, incident managers, and more, have congregated at this year's event.

Xero's Anthony Angell is one of the event's speakers. He's the team lead for site reliability engineering at the company's Auckland sector and he's here to discuss how Xero's using chatbots to improve incident management.

“We used deliver incidents in a single channel,” explains Angell. Xero initially had a single operations team that managed all production incidents.

“It used to be hectic trying to figure out what [the incident] was."

As the company grew, it was crucial that the way product team worked evolved to.

To empower product teams to support their own services, Xero's Site Reliability Engineering (SRE) team developed a set of best practices around incident management.

“I don't want to be here at 2 o'clock every night, dealing with someone else's problem.

“We thought, how can we make it easy for people?

Introducing Xero's incident management chatbot.

“Our chatbot's called Multivac – someone pulled the name out of a hat.

“We had this AWS chatbot that really didn't do much except rack up bills – so we decided to see what we could do with it. We got the framework we designed and put it into code.

This was not an easy job.

“Trying to put a framework into CoffeeScript was one of the hardest experiences of my life. Trying to integrate that with Slack was even harder.

“But, it worked. We built the chatbot and it integrated with everything. If it's got a rest API, we've got it hooked up to it. It's awesome.

Be able to put in code brings a lot of flexibility, explains Angell. One of the problems that Xero used to have was a clash due to miscommunication between teams.

“The dev team's like – 'I didn't know when the other team released, I released at the same time and broke their dependency and brought stuff down without knowing it'.

Merely looking at a calendar to watch out for these situations wasn't going to cut it.

“Who looks at their calendar halfway through the day?

“The other thing is, using a bot to do all this reduces the time to restore. That's a huge thing with incident management. It's not about trying to get Harry to bang out commands left, right and centre to get the quickest restoration. It's a matter of delivering a framework, in the fastest possible time, consistently.

“But incident management isn't just about the incident,” warns Angell, “There's post-mortems and triage work up front.

Xero's triage process

“At Xero, we've got this Issues Report form. It's quite 1900's, but it's better than you ringing me at 2 am, and then me ringing an engineering wondering - what did they just say?”'

The form is filled in and the details are used to define severity and scalability based on how many users have the problem, and in what global or local location.

Once the form is completed, this is where Multivac comes in.

“One of the bots picks it up and says, ‘someone has submitted that there's a problem'. And it picks up the user who submitted it, and it then tags the user in Slack, and says, ‘Hi, I'll find the on-call person and send them to work'.

From here, it goes and finds the person who is on-call and gives the user their phone number in case they need to call them, and then sends the person on-call cell phone alerts.

“One of the cool things is that it gives us all the data that's on the form – so the on-call engineer no longer has to worry about what Julie said to them at 2 o'clock in the morning.

“Once we get this through to the engineer, we can tell the bot to start incident.”

One of the things Xero targeted with their new incident management system is moving away from having incident chatter and normal chatter in the same channel.

“So, what [the bot] has done is that it's split off to another channel in Slack and invites me, and makes me the incident controller.

The incident controller, the puppet master

The incident controller is someone who is in control of the incident, explains Angell, who is not fixing it, but the puppet master.

“I'm orchestrating the resolution, just trying to help everyone out.

“The cool thing about the bot is that I don't need to know the incident management process – and neither do you guys.

A green tick shows up under each command the user types in, indicating that the bot has seen the command and run it.

The post-mortem: A time for positive reinforcement

“When it's all done, we tell the bot it's over. From here, we want to know how long it took to fix it. In this case [the live demo] took 7 minutes and 50.2 seconds.

“Now, it will automatically archive this channel so we no longer have stale channels lying around everywhere.

“And then – it gives us the next task and we start a post-mortem – because we all want to know why it broke.

“We run the post-mortem command and we use Jira which tracks our incidents and creates a Google doc with all the information from the incident."

What's included in the Google doc., continues Angell, is the incident title, how long it has taken, people in the incident, as well as a number of other things that the bot cannot fill out, such as summary, trigger and impact.

“After that, we go through things that went well because positive reinforcement is the key to resolving things as opposed to focusing on the negative.

Where to from here?

“We are building a report card system so we can call out teams that are not doing stuff the right way.

“And, incident and alert portals – collating incidents over the last 24 hours and populating it in a channel."

"Our team can find out what's happened in the last 24 hours, instead of sending a whole handover email of what's happened which people automatically file in the I don't care pile.

“Incident management isn't a one-stop shop, it's an organically rolling framework. The more incidents you're exposed to, and the more you actually go and look at them, you're building a beautiful framework that best suits your needs.