IT Brief New Zealand logo
Technology news for New Zealand's largest enterprises
Story image

Ever wanted to know how Xero does incident management?

By Julia Gabel
Wed 4 Oct 2017
FYI, this story is more than a year old

DevOpsDays is an event that takes place all over the world, from Paris and Berlin to Cape Town, Oslo, and across the United States.

Now, it’s Auckland’s turn.

Kicking off in Ellerslie yesterday, and with another DevOps-filled schedule set for today, everyone from QA, national and global cloud companies, incident managers, and more, have congregated at this year’s event.

Xero’s Anthony Angell is one of the event's speakers. He’s the team lead for site reliability engineering at the company’s Auckland sector and he's here to discuss how Xero’s using chatbots to improve incident management. 

“We used deliver incidents in a single channel,” explains Angell. Xero initially had a single operations team that managed all production incidents.  

“It used to be hectic trying to figure out what [the incident] was."

As the company grew, it was crucial that the way product team worked evolved to.

To empower product teams to support their own services, Xero’s Site Reliability Engineering (SRE) team developed a set of best practices around incident management.

“I don’t want to be here at 2 o’clock every night, dealing with someone else’s problem.”

“We thought, how can we make it easy for people?”

Introducing Xero’s incident management chatbot.

“Our chatbot’s called Multivac – someone pulled the name out of a hat.”

“We had this AWS chatbot that really didn’t do much except rack up bills – so we decided to see what we could do with it. We got the framework we designed and put it into code.”

This was not an easy job.

“Trying to put a framework into CoffeeScript was one of the hardest experiences of my life. Trying to integrate that with Slack was even harder.”

“But, it worked. We built the chatbot and it integrated with everything. If it’s got a rest API, we’ve got it hooked up to it. It’s awesome.”

Be able to put in code brings a lot of flexibility, explains Angell. One of the problems that Xero used to have was a clash due to miscommunication between teams.

“The dev team’s like – 'I didn’t know when the other team released, I released at the same time and broke their dependency and brought stuff down without knowing it'.”

Merely looking at a calendar to watch out for these situations wasn’t going to cut it.

“Who looks at their calendar halfway through the day?”

“The other thing is, using a bot to do all this reduces the time to restore. That’s a huge thing with incident management. It’s not about trying to get Harry to bang out commands left, right and centre to get the quickest restoration. It’s a matter of delivering a framework, in the fastest possible time, consistently.”

“But incident management isn’t just about the incident,” warns Angell, “There’s post-mortems and triage work up front.”

Xero’s triage process

“At Xero, we’ve got this Issues Report form. It’s quite 1900’s, but it’s better than you ringing me at 2 am, and then me ringing an engineering wondering - what did they just say?”’

The form is filled in and the details are used to define severity and scalability based on how many users have the problem, and in what global or local location.

Once the form is completed, this is where Multivac comes in.

“One of the bots picks it up and says, ‘someone has submitted that there’s a problem’. And it picks up the user who submitted it, and it then tags the user in Slack, and says, ‘Hi, I’ll find the on-call person and send them to work’.”

From here, it goes and finds the person who is on-call and gives the user their phone number in case they need to call them, and then sends the person on-call cell phone alerts. 

“One of the cool things is that it gives us all the data that’s on the form – so the on-call engineer no longer has to worry about what Julie said to them at 2 o’clock in the morning.”

“Once we get this through to the engineer, we can tell the bot to start incident.”  

One of the things Xero targeted with their new incident management system is moving away from having incident chatter and normal chatter in the same channel.

“So, what [the bot] has done is that it’s split off to another channel in Slack and invites me, and makes me the incident controller.”

The incident controller, the puppet master

The incident controller is someone who is in control of the incident, explains Angell, who is not fixing it, but the puppet master.

“I’m orchestrating the resolution, just trying to help everyone out.”

“The cool thing about the bot is that I don’t need to know the incident management process – and neither do you guys.”

A green tick shows up under each command the user types in, indicating that the bot has seen the command and run it.

The post-mortem: A time for positive reinforcement 

“When it’s all done, we tell the bot it’s over. From here, we want to know how long it took to fix it. In this case [the live demo] took 7 minutes and 50.2 seconds.”

“Now, it will automatically archive this channel so we no longer have stale channels lying around everywhere.”

“And then – it gives us the next task and we start a post-mortem – because we all want to know why it broke.”

“We run the post-mortem command and we use Jira which tracks our incidents and creates a Google doc with all the information from the incident."

What’s included in the Google doc., continues Angell, is the incident title, how long it has taken, people in the incident, as well as a number of other things that the bot cannot fill out, such as summary, trigger and impact.

“After that, we go through things that went well because positive reinforcement is the key to resolving things as opposed to focusing on the negative.”

Where to from here?

“We are building a report card system so we can call out teams that are not doing stuff the right way.”

“And, incident and alert portals – collating incidents over the last 24 hours and populating it in a channel."

"Our team can find out what’s happened in the last 24 hours, instead of sending a whole handover email of what’s happened which people automatically file in the I don’t care pile.”

“Incident management isn’t a one-stop shop, it’s an organically rolling framework. The more incidents you're exposed to, and the more you actually go and look at them, you’re building a beautiful framework that best suits your needs.”

Related stories
Top stories
Story image
Hard numbers: Why ambiguity in cybersecurity no longer adds up
As cybersecurity costs and risks continue to escalate, CEOs continue to struggle with what their investment in cyber protection buys. Getting rid of ambiguity becomes necessary.
Story image
Managed service providers: effective scoping to avoid costly vendor pitfalls
Managed security services are outsourced services focusing on the security and resilience of business networks.
Story image
Remote Working
How zero trust and SD-WANs can support productive remote working
The way people connect with applications and data has changed, users are remotely accessing resources that could be stored anywhere from a corporate data center to the cloud.
Story image
The 'A-B-C' of effective application security
Software applications have been a key tool for businesses for decades, but the way they are designed and operated has changed during the past few years.
Story image
Digital Transformation
Unlocking the next digital frontier for educational institutions
Understanding where to invest in technology can be challenging for education institutions, especially after the COVID-19 disruptions.
Story image
Artificial Intelligence
SAS unveils AI experience to improve kids' batting abilities
SAS has created The Batting Lab, an interactive experience using AI, computer vision and IoT analytics to help kids improve their baseball and softball swings.
Story image
Microsoft unveils adaptive accessories for disability access
Microsoft is introducing an expansive Inclusive Tech Lab to give people with disabilities greater access to technology through new software features and adaptive accessories.
Story image
Adyen expands partnership with Afterpay as BNPL payments increase
Adyen has expanded its partnership with AfterPay allowing more of Adyen’s merchants in more countries worldwide to use the BNPL provider.
Story image
Commerce Commission
ComCom welcomes new marketing codes for the telecom industry
The Commerce Commission is welcoming the creation of new marketing codes for the telecommunications industry.
Story image
IT budget
$20m boost for digital technologies announced
The government is spending an extra $20m over four years on its plan to transform the digital technologies industry.
Story image
Artificial Intelligence
SAS launches human-focused responsible innovation initiative
SAS has launched a responsible innovation initiative, furthering its commitment to equity and putting people first.
Story image
Digital Transformation
Physical security systems guide the hybrid workplace to new heights
Organisations are reviewing how data gathered from their physical security systems can optimise, protect and enhance their business operations in unique ways.
Story image
A10 Networks finds over 15 million DDoS weapons in 2021
A10 Networks notes that in the 2H 2021 reporting period, its security research team tracked more than 15.4 million Distributed Denial-of-Service (DDoS) weapons.
Story image
Kodari Securities (KOSEC)
NFT trends and opportunities: expert reveals all
The NFT market is growing at an exponential rate, with unprecedented liquidity. Here we explore how businesses can profit.
Story image
New digital traffic light system to tackle construction defects
Smarter Defects Management launches its PaaS digital system and says it will revolutionise managing defects in the construction industry.
Story image
Artificial Intelligence
Updates from Google Workspace set to ease hybrid working troubles
Google Workspace has announced a variety of new features which will utilise Google AI capabilities to help make hybrid working situations more efficient and effective.
Story image
Ivanti and Lookout bring zero trust security to hybrid work
Ivanti and Lookout have joined forces to help organisations accelerate cloud adoption and mature their zero trust security posture in the everywhere workplace.
Story image
SmartCIC, BICS partner to expand wireless service options
SmarCIC has partnered with BICS to increase choice for organisations using fixed wireless services, expanding existing carrier relationships for its CELLSMART division.
Story image
Veryfi announces Mobile Receipt Capture for D2C marketing apps
Veryfi has announced a new enhancement to its portfolio, with Mobile Receipt Capture for direct-to-consumer marketing apps.
Story image
Application Security
What are the DDoS attack trend predictions for 2022?
Mitigation and recovery are vital to ensuring brand reputation remains solid in the face of a Distributed Denial of Service (DDoS) attack and that business growth and innovation can continue.
Story image
Tech job moves
Tech job moves - Datacom, Micro Focus, SnapLogic and VMware
We round up all job appointments from May 6-12, 2022, in one place to keep you updated with the latest from across the tech industries.
Story image
Sift shares crucial advice for preventing serious ATO breaches
Are you or your business struggling with Account Takeover Fraud (ATO)? One of the latest ebooks from Sift can provide readers with the tools and expertise to help launch them into the new era of account security.
Story image
Power / Energy
Keysight Technologies introduces new next-gen DPT solution
Keysight Technologies has announced its new next-generation Double-Pulse Tester (DPT) with the PD1550A Advanced Dynamic Power Device Analyser.
Story image
Digital Transformation
Why enterprise records management should be part of any digital transformation strategy
Modern organisations create and rely upon an enormous volume of content, and digital records make up a significant proportion of that content.
Story image
Grasping the opportunity to rethink the metrics of a sustainable data centre
A data centre traditionally has two distinct operations teams: the Facility Operations team, and the IT Operations team. Collaboration between them is the key to defining, measuring, and delivering long-term efficiency and sustainability improvements.
Story image
Microsoft backing Māori and Pacific wāhine in tech industry
A new initiative focused on getting Māori and Pacific wāhine into the tech industry and backed by Microsoft, NZTech and the government is calling for tech companies to get involved.
Story image
Hawaiki Cable
BW Digital completes acquisition of Hawaiki Submarine Cable
BW Digital has completed its full acquisition of Hawaiki Submarine Cable, with all applicable regulatory filings and approvals now received.
Story image
New SAS service overcomes subscription fatigue for media companies
SAS has launched SAS 360 Match which helps media companies move towards a AVOD model to generate revenue as subscribers cancel.
Threat actors are exploiting weaknesses in interconnected IT/OT ecosystems. Darktrace illuminates your entire business and takes targeted action to stop emerging attacks.
Link image
Story image
Power / Energy
SmartCIC report reveals top five 5G carriers in the world
The Global Cellular Performance Survey also found that 5G networks are delivering high download speeds but lagging in upload speeds.
Story image
Cybersecurity starts with education
In 2021, 80% of Australian organisations responding to the Sophos State of Ransomware study reported being hit by ransomware. 
Story image
Ingram Micro Cloud adds Bitdefender solutions to marketplace
Ingram Micro Cloud has announced the expanded availability of Bitdefender solutions on the Ingram Micro Cloud Marketplace.
Story image
Prophecy International migrates VMware environment with Oracle Cloud Solution
The Adelaide-based global provider is using the solution to eliminate the need to re-write applications, therefore allowing the company to enhance its business operations.
Story image
Video: 10 Minute IT Jams - An update from IronNet
Michael Ehrlich joins us today to discuss the history of IronNet and the crucial role the company plays in the cyber defence space.
For every 10PB of storage run on HyperDrive vs. comparable alternatives, an estimated 6,656 tonnes of CO₂ are saved by reduced energy consumption alone over its lifespan. That’s the equivalent of taking nearly 1,500 cars off the road for a year.
Link image
Story image
Remote Working
How organisations can meet employees' changing expectations
The global employment market has shifted dramatically in favour of employees, sparking the so-called great resignation, in which people are leaving unsatisfying roles in search of greener pastures.
Find out how a behavioural analytics-driven approach can transform security operations with the new Exabeam commissioned Forrester study.
Link image
Story image
Absolute Software expands Secure Access product offering
Absolute Software is enhancing its Secure Access product portfolio, enabling minimised risk exposure and optimised user experiences in the hybrid working environment.
Story image
Power at the edge: the role of data centers in sustainability
The Singaporean moratorium on new data center projects was recently lifted, with one of the conditions being an increased focus on power efficiency and sustainability.
Story image
Could your Excel practices be harming your business?
While Excel has been the de-facto standard for budgeting, planning, and forecasting, is it alone, enough to support organisations in the global marketplace that’s facing rapid changes due to digital transformation?
Story image
Hands-on review: STM laptop bags
The advent of hybrid working has meant we need laptop bags. We got our hands on two of the most popular laptop bags from STM.
Booster Innovation Fund. A fund of Kiwi ingenuity – for Kiwi investors.
Link image
Story image
Fortinet's Security Fabric hits new record for integrations
The Fortinet Security Fabric has surpassed 500 technology integrations with more than 300 Fabric-Ready Technology Alliance Partners.
Story image
Data Center
Preventing downtime costs and damage with Distributed Infrastructure Management
Distributed Infrastructure Management (DIM) can often be a lifeline for many enterprises that work with highly critical ICT infrastructure and power sources.