XT - What went wrong? How are they fixing it?
The root cause of Wednesday’s crash of the XT network, which continues to affect customers is still unknown, but Telecom has revealed the steps it’s taking to restore service.
Late last night Gen-i CEO Chris Quin sent a letter to Telecom’s corporate clients explaining what has occurred. He wrote that the cause is “suspected to be within the physical and logical paths in the transport layer between the cell site and the Christchurch switch.”
The Christchurch RNC (Radio Network Controller) switch is the main exchange for all calls made south of Taupo and it routes the traffic from 453 cell sites. There are 986 cell sites nationwide, and in the letter Quin deals upfront with the queries he’s received about the number of RNCs on the XT Network.
“Design of resiliency and redundancy of a mobile network is not only to do with RNCs. In fact within the UTRAN there are several factors including the number of RNCs, routers, TMUs, links as well as the number of users and type of devices in use.”
“We have always expected we would increase the capacity of the XT network as we add more clients on the XT network, and we have seen a large number of clients move to XT in a relatively short space of time. However, more RNCs alone do not deliver greater redundancy – each RNC is equipped with its own hardware redundancy but more RNCs will increase resiliency as we spread the load and that’s our plan.”
Translation? Telecommunications Review sought the advice of an expert who’s built several mobile networks around the world.
Basically the XT Network is a sophisticated packet-based network as opposed to a circuit-based network. Bits of information are broken up and travel the fastest route to their destination, where they are reassembled. For example, if you want to send a photo that is 1MB, instead of sending it as a 1MB file down one route, it gets chopped up into small pieces, and can be sent via a number of routes and then reassembled at its destination in one piece. This is why the XT advertising is always claiming that it’s the fastest – the network architects have taken full advantage of the fact that fibre is connected to every cellsite by installing an IP-data delivery network.
It’s bleeding edge technology, according to our expert. He says he doesn’t know how many other networks are like it around the world.
Telecom may be among the first – which makes it so much more difficult to solve issues. Its technology partner Alcatel Lucent is flying in experts – and equipment – from around the world to diagnose and solve the problem. Quin writes that Alcatel Lucent’s CEO is personally engaged in the problem solving.
This is a problem being dealt with at the highest level of global telecommunications.
In the meantime, here’s what they’ve done to restore service:
Rewrite the MIB which requires an RNC reset to install
A reset of the UTRAN router to clear a path and reset the interface
Lockout of the base stations connected to suspect links
Managing the registration “storms” resulting from mobile re-starts from these base stations
Our expert translates:
The MIB is basically the interface that tells you what’s going on – it monitors and sends the information back to the core network. The UTRAN is the underlying architecture – a collective term for the RNCs and the cellsites within a network.
Reseting the RNC which connects the 453 cells sites south of Taupo is probably what occurred on Wednesday when service went down at 10.30am and why the initial Twitter reports were so confident that service would be restored at 1pm.
But then they discovered suspect links and had to shut down some cell sites. Adding to the problem was that as users were connected it resulted in an overload. When you shut down an RNC, you biff off the users and when they reconnect they have to be re-authenticated. A shut-down of the scale of Wednesday’s meant that tens of thousands of callers were being re-authenticated at the same time. It must have put enormous pressure on the network.
On the radio this morning Telecom CEO Paul Reynolds said 3% of the network is affected. In his letter Quin provided a service update and says he will be leading a webcast with his clients at 12.30pm today.
“While the remaining 94% of our cell sites nationally have not been affected, traffic loading on the network may mean that a percentage of calls in the Taupo south area may not get through,” writes Quin.
“Our average statistics for call accessibility in the southern region and retention in the area have been above 90% in the last while.
Our absolute focus right now remains on restoring service in all these locations, and ensuring that restoration is sustained and that we deliver a stable and world class XT network as we know we can.”
He lists the following 38 sites as still down:
Dunedin: St Clair, Brighton, Carisbrook, Corstophine Radio, Ardlui, Signal Hill, Dunedin Oval, Mosgiel, Tennyson Street, Waldronville, Swampy Summit, Waitati
Clutha: Clinton
Central Otago: Clyde Cell Site
Otago: Lake Hawea, Halfway Bush, Luggate, Otago Polytech, Brothers Peak, Millers Flat
Invercargill: New Hospital, Invercargill State Ins., Waikiwi, Surrey
Queenstown: Peninsula Hill, Glencoe, Camp Street
Timaru: Gleniti, Grantlea, Cantec, Mt Horrible, Timaru South
Southland: Winton, Te Anau
Canterbury: Waimate
Ashburton: Mt Alford
Waitaki: Oamaru
Wellington: Majestic Tower