On the 12 July 2020, Jersey Telecom (JT) suffered a major outage, impacting almost all services in Jersey and Guernsey.
“The service problem of the 12th July was the worst in JT’s 132 year history” said JT’s CEO Graeme Millar. We have already looked in depth at what happened, and taken the steps needed to prevent any repetition”.
“We welcome this decision by the Jersey Competition Regulatory Authority (JCRA) to independently assess the measures we have taken, as we share their prime objective of making sure the islands have the most resilient possible networks. It’s important to note that throughout the incident, 999 calls made from ANY mobile device, worked as normal.
“We have never encountered anything like this incident before, and neither has our equipment supplier, Cisco, who work with firms like JT all over the world. We look forward to sharing the results of our detailed and comprehensive investigation with the JCRA, along with the measures we have already taken to further protect the island’s networks in the future. Our preliminary report was published within a few days of the incident happening, and the final version is attached, so that everyone is clear on what happened.
“Our Board has also taken the step of appointing our former CFO John Kent, who is very well-respected within the sector, to conduct his own review of the steps JT have taken, and report directly back to them. The main reason that our networks have proved to be so resilient over the years is that we learn from any incidents, and make whatever improvements are necessary. Those incidents are rare, and we work very hard to keep it that way.”
What happened? A technical summary
JT’s Chief Technology & Information Officer Thierry Berthouloux shares the technical details of the cause of the issue, together with the rectifying actions:
“In common with most telecommunication operators, JT’s services rely on a fully resilient IP (Internet Protocol) network. JT operates a network composed of around 100 IP routers provided by Cisco and configured to a Cisco approved design. Those routers are connected to two clock sources (NTP “Network Time Protocol” servers – managed as primary and secondary for resilience) through the IP network.
“On July 12th at 18:55 BST, one of the two NTP servers generated a wrong date (actually ’27/11/2000′). This meant that because the source clock was available from a service point of view, the routers which had this source as their primary did not switch to the secondary clock source and instead started to propagate this incorrect time stamp to JT’s other network routers.
“As a result of this, 15 (of our 100) routers received the wrong date and isolated themselves from the rest of the network. By doing so, they made 35 other routers unreachable. Thus, having lost around half of all the network the inherent resilience and redundancy of the network design was lost, and the network failed resulting in the consequences described above for end user services.
“Amongst those impacted routers, two routers terminate our submarine cable connections to the UK (London), and one router terminates our submarine cable connection to France (Paris). Also, amongst those impacted routers were the 4 routers which are used as gateways to our geo-redundant mobile network core systems located in Jersey and Guernsey.
This caused JT to lose access to all our corporate services meaning that we could not reach our customer databases or use our email services from the Channel Islands. Some JT personnel located outside the borders of the Channel Islands had access to emails but not to our central databases or to the Business Continuity team who were working on the resolution in our offices.
“It was around 23:30 BST on July 12th that communication between JT locations started to resume. We chose to prioritise restoring services and monitoring to prevent any recurrence over customer communication in the hours immediately following the service incident. The need for more robust customer communications during service incidents forms one of many learnings for JT following this incident.
“In order to restore service, our engineers had to physically attend the multiple sites where the routers are located. We needed to manually change the time on each effected router to replace the incorrect date. This took considerable time especially to reach the routers located outside of the Channel Islands. Our last router in Paris was corrected on July 13th at 16:00 BST.
“Once the time had been updated on the isolated routers, most of the Channel Islands services were restored. However, as noted above, it took up to a further 36 hours for all international located devices to reconnect. This can be explained by the sudden return of connectivity generating a spike of activity to some of our and our partners’ platforms. Those platforms, even though largely over dimensioned, were not sized to recover from a full outage. Some of our telco partners also interpreted these spikes as abnormal and suspicious behaviour and automatically shut down the links to JT as a precaution.
“We understand and apologise for the impact this outage has had on all our customers. Whilst the cause of the outage was a sequence of events that was almost impossible to foresee, we recognise that we have much to learn from both the failures and successes of how we recovered the situation.
“Clearly our number one focus is on network resilience and reliability witnessed by the fact that we have never had such a largely impacting outage since our foundation in 1888. We will re-double our efforts to ensure that such an event never happens again during our lifetimes. We will also document all the learnings about how we can respond more quickly, restore services faster and provide better support and communication to our customers throughout any future incidents. Thank you very much for your patience, support and understanding”.