The entire Auckland train network was shut down during the evening peak today due to a power failure in Wellington. In this case it wasn’t a power failure from government ministers who would probably like nothing more than to close the network down but a problem at Kiwirails train control centre. The Herald reports:

KiwiRail has apologised for the widespread disruption, which was caused by a power outage in its National Train Control Centre in Wellington.

The outage affected all radio and signals in Auckland, requiring all trains services to be halted.

KiwiRail chief executive Jim Quinn said KiwiRail was working closely with Veolia in order to get trains moving again.

Mr Quinn said an urgent investigation is underway to understand the cause of the outage. “We are taking the situation very seriously,” he said.

The outage only affected the Auckland metropolitan network and train services elsewhere in the country were operating as scheduled.

Trains in between stations were forced to pull up into platforms, allowing passengers to disembark, after the outage.

Veolia Transport said it was trying to source as many rail bus services and taxis for customers on the network as it can.

The fault at KiwiRail’s National Train Control in Wellington, which controls Auckland signals and radio control, occurred about 4pm.

Limited services on the Eastern line started running just after 5pm, with all signals back up and running by 5.30pm.

Kiwirail only recently moved operations for the Auckland network down to Wellington which was done as part of the new and meant to be state of the art signalling system. Some imediate questions that come to mind are,

  • Where are the back up systems and why did they fail?
  • Why was Auckland the only region affected?
  • What are the disaster recovery plans and how long would Auckland be without train services if there was a serious earthquake in Wellington?
  • Is Kiwirail going to compensate Auckland Transport for this, after all we are paying them ever increasing amounts in track access fees.
  • Does this get added as just one fault to the list?

I’m sure these and other questions will be answered as part of the review, hopefully that review is made public. About the only positive thing to come out of this is so far is that for a brief time today we actually had integrated ticketing.

Rail tickets are accepted on the following Buses: NZ Bus, Metro link, Waka Pacific, Howick & Eastern and Go West.

Share this

28 comments

  1. I certainly hope that the contract between Auckland Transport and KiwiRail contains some pretty nasty financial penalties for events like this.

  2. I guess it doesn’t really matter in this age of telecommunications, but why is the Auckland network operated from Wellington? Do the do the whole country from there?

  3. To quote Dr Roy Lange’s comment to his son David when observing a similar situation some years ago ‘ “Disgusting!”.

  4. This last line is quite interesting:

    Rail tickets are accepted on the following Buses: NZ Bus, Metro link, Waka Pacific, Howick & Eastern and Go West.

    Hey look, integrated ticketing is actually possible and happening? Could someone please explain why we can’t do this the other 364 days a year?

  5. I think an important thing to note is that this was not “A major power outage”. It appears to have affected a couple of work stations only in the National Train Control Center. I don’t know wether it was the works station itslef (ie the “Auckland desk”), or somthing downstream of that point in the server’s or data transmission area. Questions do need to answered about why the back-up failed. But as only one part of one office on one floor of one building was affected, it very much a gross overstatement to describe it as a major power outage. Yes, the entire country is controlled from the one control center. Its been that way for some time and works pretty well overall. It allows great flexiblity for different works satations to add or shread work load according to demand. For example I know that during the weekends when things are quieter, the Wellington operator also takes on Taranki and the North Island east coast, but lose’s these ares to there own operator during the busy working days. It is a very practical and efficent way of working. It happens to be in Wellington, as that was Tranz Rails head office at the time it was established, but it would really matter where it is located. Back-up for a major disaster is reletively easy to put in place- the whole sh-bang is PC based. Getting the staff on the ground at the back-up locations is the most time-consuming part of that exercise.. So- no worry about an earthquake.

  6. I think Joyce was showing Brownlee around the control centre and said “Look this is what happens when you push the red button”.. : )

  7. @grunter –

    With all due respect, claiming it was a minor power outage because only affected one part of one office when the practical effect was to disrupt tens of thousands of commuters during the crucial evening rush hour is a very worrying bit of denial. It doesn’t matter if the outage was to just one PC caused by a toaster tripping the circuit board used to supply power to them both – it is still a major CUSTOMER (heard of them?) affecting outage and deserves to be taken a bit more seriously than your breezy “nothing to see here” dismissal.

    1. Sanctury-

      Fair point, I’m not denying there was major effect, and it is not good that back-ups meant to sort this out failed to kick in. I guess what I was trying to answer was the question asked- why was only Auckland affected if there was a major power outage at the NTCC. Well the answer to that is exactly as I said- there in fact was not a major power outage. It was a reletively minor one, affecting a samll area of the building, and leaving other work (and therfore other areas of the county) unaffected. There is no denial that it was a serious problem and had a big effect. It is being taken seriously and I dont think I dismissed it as nothing to see. In my original post I did say “Questions do need to be asked.”
      Matt L asked some specific questions about this, and a couple- “Why was only Auckland affected”, and “What are the disaster recovery plans and how long would Auckland be without train services if there was a serious earthquake in Wellington?” I attempted to answer by 1/ it was actua;l a very localised fault, and not as the media reported a major outage, and 2/ about it all being PC based and transfering control is reletively easy, with getting the staff on site being the most time consuming part.
      I’m sorry if attemting to answer questions is seen as being dismissive. It was certainly not intended that way, but I do feel that the most important issue before debating the merits or otherwise of the NTCC and back-up plans is to understand exactly what went wrong. eg People have asked where are the stand-by generators? Well I know the answer to that to- its on the ground floor near platform 2 at Wellington Station. But- it would not have helped in the instance due to the nature of the fault, and it is important for people to understand the nature of the fault to realise that.

      1. Grunter, thanks for your perspective.

        What I don’t understand though is is the system is just PC based, and a few workstations were out, why couldn’t the Auckland train control have been give over to one of the lesser loaded workstation to manage while the (presumably) “smouldering heap of the (old) Auckland workstations” were cleaned up?

        I can accept that transfer between control centres take a man power issue, but not being able to transfer within a working control centre smacks of a bigger issue at play which is way way wider than just a couple of dead workstations.

        1. I was a victim last night and a bit of an expert on delivering high availability IT systems and nothing I have heard gives me any confidence that our rail system is fault tolerant! The two pc’s fail and we are all left to walk home. Please!
          Just stopped in the tunnel – here we in again!

        2. Greg – The Auckland system is unique for NZ and far more advanced than what is everywhere else (some parts of it we are even the first place in the world to have) which all means it probably isn’t as simple as just shifting the load over to another workstation.

        3. Matt,
          Accept what you’re saying, but, if its that “world class” then why is not possible to transfer functions within the centre to other locations or are you saying that the “Auckland train system management consoles” are unique in the world and cannot be replicated to other devices in the same centre quickly and easily?

          Seriously, if I had designed or built such a world beating system like this and it failed (albeit) with possibly a fairly unique set of issues to be sure), but the actual design was such it effectively had no redundancy, then I’d be out on my ear and looking for a new job career come Monday I can tell you.

          Its not like a 50 cent shackle failed on a earthing wire attached to a power pylon? (or is it?)

        4. I’m saying that from what I have heard it isn’t possible to run the system on the other workstations in the centre as the Auckland ones are unique (the system is in other countries already, the world first part is actually some of the gear in the track side cabinets). Reports today have said the issue was a faulty UPS and I have already seen questions from the public about why there wasn’t more than one which is standard in many environments where high reliability is needed.

        5. Because the controling PC has to surrender control, and as it had shutdown, effectivley “frezzing” the out side sites, it wasn’t going to go as easily as a normal transfer. It could have been done, and ways of making that quicker in a failure situation will be part of the de-breif I expect.
          The Auckland system is different software from the rest of the country currently, (all that bi-di dual mains operation) although that is changing as other parts of the country get upgraded, but some of the other desks could have taken control had it be possible to issue the commends from the station that already had contro. This is nothing new- even the old local control panels had to be “given” control, rather than being able to “take” it.
          I hope that makes sense.

        6. Correct, the software that runs Auckland Signalling System is not available on the other desks – it doesn’t even use the same data link!

  8. What concerns me is the lack of redundancy in critical systems. They had only one UPS so when this failed all down stream systems failed. Critical systems should, have dual power feeds from two independent power sources/UPSs. Anything else is just poor design and/or points to doing things on the cheap.

    Did they understand the risks and the consequences of a potential single point of failure?

    1. It clearly never occurred to anyone that they were creating an SPF with this design. Even if they use commodity workstations with non-redundant power supplies they still should’ve had at least two workstations covering every track section, with no UPS supplying more than one machine covering any given track section and no power phase supplying the same UPS’s for any given track section. No degree of protection is too great when it’s so cheap and easy to do it properly and there are lives at stake.

      1. There were no lives at stake- when the system fails everything stops, as you may have noticed. Inconveinent and frustrating very much so, but not unsafe. There would be a great deal more to worry about if the signals defaulted to a proceed indication. Infact such an occurence is called a wrong-side failure, and they are taken very very seriously. So much so that even the mearest suspicion that a signal may have failed wrong side will see it locked down at red and a bulliton to drivers issued instructing it to be treated as displaying a red light, even if it is not, until such time a complete testing confirms there is no problem.

        1. Grunter, whilst there might not be a collision danger from a total system stop where everything fails to red, the consequences can still potentially be fatal. Sure, you might be able to get away with it in a low-density suburban environment, where the train can run to the nearest station and people can take the bus home. But in a crowded urban environment, particularly where you have large volumes of passengers and underground operation, when the service stops, you exponentially increase the risk that something, somewhere is going to go wrong.

          Sure, things go wrong sometimes. But if Auckland’s signalling has not been designed to give the degree of fault tolerance that the growing levels of patronage require, then it needs to be understood and resolved quickly, rather than swept under the carpet.

          On the subject of compensation, in the UK there’s a fairly extensive system of delay attribution where faults and the number of minutes services are consequently delayed are attributed to operators and Network Rail, with compensation paid as required.

        2. Of course there are lives at stake. We’re talking about dozens of tonnes of moving metal. Every time a person gets in a car their life is at stake. There may be all the safety precautions in the world in place but there is still risk, and all the more so with rail signalling and train communications. Bad enough to lose signal control, but losing radio communications as well means an inattentive driver could easily lead to many deaths. This was absolutely a life-risk situation, no matter how you try and spin it otherwise.

        3. Matt, now you are getting emotive.
          I’m puzzeled by your response. At first you state that there is always an element of risk, no matter how much protection there is, and up to that I do not disagree. An inattentive driver (road or rail) can cause an accident. But that risk did not change with the failure. The chance of a driver missing red light does not change depending on the reason why it is red.
          The system fails to a safe state- loss of command will cause signals to default to a stop setting. An all too familer expirence, but normally only in ones or twos, sometimes a section of line. In this rather extreme case, the entire region.
          Now, the radio probelm compounded that by making recovery operations a great deal more cumbersome, and thus slowing the process even more. But that, to a degree, is exactly what should happen. Safe operation could not be automated and managed by the normal means, therefore everything is stopped, and remains stopped, untill safe operation is able to continue. It is quite misleading to state that this could easily have led to many deaths. That is quite frankly wrong, and you know it. The system could be working absolutely fine and an inattentive driver misses a light and cause many deaths- It happen last month in Germany, and last week in Amsterdam- both places people here hold up frequently as the standard we should be aiming for. I’m not saying and don’t belive I said that there is absolutely no risk whatsoever. I stand by a statement though that this failure did not materially change the level of risk. Attempting to carry on regardless, and bowing to public pressure to resume operations before systems were restored would have been risky, which is why that pressure was resisted.

    2. Actually they did, there is a crapload of redundancy in tha office, it is a Sparkies nightmare! there wasn’t just one UPS, however on this occasion, the backup’s didn’t work as they should. Thats not to say there wern’t any, but theres a lot more than you say there were.

      By the way, some of you are commenting on why Kiwirail didn’t have this, why they did that. Fact is, most of you are wrong, and don’t know what you are talking about!

      1. Ah yes…who needs an investigation when all the information the armchair experts need to analyse the problem and conclude the root cause is in the Herald!?

      2. Clearly the redundancy designed in was insufficient. It should be impossible for anything less than physical incursion into the integrity of the operations centre to render such a critical system unavailable.
        Some of my past work has been the design and audit of high-availability IT systems, so I do actually know what’s possible and I also know that this outage clearly demonstrates a lack of sufficient redundancy somewhere in a system that has life-critical implications.

Leave a Reply

Your email address will not be published. Required fields are marked *