The greatest crisis of modern times that fortunately never happened
On the second weekend of February, the Baltic countries disconnected from the Russian power grid. The electricity systems of Estonia, Latvia, and Lithuania were successfully synchronized with the Continental European network. Although for the average person nothing happened or seemed to change, there had been months of preparation going on behind the scenes.
In an age when technology surrounds us everywhere, we explored how this landmark weekend went at the largest data center in the Baltics and what a large-scale power outage actually means for Estonia. In an interview, Toomas Kell – Greenergy Data Centers’ Operations and Technology Director and Management Board member – shed light on the preparations and outcomes.
When everyday people were urged to stockpile food, water, and other essentials due to the threat of a crisis, how did you prepare?
We found it important to review that our electrical infrastructure was functioning properly and that the backup systems designed to kick in if power fails were in working order. That was the main technical task on our side.
On the other hand, we went over our procedures related to crisis management. All institutions providing critical services have a crisis plan – essentially a plan for how to cope with a crisis. An inseparable part of this is a recovery plan – meaning if something does happen, how do we act, what do we do, and how do we resolve the situation. We reviewed these plans and updated them according to the circumstances.
What was the most difficult aspect of that?
We had to look very closely into all the tiny routines and details and understand them, to ensure we wouldn’t drop the ball anywhere, because a small mistake can escalate into a major blunder rather quickly.
Speaking of small details as an example: one potential scenario was that power goes out across all of Estonia and all communications fail – how would we continue executing the recovery plan? Our internal agreement is that if the crisis management team can no longer communicate among themselves, then within at most 90 minutes everyone will gather at the data center.
How did you check the backup systems?
We have UPS units and batteries on site that take the first hit when the power fails and provide electricity to critical systems until the generators start up. This means that from the client’s perspective the service doesn’t go down – power is supplied from the backup systems.
To review this backup infrastructure, we inspected the batteries, looked at how they hold up, and checked how the UPS systems perform. Secondly, we tested the generators, which we do regularly anyway. We started up the generators and observed how they run. Generators need fuel – we performed fuel quality analyses.
Finally, we carried out a so-called half-blackout test. We cut off one of our two power feeds and watched how all the different components responded to the event. In other words, does the power supply transition to the UPS batteries, and after that do the generators start up, etc. We do short tests weekly, where the generators are run for up to 15 minutes, but this time we ran the entire facility on generator power for over an hour. We performed the same kind of test on the other feed as well.
How long could you have operated on your fuel reserves?
Our reserves are sized according to industry standards: if the facility is running at full load, we must ensure power for 72 hours straight. That is our absolute minimum. Based on our current load, we could manage independently for over a month.
In addition, we have an assured supply agreement for fuel delivery with two different suppliers. This means that on the next business day we’ll get refueled – as long as fuel is available at all. In other words, in the case of a larger disruption whose effects last weeks or months, we would have done our part to stay operational from this perspective.
During these reviews and tests, did you discover any shortcomings? Or did everything work flawlessly?
The technical side worked perfectly. Since we conduct regular preventative testing at the facility, any faults are usually eliminated before we even perform system tests. It was more the procedures that needed reviewing. The plans existed, but they needed to be thought through and the roles reviewed – who needs to do what and where if a given scenario unfolds – all the way to crisis communication. In a worst-case scenario, our first priority is to restore normal operations and then communicate with clients, not the other way around. There are a lot of little nuances.
Have you had to think about these kinds of nuances before?
We conduct major drills once a year, where we run through all the theory in practice. We take a real scenario from our crisis action plan – for example, no more power can be drawn from the grid or someone gets injured – and we walk through that process. Communication is also part of all of it.
In the end everything went well, but you had to be prepared for the worst. What scenarios did you prepare for?
The main keywords were loss of power and power quality. If power quality is compromised, it affects the equipment. Since our clients’ hardware is quite expensive, any deviations can turn out to be extremely costly.
What about physical security?
Even though our complex is already secure as is, we ended up with an additional layer of protection. Although we are not yet officially listed as critical infrastructure for the state, right next door to us is the Harku electrical substation, which was being very heavily guarded. Reinforced concrete barriers were brought in, barbed wire was set up. The Estonian Defence League (Kaitseliit) and the police were on site. We knew theoretically before that our neighbor would be protected, but now we saw it happen for real.
This in turn created some interesting challenges. Quite often, clients come to our facility at night if something has broken. If you get stopped at 1 AM by police armed with automatic weapons, you can imagine the emotional reactions vary. Some feel great pride that the matter is being taken so seriously. Another might wonder, “Why are they giving me a hard time? – I just wanted to quickly go and solve my problem like usual.”
What did that weekend look like in reality? Did any situations come up that you had to solve on the fly?
We gathered our “troops” on-site. Together with the operations control center, we monitored both the disconnection from the Russian grid and the synchronization with Continental Europe. We kept a keen eye on the critical events.
Even though we went into a state of heightened mental readiness, the goal was not to sow panic among our own team. My message to the control center was also to stay calm and operate in a normal mode. Fortunately, that’s how it went – we didn’t have a single technical hiccup. Everything went smoothly.
Did you guys clink glasses in the control room when the connection to Continental Europe succeeded?
We didn’t have moments quite that celebratory. There was no clinking of champagne glasses – but we did have warm pastries at the ready. Honestly, we were just satisfied at heart that it all went well, both for us and at the country level.
Let’s be honest, the media managed to raise the nationwide level of anxiety quite a bit. I’ll give an example from personal experience: I go to a large supermarket every Friday, and that Friday – the day before the desynchronization – the store was packed with people, there were no shopping carts available. It felt like there would be no tomorrow.
On the other hand, this may have been a good drill and practical preparation for the people and society. The threat of war is still theoretical, but a widespread power outage was a tangible event. Everyone thought through what they would do if some crisis hit. As a result, as organizations, businesses, and as a country, we are now much better prepared.
So what was the main lesson learned?
For institutions and organizations who went through a crisis or recovery plan for the first time now, it was extremely difficult. They had to step out of their comfort zone – their normal routine was disrupted and people were acting rather nervously. If you do this regularly, the routines are clear: you don’t rush, you move at a normal pace, you’re not all flustered. Sometimes it’s smarter to stand still than to run – because if you act without thinking, you might run in the wrong direction.
Previously, when we talked about our backup systems – underground fuel tanks, backup generators and so on – it was all in theoretical terms. Now that there was actually a threat in the air, people started to understand why we do things the way we do. The fact that we are built to be crisis-resilient was likely one of the reasons why that weekend passed so uneventfully for us.
When might it happen that you’ll need this preparation next time?
I’d like to say never, but that’s probably not true. On the other hand, I want to emphasize again that the role of a proper data center is to be ready for crises. Our center is built in such a way that it withstands crises. Our team is put together in a way that we can withstand crises.
“There will always be uncertainty in the air”
The preparations at the State IT Center (RIT) are described by its director, Ergo Tars:
“Preparing for desynchronization gave us an opportunity to better understand how important RIT’s services are and how high our clients’ expectations are. The process highlighted how crucial it is for public sector institutions and providers of vital services to have up-to-date crisis management and service restoration plans. Since the world and the security situation are changing rapidly, testing and improving such plans must be on the agenda every year.
We prepared in two directions: ensuring RIT’s own continuity and equipping clients with computer workstations in case the risks materialize. More broadly, we learned how to secure critical IT services for the state in a situation with power and communication outages. This required thinking through many details: what kind of computer workstations and access to provide to people coming from the private sector to help the government, how to ensure access to data, how to address the need for printing, how to maintain internal information exchange (for example via satellite connection), and much more.
The biggest challenge is to ensure that all of RIT’s clients whose tasks are critical to the continuity of the state receive the necessary IT support in an unexpected situation. We have clients across Estonia, but resources – both money and people – are limited. The challenge is finding balance: how to prioritize the most critical needs and use the resources at hand as effectively as possible.
By mapping risks, streamlining processes, and developing alternative solutions, we can boost our preparedness, but some uncertainty always remains in the air, because every crisis situation is different and requires flexible solutions. Thus, the challenge is to ensure flexibility within the rules, so that we can support our clients regardless of the nature of the crisis.
In addition to ensuring the continuity of RIT’s services, during our preparations we paid great attention to our employees’ awareness and readiness, so that our people know how to act in crisis situations and can, if necessary, support their families and loved ones. At the same time, we saw that this kind of awareness and practice needs to be continuously developed so that responses are smooth and confident.
We highly value the initiative of our clients to think through not only the synchronization event, but also more broadly the impact of crises on their services – how those services depend on IT solutions, electricity, and communications. We encourage clients to systematically map their critical dependencies, and we are always ready to support them in drills and provide input for running through various scenarios, so that together we can further improve the nation’s resilience.”
Planning should have started earlier
Tarmo Tulva, Head of IT Infrastructure in the energy company Enefit, shares their experience:
“More thorough preparations for the desynchronization project started about four months before the deadline. From the whole group’s perspective, we reviewed both the information security measures already implemented and additional measures planned specifically for the desync period. We communicated with our partners on the telecom operator side and agreed on potential crisis contacts and methods for information exchange. We also requested additional support from partners who supply us with network equipment.
The primary prerequisite for the success of such projects is proper planning, and the lesson usually is that one could and should always start that planning a bit earlier than one eventually does. In this project as well, we should have begun the planning somewhat sooner in order to analyze all risks and possible mitigation measures more thoroughly.
The biggest risk factors were definitely related to cybersecurity. Unfortunately, you can never be completely sure that malicious actors haven’t started on their project years ago and are simply waiting for the right moment to activate their actions. Therefore, we extensively analyzed various possible scenarios and mapped out both preventive actions and possible responses in case something were to happen on the critical day. One risk mitigation measure was setting up a ‘backup command post’ at the GDC data center – to be as close as possible to our physical infrastructure and, on the other hand, to be confident in the availability of power and data connectivity. In reality we didn’t end up needing this command post, but running through that scenario as an exercise was certainly useful for us. We now have a clear understanding of how to relocate our operations center to another site if needed, and what steps must be taken beforehand to do so.
The actual desynchronization itself went unnoticed for us and there were no cyber incidents. The takeaway for the future is that planning should be started even earlier. This project also helped draw attention to additional cybersecurity measures that need to be implemented.”