The blog entry about the outage makes good reading, Google Translate does a good job of it!
https://www.sipgate.de/blog/infos-zur-stoerung-der-s...
Good shout. Google Translation below for ease...
Disruption in one of our data centers
Steffen 06/23/2022 59 3:28 min
There was an outage yesterday, June 22nd, which meant that our telephony and our logins were largely restricted. The reason: After the power supply of one of the data centers we use was cut off by an excavator shovel, the data center went offline after a series of breakdowns. And with it our telephony. Of course, our sipgate emergency team got to work immediately. But it took from around 4 p.m. until just after midnight before all the systems were running correctly again.
What happened at the data center?
Our contingency plans provide for uninterrupted operations in the data centers that we use. Through a chain of unfortunate circumstances, however, exactly these plans were thrown overboard. At 4:18 p.m. it was clear to us that things weren't right and the data center was without power. About an hour earlier, the power cable was cut by an excavator. If the power is gone, the uninterruptible power supply (UPS) jumps into the breach to bridge the gap with battery power. Then a diesel generator takes over at the data center. However, it heated up so much that after about an hour of operation it triggered a fire alarm and then switched itself off completely in an emergency. Our data center service provider is currently finding out why this could happen.
Our emergency and redundancy system
Each of the two data centers we use carries around half of the sipgate telephony load. When one failed and calls routed there could no longer be put through, the other was not able to absorb all the traffic as planned. We have a variety of failover mechanisms for emergencies, which unfortunately not all worked.
What did we do?
Our emergency team rerouted immediately and in many different places, redistributed loads and traffic to the proxy servers, changed the deployment and did what we could do until late at night and under high pressure. We have kept our customers up to date in our status blog. Around 5:30 p.m., VoIP telephony looked better again and most of the calls were made. The problem here: Our connection to the Telekom network was offline. But we were able to gradually reroute outgoing calls to other carriers. It was different with the incoming calls, which all went through equipment that was still disrupted and therefore didn't work. Nevertheless: With VoIP telephony things looked better again relatively quickly. Quite the opposite with mobile communications, where we had problems for a total of six hours because the network log-on or the switch over to the second data center did not work as desired.
Around midnight, all of our components in the data center, which had previously had no power, were supplied with power again. At 1:23 a.m. the data center was completely online and most of our telephony was restored.
What did we learn?
We couldn't do anything for the trigger of yesterday's exceptional situation. But: Our emergency and redundancy system was not up to the situation and needs to be improved. We now look at what exactly did not work well at which point, where we identified bottlenecks, dead ends and missing redundancies. And then we will draw conclusions from it. In other words: we will make ourselves even more fail-safe, we will have more machines, we will make additional connections. We are sorry that yesterday afternoon went the way it went for our customers. Thanks for your understanding!
You can find more information about yesterday's outage and general disruptions at sipgate in our status blog.
Update from June 23, 2:30 p.m.: As a result of the sudden, massive failure, things broke, of course. We are currently repairing them. Not all services are working as we would like them to. Among other things, we still have problems in the account with the announcements, waiting field, event list, fax and notifications by e-mail.