OVH lost a datacenter on a fire, what do we learn?

We all do mistakes, but it’s easy to point fingers at others. Twitter and other social media platforms have been full of hate against OVH for the outage and a lot of lack of empathy for the people working on the problem. I’m afraid we will not learn anything, and this will repeat on the future.

The problem is not OVH, but us. We think we’re smart because we’re saving pennies, but in the end we fail at calculating risks and costs in such events; we deem these as impossible, because anything with a probability lower than 0.001% is just 0% for the human brain.

It is also not about choosing a different provider, there’s nothing too special about OVH that makes it more risky than other solutions when purchasing the same type of products. Cloud services also have the same risks when doing the same (naive) stuff.

And we should stop pretending that we all do things in the right way. Because we don’t, and we know it. Basically we’re afraid to admit it.

So yes, I am also at fault. I have more things without proper backup than I would like to admit. (I am talking at personal level here; at my workplace it’s handled quite rigorously)

A short story – I also got impacted by OVH

Few weeks ago (or maybe a month, I don’t recall anymore) I started playing Valheim with friends. It’s a nice game to pass time on; ideal for these days where we don’t have much to do outside our homes. Because we had to rely on a single player to be able to join, I thought, let’s create a proper Valheim Server.

So I went and got a small VPS in OVH and followed the process to enable Valheim on that, configured a DNS to be easy to fill in, and we began gaming together on the server.

And a week or so after I saw the tweets from several people about OVH data center on fire. I checked my VPS and surprise surprise… it doesn’t work.

I checked and it seems that VPS is in SBG3, so the data is there. I decided to just patiently wait until they manage to put things back on service.

I didn’t do any single backup, and we could have lost the 10+ hours of gaming between two people there. Surely, this is almost nothing. I could just have started another server from scratch, no much problem. We would have lost some items and progress, but considering that I already have like 170 hours in-game, 10 hours doesn’t seem that much.

But as for how unlucky I might have been by having the incident so close after starting, I also see this as a blessing; other people and companies lost a lot of critical data with no means of recovering it ever — because even if they shipped the remaining of the drives to you, no amount of money will make it for any data recovery company to restore anything from such fire.

The moral of the story here for me is that off-site backups are cheap. Don’t wait for tomorrow to take care of it.

I even recall myself seeing the “backup” option on OVH when purchasing and dismissing it. And now I’m like… why? why did I do that?

Backups are cheap and easy

We have lots of complex problems in IT, and backups are not one of those. Still we lack diligence to perform them, automate them and verify them. I don’t understand why, but happens everywhere, to everyone.

Maybe this is something that should be managed by default, by a third party. It’s money well spent. And when you want to point fingers, you have all the right to do it.

A backup is not going to save us from downtime, in fact in such an event it could take from hours to days for a company to get the service back up. But definitely this scenario is still way better than losing all the data; this single event could put a company out of business if it weren’t for backups.

I see a lot of people blaming others for having the backups “in-server”. I know this seems obvious but anyway: The problem is not because they have the backups in-site, the problem is the lack of off-site ones.

This subtle distinction is important, as on-site backups should be done as well. The reason is that the same server is more reliable than another server; you don’t know if the other server is up when the backup schedules, or if the credentials expired. Off-site backups might fail for more reasons than in-site ones. Also, on-site ones are faster to retrieve and manage/edit. This overall leads to a lower MTR on typical failures.

When you have the on-site backup done, then you just copy it over to another server, service, tape, or whatever you like.

A very cheap way to do this is to just have a small server at the office and copy them from your server to the office. If the server you want to backup is at the office, then the backup needs to be sent elsewhere, you can take a disk home or use the internet connection to upload to a shared drive.

As it should become clear from the tone of this post, I’m trying to target people that don’t have a proper Disaster Recovery strategy. I hope this helps someone to avoid this kind of scenario in the future.

Just to be clear, this post is not about what should be the proper practice. It’s about the bare minimums. If I ranted here with the proper practices as an SRE, I would scare a lot of people off. And that’s not what I want.

So let’s keep things simple: If you’re not doing an off-site backup, start ASAP. Just add something. If you don’t know or don’t want to mess with this, ask your provider to do it for you for a price.

My opinion on how OVH handled this incident

These things happen. It is not unheard from me of a major incident in a datacenter. Fires, lightning, wildlife… these happen, just not frequently enough for most people to remember.

OVH did really well on handling this publically. They were transparent, shared a lot of details and tried to help all customers. It sounds like they also tried to go in a per-customer basis when those asked for support. That’s very good.

The velocity at they’re solving, fixing and rebuilding is also outstanding. Congrats to all the teams involved, that’s really hard work.

And again, the outage and data loss is not OVH’s fault. It’s customer’s fault. That is clear as a day for me.

What I think it’s not that good after this is, the customer portal (dashboards, etc) weren’t properly working until several days later. That suggests that OVH systems themselves aren’t completely prepared to handle an event of this scale.

After Strasbourg fire, OVHcloud plans to power servers up starting this  week - DCD

Also, the photos of the site show that the different buildings are really crumbled together. What’s the point of splitting the data centers in buildings if they’re so close that they will surely be impacted at the same time?

The building that got fire looks like it’s made from containers. I have zero insight on how they’re on the inside, but this suggests that OVH cheaped out when building sites. How this might had contributed on this event, no idea. They’ve an investigation ongoing and hope they also make some details public.

Something that I’m quite sure of, is that this event will hit OVH reputation and they will have to make thinks way better than the competition to regain it. Because of this, probably in the next 5 years OVH will be better prepared than most ISPs to prevent and handle these incidents.

They also stated that they’re going to do customer backups for free, without asking from the customer. And this is a great movement, I hope it gains traction and it makes other companies move on the same direction.

This incident made me realize that not only a datacenter is susceptible to major events, also the whole site is. And the major event it might not only take the site out of the grid, it might also destroy all the data inside.

Think for example on a scenario where a violent lightning strikes on the site. All power lines are usually connected to every building. Then, it’s not hard to imagine that an UPS might catch fire because of this. If the buildings are built equally and the lightning is bad enough, it could put on fire several UPS on every building. By the time firefighters are there, the whole site is already engulfed in flames. Say bye-bye to any data you had in there. (This is a far fetched scenario with zero idea on how DC are built)

What this made me realize is that backups that don’t leave the area are risky, and replication across buildings is, at least, not enough. Surely OVH took note as well.

How to avoid these events from impacting

Backups are great, but when something like this happens, you’ll have a hard time bringing stuff back up. For some cases this is an acceptable outcome, for others it isn’t.

I’m sure everyone heard by now about Docker and maybe Kubernetes too. This is one of the scenarios that they come in handy. Twitter also sometimes has some hate against Kubernetes, but those that were using it properly while this happened probably you had zero downtime.

First, we need to understand that any running application has installation/dependencies, code/binaries, configs, assets, and state. Maybe your application doesn’t have all of those, that’s ok. But it probably has state, for example a database. Or a folder with user uploaded data.

State is anything that changes upon user interaction, that records the current state of the application. We need the state to be isolated from anything else. That means that our application code should not be in the database, and the user uploaded content is not inside the application code folder.

We should be able to say: this folder is code, this other folder is user-data (state), this database is state, etc. Without mixing them up. And being able to back-up them separately.

There are lots of reasons to do this, but the one I want to highlight is that the state changes much more frequently than anything else and it’s outside of our control. The other pieces can usually be restored in an easier way.

For example, let’s say our application code in the server gets deleted. Is it a problem? Not much, I bet you have a copy in the office, in a Git repo, or in a co-worker workstation. Did you lost 10-20 commits? Well, that’s bad, but I bet you can remember more or less what were the last things you did and more or less code them again.

On the other hand, say some user photos got corrupted. Can you get them back? without a proper backup, you can’t. You aren’t who crafted that data, so without a backup all you can do is ask your users to re-upload. This is bad.

For backups you can get away by backing up everything together, but if we want to go further, to replication, this is no longer an option. The state needs to be treated differently.

With Docker you can bundle the dependencies of your application. Then deploying it is just a matter of running the code in the docker container.

The server installation can be also automated with Ansible, if you need it. But if you don’t install several servers, doing it manually can be less toilsome. A bash script can be a good mid-point as well.

These two parts are very easy to spread across a fleet of servers, they’re usually constant. And when you want to update the app, you either change the Docker image or replace the code (depends on your approach). The problem is, as you will see, the state.

Files or user uploaded content sometimes it can be just rsync’ed over. The problem is that even if you synchronize these every minute, there will be a time window where one server will not have the files and might fail requests.

Databases can be in their own docker, but the state needs to be stored outside, because Docker will remove it upon restart and/or limit their max size. Then the problem becomes on how to replicate this data. Having two servers with different state can be very harmful; and this can’t be copied over.

In databases, a primary-secondary replication can be useful for nearby replication. For truly off-site, in a different region or site, it becomes a problem because routing write queries to the primary has a latency and a possible network bottleneck. Another option for this case would be having the secondary as a Hot Stand-by: it doesn’t process any query, it’s just there waiting until an incident happens; in case of a failure, it could be quickly reconfigured as the primary and takeover.

In this way, the uploaded content can also use a simple rsync as stated before. As both servers are not actually serving at the same time, there’s simply no consistency problem. And in the event of a total failure, yes, you might lose a few seconds of database changes and a minute or two of user uploaded content. While this is bad, as long as you’re not handling user payments or similar, it should be fine.

This setup assumes that someone should manually configure the secondary when an event happens. And as with everything, it can be automated. But when compared to only backups, this, even as manual as it is, is way better. A change in configuration may take 5-15 minutes. Restoring a backup and potentially installing a new server takes lots of hours. And the backup is going to be older than 2-3 minutes; probably a day old. So I’d say this is way better.

And then we have Kubernetes and similar tools. These can handle application updates, load balancing and such. Investing on it will make these events just invisible; not only to your users, also to your own developers. If something happens during a weekend, automation will take care of it. No need to wake anyone up on the middle of the night.

Ideally, we would use more sophisticated approaches such as master-master replication and special filesystems in order to have a true replication working. NoSQL databases can also help simplifying the master-master replica.

If we deploy in a cloud service, such as AWS or Google Cloud, they also have their own products for doing a lot of this; so it’s easier to just get their database and storage options which has the needed options for doing all this stuff seamlessly, without having to worry.

The benefit of Cloud products is not that they’re more resilient than other ISPs, as I said at the beginning this kind of issue happens to everyone. The true difference is that you get access to their extra services that will make your life much easier when creating a true zero downtime application.

The Disaster Recovery Server

Sure, backup is easy but has drawbacks. Replication is nice but hard to set up. Cloud is expensive. Is there anything in the middle?

Yes! The Disaster Recovery is a quite easy concept. You just have another server, in another location. In this server, basically we have to repeat all the installation of our application. It should receive constant backups (or be the secondary in a replica if we want to do it better) and be more or less ready to take over.

This is quite straightforward to setup, if you were able to set up one server, you can set up two. The DR server just sits there, getting updated every X amount of time. And it should be more or less ready to take over.

When an even strikes, we just update the DNS to point to the DR server, and ensure everything is up. This might be around 30 minutes of work / downtime.

Because this server is only for short periods of time, if you want to save even more money, you could get something smaller than the regular server. All you need is that is able to run the application without crashing or freezing. So a slower hard drive and processor might be possible. Just be careful that when the load shifts there it might overload the server, and that would defeat the point of having it.

If you have several applications, you could host half of them on one server and the other half in the other server. In the event that one fails, just enable them all in the remaining server.

Then, the only thing remaining is to have a schedule to quarterly test the DR server and do a simulation of this scenario. Shift the load to the DR server for a few hours and inspect how it performs.

If it does worse than the main server, that’s okay, as long as it can deliver. After testing this, just consider: Could we hold on this server for a month if we need, or it will be a problem? if it’s fine for month, it’s fine for a DR server.

I think this approach is the easier and cheaper of them all. If you have any systems that it’s below this bar, you should seriously think to at least do this.

Conclusions

  • If you lost data, it’s your fault, not your provider.
  • If you don’t have automated off-site backups, start doing them now. Maybe you weren’t impacted by this instance, but you might on the next one, on this provider or on any provider.
  • If you don’t have a Disaster Recovery server (or better), think seriously about start doing it, and schedule tests quarterly.
  • The next time someone makes fun of containers, Kubernetes or similar I’ll ask them about their DR plan.

Bonus: If you ask me how it really should be done…

  • Follow 3-2-1 strategy or better: https://www.backblaze.com/blog/the-3-2-1-backup-strategy/
  • Test your backups regularly (monthly)
  • Set up monitoring for backups, and test you get mail if backup fails, or fails to copy off-site.
  • N+2 Replication. If you need just one server, get three and set up replication and load balancing.
  • Neither RAID or replication are backups. Do backups, always.