Three years ago I wrote “The Cloud is overrated!“. Since then I joined Google as an SRE, and I’ve been asking myself if Cloud does make sense for me or not. Even before joining Google, their GCP was my first option to go for Cloud; it’s seems quite good and of the three major providers (along with AWS and Azure) is the cheapest option. And let’s be fair, my main complain on Cloud is price. Vendor lock-in is my second concern and Google again seems to be the fairest of the three. Anyway, this isn’t about which is better but more about if when it’s a good idea.
Proper Cloud deployments are pricey and also require a lot of resources from developers; if it has to be done right, it’s not about deploying a WordPress inside a VPS style service on the cloud.
What is Cloud about?
Cloud is about having easy to use tools to deploy scalable and reliable applications without having to worry yourself on how to implement the details.
We need to think about scaling and zero downtime. These are the only two factors that will determine if you should pay the extra cost or not.
Everything else are extra services that they provide for you, such as Machine Learning. If you want to use those services, you could always setup the minimum on the given Cloud to make it work and call it from the outside, no problem. So these are out of my analysis here.
When you deploy an application in a server, if later you need more resources you’ll need to migrate it to a different, beefier server if it no longer fits. In an VPS you have usually the option to upgrade it to have more compute resources as well.
In Cloud, the range of machines you can run code on is quite big. From tiny (1/2 CPU, 1 GiB RAM) to enormous (32 CPU, 512 GiB RAM). This gives quite the flexibility to keep growing any service as needed.
The other thing is that they allow for fast upgrade and downgrade, and also automate it. This can be used to reduce the cost overnight when there are less load. But be aware that even with this, it’s highly unlikely that you’ll get a cheaper option than a bare metal server.
Same as an VPS, Cloud services usually guarantee data consistency; no need to do maintenance or migrations because the disks fail. This is the downside of bare metal servers: you need to handle the maintenance and migrate to a new server if the disks start to fail, having risks of data loss.
This kind of scaling refers to splitting the service into different partial copies so they can work together, in parallel. This is specially needed when the service itself won’t run in a single machine.
The problem here is that most of the time applications are stateful, and this means that the state needs to be split or replicated across the different instances.
Cloud here helps by having database services and file sharing services that can do it for you, so your service can be stateless and leaves the complexity to the Cloud provider.
In Cloud, you can also spawn dynamically more instances of your services to handle the load.
Reducing downtime to zero
This is basically done by replicating data and services across different data centers. If one goes down, your service will be still up somewhere else.
This is the most important part I believe, so I’ll leave the details for later.
When should we think about using Cloud?
This is an important decision as it’s hard to convert a typical service (monolithic, single thing that does it all) into something that it’s going to make good use of the Cloud benefits. It’s better to do this on the design phase if possible.
In the recent years has been a boom between the “Big Data” and Cloud, and everyone talks about NoSQL, sharding (horizontal scaling), etc. But all this has been just a lot of buzzwords, a way of looking cool. Is it really that cool for everyone?
All these things are meant for horizontal scaling (sharding), which means that we expect to use more than one machine for one of the services (i.e. database).
It sounds really cool, but it’s not really worth it for the majority of cases. Unless you have a big project on hand, chances are that it fits in an average server.
Why not use sharding anyway? well, it’s usually more expensive to have 5 machines running than a single one with all that power together. Sharding will impose a lot of design restrictions that are quite hard to handle, so it will substantially increase the time to develop the application. Unexpected requirements along the way will sometimes require a full redesign, because sharding requires certain premises to be true (how to split the service), and cannot be changed on the way without a lot of effort.
The other problem of sharding is that it’s always less efficient to use X machines than X threads. And X threads is less efficient than using a single-thread CPU X times more powerful. Parallelizing does not linearly scale, there’s a trade-off, always think about this.
Cloud is not (only) sharding, and sharding is not Cloud. If your service will never need to span more than one computer, there’s no point of adding the complexity.
I would recommend to plot a forecast of growth for your service for 5-10 years. Also plot the forecast for server growth, it usually increases 2x every two years (See Moore’s law). If your growth seems to be close to that, definitely you need to consider sharding from the start. Also think that there are periods of stagnation, where there are no improvements on certain areas for years.
If you go for sharding, the databases provided by the Cloud provider will make your life much easier, but they will be your vendor lock-in. Once the application is coded with a particular Cloud DB in mind, it will be quite hard to move away from that provider later. If this is a concern, look on how to make it generic enough, there are usually projects that let you change the DB or offer a plugin to connect to these DB, so you can swap later with less effort.
If you doubt, go for sharding. If you already need >25% of the biggest machine available, go for sharding. Better safe than sorry.
For me, here lies what applies to most applications and companies: How much is worth your downtime? How much is worth your data?
A server can fail, an entire data center can be struck by a lightning or engulfed in flames. Assuming you have your backups off-site, how much data is lost in this scenario? hours, a day, or week? How much time will be needed to get it back and running in a new server?
For example, in a server I use for a personal project I do a on-site database backup every two days, and a off-site full disk backup every day. This means that I can have one or two days of data loss. But if it happens, it will take me 5 days to get it up and running (because it’s a weird setup and I can only use my spare time). In this case the downtime and the data is almost worth zero, as it generates no revenue for me while it costs money. Still, the amount of time that would be needed to set it back up is something that I need to fix.
To minimize these scenarios we use replication. This will always be off-site replication. Sharding must be in-site (same DC) and Replication is better if it is off-site.
If you use sharding while managing the database, you can choose to have a fraction of the servers for redundancy. In this case, N+2 is always recommended. If you need 5 servers to handle the load, have 7 so at least 2 servers can fail. When using RAID yourself, I would recommend RAID 6. In most cases this will not apply.
Regardless, you need a full working copy elsewhere. Here you can go N+1 or N+2. Having another set of servers far away that are running the software in parallel avoids having an outage that can last weeks.
When using Cloud you can take advantage of the huge network between the different data centers. That is, they usually have another network that it’s not internet that is blazing fast and small on ping times that you can use to communicate between them, making real-time replication across the servers possible. Anyway, don’t go crazy and don’t set up the different servers very far away, as fast as those networks they can be, they still have to obey physics and are tied to the speed of light limit (no kidding here, light travels roughly at 50% of c on fiber and this can be used to estimate ping times)
If you want to use a regular ISP with VPS services, check if they also have an internal network interconnecting the data centers; this is starting to become the norm lately.
The problem with replication is that the cost for running the service is now 2x or 3x, as you need way more space and servers than before.
If cost is a problem, I would recommend to do only primary + “warm” read-only secondary. This means that all writes go to primary, and the secondary is only writing back those changes at real time. In an incident, you might lose seconds of data that might have not been written to the secondary yet. If this is a problem, you can look if the database allows for waiting until the secondary confirms the data is there. This will come with a huge penalty on write speed and latency.
The secondary could be smaller than the primary, or be used for other stuff. Only writing back data uses a very small amount of resources (but the same amount on disk space). In this case, if the secondary needs to be promoted to primary is possible that it suffocates on the amount of load, and the application would be almost unavailable until a new server is turned up. So it’s best to avoid having small secondaries if possible as this approach only serves to back up data with a resolution of seconds, but it will not be good enough for taking over.
On Cloud, they can also automate this replication for your database and files, and even automate the change from secondary replica to primary when things fail. Sharded databases do this best.
My final thoughts
I find Cloud products prohibitively expensive for my personal projects, adding proper replication makes them even more out of reach.
But on the other hand, I find extremely difficult to properly prepare automation for replica and takeover. These things are difficult to do and to test to ensure they will not hurt instead of helping.
So it seems that either there is not much money involved and the risk of data loss or downtime is not a big deal, or it actually offsets and then Cloud seems to have a price that is quite justified.
In the end this is about if you want to take the risks yourself or you want to pay extra so someone else deals with it. Generally I would go with the second and rest easy.