Clusters, Sharding and Replication: When it’s sensible to have them?

In the last years I have heard a lot of people talking about cool projects like Cassandra, MongoDB, Hadoop and Spark. While these are really great tools I’m a bit worried that people may jump to the trend of sharding too easily.

Everyone wants to brag about how they use Map&Reduce or they replicate their data. But we should be wary about those things, as they might not be as performant as we think for some use cases.

Horizontal scaling or sharding, while it has a huge performance benefit, it comes with lots of problems. Knowing the trade-offs and the costs related to them are a must before going all the way.

Same goes with replication and things like Kubernetes. While they offer a lot of benefits, there are also costs and trade-offs.

Many machines VS one beefy machine

Parallelism has drawbacks. First one is performance or efficiency loss. While a single-threaded task is 100% efficient for that single core, several threads need to compete for resources and they need to sync up. When there are several machines to sync, and jobs to partition, the losses are even higher.

Don’t even think to spread the load between 2-3 average servers without checking if one single awesome server can do it on time. Partitioning is hard to do, you’ll need several good engineers to do it right and even then it can go wrong in several ways. Having all data in one place is easier to manage, backup, restore and you can query it in all strange ways. Sharding will limit you heavily on how data will be accessed. While some queries will be blazing fast, other queries will be almost impossible to do.

The simplest way of using properly several machines is dedicating each machine to a task. Database in one, programs in another.

I can’t stress it enough. Keep things simple.

Also, keep in mind, horizontal scaling without replication is a huge risk. If you want to spread the load between 5 machines, you need to make sure that the data is duplicated. Raid is not enough in those cases. So you’ll end having 10 machines or some kind of setup that duplicates the data on the same servers. You need to make sure that at least two of those servers can blow off and your service continues working. Even if you don’t care if your service is down for three hours, you must ensure this with sharding, because making proper backups and restores of those are hard, so you could have serious data loss on incidents.

Replication and fail-overs

If you care about down-time, first thing to consider: proper automated backups. You need three types of backup running constantly. Namely:

  1. On-line backups, accessible by software. Make sure that even remotely, when working from home, your engineers can access backups, inspect them and restore them completely or partially without human interaction.
  2. On-site backups. A portion (or all) backups need to go somewhere else that can be easily reached and plugged in by a human and it is not affected by the same conditions as the server, so not in the same location (room), as far as possible while staying on-site. Also, with proper electricity protection so a spike that might fry your server will not fry your backup. Or plug them in&out manually each time.
  3. Off-site backups. Get some internet server and upload the backups as often as possible. Using incremental backups will help reducing the bandwidth needs.

You need to test backups from time to time. Setup a virtual machine or a separate bare metal one to restore them. If you can automate this and send emails in case of problems, the better. The more frequently you test the backups, the better.

There’s no point on thinking on downtime if you have risk of data-loss.

Once you have proper backup and restore test in place, the next thing to consider is a master-slave replica for fail-over. Forget about master-master.

Fail-overs are really easy to setup and not much can go wrong. If you can allow for some down-time, you don’t even need to care on how to promote the slave into master automatically: an engineer can do it in short notice when needed.

For the fail-over, usually a smaller server is enough. Just make sure that is good enough to survive a week or two with the load without saturating. It might be slower, but the point of the fail-over is to have some service running while the primary is being repaired or a new primary is being purchased and installed. This avoids huge downtimes of days or weeks which might be critical.

Make sure that the slave/secondary is not identical to the primary and it is not subjected to the same load. Also place them in separate locations (different rooms) and use different UPS for each one. The logic behind this is really simple: same hardware under same load over same period tends to fail at the same moment. You don’t want both servers to fail in the same time-frame, do you?


You might want to use RAID, but be careful, as it is another double-edged sword. RAID is not a backup, so you also need to run backups and fail-over replicas on top of having RAID. Never use fake-raid (from the motherboard), either use software RAID or use dedicated PCI-Express cards with battery. Software RAID is good if you need compatibility, inspection and remote checks. Hardware RAID is good if you want to use hot-swappable disks and you have on-site engineers checking the state regularly.

Don’t use RAID 5. Either use RAID 1 or RAID 6. RAID 0+1 and RAID 10 are also good options. Always add at least one spare disk to the RAID. RAID 5 degrades very badly and might lose data.

Don’t wait until a disk is unusable to change it. Check the health status regularly and as soon as a disk starts to display wear signs plan to change it. If you wait too much, RAID will have to copy a full disk of data and other disks might fail on the process. For the same reason, schedule ahead for disk swaps. One disk may live between 4 and 12 years, depending on load, temperature and how well manufactured is. So if you go for a RAID 6 of 8 disks, if each disk is for 4 years, this means that you need to replace a disk every 6 months. Schedule it from day one, and start replacing disks that are 6, 12 and 18 months old. You might want to reuse these for workstations or other computers, as these are still good; or wait the first 2 years before starting the schedule. But the idea is, all disks should not age at the same time or they will fail at the same time.

Think about the costs of having RAID properly set up. Sometimes, if you want to renew your servers every 5 years anyway, it might not be a good idea to invest into a complex RAID. A simple RAID 1 of two disks might be enough until you replace everything. The less parts involved, less chance for something to break. Use fail-overs to avoid downtime instead, they’re more reliable.

How much data is BigData?

Just think about those RAID6 setups. We could have easily 10 drives of 10TB, plus 2 drives for redundancy needed on RAID6, plus 1 spare drive. This will give us 100Tb to work with.

Are you going to work with 50Tb or more in the next two years? If the answer is no, probably one server is fine. Even more data can fit onto a single machine, there are specific NAS systems that will carry lots of data at high speeds over network, having a 10 Gigabit network ensures that any amount of data is transferred quickly.

Is your task constrained by Disk speed or CPU speed? If is by disk, consider going SSD, and also, having RAID to get the amount of space will give you even more disk speed than the equivalent machines. If is by CPU, think ahead if it is worth investing on a better CPU. There are beasts with more than 50 cores nowadays. It will keep things easier to manage. Yes, they’re expensive, but the cost of your engineers designing, implementing and supporting properly a sharded system can be more expensive in a couple of months.

It could also be that your task is constrained by RAM. Just remember that currently setups with 768GB or more are supported. Again, one beefy server with lots of ram could outperform 2-3 average servers.

Horizontal scaling is really worth it when you know that your data will be growing non-stop and at some point near in the future you will have the problem again and again. Then with proper sharding you can grow by just adding more machines, but you have to be willing to spend a lot of money on dozens of machines and constrain your queries so they fit the sharding schema. You might want to check if you really need all that data; sometimes a good plan for disposing or repacking data can free most of the space used.

Hot data VS Cold data

One thing I don’t want to miss here is the concept of Hot data. A lot of the benefits of sharding can be accomplished just by separating the hot data from the cold, even on a single server.

When working with big tables it’s easy to miss the point that most index scans will somehow read the whole index. Focusing the data being read to the interesting part is mandatory. This can be done by using partial indexes in PostgreSQL. Also partitioning can help when it grows too much.

But the point here is to remember that the OS and the hardware in your computer will cache data at different levels. Minimizing reads to the interesting part will improve performance a lot. If not done, our old data doesn’t get really cold, it gets accidentally revisited when searching for new entries.

Make sure your system, database, or whatever it is, actually skips all unrelated data when performing look-ups. This is a huge performance boost, no kidding. Use heavy caching for old data and add the necessary tricks to avoid computing things twice.

Instead of going over a Map&Reduce over a trillion records each time, it can be several times faster to “archive” old entries into buckets, and save the pre-computed Map&reduces of those, and then append on top the Map&Reduce of the last entries.

Also, do you really need to have that old data at all? You might want to precompute all statistics and drop it instead. Or move it to a separate server in case you need queries from time to time.

Keeping the data small is key to performance. The fastest work is the work that is never done. Keep hot data packed together. If it is scrambled the buffers of the disks, RAM and processor cache will read it even if it is not requested.

Think about it in this way: You might add 10 machines to process your load in parallel and get a 6x boost. But if instead the data is stored properly and tidy, in a single machine, it can be a 100x boost or more.

The take away point here is, before going all-out on sharding, know properly how your computers work and optimize properly. Then, if still is not enough, get those 20 machines and add another 15x boost on top of that and the ease of mind that you can keep growing anytime.