Yesterday I was comparing the basics of PostgreSQL with SQLServer, but today I want to talk a bit about the foundations of PostgreSQL. It is an interesting piece of technology, and since it is free software, we can inspect, fiddle and learn from it.
This post may not seem as useful as the last one at first, but trust me, when you want to extract all the juice from this fantastic DB, this information will be very helpful in the future.
PostgreSQL uses one of the most clever approaches to memory handling. Instead of holding records in its own memory, it leverages this task to the operating system at its own will. This makes PostgreSQL to be one of the most friendly apps with the memory on your server.
To do this, it does something similar to a “mmap” (depends on PostgreSQL version). It tells the operating system that they want to access the files on memory, so the OS can move it onto memory as needed. Also it gives hints to the OS about which data is likely to be accessed and which one is probably not needed anymore. The OS, of course, does whatever prefers. Just listens to those hints and tries to do the best.
This is one of the reasons why PostgreSQL is not as performant in Windows platforms, as this OS family does not have a proper memory handling as *nix systems do.
The main benefit is that PostgreSQL consumes almost no memory at all, and you can run it with the default config either in a small VM or in a big server, performing well on both. Of course, if you know what is the memory that should be available to PostgreSQL, you can tweak the configs to make it even faster. but even with that, still PostgreSQL manages to be friendly with memory management.
This differs from MySQL, where if you have a 6Gb database you better have 6Gb of RAM and configure it carefully to use the memory available. If any other process requires the memory, the OS will start swapping heavily to disk.
The MySQL approach is slightly faster, as in PostgreSQL, unpredicted access to data causes always a CPU interrupt to be called, and the OS will be loading it. But it is low impact anyway. Uber had problems with this and switched to MySQL from PostgreSQL, because they have servers with 768Gb of RAM and they thought PostgreSQL does not make proper use of it.
The main benefit of this design is that your available RAM does not have to be proportional to the amount of data you’re handling. So databases with 100Tb of data are perfectly working with PostgreSQL. Even Petabytes; I never had a server with enough space to test this out.
MVCC model and locking
In PostgreSQL, despite it is one of the most robust database, it uses optimistic locking and MVCC. This means that locks will be less likely to freeze your app, and handles more parallelism.
MVCC stands for Multi-Version Concurrency Control. And I like to think about it as a VCS like GIT, that almost handles branches and commits. No, MVCC does not have any concept of commit and branches, but it resembles a lot. And of course this burden is completely hidden, you can only kind of feel it when you have long transactions in parallel and analyze carefully the data they’re reading. It the most consistent model I know for database transactions.
Some say that SQLServer uses pessimistic locking instead, but I’m not that sure if that’s the case. I don’t have much information on the topic, but chances are that both use a similar approach.
So first, what is pessimistic locking? When a transaction reads a row, no modifications to it are allowed from others until the transaction finishes. When a transaction changes a row, no reads from it are allowed until the transaction commits. I’m sure I saw this behaviour in SQLServer back in 2003, but today I’m not sure anymore, although I’m seeing sometimes more lock waits than PostgreSQL would have.
What is optimistic locking? It basically means that only parallel writes to the same row are disallowed. You can still read rows while other transaction is updating it, or write rows while other transaction is reading it. This causes an obvious problem, if the transaction has updated anything but not commited and you’re trying to read it, obviously you should not be allowed to see the changes. So the only valid result from that query would be the old value, as if the update from the other transaction never took place, because from your point of view, it never happened. To do that, MVCC is the algorithm that solves this mess.
MVCC records the transaction ID within each row, when you perform an update, instead of replacing the record in-place, a new record is added, so now you have two registers (tuples) on the table that refer to the same original row with different records. When the other transaction queries, PostgreSQL knows that you’re not allowed to see this new record and returns the old record instead. When the original transaction commits, the old record is no longer visible for anyone and will be deleted on the next VACUUM command.
Why I like to think about this like a GIT repository? Well, think about starting a transaction like creating a new branch, every subsequent command commits to it, and when you commit the transaction, this branch is merged with the mainline (master branch). This is more or less what it is actually happening seen from a graph perspective.
PostgreSQL also allows for “subtransactions”, which are referred to as savepoints. So, inside your transaction you can create this savepoint which you can always rollback to, or discard. This is much like branching from a development branch and merging it back. Some applications and frameworks expose a subtransaction API which under the hood uses those savepoints to emulate them perfectly.
This also means that, by default, a open transaction can query the same table multiple times, and get every time different results as others are committing changes to it, given you didn’t write to those rows. You can see a mixture of your updates and their updates at the same time, but one row will be yours if you had written to it, so it doesn’t change anymore, or will be “public” if you didn’t write.
Transaction Isolation Levels
So by default PostgreSQL allows transactions to read changes from a table. This is referred in the SQL Standard as READ COMMITTED. There are four of them, but PostgreSQL only implements three, leaving out the most unconsistent one:
- READ UNCOMMITTED: Not supported in PostgreSQL, if you ask for it, you get READ COMMITTED. This would mean that you would be able to see changes that are not yet committed from other transactions.
- READ COMMITTED: The default behaviour in PostgreSQL, as explained before.
- REPEATABLE READ: When you do a SELECT on certain rows, those are guaranteed to be the same along your transaction. Any changes to them from other transactions are hidden until commit. Also, in case of inconsistency, it may fail on commit, so applications should be prepared to retry the commands in this case.
- SERIALIZABLE: On top of the last one, the DB will monitor and ensure that no parallel action would have had any different result, in any case, if they have had run in different order. Same locks, but PostgreSQL may fail the transaction sooner if sees any chance of dependency problems. Good thing on this one, you have a strong guarantee that a single transaction running all commands together would produce exactly the same output.
In my experience, READ COMITTED is what we need. There are some inconsistency issues with it that can be solved in a smart way with a SELECT FOR UPDATE. You don’t need to care about replaying transactions and your aplication will wait only if there’s a true case of conflict.
Write Ahead Log and Checkpoints
PostgreSQL is probably the most reliable database ever. No kidding here. I had some responsibility at some point of hundreds of servers at remote locations most of the times with no UPS, where electricity was cut frequently, and PostgreSQL was the last thing to worry about. Even with severe disk damage, PostgreSQL managed to be working and be as consistent as possible. I cannot ask more. Sometimes I wonder if filesystems could learn from it, as they gave me far more headaches than PostgreSQL and they are even more critical.
How PostgreSQL manages to be so resilient to failures? It is a combination of several things, but the main one is its Write Ahead Log, or WAL.
Everytime that PostgreSQL wants to write anything it never touches the database files, instead, it appends records into WAL describing what would have been written. These changes are temorarily held on memory on a buffer and dynamically merged with the real files when anyone asks for data.
Every 15 minutes or when the buffer reaches a desired limit, a Checkpoint operation is trigered. Its work is to actually change the database files in background following what was written into the WAL.
It goes more or less like this: First appends into the WAL a record saying “I’m going to checkpoint from WAL record 113 to record 455”. Then starts performing the task. Any write on PostgreSQL while this happens is just appended to the WAL, with no locks or waits. When it completes appends a new WAL record saying “I completed the checkpoint up to record 455”.
If the server stops abruptly, when you start PostgreSQL will look for all WAL records until it founds out if any checkpoint was half done, if it was the case, it will redo that checkpoint. Then will determine which is the latest consistent state in WAL and checkpoint up to that point. And ready to go!
Any change that wasn’t committed would be lost, and if there was any WAL half-written, will be discarded as well. And your DB will be ready and consistent in a matter of seconds, transparently, with no user interaction.
With the default configuration of PostgreSQL you are guaranteed that no single transaction would be ever lost if they committed.
This is because PostgreSQL will do an fsync and ensure that data is properly stored on disk before returning from a COMMIT command. At least, stored in the WAL records on disk. So, if your REST API is performing a payment operation and returns OK to the client, this change is safely stored by the time it happens and you have guarantee that it would never return OK if it could be lost later.
Of course, this has a performance penalty. Syncing to disk is a costful operation and makes slower writes. For massive operations, usually it is recommended to bundle them in reasonably sized transactions or use the COPY command.
In case you need extra speed at the cost of losing the last few seconds in case of a failure, you can disable the synchronous_commit option and you’ll get a huge speed boost on writes. This can be changed on the fly without restarting the server, so you can always change it for a few hours and when the batch finishes, enable it again. Still, with this option off, the most you can lose is less than a minute in the worst case.
If you want speed at all costs, fsync and other security measures can be easily disabled. And then you’ll get faster write speed than any other DB, even when comparing with NoSQL databases. But, it is not recommended as it could render your database unrecoverable in case of failure (but still unlikely).
Instead of risking your data, there’s a much easier way. Buy a proper RAID card that has a big write cache and a battery to power up the disks. You’ll get even more speed and in case of failure the data will be safely written on the disk using the battery. And don’t forget to change the battery every 2 years!
Hope you liked it. It is long enough and maybe I’ll talk about other stuff about its architecture like vacuuming, toasting or hot updates in another post.