Angular 7 almost here but Ivy is still far from ready

Yesterday I searched again for Angular and Ivy status and seems clar that Ivy is not going to be ready on time.

Also it seems that Angular 7 is not going to come with any substantial amount of changes.

Maybe pledging to release each 6 months was too short, but this is good. This means that Javascript development is slowing down.

One of the main problems in Javascript applications is that they change too fast. When doing an application one would like to have a stable foundation to work with. If something changes every few months it will mean that you have to migrate and rewrite it several times. Even migrate during the development period.

Most of the time software is written and abandoned for years. Nodejs creators may not have to support their libraries for long, but probably some apps written on top of them have to provide support for 5 years or so.

I hope they manage to get a working beta version of Ivy soon, and hopefully stable enough to start something with it when angular 8 is released.

Why I’m so desperate to use ivy? Well, its main promise is smaller bundle sizes, and that means faster startup times. When doing a Single Page App with angular that is exposed to Google, you can assume that lots of new people will enter the site. If they do a search and on click they need to download 5mb and wait several seconds to begin navigating, that’s bad. Ivy can make this more reasonable.

The Cloud is overrated!

I have to be sincere with this. Cloud services are completely overrated and they give far less value than perceived, and probably you should think twice before moving to it.

I know this is an unpopular opinion. But when everyone is saying “move it to the cloud” “if your service was on the cloud it would be even better”, etc, I do feel that I must be the one to bring them back to the earth. Sorry, it’s not that good.

As usual with my posts, I’m not saying that they should be completely avoided. They have their use cases and their strengths, which are not few. But they also have drawbacks and we should research before going with the new fancy thing, just because everyone says so.

The good parts

  • Well known providers as Google, Amazon or Microsoft
  • Easy to setup
  • Lots of tools in-built on them
  • Option to grow or shrink dynamically
  • Pay per hour of actual use
  • Lots of community and paid support

The bad parts

  • Pricey – not as cheap as they appear compared to classic solutions
  • Legal issues in Europe as the provider uses to be a company based in America
  • Uptime not as good as the average dedicated server
  • Vendor lock-in
  • Bad for data privacy
  • Performance way worse than bare metal

As you can see, many good things and many bad things. As everything. There’s really nothing special on the cloud.

As everyone praises on the good things, I guess I just need to talk about the bad things to balance this. So, let’s get this started!

What are we, the customers, to these companies

Maybe you thought as me that they built those cloud systems to diversifier their income sources and stop depending on their classic income source. After sometime I realized how wrong I was.

The cloud for most of them probably came because they already had to have tons of servers in different locations, they had to build a system that managed it as a whole, and they found out that a lot of the computing power was idle, being lost anyway. Or, on retired servers, they had no purpose any more.

So you could imagine that some clever guy thought: What if we sell this extra computing power, so it does not get simply lost? And that’s what they did. (Still I wonder why Facebook is not doing Cloud at the moment…)

And instead of being considered as a secondary, alternative product, it is considered what in reality is: leftovers. We’re just purchasing their leftovers.

And, while this should not be nothing bad in its own, the truth is, if you’re not a big spender there (like 50k/year and up) you probably mean nothing to them. They don’t care if they lose you.

They are like those extra-pricey restaurants we have in Spain in very touristic spots, they are there to catch newcomers, they aren’t really that good, and by the time the customer realizes this fact and never comes again, there are two more entering the door to replace them.

So, what quality of service do you expect if you’re a grain of sand in the beach, inside a product which purpose is to sell leftovers? Yeah, sure, as soon as what you spend on them has several figures, they may want to retain you. But it is your case?

Also, as they’re selling idle power, if they have a peak of usage on their main sites, who do you think is going to be impacted?

Finally, they have the ability to peek on your data without you noticing, and they probably are under the American Patriot Act.

And just because the cloud is so trendy now and they are so popular, they work like those Spanish restaurants I was referring to before. No matter if some of them have bad experience, that is not going to change anything, customers will still enter the door.

Someone on the internet said “no one has been fired by putting an expensive RAID card on a server”, but data has (or could had) been lost because that card wrote in a private format, and the model was no longer sold. But hey, no one was fired by that! I wonder if the same happens with the cloud, it may go down, cause loses, but hey, the cloud is the best!

Specifics on the bad parts

Let’s have a deeper look on the points I wrote before:

  • Performance: The software we run in these is under a virtualized system, very secure one, which reduces performance significantly. On top of that, they run their software layer for sharing the resources and treating the server farm as a whole. And finally, on top of everything, there’s networking between the different servers of the farm. In practice I found that the average VPS on the market outperforms the cloud by far under similar specs.
  • Data Privacy: On a dedicated server, if they want to peek on your data they have to unplug the disks and that would cause downtime. Or install hidden software on your server, which you could detect and even remove. On VPS and on the cloud, as they are virtualized, your disks are bare files in other servers. Reading the data is as easy as reading the file without downtime. And the memory can be read too. I would say, cipher your data, but seems unrealistic and not effective.
  • Vendor lock-in: Yeah, they provide lots of fancy stuff, extra satellite services, API’s, etc. But in the end, you’ll code with those and for those. When you find that it is not worth your money you’ll find having to redo lots of parts to migrate to any other provider.
  • Uptime: In contrast of what it seems logic to be, the cloud services have a history of big downtimes. Those that go with them need to be prepared for it, having backup or a secondary server in a different region.
  • Legal issues: The companies are based in America, their government can ask them to retrieve data from any of their services at anytime, and they can. In Europe we have other rights and that may not comply with our regulations depending on which type of data are you storing there.
  • Price: Its usually paid by hour at a very cheap price. But do the maths. In the end you need at least one server to be running 24/7 for the service to be available. Multiply this tiny price per hour by the number of hours of a year and compare to any other VPS provider that gives the same. Even dedicated servers are cheaper.

Conclusion

What the cloud gives is not performance nor stability or reliability. What it gives is a huge flexibility inside it. And that comes at a cost.

Other services, like VPS, have some degree of flexibility, as you can grow or shrink them. But they have no API. Do you really need an API to grow it and shrink it dynamically in seconds?

Other less known providers have to fight to get their reputation. Search for reviews over the boards and try a recommended one. Those smaller business actually depend of their customers being happy. Some of them compete on price, others are expensive but have excellent service. I had both in the last 15 years, and I am very satisfied with them. Getting a friendly call on your mobile telling that you had your server hacked and advising on what to do is priceless.

If you want true reliability and 100% uptime, you need to do your own mini-cloud. Rent servers or VPS on different providers from different countries and make sure that your services are going to continue working no matter which one is down.

If you care a lot on your data privacy (or you are being enforced by law to it), avoid the Cloud and VPS. Rent dedicated servers. If you’re just starting and don’t have the budget, there are some providers that rent old servers or cheap computers at ridiculous prices.

And for concluding this conclusion: do your homework, research, have a look on the different options, pros and cons and avoid just going with the latest trendy thing. Understand why you decided to go with a particular option!

Python threading can use all cores despite the GIL

We can read in most places that Python cannot use all cores and its threads are useless because of something called GIL, that prevents two python instructions being executed at the same time.

While this is true, as in much things in Python, in practice is a bit different. Python language is well thought and very clever in ways that most people would not expect at first.

What is this infamous GIL? GIL stands for Global Interpreter Lock and it is a thread lock that prevents two threads from executing any instruction in Python at the same time. Its reasoning is, Python cannot guarantee any safety while writting and reading the same variables at the same time, and the garbage collector (GC) also counts on that to work properly.

But Python does not only execute Python instructions, it also executes C functions, and of course there is the I/O operations. This is one common thing that people misses when reasoning on Python performance. Python is not like Java; it does not execute Python under a python virtual machine or similar, it is an actual interpreter (which makes it way slower) which is written in C and it also interfaces with C, C++ and Fortran.

This might seem as a disadvantage, but as it is C code running, as soon as there’s no Python variables involved anymore (because they were copied into C memory space) the GIL can be released. Yes! The GIL indeed can be obtained and released from C code. And as soon as this happens other Python code can run in parallel.

It turns out that a large amount of community Python libraries and built-in ones have parts or all written in C; mainly those parts that are computationally expensive.

So Python in practice tends to be way faster than expected in practice, and paralellism is actually better than predicted by theory.

If you’re using Python, you should be aware that it is more like a director when it handles the complex logic and controls other libraries to do the actual stuff. Don’t try to do yourself a fancy-fast algorithm that looks complex; most cases like this in Python end by being slower, because you’re doing it in Python instead of C.

It is threading useless in Python? Definitely not. Networking, I/O or paralelizing other libraries are useful examples on how to use them properly. For I/O there are also asynchronous functions that could do more or less the same as with threads if the operating system is supported and the particular I/O task is supported.

Can Python use all cores? No, in the conventional way, but if we’re smart and send the tasks out, definitely yes. Also check Cython and Numba for creating your own solutions that interface with C and how to release the GIL.

And finally, if we go interfacing with C, our program is going to be faster than any Java or Scala program, because, well, it is C.

Designing your project DB for scaling up (II) – Unforeseen problems of Replication

Following the series of scaling up I want to talk about what happens when you start replicating. Things that used to be common practice before, now they become a problem and mostly forbidden.

In the first place, why replicate? The most common answer to this is because we want a failover server or a more fine-grained backup.

When your application or website is critical for business, you cannot afford downtime, because downtime is money and in somecases maybe lots of money.

The two scenarios I’ve seen for this are, first, because you run a site that is doing most of your sales. If you shutdown the site you just stop selling anything as the site makes up more than 50% of your income. Second, because you have an application that drives all your business logic. If the application fails your employees just can’t work.

In those cases, downtime is not an option. If your server completely fails, waiting a few days for getting a new server, installing it, restoring the database and setup everything to get up to speed again, is just too much. You need a backup plan that is not just backups.

A failover server is probably the cheapest solution for this. You get a second server in a different place (where it can’t be affected by the same problems as the primary), and install it almost as for production, so in case of need, all is needed is a DB restore and few DNS changes. The failover server can indeed be cheaper and slower than the primary; in case of failure, having something slower meanwhile you fix the primary is not that bad, given it does not blow up when it gets the full load.

Restoring a database can take hours, so to avoid that we run a replicating failover, using straming replication if possible. When this is configured, the failover gets the changes up to the same second, no data loss, no database restore is required. The downtime is just changing DNS and other basic setup that can be done in minutes. (If it is really critical, usually this is also automated)

Scenario 2 – Unforeseen problems of Replication

So now, all your writes to the DB go to the failover server. Also some archiving is good, just in case you need Point-In-Time-Recovery. But some side effects arise with it: everything is written.

So common things that we did in the past are no longer a good approach now.

For example, I used to restore the DB on the production under a different name to perform tests on the same server, where the database was too big to move it easily to my local environment. With replication, this operation is transmitted to the failover server.

Another common case was caching on the DB. While this is common in sites like WordPress and Drupal, now those caches are being transmitted to the failover, and they pollute my PITR archives with a huge amount of data.

Or statistics. Every click in the site got one or several entries written in the database to gather the speed, visits, referrers, and so on.

First symptom appears when the failover can’t keep up with the writes; maybe it was some cheap disk and the primary was running on SSD, or maybe the data had to be transmitted over a slow internet connection. The result is the failover is not up to date in certain scenarios for hours or days. This can be a disaster if something happens on the primary on the worst moment.

Second symptom appears when you try a Point in time Recovery. Say for example you need to do a post mortem and you need the database state at a particular time for inspection. What you find? Lots of gigabytes that need to be processed in order to restore up to the point. Sometimes it can take days to process all those writes, and if you inspect, 95% of those operations are caches being rewritten, test databases, and other stupid stuff.

Last thing to mention here, is say you restored a particular point in time or even a regular SQL backup, you run the site in a separate server for inspection, and you find that the files the DB was referring to no longer exist; they were deleted at a later time in the production server.

How to mitigate these problems

The first thing we need to understand is that a database is meant to hold the current state of our application. If we restore a database and that’s not all the files needed to get into a particular state, you got a problem. If it is way more than needed to get into a particular state, you also got a problem.

So the DB should be the current state, nothing more, nothing less. Temporary/Cached data does not belong here, statistics, logging and debugging info also doesn’t. And forget about playing around on other databases on the same server for testing purposes.

First problem I want to talk about is file upload. Think about if it is a good idea to move them into the database. Most databases handle them nicely. If they are not huge, it may be a good idea to have them in a table. This will move them to the failover as any other data and Point in time recoveries will be consistent.

If you can’t (or won’t), you need to plan ahead how to deal with them. Rsync them to the failover and plan a incremental backup in similar fashion; so whatever you do with the Db, you can do more or less the same with the uploads. It is going to be a bit more complex, less consistent, but hey, if that’s what you want, go ahead.

Next, let’s talk about caching and temporary data. Get another database to store these or other kind of software that is better suited to it. For example Redis or MongoDB. NoSQL is the best tool for this problem. And most CMS do support for caching the data in them.

In PostgreSQL you could set some tables to be not logged into WAL, and that will avoid having them in your PITR archives or replicated into the failover. I don’t think this is a good idea anyway, but for some types of data may be the best option available. Foreign Data Wrappers may help, as you could “emulate” the table from another database which may be also PostgreSQL or Redis or even a text file. Again, not a good idea, but maybe the only option in certain cases.

How to know if your data really a cache or temporary? Just ask yourself a few questions: Can your app run without it? It will be reconstructed automatically when needed? Can you run a command or SQL to regenerate it from the other data? If you answered yes to any of those questions, your data is probably either a cache or temporary. If you move the data out, you could plan a script to repopulate them on demand. This will become useful when you switch to the failover, so you run this command straight away and it will populate all caches to make the site or app fast as soon as possible.

If the application needs that cache to work, and if it takes more than 10 minutes to recreate it, maybe a better idea to just leave it in the DB to avoid more downtime.

Also, if the application/website will just run too slow until the cache is fully recreated, and this takes hours, as before, maybe is best to leave it in the DB.

On those cases, another plan could be just to dump them regularly and have them prepared for restoring when the failover is up. Restoring should be way faster than recalculating it. Also Redis and MongoDB support replicas as well, so you could setup these accordingly.

That said, keep your developers off the production databases and servers. Or train them properly to avoid doing stupid stuff as restoring DB’s under other names. You could also use docker to create a a playground in the same server where that sould be safe (not the best option, but sometimes we need something cheap that does not involve a new server or employees waiting for 10Gb of data being downloaded)

I used to VACUUM ANALYZE daily everything on the server. This also has load. I changed that into an ANALYZE command that does not write and the VACUUM is done on weekends. You can also just rely on the autovacuum daemon to avoid those writes at all cost, but I prefer to keep it from spawning during the high loads, so I keep it running at a higher tolerance, but I run VACUUM manually as well for this.

Finally, about logging and statistics and other stuff that writes fast. Databases are not meant to perform several writes per second. They can, they perform well, but they are not meant for that. If you can keep those records from actually hitting the DB, the better. If you need those records on the database, think if you can pack them before saving; for example performing the writes into Redis first, then collecting every minute, packing, and saving into the database. You can also make use of the Foreign Data Wrappers in PostgreSQL to keep everything accesible with SQL commands, but reading/writting happends outside.

Of course all those techniques are useful if the data that we move out are a reasonable amount compared to the real writes of the current state. I would never advise to move anything outside of the database if it writes way less than the current state updates.

And as always, measure, test and measure again on development environments before changing anything on production!

Designing your project DB for scaling up (I) – When data does not fit on RAM

I’ve been writing software for running websites since 2005, and since then I’ve seen the same pattern appearing again and again, probably on 99% of the sites and ready to use products as Drupal or WordPress.

When the design of a site and a DB is made, there’s almost no data and almost no load on the system, and therefore no matter what you do, probably it will work out nicely without problems. Until you need to scale it up. By that moment probably it is too late.

In those years I tried DB replication, incremental backups, failover servers and other interesting stuff. And it’s then when you see that the whole approach is wrong. We’ll see a few scenarios in different posts.

Scenario 1 – Data size not fitting into memory

This is probably the first thing I hit in every project. At first, data was big but not huge, so the servers were able to handle the data. So even if I needed to play with indexes to keep stuff working fast, it was doable.

But at some point, the data just grew too much to be handled by the server memory. Depending on how much did you spend in your server this can take more or less time. Some servers are just a VPS with 1Gb of RAM. Others are monsters with almost 1 Terabyte. Changing the server is the easiest solution, but if the project does not give enough revenue to get a bigger machine this could be a problem.

First solution to this is implementing a few things to allow for “data archiving”, this is, non used data should remain unused. Seems stupid, but in most databases designs for complex queries, the database will end up reading the whole table, or the whole index, trying to discard that data.

While this is “fast”, it turns out that it prevents that unused data from becoming truly cold (never being read for days), and this data being read from time to time pollutes several caches, namely, disk caches, OS caches, and database caches. This pollution that seems harmless, pushes out of the cache real important data, making the whole application slower.

This is also hard to detect, because if you analyse any query performance in its own they will appear really fast, but the whole system is slow. Untrained DB-Admins and Sysadmins can easily fail to see this and just say “we need a bigger server” without knowing that what they need is just more RAM.

To detect this, the main tool is enabling logging of the slow queries, track them weekly, and try to improve their performance. If at some point you get a lot of queries that tend to be logged as slow but when you try them out they are really fast, then you know that you’re hitting the problem I’m describing here.

How to solve it without getting more RAM

At some point you cannot get more RAM, so what to do now? Archiving.

The basic concept is to move the unused data out of the way so it doesn’t get read by accident.

For example, dump to files and delete the rows from the DB. Or move them to another database. Straightforward, just it makes it very hard to read them in case of need, so this approach is only useful for data that we could afford deleting. I never encountered a case where I could do this.

In practice, the first thing we should think about is efficient indexing and clustering. On indexes, you need indexes that skip unused rows, and all your queries must use any of those indexes. In PostgreSQL and other advanced databases have the concept of partial indexes. They contain a WHERE clause, and not all rows are indexed. Given that you place the same condition in your queries, those indexes can be used. Doing this will ensure that the DB will skip the rows straight away.

The easiest way to do this is filtering by primary key, just adding “AND id > $VALUE” to all queries. This value must not be computed per query, but something that you store as a local variable and you change from time to time. And does not need to cut exactly on the needed point, it is enough if it cuts down 80% of the records and you are sure that you’re not removing any potential useful records. At a latter time we could place a scheduled task to weekly update this number to a fresher one. The index will still be cached into RAM, but if it is just a number and the table contains lots of other data, this should be safe.

I also mentioned clustering. Also we need to ensure that fresh data is packed together and archived data is not mixed in it. If the disk pages contain lots of mixed data, even if indexing is good, the data will end up being cached. Forget about clustered indexes or clustering the table (for PostgreSQL), those tend to have more costs than benefits, unless you almost never write to the tables, but that is almost never the case.

If you’re archiving just by old ID’s, then it should be good already. Most new rows will be added at the end.

If what you archive are not plain old records (by time of insertion) then you may have a problem on most pages having both archived and fresh data. Also, if your table is basically a few integer columns, having to read the full index could be almost the same as reading the full table.

To solve this you need a more aggressive approach. Partitioning the tables into different child ones should be the solution to go in this case. Partitioning is just splitting the table into different tables, so each table contains only a certain type or range of data.

In PostgreSQL you just create the partition tables and the DB will manage it for you:

https://www.postgresql.org/docs/10/static/ddl-partitioning.html

If it’s not the case, depending on the capabilities of your DB, you may want to add triggers or scheduled jobs to move the data around to the right table, and/or views that just do an UNION of all child tables.

This way, even reading indexes or parts of the table, you will never get any data to discard. No need for clustering, as the data is in different tables. And wherever you app needs to read the archived data, it is just a SELECT away.

I’ve implemented this approach in production resulting in huge speed-ups. But always first try it yourself in a development environment and get used to move data around as it is dangerous to move around data in production.

Postgresql – Things that I miss – special index types

Speaking of things I miss from PostgreSQL, let’s take a look at its index types. There are lots of things to discover here if you’re new to PostgreSQL. But before going dep into the rabbit hole, let’s first review our understanding of the two basic index types that most database have:

Btree & hash

Btree is the regular old index type that is available in all databases. Do not confuse with binary trees, as the B stands for block. They can be thought as binary trees optimized for disk access. If you’re interested on the actual details, Wikipedia holds a great article about them: https://en.wikipedia.org/wiki/B-tree

In a btree index, every record holds a key in the index, regardless of the cardinality. So if you are indexing a boolean column over 1000 rows you get 1000 index entries. This might be confusing at the beginning.

The index sorts first using the columns specified and afterwards by block id (physical disk layout). So when they retrieve records, they come up un disk order. This is important, as in 99% of cases, for every index record hit, the database needs to fetch its related record, so they should be sorted as in the disk layout.

Btrees are the fastest indexes in most cases and we usually never bother about any other thing. Their main downsides is their size, as they can take an important fraction of the table size, and their inability to do some operations with arrays. But apart from that, they are great. They can speed up operations as greater/less than, equals and not equals. Also they can speed up LIKE searches given there is no wildcard at the start, so for example “col1 LIKE ‘ASTR%'” will use the index.

Hash indexes on the other side, they were for most of the time just a demonstration on how to implement an index, with no usefulness at all. But in later versions of PostgreSQL they added WAL support, so these can be used normally. Their use case is very specific though. They only support pure equality, so they cannot speedup range or LIKE queries. Their main benefit is their size, as they do not contain a tree, they are smaller (around 50%). So for really big tables it may make a difference. As for speed, they are roughly the same as btree; but of course, if a btree does not fit in memory and a hash does, you can guess what will happen, right?

Gin & gist

One of the features that aren’t in other databases are indexes for array types. While this might be seen as something useful only if you use arrays, it also has some interesting uses on regular tables as well.

Gin and Gist are specific index types of PostgreSQL and they are inverted indexes, they store the data by cardinality (field contents) directly, instead of having a tuple for each row in the table. They are mostly used on array types, but there’s an extension to use them on regular columns.

They can be also be programatically extended, so you can work on top of their logic to work with your specific data types.

GIN, which stands for Generalized Inverted iNdex is best when you are indexing a regular array of anything, like text, ints, or even just a jsonb (not an array of them). It has support for:

  • arrays: overlaps with, contains, contained by, equals to
  • jsonb: key exists, all keys from list exist, any key from list exist, does contain X JSON
  • ts_vector (full text search): matching a FTS Query

GIST, the Generalized Index Search Tree, is best when we’re dealing with spatial data, as it can do much complex operations. Full Search Text maybe is one of them.

It has some of the GIN capabilities (not all), and includes support for a lot of geometrical operations like stricty left/right, strictly bellow/above, distance between, same as, etc.

So if your data is not just a collection of numbers, but more like points over a space (or lines, circles, …) GiST may be the best fit. Or even Internet Addresses!

If you have interest on these, have a look on PostGIS, a PostgreSQL extension that has tons of tools for this kind of data.

There is also SP-GiST which is more or less like GiST but for unbalanced structures. I never used them (yet), so I have no examples at the moment.

Brin indexes

Imagine we want to record several sensors like temperature, humidity, etc, into a single table, just logging them. So this table would have timestamp, sensor code and sensor value. If we register several entries per second, in a year this table will contain 500 million rows or more. Adding an index here will use as much space as the whole table already.

This is the type of scenario where BRIN indexes may help. BRIN stands for Block Range INdex. They record the minimum and maximum of the specified columns per disk block.

Sounds a bit stupid at first, but it is very clever. If the layout of the table is indeed sorted by time, if we just record the range of time in each disk block (which maybe 1Mb, for example), then in a few kilobytes we could store an histogram of it, and in a latter SELECT, given that is filtered by timestamp, we could decide which blocks of the disk we should actually care to read.

These indexes does not care much on the size of your tables, as their size tends to remain stable and it’s just the block size which grows. It is configurable, but PostgreSQL has a nice default behaviour that is difficult to beat manually.

They are not faster than BTrees, actually a bit slower than them. But again, if you cannot fit the index on memory, your search speed will be very slow. (The full index does not need to fit in memory, but at least the disk blocks which are used to navigate it for your queries should) So BRIN indexes, on some scenarios, can be faster. But as they are so small you could also just trade off speed for size, like any compression algorithm.

To make them work, don’t be stupid, cluster your table in a sensible way so different keys land in different disk blocks. So, don’t cluster by timestamp>sensorcode, as timestamp changes a lot and sensorcode has a small cardinality. Either cluster by sensorcode>timestamp, or even by year>week>sensorcode>timestamp. If every block contains all sensorcodes, BRIN indexes will never be able to filter by these.

If you have this kind of scenario, have a look also to partitioning tables and partial indexes. They are helpful too.

PostgreSQL – Architecture

Yesterday I was comparing the basics of PostgreSQL with SQLServer, but today I want to talk a bit about the foundations of PostgreSQL. It is an interesting piece of technology, and since it is free software, we can inspect, fiddle and learn from it.

This post may not seem as useful as the last one at first, but trust me, when you want to extract all the juice from this fantastic DB, this information will be very helpful in the future.

Memory usage

PostgreSQL uses one of the most clever approaches to memory handling. Instead of holding records in its own memory, it leverages this task to the operating system at its own will. This makes PostgreSQL to be one of the most friendly apps with the memory on your server.

To do this, it does something similar to a “mmap” (depends on PostgreSQL version). It tells the operating system that they want to access the files on memory, so the OS can move it onto memory as needed. Also it gives hints to the OS about which data is likely to be accessed and which one is probably not needed anymore. The OS, of course, does whatever prefers. Just listens to those hints and tries to do the best.

This is one of the reasons why PostgreSQL is not as performant in Windows platforms, as this OS family does not have a proper memory handling as *nix systems do.

The main benefit is that PostgreSQL consumes almost no memory at all, and you can run it with the default config either in a small VM or in a big server, performing well on both. Of course, if you know what is the memory that should be available to PostgreSQL, you can tweak the configs to make it even faster. but even with that, still PostgreSQL manages to be friendly with memory management.

This differs from MySQL, where if you have a 6Gb database you better have 6Gb of RAM and configure it carefully to use the memory available. If any other process requires the memory, the OS will start swapping heavily to disk.

The MySQL approach is slightly faster, as in PostgreSQL, unpredicted access to data causes always a CPU interrupt to be called, and the OS will be loading it. But it is low impact anyway. Uber had problems with this and switched to MySQL from PostgreSQL, because they have servers with 768Gb of RAM and they thought PostgreSQL does not make proper use of it.

The main benefit of this design is that your available RAM does not have to be proportional to the amount of data you’re handling. So databases with 100Tb of data are perfectly working with PostgreSQL. Even Petabytes; I never had a server with enough space to test this out.

MVCC model and locking

In PostgreSQL, despite it is one of the most robust database, it uses optimistic locking and MVCC. This means that locks will be less likely to freeze your app, and handles more parallelism.

MVCC stands for Multi-Version Concurrency Control. And I like to think about it as a VCS like GIT, that almost handles branches and commits. No, MVCC does not have any concept of commit and branches, but it resembles a lot. And of course this burden is completely hidden, you can only kind of feel it when you have long transactions in parallel and analyze carefully the data they’re reading. It the most consistent model I know for database transactions.

Some say that SQLServer uses pessimistic locking instead, but I’m not that sure if that’s the case. I don’t have much information on the topic, but chances are that both use a similar approach.

So first, what is pessimistic locking? When a transaction reads a row, no modifications to it are allowed from others until the transaction finishes. When a transaction changes a row, no reads from it are allowed until the transaction commits. I’m sure I saw this behaviour in SQLServer back in 2003, but today I’m not sure anymore, although I’m seeing sometimes more lock waits than PostgreSQL would have.

What is optimistic locking? It basically means that only parallel writes to the same row are disallowed. You can still read rows while other transaction is updating it, or write rows while other transaction is reading it. This causes an obvious problem, if the transaction has updated anything but not commited and you’re trying to read it, obviously you should not be allowed to see the changes. So the only valid result from that query would be the old value, as if the update from the other transaction never took place, because from your point of view, it never happened. To do that, MVCC is the algorithm that solves this mess.

MVCC records the transaction ID within each row, when you perform an update, instead of replacing the record in-place, a new record is added, so now you have two registers (tuples) on the table that refer to the same original row with different records. When the other transaction queries, PostgreSQL knows that you’re not allowed to see this new record and returns the old record instead. When the original transaction commits, the old record is no longer visible for anyone and will be deleted on the next VACUUM command.

Why I like to think about this like a GIT repository? Well, think about starting a transaction like creating a new branch, every subsequent command commits to it, and when you commit the transaction, this branch is merged with the mainline (master branch). This is more or less what it is actually happening seen from a graph perspective.

PostgreSQL also allows for “subtransactions”, which are referred to as savepoints. So, inside your transaction you can create this savepoint which you can always rollback to, or discard. This is much like branching from a development branch and merging it back. Some applications and frameworks expose a subtransaction API which under the hood uses those savepoints to emulate them perfectly.

This also means that, by default, a open transaction can query the same table multiple times, and get every time different results as others are committing changes to it, given you didn’t write to those rows. You can see a mixture of your updates and their updates at the same time, but one row will be yours if you had written to it, so it doesn’t change anymore, or will be “public” if you didn’t write.

Transaction Isolation Levels

So by default PostgreSQL allows transactions to read changes from a table. This is referred in the SQL Standard as READ COMMITTED. There are four of them, but PostgreSQL only implements three, leaving out the most unconsistent one:

  • READ UNCOMMITTED: Not supported in PostgreSQL, if you ask for it, you get READ COMMITTED. This would mean that you would be able to see changes that are not yet committed from other transactions.
  • READ COMMITTED: The default behaviour in PostgreSQL, as explained before.
  • REPEATABLE READ: When you do a SELECT on certain rows, those are guaranteed to be the same along your transaction. Any changes to them from other transactions are hidden until commit. Also, in case of inconsistency, it may fail on commit, so applications should be prepared to retry the commands in this case.
  • SERIALIZABLE: On top of the last one, the DB will monitor and ensure that no parallel action would have had any different result, in any case, if they have had run in different order. Same locks, but PostgreSQL may fail the transaction sooner if sees any chance of  dependency problems. Good thing on this one, you have a strong guarantee that a single transaction running all commands together would produce exactly the same output.

In my experience, READ COMITTED is what we need. There are some inconsistency issues with it that can be solved in a smart way with a SELECT FOR UPDATE. You don’t need to care about replaying transactions and your aplication will wait only if there’s a true case of conflict.

Write Ahead Log and Checkpoints

PostgreSQL is probably the most reliable database ever. No kidding here. I had some responsibility at some point of hundreds of servers at remote locations most of the times with no UPS, where electricity was cut frequently, and PostgreSQL was the last thing to worry about. Even with severe disk damage, PostgreSQL managed to be working and be as consistent as possible. I cannot ask more. Sometimes I wonder if filesystems could learn from it, as they gave me far more headaches than PostgreSQL and they are even more critical.

How PostgreSQL manages to be so resilient to failures? It is a combination of several things, but the main one is its Write Ahead Log, or WAL.

Everytime that PostgreSQL wants to write anything it never touches the database files, instead, it appends records into WAL describing what would have been written. These changes are temorarily held on memory on a buffer and dynamically merged with the real files when anyone asks for data.

Every 15 minutes or when the buffer reaches a desired limit, a Checkpoint operation is trigered. Its work is to actually change the database files in background following what was written into the WAL.

It goes more or less like this: First appends into the WAL a record saying “I’m going to checkpoint from WAL record 113 to record 455”. Then starts performing the task. Any write on PostgreSQL while this happens is just appended to the WAL, with no locks or waits. When it completes appends a new WAL record saying “I completed the checkpoint up to record 455”.

If the server stops abruptly, when you start PostgreSQL will look for all WAL records until it founds out if any checkpoint was half done, if it was the case, it will redo that checkpoint. Then will determine which is the latest consistent state in WAL and checkpoint up to that point. And ready to go!

Any change that wasn’t committed would be lost, and if there was any WAL half-written, will be discarded as well. And your DB will be ready and consistent in a matter of seconds, transparently, with no user interaction.

With the default configuration of PostgreSQL you are guaranteed that no single transaction would be ever lost if they committed.

This is because PostgreSQL will do an fsync and ensure that data is properly stored on disk before returning from a COMMIT command. At least, stored in the WAL records on disk. So, if your REST API is performing a payment operation and returns OK to the client, this change is safely stored by the time it happens and you have guarantee that it would never return OK if it could be lost later.

Of course, this has a performance penalty. Syncing to disk is a costful operation and makes slower writes. For massive operations, usually it is recommended to bundle them in reasonably sized transactions or use the COPY command.

In case you need extra speed at the cost of losing the last few seconds in case of a failure, you can disable the synchronous_commit option and you’ll get a huge speed boost on writes. This can be changed on the fly without restarting the server, so you can always change it for a few hours and when the batch finishes, enable it again. Still, with this option off, the most you can lose is less than a minute in the worst case.

If you want speed at all costs, fsync and other security measures can be easily disabled. And then you’ll get faster write speed than any other DB, even when comparing with NoSQL databases. But, it is not recommended as it could render your database unrecoverable in case of failure (but still unlikely).

Instead of risking your data, there’s a much easier way. Buy a proper RAID card that has a big write cache and a battery to power up the disks. You’ll get even more speed and in case of failure the data will be safely written on the disk using the battery. And don’t forget to change the battery every 2 years!

 

Hope you liked it. It is long enough and maybe I’ll talk about other stuff about its architecture like vacuuming, toasting or hot updates in another post.