When it comes to perform database queries in an application I hear things from “I stick with SQL-92 so I’m not stuck with one database” to “Everyone should drop ORM’s because they don’t use the full database potential and are slower”. And I believe both points are wrong. Every project should use an ORM: Change my mind.
Sticking with SQL-92, a 25 years old standard (even older, do the math) is not going to make your application run against any database. Things like character escaping, string quoting and name quoting have different requirements on different databases. Also, no database implements a SQL standard fully and properly; some products are more strict than others, but even then there are always some corners where the standard has to be bent a bit to fit the database design. If you do this, you’re going to get stuck with one database anyway, and your application is going to be slower because you would be rejecting to use any powerful feature. For me, this sounds like someone that doesn’t know SQL well and doesn’t want to learn.
Dropping the ORM entirely and go only with raw SQL, using all features that your chosen database has will give you the most performance that you can get, but you’ll end with a lot of very specific SQL everywhere that are hard to translate to other products. This is fine by me, but what worries me is that your code will heavily depend on the specific query performed, so you’ll end with a lot of code duplication.
I have seen this lots of times. In one part of the program we need to display users, so there’s a query and a template. In other parts of the program there’s users by post, so there’s another query and another part of glue code. And again and again there’s more and more glue code everywhere to do the same thing from different sources. Even sometimes we needed to emulate records in the database that aren’t created by then, so there’s more glue code.
Proper approach to database querying
The best solution is to go with a fast ORM that allows use of Raw SQL when needed, in parts or the full query. If the performance is good enough with the default ORM implementation, go with it. Whenever it falls short, try to add some raw strings in the clause where you need it. If for some case the ORM is too slow anyway, then go for full raw query just for those cases.
As I said before, I’m not specially worried of getting stuck with one database product. When I design the application I carefully choose the best possible option and usually I don’t need to change it later. The problem comes on unit testing testing.
Setting up a full-blown database like MySQL or PostgreSQL is hard, and eat lots of resources. SQLite on the other hand offers a really good SQL support and it can be created on memory, so it’s really convenient.
When using an ORM, most of the logic can be tested against SQLite. Even I found that my raw queries with CTE expressions are understood by SQLite, and that’s amazing. For the remaining ones, because they’re just a few, they can have either another raw query just for SQLite or we can mock the call entirely. Even we could have another test folder for integration tests and in those we could use a full PostgreSQL server for test purposes.
But this is not the main benefit of an ORM. And no one should expect being able to use ORM without knowing SQL language properly.
The main benefit of ORM is “Don’t Repeat Yourself” or DRY. Avoiding to write duplicated logic depending on context.
ORM are really powerful in this matter, as they create objects for us and when we create functions for those objects we don’t need to know which data was queried actually, or under which column names. So functions can be reused everywhere to perform the same logic across the application. This is for me my main motivation for using ORM.
Sometimes we can query partial data for a table, and the receiving function would never notice. Sometimes we can prefetch related data and the function will use it just fine without knowing it.
I believe that there is a relation or proportion between code and bugs. The more code you have, more chances of having bugs in it. Having your code as concise as possible makes the intention clearer and is simpler to reason, inspect, and prove correct.
In other cases, a bigger SQL is simpler to prove correct because SQL is closer to a declarative or functional language and expresses terms in mathematical relations that are also a good tool to prove correctness. But if when the results land on your code you still have to do complex calculations on it, then you’re doing something wrong or the problem is really difficult to solve while being fast.
Some ORM have the ability to map raw SQL to objects and we should use these to avoid duplicating logic again and again.
The main problem with ORM is performance. Everywhere I saw developers saying that ORM performance does not matter, we should move the work in the database anyway. This is true and false at the same time. From one side, yes, we should move all possible calculations to the database unless there is a legitimate reason to not to do so. But on the other part, ORM performance does matter.
Guess what happens when we use 10000 rows from a complex database query in most ORMs: The database handles the load, with filters and joins neatly and quickly, responding in 10ms or less. The ORM can add another 10ms on top on some cases, making the responses twice as slow.
This is the main reason why so many people is claiming against ORM, because the different libraries fail to realize how important is to be fast and they become the main bottleneck. They fail to see that databases are not fast, they’re blazing fast: they can handle millions of records in seconds. Any slowdown in our application and we’ll be impacted heavily.
Some clear indicators on how overlooked performance is are the lack of compiled Python ORMs (using C for the parts where speed is required). I also compared Django (Python) against GOrm (Go) and while GOrm queries slightly faster, Django seems to retrieve slightly more rows per second. How’s this even possible if Python is 10x slower? This is because most or all Orm in Go use reflection (like introspection in Python) instead of generating specific code for the types; Go ends doing much more work than it should.
Common Missing Features in ORM
We have a lot of features to mimic more and more stuff from SQL, but as stated before, performance is missing. No matter which ORM I always find the following features missing or incomplete: query plan caching, result caching, proper cache eviction tools and the ability to inject generated/mock results.
Query Plan Caching
There’s an associated cost converting the ORM representation of the query to the database dialect SQL. When a query is complex enough and is executed often, this becomes a small bottleneck.
But most of the query is usually stated already in code, and in fact, at compile-time it could be partially generated for a particular dialect speeding up later calls.
Some databases allow as well prepared statements. On those, you just prepare the calls at the beginning and afterwards you only need to provide the required data, without nay effort. This not only saves you from building a SQL each time, it also saves time to the database as it does not have to parse and plan the query each time.
Not only saving the cost of creating a SQL or unmarshalling the records, caching results also saves a database query with its round-trip. Most of the time I had to do it manually without much help from the ORM which is exhausting.
Cache eviction tooling
So we cached the data, good. But now it’s stale: not so good. The ORM in fact knows or has methods to know if the application itself invalidated the data. Because has knowledge on which tables are touched on each query or update, they could easily perform a cache eviction mechanism à la MySql, where whenever a table is modified, all caches using that table are evicted. It’s really simple and it does remove a lot of useless queries.
For external modifications outside the app there are also tricks, like leveraging columns like last_modified, reading internal statistics of the database, or having an external communication channel that allows us to tell the app to evict cache for a certain table.
How many ORM implement this idea? Zero. Good luck implementing this yourself without touching the ORM.
Inject generated results
This pattern is a bit more convoluted. Imagine we have a function that works with data from a query. Now we want to use it without querying. In most cases replacing the query with a list of created objects is good enough.
But what if the function expects to have working relationships? For example in Django, with select_related or prefetch_related. Those are near to impossible to mock up properly.
And now to the final boss: Perform a query to database and fill some Many-to-Many relationships without querying, from cache or some generated data. This makes more sense than might seem. The typical use-case scenario is: We want to output a list of products, but with it a huge bunch of extra data from relationships which is very costly to fetch and process. On the website, we want the products to be fresh, but for the extra data we really don’t care that much as it rarely changes and it’s barely noticeable. So we want to join a fresh query with a potentially stale cache. If the previous example was near impossible, imagine how “easy” is this one.
ORMs are needed but a pain point
Some people might think that I’m obsessed with performance, but I believe that those that have seen me working would agree that I only focus on performance when actually is important. I do not care unless I see potential problems.
We should be using an ORM to avoid code duplication and having logic concentrated in a single point. But when doing that we lose a lot of performance. They also lack tooling to avoid calculating things twice, which makes things even worse.
Still I would recommend everyone using one, because it’s better than raw queries in so many senses. They can provide correctness, slimmer code and reduced chances of bugs. Just hope that someday they’ll have features to make caching easier.
At the moment I’m trying to benchmark different ORM over Python, Go and Rust and hope I can get it out in some weeks.