Benchmarking Python vs PyPy vs Go vs Rust

Since I learned Go I started wondering how well it performs compared to Python in a HTTP REST service. There are lots and lots of benchmarks already out there, but the main problem on those benchmarks is that they’re too synthetic; mostly a simple query and far from real world scenarios.

Some frameworks like Japronto exploit this by making the connection and the plain response blazing fast, but of course, as soon as you have to do some calculation (and you have to, if not what’s the point on having a server?) they fall apart pretty easily.

To put a baseline here, Python is 50 times slower than C++ on most benchmarks, while Go is 2-3 times slower than C++ on those and Rust some times even beats C++.

But those benchmarks are pure CPU and memory bound for some particular problems. Also, the people who submitted the code did a lot of tricks and optimizations that will not happen on the code that we use to write, because safety and readability is more important.

Other type of common benchmarks are the HTTP framework benchmarks. In those, we can get a feel of which languages outperform to others, but it’s hard to measure. For example in JSON serialization Rust and C++ dominate the leader board, with Go being only 4.4% slower and Python 10.6% slower.

In multiple queries benchmark, we can appreciate that the tricks used by the frameworks to “appear fast” no longer are useful. Rust is on top here, C++ is 41% slower, and Go is 43.7% slower. Python is 66.6% slower. Some filtering can be done to put all of them in the same conditions.

While in that last test which looks more realistic, is interesting to see that Python is 80% slower, which means 5x from Rust. That’s really really far better from the 50x on most CPU benchmarks that I pointed out first. Go on the other hand does not have any benchmark including any ORM, so it’s difficult to compare the speed.

The question I’m trying to answer here is: Should we drop Python for back-end HTTP REST servers? Is Go or Rust a solid alternative?

The reasoning is, a REST API usually does not contain complicated logic or big programs. They just reply to more or less simple queries with some logic. And then, this program can be written virtually with anything. With the container trend, it is even more appealing to deploy built binaries, as we no longer need to compile for the target machine in most cases.

Benchmark Setup

I want to try out a crafted example of something slightly more complicated, but for now I didn’t find the time to craft a proper thing. For now I have to fall back into the category of “too synthetic benchmarks” and release my findings up to this point.

The base is to implement the fastest possible for the following tests:

  • HTTP “Welcome!\n” test: Just the raw minimum to get the actual overhead of parsing and creating HTTP messages.
  • Parse Message Pack: Grab 1000 pre-encoded strings, and decode them into an array of dicts or structs. Return just the number of strings decoded. Aims to get the speed of a library decoding cache data previously serialized into Redis.
  • Encode JSON: Having cached the previous step, now encode everything as a single JSON. Return the number characters in the final string. Most REST interfaces will have to output JSON, I wanted to get a grasp how fast is this compared to other steps.
  • Transfer Data: Having cached the previous step, now send this data over HTTP (133622 bytes). Sometimes our REST API has to send big chunks over the wire and it contributes to the total time spent.
  • One million loop load: A simple loop over one million doing two simple math operations with an IF condition that returns just a number. Interpreted languages like Python can have huge impact here, if our REST endpoint has to do some work like ORM do, it can be impacted by this.

The data being parsed and encoded looks like this:

{"id":0,"name":"My name","description":"Some words on here so it looks full","type":"U","count":33,"created_at":1569882498.9117897}

The test has been performed on my old i7-920 capped at 2.53GHz. It’s not really rigorous, because I had to have some applications open while testing so assume a margin of error of 10%. The programs were done by minimal effort possible in each language selecting the libraries that seemed the fastest by looking into several benchmarks published.

Python and PyPy were run under uwsgi, sometimes behind NGINX, sometimes with the HTTP server included in uwsgi; whichever was faster for the test. (If anyone knows how to test them with less overhead, let me know)

The measures have been taken with wrk:

$ ./wrk -c 256 -d 15s -t 3 http://localhost:8080/transfer-data

For Python and PyPy the number of connections had to be lowered to 64 in order to perform the tests without error.

For Go and Rust, the webserver in the executables was used directly without NGINX or similar. FastCGI was considered, but seems it’s slower than raw HTTP.

Python and PyPy were using Werkzeug directly with no url routing. I used the built-in json library and msgpack from pip. For PyPy msgpack turned out to be awfully slow so I switched to msgpack_pypy.

Go was using “” and “” for serving HTTP with url routing. For JSON I used “encoding/json” and for MessagePack I used “”.

For Rust I went with “actix-web” for the HTTP server with url routing, “serde_json” for JSON and “rmp-serde” for MessagePack.

Benchmark Results

As expected, Rust won this test; but surprisingly not in all tests and with not much difference on others. Because of the big difference on the numbers, the only way of making them properly readable is with a logarithmic scale; So be careful when reading the following graph, each major tick means double performance:

Here are the actual results in table format: (req/s)

HTTPparse mspencode jsontransfer data1Mill load

Also, for the Transfer Data test, it can be translated into MiB/s:

transfer speed
Rust2,491.53 MiB/s
Go2,897.66 MiB/s
PyPy701.15 MiB/s
Python897.27 MiB/s

And, for the sake of completeness, requests/s can be translated into mean microseconds per request:

HTTPtransfer dataparse mspencode json1Mill load

As per memory footprint: (encoding json)

  • Rust: 41MB
  • Go: 132MB
  • PyPy: 85MB * 8proc = 680MB
  • Python: 20MB * 8proc = 160MB

Some tests impose more load than others. In fact, the HTTP only test is very challenging to measure as any slight change in measurement reflects a complete different result.

The most interesting result here is Python under the tight loop; for those who have expertise in this language it shouldn’t be surprising. Pure Python code is 50x times slower than raw performance.

PyPy on the other hand managed under the same test to get really close to Go, which proves that PyPy JIT compiler actually can detect certain operations and optimize them close to C speeds.

As for the libraries, we can see that PyPy and Python perform roughly the same, with way less difference to the Go counterparts. This difference is caused by the fact that Python objects have certain cost to read and write, and Python cannot optimize the type in advance. In Go and Rust I “cheated” a bit by using raw structs instead of dynamically creating the objects, so they got a huge advantage by knowing in advance the data that they will receive. This implies that if they receive a JSON with less data than expected they will crash while Python will be just fine.

Transferring data is quite fast in Python, and given that most API will not return huge amounts of it, this is not a concern. Strangely, Go outperformed Rust here by a slight margin. Seems that Actix does an extra copy of the data and a check to ensure UTF-8 compatibility. A low-level HTTP server probably will be slightly faster. Anyway, even the slowest 700MiB/s should be fine for any API.

On HTTP connection test, even if Rust is really fast here, Python only takes 50 microseconds. For any REST API this should be more than enough and I don’t think it contributes at all.

On average, I would say that Rust is 2x faster than Go, and Go is 4x faster than PyPy. Python is from 4x to 50x slower than Go depending on the task at hand.

What is more important on REST API is the library selection, followed by raw CPU performance. To get better results I will try to do another benchmark with an ORM, because those will add a certain amount of CPU cycles into the equation.

A word on Rust

Before going all the way into developing everything in Rust because is the fastest, be warned: It’s not that easy. Of all four languages tested here, Rust was by far, the most complex and it took several hours for me, untrained, to get it working at the proper speed.

I had to fight for a while with lifetimes and borrowing values; I was lucky to have the Go test for the same, so I could see clearly that something was wrong. If I didn’t had these I would had finished earlier and call it a day, leaving code that copies data much more times than needed, being slower than regular Go programs.

Rust has more opportunities and information to optimize than C++, so their binaries can be faster and it’s even prepared to run on crazier environments like embedded, malloc-less systems. But it comes with a price to pay.

It requires several weeks of training to get some proficiency on it. You need also to benchmark properly different parts to make sure the compiler is optimizing as you expect. And there is almost no one in the market with Rust knowledge, hiring people for Rust might cost a lot.

Also, build times are slow, and in these test I had always to compile with “–release”; if not the timings were horribly bad, sometimes slower than Python itself. Release builds are even slower. It has a nice incremental build that cuts down this time a lot, but changing just one file requires 15 seconds of build time.

Its speed it’s not that far away from Go to justify all this complexity, so I don’t think it’s a good idea for REST. If someone is targeting near one million requests per second, cutting the CPU by half might make sense economically; but that’s about it.

Update on Rust (January 18 2020): This benchmark used actix-web as webserver and it has been a huge roast recently about their use on “unsafe” Rust. I’m had more benchmarks prepared to come with this webserver, but now I’ll redo them with another web server. Don’t use actix.

About PyPy

I have been pleased to see that PyPy JIT works so well for Pure Python, but it’s not an easy migration from Python.

I spent way more time than I wanted on making PyPy work properly for Python3 code under uWSGI. Also I found the problem with MsgPack being slow on it. Not all Python libraries perform well in PyPy, and some of them do not work.

PyPy also has a high load time, followed by a warm-up. The code needs to be running a few times for PyPy to detect the parts that require optimization.

I am also worried that complex Python code cannot be optimized at all. The loop that was optimized was really straightforward. Under a complex library like SQLAlchemy the benefit could be slim.

If you have a big codebase in Python and you’re wiling to spend several hours to give PyPy a try, it could be a good improvement.

But, if you’re thinking on starting a new project in PyPy for performance I would suggest looking into a different language.

Conclusion: Go with Go

I managed to craft the Go tests in no time with almost no experience with Go, as I learned it several weeks ago and I only did another program. It takes few hours to learn it, so even if a particular team does not know it, it’s fairly easy to get them trained.

Go is a language easy to develop with and really productive. Not as much as Python is, but it gets close. Also, it’s quick build times and the fact that builds statically, makes very easy to do iterations of code-test-code, being attractive as well for deployments.

With Go, you could even deploy source code if you want and make the server rebuild it each time that changes if this makes your life easier, or uses less bandwidth thanks to tools like rsync or git that only transfer changes.

What’s the point of using faster languages? Servers, virtual private servers, server-less or whatever technology incurs a yearly cost of operation. And this cost will have to scale linearly (in the best case scenario) with user visits. Using a programming language, frameworks and libraries that use as less cycles and as less memory as possible makes this year cost low, and allows your site to accept way more visits at the same price.

Go with Go. It’s simple and fast.

HTTP Pipelining is useless

…and please stop publishing benchmarks with Pipelining enabled. It’s just lying about real-world performance.

Today I just found out that one of my favorite sources for HTTP framework benchmarks is indeed using pipelining to score the different programming languages and frameworks and I’m mad about it:

The first time I saw this was with Japronto, which claimed one freaking million of requests per second, and of course this wasn’t replicable unless you had a specific benchmarking method with pipelining enabled.

Before HTTP/2 I was in favor of pipelining because we were so limited on parallel requests and TCP connections were so costly that it made sense. Now, with H2 supported on all major browsers and servers, pipelining should be banned from benchmarks.

What is HTTP pipelining?

In classic HTTP/1, we had to open a TCP connection for a single request. Open the socket, send the request, wait for response and close the socket. TCP connections have a big cost to open, so this was a real problem back in the days.

With HTTP/1.1 we had keep-alive, where after the request was completed, we can feed another request on the same TCP socket. This alleviated the problem. But still, if your computer is far from the server (usually is), the server will sit idle waiting for the last packet sent to arrive to your computer, then waiting for your next request back. In most servers this is 80ms of delay from one request to the following one.

So here enters the scenario the named HTTP pipelining, where we could send another request before the response was received, effectively queuing the requests on the server and receiving them in order. Wikipedia has a nice graph on this:

This looks nice, but HTTP/1.1 never got Pipelining working; it was there, not mandatory, with some clients and servers supporting it; but as it seemed that most web servers at the moment were failing to reply properly with pipelining, and there was no reliable way for the client to tell if the server actually supports pipelining, all major browsers didn’t add the support at all. What a shame!

It was a really good idea, but then HTTP/2 came with multiplexing and this problem vanished. There are still challenges in this area, but nothing that Pipelining will solve. So now, we’re happy with multiplexing.

HTTP/2 does not have Pipelining

This is a common misunderstanding. Yes, you can send several requests; even several thousands without waiting to receive anything. This is really good, but it’s not pipelining. Why?

Pipelining, as it’s name implies, acts like a pipe: First In, First Out. The request will be queued in the server in order and will be replied in order.

HTTP/2 has instead multiplexing, which seems similar, but better. Multiplexing means that you get several streams inside one at the same time, so you can receive data as it is produced. The requests are not queued and are not returned in the same order. They come back at the same time.

Why pipelining gives so good results

Because it’s equivalent to copy a file over the network, specially under synthetic benchmarks where localhost is the target, pipelining reduces a lot of effort to get the different packets.

Instead of grabbing a packet for a request and processing it, you can let it buffer, then grab a big chunk in one go that might contain hundreds of requests, and reply back without caring at all if the client is getting the data or not.

Even more, as the benchmark is synthetic, servers might know beforehand what to serve more or less, reducing time for look what is requested and just replying back approximately the same data again and again.

The benchmark clients also do way less effort, because they only need to fill a connection with the same string repeated millions of times.

If you think about it carefully, this is even faster than copying files over localhost: You don’t even need to read a file in the first place.

HTTP/2 multiplexing is slower

Compared to pipelining, of course. Because you’re not serving a clear stream of data but thousands of interleaved streams, your server has to do more work. This is obvious.

Of course, we could craft a cleartext HTTP/2 server that does multiplexing in effectively one single stream, replying in order. This will result in closer performance to pipelining because it’s indeed pipelining.

But this will be naive to be implemented on a production site, as the same applies if HTTP/1.1 pipelining was a thing. HTTP/2 proper multiplexing is far superior in real world scenarios.

And my question is, do you want your benchmark to return higher results or do you want your users to have the best experience possible?

Because if you only care on benchmarks, maybe is just easier to change the benchmark so it returns better results for your servers, right?

Pipelining will not help serve more requests

I can hear some of you saying “If we enable Pipelining in our production, we will be able to serve millions of results!”. And… surprise!

Why? you might ask. Well, depending on the scenario the problem is different, but it will conclude always to the same two: You need to be able to reply out-of-order to avoid bottlenecks and a single user will never cause thousands of pipelined requests like your benchmark tool.

First pitfall: Requests are heterogeneous, not homogeneous.

Requests will not have the same size, nor reply size. They will have different computing times or wait times to reply. Does your production site reply with a fortune cookie for every single request? Even CSS and JPEG queries? No, I don’t think so.

Why this matters? Well, say your client is asking for a CSS and a JPEG for the same page and you’re replying back with pipelining. If the JPEG was requested first, the CSS will stall until the image completed, making the page not render for some time.

Imagine now we have a REST API, and we get thousands of requests from a client. One of the requests contains an expensive search on the database. When that one is processed, the channel will sit idle and your client will be frozen.

Second pitfall: Real users will never pipeline thousands of requests.

Unless your site is really bad designed, you’ll see that more than 50 parallel request do not make much sense. I tried myself HTTP/2 with an Angular site aggressively sending requests for tiny resources, and the results were quite good, but less than 100 requests in parallel. And the approach was pretty stupid. Aside of this, popular servers and browser lack support for HTTP/1.1 pipelining, so enabling it in your product will not make any difference.

Let’s consider this for a second. Why do we want to pipeline in the first place? Because the client is far from the server and we want to reduce the impact of round-trip time. So, say our ping time to the server is 100ms (which is higher than the usual), and we pipeline 100 requests at a time.

Effectively, in one round-trip, we served 100 requests, so this equates to 1ms RTT per HTTP response. What haves 1ms RTT? Local network! So when you reach this parallelism, the client works as fast as from your local network given the same bandwidth is available. Try the same math for one thousand and ten thousand requests pipelined: 0.1ms and 0.01ms respectively.

So now the question is: Are you trying to save 0.9ms per request to the client, or are you just trying to get your benchmark numbers look better?

Scenario 1: API behind reverse proxy

Assume we have our shiny Japronto in port 8001 in localhost, but you want to serve it along the rest of the site, in port 80. So we put it behind a reverse proxy configuration; this might be Apache, Nginx or Varnish.

Here’s the problem: None of the popular web servers or reverse proxies support pipelining. In fact, even serving static data they will be slower than what your shiny pipelining framework claims it can do.

Even if they did, when they proxy the request, they don’t do pipeline on the proxied server either.

This approach renders pipelining useless.

Scenario 2: Main Web Server

So let’s put our framework directly facing public internet, over another port, who cares? We can send the requests from Angular/React/Vue to whatever port and the user will not notice. Of course this will add a bit of complexity as we need to add some headers here and there to tell the browser to trust our application running in a different port than the main page.

Nice! Does this work? Well, yes and no.

The main concern here is that we’re exposing a non well-tested server to the internet and this can be incredibly harmful. Bugs are most probably sitting there unnoticed, until someone actually notices and exploits them to gain access to our data.

If you want seriously to do that, please, put it inside a Docker container with all permissions cut down, with most mounting points as read only. Including the initial docker container image.

Did we enable HTTP/2 with encryption? If we’re lucky enough that our framework supports it, then it will consume extra CPU doing the encryption and multiplexing.

HTTP/2 over clear text does not work in any browser, so if you try, most users will just go with HTTP/1.1.

If we don’t use HTTP/2 at all, 99% of users have browsers that do not use pipelining at all.

For those cases where they do, the routers and hardware that makes internet itself work will mess up sometimes the data because they see HTTP in clear and they want to “manage” it because “they know the standard”. And they’re pretty old.

Scenario 3: Pipelining reverse proxy

I had an excellent idea: Let’s have our main web server to collect all requests from different users and pipeline them under a single stream! Then we can open several processes or threads to further use the CPU power and with pipelining, the amount of requests per second served will be astonishing!

Sounds great, and a patch to Nginx might do the trick. In practice this is going to be horrible. As before, we will have bottlenecks but now one user can freeze every other user because they asked a bunch of costly operations.

Conclusion 1

The only way this can work is if the framework supports HTTP/2 encrypted and is fast doing it. In this case you should have benchmarked frameworks with HTTP/2 multiplexing.

If your framework does not multiplex properly so indeed, pipelines the data, then users will see unexplainable delays under certain loads that are hard to reproduce.

Conclusion 2

In some scenarios, the client is not a user browser. For example for RPC calls if we implement the microservices approach. In this case, pipeline indeed works given the responses are homogeneous.

But, just it turns out that HTTP is not the best protocol for those applications. There are tons of RPC protocols and not all of these use HTTP. In fact, if you search for the fast ones, you’ll see that HTTP is the first thing they drop out.

I did in the past an RPC protocol myself called bjsonrpc. I wanted speed, and dropping HTTP was my main motivation to create it.

If you need HTTP for compatibility, just have two ports open, one for each protocol. Clients that can’t understand a specific protocol are likely to not understand pipelining either. Having a port for each thing will give you the best performance in the clients that support it while still allowing other software to connect.

Brief word on QUIC

The new old QUIC protocol by Google is being standarized at the moment by the IETF as a base for the future HTTP/3 protocol. QUIC does support fast encryption (less round trips) and has a lot of tolerance against packet loss as well supporting IP Address changes.

This is indeed the best protocol possible for RPC calls, except for its massive use of CPU compared to raw TCP. I really hope that someone standarizes a non-HTTP protocol on top of it aimed to application connections, to be supported by browsers.

The takeaway: managing the protocol takes a lot of CPU, we have to do it in production, and skipping part of that for some frameworks that support it is unfair for the others. Please be considerate and disable pipelining when publishing benchmarks, otherwise a lot of people will be taking the wrong decision based on YOUR results.