Rust vs Python: Rust will not replace Python

I love Python, I used it for 10+ years. I also love Rust, I have been learning it for the last year. I wanted a language to replace Python, I looked into Go and became disappointed. I’m excited about Rust, but it’s clear to me that it’s not going to replace Python.

In some parts, yes. There are small niches where Rust can be better than Python and replace it. Games and Microservices seem ones of the best candidates, but Rust will need a lot of time to get there. GUI programs have also a very good opportunity, but the fact that Rust model is too different from regular OOP makes it hard to integrate with existing toolkits, and a GUI toolkit is not something easy to do from scratch.

On CLI programs and utilities, Go is probably to prevent Rust from gaining some ground here. Go is clearly targeted towards this particular scenario, is really simple to learn and code, and it does this really well.

What Python lacks

To understand what are the opportunities from other languages to replace Python we should first look to what are the shortfalls of Python.

Static Typing

There are lots of things that Python could improve, but lately I feel that types are one of the top problems that need to be fixed, and it actually looks it’s fixable.

Python, like Javascript, is completely not typed. You can’t easily control what are the input and output types of functions, or what are the types of local variables.

There’s the option now to type your variables and check it with programs like MyPy or PyType. This is good and a huge step forward, but insufficient.

When coding, having IDE autocompletion, suggestions and inspection helps a lot when writing code, as it speeds up the developer by reducing round-trips to the documentation. On complex codebases it really helps a lot because you don’t need to navigate through lots of files to determine what’s the type that you’re trying to access.

Without types, an IDE is almost unable to determine what are the contents of a variable. It needs to guess and it’s not good. Currently, I don’t know of any autocompletion in Python solely based on MyPy.

If types were enforced by Python, then the compiler/interpreter could do some extra optimizations that aren’t possible now.

Also, there’s the problem of big codebases in Python with contributions of non-senior Python programmers. A senior developer will try to assume a “contract” for functions and objects, like, what are the “valid” inputs for that it works, what are valid outputs that must be checked from the caller. Having strict types is a good reminder for not so experienced people to have consistent designs and checks.

Just have a look on how Typescript improved upon JavaScript by just requiring types. Taking a step further and making Python enforce a minimum, so the developer needs to specify that doesn’t want to type something it will make programs easier to maintain overall. Of course this needs a way to disable it, as forcing it on every scenario would kill a lot of good things on python.

And this needs to be enforced down to libraries. The current problem is that a lot of libraries just don’t care, and if someone wants to enforce it, it gets painful as the number of dependencies increase.

Static analysis in Python exists, but it is weak. Having types enforced would allow to better, faster, and more comprehensive static analysis tools to appear. This is a strong point in Rust, as the compiler itself is doing already a lot of static analysis. If you add other tools like Cargo Clippy, it gets even better.

All of this is important to keep the codebase clean and neat, and to catch bugs before running the code.

Performance

The fact that Python is one of the slowest programming languages in use shouldn’t be news to anyone. But as I covered before in this blog, this is more nuanced than it seems at first.

Python makes heavy use of integration with C libraries, and that’s where its power unleashes. C code called from Python is still going at C speed, and while that is running the GIL is released, allowing you to do a slight multithreading.

The slowness of Python comes from the amount of magic it can do, the fact that almost anything can be replaced, mocked, whatever you want. This makes Python specially good when designing complex logic, as it is able to hide it very nicely. And monkey-patching is very useful in several scenarios.

Python works really well with Machine Learning tooling, as it is a good interface to design what the ML libraries should do. It might be slow, but a few lines of code that configure the underlying libraries take almost zero time, and those libraries do the hard work. So ML in Python is really fast and convenient.

Also, don’t forget that when such levels of introspection and “magic” are needed, regardless of the language, it is slow. This can be seen when comparing ORMs between Python and Go. As soon as the ORM is doing the magic for you, it becomes slow, in any language. To avoid this from happening you need an ORM that it’s simple, and not that automatic and convenient.

The problem arises when we need to do something where a library (that interfaces C) doesn’t exist. We end coding the actual thing manually and this becomes painfully slow.

PyPy solves part of the problem. It is able to optimize some pure python code and run it to speeds near to Javascript and Go (Note that Javascript is really fast to run). There are two problems with this approach, the first one is that the majority of python code can’t be optimized enough to get good performance. The second problem is that PyPy is not compatible with all libraries, since the libraries need to be compiled against PyPy instead of CPython.

If Python were stricter by default, allowing for wizardry stuff only when the developer really needs it, and enforcing this via annotations (types and so), I guess that both PyPy and CPython could optimize it further as it can do better assumptions on how the code is supposed to run.

The ML libraries and similar ones are able to build C code on the fly, and that should be possible for CPython itself too. If Python included a sub-language to do high-performance stuff, even if it takes more time to start a program, it would allow programmers to optimize the critical parts of the code that are specially slow. But this needs to be included on the main language and bundled on every Python installation. That would also mean that some libraries could get away with pure-python, without having to release binaries, which in turn, will increase the compatibility of these with other interpreters like PyPy.

There’s Cython and Pyrex, which I used on the past, but the problem on these is that it will force you to build the code for the different CPU targets and python versions, and that’s hard to maintain. Building anything on Windows is quite painful.

The GIL is another front here. By only allowing Python to execute a instruction at once, threads cannot be used to distribute pure python CPU intensive operations between cores. Better Python optimizations could in fact relief this by determining that function A is totally independent of function B, and allowing them to run in parallel; or even, they could build them into non-pythonic instructions if the code clearly is not making use of any Python magic. This could allow for the GIL to be released, and hence, parallelize much better.

Python & Rust together via WASM

This could solve great part of the problems if it works easy and simple. WebAssembly (WASM) was thought as a way to replace Javascript on browsers, but the neat thing is that creates code that can be run from any programming language and is independent of the CPU target.

I haven’t explored this myself, but if it can deliver what it promises, it means that you only need to build Rust code once and bundle the WASM. This should work on all CPUs and Python interpreters.

The problem I believe it is that the WASM loader for Python will need to be compiled for each combination of CPU, OS and Python interpreter. It’s far from perfect, but at least, it’s easier to get a small common library to support everything, and then other libraries or code to build on top of it. So this could relief some maintenance problems from other libraries by diverting that work onto WASM maintainers.

Other possible problem is that WASM will have it hard to do any stuff that it’s not strictly CPU computing. For example, if it has to manage sockets, files, communicate with the OS, etc. As WASM was designed to be run inside a browser, I expect that all OS communication would require a common API, and that will have some caveats for sure. While the tasks I mentioned before I expect them to be usable from WASM, things like OpenGL and directly communicating with a GPU will surely have a lack of support for long time.

What Rust Lacks

While most people will think that Rust needs to be easier to code, that it is a complex language that it requires a lot of human hours to get the code working, let me heavily disagree.

Rust is one of the most pleasant languages to code on when you have the expertise on the language. It is quite productive almost on the level of Python and very readable.

The problem is gaining this expertise. Takes way too much effort for newcomers, especially when they are already seasoned on dynamic-typed languages.

An easier way to get started in Rust

And I know that this has been said a lot by novice people, and it has been discussed ad-infinitum: We need a RustScript language.

For the sake of simplicity, I named RustScript to this hypothetical language. To my knowledge, this name is not used and RustScript does not exist, even if I sound like it does.

As I read about others proposing this, please keep reading as I already know more or less what has been proposed already and some of those discussions.

The main problem with learning Rust is the borrow-checking rules, (almost) everyone knows that. A RustScript language must have a garbage collector built in.

But the other problem that is not so talked about is the complexity of reading and understanding properly Rust code. Because people come in, try a few things, and the compiler keeps complaining everywhere, they don’t get to learn the basic stuff that would allow them to read code easily. These people will struggle even remembering if the type was f32, float or numeric.

A RustScript language must serve as a bootstrapping into Rust syntax and features of the language, while keeping the hard/puzzling stuff away. In this way, once someone is able to use RustScript easily, they will be able to learn proper Rust with a smaller learning curve, feeling familiar already, and knowing how the code should look like.

So it should change this learning curve:

Into something like this:

Here’s the problem: Rust takes months of learning to be minimally productive. Without knowing properly a lot of complex stuff, you can’t really do much with it, which becomes into frustration.

Some companies require 6 months of training to get productive inside. Do we really expect them also to increase that by another 6 months?

What it’s good about Python it’s that newcomers are productive from day zero. Rust doesn’t need to target this, but the current situation is way too bad and it’s hurting its success.

A lot of programming languages and changes have been proposed or even done but fail to solve this problem completely.

This hypothetical language must:

  • Include a Garbage Collector (GC) or any other solution that avoids requiring a borrow checker.
    Why? Removing this complexity is the main reason for RustScript to exist.
  • Have almost the same syntax as Rust, at least for the features they have in common.
    Why? Because if newcomers don’t learn the same syntax, then they aren’t doing any progress towards learning Rust.
  • Binary and Linker compatible with Rust; all libraries and tooling must work inside RustScript.
    Why? Having a complete different set of libraries would be a headache and it will require a complete different ecosystem. Newcomers should familiarize themselves with Rust libraries, not RustScript specific ones.
  • Rust sample code must be able to be machine-translated into RustScript, like how Python2 can be translated into Python3 using the 2to3 tool. (Some things like macro declarations might not work as they might not have a replacement in RustScript)
    Why? Documentation is key. Having a way to automatically translate your documentation into RustScript will make everyone’s life easier. I don’t want this guessing the API game that happens in PyQT.
  • Officially supported by the Rust team itself, and bundled with Rust when installing via RustUp.
    Why? People will install Rust via RustUp. Ideally, RustScript should be part of it, allowing for easy integration between both languages.

Almost any of these requirements alone is going to be hard to do. Getting a language that does everything needed with all the support… it’s not something I expect happening, ever.

I mean, Python has it easier. What I would ask to Python is way more realizable that what I’m asking here, and yet in 10 years there’s just slight changes in the right direction. With that in mind, I don’t expect Rust to ever have a proper RustScript, but if it happens, well, I would love to see it.

What would be even better is that RustScript were almost a superset of Rust, making Rust programs mostly valid in RustScript, with few exceptions such as macro creation. This would allow developers to incrementally change to Rust as they see fit, and face the borrow checker in small amounts, that are easy to digest. But anyway, having to declare a whole file or module as RustScript would still work, as it will allow devs to migrate file by file or module by module. That’s still better than having to choose between language X or Y for a full project.

Anyway, I’d better stop talking about this, as it’s not gonna happen, and it would require a full post (or several) anyways to describe such a language.

Proper REPL

Python is really good on it’s REPL, and a lot of tools make use of this. Rust REPL exist, but not officially supported, and they’re far from perfect.

A REPL is useful when doing ML and when trying out small things. The fact that Rust needs to compile everything, makes this quite useless as it needs boilerplate to work and every instruction takes time to get built interactively.

If Rust had a script language this would be simpler, as a REPL for scripting languages tends to be straightforward.

Simpler integration with C++ libraries

Given that both Rust and Python integrate only with C and not C++ would make anyone think that they are on the same level here; but no. Because Python’s OOP is quite similar to C++ and it’s magic can make for the missing parts (method overloading), in the end Python has way better integration with C++ than Rust.

There are a lot of ongoing efforts to make C++ integration easier in Rust, but I’m not that sure if they will get at any point something straightforward to use. There’s a lot of pressure on this and I expect it to get much, much better in the next years.

But still, the fact that Rust has strict rules on borrowing and C++ doesn’t, and C++ exceptions really don’t mix with anything else in Rust, it will make this hard to get right.

Maybe the solution is having a C++ compiler written in Rust, and make it part of the Cargo suite, so the sources can be copied inside the project and build the library for Rust, entirely using Rust. This might allow some extra insights and automation that makes things easier, but C++ is quite a beast nowadays, and having a compiler that supports the newest standards is a lot of work. This solution would also conflict with Linux distributions, as the same C++ library would need to be shipped twice in different versions, a standard one and a Rust-compatible one.

Lack of binary libraries and dynamic linking

All Rust dependencies currently rely on downloading and building the sources for each project. Because there so many dependencies, building a project takes a long time. And distributing our build means installing a big binary that contains everything inside. Linux distributions don’t like this.

Having pre-built libraries for common targets it would be nice, or if not a full build, maybe a half-way of some sort that contains the most complex part done, just requiring the final optimization stages for targeting the specific CPU; similar to what WASM is, *.pyc or the JVM. This would reduce building times by a huge amount and will make development more pleasant.

Dynamic linking is another point commonly overlooked. I believe it can be done in Rust but it’s not something that they explain on the regular books. It’s complex and tricky to do, where the regular approach is quite straightforward. This means that any update on any of your libraries require a full build and a full release of all your components.

If an automated way existed to do this in Cargo, even if it builds the libraries in some format that can’t be shared across different applications, it could already have some benefits from what we have. For example, the linking stage could take less time, as most of the time seems to be spent trying to glue everything together. Another possible benefit is that as it will produce N files instead of 1 (let’s say 10), if your application has a way to auto-update, it could update selectively the files needed, instead of re-downloading a full fat binary.

To get this to work across different applications, such as what Linux distributions do, the Rust compiler needs to have better standards and compatibility between builds, so if one library is built using rustc 1.50.0 and the application was built against 1.49.0, they need to work. I believe currently this doesn’t work well and there are no guarantees for binary compatibility across versions. (I might be wrong)

On devices where disk space and memory is constrained, having dynamic libraries shared across applications might help a lot fitting the different projects on such devices. Those might be microcontrollers or small computers. For our current desktop computers and phones, this isn’t a big deal.

The other reason why Linux distributions want these pieces separated is that when a library has a security patch, usually all it takes is to replace the library on the filesystem and you’re safe. With Rust applications you depend on each one of the maintainers of each project to update and release updated versions. Then, a security patch for an OS instead of being, say, 10MiB, it could be 2GiB because of the amount of projects that use the same library.

No officially supported libraries aside of std

In a past article Someone stop NodeJS package madness, please!!, I talked about how bad is the ecosystem in JavaScript. Because everyone does packages and there’s no control, there’s a lot of cross dependency hell.

This can happen to Rust as it has the same system. The difference is that Rust comes with “std”, which contains a lot of common tooling that prevents this from getting completely out of hand.

Python also has the same in PyPI, but turns out that the standard Python libraries cover a lot more functionality than “std”. So PyPI is quite saner than any other repository.

Rust has its reasons to have a thin std library, and probably it’s for the best. But something has to be done about the remaining common functionality that doesn’t cover.

There are lots of solutions. For example, having a second standard library which bundles all remaining common stuff (call it “extra_std” or whatever), then everyone building libraries will tend to depend on that one, instead of a myriad of different dependencies.

Another option is to promote specific libraries as “semi-official”, to point people to use these over other options if possible.

The main problem of having everyone upload and cross-depend between them is that these libraries might have just one maintainer, and that maintainer might move on and forget about these libraries forever; then you have a lot of programs and libraries depending on it unaware that it’s obsolete from long ago. Forking the library doesn’t solve the problem because no one has access to the original repo to say “deprecated, please use X”.

Another problem are security implications from doing this. You depend on a project that might have been audited on the past or never, but the new version is surely not audited. In which state is the code? Is it sound or it abuses unsafe to worrying levels? We’ll need to inspect it ourselves and we all know that most of us would never do that.

So if I were to fix this, I would say that a Rust committee with security expertise should select and promote which libraries are “common” and “sane enough”, then fork them under a slightly different name, do an audit, and always upload audited-only code. Having a group looking onto those forked libraries means that if the library is once deprecated they will correctly update the status and send people to the right replacement. If someone does a fork of a library and then that one is preferred, the security fork should then migrate and follow that fork, so everyone depending on it is smoothly migrated.

In this way, “serde” would have a fork called something like “serde-audited” or “rust-audit-group/serde”. Yes, it will be always a few versions behind, but it will be safer to depend on it than depending on upstream.

No introspection tooling in std

Python is heavy on introspection stuff and it’s super nice to automate stuff. Even Go has some introspection capabilities for their interfaces. Rust on the other hand needs to make use of macros, and the sad part is that there aren’t any officially supported macros that makes this more or less work. Even contributed packages are quite ugly to use.

Something that tends to be quite common in Python is iterating through the elements of a object/struct; their names and their values.

I would like to see a Derive macro in std to add methods that are able to list the names of the different fields, and standardize this for things like Serde. Because if using Serde is overkill for some program, then you have to cook these macros yourself.

The other problem is the lack of standard variadic types. So if I were to iterate through the values/content of each field, it becomes toilsome to do and inconvenient, because you need to know in advance which types you might receive and how, having to add boilerplate to support all of this.

The traits also lack some supertraits to be able to classify easily some variable types. So if you want a generic function that works against any integer, you need to figure out all the traits you need. When in reality, I would like to say that type T is “int-alike”.

Personal hate against f32 and f64 traits

This might be only me, but every time I add a float in Rust makes my life hard. The fact that it doesn’t support proper ordering and proper equality makes them unusable on lots of collection types (HashMaps, etc).

Yes, I know that these types don’t handle equality (due to imprecision) and comparing them is also tricky (due to NaN and friends). But, c’mon… can’t we have a “simple float”?

On some cases, like configs, decimal numbers are convenient. I wouldn’t mind using a type that is slower for those cases, that more or less handles equality (by having an epsilon inbuilt) and handles comparison (by having a strict ordering between NaN and Inf, or by disallowing it at all).

This is something that causes pain to me every time I use floats.

Why I think Rust will not replace Python

Take into account that I’m still learning Rust, I might have missed or be wrong on some stuff above. One year of practising on my own is not enough to have enough context for all of this, so take this article with a pinch of salt.

Rust is way too different to Python. I really would like Rust to replace my use on Python but seeing there are some irreconcilable differences makes me believe that this will never happen.

WASM might be able to bridge some gaps, and Diesel and other ORM might make Rust a better replacement of Python for REST APIs in the future.

On the general terms I don’t see a lot of people migrating from Python to Rust. The learning curve is too steep and for most of those replacements Go might be enough, and therefore people would skip Rust altogether. And this is sad, because Rust has a lot of potentials on lots of fronts, just requires more attention than it has.

I’m sad and angry because this isn’t the article I wanted to write. I would like to say that Rust will replace Python at some point, but if I’m realistic, that’s not going to happen. Ever.

References

https://blog.logrocket.com/rust-vs-python-why-rust-could-replace-python/

https://www.reddit.com/r/functionalprogramming/comments/kwgiof/why_do_you_think_data_scientists_prefer_python_to/glzce8e/?utm_source=share&utm_medium=web2x&context=3

Threading is not a magic wand for performance

I have a strong opinion for threading inside applications. Most of the time they are understood as a way of getting 100% of your CPU, but… things aren’t that simple.

From end users perspective

…the more cores in a CPU the better. Well, the problem as I stated on other posts is that most applications will use 1-16 threads. Once you go above 8 cores with SMT/Hyperthreading, the amount of applications making use of everyone of them gets quite low. And those applications aren’t going to use those all the time, probably on some operations only.

You could go overboard and get a 32 logical core CPU, that would allow to game, stream, video encode, compress files, run a database, browse, and compile, all at once. But seriously? (Yes, some video encoding will use all 32 threads, but it’s not as effective as you might think). More cores allow to more applications running in parallel at full speed, that’s for sure; but at some point the benefits diminish because it makes less and less sense, and most of the time you’re not going to have all those things at once; so your CPU stays unused.

Another thing to take into account is the TDP limit and cooling limits. Running expensive instructions like SIMD in all cores is likely going to go over the TDP design on the CPU, for which is going to throttle itself (Windows or other tools will not report this as throttling because it’s not thermal throttling, this is a TDP limit). In overclocking scenarios we could increase this TDP limit, sure. But this moves us to a different realm: can we cool that?

My new NH-D15, which is oversized for the 5800X, can keep it at 90ºC while having the Noctua adapters for low speed, at low noise. I have quite low tolerance for noise. But now, imagine a 5950X, which is double the cores. Moving that at full capacity will require removing the adapters and running the fans at full speed to keep things under control. Still, that chip will go over it’s TDP if SIMD instructions are run simultaneously on all cores. Moving up from there and increasing the TDP of the chip will mean that custom water cooling is the only way to keep things under control.

I’m getting a bit off-topic, but the point is that using too much CPU power will lead to tons of heat that needs to be dissipated or the CPU will start underperforming. TDP limit is a thing too. Keep that in mind.

Now we might think… well, over time applications will be using more threads as these CPUs become widespread. Heh, yes, but… no. Yes, because some applications (games) are traditionally single-threaded, and they will untap a huge amount of performance in almost every computer by adding just a few threads. But also no, because not all operations can be parallelized in threads (or processes, etc). And no, because threading is hard to do right. So I would expect quite a delay in years until most applications and games can make use of these CPUs. They will be most probably targeted to the low-end of cpu cores. If AMD keeps popularizing cheap 4 core CPUs and forces Intel to follow, we might see optimizations for 4-8 threads in 5-10 years on most places.

Why would they optimize for the lower end? Because threads are not free. For starters, an application that does a CPU intensive job in a single thread, compared to the same work in four threads, if it’s run in a single core CPU (no HT/SMT), the single thread one is going to end faster. Unless you have a clever way of enabling/disabling threads and changing its numbers, which is difficult in some cases, you’re better targeting the lower end (unless your application really benefits and requires the use of 100% of the CPU).

Why can’t we parallelize everything?

It depends on the task that the application is trying to perform.

Imagine your boss says you need to build a new PC from the parts, because this is time critical they assign you a team of 20 people to help you. With this team, you are expected to have the computer built 20 times faster than a single person would do.

Would that work out? I could imagine that 20 people will create more hassle than improve anything, and hence, the computer will be built even slower than if they assigned you alone to do it. Just having to talk to everyone requires a lot of time that it is not spent building the thing.

Now imagine that you’re asked to build not one, but 40 computers from parts and you also given a team of 5 people. Will this work? Surely, and you can even assign roles to each one so for example there’s one person mounting the CPU and RAM, and another building/preparing the case. That will lead to have the work finished before than doing it alone.

In programs a similar thing happens. In some requests there’s not much space for threads to work cooperatively on some types on problems. If what you ask are 10x the same task at the same time, it is simple to spread the load. But as we usually ask a single task, it depends on the internals of that task.

Any kind of compression is usually a hard problem to solve in parallel. This includes ZIP/RAR/GZIP compression (lossless archiving) as audio/video encoding (lossy). Compression basically works by avoiding repeating the same thing twice, so if we said that the previous image was black, and the new one is black, that should be avoided. But the problem is, the program doesn’t know that the image was black until it gets there. It’s hard for a program to send threads ahead and split this work. Still, it’s something that they do, but there are limitations.

Same applies in games. In order to calculate the next image or what the game would do next, it requires the current state. We can’t calculate future states based on a past state, it needs to be done one by one. Still, they can send threads to manage different aspects of the game; for example one thread might manage physics, another might manage enemies, other for lighting, etc. But there’s a limit on how many aspects you can split the work into.

There’s a non-stop research on those areas to improve parallelism. Physics engines might start learning how to split the computation into different chunks and then merge the simulation back. Video encoders are able to calculate ahead some of the work, for example motion estimation, or split the video in blocks to make them in parallel.

Still with all of this there are losses from parallelizing. These tasks are not perfectly isolated, they need to be directed and merged back, and that’s a cost that is simply not there if we use a single thread.

Because of these losses some developers might choose not to parallelize more, as it will hurt the performance in lower CPUs. If the application targets CPUs with minimal amount of cores, they might be forced to do that in order to keep the minimum requirements under control.

Diving into the details

Threads mostly require a shared memory between them to communicate, if not, how do we expect to give them work and retrieve the work done? Sure, there are exceptions, for example if the input comes from network or if the output goes directly to disk. But that’s atypical.

If two threads access the same piece of data it needs to be guarded to prevent concurrent accesses. This is not only because data consistency (as in databases), but mainly because it could be actual data corruption if it happens. Data can be mid-write when the read happens, or it also can cause divergence, when each thread sees different data for short periods.

These guards are usually Mutexes, which produce a lock both in software and in hardware level that prevents concurrent access. If a thread tries to access a locked Mutex it will usually block and wait for data.

They are expensive in CPU terms, we actually are burning several CPU cycles for those. Some libraries like Qt (at least the old versions) have a compile flag to enable or disable threading safety support (mutexes), because enabling them without need incurs in a performance penalty.

Threads themselves aren’t free either. Creating them requires memory and CPU. And having them causes a challenge into the OS scheduler: As the number of threads/processes increases, the number of context switches increases too. And a context switch is expensive for a CPU to do. This also increases the requirements on CPU Cache, pushing out data from registers, L1 and L2 cache.

You might think that this is “too technical”, “it doesn’t matter that much”, “there’s not that much difference in performance”. Well, turns out that the CPU registers are the fastest thing by a wide margin, followed by the L1 cache. A program running entirely on L1 cache will go 100x times faster than a program that needs to pull data from RAM for every single instruction. Yeah, RAM is super-fast, but the L1 Cache is like a warp-drive. CPU Registers are accessible on the same CPU clock, while a RAM access can take 240 cycles (depends on the CPU). So if the data requested produces a cache miss, your program will be halted 240 cycles. If it happens every single time, your program will be waiting 99.5% of the time. (This usually doesn’t happen so often as the CPU and compilers are smart, but trying to go all out on threads might cause similar scenarios if the access pattern appears random)

Don’t use too many threads!

Ideally what you need is the same amount of threads as logical cores you have. It’s not going to be faster if you reach already 100% of all cores with those.

Of course there are exceptions, and a good rule could be N+1 or N+2 threads. Sometimes threads need to wait, or are blocked (by mutexes?), so a few extra could help filling the gaps here and there, but the gains are minor.

Having a collection of 500 threads, each one waiting for a network request isn’t exactly efficient. First of all we might face mutex contention: With so many threads the probability of trying to access data at the same time raises almost exponentially. This is related to the Birthday problem.

The second problem is OS scheduler overhead. The OS doesn’t know which thread is ready now to do work, and hence, it will wake up threads almost randomly. And the thread probably will be just waking up, checking the mutex, and sleeping again. Here we wasted two context switches and a mutex check. This situation will keep repeating very fast, producing CPU heat that doesn’t produce any useful work.

We might be fooled thinking “I’m using 100% of the CPU, is as fast as it gets”. Wrong. It’s using 100% of the CPU, but mostly hogging resources and wasting CPU cycles… that might have been used better elsewhere by other application. We’re starving the system, causing a DoS-alike attack to the other programs that want to run on it.

When optimizing a program we need to think about CPU efficiency: How much work we can get done per CPU core cycle. And also get the right metrics: How much actual work was done per unit of time.

Threading is actually a trade-off. We’re paying with extra CPU cycles to be able to get the work parallelized and hopefully get it done faster. Don’t forget that! With every thread and every mutex, we’re wasting CPU resources.

Use non-blocking calls if possible

One of the reasons to spawn hundreds of threads is when the application is doing some I/O operation that blocks and makes the thread wait. For example, reading or writing to disk, network requests or maybe waiting for the GPU.

In case you’re not aware, there are blocking and non-blocking calls for most of these things. A blocking call is the traditional way, where you ask to read a file and the function blocks until the data is read and returns the data when finished. In a non-blocking call you’ll ask to read X amount of data and the function will return immediately without retrieving any data. In background the OS will fill the buffers with the data being read. Then, with another function (or a callback) you get the data that has been read. This allows your thread to do other things meanwhile.

The downside of non-blocking calls is that in most cases the flow of the program is broken into pieces in different places and no longer makes sense. To solve this there’s async programming, which will make your program look like it’s doing something serially but under the hood is switching to other tasks. This is also known as cooperative threads or green threads.

I haven’t used much async programming myself, just a bit in Python. Is one of the pending things I have to test in Rust. But it’s quite neat.

In zzping I just used non-blocking calls, because the design there is quite optimal for that type of approach and the code looks really neat.

You’ll be surprised on the amount of work that a single thread can do while using cooperative/async programming.

The key thing on async programming is that the task switching is done inside the application instead of the OS+hardware. This might sound worse, but in fact is better. Because the application is aware of which tasks are ready to be performed, switching is 100% efficient, there is no context switch that goes to an idle task. There are no context switching, because it’s a single thread.

In a single-thread CPU, async programming delivers a lot more work done than threads in these scenarios.

The final trick is to spawn one OS thread per CPU thread, each one using async programming to queue the tasks. This outperforms just threads by a wide margin, and keeps the computer responsive.

I was wrong on zzping rejecting threads

On zzping I initially designed it to remove all threads and use only non-blocking calls. Because I didn’t need the extra CPU performance, using non-blocking calls should suffice to do the pings at quite high speeds. I was both right and wrong at the same time.

From one side, it was correct because I could realize hundreds of pings per second with a single thread, doing all tasks required there. But when I tried to perfect it, it became clear that something was wrong.

At least in the Rust library I was using for sockets, doing a non-blocking recv was incurring on random waits that could not be predicted properly and accounted for. So the rate of pings was inconsistent and annoying.

It seems that if there’s something mid-way on the network card it might block, or maybe happens when there’s nothing left. I’m not sure. But the point is that this non-blocking call wasn’t exactly non-blocking.

So what I did in the end is to spawn a single OS thread to take care of receiving data from the network. The main thread is still taking care of sending data and computing everything else. Now the metrics are really accurate and does exactly what is asked for.

I don’t recall this happening to me in other programs, so it might be the Rust library that I’m using, or maybe it’s because this manages ICMP which is a bit special.

Avoid sharing memory at all costs!

A thread is optimally running when it works completely independent of others, and doesn’t know almost anything from other parts of the application. It has all the data it requires to work from the start, and it can spend a lot of time in its work alone.

It’s quite easy to fall for communicating with the shared memory whenever we like on the thread, locking it and so. The problem is that as I said, mutexes are expensive, lots of threads accessing the memory will create contention, and things will slow down a lot.

Threads require careful design. We should envision them in a producer-consumer fashion, where we have a queue of “tasks to do” where the threads can pick, and a queue of “tasks done” where the threads can output their results into. This reduces contention to a very low value. If an inter-communication is needed mid-way, think about caching some of the data on thread local memory for some time. Most of the time it’s not entirely needed to get always the latest value, but enough to get a quite recent value. A cache of just a few milliseconds (even only 1ms) might deliver real benefits while avoids working with completely stale data.

Even with these ideas, there might be contention. For example, if there are hundreds of threads, and the tasks to realize are quick, it’s possible that they end locking the queues very often, and the probability of contention raises. In these cases, think if it’s possible for a thread to work in batches: Retrieving N tasks at once and putting them onto local memory, then accumulate the output in local memory too, and output it to the shared memory in batches.

There are other possible tricks too, for example there might be specialized libraries for queues that can lock partially. The queues could be also sharded to reduce the amount of threads that can access the same data, and a thread can be used to move data from the main queue to the sharded ones and vice versa.

Closing thoughts

Threads are hard to do right. I did not cover the problems of debugging threads and unexpected behaviors that arise from them, as it’s a bit too much for this article.

I think that it’s quite clear that not all tasks satisfy this consumer-producer design, and some of them are hard to parallelize efficiently or at all. Expecting that all applications will use all cores at some point is naive. Threading is hard to do right and most programmers would avoid it if possible.

They also waste precious CPU resources. 2x threads don’t give 2x performance in return. While in some cases can reduce latency, they usually tend to increase it when abused.

Hope this was useful to get an understanding on why applications don’t use all CPU cores and the common challenges of doing it right. Let me know if I should cover something else with more detail.

Finally upgraded my computer to Ryzen 7 5800X

Waited for too long, and the performance jump is awesome. I was using an i7-920 up to last week, that is, the 1st generation of i7 that ever existed. The leap is so big that feels like I don’t deserve it, that I’m not doing a good use of it, because most of the time is idling.

The good thing about running Linux is that I can manage quite efficiently how the CPU is used and get the most of it. Running a CPU that was 12 years old wasn’t a problem for most tasks at all, and the system was performing properly on most cases. I was basically having problems on video compression scenarios (streaming, capturing, editing) and when compiling a full project in Rust from scratch. Everything else was more or less running smoothly.

After the stock problems on 5000 series processors, I had to wait even more, and I was expecting the 5900X to get cheaper, but didn’t happen on time, so I settled with the 5800X which should be good enough. And it is. No regrets on this.

The processor arrived on Saturday morning. I spent most of the day building it neatly and then, on the first boot attempt I ran on my first problem: No post, the motherboard is stuck on ‘0d’ code. Kind of expected; these motherboards require a BIOS update before running a 5000 series processor. Luckily for me, I was careful of buying a motherboard that can update the BIOS without CPU. So I followed the procedure, and after a few mistakes, finally it booted. Then I got a problem with some fans ramping too fast. Turns out that I connected these fans to a header that it’s meant for a pump (for water cooling), but the way I routed that cable made it almost impossible to change it; I would have needed to remove the motherboard entirely. I decided to disconnect the fans from the hub of the case and plug them directly, and got it fixed. Then a CPU fan wasn’t spinning, turns out that trying to hide the cables, I placed them too close to the fan and they were interfering. So another thing fixed.

Finally I got to the last problem. Grub2 (the bootloader) was freezing upon start. I did not buy any SSD/HDD for this PC, I wanted to migrate all devices from the old to the new computer. Because it does nothing, and doesn’t allow me to do anything on it, it was impossible to debug. It took me hours of trial and error, reinstalling Grub, nothing worked. As I noticed that a USB Stick with Ubuntu was successfully launching Grub and was working, I thought that maybe installing Ubuntu would, hopefully, fix the grub issue. And it did. From the new bootloader I can launch my old Debian installation without issue. The problem will be when I upgrade this Debian and tries to write the Grub again; I bet it will break.

The underlying problem is that this installation is borked from before; the SSD lacks a UEFI partition, and it is inside my HDD. This strange setup makes it quite hard to boot as it needs to jump around. At some point I would buy a new NVMe SSD and do a clean install on it. For now, I want to hold it, because I don’t need that (yet) and SSD are getting cheaper, so when it’s time I will get more speed and capacity for the same money. It’s working now, I don’t care much on fixing it. Maximum laziness.

Configuration of this new machine

PCPartPicker Part List: https://ie.pcpartpicker.com/list/p8pfGq

  • CPU: AMD Ryzen 7 5800X 3.8 GHz 8-Core Processor
  • CPU Cooler: Noctua NH-D15 CHROMAX.BLACK 82.52 CFM CPU Cooler
  • Motherboard: Gigabyte X570 AORUS MASTER ATX AM4 Motherboard
  • Memory: Corsair Vengeance LPX 64 GB (2 x 32 GB) DDR4-3600 CL18 Memory
  • Storage: Samsung 850 EVO-Series 1 TB 2.5″ Solid State Drive (migrated)
  • Storage: Western Digital Caviar Green 2 TB 3.5″ 5400RPM (migrated)
  • Video Card: EVGA GeForce GTX 1060 6GB GAMING Video Card (migrated)
  • Case: Fractal Design Meshify S2 ATX Mid Tower Case
  • Power Supply: Corsair HX Platinum 850W 80+ Platinum Fully Modular ATX

There’s no RGB at all on this build. While it looks cool, I feel that lights are a problem overnight, and running exclusively Linux it might be a problem controlling some of the RGB. So I don’t need the hassle. Dark is good, dark is simple.

On the cooler, I wondered about water cooling this, but seeing that the old lasted 12 years and I did zero maintenance and water cooling may require some, I went with air cooling. I like big fat coolers, as I want to keep noise as low as possible. I used the cables from Noctua for lowering the fan speed. This gives me around 32ºC idling, 40ºC browsing, 70ºC on video encoding, and some peaks at 90ºC when it goes all out. All of this without making almost any sound.

The case, I wanted something slightly big, that would keep things cool. After watching Gamers Nexus, I settled with the Meshify S2, as it is one of the best cases for airflow. I’m really happy with it, not only keeps things really cool and looks really good; it also is a pleasure to build in it.

The motherboard, I started looking at B550, but the connectivity was not something I liked. The case has a frontal USB-C which I liked, and most B550 boards lack this. So I went for something high-end in the X570. It’s still mostly unused, but over the years I can upgrade this over and over.

I’m aware that this generation might be the last one with AM4 socket, but anyways at some point I could be able to get a cheap deal on a better chip if needed. I also expect AMD to release a refresh of these CPUs as they did on the 3000 series. But the CPU is not something that I really expect to upgrade here; I’m more interested on PCI Express stuff that might pop up on the next years. PCIe v4 is opening the door to other cool stuff, like blazing fast drives; currently it makes almost no sense because SSDs are not fast enough to make proper use of it, but in a few years… we’ll see. (Yes, I’m aware of NVMe for these speeds, but on random reads they’re just too slow to make it worth)

The power supply is a bit oversized as well. Not only because it’s 850W (I wanted something that has quite a good extra margin in case I fit in a stupidly big GPU), but mainly because is 80+ Platinum certified. I found out later that this is actually removing heat of the system and making things cooler. Hah, who would have thought, having a better PSU actually makes the system to produce less heat.

On memory, I was using 16Gb before and I was having problems because I run too many services. As I like to tinker I end up with a lot of stuff over time (MySQL, Postgresql, Docker, …), and then I needed to go back and turn down some of these… just to end debugging a few weeks later why service X is not running, where it was, etc. So I thought, let’s put so much memory that I don’t need to turn down anything! Well… yes, so far it’s using less than 48Gb even counting disk caching.

The graphics card definitely needs an upgrade, but on the current situation… I’ll have to wait. What can I do? I refuse to pay a thousand euros for a GPU.

How fast is this 5800X

First thing I tried is to build zzping-gui from scratch. This is a Rust program I’m doing. Because it has a GUI, it has way too many dependencies and it was taking like 15 minutes to build on the old computer. And now? 48 seconds! I have also sccache set up, and with it is barely 26 seconds (from 5 minutes). Building a small change is 6 seconds (from 21 seconds). Now coding in Rust is quite a pleasure.

Next thing I tried is video capturing. Before I had to limit it to FullHD (1920×1080) and 30fps, or it would skip a lot of frames. Now I can do 1440p 60fps, and even go with slower settings on ffmpeg. Roughly 20% CPU used.

Kerbal Space Program tended to give me problems on big crafts and also on Aerodynamic FX. I had to mostly disable Aerodynamic FX or the game would go down to 10 fps when moving fast through the atmosphere. Not anymore. I ramped up all settings and I get a smooth 60fps no matter the craft. (Well, I haven’t yet tried to build a too crazy craft, but surely I tried crafts that were problematic before and it’s smooth)

Finally I tried some video editing. It’s quite pleasant now, and on some encodings the CPU can encode at 1x speed (taking as long as the video takes to play). On 1440p 60fps it’s more 1:2 ratio (taking twice). I still need to figure out why most of these encoders use only 70% of the CPU. Not entirely sure if it’s fixable.

One thing I noted is that YouTube is smoother now at 1440p and 4K resolutions. Before, they were quite good, but a bit of stutter appeared which I couldn’t explain. Now they’re buttery smooth, nice!

The bad thing is that I noticed now the GPU bottlenecks. If I run KSP fullscreen and I have a video playing on top (picture-in-picture), everything would stutter a lot unless I change the video to 720p or lower. Not an issue, as when using picture-in-picture I use a small frame that looks nice even at 480p. Basically the GPU can’t keep up with this, and I also noted that the GPU ramps up a lot the fans, making noise (And I hate it). So by upgrading the CPU I made the life of my GPU hard.

I will need a new GPU as I also plan to change the desk and add another monitor. It will be even harder to the 1060 to keep up with all this data. Hopefully by August things will settle and I will be able to get some good deal on a Radeon 6800.

On single-thread performance

Now I realize that 99% of the time the machine is under-utilized. Compiling, Video Editing… doesn’t matter. Most of the time is spent waiting for a few threads to complete. Yes, more cores would make these times smaller, but not by much.

Because there are 16 logical cores (threads), when they’re used, they’re so fast that they finish on record time, and what it’s left are a few tasks that cannot be further parallelized.

This means that to get better performance CPUs need to get better on single-thread. That’s why I returned the 3800X I bought by mistake and why I waited for the 5800X to be available. The 20% jump in single thread actually matters a lot.

It’s not about the performance of a single thread in a single core, don’t get me wrong. It’s about how fast 4-6 threads can go.

A 10% increase on single-thread performance will affect 100% of the load types we can do in the computer (unless the combination of load is above TDP). A 50% increase on multi-thread by adding extra cores seems to only affect 5% of the waiting times (if the CPU compared against has at least 8 cores). Unless you do a lot of video encoding, and the programs are fully optimized, multi-core performance is becoming less and less important given the amount of cores available to most people. Adding more is not going to have that much benefit.

Seems to me that the 5800X is the sweet spot for intensive use for hobbyists like me. Enough cores for video encoding and compiling relatively fast, yet with quite fast single thread performance for day to day use.

Sure, the 5950X would encode way faster. But am I uploading 4K videos? no. Am I uploading hour long ones? nope. So it would benefit me? nope.

In contrast, if there’s a new generation of CPUs that gain another 20% in single thread, that will make all programs run 20% faster. It will not be worth upgrading from a 5800X, but in 3 generations this will add up to 73% faster and definitely worth upgrading. At a rate of 2 years per generation, this is in 6 years.

If you have an old computer, but can wait 2 years more with it (not like me that I was holding onto a 12 year old computer), it might be worth to wait and see if DDR5 and AM5 socket appears. That will give you a path to upgrade the machine for lots of time. AM4 has lived 4 years already. If AM5 has the same run, it would mean that you could potentially upgrade the CPU one or two generations later.

For those that want to upgrade “now”, let me recommend again the 5600X. It’s an awesome CPU, with the main difference of being 30% slower on video encoding. For streaming, there shouldn’t be any difference. And waiting a video to encode from 10 minutes to 13 minutes isn’t going to be important at all.

And that’s it… for now. I’ll be back with more adventures on this new computer.