Sedice – Adding FTS with PostgreSQL was really easy

Last weekend I added an internal search for my forum at using Postgres Full Text Search and it was really easier than I ever thought. But let me tell you the full story in short.

This project was started in 2005 with a PHP Nuke site running over MySQL. As you can guess, by now, everything on that site is really, really outdated. I never had the time or people to replace it properly, but that’s a longer story for another post. It grew pretty quickly and its built-in search started to be really slow, we keep upgrading to bigger servers and never was enough. So we disabled it and replaced it by a Custom Google Search. Until now.

On the last months the users of the site started complaining that they no longer could find old posts using that search. The reason looks to be that the site is not widely used anymore, and Google no longer thinks it’s worth indexing the whole site.

To fix this I can’t enable again the old search, as it would be devastating. Also, on the last years, I downgraded the server to a really cheap one: less than 10 euro per month.

The only way out is to build a proper FTS service, and I thought it could take me weeks, as I’m no longer used to it and I only built those twice in the past, with PostgreSQL.

I have been really busy with other stuff and I didn’t want to start on this to leave it unfinished. I have some time now so I decided to give it a try.

First things I started researching which database to use. Reusing the same database (MySQL) would be handy, but it turns out that MySQL is known for not having a good FTS support. So it wasn’t an option.

I searched the web for the best options for FTS, but all I got was ElasticSearch and Lucene. No other mentions of other alternatives. It’s kind of sad because ElasticSearch is a fork of Lucene, so mostly we have only one recommended product.

A bit more of research revealed that they are based on Java, so weren’t an option for me as well. Java is going to take four times more CPU (at least) and several times more memory than a database written in C. I don’t have any JVM installed in my tiny server and I don’t particularly like the idea of installing it just for this.

So I went back to my dear PostgreSQL. Maybe is not the best for FTS but it has good support, it’s fast, and it is very friendly with memory usage.

My next question was if the data and indexes are going to fit on the server, but for that I needed to export a sample and perform some tests.

I started by spinning off a PostgreSQL 11 in my home computer and installing the MySQL foreign data wrapper (FDW). If FDW is something new to you, let me tell you that it is something really neat. It is a way to directly connect PostgreSQL to other databases and perform crossed queries. Of course these are not as performant as having everything in one database, but they’re really good for data import/export and some fine tuned queries as well.

By just using the samples on the github main page I was able to setup foregin tables for topics, posts and posts_text (Nuke has a 1-1 relationship to separate text from posts). As querying those is slow, I created materialized views for them with indexes.

Materialized views are a mixture between a view and a table. They are stored like tables, but the data they contain comes from a view definition and can be refreshed at will:

CREATE MATERIALIZED VIEW mv_nuke_bbposts AS SELECT * FROM ft_nuke_bbposts;

This worked really nice and I could start playing around a bit, but it became clear that for posts_texts I was missing around 30% of the data. It seems that the FDW for MySQL has a limit (or a bug) on how much data can be transferred, so as it hits 500Mb more or less, it stops consuming data. So for that one I resorted to a rudimentary approach:

CREATE TABLE tmp_nuke_bbposts_text AS 
  SELECT * FROM ft_nuke_bbposts_text WHERE id < 300000;
INSERT INTO tmp_nuke_bbposts 
 SELECT * FROM ft_nuke_bbposts_text 
  WHERE id BETWEEN 300000 AND 600000;
INSERT INTO tmp_nuke_bbposts 
 SELECT * FROM ft_nuke_bbposts_text 
  WHERE id BETWEEN 600001 AND 900000;
INSERT INTO tmp_nuke_bbposts 
 SELECT * FROM ft_nuke_bbposts_text 
  WHERE id BETWEEN 900001 AND 1200000;
INSERT INTO tmp_nuke_bbposts 
 SELECT * FROM ft_nuke_bbposts_text 
  WHERE id BETWEEN 1200001 AND 1500000;

This worked and I could play around with it for a while. The original text table is 1 Gigabyte in MySQL but only 440Mb in PostgreSQL. So far so good.

Next step was reading PostgreSQL FTS docs and apply those recipes here:

I decided to create a GIN index over the text column parsed into tsvector:

CREATE INDEX ON tmp_nuke_bbposts_text 
    USING GIN (to_tsvector('spanish', post_text));

This created a 180 MegaByte index and I could search it really quickly:

SELECT post_id FROM tmp_nuke_bbposts_text
 WHERE to_tsvector('spanish', post_text) @@ plainto_tsquery('search term')

This took between 0.2ms and 6ms depending on the amount of hits, which is really awesome for searching 1Gb of text and 1.5 million entries. The problem is, we only get “hits” from this, not a particular order.

For a search to be most useful we need ranking. PostgreSQL has ts_rank and ts_rank_cd for those purpose. The first one looks for number of hits in the document while the other also accounts for the position of the words; by how closer they are to the original query string.

When I added this the performance dropped significantly to 600ms for a single query. This is because PostgreSQL has to convert the text to tsvector on the fly. While it could had used the actual value stored on the index, it might be not possible for a GIN index. Then, it becomes critical to store the conversion in its own column.

After creating the new column plus joining the select with posts and topics to get an idea of the finished product, the search took between 6ms and 15ms, which is really good.

I also considered about using GiST indexes instead of GIN, but after a quick read in the docs it seems that GIN is faster for reads and slow in updates, while GiST has faster updates but slower reads. While in some scenarios having faster updates is desirable, in our case we want to build something read-heavy, and we don’t care that much on update cost. Also, it wasn’t bad at all, as the index creation usually took less than a minute for the full thing using a single thread.

Still, I wasn’t sure if this approach was the best one, so I tried another design before settling on anything. How it will perform if I do index by thread and not by post text?

As I had already the tsvector built for the post_text table I concatenated them together grouping by thread_id and created a new table. PostgreSQL can concatenate tsvector types but does not have aggregate functions for tsvector so I cannot group them easily without writting my own functions. I was lazy, and it is just a test, so instead I converted the tsvector into an array, unnested it into rows, and then grouped and aggregated back into an array again, then to a tsvector. This loses the position information, but also will compact duplicate words into a single one:

CREATE TABLE fts_topics AS SELECT x.topic_id,  array_to_tsvector(array_agg(x.term)) as vector FROM
        SELECT topic_id,  unnest(tsvector_to_array(vector)) as term  
        FROM fts_posts
        WHERE topic_id is not null
    ) x
WHERE x.term ~ '^[a-z]{3,30}$'
group by x.topic_id

Array support in PostgreSQL is really neat and saves me lots of time. I also removed some non-useful terms in the process. This is something I was planning for later, to export from Python the posts doing proper parsing. For now, this is more than good enough to test the idea.

This table turned out to be only 22Mb and the index 56Mb. I could get results from here in 0.2ms which was impressive, but the ranking functions weren’t useful anymore, as I wasn’t storing neither position or repetitions, every result ranked the same. I tested also to join this results with topics data and found out that the timings are really similar to the ones from indexing posts_text. And the problem is, if I’m not gaining any time from this, I still need to find the actual message which will take more space and time, so this design does not seem to be a good idea. So I went with indexing directly post text.

Experimenting a bit more, I noticed that I was still lacking a big important piece here for a proper search. I need a piece of the post in each result to show and highlight where the text was found, so the user can see at a glance if that is what they were searching for. PostgreSQL has a neat function for this called ts_headline, which also outputs <b> tags in html-like fashion, very convenient to use it with any website.

So I wrote a huge SQL that for a particular search string, outputs bbcode for PHP Nuke. The reason is, I wanted to demonstrate to users how this could look like in the final product by posting it as a forum message, without actually building it. Here is the monster:

SELECT array_to_string(array_agg(t2), '                       
') FROM (
SELECT '- - - [b]#' || (row_number() over ()) || ' (' || rank || '%) ' || '[url=' || || '#' || || ']' || CASE WHEN LENGTH(k.subject) > 2 THEN k.subject ELSE t.topic_title END || '[/url][/b]
' || REPLACE(REPLACE(REPLACE(k.head, '<', '['), '>', ']'),'
', ' ') as t2
SELECT (ts_rank_cd(vector,
  plainto_tsquery('spanish', t.text1))*100)::numeric(8,2) as rank,
  ts_headline('spanish', text_out, plainto_tsquery('spanish', t.text1), 'MaxFragments=1,MaxWords=100,MinWords=50') as head, *
    FROM fts_docs,
    (SELECT 'cazador de sueños' as text1) t
    WHERE vector @@ plainto_tsquery('spanish', t.text1)
LIMIT 5) k
INNER JOIN mv_nuke_bbtopics t ON t.topic_id = k.topic) j

Really ugly, but it got the job done. I could also see by myself how it will look like in the finished product.

Now it was clear that this is really possible and should be easy to deploy. The last challenge was to index the data properly on the server. For this matter I used my backend code for (a Flask+Angular experiment for replacing to write a command that would read posts one by one and dump them into PostgreSQL after processing. Having SqlAlchemy already there helped a lot on this task.

The postprocessing was slow to my taste, around 1000 rows per second even after some optimization. I can’t do much more because for each one I have to conditionally process bbcode, html and emoticons; convert it to html, and then parse that html to retrieve the text part only. Maybe in other programming languages than Python this could be faster, but this is a one time task, after this it will run only for updates from time to time, so it is fine. I ended adding a trick, to parse only a fraction of those by filtering by “post_id % 8 == N”, so locally I could start eight processes to do it in parallel and use all my resources, so I didn’t had to wait that much.

I decided I would have only a single table in PostgreSQL with almost all data on it. The only remaining matter was the thread subject; I finally decided not to add it, as it is difficult to track thread edits while it is easy to track post edits. This will incur in one mysql database call per result, not ideal, but should be under my performance margins.

This table ended having 810Mb of data and 360Mb for the index. I redid the query for the new table and after checking that perfomance was still between my expected values I moved the code to the server to start indexing in there. On the server I didn’t want any mess so I disabled parallel load and just went out for beers while the load took place. At 1000 rows/s, should be done in 30-50min, but never checked.

When I got back I realized that it was almost done. All I needed was some web page that connects to both databases and performs the query, that’s it. While tempted of adding this into my Angular site, I got discouraged by the fact that I forgot to add proper support to post lookup. It has some, but it redirects to the last post of the thread instead, and it lacks paging yet. For the time being, pointing to the old was the only reliable option.

So I went to create a damn simple PHP page, I used as a starting point, just copying the template over and using the CDN. A bit of markup and logic, and voilà!, done:

It has been a great weekend for coding and learning new stuff. PostgreSQL makes things easy and I’m always grateful to the project.

A few notes if you want to try this:

  • to_tsvector defaults to the database language if you don’t pass a dictionary. While this is neat, it makes the function “VOLATILE”, so it can’t be used in an index expression. Use “to_tsvector(‘spanish’, col1)’, as this one is “INMUTABLE”.
  • PostgreSQL knows the concept of stop words (that are meaningless for the search because they’re too common), lexemes (root parts of a word that convey the same meaning), synonyms and thesaurus. They come preconfigured for each language and they’re really powerful without further work, but they can actually be configured and this will improve the search speed and result quality.
  • PostgreSQL FTS does not have fuzzy search inbuilt. It uses lexemes and it does help, but if a user mistypes one letter it can return zero results. You can get around this by building a similarity search with pg_tgrm. It’s not hard to do, just more work. I skipped this step in this project but I used it other times.
  • You can also assign four weights to different texts, from A to D. This is useful to give titles more weight than content. It wasn’t a good idea in this particular case.
  • It does lack other cool features like attributes, you can “emulate” these by pushing manually crafted lexemes that can’t exist in its own, like “author:deavid”. It needs some manual effort. Also some of those things can also be avoided by having regular column searches.

Overall, working with PostgreSQL Full Text Search is a great experience with really good results. If your project already uses PostgreSQL for the data, consider using its FTS support before adding external database. Also if you are short on hardware like me, PostgreSQL is a great choice.

Why Java is faster than Python

And why C is faster than Java.

On some of the things you’ll hear, there is the classic “there is no faster or slower languages, it depends on what purpose you want to use it, some languages are better fit than others”. While part is true (some languages are better fit for some tasks than others), the other part is false. There are faster and slower languages. That is a fact.

The other thing I heard is “Java is a compiled language and Python is interpreted, therefore, Java is much faster”. Also false. Java is interpreted as Python, or Python compiled as Java. Both languages compile (or transpile) to bytecode, and a interpreter then executes those instructions.

But Java has JIT! Well, and PyPy also does have JIT. So what?

And Java is also not older than Python, they’re both same age more or less.

Regardless of any of those typical comments and counter-arguments, the fact is that Java is 4x slower than C and Python is 40x slower than Java (more or less, depends greatly on the benchmark). (And the old C does not have JIT, hah!)

So, why Java is faster than Python?

It’s just because Java leverages way more work and responsibility on the developer than Python. This is just a trade-off between computer performance and developer performance. How is this possible?

First, we have to understand what compilers do and how optimizations work. The compiler serves basically one fundamental purpose and it’s not generating an executable or bytecode. The purpose is to solve as much work as possible beforehand. As a side-effect, it has to write an executable or bytecode that could be read later to follow the instructions. Just to be clear, the compiler job is to remove complexity and uncertainty from the source code and write an output that is as dumb-stupid as possible so it can be followed blindly. The more stuff you remove, the faster it gets later.

Which kind of stuff we can remove or simplify for later? Well, the first step is the parsing stage, parsing a source file takes a lot of time; so the most basic compiler would read the code and output a binary abstract syntax tree that can be loaded onto memory really quick. Then, would be doing simple math stuff ahead of time. Also removing dead code.

But from here it gets tricky. For the CPU handling the instructions we need to know what we’re doing ahead of time, which data types, sizes, expected result types and so. Depending on the language this might be possible but it is not the case for Python. It has way too much abstraction and craziness inbuilt that we could never know what to expect at a particular part of the program. Every variable, even if it looks simple, can be replaced by a different thing by mocking or other kind of weird techniques in Python. So the only way around is to wrap the Python values in a complex structure that can track all those changes. In doing so, we lose all running performance in favour of being a friendlier (and crazier) language.

Java has static typing and this helps a lot translating all instructions into real CPU instructions. But for that step to be done we need a Just In Time compilation (JIT); if not, we would still feed the instructions one by one using the interpreter.

But C is faster. Its trick is moving more burden from the language to the developer. In this case, we not only require the developer to type everything and define every behaviour ahead of time; we also require the developer to have responsibility on the memory access and on the program behaviour. So in C, if a program closes unexpectedly is never C fault, it’s always the developer responsibility to check everything.

This in C is called “Undefined Behaviour” and describes those grey areas where the C compiler just “it doesn’t care”, it will assume everything is good and optimize as much as possible. The C compiler can even replace function calls with the expected result on the final executable file. It can also decide to “unroll” a function into the caller because it believes that is the same result, and faster.

So in short, the flexibility of a language and its “auto-magic” has huge trade-offs in performance. Writing a Java interpreter or compiler that is faster than Python should be easy. But writing a Python interpreter that could be fast enough to be compared to Java is almost impossible.

I want to add a special mention to JavaScript, probably the most hated language lately. Being a bit less auto-magic than Python and having huge efforts to implement faster JS engines in browsers has led JavaScript to have JIT and lots of sorts of optimizations, leaving us with one of the fastest interpreted languages that exists up to now:

JavaScript lacks thread support, because browsers don’t want scripts to mess with them for security and stability reasons. Still, if you account that most of those benchmarks I shared earlier Java was using all CPUs and JavaScript only one, it seems that JavaScript performance is somewhat close to Java, which is impressive.

Still I would not recommend (yet) JavaScript for server-side applications. But nonetheless it is an interesting outcome of being the only standard scripting language over the web.