How HTTP/2 makes your site faster

En Español – Cómo HTTP/2 acelera tu sitio web

Most people never heard about HTTP/2. It was published around 2015 and wasn’t almost noticed. Three years later and seems like 99% of websites are still using HTTP/1.1.

This new version of the HTTP protocol comes with lots of new features, almost all are for increasing the perceived performance for the user of any website. Configuring the web server for HTTP/2 is simple, as both Apache and Nginx have support for it from time ago. Also, any browser that does not support it will be gracefully downgraded to HTTP/1.1 without any problem or performance degradation.

The only obstacle that prevents people from using HTTP/2 is that works only with SSL encryption, with HTTPS. Almost all browsers have decided that they are not going to enable HTTP/2 unless it is through HTTPS.

To use SSL nowadays it is easier than ever. The HTTPS certificates have to be usually paid annually, but there is an initiative called LetsEncrypt that generates basic HTTPS certificates completely for free. And it works on all browsers. They have to be renewed every three months or less and we do have to prove that we own the domain and the web server. Letsencrypt has scripts for automating this that make this task super easy.

Performance Improvements on HTTP/2

The main difference with the previous version is unlimited parallelism. Until now browsers were doing four URL’s at a time as a maximum and they had to wait until any those requests finished before starting the next request.

It is done in this way because in HTTP/1.1 each request is a single TCP connection in its own. Keep-Alive helps reusing the same connection when a request ends, but still has to wait until it ends before asking a new resource. To avoid Denial Of Service attacks, it was established that browsers should limit the amount of concurrent connections.

Every TCP connection fights to get as much bandwidth as is possible. Network Cards, Switches and Routers distribute bandwidth along the different connections. If an application starts 2000 connections, in the total it will have 2000x more priority against programs that only make a single request at a time. This is considered bad parctice and it tends to saturate the network. Some of you probably remember eMule or BitTorrent programs saturating the whole network when anyone was running them. That is because P2P applications use thousands of concurrent connections and they generate a huge network load, consuming resources in a unfair way.

To avoid this maximum concurrent connections problem, those who make websites were kind of forced to bundle things as much as possible, so we use as less connections as we could. This is bad for HTTP standards, because caching policies now apply to the whole bundle instead, and whenever you want to change a line of CSS or a particular image of it, the whole bundle gets transmitted again to all users.

HTTP/2 has only a single TCP connection regardless of how many concurrent requests it handles, so it allows unlimited parallel downloads. Browsers have benefited from this, and they are removing the concurrency limit once they know they are over HTTP/2, so in practice we see about one hundred requests being handled concurrently.

So now bundling has become less important while still plays a role. We can go back to the old times when every small file was served independently without having to lose performance, although bundling still gets a small performance benefit even on HTTP/2.

If your website is already bundling CSS, Javascript and/or images, HTTP/2 still has noticeable performance benefits, because even with bundling the browser usually needs to handle several concurrent requests.

Other benefits of HTTP/2

Aside of the request concurrency, HTTP/2 is no longer a text protocol but a binary one. This gets rid of part of the CPU used to parse HTTP data compared on text mode. Also it avoids redundant information to be transmitted whenever there are several requests, so there is less bandwith for HTTP messages with no content. Its TCP connection remains open for long periods, so consequent user clicks on the website will reuse the same TCP connection, leading to a quicker response.

Also supports content prioritization. The browser can ask for resources with different priorities; let’s say for example that the browser knows that a CSS resource is blocking the website from being rendered, but it also has other images to load that are not blocking but they would be nice to have. The browser can emit the image requests in low priority and the CSS in high. The web server should follow this priority and send the CSS before if possible. But if this CSS resource has to wait for a backend (PHP, Python, …), the web server may decide to send the images meanwhile.

They also added a header compression algorithm called HPACK. HTTP headers are a bit bulky compared with small contents and with this feature sending HTTP codes without content (like 304 Not Modified) becomes way more eficient.

And last, there is the new “Server Push”. This allows for a web server to send non requested data to the browser along a particular content. This is helpful whenever the browser asks for the first time the page, the server can start sending the required Javascript & CSS before the client receives the HTML and is able to tell what it needs. So whenever the browser finishes the HTML and parses it, it figures out which additional resources are needed and with the appropriate Server Push those resources should be already on the browser by the time it notices that needs them. With this the initial load could be reduced maybe by two seconds.

I am already trying out HTTP/2 since time ago with great results. The only complex part was the SSL certificate that is a bit cumbersome to configure the first time.

So, what do you think? Are you using it already?

Properly Securing Web Tokens

For those following my Spanish post series on how to do a blazing fast website you’ll know already that I’m researching on the use of Json Web Tokens (JWT) or similar to allow a faster authorization of resources, that could be even cache-able.

Using Angular 6, I get a handful of tokens (which are not JWT but text) that will allow the browser to get authorization to different types of resources without needing to know which particular user is requesting access. Those tokens are only used on GET requests for retrieving data. All data change requests such as POST or PATCH should use a token that also authenticates who is the user. All those tokens are short-lived: currently 1-2 hours, but depending on security concerns this could be reduced.

For renewing such tokens we need a “master” token that is long lived, effectively converting this token into the classical session cookie. I do have more security concerns regarding this master token in particular, not only because it does live for weeks, also because it has the ability to renew all tokens including itself, indefinitely.

If this token is ever copied by a hacker, they would be able to retain a session open forever as long as they don’t forget to refresh it every few days. And most actions that can be done using user/password can be done using this token.

Some sites add an extra security measure, asking for your password when you want to do “dangerous things”, as for example changing your password or other problematic data in your account configuration. Also, other common measure is to store the IP Address, browser and some cookies, in order to track the user. This tracking which is most useful for Ads & selling user data, also becomes useful to detect an unexpected access.

I don’t like the idea of tracking users. While the protection obtained is good, is still not flawless, and can give a few headaches to the user in certain scenarios, and finally, there’s a huge cost implementing this. I wanted my site to be quick to implement and not requiring a big server. So for me this approach is discarded.

My take on this is to softly track the life of the master token. The master token should never be copied and we could use one simple technique used on anti-piracy and DRM to detect unwanted copies. It goes like this: When the user logs in with password we emit the master token with a serial number #001. When the user refreshes the tokens we emit a new master token with a serial number #002. So each refresh increases this number. We store the last serial number in the server, and it should match when the request comes.

With this, if a hacker copies the master token and tries to get the authorization tokens (or renew), will request a new serial number. So one of them will be requesting a token renew with an old serial number. When this happens we can disallow this request, and because we don’t know if that was the user or a hacker, we should also remove the access for the valid token as well. The result is, both the user and the hacker will get kicked out of their sessions and will be asked to provide a password again. Inconvenient for the user, but the user will know their password but the hacker (hopefully) not. Also if you get continuously being kicked of your account would trigger some alarms, or at least this user will contact the website support for help.

As I said earlier, this is a very soft way of tracking the user, as we don’t store any kind of personal information. The downside is, the user can only have one device logged in at a given time. Logging in from another device will effectively kick you out from all other sessions in any device.

To fix this the master token should also provide a “batch number”, which is a random string created when the user logs in with password. This helps us to track the different valid sessions and control different serial numbers for each batch number. With this implementation, logging in into two devices should not log off any of those, and still protect against token copying.

If you think that the hacker could simply guess the next serial number and send it, that’s not possible. The tokens are signed by the server and even if the hacker tries to ask a new number he will need to redo the signature or it will be marked as invalid from the server. The signature in my case is an HMAC algorithm using SHA2-224 and a long random password stored in the server. Even one bit change will lead to a complete different signature, and reversing an HMAC-SHA224 is near to impossible. I could even implement automatic rotation of passwords to enhance security, but maybe (just maybe) is overkill.

Then I got to the point of storing the tokens on the client’s browser. I started on a in-memory storage using BehaviorSubject from RxJS. Seems logical as most tokens have to be used constantly on almost all requests. Whenever I refreshed the page my credentials were lost so I had to log in again and again. Expected of course, but annoying. So I had to think about where to store them permanently so they survive not only page refreshes but also when I close the browser and re-open the page the following day, as I would expect to be still logged in.

The first idea that came into my mind was LocalStorage. Why not? Simple, I can store there a JSON-ified version of the contents of my BehaviorSubject and on reload I only need to decode the JSON, populate the default value for that BehaviorSubject and I’m ready to go.

But wait a second… who can access that LocalStorage? I got concerned again on the likeliness of a third-party messing around and getting this precious sensitive data. So I searched on google about LocalStorage security and I found tons of articles against its use for sensitive data. The reason is, any JavaScript running in my page would have access to it, regardless of which domain was imported. This is a problem if you are importing JavaScript from other domains like Google Analytics requires. Maybe not my case, but still concerned. Also, if a XSS attack is able to execute JavaScript on the same page, it would be easy for them to get the master token. We shouldn’t have XSS attacks on the first place, but having several layers of security and not depending on a single one is what makes things actually secure.

So this made me think a lot on what is the proper way of storing this data. Cookies! Everyone is calling for using Cookies for such data. But I don’t want cookies because they appear on each request, effectively disabling most caching systems and they also enable the CSRF attack. Yeah, sure, most Frameworks include CRSF solutions which usually are… more cookies!. Whatever. Simple is beautiful and I still think it can be accomplished in an elegant way.

Cookies have a particular option called httpOnly which is very interesting. This option hides the cookie from JavaScript entirely. So no XSS attack could ever get the cookie value because it was never available to it. Enforcing more this security scheme, I believe the master token should not be even in memory for Angular, so XSS attacks are not able to inspect the memory data and find the token inside.

I have an endpoint called /api/login which would handle both password logins and token renewals. What I want to do (and I think this is the best option available) is: send on the response all short-lived tokens, so Angular is able to cache them in memory and use them for GET calls. For POST & PATCH I will use also a short-lived token, but a different one that authenticates the user. Those tokens will be sent in a header manually from Angular, so no CSRF will be possible. The master token will be sent in the same call, but as an HttpOnly Cookie. It will not be handled by Angular in any way.

When the user closes the browser and opens the site the next day, Angular should blindly call /api/login without parameters, just in case there is a hidden cookie that would let us get the tokens again. So instead of storing them locally, we will trust the server for “storing” them for us. (Actually they are not stored but generated each time, in a way that will generate the same response if the token is recent enough; this only applies to short-lived tokens, as the master token needs to be stored as stated before)

To ensure that the master token cookie is never sent to any other API (so they can be cached properly), the cookie needs a “path” parameter stating that should only be sent to “/api/login”.

There are more security enhancements on Cookies that can be done, if you are interested, have a look onto the MDN page for SetCookie and on this helpful article on why we shouldn’t use LocalStorage for sensitive data.

I believe this approach gets the best of two worlds: Better security than LocalStorage or regular Cookie usage while still convenient and simple. The main drawback maybe is that no one is doing this at the moment and requires custom implementations at the moment. So while it is simpler than other approaches, today it is very likely to require a good amount of work to setup it properly compared to the old and tested ones.

We could use LocalStorage to store the short-lived tokens to avoid the round-trip to the server on page load, but I believe there is not much performance benefit on doing that and most importantly increases the surface attack a lot. If you want to go this way, store only the user details and which permissions/tokens the user had, but never the actual tokens. That would make the site quicker on the first load, the user will appear consistently logged in (no flicker) and will avoid that nasty first login try without cookie for anonymous users.

The only attack possible here I believe is an XSS attack scanning the JavaScript memory and extracting the short-lived tokens. It is difficult, and it will be useful for the attacker for an hour more or less. I could try to mitigate this by tightening the requisites of POST & PATCH requests, either by sending the cookie to those or by making the authenticating token to live only a minute or two, effectively having that token always expired on the client and requiring the browser to query first to /api/login almost each time before a POST/PATCH request needs to be issued. But well, I’m lazy, and even I like a lot going paranoid on security, I don’t think the time cost of implementing this is ever going to pay off.

One final thing, I had lots of problems getting the cookies into my XHR requests. Stupid problems from my ignorance as I’m new on this, as I didn’t knew all those Access-Control-* http headers, and I didn’t knew that XHR request needed a withCredentials=true attribute in JavaScript set in order to work.

So, in a nutshell, http headers required on the server:

  • Send the Access-Control-* headers also on OPTIONS requests, as the browser will ask for OPTIONS before trying the actual POST or PATCH requests.
  • Access-Control-Allow-Origin: Star (*) is not a good value here, as it is not compatible with sending cookies (credentials). Specify the actual server url. (http://localhost:9999)
  • Access-Control-Allow-Methods: Don’t forget them. Also don’t forget to add OPTIONS, as the browser can get pretty confused when it asks for OPTIONS and receives a good response saying that OPTIONS wasn’t an option after all.
  • Access-Control-Max-Age: Set it. Always. And set it to ridiculously high values. That will prevent the browser from asking for OPTIONS again and again.
  • Access-Control-Allow-Credentials: Set it to “true” if you want to be able to set or get cookies on that endpoint.
  • Access-Control-Allow-Headers: If you plan to send an Authentication header, send “content-type, authentication, *”
  • Access-Control-Expose-Headers: If you want to inspect the E-Tag from JavaScript you need to provide this header with “ETag”.

On the front-end, when performing a XHR request, add withCredentials=true. In Angular 6:

this.http.post<Hero>( 
    this.baseurl + '/hero', 
    {'withCredentials': true} )

 

Hopefully this covers everything. And I hope I don’t face more surprises again. I’m starting to get exhausted on all those pesky security fanciness on XHR requests. I like to see that they’ve got us covered on JavaScript forbidding it from doing what is not supposed to do, but this is starting to feel cluttered.

Sedice: https y autorenovación de certificados

Como comenté en la entrada anterior, este fin de semana le dediqué un tiempo a Sedice. Estuve preparando el https para la web con la renovación de certificados automática.

Sedice funciona con una versión anterior de Debian, concretamente Debian 6, ya que la web depende de un php antiguo y las nuevas versiones requieren parchear el código.

El servidor web es lighttpd, que fue mi reemplazo preferido para apache durante muchos años. Actualmente nginx es mi favorito ya que se ha vuelto muy sencillo de configurar y flexible.

Para servir https uso un chroot con Debian Stretch, y dentro hay un nginx que hace de proxy. De este modo, nginx reserva el puerto https 443 del servidor y lighttpd usa el puerto 80 http. Cuándo alguien accede a la web por https, llega a nginx y éste reenvía la petición a lighttpd.

Lo mismo podéis hacer vosotros con Apache, si veis que se os queda corto o que la versión que tenéis de Apache no soporta HTTP/2, siempre se puede instalar Nginx, que además os reducirá la CPU usada por Apache y aumentará la velocidad de la web. Con algo de caché en nginx para ciertas páginas, la velocidad aumenta incluso más aún.

El problema del HTTPS es conseguir el certificado SSL apropiado para el servidor. Pero para esto tenemos letsencrypt, que lo recomiendo muchísimo. Dan certificados gratuitos a cambio de verificar que tenemos control del dominio y del servidor web.

Para configuraciones no estándares, como usar un reverse proxy en este caso, letsencrypt no es tan inteligente como para saber dónde tiene que dejar su fichero de verificación, por lo que hay que usar las opciones más manuales y configurarlo a mano. ¡Pero es muy sencillo, lo prometo!

Id a la siguiente página y introducid vuestro sistema, os dará instrucciones exactas de cómo hacerlo:

LetsEncrypt – CertBot

Si tenéis una configuración más especialita, como las mías, os toca seleccionar software “None of the Above” e ir a la pestaña de “Advanced“. Allí al final se ve el comando clave:

$ sudo certbot certonly --webroot
    -w /var/www/example -d example.com -d www.example.com 
    -w /var/www/thing -d thing.is -d m.thing.is

Aquí la idea es, definir qué carpeta del servidor web sirve qué dominio. Simplemente le decimos con -w donde está la carpeta y luego con -d empezamos a listar los dominios que para los que queremos obtener el certificado. Esto funciona siempre, y lo que devuelve es la carpeta (o carpetas) donde ha dejado los distintos certificados. La instalación ya se realiza a mano en este caso.

Si tenéis una configuración más normal, seguís los pasos básicos para vuestro software y certbot se encarga de adivinar qué dominios tenéis, desde qué carpeta se sirven y cómo instalar los certificados. Modo newbie lo llamaría yo.

Luego viene la pega. Los certificados hay que renovarlos cada tres meses en vez de cada año o dos años. Es un poco coñazo, por lo que en vez de hacerlo a mano cada vez hay que automatizarlo. Esta es la parte en la que me quería poner este finde y resultó ser mucho más sencilla de lo que pensaba.

Letsencrypt recuerda qué dominios se sirven desde qué sitio, por lo que la historia inicial de configurar e instalar se hace sólo al principio, y renovar es tremendamente sencillo:

sudo certbot renew

Pensaba que me tocaría montar una planificación compleja para saber si tengo que preguntar por el certificado o no… pero resulta que ya lo recuerda certbot y no emite nada si no hace falta! Es crontab friendly!

Así que, abro el crontab y lo meto directamente:

# m  h dom mon dow   command
  0  0  *   *   0    certbot renew >> /var/log/certbot.log

Y ya está. Con esto, una vez por semana renovará los certificados que necesite, sin hacer peticiones innecesarias.

Si es que es más sencillo que los certificados de pago, quien no tiene https es porque no quiere!

Estaba hasta las narices de recibir el e-mail de renovar de letsencrypt cada mes y pico. Otra cosa arreglada.

Ah, y que no se os olvide! si hacéis un reverse proxy como en mi caso, acordaros de pasar la IP original al segundo servidor usando la cabecera X-Forwarded-For y en éste servidor decodificar la cabecera. De lo contrario todas las visitas aparecerán como que llegan de 127.0.0.1 y dejaréis de saber de qué IP venía.

Sedice: Backup y réplica incremental

Este fin de semana me he puesto de vuelta con Sedice.com y le he dedicado algunas horas. El plan era normalizar la situación del https, auto-renovar los certificados y conseguir una réplica incremental en mi equipo sin llenar los discos del servidor.

¿Por dónde empiezo? Sedice usa un servidor ultra-low-cost de un proveedor bastante desconocido llamado IperWeb. El hosting es un OverZold y tiene como particularidad que corre sobre OpenVZ, en servidores muy potentes, pero que tienen overbooking.

El resultado: 12 euros al mes, 6Gb de RAM, 60Gb de disco y mucha, mucha CPU. La pega es que no es un servidor fiable, al tener overbooking, hay mucha más gente en el servidor compartiendo los recursos que lo que el servidor puede entregar. El disco duro se ve fácilmente saturado y la CPU a veces va más lenta, sin explicación alguna.

En cambio, el servidor personal que tengo para backups cuesta lo mismo y la CPU es un Atom que va muy lenta, poca ram, pero un disco duro de 2TB.

Después de pelear cuando lo instalé, conseguí reducir las consultas al disco duro y hacer que funcione óptimamente en este servidor. Es tremendo, porque cualquier alternativa (incluyendo Amazon AWS) serían más de 40 euros al mes para el mismo rendimiento. Y sin ninguna fuente de ingreso, sólo algunas donaciones, y que tampoco le doy uso personalmente, pues cuanto más eficiente sea económicamente, pues mejor.

El problema es que no espero mucho de éste servidor y me da cosa que un día pierda datos o pase algo. Al fin y al cabo para ellos sólo soy un número entre los miles de clientes que tendrán en la misma máquina. El servicio y soporte de IperWeb siempre ha sido excelente y estoy muy contento con ellos, pero una cosa no quita la otra.

Tuve que reducir la cantidad de backups a 3 a la semana, porque tener uno cada día y almacenar 7 como histórico me ocupaba demasiado disco. Esto provocó que cuando IperWeb reinició la máquina sin avisar, y culpa mía por sobre-optimizar el acceso a disco de MySQL, las tablas se corrompieron de tal manera que no pude hacer más que restaurar todo desde cero, con la mala suerte que el último backup era de hacía dos días. Y perdimos dos días enteros de mensajes en el foro.

MySQL Binary logs

Resulta que MySQL tiene una funcionalidad muy interesante para mí que lo llaman binlogs. Éstos son ficheros que escriben las modificaciones en la base de datos según ocurren. Y con un programilla llamado mysqlbinlog puedes ver el fichero en formato sql, y ejecutarlo para avanzar el estado de tu base de datos hasta un punto en particular.

Me viene de perlas porque además estoy realizando pruebas de programación web en mi ordenador local y empiezo a necesitar una copia de la base de datos que se vaya actualizando frecuentemente. Si tengo que restaurar un backup en Sedice, también puedo usar estos logs para ponerla al día y perder muchos menos datos. Dos pájaros de un tiro.

Lo primero fue activarlos en el fichero de configuración de MySQL (/etc/mysql/my.cnf)

server-id = 101
log_bin = /var/log/mysql/mysql-bin.log
expire_logs_days = 60
max_binlog_size = 60M

Con esto lo que conseguimos es que MySQL, una vez reiniciado, empiece a escribir los ficheros en la carpeta. Los ficheros tienen un nombre como éste: mysql-bin.000063

Cuando supera un fichero 60Mb, MySQL se pasa a otro fichero. Genial!

El siguiente paso es comprimir regularmente éstos ficheros, porque crecen bastante. Al comprimirlos se quedan en 7-8Mb. Para eso configuro una entrada en el crontab para cada 15 minutos:

*/15 * * * * /var/log/mysql/compress_binlog.sh

Y el fichero en cuestión contiene lo siguiente:

#!/bin/bash
NEXTFILE=$(find /var/log/mysql -iname "mysql-bin.*" \! 
    -iname "*.xz" \! -iname "*.index"
    | sort | head -n-1 | head -n1 | grep mysql)
RESULT=$?

if [ "$RESULT" != "0" ]; then
 echo "No input files left for compressing."
 exit 1;
fi

echo "Compressing $NEXTFILE . . . "
nice -n19 xz -v -9 "$NEXTFILE"
mv "$NEXTFILE.xz" /var/log/mysql/archive/
#echo "done."

rsync -aviP /var/log/mysql/archive/ \
   sedice@mi-servidor-ovh:mysql/archive

find /var/log/mysql/archive/ -iname "*xz" -mtime +3 -delete

La mecánica, aunque no lo parezca, es muy sencilla. Primero identifico los ficheros que hay para comprimir, quito el primero de ellos porque ese se está escribiendo (y no quiero comprimir un fichero a medias), luego selecciono de lo restante el primero de la lista alfabéticamente.

Luego, a comprimir con prioridad mínima (nice -n19) para no afectar al resto de la web. Cuando termina, se mueve a una carpeta específica. Esto es importante, porque si quieres hacer rsync desde otra máquina a éste servidor, no sabes si el fichero se está comprimiendo o no, y no quieres transferir un fichero a medias. El moverlo de carpeta al acabar soluciona este problema.

El tamaño del fichero (60Mb) lo seleccioné porque es lo suficientemente grande como para que una compresión agresiva (xz -9) comprima bien, pero no demasiado, para que se generen ficheros cada poco tiempo. Normalmente cada 45 minutos se genera uno nuevo. Si fuesen de 10Mb se generarían muy a menudo, que es bueno para tener las copias al día, pero se comprimirían poco y al archivarlos terminarían ocupando mucho espacio. Y más grandes no creo que se compriman más.

Lo transmito por rsync a mi servidor personal de backups. Así tengo una copia fuera y estoy más tranquilo. En el programa de backup SQL habitual envío también los backups completos a éste servidor de backups.

Finalmente, los ficheros de más de tres días no me sirven para nada y los borro en el servidor principal. Se quedan archivados en mi servidor de backups por si acaso.

Y bueno, ya está. Con esto si un día falla puedo recuperar hasta la última hora.

Pues no. Los backups son como el gato de Shrödinger, que está vivo y muerto a la vez hasta abrir la caja. Los backups funcionan y no funcionan a la vez, hasta que los pruebas. Y cuando los pruebas… vienen las sorpresas.

Restaurando los binary logs

Como comenté, localmente tengo la misma base de datos funcionando. Lo ideal es poder conectarse cada cierto tiempo, descargar y aplicar las actualizaciones. De ese modo en mis pruebas puedo realizar búsquedas de mensajes en los últimos días y demás cosas. Que me hace falta para la serie de posts que estoy haciendo de la web ultra-rápida.

Como siempre, otro script para descargar:

#!/bin/bash
rsync -aiP sedice@mi-servidor-ovh:mysql/archive/ archive/

Sencillo. Pero no tener que recordar el comando cada vez está muy bien. Además me permite automatizarlo con crontab si quisiera. Estoy descargando desde mi servidor de backups en lugar del servidor original. Es algo más de retraso, pero éste servidor no borra, por lo que siempre tengo lo último.

Una vez me descargo los cambios, toca aplicarlos. ¡Otro script!

#!/bin/bash -e
for file in $(find . -maxdepth 1 -iname "*.xz")
do
 if [ \! -f "$file.done" ] ; then
 echo $file
 xz -dkc "$file" > tmp/binlog
 mysqlbinlog tmp/binlog > tmp/sql
 mysql -h sedice.loc -u usuario -ppassword < tmp/sql || /bin/true
 unlink tmp/binlog
 unlink tmp/sql
 touch "$file.done"
 fi
done

Éste lleva un poco más de trabajo. El principal problema aquí es que la utilidad mysqlbinlog no funciona bien con streams. Lo descubrí tras diversas pruebas y ver errores. Lo sencillo hubiese sido hacer un pipe y enviarle los datos sin tener que hacer tanto paso. Pero devuelve errores. Es como si quisiese rebobinar el fichero y no pudiese. Así que toca hacerlo a lo bruto, descomprimir todo el fichero y pasarlo entero.

La mecánica es, descomprimir un fichero, pasarlo a mysqlbinlog y obtener los comandos sql, conectar a la base de datos local y enviarle el fichero, y finalmente borrar los temporales.

El problema es detectar dónde nos quedamos. La estrategia usada es de las más tontas pero más efectivas, escribir un fichero vacío indicando que ése fichero ya lo hemos hecho. Probé con moverlos de carpeta, pero eso provoca que se descarguen de nuevo con rsync. También probé con un fichero de timestamp, pero ésta solución es más sencilla y funciona.

Tarda un rato en procesar, culpa de que Drupal graba un montón de cosas de caché en MySQL, y eso provoca un montón de escrituras. Desactivar la caché no es una opción porque Drupal ya es bastante lento en sí, y no puedo moverlo a Redis o Memcache, porque es un Drupal 4.7 y no me atrevo a tocarlo.

Y para acabar, demostrar que funciona. Consultando las tablas de posts, veo que efectivamente el último es de hace un rato. ¡Misión conseguida!

A ver si esto me ayuda a terminar la serie de posts y os enseño cómo funciona la autenticación y la invalidación de caché con medidas de velocidad usando una base de datos real de más de 2 gigas que se actualiza cada cierto tiempo.