#PostgreSQL live aggregate cache (II)

I recently published a previous entry about doing caches automatically updated with minimized risk of wrong aggregate results, even inside a transaction with modifications.

What happens if anyone modifies ancient rows? We will get wrong aggregates. It is the only drawback of the solution proposed. But in certain circumstances it will be not enough. Maybe your table isn’t modified only in the newer parts. Or maybe you need to get always a correct aggregate no matter what. What can we do?

Let’s suppose a table can be divided in segments. Those segments are continuous on disk. If we aggregate every segment independently, we could later calculate a grand total by aggregating the previous partial aggregates. Because the segment size is relatively big (16384 rows in my examples) we can calculate the grand total almost instantly. If we know which segments have changed recently, we can invalidate them and go to the actual table for those records.

To make faster access to the segment calculation, I add one index:

CREATE INDEX ON invoicelines ((lineid >> 14), prod);

Then the table holding the cache could be defined as:

DROP TABLE IF EXISTS agg_invoicelines_prod2 CASCADE;
CREATE TABLE agg_invoicelines_prod2 AS 
SELECT lineid>>14 as segment, l.prod, sum(quantity) as quantity, 
count(*) as count1, sum(l.totalprice) as totalprice
FROM invoicelines l
GROUP BY lineid>>14, l.prod;
ALTER TABLE agg_invoicelines_prod2 ADD PRIMARY KEY (prod,segment);

In this case I’m not using materialized views. That’s because now we’re taking care on how the cache is refreshed.

Now we can create a table for segment invalidation. In case a record hits this table, means that part of the cached table isn’t valid anymore.

CREATE TABLE agg_invoicelines_prod2_invalid ( 
  id serial not null, segment int not null, 
  prod varchar not null, primary key (id) 
CREATE INDEX ON agg_invoicelines_prod2_invalid (segment, prod, id);

Why another table and not a simple column on the prior one? Well, if two transactions try to invalidate the same record, PostgreSQL will lock one until the other commits. This scheme allows parallel invalidation of segments.

But the problem here is we don’t have yet a view for the grand total. Let’s write one, even if it returns stale data:

CREATE OR REPLACE VIEW view_agg_invoicelines_prod2_stale AS
SELECT prod, sum(quantity) as quantity, sum(count1) as count1, 
       sum(totalprice) as totalprice
FROM agg_invoicelines_prod2
GROUP BY prod;

How fast is this method compared with the previous one up to this point? Let’s execute an example query:

SELECT * FROM view_agg_invoicelines_prod2_stale   -- 0.039s
SELECT * FROM view_agg_invoicelines_prod          -- 0.018s
SELECT * FROM agg_invoicelines_prod               -- 0.0001s

As you can see this method is slower than reading the cached aggregates directly, that’s because we hold 123 segments and in average, it has to process 123 rows to retrieve a single aggregated row.

Even being 400 times slower, it’s still a lot useful, because it returns in 40ms which is very fast. For larger tables you can use a bigger segment to avoid being too slow.

Remember we did this because we want invalidate segments. So let’s do it. We’ll need a trigger for every modification.

CREATE OR REPLACE FUNCTION func_trg_invoicelines_invalidate()
 RETURNS trigger AS
 INSERT INTO agg_invoicelines_prod2_invalid (segment, prod) 
 (      SELECT NEW.lineid >> 14, NEW.prod 
  UNION SELECT OLD.lineid >> 14, OLD.prod )
 SELECT segment, prod FROM agg_invoicelines_prod2_invalid;
$BODY$ LANGUAGE plpgsql;

CREATE TRIGGER trg_invoicelines_invalidate
 AFTER UPDATE ON invoicelines
 EXECUTE PROCEDURE func_trg_invoicelines_invalidate();
CREATE TRIGGER trg_invoicelines_invalidate2
 EXECUTE PROCEDURE func_trg_invoicelines_invalidate();

I use a separate trigger for UPDATE to use the “WHEN” trick and avoid invalidating when there are no changes at all. Let’s test inserting new invoices:

INSERT INTO invoicelines (invoiceid, lineno, prod, quantity, totalprice)
SELECT invoiceid, generate_series(1,100) as lineno, 
 'prod' || to_char((random()*250)::int,'FM00000') as prod, 
 (random()*25)::numeric(12,0) as quantity,
 +random()*50)::numeric(12,2) as totalprice
FROM generate_series(1,2) invoiceid; -- 200 rows.

SELECT * FROM agg_invoicelines_prod2_invalid 
-- segment: 122; 131 products. (131 rows)

As you can see, the new invoices ended up all in the segment 122. 200 rows collapsed into 131 invalidations. With more rows inserted we could expect better collapse of the invalidated segment.

The problem is on random updates on the middle of the table. If the changes aren’t together in the disk, we will end with one invalidation per row modified. For example, we could update the totalprice of random rows (0.01%):

UPDATE invoicelines SET totalprice = totalprice * 1.10 
WHERE random() < 0.01/100.0

202 rows were updated and now the table for invalidations has 333 rows. Exactly 202 more. Every row updated randomly caused an invalidated row. This was expected.

But don’t worry, the invalidation table isn’t going to grow indefinitely. At most, when every segment is invalidated, the table will not grow larger. In this example 30867 rows will be the maximum (in theory, because two concurrent transactions can invalidate the same segment at the same time).

Now it’s time to create the magic view where we mix cached data with real data. But, before that I’ll create a small function to get the segments. This function exists mainly to hide this query from the planner later. In my tests, placing the function forces the planner to query it first and plan correctly with its results. (Your mileage may vary)

CREATE OR REPLACE FUNCTION fn_get_invoicelines_prod2_invalid(
      out segment integer, 
      out prod varchar) returns setof record
 AS $$ 
SELECT DISTINCT segment, prod 
FROM agg_invoicelines_prod2_invalid 

For the “online” data view the query goes like: first get all still valid rows from cache, then get live data for those segments which were invalidated. Add those two, and aggregate them to create only one result.

CREATE OR REPLACE VIEW view_agg_invoicelines_prod2_live AS
SELECT y.prod, sum(quantity) as quantity, 
       sum(count1) as count1, sum(totalprice) as totalprice
 SELECT (lineid >> 14) as segment, l.prod, 
        sum(quantity) as quantity, sum(count1) as count1, 
        sum(totalprice) as totalprice
 SELECT l.lineid, l.prod, l.quantity, 1 as count1, l.totalprice 
 FROM invoicelines l
 INNER JOIN fn_get_invoicelines_prod2_invalid() x 
    ON x.segment = (l.lineid >> 14) AND x.prod = l.prod
 ) l
 GROUP BY (lineid >> 14), l.prod
 SELECT t.* FROM agg_invoicelines_prod2 t
 LEFT JOIN fn_get_invoicelines_prod2_invalid() x 
    ON x.segment = t.segment AND x.prod = t.prod
 WHERE x.segment is null
) y
GROUP BY y.prod;

Let’s try to query it. I still have 333 segments invalidated.

SELECT * FROM view_agg_invoicelines_prod2_live -- 0.132s

Looking into the query plan I see PostgreSQL is aggregating 30536 cached rows and 14180 live rows. Of course, it takes longer than simpler approaches, because we have more rows to aggregate. But compared to the 2.4 seconds that a regular query takes to invoice lines, we’re still getting a 20x speed up.

From time to time we have to recalculate the stale data in order to avoid performance degradation. This should be done outside of any transaction, in periods of low CPU. But it doesn’t take much CPU.

First, we delete stale data from agg_invoicelines_prod2 table, so we can insert them later.

DELETE FROM agg_invoicelines_prod2 p 
USING agg_invoicelines_prod2_invalid i
WHERE p.prod = i.prod AND p.segment = i.segment;

Now we delete the invalidated rows and we use that (putting the DELETE into a CTE expression) to calculate a insert:

WITH d as (
  DELETE FROM agg_invoicelines_prod2_invalid x RETURNING x.*
INSERT INTO agg_invoicelines_prod2 
   (segment, prod, quantity, count1, totalprice)
SELECT lineid>>14 as segment, l.prod, 
    sum(quantity) as quantity, count(*) as count1, 
    sum(l.totalprice) as totalprice
FROM invoicelines l 
    ON l.prod = d.prod AND (l.lineid>>14) = d.segment
GROUP BY lineid>>14, l.prod;  -- 0.063s for 333 rows.

For me, it took 63ms. Very fast indeed.

Let’s see how long it takes to query the view now with a full valid cache:

SELECT * FROM view_agg_invoicelines_prod2_live -- 0.086s

Slightly faster. From 132ms to 86ms, 53% faster. But the previous method gave us 18ms, which is 477% faster. Is this the price to pay for having a live aggregate cache? Probably we could use bigger segment sizes because we have seen that invalidations and cache updates are happening way faster than expected but selects way slower. We now have 123 different segments. If we add 3 bits more to the shift, the segment would be 16 times bigger and we could expect about 7-8 different segments. That would give us a 17 bit shift.

I tried this myself and I got 16 segments instead of 8, but with good results. Invalidations table works more or less as before. Same rows. With ~300 invalidated cache rows it takes 77ms to query, but with a full valid cache it takes 12ms. Even faster than the previous method.


Well, the previous method had a fixed amount of 10.000 live rows to aggregate, now we’re aggregating fewer records so that explains how we’re getting even faster. The previous method doesn’t have triggers but makes a guess, in this one we use triggers so data modification is a bit slower but we can use that information gathered to exactly retrieve what is needed.

In some cases the group by columns have a cardinality so high that segmenting the table no longer makes sense. But I think segmenting should help on other cases, specially if you expect modifications. If aggregating the whole table takes 2 seconds, if a insertion changes the 250 products we have, you’ll end with a fully invalidated cache if you don’t use segments.

The main point of segmenting the table is querying regions of disks which are near one of another. Maybe this behavior shows up better with even more rows. Also, segments can be useful on table partitioning schemes, as we could avoid reading other child tables entirely.

Hope you liked the post. Maybe I’ll came up with a new entry on this topic if I get new ideas.

Thanks for reading!


#PostgreSQL Speeding Up Queries (I)

Do you have a project which lots of data and queries and you’re struggling to put hardware in and caching systems in order to keep it running? It was working fast the last year but now it gets slower and slower?

I know a lot about optimizing times in production, and I can tell you several tricks. Remember that most times it’s not about computing faster, it’s about computing less. Imagine you have to do by hand all your CPU does, and you’re super-lazy. Which things can be avoided? Which tasks could be avoided? If our computers could yell at us, I’m pretty sure they will be. Let’s make our computer’s life easier.

Understanding Indexes.

Let’s start from the simplest. Without indexes the tables are like plain text files, if you need access to a particular line or filter values you need always to read the full file to retrieve only the rows we need.

Indexes are useful because they are lookup tables for certain keys. The database uses them automatically where are going to speed up the query itself. Mainly used in SELECT queries, but also useful on certain INSERT, UPDATE and DELETE queries.

First of all, indexes doesn’t come for free. You shouldn’t place an index on every column on every table. They speed up finding rows but they also slow down every row modification (insert, update & delete). Every time you change, delete or insert a row the index needs to reflect that change.

People find hard to know which columns should index. I may help with a small set of rules:

  • Primary Keys constraints already have indexes in PostgreSQL, so there’s no need to create more indexes on a Primary Key column.
  • Foreign Keys need an index. Every column that you know that is going to be used as a condition in a JOIN needs an index. (Unless PostgreSQL creates it internally, see previous point).
  • Columns that classify the data and could be used in GROUP BY expression, they need an index. If you imagine a SUM(x) on your table, which columns could be used in the GROUP BY? Just create some of the common cases.
  • Columns that can be used to cutoff the amount of data, like dates. If you plan later to make queries for the last year, an index for that column is going to be useful.

If you don’t care with insertion delays for a million rows, maybe you can have 30 indexes or more in a table. If you care, you’ll have to choose the most important and reduce them to 5-10.

Speeding up slow queries

After you have created most basic indexes in every table, you should be fine until your database grows. Then there will be queries that start to run slowly. This is perfectly normal. Unless you plan ahead and know how the PostgreSQL planner works, the easiest way is to wait until some query slows down and then optimize those queries.

To do that, is best to configure PostgreSQL so it logs those problematic query. Depending on your application a slow query may be 300ms long or 15 seconds long. Web applications require faster responses, GUI applications usually don’t, and batch operations could last minutes. It depends. I recommend logging always everything over 2 seconds, that’s a good starting value.

So, open postgresql.conf (usually in /etc/postgresql/9.6/main/postgresql.conf) and change:

 # log_min_duration_statement = -1 # -1 is disabled, 0 logs all statements
 # and their durations, > 0 logs only
 # statements running at least this numb$
 # of milliseconds

uncomment it and set “2000” for 2 seconds:

log_min_duration_statement = 2000

Then a reload is enough:

$ sudo /etc/init.d/postgresql reload

You can see your logs at /var/log/postgresql/*.log

After hours or days, the problematic queries would be logged. If you know which parts of your app are slow, go and try them, and the queries will be immediately logged. Go to the log, copy the SQL to pgAdmin and use SHIFT+F7 (EXPLAIN ANALYZE). It will re-execute the query, so it can take several seconds to complete. You’ll see a nice graph like this one:


This is telling us how PostgreSQL is executing our query. You’ll have to learn to read it, but it is pretty easy. When you place the mouse on top of any icon, information will appear; and if you pressed SHIFT+F7 (and not just F7) there will be precious timing information.

The above graph says: read every row in “towns” table, and for every row (nested loop) query the index idx_schools_the_geom2; then aggregate the result.

This is fine for a small amount of towns, but if you have 100000 towns, this is a bad plan because the index will be queried 100000 times. And indexes are fast, but a fast process multiplied 100000 times equals to very, very slow. (0.1ms * 100,000 = 10s)

So you have to look for a way to simplify the database’s job. If the two tables could be joined using other method would be better. The best join method is the merge join which is nearly  cost-less. The merge join depends on both sources being pre-sorted using the JOIN ON clause condition. If we create a multicolumn index on towns that matches the JOIN clause maybe the plan changes and the query gets faster.

When you have done your changes, try to re-execute the query in pgAdmin using SHIFT+F7 (EXPLAIN ANALYZE). See if the plan changes, and see if it is faster. If not, undo what you did and try other things. If you succeed, take note of the problem, what you did to solve it, and how PostgreSQL behaved after your changes.

After several query optimizations, you’ll start to learn by example how this works. It’s not that hard, I promise.



Numpy, Cython y otras triquiñuelas para velocidad en #Python

En un post anterior de hace unos cuantos años comenté cómo Python era muy útil, pero al ser lenta su ejecución al final parecía haber una barrera invisible que hacía Python poco útil especialmente cuando habían muchas iteraciones que realizar.

Ha pasado mucho tiempo, sigo usando Python más que nunca y tengo que decir que empiezo a estar en desacuerdo con mi yo de hace 5 años. Python empieza a ser muy viable en una gran cantidad de escenarios.

Primero que nada, decir lo obvio: Las librerías se deben programar en C. Además Java sigue estando a la cabeza de los lenguajes que no son puramente compilados en cuanto a velocidad. Pero todo es relativo.

Programar en Python y ser a la vez eficiente no es trivial (seguramente no lo es en ningún lenguaje), pero a los que están acostumbrados a los lenguajes de bajo nivel, la forma de trabajo en Python puede resultar anti-intuitiva a nivel de optimización.

Python es uno de los lenguajes más lentos que existen. Hay que ser conscientes de esto antes de empezar a picar código que requiera ser eficiente en alguna medida. Una instrucción Python toma unas 150 veces más tiempo que la misma en C. Java por ejemplo suele ser 4 o 5 veces más lento que C. Pero esto no hace que un programa Python sea 150 veces más lento que uno de C, esta extrapolación funciona por ejemplo para Java, pero no para Python. ¿Porqué?

Java es un lenguaje que compila a bytecode, y éste bytecode se ejecuta por una VM. Si la VM y el bytecode inherentemente agrega 4-5 instrucciones de procesador o es incapaz de optimizar para un procesador dado, cualquier proceso, sea el que sea, sufrirá la misma penalización. Hasta aquí nada nuevo.

Python, en cambio, es un lenguaje bastante abierto en cuanto a su funcionamiento. Las librerías de Python (tanto las incluídas de serie como las contribuídas por teceros) pueden ser en Python o en C. Esto hace que una gran cantidad de operaciones de Python puedan ser ejecutadas directamente en C sin ningún problema. Por supuesto, al manipular objetos de Python desde C tienes bloqueos (el GIL) y un uso más elevado de recursos. Pero las librerías intentan reducir ambas cosas.

Por poner dos ejemplos, la librería Qt para Python enlaza hacia C++ y el comando .sort() de las listas se ejecuta en C. El resultado es que los programas realizados en Python tienden a ser más ligeros usando menos código y pensando menos en eficiencia. El dilema viene cuando queremos optimizar.

Normalmente pensamos, que al hacer las cosas nosotros manualmente y aprovecharse de un par de ventajas en nuestros datos, podemos mejorar el tiempo del algoritmo. Esto es así en C++ y en Java. En Python nos vemos frustrados al ver que cuanto mayor control ejercemos, más lento va: esto es porque estamos obligando a ejecutar más código Python.

Paradójicamente en Python el código más rápido es normalmente el más Pythoniano. A menos código escrito y más sencillo, más rápido funciona. Se da muchas veces que una solución O(n*log n) es más rápida que una O(n) simplemente porque la primera se ejecuta en C y la segunda en Python, y el método O(n) necesitaría de un N mayor a varios millones para ser realmente más rápido. Esto es contradictorio y, a veces, frustrante.

El resumen de porqué es así y porqué “Python mola” es porque Python es un lenguaje “director”. Está pensado para que escribir lógica compleja sea sencillo y práctico. Su objetivo no es ejecutar eficientemente, sino dirigir a otros para que lo hagan por él.

Cada año que pasa salen más y más librerías que explotan este aspecto. NumPy por ejemplo es muy conocida. Permite cálculo de matrices usando un Python muy sencillo, pero usando la misma memoria y CPU que si lo haces en C. Theano, por poner otro ejemplo, es una librería que resuelve ecuaciones y convierte a Python en un pequeño Matlab. ¡Incluso compila las ecuaciones y permite ejecutarlas en paralelo y en la tarjeta gráfica!

El truco está siempre en ceder a otros el trabajo, delegar. Por ejemplo puedes delegar el cálculo a una base de datos. Mientras esperas a la base de datos puedes seguir ejecutando más código Python, que para eso no tienes problemas con los bloqueos del famoso GIL.

Y hablando del GIL, acabo de leer en twitter que PyPy está trabajando para eliminarlo. Mucha suerte!

PyPy, para los que no lo conozcáis, es un intérprete de Python escrito en Python. Compilan el resultado, por lo que el ejecutable final es rápido. Lo bueno, es que están intentando que Python sea más rápido, y después de varios años no sólo lo consiguen, sino que además han conseguido bastante compatibilidad con todo el código ya existente.

Pienso que, si el problema fuese la velocidad de verdad, ni Java ni C#. Tal vez ni C++. Todo el código que llevo visto que realmente necesita ser rápido es C puro.

Y es que la gran mayoría de programadores no lo queremos aceptar, pero no tenemos ni puñetera idea de optimización de código (y creemos que la tenemos). Deberíamos dejar de preocuparnos en hacer código rápido y aprender a programar buen código. Y si creéis que no tengo razón, os leéis “What Every Programmer Should Know About Memory.pdf” y luego me contáis.






#PostgreSQL live aggregate cache (I)

Common problem: you have a big table and your application usually reads aggregate data from it. Each time you execute a SUM+GROUP BY PostgreSQL spends lots of time reading the whole table and computing the aggregate.

Common solution: put a MATERIALIZED VIEW and query that one instead.

Common pitfall: The data in the materialized view gets old and your application reads old, stale data which isn’t correct any longer.

There are two known workaround for this, one is to refresh the materialized view often, but you don’t get the last minute change and never your own transaction changes. The second one is to cache only up to a certain date or record and using a view and UNION you can get the best of two worlds: cached aggregates for old data, and fresh ones for new data. The problem with this approach is that you can get duplicates or hidden tuples by accident, and from time to time you have to reconstruct everything to set up a new time point for caching to avoid calculating too much “fresh data”.

Even with that, what happens if anyone modifies old data? Most probably the changes won’t enter the cached table until the next week or next day (depends on your cron jobs).

I always asked myself, would be possible to have everything? could I setup something that exactly mimics the behavior of an fresh query in all situations while being very fast? Seems something difficult to do, but after several hours of testing I got a perfectly working solution. And I would like to share this with you. Yes, you!

Before explaining the hard part, I’ll start with the simplest. Let’s define a simple use case scenario for this problem: statistical gathering from invoice lines (product, quantity, totalprice). We have several million rows there. This could be defined as:

DROP TABLE IF EXISTS invoicelines ;
CREATE TABLE invoicelines (
lineid serial not null,
invoiceid int not null,
lineno int not null,
prod varchar not null,
quantity numeric(12,4) not null default 1,
totalprice numeric(12,2) not null default 0,
PRIMARY KEY (lineid)
CREATE INDEX ON invoicelines (invoiceid,lineid);
CREATE INDEX ON invoicelines (prod);

If you want to populate this example, you could execute:
(beware this creates 2 Million rows, it takes 35 seconds on a i7-920@2.6Ghz. Remember to drop indexes and re-create them afterwards)

INSERT INTO invoicelines (invoiceid, lineno, prod, quantity, totalprice)
SELECT invoiceid, generate_series(1,100) as lineno, 
  'prod' || to_char((random()*250)::int,'FM00000') as prod, 
 (random()*25)::numeric(12,0) as quantity,
   +random()*50)::numeric(12,2) as totalprice
FROM generate_series(1,20000) invoiceid; -- 35s

So if our application wants to read a total quantity sold for all time, it does:

SELECT prod, sum(quantity) FROM invoicelines GROUP BY prod; -- 1.9s

And that works fast until you have more than 100k rows on the table. So we do a materialized view for that:

CREATE MATERIALIZED VIEW agg_invoicelines_prod AS
SELECT prod, sum(quantity) as quantity FROM invoicelines GROUP BY prod;
-- 6.1s

Then the data gets old. We add a refresh on cron.weekly:

REFRESH MATERIALIZED VIEW agg_invoicelines_prod;
-- 3.2s

Now as I mentioned earlier, the problem is we will not see the most recent changes, specially the changes inside our transaction. There may be several valid values for the same aggregate for different live transactions because they see different data. It is not feasible to refresh the view inside it, not only because the CPU time wasted, is mainly because to update rows you need to lock them, and if other transaction tries to update the very same ones, one of the two will get blocked. Bad idea. (To be exact, refresh materialized view locks the whole view unless CONCURRENTLY is specified)

So, the next logic move is to create a view that cuts that in two: stale data and fresh data. How this can be done? Easy:

CREATE MATERIALIZED VIEW agg_invoicelines_prod AS
WITH maximums as (
  SELECT max(lineid) - 10000 as max_lineid FROM invoicelines)
SELECT prod, sum(quantity) as quantity, max(lineid) as lineid 
FROM invoicelines, maximums 
WHERE lineid <= max_lineid GROUP BY prod; -- 4.5s
CREATE INDEX ON agg_invoicelines_prod (lineid); -- 0.2s

CREATE VIEW view_agg_invoicelines_prod AS
WITH maximums as (
   SELECT max(lineid) as max_lineid FROM agg_invoicelines_prod)
SELECT prod, sum(quantity) FROM (
  SELECT prod, quantity FROM agg_invoicelines_prod
  SELECT prod, quantity FROM invoicelines, maximums 
  WHERE lineid > max_lineid 
) subq_1
GROUP BY prod; -- 0.2s

SELECT * FROM view_agg_invoicelines_prod -- 0.013s

If you read those queries carefully you will notice I have took care already of a lot of things:

  • Common Table Expression “maximums”: Instead of using a fixed number to cut by lineid I use a rolling one. The CTE in this case makes sure that the planner sees a simple query and optimizes it the way we want. A index on lineid is required, but since is a primary key it shouldn’t be necessary to add another index. By using a rolling lineid I avoid to rebuild the views every month to avoid calculating too much fresh data.
  • max(lineid) – 10000: The number controls the amount of new data to exclude from caching. This allows the user to modify not-so-new rows and still get correct aggregates.
  • CTE maximums for view_agg_invoicelines_prod: Instead of repeating the previous calculation, I get the old one from the cache. This is because over time max_lineid changes, but this approach requires that the two parts use the very same number to cut the table. This trick avoids duplicate lines or hidden ones over time.

I hope you understood everything up to here, it should be straightforward. The last query block is more or less the recipe I was using in most situations. If you didn’t knew those tricks, I recommend a lot using them. They are easily understandable and have almost no maintenance at all. You won’t be disappointed.

Now, what happens if anyone modifies ancient rows? We will get wrong aggregates. It is the only drawback of this solution. And because this solution is so simple for most cases will be the best option to do.

In certain circumstances it will be not enough. Maybe your table isn’t modified only in the newer parts. Or maybe you need to get always a correct aggregate no matter what. What can we do? Stay tuned! a new blog post is coming to answer those questions.


Los SSD en camino para reemplazar el HDD

Hace 5 años que escribí la entrada de DISCOS DE ESTADO SÓLIDO (SSD) (PARTE I) , ¿cómo ha cambiado el panorama en todo este tiempo?

Lo principal, es que ahora casi todo el mundo sabe lo que es un SSD; la mayoría de tiendas y marcas ofertan equipos con este tipo de discos. Si bien, entraron antes en los portátiles, ahora los ves incluso con los de torre. La gente los usa principalmente para acelerar los tiempos de arranque del ordenador.

Estamos en un punto bastante histórico, donde el precio de los SSD va a empezar a compensar de verdad, hasta tal punto que en breve podría dejar los HDD en algo tan viejo como las cintas de casette. ¿Exagerado? no lo creo.

Con la progresión actual, se requieren unos 5 años más para que el SSD sea igual de barato que el HDD con la misma capacidad. Llegados a ese punto, nadie más montará HDD para absolutamente nada y serán seguramente algo del pasado. Podrían quedar relegados a usos muy concretos igual que les pasa a las cintas de backup.

Pero a día de hoy, esto ya está pasando. Los precios actuales, aunque bastante más caros, empiezan a ser muy interesante para casi cualquier aplicación. En los servidores está pasando otra cosa interesante: empiezan a haber SSD de 2.5″ de tamaño bastante superior a lo máximo que ofrece un HDD del mismo tamaño. Así pues, hay empresas que están planteando llenar su JBOD (un array de 20 discos o más de 2.5″) de SSD para conseguir más espacio de almacenamiento en el mismo espacio físico. Muy caro, pero también mucho más rápido.

Lo que sucedió este pasado otoño es que al parecer hubo una crecida en demanda en los SSD muy por encima de lo esperado y se están dando cuenta de que pueden explotar el precio actual mucho más de lo que creían. Así pues, pese a que hay competencia, los precios han vuelto a subir ligeramente (en algunos casos descaradamente) y se mantienen más o menos estables desde entonces. Igual vamos a necesitar esperar 1 o 2 años para que el precio siga bajando al ritmo anterior, si es que lo hace. Si finalmente bajan algo más, estoy seguro que en vez de usar RAID, un SSD + HDD sencillo va a funcionar bastante mejor para la mayoría. Las tarjetas RAID son caras y si no se va a hacer uso del hot-swap, al final el SSD da mejor rendimiento y la seguridad de los datos es incluso mejor.

Hablando de seguridad en los datos del SSD, creo que esto no lo comenté en la anterior entrada… principalmente porque aún no lo sabía. Existen dos temas a tener en cuenta en la compra de un SSD para manejar datos críticos. Todos los fabricantes anuncian velocidades en los SSD, pero la durabilidad te la dan en MTBF, que en SSD no sirve para absolutamente nada. La diferencia de verdad entre un SSD y otro es la fiabilidad. Básicamente:


(Terabytes Written / Device Writes Per Day). Un SSD tiene una cantidad de escrituras limitada. Debido a la arquitectura, cada vez la señal es más débil, hasta que es irreconocible. Para evitar esto, los discos antes de llegar a tal punto se bloquean de un modo u otro. El problema es que la mayoría de veces pasan de ir perfectos a “brickearse” por completo de forma instantánea. La estimación del fabricante de cuánto puedes escribir en un disco viene en una de estas medidas. DWPD va en función de la capacidad del disco y de la garantía; para 5 años y un disco de 480Gb son 876TBW por cada DWPD. En discos para servidores buscamos normalmente unos 5.000TBW o más, o 5DWPD. La mayoría de discos para usuario, por buenos que sean, no alcanzan ni a 0.5DWPD. De todos modos, de lo que dice el fabricante, a lo que el disco tiene que sufrir para brickearse, suele irse mucho. Se curan en salud en estos aspectos.

Data In-flight Protection

Básicamente, qué pasa con los datos que escribiste cuando se va la luz. Los discos tienen memorias RAM de caché donde almacenan tablas de traducción y las actualizan; así como los datos que almacenan para escribir en golpes. Esta caché tiene dos intenciones: reducir la cantidad de reescrituras (y con ello la vida útil del disco) y aumentar la velocidad de escritura percibida por el host. El problema es que cuando se le va la luz de forma abrupta se pueden perder muchos datos; puedes terminar con ficheros sin contenido sólo porque intentaste agregarle una línea en los minutos anteriores. Esto ocurre con la mayoría de discos de usuario y es más notable según marcas y modelos. Los modelos para servidores dicen tener una protección para datos “en vuelo”. Suelen llevar unos condensadores que actúan como batería por unos segundos que vuelcan la caché en el caso de que se queden sin luz. Es importante que si el disco es para servidor, o tienes que almacenar algo crítico, debería estar provisto por algo de esto. Al igual que los TBW, los fabricantes no anuncian no tenerlo; sólo lo indican en los que lo tienen y cuesta verlo.

Guardado de datos en frío

Este de regalo: los SSD corrompen los datos si están demasiado tiempo desenchufados de un ordenador. Increíble, verdad? Especialmente si se guardan los discos en un ambiente caliente, llegan a perder datos en cuestión de semanas. (7 semanas a 45ºC según este artículo). Para una temperatura normal es un año, y si es un poco frío tal vez un par. Esto es totalmente al contrario que los HDD, que almacenan muy bien los datos en frío durante mucho tiempo siempre que no se expongan a campos magnéticos.

¿y vosotros, hacéis uso de SSD?