[Tile-serving] Osm2pgsql table partitioning

Sat Jun 1 17:17:07 UTC 2013

Hello everyone,

with the continued rapid growth of osm data, the tables osm2pgsql 
generates are becoming rather large and unwieldy as well. Although for 
high zoom rendering the spatial indices appear to be doing a pretty good 
job, allowing for fast high zoom queries, doing a low zoom query on a > 
40GB table just takes time, even with SSDs.

For the largest table, the planet_osm_line table, there has long been a 
solution to this by using the planet_osm_roads table, with only a small 
subset of data from the lines table needed for low zoom rendering. 
However, the polygon table, that is also pretty large by now, so far 
remains monolithic and furthermore, the roads table duplicates data, 
making the database even bigger than it already is.

Once again trying to address this issue, I have created a proof of 
concept [1] patch that allows to specify arbitrary table partitionings. 
This should have advantages over both the use of partial special 
indicies and completely separate tables as the planet_osm_roads table.

Partitioning shouldn't increase the data size (by much), as it doesn't 
duplicate any data (neither actual data, nor index data). As SSDs are 
still not super cheap, this should be an advantage over both additional 
partial spacial indices and entirely separate tables.

As queries can (and should) still be run against the parent table, the 
database schema remains the same from the perspective of the data users. 
Therefore this should remain backwards (and forwards) compatible and 
should continue to allow easy sharing of style sheets between tile 
server instances (even though it might loose the efficiency if the style 
and partitioning don't fit together).

So hopefully table partitioning should be a viable option for speeding 
up rendering without two many downsides.

That's however the theory. The question is, does this work in practice 
as well and does it actually gain any significant performance 
improvements making it worth the extra complexity?

What would be the most useful partitionings to test this on? What would 
likely show the biggest speed improvements with the default style sheet?

One possibility would be to partition the polygon table into "buillding 
is null" and "building is not null". That would split the tables roughly 
6:1, and many of the slower queries in the default stylesheet already 
have a "where building is null" clause. Another possibility is to split 
the polygon table by area_size. Then one can add a where condition to 
the stylesheets to only retrieve polygons with a certain minimum size 
(e.g. the size of a pixel).

For the point table, one could potentially partition into place is null 
and not null, as place points are more likely used in low zoom rendering.

So what type of partitionings would be worth testing? And what would the 
interface ideally look like with which to configure these partitionings? 
At the moment they are still hard coded in a config array at the top of 
output-pgsql.c ( 
https://github.com/apmon/osm2pgsql/blob/partitioning/output-pgsql.c#L70 ).

Any other thoughts and suggestions?

Kai

[1] https://github.com/apmon/osm2pgsql/tree/partitioning
[2] http://www.postgresql.org/docs/9.1/static/ddl-partitioning.html