[Tile-serving] Performance of COORDS at updates

Wed Mar 18 08:24:14 UTC 2015

On 3/18/2015 12:56 AM, Robert Buchholz wrote:
> (I was using this mailing list in digest mode, and couldn't find a way 
> to directly reply to the messages of Paul and Lynn in that mode).
>
> As Lynn pointed out, updates themselves are not yet implemented in 
> COORDS. However, all data structures to support updates are in place 
> (flat files of all node, way and relation data, indexed by their 
> respective entity id). The six hours for data import already include 
> writing out these files (about 191GB for a planet dump) as well as 
> creating the actual geometry tiles. 
What data structure is used to do a lookup of ways that reference a 
particular node?

For those following along but not deeply involved in writing converters 
for whole-planet scale OSM data, the increased size and decreased speed 
of an import that can be updated is not caused by needing to find the 
properties of an object by ID, but finding the parent ways of a node for 
when the node has moved, or a comparable question with relations.

This is solved a few ways.

pgsnapshot and apidb have a way_nodes table which way id, node id and 
position in way. Indexes allow lookups to be done by node id or way id. 
The disadvantages of this method stem from the size of the table needed.

osm2pgsql stores an array of nodes with each way and does an array 
overlap query (&&) which uses a GIN index built on the nodes column. The 
disadvantage of this is that GIN indexes are comparatively slow to build 
and rely on random IO when building. On a machine with a particularly 
fast CPU and sequential disk speed and a slow random disk speed, 
building the GIN index takes the majority of the import time. This can 
be avoided by --slim --drop, which does not build the index, 
substantially reducing the import time.

For assorted reasons, the separate table and binary tree index method is 
faster, but this turns out not to be particularly important with osm2pgsql.

With both of these you also have to update the data structure (index) as 
data changes, leading to bloat. GIN indexes are significantly worse for 
bloating than binary tree indexes.