[Tile-serving] Approaches to enable database updates

Wed Mar 18 16:19:28 UTC 2015

On 18.03.2015 09:24, Paul Norman wrote:
> What data structure is used to do a lookup of ways that reference a 
> particular node?
>
> For those following along but not deeply involved in writing 
> converters for whole-planet scale OSM data, the increased size and 
> decreased speed of an import that can be updated is not caused by 
> needing to find the properties of an object by ID, but finding the 
> parent ways of a node for when the node has moved, or a comparable 
> question with relations.
>
> This is solved a few ways.
>
> pgsnapshot and apidb have a way_nodes table which way id, node id and 
> position in way. Indexes allow lookups to be done by node id or way 
> id. The disadvantages of this method stem from the size of the table 
> needed.
I am going with a reverse index approach similar to what you describe 
what pgsnapshot and apidb are doing: I build indexes (though that code 
is currently disabled) that map each node, way and relation id to a list 
of all way and relation IDs that refer to it. AFAIR, this takes less 
than 30GB of additional space (a node referenced by at most a single way 
or relation needs 8 bytes of additional storage; one referenced by more 
than one way or relations needs 8 bytes per reference plus a constant 12 
bytes for book-keeping). Each index is mostly a giant flat array stored 
on disk as a single file, so looking up which ways refer to a given node 
can be done with a single seek (whereas the database indices used in 
pgsnapshot and apidb should need O(log(n)) seeks).

Currently, I am building these indices the same way I resolve the 
locations of nodes referenced by ways: either by linearly reading all 
ways and randomly accessing the nodes each way references, hoping they 
are still in RAM (for small extracts), or by partitioning the set of 
nodes into subsets that can be kept in RAM, and then for each subset 
linearly reading all ways, but updating only those node references from 
the current set (for planet dumps).

But I got a third alternative in mind (not yet implemented) that should 
dramatically reduce RAM consumption for creating these indices, and 
still require only sequential reads and writes: Create a set of 
temporary files that each covers about the same set of *nodes* (e.g. 
file000 covers node with ID 1 to 10M, file001 covers node with IDs 10M 
to 20M, etc.). Then, sequentially read the set of all ways, and for each 
node referenced by a way, store the tuple (wayId, nodeId) in just that 
temporary file that covers the nodeId (OS disk caches should ensure that 
the actual I/O necessary to write out these temporary files are 
essentially sequential). This leaves you with 300 odd files each less 
than one GB in size and each covering a sequence of 10M nodes. Then, you 
just have to read each file in turn sequentially, sort the set of 
(wayId, nodeId) tuples it contains by nodeId (can be done using only a 
few hundred MB of RAM), and you got your list of reverse node references 
- sorted by nodeId - that can be written to the final flat index file 
sequentially.

If I am not mistaken, a very similar approach could then also be used to 
resolve the node locations each way refers to, based on the just-created 
reverse index.