[Tile-serving] Approaches to enable database updates
Robert Buchholz
robertbuchholz at gmx.de
Wed Mar 18 16:19:28 UTC 2015
On 18.03.2015 09:24, Paul Norman wrote:
> What data structure is used to do a lookup of ways that reference a
> particular node?
>
> For those following along but not deeply involved in writing
> converters for whole-planet scale OSM data, the increased size and
> decreased speed of an import that can be updated is not caused by
> needing to find the properties of an object by ID, but finding the
> parent ways of a node for when the node has moved, or a comparable
> question with relations.
>
> This is solved a few ways.
>
> pgsnapshot and apidb have a way_nodes table which way id, node id and
> position in way. Indexes allow lookups to be done by node id or way
> id. The disadvantages of this method stem from the size of the table
> needed.
I am going with a reverse index approach similar to what you describe
what pgsnapshot and apidb are doing: I build indexes (though that code
is currently disabled) that map each node, way and relation id to a list
of all way and relation IDs that refer to it. AFAIR, this takes less
than 30GB of additional space (a node referenced by at most a single way
or relation needs 8 bytes of additional storage; one referenced by more
than one way or relations needs 8 bytes per reference plus a constant 12
bytes for book-keeping). Each index is mostly a giant flat array stored
on disk as a single file, so looking up which ways refer to a given node
can be done with a single seek (whereas the database indices used in
pgsnapshot and apidb should need O(log(n)) seeks).
Currently, I am building these indices the same way I resolve the
locations of nodes referenced by ways: either by linearly reading all
ways and randomly accessing the nodes each way references, hoping they
are still in RAM (for small extracts), or by partitioning the set of
nodes into subsets that can be kept in RAM, and then for each subset
linearly reading all ways, but updating only those node references from
the current set (for planet dumps).
But I got a third alternative in mind (not yet implemented) that should
dramatically reduce RAM consumption for creating these indices, and
still require only sequential reads and writes: Create a set of
temporary files that each covers about the same set of *nodes* (e.g.
file000 covers node with ID 1 to 10M, file001 covers node with IDs 10M
to 20M, etc.). Then, sequentially read the set of all ways, and for each
node referenced by a way, store the tuple (wayId, nodeId) in just that
temporary file that covers the nodeId (OS disk caches should ensure that
the actual I/O necessary to write out these temporary files are
essentially sequential). This leaves you with 300 odd files each less
than one GB in size and each covering a sequence of 10M nodes. Then, you
just have to read each file in turn sequentially, sort the set of
(wayId, nodeId) tuples it contains by nodeId (can be done using only a
few hundred MB of RAM), and you got your list of reverse node references
- sorted by nodeId - that can be written to the final flat index file
sequentially.
If I am not mistaken, a very similar approach could then also be used to
resolve the node locations each way refers to, based on the just-created
reverse index.
More information about the Tile-serving
mailing list