[osmosis-dev] Improving pgsnapshot

Sat Jul 20 02:06:07 UTC 2013

From: Brett Henderson [mailto:brett at bretth.com] 
Subject: Re: [osmosis-dev] Improving pgsnapshot

> > Schema details:
> > 
> > I want to add a nodes.ways bigint[] column to the nodes table and use it for
> > node->ways lookups, then use the existing ways.nodes bigint[] column for
> > way->nodes
> 
> What is the motivation for this?  I'm guessing that performance is likely to 
> increase because we get to make better use of geo-spatial indexes ... It seems 
> like a good idea.  I haven't had a look at the existing queries to determine 
> the impact of making this change.

I'm explaining this in a good amount of detail because it helps me to write 
it out for planning purposes, so you get one of my long technical emails :) 

I'll use the example of a map? query, but ignore relations. You can 
generalize this to almost any task that has to use way_nodes right now

The way to do this query is: 

1. Add all nodes in the bounding box to a nodes list (nodes_from_bbox)

2. Find all parent ways of those nodes and add them to a ways list 
(ways_from_nodes)

3. Add to the nodes list any child nodes of those ways where the child nodes 
are not already in the nodes list (nodes_from_way_nodes)

4. Serialize the nodes in the nodes list and ways in the ways list

See the cgimap code for details:
https://github.com/pnorman/openstreetmap-cgimap/blob/gsoc13/src/api06/map_handler.cpp#L22-36
https://github.com/pnorman/openstreetmap-cgimap/blob/gsoc13/src/backend/pgsnapshot/snapshot_selection.cpp#L382-477

A couple of points that might not be immediately obvious

A. Because you need the full node/way row in 4 it is best to insert the 
entire way/node row into the temp tables in 1-3.

B. The ways selected by 2 are a subset of those meeting ST_Intersects(bbox,way.linestring).

Point A is what matters the most.

A nodes.parent_ways column would speed up 2, and not impact 1, 3 or 4.

Before going into why, the typical indexes of relevance are:

On way_nodes: "idx_way_nodes_node_id" btree (node_id) [52GB]

On ways: "pk_ways" PRIMARY KEY, btree (id) [4.5GB]

Currently, the best query plan for a sanely sized bbox involves 

- a sequential scan on tmp_nodes
- index scanning using idx_way_nodes_node_id
- sorting the resulting list of way_ids to remove duplicates 
- index scanning using pk_ways

By *far* the slowest part is using idx_way_nodes_node_id.

With my proposed parent_ways column the best query plan should be

- a sequential scan on tmp_nodes
- unnest(parent_ways)
- sorting the resulting list of way_ids to remove duplicates 
- index scanning using pk_ways

My initial tests show a ~10x increase in speed, but this may scale 
upwards on a large database

The second reason is for one of space reasons.

way_nodes is 240GB total, with 120GB data and 120GB indexes

Adding a parent_ways column would add an estimated 66GB to the nodes 
table, bringing it from 194GB to 260GB, for a total DB size reduction 
of 174GB.

The sole concern of mine is that by making the nodes table wider it 
increases sort memory requirements and decrease sequential scan speed. 
This could be addressed by instead making a node_ways table with 
columns (node_id bigint PRIMARY KEY, parent_ways bigint[])