[osmosis-dev] Improving pgsnapshot

Sun Jul 21 10:37:48 UTC 2013

On 20 July 2013 12:06, Paul Norman <penorman at mac.com> wrote:

> From: Brett Henderson [mailto:brett at bretth.com]
> Subject: Re: [osmosis-dev] Improving pgsnapshot
>
> > > Schema details:
> > >
> > > I want to add a nodes.ways bigint[] column to the nodes table and use
> it for
> > > node->ways lookups, then use the existing ways.nodes bigint[] column
> for
> > > way->nodes
> >
> > What is the motivation for this?  I'm guessing that performance is
> likely to
> > increase because we get to make better use of geo-spatial indexes ... It
> seems
> > like a good idea.  I haven't had a look at the existing queries to
> determine
> > the impact of making this change.
>
> I'm explaining this in a good amount of detail because it helps me to write
> it out for planning purposes, so you get one of my long technical emails :)
>
> I'll use the example of a map? query, but ignore relations. You can
> generalize this to almost any task that has to use way_nodes right now
>

Do you have the linestring column available in your ways table?  If so, I
don't believe you need to touch the way_nodes table for map queries.

>
> The way to do this query is:
>
> 1. Add all nodes in the bounding box to a nodes list (nodes_from_bbox)
>
> 2. Find all parent ways of those nodes and add them to a ways list
> (ways_from_nodes)
>

If you have the linestring column available, you can query the ways table
directly and avoid touching the way_nodes table.  There should be no need
to query for the parent ways of nodes.  I might be missing something in
your explanation though because you mention the linestring column further
down in your email.

>
> 3. Add to the nodes list any child nodes of those ways where the child
> nodes
> are not already in the nodes list (nodes_from_way_nodes)
>

Are you using the ways.nodes column here, or the way_nodes table?

> 4. Serialize the nodes in the nodes list and ways in the ways list
>
> See the cgimap code for details:
>
> https://github.com/pnorman/openstreetmap-cgimap/blob/gsoc13/src/api06/map_handler.cpp#L22-36
>
> https://github.com/pnorman/openstreetmap-cgimap/blob/gsoc13/src/backend/pgsnapshot/snapshot_selection.cpp#L382-477
>
> A couple of points that might not be immediately obvious
>
> A. Because you need the full node/way row in 4 it is best to insert the
> entire way/node row into the temp tables in 1-3.
>
> B. The ways selected by 2 are a subset of those meeting
> ST_Intersects(bbox,way.linestring).
>

So why not just use the above query instead of using 2?

> Point A is what matters the most.
>
> A nodes.parent_ways column would speed up 2, and not impact 1, 3 or 4.
>
> Before going into why, the typical indexes of relevance are:
>
> On way_nodes: "idx_way_nodes_node_id" btree (node_id) [52GB]
>
> On ways: "pk_ways" PRIMARY KEY, btree (id) [4.5GB]
>
> Currently, the best query plan for a sanely sized bbox involves
>
> - a sequential scan on tmp_nodes
> - index scanning using idx_way_nodes_node_id
> - sorting the resulting list of way_ids to remove duplicates
> - index scanning using pk_ways
>
> By *far* the slowest part is using idx_way_nodes_node_id.
>
> With my proposed parent_ways column the best query plan should be
>
> - a sequential scan on tmp_nodes
> - unnest(parent_ways)
> - sorting the resulting list of way_ids to remove duplicates
> - index scanning using pk_ways
>
> My initial tests show a ~10x increase in speed, but this may scale
> upwards on a large database
>

I also found ~10x increase in speed when I added the ways.nodes column,
although that was mainly due to being able to use only tables clustered by
geo-spatial indexes.  That allowed me to avoid hitting way_nodes in my
queries.

>
> The second reason is for one of space reasons.
>
> way_nodes is 240GB total, with 120GB data and 120GB indexes
>
> Adding a parent_ways column would add an estimated 66GB to the nodes
> table, bringing it from 194GB to 260GB, for a total DB size reduction
> of 174GB.
>

That sounds great.

>
> The sole concern of mine is that by making the nodes table wider it
> increases sort memory requirements and decrease sequential scan speed.
> This could be addressed by instead making a node_ways table with
> columns (node_id bigint PRIMARY KEY, parent_ways bigint[])
>
> The problem with creating a separate table is that you'll no longer be
able to cluster the table according to a geo-spatial index which kills IO
throughput, at least on spinning disks.

I'm not sure I've followed everything in your email so my comments may not
be accurate.  If case it's useful, the Osmosis bounding box query
implementation is in the following class in the iterateBoundingBox method.
Note that it only creates three temporary tables containing the node, way
and relation ids that are subsequently retrieved via joins back onto the
nodes, ways and relations tables to retrieve the result data.  Also note
that it has three way query implementations that use either way.linestring
column if available, or way.bbox column if available, or fallback to
way_nodes query if neither column is available.
https://github.com/brettch/osmosis/blob/master/osmosis-pgsnapshot/src/main/java/org/openstreetmap/osmosis/pgsnapshot/v0_6/impl/PostgreSqlDatasetContext.java

Brett
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstreetmap.org/pipermail/osmosis-dev/attachments/20130721/53733e8a/attachment.html>