[osmosis-dev] Improving pgsnapshot

Sun Jul 21 11:44:02 UTC 2013

> From: Brett Henderson [mailto:brett at bretth.com] 
> Sent: Sunday, July 21, 2013 3:38 AM
> To: Paul Norman
> Cc: osmosis-dev
> Subject: Re: [osmosis-dev] Improving pgsnapshot
> 
> > 2. Find all parent ways of those nodes and add them to a ways list
> > (ways_from_nodes)
> 
> If you have the linestring column available, you can query the ways 
> table directly and avoid touching the way_nodes table.  There should 
> be no need to query for the parent ways of nodes.  I might be missing 
> something in your explanation though because you mention the linestring 
> column further down in your email.

I don't want to require the linestring column. I'm also wondering what 
the linestring column does to update speed, dead tuples (and index bloat 
and autovacuum frequency). If you can do away with it you also save 40GB 
for data and 10GB for the index.

> > 3. Add to the nodes list any child nodes of those ways where the child nodes
> > are not already in the nodes list (nodes_from_way_nodes)
> 
> Are you using the ways.nodes column here, or the way_nodes table?

ways.nodes, you should pretty much never use way_nodes in this direction, 
particularly if you've already had to select the ways out. Since I've 
selected the ways already it's only an unnest(tmp_ways.nodes) which runs 
without having to hit disk at all in something like a 100th of the time.

I would **never** use way_nodes for ways->nodes without a solid set of 
benchmarks showing it was faster for my case. There are some topology 
analysis tasks that *might* be faster on way_nodes.

Of note: you can actually generate way_nodes fairly easily from ways.nodes 
by unnesting it.

SELECT id,unnest(nodes) FROM ways; If you need sequences you can get them 
but the SQL is a bit more complex. Of course, if you take this, create it 
as a view, then try to run nodes->ways on it you'll get horrible performance.

> > B. The ways selected by 2 are a subset of those meeting ST_Intersects(bbox,way.linestring).
> 
> So why not just use the above query instead of using 2?

Two reasons

1. ST_Intersects(bbox,way.linestring) is way faster for large bboxes, 
but I haven't benchmarked it for small ones.

2. It changes the results. This absolutely rules it out, but the 
performance is complicated.

I am not longer returning a map query, and the differences for bounding boxes 
with a high aspect ratio are significant. Also, ST_Intersects is really 
_ST_Intersects AND &&. The && is fast, but for a small bbox the ratio of 
rows matching && but not matching _ST_Intersects becomes high, so you end 
up fetching unnecessary rows. This may not be a serious issue, but I'd want
to benchmark it against a real set of map? calls.

That being said, one option is to get a list of ways from ST_Intersects 
and then filter them by ways.nodes against tmp_nodes, making use of the subset
fact to get the desired results.

> The problem with creating a separate table is that you'll no longer be able 
> to cluster the table according to a geo-spatial index which kills IO 
> throughput, at least on spinning disks.

It's a moot point for me as, I've never been able to CLUSTER the nodes 
table by geom on a full planet, it takes too long and postgres tends to 
error out. Also, you have to recluster periodically and that kind of 
downtime is excessive.

As its own table it would not impact performance of WHERE nodes.geom && bbox 
queries as you could still cluster that the same way as now. It would impact
joins because the rows joined to would be scattered.

> Note that it only creates three temporary tables containing the node, way and 
> relation ids that are subsequently retrieved via joins back onto the nodes, 
> ways and relations tables to retrieve the result data.  

My cgimap queries are similar, except I store the full row to avoid having to go 
through the indexes a second time. Yes, the indexes and rows from the big tables 
are memory, but I know I'll need the full row to serialize and for ways->nodes.

> Also note that it has three way query implementations that use either 
> way.linestring column if available, or way.bbox column if available, or 
> fallback to way_nodes query if neither column is available.

I would suggest never using way.bbox by itself for a map?-like query. For example, 
a nodes.geom && bbox condition in China will often return the Chinese border, 
even if you're nowhere near it. You then have to pull out that giant relation 
and handle it.

I will have a look at the SQL though, see if I can gain anything, or contribute 
any of my optimizations.