[osmosis-dev] Bad PostgreSQL plans with pgsnapshot

Tue Apr 16 08:31:55 UTC 2013

I ran across a PostgreSQL query planner bug today when benchmarking
pgsnapshot. This may affect others so I'm documenting it.

A common task is to extract a bounding box or polygon from the database,
equivalent to a map? call.

The way I was benchmarking was a multi-step process, but there's two that
are relevant for this.

The first is getting all the nodes in the area. The query for this is

CREATE TEMPORARY TABLE bbox_nodes AS 
  SELECT * FROM nodes 
    WHERE geom &&
ST_SetSRID(ST_MakeBox2D(ST_Point(-87.75,41.80),ST_Point(-87.65,41.85)),4326)
;
ANALYZE bbox_nodes;

The next is to get the ways with any of those nodes as members.

I started this with 

SELECT DISTINCT ... FROM ways
  JOIN way_nodes ON (ways.id=way_nodes.way_id)
  JOIN bbox_nodes ON (way_nodes.node_id=bbox_nodes.id);

This was basically the query I intended to benchmark, but I was getting
horrible query plans and run times so I split the query up, arriving at
this: 

SELECT DISTINCT way_nodes.way_id
  FROM way_nodes 
  JOIN bbox_nodes ON (way_nodes.node_id=bbox_nodes.id);

I was still getting bad query plans involving a sequential scan of
way_nodes, a 2.2 billion row table.

After much help from RhodiumToad from #postgresql the cause was tracked down
to a bug in the query planner where a fixed fudge factor was accounting for
almost all of the cost of fetching the random pages. The cost estimate was
about 356 of which 348 was from the fudge factor. This was causing it to
vastly overestimate the cost of the index lookups, forcing it to do a
sequential scan.

This bug has already been fixed in 9.3
(http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;f=src/backend/u
tils/adt/selfuncs.c;h=31f38f28b00cbe2b9267205359e3cf7bafa1cb97) but not
backported to earlier versions.

The workaround was to increase cpu_tuple_cost to 0.1. This makes it use the
indexes, resulting in a 50s query for a 350k node 60k way bbox instead of a
query aborted at 10 minutes.

If you are running a query against a small bbox (50k nodes) this bug does
not matter as the incorrect fudge factor is not significant enough.

If you are running it against a very large bbox (~10-100 million nodes?) it
also does not matter, a sequential scan will be faster there.

I do not know if this bug impacts the osmosis --dataset-bounding-box task,
jxapi or pyxapi. 

I do not know if set enable_seqscan=false is an appropriate solution. If you
have an index on bbox_nodes (id) it is not as it will force index usage on
that table when it needs to fetch 100% of the table anyways.