[OSM-dev] invalid geometry in planet_polygon

Tue Aug 24 20:36:15 BST 2010

On Tue, Aug 24, 2010 at 11:42 AM, Stefan de Konink <stefan at konink.de> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA512
>
> Op 24-08-10 18:37, nimix schreef:
>> Your right if you want to do a ... WHERE St_IsValid(geom) query, the planner
>> don't uses the index because 99% of geometries are valid, so an index scan
>> seems to be useless. In that case I made postgres use the index by double
>> negation, but that only works fast if there are few invalid geometries:
>
> Then the planner is stupid, because it is actually checking stuff that
> is already indexed. It should rewrite the query to use the index, that
> is the index for.
>
> If PostgreSQL pretends that it accepts manual indices then it should use
> them.
>

Postgres probably knows from statistics that almost all rows pass the
test, therefore. If it did a sequential scan on the index, to get
St_IsValid rows, it expects it would end up doing an extra O(n) random
accesses within the table to actually fetch the rows and it won't pay
the extra seeks unless the table happens to be well clustered with
respect to that index. (Or conversely, if it has another small row
set, generated with another index, it won't consider looking them up
in the St-IsValid index to be worth the O(n) random seeks). As the
table probably isn't clustered with respect to that index, it thinks
that this is really inefficient (and may be right). You might be able
to force the issue by defining St_IsValid() function to have a high
cost.

Would something like this work?

  CREATE INDEX idx_ploygon_geom_with_valid ON (St_IsValid(geom),
<other columns>) planet_osm_polygon USING gist (way);

Here, for any search that uses <other columns>, it gets the
St_IsValid() test for free with no additional seeks.

More pragmatically, if there are only a few 100k bad ways out of 50m,
I think the simplest way is to cache the validity information in an
extra column, updated either by triggers or using a view&rewrite rule.
Filtering off the small fraction of bad rows will be cheap.

Scott