[osmosis-dev] pgsnapshot composite index results

Tue Jan 22 09:53:31 GMT 2013

I frequently use my pgsnapshot database for unusual purposes and end up
running non-standard queries.

The standard indexes for pgsnapshot nodes include a GiST index on geom.
Another common index suggested by the jxapi installation instructions[1] is
a GIN index on tags.

These indexes work well when you have a query that is highly selective
spatially or against the tags but are frequently not ideal against queries
combining a medium selective spatial condition with a medium selective tag
condition.

While working on addressmerge[2] I encountered a situation where the query
SELECT * FROM local_all; was quicker than SELECT * FROM local_all WHERE tags
? 'addr:housenumber'; local_all was a view of the nodes, ways and
multipolygons[3] in the local area. The speed difference was caused by a
non-optimal query plan of a query of the form SELECT * FROM nodes WHERE
st_intersects (geom,'my_geom'::geometry) AND tags ? 'addr:housenumber';
where my_geom was the EWKT for a polygon covering the area of interest.

The query plan for the first query involved an index scan of the geom gist
index. The second involved a bitmap and of the geom gist and tags gin
indexes. Unfortunately, due to the limitations of hstore statistics this was
likely not the optimal plan. An exploration of options in #postgresql lead
to the discussion of a composite gist index on (geom, tags) as an
alternative indexing strategy, which is what this message is about (after
this rather lengthy preamble.)

A composite index would be created with a statement like CREATE INDEX [
CONCURRENTLY ] ON nodes USING gist (geom, tags); This index can benefit
statements that are moderately selective in both geom and tags, but is more
important that geom be selective than that tags be selective.

All tests were done with replication stopped on my home server on a 6 7200
RPM drive RAID10 array, 32GB RAM, queries repeated to ensure consistent
caching (i.e. everything in memory). The initial runs of the queries were
obviously substantially slower from disk, but similar behavior was observed
there.

The creation of the composite index took 24 hours, non-currently. I do not
have the creation time for the non-composite index, but I would estimate it
at 18 hours. The indexes are 153GB and 84GB respectively.

With use of transactions it is possible to drop an index then ROLLBACK the
transaction, allowing for easy testing of different combinations of queries
and indexes.

For the following table to make sense, use a fixed-width font.

With the WHERE tags ? 'addr:housenumber' restriction:
                geom index   (geom, tags) index
Total time:       3000ms        222ms
Total cost:      61622         1166
Estimated rows:     28           28
Actual rows:     78873        78873

Without the WHERE restriction:
                geom index  (geom, tags) index
Total time:         386ms       400ms
Total cost:      345222      347339
Estimated rows:   27986       27986
Actual rows:     184644      184644
Index scan time:     47ms        59ms

The run to run variation in total speed without the tags restriction is
greater than the different in indexes, but there is a noticeable difference
in index scan time.

Using a rectangle covering the southwest of BC I ran some further queries to
investigate the index scan time. Total query time was about 9 seconds, but
it's the index scan part we're interested in.

Forcing the composite index to be used increased the scan time from 1.28s to
1.58s, an approximately 20% increase. The rest of the query time remained
approximately constant.

Putting these results into an xapi context, the use of a composite index
would slow down map? type queries. The index scan is a small part of the
total response time. If most of the time is spent retrieving nodes for
backfilling (done by ID), serializing XML or doing joins with way_nodes or
relation_members, the time spent scanning the index is a minor issue. 

As a gist composite index is substantially slower than a gin index it would
not replace it. This would mean there would be essentially no speed change
for *[key=value] queries without a bbox restriction.

Where it would substantially speed up queries is for moderately selective
ones, e.g. fetch all Starbucks in the bounding polygon for the US, or the
case where I fetched all addresses in a city.

Something that I haven't touched on yet is updates. The speed of osmosis
updates to pgsnapshot databases is not well explored. Toby has investigated
slow queries that occur during diff processing[4] and the queries he
investigated did not involve any use of geometry indexes, but the statement
that he looked at does involve an update of the ways linestring. These
updates would require updates to the geometry index which would presumably
be slower with a composite index. I don't have much experience in reading
EXPLAIN results for updates, but I think about 25% of the time is spent
updating the row and indexes. I have no idea how much of this is spent on
linestring index updates.

What I also don't know is how much time is spent inserting nodes. I would
expect that most changes to OSM are the creation of nodes and even if these
queries are individually quick they may compose a significant portion of the
overall update time by sheer volume. 

Is a composite index worth it? It depends on your use case. If you are
purely using a pgsnapshot database from osmosis which never uses the tags in
queries then it is clearly not worth it. For xapi map? queries it is also
not worth it. Anything involving both geographic filters and tag filters may
benefit from it, but at cost of potentially slower queries for purely
spatial queries and an unknown impact on updates. There is also a disk space
hit to consider, although an additional 70GB of indexes on a database that
is already 750GB may not be a huge issue.

Something to keep in mind for an xapi situation is that the more IO time
spent on updates the less that can be spent on queries, balancing out the
speed increase from faster queries. On the other hand, a 10x increase (or
better!) on the right queries is significant.

On case where it's a clear winner is where *all* queries involve both a
spatial and tag component and there isn't a need for a separate gin index if
the composite index is used. The separate gist geom and gin tag indexes
could then be replaced by one gist (geom, tags) index, saving space and not
slowing down updates with additional index updates.

[1]: https://github.com/iandees/xapi-servlet

[2]:
https://github.com/pnorman/addressmerge/blob/c4a26eb6/addressmerge.py#L56

[3]: Complete MP tag handling is not required for this application so
https://github.com/pnorman/addressmerge/blob/c4a26eb6/addressmerge.py#L66 is
sufficient

[4]:
http://lists.openstreetmap.org/pipermail/osmosis-dev/2013-January/001478.htm
l