[Tile-serving] [openstreetmap/osm2pgsql] [EXPERIMENTAL] Reduced way node index (#1058)

Sat Jan 11 08:49:21 UTC 2020

Following #1045, I've been experimenting with reducing the size of the way-node index by creating a lossy index.

OSM ways have a locality property that we can use to reduce the index size: they are often made up of sequential node ids. If node N is contained in the way, then there is a good chance that N+1, N+2, ... are contained in it as well. Thus, if we group nearby nodes and create an index from node groups to ways, the index will be significantly smaller. The drawback is that a lookup in such an index returns false positives, i.e. ways that do not contain the node of interest. So the smaller index is paid for with a performance loss for updates.

#### How it works

Creating buckets from node IDs is simple. Just divide through the expected bucket size, i.e. for bucket size 32: node_bucket = node_id/32. To get all buckets that a way has nodes in, take all nodes, divide through bucket size and remove the duplicates. In SQL that works like this:

	ARRAY(SELECT DISTINCT unnest(nodes)/32)

We could create an extra column in the place_osm_ways table to save our bucket array. As it turns out ([thank you Stackoverflow](https://stackoverflow.com/questions/11165450/store-common-query-as-column/11166268#11166268)), this is not really necessary. We can simply create a function that returns a semi-virtual column with our values:

```
   CREATE FUNCTION bucket_32(planet_osm_ways) RETURNS bigint[] AS
   $func$
	   SELECT array(select distinct unnest(nodes)/32)
	   FROM   planet_osm_ways b
	   WHERE  b.id = $1.id
  $func$ LANGUAGE SQL IMMUTABLE;
```

As long as the function is immutable, we can even create an index over that function:

```
CREATE INDEX planet_osm_bucket_32 on planet_osm_ways using gin(bucket_32(planet_osm_ways));
```

Now we can look up which ways a node is a member by looking up the bucket and the node:

```
  SELECT id FROM planet_osm_ways w
  WHERE 32001039= any(w.nodes) and ARRAY[32001039::bigint/32]] && w.bucket_32;
```

This uses our shiny new index:

```
QUERY PLAN                                                             
--------------------

 Bitmap Heap Scan on planet_osm_ways w  (cost=13.03..402.48 rows=7 width=198) (actual time=0.045..0.063 rows=2 loops=1)
   Recheck Cond: ('{1000032}'::bigint[] && bucket_32(w.*))
   Filter: (32001039 = ANY (nodes))
   Rows Removed by Filter: 8
   Heap Blocks: exact=4
   ->  Bitmap Index Scan on planet_osm_bucket_32  (cost=0.00..13.03 rows=137 width=0) (actual time=0.026..0.026 rows=10 loops=1)
		 Index Cond: ('{1000032}'::bigint[] && bucket_32(w.*))
 Planning Time: 0.579 ms
 Execution Time: 0.113 ms
```

### Performance

I've tested this with a recent planet against a database using the `null` output for bucket sizes 16.32 and 64.

Bucket size | none | 16 | 32 | 64
---------------|------|----|----|----
Time to create index (hh:mm) | 07:30 | 2:50 | 2:35 | 2:25
Index size (GB) | 276 | 29 | 17 | 11 |
Average number of ways retrieved | 1 | 6.3 | 10.6 | 18.3
time to update 1 day (s) | 361 | 419 | 428 | 431

"Average number of ways retrieved" is a rough estimate of how many ways need to be loaded from planet_osm_ways for each actual way result we expect.

The updates times are computed by running the same day update 4 times and throwing away the first result. So they are on a hot database and thus on the optimistic side. On a cold database I've seen up to 70% increase in the time. However, it should be noted that all this is done on a null output. Given that the update of the output tables is the dominating factor for updates, I'd expect a lower overall impact for a real world update of, say, an openstreetmap-carto database.

### TODO

We need proper tests for the invalidation first. I got the query wrong at the first try and the tests still past happily.

The bucket indexes still have the same issue with the query planner as described in #1045. So we likely still run into jit/parallel worker slowness.

The queries only started to be reasonably fast after a full 'VACUUM ANALZE' on the planet_osm_ways table. This may be because I created the indexes on top of an existing database. It may also be that the vacuum is always necessary.
You can view, comment on, or merge this pull request online at:

  https://github.com/openstreetmap/osm2pgsql/pull/1058

-- Commit Summary --

  * use a bucketed way-node index
  * fix prepared statement
  * fix index lookup id

-- File Changes --

    M src/middle-pgsql.cpp (19)
    M src/middle-pgsql.hpp (6)
    M tests/regression.py (2)

-- Patch Links --

https://github.com/openstreetmap/osm2pgsql/pull/1058.patch
https://github.com/openstreetmap/osm2pgsql/pull/1058.diff

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/openstreetmap/osm2pgsql/pull/1058
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstreetmap.org/pipermail/tile-serving/attachments/20200111/d345b083/attachment-0001.htm>