[OSM-dev] keeping thematic planet extract up to date

Tue Oct 18 18:06:43 BST 2011

Hi Martijn,

> this seems to run OK, but invariably, after leting this run for a few
> hours with a 5 minute interval (to catch up, my initial extract is a
> couple of months old) the database table only holds a small number (less
> than 20) nodes. What is going wrong here?

well, sorry to say that, but it has multiple problems :)

1. Your filter (--tf accept-ways x=* --tf --accept-nodes x=*) doesn't do 
what you want, because it filters out _all_ nodes that aren't tagged 
with gnis:id=*, including those that constitute the gnis:id=* ways. So 
you end up with a bunch of ways in your stream with empty geometries. 
This is probably the main reason you see only a small number of nodes in 
the DB, because there's nothing else --wp oder --wpc can write to the 
DB. See my earlier post [1] about how to do tag-based filtering with 
osmosis :)

2. In the long run, you'll get wrong data if you only store the filtered 
data. Consider this scenario:

T+0: you get your initial extract, way 12345 has no gnis:id, gets 
filtered out and is not stored in the DB
T+1: somebody sets gnis:id=foo on way 12345
T+2: you get a change stream from replication which says: "Update way 
12345 with these tags" and you have no way to update. From your point of 
view, this "update" is a "create" - but nobody but you knows that. Worse 
even, you have no nodes for this way because they got filtered out at 
T+0 and are not included in the change stream. No nodes -> no geometry, 
even if you manage to sneak that way object into your DB somehow.

I maintain some thematic extracts for my work myself. Here's what I do:

-------
#!/bin/bash
# archive the last known good version
mv germany-railways.osm.pbf germany-railways.osm.pbf.1

# replicate the full extract, calls osmosis --rri
$HOME/scripts/get-changes.sh germany-boxed.osm.pbf state

# "thematic filtering", calls Osmosis to filter out railways
$HOME/scripts/filter-railways.sh germany-boxed.osm.pbf 
germany-railways.osm.pbf

# derive change for the railways
osmosis --rb germany-railways.osm.pbf --sort --rb 
germany-railways.osm.pbf.1 --sort --derive-change bufferCapacity=10000 
--lpc --wxc railways.osc

# update DB (this is the osm2pgsql equivalent to --wpc)
osm2pgsql -U podolsir -d gis --prefix osm_railways -a -m -s -S 
$HOME/scripts/railways.style railways.osc
------

Basically, this way you keep your replication targets compatible with 
the respective replication sources (more or less, a bbox-based extract 
is not fully water-proof either, but it works for a reasonably generous 
bbox). Based on that, you do your tag based filtering and derive a 
change which has the right "updates" and "creates".

Yes, this _is_ much slower than the "intuititve" way (I started out with 
that, too :)), because you need to process _all_ data you have in 
--apply-change this way.

You could try to keep everything in your PostGIS database and then just 
SELECT the stuff that has "gnis:id" for actual processing. However I 
don't know what that means in performance terms, as I didn't use that 
kind of databases yet on any scale worth mentioning. My guess would be 
that --apply-change gets faster but you'll need much more disk space.

In any case: if you replicate, you need a source and a target that are 
compatible. Since your replication source is the planet, ideally you 
should have a complete planet as the target. Large geographic extracts 
work more or less, tag-based extracts almost never work as replication 
targets.

Hope that helps
Igor

[1] http://lists.openstreetmap.org/pipermail/dev/2011-April/022394.html